[HN Gopher] Jagged AGI: o3, Gemini 2.5, and everything after
       ___________________________________________________________________
        
       Jagged AGI: o3, Gemini 2.5, and everything after
        
       Author : ctoth
       Score  : 137 points
       Date   : 2025-04-20 14:55 UTC (8 hours ago)
        
 (HTM) web link (www.oneusefulthing.org)
 (TXT) w3m dump (www.oneusefulthing.org)
        
       | sejje wrote:
       | In the last example (the riddle)--I generally assume the AI isn't
       | misreading, rather that it assumes you couldn't give it the
       | riddle correctly, but it has seen it already.
       | 
       | I would do the same thing, I think. It's too well-known.
       | 
       | The variation doesn't read like a riddle at all, so it's
       | confusing even to me as a human. I can't find the riddle part.
       | Maybe the AI is confused, too. I think it makes an okay
       | assumption.
       | 
       | I guess it would be nice if the AI asked a follow up question
       | like "are you sure you wrote down the riddle correctly?", and I
       | think it could if instructed to, but right now they don't
       | generally do that on their own.
        
         | Jensson wrote:
         | > generally assume the AI isn't misreading, rather that it
         | assumes you couldn't give it the riddle correctly, but it has
         | seen it already.
         | 
         | LLMs doesn't assume, its a text completer. It sees something
         | that looks almost like a well known problem and it will
         | complete with that well known problem, its a problem specific
         | to being a text completer that is hard to get around.
        
           | simonw wrote:
           | These newer "reasoning" LLMs really don't feel like pure text
           | completers any more.
        
             | jordemort wrote:
             | And yet
        
             | gavinray wrote:
             | Is it not physically impossible for LLM's to be anything
             | but "plausible text completion"?
             | 
             | Neural Networks as I understand them are universal function
             | approximators.
             | 
             | In terms of text, that means they're trained to output what
             | they believe to be the "most probably correct" sequence of
             | text.
             | 
             | An LLM has no idea that it is "conversing", or "answering"
             | -- it relates some series of symbolic inputs to another
             | series of probabilistic symbolic outputs, aye?
        
             | Borealid wrote:
             | What your parent poster said is nonetheless true,
             | regardless of how it feels to you. Getting text from an LLM
             | is a process of iteratively attempting to find a likely
             | next token given the preceding ones.
             | 
             | If you give an LLM "The rain in Spain falls" the single
             | most likely next token is "mainly", and you'll see that one
             | proportionately more than any other.
             | 
             | If you give an LLM "Find an unorthodox completion for the
             | sentence 'The rain in Spain falls'", the most likely next
             | token is something other than "mainly" because the tokens
             | in "unorthodox" are more likely to appear before text that
             | otherwise bucks statistical trends.
             | 
             | If you give the LLM "blarghl unorthodox babble The rain in
             | Spain" it's likely the results are similar to the second
             | one but less likely to be coherent (because text obeying
             | grammatical rules is more likely to follow other text also
             | obeying those same rules).
             | 
             | In any of the three cases, the LLM is predicting text, not
             | "parsing" or "understanding" a prompt. The fact it will
             | respond similarly to a well-formed and unreasonably-formed
             | prompt is evidence of this.
             | 
             | It's theoretically possible to engineer a string of
             | complete gibberish tokens that will prompt the LLM to
             | recite song lyrics, or answer questions about mathemtical
             | formulae. Those strings of gibberish are just difficult to
             | discover.
        
               | Workaccount2 wrote:
               | The problem is showing that humans aren't just doing next
               | word prediction too.
        
               | Borealid wrote:
               | I don't see that as a problem. I don't particularly care
               | how human intelligence works; what matters is what an LLM
               | is capable of doing and what a human is capable of doing.
               | 
               | If those two sets of accomplishments are the same there's
               | no point arguing about differences in means or terms.
               | Right now humans can build better LLMs but nobody has
               | come up with an LLM that can build better LLMs.
        
               | baq wrote:
               | That's literally the definition of takeoff, when it
               | starts it gets us to singularity in a decade and there's
               | no publicly available evidence that it's started...
               | emphasis on publicly available.
        
               | myk9001 wrote:
               | > it gets us to singularity
               | 
               | Are we sure it's actually taking us along?
        
               | johnisgood wrote:
               | > but nobody has come up with an LLM that can build
               | better LLMs.
               | 
               | Yet. Not that we know of, anyway.
        
               | dannyobrien wrote:
               | So I just gave your blarghl line to Claude, and it
               | replied "It seems like you included a mix of text
               | including "blarghl unorthodox babble" followed by the
               | phrase "The rain in Spain."
               | 
               | Did you mean to ask about the well-known phrase "The rain
               | in Spain falls mainly on the plain"? This is a famous
               | elocution exercise from the musical "My Fair Lady," where
               | it's used to teach proper pronunciation.
               | 
               | Or was there something specific you wanted to discuss
               | about Spain's rainfall patterns or perhaps something else
               | entirely? I'd be happy to help with whatever you intended
               | to ask. "
               | 
               | I think you have a point here, but maybe re-express it?
               | Because right now your argument seems trivially
               | falsifiable even under your own terms.
        
               | Borealid wrote:
               | If you feed Claude you're getting Claude's "system
               | prompt" before the text you give it.
               | 
               | If you want to test convolution you have to use a raw
               | model with no system prompt. You can do that with a Llama
               | or similar. Otherwise your context window is full of
               | words like "helpful" and "answer" and "question" that
               | guide the response and make it harder (not impossible) to
               | see the effect I'm talking about.
        
               | itchyjunk wrote:
               | At this point, you might as well be claiming completions
               | model behaves differently than a fine-tuned model. Which
               | is true but the prompt in API without any systems message
               | seems to also not match your prediction.
        
               | tough wrote:
               | the point is when there's a system prompt you didnt write
               | you get autocomplete of your input + said dystem prompt,
               | and as such biasing all outputs
        
               | dannyobrien wrote:
               | I'm a bit confused here. Are you saying that if I zero
               | out the system prompt on _any_ LLM, including those fine-
               | tuned to give answers in an instructional form, they will
               | follow your effect -- that nonsense prompts will get
               | similar results to coherent prompts?
               | 
               | Because I've tried it on a few local models I have handy,
               | and I don't see that happening at all. As someone else
               | says, some of that difference is almost certainly due to
               | supervised fine-tuning (SFT) and reinforcement learning
               | from human feedback (RLHF) -- but it's weird to me, given
               | the confidence you made your prediction, that you didn't
               | exclude those from your original statement.
               | 
               | I guess, maybe the real question here is: could you give
               | me a more explicit example of how to show what you are
               | trying to show? And explain why I'm not seeing it while
               | running local models without system prompts?
        
               | simonw wrote:
               | No, I think the "reasoning" step really does make a
               | difference here.
               | 
               | There's more than just next token prediction going on.
               | Those reasoning chain of thoughts have undergone their
               | own reinforcement learning training against a different
               | category of samples.
               | 
               | They've seen countless examples of how a reasoning chain
               | would look for calculating a mortgage, or searching a
               | flight, or debugging a Python program.
               | 
               | So I don't think it is accurate to describe the eventual
               | result as "just next token prediction". It is a
               | combination of next token production that has been
               | informed by a chain of thought that was based on a
               | different set of specially chosen examples.
        
               | Borealid wrote:
               | Do you believe it's possible to produce a given set of
               | model weights with an infinitely large number of
               | different training examples?
               | 
               | If not, why not? Explain.
               | 
               | If so, how does your argument address the fact that this
               | implies any given "reasoning" model can be trained
               | without giving it a single example of something you would
               | consider "reasoning"? (in fact, a "reasoning" model may
               | be produced by random chance?)
        
               | simonw wrote:
               | I'm afraid I don't understand your question.
        
               | wongarsu wrote:
               | > The fact it will respond similarly to a well-formed and
               | unreasonably-formed prompt is evidence of this.
               | 
               | Don't humans do the same in conversation? How should an
               | intelligent being (constrained to the same I/O system)
               | respond here to show that it is in fact intelligent?
        
               | Borealid wrote:
               | Imagine a Rorschach Test of language, where a certain set
               | of non-recognizable-language tokens invariably causes an
               | LLM to talk about flowers. These strings exist by
               | necessity due to how the LLM's layers are formed.
               | 
               | There exists no similar set of tokens for humans, because
               | our process is to parse the incoming sounds into words,
               | use grammar to extract conceptual meaning from those
               | words, and then shape a response from that conceptual
               | meaning.
               | 
               | Artists like Lewis Carrol and Stanislaw Lem play with
               | this by inserting non-words at certain points in
               | sentences to get humans to infer the meaning of those
               | words from surrounding context, but the truth remains
               | that an LLM will gladly convolute a wholly non-language
               | input into a response as if it were well-formed, but a
               | human can't/won't do that.
               | 
               | I know this is hard to understand, but the current
               | generation of LLMs are working directly with language.
               | Their "brains" are built on language. Some day we might
               | have some kind of AI system that's built on some kind of
               | meaning divorced from language, but that's not what's
               | happening here. They're engineering matrixes that
               | repeatedly perform "context window times model => one
               | more token" operations.
        
               | og_kalu wrote:
               | I think you are begging the question here.
               | 
               | For one thing, LLMs absolutely form responses from
               | conceptual meanings. This has been demonstrated
               | empirically multiple times now including again by
               | anthropic only a few weeks ago. 'Language' is just the
               | input and output, the first and last few layers of the
               | model.
               | 
               | So okay, there exists some set of 'gibberish' tokens that
               | will elicit meaningful responses from LLMs. How does your
               | conclusion - "Therefore, LLMs don't understand" fit the
               | bill here? You would also conclude that humans have no
               | understanding of what they see because of the Rorschach
               | test ?
               | 
               | >There exists no similar set of tokens for humans,
               | because our process is to parse the incoming sounds into
               | words, use grammar to extract conceptual meaning from
               | those words, and then shape a response from that
               | conceptual meaning.
               | 
               | Grammar is useful fiction, an incomplete model of a
               | demonstrably probabilistic process. We don't use
               | 'grammar' to do anything.
        
               | wongarsu wrote:
               | > Imagine a Rorschach Test of language, where a certain
               | set of non-recognizable-language tokens invariably causes
               | an LLM to talk about flowers. These strings exist by
               | necessity due to how the LLM's layers are formed.
               | 
               | Maybe not for humanity as a species, but for individual
               | humans there are absolutely token sequences that lead
               | them to talk about certain topics, and nobody being able
               | to bring them back to topic. Now you'd probably say those
               | are recognizable token sequences, but do we have a fair
               | process to decide what's recognizable that isn't
               | inherently biased towards making humans the only rational
               | actor?
               | 
               | I'm not contending at all that LLMs are only built on
               | language. Their lack of physical reference point is
               | sometimes laughably obvious. We could argue whether there
               | are signs they also form a world model and reasoning that
               | abstracts from language alone, but that's not even my
               | point. My point is rather that any test or argument that
               | attempts to say that LLMs can't "reason" or "assume" or
               | whatever has to be a test a human could pass. Preferably
               | a test a random human would pass with flying colors.
        
               | baq wrote:
               | This again.
               | 
               | It's predicting text. Yes. Nobody argues about that.
               | (You're also predicting text when you're typing it. Big
               | deal.)
               | 
               |  _How_ it is predicting the text is the question to ask
               | and indeed it's being asked and we're getting glimpses of
               | understanding and lo and behold it's a damn complex
               | process. See the recent anthropic research paper for
               | details.
        
           | monkpit wrote:
           | This take really misses a key part of implementation of these
           | LLMs and I've been struggling to put my finger on it.
           | 
           | In every LLM thread someone chimes in with "it's just a
           | statistical token predictor".
           | 
           | I feel this misses the point and I think it dismisses
           | attention heads and transformers, and that's what sits weird
           | with me every time I see this kind of take.
           | 
           | There _is_ an assumption being made within the model at
           | runtime. Assumption, confusion, uncertainty - one camp might
           | argue that none of these exist in the LLM.
           | 
           | But doesn't the implementation constantly make assumptions?
           | And what even IS your definition of "assumption" that's not
           | being met here?
           | 
           | Edit: I guess my point, overall, is: what's even the purpose
           | of making this distinction anymore? It derails the discussion
           | in a way that's not insightful or productive.
        
             | Jensson wrote:
             | > I feel this misses the point and I think it dismisses
             | attention heads and transformers
             | 
             | Those just makes it better at completing the text, but for
             | very common riddles those tools still gets easily overruled
             | by pretty simple text completion logic since the weights
             | for those will be so extremely strong.
             | 
             | The point is that if you understand its a text completer
             | then its easy to understand why it fails at these. To fix
             | these properly you need to make it no longer try to
             | complete text, and that is hard to do without breaking it.
        
           | wongarsu wrote:
           | If you have the model output a chain of thought, whether it's
           | a reasoning model or you prompt a "normal" model to do so,
           | you will see examples of the model going "user said X, but
           | did they mean Y? Y makes more sense, I will assume Y".
           | Sometimes stretched over multiple paragraphs, consuming the
           | entire reasoning budget for that prompt.
           | 
           | Discussing whether models can "reason" or "think" is a
           | popular debate topic on here, but I think we can all at least
           | agree that they do something that at least resembles
           | "reasoning" and "assumptions" from our human point of view.
           | And if in its chain-of-thought it decides your prompt is
           | wrong it will go ahead answering what it assumes is the right
           | prompt
        
           | sejje wrote:
           | > it's a text completer
           | 
           | Yes, and it can express its assumptions in text.
           | 
           | Ask it to make some assumptions, like about a stack for a
           | programming task, and it will.
           | 
           | Whether or not the mechanism behind it feels like real
           | thinking to you, it can definitely do this.
        
             | wobfan wrote:
             | If you call putting text together that reads like an
             | assumption, then yes. But it cannot express assumption, as
             | it is not assuming. It is completing text, like OP said.
        
               | ToValueFunfetti wrote:
               | It's trained to complete text, but it does so by
               | constructing internal circuitry during training. We don't
               | have enough transparency into that circuitry or the human
               | brain's to positively assert that it doesn't assume.
               | 
               | But I'd wager it's there; assuming is not a particularly
               | impressive or computationally intense operation. There's
               | a tendency to bundle all of human consciousness into the
               | definitions of our cognitive components, but I would
               | argue that, eg., a branch predictor is meeting the bar
               | for any sane definition of 'assume'.
        
           | og_kalu wrote:
           | Text Completion is just the objective function. It's not
           | descriptive and says nothing about how the models complete
           | text. Why people hang on this word, I'll never understand.
           | When you wrote your comment, you were completing text.
           | 
           | The problem you've just described is a problem with humans as
           | well. LLMs are assuming all the time. Maybe you would like to
           | call it another word, but it is happening.
        
             | codr7 wrote:
             | With a plan, aiming for something, that's the difference.
        
               | og_kalu wrote:
               | Again, you are only describing the _how_ here, not the
               | _what_ (text completion).
               | 
               | Also, LLMs absolutely 'plan' and 'aim for something' in
               | the process of completing text.
               | 
               | https://www.anthropic.com/research/tracing-thoughts-
               | language...
        
               | namaria wrote:
               | Yeah this paper is great fodder for the LLM pixel dust
               | argument.
               | 
               | They use a replacement model. It isn't even observing the
               | LLM itself but a different architecture model. And it is
               | very liberal with interpreting the patterns of
               | activations seen in the replacement model with flowery
               | language. It also include some very relevant caveats,
               | such as:
               | 
               | "Our cross-layer transcoder is trained to mimic the
               | activations of the underlying model at each layer.
               | However, even when it accurately reconstructs the model's
               | activations, there is no guarantee that it does so via
               | the same mechanisms."
               | 
               | https://transformer-circuits.pub/2025/attribution-
               | graphs/met...
               | 
               | So basically the whole exercise might or might not be
               | valid. But it generates some pretty interactive graphics
               | and a nice blog post to reinforce the
               | anthropomorphization discourse
        
               | og_kalu wrote:
               | 'So basically the whole exercise might or might not be
               | valid.'
               | 
               | Nonsense. Mechanistic faithfulness probes whether the
               | replacement model ("cross-layer transcoder") truly uses
               | the same internal functions as the original LLM. If it
               | doesn't, the attribution graphs it suggests might mis-
               | lead at a fine-grained level but because every hypothesis
               | generated by those graphs is tested via direct
               | interventions on the real model, high-level causal
               | discoveries (e.g. that Claude plans its rhymes ahead of
               | time) remain valid.
        
               | losvedir wrote:
               | So do LLMs. "In the United States, someone whose job is
               | to go to space is called ____" it will say "an" not
               | because that's the most likely next word, but because
               | it's "aiming" (to use your terminology) for "astronaut"
               | in the future.
        
               | codr7 wrote:
               | I don't know about you, but I tend to make more elaborate
               | plans than the next word. I have a purpose, an idea I'm
               | trying to communicate. These things don't have ideas,
               | they're not creative.
        
             | Jensson wrote:
             | > When you wrote your comment, you were completing text.
             | 
             | I didn't train to complete text though, I was primarily
             | trained to make accurate responses.
             | 
             | And no, writing a response is not "completing text", I
             | don't try to figure out what another person would write as
             | a response, I write what I feel people need to read. That
             | is a completely different thought process. If I tried to
             | mimic what another commenter would have written it would
             | look very different.
        
               | AstralStorm wrote:
               | Sometimes we also write what we really want people to not
               | read. That's usually called trolling though.
        
               | og_kalu wrote:
               | >And no, writing a response is not "completing text", I
               | don't try to figure out what another person would write
               | as a response, I write what I feel people need to read.
               | 
               | Functionally, it is. You're determining what text should
               | follow the prior text. Your internal reasoning ('what I
               | feel people need to read') is _how_ you decide on the
               | completion.
               | 
               | The core point isn't that your internal 'how' is the same
               | as an LLM's (Maybe, Maybe not), but that labeling the LLM
               | as a 'text completer' they way you have is essentially
               | meaningless.
               | 
               | You are just imposing your own ideas on the _how_ a LLM
               | works, not speaking any fundamental truth of being a
               | 'text completer'.
        
         | moffkalast wrote:
         | Yeah you need specific instruct training for that sort of
         | thing, Claude Opus being one of the rare examples that does
         | such a sensibility check quite often and even admits when it
         | doesn't know something.
         | 
         | These days it's all about confidently bullshitting on
         | benchmarks and overfitting on common riddles to make pointless
         | numbers go up. The more impressive models get on paper, the
         | more rubbish they are in practice.
        
           | pants2 wrote:
           | Gemini 2.5 is actually pretty good at this. It's the only
           | model ever to tell me "no" to a request in Cursor.
           | 
           | I asked it to add websocket support for my app and it
           | responded like, "looks like you're using long polling now.
           | That's actually better and simpler. Lets leave it how it is."
           | 
           | I was genuinely amazed.
        
       | simonw wrote:
       | Coining "Jagged AGI" to work around the fact that nobody agrees
       | on a definition for AGI is a clever piece of writing:
       | 
       | > In some tasks, AI is unreliable. In others, it is superhuman.
       | You could, of course, say the same thing about calculators, but
       | it is also clear that AI is different. It is already
       | demonstrating general capabilities and performing a wide range of
       | intellectual tasks, including those that it is not specifically
       | trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given
       | the definitional problems, I really don't know, but I do think
       | they can be credibly seen as a form of "Jagged AGI" - superhuman
       | in enough areas to result in real changes to how we work and
       | live, but also unreliable enough that human expertise is often
       | needed to figure out where AI works and where it doesn't.
        
         | shrx wrote:
         | >> It is already demonstrating general capabilities and
         | performing a wide range of intellectual tasks, including those
         | that it is not specifically trained on.
         | 
         | Huh? Isn't a LLM's capability fully constrained by the training
         | data? Everything else is hallucinated.
        
           | bbor wrote:
           | The critical discovery was a way to crack the "Frame
           | Problem", which roughly comes down to colloquial notions of
           | common sense or intuition. For the first time ever, we have
           | models that know if you jump off a stool, you will (likely!)
           | be standing on the ground afterwards.
           | 
           | In that sense, they absolutely know things that aren't in
           | their training data. You're correct about factual knowledge,
           | tho -- that's why they're not trained to optimize it! A
           | database(/pagerank?) solves that problem already.
        
           | simonw wrote:
           | You can argue that _everything_ output by an LLM is
           | hallucinated, since there 's no difference under-the-hood
           | between outputting useful information and outputting
           | hallucinations.
           | 
           | The quality of the LLM then becomes how often it produces
           | useful information. That score has gone up a _lot_ in the
           | past 18 months.
           | 
           | (Sometimes hallucinations are what you want: "Tell me a fun
           | story about a dog learning calculus" is a valid prompt which
           | mostly isn't meant to produce real facts about the world")
        
             | codr7 wrote:
             | Isn't it the case that the latest models actually
             | hallucinate more than the ones that came before? Despite
             | best efforts to prevent it.
        
               | simonw wrote:
               | The o3 model card reports a so far unexplained uptick in
               | hallucination rate from o1 - on page 4 of https://cdn.ope
               | nai.com/pdf/2221c875-02dc-4789-800b-e7758f372...
               | 
               | That is according to one specific internal OpenAI
               | benchmark, I don't know if it's been replicated
               | externally yet.
        
         | verdverm wrote:
         | Why not call it AJI instead of AGI then?
         | 
         | Certainly jagged does not imply general
         | 
         | It seems to me the bar for "AGI" has been lowered to measuring
         | what tasks it can do rather than the traits we normally
         | associate with general intelligence. People want it to be here
         | so bad they nerf the requirements...
        
           | bbor wrote:
           | Well I think the point being made is an instrumental one:
           | it's general enough to matter, so we should use the word
           | "general" to communicate that to laypeople.
           | 
           | Re:"traits we associate with general intelligence", I think
           | the exact issue is that there is no scientific (ie
           | specific*consistent) list of such traits. This is why Turing
           | wrote his famous 1950 paper and invoked the Imitation Game;
           | not to detail how one could test for a computer that's really
           | thinking(/truly general), but to show why that question isn't
           | necessary in the first place.
        
             | verdverm wrote:
             | I still disagree, being good at a number of tasks does not
             | make it intellectual.
             | 
             | Certainly creativity is missing, it has no internal
             | motivation, and it will answer the same simple question
             | both right and wrong, depending on unknown factors. What if
             | we reverse the framing from "it can do these tasks,
             | therefore it must be..." to "it lacks these traits,
             | therefore it is not yet..."
             | 
             | While I do not disagree that the LLMs have become advanced
             | enough to do a bunch of automation, I do not agree they are
             | intelligent or actually thinking.
             | 
             | I'm with Yann Lecun when he says that we won't reach AGI
             | until we move beyond transformers.
        
           | iknowstuff wrote:
           | AJI lol love it.
        
           | nearbuy wrote:
           | Human intelligence is jagged. You're raising the AGI bar to a
           | point where most people wouldn't qualify as having general
           | intelligence.
           | 
           | My partner and I work in different fields. AI has advanced to
           | the point where there are very few questions I could ask my
           | partner that o3 couldn't answer as well or better.
           | 
           | I can't ask expert level questions in her field, because I'm
           | not an expert in her field, and she couldn't ask expert level
           | questions in my field for the same reason. So when we're
           | communicating with each other, we're mostly at sub-o3 level.
           | 
           | > People want it to be here so bad they nerf the
           | requirements...
           | 
           | People want to claim it's overhyped (and protect their own
           | egos) so badly they raise the requirements...
           | 
           | But really, largely people just have different ideas of what
           | AGI is supposed to mean. It used to vaguely mean "human-level
           | intelligence", which was fine for talking about some
           | theoretical future event. Now we're at a point where that
           | definition is too vague to say whether AI meets it.
        
             | tasuki wrote:
             | > You're raising the AGI bar to a point where most people
             | wouldn't qualify as having general intelligence.
             | 
             | We kind of don't? Look how difficult it is for us to just
             | understand some basic math. Us humans mostly have
             | intelligence related to the ancestral environment we
             | developed in, nothing general about that.
             | 
             | I agree with you the term "AGI" is rather void of meaning
             | these days...
        
             | verdverm wrote:
             | You're using limited and anecdotal task based metrics as
             | some sort of evidence. Both of you are able to drive a car,
             | yet we need completely different AIs for such tasks.
             | 
             | I still find task based measures insufficient, there are
             | very basic machines than can perform tasks humans cannot.
             | Should this be a measure on our or their intelligence?
             | 
             | I have another comment in this thread about trait based
             | metrics being a possibly better method.
             | 
             | > People want to claim it's overhyped (and protect their
             | own egos) so badly they raise the requirements...
             | 
             | Shallow response. Seek to elevate the conversation. There
             | are also people who see it for what it is, a useful tool
             | but not intelligent...
        
               | nearbuy wrote:
               | > You're using limited and anecdotal task based metrics
               | as some sort of evidence.
               | 
               | And you presented no evidence at all. Not every comment I
               | make is going to contain a full lit review.
               | 
               | > Both of you are able to drive a car, yet we need
               | completely different AIs for such tasks.
               | 
               | This is like a bird complaining humans aren't intelligent
               | because they can't fly. How is Gemini or o3 supposed to
               | drive without real-time vision and a vehicle to control?
               | How are you supposed to fly without wings?
               | 
               | It lacks the sensors and actuators to drive, but this is
               | moving away from a discussion on intelligence. If you
               | want to argue that any system lacking real-time vision
               | isn't intelligent, you're just using a very unusual
               | definition of intelligence that excludes blind people.
               | 
               | > Shallow response. Seek to elevate the conversation.
               | 
               | This was an ironic response pointing out the shallowness
               | of your own unsubstantiated accusation that people just
               | disagree with you because they're biased or deluded
               | themselves. The next paragraph starting with "But really"
               | was supposed to convey it wasn't serious, just a jab
               | showing the silliness of your own jab.
        
         | qsort wrote:
         | I don't think that's a particularly honest line of thinking
         | though. It preempts the obvious counterargument, but very
         | weakly so. Calculators are different, but why? Can an ensemble
         | of a calculator, a Prolog interpreter, Alexnet and Stockfish be
         | considered "jagged superintelligence"? They are all clearly
         | superhuman, and yet require human experience to be wielded
         | effectively.
         | 
         | I'm guilty as charged of having looked at GPT 3.5 and having
         | thought "it's meh", but more than anything this is showing that
         | debating words rather than the underlying capabilities is an
         | empty discussion.
        
           | og_kalu wrote:
           | >Calculators are different, but why? Can an ensemble of a
           | calculator, a Prolog interpreter, Alexnet and Stockfish be
           | considered "jagged superintelligence"?
           | 
           | Those are all different things with little to nothing to do
           | with each other. It's like saying what if I ensemble a snake
           | and cat ? What does that even mean ? GPT-N or whatever is a
           | single model that can do many things, no ensembling required.
           | That's the difference between it and a calculator or
           | stockfish.
        
             | AstralStorm wrote:
             | That is not true, the model is modular, thus an ensemble.
             | Uses DallE for graphics and specialized tokenizer models
             | for sound.
             | 
             | If you remove those tools, or cut its access to search
             | databases, it becomes quite less capable.
             | 
             | A human would often still manage to do it without some data
             | still, perhaps with less certainty, while GPT has more
             | problems than that without others filling in the holes.
        
               | og_kalu wrote:
               | >Uses DallE for graphics and specialized tokenizer models
               | for sound.
               | 
               | chatgpt no longer uses dalle for image generation. I
               | don't understand your point about the tokenization. It
               | doesn't make the model become an ensemble.
               | 
               | It's also just beside the point. Even if you restrict the
               | modalities to text alone, these models are still general
               | alone in ways a calculator is not.
        
       | k2xl wrote:
       | There is a similar issue with image and video generation. Asking
       | the AI to "Generate an image of a man holding a pencil with his
       | left hand" or "Generate a clock showing the time 5 minutes past 6
       | o'clock" often fail due to so many images in the training set
       | being similar (almost all clock images on show 10:10
       | (https://generativeai.pub/in-the-ai-art-world-the-time-is-alm...)
        
       | fsmv wrote:
       | It's not AGI because it still doesn't understand anything. It can
       | only tell you things that can be found on the internet. These
       | "jagged" results expose the truth that these models have near 0
       | intelligence.
       | 
       | It is not a simple matter of patching the rough edges. We are
       | fundamentally not using an architecture that is capable of
       | intelligence.
       | 
       | Personally the first time I tried deep research on a real topic
       | it was disastrously incorrect on a key point.
        
         | falcor84 wrote:
         | >near 0 intelligence
         | 
         | What does that even mean? Do you actually have any particular
         | numeric test of intelligence that's somehow better than all the
         | others?
        
         | simonw wrote:
         | Is one of your personal requirements for AGI "never makes a
         | mistake?"
        
           | Arainach wrote:
           | I think determinism is an important element. You can ask the
           | same LLM the same question repeatedly and get different
           | answers - and not just different ways of stating the same
           | answer, very different answers.
           | 
           | If you ask an intelligent being the same question they may
           | occasionally change the precise words they use but their
           | answer will be the same over and over.
        
             | hdjjhhvvhga wrote:
             | If determinism is a hard requirement, then LLM-based AI
             | can't fulfill it by definition.
        
             | samuel wrote:
             | That's not an inherent property of the system. You can
             | choose the most likely token(topk=1) and it will be
             | deterministic (at least in theory, in some hardware setups
             | it might be trickier)
        
             | beering wrote:
             | A human will give different answers to the same question,
             | so I'm not sure why it's fair to set a higher bar for an
             | LLM. Or rather, I'm not sure how you would design this test
             | in a way where humans would pass and the best LLM would
             | fail.
        
             | simonw wrote:
             | That's because "intelligent beings" have memory. If you ask
             | an LLM the same question within the same chat session
             | you'll get a consistent answer about it.
        
               | Arainach wrote:
               | I disagree. If you were to take a snapshot of someone's
               | knowledge and memory such that you could restore to it
               | over and over, that person would give the same answer to
               | the question. The same is not true for an LLM.
               | 
               | Heck, I can't even get LLMs to be consistent about *their
               | own capabilities*.
               | 
               | Bias disclaimer: I work at Google, but not on Gemini. If
               | I ask Gemini to produce an SVG file, it will sometimes do
               | so and sometimes say "sorry, I can't, I can only produce
               | raster images". I cannot deterministically produce either
               | behavior - it truly seems to vary randomly.
        
               | IanCal wrote:
               | You could run an llm deterministically too.
               | 
               | We're often explicitly adding in randomness to the
               | results so it feels weird to then accuse them of not
               | being intelligent after we deliberately force them off
               | the path.
        
               | danielbln wrote:
               | You'd need to restore more than memory/knowledge. You'd
               | need to restore the full human, and in the exact same
               | condition (inside and out).
               | 
               | Ask me some question before bed and again after waking
               | up, I'll probably answer it at night but in the morning
               | tell you to sod off until I had coffee.
        
         | ben_w wrote:
         | The concept of "understand" is itself ill-defined -- or, to put
         | it another way, _not understood_.
        
         | danielbln wrote:
         | There are some very strong and very unfounded assumptions in
         | your comment. Is there anything more substantial there other
         | than "that's what it feels like to me"?
        
       | logicchains wrote:
       | I'd argue that it's not productive to use any definition of AGI
       | coined after 2020, to avoid the fallacy of shifting the
       | goalposts.
        
         | Borealid wrote:
         | I think there's a single definition of AGI that will stand
         | until the singularity:
         | 
         | "An AGI is a human-created system that demonstrates iteratively
         | improving its own conceptual design without further human
         | assistance".
         | 
         | Note that a "conceptual design" here does not include tweaking
         | weights within an already-externally-established formula.
         | 
         | My reasoning is thus:
         | 
         | 1. A system that is only capable of acting with human
         | assistance cannot have its own intelligence disentangled from
         | the humans'
         | 
         | 2. A system that is only intelligent enough to solve problems
         | that somehow exclude problems with itself is not "generally"
         | intelligent
         | 
         | 3. A system that can only generate a single round of
         | improvements to its own designs has not demonstrated
         | improvements to those designs, as if iteration N+1 were truly
         | superior to iteration N, it would be able to produce iteration
         | N+2
         | 
         | 4. A system that is not capable of changing its own design is
         | incapable of iterative improvement, as there is a maximum
         | efficacy within any single framework
         | 
         | 5. A system that could improve itself in theory and fails to do
         | so in practice has not demonstrated intelligence
         | 
         | It's pretty clear that no current-day system has hit this
         | milestone; if some program had, there would no longer be a need
         | for continued investment in algorithms design (or computer
         | science, or most of humanity...).
         | 
         | A program that randomly mutates its own code could self-improve
         | in theory but fails to do so in practice.
         | 
         | I don't think these goalposts have moved in the past or need to
         | move in the future. This is what it takes to cause the
         | singularity. The movement recently has been people trying to
         | sell something less than this as an AGI.
        
           | logicchains wrote:
           | AGI means "artificial general intelligence", it's got nothing
           | to do with the singularity (which requires "artificial
           | superior intelligence"; ASI). Requiring AGI to have
           | capabilities that most humans lack is moving the goal post
           | WRT to how it was originally defined.
        
             | jpc0 wrote:
             | I don't think this is capabilities humans do not have, this
             | to me is the one capability humans destinctly have over
             | LLMs, the ability to introspect and shape their own future.
             | 
             | I feel this definition doesn't require a current LLM model
             | to be able to change its own working but to be able to
             | generate a guided next generation.
             | 
             | It's possible that LLMs can surpass human beings, purely
             | because I believe we will inevitably be limited to short
             | term storage constraints which LLMs will not. It will be a
             | bandwidth vs througput question. An LLM will have a much
             | larger although slightly slower store of knowledge than
             | what human have. But will be much quicker than a human
             | looking up and validating the data.
             | 
             | We aren't there yet.
        
           | gom_jabbar wrote:
           | > The movement recently has been people trying to sell
           | something less than this as an AGI.
           | 
           | Selling something that does not yet exist is an essential
           | part of capitalism, which - according to the main thesis of
           | philosophical Accelerationism - is (teleologically) identical
           | to AI. [0] It's sometimes referred to as _Hyperstition_ ,
           | i.e. fictions that make themselves real.
           | 
           | [0] https://retrochronic.com
        
         | TheAceOfHearts wrote:
         | I really dislike this framing. Historically we've been very
         | confused about what AGI means because we don't actually
         | understand it. We're still confused so most working definitions
         | have been iterated upon as models acquire new capabilities.
         | It's akin to searching something in the fog of war: you set a
         | course or destination because you think that's the approximate
         | direction where the thing will be found, but then you get there
         | and realize you were wrong so you continue exploring.
         | 
         | Most people have a rough idea of what AGI means, but we still
         | haven't figured out an exact definition that lines up with
         | reality. As we continue exploring the idea space, we'll keep
         | figuring out which parameters place boundaries and requirements
         | on what AGI means.
         | 
         | There's no reason to just accept an ancient definition from
         | someone who was confused and didn't know any better at the time
         | when they invented their definition. Older definitions were
         | just shots in the dark that pointed in a general direction, but
         | there's no guarantee that they would hit upon the exact
         | destination.
        
       | skybrian wrote:
       | What's clear is that AI is unreliable in general and must be
       | tested on specific tasks. That might be human review of a single
       | output or some kind of task-specific evaluation.
       | 
       | It's bad luck for those of us who want to talk about how good or
       | bad they are in general. Summary statistics aren't going to tell
       | us much more than a reasonable guess as to whether a new model is
       | worth trying on a task we actually care about.
        
         | simonw wrote:
         | Right: we effectively all need our own evals for the tasks that
         | matter to us... but writing those evals continues to be one of
         | the least well documented areas of how to effectively use LLMs.
        
       | mellosouls wrote:
       | The capabilities of AI post gpt3 have become extraordinary and
       | clearly in many cases superhuman.
       | 
       | However (as the article admits) there is still no general
       | agreement of what AGI is, or how we (or even if we can) get there
       | from here.
       | 
       | What there is is a growing and often naive excitement that
       | anticipates it as coming into view, and unfortunately that will
       | be accompanied by the hype-merchants desperate to be first to
       | "call it".
       | 
       | This article seems reasonable in some ways but unfortunately
       | falls into the latter category with its title and sloganeering.
       | 
       | "AGI" in the title of any article should be seen as a cautionary
       | flag. On HN - if anywhere - we need to be on the alert for this.
        
         | ashoeafoot wrote:
         | AGI is a annonymous good model coming around the corner with no
         | company and no LLM researchers attached. AGI is when the LLM
         | hype train threads are replaced with CEOs and let go
         | researchers demanding UBI.
        
           | MichaelZuo wrote:
           | Yeah formal agreement seems exceedingly unlikely. Since there
           | isn't even agreement on the defintion of "Artifical
           | Intelligence".
        
           | ben_w wrote:
           | It's easy to treat AGI as one thing -- I did so myself before
           | everyone's differing reaction to LLMs made me realise we all
           | mean different things by each of the three letters of the
           | initialism, and that none of those initials are really
           | boolean valued.
           | 
           | Given how Dutch disease[0] is described, I suspect that if
           | the "G" (general) increases with fixed "I" (intelligence), as
           | the proportion of economic activity for which the Pareto
           | frontier is AI rather than human expands, I think humans will
           | get pay rises for the remaining work right up until they get
           | unemployable.
           | 
           | On the other hand, if "G" is fully general and it's "I" which
           | rises for a suitable cost[1], it goes through IQ 55
           | (displacing no workers) to IQ 100 (probably close to half of
           | workers redundant, but mean of population doesn't have to
           | equal mean of workforce), to IQ 145 (almost everyone
           | redundant), to IQ 200 (definitionally renders everyone
           | redundant).
           | 
           | [0] https://en.wikipedia.org/wiki/Dutch_disease
           | 
           | [1] A fully-general AGI with the equivalent of IQ 200 on any
           | possible test, still can't replace a single human if it costs
           | 200 trillion USD per year to run.
        
         | jjeaff wrote:
         | I suspect AGI will be one of those things that you can't
         | describe it exactly, but you'll know it when you see it.
        
           | ninetyninenine wrote:
           | I suspect everyone will call it a stochastic parrot because
           | it did this one thing not right. And this will continue into
           | the far far future even when it becomes sentient we will
           | completely miss it.
        
             | Jensson wrote:
             | Once it pushed out most humans from white collar labor so
             | the remaining humans work in blue collar jobs people wont
             | say its just a stochastic parrot.
        
               | myk9001 wrote:
               | Maybe, maybe not. Power loom pushed a lot of humans out
               | of the textile factory jobs, yet noone claims power loom
               | is the AGI.
        
               | Jensson wrote:
               | Not a lot, I mean basically everyone, to the point where
               | most companies doesn't need to pay humans to think
               | anymore.
        
               | myk9001 wrote:
               | Well, I'm too lazy to look up how many weavers were
               | displaced back then and that's why I said a lot. Maybe
               | all of them, since they weren't trained to operate the
               | new machines.
               | 
               | Anyway, sorry for a digression, my point is LLM replacing
               | white collar workers doesn't necessarily imply it's
               | generally intelligent -- it may but doesn't have to be.
               | 
               | Although if it gets to a point where companies are
               | running dark office buildings (by analogy with dark
               | factories) -- yes, it's AGI by then.
        
               | jimbokun wrote:
               | Or become shocked to realize humans are basically
               | statistical parrots too.
        
             | AstralStorm wrote:
             | It's more than that but less than intelligence.
             | 
             | Its generalization capabilities are a bit on the low side,
             | and memory is relatively bad. But it is much more than just
             | a parrot now, it can handle _some_ of basic logic, but not
             | follow given patterns correctly for novel problems.
             | 
             | I'd liken it to something like a bird, extremely good at
             | specialized tasks but failing a lot of common ones unless
             | repeatedly shown the solution. It's not a corvid or a
             | parrot yet. Fails rather badly at detour tests.
             | 
             | It might be sentient already though. Someone needs to run a
             | test if it can discern itself and another instance of
             | itself in its own work.
        
               | Jensson wrote:
               | > It might be sentient already though. Someone needs to
               | run a test if it can discern itself and another instance
               | of itself in its own work.
               | 
               | It doesn't have any memory, how could it tell itself from
               | a clone of itself?
        
               | AstralStorm wrote:
               | Similarity match. For that you need to understand
               | reflexively how you think and write.
               | 
               | It's a fun test to give a person something they have
               | written but do not remember. Most people can still spot
               | it.
               | 
               | It's easier with images though. Especially a mirror. For
               | DallE, the test would be if it can discern its own work
               | from human generated image. Especially of you give it an
               | imaginative task like drawing a representation of itself.
        
           | NitpickLawyer wrote:
           | > but you'll know it when you see it.
           | 
           | I agree, but with the caveat that it's getting harder and
           | harder with all the hype / doom cycles and all the goalpost
           | moving that's happening in this space.
           | 
           | IMO if you took gemini2.5 / claude / o3 and showed it to
           | people from ten / twenty years ago, they'd say that it is
           | unmistakably AGI.
        
             | Jensson wrote:
             | > IMO if you took gemini2.5 / claude / o3 and showed it to
             | people from ten / twenty years ago, they'd say that it is
             | unmistakably AGI.
             | 
             | No they wouldn't, since those still can't replace human
             | white collar workers even at many very basic tasks.
             | 
             | Once AGI is here most white collar jobs are gone, you'd
             | only need to hire geniuses at most.
        
               | zaptrem wrote:
               | Which part of "General Intelligence" requires replacing
               | white collar workers? A middle schooler has general
               | intelligence (they know about and can do a lot of things
               | across a lot of different areas) but they likely can't
               | replace white collar workers either. IMO GPT-3 was AGI,
               | just a pretty crappy one.
        
               | Jensson wrote:
               | > A middle schooler has general intelligence (they know
               | about and can do a lot of things across a lot of
               | different areas) but they likely can't replace white
               | collar workers either.
               | 
               | Middle schoolers replace white collars workers all the
               | time, it takes 10 years for them to do it but they can do
               | it.
               | 
               | No current model can do the same since they aren't able
               | to learn over time like a middle schooler.
        
               | sebastiennight wrote:
               | Compared to someone who graduated middle school on
               | November 30th, 2022 (2.5 years ago, would you say that
               | today's gemini 2.5 pro has NOT gained intelligence
               | faster?
               | 
               | I mean, if you're a CEO or middle manager and you have
               | the choice of hiring this middle schooler for general
               | office work, or today's gemini-2.5-pro, are you 100%
               | saying the ex-middle-schooler is definitely going to give
               | you best bang for your buck?
               | 
               | Assuming you can either pay them $100k a year, or spend
               | the $100k on gemini inference.
        
               | Jensson wrote:
               | > would you say that today's gemini 2.5 pro has NOT
               | gained intelligence faster?
               | 
               | Gemini 2.5 pro the model has not gained any intelligence
               | since it is a static model.
               | 
               | New models are not the models learning, it is humans
               | creating new models. The models trained has access to all
               | the same material and knowledge a middle schooler has as
               | they go on to learn how to do a job, yet they fail to
               | learn the job while the kid succeeds.
        
               | ben_w wrote:
               | > Gemini 2.5 pro the model has not gained any
               | intelligence since it is a static model.
               | 
               | Surely that's an irrelevant distinction, from the point
               | of view of a hiring manager?
               | 
               | If a kid takes ten years from middle school to being
               | worth hiring, then the question is "what new AI do you
               | expect will exist in 10 years?"
               | 
               | How the model comes to be, doesn't matter. Is it a fine
               | tune on more training data from your company docs and/or
               | an extra decade of the internet? A different
               | architecture? A different lab in a different country?
               | 
               | Doesn't matter.
               | 
               | Doesn't matter for the same reason you didn't hire the
               | kid immediately out of middle school, and hired someone
               | else who had already had another decade to learn more in
               | the meantime.
               | 
               | Doesn't matter for the same reason that different flesh
               | humans aren't perfectly substitutable.
               | 
               | You pay to solve a problem, not to specifically have a
               | human solve it. Today, not in ten years when today's
               | middle schooler graduates from university.
               | 
               | And that's even though I agree that AI today doesn't
               | learn effectively from as few examples as humans need.
        
             | bayarearefugee wrote:
             | There's no way to be sure in either case, but I suspect
             | their impressions of the technology ten or twenty years ago
             | would be not so different from my experience of first using
             | LLMs a few years ago...
             | 
             | Which is to say complete amazement followed quickly by
             | seeing all the many ways in which it absolutely falls flat
             | on its face revealing the lack of actual thinking, which is
             | a situation that hasn't fundamentally changed since then.
        
             | mac-mc wrote:
             | When it can replace a polite, diligent, experienced 120 IQ
             | human in all tasks. So it has a consistent long-term
             | narrative memory, doesn't "lose the plot" as you interact
             | longer and longer with it, can pilot robots to do physical
             | labor without much instruction (what is current state of
             | the art is not that, a trained human will still do much
             | better, can drive cars, etc), generate images without goofy
             | non-human style errors, etc.
        
               | NitpickLawyer wrote:
               | > experienced 120 IQ human in all tasks.
               | 
               | Well, that's 91th percentile already. I know the terms
               | are hazy, but that seems closer to ASI than AGI from that
               | perspective, no?
               | 
               | I think I do agree with you on the other points.
        
             | sebastiennight wrote:
             | I don't think so, and here's my simple proof:
             | 
             | You and I could sit behind a keyboard, role-playing as the
             | AI in a reverse Turing test, typing away furiously at the
             | top of our game, and if you told someone that their job is
             | to assess our performance (thinking they're interacting
             | with a computer), they would _still_ conclude that we are
             | _definitely_ not AGI.
             | 
             | This is a battle that can't be won at any point because
             | it's a matter of faith for the forever-skeptic, not facts.
        
               | Jensson wrote:
               | > I don't think so, and here's my simple proof:
               | 
               | That isn't a proof since you haven't ran that test, it is
               | just a thought experiment.
        
               | ben_w wrote:
               | I've been accused a few times of being an AI, even here.
               | 
               | (Have you not experienced being on the recieving end of
               | such accusations? Or do I just write weird?)
               | 
               | I think this demonstrates the same point.
        
           | afro88 wrote:
           | This is part of what the article is about
        
           | torginus wrote:
           | I still can't have an earnest conversation or bounce ideas
           | off of any LLM - all of them seem to be a cross between a
           | sentient encyclopedia and a constraint solver.
           | 
           | They might get more powerful but I feel like they're still
           | missing something.
        
             | itchyjunk wrote:
             | Why are you not able to have an earnest conversation with
             | an LLM? What kind of ideas are you not able to bounce of
             | LLMs? These seem to be the type of use cases where LLMs
             | have generally shined for me.
        
             | HDThoreaun wrote:
             | I felt this way until I tried gemini 2.5. Imo it fully
             | passes the turing test unless youre specifically utilizing
             | tricks that LLMs are known to fall for.
        
           | jimbokun wrote:
           | We have all seen it and are now just in severe denial.
        
           | dgs_sgd wrote:
           | This is actually how a supreme court justice defined the test
           | for obscenity.
           | 
           | > The phrase "I know it when I see it" was used in 1964 by
           | United States Supreme Court Justice Potter Stewart to
           | describe his threshold test for obscenity in Jacobellis v.
           | Ohio
        
             | sweetjuly wrote:
             | The reason why it's so famous though (and why some people
             | tend to use it in a tongue in cheek manner) is because "you
             | know it when you see it" is a hilariously unhelpful and
             | capricious threshold, especially when coming from the
             | Supreme Court. For rights which are so vital to the fabric
             | of the country, the Supreme Court recommending we hinge
             | free speech on--essentially--unquantifiable vibes is equal
             | parts bizarre and out of character.
        
           | DesiLurker wrote:
           | my 2c on this is that if you interact with any current llm
           | enough you can mentally 'place' its behavior and responses.
           | when we truly have AGI+/ASI my guess is that it will be like
           | that old adage of blind men feeling & describing an elephant
           | for the first time. we just wont be able to fully understand
           | its responses. it would always be something left hanging and
           | then eventually we'll just stop trying. that would be time
           | when the exponential improvement really kicks in.
           | 
           | it should suffice to say we are nowhere near that and I dont
           | even believe LLMs are the right architecture for that.
        
         | Zambyte wrote:
         | I think a reasonable definition of intelligence is the
         | application of reason on knowledge. An example of a system that
         | is highly knowledgeable but has little to no reason would be an
         | encyclopedia. An example of a system that is highly reasonable,
         | but has little knowledge would be a calculator. Intelligent
         | systems demonstrate both.
         | 
         | Systems that have general intelligence are ones that are
         | capable of applying reason to an unbounded domain of knowledge.
         | Examples of such systems include: libraries, wikis, and forums
         | like HN. These systems are not AGI, because the reasoning
         | agents in each of these systems are organic (humans); they are
         | more like a cyborg general intelligence.
         | 
         | Artificial general intelligence are just systems that are fully
         | artificial (ie: computer programs) that can apply reason to an
         | unbounded domain of knowledge. We're here, and we have been for
         | years. AGI sets no minimum as to how great the reasoning must
         | be, but it's obvious to anyone who has used modern generative
         | intelligence systems like LLMs that the technology can be used
         | to reason about an unbounded domain of knowledge.
         | 
         | If you don't want to take my word for it, maybe Peter Norvig
         | can be more convincing: https://www.noemamag.com/artificial-
         | general-intelligence-is-...
        
           | jimbokun wrote:
           | Excellent article and analysis. Surprised I missed it.
           | 
           | It is very hard to argue with Norvig's arguments that AGI has
           | been around since at least 2023.
        
           | conception wrote:
           | I think the thing missing would be memory. The knowledge of
           | current models is more or less static save for whatever you
           | can cram into their context window. I think if they had
           | memory and thus the ability to learn - "oh hey, I've already
           | tried to solve a bug in these ways maybe I won't get stuck in
           | loop on them!" Would be the agi push for me. Real time
           | incorporating new knowledge into the model is the missing
           | piece.
        
         | yeahwhatever10 wrote:
         | This is the forum that fell the hardest for the superconductor
         | hoax a few years ago. HN has no superiority leg to stand on.
        
         | nightmunnas wrote:
         | Low agreeableness will actually be extremely useful in many use
         | cases, such as scientific discovery and of course programming
         | assistance. It's amazing that this venue hasn't been explored
         | more deeply.
        
           | Jensson wrote:
           | Its much easier to sell an agreeable assistant than a
           | disagreeable one, so it isn't that strange the alternative
           | isn't explored.
        
         | j_timberlake wrote:
         | The exact definition of AGI is pretty much the least
         | interesting thing about AGI. It's basically bike-shedding at
         | this point: arguing about something easy to understand instead
         | of tackling the really hard questions like "how competent can
         | AI get before it's too dangerous to be in the hands of flakey
         | tech companies?"
        
         | mrshadowgoose wrote:
         | I've always felt that trying to pin down the precise definition
         | of AGI is as useless as trying to pin down "what it means to
         | truly understand". It's a mental trap for smart people, that
         | distracts them from focusing on the impacts of hard-to-define
         | concepts like AGI.
         | 
         | AGI doesn't need to be "called", and there is no need for
         | anyone to come to an agreement as to what its precise
         | definition is. But at some point, we will cross that hard-to-
         | define threshold, and the economic effects will be felt almost
         | immediately.
         | 
         | We should probably be focusing on how to prepare society for
         | those changes, and not on academic bullshit.
        
           | throwup238 wrote:
           | It's definitely a trap for those who aren't familiar with the
           | existing academic work in philosophy, cognition, and
           | neuroscience. There are no definitive answers but there are
           | lots of relatively well developed ideas and concepts that
           | everyone here on HN seems completely ignorant of, even though
           | some of the ideas were developed by industry giants like
           | Marvin Minsky.
           | 
           | Stuff like society of minds (Minksy), embodied cognition
           | (Varela, Rosch, and Thompson), connectionist or subsymbolic
           | views (Rumelhart), multiple intelligences (Gardner),
           | psychometric and factor-analytic theories (Carroll), and all
           | the other work like E. Hutchins. They're far from just
           | academic wankery, there's a lot of useful stuff in there,
           | it's just completely ignored by the AI crowd.
        
         | dheera wrote:
         | I spent some amount of time trying to create a stock/option
         | trading bot to exploit various market inefficiencies that
         | persist, and did a bunch of code and idea bouncing off these
         | LLMs. What I fund is that even all the various incarnations of
         | GPT 4+ and GPT o+ routinely kept falling for the "get rich
         | quick" option strategies all over the internet that don't work.
         | 
         | In cases where 95%+ of the information on the internet is
         | misinformation, the current incarnations of LLMs have a really
         | hard time sorting out and filtering out the 5% of information
         | that's actually valid and useful.
         | 
         | In that sense, current LLMs are not yet superhuman at all,
         | though I do think we can eventually get there.
        
           | jimbokun wrote:
           | So they are only as smart as most humans.
        
         | daxfohl wrote:
         | Until you can boot one up, give it access to a VM video and
         | audio feeds and keyboard and mouse interfaces, give it an email
         | and chat account, tell it where the company onboarding docs are
         | and expect them to be a productive team member, they're not
         | AGI. So long as we need special protocols like MCP and A2A,
         | rather than expecting them to figure out how to collaborate
         | like a human, they're not AGI.
         | 
         | The first step, my guess, is going to be the ability to work
         | through github issues like a human, identifying which issues
         | have high value, asking clarifying questions, proposing
         | reasonable alternatives, knowing when to open a PR, responding
         | to code review, merging or abandoning when appropriate. But
         | we're not even very close to that yet. There's some of it, but
         | from what I've seen most instances where this has been
         | successful are low level things like removing old feature
         | flags.
        
           | rafaelmn wrote:
           | Just because we rely on vision to interface with computer
           | software doesn't mean it's optimal for AI models. Having a
           | specialized interface protocol is orthogonal to capability.
           | Just like you could theoretically write code in a
           | proportional font with notepad and run your tools through
           | windows CMD - having an editor with syntax highlighting and
           | monospaced font helps you read/navigate/edit, having
           | tools/navigation/autocomplete etc. optimized for your flow
           | makes you more productive and expands your capability, etc.
           | 
           | If I forced you to use unnatural interfaces it would severely
           | limit your capabilities as well because you'd have to
           | dedicate more effort towards handling basic editing tasks. As
           | someone who recently swapped to a split 36key keyboard with a
           | new layout I can say this becomes immediately obvious when
           | you try something like this. You take your typing/editing
           | skills for granted - try switching your setup and see how
           | your productivity/problem solving ability tanks in practice.
        
             | daxfohl wrote:
             | Agreed, but I also think to be called AGI, they should be
             | capable of working through human interfaces rather than
             | needing to have special interfaces created for them to get
             | around their lack of AGI.
             | 
             | The catch in this though isn't the ability to use these
             | interfaces. I expect that will be easy. The hard part will
             | be, once these interfaces are learned, the scope and search
             | space of what they will be able to do is infinitely larger.
             | And moreover our expectations will change in how we expect
             | an AGI to handle itself when our way of working with it
             | becomes more human.
             | 
             | Right now we're claiming nascent AGI, but really much of
             | what we're asking these systems to do have been laid out
             | for them. A limited set of protocols and interfaces, and a
             | targeted set of tasks to which we normally apply these
             | things. And moreover our expectations are as such. We don't
             | converse with them as with a human. Their search space is
             | much smaller. So while they appear AGI in specific tasks, I
             | think it's because we're subconsciously grading them on a
             | curve. The only way we have to interact with them
             | prejudices us to have a very low bar.
             | 
             | That said, I agree that video feed and mouse is a terrible
             | protocol for AI. But _that_ said, I wouldn 't be surprised
             | if that's what we end up settling on. Long term, it's just
             | going to be easier for these bots to learn and adapt to use
             | human interfaces than for us to maintain two sets of
             | interfaces for things, except for specific bot-to-bot
             | cases. It's horribly inefficient, but in my experience
             | efficiency never comes out ahead with each new generation
             | of UIs.
        
           | Closi wrote:
           | This is an incredibly specific test/definition of AGI -
           | particularly remembering that I would probably say an octopus
           | classes as an intelligent being yet can't use outlook...
        
         | Rebuff5007 wrote:
         | > clearly in many cases superhuman
         | 
         | In what cases is it superhuman exactly? And what humans are you
         | comparing against?
         | 
         | I'd bet that for any discipline you chose, one could find an
         | expert in that field that can trick any of today's post-gpt3
         | ais.
        
       | tomrod wrote:
       | I agree with Professor Mollick that the capabilities in specific
       | task categories are becoming superhuman -- a precursor for AGI.
       | 
       | Until those capabilities are expanded for model self-improvement
       | -- including being able to adapt its own infrastructure, code,
       | storage, etc. -- then I think AGI/ASI are yet to be realized. My
       | POV is SkyNet, Traveler's "The Director", Person of Interest's
       | "The Machine" and "Samaritan." The ability to target a
       | potentially inscrutable goal along with the self-agency to direct
       | itself towards that is true "AGI" in my book. We have a lot of
       | components that we can reason are necessary, but it is unclear to
       | me that we get there in the next few months.
        
         | airstrike wrote:
         | I don't think we should take it as a given that these are truly
         | precursors for AGI.
         | 
         | We may be going about it the wrong way entirely and need to
         | backtrack and find a wholly new architecture, in which case
         | current capabilities would _predate_ AGI but not be
         | _precursors_.
        
       | Y_Y wrote:
       | Here's o4-mini-high solving riddles:
       | 
       | > What weighs more; a pound of courage or a pound coin?
       | 
       | > They weigh the same--each is a "pound." A pound of courage and
       | a pound coin both come out to one pound in weight!
       | 
       | https://chatgpt.com/share/68053ad0-aae4-800d-95ff-5b5b0a9a38...
       | 
       | I didn't know what AGI is, but I know this isn't it.
        
         | hannofcart wrote:
         | Sonnet 3.7 (Thinking) gave me:
         | 
         | :::thinking This question is playing with words and units.
         | 
         | A "pound of courage" refers to a metaphorical measurement of
         | the abstract concept of courage, using the imperial weight unit
         | "pound" (which equals 453.59 grams).
         | 
         | A "pound coin" refers to the physical currency of the United
         | Kingdom, which is called a "pound" but is a unit of currency,
         | not weight. A physical pound coin weighs about 8-9 grams
         | (specifically, the current 12-sided pound coin weighs 8.75
         | grams).
         | 
         | So comparing them: - A pound of courage would weigh 453.59
         | grams (if courage had physical weight, which it doesn't since
         | it's an abstract concept) - A pound coin weighs approximately
         | 8.75 grams
         | 
         | The trick in this question is that it's comparing: 1. A
         | specific weight of an abstract concept (pound of courage) 2. A
         | physical coin whose name includes the word "pound" but refers
         | to currency, not weight
         | 
         | Since the question asks what weighs more, and weight is a
         | physical property, I should point out that courage is abstract
         | and doesn't have physical weight, while a pound coin is a
         | physical object with a measurable weight. ::: # Comparing "A
         | Pound of Courage" vs "A Pound Coin"
         | 
         | This is a play on words with different meanings of "pound":
         | 
         | - A *pound of courage* refers to a metaphorical pound (weight
         | unit) of an abstract concept - A *pound coin* is a physical
         | British currency coin (which weighs about 8.75 grams)
         | 
         | If we're talking about physical weight: - A pound of courage
         | doesn't have physical weight since courage is abstract - A
         | pound coin has a definite weight of approximately 8.75 grams
         | 
         | If we interpreted "pound of courage" as literally one pound
         | (453.59 grams) of something, it would weigh significantly more
         | than the coin.
         | 
         | This question is clever because it compares a metaphorical
         | measurement with a physical object whose name includes the word
         | "pound" but in a different context.
        
           | jpc0 wrote:
           | Yet for me I will draw analogies the other way, maybe you can
           | prompt the AI into the chain of thought but in my experience
           | that doesn't happen.
           | 
           | I would look along the lines of the value of a human being
           | based on traits vs payment which is likely what the original
           | phrase intended.
           | 
           | Is paying someone more better than getting a better candidate
           | but paying them less.
        
           | boznz wrote:
           | If I ask a cancer specialist "Do I have Cancer?" I really
           | don't want to prompt them with "can you think a bit harder on
           | that"
        
         | pbhjpbhj wrote:
         | Courage is a beer, a kilo of Courage weighs a kilo.
        
       | simianwords wrote:
       | I thought o1 pro could have solved this riddle
       | 
       | > A young boy who has been in a car accident is rushed to the
       | emergency room. Upon seeing him, the surgeon says, "I can operate
       | on this boy!" How is this possible?
       | 
       | But it didn't!
        
         | simonw wrote:
         | Hah, yeah that still catches out o4-mini and o3 too. Amusingly,
         | adding "It's not the riddle." to the end fixes that.
         | 
         | (o4-mini high thought for 52 seconds and even cheated and
         | looked up the answer on Hacker News: https://chatgpt.com/share/
         | 68053c9a-51c0-8006-a7fc-75edb734c2...)
        
       | VladimirOrlov wrote:
       | Is this for real? ... All this Hype is ... very-very old Hype,
       | and nothing fundamentally new (yet) from 1960s time. Looks like
       | every upgrade of software is "revolution" or "revelation". Please
       | compare 'Win 3.1' and 'Win 11', some progress? sure!, is any
       | "Intelligence" there? No! No! No! What is the difference? Who
       | constantly lying and why? What is the reason of this systematic
       | and persistent lies? p.s. I, personally think, that someday we
       | will have a "semi-smart" computer systems, I also think, in a
       | 5-10 years, we will learn more what is possible and real and what
       | is not (regarding "semi-smart" computer systems). Until that ...
       | hold your horses (please), so to speak.
        
       | low_tech_love wrote:
       | The first thing I want AGI to do is to be able to tell me when it
       | doesn't know something, or when it's not certain, so at least
       | give me a heads up to set expectations correctly. I ran my own
       | personal "benchmark" on Gemini 2.5 and it failed just like all
       | others. I told it that I was playing an old point-and-click
       | adventure game from the mid-90s and I was stuck on a certain
       | part, and asked for spoiler-light hints on what to do next. Not
       | only can they not give me hints, they hallucinate completely the
       | game, and invent some weird non-sensical solutions. Every single
       | model does this. Even if I tell them to give up and just give me
       | the solution, they come up with some non-existing solution.
       | 
       | I wonder how hard it is to objectively use information that is
       | available online for 30 years? But the worst part is how it lies
       | and pretends it knows what it's talking about, and when you point
       | it out it simply turns into another direction and starts lying
       | again. Maybe the use case here is not the main focus of modern
       | AI; maybe modern AI is about generating slop that does not
       | require verification, because it's "new" content. But to me it
       | just sounds like believable slop, not AGI.
        
         | irthomasthomas wrote:
         | Here's an example of how my agent handles this:
         | 
         | Gathering context for user request...
         | 
         | Context gathering - Attempting to answer question via LLM: Are
         | there existing Conversation classes in the ecosystem this
         | should extend? Context gathering - LLM provided answer: "No"
         | 
         | Context gathering - Attempting to answer question via LLM: How
         | should model selection work when continuing a previous
         | conversation? Context gathering - LLM answer was UNKNOWN,
         | asking user. Asking user: How should model selection work when
         | continuing a previous conversation?
         | 
         | Context gathering - received user response to question: "How
         | should model selection work when continuing a previous
         | conversation?"
         | 
         | Context gathering - finished processing all user questions
         | Context gathering - processing command executions... Context
         | gathering - executing command: sqlite3 $(find . -name
         | llm_conversations.db) .tables
         | 
         | Context gathering - command execution completed
         | 
         | Context gathering - executing command: grep -r Conversation
         | tests/
         | 
         | Context gathering - command execution completed
         | 
         | Context gathering - executing command: grep -h conversation_id
         | *py Context gathering - command execution completed Context
         | gathering - finished processing all commands Analyzing task
         | complexity and requirements...
         | 
         | DEBUG: reasoning_model: openrouter/google/gemini-2.5-pro-
         | preview-03-25 Task classified as coding (confidence: 1.0) Task
         | difficulty score: 98.01339999999999/100 Selected primary
         | reasoning model: claude-3.7-sonnet
         | get_reasoning_assistance:[:214: integer expression expected:
         | 98.01339999999999 Reasoning assistance completed in 39 seconds
         | Calling LLM with model: claude-3.7-sonnet
        
       | aylmao wrote:
       | > I've always been a staunch defender of capitalism and free
       | markets, even though that's historically been an unpopular
       | opinion in my particular social circle. Watching the LLM market,
       | I can't help but feel extremely vindicated.
       | 
       | > The brutal and bruising competition between the tech giants has
       | left nothing but riches for the average consumer.
       | 
       | Capitalism has always been great at this: creating markets,
       | growing them, producing new goods. It's widely acknowledged
       | amongst people who actually seek to gain an understanding of
       | Marxism, and don't just stay in the surface-level black-and-white
       | "socialism and capitalism are opposites" discourse that's very
       | common in the West, especially the USA, especially after the
       | McCarthy's Red Scare.
       | 
       | The problem is what comes once the market is grown and the only
       | way for owners keep profits growing is: 1. consolidating into
       | monopolies or cartels, so competition doesn't get in the way of
       | profits, 2. squeezing the working class, looking to pay less for
       | more work, and/or 3. abusing the natural world, to extract more
       | materials or energy for less money. This is evident in plenty of
       | developed industries: from health care, to broadcasting,
       | telecommunications, fashion, etc.
       | 
       | If we view Socialism for what it is, namely a system built to
       | replace Capitalism's bad parts but keep its good parts, China's
       | system, for example, starts to make more sense. Capitalism in a
       | similar way was an evolution from Feudalism that replaced it's
       | bad parts, to achieve greater liberty for everyone-- liberty is
       | very much lost as Feudalism matures, but great for society as a
       | whole. Socialism is meant to be the similar, aiming to achieve
       | greater equity, which it views as very much better for society as
       | a whole.
        
         | arrosenberg wrote:
         | Agree with most of what you wrote, but China isn't capitalist,
         | they're mercantilist with socialist policies. Capital is
         | heavily constrained under Xi.
        
       | myk9001 wrote:
       | Letting models interact with systems outside their sanbox brings
       | about some incredible applications. These applications truly seem
       | to have the potential to deeply change entire professions.
       | 
       | All that said, I wonder if GPT4 had been integrated with the same
       | tools, would it've been any less capable?
       | 
       | It sure could give you a search prompt for Google if you asked it
       | to. Back then you had to copy and paste that search prompt
       | yourself. Today o3 can do it on its own. Cool! Does it imply
       | though o3 is any closer to AGI than GPT4?
       | 
       | Models gaining access to external tools, however impressive from
       | all the applications standpoint, feels like lateral movement not
       | a step towards the AGI.
       | 
       | On the other hand, a model remaining isolated in its sandbox
       | while actually learning to reason about that puzzle (assuming
       | it's not present in the training data) would give off that
       | feeling the AGI vibes.
        
         | joshuanapoli wrote:
         | The newer models are definitely more useful. Back in the GPT
         | 3.5 and 4 days, AutoGPT applied the same types of tools, but
         | you had to be pretty lucky for it to get anywhere. Now Claude
         | 3.7, Gemini 2.5, GPT o3 make much fewer mistakes, and are
         | better able to get back on-track when a mistake is discovered.
         | So they're more convincing as intelligent helpers.
        
           | myk9001 wrote:
           | Good point. I still wonder if o3 has improved command of
           | tools because it's significantly smarter in general. Or it's
           | "just" trained with a specific focus on using tools better,
           | if that makes sense.
        
       | boznz wrote:
       | I'm surprised nobody mentioned the video interview. I only
       | watched the first 60 seconds and this is the first time I have
       | seen or heard the author, but if I hadn't been told this was AI
       | generated I would have assumed it was genuine and any 'twitching'
       | was the result of the video compression.
        
         | smusamashah wrote:
         | How???? I can believe the guy in the video being AI because his
         | lips are not perfectly synced. But the woman? Even with
         | continuous silly exaggerated movement I have hard time
         | believing its generated.
         | 
         | A strand of her hair fell on her shoulder, because she was
         | moving continuously (like crazy) it was moving too in a
         | perfectly believable way, and IT EVENTUALLY FELL OFF THE
         | SHOULDER/SHIRT LIKE REAL HAIR and got mixed into other fallen
         | hair. How is that generated? It's too small detail. Are there
         | any artifacts on her side?
         | 
         | Edit: she has to be real. Her lip movements are definitely
         | forced/edited though. It has to be a video recording of her
         | talking. And then a tool/AI has modified her lips to match the
         | voice. If you look at her face and hand movements, her shut
         | lips seem forced.
        
         | -__---____-ZXyw wrote:
         | I went and watched 10 seconds on account of your comment, and
         | couldn't disagree more. The heads keep sort of rolling around
         | in a disconcerting and quite eery fashion?
        
       | keernan wrote:
       | I fail to see how LLMs are anything beyond a lookup function
       | retrieving information from a huge database (containing, in
       | theory, all known human information), and then summarizing the
       | results using language algorithms.
       | 
       | While incredibly powerful and transformative, it is not
       | 'intelligence'. LLMs are forever knowledgebase bound. They are
       | encyclopedias with a fancy way of presenting information looked
       | up in the encyclopedia.
       | 
       | The 'presentation' has no concept, awareness, or understanding of
       | the information being presented - and never will. And this is the
       | critical line. Without comprehension, a LLM is incapable of being
       | creative. Of coming up with new ideas. It cannot ponder. Wonder.
       | Think.
        
       | dgs_sgd wrote:
       | While it's hard to agree on what AGI is I think we can more
       | easily agree on what AGI _is not_.
       | 
       | I don't consider an AI that fails the surgery brain teaser in the
       | article to be AGI, no matter how superhuman it is at other narrow
       | tasks. It doesn't satisfy the "G" part of AGI.
        
       | chrsw wrote:
       | What about all the things that aren't strictly intelligence but I
       | guess intelligence adjacent: autonomy, long term memory,
       | motivation, curiosity, resilience, goals, choice, and maybe the
       | biggest of them all: fear? Why would an AGI "want" do anything
       | more than my calculator "wants" to compute an answer to some math
       | problem I gave it? Without these things an AGI, or whatever, is
       | just an extension of whoever is ultimately controlling it.
       | 
       | And that's when we return to a much older and much more important
       | question than whether Super LLM 10.0 Ultra Plus is AGI or not:
       | how much power should a person or group of people be allowed to
       | have?
        
         | hiAndrewQuinn wrote:
         | https://gwern.net/tool-ai is a quite comprehensive dive into
         | why.
        
       | snarg wrote:
       | I honestly thought that we were agreed on the definition of AGI.
       | My understanding classified it as a model that can build on its
       | knowledge and better itself, teaching itself new tasks and
       | techniques, adapting as necessary. I.e., not simply knowing
       | enough techniques to impress some humans. By this definition, it
       | doesn't matter if it's super-intelligent or if its knowledge is
       | rudimentary, because given enough add-on hardware and power, it
       | could become super-intelligent over time.
        
       | gilbetron wrote:
       | AGI that is bad at some things is still AGI. We have AGI, it is
       | just bad at some things and hallucinates. It is literally smarter
       | than many people I know, but that doesn't mean it can beat a
       | human at anything. That would be ASI, which, hopefully, will take
       | a while to get here.
       | 
       | Although, I could be argued into calling what we have already ASI
       | - take a human and Gemini 2.5, and put them through a barrage of
       | omni-disciplinary questions and situations and problems. Gemini
       | 2.5 will win, but not absolutely.
       | 
       | AGI (we have) ASI (we might have) AOI (Artificial Omniscient
       | Intelligence, will hopefully take a while to get here)
        
       ___________________________________________________________________
       (page generated 2025-04-20 23:00 UTC)