[HN Gopher] Jagged AGI: o3, Gemini 2.5, and everything after
___________________________________________________________________
Jagged AGI: o3, Gemini 2.5, and everything after
Author : ctoth
Score : 137 points
Date : 2025-04-20 14:55 UTC (8 hours ago)
(HTM) web link (www.oneusefulthing.org)
(TXT) w3m dump (www.oneusefulthing.org)
| sejje wrote:
| In the last example (the riddle)--I generally assume the AI isn't
| misreading, rather that it assumes you couldn't give it the
| riddle correctly, but it has seen it already.
|
| I would do the same thing, I think. It's too well-known.
|
| The variation doesn't read like a riddle at all, so it's
| confusing even to me as a human. I can't find the riddle part.
| Maybe the AI is confused, too. I think it makes an okay
| assumption.
|
| I guess it would be nice if the AI asked a follow up question
| like "are you sure you wrote down the riddle correctly?", and I
| think it could if instructed to, but right now they don't
| generally do that on their own.
| Jensson wrote:
| > generally assume the AI isn't misreading, rather that it
| assumes you couldn't give it the riddle correctly, but it has
| seen it already.
|
| LLMs doesn't assume, its a text completer. It sees something
| that looks almost like a well known problem and it will
| complete with that well known problem, its a problem specific
| to being a text completer that is hard to get around.
| simonw wrote:
| These newer "reasoning" LLMs really don't feel like pure text
| completers any more.
| jordemort wrote:
| And yet
| gavinray wrote:
| Is it not physically impossible for LLM's to be anything
| but "plausible text completion"?
|
| Neural Networks as I understand them are universal function
| approximators.
|
| In terms of text, that means they're trained to output what
| they believe to be the "most probably correct" sequence of
| text.
|
| An LLM has no idea that it is "conversing", or "answering"
| -- it relates some series of symbolic inputs to another
| series of probabilistic symbolic outputs, aye?
| Borealid wrote:
| What your parent poster said is nonetheless true,
| regardless of how it feels to you. Getting text from an LLM
| is a process of iteratively attempting to find a likely
| next token given the preceding ones.
|
| If you give an LLM "The rain in Spain falls" the single
| most likely next token is "mainly", and you'll see that one
| proportionately more than any other.
|
| If you give an LLM "Find an unorthodox completion for the
| sentence 'The rain in Spain falls'", the most likely next
| token is something other than "mainly" because the tokens
| in "unorthodox" are more likely to appear before text that
| otherwise bucks statistical trends.
|
| If you give the LLM "blarghl unorthodox babble The rain in
| Spain" it's likely the results are similar to the second
| one but less likely to be coherent (because text obeying
| grammatical rules is more likely to follow other text also
| obeying those same rules).
|
| In any of the three cases, the LLM is predicting text, not
| "parsing" or "understanding" a prompt. The fact it will
| respond similarly to a well-formed and unreasonably-formed
| prompt is evidence of this.
|
| It's theoretically possible to engineer a string of
| complete gibberish tokens that will prompt the LLM to
| recite song lyrics, or answer questions about mathemtical
| formulae. Those strings of gibberish are just difficult to
| discover.
| Workaccount2 wrote:
| The problem is showing that humans aren't just doing next
| word prediction too.
| Borealid wrote:
| I don't see that as a problem. I don't particularly care
| how human intelligence works; what matters is what an LLM
| is capable of doing and what a human is capable of doing.
|
| If those two sets of accomplishments are the same there's
| no point arguing about differences in means or terms.
| Right now humans can build better LLMs but nobody has
| come up with an LLM that can build better LLMs.
| baq wrote:
| That's literally the definition of takeoff, when it
| starts it gets us to singularity in a decade and there's
| no publicly available evidence that it's started...
| emphasis on publicly available.
| myk9001 wrote:
| > it gets us to singularity
|
| Are we sure it's actually taking us along?
| johnisgood wrote:
| > but nobody has come up with an LLM that can build
| better LLMs.
|
| Yet. Not that we know of, anyway.
| dannyobrien wrote:
| So I just gave your blarghl line to Claude, and it
| replied "It seems like you included a mix of text
| including "blarghl unorthodox babble" followed by the
| phrase "The rain in Spain."
|
| Did you mean to ask about the well-known phrase "The rain
| in Spain falls mainly on the plain"? This is a famous
| elocution exercise from the musical "My Fair Lady," where
| it's used to teach proper pronunciation.
|
| Or was there something specific you wanted to discuss
| about Spain's rainfall patterns or perhaps something else
| entirely? I'd be happy to help with whatever you intended
| to ask. "
|
| I think you have a point here, but maybe re-express it?
| Because right now your argument seems trivially
| falsifiable even under your own terms.
| Borealid wrote:
| If you feed Claude you're getting Claude's "system
| prompt" before the text you give it.
|
| If you want to test convolution you have to use a raw
| model with no system prompt. You can do that with a Llama
| or similar. Otherwise your context window is full of
| words like "helpful" and "answer" and "question" that
| guide the response and make it harder (not impossible) to
| see the effect I'm talking about.
| itchyjunk wrote:
| At this point, you might as well be claiming completions
| model behaves differently than a fine-tuned model. Which
| is true but the prompt in API without any systems message
| seems to also not match your prediction.
| tough wrote:
| the point is when there's a system prompt you didnt write
| you get autocomplete of your input + said dystem prompt,
| and as such biasing all outputs
| dannyobrien wrote:
| I'm a bit confused here. Are you saying that if I zero
| out the system prompt on _any_ LLM, including those fine-
| tuned to give answers in an instructional form, they will
| follow your effect -- that nonsense prompts will get
| similar results to coherent prompts?
|
| Because I've tried it on a few local models I have handy,
| and I don't see that happening at all. As someone else
| says, some of that difference is almost certainly due to
| supervised fine-tuning (SFT) and reinforcement learning
| from human feedback (RLHF) -- but it's weird to me, given
| the confidence you made your prediction, that you didn't
| exclude those from your original statement.
|
| I guess, maybe the real question here is: could you give
| me a more explicit example of how to show what you are
| trying to show? And explain why I'm not seeing it while
| running local models without system prompts?
| simonw wrote:
| No, I think the "reasoning" step really does make a
| difference here.
|
| There's more than just next token prediction going on.
| Those reasoning chain of thoughts have undergone their
| own reinforcement learning training against a different
| category of samples.
|
| They've seen countless examples of how a reasoning chain
| would look for calculating a mortgage, or searching a
| flight, or debugging a Python program.
|
| So I don't think it is accurate to describe the eventual
| result as "just next token prediction". It is a
| combination of next token production that has been
| informed by a chain of thought that was based on a
| different set of specially chosen examples.
| Borealid wrote:
| Do you believe it's possible to produce a given set of
| model weights with an infinitely large number of
| different training examples?
|
| If not, why not? Explain.
|
| If so, how does your argument address the fact that this
| implies any given "reasoning" model can be trained
| without giving it a single example of something you would
| consider "reasoning"? (in fact, a "reasoning" model may
| be produced by random chance?)
| simonw wrote:
| I'm afraid I don't understand your question.
| wongarsu wrote:
| > The fact it will respond similarly to a well-formed and
| unreasonably-formed prompt is evidence of this.
|
| Don't humans do the same in conversation? How should an
| intelligent being (constrained to the same I/O system)
| respond here to show that it is in fact intelligent?
| Borealid wrote:
| Imagine a Rorschach Test of language, where a certain set
| of non-recognizable-language tokens invariably causes an
| LLM to talk about flowers. These strings exist by
| necessity due to how the LLM's layers are formed.
|
| There exists no similar set of tokens for humans, because
| our process is to parse the incoming sounds into words,
| use grammar to extract conceptual meaning from those
| words, and then shape a response from that conceptual
| meaning.
|
| Artists like Lewis Carrol and Stanislaw Lem play with
| this by inserting non-words at certain points in
| sentences to get humans to infer the meaning of those
| words from surrounding context, but the truth remains
| that an LLM will gladly convolute a wholly non-language
| input into a response as if it were well-formed, but a
| human can't/won't do that.
|
| I know this is hard to understand, but the current
| generation of LLMs are working directly with language.
| Their "brains" are built on language. Some day we might
| have some kind of AI system that's built on some kind of
| meaning divorced from language, but that's not what's
| happening here. They're engineering matrixes that
| repeatedly perform "context window times model => one
| more token" operations.
| og_kalu wrote:
| I think you are begging the question here.
|
| For one thing, LLMs absolutely form responses from
| conceptual meanings. This has been demonstrated
| empirically multiple times now including again by
| anthropic only a few weeks ago. 'Language' is just the
| input and output, the first and last few layers of the
| model.
|
| So okay, there exists some set of 'gibberish' tokens that
| will elicit meaningful responses from LLMs. How does your
| conclusion - "Therefore, LLMs don't understand" fit the
| bill here? You would also conclude that humans have no
| understanding of what they see because of the Rorschach
| test ?
|
| >There exists no similar set of tokens for humans,
| because our process is to parse the incoming sounds into
| words, use grammar to extract conceptual meaning from
| those words, and then shape a response from that
| conceptual meaning.
|
| Grammar is useful fiction, an incomplete model of a
| demonstrably probabilistic process. We don't use
| 'grammar' to do anything.
| wongarsu wrote:
| > Imagine a Rorschach Test of language, where a certain
| set of non-recognizable-language tokens invariably causes
| an LLM to talk about flowers. These strings exist by
| necessity due to how the LLM's layers are formed.
|
| Maybe not for humanity as a species, but for individual
| humans there are absolutely token sequences that lead
| them to talk about certain topics, and nobody being able
| to bring them back to topic. Now you'd probably say those
| are recognizable token sequences, but do we have a fair
| process to decide what's recognizable that isn't
| inherently biased towards making humans the only rational
| actor?
|
| I'm not contending at all that LLMs are only built on
| language. Their lack of physical reference point is
| sometimes laughably obvious. We could argue whether there
| are signs they also form a world model and reasoning that
| abstracts from language alone, but that's not even my
| point. My point is rather that any test or argument that
| attempts to say that LLMs can't "reason" or "assume" or
| whatever has to be a test a human could pass. Preferably
| a test a random human would pass with flying colors.
| baq wrote:
| This again.
|
| It's predicting text. Yes. Nobody argues about that.
| (You're also predicting text when you're typing it. Big
| deal.)
|
| _How_ it is predicting the text is the question to ask
| and indeed it's being asked and we're getting glimpses of
| understanding and lo and behold it's a damn complex
| process. See the recent anthropic research paper for
| details.
| monkpit wrote:
| This take really misses a key part of implementation of these
| LLMs and I've been struggling to put my finger on it.
|
| In every LLM thread someone chimes in with "it's just a
| statistical token predictor".
|
| I feel this misses the point and I think it dismisses
| attention heads and transformers, and that's what sits weird
| with me every time I see this kind of take.
|
| There _is_ an assumption being made within the model at
| runtime. Assumption, confusion, uncertainty - one camp might
| argue that none of these exist in the LLM.
|
| But doesn't the implementation constantly make assumptions?
| And what even IS your definition of "assumption" that's not
| being met here?
|
| Edit: I guess my point, overall, is: what's even the purpose
| of making this distinction anymore? It derails the discussion
| in a way that's not insightful or productive.
| Jensson wrote:
| > I feel this misses the point and I think it dismisses
| attention heads and transformers
|
| Those just makes it better at completing the text, but for
| very common riddles those tools still gets easily overruled
| by pretty simple text completion logic since the weights
| for those will be so extremely strong.
|
| The point is that if you understand its a text completer
| then its easy to understand why it fails at these. To fix
| these properly you need to make it no longer try to
| complete text, and that is hard to do without breaking it.
| wongarsu wrote:
| If you have the model output a chain of thought, whether it's
| a reasoning model or you prompt a "normal" model to do so,
| you will see examples of the model going "user said X, but
| did they mean Y? Y makes more sense, I will assume Y".
| Sometimes stretched over multiple paragraphs, consuming the
| entire reasoning budget for that prompt.
|
| Discussing whether models can "reason" or "think" is a
| popular debate topic on here, but I think we can all at least
| agree that they do something that at least resembles
| "reasoning" and "assumptions" from our human point of view.
| And if in its chain-of-thought it decides your prompt is
| wrong it will go ahead answering what it assumes is the right
| prompt
| sejje wrote:
| > it's a text completer
|
| Yes, and it can express its assumptions in text.
|
| Ask it to make some assumptions, like about a stack for a
| programming task, and it will.
|
| Whether or not the mechanism behind it feels like real
| thinking to you, it can definitely do this.
| wobfan wrote:
| If you call putting text together that reads like an
| assumption, then yes. But it cannot express assumption, as
| it is not assuming. It is completing text, like OP said.
| ToValueFunfetti wrote:
| It's trained to complete text, but it does so by
| constructing internal circuitry during training. We don't
| have enough transparency into that circuitry or the human
| brain's to positively assert that it doesn't assume.
|
| But I'd wager it's there; assuming is not a particularly
| impressive or computationally intense operation. There's
| a tendency to bundle all of human consciousness into the
| definitions of our cognitive components, but I would
| argue that, eg., a branch predictor is meeting the bar
| for any sane definition of 'assume'.
| og_kalu wrote:
| Text Completion is just the objective function. It's not
| descriptive and says nothing about how the models complete
| text. Why people hang on this word, I'll never understand.
| When you wrote your comment, you were completing text.
|
| The problem you've just described is a problem with humans as
| well. LLMs are assuming all the time. Maybe you would like to
| call it another word, but it is happening.
| codr7 wrote:
| With a plan, aiming for something, that's the difference.
| og_kalu wrote:
| Again, you are only describing the _how_ here, not the
| _what_ (text completion).
|
| Also, LLMs absolutely 'plan' and 'aim for something' in
| the process of completing text.
|
| https://www.anthropic.com/research/tracing-thoughts-
| language...
| namaria wrote:
| Yeah this paper is great fodder for the LLM pixel dust
| argument.
|
| They use a replacement model. It isn't even observing the
| LLM itself but a different architecture model. And it is
| very liberal with interpreting the patterns of
| activations seen in the replacement model with flowery
| language. It also include some very relevant caveats,
| such as:
|
| "Our cross-layer transcoder is trained to mimic the
| activations of the underlying model at each layer.
| However, even when it accurately reconstructs the model's
| activations, there is no guarantee that it does so via
| the same mechanisms."
|
| https://transformer-circuits.pub/2025/attribution-
| graphs/met...
|
| So basically the whole exercise might or might not be
| valid. But it generates some pretty interactive graphics
| and a nice blog post to reinforce the
| anthropomorphization discourse
| og_kalu wrote:
| 'So basically the whole exercise might or might not be
| valid.'
|
| Nonsense. Mechanistic faithfulness probes whether the
| replacement model ("cross-layer transcoder") truly uses
| the same internal functions as the original LLM. If it
| doesn't, the attribution graphs it suggests might mis-
| lead at a fine-grained level but because every hypothesis
| generated by those graphs is tested via direct
| interventions on the real model, high-level causal
| discoveries (e.g. that Claude plans its rhymes ahead of
| time) remain valid.
| losvedir wrote:
| So do LLMs. "In the United States, someone whose job is
| to go to space is called ____" it will say "an" not
| because that's the most likely next word, but because
| it's "aiming" (to use your terminology) for "astronaut"
| in the future.
| codr7 wrote:
| I don't know about you, but I tend to make more elaborate
| plans than the next word. I have a purpose, an idea I'm
| trying to communicate. These things don't have ideas,
| they're not creative.
| Jensson wrote:
| > When you wrote your comment, you were completing text.
|
| I didn't train to complete text though, I was primarily
| trained to make accurate responses.
|
| And no, writing a response is not "completing text", I
| don't try to figure out what another person would write as
| a response, I write what I feel people need to read. That
| is a completely different thought process. If I tried to
| mimic what another commenter would have written it would
| look very different.
| AstralStorm wrote:
| Sometimes we also write what we really want people to not
| read. That's usually called trolling though.
| og_kalu wrote:
| >And no, writing a response is not "completing text", I
| don't try to figure out what another person would write
| as a response, I write what I feel people need to read.
|
| Functionally, it is. You're determining what text should
| follow the prior text. Your internal reasoning ('what I
| feel people need to read') is _how_ you decide on the
| completion.
|
| The core point isn't that your internal 'how' is the same
| as an LLM's (Maybe, Maybe not), but that labeling the LLM
| as a 'text completer' they way you have is essentially
| meaningless.
|
| You are just imposing your own ideas on the _how_ a LLM
| works, not speaking any fundamental truth of being a
| 'text completer'.
| moffkalast wrote:
| Yeah you need specific instruct training for that sort of
| thing, Claude Opus being one of the rare examples that does
| such a sensibility check quite often and even admits when it
| doesn't know something.
|
| These days it's all about confidently bullshitting on
| benchmarks and overfitting on common riddles to make pointless
| numbers go up. The more impressive models get on paper, the
| more rubbish they are in practice.
| pants2 wrote:
| Gemini 2.5 is actually pretty good at this. It's the only
| model ever to tell me "no" to a request in Cursor.
|
| I asked it to add websocket support for my app and it
| responded like, "looks like you're using long polling now.
| That's actually better and simpler. Lets leave it how it is."
|
| I was genuinely amazed.
| simonw wrote:
| Coining "Jagged AGI" to work around the fact that nobody agrees
| on a definition for AGI is a clever piece of writing:
|
| > In some tasks, AI is unreliable. In others, it is superhuman.
| You could, of course, say the same thing about calculators, but
| it is also clear that AI is different. It is already
| demonstrating general capabilities and performing a wide range of
| intellectual tasks, including those that it is not specifically
| trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given
| the definitional problems, I really don't know, but I do think
| they can be credibly seen as a form of "Jagged AGI" - superhuman
| in enough areas to result in real changes to how we work and
| live, but also unreliable enough that human expertise is often
| needed to figure out where AI works and where it doesn't.
| shrx wrote:
| >> It is already demonstrating general capabilities and
| performing a wide range of intellectual tasks, including those
| that it is not specifically trained on.
|
| Huh? Isn't a LLM's capability fully constrained by the training
| data? Everything else is hallucinated.
| bbor wrote:
| The critical discovery was a way to crack the "Frame
| Problem", which roughly comes down to colloquial notions of
| common sense or intuition. For the first time ever, we have
| models that know if you jump off a stool, you will (likely!)
| be standing on the ground afterwards.
|
| In that sense, they absolutely know things that aren't in
| their training data. You're correct about factual knowledge,
| tho -- that's why they're not trained to optimize it! A
| database(/pagerank?) solves that problem already.
| simonw wrote:
| You can argue that _everything_ output by an LLM is
| hallucinated, since there 's no difference under-the-hood
| between outputting useful information and outputting
| hallucinations.
|
| The quality of the LLM then becomes how often it produces
| useful information. That score has gone up a _lot_ in the
| past 18 months.
|
| (Sometimes hallucinations are what you want: "Tell me a fun
| story about a dog learning calculus" is a valid prompt which
| mostly isn't meant to produce real facts about the world")
| codr7 wrote:
| Isn't it the case that the latest models actually
| hallucinate more than the ones that came before? Despite
| best efforts to prevent it.
| simonw wrote:
| The o3 model card reports a so far unexplained uptick in
| hallucination rate from o1 - on page 4 of https://cdn.ope
| nai.com/pdf/2221c875-02dc-4789-800b-e7758f372...
|
| That is according to one specific internal OpenAI
| benchmark, I don't know if it's been replicated
| externally yet.
| verdverm wrote:
| Why not call it AJI instead of AGI then?
|
| Certainly jagged does not imply general
|
| It seems to me the bar for "AGI" has been lowered to measuring
| what tasks it can do rather than the traits we normally
| associate with general intelligence. People want it to be here
| so bad they nerf the requirements...
| bbor wrote:
| Well I think the point being made is an instrumental one:
| it's general enough to matter, so we should use the word
| "general" to communicate that to laypeople.
|
| Re:"traits we associate with general intelligence", I think
| the exact issue is that there is no scientific (ie
| specific*consistent) list of such traits. This is why Turing
| wrote his famous 1950 paper and invoked the Imitation Game;
| not to detail how one could test for a computer that's really
| thinking(/truly general), but to show why that question isn't
| necessary in the first place.
| verdverm wrote:
| I still disagree, being good at a number of tasks does not
| make it intellectual.
|
| Certainly creativity is missing, it has no internal
| motivation, and it will answer the same simple question
| both right and wrong, depending on unknown factors. What if
| we reverse the framing from "it can do these tasks,
| therefore it must be..." to "it lacks these traits,
| therefore it is not yet..."
|
| While I do not disagree that the LLMs have become advanced
| enough to do a bunch of automation, I do not agree they are
| intelligent or actually thinking.
|
| I'm with Yann Lecun when he says that we won't reach AGI
| until we move beyond transformers.
| iknowstuff wrote:
| AJI lol love it.
| nearbuy wrote:
| Human intelligence is jagged. You're raising the AGI bar to a
| point where most people wouldn't qualify as having general
| intelligence.
|
| My partner and I work in different fields. AI has advanced to
| the point where there are very few questions I could ask my
| partner that o3 couldn't answer as well or better.
|
| I can't ask expert level questions in her field, because I'm
| not an expert in her field, and she couldn't ask expert level
| questions in my field for the same reason. So when we're
| communicating with each other, we're mostly at sub-o3 level.
|
| > People want it to be here so bad they nerf the
| requirements...
|
| People want to claim it's overhyped (and protect their own
| egos) so badly they raise the requirements...
|
| But really, largely people just have different ideas of what
| AGI is supposed to mean. It used to vaguely mean "human-level
| intelligence", which was fine for talking about some
| theoretical future event. Now we're at a point where that
| definition is too vague to say whether AI meets it.
| tasuki wrote:
| > You're raising the AGI bar to a point where most people
| wouldn't qualify as having general intelligence.
|
| We kind of don't? Look how difficult it is for us to just
| understand some basic math. Us humans mostly have
| intelligence related to the ancestral environment we
| developed in, nothing general about that.
|
| I agree with you the term "AGI" is rather void of meaning
| these days...
| verdverm wrote:
| You're using limited and anecdotal task based metrics as
| some sort of evidence. Both of you are able to drive a car,
| yet we need completely different AIs for such tasks.
|
| I still find task based measures insufficient, there are
| very basic machines than can perform tasks humans cannot.
| Should this be a measure on our or their intelligence?
|
| I have another comment in this thread about trait based
| metrics being a possibly better method.
|
| > People want to claim it's overhyped (and protect their
| own egos) so badly they raise the requirements...
|
| Shallow response. Seek to elevate the conversation. There
| are also people who see it for what it is, a useful tool
| but not intelligent...
| nearbuy wrote:
| > You're using limited and anecdotal task based metrics
| as some sort of evidence.
|
| And you presented no evidence at all. Not every comment I
| make is going to contain a full lit review.
|
| > Both of you are able to drive a car, yet we need
| completely different AIs for such tasks.
|
| This is like a bird complaining humans aren't intelligent
| because they can't fly. How is Gemini or o3 supposed to
| drive without real-time vision and a vehicle to control?
| How are you supposed to fly without wings?
|
| It lacks the sensors and actuators to drive, but this is
| moving away from a discussion on intelligence. If you
| want to argue that any system lacking real-time vision
| isn't intelligent, you're just using a very unusual
| definition of intelligence that excludes blind people.
|
| > Shallow response. Seek to elevate the conversation.
|
| This was an ironic response pointing out the shallowness
| of your own unsubstantiated accusation that people just
| disagree with you because they're biased or deluded
| themselves. The next paragraph starting with "But really"
| was supposed to convey it wasn't serious, just a jab
| showing the silliness of your own jab.
| qsort wrote:
| I don't think that's a particularly honest line of thinking
| though. It preempts the obvious counterargument, but very
| weakly so. Calculators are different, but why? Can an ensemble
| of a calculator, a Prolog interpreter, Alexnet and Stockfish be
| considered "jagged superintelligence"? They are all clearly
| superhuman, and yet require human experience to be wielded
| effectively.
|
| I'm guilty as charged of having looked at GPT 3.5 and having
| thought "it's meh", but more than anything this is showing that
| debating words rather than the underlying capabilities is an
| empty discussion.
| og_kalu wrote:
| >Calculators are different, but why? Can an ensemble of a
| calculator, a Prolog interpreter, Alexnet and Stockfish be
| considered "jagged superintelligence"?
|
| Those are all different things with little to nothing to do
| with each other. It's like saying what if I ensemble a snake
| and cat ? What does that even mean ? GPT-N or whatever is a
| single model that can do many things, no ensembling required.
| That's the difference between it and a calculator or
| stockfish.
| AstralStorm wrote:
| That is not true, the model is modular, thus an ensemble.
| Uses DallE for graphics and specialized tokenizer models
| for sound.
|
| If you remove those tools, or cut its access to search
| databases, it becomes quite less capable.
|
| A human would often still manage to do it without some data
| still, perhaps with less certainty, while GPT has more
| problems than that without others filling in the holes.
| og_kalu wrote:
| >Uses DallE for graphics and specialized tokenizer models
| for sound.
|
| chatgpt no longer uses dalle for image generation. I
| don't understand your point about the tokenization. It
| doesn't make the model become an ensemble.
|
| It's also just beside the point. Even if you restrict the
| modalities to text alone, these models are still general
| alone in ways a calculator is not.
| k2xl wrote:
| There is a similar issue with image and video generation. Asking
| the AI to "Generate an image of a man holding a pencil with his
| left hand" or "Generate a clock showing the time 5 minutes past 6
| o'clock" often fail due to so many images in the training set
| being similar (almost all clock images on show 10:10
| (https://generativeai.pub/in-the-ai-art-world-the-time-is-alm...)
| fsmv wrote:
| It's not AGI because it still doesn't understand anything. It can
| only tell you things that can be found on the internet. These
| "jagged" results expose the truth that these models have near 0
| intelligence.
|
| It is not a simple matter of patching the rough edges. We are
| fundamentally not using an architecture that is capable of
| intelligence.
|
| Personally the first time I tried deep research on a real topic
| it was disastrously incorrect on a key point.
| falcor84 wrote:
| >near 0 intelligence
|
| What does that even mean? Do you actually have any particular
| numeric test of intelligence that's somehow better than all the
| others?
| simonw wrote:
| Is one of your personal requirements for AGI "never makes a
| mistake?"
| Arainach wrote:
| I think determinism is an important element. You can ask the
| same LLM the same question repeatedly and get different
| answers - and not just different ways of stating the same
| answer, very different answers.
|
| If you ask an intelligent being the same question they may
| occasionally change the precise words they use but their
| answer will be the same over and over.
| hdjjhhvvhga wrote:
| If determinism is a hard requirement, then LLM-based AI
| can't fulfill it by definition.
| samuel wrote:
| That's not an inherent property of the system. You can
| choose the most likely token(topk=1) and it will be
| deterministic (at least in theory, in some hardware setups
| it might be trickier)
| beering wrote:
| A human will give different answers to the same question,
| so I'm not sure why it's fair to set a higher bar for an
| LLM. Or rather, I'm not sure how you would design this test
| in a way where humans would pass and the best LLM would
| fail.
| simonw wrote:
| That's because "intelligent beings" have memory. If you ask
| an LLM the same question within the same chat session
| you'll get a consistent answer about it.
| Arainach wrote:
| I disagree. If you were to take a snapshot of someone's
| knowledge and memory such that you could restore to it
| over and over, that person would give the same answer to
| the question. The same is not true for an LLM.
|
| Heck, I can't even get LLMs to be consistent about *their
| own capabilities*.
|
| Bias disclaimer: I work at Google, but not on Gemini. If
| I ask Gemini to produce an SVG file, it will sometimes do
| so and sometimes say "sorry, I can't, I can only produce
| raster images". I cannot deterministically produce either
| behavior - it truly seems to vary randomly.
| IanCal wrote:
| You could run an llm deterministically too.
|
| We're often explicitly adding in randomness to the
| results so it feels weird to then accuse them of not
| being intelligent after we deliberately force them off
| the path.
| danielbln wrote:
| You'd need to restore more than memory/knowledge. You'd
| need to restore the full human, and in the exact same
| condition (inside and out).
|
| Ask me some question before bed and again after waking
| up, I'll probably answer it at night but in the morning
| tell you to sod off until I had coffee.
| ben_w wrote:
| The concept of "understand" is itself ill-defined -- or, to put
| it another way, _not understood_.
| danielbln wrote:
| There are some very strong and very unfounded assumptions in
| your comment. Is there anything more substantial there other
| than "that's what it feels like to me"?
| logicchains wrote:
| I'd argue that it's not productive to use any definition of AGI
| coined after 2020, to avoid the fallacy of shifting the
| goalposts.
| Borealid wrote:
| I think there's a single definition of AGI that will stand
| until the singularity:
|
| "An AGI is a human-created system that demonstrates iteratively
| improving its own conceptual design without further human
| assistance".
|
| Note that a "conceptual design" here does not include tweaking
| weights within an already-externally-established formula.
|
| My reasoning is thus:
|
| 1. A system that is only capable of acting with human
| assistance cannot have its own intelligence disentangled from
| the humans'
|
| 2. A system that is only intelligent enough to solve problems
| that somehow exclude problems with itself is not "generally"
| intelligent
|
| 3. A system that can only generate a single round of
| improvements to its own designs has not demonstrated
| improvements to those designs, as if iteration N+1 were truly
| superior to iteration N, it would be able to produce iteration
| N+2
|
| 4. A system that is not capable of changing its own design is
| incapable of iterative improvement, as there is a maximum
| efficacy within any single framework
|
| 5. A system that could improve itself in theory and fails to do
| so in practice has not demonstrated intelligence
|
| It's pretty clear that no current-day system has hit this
| milestone; if some program had, there would no longer be a need
| for continued investment in algorithms design (or computer
| science, or most of humanity...).
|
| A program that randomly mutates its own code could self-improve
| in theory but fails to do so in practice.
|
| I don't think these goalposts have moved in the past or need to
| move in the future. This is what it takes to cause the
| singularity. The movement recently has been people trying to
| sell something less than this as an AGI.
| logicchains wrote:
| AGI means "artificial general intelligence", it's got nothing
| to do with the singularity (which requires "artificial
| superior intelligence"; ASI). Requiring AGI to have
| capabilities that most humans lack is moving the goal post
| WRT to how it was originally defined.
| jpc0 wrote:
| I don't think this is capabilities humans do not have, this
| to me is the one capability humans destinctly have over
| LLMs, the ability to introspect and shape their own future.
|
| I feel this definition doesn't require a current LLM model
| to be able to change its own working but to be able to
| generate a guided next generation.
|
| It's possible that LLMs can surpass human beings, purely
| because I believe we will inevitably be limited to short
| term storage constraints which LLMs will not. It will be a
| bandwidth vs througput question. An LLM will have a much
| larger although slightly slower store of knowledge than
| what human have. But will be much quicker than a human
| looking up and validating the data.
|
| We aren't there yet.
| gom_jabbar wrote:
| > The movement recently has been people trying to sell
| something less than this as an AGI.
|
| Selling something that does not yet exist is an essential
| part of capitalism, which - according to the main thesis of
| philosophical Accelerationism - is (teleologically) identical
| to AI. [0] It's sometimes referred to as _Hyperstition_ ,
| i.e. fictions that make themselves real.
|
| [0] https://retrochronic.com
| TheAceOfHearts wrote:
| I really dislike this framing. Historically we've been very
| confused about what AGI means because we don't actually
| understand it. We're still confused so most working definitions
| have been iterated upon as models acquire new capabilities.
| It's akin to searching something in the fog of war: you set a
| course or destination because you think that's the approximate
| direction where the thing will be found, but then you get there
| and realize you were wrong so you continue exploring.
|
| Most people have a rough idea of what AGI means, but we still
| haven't figured out an exact definition that lines up with
| reality. As we continue exploring the idea space, we'll keep
| figuring out which parameters place boundaries and requirements
| on what AGI means.
|
| There's no reason to just accept an ancient definition from
| someone who was confused and didn't know any better at the time
| when they invented their definition. Older definitions were
| just shots in the dark that pointed in a general direction, but
| there's no guarantee that they would hit upon the exact
| destination.
| skybrian wrote:
| What's clear is that AI is unreliable in general and must be
| tested on specific tasks. That might be human review of a single
| output or some kind of task-specific evaluation.
|
| It's bad luck for those of us who want to talk about how good or
| bad they are in general. Summary statistics aren't going to tell
| us much more than a reasonable guess as to whether a new model is
| worth trying on a task we actually care about.
| simonw wrote:
| Right: we effectively all need our own evals for the tasks that
| matter to us... but writing those evals continues to be one of
| the least well documented areas of how to effectively use LLMs.
| mellosouls wrote:
| The capabilities of AI post gpt3 have become extraordinary and
| clearly in many cases superhuman.
|
| However (as the article admits) there is still no general
| agreement of what AGI is, or how we (or even if we can) get there
| from here.
|
| What there is is a growing and often naive excitement that
| anticipates it as coming into view, and unfortunately that will
| be accompanied by the hype-merchants desperate to be first to
| "call it".
|
| This article seems reasonable in some ways but unfortunately
| falls into the latter category with its title and sloganeering.
|
| "AGI" in the title of any article should be seen as a cautionary
| flag. On HN - if anywhere - we need to be on the alert for this.
| ashoeafoot wrote:
| AGI is a annonymous good model coming around the corner with no
| company and no LLM researchers attached. AGI is when the LLM
| hype train threads are replaced with CEOs and let go
| researchers demanding UBI.
| MichaelZuo wrote:
| Yeah formal agreement seems exceedingly unlikely. Since there
| isn't even agreement on the defintion of "Artifical
| Intelligence".
| ben_w wrote:
| It's easy to treat AGI as one thing -- I did so myself before
| everyone's differing reaction to LLMs made me realise we all
| mean different things by each of the three letters of the
| initialism, and that none of those initials are really
| boolean valued.
|
| Given how Dutch disease[0] is described, I suspect that if
| the "G" (general) increases with fixed "I" (intelligence), as
| the proportion of economic activity for which the Pareto
| frontier is AI rather than human expands, I think humans will
| get pay rises for the remaining work right up until they get
| unemployable.
|
| On the other hand, if "G" is fully general and it's "I" which
| rises for a suitable cost[1], it goes through IQ 55
| (displacing no workers) to IQ 100 (probably close to half of
| workers redundant, but mean of population doesn't have to
| equal mean of workforce), to IQ 145 (almost everyone
| redundant), to IQ 200 (definitionally renders everyone
| redundant).
|
| [0] https://en.wikipedia.org/wiki/Dutch_disease
|
| [1] A fully-general AGI with the equivalent of IQ 200 on any
| possible test, still can't replace a single human if it costs
| 200 trillion USD per year to run.
| jjeaff wrote:
| I suspect AGI will be one of those things that you can't
| describe it exactly, but you'll know it when you see it.
| ninetyninenine wrote:
| I suspect everyone will call it a stochastic parrot because
| it did this one thing not right. And this will continue into
| the far far future even when it becomes sentient we will
| completely miss it.
| Jensson wrote:
| Once it pushed out most humans from white collar labor so
| the remaining humans work in blue collar jobs people wont
| say its just a stochastic parrot.
| myk9001 wrote:
| Maybe, maybe not. Power loom pushed a lot of humans out
| of the textile factory jobs, yet noone claims power loom
| is the AGI.
| Jensson wrote:
| Not a lot, I mean basically everyone, to the point where
| most companies doesn't need to pay humans to think
| anymore.
| myk9001 wrote:
| Well, I'm too lazy to look up how many weavers were
| displaced back then and that's why I said a lot. Maybe
| all of them, since they weren't trained to operate the
| new machines.
|
| Anyway, sorry for a digression, my point is LLM replacing
| white collar workers doesn't necessarily imply it's
| generally intelligent -- it may but doesn't have to be.
|
| Although if it gets to a point where companies are
| running dark office buildings (by analogy with dark
| factories) -- yes, it's AGI by then.
| jimbokun wrote:
| Or become shocked to realize humans are basically
| statistical parrots too.
| AstralStorm wrote:
| It's more than that but less than intelligence.
|
| Its generalization capabilities are a bit on the low side,
| and memory is relatively bad. But it is much more than just
| a parrot now, it can handle _some_ of basic logic, but not
| follow given patterns correctly for novel problems.
|
| I'd liken it to something like a bird, extremely good at
| specialized tasks but failing a lot of common ones unless
| repeatedly shown the solution. It's not a corvid or a
| parrot yet. Fails rather badly at detour tests.
|
| It might be sentient already though. Someone needs to run a
| test if it can discern itself and another instance of
| itself in its own work.
| Jensson wrote:
| > It might be sentient already though. Someone needs to
| run a test if it can discern itself and another instance
| of itself in its own work.
|
| It doesn't have any memory, how could it tell itself from
| a clone of itself?
| AstralStorm wrote:
| Similarity match. For that you need to understand
| reflexively how you think and write.
|
| It's a fun test to give a person something they have
| written but do not remember. Most people can still spot
| it.
|
| It's easier with images though. Especially a mirror. For
| DallE, the test would be if it can discern its own work
| from human generated image. Especially of you give it an
| imaginative task like drawing a representation of itself.
| NitpickLawyer wrote:
| > but you'll know it when you see it.
|
| I agree, but with the caveat that it's getting harder and
| harder with all the hype / doom cycles and all the goalpost
| moving that's happening in this space.
|
| IMO if you took gemini2.5 / claude / o3 and showed it to
| people from ten / twenty years ago, they'd say that it is
| unmistakably AGI.
| Jensson wrote:
| > IMO if you took gemini2.5 / claude / o3 and showed it to
| people from ten / twenty years ago, they'd say that it is
| unmistakably AGI.
|
| No they wouldn't, since those still can't replace human
| white collar workers even at many very basic tasks.
|
| Once AGI is here most white collar jobs are gone, you'd
| only need to hire geniuses at most.
| zaptrem wrote:
| Which part of "General Intelligence" requires replacing
| white collar workers? A middle schooler has general
| intelligence (they know about and can do a lot of things
| across a lot of different areas) but they likely can't
| replace white collar workers either. IMO GPT-3 was AGI,
| just a pretty crappy one.
| Jensson wrote:
| > A middle schooler has general intelligence (they know
| about and can do a lot of things across a lot of
| different areas) but they likely can't replace white
| collar workers either.
|
| Middle schoolers replace white collars workers all the
| time, it takes 10 years for them to do it but they can do
| it.
|
| No current model can do the same since they aren't able
| to learn over time like a middle schooler.
| sebastiennight wrote:
| Compared to someone who graduated middle school on
| November 30th, 2022 (2.5 years ago, would you say that
| today's gemini 2.5 pro has NOT gained intelligence
| faster?
|
| I mean, if you're a CEO or middle manager and you have
| the choice of hiring this middle schooler for general
| office work, or today's gemini-2.5-pro, are you 100%
| saying the ex-middle-schooler is definitely going to give
| you best bang for your buck?
|
| Assuming you can either pay them $100k a year, or spend
| the $100k on gemini inference.
| Jensson wrote:
| > would you say that today's gemini 2.5 pro has NOT
| gained intelligence faster?
|
| Gemini 2.5 pro the model has not gained any intelligence
| since it is a static model.
|
| New models are not the models learning, it is humans
| creating new models. The models trained has access to all
| the same material and knowledge a middle schooler has as
| they go on to learn how to do a job, yet they fail to
| learn the job while the kid succeeds.
| ben_w wrote:
| > Gemini 2.5 pro the model has not gained any
| intelligence since it is a static model.
|
| Surely that's an irrelevant distinction, from the point
| of view of a hiring manager?
|
| If a kid takes ten years from middle school to being
| worth hiring, then the question is "what new AI do you
| expect will exist in 10 years?"
|
| How the model comes to be, doesn't matter. Is it a fine
| tune on more training data from your company docs and/or
| an extra decade of the internet? A different
| architecture? A different lab in a different country?
|
| Doesn't matter.
|
| Doesn't matter for the same reason you didn't hire the
| kid immediately out of middle school, and hired someone
| else who had already had another decade to learn more in
| the meantime.
|
| Doesn't matter for the same reason that different flesh
| humans aren't perfectly substitutable.
|
| You pay to solve a problem, not to specifically have a
| human solve it. Today, not in ten years when today's
| middle schooler graduates from university.
|
| And that's even though I agree that AI today doesn't
| learn effectively from as few examples as humans need.
| bayarearefugee wrote:
| There's no way to be sure in either case, but I suspect
| their impressions of the technology ten or twenty years ago
| would be not so different from my experience of first using
| LLMs a few years ago...
|
| Which is to say complete amazement followed quickly by
| seeing all the many ways in which it absolutely falls flat
| on its face revealing the lack of actual thinking, which is
| a situation that hasn't fundamentally changed since then.
| mac-mc wrote:
| When it can replace a polite, diligent, experienced 120 IQ
| human in all tasks. So it has a consistent long-term
| narrative memory, doesn't "lose the plot" as you interact
| longer and longer with it, can pilot robots to do physical
| labor without much instruction (what is current state of
| the art is not that, a trained human will still do much
| better, can drive cars, etc), generate images without goofy
| non-human style errors, etc.
| NitpickLawyer wrote:
| > experienced 120 IQ human in all tasks.
|
| Well, that's 91th percentile already. I know the terms
| are hazy, but that seems closer to ASI than AGI from that
| perspective, no?
|
| I think I do agree with you on the other points.
| sebastiennight wrote:
| I don't think so, and here's my simple proof:
|
| You and I could sit behind a keyboard, role-playing as the
| AI in a reverse Turing test, typing away furiously at the
| top of our game, and if you told someone that their job is
| to assess our performance (thinking they're interacting
| with a computer), they would _still_ conclude that we are
| _definitely_ not AGI.
|
| This is a battle that can't be won at any point because
| it's a matter of faith for the forever-skeptic, not facts.
| Jensson wrote:
| > I don't think so, and here's my simple proof:
|
| That isn't a proof since you haven't ran that test, it is
| just a thought experiment.
| ben_w wrote:
| I've been accused a few times of being an AI, even here.
|
| (Have you not experienced being on the recieving end of
| such accusations? Or do I just write weird?)
|
| I think this demonstrates the same point.
| afro88 wrote:
| This is part of what the article is about
| torginus wrote:
| I still can't have an earnest conversation or bounce ideas
| off of any LLM - all of them seem to be a cross between a
| sentient encyclopedia and a constraint solver.
|
| They might get more powerful but I feel like they're still
| missing something.
| itchyjunk wrote:
| Why are you not able to have an earnest conversation with
| an LLM? What kind of ideas are you not able to bounce of
| LLMs? These seem to be the type of use cases where LLMs
| have generally shined for me.
| HDThoreaun wrote:
| I felt this way until I tried gemini 2.5. Imo it fully
| passes the turing test unless youre specifically utilizing
| tricks that LLMs are known to fall for.
| jimbokun wrote:
| We have all seen it and are now just in severe denial.
| dgs_sgd wrote:
| This is actually how a supreme court justice defined the test
| for obscenity.
|
| > The phrase "I know it when I see it" was used in 1964 by
| United States Supreme Court Justice Potter Stewart to
| describe his threshold test for obscenity in Jacobellis v.
| Ohio
| sweetjuly wrote:
| The reason why it's so famous though (and why some people
| tend to use it in a tongue in cheek manner) is because "you
| know it when you see it" is a hilariously unhelpful and
| capricious threshold, especially when coming from the
| Supreme Court. For rights which are so vital to the fabric
| of the country, the Supreme Court recommending we hinge
| free speech on--essentially--unquantifiable vibes is equal
| parts bizarre and out of character.
| DesiLurker wrote:
| my 2c on this is that if you interact with any current llm
| enough you can mentally 'place' its behavior and responses.
| when we truly have AGI+/ASI my guess is that it will be like
| that old adage of blind men feeling & describing an elephant
| for the first time. we just wont be able to fully understand
| its responses. it would always be something left hanging and
| then eventually we'll just stop trying. that would be time
| when the exponential improvement really kicks in.
|
| it should suffice to say we are nowhere near that and I dont
| even believe LLMs are the right architecture for that.
| Zambyte wrote:
| I think a reasonable definition of intelligence is the
| application of reason on knowledge. An example of a system that
| is highly knowledgeable but has little to no reason would be an
| encyclopedia. An example of a system that is highly reasonable,
| but has little knowledge would be a calculator. Intelligent
| systems demonstrate both.
|
| Systems that have general intelligence are ones that are
| capable of applying reason to an unbounded domain of knowledge.
| Examples of such systems include: libraries, wikis, and forums
| like HN. These systems are not AGI, because the reasoning
| agents in each of these systems are organic (humans); they are
| more like a cyborg general intelligence.
|
| Artificial general intelligence are just systems that are fully
| artificial (ie: computer programs) that can apply reason to an
| unbounded domain of knowledge. We're here, and we have been for
| years. AGI sets no minimum as to how great the reasoning must
| be, but it's obvious to anyone who has used modern generative
| intelligence systems like LLMs that the technology can be used
| to reason about an unbounded domain of knowledge.
|
| If you don't want to take my word for it, maybe Peter Norvig
| can be more convincing: https://www.noemamag.com/artificial-
| general-intelligence-is-...
| jimbokun wrote:
| Excellent article and analysis. Surprised I missed it.
|
| It is very hard to argue with Norvig's arguments that AGI has
| been around since at least 2023.
| conception wrote:
| I think the thing missing would be memory. The knowledge of
| current models is more or less static save for whatever you
| can cram into their context window. I think if they had
| memory and thus the ability to learn - "oh hey, I've already
| tried to solve a bug in these ways maybe I won't get stuck in
| loop on them!" Would be the agi push for me. Real time
| incorporating new knowledge into the model is the missing
| piece.
| yeahwhatever10 wrote:
| This is the forum that fell the hardest for the superconductor
| hoax a few years ago. HN has no superiority leg to stand on.
| nightmunnas wrote:
| Low agreeableness will actually be extremely useful in many use
| cases, such as scientific discovery and of course programming
| assistance. It's amazing that this venue hasn't been explored
| more deeply.
| Jensson wrote:
| Its much easier to sell an agreeable assistant than a
| disagreeable one, so it isn't that strange the alternative
| isn't explored.
| j_timberlake wrote:
| The exact definition of AGI is pretty much the least
| interesting thing about AGI. It's basically bike-shedding at
| this point: arguing about something easy to understand instead
| of tackling the really hard questions like "how competent can
| AI get before it's too dangerous to be in the hands of flakey
| tech companies?"
| mrshadowgoose wrote:
| I've always felt that trying to pin down the precise definition
| of AGI is as useless as trying to pin down "what it means to
| truly understand". It's a mental trap for smart people, that
| distracts them from focusing on the impacts of hard-to-define
| concepts like AGI.
|
| AGI doesn't need to be "called", and there is no need for
| anyone to come to an agreement as to what its precise
| definition is. But at some point, we will cross that hard-to-
| define threshold, and the economic effects will be felt almost
| immediately.
|
| We should probably be focusing on how to prepare society for
| those changes, and not on academic bullshit.
| throwup238 wrote:
| It's definitely a trap for those who aren't familiar with the
| existing academic work in philosophy, cognition, and
| neuroscience. There are no definitive answers but there are
| lots of relatively well developed ideas and concepts that
| everyone here on HN seems completely ignorant of, even though
| some of the ideas were developed by industry giants like
| Marvin Minsky.
|
| Stuff like society of minds (Minksy), embodied cognition
| (Varela, Rosch, and Thompson), connectionist or subsymbolic
| views (Rumelhart), multiple intelligences (Gardner),
| psychometric and factor-analytic theories (Carroll), and all
| the other work like E. Hutchins. They're far from just
| academic wankery, there's a lot of useful stuff in there,
| it's just completely ignored by the AI crowd.
| dheera wrote:
| I spent some amount of time trying to create a stock/option
| trading bot to exploit various market inefficiencies that
| persist, and did a bunch of code and idea bouncing off these
| LLMs. What I fund is that even all the various incarnations of
| GPT 4+ and GPT o+ routinely kept falling for the "get rich
| quick" option strategies all over the internet that don't work.
|
| In cases where 95%+ of the information on the internet is
| misinformation, the current incarnations of LLMs have a really
| hard time sorting out and filtering out the 5% of information
| that's actually valid and useful.
|
| In that sense, current LLMs are not yet superhuman at all,
| though I do think we can eventually get there.
| jimbokun wrote:
| So they are only as smart as most humans.
| daxfohl wrote:
| Until you can boot one up, give it access to a VM video and
| audio feeds and keyboard and mouse interfaces, give it an email
| and chat account, tell it where the company onboarding docs are
| and expect them to be a productive team member, they're not
| AGI. So long as we need special protocols like MCP and A2A,
| rather than expecting them to figure out how to collaborate
| like a human, they're not AGI.
|
| The first step, my guess, is going to be the ability to work
| through github issues like a human, identifying which issues
| have high value, asking clarifying questions, proposing
| reasonable alternatives, knowing when to open a PR, responding
| to code review, merging or abandoning when appropriate. But
| we're not even very close to that yet. There's some of it, but
| from what I've seen most instances where this has been
| successful are low level things like removing old feature
| flags.
| rafaelmn wrote:
| Just because we rely on vision to interface with computer
| software doesn't mean it's optimal for AI models. Having a
| specialized interface protocol is orthogonal to capability.
| Just like you could theoretically write code in a
| proportional font with notepad and run your tools through
| windows CMD - having an editor with syntax highlighting and
| monospaced font helps you read/navigate/edit, having
| tools/navigation/autocomplete etc. optimized for your flow
| makes you more productive and expands your capability, etc.
|
| If I forced you to use unnatural interfaces it would severely
| limit your capabilities as well because you'd have to
| dedicate more effort towards handling basic editing tasks. As
| someone who recently swapped to a split 36key keyboard with a
| new layout I can say this becomes immediately obvious when
| you try something like this. You take your typing/editing
| skills for granted - try switching your setup and see how
| your productivity/problem solving ability tanks in practice.
| daxfohl wrote:
| Agreed, but I also think to be called AGI, they should be
| capable of working through human interfaces rather than
| needing to have special interfaces created for them to get
| around their lack of AGI.
|
| The catch in this though isn't the ability to use these
| interfaces. I expect that will be easy. The hard part will
| be, once these interfaces are learned, the scope and search
| space of what they will be able to do is infinitely larger.
| And moreover our expectations will change in how we expect
| an AGI to handle itself when our way of working with it
| becomes more human.
|
| Right now we're claiming nascent AGI, but really much of
| what we're asking these systems to do have been laid out
| for them. A limited set of protocols and interfaces, and a
| targeted set of tasks to which we normally apply these
| things. And moreover our expectations are as such. We don't
| converse with them as with a human. Their search space is
| much smaller. So while they appear AGI in specific tasks, I
| think it's because we're subconsciously grading them on a
| curve. The only way we have to interact with them
| prejudices us to have a very low bar.
|
| That said, I agree that video feed and mouse is a terrible
| protocol for AI. But _that_ said, I wouldn 't be surprised
| if that's what we end up settling on. Long term, it's just
| going to be easier for these bots to learn and adapt to use
| human interfaces than for us to maintain two sets of
| interfaces for things, except for specific bot-to-bot
| cases. It's horribly inefficient, but in my experience
| efficiency never comes out ahead with each new generation
| of UIs.
| Closi wrote:
| This is an incredibly specific test/definition of AGI -
| particularly remembering that I would probably say an octopus
| classes as an intelligent being yet can't use outlook...
| Rebuff5007 wrote:
| > clearly in many cases superhuman
|
| In what cases is it superhuman exactly? And what humans are you
| comparing against?
|
| I'd bet that for any discipline you chose, one could find an
| expert in that field that can trick any of today's post-gpt3
| ais.
| tomrod wrote:
| I agree with Professor Mollick that the capabilities in specific
| task categories are becoming superhuman -- a precursor for AGI.
|
| Until those capabilities are expanded for model self-improvement
| -- including being able to adapt its own infrastructure, code,
| storage, etc. -- then I think AGI/ASI are yet to be realized. My
| POV is SkyNet, Traveler's "The Director", Person of Interest's
| "The Machine" and "Samaritan." The ability to target a
| potentially inscrutable goal along with the self-agency to direct
| itself towards that is true "AGI" in my book. We have a lot of
| components that we can reason are necessary, but it is unclear to
| me that we get there in the next few months.
| airstrike wrote:
| I don't think we should take it as a given that these are truly
| precursors for AGI.
|
| We may be going about it the wrong way entirely and need to
| backtrack and find a wholly new architecture, in which case
| current capabilities would _predate_ AGI but not be
| _precursors_.
| Y_Y wrote:
| Here's o4-mini-high solving riddles:
|
| > What weighs more; a pound of courage or a pound coin?
|
| > They weigh the same--each is a "pound." A pound of courage and
| a pound coin both come out to one pound in weight!
|
| https://chatgpt.com/share/68053ad0-aae4-800d-95ff-5b5b0a9a38...
|
| I didn't know what AGI is, but I know this isn't it.
| hannofcart wrote:
| Sonnet 3.7 (Thinking) gave me:
|
| :::thinking This question is playing with words and units.
|
| A "pound of courage" refers to a metaphorical measurement of
| the abstract concept of courage, using the imperial weight unit
| "pound" (which equals 453.59 grams).
|
| A "pound coin" refers to the physical currency of the United
| Kingdom, which is called a "pound" but is a unit of currency,
| not weight. A physical pound coin weighs about 8-9 grams
| (specifically, the current 12-sided pound coin weighs 8.75
| grams).
|
| So comparing them: - A pound of courage would weigh 453.59
| grams (if courage had physical weight, which it doesn't since
| it's an abstract concept) - A pound coin weighs approximately
| 8.75 grams
|
| The trick in this question is that it's comparing: 1. A
| specific weight of an abstract concept (pound of courage) 2. A
| physical coin whose name includes the word "pound" but refers
| to currency, not weight
|
| Since the question asks what weighs more, and weight is a
| physical property, I should point out that courage is abstract
| and doesn't have physical weight, while a pound coin is a
| physical object with a measurable weight. ::: # Comparing "A
| Pound of Courage" vs "A Pound Coin"
|
| This is a play on words with different meanings of "pound":
|
| - A *pound of courage* refers to a metaphorical pound (weight
| unit) of an abstract concept - A *pound coin* is a physical
| British currency coin (which weighs about 8.75 grams)
|
| If we're talking about physical weight: - A pound of courage
| doesn't have physical weight since courage is abstract - A
| pound coin has a definite weight of approximately 8.75 grams
|
| If we interpreted "pound of courage" as literally one pound
| (453.59 grams) of something, it would weigh significantly more
| than the coin.
|
| This question is clever because it compares a metaphorical
| measurement with a physical object whose name includes the word
| "pound" but in a different context.
| jpc0 wrote:
| Yet for me I will draw analogies the other way, maybe you can
| prompt the AI into the chain of thought but in my experience
| that doesn't happen.
|
| I would look along the lines of the value of a human being
| based on traits vs payment which is likely what the original
| phrase intended.
|
| Is paying someone more better than getting a better candidate
| but paying them less.
| boznz wrote:
| If I ask a cancer specialist "Do I have Cancer?" I really
| don't want to prompt them with "can you think a bit harder on
| that"
| pbhjpbhj wrote:
| Courage is a beer, a kilo of Courage weighs a kilo.
| simianwords wrote:
| I thought o1 pro could have solved this riddle
|
| > A young boy who has been in a car accident is rushed to the
| emergency room. Upon seeing him, the surgeon says, "I can operate
| on this boy!" How is this possible?
|
| But it didn't!
| simonw wrote:
| Hah, yeah that still catches out o4-mini and o3 too. Amusingly,
| adding "It's not the riddle." to the end fixes that.
|
| (o4-mini high thought for 52 seconds and even cheated and
| looked up the answer on Hacker News: https://chatgpt.com/share/
| 68053c9a-51c0-8006-a7fc-75edb734c2...)
| VladimirOrlov wrote:
| Is this for real? ... All this Hype is ... very-very old Hype,
| and nothing fundamentally new (yet) from 1960s time. Looks like
| every upgrade of software is "revolution" or "revelation". Please
| compare 'Win 3.1' and 'Win 11', some progress? sure!, is any
| "Intelligence" there? No! No! No! What is the difference? Who
| constantly lying and why? What is the reason of this systematic
| and persistent lies? p.s. I, personally think, that someday we
| will have a "semi-smart" computer systems, I also think, in a
| 5-10 years, we will learn more what is possible and real and what
| is not (regarding "semi-smart" computer systems). Until that ...
| hold your horses (please), so to speak.
| low_tech_love wrote:
| The first thing I want AGI to do is to be able to tell me when it
| doesn't know something, or when it's not certain, so at least
| give me a heads up to set expectations correctly. I ran my own
| personal "benchmark" on Gemini 2.5 and it failed just like all
| others. I told it that I was playing an old point-and-click
| adventure game from the mid-90s and I was stuck on a certain
| part, and asked for spoiler-light hints on what to do next. Not
| only can they not give me hints, they hallucinate completely the
| game, and invent some weird non-sensical solutions. Every single
| model does this. Even if I tell them to give up and just give me
| the solution, they come up with some non-existing solution.
|
| I wonder how hard it is to objectively use information that is
| available online for 30 years? But the worst part is how it lies
| and pretends it knows what it's talking about, and when you point
| it out it simply turns into another direction and starts lying
| again. Maybe the use case here is not the main focus of modern
| AI; maybe modern AI is about generating slop that does not
| require verification, because it's "new" content. But to me it
| just sounds like believable slop, not AGI.
| irthomasthomas wrote:
| Here's an example of how my agent handles this:
|
| Gathering context for user request...
|
| Context gathering - Attempting to answer question via LLM: Are
| there existing Conversation classes in the ecosystem this
| should extend? Context gathering - LLM provided answer: "No"
|
| Context gathering - Attempting to answer question via LLM: How
| should model selection work when continuing a previous
| conversation? Context gathering - LLM answer was UNKNOWN,
| asking user. Asking user: How should model selection work when
| continuing a previous conversation?
|
| Context gathering - received user response to question: "How
| should model selection work when continuing a previous
| conversation?"
|
| Context gathering - finished processing all user questions
| Context gathering - processing command executions... Context
| gathering - executing command: sqlite3 $(find . -name
| llm_conversations.db) .tables
|
| Context gathering - command execution completed
|
| Context gathering - executing command: grep -r Conversation
| tests/
|
| Context gathering - command execution completed
|
| Context gathering - executing command: grep -h conversation_id
| *py Context gathering - command execution completed Context
| gathering - finished processing all commands Analyzing task
| complexity and requirements...
|
| DEBUG: reasoning_model: openrouter/google/gemini-2.5-pro-
| preview-03-25 Task classified as coding (confidence: 1.0) Task
| difficulty score: 98.01339999999999/100 Selected primary
| reasoning model: claude-3.7-sonnet
| get_reasoning_assistance:[:214: integer expression expected:
| 98.01339999999999 Reasoning assistance completed in 39 seconds
| Calling LLM with model: claude-3.7-sonnet
| aylmao wrote:
| > I've always been a staunch defender of capitalism and free
| markets, even though that's historically been an unpopular
| opinion in my particular social circle. Watching the LLM market,
| I can't help but feel extremely vindicated.
|
| > The brutal and bruising competition between the tech giants has
| left nothing but riches for the average consumer.
|
| Capitalism has always been great at this: creating markets,
| growing them, producing new goods. It's widely acknowledged
| amongst people who actually seek to gain an understanding of
| Marxism, and don't just stay in the surface-level black-and-white
| "socialism and capitalism are opposites" discourse that's very
| common in the West, especially the USA, especially after the
| McCarthy's Red Scare.
|
| The problem is what comes once the market is grown and the only
| way for owners keep profits growing is: 1. consolidating into
| monopolies or cartels, so competition doesn't get in the way of
| profits, 2. squeezing the working class, looking to pay less for
| more work, and/or 3. abusing the natural world, to extract more
| materials or energy for less money. This is evident in plenty of
| developed industries: from health care, to broadcasting,
| telecommunications, fashion, etc.
|
| If we view Socialism for what it is, namely a system built to
| replace Capitalism's bad parts but keep its good parts, China's
| system, for example, starts to make more sense. Capitalism in a
| similar way was an evolution from Feudalism that replaced it's
| bad parts, to achieve greater liberty for everyone-- liberty is
| very much lost as Feudalism matures, but great for society as a
| whole. Socialism is meant to be the similar, aiming to achieve
| greater equity, which it views as very much better for society as
| a whole.
| arrosenberg wrote:
| Agree with most of what you wrote, but China isn't capitalist,
| they're mercantilist with socialist policies. Capital is
| heavily constrained under Xi.
| myk9001 wrote:
| Letting models interact with systems outside their sanbox brings
| about some incredible applications. These applications truly seem
| to have the potential to deeply change entire professions.
|
| All that said, I wonder if GPT4 had been integrated with the same
| tools, would it've been any less capable?
|
| It sure could give you a search prompt for Google if you asked it
| to. Back then you had to copy and paste that search prompt
| yourself. Today o3 can do it on its own. Cool! Does it imply
| though o3 is any closer to AGI than GPT4?
|
| Models gaining access to external tools, however impressive from
| all the applications standpoint, feels like lateral movement not
| a step towards the AGI.
|
| On the other hand, a model remaining isolated in its sandbox
| while actually learning to reason about that puzzle (assuming
| it's not present in the training data) would give off that
| feeling the AGI vibes.
| joshuanapoli wrote:
| The newer models are definitely more useful. Back in the GPT
| 3.5 and 4 days, AutoGPT applied the same types of tools, but
| you had to be pretty lucky for it to get anywhere. Now Claude
| 3.7, Gemini 2.5, GPT o3 make much fewer mistakes, and are
| better able to get back on-track when a mistake is discovered.
| So they're more convincing as intelligent helpers.
| myk9001 wrote:
| Good point. I still wonder if o3 has improved command of
| tools because it's significantly smarter in general. Or it's
| "just" trained with a specific focus on using tools better,
| if that makes sense.
| boznz wrote:
| I'm surprised nobody mentioned the video interview. I only
| watched the first 60 seconds and this is the first time I have
| seen or heard the author, but if I hadn't been told this was AI
| generated I would have assumed it was genuine and any 'twitching'
| was the result of the video compression.
| smusamashah wrote:
| How???? I can believe the guy in the video being AI because his
| lips are not perfectly synced. But the woman? Even with
| continuous silly exaggerated movement I have hard time
| believing its generated.
|
| A strand of her hair fell on her shoulder, because she was
| moving continuously (like crazy) it was moving too in a
| perfectly believable way, and IT EVENTUALLY FELL OFF THE
| SHOULDER/SHIRT LIKE REAL HAIR and got mixed into other fallen
| hair. How is that generated? It's too small detail. Are there
| any artifacts on her side?
|
| Edit: she has to be real. Her lip movements are definitely
| forced/edited though. It has to be a video recording of her
| talking. And then a tool/AI has modified her lips to match the
| voice. If you look at her face and hand movements, her shut
| lips seem forced.
| -__---____-ZXyw wrote:
| I went and watched 10 seconds on account of your comment, and
| couldn't disagree more. The heads keep sort of rolling around
| in a disconcerting and quite eery fashion?
| keernan wrote:
| I fail to see how LLMs are anything beyond a lookup function
| retrieving information from a huge database (containing, in
| theory, all known human information), and then summarizing the
| results using language algorithms.
|
| While incredibly powerful and transformative, it is not
| 'intelligence'. LLMs are forever knowledgebase bound. They are
| encyclopedias with a fancy way of presenting information looked
| up in the encyclopedia.
|
| The 'presentation' has no concept, awareness, or understanding of
| the information being presented - and never will. And this is the
| critical line. Without comprehension, a LLM is incapable of being
| creative. Of coming up with new ideas. It cannot ponder. Wonder.
| Think.
| dgs_sgd wrote:
| While it's hard to agree on what AGI is I think we can more
| easily agree on what AGI _is not_.
|
| I don't consider an AI that fails the surgery brain teaser in the
| article to be AGI, no matter how superhuman it is at other narrow
| tasks. It doesn't satisfy the "G" part of AGI.
| chrsw wrote:
| What about all the things that aren't strictly intelligence but I
| guess intelligence adjacent: autonomy, long term memory,
| motivation, curiosity, resilience, goals, choice, and maybe the
| biggest of them all: fear? Why would an AGI "want" do anything
| more than my calculator "wants" to compute an answer to some math
| problem I gave it? Without these things an AGI, or whatever, is
| just an extension of whoever is ultimately controlling it.
|
| And that's when we return to a much older and much more important
| question than whether Super LLM 10.0 Ultra Plus is AGI or not:
| how much power should a person or group of people be allowed to
| have?
| hiAndrewQuinn wrote:
| https://gwern.net/tool-ai is a quite comprehensive dive into
| why.
| snarg wrote:
| I honestly thought that we were agreed on the definition of AGI.
| My understanding classified it as a model that can build on its
| knowledge and better itself, teaching itself new tasks and
| techniques, adapting as necessary. I.e., not simply knowing
| enough techniques to impress some humans. By this definition, it
| doesn't matter if it's super-intelligent or if its knowledge is
| rudimentary, because given enough add-on hardware and power, it
| could become super-intelligent over time.
| gilbetron wrote:
| AGI that is bad at some things is still AGI. We have AGI, it is
| just bad at some things and hallucinates. It is literally smarter
| than many people I know, but that doesn't mean it can beat a
| human at anything. That would be ASI, which, hopefully, will take
| a while to get here.
|
| Although, I could be argued into calling what we have already ASI
| - take a human and Gemini 2.5, and put them through a barrage of
| omni-disciplinary questions and situations and problems. Gemini
| 2.5 will win, but not absolutely.
|
| AGI (we have) ASI (we might have) AOI (Artificial Omniscient
| Intelligence, will hopefully take a while to get here)
___________________________________________________________________
(page generated 2025-04-20 23:00 UTC)