[HN Gopher] Francois Chollet: The Arc Prize and How We Get to AG...
___________________________________________________________________
Francois Chollet: The Arc Prize and How We Get to AGI [video]
Author : sandslash
Score : 159 points
Date : 2025-07-03 14:00 UTC (4 days ago)
(HTM) web link (www.youtube.com)
(TXT) w3m dump (www.youtube.com)
| qoez wrote:
| I feel like I'm the only one who isn't convinced getting a high
| score on the ARC eval test means we have AGI. It's mostly about
| pattern matching (and some of it ambiguous even for humans what
| the actual true response aught to be). It's like how in humans
| there's lots of different 'types' of intelligence, and just
| overfitting on IQ tests doesn't in my mind convince me a person
| is actually that smart.
| avmich wrote:
| Roughly speaking, the job of a medical doctor is to diagnose
| the patient - and then, after the diagnosis is made, to apply
| the healing from the book, corresponding to the diagnosis.
|
| The diagnosis is pattern matching (again, roughly). It kinda
| suggests that a lot of "intelligent" problems are focused on
| pattern matching, and (relatively straightforward) application
| of "previous experience". So, pattern matching can bring us a
| great deal towards AGI.
| AnimalMuppet wrote:
| Pattern matching is instinct. (Or at least, instinct is a
| kind of pattern matching. And once you learn the patterns,
| pattern matching can become almost instinctual). And that's
| fine, for things that fit the pattern. But a human-level
| intelligence can also deal with problems for which there is
| no pattern. (I mean, not always successfully - finding a
| correct solution to a novel problem is difficult. But it is
| within the capability of at least some humans.)
| yorwba wrote:
| I think the people behind the ARC Prize agree that getting a
| high score doesn't mean we have AGI. (They already updated the
| benchmark once to make it harder.) But an AGI should get a
| similarly high score as humans do. So current models that get
| very low scores are definitely not AGI, and likely quite far
| away from it.
| cubefox wrote:
| > I think the people behind the ARC Prize agree that getting
| a high score doesn't mean we have AGI
|
| The benchmark was literally called ARC-AGI. Only after OpenAI
| cracked it, they started backtracking and saying that it
| doesn't test for true AGI. Which undermines the whole premise
| of a benchmark.
| whiplash451 wrote:
| You're not the only one. ARC-AGI is a laudable effort, but its
| fundamental premise is indeed debatable:
|
| "We argue that human cognition follows strictly the same
| pattern as human physical capabilities: both emerged as
| evolutionary solutions to specific problems in specific
| evironments" (from page 22 of On the Measure of Intelligence)
|
| https://arxiv.org/pdf/1911.01547
| Davidzheng wrote:
| But I believe that because of this "even edge" thing which
| people call of AI weakenesses being not necessarily same of
| humans, once we run out of these tests which AI is worse than
| humans it will actually in effect be very much superhuman. My
| main evidence for this is leela-zero the Go AI who struggled
| with ladders and some other aspects of Go play well into the
| superhuman regime (in go it's easier to see when it's
| superhuman bc you can have elos and play win-rates etc and
| there's less room for debates)
| energy123 wrote:
| https://en.m.wikipedia.org/wiki/AI_effect
|
| But on a serious note, I don't think Chollet would disagree.
| ARC is a necessary but not sufficient condition, and he says
| that, despite the unfortunate attention-grabbing name choice of
| the benchmark. I like Chollet's view that we will know that AGI
| is here when we can't come up with new benchmarks that separate
| humans from AI.
| loki_ikol wrote:
| Well for most, the next steps are probably towards removing the
| highly deterministic and discrete characteristics of current
| approaches (we certainly don't think in lock steps). Those have
| no measures. Even the creative aspect is undermined by those
| characteristics.
| kubb wrote:
| AGI isn't defined anywhere, so it can be anything you want.
| FrustratedMonky wrote:
| Yes. And a lot of humans also don't pass for having AGI.
| mindcrime wrote:
| Oh, it's defined in lots of places. The problem is.. it's
| defined in _lots_ of places!
| oldge wrote:
| Today's llms are fancy autocomplete but lack test time self
| learning or persistent drive. By contrast, an AGI would
| require: - A goal-generation mechanism (G) that can propose
| objectives without external prompts - A utility function (U)
| and policy p(a|s) enabling action selection and hierarchy
| formation over extended horizons - Stateful memory (M) +
| feedback integration to evaluate outcomes, revise plans, and
| execute real-world interventions autonomously Without G, U, p,
| and M operating llms remain reactive statistical predictors,
| not human level intelligence.
| KoolKat23 wrote:
| I'd say we're not far off.
|
| Looking at the human side, it takes a while to actually learn
| something. If you've recently read something it remains in
| your "context window". You need to dream about it, to think
| about, to revisit and repeat until you actually learn it and
| "update your internal model". We need a mechanism for
| continuous weight updating.
|
| Goal-generation is pretty much covered by your body
| constantly drip-feeding your brain various hormones "ongoing
| input prompts".
| onemoresoop wrote:
| > I'd say we're not far off.
|
| How are we not far off? How can LLMs generate goals and
| based on what?
| NetRunnerSu wrote:
| Minimize prediction errors.
| tsurba wrote:
| But are we close to doing that in real-time on any
| reasonably large model? I don't think so.
| FeepingCreature wrote:
| You just train it on the goal. Then it has that goal.
|
| Alternately, you can train it on following a goal and
| then you have a system where you can specify a goal.
|
| At sufficient scale, a model will already contain goal-
| following algorithms because those help predict the next
| token when the model is basetrained on goal-following
| entities, ie. humans. Goal-driven RL then brings those
| algorithms to prominence.
| kelseyfrog wrote:
| How do you figure goal generation and supervised goal
| training are interchangeable?
| kordlessagain wrote:
| Random goal use is showing to be more important than
| training. Although, last year someone trained on the fly
| during the competition, which is pretty awesome when you
| think about it.
| NetRunnerSu wrote:
| Yes, you're right, that's what we're doing.
|
| https://github.com/dmf-archive/PILF
| KoolKat23 wrote:
| Very interesting, thanks for the link.
| NetRunnerSu wrote:
| In fact, there is no technical threshold anymore. As long as
| the theory is in place, you can see such AGI at most half a
| year. It will even be more energy efficient than the current
| dense models.
|
| https://dmf-archive.github.io/docs/posts/beyond-snn-
| plausibl...
| TheAceOfHearts wrote:
| Getting a high score on ARC doesn't mean we have AGI and
| Chollet has always said as much AFAIK, it's meant to push the
| AI research space in a positive direction. Being able to solve
| ARC problems is probably a pre-requisite to AGI. It's a
| directional push into the fog of war, with the claim being that
| we should explore that area because we expect it's relevant to
| building AGI.
| lostphilosopher wrote:
| We don't really have a true test that means "if we pass this
| test we have AGI" but we have a variety of tests (like ARC)
| that we believe any true AGI would be able to pass. It's a
| "necessary but not sufficient" situation. Also ties directly
| to the challenge in defining what AGI really means. You see a
| lot of discussions of "moving the goal posts" around AGI, but
| as I see it we've never had goal posts, we've just got a
| bunch of lines we'd expect to cross before reaching them.
| MPSimmons wrote:
| I don't think we actually even have a good definition of
| "This is what AGI is, and here are the stationary goal
| posts that, when these thresholds are met, then we will
| have AGI".
|
| If you judged human intelligence by our AI standards, then
| would humans even pass as Natural General Intelligence?
| Human intelligence tests are constantly changing, being
| invalidated, and rerolled as well.
|
| I maintain that today's modern LLMs would pass sufficiently
| for AGI and is also very close to passing a Turing Test, if
| measured in 1950 when the test was proposed.
| fvdessen wrote:
| Turing test is not really that meaningful anymore because
| you can always detect the AI by text and timing patterns
| rather than actual intelligence. In fact the most
| reliable way to test for AI is probably to ask trivia
| questions on various niche topics, I don't think any
| human has as much breath of general knowledge as current
| AIs.
| QuadmasterXLII wrote:
| The current definition and goal of AGI is "Artificial
| intelligence good enough to replace every employee for
| cheaper" and much of the difficulty people have in
| defining it is cognitive dissonance about the goal.
| fasterik wrote:
| _> I don't think we actually even have a good definition
| of "This is what AGI is, and here are the stationary goal
| posts that, when these thresholds are met, then we will
| have AGI"._
|
| Not only do we not have that, I don't think it's possible
| to have it.
|
| Philosophers have known about this problem for centuries.
| Wittgenstein recognized that most concepts don't have
| precise definitions but instead behave more like family
| resemblances. When we look at a family we recognize that
| they share physical characteristics, even if there's no
| single characteristic shared by all of them. They don't
| need to unanimously share hair color, skin complexion,
| mannerisms, etc. in order to have a family resemblance.
|
| Outside of a few well-defined things in logic and
| mathematics, concepts operate in the same way.
| Intelligence isn't a well-defined concept, but that
| doesn't mean we can't talk about different types of human
| intelligence, non-human animal intelligence, or machine
| intelligence in terms of family resemblances.
|
| Benchmarks are useful tools for assessing relative
| progress on well-defined tasks. But the decision of what
| counts as AGI will always come down to fuzzy comparisons
| and qualitative judgments.
| tedy1996 wrote:
| I have graduated with a degree in Software engineering and
| i am bilingual (Bulgarian and English). Currently AI is
| better than me in everything except adding big numbers or
| writing code in really niche topics - for example code
| golfing a Brainfuck interpreter or writing a Rubiks cube
| solver. I believe AGI has been here for at least a year
| now.
| fvdessen wrote:
| I suggest you to try to let the AI think through race
| conditions scenarios in asynchronous programs; it is not
| that good at these abstract reasoning tasks.
| ummonk wrote:
| "Being able to solve ARC problems is probably a pre-requisite
| to AGI." - is it? Humans have general intelligence and most
| can't solve the harder ARC problems.
| adastra22 wrote:
| They, and the other posters posting similar things, don't
| mean human-like intelligence, or even the rigorously
| defined solving of unconstrained problem spaces that
| originally defined Artificial General Intelligence (in
| contrast to "narrow" intelligence").
|
| They mean an artificial god, and it has become a god of the
| gaps: we have made artificial general intelligence, and it
| is more human-like than god-like, and so to make a god we
| must have it do XYZ precisely because that is something
| which people can't do.
| ummonk wrote:
| Right, but there is a very clear term for that which they
| should be using: ASI
| satellite2 wrote:
| Didn't he say that 70% in a random sample of the population
| should get it right?
| singron wrote:
| https://arcprize.org/leaderboard
|
| "Avg. Mturker" has 77% on ARC1 and costs $3/task. "Stem
| Grad" has 98% on ARC1 and costs $10/task. I would love a
| segment like "typical US office employee" or something else
| in between since I don't think you need a stem degree to do
| better than 77%.
|
| It's also worth noting the "Human Panel" gets 100% on ARC2
| at $17/task. All the "Human" models are on the score/cost
| frontier and exceptional in their score range although too
| expensive to win the prize obviously.
|
| I think the real argument is that the ARC problems are too
| abstract and obscure to be relevant to useful AGI, but I
| think we need a little flexibility in that area so we can
| have tests that can be objectively and mechanically graded.
| E.g. "write a NYT bestseller" is an impractical test in
| many ways even if it's closer to what AGI should be.
| echelon wrote:
| My problem with AGI is the lack of a simple, concrete
| definition.
|
| Can we formalize it as giving out a task expressible in, say,
| n^m bytes of information that encodes a task of n^(m+q) real
| algorithmic and verification complexity -- then solving that
| task within a certain time, compute, and attempt bounds?
|
| Something that captures "the AI was able to unwind the
| underlying unspoken complexity of the novel problem".
|
| I feel like one could map a variety of easy human "brain
| teaser" type tasks to heuristics that fit within some
| mathematical framework and then grow the formalism from
| there.
| kordlessagain wrote:
| After researching this a fair amount, my opinion is that
| consciousness/intelligence (can you have one without the
| other?) emerges from some sort of weird entropy exchange in
| domains in the brain. The theory goes that we aren't
| conscious, but we DO consciousness, sometimes. Maybe
| entropy, or the inverse of it, gives way to intelligence,
| somehow.
|
| This entropy angle has real theoretical backing. Some
| researchers propose consciousness emerges from the brain's
| ability to integrate information across different scales
| and timeframes. This would essentially create temporary
| "islands of low entropy" in neural networks. Giulio
| Tononi's Integrated Information Theory suggests
| consciousness corresponds to a system's ability to generate
| integrated information, which relates to how it reduces
| uncertainty (entropy) about its internal states. Then there
| is Hammeroff and Penrose, which I commented about on here
| years ago and got blasted for it. Meh. I'm a learner, and I
| learn by entertaining truths. But I always remain critical
| of theories until I'm sold.
|
| I'm not selling any of this as a truth, because the fact
| remains we have no idea what "consciousness" is. We have a
| better handle on "intelligence", but as others point out,
| most humans aren't that intelligent. They still manage to
| drive to the store and feed their dogs, however.
|
| A lot of the current leading ARC solutions use random
| sampling, which sorta makes sense once you start thinking
| about having to handle all the different types of problems.
| At least it seems to be helping out in paring down the
| decision tree.
| glenstein wrote:
| >My problem with AGI is the lack of a simple, concrete
| definition.
|
| You can't always start from definitions. There are many
| research areas where the object of research is to know
| something well enough that you could converge on such a
| thing as a definition, e.g. dark matter, consciousness,
| intelligence, colony collapse syndrome, SIDS. We
| nevertheless can progress in our understanding of them in a
| whole motley of strategic ways, by case studies that best
| exhibit salient properties, trace the outer boundaries of
| the problem space, track the central cluster of "family
| resemblances" that seem to characterize the problem,
| entertain candidate explanations that are closer or further
| away, etc. Essentially a practical attitude.
|
| I don't doubt in principle that we could arrive at such a
| thing as a definition that satisfies most people, but I
| suspect you're more likely to have that at the end than the
| beginning.
| apwell23 wrote:
| one of those cases where defining it and solving it is the
| same. If you know how to define it then you've solved it.
| kordlessagain wrote:
| ARC is definitely about achieving AGI and it doesn't matter
| whether we "have" it or not right now. That is the goal:
|
| > where he introduced the "Abstract and Reasoning Corpus for
| Artificial General Intelligence" (ARC-AGI) benchmark to
| measure intelligence
|
| So, a high enough score is a threshold to claim AGI. And, if
| you use an LLM to work these types of problems, it becomes
| pretty clear that passing more tests indicates a level of
| "awareness" that goes beyond rational algorithms.
|
| I thought I had seen everything until I started working on
| some of the problems with agents. I'm still sorta in awe
| about how the reasoning manifests. (And don't get me wrong,
| LLMs like Claude still go completely off the rails where even
| a less intelligent human would know better.)
| MPSimmons wrote:
| >a high enough score is a threshold to claim AGI
|
| I'm pretty sure he said that AGI would achieve a high
| score, not that a high score was indicative of AGI
| cubefox wrote:
| > Getting a high score on ARC doesn't mean we have AGI and
| Chollet has always said as much AFAIK
|
| He only seems to say this recently, since OpenAI cracked the
| ARC-AGI benchmark. But in the original 2019 abstract he said
| this:
|
| > We argue that ARC can be used to measure a human-like form
| of general fluid intelligence and that it enables fair
| general intelligence comparisons between AI systems and
| humans.
|
| https://arxiv.org/abs/1911.01547
|
| Now he seems to backtrack, with the release of harder ARC-
| like benchmarks, implying that the first one didn't actually
| test for really general human-like intelligence.
|
| This sounds a bit like saying that a machine beating chess
| would require general intelligence -- but then adding, after
| Deep Blue beats chess, that chess doesn't actually count as a
| test for AGI, and that Go is the real AGI benchmark. And
| after a narrow system beats Go, moving the goalpost to
| beating Atari, and then to beating StarCraft II, then to
| MineCraft, etc.
|
| At some point, intuitively real "AGI" will be necessary to
| beat one of these increasingly difficult benchmarks, but only
| because otherwise yet another benchmark would have been
| invented. Which makes these benchmarks mostly post hoc
| rationalizations.
|
| A better approach would be to question what went wrong with
| coming up with the very first benchmark, and why a similar
| thing wouldn't occur with the second.
| ben_w wrote:
| You're not alone in this; I expect us to have not yet
| enumerated all the things that we ourselves mean by
| "intelligence".
|
| But conversely, _not_ passing this test is a proof of _not_
| being as general as a human 's intelligence.
| NetRunnerSu wrote:
| Unfortunately, we did it. All that is left is to assemble the
| parts.
|
| https://news.ycombinator.com/item?id=44488126
| kypro wrote:
| I find the "what is intelligence?" discussion a little
| pointless if I'm honest. It's similar to asking a question
| like does it mean to be a "good person" and would we know
| whether an AI or person is really "good"?
|
| While understanding why a person or AI is doing what it's
| doing can be important (perhaps specifically in safety
| contexts) at the end of the day all that's really going to
| matter to most people is the outcomes.
|
| So if an AI can use what appears to be intelligence to solve
| general problems and can act in ways that are broadly good
| for society, whether or not it meets some philosophical
| definition of "intelligent" or "good" doesn't matter much -
| at least in most contexts.
|
| That said, my own opinion on this is that the truth is likely
| in between. LLMs today seem extremely good at being glorified
| auto-completes, and I suspect most (95%+) of what they do is
| just recalling patterns in their weights. But unlike
| traditional auto-completes they do seem to have some ability
| to reason and solve truly novel problems. As it stands I'd
| argue that ability is fairly poor, but this might only
| represent 1-2% of what we use intelligence for.
|
| If I were to guess why this is I suspect it's not that LLM
| architecture today is completely wrong, but that the way LLMs
| are trained means that in general knowledge recall is
| rewarded more than reasoning. This is similar to the trade-
| off we humans have with education - do you prioritise the
| acquisition of knowledge or critical thinking? Maybe believe
| critical thinking is more important and should be prioritised
| more, but I suspect for the vast majority of tasks we're
| interested in solving knowledge storage and recall is
| actually more important.
| ben_w wrote:
| That's certainly a valid way of looking at their abilities
| at any given task -- "The question of whether a computer
| can think is no more interesting than the question of
| whether a submarine can swim".
|
| But when the question is "are they going to more important
| to the economy than humans?", then they have to be good at
| basically everything a human can do, otherwise we just see
| a variant of Amdahl's law in action and the AI perform an
| arbitrary speed-up of n % of the economy while humans are
| needed for the remaining 100-n %.
|
| I may be wrong, but it seems to me that the ARC prize is
| more about the latter.
| IanCal wrote:
| > are they going to more important to the economy than
| humans?", then they have to be good at basically
| everything a human can do,
|
| I really don't think that's the case. A robot that can
| stack shelves faster than a human is more valuable at
| that job than someone who can move items and also
| appreciate comedy. One that can write software more
| reliably than person X is more valuable than them at that
| job even if X is well rounded and can do cryptic
| crosswords and play the guitar.
|
| Also many tasks they can be worse but cheaper.
|
| I do wonder how many tasks something like o3 or o3 pro
| can't do as well as a median employee.
| ben_w wrote:
| > I really don't think that's the case. A robot that can
| stack shelves faster than a human is more valuable at
| that job than someone who can move items and also
| appreciate comedy.
|
| Yes, until all the shelves are stacked and that is no
| longer your limiting factor.
|
| > One that can write software more reliably than person X
| is more valuable than them at that job even if X is well
| rounded and can do cryptic crosswords and play the
| guitar.
|
| Cryptic crosswords and guitar playing are already
| something computers can do, so they're not great
| examples.
|
| Consider a different example: "computer" used to be a job
| title of a person who computes. A single Raspberry Pi
| model zero, given away for free on a magazine cover at
| launch, can do this faster than the entire human
| population combined even if we all worked at the speed of
| the world record holder 24/7. But that wasn't enough to
| replace all human labour.
| tedy1996 wrote:
| AFAIK 55% of PRs written by latest GPT model get
| approved.
| OtomotO wrote:
| You're not alone in this, no.
|
| My definition of AGI is the one I was brought up with, not an
| ever moving goal post (to the "easier" side).
|
| And no, I also don't buy that we are just stochastic parrots.
|
| But whatever. I've seen many hypes and if I don't die and the
| world doesn't go to shit, I'll see a few more in the next
| couple of decades
| NetRunnerSu wrote:
| To pass Arc, you need a living model with sentient abilities,
| not the dead frog now.
|
| https://news.ycombinator.com/item?id=44488126
| nxobject wrote:
| I understand Chollet is transparent that the "branding" of the
| ARC-AGI-n suites is meant to be suggestive of its purpose, than
| substantial.
|
| However, it does rub me the wrong way - as someone who's
| cynical of how branding can enable breathless AI hype by bad
| journalism. A hypothetical comparison would be labelling
| SHRDLU's (1968) performance on Block World planning tasks as
| "ARC-AGI-(-1)".[0]
|
| A less loaded name like (bad strawman option) "ARC-
| VeryToughSymbolicReasoning" should capture how the ARC-AGI-n
| suite is genuinely and intrinsically very hard for current AIs,
| and what progress satisfactory performance on the benchmark
| suite would represent. Which Chollet has done, and has grounded
| him throughout! [1]
|
| [0] https://en.wikipedia.org/wiki/SHRDLU [1]
| https://arxiv.org/abs/1911.01547
| heymijo wrote:
| I get what you're saying about perception being reality and
| that ARC-AGI suggests beating it means AGI has been achieved.
|
| In practice when I have seen ARC brought up, it has more
| nuance than any of the other benchmarks.
|
| Unlike, Humanity's Last Exam, which is the most egregious
| example I have seen in naming and when it is referenced in
| terms of an LLMs capability.
| maaaaattttt wrote:
| I've said this somewhere else, but we have the perfect test for
| AGI in the form of any open world game. Give the instructions
| to the AGI that it should finish the game and how to control
| it. Give the frames as input and wait. When I think of the
| latest Zelda games and especially how the Shrine chanllenges
| are desgined they especially feel like the perfect environement
| for an AGI test.
| Lerc wrote:
| And if someone makes a machine that does all that and another
| person says
|
| "That's not really AGI because xyz"
|
| What then? The difficulty in coming up with a test for AGI is
| coming up with something that people will accept a passing
| grade as AGI.
|
| In many respects I feel like all of the claims that models
| don't really understand or have internal representation or
| whatever tend to lean on nebulous or circular definitions of
| the properties in question. Trying to pin the arguments down
| usually end up with dualism and/or religion.
|
| Doing what Chollet has done is infinitely better, if a person
| can easily do something and a model cannot then there is
| clearly something significant missing
|
| It doesn't matter what the property is or what it is called.
| Such tests might even help us see what those properties are.
|
| Anyone who wants to claim the fundamental inability of these
| models should be able to provide a task that it is clearly
| possible to tell when it has been succeeded, and to show that
| humans can do it (if that's the bar we are claiming can't be
| met). If they are right, then no future model should be able
| to solve that class of problems.
| maaaaattttt wrote:
| Given your premise (which I agree with) I think the issue
| in general comes from the lack of a good, broadly accepted
| definition of what AGI is. My initial comment originates
| from the fact that in my internal definition, an AGI would
| have a de facto understanding of the physics of "our
| world". Or better, could infer them by trial and error.
| But, indeed, it doesn't have to be the case. (The other
| advantage of the Zelda games is that they introduce new
| abilities that don't exist in our world, and for which most
| children -I've seen- understand the mechanisms and how they
| could be applied to solve a problem quite naturaly even
| they've never had that ability before).
| wat10000 wrote:
| I'd say the issue is the lack of a good, broadly accepted
| definition of what I is. We all know "smart" when we see
| it, but actually defining it in a rigorous way is tough.
| ta8645 wrote:
| This difficulty is interesting in and of itself.
|
| When people catalogue the deficiencies in AI systems,
| they often (at least implicitly) forgive all of our own
| such limitations. When someone points to something that
| an AI system clearly doesn't understand, they say that
| proves it isn't AGI. But if you point at any random
| human, who fails at the very same task, you wouldn't say
| they lack "HGI", even if they're too personally limited
| to ever be taught the skill.
|
| All of which, is to say, I don't think pointing at a
| limitation of an AI system, really proves it lacks AGI.
| It's a more slippery definition, than that.
| jcranmer wrote:
| > The difficulty in coming up with a test for AGI is coming
| up with something that people will accept a passing grade
| as AGI.
|
| The difficulty with intelligence is we don't even know what
| it is in the first place (in a psychology sense, we don't
| even have a reliable model of anything that corresponds to
| what humans point at and call intelligence; IQ and g are
| really poor substitutes).
|
| Add into that Goodhart's Law (essentially, propose a test
| as a metric for something, and people will optimize for the
| test rather than what the test is trying to measure), and
| it's really no surprise that there's no test for AGI.
| bonoboTP wrote:
| > It doesn't matter what the property is or what it is
| called. Such tests might even help us see what those
| properties are.
|
| This is a very good point and somewhat novel to me in its
| explicitness.
|
| There's no reason to think that we already have the
| concepts and terminology to point out the gaps between the
| current state and human-level intelligence and beyond. It's
| incredibly naive to think we have armchair-generated
| already those concepts by pure self-reflection and
| philosophizing. This is obvious in fields like physics.
| Experiments were necessary to even come up with the basic
| concepts of electromagnetism or relativity or quantum
| mechanics.
|
| I think the reason is that pure philosophizing is still
| more prestigious than getting down in the weeds and dirt
| and doing limited-scope well-defined experiments on
| concrete things. So people feel smart by wielding poorly
| defined concepts like "understanding" or "reasoning" or
| "thinking", contrasting it with "mere pattern matching", a
| bit like the stalemate that philosophy as a field often
| hits, as opposed to the more pragmatic approach in the
| sciences, where empirical contact with reality allows more
| consensus and clarity without getting caught up in mere
| semantics.
| davidclark wrote:
| In the video, Francois Chollet, creator of the ARC benchmarks,
| says that beating ARC does not equate to AGI. He specifically
| says they will be able to be beaten without AGI.
| cubefox wrote:
| He only says this because otherwise he would have to say that
|
| - OpenAI's o3 counts as "AGI" when it did unexpectedly beat
| the ARC-AGI benchmark or
|
| - Explicitly admit that he was wrong when assuming that ARC-
| AGI would test for AGI
| sweezyjeezy wrote:
| FWIW the original ARC was published in 2019, just after
| GPT-2 but a while before GPT-3. I work in the field, I
| think that discussing AGI seriously is actually kind of a
| recent thing (I'm not sure I ever heard the term 'AGI'
| until a few years ago). I'm not saying I know he didn't
| feel that, but he doesn't talk in such terms in the
| original paper.
| cainxinth wrote:
| > It's mostly about pattern matching...
|
| For all we know, human intelligence is just an emergent
| property of really good pattern matching.
| cttet wrote:
| The point is not that having a high score -> AGI, their ideas
| are more of having a low score -> we don't have AGI yet.
| CamperBob2 wrote:
| If you can write code to solve ARC by "overfitting," then give
| it a shot! There's prize money to be won, as long as your model
| does a good job on the hidden test set. Zuckerberg is said to
| be throwing around 8-figure signing bonuses for talent like
| that.
|
| But then, I guess it wouldn't be "overfitting" after all, would
| it?
| gonzobonzo wrote:
| I agree with you but I'll go a step further - these benchmarks
| are a good example of how far we are from AGI.
|
| A good base test would be to give a manager a mixed team of
| remote workers, half being human and half being AI, and seeing
| if the manager or any of the coworkers would be able to tell
| the difference. We wouldn't be able to say that AI that passed
| that test would necessarily be AGI, since we would have to test
| it in other situations. But we could say that AI that couldn't
| pass that test wouldn't qualify, since it wouldn't be able to
| successfully accomplish some tasks that humans are able to.
|
| But of course, current AI is nowhere near that level yet. We're
| left with benchmarks, because we all know how far away we are
| from actual AGI.
| criddell wrote:
| The AGI test I think makes sense is to put it in a robot body
| and let it navigate the world. Can I take the robot to my
| back yard and have it weed my vegetable garden? Can I show it
| how to fold my laundry? Can I take it to the grocery store
| and tell it "go pick up 4 yellow bananas and two avocados
| that will be ready to eat in the next day or two, and then
| meet me in dairy"? Can I ask it to dice an onion for me
| during meal prep?
|
| These are all things my kids would do when they were pretty
| young.
| gonzobonzo wrote:
| I agree, I think of that as the next level beyond the
| digital assistant test - a physical assistant test. Once
| there are sufficiently capable robots, hook one up to the
| AI. Tell it to mow your lawn, drive your car to the
| mechanic and have the mechanic to get checked, box up an
| item, take it to the post office, and have it shiped, pick
| up your dry cleaning, buy ingredients from a grocery store,
| cook dinner, etc. Basic tasks an low-skilled worker would
| do as someone's assistant.
| bumby wrote:
| I think the next harder level in AGI testing would be
| "convince my kids to weed the garden and fold the laundry"
| :-)
| godshatter wrote:
| The problem with "spot the difference" tests, imho, is that I
| would expect an AGI to be easily spotted. There's going to be
| a speed of calculation difference, at the very least. If
| nothing else, typing speed would be completely different
| unless the AGI is supposed to be deceptive. Who knows what
| it's personality would be like. I'd say it's a simple enough
| test just to see if an AGI could be hired as, for example, an
| entry level software developer and keep it's job based on the
| same criteria base-level humans have to meet.
|
| I agree that current AI is nowhere near that level yet. If AI
| isn't even trying to extract meaning from the words it smiths
| or the pictures it diffuses then it's nothing more than a
| cute (albeit useful) parlor trick.
| SubiculumCode wrote:
| [1]https://app.rescript.info/public/share/W_T7E1OC2Wj49ccqlIOOz
| ...
|
| Perhaps it's because the representations are fractured. The
| link above is to the transcript of an episode of Machine
| Learning Street Talk with Kenneth O. Stanleyabout The Fractured
| Entangled Representation Hypothesis[1]
| crazylogger wrote:
| I think next year's AI benchmarks are going to be like this
| project: https://www.anthropic.com/research/project-vend-1
|
| Give the AI tools and let it do _real stuff_ in the world:
|
| "FounderBench": Ask the AI to build a successful business,
| whatever that business may be - the AI decides. Maybe try to
| get funded by YC - hiring a human presenter for Demo Day is
| allowed. They will be graded on profit / loss, and valuation.
|
| Testing plain LLM on whiteboard-style question is meaningless
| now. Going forward, it will all be multi-agent systems with
| computer use, long-term memory & goals, and delegation.
| mindcrime wrote:
| > I feel like I'm the only one who isn't convinced getting a
| high score on the ARC eval test means we have AGI.
|
| Wait, what? Approximately nobody is claiming that "getting a
| high score on the ARC eval test means we have AGI". It's a
| useful eval for measuring progress along the way, but I don't
| think anybody considers it the final word.
| andoando wrote:
| Who says intelligence is anything more than "pattern matching"?
| Everything is patterns
| sva_ wrote:
| It is a necessary condition, but not a sufficient one.
| tippytippytango wrote:
| He's playing the game. You have to say AGI is your goal to get
| attention. It's just like the YouTube thumbnail game. You can
| hate it, but you still have to play if you want people to pay
| attention.
| hackinthebochs wrote:
| Has Chollet ever talked about his change of heart regarding AGI?
| It wasn't that long ago when he was one of the loudest voices
| decrying even the concept of AGI, let alone us being on the path
| to creating it. Now he's an advocate and has his own prize
| dataset? Seems rather convenient to change your tune once
| hundreds of billions are being thrown at AGI (not that I would
| blame him).
| zamderax wrote:
| People are allowed to evolve opinions. It seems to me he
| believes that a combination of transformer and program
| synthesis are key. The big unknown at the moment is how to do
| program search.
| hackinthebochs wrote:
| Absolutely. Presumably there is some specific considerations
| or evidence that helped him evolve his opinion. I would be
| interested in seeing a writeup about it. With him having been
| a very public advocate against AGI, a writeup of his
| evolution seems appropriate and would be very edifying for a
| lot of people.
| blibble wrote:
| > Presumably there is some specific considerations or
| evidence that helped him evolve his opinion.
|
| suitcases full of money?
| Bjorkbat wrote:
| I recall it as less an evolution and more a complete tonal
| shift the moment o3 was evaluated on ARC-AGI. I remember on
| Twitter Sam made some dumb post suggesting they had beaten
| the benchmark internally and Francois calling him out on
| his vagueposting. Soon as they publicly released the
| scores, it was like he was all-in on reasoning.
|
| Which I have to admit I was kind of disappointed by.
| cubefox wrote:
| ARC-AGI was introduced in 2019:
|
| https://arxiv.org/abs/1911.01547
|
| GPT-3 didn't come out until 2020.
| hackinthebochs wrote:
| In my view that just makes his evolution more interesting as
| it wasn't just a matter of being wow'ed by what ChatGPT could
| do.
| 0xCE0 wrote:
| He has recently co-founded NDEA company, so he has to align
| himself for that. Same kind of vibe change feels for Joscha
| Bach after having some position in Liquid AI company.
| Communication is not so relaxed anymore.
|
| That said, I'd still listen these two guys (+ Schmidhuber) more
| than any other AI-guy.
| roenxi wrote:
| By both definitions of intelligence in the presentation we should
| be saying "how we got to AGI" in the past tense. We're already
| there. AI's can deal with situations they weren't prepared for in
| any sense that a human can. They might not do well, but they'll
| have a crack at it. We can trivially build systems that collect
| data and do a bit more offline training if that is what someone
| wants to see, but there doesn't really seem to be a commercial
| need for that right now. Similarly, AIs can whip most humans at
| most domains that require intelligence.
|
| I think the debate hqas been flat-footed by the speed all this
| happened. We're not talking AGI any more, we're talking about how
| to build superintelligences hitherto unseen in nature.
| cubefox wrote:
| Well, there is also robotics, active inference, online
| learning, etc. Things animals can do well.
| AIPedant wrote:
| Current robots perform very badly on my patented and highly
| scientific ROACH-AGI benchmark - "is this thing smarter at
| navigating unfamiliar 3D spaces than a cockroach?"
| tmvphil wrote:
| According to this presentation at least, ARC-AGI-2 shows that
| there is a big meaningful gap in fluid intelligence between
| normal non-genius humans and the best models currently, which
| seems to indicate we are not "already there".
| saberience wrote:
| There's already a big meaningful gap between the things AIs
| can do which humans can't, so why do you only count as
| "meaningful" the things humans can do which AIs can't?
|
| I enjoy seeing people repeatedly move the goalposts for
| "intelligence" as AIs simply get smarter and smarter every
| week. Soon AI will have to beat Einstein in Physics, Usain
| Bolt in running, and Steve Jobs in marketing to be considered
| AGI...
| tmvphil wrote:
| > There's already a big meaningful gap between the things
| AIs can do which humans can't, so why do you only count as
| "meaningful" the things humans can do which AIs can't?
|
| Where did I say there was nothing meaningful about current
| capabilities? I'm saying that's what is novel about a claim
| of "AGI" (as opposed to a claim of "computer does something
| better than humans", which has been an obviously true
| statement since the ENIAC) is the ability to do at some
| level _everything_ a normal human intelligence can do.
| TheAceOfHearts wrote:
| The first highlight from this video is getting to see a preview
| of the next ARC dataset. Otherwise it feels like most of what
| Chollet says here has already been repeated in his other podcast
| appearances and videos. It's a good video if you're not
| familiarized with his work, but if you've seen some of his recent
| interviews then you can probably skip the first 20 minutes.
|
| The second highlight from this video is the section from 29
| minutes onward, where he talks about designing systems that can
| build up rich libraries of abstractions which can be applied to
| new problems. I wish he had lingered more on exploring and
| explaining this approach, but maybe they're trying to keep a bit
| of secret sauce because it's what his company is actively working
| on.
|
| One of the major points which seems to be emerging from recent AI
| discourse is that the ability to integrate continuous learning
| seems like it'll be a key element in building AGI. Context is
| fine for short tasks, but if lessons are never preserved you're
| severely capped with how far the system can go.
| vixen99 wrote:
| Is the text available for those who don't hear so well?
| jasonlotito wrote:
| At the very least, YouTube provides a transcript and a "Show
| Transcript" button in the video description, which you can
| click on to follow along.
| heymijo wrote:
| When I watched the video I had the subtitles on. The
| automatic transcript is pretty good. "Test-time" which is
| used frequently gets translated as "Tesla" so watch out for
| that.
| saberience wrote:
| The Arc prize/benchmark is a terrible judge of whether we got to
| AGI.
|
| If we assume that humans have "general intelligence", we would
| assume all humans could ace Arc... but they can't. Try asking
| your average person, i.e. supermarket workers, gas station
| attendants etc to do the Arc puzzles, they will do poorly,
| especially on the newer ones, but AI has to do perfectly to prove
| they have general intelligence? (not trying to throw shade here
| but the reality is this test is more like an IQ test than an AGI
| test).
|
| Arc is a great example of AI researchers moving the goal posts
| for what we consider intelligent.
|
| Let's get real, Claude Opus is smarter than 99% of people right
| now, and I would trust its decision making over 99% of people I
| know in most situations, except perhaps emotion driven ones.
|
| Arc agi benchmark is just a gimmick. Also, since it's a visual
| test and the current models are text based it's actually a rigged
| (against the AI models) test anyway, since their datasets were
| completely text based.
|
| Basically, it's a test of some kind, but it doesn't mean quite as
| much as Chollet thinks it means.
| leumon wrote:
| He said in the video that they tested regular people (uber
| driver, etc.) on arc-agi2 and at least 2 people were able to
| solve each task (an average of 9-10 people saw each task). Also
| this quote from the paper: _None of the self-reported
| demographic factors recorded for all participants--including
| occupation, industry, technical experience, programming
| proficiency, mathematical background, puzzle-solving aptitude,
| and var- ious other measured attributes--demonstrated clear,
| statistically significant relationships with performance
| outcomes. This finding suggests that ARC-AGI-2 tasks assess
| general problem-solving capabilities rather than domain-
| specific knowledge or specialized skills acquired through
| particular professional or educational experiences._
| daveguy wrote:
| It is not a judge of whether we got to AGI. _And literally no
| one except straw-manning critics are trying to claim it is_.
| The point is, an AGI should easily be able to pass it. But it
| can obviously be passed without getting to AGI (as . It 's a
| necessary but not sufficient criteria. If something can't pass
| a test as simple as AGI (which _no AI currently can_ ) then
| it's definitely not AGI. Anyone claiming AGI should be able to
| point their AI at the problem and have an 80+% solution rate.
| Current attempts on the second ARC are less than 10% with zero
| shot attempts even worse. Even the better performing LLMs on
| the first ARC couldn't do well without significant pre-
| training. In short, the G in AGI stands for _general_.
| saberience wrote:
| So do you agree that a human that CANNOT solve ARC doesn't
| have general intelligence?
|
| If we think humans have "GI" then I think we have AIs right
| now with "GI" too. Just like humans do, AIs spike in various
| directions. They are amazing at some things and weak at
| visual/IQ test type problems like ARC.
| adamgordonbell wrote:
| It's a good question. But only complicated answers are
| possible. A puppy and crow and a raccoon all have
| intelligence but certainly can't all pass the ARC
| challenge.
|
| I think the charitable interpretation is that, if
| intelligence is made up of many skills, and AIs are super
| human at some, like image recognition.
|
| And that therefore, future efforts need to be on the areas
| where AIs are significantly less skilled. And also, since
| they are good at memorizing things, knowledge questions are
| the wrong direction and anything most humans could solve
| but that AIs can not, especially if as generic as pattern
| matching, should be an important target.
| cttet wrote:
| Maybe it is a cultural difference aspect, but I feel that
| "supermarket workers, gas station attendants" (in an Asian
| country) that I know of should be quite capable of most ARC
| tasks.
| profchemai wrote:
| Out of 100 of evals, ARC is a very distinct and unique eval,
| most frontier models are also visual now, don't see the harm in
| having this instead of another text eval.
| Workaccount2 wrote:
| This is what is called "spikey" intelligence, where a model
| might be able to crack phd physics problems and solve byzantine
| pattern matching games at the 90th percentile, but also can't
| figure out how to look up a company and copy their address on
| the "customer" line of an invoice.
| chromaton wrote:
| Current AI systems don't have a great ability to take
| instructions or information about the state of the world and
| produce new output based upon that. Benchmarks that emphasize
| this ability help greatly in progress toward AGI.
| jacquesm wrote:
| Let's not. Seriously. I absolutely love Francois and have used
| his work extensively. But looking around me at the social impact
| of AI I am really not convinced that this is what the world needs
| right now and that if we can stave off the turning point for
| another decade or two that humanity will likely benefit from
| that. The last thing we need is to inject yet another instability
| into a planet that is already fighting existential crisis on a
| number of fronts.
| thatguy0900 wrote:
| It doesn't matter what should or should not happen. Technology
| will continue to race forward at breakneck speed while everyone
| involved pats each other on the back for making a bunch of
| money before the consequences hit
| nessbot wrote:
| technology doesn't just advance itself
| lo_zamoyski wrote:
| This is true. We have a choice...in principle.
|
| But in practice, it's like stopping an arms race.
| bnchrch wrote:
| No, but one thing is certain, in large human systems you
| can only redirect greed, you can't stop it.
| alex_duf wrote:
| If the incentive is there, the technology will advance. I
| hear "we need to slow down the progress of technology", but
| that's misunderstanding _why_ it progresses. I'm assuming
| the slow down camp really need to look into what's the
| incentive to slow down.
|
| Personally I don't think it's possible at this stage. The
| cat's out of the bag (this new class of tools are working)
| the economic incentive is way too strong.
| modeless wrote:
| ARC-AGI 3 remindes me of PuzzleScript games:
| https://www.puzzlescript.net/Gallery/index.html
|
| There are dozens of ready-made, well-designed, and very creative
| games there. All are tile-based and solved with only arrow keys
| and a single action button. Maybe someone should make a
| PuzzleScript AGI benchmark?
| mNovak wrote:
| This game is great!
|
| https://nebu-soku.itch.io/golfshall-we-golf
|
| Maybe someone can make an MCP connection for the AIs to
| practice. But I think the idea of the benchmark is to reserve
| some puzzles for private evaluation, so that they're not in the
| training data.
| visarga wrote:
| I think intelligence is search. Search is exploration and
| learning. So intelligence is not in the model, or in the
| environment, but in their mutual dance. A river is not the banks,
| nor the water, but their relation.
| visarga wrote:
| I think intelligence is search. Search is exploration + learning.
| So intelligence is not in the model or in the environment, but in
| their mutual dance. A river is not the banks, nor the water, but
| their relation. ARC is just a frozen snapshot of the banks, not
| the dynamic environment we have.
| ipunchghosts wrote:
| I agree strongly with this take but find it hard to convince
| others of it. Instead, people keep thinking there is a magic
| bullet to discover resulting in a lot of wasted resources and
| money.
| bogtog wrote:
| I wonder how much slow progress on ARC can be explained by their
| visual properties making them easy for humans but hard for LLMs.
|
| My impression is that models are pretty bad at interpreting grids
| of characters. Yesterday, I was trying to get Claude to convert a
| message into a cipher where it converted a 98-character string
| into 7x14 grid where the sequential letters moved 2-right and
| 1-down (i.e., like a knight it chess). Claude seriously
| struggled.
|
| Yet, Francois always pumps up the "fluid intelligence" component
| of this test and emphasizes how easy these are for humans. Yet,
| humans would presumably be terrible at the tasks if they looked
| at it character-by-character
|
| This feels like a somewhat similar (intuition-lie?) case as the
| Apple paper showing how reasoning model's can't do tower of hanoi
| past 10+ disks. Readers will intuitively think about how they
| themselves could tediously do an infinitely long tower of hanoi,
| which is what the paper is trying to allude to. However, the more
| appropriate analogy would be writing out all >1000 moves on a
| piece of paper at once and being 100% correct, which is obviously
| much harder
| krackers wrote:
| I thought so too back when the test was first released, but now
| that we have multimodal models which can take images directly
| as input, shouldn't this point be moot?
| ltbarcly3 wrote:
| There is some kind of massive brigading happening on this thread.
| Lots of thoughtful comments are downmodded or flagged (including
| mine, which I thought was pretty thoughtful. I even said poop
| instead of shit.).
|
| https://news.ycombinator.com/item?id=44492241
|
| My comment was basically instantly flagged. I see at least 3
| other flagged comments that I can't imagine deserve to be
| flagged.
| layer8 wrote:
| You didn't address anything from the actual talk.
| ltbarcly3 wrote:
| I addressed the entire concept of the talk, and made other
| relevant points. The correct response to "let me tell you
| something I can't possibly know" isn't to argue the points
| within that frame.
|
| If you see a talk like: "How we will develop diplomacy with
| the rat-people of TRAPPIST-5." you don't have to make some
| argument about super-earths and gravity and the rocket
| equation. You can just point out it's absurd to pretend to
| know something like whether there are rat-people there.
|
| Either way, it isn't flag-able!
| layer8 wrote:
| Did you actually watch the talk?
|
| The flagging is probably due to your aggressively indignant
| style.
| lawlessone wrote:
| How do we define AGI?
|
| I would have thought/considered AGI to be something that is
| constantly aware, a biological brain is always on. An LLM is on
| briefly while it's inferring.
|
| A biological brain constantly updates itself adds memories of
| things. Those memories generally stick around.
| khalic wrote:
| This quest for an ill defined AGI is going to create a million of
| Cpt Ahab
| gtech1 wrote:
| This may be a silly question, I'm no expert. But why not simply
| define as AGI any system that can answer a question that no human
| can. So for example, ask AGI to find out, from current knowledge,
| how to reconcile gravity and qed.
| m11a wrote:
| That would be ASI I think.
|
| But consider: technically AlphaTensor found new algorithms no
| human did before (https://en.wikipedia.org/wiki/Matrix_multipli
| cation_algorith...). So isn't it AGI by your definition of
| answering a question no human could before: how to do 4x4
| matrix multiplication in 47 steps?
| imiric wrote:
| "What is the meaning of life, the universe, and everything?"
| ta8645 wrote:
| 42
| soVeryTired wrote:
| Computers can already do a lot of things that no human can
| though. They can reliably find the best chess or go move better
| than a human.
|
| It's conceivable (though not likely) that given training enough
| training in symbolic mathematics and some experimental data, an
| LLM-style AI could figure out a neat reconciliation of the two
| theories. I wouldn't say that makes it AGI though. You could
| achieve that unification with an AI that was limted to
| mathematics rather than being something that can function in
| many domains like a human can.
| layer8 wrote:
| Aside from other objections already mentioned, your example
| would require feasible experiments for verification, and likely
| the process of finding a successful theory of quantum gravity
| requires a back and forth between experimenters and theorists.
___________________________________________________________________
(page generated 2025-07-07 23:00 UTC)