[HN Gopher] Arc Prize 2024 Winners and Technical Report
___________________________________________________________________
Arc Prize 2024 Winners and Technical Report
Author : alphabetting
Score : 71 points
Date : 2024-12-06 19:20 UTC (3 hours ago)
(HTM) web link (arcprize.org)
(TXT) w3m dump (arcprize.org)
| mikeknoop wrote:
| Author here -- six months ago we launched ARC Prize, a huge $1M
| experiment, to test if we need new ideas for AGI. The ARC-AGI
| benchmark remains unbeaten and I think we can now definitely say
| "yes".
|
| One big update since June is that progress is no longer stalled.
| Coming into 2024, the public consensus vibe was that pure deep
| learning / LLMs would continue scaling to AGI. The fundamental
| architecture of these systems hasn't changed since ~2019.
|
| But this flipped late summer. AlphaProof and o1 are evidence of
| this new reality. All frontier AI systems are now incorporating
| components beyond pure deep learning like program synthesis and
| program search.
|
| I believe ARC Prize played a role here too. All the winners this
| year are leveraging new AGI reasoning approaches like deep-
| learning guided program synthesis, and test-time training/fine-
| tuning. We'll be seeing a lot more of these in frontier AI
| systems in coming years.
|
| And I'm proud to say that all the code and papers from this
| year's winners are now open source!
|
| We're going to keep running this thing annually until its
| defeated. And we've got ARC-AGI-2 in the works to improve on
| several of the v1 flaws (more here:
| https://arcprize.org/blog/arc-prize-2024-winners-technical-r...)
|
| The ARC-AGI community keeps surprising me. From initial launch,
| through o1 testing, to the final 48 hours when the winning team
| jumped 10% and both winning papers dropped out of nowhere. I'm
| incredibly grateful to everyone and we will do our best to
| steward this attention towards AGI.
|
| We'll be back in 2025!
| mrandish wrote:
| Congrats to you and Francois on the success of ARC-AGI 24 and
| thanks so much for doing it. I just finished the technical
| report and am encouraged! It's great to finally see some
| tangible progress in research that is both novel and plausibly
| in fruitful directions.
| tbalsam wrote:
| As a rather experienced ML researcher, ARC is a great benchmark
| on its own, but is punching below its weight in terms of
| claiming that it is a gate (or in terms of this post -- a
| "steward") towards AGI, and in my perspective and the
| perspective of several researchers near me this has watered
| down the value of the ARC benchmark as a test.
|
| It is a great unit test for reasoning -- that's fantastic! And
| maybe it is indeed the best way to test for this -- who knows
| exactly. But the claim is a little grandiose for what it is,
| this is somewhat similar to saying that testing on string
| parity is the One True Test for testing an optimizer's
| efficiency.
|
| I'd heartily recommend maybe taking down the marketing vibrance
| down a notch and keep things a bit more measured, it's not
| entirely a meme, though some of the more-serious researchers
| don't take it as seriously as a result. And that's the kind of
| people that you want to attract to this sort of thing!
|
| I think there is a potentially good future for ARC! But it
| might struggle to attract some of the kind of talent that you
| want to work on this problem as a result.
| mikeknoop wrote:
| > I'd heartily recommend maybe taking down the marketing
| vibrance down a notch and keep things a bit more measured,
| it's not entirely a meme, though some of the more-serious
| researchers don't take it as seriously as a result.
|
| This is fair critique. ARC Prize's 2024 messaging was sharp
| to break through the noise floor -- ARC has been around since
| 2019 but most only learned about it this summer. Now that it
| has garnered awareness, it is no longer useful, and in same
| cases hurting progress like you point out. The messaging
| needs to evolve and mature next year to be more
| neutral/academic.
| tbalsam wrote:
| I feel rather consternated that this response effectively
| boils down to "yes, we know we overhyped this to get
| people's attention, and now that we have it we can be more
| honest about it". Fighting for place in the attention
| economy is understandable, being deceptive about it is not.
|
| This is part of the ethical morass of why some more serious
| researchers aren't touching the benchmark. People are not
| going to take it seriously if it continues like this!
| mikeknoop wrote:
| I think we agree; to clarify, sharp messaging isn't
| inaccurate messaging. And I believe the story is not
| overhyped given the evidence: the benchmark resisted a
| $1M prize pool for ~6 months. But I concede we did obsess
| about the story to give it the best chance of survival in
| the marketplace of ideas against the incumbent AI
| research meme (LLM scaling). Now that the AI research
| field is coming around to the idea that something beyond
| deep learning is needed, the story matters less, and the
| benchmark, and future versions, can stand on their
| utility as a compass towards AGI.
| iwsk wrote:
| we live in a society
| trott wrote:
| Mike and Francois,
|
| Compute is limited during inference, and this naturally limits
| brute-force program search.
|
| But this doesn't prevent one from creating a huge ARC-like
| dataset ahead of time, like BARC did (but bigger), and training
| a correspondingly huge NN on it.
|
| Placing a limit on the submission size could foil this kind of
| brute-force approach though. I wonder if you are considering
| this for 2025?
| padswo1 wrote:
| I don't think ARC has particularly advanced the research. The
| approaches that are successful were developed elsewhere and
| then applied to ARC. Happy to be shown somewhere this is not
| the case.
|
| In the case of TTT, I wouldn't really describe that as a 'new
| AGI reasoning approach'. People have been fine tuning deep
| learning models on specific tasks for a long time.
|
| The fundamental instinct driving the creation of ARC - that
| 'deep learning cannot do system 2 thinking', is under threat of
| being proven wrong very soon. Attempts to define the approaches
| that are working as somehow not 'traditional deep learning'
| really seem like shifting the goal posts.
| mikeknoop wrote:
| Correct, fine-tuning is not new. It's long been used to
| augment foundational LLMs with private data. Eg. private
| enterprise data. We do this at Zapier, for instance.
|
| The new and surprising thing about test-time training (TTT)
| is how effective it is an approach to deal with novel
| abstract reasoning problems like ARC-AGI.
|
| TTT was pioneered by Jack Cole last year and popularized this
| year by several teams, including this winning paper:
| https://ekinakyurek.github.io/papers/ttt.pdf
| celeritascelery wrote:
| What surprises me about this is how poorly general-purpose LLMs
| do. The best one is OpenAI o1-preview at 18%. This is
| significantly worse than the purpose-built models like ARChitects
| (which scored 53.5). This model used TTT to train on the ARC-AGI
| task specification (amoung other things). It seems that even if
| someone creates a model that can "solve" ARC, it still is not
| indicative of AGI since it is not "general" anymore, it is just
| specialized to this particular task. Similar to how chess engines
| are not AGI, despite being superhuman at chess. It will be much
| more convincing when general models not trained specifically for
| ARC can still score well on it.
|
| They do mention that some of the tasks here are susceptible to
| brute force and they plan to address that in ARC-AGI-2.
|
| > nearly half (49%) of the private evaluation set was solved by
| at least one team during the original 2020 Kaggle competition all
| of which were using some variant of brute-force program search.
| This suggests a large fraction of ARC-AGI-1 tasks are susceptible
| to this kind of method and does not carry much useful signal
| towards general intelligence.
| fchollet wrote:
| It is correct that the first model that will beat ARC-AGI will
| only be able to handle ARC-AGI tasks. However, the idea is that
| the _architecture_ of that model should be able to be
| repurposed to arbitrary problems. That is what makes ARC-AGI a
| good compass towards AGI (unlike chess).
|
| For instance, current top models use TTT, which is a completely
| general-purpose technique that provides the most significant
| boost to DL model's generalization power in recent memory.
|
| The other category of approach that is working well is program
| synthesis -- if pushed to the extent that it could solve ARC-
| AGI, the same system could be redeployed to solve arbitrary
| programming tasks, as well as tasks isomorphic to programming
| (such as theorem proving).
| scoobertdoobert wrote:
| Francois, have you coded and tested a solution yourself that
| you think will work best?
| mrandish wrote:
| > It seems that even if someone creates a model that can
| "solve" ARC, it still is not indicative of AGI since it is not
| "general" anymore
|
| I recently explained why I like ARC to a non-technical friend
| this way: "When an AI solves ARC it won't be proof of AGI. It's
| the opposite. As long as ARC remains unsolved I'm confident
| we're not even close to AGI."
|
| For the sake of being provocative, I'd even argue that ARC
| remaining unsolved is a sign we're not yet making meaningful
| progress in the right direction. AGI is the top of Everest. ARC
| is base camp.
| YeGoblynQueenne wrote:
| The first question I still have is what happened to core
| knowledge priors. The white paper that introduced ARC made a big
| todo about how core knowledge priors are necessary to solve ARC
| tasks but from what I can tell none of the best-performing (or
| at-all performing) systems have anything to do with core knowlege
| priors.
|
| So what happened to that assumption? Is it dead?
|
| The second question I still have is about the defenses of ARC
| against memorisation-based, big-data approaches. I note that the
| second best system is based on an LLM with "test time training"
| where the first two steps are: initial finetuning
| on similar tasks auxiliary task format and augmentations
|
| Which is to say, a data augmentation approach. With big data
| comes great responsibility and the authors of the second-best
| system don't disappoint: they claim that by training on more
| examples they achieve reasoning.
|
| So what happened to the claim that ARC is secure against big-data
| approaches? Is it dead?
| fchollet wrote:
| What all top models do is recombine at test time the knowledge
| they already have. So they all possess Core Knowledge priors.
| Techniques to acquire them vary:
|
| * Use a pretrained LLM and hope that relevant programs will be
| memorized via exposure to text data (this doesn't work that
| well)
|
| * Pretrain a LLM on ARC-AGI-like data
|
| * Hardcode the priors into a DSL
|
| > Which is to say, a data augmentation approach
|
| The key bit isn't the data augmentation but the TTT. TTT is a
| way to lift the #1 issue with DL models: that they cannot
| recombine their knowledge at test time to adapt to something
| they haven't seen before (strong generalization). You can argue
| whether TTT is the right way to achieve this, but there is no
| doubt that TTT is a major advance in this direction.
|
| The top ARC-AGI models perform well not because they're trained
| on tons of data, but because they can adapt to novelty at test
| time (usually via TTT). For instance, if you drop the TTT
| component you will see that these large models trained on
| millions of synthetic ARC-AGI tasks drop to <10% accuracy. This
| demonstrates empirically that ARC-AGI cannot be solved purely
| via memorization and interpolation.
| optimalsolver wrote:
| >This demonstrates empirically that ARC-AGI cannot be solved
| purely via memorization and interpolation
|
| Now that the current challenge is over, and a successor
| dataset is in the works, can we see how well the leading LLMs
| perform against the private test set?
| tuukkah wrote:
| I think the "semi-private" numbers here already measure
| that: https://arcprize.org/2024-results
|
| For example, Claude 3.5 gets 14% in semi-private eval vs
| 21% in public eval. I remember reading an explanation of
| "semi-private" earlier but cannot find it now.
| YeGoblynQueenne wrote:
| >> So they all possess Core Knowledge priors.
|
| Do you mean the ones from your white paper? The same ones
| that humans possess? How do you know this?
|
| >> The key bit isn't the data augmentation but the TTT.
|
| I haven't had the chance to read the papers carefully. Have
| they done ablation studies? For instance, is the following a
| guess or is it an empirical result?
|
| >> For instance, if you drop the TTT component you will see
| that these large models trained on millions of synthetic ARC-
| AGI tasks drop to <10% accuracy.
| aithrowawaycomm wrote:
| Even the strongest possible interpretation of the results
| wouldn't conclude "ARC-AGI is dead" because none of the
| submissions came especially close to human-level performance;
| the criteria was 85% success but the best in 2024 was 55%.
|
| That said, I think there should be consideration via
| information thermodynamics: even with TTT these program-
| generating systems are using an enormous amount of bits
| compared to a human mind, a tiny portion of which solves ARC
| quickly and easily using causality-first principles of
| reasoning.
|
| Another point: suppose a system solves ARC-AGI with 99%
| accuracy. Then it should be tested on "HARC-HAGI," a variant
| that uses hexagons instead of squares. This likely wouldn't
| trip up a human very much - perhaps a small decrease due to
| increased surface area for brain farts. But if the AI needs to
| be retrained on a ton of hexagonal examples, then that AI can't
| be an AGI candidate.
| szvsw wrote:
| > That said, I think there should be consideration via
| information thermodynamics: even with TTT these program-
| generating systems are using an enormous amount of bits
| compared to a human mind, a tiny portion of which solves ARC
| quickly and easily using causality-first principles of
| reasoning.
|
| This isn't my area of expertise, but it seems plausible to me
| that what you said is completely erroneous or at the very
| least completely unverifiable at this point in time. How do
| you quantify how many bits it takes a human mind to solve one
| of the ARC problems?
|
| That seems likely beyond the level of insight we have into
| the structure of cognition and information storage etc etc in
| wetware. I could of course be wrong and would love to be
| corrected if so! You mentioned a "tiny portion" of the human
| mind, but (as far as I'm aware), any given "small" part of
| human cognition still involves huge amounts of complexity and
| compute.
|
| Maybe you are saying that the high level decision making a
| human goes through when solving can be represented with a
| relatively small number of pieces of information/logical
| operations (as opposed to a much lower level notion closer to
| the wetware of the quantity of information) but then it seems
| unfair to compare to the low level equivalent (weights &
| biases, FLOPs etc) in the ML system when there may be higher
| order equivalents.
|
| I do appreciate the general notion of wanting to normalize
| against _something_ though, and some notion of information
| seems like a reasonable choice, but practically out of our
| reach. Maybe something like peak power or total energy
| consumption would be a more reasonable choice, which we can
| at least get a lower and upper bounds on in the human case
| (metabolic rates are pretty well studied, and even if we
| don't have a good idea of how much energy is involved in
| completing cognitive tasks we can at least get bounds for
| running the entire system in that period of time) and close
| to a precise value in the ML case.
| hulium wrote:
| Were there any interesting non-neural approaches? I was wondering
| whether there is any underlying structure in the ARC tasks that
| could tell us something about algorithms for "reasoning" problems
| in general.
| neoneye2 wrote:
| The 3rd place solution by Agnis Liukis, solves 40 tasks.
| https://www.kaggle.com/code/gregkamradt/arc-prize-2024-solut...
___________________________________________________________________
(page generated 2024-12-06 23:00 UTC)