[HN Gopher] Arc Prize 2024 Winners and Technical Report
       ___________________________________________________________________
        
       Arc Prize 2024 Winners and Technical Report
        
       Author : alphabetting
       Score  : 71 points
       Date   : 2024-12-06 19:20 UTC (3 hours ago)
        
 (HTM) web link (arcprize.org)
 (TXT) w3m dump (arcprize.org)
        
       | mikeknoop wrote:
       | Author here -- six months ago we launched ARC Prize, a huge $1M
       | experiment, to test if we need new ideas for AGI. The ARC-AGI
       | benchmark remains unbeaten and I think we can now definitely say
       | "yes".
       | 
       | One big update since June is that progress is no longer stalled.
       | Coming into 2024, the public consensus vibe was that pure deep
       | learning / LLMs would continue scaling to AGI. The fundamental
       | architecture of these systems hasn't changed since ~2019.
       | 
       | But this flipped late summer. AlphaProof and o1 are evidence of
       | this new reality. All frontier AI systems are now incorporating
       | components beyond pure deep learning like program synthesis and
       | program search.
       | 
       | I believe ARC Prize played a role here too. All the winners this
       | year are leveraging new AGI reasoning approaches like deep-
       | learning guided program synthesis, and test-time training/fine-
       | tuning. We'll be seeing a lot more of these in frontier AI
       | systems in coming years.
       | 
       | And I'm proud to say that all the code and papers from this
       | year's winners are now open source!
       | 
       | We're going to keep running this thing annually until its
       | defeated. And we've got ARC-AGI-2 in the works to improve on
       | several of the v1 flaws (more here:
       | https://arcprize.org/blog/arc-prize-2024-winners-technical-r...)
       | 
       | The ARC-AGI community keeps surprising me. From initial launch,
       | through o1 testing, to the final 48 hours when the winning team
       | jumped 10% and both winning papers dropped out of nowhere. I'm
       | incredibly grateful to everyone and we will do our best to
       | steward this attention towards AGI.
       | 
       | We'll be back in 2025!
        
         | mrandish wrote:
         | Congrats to you and Francois on the success of ARC-AGI 24 and
         | thanks so much for doing it. I just finished the technical
         | report and am encouraged! It's great to finally see some
         | tangible progress in research that is both novel and plausibly
         | in fruitful directions.
        
         | tbalsam wrote:
         | As a rather experienced ML researcher, ARC is a great benchmark
         | on its own, but is punching below its weight in terms of
         | claiming that it is a gate (or in terms of this post -- a
         | "steward") towards AGI, and in my perspective and the
         | perspective of several researchers near me this has watered
         | down the value of the ARC benchmark as a test.
         | 
         | It is a great unit test for reasoning -- that's fantastic! And
         | maybe it is indeed the best way to test for this -- who knows
         | exactly. But the claim is a little grandiose for what it is,
         | this is somewhat similar to saying that testing on string
         | parity is the One True Test for testing an optimizer's
         | efficiency.
         | 
         | I'd heartily recommend maybe taking down the marketing vibrance
         | down a notch and keep things a bit more measured, it's not
         | entirely a meme, though some of the more-serious researchers
         | don't take it as seriously as a result. And that's the kind of
         | people that you want to attract to this sort of thing!
         | 
         | I think there is a potentially good future for ARC! But it
         | might struggle to attract some of the kind of talent that you
         | want to work on this problem as a result.
        
           | mikeknoop wrote:
           | > I'd heartily recommend maybe taking down the marketing
           | vibrance down a notch and keep things a bit more measured,
           | it's not entirely a meme, though some of the more-serious
           | researchers don't take it as seriously as a result.
           | 
           | This is fair critique. ARC Prize's 2024 messaging was sharp
           | to break through the noise floor -- ARC has been around since
           | 2019 but most only learned about it this summer. Now that it
           | has garnered awareness, it is no longer useful, and in same
           | cases hurting progress like you point out. The messaging
           | needs to evolve and mature next year to be more
           | neutral/academic.
        
             | tbalsam wrote:
             | I feel rather consternated that this response effectively
             | boils down to "yes, we know we overhyped this to get
             | people's attention, and now that we have it we can be more
             | honest about it". Fighting for place in the attention
             | economy is understandable, being deceptive about it is not.
             | 
             | This is part of the ethical morass of why some more serious
             | researchers aren't touching the benchmark. People are not
             | going to take it seriously if it continues like this!
        
               | mikeknoop wrote:
               | I think we agree; to clarify, sharp messaging isn't
               | inaccurate messaging. And I believe the story is not
               | overhyped given the evidence: the benchmark resisted a
               | $1M prize pool for ~6 months. But I concede we did obsess
               | about the story to give it the best chance of survival in
               | the marketplace of ideas against the incumbent AI
               | research meme (LLM scaling). Now that the AI research
               | field is coming around to the idea that something beyond
               | deep learning is needed, the story matters less, and the
               | benchmark, and future versions, can stand on their
               | utility as a compass towards AGI.
        
               | iwsk wrote:
               | we live in a society
        
         | trott wrote:
         | Mike and Francois,
         | 
         | Compute is limited during inference, and this naturally limits
         | brute-force program search.
         | 
         | But this doesn't prevent one from creating a huge ARC-like
         | dataset ahead of time, like BARC did (but bigger), and training
         | a correspondingly huge NN on it.
         | 
         | Placing a limit on the submission size could foil this kind of
         | brute-force approach though. I wonder if you are considering
         | this for 2025?
        
         | padswo1 wrote:
         | I don't think ARC has particularly advanced the research. The
         | approaches that are successful were developed elsewhere and
         | then applied to ARC. Happy to be shown somewhere this is not
         | the case.
         | 
         | In the case of TTT, I wouldn't really describe that as a 'new
         | AGI reasoning approach'. People have been fine tuning deep
         | learning models on specific tasks for a long time.
         | 
         | The fundamental instinct driving the creation of ARC - that
         | 'deep learning cannot do system 2 thinking', is under threat of
         | being proven wrong very soon. Attempts to define the approaches
         | that are working as somehow not 'traditional deep learning'
         | really seem like shifting the goal posts.
        
           | mikeknoop wrote:
           | Correct, fine-tuning is not new. It's long been used to
           | augment foundational LLMs with private data. Eg. private
           | enterprise data. We do this at Zapier, for instance.
           | 
           | The new and surprising thing about test-time training (TTT)
           | is how effective it is an approach to deal with novel
           | abstract reasoning problems like ARC-AGI.
           | 
           | TTT was pioneered by Jack Cole last year and popularized this
           | year by several teams, including this winning paper:
           | https://ekinakyurek.github.io/papers/ttt.pdf
        
       | celeritascelery wrote:
       | What surprises me about this is how poorly general-purpose LLMs
       | do. The best one is OpenAI o1-preview at 18%. This is
       | significantly worse than the purpose-built models like ARChitects
       | (which scored 53.5). This model used TTT to train on the ARC-AGI
       | task specification (amoung other things). It seems that even if
       | someone creates a model that can "solve" ARC, it still is not
       | indicative of AGI since it is not "general" anymore, it is just
       | specialized to this particular task. Similar to how chess engines
       | are not AGI, despite being superhuman at chess. It will be much
       | more convincing when general models not trained specifically for
       | ARC can still score well on it.
       | 
       | They do mention that some of the tasks here are susceptible to
       | brute force and they plan to address that in ARC-AGI-2.
       | 
       | > nearly half (49%) of the private evaluation set was solved by
       | at least one team during the original 2020 Kaggle competition all
       | of which were using some variant of brute-force program search.
       | This suggests a large fraction of ARC-AGI-1 tasks are susceptible
       | to this kind of method and does not carry much useful signal
       | towards general intelligence.
        
         | fchollet wrote:
         | It is correct that the first model that will beat ARC-AGI will
         | only be able to handle ARC-AGI tasks. However, the idea is that
         | the _architecture_ of that model should be able to be
         | repurposed to arbitrary problems. That is what makes ARC-AGI a
         | good compass towards AGI (unlike chess).
         | 
         | For instance, current top models use TTT, which is a completely
         | general-purpose technique that provides the most significant
         | boost to DL model's generalization power in recent memory.
         | 
         | The other category of approach that is working well is program
         | synthesis -- if pushed to the extent that it could solve ARC-
         | AGI, the same system could be redeployed to solve arbitrary
         | programming tasks, as well as tasks isomorphic to programming
         | (such as theorem proving).
        
           | scoobertdoobert wrote:
           | Francois, have you coded and tested a solution yourself that
           | you think will work best?
        
         | mrandish wrote:
         | > It seems that even if someone creates a model that can
         | "solve" ARC, it still is not indicative of AGI since it is not
         | "general" anymore
         | 
         | I recently explained why I like ARC to a non-technical friend
         | this way: "When an AI solves ARC it won't be proof of AGI. It's
         | the opposite. As long as ARC remains unsolved I'm confident
         | we're not even close to AGI."
         | 
         | For the sake of being provocative, I'd even argue that ARC
         | remaining unsolved is a sign we're not yet making meaningful
         | progress in the right direction. AGI is the top of Everest. ARC
         | is base camp.
        
       | YeGoblynQueenne wrote:
       | The first question I still have is what happened to core
       | knowledge priors. The white paper that introduced ARC made a big
       | todo about how core knowledge priors are necessary to solve ARC
       | tasks but from what I can tell none of the best-performing (or
       | at-all performing) systems have anything to do with core knowlege
       | priors.
       | 
       | So what happened to that assumption? Is it dead?
       | 
       | The second question I still have is about the defenses of ARC
       | against memorisation-based, big-data approaches. I note that the
       | second best system is based on an LLM with "test time training"
       | where the first two steps are:                 initial finetuning
       | on similar tasks        auxiliary task format and augmentations
       | 
       | Which is to say, a data augmentation approach. With big data
       | comes great responsibility and the authors of the second-best
       | system don't disappoint: they claim that by training on more
       | examples they achieve reasoning.
       | 
       | So what happened to the claim that ARC is secure against big-data
       | approaches? Is it dead?
        
         | fchollet wrote:
         | What all top models do is recombine at test time the knowledge
         | they already have. So they all possess Core Knowledge priors.
         | Techniques to acquire them vary:
         | 
         | * Use a pretrained LLM and hope that relevant programs will be
         | memorized via exposure to text data (this doesn't work that
         | well)
         | 
         | * Pretrain a LLM on ARC-AGI-like data
         | 
         | * Hardcode the priors into a DSL
         | 
         | > Which is to say, a data augmentation approach
         | 
         | The key bit isn't the data augmentation but the TTT. TTT is a
         | way to lift the #1 issue with DL models: that they cannot
         | recombine their knowledge at test time to adapt to something
         | they haven't seen before (strong generalization). You can argue
         | whether TTT is the right way to achieve this, but there is no
         | doubt that TTT is a major advance in this direction.
         | 
         | The top ARC-AGI models perform well not because they're trained
         | on tons of data, but because they can adapt to novelty at test
         | time (usually via TTT). For instance, if you drop the TTT
         | component you will see that these large models trained on
         | millions of synthetic ARC-AGI tasks drop to <10% accuracy. This
         | demonstrates empirically that ARC-AGI cannot be solved purely
         | via memorization and interpolation.
        
           | optimalsolver wrote:
           | >This demonstrates empirically that ARC-AGI cannot be solved
           | purely via memorization and interpolation
           | 
           | Now that the current challenge is over, and a successor
           | dataset is in the works, can we see how well the leading LLMs
           | perform against the private test set?
        
             | tuukkah wrote:
             | I think the "semi-private" numbers here already measure
             | that: https://arcprize.org/2024-results
             | 
             | For example, Claude 3.5 gets 14% in semi-private eval vs
             | 21% in public eval. I remember reading an explanation of
             | "semi-private" earlier but cannot find it now.
        
           | YeGoblynQueenne wrote:
           | >> So they all possess Core Knowledge priors.
           | 
           | Do you mean the ones from your white paper? The same ones
           | that humans possess? How do you know this?
           | 
           | >> The key bit isn't the data augmentation but the TTT.
           | 
           | I haven't had the chance to read the papers carefully. Have
           | they done ablation studies? For instance, is the following a
           | guess or is it an empirical result?
           | 
           | >> For instance, if you drop the TTT component you will see
           | that these large models trained on millions of synthetic ARC-
           | AGI tasks drop to <10% accuracy.
        
         | aithrowawaycomm wrote:
         | Even the strongest possible interpretation of the results
         | wouldn't conclude "ARC-AGI is dead" because none of the
         | submissions came especially close to human-level performance;
         | the criteria was 85% success but the best in 2024 was 55%.
         | 
         | That said, I think there should be consideration via
         | information thermodynamics: even with TTT these program-
         | generating systems are using an enormous amount of bits
         | compared to a human mind, a tiny portion of which solves ARC
         | quickly and easily using causality-first principles of
         | reasoning.
         | 
         | Another point: suppose a system solves ARC-AGI with 99%
         | accuracy. Then it should be tested on "HARC-HAGI," a variant
         | that uses hexagons instead of squares. This likely wouldn't
         | trip up a human very much - perhaps a small decrease due to
         | increased surface area for brain farts. But if the AI needs to
         | be retrained on a ton of hexagonal examples, then that AI can't
         | be an AGI candidate.
        
           | szvsw wrote:
           | > That said, I think there should be consideration via
           | information thermodynamics: even with TTT these program-
           | generating systems are using an enormous amount of bits
           | compared to a human mind, a tiny portion of which solves ARC
           | quickly and easily using causality-first principles of
           | reasoning.
           | 
           | This isn't my area of expertise, but it seems plausible to me
           | that what you said is completely erroneous or at the very
           | least completely unverifiable at this point in time. How do
           | you quantify how many bits it takes a human mind to solve one
           | of the ARC problems?
           | 
           | That seems likely beyond the level of insight we have into
           | the structure of cognition and information storage etc etc in
           | wetware. I could of course be wrong and would love to be
           | corrected if so! You mentioned a "tiny portion" of the human
           | mind, but (as far as I'm aware), any given "small" part of
           | human cognition still involves huge amounts of complexity and
           | compute.
           | 
           | Maybe you are saying that the high level decision making a
           | human goes through when solving can be represented with a
           | relatively small number of pieces of information/logical
           | operations (as opposed to a much lower level notion closer to
           | the wetware of the quantity of information) but then it seems
           | unfair to compare to the low level equivalent (weights &
           | biases, FLOPs etc) in the ML system when there may be higher
           | order equivalents.
           | 
           | I do appreciate the general notion of wanting to normalize
           | against _something_ though, and some notion of information
           | seems like a reasonable choice, but practically out of our
           | reach. Maybe something like peak power or total energy
           | consumption would be a more reasonable choice, which we can
           | at least get a lower and upper bounds on in the human case
           | (metabolic rates are pretty well studied, and even if we
           | don't have a good idea of how much energy is involved in
           | completing cognitive tasks we can at least get bounds for
           | running the entire system in that period of time) and close
           | to a precise value in the ML case.
        
       | hulium wrote:
       | Were there any interesting non-neural approaches? I was wondering
       | whether there is any underlying structure in the ARC tasks that
       | could tell us something about algorithms for "reasoning" problems
       | in general.
        
         | neoneye2 wrote:
         | The 3rd place solution by Agnis Liukis, solves 40 tasks.
         | https://www.kaggle.com/code/gregkamradt/arc-prize-2024-solut...
        
       ___________________________________________________________________
       (page generated 2024-12-06 23:00 UTC)