hngopher.com

       [HN Gopher] Arc-AGI-2 and ARC Prize 2025
       ___________________________________________________________________
        
       Arc-AGI-2 and ARC Prize 2025
        
       Author : gkamradt
       Score  : 16 points
       Date   : 2025-03-24 20:35 UTC (2 hours ago)
        
 (HTM) web link (arcprize.org)
 (TXT) w3m dump (arcprize.org)
        
       | gkamradt wrote:
       | Hey HN, Greg from ARC Prize Foundation here.
       | 
       | Alongside Mike Knoop and Francois Francois Chollet, we're
       | launching ARC-AGI-2, a frontier AI benchmark that measures a
       | model's ability to generalize on tasks it hasn't seen before, and
       | the ARC Prize 2025 competition to beat it.
       | 
       | In Dec '24, ARC-AGI-1 (2019) pinpointed the moment AI moved
       | beyond pure memorization as seen by OpenAI's o3.
       | 
       | ARC-AGI-2 targets test-time reasoning.
       | 
       | My view is that good AI benchmarks don't just measure progress,
       | they inspire it. Our mission is to guide research towards general
       | systems.
       | 
       | Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2.
       | Specialized AI reasoning systems (like R1 or o3-mini) are <4%.
       | 
       | Every (100%) of ARC-AGI-2 tasks, however, have been solved by at
       | least two humans, quickly and easily. We know this because we
       | tested 400 people live.
       | 
       | Our belief is that once we can no longer come up with
       | quantifiable problems that are "feasible for humans and hard for
       | AI" then we effectively have AGI. ARC-AGI-2 proves that we do not
       | have AGI.
       | 
       | Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation
       | sets (semi-private, private eval) have increased to 120 tasks *
       | Solving tasks requires more reasoning vs pure intuition * Each
       | task has been confirmed to have been solved by at least 2 people
       | (many more) out of an average of 7 test taskers in 2 attempts or
       | less * Non-training task sets are now difficulty-calibrated
       | 
       | The 2025 Prize ($1M, open-source required) is designed to drive
       | progress on this specific gap. Last year's competition (also
       | launched on HN) had 1.5K teams participate and had 40+ research
       | papers published.
       | 
       | The Kaggle competition goes live later this week and you can sign
       | up here: https://arcprize.org/competition
       | 
       | We're in an idea-constrained environment. The next AGI
       | breakthrough might come from you, not a giant lab.
       | 
       | Happy to answer questions.
        
         | artninja1988 wrote:
         | What are you doing to prevent the test set being leaked? Will
         | you still be offering API access to the semi private test set
         | to the big model providers who presumably train on their API?
        
           | gkamradt wrote:
           | We have a few sets:
           | 
           | 1. Public Train - 1,000 tasks that are public 2. Public Eval
           | - 120 tasks that are public
           | 
           | So for those two we don't have protections.
           | 
           | 3. Semi Private Eval - 120 tasks that are exposed to 3rd
           | parties. We sign data agreements where we can, but we
           | understand this is exposed and not 100% secure. It's a risk
           | we are open to in order to keep testing velocity. In theory
           | it is very difficulty to secure this 100%. The cost to create
           | a new semi-private test set is lower than the effort needed
           | to secure it 100%.
           | 
           | 4. Private Eval - Only on Kaggle, not exposed to any 3rd
           | parties at all. Very few people have access to this. Our
           | trust vectors are with Kaggle and the internal team only.
        
       | artificialprint wrote:
       | Oh boy! Some of these tasks are not hard, but require full
       | attention and a lot of counting just to get things right! ARC3
       | will go 3D perhaps? JK
       | 
       | Congrats on launch, lets see how long it'll take to get saturated
        
         | fchollet wrote:
         | ARC 3 is still spatially 2D, but it adds a time dimension, and
         | it's interactive.
        
       ___________________________________________________________________
       (page generated 2025-03-24 23:00 UTC)