[HN Gopher] Arc-AGI-2 and ARC Prize 2025
___________________________________________________________________
Arc-AGI-2 and ARC Prize 2025
Author : gkamradt
Score : 16 points
Date : 2025-03-24 20:35 UTC (2 hours ago)
(HTM) web link (arcprize.org)
(TXT) w3m dump (arcprize.org)
| gkamradt wrote:
| Hey HN, Greg from ARC Prize Foundation here.
|
| Alongside Mike Knoop and Francois Francois Chollet, we're
| launching ARC-AGI-2, a frontier AI benchmark that measures a
| model's ability to generalize on tasks it hasn't seen before, and
| the ARC Prize 2025 competition to beat it.
|
| In Dec '24, ARC-AGI-1 (2019) pinpointed the moment AI moved
| beyond pure memorization as seen by OpenAI's o3.
|
| ARC-AGI-2 targets test-time reasoning.
|
| My view is that good AI benchmarks don't just measure progress,
| they inspire it. Our mission is to guide research towards general
| systems.
|
| Base LLMs (no reasoning) are currently scoring 0% on ARC-AGI-2.
| Specialized AI reasoning systems (like R1 or o3-mini) are <4%.
|
| Every (100%) of ARC-AGI-2 tasks, however, have been solved by at
| least two humans, quickly and easily. We know this because we
| tested 400 people live.
|
| Our belief is that once we can no longer come up with
| quantifiable problems that are "feasible for humans and hard for
| AI" then we effectively have AGI. ARC-AGI-2 proves that we do not
| have AGI.
|
| Change log from ARC-AGI-2 to ARC-AGI-2: * The two main evaluation
| sets (semi-private, private eval) have increased to 120 tasks *
| Solving tasks requires more reasoning vs pure intuition * Each
| task has been confirmed to have been solved by at least 2 people
| (many more) out of an average of 7 test taskers in 2 attempts or
| less * Non-training task sets are now difficulty-calibrated
|
| The 2025 Prize ($1M, open-source required) is designed to drive
| progress on this specific gap. Last year's competition (also
| launched on HN) had 1.5K teams participate and had 40+ research
| papers published.
|
| The Kaggle competition goes live later this week and you can sign
| up here: https://arcprize.org/competition
|
| We're in an idea-constrained environment. The next AGI
| breakthrough might come from you, not a giant lab.
|
| Happy to answer questions.
| artninja1988 wrote:
| What are you doing to prevent the test set being leaked? Will
| you still be offering API access to the semi private test set
| to the big model providers who presumably train on their API?
| gkamradt wrote:
| We have a few sets:
|
| 1. Public Train - 1,000 tasks that are public 2. Public Eval
| - 120 tasks that are public
|
| So for those two we don't have protections.
|
| 3. Semi Private Eval - 120 tasks that are exposed to 3rd
| parties. We sign data agreements where we can, but we
| understand this is exposed and not 100% secure. It's a risk
| we are open to in order to keep testing velocity. In theory
| it is very difficulty to secure this 100%. The cost to create
| a new semi-private test set is lower than the effort needed
| to secure it 100%.
|
| 4. Private Eval - Only on Kaggle, not exposed to any 3rd
| parties at all. Very few people have access to this. Our
| trust vectors are with Kaggle and the internal team only.
| artificialprint wrote:
| Oh boy! Some of these tasks are not hard, but require full
| attention and a lot of counting just to get things right! ARC3
| will go 3D perhaps? JK
|
| Congrats on launch, lets see how long it'll take to get saturated
| fchollet wrote:
| ARC 3 is still spatially 2D, but it adds a time dimension, and
| it's interactive.
___________________________________________________________________
(page generated 2025-03-24 23:00 UTC)