[HN Gopher] ARC Prize - a $1M+ competition towards open AGI prog...
___________________________________________________________________
ARC Prize - a $1M+ competition towards open AGI progress
Hey folks! Mike here. Francois Chollet and I are launching ARC
Prize, a public competition to beat and open-source the solution to
the ARC-AGI eval. ARC-AGI is (to our knowledge) the only eval
which measures AGI: a system that can efficiently acquire new skill
and solve novel, open-ended problems. Most AI evals measure skill
directly vs the acquisition of new skill. Francois created the
eval in 2019, SOTA was 20% at inception, SOTA today is only 34%.
Humans score 85-100%. 300 teams attempted ARC-AGI last year and
several bigger labs have attempted it. While most other skill-
based evals have rapidly saturated to human-level, ARC-AGI was
designed to resist "memorization" techniques (eg. LLMs) Solving
ARC-AGI tasks is quite easy for humans (even children) but
impossible for modern AI. You can try ARC-AGI tasks yourself here:
https://arcprize.org/play ARC-AGI consists of 400 public training
tasks, 400 public test tasks, and 100 secret test tasks. Every task
is novel. SOTA is measured against the secret test set which adds
to the robustness of the eval. Solving ARC-AGI tasks requires no
world knowledge, no understanding of language. Instead each puzzle
requires a small set of "core knowledge priors" (goal directedness,
objectness, symmetry, rotation, etc.) At minimum, a solution to
ARC-AGI opens up a completely new programming paradigm where
programs can perfectly and reliably generalize from an arbitrary
set of priors. At maximum, unlocks the tech tree towards AGI. Our
goal with this competition is: 1. Increase the number of
researchers working on frontier AGI research (vs tinkering with
LLMs). We need new ideas and the solution is likely to come from an
outsider! 2. Establish a popular, objective measure of AGI progress
that the public can use to understand how close we are to AGI (or
not). Every new SOTA score will be published here:
https://x.com/arcprize 3. Beat ARC-AGI and learn something new
about the nature of intelligence. Happy to answer questions!
Author : mikeknoop
Score : 151 points
Date : 2024-06-11 17:19 UTC (5 hours ago)
(HTM) web link (arcprize.org)
(TXT) w3m dump (arcprize.org)
| freediver wrote:
| This is amazing, and much needed. Thanks for organizing this.
| Makes me want to flex the programming muscle again.
| breck wrote:
| I can beat the SOTA using ICS
| (https://breckyunits.com/intelligence.html)
|
| If you make your site public domain, and drop the (C), I'll
| compete.
| Lerc wrote:
| I watched a video that covered ARC-AGI a few days ago, It had
| links to the old competition. It gave me much to think about.
| Nice to see a new run at it.
|
| Not sure If I have the skills to make an entry, but I'll be
| watching at least.
| lacker wrote:
| I really like the idea of ARC. But to me the problems seem like
| they require a lot of spatial world knowledge, more than they
| require abstract reasoning. Shapes overlapping each other,
| containing each other, slicing up and reassembling pieces,
| denoising regular geometric shapes, you can call them "core
| knowledge" but to me it seems like they are more like "things
| that are intuitive to human visual processing".
|
| Would an intelligent but blind human be able to solve these
| problems?
|
| I'm worried that we will need more than 800 examples to solve
| these problems, not because the abstract reasoning is so
| difficult, but because the problems require spatial knowledge
| that we intelligent humans learn with far more than 800 training
| examples.
| nickpsecurity wrote:
| To parent: the spatial reasoning and blind person were great
| counterexamples. It still might be OK despite the blind
| exceptions if it showed general reasoning.
|
| To OP: I like your project goal. I think you should look at
| prior, reasoning engines that tried to build common sense. Cyc
| and OpenMind are examples. You also might find use for the list
| of AGI goals in Section 2 of this paper:
|
| https://arxiv.org/pdf/2308.04445
|
| When studying intros of brain function, I also noted many
| regions tie into the hippocampus which might do both sense-
| neutral storage of concepts and make inner models (or
| approximations) of external world. The former helps tie
| concepts together through various senses. The latter helps in
| planning when we are imagining possibilities to evaluate and
| iterate on them.
|
| Seems like AGI should have these hippocampus-like traits and
| those in the Cyc paper. One could test if an architecture could
| do such things in theory or on a small scale. It shouldn't tie
| into just one type of sensory input either. At least two with
| the ability to act on what only exists in one or what is in
| both.
|
| Edit: Children also have an enormous amount of unsupervised
| training on visual and spatial data. They get reinforcement
| through play and supervised training by parents. A realistic
| benchmark might similarly require GB of prettaining.
| Lerc wrote:
| I don't think the intent is to learn the entire problem domain
| from the examples, but the specific rule that is being applied.
|
| There may (almost certainly will be) additional knowledge
| encoded in the solver to cover the spacial concepts etc. The
| distinction with the AGI-ARC test is the disparity between
| human and AI performance, and that it focuses on puzzles that
| are easier for humans.
|
| It would be interesting to see a finetuned LLM just try and
| express the rule for each puzzle as english. It could have full
| knowledge of what ARC-AGI is and how the tests operate, but the
| proof of the pudding is simply how it does on the test set.
| CooCooCaCha wrote:
| "Would an intelligent but blind human be able to solve these
| problems?"
|
| This is the wrong way to think about it IMO. Spatial
| relationships are just another type of logical relationship and
| we should expect AGI to be able to analyze relationships and
| generate algorithms on the fly to solve problems.
|
| Just because humans can be biased in various ways doesn't mean
| these biases are inherent to all intelligences.
| janalsncm wrote:
| Part of the concern might be that visual reasoning problems
| are overrepresented in ARC in the space of all abstract
| reasoning problems.
|
| It's similar to how chess problems are technically reasoning
| problems but they are not representative of general
| reasoning.
| crazygringo wrote:
| > _Spatial relationships are just another type of logical
| relationship and we should expect AGI to be able to analyze
| relationships and generate algorithms on the fly to solve
| problems._
|
| Not really. By that reasoning, 5-dimensional spatial
| reasoning is "just another type of logical relationship" and
| yet humans mostly can't do that at all.
|
| It's clear that we have incredibly specialized capabilities
| for dealing with two- and three-dimensional spatiality that
| don't have much of anything to do with general logical
| intelligence at all.
| elicksaur wrote:
| I'm a big fan of the ARC as a problem set to tackle. The
| sparseness of the data and infinite-ness of the rules which could
| apply make it much tougher than existing ML problem sets.
|
| However, I do disagree that this problem represents "AGI". It's
| just a different dataset than what we've seen with existing ML
| successes, but the approaches are generally similar to what's
| come before. It could be that some truly novel breakthrough which
| is AGI solves the problem set, but I don't think solving the
| problem set is a guaranteed indicator of AGI.
| m3kw9 wrote:
| Low balling the crowd with this I see
| bigyikes wrote:
| What is the fundamental difference between ARC and a standard IQ
| test? On the surface they seem similar in that they both involve
| deducing and generalizing visual patterns.
|
| Is there something special about these questions that makes them
| resistant to memorization? Or is it more just the fact that there
| are 100 secret tasks?
| paxys wrote:
| While I agree with the spirit of the competition, a $1M prize
| seems a little too low considering tens of billions of dollars
| have already been invested in the race to AGI, and we will see
| many times that put into the space in the coming years. The
| impact of AGI will be measured in _trillions_ at minimum. So what
| you are ultimately rewarding isn 't AGI research but fine tuning
| the newest public LLM release to best meet the parameters of the
| test.
|
| I'd also urge you to use a different platform for communicating
| with the public because x.com links are now inaccessible without
| creating an account.
| ks2048 wrote:
| The submissions can't use the internet. And I imagine can't be
| too huge - so you can't use "newest public LLMs" on this task.
| mikeknoop wrote:
| That is correct for ARC Prize: limited Kaggle compute (to
| target efficiency) and no internet (to reduce cheating).
|
| We are also trialing a secondary leaderboard called ARC-AGI-
| Pub that imposes no limits or constraints. Not part of the
| prize today but could be in the future:
| https://arcprize.org/leaderboard
| mikeknoop wrote:
| I agree, $1M is ~trivial in AI. The primary goal with the prize
| is to raise public awareness about how close (or far today) we
| are from AGI: https://arcprize.org/leaderboard and we hope that
| understanding will shift more would-be AI researchers to
| working new ideas
| bigyikes wrote:
| Dwarkesh just released an interview with Francois Chollet
| (partner of OP). I've only listened to a few minutes so far, but
| I'm very interested in hearing more about his conceptions of the
| limitations of LLMs.
|
| https://youtu.be/UakqL6Pj9xo
| pmayrgundter wrote:
| This claim that these tests are easy for humans seems dubious,
| and so I went looking a bit. Melanie Mitchell chimed in on
| Chollet's thread and posted their related test [ConceptARC].
|
| In it they question the ease of Chollet's tests: "One limitation
| on ARC's usefulness for AI research is that it might be too
| challenging. Many of the tasks in Chollet's corpus are difficult
| even for humans, and the corpus as a whole might be sufficiently
| difficult for machines that it does not reveal real progress on
| machine acquisition of core knowledge."
|
| ConceptARC is designed to be easier, but then also has to filter
| ~15% of its own test takers for "[failing] at solving two or more
| minimal tasks... or they provided empty or nonsensical
| explanations for their solutions"
|
| After this filtering, ConceptARC finds another 10-15% failure
| rate amongst humans on the main corpus questions, so they're
| seeing maybe 25-30% unable to solve these simpler questions meant
| to test for "AGI".
|
| ConceptARC's main results show CG4 scoring well below the
| filtered humans, which would agree with a [Mensa] test result
| that its IQ=85.
|
| Chollet and Mitchell could instead stratify their human groups to
| estimate IQ then compare with the Mensa measures and see if e.g.
| Claude3@IQ=100 compares with their ARC scores for their average
| human
|
| [ConceptArc]https://arxiv.org/pdf/2305.07141
| [Mensa]https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-
| passes-10...
| mikeknoop wrote:
| Here is some published research on the human difficult of ARC-
| AGI:
| https://cims.nyu.edu/~brenden/papers/JohnsonEtAl2021CogSci.p...
|
| > We found that humans were able to infer the underlying
| program and generate the correct test output for a novel test
| input example, with an average of 84% of tasks solved per
| participant
| mark_l_watson wrote:
| I saw Melanie's post and I am intrigued by an easier AGI suite.
| I would like some experimenting done by individuals like myself
| snd smaller organizations.
| salamo wrote:
| They claim that the average score for humans is between 85% and
| 100%, so I think there's a disagreement on whether the test is
| actually too hard. Taking them at their word, if no existing
| model can score even half what the average human can, the test
| is certainly measuring some kind of significant difference.
|
| I guess there might be a disagreement of whether the problems
| in ARC are a representative sample of all of the possible
| abstract programs which could be synthesized, but then again
| most LLMs are also trained on human data.
| lxe wrote:
| I've never done these before, or Kaggle competitions in general.
| Any recommendations before I dive in? I have prety much zero
| lowe-level ML experience, but a good amount of practical software
| eng behind me.
| gkamradt wrote:
| We put a bunch of detail to get started on the guide
| https://arcprize.org/guide
|
| Happy to answer any questions you have along the way
|
| (I'm helping run ARC Prize)
| david_shi wrote:
| What is the fastest way to get up to speed with techniques that
| led to the current SOTA?
| gkamradt wrote:
| Check out the SOTA resources on the guide
|
| https://arcprize.org/guide
|
| Happy to answer any questions you have along the way
|
| (I'm helping run ARC Prize)
| abtinf wrote:
| > requires no world knowledge, no understanding of language
|
| This is treating "intelligence" like some abstract, platonic
| thing divorced from reality. Whatever else solving these puzzles
| is indicative of, it's not intelligence.
| abtinf wrote:
| From the abstract of the " On the Measure of Intelligence"
| paper:
|
| > We then articulate a new formal definition of intelligence
| based on Algorithmic Information Theory, describing
| intelligence as skill-acquisition efficiency and highlighting
| the concepts of scope, generalization difficulty, priors, and
| experience.
|
| I'm afraid that definition forecloses the possibility of AGI.
| The immediate basic question is: why build skills at all?
| Phil_Latio wrote:
| Why does an AGI need to have any knowledge about our reality?
| The principle behind an AGI should work just as well on a made
| up world where those puzzles play a part in.
| salamo wrote:
| This is super cool. I share Francois' intuition that the
| presently data-hungry learning paradigm is not only not
| generalizable but unsustainable: humans do not need 10,000
| examples to tell the difference between cats and dogs, and the
| main reason computers can today is because we have millions of
| examples. As a result, it may be hard to transfer knowledge to
| more esoteric domains where data is expensive, rare, and hard to
| synthesize.
|
| If I can make one criticism/observation of the tests, it seems
| that most of them reason about perfect information in a game-
| theoretic sense. However, many if not most of the more
| challenging problems we encounter involve hidden information.
| Poker and negotiations are examples of problem solving in
| imperfect information scenarios. Smoothly navigating social
| situations also requires a related problem of working with hidden
| information.
|
| One of the really interesting things we humans are able to do is
| to take the rules of a game and generate strategies. While we do
| have some algorithms which can "teach themselves" e.g. to play go
| or chess, those same self-play algorithms don't work on hidden
| information games. One of the really interesting capabilities of
| any generally-intelligent system would be synthesizing a general
| problem solver for those kinds of situations as well.
| empath75 wrote:
| This is like offering a one million dollar prize for curing
| cancer. It's sort of pointless to offer a prize for something
| people are spending orders of magnitude more on trying to do
| anyway.
| nojvek wrote:
| I love the ARC challenge. It's hard to beat by memorization.
| There aren't enough examlples, so one has to train on a large
| dataset elsewhere and then train on ARC to generalize and figure
| out which rules are most applicable.
|
| I did a few human examples by hand, but gotta do more of them to
| start seeing patterns.
|
| Human visual and auditory system is impressive. Most animals
| see/hear and plan from that without having much language.
| Physical intelligence is the biggest leg up when it comes to
| evolution optimizing for survival.
| logicallee wrote:
| Thank you for this generous contest, which brings important
| attention to the field of testing for AGI.
|
| >Happy to answer questions!
|
| 1. Can humans take the complete test suite? Has any human done
| so? Is it timed? How long does it take a human? What is the
| highest a human who sat down and took the ARC-AGI test scored?
|
| 2. How surprised would you be if a new model jumped to scoring
| 100% or nearly 100% on ARC-AGI (including the secret test tasks)?
| What kind of test would you write next?
___________________________________________________________________
(page generated 2024-06-11 23:00 UTC)