[HN Gopher] OpenAI O3 breakthrough high score on ARC-AGI-PUB
       ___________________________________________________________________
        
       OpenAI O3 breakthrough high score on ARC-AGI-PUB
        
       Author : maurycy
       Score  : 730 points
       Date   : 2024-12-20 18:11 UTC (4 hours ago)
        
 (HTM) web link (arcprize.org)
 (TXT) w3m dump (arcprize.org)
        
       | razodactyl wrote:
       | Great. Now we have to think of a new way to move the goalposts.
        
         | tines wrote:
         | I mean, what else do you call learning?
        
         | Pesthuf wrote:
         | Well right now, running this model is really expensive, but we
         | should prepare a new cope for when equivalent models no longer
         | are, ahead of time.
        
           | cchance wrote:
           | Ya getting costs down will be the big one, i imagine
           | quantization, distillation and lots and lots of improvements
           | on the compute side both hardware and software wise.
        
         | a_wild_dandan wrote:
         | Let's just define AI as "whatever computers still can't do."
         | That'll show those dumb statistical parrots!
        
         | dboreham wrote:
         | Imagine how the Neanderthals felt...
        
         | foobarqux wrote:
         | This is just as silly as claiming that people "moved the
         | goalposts" when a computer beat Kasparov at chess to claim that
         | it wasn't AGI: it wasn't a good test and some people only
         | realize this after the computer beat Kasparov but couldn't do
         | much else. In this case the ARC maintainers specifically have
         | stated that this is a necessary but not sufficient test of AGI
         | (I personally think it is neither).
        
           | og_kalu wrote:
           | It's not silly. The computer that could beat Kasparov
           | couldn't do anything else so of course it wasn't Artificial
           | General Intelligence.
           | 
           | o3 can do much much more. There is nothing narrow about SOTA
           | LLMs. They are already General. It doesn't matter what ARC
           | Maintainers have said. There is no common definition of
           | General that LLMs fail to meet. It's not a binary thing.
           | 
           | By the time a single machine covers every little test
           | humanity can devise, what comes out of that is not 'AGI' as
           | the words themselves mean but a General Super Intelligence.
        
             | foobarqux wrote:
             | It is silly, the logic is the same: "Only a (world-
             | altering) 'AGI' could do [test]" -> test is passed -> no
             | (world-altering) 'AGI' -> conclude that [test] is not a
             | sufficient test for (world-altering) 'AGI' -> chase new
             | benchmark.
             | 
             | If you want to play games about how to define AGI go ahead.
             | People have been claiming for years that we've already
             | reached AGI and with every improvement they have to
             | bizarrely claim anew that _now_ we 've really achieved AGI.
             | But after a few months people realize it still doesn't do
             | what you would expect of an AGI and so you chase some new
             | benchmark ("just one more eval").
             | 
             | The fact is that there really hasn't been the type of
             | world-altering impact that people generally associate with
             | AGI and no reason to expect one.
        
               | og_kalu wrote:
               | >It is silly, the logic is the same: "Only a (world-
               | altering) 'AGI' could do [test]" -> test is passed -> no
               | (world-altering) 'AGI' -> conclude that [test] is not a
               | sufficient test for (world-altering) 'AGI' -> chase new
               | benchmark.
               | 
               | Basically nobody today thinks beating a single benchmark
               | and nothing else will make you a General Intelligence. As
               | you've already pointed out out, even the maintainers of
               | ARC-AGI do not think this.
               | 
               | >If you want to play games about how to define AGI go
               | ahead.
               | 
               | I'm not playing any games. ENIAC cannot do 99% of the
               | things people use computers to do today and yet barely
               | anybody will tell you it wasn't the first general purpose
               | computer.
               | 
               | On the contrary, it is people who seem to think "General"
               | is a moniker for everything under the sun (and then some)
               | that are playing games with definitions.
               | 
               | >People have been claiming for years that we've already
               | reached AGI and with every improvement they have to
               | bizarrely claim anew that now we've really achieved AGI.
               | 
               | Who are these people ? Do you have any examples at all.
               | Genuine question
               | 
               | >But after a few months people realize it still doesn't
               | do what you would expect of an AGI and so you chase some
               | new benchmark ("just one more eval").
               | 
               | What do you expect from 'AGI'? Everybody seems to have
               | different expectations, much of it rooted in science
               | fiction and not even reality, so this is a moot point.
               | What exactly is World Altering to you ? Genuinely, do you
               | even have anything other than a "I'll know it when i see
               | it ?"
               | 
               | If you introduce technology most people adopt, is that
               | world altering or are you waiting for Skynet ?
        
               | foobarqux wrote:
               | > Basically nobody today thinks beating a single
               | benchmark and nothing else will make you a General
               | Intelligence.
               | 
               | People's comments, including in this very thread, seem to
               | suggest otherwise (c.f. comments about "goal post
               | moving"). Are you saying that a widespread belief wasn't
               | that a chess playing computer would require AGI? Or that
               | Go was at some point the new test for AGI? Or the Turing
               | test?
               | 
               | > I'm not playing any games... "General" is a moniker for
               | everything under the sun that are playing games with
               | definitions.
               | 
               | People have a colloquial understanding of AGI whose
               | consequence is a significant change to daily life, not
               | the tortured technical definition that you are using.
               | Again your definition isn't something anyone cares about
               | (except maybe in the legal contract between OpenAI and
               | Microsoft).
               | 
               | > Who are these people ? Do you have any examples at all.
               | Genuine question
               | 
               | How about you? I get the impression that you think AGI
               | was achieved some time ago. It's a bit difficult to
               | simultaneously argue both that we achieved AGI in GPT-N
               | and also that GPT-(N+X) is now the real breakthrough AGI
               | while claiming that your definition of AGI is useful.
               | 
               | > What do you expect from 'AGI'?
               | 
               | I think everyone's definition of AGI includes, as a
               | component, significant changes to the world, which
               | probably would be something like rapid GDP growth or
               | unemployment (though you could have either of those
               | without AGI). The fact that you have to argue about what
               | the word "general" technically means is proof that we
               | don't have AGI in a sense that anyone cares about.
        
               | og_kalu wrote:
               | >People's comments, including in this very thread, seem
               | to suggest otherwise (c.f. comments about "goal post
               | moving").
               | 
               | But you don't see this kind of discussion on the narrow
               | models/techniques that made strides on this benchmark, do
               | you ?
               | 
               | >People have a colloquial understanding of AGI whose
               | consequence is a significant change to daily life, not
               | the tortured technical definition that you are using
               | 
               | And ChatGPT has represented a significant change to the
               | daily lives of many. It's the fastest adopted software
               | product in history. In just 2 years, it's one of the top
               | ten most visited sites on the planet worldwide. A lot of
               | people have had the work they do significant change since
               | its release. This is why I ask, what is world altering ?
               | 
               | >How about you? I get the impression that you think AGI
               | was achieved some time ago.
               | 
               | Sure
               | 
               | >It's a bit difficult to simultaneously argue both that
               | we achieved AGI in GPT-N and also that GPT-(N+X) is now
               | the real breakthrough AGI
               | 
               | I have never claimed GPT-N+X is the "new breakthrough
               | AGI". As far as I'm concerned, we hit AGI sometime ago
               | and are making strides in competence and/or enabling even
               | more capabilities.
               | 
               | You can recognize ENIAC as a general purpose computer and
               | also recognize the breakthroughs in computing since then.
               | They're not mutually exclusive.
               | 
               | And personally, I'm more impressed with o3's Frontier
               | Math score than ARC.
               | 
               | >I think everyone's definition of AGI includes, as a
               | component, significant changes to the world
               | 
               | Sure
               | 
               | >which probably would be something like rapid GDP growth
               | or unemployment
               | 
               | What people imagine as "significant change" is definitely
               | not in any broad agreement.
               | 
               | Even in science fiction, the existence of general
               | intelligences more competent than today's LLMs does not
               | necessarily precursor massive unemployment or GDP growth.
               | 
               | And for a lot of people, the clincher stopping them from
               | calling a machine AGI is not even any of these things.
               | For some, that it is "sentient" or "cannot lie" is far
               | more important than any spike of unemployment.
        
               | foobarqux wrote:
               | > But you don't see this kind of discussion on the narrow
               | models/techniques that made strides on this benchmark, do
               | you ?
               | 
               | I don't understand what you are getting at.
               | 
               | Ultimately there is no axiomatic definition of the term
               | AGI. I don't think the colloquial understanding of the
               | word is what you think it is (i.e. if you had described
               | to people, pre-chatgpt, today's chatgpt behavior,
               | including all the limitations and failings and the fact
               | that there was no change in GDP, unemployment, etc), and
               | asked if that was AGI I seriously doubt they would say
               | yes.)
               | 
               | More importantly I don't think anyone would say their
               | life was much different from a few years ago and
               | separately would say under AGI it would be.
               | 
               | But the point that started all this discussion is the
               | fact that these "evals" are not good proxies for AGI and
               | no one is moving goal-posts even if they realize this
               | fact only after the tests have been beaten. You can
               | foolishly _define_ AGI as beating ARC but the moment ARC
               | is beaten you realize that you don 't care about that
               | definition at all. That doesn't change if you make a 10
               | or 100 benchmark suite.
        
       | og_kalu wrote:
       | This is also wildly ahead in SWE-bench (71.7%, previous 48%) and
       | Frontier Math (25% on high compute, previous 2%).
       | 
       | So much for a plateau lol.
        
         | throwup238 wrote:
         | _> So much for a plateau lol._
         | 
         | It's been really interesting to watch all the internet pundits'
         | takes on the plateau... as if the _two years_ since the release
         | of GPT3.5 is somehow enough data for an armchair ponce to
         | predict the performance characteristics of an entirely novel
         | technology that no one understands.
        
           | jgalt212 wrote:
           | You could make an equivalently dismissive comment about the
           | hypesters.
        
             | throwup238 wrote:
             | Yeah but anyone with half a brain knows to ignore them.
             | Vapid cynicism is a lot more seductive to the average nerd.
        
           | bandwidth-bob wrote:
           | The pundits response to the (alleged) plateau was
           | proportional to the certainty with which CEOs of frontier
           | labs discussed pre-training scaling. The o3 result is from
           | scaling test time compute, which represents a meaningful
           | change in how you would build out compute for scaling (single
           | supercluster --> presence in regions close to users). Thus it
           | is important to discuss.
        
         | attentionmech wrote:
         | I legit see that if there is not even a new breakthrough just
         | one week, people start shouting plateau plateau.. Our rate of
         | progress is extraordinary and any downplay of it seems like
         | stupid
        
         | optimalsolver wrote:
         | >Frontier Math (25% on high compute, previous 2%)
         | 
         | This is so insane that I can't help but be skeptical. I know FM
         | answer key is private, but they have to send the questions to
         | OpenAI in order to score the models. And a significant jump on
         | this benchmark sure would increase a company's valuation...
         | 
         | Happy to be wrong on this.
        
         | OsrsNeedsf2P wrote:
         | At 6,670$/task? I hope there's a jump
        
           | og_kalu wrote:
           | It's not 6,670$/task. That was the high efficiency cost for
           | 400 questions.
        
       | maxdoop wrote:
       | How much longer can I get paid $150k to write code ?
        
         | tsunamifury wrote:
         | Often what happens is the golf-course phenomenon. As golfing
         | gets less popular, low and mid tier golf courses go out of
         | business as they simply aren't needed. But at the same time
         | demand for high end golf courses actually skyrockets because
         | people who want to golf either can give it up or go higher end.
         | 
         | This I think will happen with programmers. Rote programming
         | will slowly die out, while demand for super high end will go
         | dramatically up in price.
        
           | CapcomGo wrote:
           | Where does this golf-course phenomenon come from? It doesn't
           | really match the real world or how golfing works.
        
             | tsunamifury wrote:
             | how so, witnessed it quite directly in California. Majority
             | have closed and remaining have gone up in price and are up
             | scale. This has been covered in various new programs like
             | 60 minutes. You can look up death of golfing.
             | 
             | Also unsure what you mean by...'how golfing works'. This is
             | the economics of it, not the game
        
               | EVa5I7bHFq9mnYK wrote:
               | Maybe its CA thing? Plenty of $50 golf courses here in
               | Phoenix.
        
         | colesantiago wrote:
         | Frontier expert specialist programmers will always be in
         | demand.
         | 
         | Generalist junior and senior engineers will need to think of a
         | different career path in less than 5 years as more layoffs will
         | reduce the software engineering workforce.
         | 
         | It looks like it may be the way things are if progress in the
         | o1, o3, oN models and other LLMs continues on.
        
           | deadbabe wrote:
           | This assumes that software products in the future will remain
           | at the same complexity as they are today, just with AI
           | building them out.
           | 
           | But they won't. AI will enable building even _more_ complex
           | software which counter intuitively will result in need even
           | _more_ human jobs to deal with this added complexity.
           | 
           | Think about how despite an increasing amount of free open
           | source libraries over time enabling some powerful stuff
           | easily, developer jobs have only increased, not decreased.
        
             | dmm wrote:
             | I've made a similar argument in the past but now I'm not so
             | sure. It seems to me that developer demand was linked to
             | large expansions in software demand first from PCs then the
             | web and finally smartphones.
             | 
             | What if software demand is largely saturated? It seems the
             | big tech companies have struggled to come up with the next
             | big tech product category, despite lots of talent and
             | capital.
        
               | deadbabe wrote:
               | There doesn't need to be a new category. Existing
               | categories can just continue bloating in complexity.
               | 
               | Compare the early web vs the complicated JavaScript laden
               | single page application web we have now. You need way
               | more people now. AI will make it even worse.
               | 
               | Consider that in the AI driven future, there will be no
               | more frameworks like React. Who is going to bother
               | writing one? Instead every company will just have their
               | own little custom framework built by an AI that works
               | only for their company. Joining a new company means you
               | bring generalist skills and learn how their software
               | works from the ground up and when you leave to another
               | company that knowledge is instantly useless.
               | 
               | Sounds exciting.
               | 
               | But there's also plenty of unexplored categories anyway
               | that we can't access still because there's insufficient
               | technology for. Household robots with AGI for instance
               | may require instructions for specific services sold as
               | "apps" that have to be designed and developed by
               | companies.
        
               | bandwidth-bob wrote:
               | The new capabilities of LLMs, and generally large
               | foundation models, _expands_ the range of what a computer
               | program can do. Naturally, we will need to build all of
               | those things with code. Which will be done by a combo of
               | people with product ideas, engineers, and LLMs. There
               | will be then specialization and competition on each new
               | use-case. eg., who builds the best AI doctor etc.,.
        
             | hackinthebochs wrote:
             | What about "general" in AGI do you not understand? There
             | will be no new style of development for which the AGI will
             | be poorly suited that all the displaced developers can move
             | to.
        
               | bandwidth-bob wrote:
               | For true AGI (whatever that means, lets say fully
               | replicates human abilities), discussing "developers" only
               | is a drop in the bucket compared to all knowledge work
               | jobs which will be displaced.
        
             | cruffle_duffle wrote:
             | This is exactly what will happen. We'll just up the
             | complexity game to entirely new baselines. There will
             | continue to be good money in software.
             | 
             | These models are tools to help engineers, not replacements.
             | Models cannot, on their own, build novel new things no
             | matter how much the hype suggests otherwise. What they can
             | do is remove a hell of a lot of accidental complexity.
        
               | lagrange77 wrote:
               | > These models are tools to help engineers, not
               | replacements. Models cannot, on their own, build novel
               | new things no matter how much the hype suggests
               | otherwise.
               | 
               | But maybe models + managers/non technical people can?
        
           | mitjam wrote:
           | The question is: How to become a senior when there is no
           | place to be a junior? Will future SWE need to do the 10k
           | hours as a hobby? Will AI speed up or slow down learning?
        
             | singularity2001 wrote:
             | good question and I think you gave the correct answer yes
             | people will just do the 10,000 hours required by starting
             | programming at the age of eight and then playing around
             | until they're done studying
        
         | prmph wrote:
         | I'll believe the models can take the jobs of programmers when
         | they can generate a sophisticated iOS app based on some simple
         | prompts, ready for building and publication in the app store.
         | That is nowhere near the horizon no matter how much things are
         | hyped up, and it may well never arrive.
        
           | timenotwasted wrote:
           | The absolutist type comments are such a wild take given how
           | often they are so wrong.
        
             | tsunamifury wrote:
             | Totally... simple increases in 20% efficiency will already
             | significant destroy demand for coders. This forum however
             | will be resistant to admit such economic phenomenon.
             | 
             | Look at video bay editing after the advent of Final Cut.
             | Significant drop in the specialized requirement as a
             | professional field, even while content volume went up
             | dramatically.
        
               | exitb wrote:
               | Computing has been transforming countless jobs before it
               | got to Final Cut. On one hand, programming is not the
               | hardest job out there. On the other, it takes months to
               | fully onboard a human developer - a person that already
               | has years of relevant education and work experience.
               | There are desk jobs that onboard new hires in days
               | instead. Let's see when they're displaced by AI first.
        
               | tsunamifury wrote:
               | Don't know if you noticed but thats already happening.
               | Mass layoffs in customer service etc have already
               | happened over the last 2 years
        
               | exitb wrote:
               | So, how does it work out? Are the customers happy? Are
               | the bosses at my work going to be equally happy with my
               | AI replacement?
        
               | EVa5I7bHFq9mnYK wrote:
               | That's until AI has improved enough that it can
               | automatically navigate the menus to get me a human
               | operator to talk to.
        
               | derektank wrote:
               | I could be misreading this, but as far as I can tell,
               | there are more video and film editors today (29,240) than
               | there were film editors in 1997 (9,320). Seems like an
               | example of improved productivity shifting the skills
               | required but ultimately driving greater demand for the
               | profession as a whole. Salaries don't seem to have been
               | hurt either, median wage was $35,214 in '97 and $66,600
               | today, right in line with inflation.
               | 
               | https://www.bls.gov/oes/2023/may/oes274032.htm
               | 
               | https://www.bls.gov/oes/tables.htm
        
           | vouaobrasil wrote:
           | Nah, it will arrive. And regardless, this sort of AI reduces
           | the skill level required to make the app. It reduces the
           | amount of people required and thus reduces the demand for
           | engineers. So, even though AI is not CLOSE to what you are
           | suggesting, it can significantly reduce the salaries of those
           | that ARE required. So maybe fewer $150K programmers will be
           | hired with the same revenue for even higher profits.
           | 
           | The most bizarre thing is that programmers are literally
           | writing code to replace themselves because once this AI
           | started, it was a race to the bottom and nobody wants to be
           | last.
        
             | skydhash wrote:
             | > Nah, it will arrive
             | 
             | Will it?
             | 
             | It's already hard to get people to use computer as they are
             | right now, where you only need to click on things and no
             | longer have to enter commands. That because most people
             | don't like to engage in formal reasoning. Even with one of
             | the most intuitive computer assisted task (drawing and 3d
             | modeling), there's so much to learn regarding theories that
             | few people bother.
             | 
             | Programming has always been easy to learn, and tools to
             | automate coding have existed for decades now. But how many
             | people you know have had the urge to learn enough to
             | automate their tasks?
        
             | prmph wrote:
             | They've been promising us this thing since the 60s: End-
             | user development, 5GLs, etc. enabling the average Joe to
             | develop sophisticated apps in minimal time. And it never
             | arrives.
             | 
             | I remember attending a tech fair decades ago, and at one
             | stand they were vending some database products. When I
             | mentioned that I was studying computer science with a focus
             | on software engineering, they sneered that coding will be
             | much less important in the future since powerful databases
             | will minimize the need for a lot of data wrangling in
             | applications with algorithms.
             | 
             | What actually happened is that the demand for programmers
             | increased, and software ate the world. I suspect something
             | similar will happen the current AI hype.
        
               | vouaobrasil wrote:
               | Well, I think in the 60s we also didn't have LLMs that
               | could actually write complete programs, either.
        
               | mirsadm wrote:
               | No one writes a "complete program" these days. Things
               | just keep evolving forever. I spent way too much time I
               | care to admit dealing with dependencies of libraries
               | which change seemingly on a daily basis these days. These
               | predictions are so far off reality it makes me wonder if
               | the people making them have ever written any code in
               | their life.
        
               | vouaobrasil wrote:
               | That's fair. Well, I've written a lot of code. But
               | anyway, I do want to emphasize the following. I am not
               | making the same prediction as some that say AI can
               | replace a programmer. Instead, I am saying: combination
               | of AI plus programmers will reduce the need for the
               | number or programmers, and hence allow the software
               | industry to exist with far fewer people, with the lucky
               | ones accumulating even more wealth.
        
               | whynotminot wrote:
               | > They've been promising us this thing since the 60s:
               | End-user development, 5GLs, etc. enabling the average Joe
               | to develop sophisticated apps in minimal time. And it
               | never arrives.
               | 
               | This has literally already arrived. Average Joes _are_
               | writing software using LLMs right now.
        
         | deadbabe wrote:
         | There's a very good chance that if a company can replace its
         | programmers with pure AI then it means whatever they're doing
         | is probably already being offered as a SaaS product so why not
         | just skip the AI and buy that? Much cheaper and you don't have
         | to worry about dealing with bugs.
        
           | croemer wrote:
           | SaaS works for general problems faced by many businesses.
        
             | deadbabe wrote:
             | Exactly. Most businesses can get away with not having
             | developers at all if they just glue together the right
             | combination of SaaS products. But this doesn't happen,
             | implying there is something more about having your own
             | homegrown developers that SaaS cannot replace.
        
               | croemer wrote:
               | The risk is not SaaS replacing internal developers. It's
               | about increased productivity of developers reducing the
               | number of developers needed to achieve something.
        
               | deadbabe wrote:
               | Again, you're assuming product complexity won't grow as a
               | result of new AI tools.
               | 
               | 3 decades ago you needed a big team to create the type of
               | video games that one person can probably make on their
               | own today in their spare time with modern tools.
               | 
               | But now modern tools have been used to make even more
               | complicated games that require more massive teams than
               | ever and huge amounts of money. One person has no hope of
               | replicating that now, but maybe in the future with AI
               | they can. And then the AAA games will be even _more_
               | advanced.
               | 
               | It will be similar with other software.
        
         | sss111 wrote:
         | 3 to 5 years, max. Traditional coding is going to be dead in
         | the water. Optimistically, the junior SWE job will evolve but
         | more realistically dedicated AI-based programming agents will
         | end demand for Junior SWEs
        
           | lagrange77 wrote:
           | Which implies that a few years later they will not become
           | senior SWEs either.
        
         | torginus wrote:
         | Well, considering they floated the $2000 subscription idea, and
         | they still haven't revealed everything, they could still
         | introduce the $2k sub with o3+agents/tool use, which means,
         | till about next week.
        
         | arrosenberg wrote:
         | Unless the LLMs see multiple leaps in capability, probably
         | indefinitely. The Malthusians in this thread seem to think that
         | LLMs are going to fix the human problems involved in executing
         | these businesses - they won't. They make good programmers more
         | productive and will cost some jobs at the margins, but it will
         | be the low-level programming work that was previously
         | outsourced to Asia and South America for cost-arbitrage.
        
         | mrdependable wrote:
         | I think they will have to figure out how to get around context
         | limits before that happens. I also wouldn't be surprised if the
         | future models that can actually replace workers are sold at
         | such an exorbitant price that only larger companies will be
         | able to afford it. Everyone else gets access to less capable
         | models that still require someone with knowledge to get to an
         | end result.
        
         | kirykl wrote:
         | If it's any consolation, Agile priests and middle managers will
         | be the first to go
        
       | braden-lk wrote:
       | If people constantly have to ask if your test is a measure of
       | AGI, maybe it should be renamed to something else.
        
         | OfficialTurkey wrote:
         | From the post
         | 
         | > Passing ARC-AGI does not equate achieving AGI, and, as a
         | matter of fact, I don't think o3 is AGI yet. o3 still fails on
         | some very easy tasks, indicating fundamental differences with
         | human intelligence.
        
           | cchance wrote:
           | Its funny when they say this, as if all humans can solve
           | basic ass question/answer combos, people seem to forget
           | theirs a percentage of the population that honestly believe
           | the world is flat along with other hallucinations at the
           | human level
        
             | jppittma wrote:
             | I don't believe AGI at that level has any commercial value.
        
       | modeless wrote:
       | Congratulations to Francois Chollet on making the most
       | interesting and challenging LLM benchmark so far.
       | 
       | A lot of people have criticized ARC as not being relevant or
       | indicative of true reasoning, but I think it was exactly the
       | right thing. The fact that scaled reasoning models are finally
       | showing progress on ARC proves that what it measures really is
       | relevant and important for reasoning.
       | 
       | It's obvious to everyone that these models can't perform as well
       | as humans on everyday tasks despite blowout scores on the hardest
       | tests we give to humans. Yet nobody could quantify exactly the
       | ways the models were deficient. ARC is the best effort in that
       | direction so far.
       | 
       | We don't need more "hard" benchmarks. What we need right now are
       | "easy" benchmarks that these models nevertheless fail. I hope
       | Francois has something good cooked up for ARC 2!
        
         | dtquad wrote:
         | Are there any single-step non-reasoner models that do well on
         | this benchmark?
         | 
         | I wonder how well the latest Claude 3.5 Sonnet does on this
         | benchmark and if it's near o1.
        
           | throwaway71271 wrote:
           | | Name                                 | Semi-private eval |
           | Public eval |         |--------------------------------------
           | |-------------------|-------------|         | Jeremy Berman
           | | 53.6%             | 58.5%       |         | Akyurek et al.
           | | 47.5%             | 62.8%       |         | Ryan Greenblatt
           | | 43%               | 42%         |         | OpenAI
           | o1-preview (pass@1)           | 18%               | 21%
           | |         | Anthropic Claude 3.5 Sonnet (pass@1) | 14%
           | | 21%         |         | OpenAI GPT-4o (pass@1)
           | | 5%                | 9%          |         | Google Gemini
           | 1.5 (pass@1)           | 4.5%              | 8%          |
           | 
           | https://arxiv.org/pdf/2412.04604
        
             | kandesbunzler wrote:
             | why is this missing the o1 release / o1 pro models? Would
             | love to know how much better they are
        
           | YetAnotherNick wrote:
           | Here are the results for base models[1]:                 o3
           | (coming soon)  75.7% 82.8%       o1-preview        18%   21%
           | Claude 3.5 Sonnet 14%   21%       GPT-4o            5%    9%
           | Gemini 1.5        4.5%  8%
           | 
           | Score (semi-private eval) / Score (public eval)
           | 
           | [1]: https://arcprize.org/2024-results
        
             | simonw wrote:
             | I'd love to know how Claude 3.5 Sonnet does so well despite
             | (presumably) not having the same tricks as the o-series
             | models.
        
             | Bjorkbat wrote:
             | It's easy to miss, but if you look closely at the first
             | sentence of the announcement they mention that they used a
             | version of o3 trained on a public dataset of ARC-AGI, so
             | technically it doesn't belong on this list.
        
         | refulgentis wrote:
         | This emphasizes persons and a self-conceived victory narrative
         | over the ground truth.
         | 
         | Models have regularly made progress on it, this is not new with
         | the o-series.
         | 
         | Doing astoundingly well on it, and having a mutually shared PR
         | interest with OpenAI in this instance, doesn't mean a pile of
         | visual puzzles is actually AGI or some well thought out and
         | designed benchmark of True Intelligence(tm). It's one type of
         | visual puzzle.
         | 
         | I don't mean to be negative, but to inject a memento mori. Real
         | story is some guys get together and ride off Chollet's name
         | with some visual puzzles from ye olde IQ test, and the deal was
         | Chollet then gets to show up and say it proves program
         | synthesis is required for True Intelligence.
         | 
         | Getting this score is extremely impressive but I don't assign
         | more signal to it than any other benchmark with some thought to
         | it.
        
           | modeless wrote:
           | Solving ARC doesn't mean we have AGI. Also o3 presumably
           | isn't doing program synthesis, seemingly proving Francois
           | wrong on that front. (Not sure I believe the speculation
           | about o3's internals in the link.)
           | 
           | What I'm saying is the fact that as models are getting better
           | at reasoning they are also scoring better on ARC proves that
           | it _is_ measuring something relating to reasoning. And nobody
           | else has come up with a comparable benchmark that is so easy
           | for humans and so hard for LLMs. Even today, let alone five
           | years ago when ARC was released. ARC was visionary.
        
             | hdjjhhvvhga wrote:
             | Your argumentation seems convincing but I'd like to offer a
             | competitive narrative: any benchmark that is public becomes
             | completely useless because companies optimize for it -
             | especially AI that depends on piles of money and they need
             | some proof they are developing.
             | 
             | That's why I have some private benchmarks and I'm sorry to
             | say that the transition from GTP4 to o1 wasn't
             | unambiguously a step forward (in some tasks yes, in some
             | not).
             | 
             | On the other hand, private benchmarks are even less useful
             | to the general public than the public ones, so we have to
             | deal with what we have - but many of us just treat it as
             | noise and don't give it much significance. Ultimately, the
             | models should defend themselves by performing the tasks
             | individual users want them to do.
        
               | stonemetal12 wrote:
               | Rather any Logic puzzle you post on the internet as
               | something AIs are bad at is in the next round of training
               | data so AIs get better at that specific question. Not
               | because AI companies are optimizing for a benchmark but
               | because they suck up everything.
        
               | modeless wrote:
               | ARC has two test sets that are not posted on the
               | Internet. One is kept completely private and never
               | shared. It is used when testing open source models and
               | the models are run locally with no internet access. The
               | other test set is used when testing closed source models
               | that are only available as APIs. So it could be leaked in
               | theory, but it is still not posted on the internet and
               | can't be in any web crawls.
               | 
               | You could argue that the models can get an advantage by
               | looking at the training set which is on the internet. But
               | all of the tasks are unique and generalizing from the
               | training set to the test set is the whole point of the
               | benchmark. So it's not a serious objection.
        
             | QuantumGood wrote:
             | Gaming the benchmarks usually needs to be considered first
             | when evaluating new results.
        
               | chaps wrote:
               | Honestly, is gaming benchmarks actually a problem in this
               | space in that it still shows something useful? Just means
               | we need more benchmarks, yeah? It really feels not unlike
               | keggle competitions.
               | 
               | We do the same exact stuff with real people with
               | programming challenges and such where people just study
               | common interview questions rather than learning the
               | material holistically. And since we know that people game
               | these interview type questions, we can adjust the
               | interview processes to minimize gamification.... which
               | itself leads to gamification and back to step one. That's
               | not ideal an ideal feedback loop of course, but people
               | still get jobs and churn out "productive work" out of it.
        
               | ben_w wrote:
               | AI are very good at gaming benchmarks. Both as
               | overfitting and as Goodhart's law, gaming benchmarks has
               | been a core problem during training for as long as I've
               | been interested in the field.
               | 
               | Sometimes this manifests as "outside the box thinking",
               | like how a genetic algorithm got an "oscillator" which
               | was really just an antenna.
               | 
               | It is a hard problem, and yes we still both need and can
               | make more and better benchmarks; but it's still a problem
               | because it means the benchmarks we do have are
               | overstating competence.
        
               | CamperBob2 wrote:
               | The _idea_ behind this particular benchmark, at least, is
               | that it can 't be gamed. What are some ways to game ARC-
               | AGI, meaning to pass it without developing the required
               | internal model and insights?
               | 
               | In principle you can't optimize specifically for ARC-AGI,
               | train against it, or overfit to it, because only a few of
               | the puzzles are publicly disclosed.
               | 
               | Whether it lives up to that goal, I don't know, but their
               | approach sounded good when I first heard about it.
        
               | psb217 wrote:
               | Well, with billions in funding you could task a hundred
               | or so very well paid researchers to do their best at
               | reverse engineering the general thought process which
               | went into ARC-AGI, and then generate fresh training data
               | and labeled CoTs until the numbers go up.
        
               | CamperBob2 wrote:
               | Right, but the ARC-AGI people would counter by saying
               | they're welcome to do just that. In doing so -- again in
               | their view -- the researchers would create a model that
               | could be considered capable of AGI.
               | 
               | I spent a couple of hours looking at the publicly-
               | available puzzles, and was really impressed at how much
               | room for creativity the format provides. Supposedly the
               | puzzles are "easy for humans," but some of them were
               | not... at least not for me.
               | 
               | (It did occur to me that a better test of AGI might be
               | the ability to generate new, innovative ARC-AGI puzzles.)
        
               | chaps wrote:
               | We're in agreement!
               | 
               | What's endlessly interesting to me with all of this is
               | how surprisingly quick the benchmarking feedback loops
               | have become plus the level of scrutiny each one receives.
               | We (as a culture/society/whatever) don't really treat
               | human benchmarking criteria with the same scrutiny such
               | that feedback loops are useful and lead to productive
               | changes to the benchmarking system itself. So from that
               | POV it feels like substantial progress continues to be
               | made through these benchmarks.
        
               | bubblyworld wrote:
               | I think gaming the benchmarks is _encouraged_ in the ARC
               | AGI context. If you look at the public test cases you 'll
               | see they test a ton of pretty abstract concepts - space,
               | colour, basic laws of physics like gravity/magnetism,
               | movement, identity and lots of other stuff (highly
               | recommend exploring them). Getting an AI to do well _at
               | all_ , regardless of whether it was gamed or not, is the
               | whole challenge!
        
             | refulgentis wrote:
             | > Solving ARC doesn't mean we have AGI. Also o3 presumably
             | isn't doing program synthesis, seemingly proving Francois
             | wrong on that front.
             | 
             | Agreed.
             | 
             | > And nobody else has come up with a comparable benchmark
             | that is so easy for humans and so hard for LLMs.
             | 
             | ? There's plenty.
        
               | modeless wrote:
               | I'd love to hear about more. Which ones are you thinking
               | of?
        
               | refulgentis wrote:
               | - "Are You Human" https://arxiv.org/pdf/2410.09569 is
               | designed to be directly on target, i.e. cross cutting set
               | of questions that are easy for humans, but challenging
               | for LLMs, Instead of one type of visual puzzle. Much
               | better than ARC for the purpose you're looking for.
               | 
               | - SimpleBench https://simple-bench.com/ (similar to
               | above; great landing page w/scores that show human / ai
               | gap)
               | 
               | - PIQA (physical question answering, i.e. "how do i get a
               | yolk out of a water bottle", common favorite of local llm
               | enthusiasts in /r/localllama
               | https://paperswithcode.com/dataset/piqa
               | 
               | - Berkeley Function-Calling (I prefer
               | https://gorilla.cs.berkeley.edu/leaderboard.html)
               | 
               | AI search googled "llm benchmarks challenging for ai easy
               | for humans", and "language model benchmarks that humans
               | excel at but ai struggles with", and "tasks that are easy
               | for humans but difficult for natural language ai".
               | 
               | It also mentioned Moravec's Paradox is a known framing of
               | this concept, started going down that rabbit hole because
               | the resources were fascinating, but, had to hold back and
               | submit this reply first. :)
        
               | modeless wrote:
               | Thanks for the pointers! I hadn't seen Are You Human.
               | Looks like it's only two months old. Of course it is much
               | easier to design a test specifically to thwart LLMs now
               | that we have them. It seems to me that it is designed to
               | exploit details of LLM structure like tokenizers (e.g.
               | character counting tasks) rather than to provide any sort
               | of general reasoning benchmark. As such it seems
               | relatively straightforward to improve performance in ways
               | that wouldn't necessarily represent progress in general
               | reasoning. And today's LLMs are not nearly as far from
               | human performance on the benchmark as they were on ARC
               | for many years after it was released.
               | 
               | SimpleBench looks more interesting. Also less than two
               | months old. It doesn't look as challenging for LLMs as
               | ARC, since o1-preview and Sonnet 3.5 already got half of
               | the human baseline score; they did much worse on ARC. But
               | I like the direction!
               | 
               | PIQA is cool but not hard enough for LLMs.
               | 
               | I'm not sure Berkeley Function-Calling represents tasks
               | that are "easy" for average humans. Maybe programmers
               | could perform well on it. But I like ARC in part because
               | the tasks do seem like they should be quite
               | straightforward even for non-expert humans.
               | 
               | Moravec's paradox isn't a benchmark per se. I tend to
               | believe that there is no real paradox and all we need is
               | larger datasets to see the same scaling laws that we have
               | for LLMs. I see good evidence in this direction:
               | https://www.physicalintelligence.company/blog/pi0
        
               | CamperBob2 wrote:
               | How long has SimpleBench been posted? Out of the first 6
               | questions at https://simple-bench.com/try-yourself,
               | o1-pro got 5/6 right.
               | 
               | It was interesting to see how it failed on question 6: ht
               | tps://chatgpt.com/c/6765e70e-44b0-800b-97bd-928919f04fbe
               | 
               | Apparently LLMs do not consider global thermonuclear war
               | to be all that big a deal, for better or worse.
        
               | Pannoniae wrote:
               | Don't worry, I also got that wrong :) I thought her
               | affair would be the biggest problem for John.
        
           | stego-tech wrote:
           | I won't be as brutal in my wording, but I agree with the
           | sentiment. This was something drilled into me as someone with
           | a hobby in PC Gaming _and_ Photography: benchmarks, while
           | handy measures of _potential_ capabilities, are not
           | _guarantees_ of real world performance. Very few PC gamers
           | completely reinstall the OS before benchmarking to remove all
           | potential cruft or performance impacts, just as very few
           | photographers exclusively take photos of test materials.
           | 
           | While I appreciate the benchmark and its goals (not to
           | mention the puzzles - I quite enjoy figuring them out),
           | successfully passing this benchmark does not demonstrate or
           | guarantee real world capabilities or performance. This is why
           | I increasingly side-eye this field and its obsession with
           | constantly passing benchmarks and then moving the goal posts
           | to a newer, harder benchmark that claims to be a better
           | simulation of human capabilities than the last one: it reeks
           | of squandered capital and a lack of a viable/profitable
           | product, at least to my sniff test. Rather than simply
           | capitalize on their actual accomplishments (which LLMs are -
           | natural language interaction is huge!), they're trying to
           | prove to Capital that with a few (hundred) billion more in
           | investments, they can make AGI out of this and replace all
           | those expensive humans.
           | 
           | They've built the most advanced prediction engines ever
           | conceived, and insist they're best used to replace labor. I'm
           | not sure how they reached that conclusion, but considering
           | even their own models refute this use case for LLMs, I doubt
           | their execution ability on that lofty promise.
        
           | danielmarkbruce wrote:
           | 100%. The hype is misguided. I doubt half the people excited
           | about the result have even looked at what the benchmark is.
        
         | Balgair wrote:
         | Complete aside here: I used to do work with amputees and
         | prosthetics. There is a standardized test (and I just cannot
         | remember the name) that fits in a briefcase. It's used for
         | measuring the level of damage to the upper limbs and for
         | prosthetic grading.
         | 
         | Basically, it's got the dumbest and simplest things in it.
         | Stuff like a lock and key, a glass of water and jug, common
         | units of currency, a zipper, etc. It tests if you can do any of
         | those common human tasks. Like pouring a glass of water,
         | picking up coins from a flat surface (I chew off my nails so
         | even an able person like me fails that), zip up a jacket, lock
         | your own door, put on lipstick, etc.
         | 
         | We had hand prosthetics that could play Mozart at 5x speed on a
         | baby grand, but could not pick up a silver dollar or zip a
         | jacket even a little bit. To the patients, the hands were
         | therefore about as useful as a metal hook (a common solution
         | with amputees today, not just pirates!).
         | 
         | Again, a total aside here, but your comment just reminded me of
         | that brown briefcase. Life, it turns out, is a lot more complex
         | than we give it credit for. Even pouring the OJ can be, in rare
         | cases, transcendent.
        
           | m463 wrote:
           | It would be interesting to see trick questions.
           | 
           | Like in your test
           | 
           | a hand grenade and a pin - don't pull the pin.
           | 
           | Or maybe a mousetrap? but maybe that would be defused?
           | 
           | in the ai test...
           | 
           | or Global Thermonuclear War, the only winning move is...
        
             | sdenton4 wrote:
             | to move first!
        
               | m463 wrote:
               | oh crap. lol!
        
             | HPsquared wrote:
             | Gaming streams being in the training data, it might pull
             | the pin because "that's what you do".
        
               | 8note wrote:
               | or, because it has to give an output, and pulling the pin
               | is the only option
        
               | TeMPOraL wrote:
               | There's also the option of not pulling the pin, and
               | shooting your enemies as they instinctively run from what
               | they think is a live grenade. Saw it on a TV show the
               | other day.
        
           | ubj wrote:
           | There's a lot of truth in this. I sometimes joke that robot
           | benchmarks should focus on common household chores. Given a
           | basket of mixed laundry, sort and fold everything into
           | organized piles. Load a dishwasher given a sink and counters
           | overflowing with dishes piled up haphazardly. Clean a bedroom
           | that kids have trashed. We do these tasks almost without
           | thinking, but the unstructured nature presents challenges for
           | robots.
        
             | Balgair wrote:
             | I maintain that whoever invents a robust laundry _folding_
             | robot will be a trillionaire. In that, I dump jumbled clean
             | clothes straight from a dryer at it and out comes folded
             | and sorted clothes (and those loner socks). I know we 're
             | getting close, but I also know we're not there yet.
        
               | oblio wrote:
               | Laundry folding and laundry ironing, I would say.
        
               | musicale wrote:
               | Hopefully will detect whether a small child is inside or
               | not.
        
               | imafish wrote:
               | > I maintain that whoever invents a robust laundry
               | folding robot will be a trillionaire
               | 
               | ... so Elon Musk? :D
        
               | jessekv wrote:
               | I want it to lay out an outfit every day too. Hopefully
               | without hallucination.
        
               | stefs wrote:
               | it's not hallucination, it's high fashion
        
               | tanseydavid wrote:
               | Yes, but the stupid robot laid out your Thursday-black-
               | Turtleneck for you on Saturday morning. That just won't
               | suffice.
        
               | yongjik wrote:
               | I can live without folding laundry (I can just shove my
               | undershirts in the closet, who cares if it's not folded),
               | but whoever manufactures a reliable auto-loading
               | dishwasher will have my dollars. Like, just put all your
               | dishes in the sink and let the machine handle them.
        
               | Brybry wrote:
               | But if your dishwasher is empty is takes nearly the same
               | amount of time/effort to put dishes straight into the
               | dishwasher that it does to put them in the sink.
               | 
               | I think I'd only really save time by having a robot that
               | could unload my dishwasher and put up the clean dishes.
        
               | namibj wrote:
               | That's called a second dishwasher: one is for taking out,
               | the other for putting in. When the latter is full, turn
               | it on, dirty dishes wait outside until the cycle
               | finishes, when the dishwashers switch roles.
        
               | ptsneves wrote:
               | I thought about this and it gets even better. You do not
               | really need shelves as you just use the clean dishwasher
               | as the storage place. I honestly don't know why this is
               | not a thing in big or wealthy homes.
        
               | jannyfer wrote:
               | Another thing that bothers me is that dishwashers are
               | low. As I get older, I'm finding it really annoying to
               | bend down.
               | 
               | So get me a counter-level dishwasher cabinet and I'll be
               | happy!
        
               | yongjik wrote:
               | Hmm, that doesn't match my experience. It takes me a lot
               | more time to put dishes into the dishwasher, because it
               | has different places for cutlery, bowls, dishes, and so
               | on, and of course the existing structure never matches my
               | bowls' size perfectly so I have to play tetris or run it
               | with only 2/3 filled (which will cause me to waste more
               | time as I have to do dishes again sooner).
               | 
               | And that's before we get to bits of sticky rice left on
               | bowls, which somehow dishwashers never scrape off clean.
               | YMMV.
        
               | HPsquared wrote:
               | 1. Get a set of dishes that does fit nicely together in
               | the dishwasher.
               | 
               | 2. Start with a cold prewash, preferably with a little
               | powder in there too. This massively helps with stubborn
               | stuff.
        
               | nradov wrote:
               | There is the Foldimate robot. I don't know how well it
               | works. It doesn't seem to pair up socks. (Deleted the web
               | link, it might not be legitimate.)
        
               | smokel wrote:
               | Beware, this website is probably a scam.
               | 
               | Foldimate has gone bankrupt in 2021 [1], and the domain
               | referral from foldimate.com to a 404 page at miele.com,
               | suggests that it was Miele who bought up the remains, not
               | a sketchy company with a ".website" top-level domain.
               | 
               | [1] https://en.wikipedia.org/wiki/FoldiMate
        
               | smokel wrote:
               | We are certainly getting close! In 2010, watching PR2
               | fold some unseen towels is similar to watching paint dry
               | [1], but we can now enjoy robots attain lazy student-
               | level laundry folding in real-time, as demonstrated by
               | p0[2].
               | 
               | [1] https://www.youtube.com/watch?v=gy5g33S0Gzo
               | 
               | [2] https://www.physicalintelligence.company/blog/pi0
        
               | sss111 wrote:
               | Honestly, a robot that can hang jumbled clean clothes
               | instead of folding them would be good enough, it's crazy
               | how we don't even have those.
        
               | dweekly wrote:
               | I was a believer in Gal's FoldiMate but sadly
               | it...folded.
               | 
               | https://en.m.wikipedia.org/wiki/FoldiMate
        
               | blargey wrote:
               | At this point I'm not sure we'll actually get a task-
               | specific machine for laundry folding/sorting before
               | humanoid robots gain the capability to do it well enough.
        
             | zamalek wrote:
             | Slightly tangential, we already have amazing laundry
             | robots. They are called washing and drying machines. We
             | don't give these marvels enough credit, mostly because they
             | aren't shaped like humans.
             | 
             | Humanoid robots are mostly a waste of time. Task-shaped
             | robots are _much_ easier to design, build, and maintain...
             | and are more reliable. Some of the things you mention might
             | needs humanoid versatility (loading the dishwasher), others
             | would be far better served by purpose-built robots (laundry
             | sorting).
        
               | jkaptur wrote:
               | I'm embarrassed to say that I spent a few moments
               | daydreaming about a robot that could wash my dishes. Then
               | I thought about what to call it...
        
               | musicale wrote:
               | Sadly current "dishwasher" models are neither self-
               | loading nor unloading. (Seems like they should be able to
               | take a tray of dishes, sort them, load them, and stack
               | them after cleaning.)
               | 
               | Maybe "busbot" or "scullerybot".
        
               | vidarh wrote:
               | The problem is more doing it in sufficiently little
               | space, and using little enough water and energy. Doing
               | one that you just feed dishes individually and that
               | immediate wash them and feed them to storage should be
               | entirely viable, but it'd be wasteful, and it'd compete
               | with people having multiple small drawer-style
               | dishwashers, offering relatively little convenience over
               | that.
               | 
               | It seems most people aren't willing to pay for multiple
               | dishwashers - even multiple small ones or set aside
               | enough space, and that places severe constraints on
               | trying to do better.
        
               | wsintra2022 wrote:
               | Was it a dishwasher? Just give it all your unclean dishes
               | and tell it to go, come back an hour later and they all
               | washed and mostly dried!
        
               | rytis wrote:
               | I agree. I don't know where this obsession comes from.
               | Obsession with resembling as close to humans as possible.
               | We're so far from being perfect. If you need proof just
               | look at your teeth. Yes, we're relatively universal, but
               | a screwdriver is more efficient at driving in screws that
               | our fingers. So please, stop wasting time building
               | perfect universal robots, build more purpose-build ones.
        
               | Nevermark wrote:
               | Given we have shaped so many tasks to fit our bodies, it
               | will be a long time before a bot able to do a
               | variety/majority of human tasks the human way won't be
               | valuable.
               | 
               | 1000 machines specialized for 1000 tasks are great, but
               | don't deliver the same value as a single bot that can
               | interchange with people flexibly.
               | 
               | Costly today, but wont be forever.
        
               | golol wrote:
               | The shape doesn't matter! Non-humanoid shapes give minir
               | advantages on specific tasks but for a general robot
               | you'll have a hard time finding a shape much more optimal
               | than humanoid. And if you go with humanoid you have so
               | much data available! Videos contain the information of
               | which movements a robot should execude. Teleoperation is
               | easy. This is the bitter lesson! The shape doesn't
               | matter, any shape will work with the right architecture,
               | data and training!
        
               | rowanG077 wrote:
               | Purpose build robots are basically solved. Dishwashers,
               | laundry machines, assembly robots, etc. the moat is a
               | general purpose robot that can do what a human can do.
        
               | graemep wrote:
               | Great examples. They are simple, reliable, efficient and
               | effective. Far better than blindly copying what a human
               | being does. Maybe there are equally clever ways of doing
               | things like folding clothes.
        
           | ecshafer wrote:
           | I had a pretty bad case of tendinitis once, that basically
           | made my thumb useless since using it would cause extreme
           | pain. That test seems really good. I could use a computer
           | keyboard without any issue, but putting a belt on or pouring
           | water was impossible.
        
             | vidarh wrote:
             | I had a swollen elbow a short while ago, and the amount of
             | things I've never thought about that were affected by
             | reduced elbow join mobility and an inability to put
             | pressure on the elbow was disturbing.
        
           | CooCooCaCha wrote:
           | That's why the goal isn't just benchmark scores, it's
           | _reliable_ and robust intelligence.
           | 
           | In that sense, the goalposts haven't moved in a long time
           | despite claims from AI enthusiasts that people are constantly
           | moving goalposts.
        
           | croemer wrote:
           | > We had hand prosthetics that could play Mozart at 5x speed
           | on a baby grand, but could not pick up a silver dollar or zip
           | a jacket even a little bit. "
           | 
           | I must be missing something, how can they be able to play
           | Mozart at 5x speed with their prosthetics but not zip a
           | jacket? They could press keys but not do tasks requiring
           | feedback?
           | 
           | Or did you mean they used to play Mozart at 5x speed before
           | they became amputees?
        
             | rahimnathwani wrote:
             | Imagine a prosthetic 'hand' that has 5 regular fingers,
             | rather than 4 fingers and a thumb. It would be able to play
             | a piano just fine, but be unable to grasp anything small,
             | like a zipper.
        
             | numpad0 wrote:
             | Thumb not opposable?
        
             | 8note wrote:
             | zipping up a jacket is really hard to do, and requires very
             | precise movements and coordination between hands.
             | 
             | playing mozart is much more forgiving in terms of the
             | number of different motions you have to make in different
             | directions, the amount of pressure to apply, and even the
             | black keys are much bigger than large sized zipper tongues.
        
               | Balgair wrote:
               | Pretty much. The issue with zippers is that the fabric
               | moves about in unpredictable ways. Piano playing was just
               | movement programs. Zipping required (surprisingly) fast
               | feedback. Also, gripping is somewhat tough compared to
               | pressing.
        
             | ben_w wrote:
             | Playing a piano involves pushing down on the right keys
             | with the right force at the right time, but that could be
             | pre-programmed well before computers. The self-playing
             | piano in the saloon in Westworld wasn't a _huge_
             | anachronism, such things slightly overlapped with the Wild
             | West era: https://en.wikipedia.org/wiki/Player_piano
             | 
             | Picking up a 1mm thick metal disk from a flat surface
             | requires the user gives the exact time, place, and force,
             | and I'm not even sure what considerations it needs for
             | surface materials (e.g. slightly squishy fake skin) and/or
             | tip shapes (e.g. fake nails).
        
               | numpad0 wrote:
               | > Picking up a 1mm thick metal disk from a flat surface
               | requires the user gives the exact time, place, and force
               | 
               | place sure but can't you cheat a bit for time and force
               | with compliance("impedance control")?
        
               | ben_w wrote:
               | In theory, apparently not in practice.
        
             | oblio wrote:
             | I'm far from a piano player, but I can definitely push
             | piano buttons quite quickly while zipping up my jacket when
             | it's cold and/or wet outside is really difficult.
             | 
             | Even more so for picking up coins from a flat surface.
             | 
             | For robotics, it's kind of obvious, speed is rarely an
             | issue, so the "5x" part is almost trivial. And you can
             | program the sequence quite easily, so that's also doable.
             | Piano keys are big and obvious and an ergonomically
             | designed interface meant to be relatively easy to press,
             | ergo easy even for a prosthetic. A small coin on a flat
             | surface is far from ergonomic.
        
               | croemer wrote:
               | But how do you deliberately control those fingers to
               | actually play yourself what you have in mind rather than
               | something preprogrammed? Surely the idea of a prosthetic
               | does not just mean "a robot that is connected to your
               | body", but something that the owner control with your
               | mind.
        
               | vidarh wrote:
               | Nobody said anything about deliberately controlling those
               | fingers to play yourself. Clearly it's not something you
               | do for the sake of the enjoyment of playing, but more
               | likely a demonstration of the dexterity of the prosthesis
               | and ability to program it for complex tasks.
               | 
               | The idea of a prosthesis is to help you regain
               | functionality. If the best way of doing that is through
               | automation, then it'd make little sense not to.
        
               | yongjik wrote:
               | I play piano as a hobby, and the funny thing is, if my
               | hands are so cold that I can't zip up my jacket, there's
               | no way I can play anything well. I know it's not quite
               | zipping up jackets ;) but a human playing the piano does
               | require a fast feedback loop.
        
             | n144q wrote:
             | Well, you see, while the original comment says they could
             | play at 5x speed, it does not say it plays at that speed
             | _well_ or play it beautifully. Any teacher or any student
             | who learned piano for a while will tell you that this
             | matters a lot, especially for classical music -- being able
             | to accurately play at an even tempo with the correct
             | dynamics and articulation is hard and is what
             | differentiates a beginner /intermediate player from an
             | advanced one. In fact, one mistake many students make is
             | playing a piece too fast when they are not ready, and
             | teachers really want students to practice very slowly.
             | 
             | My point is -- being able to zip a jacket is all about
             | those subtle actions, and could actually be harder than
             | "just" playing piano fast.
        
           | alexose wrote:
           | It feels like there's a whole class of information that
           | easily shorthanded, but really hard to explain to novices.
           | 
           | I think a lot about carpentry. From the outside, it's pretty
           | easy: Just make the wood into the right shape and stick it
           | together. But as one progresses, the intricacies become more
           | apparent. Variations in the wood, the direction of the grain,
           | the seasonal variations in thickness, joinery techniques that
           | are durable but also time efficient.
           | 
           | The way this information connects is highly multisensory and
           | multimodal. I now know which species of wood to use for which
           | applications. This knowledge was hard won through many, many
           | mistakes and trials that took place at my home, the hardware
           | store, the lumberyard, on YouTube, from my neighbor Steve,
           | and in books written by experts.
        
           | Method-X wrote:
           | Was it the Southampton hand assessment procedure?
        
             | Balgair wrote:
             | Yes! Thank you!
             | 
             | https://www.shap.ecs.soton.ac.uk/
        
           | oblio wrote:
           | This was actually discovered quite early on in the history of
           | AI:
           | 
           | > Rodney Brooks explains that, according to early AI
           | research, intelligence was "best characterized as the things
           | that highly educated male scientists found challenging", such
           | as chess, symbolic integration, proving mathematical theorems
           | and solving complicated word algebra problems. "The things
           | that children of four or five years could do effortlessly,
           | such as visually distinguishing between a coffee cup and a
           | chair, or walking around on two legs, or finding their way
           | from their bedroom to the living room were not thought of as
           | activities requiring intelligence."
           | 
           | https://en.wikipedia.org/wiki/Moravec%27s_paradox
        
             | bawolff wrote:
             | I don't know why people always feel the need to gender
             | these things. Highly educated female scientists generally
             | find the same things challenging.
        
               | robocat wrote:
               | I don't know why anyone would blame people as though
               | someone is making an explicit choice. I find your choice
               | of words to be insulting to the OP.
               | 
               | We learn our language and stereotypes subconciously from
               | our society, and it is no easy thing to fight against
               | that.
        
               | Barrin92 wrote:
               | >I don't know why people always feel the need to gender
               | these things
               | 
               | Because it's relevant to the point being made, i.e. that
               | these tests reflect the biases and interests of the
               | people who make them. This is true not just for AI tests,
               | but intelligence test applied to humans. That Demis
               | Hassabis, a chess player and video game designer, decided
               | to test his machine on video games, Go and chess probably
               | is not an accident.
               | 
               | The more interesting question is why people respond so
               | apprehensively to pointing out a very obvious problem and
               | bias in test design.
        
           | drdrey wrote:
           | I think assembling Legos would be a cool robot benchmark: you
           | need to parse the instructions, locate the pieces you need,
           | pick them up, orient them, snap them to your current
           | assembly, visually check if you achieved the desired state,
           | repeat
        
           | throwup238 wrote:
           | This is expressed in AI research as Moravec's paradox:
           | https://en.wikipedia.org/wiki/Moravec%27s_paradox
           | 
           | Getting to LLMs that could talk to us turned out to be a lot
           | easier than making something that could control even a
           | robotic arm without precise programming, let alone a
           | humanoid.
        
           | MarcelOlsz wrote:
           | >We had hand prosthetics that could play Mozart at 5x speed
           | on a baby grand
           | 
           | I'd love to know more about this.
        
           | xnx wrote:
           | Despite lake of fearsome teeth or claws, humans are _way_ op
           | due to brain, hand dexterity, and balance.
        
         | lossolo wrote:
         | > making the most interesting and challenging LLM benchmark so
         | far.
         | 
         | This[1] is currently the most challenging benchmark. I would
         | like to see how O3 handles it, as O1 solved only 1%.
         | 
         | 1. https://epoch.ai/frontiermath/the-benchmark
        
           | pynappo wrote:
           | Apparently o3 scored about 25%
           | 
           | https://youtu.be/SKBG1sqdyIU?t=4m40s
        
             | FiberBundle wrote:
             | This is actually the result that I find way more
             | impressive. Elite mathematicians think these problems are
             | challenging and thought they were years away from being
             | solvable by AI.
        
           | modeless wrote:
           | You're right, I was wrong to say "most challenging" as there
           | have been harder ones coming out recently. I think the
           | correct statement would be "most challenging long-standing
           | benchmark" as I don't believe any other test designed in 2019
           | has resisted progress for so long. FrontierMath is only a
           | month old. And of course the real key feature of ARC is that
           | it is easy for humans. FrontierMath is (intentionally) not.
        
         | skywhopper wrote:
         | "The fact that scaled reasoning models are finally showing
         | progress on ARC proves that what it measures really is relevant
         | and important for reasoning."
         | 
         | Not sure I understand how this follows. The fact that a certain
         | type of model does well on a certain benchmark means that the
         | benchmark is relevant for a real-world reasoning? That doesn't
         | make sense.
        
           | munchler wrote:
           | It shows objectively that the models are getting better at
           | some form of reasoning, which is at least worth noting.
           | Whether that improved reasoning is relevant for the real
           | world is a different question.
        
             | moffkalast wrote:
             | It shows objectively that one model got better at this
             | specific kind of weird puzzle that doesn't translate to
             | anything because it is just a pointless pattern matching
             | puzzle that can be trained for, just like anything else. In
             | fact they specifically trained for it, they say so upfront.
             | 
             | It's like the modern equivalent of saying "oh when AI
             | solves chess it'll be as smart as a person, so it's a good
             | benchmark" and we all know how that nonsense went.
        
               | munchler wrote:
               | Hmm, you could be right, but you could also be very
               | wrong. Jury's still out, so the next few years will be
               | interesting.
               | 
               | Regarding the value of "pointless pattern matching" in
               | particular, I would refer you to Douglas Hofstadter's
               | discussion of Bongard problems starting on page 652 of
               | _Godel, Escher, Bach_. Money quote: "I believe that the
               | skill of solving Bongard [pattern recognition] problems
               | lies very close to the core of 'pure' intelligence, if
               | there is such a thing."
        
         | jug wrote:
         | I liked the SimpleQA benchmark that measures hallucinations.
         | OpenAI models did surprisingly poorly, even o1. In fact, it
         | looks like OpenAI often does well on benchmarks by taking the
         | shortcut to be more risk prone than both Anthropic and Google.
        
         | zone411 wrote:
         | It's the least interesting benchmark for language models among
         | all they've released, especially now that we already had a
         | large jump in its best scores this year. It might be more
         | useful as a multimodal reasoning task since it clearly involves
         | visual elements, but with o3 already performing so well, this
         | has proven unnecessary. ARC-AGI served a very specific purpose
         | well: showcasing tasks where humans easily outperformed
         | language models, so these simple puzzles had their uses. But
         | tasks like proving math theorems or programming are far more
         | impactful.
        
         | danielmarkbruce wrote:
         | Highly challenging for LLMs because it has nothing to do with
         | language. LLMs and their training processes have all kinds of
         | optimizations for language and how it's presented.
         | 
         | This benchmark has done a wonderful job with marketing by
         | picking a great name. It's largely irrelevant for LLMs despite
         | the fact it's difficult.
         | 
         | Consider how much of the model is just noise for a task like
         | this given the low amount of information in each token and the
         | high embedding dimensions used in LLMs.
        
         | adamgordonbell wrote:
         | There is a benchmark, NovelQA, that LLMs don't dominate when it
         | feels like they should. The benchmark is to read a novel and
         | answer questions about it.
         | 
         | LLMs are below human evaluation, as I last looked, but it
         | doesn't get much attention.
         | 
         | Once it is passed, I'd like to see one that is solving the
         | mystery in a mystery book right before it's revealed.
         | 
         | We'd need unpublished mystery novels to use for that benchmark,
         | but I think it gets at what I think of as reasoning.
         | 
         | https://novelqa.github.io/
        
           | CamperBob2 wrote:
           | Does it work on short stories, but not novels? If so, then
           | that's just a minor question of context length that should
           | self-resolve over time.
        
             | adamgordonbell wrote:
             | The books fit in the current long context models, so it's
             | not merely the context size constraint but the length is
             | part of the issue, for sure.
        
           | meta_x_ai wrote:
           | Looks like it's not updated for nearly a year and I'm
           | guessing Gemini 2.0 Flash with 2m context will simply crush
           | it
        
             | adamgordonbell wrote:
             | That's true. They don't have Claude 3.5 on there either. So
             | maybe it's not relevant anymore, but I'm not sure.
             | 
             | If so, let's move on to the murder mysteries or more
             | complex literary analysis.
        
       | wilg wrote:
       | fun! the benchmarks are so interesting because real world use is
       | so variable. sometimes 4o will nail a pretty difficult problem,
       | other times o1 pro mode will fail 10 times on what i would think
       | is a pretty easy programming problem and i waste more time trying
       | to do it with ai
        
       | behnamoh wrote:
       | So now not only are the models closed, but so are their evals?!
       | This is a "semi-private" eval. WTH is that supposed to mean? I'm
       | sure the model is great but I refuse to take their word for it.
        
         | ZeroCool2u wrote:
         | The private evaluation set is private from the public/OpenAI so
         | companies can't train on those problems and cheat their way to
         | a high score by overfitting.
        
           | jsheard wrote:
           | If the models run on OpenAIs servers then surely they could
           | still see the questions being put into it if they wanted to
           | cheat? That could only be prevented by making the evaluation
           | a one-time deal that can't be repeated, or by having OpenAI
           | distribute their models for evaluators to run themselves,
           | which I doubt they're inclined to do.
        
             | foobarqux wrote:
             | Yes that's why it is "semi"-private: From the ARC website
             | "This set is "semi-private" because we can assume that over
             | time, this data will be added to LLM training data and need
             | to be periodically updated."
             | 
             | I presume evaluation on the test set is gated (you have to
             | ask ARC to run it).
        
         | cchance wrote:
         | the evals are the question/answers, ARC-AGI doesn't share the
         | questions and answers for a portion so that models can't be
         | trained on them, the public ones... the public knows the
         | questions so theres a chance they could have been at least
         | partially been trained on the question (if not the actual
         | answer).
         | 
         | Thats how i understand it
        
       | neom wrote:
       | Why would they give a cost estimate per task on their low compute
       | mode but not their high mode?
       | 
       | "low compute" mode: Uses 6 samples per task, Uses 33M tokens for
       | the semi-private eval set, Costs $17-20 per task, Achieves 75.7%
       | accuracy on semi-private eval
       | 
       | The "high compute" mode: Uses 1024 samples per task (172x more
       | compute), Cost data was withheld at OpenAI's request, Achieves
       | 87.5% accuracy on semi-private eval
       | 
       | Can we just extrapolate $3kish per task on high compute?
       | (wondering if they're withheld because this isn't the case?)
        
         | WiSaGaN wrote:
         | The withheld part is really a red flag for me. Why do you want
         | to withhold a compute number?
        
       | zebomon wrote:
       | My initial impression: it's very impressive and very exciting.
       | 
       | My skeptical impression: it's complete hubris to conflate ARC or
       | any benchmark with truly general intelligence.
       | 
       | I know my skepticism here is identical to moving goalposts. More
       | and more I am shifting my personal understanding of general
       | intelligence as a phenomenon we will only ever be able to
       | identify with the benefit of substantial retrospect.
       | 
       | As it is with any sufficiently complex program, if you could
       | discern the result beforehand, you wouldn't have had to execute
       | the program in the first place.
       | 
       | I'm not trying to be a downer on the 12th day of Christmas.
       | Perhaps because my first instinct is childlike excitement, I'm
       | trying to temper it with a little reason.
        
         | amarcheschi wrote:
         | I just googled arc agi questions, and it looks like it is
         | similar to an iq test with raven matrix. Similar as in you have
         | some examples of images before and after, then an image before
         | and you have to guess the after.
         | 
         | Could anyone confirm if this is the only kind of questions in
         | the benchmark? If yes, how come there is such a direct
         | connection to "oh this performs better than humans" when llm
         | can be quite better than us in understanding and forecasting
         | patterns? I'm just curious, not trying to stir up controversies
        
           | zebomon wrote:
           | It's a test on which (apparently until now) the vast majority
           | of humans have far outperformed all machine systems.
        
             | patrickhogan1 wrote:
             | But it's not a test that directly shows general
             | intelligence.
             | 
             | I am excited no less! This is huge improvement.
             | 
             | How does this do on SWE Bench?
        
               | og_kalu wrote:
               | >How does this do on SWE Bench?
               | 
               | 71.7%
        
               | throwaway0123_5 wrote:
               | I've seen this figure on a few tech news websites and
               | reddit but can't find an official source. If it was in
               | the video I must have missed it, where is this coming
               | from?
        
               | og_kalu wrote:
               | It was in the video. I don't know if Open ai have a page
               | up yet
        
           | ALittleLight wrote:
           | Yes, it's pretty similar to Raven's. The reason it is an
           | interesting benchmark is because humans, even very young
           | humans, "get" the test in the sense of understanding what
           | it's asking and being able to do pretty well on it - but LLMs
           | have really struggled with the benchmark in the past.
           | 
           | Chollett (one of the creators of the ARC benchmark) has been
           | saying it proves LLMs can't reason. The test questions are
           | supposed to be unique and not in the model's training set.
           | The fact that LLMs struggled with the ARC challenge suggested
           | (to Chollett and others) that models weren't "Truly
           | reasoning" but rather just completing based on things they'd
           | seen before - when the models were confronted with things
           | they hadn't seen before, the novel visual patterns, they
           | really struggled.
        
           | Eridrus wrote:
           | ML is quite good at understanding and forecasting patterns
           | when you train on the data you want to forecast. LLMs manage
           | to do so much because we just decided to train on everything
           | on the internet and hope that it included everything we ever
           | wanted to know.
           | 
           | This tries to create patterns that are intentionally not in
           | the data and see if a system can generalize to them, which o3
           | super impressively does!
        
             | yunwal wrote:
             | ARC is in the dataset though? I mean I'm aware that there
             | are new puzzles every day, but there's still a very
             | specific format and set of skills required to solve it. I'd
             | bet a decent amount of money that humans get better at ARC
             | with practice, so it seems strange to suggest that AI
             | wouldn't.
        
         | hansonkd wrote:
         | It doesn't need to be general intelligence or perfectly map to
         | human intelligence.
         | 
         | All it needs to be is useful. Reading constant comments about
         | LLMs can't be general intelligence or lack reasoning etc, to me
         | seems like people witnessing the airplane and complaining that
         | it isn't "real flying" because it isn't a bird flapping its
         | wings (a large portion of the population held that point of
         | view back then).
         | 
         | It doesn't need to be general intelligence for the rapid
         | advancement of LLM capabilities to be the most societal
         | shifting development in the past decades.
        
           | zebomon wrote:
           | I agree. If the LLMs we have today never got any smarter, the
           | world would still be transformed over the next ten years.
        
           | AyyEye wrote:
           | > Reading constant comments about LLMs can't be general
           | intelligence or lack reasoning etc, to me seems like people
           | witnessing the airplane and complaining that it isn't "real
           | flying" because it isn't a bird flapping its wings (a large
           | portion of the population held that point of view back then).
           | 
           | That is a natural reaction to the incessant techbro, AIbro,
           | marketing, and corporate lies that "AI" (or worse AGI) is a
           | real thing, and can be directly compared to real humans.
           | 
           | There are people on this very thread saying it's better at
           | reasoning than real humans (LOL) because it scored higher on
           | some benchmark than humans... Yet this technology still can't
           | reliably determine what number is circled, if two lines
           | intersect, or count the letters in a word. (That said
           | behaviour may have been somewhat finetuned out of newer
           | models only reinforces the fact that the technology
           | inherently not capable of understanding _anything_.)
        
             | IanCal wrote:
             | I encounter "spicy auto complete" style comments far more
             | often than techbro AI-everything comments and its frankly
             | getting boring.
             | 
             | I've been doing AI things for about 20+ years and llms are
             | wild. We've gone from specialized things being pretty bad
             | as those jobs to general purpose things better at that and
             | everything else. The idea you could make and API call with
             | "is this sarcasm?" and get a better than chance guess is
             | incredible.
        
               | AyyEye wrote:
               | Nobody is disputing the coolness factor, only the
               | intelligence factor.
        
               | hansonkd wrote:
               | I'm saying the intelligence factor doesn't matter. Only
               | the utility factor. Today LLMs are incredibly useful and
               | every few months there appears to be bigger and bigger
               | leaps.
               | 
               | Analyzing whether or not LLMs have intelligence is
               | missing the forest from the trees. This technology is
               | emerging in a capitalist society that is hyper optimized
               | to adopt useful things at the expense of almost
               | everything else. If the utility/price point gets hit for
               | a problem, it will replace it regardless of if it is
               | intelligent or not.
        
               | surgical_fire wrote:
               | Eh, I see far more "AI is the second coming of Jesus"
               | type of comments than healthy skepticism. A lot of
               | anxiety from people afraid that their source of income
               | will dry and a lot of excitement of people with an axe to
               | grind that "those entitled expensive peasants will get
               | what they deserve".
               | 
               | I think I count myself among the skeptics nowadays for
               | that reason. And I say this as someone that thinks LLM is
               | an interesting piece of technology, but with somewhat
               | limited use and unclear economics.
               | 
               | If the hype was about "look at this thing that can parse
               | natural language surprisingly well and generate coherent
               | responses", I would be excited too. As someone that had
               | to do natural language processing in the past, that is a
               | damn hard task to solve, and LLMs excel at it.
               | 
               | But that is not the hype is it? We have people beating
               | the drums of how this is just shy of taking the world by
               | storm, and AGI is just around the corner, and it will
               | revolutionize all economy and society and nothing will
               | ever be the same.
               | 
               | So, yeah, it gets tiresome. I wish the hype would die
               | down a little so this could be appreciated for what it
               | is.
        
               | williamcotton wrote:
               | _We have people beating the drums of how this is just shy
               | of taking the world by storm, and AGI is just around the
               | corner, and it will revolutionize all economy and society
               | and nothing will ever be the same._
               | 
               | Where are you seeing this? I pretty much only read HN and
               | football blogs so maybe I'm out of the loop.
        
               | sensanaty wrote:
               | In this very thread there are multiple people espousing
               | their views that the high score here is proof that o3 has
               | achieved AGI.
        
           | handsclean wrote:
           | People aren't responding to their own assumption that AGI is
           | necessary, they're responding to OpenAI and the chorus
           | constantly and loudly singing hymns to AGI.
        
           | surgical_fire wrote:
           | > to me seems like people witnessing the airplane and
           | complaining that it isn't "real flying" because it isn't a
           | bird flapping its wings
           | 
           | To me it is more like there is someone jumping on a pogo ball
           | while flapping their arms and saying that they are flying
           | whenever they hop off the ground.
           | 
           | Skeptics say that they are not really flying, while adherents
           | say that "with current pogo ball advancements, they will be
           | flying any day now"
        
             | intelVISA wrote:
             | Between skeptics and adherents who is more easily able to
             | extract VC money for vaporware? If you limit yourself to
             | 'the facts' you're leaving tons of $$ on the table...
        
               | surgical_fire wrote:
               | By all means, if this is the goal, AI is a success.
               | 
               | I understand that in this forum too many people are
               | invested in putting lipstick on this particular pig.
        
             | PaulDavisThe1st wrote:
             | An old quote, quite famous: "... is like saying that an ape
             | who climbs to the top of a tree for the first time is one
             | step closer to landing on the moon".
        
             | DonHopkins wrote:
             | Is that what Elon Musk was trying to do on stage?
        
           | billyp-rva wrote:
           | > It doesn't need to be general intelligence or perfectly map
           | to human intelligence.
           | 
           | > All it needs to be is useful.
           | 
           | Computers were already useful.
           | 
           | The only definition we have for "intelligence" is human (or,
           | generally, animal) intelligence. If LLMs aren't that, let's
           | call it something else.
        
             | throwup238 wrote:
             | What exactly is human (or animal) intelligence? How do you
             | define that?
        
               | billyp-rva wrote:
               | Does it matter? If LLMs _aren 't_ that, whatever it is,
               | then we should use a different word. Finders keepers.
        
               | throwup238 wrote:
               | How do you know that LLMs "aren't that" if you can't even
               | define what _that_ is?
               | 
               | "I'll know it when I see it" isn't a compelling argument.
        
               | grahamj wrote:
               | they can't do what we do therefore they aren't what we
               | are
        
               | layer8 wrote:
               | And what is that, in concrete terms? Many humans can't do
               | what other humans can do. What is the common subset that
               | counts as human intelligence?
        
               | jonny_eh wrote:
               | > "I'll know it when I see it" isn't a compelling
               | argument.
               | 
               | It feels compelling to me.
        
               | Aperocky wrote:
               | I think a successful high level intelligence should
               | quickly accelerate or converge to infinity/physical
               | resource exhaustion because they can now work on
               | improving themselves.
               | 
               | So if above human intelligence does happen, I'd assume
               | we'd know it, quite soon.
        
           | wruza wrote:
           | And look at the airplanes, they really can't just land on a
           | mountain slope or a tree without heavy maintenance
           | afterwards. Those people weren't all stupid, they questioned
           | the promise of flying servicemen delivering mail or milk to
           | their window and flying on a personal aircar to their
           | workplace. Just like todays promises about whatever the CEOs
           | telltales are. Imagining bullshit isn't unique to this
           | century.
           | 
           | Aerospace is still a highly regulated area that requires
           | training and responsibility. If parallels can be drawn here,
           | they don't look so cool for a regular guy.
        
             | skydhash wrote:
             | This pretty much. Everyone knows that LLMs are great for
             | text generation and processing. What people has been
             | questioning is the end goals as promised by its builders,
             | i.e. is it useful? And from most of what I saw, it's very
             | much a toy.
        
             | Workaccount2 wrote:
             | What people always leave out is that society will bend to
             | the abilities of the new technology. Planes can't land in
             | your backyard so we built airports. We didn't abandon
             | planes.
        
               | PaulDavisThe1st wrote:
               | Sure, but that also vindicates the GP's point that the
               | initial claims of the boosters for planes contained more
               | than their fair share of bullshit and lies.
        
               | wruza wrote:
               | Yes but the idea was lost in the process. It became a
               | faster transportation system that uses air as a medium,
               | but that's it. Personal planes are still either big
               | business or an expensive and dangerous personal toy
               | thing. I don't think it's the same for LLMs (would be
               | naive). But where are promises like "we're gonna change
               | travel economics etc"? All headlines scream is "AGI
               | around the corner". Yeah, now where's my damn postman
               | flying? I need my mail.
        
               | ben_w wrote:
               | > It became a faster transportation system that uses air
               | as a medium, but that's it.
               | 
               | On the one hand, yes; on the other, this understates the
               | impact that had.
               | 
               | My uncle moved from the UK to Australia because, I'm
               | told*, he didn't like his mum and travel was so expensive
               | that he assumed they'd never meet again. My first trip
               | abroad... I'm not 100% sure how old I was, but it must
               | have been between age 6 and 10, was my gran (his mum)
               | paying for herself, for both my parents, and for me, to
               | fly to Singapore, then on to various locations in
               | Australia including my uncle, and back via Thailand, on
               | her pension.
               | 
               | That was a gap of around one and a half generations.
               | 
               | * both of them are long-since dead now so I can't ask
        
               | ForHackernews wrote:
               | This is already happening. A few days ago Microsoft
               | turned down a documentation PR because the formatting was
               | better for humans but worse for LLMs: https://github.com/
               | MicrosoftDocs/WSL/pull/2021#issuecomment-...
               | 
               | They changed their mind after a public outcry including
               | here on HN.
        
               | oblio wrote:
               | We are slowly discovering that many of our wonderful
               | inventions from 60-80-100 years ago have serious side
               | effects.
               | 
               | Plastics, cars, planes, etc.
               | 
               | One could say that a balanced situation, where vested
               | interests are put back in the box (close to impossible
               | since it would mean fighting trillions of dollars), would
               | mean that for example all 3 in the list above are used a
               | lot less than we use them now, for example. And only used
               | where truly appropriate.
        
               | tivert wrote:
               | > What people always leave out is that society will bend
               | to the abilities of the new technology.
               | 
               | Do they really? I don't think they do.
               | 
               | > Planes can't land in your backyard so we built
               | airports. We didn't abandon planes.
               | 
               | But then what do you do with the all the fantasies and
               | hype about the new technology (like planes that land in
               | your backyard and you fly them to work)?
               | 
               | And it's quite possible and fairly common that the new
               | technology _actually ends up being mostly hype_ , and
               | there's actually no "airports" use case in the wings. I
               | mean, how much did society "bend to the abilities of"
               | NFTs?
               | 
               | And then what if the mature "airports" use case is
               | actually something _most people do not want_?
        
               | moffkalast wrote:
               | No, we built helicopters.
        
             | throwaway4aday wrote:
             | Your point is on the verge of nullification with the rapid
             | improvement and adoption of autonomous drones don't you
             | think?
        
           | alexalx666 wrote:
           | If I could put it into Tesla style robot and it could do
           | dishes and help me figure out tech stuff, it would be more
           | than enough.
        
           | skywhopper wrote:
           | On the contrary, the pushback is critical because many
           | employers are buying the hype from AI companies that AGI is
           | imminent, that LLMs can replace professional humans, and that
           | computers are about to eliminate all work (except VCs and
           | CEOs apparently).
           | 
           | Every person that believes that LLMs are near sentient or
           | actually do a good job at reasoning is one more person
           | handing over their responsibilities to a zero-accountability
           | highly flawed robot. We've already seen LLMs generate bad
           | legal documents, bad academic papers, and extremely bad code.
           | Similar technology is making bad decisions about who to
           | arrest, who to give loans to, who to hire, who to bomb, and
           | who to refuse heart surgery for. Overconfident humans
           | employing this tech for these purposes have been bamboozled
           | by the lies from OpenAI, Microsoft, Google, et al. It's
           | crucial to call out overstatement and overhype about this
           | tech wherever it crops up.
        
           | jasondigitized wrote:
           | This a thousand times.
        
           | colordrops wrote:
           | I don't think many informed people doubt the utility of LLMs
           | at this point. The potential of human-like AGI has profound
           | implications far beyond utility models, which is why people
           | are so eager to bring it up. A true human-like AGI basically
           | means that most intellectual/white collar work will not be
           | needed, and probably manual labor before too long as well.
           | Huge huge implications for humanity, e.g. how does an economy
           | and society even work without workers?
        
             | vouaobrasil wrote:
             | > Huge huge implications for humanity, e.g. how does an
             | economy and society even work without workers?
             | 
             | I don't think those that create AI care about that. They
             | just to come out on top before someone else does.
        
         | sigmoid10 wrote:
         | These comments are getting ridiculous. I remember when this
         | test was first discussed here on HN and everyone agreed that it
         | clearly proves current AI models are not "intelligent"
         | (whatever that means). And people tried to talk me down when I
         | theorised this test will get nuked soon - like all the ones
         | before. It's time people woke up and realised that the old age
         | of AI is over. This new kind is here to stay and it _will_ take
         | over the world. And you better guess it 'll be sooner rather
         | than later and start to prepare.
        
           | samvher wrote:
           | What kind of preparation are you suggesting?
        
             | sigmoid10 wrote:
             | This is far too broad to summarise here. You can read up on
             | Sutskever or Bostrom or hell even Steven Hawking's ideas
             | (going in order from really deep to general topics). We
             | need to discuss _everything_ - from education over jobs and
             | taxes all the way to the principles of politics, our
             | economy and even the military. If we fail at this as a
             | society, we will at the very least create a world where the
             | people who own capital today massively benefit and become
             | rich beyond imagination (despite having contributed nothing
             | to it), while the majority of the population will be
             | unemployable and forever left behind. And the worst case
             | probably falls somewhere between the end of human
             | civilisation and the end of our species.
        
               | kelseyfrog wrote:
               | What we're going to do is punt the questions and then
               | convince ourselves the outcome was inevitable and if
               | anything it's actually our fault.
        
               | astrange wrote:
               | One way you can tell this isn't realistic is that it's
               | the plot of Atlas Shrugged. If your economic intuitions
               | produce that book it means they are wrong.
               | 
               | > while the majority of the population will be
               | unemployable and forever left behind
               | 
               | Productivity improvements increase employment. A
               | superhuman AI is a productivity improvement.
        
             | johnny_canuck wrote:
             | Start learning a trade
        
               | jorblumesea wrote:
               | that's going to work when every white collar worker goes
               | into the trades /s
               | 
               | who is going to pay for residential electrical work lol
               | and how much will you make if some guy from MIT is going
               | to compete with you
        
               | whynotminot wrote:
               | I feel like that's just kicking the can a little further
               | down the road.
               | 
               | Our value proposition as humans in a capitalist society
               | is an increasingly fragile thing.
        
           | foobarqux wrote:
           | You should look up the terms necessary and sufficient.
        
             | sigmoid10 wrote:
             | The real issue is people constantly making up new goalposts
             | to keep their outdated world view somewhat aligned with
             | what we are seeing. But these two things are drifting apart
             | faster and faster. Even I got surprised by how quickly the
             | ARC benchmark was blown out of the water, and I'm pretty
             | bullish on AI.
        
               | foobarqux wrote:
               | The ARC maintainers have explicitly said that passing the
               | test was necessary but not sufficient so I don't know
               | where you come up with goal-post moving. (I personally
               | don't like the test; it is more about "intuition" or in-
               | built priors, not reasoning).
        
               | manmal wrote:
               | Are you like invested in LLM companies or something?
               | You're pushing the agenda hard in this thread.
        
           | lawlessone wrote:
           | Failing the test may prove the AI is not intelligent. Passing
           | the test doesn't necessarily prove it is.
        
             | NitpickLawyer wrote:
             | Your comment reminds me of this quote from a book published
             | in the 80s:
             | 
             | > There is a related "Theorem" about progress in AI: once
             | some mental function is programmed, people soon cease to
             | consider it as an essential ingredient of "real thinking".
             | The ineluctable core of intelligence is always in that next
             | thing which hasn't yet been programmed. This "Theorem" was
             | first proposed to me by Larry Tesler, so I call it Tesler's
             | Theorem: "AI is whatever hasn't been done yet."
        
               | 6gvONxR4sf7o wrote:
               | I've always disliked this argument. A person can do
               | something well without devising a general solution to the
               | thing. Devising a general solution to the thing is a step
               | we're talking all the time with all sorts of things, but
               | it doesn't invalidate the cool fact about intelligence:
               | whatever it is that lets us do the thing well _without_
               | the general solution is hard to pin down and hard to
               | reproduce.
               | 
               | All that's invalidated each time is the idea that a
               | general solution to that task requires a general solution
               | to all tasks, or that a general solution to that task
               | requires our special sauce. It's the idea that something
               | able to to that task will also be able to do XYZ.
               | 
               | And yet people keep coming up with a new task that people
               | point to saying, 'this is the one! there's no way
               | something could solve this one without also being able to
               | do XYZ!'
        
             | 8note wrote:
             | id consider that it doing the test at all, without proper
             | compensation is a sign that it isnt intelligent
        
           | QuantumGood wrote:
           | "it will take over the world"
           | 
           | Calibrating to the current hype cycle has been challenging
           | with AI pronouncements.
        
           | jcims wrote:
           | I agree, it's like watching a meadow ablaze and dismissing it
           | because it's not a 'real forest fire' yet. No it's not 'real
           | AGI' yet, but *this is how we get there* and the pace is
           | relentless, incredible and wholly overwhelming.
           | 
           | I've been blessed with grandchildren recently, a little boy
           | that's 2 1/2 and just this past Saturday a granddaughter.
           | Major events notwithstanding, the world will largely resemble
           | today when they are teenagers, but the future is going to
           | look very very very different. I can't even imagine what the
           | capability and pervasiveness of it all will be like in ten
           | years, when they are still just kids. For me as someone
           | that's invested in their future I'm interested in all of the
           | educational opportunities (technical, philosphical and self-
           | awareness) but obviously am concerned about the potential for
           | pernicious side effects.
        
           | philipkglass wrote:
           | If AI takes over white collar work that's still half of the
           | world's labor needs untouched. There are some promising early
           | demos of robotics plus AI. I also saw some promising demos of
           | robotics 10 and 20 years that didn't reach mass adoption. I'd
           | like to believe that by the time I reach old age the robots
           | will be fully qualified replacements for plumbers and home
           | health aides. Nothing I've seen so far makes me think that's
           | especially likely.
           | 
           | I'd love more progress on tasks in the physical world,
           | though. There are only a few paths for countries to deal with
           | a growing ratio of old retired people to young workers:
           | 
           | 1) Prioritize the young people at the expense of the old by
           | e.g. cutting old age benefits (not especially likely since
           | older voters have greater numbers and higher participation
           | rates in elections)
           | 
           | 2) Prioritize the old people at the expense of the young by
           | raising the demands placed on young people (either directly
           | as labor, e.g. nurses and aides, or indirectly through higher
           | taxation)
           | 
           | 3) Rapidly increase the population of young people through
           | high fertility or immigration (the historically favored path,
           | but eventually turns back into case 1 or 2 with an even
           | larger numerical burden of older people)
           | 
           | 4) Increase the health span of older people, so that they are
           | more capable of independent self-care (a good idea, but
           | difficult to achieve at scale, since most effective
           | approaches require behavioral changes)
           | 
           | 5) Decouple goods and services from labor, so that old people
           | with diminished capabilities can get everything they need
           | without forcing young people to labor for them
        
             | reducesuffering wrote:
             | > If AI takes over white collar work that's still half of
             | the world's labor needs untouched.
             | 
             | I am continually _baffled_ that people here throw this
             | argument out and can 't imagine the second-order effects.
             | If white collar work is automated by AGI, all the RnD to
             | solve robotics beyond imagination will happen in a flash.
             | The top AI labs, the people smartest enough to make this
             | technology, all are focusing on automating AGI Researchers
             | and from there follows everything, obviously.
        
               | brotchie wrote:
               | +1, the second and third order effects aren't trivial.
               | 
               | We're already seeing escape velocity in world modeling
               | (see Google Veo2 and the latest Genesis LLM-based physics
               | modeling framework).
               | 
               | The hardware for humanoid robots is 95% of the way there,
               | the gap is control logic and intelligence, which is
               | rapidly being closed.
               | 
               | Combine Veo2 world model, Genesis control planning,
               | o3-style reasoning, and you're pretty much there with
               | blue collar work automation.
               | 
               | We're only a few turns (<12 months) away from an
               | existence proof of a humanoid robot that can watch a
               | Youtube video and then replicate the task in a novel
               | environment. May take longer than that to productionize.
               | 
               | It's really hard to think and project forward on an
               | exponential. We've been on an exponential technology
               | curve since the discovery of fire (at least). The 2nd
               | order has kicked up over the last few years.
               | 
               | Not a rational approach to look back at robotics
               | 2000-2022 and project that pace forwards. There's more
               | happening every month than in decades past.
        
               | philipkglass wrote:
               | I hope that you're both right. In 2004-2007 I saw self
               | driving vehicles make lightning progress from the weak
               | showing of the 2004 DARPA Grand Challenge to the
               | impressive 2005 Grand Challenge winners and the even more
               | impressive performance in the 2007 Urban Challenge. At
               | the time I thought that full self driving vehicles would
               | have a major commercial impact within 5 years. I expected
               | truck and taxi drivers to be obsolete jobs in 10 years.
               | 17 years after the Urban Challenge there are still
               | millions of truck driver jobs in America and only Waymo
               | seems to have a credible alternative to taxi drivers
               | (even then, only in a small number of cities).
        
           | ben_w wrote:
           | > It's time people woke up and realised that the old age of
           | AI is over. This new kind is here to stay and it will take
           | over the world. And you better guess it'll be sooner rather
           | than later and start to prepare.
           | 
           | I was just thinking about how 3D game engines were perceived
           | in the 90s. Every six months some new engine came out, blew
           | people's minds, was declared photorealistic, and was
           | forgotten a year later. The best of those engines kept
           | improving and are still here, and kinda did change the world
           | in their own way.
           | 
           | Software development seemed rapid and exciting until about
           | Halo or Half Life 2, then it was shallow but shiny press
           | releases for 15 years, and only became so again when OpenAI's
           | InstructGPT was demonstrated.
           | 
           | While I'm really impressed with current AI, and value the
           | best models greatly, and agree that they will change (and
           | have already changed) the world... I can't help but think of
           | the _Next Generation_ front cover, February 1997 when
           | considering how much further we may be from what we want:
           | https://www.giantbomb.com/pc/3045-94/forums/unreal-yes-
           | this-...
        
             | torginus wrote:
             | The weird thing about the phenomenon you mention is only
             | after the field of software engineering has plateaued 15
             | years ago, as you mentioned, that this insane demand for
             | engineers did arise, with corresponding insane salaries.
             | 
             | It's a very strange thing I've never understood.
        
               | dwaltrip wrote:
               | My guess: It's a very lengthy, complex, and error-prone
               | process to "digitize" human civilization (government,
               | commerce, leisure, military, etc). The tech existed, we
               | just didn't know how to use it.
               | 
               | We still barely know how to use computers effectively,
               | and they have already transformed the world. For better
               | or worse.
        
             | hansonkd wrote:
             | > how much further we may be from what we wan
             | 
             | The timescale you are describing for 3D graphics is 4 years
             | from the 1997 cover you posted to the release of Halo which
             | you are saying plateaued excitement because it got advanced
             | enough.
             | 
             | An almost infinitesimally small amount of time in terms of
             | history human development and you are mocking the magazine
             | being excited for the advancement because it was... 4 years
             | yearly?
        
               | ben_w wrote:
               | No, the timescale is "the 90s", the _the specific
               | example_ is from 1997, and chosen because of how badly it
               | aged. Nobody looks at the original single-player Unreal
               | graphics today and thinks  "this is amazing!", but we all
               | did at the time -- Reflections! Dynamic lighting! It was
               | amazing for the era -- but it was also a long way from
               | photorealism. ChatGPT is amazing... but how far is it
               | from Brent Spiner's Data?
               | 
               | The era was people getting wowed from Wolfenstein (1992)
               | to "about Halo or Half Life 2" (2001 or 2004).
               | 
               | And I'm not saying the flattening of excitement was for
               | any specific reason, just that this was roughly when it
               | stopped getting exciting -- it might have been because
               | the engines were good enough for 3D art styles beyond "as
               | realistic as we can make it", but for all I know it was
               | the War On Terror which changed the tone of press
               | releases and how much the news in general cared. Or
               | perhaps it was a culture shift which came with more
               | people getting online and less media being printed on
               | glossy paper and sold in newsagents.
               | 
               | Whatever the cause, it happened around that time.
        
               | TeMPOraL wrote:
               | I'm still holding on to my hypothesis in that the
               | excitement was sustained in large part because this
               | progress was something a regular person could partake in.
               | Most didn't, but they likely known some kid who was. And
               | some of those kids run the gaming magazines.
               | 
               | This was a time where, for 3D graphics, barriers to entry
               | got low (math got figured out, hardware was good enough,
               | knowledge spread), but the commercial market didn't yet
               | capture everything. Hell, a bulk of those excited kids I
               | remember, trying to do a better Unreal Tournament after
               | school instead of homework (and almost succeeding!), they
               | went on create and staff the next generation of
               | commercial gamedev.
               | 
               | (Which is maybe why this period lasted for about as long
               | as it takes for a schoolkid to grow up, graduate, and
               | spend few years in the workforce doing the stuff they
               | were so excited about.)
        
             | TeMPOraL wrote:
             | > _Software development seemed rapid and exciting until
             | about Halo or Half Life 2, then it was shallow but shiny
             | press releases for 15 years_
             | 
             | The transition seems to map well to the point where engines
             | got sophisticated enough, that highly dedicated high-
             | schoolers couldn't keep up. Until then, people would
             | routinely make hobby game engines (for games they'd then
             | never finish) that were MVPs of what the game industry had
             | a year or three earlier. I.e. close enough to compete on
             | visuals with top photorealistic games of a given year - but
             | more importantly, this was a time where _you could do cool
             | nerdy shit to impress your friends and community_.
             | 
             | Then Unreal and Unity came out, with a business model that
             | killed the motivation to write your own engine from scratch
             | (except for purely educational purposes), we got more
             | games, more progress, but the excitement was gone.
             | 
             | Maybe it's just a spurious correlation, but it seems to
             | track with:
             | 
             | > _and only became so again when OpenAI 's InstructGPT was
             | demonstrated._
             | 
             | Which is again, if you exclude training SOTA models - which
             | is still mostly out of reach for anyone but a few entities
             | on the planet - the time where _anyone_ can do something
             | cool that doesn 't have a better market alternative yet,
             | and any dedicated high-schooler can make truly impressive
             | and useful work, outpacing commercial and academic work
             | based on pure motivation and focus alone (it's easier when
             | you're not being distracted by bullshit incentives like
             | _user growth_ or _making VCs happy_ or _churning out
             | publications, farming citations_ ).
             | 
             | It's, once again, a time of dreams, where anyone with some
             | technical interest and a bit of free time can _make the
             | future happen in front of their eyes_.
        
           | levocardia wrote:
           | I'm a little torn. ARC is really hard, and Francois is
           | extremely smart and thoughtful about what intelligence means
           | (the original "On the Measure of Intelligence" heavily
           | influenced my ideas on how to think about AI).
           | 
           | On the other hand, there is a long, long history of AI
           | achieving X but not being what we would casually refer to as
           | "generally intelligent," then people deciding X isn't really
           | intelligence; only when AI achieves Y will it be
           | intelligence. Then AI achieves Y and...
        
           | Workaccount2 wrote:
           | You are telling a bunch of high earning individuals ($150k+)
           | that they may be dramatically less valuable in the eat
           | future. Of course the goal posts will keep being pushed back
           | and the acknowledgements will never come.
        
           | ignoramous wrote:
           | > _These comments are getting ridiculous._
           | 
           | Not really. Francois (co-creator of the ARC Prize) has this
           | to say:                 The v1 version of the benchmark is
           | starting to saturate. There were already signs of this in the
           | Kaggle competition this year: an ensemble of all submissions
           | would score 81%            Early indications are that ARC-
           | AGI-v2 will represent a complete reset of the state-of-the-
           | art, and it will remain extremely difficult for o3.
           | Meanwhile, a smart human or a small panel of average humans
           | would still be able to score >95% ... This shows that it's
           | still feasible to create unsaturated, interesting benchmarks
           | that are easy for humans, yet impossible for AI, without
           | involving specialist knowledge. We will have AGI when
           | creating such evals becomes outright impossible.
           | For me, the main open question is where the scaling
           | bottlenecks for the techniques behind o3 are going to be. If
           | human-annotated CoT data is a major bottleneck, for instance,
           | capabilities would start to plateau quickly like they did for
           | LLMs (until the next architecture). If the only bottleneck is
           | test-time search, we will see continued scaling in the
           | future.
           | 
           | https://x.com/fchollet/status/1870169764762710376 /
           | https://ghostarchive.org/archive/Sqjbf
        
           | bluerooibos wrote:
           | The goalposts have moved, again and again.
           | 
           | It's gone from "well the output is incoherent" to "well it's
           | just spitting out stuff it's already seen online" to
           | "WELL...uhh IT CAN'T CREATE NEW/NOVEL KNOWLEDGE" in the space
           | of 3-4 years.
           | 
           | It's incredible.
           | 
           | We already have AGI.
        
         | FrustratedMonky wrote:
         | " it's complete hubris to conflate ARC or any benchmark with
         | truly general intelligence."
         | 
         | Maybe it would help to include some human results in the AI
         | ranking.
         | 
         | I think we'd find that Humans score lower?
        
           | zamadatix wrote:
           | I'm not sure it'd help what they are talking about much.
           | 
           | E.g. go back in time and imagine you didn't know there are
           | ways for computers to be really good at performing
           | integration yet as nobody had tried to make them. If someone
           | asked you how to tell if something is intelligent "the
           | ability to easily reason integrations or calculate extremely
           | large multiplications in mathematics" might seem like a great
           | test to make.
           | 
           | Skip forward to the modern era and it's blatantly obvious
           | CASes like Mathematica on a modern computer range between
           | "ridiculously better than the average person" to "impossibly
           | better than the best person" depending on the test. At the
           | same time, it becomes painfully obvious a CAS is wholly
           | unrelated to general intelligence and just because your test
           | might have been solvable by an AGI doesn't mean solving it
           | proves something must have been an AGI.
           | 
           | So you come up with a new test... but you have the same
           | problem as originally, it seems like anything non-human
           | completely bombs and an AGI would do well... but how do you
           | know the thing that solves it will have been an AGI for sure
           | and not just another system clearly unrelated?
           | 
           | Short of a more clever way what GP is saying is the goalposts
           | must keep being moved until it's not so obvious the thing
           | isn't AGI, not that the average human gets a certain score
           | which is worse.
           | 
           | .
           | 
           | All that aside, to answer your original question, in the
           | presentation it was said the average human gets 85% and this
           | was the first model to beat that. It was also said a second
           | version is being worked on. They have some papers on their
           | site about clear examples of why the current test clearly has
           | a lot of testing unrelated to whether something is really AGI
           | (a brute force method was shown to get >50% in 2020) so their
           | aim is to create a new goalpost test and see how things shake
           | out this time.
        
             | FrustratedMonky wrote:
             | "Short of a more clever way what GP is saying is the
             | goalposts must keep being moved until it's not so obvious
             | the thing isn't AGI, not that the average human gets a
             | certain score which is worse."
             | 
             | Best way of stating that I've heard.
             | 
             | The Goal Post must keep moving, until we understand enough
             | what is happening.
             | 
             | I usually poo-poo the goal post moving, but this makes
             | sense.
        
             | og_kalu wrote:
             | Generality is not binary. It's a spectrum. And these models
             | are already general in ways those things you've mentioned
             | simply weren't.
             | 
             | What exactly is AGI to you ? If it's simply a generally
             | intelligent machine then what are you waiting for ? What
             | else is there to be sure of ? There's nothing narrow about
             | these models.
             | 
             | Humans love to believe they're oh so special so much that
             | there will always be debates on whether 'AGI' has arrived.
             | If you are waiting for that then you'll be waiting a very
             | long time, even if a machine arrives that takes us to the
             | next frontier in science.
        
         | m3kw9 wrote:
         | From the statement where - this is a pretty tough test where AI
         | scores low vs humans just last year, and AI can do it as good
         | as humans may not be AGI which I agree, but it means something
         | with all caps
        
           | manmal wrote:
           | Obviously, the multi billion dollar companies will try to
           | satisfy the benchmarks they are not yet good in, as has
           | always been the case.
        
         | wslh wrote:
         | > My skeptical impression: it's complete hubris to conflate ARC
         | or any benchmark with truly general intelligence.
         | 
         | But isn't it interesting to have several benchmarks? Even if
         | it's not about passing the Turing test, benchmarks serve a
         | purpose--similar to how we measure microprocessors or other
         | devices. Intelligence may be more elusive, but even if we had
         | an oracle delivering the ultimate intelligence benchmark, we'd
         | still argue about its limitations. Perhaps we'd claim it
         | doesn't measure creativity well, and we'd find ourselves
         | revisiting the same debates about different kinds of
         | intelligences.
        
           | zebomon wrote:
           | It's certainly interesting. I'm just not convinced it's a
           | test of general intelligence, and I don't think we'll know
           | whether or not it is until it's been able to operate in the
           | real world to the same degree that our general intelligence
           | does.
        
         | kelseyfrog wrote:
         | > truly general intelligence
         | 
         | Indistinguishable from goalpost moving like you said, but also
         | no true Scotsman.
         | 
         | I'm curious what would happen in your eyes if we misattributed
         | general intelligence to an AI model? What are the consequences
         | of a false positive and how would they affect your life?
         | 
         | It's really clear to me how intelligence fits into our reality
         | as part of our social ontology. The attributes and their
         | expression that each of us uses to ground our concept of the
         | intelligent predicate differs wildly.
         | 
         | My personal theory is that we tend to have an exemplar-based
         | dataset of intelligence, and each of us attempts to construct a
         | parsimonious model of intelligence, but like all (mental)
         | models, they can be useful but wrong. These models operate in a
         | space where the trade off is completeness or consistency, and
         | most folks, uncomfortable saying "I don't know" lean toward
         | being complete in their specification rather than consistent.
         | The unfortunate side-effect is that we're able to easily
         | generate test data that highlights our model inconsistency - AI
         | being a case in point.
        
           | PaulDavisThe1st wrote:
           | > I'm curious what would happen in your eyes if we
           | misattributed general intelligence to an AI model? What are
           | the consequences of a false positive and how would they
           | affect your life?
           | 
           | Rich people will think they can use the AI model instead of
           | paying other people to do certain tasks.
           | 
           | The consequences could range from brilliant to utterly
           | catastrophic, depending on the context and precise way in
           | which this is done. But I'd lean toward the catastrophic.
        
             | kelseyfrog wrote:
             | Any specifics? It's difficult to separate this from
             | generalized concern.
        
               | PaulDavisThe1st wrote:
               | someone wants a "personal assistant" and believes that
               | the LLM has AGI ...
               | 
               | someone wants a "planning officer" and believes that the
               | LLM has AGI ...
               | 
               | someone wants a "hiring consultant" and believes that the
               | LLM has AGI ...
               | 
               | etc. etc.
        
               | kelseyfrog wrote:
               | My apologies, but would it be possible to list the
               | catastrophic consequences of these?
        
         | Agentus wrote:
         | how about a extra large dose of your skepticism. is true
         | intelligence really a thing and not just a vague human
         | construct that tries to point out the mysterious unquantifiable
         | combination of human behaviors?
         | 
         | humans clearly dont know what intelligence is unambiguously.
         | theres also no divinely ordained objective dictionary that one
         | can point at to reference what true intelligence is. a deep
         | reflection of trying to pattern associate different human
         | cognitive abilities indicates human cognitive capabilities
         | arent that spectacular really.
        
         | Bjorkbat wrote:
         | I think it's still an interesting way to measure general
         | intellience, it's just that o3 has demonstrated that you can
         | actually achieve human performance on it by training it on the
         | public training set and giving it ridiculous amounts of
         | compute, which I imagine equates to ludicrously long chains-of-
         | thought, and if I understand correctly more than one chain-of-
         | thought per task (they mention sample sizes in the blog post,
         | with o3-low using 6 and o3-high using 1024. Not sure if these
         | are chains-of-thought per task or what).
         | 
         | Once you look at it that way it the approach really doesn't
         | look like intelligence that's able to generalize to novel
         | domains. It doesn't pass the sniff test. It looks a lot more
         | like brute-forcing.
         | 
         | Which is probably why, in order to actually qualify for the
         | leaderboard, they stipulate that you can't use more than $10k
         | more of compute. Otherwise, it just sounds like brute-forcing.
        
       | attentionmech wrote:
       | Isn't this at the level now where it can sort of self improve. My
       | guess is that they will just use it to improve the model and the
       | cost they are showing per evaluation will go down drastically.
       | 
       | So, next step in reasoning is open world reasoning now?
        
       | yawnxyz wrote:
       | O3 High (tuned) model scored an 88% at what looks like
       | $6,000/task haha
       | 
       | I think soon we'll be pricing any kind of tasks by their compute
       | costs. So basically, human = $50/task, AI = $6,000/task, use
       | human. If AI beats human, use AI? Ofc that's considering both get
       | 100% scores on the task
        
         | cchance wrote:
         | Isn't that generally what ... all jobs are? Automation Cost vs
         | Longterm Human cost... its why amazon did the weird "our stores
         | are AI driven" but in reality was cheaper to higher a bunch of
         | guys in a sweat shop to look at the cameras and write things
         | down lol.
         | 
         | The thing is given what we've seen from distillation and tech,
         | even if its 6,000/task... that will come down drastically over
         | time through optimization and just... faster more efficient
         | processing hardware and software.
        
           | cryptoegorophy wrote:
           | I remember hearing Tesla trying to automate all of production
           | but some things just couldn't , like the wiring which humans
           | still had to do.
        
         | dyauspitr wrote:
         | Compute can get optimized and cheap quickly.
        
           | karmasimida wrote:
           | Is it? The moore's law is dead dead, I don't think this is a
           | given.
        
         | jsheard wrote:
         | That's the elephant in the room with the reasoning/COT
         | approach, it shifts what was previously a scaling of training
         | costs into scaling of training _and_ inference costs. The
         | promise of doing expensive training once and then running the
         | model cheaply forever falls apart once you 're burning tens,
         | hundreds or thousands of dollars worth of compute every time
         | you run a query.
        
           | Legend2440 wrote:
           | Yeah, but next year they'll come out with a faster GPU, and
           | the year after that another still faster one, and so on.
           | Compute costs are a temporary problem.
        
             | freehorse wrote:
             | The issue is not just scaling compute, but scaling it in a
             | rate that meets the increase in complexity of the problems
             | that are not currently solved. If that is O(n) then what
             | you say probably stands. If that is eg O(n^8) or
             | exponential etc, then there is no hope to actually get good
             | enough scaling by just increasing compute in a normal rate.
             | Then AI technology will still be improving, but improving
             | to a halt, practically stagnating.
             | 
             | o3 will be interesting if it offers indeed a novel
             | technology to handle problem solving, something that is
             | able to learn from few novel examples efficiently and
             | adapt. That's what intelligence actually is. Maybe this is
             | the case. If, on the other hand, it is a smart way to pair
             | CoT within an evaluation loop (as the author hints as
             | possibility) then it is probable that, while this _can_
             | handle a class of problems that current LLMs cannot, it is
             | not really this kind of learning, meaning that it will not
             | be able to scale to more complex, real world tasks with a
             | problem space that is too large and thus less amenable to
             | such a technique. It is still interesting, because having a
             | good enough evaluator may be very important step, but it
             | would mean that we are not yet there.
             | 
             | We will learn soon enough I suppose.
        
           | Workaccount2 wrote:
           | They're gonna figure it out. Something is being missed
           | somewhere, as human brains can do all this computation on 20
           | watts. Maybe it will be a hardware shift or maybe just a
           | software one, but I strongly suspect that modern transformers
           | are grossly inefficient.
        
         | redeux wrote:
         | Time and availability would also be factors.
        
         | Benjaminsen wrote:
         | Compute costs on AI with the same roughly the same capabilities
         | have been halving every ~7 months.
         | 
         | That makes something like this competitive in ~3 years
        
         | freehorse wrote:
         | This makes me think and speculate if the solution comprises of
         | a "solver" trying semi-random or more targeted things and a
         | "checker" checking these? Usually checking a solution is
         | cognitively (and computationally) easier than coming up with
         | it. Else I cannot think what sort of compute would burn 6000$
         | per task, unless you are going through a lot of loops and you
         | have somehow solved the part of the problem that can figure out
         | if a solution is correct or not, while coming up with the
         | actual correct solution is not as solved yet to the same
         | degree. Or maybe I am just naive and these prices are just like
         | breakfast for companies like that.
        
         | og_kalu wrote:
         | It's not 6000/task (i.e per question). 6000 is about the retail
         | cost for evaluating the entire benchmark on high efficiency
         | (about 400 questions)
        
           | Tiberium wrote:
           | From reading the blog post and Twitter, and cost of other
           | models, I think it's evident that it IS actually cost per
           | task, see this tweet: https://files.catbox.moe/z1n8dc.jpg
           | 
           | And o1 cost $15/$60 for 1M in/out, so the estimated costs on
           | the graph would match for a single task, not the whole
           | benchmark.
        
             | slibhb wrote:
             | The blog clarifies that it's $17-20 per task. Maybe it runs
             | into thousands for tasks it can't solve?
        
               | Tiberium wrote:
               | That cost is for o3 low, o3 high goes into thousands per
               | task.
        
         | gbnwl wrote:
         | Well they got 75.7% at $17/task. Did you see that?
        
         | seydor wrote:
         | What if we use those humans to generate energy for the tasks?
        
       | spaceman_2020 wrote:
       | Just as an aside, I've personally found o1 to be completely
       | useless for coding.
       | 
       | Sonnet 3.5 remains the king of the hill by quite some margin
        
         | cchance wrote:
         | The new gemini's are pretty good too
        
           | lysecret wrote:
           | Actually prefer new geminis too. 2.0 experimental especially.
        
         | og_kalu wrote:
         | To be fair, until the last checkpoint released 2 days ago, o1
         | didn't really beat sonnet (and if so, barely) in most non-
         | competitive coding benchmarks
        
         | vessenes wrote:
         | To fill this out, I find o1-pro (and -preview when it was live)
         | to be pretty good at filling in blindspots/spotting holistic
         | bugs. I use Claude for day to day, and when Claude is spinning,
         | o1 often can point out why. It's too slow for AI coding, and I
         | agree that at default its responses aren't always satisfying.
         | 
         | That said, I think its code style is arguably better, more
         | concise and has better patterns -- Claude needs a fair amount
         | of prompting and oversight to not put out semi-shitty code in
         | terms of structure and architecture.
         | 
         | In my mind: going from Slowest to Fastest, and Best
         | Holistically to Worst, the list is:
         | 
         | 1. o1-pro 2. Claude 3.5 3. Gemini 2 Flash
         | 
         | Flash is so fast, that it's tempting to use more, but it really
         | needs to be kept to specific work on strong codebases without
         | complex interactions.
        
         | bearjaws wrote:
         | o1 is pretty good at spotting OWASP defects, compared to most
         | other models.
         | 
         | https://myswamp.substack.com/p/benchmarking-llms-against-com...
        
         | InkCanon wrote:
         | I just asked o1 a simple yes or no question about x86 atomics
         | and it did one of those A or B replies. The first answer was
         | yes, the second answer was no.
        
         | m3kw9 wrote:
         | o1 is when all else fails, sometimes it does the same mistakes
         | as weaker models if you give it simple tasks with very little
         | context, but when a good precise context is given it usually
         | outperforms other Models
        
         | karmasimida wrote:
         | Yeah I feel for chat use case, o1 is just too slow for me, and
         | my queries aren't that complicated.
         | 
         | For coding, o1 is marvelous at Leetcode question I think it is
         | the best teacher I would ever afford to teach me leetcoding,
         | but I don't find myself have a lot of other use cases for o1
         | that is complex and requires really long reasoning chain
        
         | bitbuilder wrote:
         | I find myself hoping between o1 and Sonnet pretty frequently
         | these days, and my personal observation is that the quality of
         | output from o1 scales more directly to the quality of the
         | prompting you're giving it.
         | 
         | In a way it almost feels like it's become _too_ good at
         | following instructions and simply just takes your direction
         | more literally. It doesn 't seem to take the initiative of
         | going the extra mile of filling in the blanks from your lazy
         | input (note: many would see this as a good thing). Claude on
         | the other hand feels more intuitive in discerning intent from a
         | lazy prompt, which I may be prone to offering it at times when
         | I'm simply trying out ideas.
         | 
         | However, if I take the time to write up a well thought out
         | prompt detailing my expectations, I find I much prefer the code
         | o1 creates. It's smarter in its approach, offers clever ideas I
         | wouldn't have thought of, and generally cleaner.
         | 
         | Or put another way, I can give Sonnet a lazy or detailed prompt
         | and get a good result, while o1 will give me an excellent
         | result with a well thought out prompt.
         | 
         | What this boils down to is I find myself using Sonnet while
         | brainstorming ideas, or when I simply don't know how I want to
         | approach a problem. I can pitch it a feature idea the same way
         | a product owner might pitch an idea to an engineer, and then
         | iterate through sensible and intuitive ways of looking at the
         | problem. Once I get a handle on how I'd like to implement a
         | solution, I type up a spec and hand it off to o1 to crank out
         | the code I'd intend to implement.
        
           | jules wrote:
           | Can you solve this by putting your lazy prompt through GPT-4o
           | or Sonnet 3.6 and asking it to expand the prompt to a full
           | prompt for o1?
        
       | smy20011 wrote:
       | It seems O3 following trend of Chess engine that you can cut your
       | search depth depends on state.
       | 
       | It's good for games with clear signal of success (Win/Lose for
       | Chess, tests for programming). One of the blocker for AGI is we
       | don't have clear evaluation for most of our tasks and we cannot
       | verify them fast enough.
        
       | flakiness wrote:
       | The cost axis is interesting. O3 Low is $10+ per task and 03 High
       | is over $1000 (it's logarithmic graph so it's like $50 and $5000
       | respectively?)
        
       | obblekk wrote:
       | Human performance is 85% [1]. o3 high gets 87.5%.
       | 
       | This means we have an algorithm to get to human level performance
       | on this task.
       | 
       | If you think this task is an eval of general reasoning ability,
       | we have an algorithm for that now.
       | 
       | There's a lot of work ahead to generalize o3 performance to all
       | domains. I think this explains why many researchers feel AGI is
       | within reach, now that we have an algorithm that works.
       | 
       | Congrats to both Francois Chollet for developing this compelling
       | eval, and to the researchers who saturated it!
       | 
       | [1] https://x.com/SmokeAwayyy/status/1870171624403808366,
       | https://arxiv.org/html/2409.01374v1
        
         | phillipcarter wrote:
         | As excited as I am by this, I still feel like this is still
         | just a small approximation of a small chunk of human reasoning
         | ability at large. o3 (and whatever comes next) feels to me like
         | it will head down the path of being a reasoning coprocessor for
         | various tasks.
         | 
         | But, still, this is incredibly impressive.
        
           | qt31415926 wrote:
           | Which parts of reasoning do you think is missing? I do feel
           | like it covers a lot of 'reasoning' ground despite its on the
           | surface simplicity
        
             | phillipcarter wrote:
             | I think it's hard to enumerate the unknown, but I'd
             | personally love to see how models like this perform on
             | things like word problems where you introduce red herrings.
             | Right now, LLMs at large tend to struggle mightily to
             | understand when some of the given information is not only
             | irrelevant, but may explicitly serve to distract from the
             | real problem.
        
               | KaoruAoiShiho wrote:
               | o1 already fixed the red herrings...
        
         | ALittleLight wrote:
         | It's not saturated. 85% is average human performance, not "best
         | human" performance. There is still room for the model to go up
         | to 100% on this eval.
        
         | scotty79 wrote:
         | Still it's comparing average human level performance with best
         | AI performance. Examples of things o3 failed at are insanely
         | easy for humans.
        
           | FrustratedMonky wrote:
           | There are things Chimps do easily that humans fail at, and
           | vice/versa of course.
           | 
           | There are blind spots, doesn't take away from 'general'.
        
           | cchance wrote:
           | You'd be surprised what the AVERAGE human fails to do that
           | you think is easy, my mom can't fucking send an email without
           | downloading a virus, i have a coworker that believes beyond a
           | shadow of a doubt the world is flat.
           | 
           | The Average human is a lot dumber than people on hackernews
           | and reddit seem to realize, shit the people on mturk are
           | likely smarter than the AVERAGE person
        
             | staticman2 wrote:
             | Yet the average human can drive a car a lot better than
             | ChatGPT can, which shows that the way you frame
             | "intelligence" dictates your conclusion about who is
             | "intelligent".
        
               | p1esk wrote:
               | Pretty sure a waymo car drives better than an average SF
               | driver.
        
               | tracerbulletx wrote:
               | If you take an electrical sensory input signal sequence,
               | and transform it to a electrical muscle output signal
               | sequence you've got a brain. ChatGPT isn't going to drive
               | a car because it's trained on verbal tokens, and it's not
               | optimized for the type of latency you need for physical
               | interaction.
               | 
               | And the brain doesn't use the same network to do verbal
               | reasoning as real time coordination either.
               | 
               | But that work is moving along fine. All of these models
               | and lessons are going to be combined into AGI. It is
               | happening. There isn't really that much in the way.
        
         | cryptoegorophy wrote:
         | What's interesting is it might be very close to human
         | intelligence than some "alien" intelligence, because after all
         | it is a LLM and trained on human made text, which kind of
         | represents human intelligence.
        
           | hammock wrote:
           | In that vein, perhaps the delta between o3 @ 87.5% and Human
           | @ 85% represents a deficit in the ability of text to
           | communicate human reasoning.
           | 
           | In other words, it's possible humans can reason better than
           | o3, but cannot articulate that reasoning as well through text
           | - only in our heads, or through some alternative medium.
        
             | 85392_school wrote:
             | I wonder how much of an effect amount of time to answer has
             | on human performance.
        
               | yunwal wrote:
               | Yeah, this is sort of meaningless without some idea of
               | cost or consequences of a wrong answer. One of the nice
               | things about working with a competent human is being able
               | to tell them "all of our jobs are on the line" and
               | knowing with certainty that they'll come to a good
               | answer.
        
             | unsupp0rted wrote:
             | It's possible humans reason better through text than not
             | through text, so these models, having been trained on text,
             | should be able to out-reason any person who's not currently
             | sitting down to write.
        
         | antirez wrote:
         | NNs are not algorithms.
        
           | notfish wrote:
           | An algorithm is "a process or set of rules to be followed in
           | calculations or other problem-solving operations, especially
           | by a computer"
           | 
           | How does a giant pile of linear algebra not meet that
           | definition?
        
             | antirez wrote:
             | It's not made of "steps", it's an almost continuous
             | function to its inputs. And a function is not an algorithm:
             | it is not an object made of conditions, jumps,
             | terminations, ... Obviously it has computation capabilities
             | and is Turing-complete, but is the opposite of an
             | algorithm.
        
               | raegis wrote:
               | > It's not made of "steps", it's an almost continuous
               | function to its inputs.
               | 
               | Can you define "almost continuous function"? Or explain
               | what you mean by this, and how it is used in the A.I.
               | stuff?
        
               | janalsncm wrote:
               | If it wasn't made of steps then Turing machines wouldn't
               | be able to execute them.
               | 
               | Further, this is probably running an algorithm on top of
               | an NN. Some kind of tree search.
               | 
               | I get what you're saying though. You're trying to draw a
               | distinction between statistical methods and symbolic
               | methods. Someday we will have an algorithm which uses
               | statistical methods that can match human performance on
               | most cognitive tasks, and it won't look or act like a
               | brain. In some sense that's disappointing. We can build
               | supersonic jets without fully understanding how birds
               | fly.
        
               | antirez wrote:
               | Let's see that Turing machines can approximate the
               | execution of NN :) That's why there are issues related to
               | numerical precision, but the contrary is also true
               | indeed, NNs can discover and use similar techniques used
               | by traditional algorithms. However: the two remain two
               | different methods to do computations, and probably it's
               | not just by chance that many things we can't do
               | algorithmically, we can do with NNs, what I mean is that
               | this is not _just_ related to the fact that NNs discover
               | complex algorithms via gradient descent, but also that
               | the computational model of NNs is more adapt to solving
               | certain tasks. So the inference algorithm of NNs (doing
               | multiplications and other batch transformations) is just
               | needed for standard computers to approximate the NN
               | computational model. You can do this analogically, and
               | nobody would claim much (maybe?) it 's running an
               | algorithm. Or that brains themselves are algorithms.
        
           | benlivengood wrote:
           | Deterministic (ieee 754 floats), terminates on all inputs,
           | correctness (produces loss < X on N training/test inputs)
           | 
           | At most you can argue that there isn't a useful bounded loss
           | on every possible input, but it turns out that humans don't
           | achieve useful bounded loss on identifying arbitrary sets of
           | pixels as a cat or whatever, either. Most problems NNs are
           | aimed at are qualitative or probabilistic where provable
           | bounds are less useful than Nth-percentile performance on
           | real-world data.
        
           | KeplerBoy wrote:
           | Running inference on a model certainly is a algorithm.
        
           | drdeca wrote:
           | How do you define "algorithm"? I suspect it is a definition I
           | would find somewhat unusual. Not to say that I strictly
           | disagree, but only because to my mind "neural net" suggests
           | something a bit more concrete than "algorithm", so I might
           | instead say that an artificial neural net is an
           | implementation of an algorithm, rather than or something like
           | that.
           | 
           | But, to my mind, something of the form "Train a neural
           | network with an architecture generally like [blah], with a
           | training method+data like [bleh], and save the result. Then,
           | when inputs are received, run them through the NN in such-
           | and-such way." would constitute an algorithm.
        
         | 6gvONxR4sf7o wrote:
         | Human performance is much closer to 100% on this, depending on
         | your human. It's easy to miss the dot in the corner of the
         | headline graph in TFA that says "STEM grad."
        
         | hypoxia wrote:
         | It actually beats the human average by a wide margin:
         | 
         | - 64.2% for humans vs. 82.8%+ for o3.
         | 
         | ...
         | 
         | Private Eval:
         | 
         | - 85%: threshold for winning the prize [1]
         | 
         | Semi-Private Eval:
         | 
         | - 87.5%: o3 (unlimited compute) [2]
         | 
         | - 75.7%: o3 (limited compute) [2]
         | 
         | Public Eval:
         | 
         | - 91.5%: o3 (unlimited compute) [2]
         | 
         | - 82.8%: o3 (limited compute) [2]
         | 
         | - 64.2%: human average (Mechanical Turk) [1] [3]
         | 
         | Public Training:
         | 
         | - 76.2%: human average (Mechanical Turk) [1] [3]
         | 
         | ...
         | 
         | References:
         | 
         | [1] https://arcprize.org/guide
         | 
         | [2] https://arcprize.org/blog/oai-o3-pub-breakthrough
         | 
         | [3] https://arxiv.org/abs/2409.01374
        
           | usaar333 wrote:
           | Super human isn't beating rando mech turk.
           | 
           | Their post has stem grad at nearly 100%
        
             | tripletao wrote:
             | This is correct. It's easy to get arbitrarily bad results
             | on Mechanical Turk, since without any quality control
             | people will just click as fast as they can to get paid (or
             | bot it and get paid even faster).
             | 
             | So in practice, there's always some kind of quality
             | control. Stricter quality control will improve your
             | results, and the right amount of quality control is
             | subjective. This makes any assessment of human quality
             | meaningless without explanation of how those humans were
             | selected and incentivized. Chollet is careful to provide
             | that, but many posters here are not.
             | 
             | In any case, the ensemble of task-specific, low-compute
             | Kaggle solutions is reportedly also super-Turk, at 81%. I
             | don't think anyone would call that AGI, since it's not
             | general; but if the "(tuned)" in the figure means o3 was
             | tuned specifically for these tasks, that's not obviously
             | general either.
        
       | Imnimo wrote:
       | Whenever a benchmark that was thought to be extremely difficult
       | is (nearly) solved, it's a mix of two causes. One is that
       | progress on AI capabilities was faster than we expected, and the
       | other is that there was an approach that made the task easier
       | than we expected. I feel like the there's a lot of the former
       | here, but the compute cost per task (thousands of dollars to
       | solve one little color grid puzzle??) suggests to me that there's
       | some amount of the latter. Chollet also mentions ARC-AGI-2 might
       | be more resistant to this approach.
       | 
       | Of course, o3 looks strong on other benchmarks as well, and
       | sometimes "spend a huge amount of compute for one problem" is a
       | great feature to have available if it gets you the answer you
       | needed. So even if there's some amount of "ARC-AGI wasn't quite
       | as robust as we thought", o3 is clearly a very powerful model.
        
         | exe34 wrote:
         | > the other is that there was an approach that made the task
         | easier than we expected.
         | 
         | from reading Dennett's philosophy, I'm convinced that that's
         | how human intelligence works - for each task that "only a human
         | could do that", there's a trick that makes it easier than it
         | seems. We are bags of tricks.
        
       | whoistraitor wrote:
       | The general message here seems to be that inference-time brute-
       | forcing works as long as you have a good search and evaluation
       | strategy. We've seemingly hit a ceiling on the base LLM forward-
       | pass capability so any further wins are going to be in how we
       | juggle multiple inferences to solve the problem space. It feels
       | like a scripting problem now. Which is cool! A fun space for
       | hacker-engineers. Also:
       | 
       | > My mental model for LLMs is that they work as a repository of
       | vector programs. When prompted, they will fetch the program that
       | your prompt maps to and "execute" it on the input at hand. LLMs
       | are a way to store and operationalize millions of useful mini-
       | programs via passive exposure to human-generated content.
       | 
       | I found this such an intriguing way of thinking about it.
        
         | whimsicalism wrote:
         | > We've seemingly hit a ceiling on the base LLM forward-pass
         | capability so any further wins are going to be in how we juggle
         | multiple inferences to solve the problem space
         | 
         | Not so sure - but we might need to figure out the
         | inference/search/evaluation strategy in order to provide the
         | data we need to distill to the single forward-pass data
         | fitting.
        
       | cchance wrote:
       | Is it just me or does looking at the ARC-AGI example questions at
       | the bottom... make your brain hurt?
        
         | drdaeman wrote:
         | Looks pretty obvious to me, although, of course, it took me a
         | few moments to understand what's expected as a solution.
         | 
         | c6e1b8da is moving rectangular figures by a given vector,
         | 0d87d2a6 is drawing horizontal and/or vertical lines
         | (connecting dots at the edges) and filling figures they touch,
         | b457fec5 is filling gray figures with a given repeating color
         | pattern.
         | 
         | This is pretty straightforward stuff that doesn't require much
         | spatial thinking or keeping multiple things/aspects in memory -
         | visual puzzles from various "IQ" tests are way harder.
         | 
         | This said, now I'm curious how SoTA LLMs would do on something
         | like WAIS-IV.
        
         | randyrand wrote:
         | I'll sound like a total douche bag - but I thought they were
         | incredibly obvious - which I think is the point of them.
         | 
         | What took me longer was figuring out how the question was
         | arranged, i.e. left input, right output, 3 examples each
        
       | airstrike wrote:
       | Uhh...some of us are apparently living under a rock, as this is
       | the first time I hear about o3 and I'm on HN far too much every
       | day
        
         | burningion wrote:
         | I think it was just announced today! You're fine!
        
       | cryptoegorophy wrote:
       | Besides higher scores - is there any improvements for a general
       | use? Like asking to help setup home assistant etc etc?
        
       | rvz wrote:
       | Great results. However, let's all just admit it.
       | 
       | It has well replaced journalists, artists and on its way to
       | replace nearly both junior and senior engineers. The ultimate
       | intention of "AGI" is that it is going to replace tens of
       | millions of jobs. That is it and you know it.
       | 
       | It will only accelerate and we need to stop pretending and
       | coping. Instead lets discuss solutions for those lost jobs.
       | 
       | So what is the replacement for these lost jobs? (It is not UBI or
       | "better jobs" without defining them.)
        
         | neom wrote:
         | Do you follow Jack Clark? I noticed he's been on the road a lot
         | talking to governments and policy makers, and not just in the
         | "AI is coming" way he used to talk.
        
         | whynotminot wrote:
         | When none of us have jobs or income, there will be no ability
         | for us to buy products. And then no reason for companies to buy
         | ads to sell products to people who don't have money. Without ad
         | money (or the potential of future ad money), the people pushing
         | the bounds of AGI into work replacement will lose the very
         | income streams powering this research and their valuations.
         | 
         | Ford didn't support a 40 hour work week out of the kindness of
         | his heart. He wanted his workers to have time off for buying
         | things (like his cars).
         | 
         | I wonder if our AGI industrialist overlords will do something
         | similar for revenue sharing or UBI.
        
           | whimsicalism wrote:
           | This picture doesn't make sense. If most don't have any money
           | to buy products, just invent some other money and start
           | paying one of the other people who doesn't have any money to
           | start making the products for you.
           | 
           | In reality, if there really is mass unemployment, AI driven
           | automation will make consumables so cheap that anyone will be
           | able to buy it.
        
             | whynotminot wrote:
             | > This picture doesn't make sense. If most don't have any
             | money to buy products, just invent some other money and
             | start paying one of the other people who doesn't have any
             | money to start making the products for you.
             | 
             | Uh, this picture doesn't make sense. Why would anyone value
             | this randomly invented money?
        
               | whimsicalism wrote:
               | > Why would anyone value this randomly invented money?
               | 
               | Because they can use it to pay for goods?
               | 
               | Your notion is that almost everyone is going to be out of
               | a job and thus have nothing. Okay, so I'm one of those
               | people and I need this house built. But I'm not making
               | any money because of AI or whatever. Maybe someone else
               | needs someone to drive their aging relative around and
               | they're a good builder.
               | 
               | If 1. neither of those people have jobs or income because
               | of AI 2. AI isn't provisioning services for basically
               | free,
               | 
               | then it makes sense for them to do an exchange of labor -
               | even with AI (if that AI is not providing services to
               | everyone). The original reason for having money and
               | exchanging it still exists.
        
               | whynotminot wrote:
               | Honestly I don't even know how to engage with your point.
               | 
               | Yes if we recreate society some form of money would
               | likely emerge.
        
               | neom wrote:
               | Didn't money basically only emerge to deal with with
               | difficulty of "double coincidence of wants". Money simply
               | solved the problem of making all forms of value
               | interchangeable and transportable across time AND
               | circumstance? A dollar can do with with or without AI
               | existing no?
        
               | whimsicalism wrote:
               | Yes, that's my point
        
               | staticman2 wrote:
               | You seem to be arguing that large unemployment rates are
               | logically impossible, so we shouldn't worry about
               | unemployment.
               | 
               | The fact unemployment was 25% during the great depression
               | would seem to suggest that at a minimum, a 25%
               | unemployment rate is possible during a disruptive event.
        
             | tivert wrote:
             | > This picture doesn't make sense. If most don't have any
             | money to buy products, just invent some other money and
             | start paying one of the other people who doesn't have any
             | money to start making the products for you.
             | 
             | Ultimately, it all comes down to raw materials and similar
             | resources, _and all those will be claimed by people with
             | lots of real money_. Your  "invented ... other money" will
             | be useless to buy that fundamental stuff. At best, it will
             | be useful for trading scrap and other junk among the
             | unemployed.
             | 
             | > In reality, if there really is mass unemployment, AI
             | driven automation will make consumables so cheap that
             | anyone will be able to buy it.
             | 
             | No. Why would the people who own that automation want to
             | waste their resources producing consumer goods for people
             | with nothing to give them in return?
        
           | tivert wrote:
           | > When none of us have jobs or income, there will be no
           | ability for us to buy products. And then no reason for
           | companies to buy ads to sell products to people who don't
           | have money. Without ad money (or the potential of future ad
           | money), the people pushing the bounds of AGI into work
           | replacement will lose the very income streams powering this
           | research and their valuations.
           | 
           | I don't think so. I agree the push for AGI will kill the
           | modern consumer product economy, but I think it's quite
           | possible for the economy to evolve into a new form (that will
           | probably be terrible for most humans) that keep pushes "work
           | replacement."
           | 
           | Imagine, an AGI billionare buying up land, mines, and power
           | plants as the consumer economy dies, then shifting those
           | resources away from the consumer economy into self-
           | aggrandizing pet projects (e.g. ziggurats, penthouses on
           | Mars, space yachts, life extension, and stuff like that). He
           | might still employ a small community of servants, AGI
           | researchers, and other specialists; but all the rest of the
           | population will be irrelevant to him.
           | 
           | And individual autarky probably isn't necessary, consumption
           | will be redirected towards the massive pet production I
           | mentioned, with vestigial markets for power, minerals, etc.
        
         | RivieraKid wrote:
         | The economic theory answer is that people simply switch to jobs
         | that are not yet replaceable by AI. Doctors, nurses,
         | electricians, construction workers, police officers, etc.
         | People in aggregate will produce more, consume more and work
         | less.
        
         | drdaeman wrote:
         | > It has well replaced journalists, artists and on its way to
         | replace nearly both junior and senior engineers.
         | 
         | Did it, really? Or did it just provide automation for routine
         | no-thinking-necessary text-writing tasks, but is still
         | ultimately completely bound by the level of human operator's
         | intelligence? I strongly suspect it's the latter. If it had
         | actually replaced journalists it must be junk outlets, where
         | readers' intelligence is negligible and anything goes.
         | 
         | Just yesterday I've used o1 and Claude 3.5 to debug a Linux
         | kernel issue (ultimately, a bad DSDT table causing TPM2 driver
         | unable to reserve memory region for command response buffer,
         | the solution was to use memmap to remove NVS flag from the
         | relevant regions) and confirmed once again LLMs still don't
         | reason at all - just spew out plausible-looking chains of
         | words. The models were good listeners, and a mostly-helpful
         | code generators (when they didn't make silliest mistakes), but
         | they gave no traces of understanding and no attention for any
         | nuances (e.g. LLM used `IS_ERR` to check `__request_resource`
         | result, despite me giving it full source code for that function
         | and there's even a comment that makes it obvious it returns a
         | pointer or NULL, not an error code - misguided attention kind
         | of mistake).
         | 
         | So, in my opinion, LLMs (as currently available to broad
         | public, like myself) are useful for automating away some
         | routine stuff, but their usefulness is bounded by the
         | operator's knowledge and intelligence. And that means that the
         | actual jobs (if they require thinking and not just writing
         | words) are safe.
         | 
         | When asked about what I do at work, I used to joke that I just
         | press buttons on my keyboard in fancy patterns. Ultimately,
         | LLMs seem to suggest that it's not what I really do.
        
       | mensetmanusman wrote:
       | I'm super curious as to whether this technology completely
       | destroys the middle class, or if everyone becomes better off
       | because productivity is going to skyrocket.
        
         | mhogers wrote:
         | Is anyone here aware of the latest research that tries to
         | predict the outcome? Please share - super curious as well
        
           | te_chris wrote:
           | There's this https://arxiv.org/pdf/2312.05481v9
        
           | pdfernhout wrote:
           | Some thoughts I put together on all this circa 2010:
           | https://pdfernhout.net/beyond-a-jobless-recovery-knol.html
           | "This article explores the issue of a "Jobless Recovery"
           | mainly from a heterodox economic perspective. It emphasizes
           | the implications of ideas by Marshall Brain and others that
           | improvements in robotics, automation, design, and voluntary
           | social networks are fundamentally changing the structure of
           | the economic landscape. It outlines towards the end four
           | major alternatives to mainstream economic practice (a basic
           | income, a gift economy, stronger local subsistence economies,
           | and resource-based planning). These alternatives could be
           | used in combination to address what, even as far back as
           | 1964, has been described as a breaking "income-through-jobs
           | link". This link between jobs and income is breaking because
           | of the declining value of most paid human labor relative to
           | capital investments in automation and better design. Or, as
           | is now the case, the value of paid human labor like at some
           | newspapers or universities is also declining relative to the
           | output of voluntary social networks such as for digital
           | content production (like represented by this document). It is
           | suggested that we will need to fundamentally reevaluate our
           | economic theories and practices to adjust to these new
           | realities emerging from exponential trends in technology and
           | society."
        
         | tivert wrote:
         | > I'm super curious as to whether this technology completely
         | destroys the middle class, or if everyone becomes better off
         | because productivity is going to skyrocket.
         | 
         | Even if productivity skyrockets, why would anyone assume the
         | dividends would be shared with the "destroy[ed] middle class"?
         | 
         | All indications will be this will end up like the China Shock:
         | "I lost my middle class job, and all I got was the opportunity
         | to buy flimsy pieces of crap from a dollar store." America
         | lacks the ideological foundations for any other result, and the
         | coming economic changes will likely make building those
         | foundations even more difficult if not impossible.
        
           | rohan_ wrote:
           | Because access to the financial system was democratized ten
           | years ago
        
       | croemer wrote:
       | The programming task they gave o3-mini high (creating Python
       | server that allows chatting with OpenAI API and run some code in
       | terminal) didn't seem very hard? Strange choice of example for
       | something that's claimed to be a big step forwards.
       | 
       | YT timestamped link:
       | https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for
       | the fixed link @photonboom)
       | 
       | Updated: I gave the task to Claude 3.5 Sonnet and it worked first
       | shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-
       | faa5aa...
        
         | bearjaws wrote:
         | It's good that it works since if you ask GPT-4o to use the
         | openai sdk it will often produce invalid and out of date code.
        
         | m3kw9 wrote:
         | I would say they didn't need to demo anything, because if you
         | are gonna use the output code live on a demo it may make
         | compile errors and then look stupid trying to fix it live
        
           | croemer wrote:
           | If it was a safe bet problem, then they should have said
           | that. To me it looks like they faked excitement for something
           | not exciting which lowers credibility of the whole
           | presentation.
        
           | sunaookami wrote:
           | They actually did that the last time when they showed the
           | apps integration. First try in Xcode didn't work.
        
             | m3kw9 wrote:
             | Yeah I think that time it was ok because they were demoing
             | the app function, but for this they are demoing the model
             | smarts
        
         | photonboom wrote:
         | heres the right timestamp:
         | https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s
        
         | phil917 wrote:
         | Yeah I agree that wasn't particularly mind blowing to me and
         | seems fairly in line with what existing SOTA models can do.
         | Especially since they did it in steps. Maybe I'm missing
         | something.
        
         | MyFirstSass wrote:
         | What? Is this what this is? Either this is a complete joke or
         | we're missing something.
         | 
         | I've been doing similar stuff in Claude for months and it's not
         | that impressive when you see how limited they really are.
        
       | tripletao wrote:
       | Their discussion contains an interesting aside:
       | 
       | > Moreover, ARC-AGI-1 is now saturating - besides o3's new score,
       | the fact is that a large ensemble of low-compute Kaggle solutions
       | can now score 81% on the private eval.
       | 
       | So while these tasks get greatest interest as a benchmark for
       | LLMs and other large general models, it doesn't yet seem obvious
       | those outperform human-designed domain-specific approaches.
       | 
       | I wonder to what extent the large improvement comes from OpenAI
       | training deliberately targeting this class of problem. That
       | result would still be significant (since there's no way to
       | overfit to the private tasks), but would be different from an
       | "accidental" emergent improvement.
        
       | Bjorkbat wrote:
       | I was impressed until I read the caveat about the high-compute
       | version using 172x more compute.
       | 
       | Assuming for a moment that the cost per task has a linear
       | relationship with compute, then it costs a little more than $1
       | million to get that score on the public eval.
       | 
       | The results are cool, but man, this sounds like such a busted
       | approach.
        
         | futureshock wrote:
         | So what? I'm serious. Our current level of progress would have
         | been sci-fi fantasy with the computers we had in 2000. The cost
         | may be astronomical today, but we have proven a method to
         | achieve human performance on tests of reasoning over novel
         | problems. WOW. Who cares what it costs. In 25 years it will run
         | on your phone.
        
           | Bjorkbat wrote:
           | It's not so much the cost as much the fact that they got a
           | slightly better result by throwing 172x more compute
           | per/task. The fact that it may have cost somewhere north of
           | $1 million simply helps to give a better idea of how absurd
           | the approach is.
           | 
           | It feels a lot less like the breakthrough when the solution
           | looks so much like simply brute-forcing.
           | 
           | But you might be right, who cares? Does it really matter how
           | crude the solution is if we can achieve true AGI and bring
           | the cost down by increasing the efficiency of compute?
        
             | futureshock wrote:
             | "Simply brute-forcing"
             | 
             | That's the thing that's interesting to me though and I had
             | the same first reaction. It's a very different problem than
             | brute-forcing chess. It has one chance to come to the
             | correct answer. Running through thousands or millions of
             | options means nothing if the model can't determine which is
             | correct. And each of these visual problems involve
             | combinations of different interacting concepts. To solve
             | them requires understanding, not mimicry. So no matter how
             | inefficient and "stupid" these models are, they can be said
             | to understand these novel problems. That's a direct counter
             | to everyone who ever called these a stochastic parrot and
             | said they were a dead-end to AGI that was only searching an
             | in distribution training set.
             | 
             | The compute costs are currently disappointing, but so was
             | the cost of sequencing the first whole human genome. That
             | went from 3 billion to a few hundred bucks from your local
             | doctor.
        
           | radioactivist wrote:
           | So your claim for optimism here is that something today that
           | took ~10^22 floating point operations (based on an estimate
           | earlier in the thread) to execute will be running on phones
           | in 25 years? Phones which are currently running at O(10^12)
           | flops. That means ten orders of magnitudes of improvement for
           | that to run in a reasonable amount of time? It's a similar
           | scale up in going from ENIAC (500 flops) to a modern desktop
           | (5-10 teraflops).
        
             | futureshock wrote:
             | That sounds reasonable to me because the compute cost for
             | this level of reasoning performance won't stay at 10^22 and
             | phones won't stay at 10^12. This reasoning breakthrough is
             | about 3 months old.
        
               | radioactivist wrote:
               | I think expecting five _orders of magnitude_ improvement
               | from either side of this (inference cost or phone
               | performance) is insane.
        
       | onemetwo wrote:
       | In (1) the author use a technique to improve the performance of
       | an LLM, he trained sonnet 3.5 to obtain 53,6% in the arc-agi-pub
       | benchmark moreover he said that more computer power would give
       | better results. So the results of o3 could be produced in this
       | way using the same method with more computer power, so if this is
       | the case the result of o3 is not very interesting.
       | 
       | (1) https://params.com/@jeremy-berman/arc-agi
        
       | TypicalHog wrote:
       | This is actually mindblowing!
        
       | blixt wrote:
       | These results are fantastic. Claude 3.5 and o1 are already good
       | enough to provide value, so I can't wait to see how o3 performs
       | comparatively in real-world scenarios.
       | 
       | But I gotta say, we must be saturating just about any zero-shot
       | reasoning benchmark imaginable at this point. And we will still
       | argue about whether this is AGI, in my opinion because these LLMs
       | are forgetful and it's very difficult for an application
       | developer to fix that.
       | 
       | Models will need better ways to remember and learn from doing a
       | task over and over. For example, let's look at code agents: the
       | best we can do, even with o3, is to cram as much of the code base
       | as we can fit into a context window. And if it doesn't fit we
       | branch out to multiple models to prune the context window until
       | it does fit. And here's the kicker - the second time you ask for
       | it to do something this all starts over from zero again. With
       | this amount of reasoning power, I'm hoping session-based learning
       | becomes the next frontier for LLM capabilities.
       | 
       | (There are already things like tool use, linear attention, RAG,
       | etc that can help here but currently they come with downsides and
       | I would consider them insufficient.)
        
       | vessenes wrote:
       | This feels like big news to me.
       | 
       | First of all, ARC is definitely an intelligence test for autistic
       | people. I say as someone with a tad of the neurodiversity. That
       | said, I think it's a pretty interesting one, not least because as
       | you go up in the levels, it requires (for a human) a fair amount
       | of lateral thinking and analogy-type thinking, and of course, it
       | requires that this go in and out of visual representation. That
       | said, I think it's a bit funny that most of the people training
       | these next-gen AIs are neurodiverse and we are training the AI in
       | our own image. I continue to hope for some poet and painter-
       | derived intelligence tests to be added to the next gen tests we
       | all look at and score.
       | 
       | For those reasons, I've always really liked ARC as a test -- not
       | as some be-all end-all for AGI, but just because I think that the
       | most intriguing areas next for LLMs are in these analogy arenas
       | and ability to hold more cross-domain context together for
       | reasoning and etc.
       | 
       | Prompts that are interesting to play with right now on these
       | terms range from asking multimodal models to say count to ten in
       | a Boston accent, and then propose a regional french accent that's
       | an equivalent and count to ten in that. (To my ear, 4o is
       | unconvincing on this). Similar in my mind is writing and
       | architecting code that crosses multiple languages and APIs, and
       | asking for it to be written in different styles. (claude and
       | o1-pro are .. okay at this, depending).
       | 
       | Anyway. I agree that this looks like a large step change. I'm not
       | sure if the o3 methods here involve the spinning up of clusters
       | of python interpreters to breadth-search for solutions -- a
       | method used to make headway on ARC in the past; if so, this is
       | still big, but I think less exciting than if the stack is close
       | to what we know today, and the compute time is just more
       | introspection / internal beam search type algorithms.
       | 
       | Either way, something had to assess answers and think they were
       | right, and this is a HUGE step forward.
        
         | jamiek88 wrote:
         | > most of the people training these next-gen AIs are
         | neurodiverse
         | 
         | Citation needed. This is a huge claim based only on stereotype.
        
           | vessenes wrote:
           | So true. Perhaps I'm just thinking it's my people and need to
           | update my priors.
        
         | getpost wrote:
         | > most of the people training these next-gen AIs are
         | neurodiverse and we are training the AI in our own image
         | 
         | Do you have any evidence to support that? It would be
         | fascinating if the field is primarly advancing due to a unique
         | constellation of traits contributed by individuals who, in the
         | past, may not have collaborated so effectively.
        
           | vessenes wrote:
           | PURELY Anecdotal. But I'll say that as of 2024 1 in 36 US
           | children are diagnosed on the spectrum according to the
           | CDC(!), which would mean if you met 10 AI researchers and 4
           | were neurodivergent you'd reasonably expect that it's a
           | higher-than-population average representation. I'm polling
           | from the Effective Altruist AI folks in my mind, and the
           | number is definitely, definitely higher than 4/10.
        
             | EVa5I7bHFq9mnYK wrote:
             | Are there non-Effective Altruist AI folks?
        
               | vessenes wrote:
               | I love how this might mean "non-Effective",
               | non-"Effective Altruist" or non-"Effective Altruist AI"
               | folks.
               | 
               | Yes
        
       | nopinsight wrote:
       | Let me go against some skeptics and explain why I think full o3
       | is pretty much AGI or at least embodies most essential aspects of
       | AGI.
       | 
       | What has been lacking so far in frontier LLMs is the ability to
       | reliably deal with the right level of abstraction for a given
       | problem. Reasoning is useful but often comes out lacking if one
       | cannot reason at the right level of abstraction. (Note that many
       | humans can't either when they deal with unfamiliar domains,
       | although that is not the case with these models.)
       | 
       | ARC has been challenging precisely because solving its problems
       | often requires:                  1) using multiple different
       | *kinds* of core knowledge [1], such as symmetry, counting, color,
       | AND             2) using the right level(s) of abstraction
       | 
       | Achieving human-level performance in the ARC benchmark, _as well
       | as_ top human performance in GPQA, Codeforces, AIME, and Frontier
       | Math suggests the model can potentially solve any problem at the
       | human level if it possesses essential knowledge about it. Yes,
       | this includes out-of-distribution problems that most humans can
       | solve.
       | 
       | It might not _yet_ be able to generate highly novel theories,
       | frameworks, or artifacts to the degree that Einstein,
       | Grothendieck, or van Gogh could. But not many humans can either.
       | 
       | [1] https://www.harvardlds.org/wp-
       | content/uploads/2017/01/Spelke...
       | 
       | ADDED:
       | 
       | Thanks to the link to Chollet's posts by lswainemoore below. I've
       | analyzed some easy problems that o3 failed at. They involve
       | spatial intelligence, including connection and movement. This
       | skill is very hard to learn from textual and still image data.
       | 
       | I believe this sort of core knowledge is learnable through
       | movement and interaction data in a simulated world and it will
       | _not_ present a very difficult barrier to cross. (OpenAI
       | purchased a company behind a Minecraft clone a while ago. I 've
       | wondered if this is the purpose.)
        
         | xvector wrote:
         | Agree. AGI is here. I feel such a sense of pride in our
         | species.
        
         | timabdulla wrote:
         | What's your explanation for why it can only get ~70% on SWE-
         | bench Verified?
         | 
         | I believe about 90% of the tasks were estimated by humans to
         | take less than one hour to solve, so we aren't talking about
         | very complex problems, and to boot, the contamination factor is
         | huge: o3 (or any big model) will have in-depth knowledge of the
         | internals of these projects, and often even know about the
         | individual issues themselves (e.g. you can say what was Github
         | issue #4145 in project foo, and there's a decent chance it can
         | tell you exactly what the issue was about!)
        
           | slewis wrote:
           | I've spent tons of time evaluating o1-preview on SWEBench-
           | Verified.
           | 
           | For one, I speculate OpenAI is using a very basic agent
           | harness to get the results they've published on SWEBench. I
           | believe there is a fair amount of headroom to improve results
           | above what they published, using the same models.
           | 
           | For two, some of the instances, even in SWEBench-Verified,
           | require a bit of "going above and beyond" to get right. One
           | example is an instance where the user states that a TypeError
           | isn't properly handled. The developer who fixed it handled
           | the TypeError but also handled a ValueError, and the golden
           | test checks for both. I don't know how many instances fall in
           | this category, but I suspect its more than on a simpler
           | benchmark like MATH.
        
           | nopinsight wrote:
           | One possibility is that it may not yet have sufficient
           | _experience and real-world feedback_ for resolving coding
           | issues in professional repos, as this involves multiple steps
           | and very diverse actions (or branching factor, in AI terms).
           | They have committed to not training on API usage, which
           | limits their ability to directly acquire training data from
           | it. However, their upcoming agentic efforts may address this
           | gap in training data.
        
             | timabdulla wrote:
             | Right, but the branching factor increases exponentially
             | with the scope of the work.
             | 
             | I think it's obvious that they've cracked the formula for
             | solving well-defined, small-in-scope problems at a
             | superhuman level. That's an amazing thing.
             | 
             | To me, it's less obvious that this implies that they will
             | in short order with just more training data be able to
             | solve ambiguous, large-in-scope problems at even just a
             | skilled human level.
             | 
             | There are far more paths to consider, much more context to
             | use, and in an RL setting, the rewards are much more
             | ambiguously defined.
        
               | nopinsight wrote:
               | Their reasoning models can learn from procedures and
               | methods, which generalize far better than data. Software
               | tasks are diverse but most tasks are still fairly limited
               | in scope. Novel tasks might remain challenging for these
               | models, as they do for humans.
               | 
               | That said, o3 might still lack some kind of interaction
               | intelligence that's hard to learn. We'll see.
        
         | Imnimo wrote:
         | >Achieving human-level performance in the ARC benchmark, as
         | well as top human performance in GPQA, Codeforce, AIME, and
         | Frontier Math strongly suggests the model can potentially solve
         | any problem at the human level if it possesses essential
         | knowledge about it.
         | 
         | The article notes, "o3 still fails on some very easy tasks".
         | What explains these failures if o3 can solve "any problem" at
         | the human level? Do these failed cases require some essential
         | knowledge that has eluded the massive OpenAI training set?
        
           | nopinsight wrote:
           | Great point. I'd love to see what these easy tasks are and
           | would be happy to revise my hypothesis accordingly. o3's
           | intelligence is unlikely to be a strict superset of human
           | intelligence. It is certainly superior to humans in some
           | respects and probably inferior in others. Whether it's
           | sufficiently generally intelligent would be both a matter of
           | definition and empirical fact.
        
             | Imnimo wrote:
             | Chollet has a few examples here:
             | 
             | https://x.com/fchollet/status/1870172872641261979
             | 
             | https://x.com/fchollet/status/1870173137234727219
             | 
             | I would definitely consider them legitimately easy for
             | humans.
        
               | nopinsight wrote:
               | Thanks! I added some comments on this at the bottom of
               | the post above.
        
         | phil917 wrote:
         | Quote from the creators of the AGI-ARC benchmark: "Passing ARC-
         | AGI does not equate achieving AGI, and, as a matter of fact, I
         | don't think o3 is AGI yet. o3 still fails on some very easy
         | tasks, indicating fundamental differences with human
         | intelligence."
        
           | CooCooCaCha wrote:
           | Yeah the real goalpost is _reliable_ intelligence. A supposed
           | phd level AI failing simple problems is a red flag that we're
           | still missing something.
        
             | gremlinsinc wrote:
             | You've never met a Doctor who couldn't figure out how to
             | work their email? Or use street smarts? You can have a PHD
             | but be unable to reliably handle soft skills, or any number
             | of things you might 'expect' someone to be able to do.
             | 
             | Just playing devils' advocate or nitpicking the language a
             | bit...
        
               | CooCooCaCha wrote:
               | An important distinction here is you're comparing skill
               | across very different tasks.
               | 
               | I'm not even going that far, I'm talking about
               | performance on similar tasks. Something many people have
               | noticed about modern AI is it can go from genius to baby-
               | level performance seemingly at random.
               | 
               | Take self driving cars for example, a reasonably
               | intelligent human of sound mind and body would never
               | accidentally mistake a concrete pillar for a road. Yet
               | that happens with self-driving cars, and seemingly here
               | with ARC-AGI problems which all have a similar flavor.
        
               | nuancebydefault wrote:
               | A coworker of mine has a phd in physics. Showing the
               | difference to him between little and big endian in a hex
               | editor, showing file sizes of raw image files and how to
               | compute it... I explained 3 times and maybe he understood
               | part of it now.
        
           | nopinsight wrote:
           | I'd need to see what kinds of easy tasks those are and would
           | be happy to revise my hypothesis if that's warranted.
           | 
           | Also, it depends a great deal on what we define as AGI and
           | whether they need to be a strict superset of typical human
           | intelligence. o3's intelligence is probably superhuman in
           | some aspects but inferior in others. We can find many humans
           | who exhibit such tendencies as well. We'd probably say they
           | think differently but would still call them generally
           | intelligent.
        
             | lswainemoore wrote:
             | They're in the original post. Also here:
             | https://x.com/fchollet/status/1870172872641261979 /
             | https://x.com/fchollet/status/1870173137234727219
             | 
             | Personally, I think it's fair to call them "very easy". If
             | a person I otherwise thought was intelligent was unable to
             | solve these, I'd be quite surprised.
        
               | nopinsight wrote:
               | Thanks! I've analyzed some easy problems that o3 failed
               | at. They involve spatial intelligence including
               | connection and movement. This skill is very hard to learn
               | from textual and still image data.
               | 
               | I believe this sort of core knowledge is learnable
               | through movement and interaction data in a simulated
               | world and it will not present a very difficult barrier to
               | cross.
               | 
               | (OpenAI purchased a company behind a Minecraft clone a
               | while ago. I've wondered if this is the purpose.)
        
               | lswainemoore wrote:
               | > I believe this sort of core knowledge is learnable
               | through movement and interaction data in a simulated
               | world and it will not present a very difficult barrier to
               | cross.
               | 
               | Maybe! I suppose time will tell. That said, spatial
               | intelligence (connection/movement included) is the whole
               | game in this evaluation set. I think it's revealing that
               | they can't handle these particular examples, and
               | problematic for claims of AGI.
        
           | 93po wrote:
           | they say it isn't AGI but i think the way o3 functions can be
           | refined to AGI - it's learning to solve a new, novel
           | problems. we just need to make it do that more consistently,
           | which seems achievable
        
         | nyrikki wrote:
         | GPQA scores are mostly from pre-training, against content in
         | the corpus. They have gone silent but look at the GPT4
         | technical report which calls this out.
         | 
         | We are nowhere close to what Sam Altman calls AGI and
         | transformers are still limited to what uniform-TC0 can do.
         | 
         | As an example the Boolean Formula Value Problem is
         | NC1-complete, thus beyond transformers but trivial to solve
         | with a TM.
         | 
         | As it is now proven that the frame problem is equivalent to the
         | halting problem, even if we can move past uniform-TC0 limits,
         | novelty is still a problem.
         | 
         | I think the advancements are truly extraordinary, but unless
         | you set the bar very low, we aren't close to AGI.
         | 
         | Heck we aren't close to P with commercial models.
        
           | sebzim4500 wrote:
           | Isn't any physically realizable computer (including our
           | brains) limited to what uniform-TC0 can do?
        
             | drdeca wrote:
             | Do you just mean because any physically realizable computer
             | is a finite state machine? Or...?
             | 
             | I wouldn't describe a computer's usual behavior as having
             | constant depth.
             | 
             | It is fairly typical to talk about problems in P as being
             | feasible (though when the constant factors are too big,
             | this isn't strictly true of course).
             | 
             | Just because for unreasonably large inputs, my computer
             | can't run a particular program and produce the correct
             | answer for that input, due to my computer running out of
             | memory, we don't generally say that my computer is
             | fundamentally incapable of executing that algorithm.
        
             | nyrikki wrote:
             | Neither TC0 nor uniform-TC0 are physically realizable, they
             | are tools not physical devices.
             | 
             | The default nonuniform circuits classes are allowed to have
             | a different circuit per input size, the uniform types have
             | unbounded fan-in
             | 
             | Similar to how a k-tape TM doesn't get 'charged' for the
             | input size.
             | 
             | With Nick Class (NC) the number of components is similar to
             | traditional compute time while depth relates to the ability
             | to parallelize operations.
             | 
             | These are different than biological neurons, not better or
             | worse but just different.
             | 
             | Human neurons can use dendritic compartmentalization, use
             | spike timing, can retime spikes etc...
             | 
             | While the perceptron model we use in ML is useful, it is
             | not able to do xor in one layer, while biological neurons
             | do that without anything even reaching the soma, purely in
             | the dendrites.
             | 
             | Statistical learning models still comes down to a choice
             | function, no matter if you call that set shattering or...
             | 
             | With physical computers the time hierarchy does apply and
             | if TIME(g(n)) is given more time than TIME(f(n)), g(n) can
             | solve more problems.
             | 
             | So you can simulate a NTM with exhaustive search with a
             | physical computer.
             | 
             | Physical computers also tend to have NAND and XOR gates,
             | and can have different circuit depths.
             | 
             | When you are in TC0, you only have AND, OR and Threshold
             | (or majority) gates.
             | 
             | Think of instruction level parallelism in a typical CPU, it
             | can return early, vs Itanium EPIC, which had to wait for
             | the longest operation. Predicated execution is also how
             | GPUs work.
             | 
             | They can send a mask and save on load store ops as an
             | example, but the cost of that parallelism is the consent
             | depth.
             | 
             | It is the parallelism tradeoff that both makes transformers
             | practical as well as limit what they can do.
             | 
             | The IID assumption and autograd requiring smooth manifolds
             | plays a role too.
             | 
             | The frame problem, which causes hard problems to become
             | unsolvable for computers and people alike does also.
             | 
             | But the fact that we have polynomial time solutions for the
             | Boolean Formula Value Problem, as mentioned in my post
             | above is probably a simpler way of realizing physical
             | computers aren't limited to uniform-TC0.
        
         | norir wrote:
         | Personally I find "human-level" to be a borderline meaningless
         | and limiting term. Are we now super human as a species relative
         | to ourselves just five years ago because of our advances in
         | developing computer programs that better imitate what many (but
         | far from all) of us were already capable of doing? Have we
         | reached a limit to human potential that can only be surpassed
         | by digital machines? Who decides what human level is and when
         | we have surpassed it? I have seen some ridiculous claims about
         | ai in art that don't stand up to even the slightest scrutiny by
         | domain experts but that easily fool the masses.
        
         | PaulDavisThe1st wrote:
         | > It might not yet be able to generate highly novel theories,
         | frameworks, or artifacts to the degree that Einstein,
         | Grothendieck, or van Gogh could.
         | 
         | Every human does this dozens, hundreds or thousands of times
         | ... during childhood.
        
         | ec109685 wrote:
         | The problem with ARC is that there are a finite number of
         | heuristics that could be enumerated and trained for, which
         | would give model a substantial leg up on this evaluation, but
         | not be generalized to other domains.
         | 
         | For example, if they produce millions of examples of the type
         | of problems o3 still struggles on, it would probably do better
         | at similar questions.
         | 
         | Perhaps the private data set is different enough that this
         | isn't a problem, but the ideal situation would be unveiling a
         | truly novel dataset, which it seems like arc aims to do.
        
       | CliveBloomers wrote:
       | Another meaningless benchmark, another month--it's like clockwork
       | at this point. No one's going to remember this in a month; it's
       | just noise. The real test? It's not in these flashy metrics or
       | minor improvements. The only thing that actually matters is how
       | fast it can wipe out the layers of middle management and all
       | those pointless, bureaucratic jobs that add zero value.
       | 
       | That's the true litmus test. Everything else? It's just fine-
       | tuning weights, playing around the edges. Until it starts cutting
       | through the fat and reshaping how organizations really operate,
       | all of this is just more of the same.
        
         | handfuloflight wrote:
         | Agreed, but isn't it management who decides that this would be
         | implemented? Are they going to propogate their own removal?
        
           | zamadatix wrote:
           | Middle manager types are probably interested in their salary
           | performance more than anything. "Real" management (more of
           | their assets come from their ownership of the company than a
           | salary) will override them if it's truthfully the best
           | performing operating model for the company.
        
         | oytis wrote:
         | So far AI market seems to be focused on replacing meaningful
         | jobs, meaningless ones look safe (which kind of makes sense if
         | you think about it).
        
       | 6gvONxR4sf7o wrote:
       | I'm glad these stats show a better estimate of human ability than
       | just the average mturker. The graph here has the average mturker
       | performance as well as a STEM grad measurement. Stuff like that
       | is why we're always feeling weird that these things supposedly
       | outperform humans while still sucking. I'm glad to see 'human
       | performance' benchmarked with more variety (attention, time,
       | education, etc).
        
       | RivieraKid wrote:
       | It sucks that I would love to be excited about this... but I
       | mostly feel anxiety and sadness.
        
         | xvector wrote:
         | Humanity is about to enter an even steeper hockey stick growth
         | curve. Progressing along the Kardashev scale feels all but
         | inevitable. We will live to see Longevity Escape Velocity. I'm
         | fucking pumped and feel thrilled and excited and proud of our
         | species.
         | 
         | Sure, there will be growing pains, friction, etc. Who cares?
         | There always is with world-changing tech. Always.
        
           | drcode wrote:
           | longevity for the AIs
        
           | tokioyoyo wrote:
           | My job should be secure for a while, but why would an average
           | person give a damn about humanity when they might lose their
           | jobs and comfort levels? If I had kids, I would absolutely
           | hate this uncertainty as well.
           | 
           | "Oh well, I guess I can't give the opportunities to my kid
           | that I wanted, but at least humanity is growing rapidly!"
        
             | xvector wrote:
             | > when they might lose their jobs and comfort levels?
             | 
             | Everyone has always worried about this for every major
             | technology throughout history
             | 
             | IMO AGI will dramatically increase comfort levels, lower
             | your chance of dying, death, disease, etc.
        
               | tokioyoyo wrote:
               | Again, sure, but it doesn't matter to an average person.
               | That's too much focus on the hypothetical future. People
               | care about the current times. In the short term it will
               | suck for a good chunk of people, and whether the
               | sacrifice is worth it will depend on who you are.
               | 
               | People aren't really on uproar yet, because
               | implementations haven't affected the job market of the
               | masses. Afterwards? Tume will show.
        
               | xvector wrote:
               | Yes, people tend to focus on current times. It's an
               | incredibly shortsighted mentality that selfishly puts
               | oneself over tens of billions of future lives being
               | improved. https://pessimistsarchive.org
        
               | tokioyoyo wrote:
               | Do you have any dependents, like parents or kids, by any
               | chance? Imagine not being able to provide for them. Think
               | how'd you feel in such circumstances.
               | 
               | Like in general I totally agree with you, but I also
               | understand why a person would care about their loved ones
               | and themselves first.
        
               | realce wrote:
               | Eventually you draw the black ball, it is inevitable.
        
           | croemer wrote:
           | Longevity Escape Velocity? Even if you had orders of
           | magnitude more people working on medical research, it's not a
           | given that prolonging life indefinitely is even possible.
        
             | soheil wrote:
             | Of course it's a given unless you want to invoke
             | supernatural causes the human brain is a collection of
             | cells with electro-chemical connections that if fully
             | reconstructed either physically or virtually would
             | necessarily need to represent the original person's brain.
             | Therefore with sufficient intelligence it would be possible
             | to engineer technology that would be able to do that
             | reconstruction without even having to go to the atomic
             | level, which we also have a near full understanding of
             | already.
        
           | lewhoo wrote:
           | > Sure, there will be growing pains, friction, etc. Who
           | cares?
           | 
           | That's right. Who cares about pains of others and why they
           | even should are absolutely words to live by.
        
             | xvector wrote:
             | Yeah, with this mentality, we wouldn't have electricity
             | today. You will never make transition to new technology
             | painless, no matter what you do. (See:
             | https://pessimistsarchive.org)
             | 
             | What you are likely doing, though, is making many more
             | future humans pay a cost in suffering. Every day we delay
             | longevity escape velocity is another 150k people dead.
        
               | lewhoo wrote:
               | There was a time when in the name of progress people were
               | killed for whatever resources they possessed, others were
               | enslaved etc. and I was under the impression that the
               | measure of our civilization is that we actually DID care
               | and just how much. It seems to me that you are very eager
               | to put up altars of sacrifice without even thinking that
               | the problems you probably have in mind are perfectly
               | solvable without them.
        
               | smokedetector1 wrote:
               | By far the greatest issue facing humanity today is wealth
               | inequality.
        
           | asdf6969 wrote:
           | I would rather follow in the steps of uncle Ted than let AI
           | turn me in a homeless person. It's no consolation that my
           | tent will have a nice view of a lunar colony
        
           | objektif wrote:
           | You sound like a rich person.
        
           | soheil wrote:
           | I agree, save invoking supernatural causes, the human brain
           | is a collection of cells with electro-chemical connections
           | that if fully reconstructed either physically or virtually
           | would necessarily need to represent the original person's
           | brain. Therefore with sufficient intelligence it would be
           | possible to engineer technology that would be able to do that
           | reconstruction without even having to go to the atomic level,
           | which we also have a near full understanding of already.
        
         | pupppet wrote:
         | We're enabling a huge swath of humanity being put out of work
         | so a handful of billionaires can become trillionaires.
        
           | abiraja wrote:
           | And also the solving of hundreds of diseases that ail us.
        
             | hartator wrote:
             | It doesn't matter. Statists rather be poor, sick, and dead
             | than risking trillionaires.
        
               | thrance wrote:
               | You should read about workers right in the gilded age,
               | and see how good _laissez-faire_ capitalism was. What do
               | you think will happen when the only thing you can trade
               | with the trillionaires, your labor, becomes worthless?
        
             | lewhoo wrote:
             | One of the biggest factors in risk of death right now is
             | poverty. Also what is being chased right now is "human
             | level on most economically viable tasks" because the
             | automated research for solving physics etc. even now seems
             | far-fetched.
        
             | thrance wrote:
             | You need to solve diseases _and_ make the cure available.
             | Millions die of curable diseases every year, simply because
             | they are not deemed useful enough. What happens when your
             | labor becomes worthless?
        
             | asdf6969 wrote:
             | Why do you think you'll be able to afford healthcare? The
             | new medicine is for the AI owners
        
         | gom_jabbar wrote:
         | Anxiety and sadness are actually mild emotional responses to
         | the dissolution of human culture. Nick Land in 1992:
         | 
         | "It is ceasing to be a matter of how we think about technics,
         | if only because technics is increasingly thinking about itself.
         | It might still be a few decades before artificial intelligences
         | surpass the horizon of biological ones, but it is utterly
         | superstitious to imagine that the human dominion of terrestrial
         | culture is still marked out in centuries, let alone in some
         | metaphysical perpetuity. The high road to thinking no longer
         | passes through a deepening of human cognition, but rather
         | through a becoming inhuman of cognition, a migration of
         | cognition out into the emerging planetary technosentience
         | reservoir, into 'dehumanized landscapes ... emptied spaces'
         | where human culture will be dissolved. Just as the capitalist
         | urbanization of labour abstracted it in a parallel escalation
         | with technical machines, so will intelligence be transplanted
         | into the purring data zones of new software worlds in order to
         | be abstracted from an increasingly obsolescent anthropoid
         | particularity, and thus to venture beyond modernity. Human
         | brains are to thinking what mediaeval villages were to
         | engineering: antechambers to experimentation, cramped and
         | parochial places to be.
         | 
         | [...]
         | 
         | Life is being phased-out into something new, and if we think
         | this can be stopped we are even more stupid than we seem." [0]
         | 
         | Land is being ostracized for some of his provocations, but it
         | seems pretty clear by now that we are in the Landian
         | Accelerationism timeline. Engaging with his thought is crucial
         | to understanding what is happening with AI, and what is still
         | largely unseen, such as the autonomization of capital.
         | 
         | [0] https://retrochronic.com/#circuitries
        
         | Jcampuzano2 wrote:
         | Same, it's sad but I honestly hoped they never achieved these
         | results and it came out that it wasn't possible or would take
         | an insurmountable amount of resources but here we are ok the
         | verge of making most humans useless when it comes to
         | productivity.
         | 
         | While there are those that are excited, the world is not
         | prepared for the level of distress this could put on the
         | average person without critical changes at a monumental level.
        
           | JacksCracked wrote:
           | If you don't feel like the world needed grand scale changes
           | at a societal level with all the global problems we're unable
           | to solve, you haven't been paying attention. Income
           | inequality, corporate greed, political apathy, global
           | warming.
        
             | sensanaty wrote:
             | And you think the bullshit generators backed by the largest
             | corporate entities in humanity who are, as we speak,
             | causing all the issues you mention are somehow gonna solve
             | any of this?
        
       | bluecoconut wrote:
       | Efficiency is now key.
       | 
       | ~=$3400 per single task to meet human performance on this
       | benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED",
       | which makes me think they did some undisclosed amount of fine-
       | tuning (eg. via the API they showed off last week), so even more
       | compute went into this task.
       | 
       | We can compare this roughly to a human doing ARC-AGI puzzles,
       | where a human will take (high variance in my subjective
       | experience) between 5 second and 5 minutes to solve the task. (So
       | i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr,
       | and they include in their document an average mechancal turker at
       | $2 USD task in their document)
       | 
       | Going the other direction: I am interpreting this result as human
       | level reasoning now costs (approximately) 41k/hr to 2.5M/hr with
       | current compute.
       | 
       | Super exciting that OpenAI pushed the compute out this far so we
       | could see he O-series scaling continue and intersect humans on
       | ARC, now we get to work towards making this economical!
        
         | riku_iki wrote:
         | > ~=$3400 per single task
         | 
         | report says it is $17 per task, and $6k for whole dataset of
         | 400 tasks.
        
           | bluecoconut wrote:
           | That's the low-compute mode. In the plot at the top where
           | they score 88%, O3 High (tuned) is ~3.4k
        
             | ionwake wrote:
             | sorry to be a noob, but can someone tell me doe sths mena
             | o3 will be unaffordable for a typical user? Will only
             | companies with thousands to spend per query be able to use
             | this?
             | 
             | Sorry for being thick Im just confused how they can turn
             | this into an addordable service?
        
           | jhrmnn wrote:
           | That's for the low-compute configuration that doesn't reach
           | human-level performance (not far though)
        
             | riku_iki wrote:
             | I referred on high compute mode. They have table with
             | breakdown here: https://arcprize.org/blog/oai-o3-pub-
             | breakthrough
        
               | EVa5I7bHFq9mnYK wrote:
               | That's high EFFICIENCY. High efficiency = low compute.
        
               | junipertea wrote:
               | The table row with 6k figure refers to high efficiency,
               | not high compute mode. From the blog post:
               | 
               | Note: OpenAI has requested that we not publish the high-
               | compute costs. The amount of compute was roughly 172x the
               | low-compute configuration.
        
               | gbnwl wrote:
               | That's "efficiency" high, which actually means less
               | compute. The 87.5% score using low efficiency (more
               | compute) doesn't have cost listed.
        
               | bluecoconut wrote:
               | they use some poor language.
               | 
               | "High Efficiency" is O3 Low "Low Efficiency" is O3 High
               | 
               | They left the "Low efficiency" (O3 High) values as `-`
               | but you can infer them from the plot at the top.
               | 
               | Note the $20 and $17 per task aligns with the X-axis of
               | the O3-low
        
           | binarymax wrote:
           | _" Note: OpenAI has requested that we not publish the high-
           | compute costs. The amount of compute was roughly 172x the
           | low-compute configuration."_
           | 
           | The low compute was $17 per task. Speculate 172*$17 for the
           | high compute is $2,924 per task, so I am also confused on the
           | $3400 number.
        
             | bluecoconut wrote:
             | 3400 came from counting pixels on the plot.
             | 
             | Also its $20 on for the o3-low via the table for the semi-
             | private, which x172 is 3440, also coming in close to the
             | 3400 number
        
           | xrendan wrote:
           | You're misreading it, there's two different runs, a low and a
           | high compute run.
           | 
           | The number for the high-compute one is ~172x the first one
           | according to the article so ~=$2900
        
         | bluecoconut wrote:
         | some other imporant quotes: "Average human off the street:
         | 70-80%. STEM college grad: >95%. Panel of 10 random humans:
         | 99-100%" -@fchollet on X
         | 
         | So, considering that the $3400/task system isn't able to
         | compete with STEM college grad yet, we still have some room
         | (but it is shrinking, i expect even more compute will be thrown
         | and we'll see these barriers broken in coming years)
         | 
         | Also, some other back of envelope calculations:
         | 
         | The gap in cost is roughly 10^3 between O3 High and Avg.
         | mechanical turkers (humans). Via Pure GPU cost improvement
         | (~doubling every 2-2.5 years) puts us at 20~25 years.
         | 
         | The question is now, can we close this "to human" gap (10^3)
         | quickly with algorithms, or are we stuck waiting for the 20-25
         | years for GPU improvements. (I think it feels obvious: this is
         | new technology, things are moving fast, the chance for
         | algorithmic innovation here is high!)
         | 
         | I also personally think that we need to adjust our efficiency
         | priors, and start looking not at "humans" as the bar to beat,
         | but theoretical computatble limits (show gaps much larger
         | ~10^9-10^15 for modest problems). Though, it may simply be the
         | case that tool/code use + AGI at near human cost covers a lot
         | of that gap.
        
           | zamadatix wrote:
           | I don't follow how 10 random humans can beat the average STEM
           | college grad and average humans in that tweet. I suspect it's
           | really "a panel of 10 randomly chosen experts in the space"
           | or something?
           | 
           | I agree the most interesting thing to watch will be cost for
           | a given score more than maximum possible score achieved (not
           | that the latter won't be interesting by any means).
        
             | hmottestad wrote:
             | Might be that within a group of 10 people, randomly chosen,
             | when each person attempts to solve the tasks at least 99%
             | of the time 1 person out of the 10 people will get it
             | right.
        
             | bcrosby95 wrote:
             | Two heads is better than 1. 10 is way better. Even if they
             | aren't a field of experts. You're bound to get random
             | people that remember random stuff from high school,
             | college, work, and life in general, allowing them to piece
             | together a solution.
        
               | inerte wrote:
               | Aaaah thanks for the explanation. PANEL of 10 humans, as
               | in, they were all together. I parsed the phrase as "10
               | random people" > "average human" which made little sense.
        
               | modeless wrote:
               | Actually I believe that he did mean 10 random people
               | tested individually, not a committee of 10 people. The
               | key being that the question is considered to be answered
               | correctly if any one of the 10 people got it right. This
               | is similar to how LLMs are evaluated with pass@5 or
               | pass@10 criteria (because the LLM has no memory so
               | running it 10 times is more like asking 10 random people
               | than asking the same person 10 times in a row).
               | 
               | I would expect 10 random people to do better than a
               | committee of 10 people because 10 people have 10 chances
               | to get it right while a committee only has one. Even if
               | the committee gets 10 guesses (which must be made
               | simultaneously, not iteratively) it might not do better
               | because people might go along with a wrong consensus
               | rather than push for the answer they would have chosen
               | independently.
        
           | cchance wrote:
           | I mean considering the big breaththrough this year for o1/o3
           | seems to have been "models having internal thoughts might
           | help reasoning", seems to everyone outside of the AI field
           | was sort of a "duh" moment.
           | 
           | I'd hope we see more internal optimizations and improvements
           | to the models. The idea that the big breakthrough being
           | "don't spit out the first thought that pops into your head"
           | seems obvious to everyone outside of the field, but guess
           | what turns out it was a big improvement when the devs decided
           | to add it.
        
           | iandanforth wrote:
           | Let's say that Google is already 1 generation ahead of nvidia
           | in terms of efficient AI compute. ($1700)
           | 
           | Then let's say that OpenAI brute forced this without any
           | meta-optimization of the hypothesized search component (they
           | just set a compute budget). This is probably low hanging
           | fruit and another 2x in compute reduction. ($850)
           | 
           | Then let's say that OpenAI was pushing really really hard for
           | the numbers and was willing to burn cash and so didn't bother
           | with serious thought around hardware aware distributed
           | inference. This could be _more_ than a 2x decrease in cost
           | like we 've seen deliver 10x reductions in cost via better
           | attention mechanisms, but let's go with 2x for now. ($425).
           | 
           | So I think we've got about an 8x reduction in cost sitting
           | there once Google steps up. This is probably 4-6 months of
           | work flat out if they haven't already started down this path,
           | but with what they've got with deep research, maybe it's
           | sooner?
           | 
           | Then if "all" we get is hardware improvements we're down to
           | what 10-14 years?
        
           | bjornsing wrote:
           | > are we stuck waiting for the 20-25 years for GPU
           | improvements
           | 
           | If this turns out to be hard to optimize / improve then there
           | will be a _huge_ economic incentive for efficient ASICs. No
           | freaking way we'll be running on GPUs for 20-25 years, or
           | even 2.
        
       | aithrowawaycomm wrote:
       | I would like to see this repeated with my highly innovative HARC-
       | HAGI, which is ARC-AGI but it uses hexagons instead of squares. I
       | suspect humans would only make slightly more brain farts on HARC-
       | HAGI than ARC-AGI, but O3 would fail very badly since it almost
       | certainly has been specifically trained on squares.
       | 
       | I am not really trying to downplay O3. But this would be a simple
       | test as to whether O3 is truly "a system capable of adapting to
       | tasks it has never encountered before" versus novel ARC-AGI tasks
       | it hasn't encountered before.
        
       | botro wrote:
       | The LLM community has come up with tests they call 'Misguided
       | Attention'[1] where they prompt the LLM with a slightly altered
       | version of common riddles / tests etc. This often causes the LLM
       | to fail.
       | 
       | For example I used the prompt "As an astronaut in China, would I
       | be able to see the great wall?" and since the training data for
       | all LLMs is full of text dispelling the common myth that the
       | great wall is visible from space, LLMs do not notice the slight
       | variation that the astronaut is IN China. This has been a
       | sobering reminder to me as discussion of AGI heats up.
       | 
       | [1] https://github.com/cpldcpu/MisguidedAttention
        
       | whimsicalism wrote:
       | We need to start making benchmarks in memory & continued
       | processing over a task over multiple days, handoffs, etc (ie.
       | 'agentic' behavior). Not sure how possible this is.
        
       | slibhb wrote:
       | Interesting about the cost:
       | 
       | > Of course, such generality comes at a steep cost, and wouldn't
       | quite be economical yet: you could pay a human to solve ARC-AGI
       | tasks for roughly $5 per task (we know, we did that), while
       | consuming mere cents in energy. Meanwhile o3 requires $17-20 per
       | task in the low-compute mode.
        
       | imranq wrote:
       | Based on the chart, the Kaggle SOTA model is far more impressive.
       | These O3 models are more expensive to run than just hiring a
       | mechanical turk worker. It's nice we are proving out the scaling
       | hypothesis further, it's just grossly inelegant.
       | 
       | The Kaggle SOTA performs 2x as well as o1 high at a fraction of
       | the cost
        
         | cvhc wrote:
         | I was going to say the same.
         | 
         | I wonder what exactly o3 costs. Does it still spend a terrible
         | amount of time thinking, despite being finetuned to the
         | dataset?
        
         | derac wrote:
         | But does that Kaggle solution achieve human level perf with any
         | level of compute? I think you're missing the forest for the
         | trees here.
        
       | neuroelectron wrote:
       | OpenAI spent approximately $1,503,077 to smash the SOTA on ARC-
       | AGI with their new o3 model
       | 
       | semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks
       | (~$20/task) with just 6 samples & 33M tokens processed in ~1.3
       | min/task and a cost of $2012
       | 
       | The "low-efficiency" setting with 1024 samples scored 87.5% but
       | required 172x more compute.
       | 
       | If we assume compute spent and cost are proportional, then OpenAI
       | might have just spent ~$346.064 for the low efficiency run on the
       | semi-private eval.
       | 
       | On the public eval they might have spent ~$1.148.444 to achieve
       | 91.5% with the low efficiency setting. (high-efficiency mode:
       | $6677)
       | 
       | OpenAI just spent more money to run an eval on ARC than most
       | people spend on a full training run.
        
         | rfoo wrote:
         | Pretty sure this "cost" is based on their retail price instead
         | of actual inference cost.
        
           | neuroelectron wrote:
           | Yes that's correct and there's a bit of "pixel math" as well
           | so take these numbers with a pinch of salt. Preliminary model
           | sizes from the temporarily public HF repository puts the full
           | model size at 8tb or roughly 80 H100s
        
         | bluecoconut wrote:
         | By my estimates, for this single benchmark, this is comparable
         | cost to training a ~70B model from scratch today. Literally
         | from 0 to a GPT-3 scale model for the compute they ran on 100
         | ARC tasks.
         | 
         | I double checked with some flop estimates (P100 for 12 hours =
         | Kaggle limit, they claim ~100-1000x for O3-low, and x172 for
         | O3-high) so roughly on the order of 10^22-10^23 flops.
         | 
         | In another way, using H100 market price $2/chip -> at $350k,
         | that's ~175k hours. Or 10^24 FLOPs in total.
         | 
         | So, huge margin, but 10^22 - 10^24 flop is the band I think we
         | can estimate.
         | 
         | These are the scale of numbers that show up in the chinchilla
         | optimal paper, haha. Truly GPT-3 scale models.
        
         | rvnx wrote:
         | It sounds like they essentially brute-forced the solutions ?
         | Ask LLM for answer, answer for LLM to verify the answer. Ask
         | LLM for answer, answer for LLM to verify the answer. Add a bit
         | of randomness. Ask LLM for answer, answer for LLM to verify the
         | answer. Add a bit of randomness. Repeat 5B times (this is what
         | the paper says).
        
         | ramesh31 wrote:
         | >OpenAI just spent more money to run an eval on ARC than most
         | people spend on a full training run.
         | 
         | Of course, this is just the scaling law holding true. More is
         | more when it comes to LLM's as far as we've seen. Now it's just
         | on the hardware side to make this feasible economically.
        
       | sys32768 wrote:
       | So in a few years, coders will be as relevant as cuneiform
       | scribes.
        
       | devoutsalsa wrote:
       | When the source code for these LLMs gets leaked, I expect to see:
       | def letter_count(string, letter):             if string ==
       | "strawberry" and letter == "r":                 return 3
       | ...
        
         | knbknb wrote:
         | In of their release videos for the o1 -preview model they
         | _admitted_ that it's hardcoded in.
        
       | phil917 wrote:
       | Direct quote from the ARC-AGI blog:
       | 
       | "SO IS IT AGI?
       | 
       | ARC-AGI serves as a critical benchmark for detecting such
       | breakthroughs, highlighting generalization power in a way that
       | saturated or less demanding benchmarks cannot. However, it is
       | important to note that ARC-AGI is not an acid test for AGI - as
       | we've repeated dozens of times this year. It's a research tool
       | designed to focus attention on the most challenging unsolved
       | problems in AI, a role it has fulfilled well over the past five
       | years.
       | 
       | Passing ARC-AGI does not equate achieving AGI, and, as a matter
       | of fact, I don't think o3 is AGI yet. o3 still fails on some very
       | easy tasks, indicating fundamental differences with human
       | intelligence.
       | 
       | Furthermore, early data points suggest that the upcoming ARC-
       | AGI-2 benchmark will still pose a significant challenge to o3,
       | potentially reducing its score to under 30% even at high compute
       | (while a smart human would still be able to score over 95% with
       | no training). This demonstrates the continued possibility of
       | creating challenging, unsaturated benchmarks without having to
       | rely on expert domain knowledge. You'll know AGI is here when the
       | exercise of creating tasks that are easy for regular humans but
       | hard for AI becomes simply impossible."
       | 
       | The high compute variant sounds like it costed around *$350,000*
       | which is kinda wild. Lol the blog post specifically mentioned how
       | OpenAPI asked ARC-AGI to not disclose the exact cost for the high
       | compute version.
       | 
       | Also, 1 odd thing I noticed is that the graph in their blog post
       | shows the top 2 scores as "tuned" (this was not displayed in the
       | live demo graph). This suggest in those cases that the model was
       | trained to better handle these types of questions, so I do wonder
       | about data / answer contamination in those cases...
        
         | Bjorkbat wrote:
         | > Also, 1 odd thing I noticed is that the graph in their blog
         | post shows the top 2 scores as "tuned"
         | 
         | Something I missed until I scrolled back to the top and reread
         | the page was this
         | 
         | > OpenAI's new o3 system - trained on the ARC-AGI-1 Public
         | Training set
         | 
         | So yeah, the results were specifically from a version of o3
         | trained on the public training set
         | 
         | Which on the one hand I think is a completely fair thing to do.
         | It's reasonable that you should teach your AI the rules of the
         | game, so to speak. There really aren't any spoken rules though,
         | just pattern observation. Thus, if you want to teach the AI how
         | to play the game, you must train it.
         | 
         | On the other hand though, I don't think the o1 models nor
         | Claude were trained on the dataset, in which case it isn't a
         | completely fair competition. If I had to guess, you could
         | probably get 60% on o1 if you trained it on the public dataset
         | as well.
        
           | skepticATX wrote:
           | Great catch. Super disappointing that AI companies continue
           | to do things like this. It's a great result either way but
           | predictably the excitement is focused on the jump from o1,
           | which is now in question.
        
             | Bjorkbat wrote:
             | To me it's very frustrating because such little caveats
             | make benchmarks less reliable. Implicitly, benchmarks are
             | no different from tests in that someone/something who
             | scores high on a benchmark/test _should_ be able to
             | generalize that knowledge out into the real world.
             | 
             | While that is true with humans taking tests, it's not
             | really true with AIs evaluating on benchmarks.
             | 
             | SWE-bench is a great example. Claude Sonnet can get
             | something like a 50% on verified, whereas I think I might
             | be able to score a 20-25%? So, Claude is a better
             | programmer than me.
             | 
             | Except that isn't really true. Claude can still make a lot
             | of clumsy mistakes. I wouldn't even say these are junior
             | engineer mistakes. I've used it for creative programming
             | tasks and have found one example where it tried to use a
             | library written for d3js for a p5js programming example.
             | The confusion is kind of understandable, but it's also a
             | really dumb mistake.
             | 
             | Some very simple explanations, the models were probably
             | overfitted to a degree on Python given its popularity in
             | AI/ML work, and SWE-bench is all Python. Also, the
             | underlying Github issues are quite old, so they probably
             | contaminated the training data and the models have simply
             | memorized the answers.
             | 
             | Or maybe benchmarks are just bad at measuring intelligence
             | in general.
             | 
             | Regardless, every time a model beats a benchmark I'm
             | annoyed by the fact that I have no clue whatsoever how much
             | this actually translates into real world performance. Did
             | OpenAI/Anthropic/Google actually create something that will
             | automate wide swathes of the software engineering
             | profession? Or did they create the world's most
             | knowledgeable junior engineer?
        
               | throwaway0123_5 wrote:
               | > Some very simple explanations, the models were probably
               | overfitted to a degree on Python given its popularity in
               | AI/ML work, and SWE-bench is all Python. Also, the
               | underlying Github issues are quite old, so they probably
               | contaminated the training data and the models have simply
               | memorized the answers.
               | 
               | My understanding is that it works by checking if the
               | proposed solution passes test-cases included in the
               | original (human) PR. This seems to present some problems
               | too, because there are surely ways to write code that
               | passes the tests but would fail human review for one
               | reason or another. It would be interesting to not only
               | see the pass rate but also the rate at which the proposed
               | solutions are preferred to the original ones (preferably
               | evaluated by a human but even an LLM comparing the two
               | solutions would be interesting).
        
               | Bjorkbat wrote:
               | If I recall correctly the authors of the benchmark did
               | mention on Twitter that for certain issues models will
               | submit an answer that technically passes the test but is
               | kind of questionable, so yeah, good point.
        
           | phil917 wrote:
           | Lol I missed that even though it's literally the first
           | sentence of the blog, good catch.
           | 
           | Yeah, that makes this result a lot less impressive for me.
        
         | hartator wrote:
         | > acid test
         | 
         | The css acid test? This can be gamed too.
        
       | parsimo2010 wrote:
       | I really like that they include reference levels for an average
       | STEM grad and an average worker for Mechanical Turk. So for $350k
       | worth of compute you can have slightly better performance than a
       | menial wage worker, but slightly worse performance than a college
       | grad. Right now humans win on value, but AI is catching up.
        
       | nxobject wrote:
       | As an aside, I'm a little miffed that the benchmark calls out
       | "AGI" in the name, but then heavily cautions that it's necessary
       | but insufficient for AGI.
       | 
       | > ARC-AGI serves as a critical benchmark for detecting such
       | breakthroughs, highlighting generalization power in a way that
       | saturated or less demanding benchmarks cannot. However, it is
       | important to note that ARC-AGI is not an acid test for AGI
        
         | mmcnl wrote:
         | I immediately thought so too. Why confuse everyone?
        
       | notRobot wrote:
       | Humans can take the test here to see what the questions are like:
       | https://arcprize.org/play
        
       | spyckie2 wrote:
       | The more Hacker News worthy discussion is the part where the
       | author talks about search through the possible mini-program space
       | of LLMs.
       | 
       | It makes sense because tree search can be endlessly optimized. In
       | a sense, LLMs turn the unstructured, open system of general
       | problems into a structured, closed system of possible moves.
       | Which is really cool, IMO.
        
         | glup wrote:
         | Yes! This seems to be a really neat combination of 2010's
         | Bayesian cleverness / Tenenbaumian program search approaches
         | with the LLMs as merely sources of high-dim conditional
         | distributions. I knew people were experimenting in this space
         | (like https://escholarship.org/uc/item/7018f2ss) but didn't
         | know it did so well wrt these new benchmarks.
        
       | binarymax wrote:
       | All those saying "AGI", read the article and especially the
       | section "So is it AGI?"
        
       | skizm wrote:
       | This might sound dumb, and I'm not sure how to phrase this, but
       | is there a way to measure the raw model output quality without
       | all the more "traditional" engineering work (mountain of `if`
       | statements I assume) done on top of the output? And if so, would
       | that be a better measure of when scaling up the input data will
       | start showing diminishing returns?
       | 
       | (I know very little about the guts of LLMs or how they're tested,
       | so the distinction between "raw" output and the more
       | deterministic engineering work might be incorrect)
        
         | whimsicalism wrote:
         | what do you mean by the mountain of if-statements on top of the
         | output? like checking if the output matches the expected result
         | in evaluations?
        
           | skizm wrote:
           | Like when you type something into the chat gpt app _I am
           | guessing_ it will start by preprocessing your input, doing
           | some sanity checks, making sure it doesn't say "how do I
           | build a bomb?" or whatever. It may or may not alter /clean up
           | your input before sending it to the model for processing.
           | Once processed, there's probably dozens of services it goes
           | through to detect if the output is racist, somehow actually
           | contained a bomb recipe, or maybe copywriter material, normal
           | pattern matching stuff, maybe some advanced stuff like
           | sentiment analysis to see if the output is bad mouthing Trump
           | or something, and it might either alter the output or simply
           | try again.
           | 
           | I'm wondering when you strip out all that "extra" non-model
           | pre and post processing, if there's someway to measure
           | performance of that.
        
             | whimsicalism wrote:
             | oh, no - but most queries aren't being filtered by
             | supervisor models nowadays anyways.. most of the refusal is
             | baked in
        
       | Seattle3503 wrote:
       | How can there be "private" taks when you have use the OpenAI API
       | to run queries? OpenAI sees everything.
        
       | tmaly wrote:
       | Just curious, I know o1 is a model OpenAI offers. I have never
       | heard of the o3 model. How does it differ from o1?
        
       | roboboffin wrote:
       | Interesting that in the video, there is an admission that they
       | have been targeting this benchmark. A comment that was quickly
       | shut down by Sam.
       | 
       | A bit puzzling to me. Why does it matter ?
        
       | cubefox wrote:
       | This was a surprisingly insightful blog post, going far beyond
       | just announcing the o3 results.
        
       | c1b wrote:
       | How does o3 know when to stop reasoning?
        
         | adtac wrote:
         | It thinks hard about it
        
       | c1b wrote:
       | So o1 pro is CoT RL and o3 adds search?
        
       | jack_pp wrote:
       | AGI for me is something I can give a new project to and be able
       | to use it better than me. And not because it has a huge context
       | window, because it will update its weights after consuming that
       | project. Until we have that I don't believe we have truly reached
       | AGI.
       | 
       | Edit: it also _tests_ the new knowledge, it has concepts such as
       | trusting a source, verifying it etc. If I can just gaslight it
       | into unlearning python then it 's still too dumb.
        
       | submeta wrote:
       | I pay for lots of models, but Claude Sonnet is the one I use
       | most. ChatGPT is my quick tool for short Q&As because it's got a
       | desktop app. Even Google's new offerings did not lure me away
       | from Claude which I use daily for hours via a Teams plan with
       | five seats.
       | 
       | Now I am wondering what Anthropic will come up with. Exciting
       | times.
        
         | isof4ult wrote:
         | Claude also has a desktop app:
         | https://support.anthropic.com/en/articles/10065433-installin...
        
       | Animats wrote:
       | The graph seems to indicate a new high in cost per task. It looks
       | like they came in somewhere around $5000/task, but the log scale
       | has too few markers to be sure.
       | 
       | That may be a feature. If AI becomes too cheap, the over-funded
       | AI companies lose value.
       | 
       | (1995 called. It wants its web design back.)
        
         | jstummbillig wrote:
         | I doubt it. Competitive markets mostly work and inefficiencies
         | are opportunities for other players. And AI is full of glaring
         | inefficiencies.
        
           | Animats wrote:
           | Inefficiency can create a moat. If you can charge a lot for
           | your product, you have ample cash for advertising, marketing,
           | and lobbying, and can come out with many product variants. If
           | you're the lowest cost producer, you don't have the margins
           | to do that.
           | 
           | The current US auto industry is an example of that strategy.
           | So is the current iPhone.
        
       | hypoxia wrote:
       | Many are incorrectly citing 85% as human-level performance.
       | 
       | 85% is just the (semi-arbitrary) threshold for the winning the
       | prize.
       | 
       | o3 actually beats the human average by a wide margin: 64.2% for
       | humans vs. 82.8%+ for o3.
       | 
       | ...
       | 
       | Here's the full breakdown by dataset, since none of the articles
       | make it clear --
       | 
       | Private Eval:
       | 
       | - 85%: threshold for winning the prize [1]
       | 
       | Semi-Private Eval:
       | 
       | - 87.5%: o3 (unlimited compute) [2]
       | 
       | - 75.7%: o3 (limited compute) [2]
       | 
       | Public Eval:
       | 
       | - 91.5%: o3 (unlimited compute) [2]
       | 
       | - 82.8%: o3 (limited compute) [2]
       | 
       | - 64.2%: human average (Mechanical Turk) [1] [3]
       | 
       | Public Training:
       | 
       | - 76.2%: human average (Mechanical Turk) [1] [3]
       | 
       | ...
       | 
       | References:
       | 
       | [1] https://arcprize.org/guide
       | 
       | [2] https://arcprize.org/blog/oai-o3-pub-breakthrough
       | 
       | [3] https://arxiv.org/abs/2409.01374
        
         | Workaccount2 wrote:
         | If my life depended on the average rando solving 8/10 arc-prize
         | puzzles, I'd consider myself dead.
        
       | highfrequency wrote:
       | Very cool. I recommend scrolling down to look at the example
       | problem that O3 still can't solve. It's clear what goes on in the
       | human brain to solve this problem: we look at one example,
       | hypothesize a simple rule that explains it, and then check that
       | hypothesis against the other examples. It doesn't quite work, so
       | we zoom into an example that we got wrong and refine the
       | hypothesis so that it solves that sample. We keep iterating in
       | this fashion until we have the simplest hypothesis that satisfies
       | all the examples. In other words, how humans do science -
       | iteratively formulating, rejecting and refining hypotheses
       | against collected data.
       | 
       | From this it makes sense why the original models did poorly and
       | why iterative chain of thought is required - the challenge is
       | designed to be inherently iterative such that a zero shot model,
       | no matter how big, is extremely unlikely to get it right on the
       | first try. Of course, it also requires a broad set of human-like
       | priors about what hypotheses are "simple", based on things like
       | object permanence, directionality and cardinality. But as the
       | author says, these basic world models were already encoded in the
       | GPT 3/4 line by simply training a gigantic model on a gigantic
       | dataset. What was missing was iterative hypothesis generation and
       | testing against contradictory examples. My guess is that O3 does
       | something like this:
       | 
       | 1. Prompt the model to produce a simple rule to explain the nth
       | example (randomly chosen)
       | 
       | 2. Choose a different example, ask the model to check whether the
       | hypothesis explains this case as well. If yes, keep going. If no,
       | ask the model to _revise_ the hypothesis in the simplest possible
       | way that also explains this example.
       | 
       | 3. Keep iterating over examples like this until the hypothesis
       | explains all cases. Occasionally, new revisions will invalidate
       | already solved examples. That's fine, just keep iterating.
       | 
       | 4. Induce randomness in the process (through next-word sampling
       | noise, example ordering, etc) to run this process a large number
       | of times, resulting in say 1,000 hypotheses which all explain all
       | examples. Due to path dependency, anchoring and consistency
       | effects, some of these paths will end in awful hypotheses - super
       | convoluted and involving a large number of arbitrary rules. But
       | some will be simple.
       | 
       | 5. Ask the model to select among the valid hypotheses (meaning
       | those that satisfy all examples) and choose the one that it views
       | as the simplest for a human to discover.
        
         | hmottestad wrote:
         | I took a look at those examples that o3 can't solve. Looks
         | similar to an IQ-test.
         | 
         | Took me less time to figure out the 3 examples that it took to
         | read your post.
         | 
         | I was honestly a bit surprised to see how visual the tasks
         | were. I had thought they were text based. So now I'm quite
         | impressed that o3 can solve this type of task at all.
        
           | highfrequency wrote:
           | You must be a stem grad! Or perhaps an ensemble of Kaggle
           | submissions?
        
           | neom wrote:
           | I also took some time to look at the ones it couldn't solve.
           | I stopped after this one: https://kts.github.io/arc-
           | viewer/page6/#47996f11
        
       | heliophobicdude wrote:
       | We should NOT give up on scaling pretraining just yet!
       | 
       | I believe that we should explore pretraining video completion
       | models that explicitly have no text pairings. Why? We can train
       | unsupervised like they did for GPT series on the text-internet
       | but instead on YouTube lol. Labeling or augmenting the frames
       | limits scaling the training data.
       | 
       | Imagine using the initial frames or audio to prompt the video
       | completion model. For example, use the initial frames to write
       | out a problem on a white board then watch in output generate the
       | next frames the solution being worked out.
       | 
       | I fear text pairings with CLIP or OCR constrain a model too much
       | and confuse
        
       | thatxliner wrote:
       | > verified easy for humans, harder for AI
       | 
       | Isn't that the premise behind the CAPTCHA?
        
       | usaar333 wrote:
       | For what it's worth, I'm much more impressed with the frontier
       | math score.
        
       | asdf6969 wrote:
       | Terrifying. This news makes me happy I save all my money. My only
       | hope for the future is that I can retire early before I'm
       | unemployable
        
       | rimeice wrote:
       | Never underestimate a droid
        
       | thisisthenewme wrote:
       | I feel like AI is already changing how we work and live - I've
       | been using it myself for a lot of my development work. Though,
       | what I'm really concerned about is what happens when it gets
       | smart enough to do pretty much everything better (or even close)
       | than humans can. We're talking about a huge shift where first
       | knowledge workers get automated, then physical work too. The
       | thing is, our whole society is built around people working to
       | earn money, so what happens when AI can do most jobs? It's not
       | just about losing jobs - it's about how people will pay for basic
       | stuff like food and housing, and what they'll do with their lives
       | when work isn't really a thing anymore. Or do people feel like
       | there will be jobs safe from AI? (hopefully also fulfilling)
       | 
       | Some folks say we could fix this with universal basic income,
       | where everyone gets enough money to live on, but I'm not
       | optimistic that it'll be an easy transition. Plus, there's this
       | possibility that whoever controls these 'AGI' systems basically
       | controls everything. We definitely need to figure this stuff out
       | before it hits us, because once these changes start happening,
       | they're probably going to happen really fast. It's kind of like
       | we're building this awesome but potentially dangerous new
       | technology without really thinking through how it's going to
       | affect regular people's lives. I feel like we need a parachute
       | before we attempt a skydive. Some people feel pretty safe about
       | their jobs and think they can't be replaced. I don't think that
       | will be the case. Even if AI doesn't take your job, you now have
       | a lot more unemployed people competing for the same job that is
       | safe from AI.
        
         | cerved wrote:
         | > Though, what I'm really concerned about is what happens when
         | it gets smart enough to do pretty much everything better (or
         | even close)
         | 
         | I'll get concerned when it stops sucking so hard. It's like
         | talking to a dumb robot. Which it unsurprisingly is.
        
         | lacedeconstruct wrote:
         | I am pretty sure we will have a deep cultural repulsion from it
         | and people will pay serious money to have an AI free
         | experience, If AI becomes actually useful there is alot of
         | areas that we dont even know how to tackle like medicine and
         | biology, I dont think anything would change otherwise, AI will
         | take jobs but it will open alot more jobs at much higher
         | abstraction, 50 years ago the idea that a software engineer
         | would become a get rich quick job would have been insane imo
        
         | neom wrote:
         | I spend quite a lot of time noodling on this. The thing that
         | became really clear from this o3 announcement is that the
         | "throw a lot of compute at it and it can do insane things" line
         | of thinking continues to hold very true. If that is true, is
         | the right thing to do productize it (use the compute more
         | generally) or apply it (use the compute for very specific
         | incredibly hard and ground breaking problems)? I don't know if
         | any of this thinking is logical or not, but if it's a matter of
         | where to apply the compute, I feel like I'd be more inclined to
         | say: don't give me AI, instead use AI to very fundamentally
         | shift things.
        
         | para_parolu wrote:
         | From IT bubble it's very easy to have impression that AI will
         | replace most people. Most of people on my street do not work in
         | IT. Teacher, nurse, hobby shop owner, construction workers,
         | etc. Surely programming and other virtual work may become less
         | paid job but it's not end of the world.
        
         | vouaobrasil wrote:
         | A possibility is a coalition: of people who refuse to use AI
         | and who refuse to do business with those who use AI. If the
         | coalition grows large enough, AI can be stopped by economic
         | attrition.
        
       | w4 wrote:
       | The cost to run the highest performance o3 model is estimated to
       | be somewhere between $2,000 and $3,400 per task.[1] Based on
       | these estimates, o3 costs about 100x what it would cost to have a
       | human perform the exact same task. Many people are therefore
       | dismissing the near-term impact of these models because of these
       | extremely expensive costs.
       | 
       | I think this is a mistake.
       | 
       | Even if very high costs make o3 uneconomic for businesses, it
       | could be an epoch defining development for nation states,
       | assuming that it is true that o3 can reason like an averagely
       | intelligent person.
       | 
       | Consider the following questions that a state actor might ask
       | itself: What is the cost to raise and educate an average person?
       | Correspondingly, what is the cost to build and run a datacenter
       | with a nuclear power plant attached to it? And finally, how many
       | person-equivilant AIs could be run in parallel per datacenter?
       | 
       | There are many state actors, corporations, and even individual
       | people who can afford to ask these questions. There are also many
       | things that they'd like to do but can't because there just aren't
       | enough people available to do them. o3 might change that despite
       | its high cost.
       | 
       | So _if_ it is true that we 've now got something like human-
       | equivilant intelligence on demand - and that's a really big if -
       | then we may see its impacts much sooner than we would otherwise
       | intuit, especially in areas where economics takes a back seat to
       | other priorities like national security and state
       | competitiveness.
       | 
       | [1] https://news.ycombinator.com/item?id=42473876
        
         | istjohn wrote:
         | Your economic analysis is deeply flawed. If there was anything
         | that valuable and that required that much manpower, it would
         | already have driven up the cost of labor accordingly. The one
         | property that could conceivably justify a substantially higher
         | cost is secrecy. After all, you can't (legally) kill a human
         | after your project ends to ensure total secrecy. But that takes
         | us into thriller novel territory.
        
           | w4 wrote:
           | I don't think that's right. Free societies don't tolerate
           | total mobilization by their governments outside of war time,
           | no matter how valuable the outcomes might be in the long
           | term, in part because of the very economic impacts you
           | describe. Human-level AI - even if it's very expensive - puts
           | something that looks a lot like total mobilization within
           | reach without the societal pushback. This is especially true
           | when it comes to tasks that society as a whole may not
           | sufficiently value, but that a state actor might value very
           | much, and when paired with something like a co-located
           | reactor and data center that does not impact the grid.
           | 
           | That said, this is all predicated on o3 or similar actually
           | having achieved human level reasoning. That's yet to be fully
           | proven. We'll see!
        
       | starchild3001 wrote:
       | Intelligence comes in many forms and flavors. ARC prize questions
       | are just one version of it -- perhaps measuring more human-like
       | pattern recognition than true intelligence.
       | 
       | Can machines be more human-like in their pattern recognition? O3
       | met this need today.
       | 
       | While this is some form of accomplishment, it's nowhere near the
       | scientific and engineering problem solving needed to call
       | something truly artificial (human-like) intelligent.
       | 
       | What's exciting is that these reasoning models are making
       | significant strides in tackling eng and scientific problem-
       | solving. Solving the ARC challenge seems almost trivial in
       | comparison to that.
        
       | demirbey05 wrote:
       | It is not exactly AGI but huge step toward it. I would expect
       | this step in 2028-2030. I cant really understand why people are
       | happy with it, this technology is so dangerous that can disrupt
       | whole society. It's neither like smartphone nor internet. What
       | will happen to 3rd world countries. Lots of unsolved questions
       | and world is not prepared for such a change. Lots of people will
       | lose their jobs I am not even mentioning their debts. No one will
       | have chance to be rich anymore, If you are in first world country
       | you will probably get UBI, if not you wont.
        
         | FanaHOVA wrote:
         | > I would expect this step in 2028-2030.
         | 
         | Do you work at one of the frontier labs?
        
         | wyager wrote:
         | > What will happen to 3rd world countries
         | 
         | Probably less disruption than will happen in 1st world
         | countries.
         | 
         | > No one will have chance to be rich anymore
         | 
         | It's strange to reach this conclusion from "look, a massive new
         | productivity increase".
        
           | demirbey05 wrote:
           | its not like sonnet, yes current ai tools are increasing
           | productivity and provides many ways to have chance to be
           | rich, but agi is completely different. You need to handle
           | evil competition between you and big fishes, probably big
           | fishes will have more ai resources than you. What is the
           | survival ratio in such a environment ? Very low.
        
           | janalsncm wrote:
           | Strange indeed if we work under the assumption that the
           | profits from this productivity will be distributed (even
           | roughly) evenly. The problem is that most of us see no
           | indication that they will be.
           | 
           | I read "no one will have a chance to be rich anymore" as a
           | statement about economic mobility. Despite steep declines in
           | mobility over the last 50 years, it was still theoretically
           | possible for a poor child (say bottom 20% wealth) to climb
           | several quintiles. Our industry (SWE) was one of the best
           | examples. Of course there have been practical barriers (poor
           | kids go to worse schools, and it's hard to get into college
           | if you can't read) but the path was there.
           | 
           | If robots replace a lot of people, that path narrows. If AGI
           | replaces all people, the path no longer exists.
        
           | the8472 wrote:
           | Intelligence is the thing distinguishing humans from all
           | previous inventions that already were superhuman in some
           | narrow domain.
           | 
           | car : horse :: AGI : humans
        
         | Ancalagon wrote:
         | Same, I don't really get the excitement. None of these
         | companies are pushing for a utopian Star Trek society either
         | with that power.
        
           | moffkalast wrote:
           | Open models will catch up next year or the year after, there
           | only so many things to try and there's lots of people trying
           | them, so it's more or less an inevitability.
           | 
           | The part to get excited about is that there's plenty of
           | headroom left to gain in performance. They called o1 a
           | preview, and it was, a preview for QwQ and similar models. We
           | get the demo from OAI and then get the real thing for free
           | next year.
        
         | lagrange77 wrote:
         | I hope governments will finally take action.
        
           | Joeri wrote:
           | What action do you expect them to take?
           | 
           | What law would effectively reduce risk from AGI? The EU
           | passed a law that is entirely about reducing AI risk and
           | people in the technology world almost universally considered
           | it a bad law. Why would other countries do better? How could
           | they do better?
        
         | dyauspitr wrote:
         | I'm extremely excited because I want to see the future and I'm
         | trying not to think of how severely fucked my life will be.
        
       | vjerancrnjak wrote:
       | The result on Epoch AI Frontier Math benchmark is quite a leap.
       | Pretty sure most people couldn't even approach these problems,
       | unlike ARC AGI
        
       | laurent_du wrote:
       | The real breakthrough is the 25% on Frontier Math.
        
       | Havoc wrote:
       | If I'm reading that chart right that means still log scaling & we
       | should still be good with "throw more power" at it for a while?
        
       | jaspa99 wrote:
       | Can it play Mario 64 now?
        
       | nprateem wrote:
       | There should be a benchmark that tells the AI it's previous
       | answer was wrong and test the number of times it either corrects
       | itself or incorrectly capitulates, since it seems easy to trip
       | them up when they are in fact right.
        
       | freediver wrote:
       | Wondering what are author's thoughts on the future of this
       | approach to benchmarking? Completing super hard tasks while then
       | failing on 'easy' (for humans) ones might signal measuring the
       | wrong thing, similar to Turing test.
        
       | ChildOfChaos wrote:
       | This is insanely expensive to run though. Looks like it cost
       | around $1 million of compute to get that result.
       | 
       | Doesn't seem like such a massive breakthrough when they are
       | throwing so much compute at it, particularly as this is test time
       | compute, it just isn't practical at all, you are not getting this
       | level with a ChatGPT subscription, even the new $200 a month
       | option.
        
       | pixelsort wrote:
       | > You'll know AGI is here when the exercise of creating tasks
       | that are easy for regular humans but hard for AI becomes simply
       | impossible.
       | 
       | No, we won't. All that will tell us is that the abilities of the
       | humans who have attempted to discern the patterns of similarity
       | among problems difficult for auto-regressive models has once
       | again failed us.
        
         | maxdoop wrote:
         | So then what is AGI?
        
       | ndm000 wrote:
       | One thing I have not seen commented on is that ARC-AGI is a
       | visual benchmark but LLMs are primarily text. For instance when I
       | see one of the ARC-AGI puzzles, I have a visual representation in
       | my brain and apply some sort of visual reasoning solve it. I can
       | "see" in my mind's eye the solution to the puzzle. If I didn't
       | have that capability, I don't think I could reason through words
       | how to go about solving it - it would certainly be much more
       | difficult.
       | 
       | I hypothesize that something similar is going on here. OpenAI has
       | not published (or I have not seen) the number of reasoning tokens
       | it took to solve these - we do know that each tasks was
       | thoussands of dollars. If "a picture is worth a thousand words",
       | could we make AI systems that can reason visually with much
       | better performance?
        
       | siva7 wrote:
       | Seriously, programming as a profession will end soon. Let's not
       | kid us anymore. Time to jump the ship.
        
         | mmcnl wrote:
         | Why specifically programming? I think every knowledge
         | profession is at risk, or at the very minimum suspect to a huge
         | transformation. Doctors, analysts, lawyers, etc.
        
       | jdefr89 wrote:
       | Uhhhh... It was trained on ARC data? So they targeted a specific
       | benchmark and are surprised and blown away the LLM performed well
       | in it? What's that law again? When a benchmark is targeted by
       | some system the benchmark becomes useless?
        
       | bilsbie wrote:
       | When is this available? Which plans can use it?
        
       | bilsbie wrote:
       | Does anyone have prompts they like to use to test the quality of
       | new models?
       | 
       | Please share. I'm compiling a list.
        
       ___________________________________________________________________
       (page generated 2024-12-20 23:00 UTC)