[HN Gopher] OpenAI O3 breakthrough high score on ARC-AGI-PUB
___________________________________________________________________
OpenAI O3 breakthrough high score on ARC-AGI-PUB
Author : maurycy
Score : 730 points
Date : 2024-12-20 18:11 UTC (4 hours ago)
(HTM) web link (arcprize.org)
(TXT) w3m dump (arcprize.org)
| razodactyl wrote:
| Great. Now we have to think of a new way to move the goalposts.
| tines wrote:
| I mean, what else do you call learning?
| Pesthuf wrote:
| Well right now, running this model is really expensive, but we
| should prepare a new cope for when equivalent models no longer
| are, ahead of time.
| cchance wrote:
| Ya getting costs down will be the big one, i imagine
| quantization, distillation and lots and lots of improvements
| on the compute side both hardware and software wise.
| a_wild_dandan wrote:
| Let's just define AI as "whatever computers still can't do."
| That'll show those dumb statistical parrots!
| dboreham wrote:
| Imagine how the Neanderthals felt...
| foobarqux wrote:
| This is just as silly as claiming that people "moved the
| goalposts" when a computer beat Kasparov at chess to claim that
| it wasn't AGI: it wasn't a good test and some people only
| realize this after the computer beat Kasparov but couldn't do
| much else. In this case the ARC maintainers specifically have
| stated that this is a necessary but not sufficient test of AGI
| (I personally think it is neither).
| og_kalu wrote:
| It's not silly. The computer that could beat Kasparov
| couldn't do anything else so of course it wasn't Artificial
| General Intelligence.
|
| o3 can do much much more. There is nothing narrow about SOTA
| LLMs. They are already General. It doesn't matter what ARC
| Maintainers have said. There is no common definition of
| General that LLMs fail to meet. It's not a binary thing.
|
| By the time a single machine covers every little test
| humanity can devise, what comes out of that is not 'AGI' as
| the words themselves mean but a General Super Intelligence.
| foobarqux wrote:
| It is silly, the logic is the same: "Only a (world-
| altering) 'AGI' could do [test]" -> test is passed -> no
| (world-altering) 'AGI' -> conclude that [test] is not a
| sufficient test for (world-altering) 'AGI' -> chase new
| benchmark.
|
| If you want to play games about how to define AGI go ahead.
| People have been claiming for years that we've already
| reached AGI and with every improvement they have to
| bizarrely claim anew that _now_ we 've really achieved AGI.
| But after a few months people realize it still doesn't do
| what you would expect of an AGI and so you chase some new
| benchmark ("just one more eval").
|
| The fact is that there really hasn't been the type of
| world-altering impact that people generally associate with
| AGI and no reason to expect one.
| og_kalu wrote:
| >It is silly, the logic is the same: "Only a (world-
| altering) 'AGI' could do [test]" -> test is passed -> no
| (world-altering) 'AGI' -> conclude that [test] is not a
| sufficient test for (world-altering) 'AGI' -> chase new
| benchmark.
|
| Basically nobody today thinks beating a single benchmark
| and nothing else will make you a General Intelligence. As
| you've already pointed out out, even the maintainers of
| ARC-AGI do not think this.
|
| >If you want to play games about how to define AGI go
| ahead.
|
| I'm not playing any games. ENIAC cannot do 99% of the
| things people use computers to do today and yet barely
| anybody will tell you it wasn't the first general purpose
| computer.
|
| On the contrary, it is people who seem to think "General"
| is a moniker for everything under the sun (and then some)
| that are playing games with definitions.
|
| >People have been claiming for years that we've already
| reached AGI and with every improvement they have to
| bizarrely claim anew that now we've really achieved AGI.
|
| Who are these people ? Do you have any examples at all.
| Genuine question
|
| >But after a few months people realize it still doesn't
| do what you would expect of an AGI and so you chase some
| new benchmark ("just one more eval").
|
| What do you expect from 'AGI'? Everybody seems to have
| different expectations, much of it rooted in science
| fiction and not even reality, so this is a moot point.
| What exactly is World Altering to you ? Genuinely, do you
| even have anything other than a "I'll know it when i see
| it ?"
|
| If you introduce technology most people adopt, is that
| world altering or are you waiting for Skynet ?
| foobarqux wrote:
| > Basically nobody today thinks beating a single
| benchmark and nothing else will make you a General
| Intelligence.
|
| People's comments, including in this very thread, seem to
| suggest otherwise (c.f. comments about "goal post
| moving"). Are you saying that a widespread belief wasn't
| that a chess playing computer would require AGI? Or that
| Go was at some point the new test for AGI? Or the Turing
| test?
|
| > I'm not playing any games... "General" is a moniker for
| everything under the sun that are playing games with
| definitions.
|
| People have a colloquial understanding of AGI whose
| consequence is a significant change to daily life, not
| the tortured technical definition that you are using.
| Again your definition isn't something anyone cares about
| (except maybe in the legal contract between OpenAI and
| Microsoft).
|
| > Who are these people ? Do you have any examples at all.
| Genuine question
|
| How about you? I get the impression that you think AGI
| was achieved some time ago. It's a bit difficult to
| simultaneously argue both that we achieved AGI in GPT-N
| and also that GPT-(N+X) is now the real breakthrough AGI
| while claiming that your definition of AGI is useful.
|
| > What do you expect from 'AGI'?
|
| I think everyone's definition of AGI includes, as a
| component, significant changes to the world, which
| probably would be something like rapid GDP growth or
| unemployment (though you could have either of those
| without AGI). The fact that you have to argue about what
| the word "general" technically means is proof that we
| don't have AGI in a sense that anyone cares about.
| og_kalu wrote:
| >People's comments, including in this very thread, seem
| to suggest otherwise (c.f. comments about "goal post
| moving").
|
| But you don't see this kind of discussion on the narrow
| models/techniques that made strides on this benchmark, do
| you ?
|
| >People have a colloquial understanding of AGI whose
| consequence is a significant change to daily life, not
| the tortured technical definition that you are using
|
| And ChatGPT has represented a significant change to the
| daily lives of many. It's the fastest adopted software
| product in history. In just 2 years, it's one of the top
| ten most visited sites on the planet worldwide. A lot of
| people have had the work they do significant change since
| its release. This is why I ask, what is world altering ?
|
| >How about you? I get the impression that you think AGI
| was achieved some time ago.
|
| Sure
|
| >It's a bit difficult to simultaneously argue both that
| we achieved AGI in GPT-N and also that GPT-(N+X) is now
| the real breakthrough AGI
|
| I have never claimed GPT-N+X is the "new breakthrough
| AGI". As far as I'm concerned, we hit AGI sometime ago
| and are making strides in competence and/or enabling even
| more capabilities.
|
| You can recognize ENIAC as a general purpose computer and
| also recognize the breakthroughs in computing since then.
| They're not mutually exclusive.
|
| And personally, I'm more impressed with o3's Frontier
| Math score than ARC.
|
| >I think everyone's definition of AGI includes, as a
| component, significant changes to the world
|
| Sure
|
| >which probably would be something like rapid GDP growth
| or unemployment
|
| What people imagine as "significant change" is definitely
| not in any broad agreement.
|
| Even in science fiction, the existence of general
| intelligences more competent than today's LLMs does not
| necessarily precursor massive unemployment or GDP growth.
|
| And for a lot of people, the clincher stopping them from
| calling a machine AGI is not even any of these things.
| For some, that it is "sentient" or "cannot lie" is far
| more important than any spike of unemployment.
| foobarqux wrote:
| > But you don't see this kind of discussion on the narrow
| models/techniques that made strides on this benchmark, do
| you ?
|
| I don't understand what you are getting at.
|
| Ultimately there is no axiomatic definition of the term
| AGI. I don't think the colloquial understanding of the
| word is what you think it is (i.e. if you had described
| to people, pre-chatgpt, today's chatgpt behavior,
| including all the limitations and failings and the fact
| that there was no change in GDP, unemployment, etc), and
| asked if that was AGI I seriously doubt they would say
| yes.)
|
| More importantly I don't think anyone would say their
| life was much different from a few years ago and
| separately would say under AGI it would be.
|
| But the point that started all this discussion is the
| fact that these "evals" are not good proxies for AGI and
| no one is moving goal-posts even if they realize this
| fact only after the tests have been beaten. You can
| foolishly _define_ AGI as beating ARC but the moment ARC
| is beaten you realize that you don 't care about that
| definition at all. That doesn't change if you make a 10
| or 100 benchmark suite.
| og_kalu wrote:
| This is also wildly ahead in SWE-bench (71.7%, previous 48%) and
| Frontier Math (25% on high compute, previous 2%).
|
| So much for a plateau lol.
| throwup238 wrote:
| _> So much for a plateau lol._
|
| It's been really interesting to watch all the internet pundits'
| takes on the plateau... as if the _two years_ since the release
| of GPT3.5 is somehow enough data for an armchair ponce to
| predict the performance characteristics of an entirely novel
| technology that no one understands.
| jgalt212 wrote:
| You could make an equivalently dismissive comment about the
| hypesters.
| throwup238 wrote:
| Yeah but anyone with half a brain knows to ignore them.
| Vapid cynicism is a lot more seductive to the average nerd.
| bandwidth-bob wrote:
| The pundits response to the (alleged) plateau was
| proportional to the certainty with which CEOs of frontier
| labs discussed pre-training scaling. The o3 result is from
| scaling test time compute, which represents a meaningful
| change in how you would build out compute for scaling (single
| supercluster --> presence in regions close to users). Thus it
| is important to discuss.
| attentionmech wrote:
| I legit see that if there is not even a new breakthrough just
| one week, people start shouting plateau plateau.. Our rate of
| progress is extraordinary and any downplay of it seems like
| stupid
| optimalsolver wrote:
| >Frontier Math (25% on high compute, previous 2%)
|
| This is so insane that I can't help but be skeptical. I know FM
| answer key is private, but they have to send the questions to
| OpenAI in order to score the models. And a significant jump on
| this benchmark sure would increase a company's valuation...
|
| Happy to be wrong on this.
| OsrsNeedsf2P wrote:
| At 6,670$/task? I hope there's a jump
| og_kalu wrote:
| It's not 6,670$/task. That was the high efficiency cost for
| 400 questions.
| maxdoop wrote:
| How much longer can I get paid $150k to write code ?
| tsunamifury wrote:
| Often what happens is the golf-course phenomenon. As golfing
| gets less popular, low and mid tier golf courses go out of
| business as they simply aren't needed. But at the same time
| demand for high end golf courses actually skyrockets because
| people who want to golf either can give it up or go higher end.
|
| This I think will happen with programmers. Rote programming
| will slowly die out, while demand for super high end will go
| dramatically up in price.
| CapcomGo wrote:
| Where does this golf-course phenomenon come from? It doesn't
| really match the real world or how golfing works.
| tsunamifury wrote:
| how so, witnessed it quite directly in California. Majority
| have closed and remaining have gone up in price and are up
| scale. This has been covered in various new programs like
| 60 minutes. You can look up death of golfing.
|
| Also unsure what you mean by...'how golfing works'. This is
| the economics of it, not the game
| EVa5I7bHFq9mnYK wrote:
| Maybe its CA thing? Plenty of $50 golf courses here in
| Phoenix.
| colesantiago wrote:
| Frontier expert specialist programmers will always be in
| demand.
|
| Generalist junior and senior engineers will need to think of a
| different career path in less than 5 years as more layoffs will
| reduce the software engineering workforce.
|
| It looks like it may be the way things are if progress in the
| o1, o3, oN models and other LLMs continues on.
| deadbabe wrote:
| This assumes that software products in the future will remain
| at the same complexity as they are today, just with AI
| building them out.
|
| But they won't. AI will enable building even _more_ complex
| software which counter intuitively will result in need even
| _more_ human jobs to deal with this added complexity.
|
| Think about how despite an increasing amount of free open
| source libraries over time enabling some powerful stuff
| easily, developer jobs have only increased, not decreased.
| dmm wrote:
| I've made a similar argument in the past but now I'm not so
| sure. It seems to me that developer demand was linked to
| large expansions in software demand first from PCs then the
| web and finally smartphones.
|
| What if software demand is largely saturated? It seems the
| big tech companies have struggled to come up with the next
| big tech product category, despite lots of talent and
| capital.
| deadbabe wrote:
| There doesn't need to be a new category. Existing
| categories can just continue bloating in complexity.
|
| Compare the early web vs the complicated JavaScript laden
| single page application web we have now. You need way
| more people now. AI will make it even worse.
|
| Consider that in the AI driven future, there will be no
| more frameworks like React. Who is going to bother
| writing one? Instead every company will just have their
| own little custom framework built by an AI that works
| only for their company. Joining a new company means you
| bring generalist skills and learn how their software
| works from the ground up and when you leave to another
| company that knowledge is instantly useless.
|
| Sounds exciting.
|
| But there's also plenty of unexplored categories anyway
| that we can't access still because there's insufficient
| technology for. Household robots with AGI for instance
| may require instructions for specific services sold as
| "apps" that have to be designed and developed by
| companies.
| bandwidth-bob wrote:
| The new capabilities of LLMs, and generally large
| foundation models, _expands_ the range of what a computer
| program can do. Naturally, we will need to build all of
| those things with code. Which will be done by a combo of
| people with product ideas, engineers, and LLMs. There
| will be then specialization and competition on each new
| use-case. eg., who builds the best AI doctor etc.,.
| hackinthebochs wrote:
| What about "general" in AGI do you not understand? There
| will be no new style of development for which the AGI will
| be poorly suited that all the displaced developers can move
| to.
| bandwidth-bob wrote:
| For true AGI (whatever that means, lets say fully
| replicates human abilities), discussing "developers" only
| is a drop in the bucket compared to all knowledge work
| jobs which will be displaced.
| cruffle_duffle wrote:
| This is exactly what will happen. We'll just up the
| complexity game to entirely new baselines. There will
| continue to be good money in software.
|
| These models are tools to help engineers, not replacements.
| Models cannot, on their own, build novel new things no
| matter how much the hype suggests otherwise. What they can
| do is remove a hell of a lot of accidental complexity.
| lagrange77 wrote:
| > These models are tools to help engineers, not
| replacements. Models cannot, on their own, build novel
| new things no matter how much the hype suggests
| otherwise.
|
| But maybe models + managers/non technical people can?
| mitjam wrote:
| The question is: How to become a senior when there is no
| place to be a junior? Will future SWE need to do the 10k
| hours as a hobby? Will AI speed up or slow down learning?
| singularity2001 wrote:
| good question and I think you gave the correct answer yes
| people will just do the 10,000 hours required by starting
| programming at the age of eight and then playing around
| until they're done studying
| prmph wrote:
| I'll believe the models can take the jobs of programmers when
| they can generate a sophisticated iOS app based on some simple
| prompts, ready for building and publication in the app store.
| That is nowhere near the horizon no matter how much things are
| hyped up, and it may well never arrive.
| timenotwasted wrote:
| The absolutist type comments are such a wild take given how
| often they are so wrong.
| tsunamifury wrote:
| Totally... simple increases in 20% efficiency will already
| significant destroy demand for coders. This forum however
| will be resistant to admit such economic phenomenon.
|
| Look at video bay editing after the advent of Final Cut.
| Significant drop in the specialized requirement as a
| professional field, even while content volume went up
| dramatically.
| exitb wrote:
| Computing has been transforming countless jobs before it
| got to Final Cut. On one hand, programming is not the
| hardest job out there. On the other, it takes months to
| fully onboard a human developer - a person that already
| has years of relevant education and work experience.
| There are desk jobs that onboard new hires in days
| instead. Let's see when they're displaced by AI first.
| tsunamifury wrote:
| Don't know if you noticed but thats already happening.
| Mass layoffs in customer service etc have already
| happened over the last 2 years
| exitb wrote:
| So, how does it work out? Are the customers happy? Are
| the bosses at my work going to be equally happy with my
| AI replacement?
| EVa5I7bHFq9mnYK wrote:
| That's until AI has improved enough that it can
| automatically navigate the menus to get me a human
| operator to talk to.
| derektank wrote:
| I could be misreading this, but as far as I can tell,
| there are more video and film editors today (29,240) than
| there were film editors in 1997 (9,320). Seems like an
| example of improved productivity shifting the skills
| required but ultimately driving greater demand for the
| profession as a whole. Salaries don't seem to have been
| hurt either, median wage was $35,214 in '97 and $66,600
| today, right in line with inflation.
|
| https://www.bls.gov/oes/2023/may/oes274032.htm
|
| https://www.bls.gov/oes/tables.htm
| vouaobrasil wrote:
| Nah, it will arrive. And regardless, this sort of AI reduces
| the skill level required to make the app. It reduces the
| amount of people required and thus reduces the demand for
| engineers. So, even though AI is not CLOSE to what you are
| suggesting, it can significantly reduce the salaries of those
| that ARE required. So maybe fewer $150K programmers will be
| hired with the same revenue for even higher profits.
|
| The most bizarre thing is that programmers are literally
| writing code to replace themselves because once this AI
| started, it was a race to the bottom and nobody wants to be
| last.
| skydhash wrote:
| > Nah, it will arrive
|
| Will it?
|
| It's already hard to get people to use computer as they are
| right now, where you only need to click on things and no
| longer have to enter commands. That because most people
| don't like to engage in formal reasoning. Even with one of
| the most intuitive computer assisted task (drawing and 3d
| modeling), there's so much to learn regarding theories that
| few people bother.
|
| Programming has always been easy to learn, and tools to
| automate coding have existed for decades now. But how many
| people you know have had the urge to learn enough to
| automate their tasks?
| prmph wrote:
| They've been promising us this thing since the 60s: End-
| user development, 5GLs, etc. enabling the average Joe to
| develop sophisticated apps in minimal time. And it never
| arrives.
|
| I remember attending a tech fair decades ago, and at one
| stand they were vending some database products. When I
| mentioned that I was studying computer science with a focus
| on software engineering, they sneered that coding will be
| much less important in the future since powerful databases
| will minimize the need for a lot of data wrangling in
| applications with algorithms.
|
| What actually happened is that the demand for programmers
| increased, and software ate the world. I suspect something
| similar will happen the current AI hype.
| vouaobrasil wrote:
| Well, I think in the 60s we also didn't have LLMs that
| could actually write complete programs, either.
| mirsadm wrote:
| No one writes a "complete program" these days. Things
| just keep evolving forever. I spent way too much time I
| care to admit dealing with dependencies of libraries
| which change seemingly on a daily basis these days. These
| predictions are so far off reality it makes me wonder if
| the people making them have ever written any code in
| their life.
| vouaobrasil wrote:
| That's fair. Well, I've written a lot of code. But
| anyway, I do want to emphasize the following. I am not
| making the same prediction as some that say AI can
| replace a programmer. Instead, I am saying: combination
| of AI plus programmers will reduce the need for the
| number or programmers, and hence allow the software
| industry to exist with far fewer people, with the lucky
| ones accumulating even more wealth.
| whynotminot wrote:
| > They've been promising us this thing since the 60s:
| End-user development, 5GLs, etc. enabling the average Joe
| to develop sophisticated apps in minimal time. And it
| never arrives.
|
| This has literally already arrived. Average Joes _are_
| writing software using LLMs right now.
| deadbabe wrote:
| There's a very good chance that if a company can replace its
| programmers with pure AI then it means whatever they're doing
| is probably already being offered as a SaaS product so why not
| just skip the AI and buy that? Much cheaper and you don't have
| to worry about dealing with bugs.
| croemer wrote:
| SaaS works for general problems faced by many businesses.
| deadbabe wrote:
| Exactly. Most businesses can get away with not having
| developers at all if they just glue together the right
| combination of SaaS products. But this doesn't happen,
| implying there is something more about having your own
| homegrown developers that SaaS cannot replace.
| croemer wrote:
| The risk is not SaaS replacing internal developers. It's
| about increased productivity of developers reducing the
| number of developers needed to achieve something.
| deadbabe wrote:
| Again, you're assuming product complexity won't grow as a
| result of new AI tools.
|
| 3 decades ago you needed a big team to create the type of
| video games that one person can probably make on their
| own today in their spare time with modern tools.
|
| But now modern tools have been used to make even more
| complicated games that require more massive teams than
| ever and huge amounts of money. One person has no hope of
| replicating that now, but maybe in the future with AI
| they can. And then the AAA games will be even _more_
| advanced.
|
| It will be similar with other software.
| sss111 wrote:
| 3 to 5 years, max. Traditional coding is going to be dead in
| the water. Optimistically, the junior SWE job will evolve but
| more realistically dedicated AI-based programming agents will
| end demand for Junior SWEs
| lagrange77 wrote:
| Which implies that a few years later they will not become
| senior SWEs either.
| torginus wrote:
| Well, considering they floated the $2000 subscription idea, and
| they still haven't revealed everything, they could still
| introduce the $2k sub with o3+agents/tool use, which means,
| till about next week.
| arrosenberg wrote:
| Unless the LLMs see multiple leaps in capability, probably
| indefinitely. The Malthusians in this thread seem to think that
| LLMs are going to fix the human problems involved in executing
| these businesses - they won't. They make good programmers more
| productive and will cost some jobs at the margins, but it will
| be the low-level programming work that was previously
| outsourced to Asia and South America for cost-arbitrage.
| mrdependable wrote:
| I think they will have to figure out how to get around context
| limits before that happens. I also wouldn't be surprised if the
| future models that can actually replace workers are sold at
| such an exorbitant price that only larger companies will be
| able to afford it. Everyone else gets access to less capable
| models that still require someone with knowledge to get to an
| end result.
| kirykl wrote:
| If it's any consolation, Agile priests and middle managers will
| be the first to go
| braden-lk wrote:
| If people constantly have to ask if your test is a measure of
| AGI, maybe it should be renamed to something else.
| OfficialTurkey wrote:
| From the post
|
| > Passing ARC-AGI does not equate achieving AGI, and, as a
| matter of fact, I don't think o3 is AGI yet. o3 still fails on
| some very easy tasks, indicating fundamental differences with
| human intelligence.
| cchance wrote:
| Its funny when they say this, as if all humans can solve
| basic ass question/answer combos, people seem to forget
| theirs a percentage of the population that honestly believe
| the world is flat along with other hallucinations at the
| human level
| jppittma wrote:
| I don't believe AGI at that level has any commercial value.
| modeless wrote:
| Congratulations to Francois Chollet on making the most
| interesting and challenging LLM benchmark so far.
|
| A lot of people have criticized ARC as not being relevant or
| indicative of true reasoning, but I think it was exactly the
| right thing. The fact that scaled reasoning models are finally
| showing progress on ARC proves that what it measures really is
| relevant and important for reasoning.
|
| It's obvious to everyone that these models can't perform as well
| as humans on everyday tasks despite blowout scores on the hardest
| tests we give to humans. Yet nobody could quantify exactly the
| ways the models were deficient. ARC is the best effort in that
| direction so far.
|
| We don't need more "hard" benchmarks. What we need right now are
| "easy" benchmarks that these models nevertheless fail. I hope
| Francois has something good cooked up for ARC 2!
| dtquad wrote:
| Are there any single-step non-reasoner models that do well on
| this benchmark?
|
| I wonder how well the latest Claude 3.5 Sonnet does on this
| benchmark and if it's near o1.
| throwaway71271 wrote:
| | Name | Semi-private eval |
| Public eval | |--------------------------------------
| |-------------------|-------------| | Jeremy Berman
| | 53.6% | 58.5% | | Akyurek et al.
| | 47.5% | 62.8% | | Ryan Greenblatt
| | 43% | 42% | | OpenAI
| o1-preview (pass@1) | 18% | 21%
| | | Anthropic Claude 3.5 Sonnet (pass@1) | 14%
| | 21% | | OpenAI GPT-4o (pass@1)
| | 5% | 9% | | Google Gemini
| 1.5 (pass@1) | 4.5% | 8% |
|
| https://arxiv.org/pdf/2412.04604
| kandesbunzler wrote:
| why is this missing the o1 release / o1 pro models? Would
| love to know how much better they are
| YetAnotherNick wrote:
| Here are the results for base models[1]: o3
| (coming soon) 75.7% 82.8% o1-preview 18% 21%
| Claude 3.5 Sonnet 14% 21% GPT-4o 5% 9%
| Gemini 1.5 4.5% 8%
|
| Score (semi-private eval) / Score (public eval)
|
| [1]: https://arcprize.org/2024-results
| simonw wrote:
| I'd love to know how Claude 3.5 Sonnet does so well despite
| (presumably) not having the same tricks as the o-series
| models.
| Bjorkbat wrote:
| It's easy to miss, but if you look closely at the first
| sentence of the announcement they mention that they used a
| version of o3 trained on a public dataset of ARC-AGI, so
| technically it doesn't belong on this list.
| refulgentis wrote:
| This emphasizes persons and a self-conceived victory narrative
| over the ground truth.
|
| Models have regularly made progress on it, this is not new with
| the o-series.
|
| Doing astoundingly well on it, and having a mutually shared PR
| interest with OpenAI in this instance, doesn't mean a pile of
| visual puzzles is actually AGI or some well thought out and
| designed benchmark of True Intelligence(tm). It's one type of
| visual puzzle.
|
| I don't mean to be negative, but to inject a memento mori. Real
| story is some guys get together and ride off Chollet's name
| with some visual puzzles from ye olde IQ test, and the deal was
| Chollet then gets to show up and say it proves program
| synthesis is required for True Intelligence.
|
| Getting this score is extremely impressive but I don't assign
| more signal to it than any other benchmark with some thought to
| it.
| modeless wrote:
| Solving ARC doesn't mean we have AGI. Also o3 presumably
| isn't doing program synthesis, seemingly proving Francois
| wrong on that front. (Not sure I believe the speculation
| about o3's internals in the link.)
|
| What I'm saying is the fact that as models are getting better
| at reasoning they are also scoring better on ARC proves that
| it _is_ measuring something relating to reasoning. And nobody
| else has come up with a comparable benchmark that is so easy
| for humans and so hard for LLMs. Even today, let alone five
| years ago when ARC was released. ARC was visionary.
| hdjjhhvvhga wrote:
| Your argumentation seems convincing but I'd like to offer a
| competitive narrative: any benchmark that is public becomes
| completely useless because companies optimize for it -
| especially AI that depends on piles of money and they need
| some proof they are developing.
|
| That's why I have some private benchmarks and I'm sorry to
| say that the transition from GTP4 to o1 wasn't
| unambiguously a step forward (in some tasks yes, in some
| not).
|
| On the other hand, private benchmarks are even less useful
| to the general public than the public ones, so we have to
| deal with what we have - but many of us just treat it as
| noise and don't give it much significance. Ultimately, the
| models should defend themselves by performing the tasks
| individual users want them to do.
| stonemetal12 wrote:
| Rather any Logic puzzle you post on the internet as
| something AIs are bad at is in the next round of training
| data so AIs get better at that specific question. Not
| because AI companies are optimizing for a benchmark but
| because they suck up everything.
| modeless wrote:
| ARC has two test sets that are not posted on the
| Internet. One is kept completely private and never
| shared. It is used when testing open source models and
| the models are run locally with no internet access. The
| other test set is used when testing closed source models
| that are only available as APIs. So it could be leaked in
| theory, but it is still not posted on the internet and
| can't be in any web crawls.
|
| You could argue that the models can get an advantage by
| looking at the training set which is on the internet. But
| all of the tasks are unique and generalizing from the
| training set to the test set is the whole point of the
| benchmark. So it's not a serious objection.
| QuantumGood wrote:
| Gaming the benchmarks usually needs to be considered first
| when evaluating new results.
| chaps wrote:
| Honestly, is gaming benchmarks actually a problem in this
| space in that it still shows something useful? Just means
| we need more benchmarks, yeah? It really feels not unlike
| keggle competitions.
|
| We do the same exact stuff with real people with
| programming challenges and such where people just study
| common interview questions rather than learning the
| material holistically. And since we know that people game
| these interview type questions, we can adjust the
| interview processes to minimize gamification.... which
| itself leads to gamification and back to step one. That's
| not ideal an ideal feedback loop of course, but people
| still get jobs and churn out "productive work" out of it.
| ben_w wrote:
| AI are very good at gaming benchmarks. Both as
| overfitting and as Goodhart's law, gaming benchmarks has
| been a core problem during training for as long as I've
| been interested in the field.
|
| Sometimes this manifests as "outside the box thinking",
| like how a genetic algorithm got an "oscillator" which
| was really just an antenna.
|
| It is a hard problem, and yes we still both need and can
| make more and better benchmarks; but it's still a problem
| because it means the benchmarks we do have are
| overstating competence.
| CamperBob2 wrote:
| The _idea_ behind this particular benchmark, at least, is
| that it can 't be gamed. What are some ways to game ARC-
| AGI, meaning to pass it without developing the required
| internal model and insights?
|
| In principle you can't optimize specifically for ARC-AGI,
| train against it, or overfit to it, because only a few of
| the puzzles are publicly disclosed.
|
| Whether it lives up to that goal, I don't know, but their
| approach sounded good when I first heard about it.
| psb217 wrote:
| Well, with billions in funding you could task a hundred
| or so very well paid researchers to do their best at
| reverse engineering the general thought process which
| went into ARC-AGI, and then generate fresh training data
| and labeled CoTs until the numbers go up.
| CamperBob2 wrote:
| Right, but the ARC-AGI people would counter by saying
| they're welcome to do just that. In doing so -- again in
| their view -- the researchers would create a model that
| could be considered capable of AGI.
|
| I spent a couple of hours looking at the publicly-
| available puzzles, and was really impressed at how much
| room for creativity the format provides. Supposedly the
| puzzles are "easy for humans," but some of them were
| not... at least not for me.
|
| (It did occur to me that a better test of AGI might be
| the ability to generate new, innovative ARC-AGI puzzles.)
| chaps wrote:
| We're in agreement!
|
| What's endlessly interesting to me with all of this is
| how surprisingly quick the benchmarking feedback loops
| have become plus the level of scrutiny each one receives.
| We (as a culture/society/whatever) don't really treat
| human benchmarking criteria with the same scrutiny such
| that feedback loops are useful and lead to productive
| changes to the benchmarking system itself. So from that
| POV it feels like substantial progress continues to be
| made through these benchmarks.
| bubblyworld wrote:
| I think gaming the benchmarks is _encouraged_ in the ARC
| AGI context. If you look at the public test cases you 'll
| see they test a ton of pretty abstract concepts - space,
| colour, basic laws of physics like gravity/magnetism,
| movement, identity and lots of other stuff (highly
| recommend exploring them). Getting an AI to do well _at
| all_ , regardless of whether it was gamed or not, is the
| whole challenge!
| refulgentis wrote:
| > Solving ARC doesn't mean we have AGI. Also o3 presumably
| isn't doing program synthesis, seemingly proving Francois
| wrong on that front.
|
| Agreed.
|
| > And nobody else has come up with a comparable benchmark
| that is so easy for humans and so hard for LLMs.
|
| ? There's plenty.
| modeless wrote:
| I'd love to hear about more. Which ones are you thinking
| of?
| refulgentis wrote:
| - "Are You Human" https://arxiv.org/pdf/2410.09569 is
| designed to be directly on target, i.e. cross cutting set
| of questions that are easy for humans, but challenging
| for LLMs, Instead of one type of visual puzzle. Much
| better than ARC for the purpose you're looking for.
|
| - SimpleBench https://simple-bench.com/ (similar to
| above; great landing page w/scores that show human / ai
| gap)
|
| - PIQA (physical question answering, i.e. "how do i get a
| yolk out of a water bottle", common favorite of local llm
| enthusiasts in /r/localllama
| https://paperswithcode.com/dataset/piqa
|
| - Berkeley Function-Calling (I prefer
| https://gorilla.cs.berkeley.edu/leaderboard.html)
|
| AI search googled "llm benchmarks challenging for ai easy
| for humans", and "language model benchmarks that humans
| excel at but ai struggles with", and "tasks that are easy
| for humans but difficult for natural language ai".
|
| It also mentioned Moravec's Paradox is a known framing of
| this concept, started going down that rabbit hole because
| the resources were fascinating, but, had to hold back and
| submit this reply first. :)
| modeless wrote:
| Thanks for the pointers! I hadn't seen Are You Human.
| Looks like it's only two months old. Of course it is much
| easier to design a test specifically to thwart LLMs now
| that we have them. It seems to me that it is designed to
| exploit details of LLM structure like tokenizers (e.g.
| character counting tasks) rather than to provide any sort
| of general reasoning benchmark. As such it seems
| relatively straightforward to improve performance in ways
| that wouldn't necessarily represent progress in general
| reasoning. And today's LLMs are not nearly as far from
| human performance on the benchmark as they were on ARC
| for many years after it was released.
|
| SimpleBench looks more interesting. Also less than two
| months old. It doesn't look as challenging for LLMs as
| ARC, since o1-preview and Sonnet 3.5 already got half of
| the human baseline score; they did much worse on ARC. But
| I like the direction!
|
| PIQA is cool but not hard enough for LLMs.
|
| I'm not sure Berkeley Function-Calling represents tasks
| that are "easy" for average humans. Maybe programmers
| could perform well on it. But I like ARC in part because
| the tasks do seem like they should be quite
| straightforward even for non-expert humans.
|
| Moravec's paradox isn't a benchmark per se. I tend to
| believe that there is no real paradox and all we need is
| larger datasets to see the same scaling laws that we have
| for LLMs. I see good evidence in this direction:
| https://www.physicalintelligence.company/blog/pi0
| CamperBob2 wrote:
| How long has SimpleBench been posted? Out of the first 6
| questions at https://simple-bench.com/try-yourself,
| o1-pro got 5/6 right.
|
| It was interesting to see how it failed on question 6: ht
| tps://chatgpt.com/c/6765e70e-44b0-800b-97bd-928919f04fbe
|
| Apparently LLMs do not consider global thermonuclear war
| to be all that big a deal, for better or worse.
| Pannoniae wrote:
| Don't worry, I also got that wrong :) I thought her
| affair would be the biggest problem for John.
| stego-tech wrote:
| I won't be as brutal in my wording, but I agree with the
| sentiment. This was something drilled into me as someone with
| a hobby in PC Gaming _and_ Photography: benchmarks, while
| handy measures of _potential_ capabilities, are not
| _guarantees_ of real world performance. Very few PC gamers
| completely reinstall the OS before benchmarking to remove all
| potential cruft or performance impacts, just as very few
| photographers exclusively take photos of test materials.
|
| While I appreciate the benchmark and its goals (not to
| mention the puzzles - I quite enjoy figuring them out),
| successfully passing this benchmark does not demonstrate or
| guarantee real world capabilities or performance. This is why
| I increasingly side-eye this field and its obsession with
| constantly passing benchmarks and then moving the goal posts
| to a newer, harder benchmark that claims to be a better
| simulation of human capabilities than the last one: it reeks
| of squandered capital and a lack of a viable/profitable
| product, at least to my sniff test. Rather than simply
| capitalize on their actual accomplishments (which LLMs are -
| natural language interaction is huge!), they're trying to
| prove to Capital that with a few (hundred) billion more in
| investments, they can make AGI out of this and replace all
| those expensive humans.
|
| They've built the most advanced prediction engines ever
| conceived, and insist they're best used to replace labor. I'm
| not sure how they reached that conclusion, but considering
| even their own models refute this use case for LLMs, I doubt
| their execution ability on that lofty promise.
| danielmarkbruce wrote:
| 100%. The hype is misguided. I doubt half the people excited
| about the result have even looked at what the benchmark is.
| Balgair wrote:
| Complete aside here: I used to do work with amputees and
| prosthetics. There is a standardized test (and I just cannot
| remember the name) that fits in a briefcase. It's used for
| measuring the level of damage to the upper limbs and for
| prosthetic grading.
|
| Basically, it's got the dumbest and simplest things in it.
| Stuff like a lock and key, a glass of water and jug, common
| units of currency, a zipper, etc. It tests if you can do any of
| those common human tasks. Like pouring a glass of water,
| picking up coins from a flat surface (I chew off my nails so
| even an able person like me fails that), zip up a jacket, lock
| your own door, put on lipstick, etc.
|
| We had hand prosthetics that could play Mozart at 5x speed on a
| baby grand, but could not pick up a silver dollar or zip a
| jacket even a little bit. To the patients, the hands were
| therefore about as useful as a metal hook (a common solution
| with amputees today, not just pirates!).
|
| Again, a total aside here, but your comment just reminded me of
| that brown briefcase. Life, it turns out, is a lot more complex
| than we give it credit for. Even pouring the OJ can be, in rare
| cases, transcendent.
| m463 wrote:
| It would be interesting to see trick questions.
|
| Like in your test
|
| a hand grenade and a pin - don't pull the pin.
|
| Or maybe a mousetrap? but maybe that would be defused?
|
| in the ai test...
|
| or Global Thermonuclear War, the only winning move is...
| sdenton4 wrote:
| to move first!
| m463 wrote:
| oh crap. lol!
| HPsquared wrote:
| Gaming streams being in the training data, it might pull
| the pin because "that's what you do".
| 8note wrote:
| or, because it has to give an output, and pulling the pin
| is the only option
| TeMPOraL wrote:
| There's also the option of not pulling the pin, and
| shooting your enemies as they instinctively run from what
| they think is a live grenade. Saw it on a TV show the
| other day.
| ubj wrote:
| There's a lot of truth in this. I sometimes joke that robot
| benchmarks should focus on common household chores. Given a
| basket of mixed laundry, sort and fold everything into
| organized piles. Load a dishwasher given a sink and counters
| overflowing with dishes piled up haphazardly. Clean a bedroom
| that kids have trashed. We do these tasks almost without
| thinking, but the unstructured nature presents challenges for
| robots.
| Balgair wrote:
| I maintain that whoever invents a robust laundry _folding_
| robot will be a trillionaire. In that, I dump jumbled clean
| clothes straight from a dryer at it and out comes folded
| and sorted clothes (and those loner socks). I know we 're
| getting close, but I also know we're not there yet.
| oblio wrote:
| Laundry folding and laundry ironing, I would say.
| musicale wrote:
| Hopefully will detect whether a small child is inside or
| not.
| imafish wrote:
| > I maintain that whoever invents a robust laundry
| folding robot will be a trillionaire
|
| ... so Elon Musk? :D
| jessekv wrote:
| I want it to lay out an outfit every day too. Hopefully
| without hallucination.
| stefs wrote:
| it's not hallucination, it's high fashion
| tanseydavid wrote:
| Yes, but the stupid robot laid out your Thursday-black-
| Turtleneck for you on Saturday morning. That just won't
| suffice.
| yongjik wrote:
| I can live without folding laundry (I can just shove my
| undershirts in the closet, who cares if it's not folded),
| but whoever manufactures a reliable auto-loading
| dishwasher will have my dollars. Like, just put all your
| dishes in the sink and let the machine handle them.
| Brybry wrote:
| But if your dishwasher is empty is takes nearly the same
| amount of time/effort to put dishes straight into the
| dishwasher that it does to put them in the sink.
|
| I think I'd only really save time by having a robot that
| could unload my dishwasher and put up the clean dishes.
| namibj wrote:
| That's called a second dishwasher: one is for taking out,
| the other for putting in. When the latter is full, turn
| it on, dirty dishes wait outside until the cycle
| finishes, when the dishwashers switch roles.
| ptsneves wrote:
| I thought about this and it gets even better. You do not
| really need shelves as you just use the clean dishwasher
| as the storage place. I honestly don't know why this is
| not a thing in big or wealthy homes.
| jannyfer wrote:
| Another thing that bothers me is that dishwashers are
| low. As I get older, I'm finding it really annoying to
| bend down.
|
| So get me a counter-level dishwasher cabinet and I'll be
| happy!
| yongjik wrote:
| Hmm, that doesn't match my experience. It takes me a lot
| more time to put dishes into the dishwasher, because it
| has different places for cutlery, bowls, dishes, and so
| on, and of course the existing structure never matches my
| bowls' size perfectly so I have to play tetris or run it
| with only 2/3 filled (which will cause me to waste more
| time as I have to do dishes again sooner).
|
| And that's before we get to bits of sticky rice left on
| bowls, which somehow dishwashers never scrape off clean.
| YMMV.
| HPsquared wrote:
| 1. Get a set of dishes that does fit nicely together in
| the dishwasher.
|
| 2. Start with a cold prewash, preferably with a little
| powder in there too. This massively helps with stubborn
| stuff.
| nradov wrote:
| There is the Foldimate robot. I don't know how well it
| works. It doesn't seem to pair up socks. (Deleted the web
| link, it might not be legitimate.)
| smokel wrote:
| Beware, this website is probably a scam.
|
| Foldimate has gone bankrupt in 2021 [1], and the domain
| referral from foldimate.com to a 404 page at miele.com,
| suggests that it was Miele who bought up the remains, not
| a sketchy company with a ".website" top-level domain.
|
| [1] https://en.wikipedia.org/wiki/FoldiMate
| smokel wrote:
| We are certainly getting close! In 2010, watching PR2
| fold some unseen towels is similar to watching paint dry
| [1], but we can now enjoy robots attain lazy student-
| level laundry folding in real-time, as demonstrated by
| p0[2].
|
| [1] https://www.youtube.com/watch?v=gy5g33S0Gzo
|
| [2] https://www.physicalintelligence.company/blog/pi0
| sss111 wrote:
| Honestly, a robot that can hang jumbled clean clothes
| instead of folding them would be good enough, it's crazy
| how we don't even have those.
| dweekly wrote:
| I was a believer in Gal's FoldiMate but sadly
| it...folded.
|
| https://en.m.wikipedia.org/wiki/FoldiMate
| blargey wrote:
| At this point I'm not sure we'll actually get a task-
| specific machine for laundry folding/sorting before
| humanoid robots gain the capability to do it well enough.
| zamalek wrote:
| Slightly tangential, we already have amazing laundry
| robots. They are called washing and drying machines. We
| don't give these marvels enough credit, mostly because they
| aren't shaped like humans.
|
| Humanoid robots are mostly a waste of time. Task-shaped
| robots are _much_ easier to design, build, and maintain...
| and are more reliable. Some of the things you mention might
| needs humanoid versatility (loading the dishwasher), others
| would be far better served by purpose-built robots (laundry
| sorting).
| jkaptur wrote:
| I'm embarrassed to say that I spent a few moments
| daydreaming about a robot that could wash my dishes. Then
| I thought about what to call it...
| musicale wrote:
| Sadly current "dishwasher" models are neither self-
| loading nor unloading. (Seems like they should be able to
| take a tray of dishes, sort them, load them, and stack
| them after cleaning.)
|
| Maybe "busbot" or "scullerybot".
| vidarh wrote:
| The problem is more doing it in sufficiently little
| space, and using little enough water and energy. Doing
| one that you just feed dishes individually and that
| immediate wash them and feed them to storage should be
| entirely viable, but it'd be wasteful, and it'd compete
| with people having multiple small drawer-style
| dishwashers, offering relatively little convenience over
| that.
|
| It seems most people aren't willing to pay for multiple
| dishwashers - even multiple small ones or set aside
| enough space, and that places severe constraints on
| trying to do better.
| wsintra2022 wrote:
| Was it a dishwasher? Just give it all your unclean dishes
| and tell it to go, come back an hour later and they all
| washed and mostly dried!
| rytis wrote:
| I agree. I don't know where this obsession comes from.
| Obsession with resembling as close to humans as possible.
| We're so far from being perfect. If you need proof just
| look at your teeth. Yes, we're relatively universal, but
| a screwdriver is more efficient at driving in screws that
| our fingers. So please, stop wasting time building
| perfect universal robots, build more purpose-build ones.
| Nevermark wrote:
| Given we have shaped so many tasks to fit our bodies, it
| will be a long time before a bot able to do a
| variety/majority of human tasks the human way won't be
| valuable.
|
| 1000 machines specialized for 1000 tasks are great, but
| don't deliver the same value as a single bot that can
| interchange with people flexibly.
|
| Costly today, but wont be forever.
| golol wrote:
| The shape doesn't matter! Non-humanoid shapes give minir
| advantages on specific tasks but for a general robot
| you'll have a hard time finding a shape much more optimal
| than humanoid. And if you go with humanoid you have so
| much data available! Videos contain the information of
| which movements a robot should execude. Teleoperation is
| easy. This is the bitter lesson! The shape doesn't
| matter, any shape will work with the right architecture,
| data and training!
| rowanG077 wrote:
| Purpose build robots are basically solved. Dishwashers,
| laundry machines, assembly robots, etc. the moat is a
| general purpose robot that can do what a human can do.
| graemep wrote:
| Great examples. They are simple, reliable, efficient and
| effective. Far better than blindly copying what a human
| being does. Maybe there are equally clever ways of doing
| things like folding clothes.
| ecshafer wrote:
| I had a pretty bad case of tendinitis once, that basically
| made my thumb useless since using it would cause extreme
| pain. That test seems really good. I could use a computer
| keyboard without any issue, but putting a belt on or pouring
| water was impossible.
| vidarh wrote:
| I had a swollen elbow a short while ago, and the amount of
| things I've never thought about that were affected by
| reduced elbow join mobility and an inability to put
| pressure on the elbow was disturbing.
| CooCooCaCha wrote:
| That's why the goal isn't just benchmark scores, it's
| _reliable_ and robust intelligence.
|
| In that sense, the goalposts haven't moved in a long time
| despite claims from AI enthusiasts that people are constantly
| moving goalposts.
| croemer wrote:
| > We had hand prosthetics that could play Mozart at 5x speed
| on a baby grand, but could not pick up a silver dollar or zip
| a jacket even a little bit. "
|
| I must be missing something, how can they be able to play
| Mozart at 5x speed with their prosthetics but not zip a
| jacket? They could press keys but not do tasks requiring
| feedback?
|
| Or did you mean they used to play Mozart at 5x speed before
| they became amputees?
| rahimnathwani wrote:
| Imagine a prosthetic 'hand' that has 5 regular fingers,
| rather than 4 fingers and a thumb. It would be able to play
| a piano just fine, but be unable to grasp anything small,
| like a zipper.
| numpad0 wrote:
| Thumb not opposable?
| 8note wrote:
| zipping up a jacket is really hard to do, and requires very
| precise movements and coordination between hands.
|
| playing mozart is much more forgiving in terms of the
| number of different motions you have to make in different
| directions, the amount of pressure to apply, and even the
| black keys are much bigger than large sized zipper tongues.
| Balgair wrote:
| Pretty much. The issue with zippers is that the fabric
| moves about in unpredictable ways. Piano playing was just
| movement programs. Zipping required (surprisingly) fast
| feedback. Also, gripping is somewhat tough compared to
| pressing.
| ben_w wrote:
| Playing a piano involves pushing down on the right keys
| with the right force at the right time, but that could be
| pre-programmed well before computers. The self-playing
| piano in the saloon in Westworld wasn't a _huge_
| anachronism, such things slightly overlapped with the Wild
| West era: https://en.wikipedia.org/wiki/Player_piano
|
| Picking up a 1mm thick metal disk from a flat surface
| requires the user gives the exact time, place, and force,
| and I'm not even sure what considerations it needs for
| surface materials (e.g. slightly squishy fake skin) and/or
| tip shapes (e.g. fake nails).
| numpad0 wrote:
| > Picking up a 1mm thick metal disk from a flat surface
| requires the user gives the exact time, place, and force
|
| place sure but can't you cheat a bit for time and force
| with compliance("impedance control")?
| ben_w wrote:
| In theory, apparently not in practice.
| oblio wrote:
| I'm far from a piano player, but I can definitely push
| piano buttons quite quickly while zipping up my jacket when
| it's cold and/or wet outside is really difficult.
|
| Even more so for picking up coins from a flat surface.
|
| For robotics, it's kind of obvious, speed is rarely an
| issue, so the "5x" part is almost trivial. And you can
| program the sequence quite easily, so that's also doable.
| Piano keys are big and obvious and an ergonomically
| designed interface meant to be relatively easy to press,
| ergo easy even for a prosthetic. A small coin on a flat
| surface is far from ergonomic.
| croemer wrote:
| But how do you deliberately control those fingers to
| actually play yourself what you have in mind rather than
| something preprogrammed? Surely the idea of a prosthetic
| does not just mean "a robot that is connected to your
| body", but something that the owner control with your
| mind.
| vidarh wrote:
| Nobody said anything about deliberately controlling those
| fingers to play yourself. Clearly it's not something you
| do for the sake of the enjoyment of playing, but more
| likely a demonstration of the dexterity of the prosthesis
| and ability to program it for complex tasks.
|
| The idea of a prosthesis is to help you regain
| functionality. If the best way of doing that is through
| automation, then it'd make little sense not to.
| yongjik wrote:
| I play piano as a hobby, and the funny thing is, if my
| hands are so cold that I can't zip up my jacket, there's
| no way I can play anything well. I know it's not quite
| zipping up jackets ;) but a human playing the piano does
| require a fast feedback loop.
| n144q wrote:
| Well, you see, while the original comment says they could
| play at 5x speed, it does not say it plays at that speed
| _well_ or play it beautifully. Any teacher or any student
| who learned piano for a while will tell you that this
| matters a lot, especially for classical music -- being able
| to accurately play at an even tempo with the correct
| dynamics and articulation is hard and is what
| differentiates a beginner /intermediate player from an
| advanced one. In fact, one mistake many students make is
| playing a piece too fast when they are not ready, and
| teachers really want students to practice very slowly.
|
| My point is -- being able to zip a jacket is all about
| those subtle actions, and could actually be harder than
| "just" playing piano fast.
| alexose wrote:
| It feels like there's a whole class of information that
| easily shorthanded, but really hard to explain to novices.
|
| I think a lot about carpentry. From the outside, it's pretty
| easy: Just make the wood into the right shape and stick it
| together. But as one progresses, the intricacies become more
| apparent. Variations in the wood, the direction of the grain,
| the seasonal variations in thickness, joinery techniques that
| are durable but also time efficient.
|
| The way this information connects is highly multisensory and
| multimodal. I now know which species of wood to use for which
| applications. This knowledge was hard won through many, many
| mistakes and trials that took place at my home, the hardware
| store, the lumberyard, on YouTube, from my neighbor Steve,
| and in books written by experts.
| Method-X wrote:
| Was it the Southampton hand assessment procedure?
| Balgair wrote:
| Yes! Thank you!
|
| https://www.shap.ecs.soton.ac.uk/
| oblio wrote:
| This was actually discovered quite early on in the history of
| AI:
|
| > Rodney Brooks explains that, according to early AI
| research, intelligence was "best characterized as the things
| that highly educated male scientists found challenging", such
| as chess, symbolic integration, proving mathematical theorems
| and solving complicated word algebra problems. "The things
| that children of four or five years could do effortlessly,
| such as visually distinguishing between a coffee cup and a
| chair, or walking around on two legs, or finding their way
| from their bedroom to the living room were not thought of as
| activities requiring intelligence."
|
| https://en.wikipedia.org/wiki/Moravec%27s_paradox
| bawolff wrote:
| I don't know why people always feel the need to gender
| these things. Highly educated female scientists generally
| find the same things challenging.
| robocat wrote:
| I don't know why anyone would blame people as though
| someone is making an explicit choice. I find your choice
| of words to be insulting to the OP.
|
| We learn our language and stereotypes subconciously from
| our society, and it is no easy thing to fight against
| that.
| Barrin92 wrote:
| >I don't know why people always feel the need to gender
| these things
|
| Because it's relevant to the point being made, i.e. that
| these tests reflect the biases and interests of the
| people who make them. This is true not just for AI tests,
| but intelligence test applied to humans. That Demis
| Hassabis, a chess player and video game designer, decided
| to test his machine on video games, Go and chess probably
| is not an accident.
|
| The more interesting question is why people respond so
| apprehensively to pointing out a very obvious problem and
| bias in test design.
| drdrey wrote:
| I think assembling Legos would be a cool robot benchmark: you
| need to parse the instructions, locate the pieces you need,
| pick them up, orient them, snap them to your current
| assembly, visually check if you achieved the desired state,
| repeat
| throwup238 wrote:
| This is expressed in AI research as Moravec's paradox:
| https://en.wikipedia.org/wiki/Moravec%27s_paradox
|
| Getting to LLMs that could talk to us turned out to be a lot
| easier than making something that could control even a
| robotic arm without precise programming, let alone a
| humanoid.
| MarcelOlsz wrote:
| >We had hand prosthetics that could play Mozart at 5x speed
| on a baby grand
|
| I'd love to know more about this.
| xnx wrote:
| Despite lake of fearsome teeth or claws, humans are _way_ op
| due to brain, hand dexterity, and balance.
| lossolo wrote:
| > making the most interesting and challenging LLM benchmark so
| far.
|
| This[1] is currently the most challenging benchmark. I would
| like to see how O3 handles it, as O1 solved only 1%.
|
| 1. https://epoch.ai/frontiermath/the-benchmark
| pynappo wrote:
| Apparently o3 scored about 25%
|
| https://youtu.be/SKBG1sqdyIU?t=4m40s
| FiberBundle wrote:
| This is actually the result that I find way more
| impressive. Elite mathematicians think these problems are
| challenging and thought they were years away from being
| solvable by AI.
| modeless wrote:
| You're right, I was wrong to say "most challenging" as there
| have been harder ones coming out recently. I think the
| correct statement would be "most challenging long-standing
| benchmark" as I don't believe any other test designed in 2019
| has resisted progress for so long. FrontierMath is only a
| month old. And of course the real key feature of ARC is that
| it is easy for humans. FrontierMath is (intentionally) not.
| skywhopper wrote:
| "The fact that scaled reasoning models are finally showing
| progress on ARC proves that what it measures really is relevant
| and important for reasoning."
|
| Not sure I understand how this follows. The fact that a certain
| type of model does well on a certain benchmark means that the
| benchmark is relevant for a real-world reasoning? That doesn't
| make sense.
| munchler wrote:
| It shows objectively that the models are getting better at
| some form of reasoning, which is at least worth noting.
| Whether that improved reasoning is relevant for the real
| world is a different question.
| moffkalast wrote:
| It shows objectively that one model got better at this
| specific kind of weird puzzle that doesn't translate to
| anything because it is just a pointless pattern matching
| puzzle that can be trained for, just like anything else. In
| fact they specifically trained for it, they say so upfront.
|
| It's like the modern equivalent of saying "oh when AI
| solves chess it'll be as smart as a person, so it's a good
| benchmark" and we all know how that nonsense went.
| munchler wrote:
| Hmm, you could be right, but you could also be very
| wrong. Jury's still out, so the next few years will be
| interesting.
|
| Regarding the value of "pointless pattern matching" in
| particular, I would refer you to Douglas Hofstadter's
| discussion of Bongard problems starting on page 652 of
| _Godel, Escher, Bach_. Money quote: "I believe that the
| skill of solving Bongard [pattern recognition] problems
| lies very close to the core of 'pure' intelligence, if
| there is such a thing."
| jug wrote:
| I liked the SimpleQA benchmark that measures hallucinations.
| OpenAI models did surprisingly poorly, even o1. In fact, it
| looks like OpenAI often does well on benchmarks by taking the
| shortcut to be more risk prone than both Anthropic and Google.
| zone411 wrote:
| It's the least interesting benchmark for language models among
| all they've released, especially now that we already had a
| large jump in its best scores this year. It might be more
| useful as a multimodal reasoning task since it clearly involves
| visual elements, but with o3 already performing so well, this
| has proven unnecessary. ARC-AGI served a very specific purpose
| well: showcasing tasks where humans easily outperformed
| language models, so these simple puzzles had their uses. But
| tasks like proving math theorems or programming are far more
| impactful.
| danielmarkbruce wrote:
| Highly challenging for LLMs because it has nothing to do with
| language. LLMs and their training processes have all kinds of
| optimizations for language and how it's presented.
|
| This benchmark has done a wonderful job with marketing by
| picking a great name. It's largely irrelevant for LLMs despite
| the fact it's difficult.
|
| Consider how much of the model is just noise for a task like
| this given the low amount of information in each token and the
| high embedding dimensions used in LLMs.
| adamgordonbell wrote:
| There is a benchmark, NovelQA, that LLMs don't dominate when it
| feels like they should. The benchmark is to read a novel and
| answer questions about it.
|
| LLMs are below human evaluation, as I last looked, but it
| doesn't get much attention.
|
| Once it is passed, I'd like to see one that is solving the
| mystery in a mystery book right before it's revealed.
|
| We'd need unpublished mystery novels to use for that benchmark,
| but I think it gets at what I think of as reasoning.
|
| https://novelqa.github.io/
| CamperBob2 wrote:
| Does it work on short stories, but not novels? If so, then
| that's just a minor question of context length that should
| self-resolve over time.
| adamgordonbell wrote:
| The books fit in the current long context models, so it's
| not merely the context size constraint but the length is
| part of the issue, for sure.
| meta_x_ai wrote:
| Looks like it's not updated for nearly a year and I'm
| guessing Gemini 2.0 Flash with 2m context will simply crush
| it
| adamgordonbell wrote:
| That's true. They don't have Claude 3.5 on there either. So
| maybe it's not relevant anymore, but I'm not sure.
|
| If so, let's move on to the murder mysteries or more
| complex literary analysis.
| wilg wrote:
| fun! the benchmarks are so interesting because real world use is
| so variable. sometimes 4o will nail a pretty difficult problem,
| other times o1 pro mode will fail 10 times on what i would think
| is a pretty easy programming problem and i waste more time trying
| to do it with ai
| behnamoh wrote:
| So now not only are the models closed, but so are their evals?!
| This is a "semi-private" eval. WTH is that supposed to mean? I'm
| sure the model is great but I refuse to take their word for it.
| ZeroCool2u wrote:
| The private evaluation set is private from the public/OpenAI so
| companies can't train on those problems and cheat their way to
| a high score by overfitting.
| jsheard wrote:
| If the models run on OpenAIs servers then surely they could
| still see the questions being put into it if they wanted to
| cheat? That could only be prevented by making the evaluation
| a one-time deal that can't be repeated, or by having OpenAI
| distribute their models for evaluators to run themselves,
| which I doubt they're inclined to do.
| foobarqux wrote:
| Yes that's why it is "semi"-private: From the ARC website
| "This set is "semi-private" because we can assume that over
| time, this data will be added to LLM training data and need
| to be periodically updated."
|
| I presume evaluation on the test set is gated (you have to
| ask ARC to run it).
| cchance wrote:
| the evals are the question/answers, ARC-AGI doesn't share the
| questions and answers for a portion so that models can't be
| trained on them, the public ones... the public knows the
| questions so theres a chance they could have been at least
| partially been trained on the question (if not the actual
| answer).
|
| Thats how i understand it
| neom wrote:
| Why would they give a cost estimate per task on their low compute
| mode but not their high mode?
|
| "low compute" mode: Uses 6 samples per task, Uses 33M tokens for
| the semi-private eval set, Costs $17-20 per task, Achieves 75.7%
| accuracy on semi-private eval
|
| The "high compute" mode: Uses 1024 samples per task (172x more
| compute), Cost data was withheld at OpenAI's request, Achieves
| 87.5% accuracy on semi-private eval
|
| Can we just extrapolate $3kish per task on high compute?
| (wondering if they're withheld because this isn't the case?)
| WiSaGaN wrote:
| The withheld part is really a red flag for me. Why do you want
| to withhold a compute number?
| zebomon wrote:
| My initial impression: it's very impressive and very exciting.
|
| My skeptical impression: it's complete hubris to conflate ARC or
| any benchmark with truly general intelligence.
|
| I know my skepticism here is identical to moving goalposts. More
| and more I am shifting my personal understanding of general
| intelligence as a phenomenon we will only ever be able to
| identify with the benefit of substantial retrospect.
|
| As it is with any sufficiently complex program, if you could
| discern the result beforehand, you wouldn't have had to execute
| the program in the first place.
|
| I'm not trying to be a downer on the 12th day of Christmas.
| Perhaps because my first instinct is childlike excitement, I'm
| trying to temper it with a little reason.
| amarcheschi wrote:
| I just googled arc agi questions, and it looks like it is
| similar to an iq test with raven matrix. Similar as in you have
| some examples of images before and after, then an image before
| and you have to guess the after.
|
| Could anyone confirm if this is the only kind of questions in
| the benchmark? If yes, how come there is such a direct
| connection to "oh this performs better than humans" when llm
| can be quite better than us in understanding and forecasting
| patterns? I'm just curious, not trying to stir up controversies
| zebomon wrote:
| It's a test on which (apparently until now) the vast majority
| of humans have far outperformed all machine systems.
| patrickhogan1 wrote:
| But it's not a test that directly shows general
| intelligence.
|
| I am excited no less! This is huge improvement.
|
| How does this do on SWE Bench?
| og_kalu wrote:
| >How does this do on SWE Bench?
|
| 71.7%
| throwaway0123_5 wrote:
| I've seen this figure on a few tech news websites and
| reddit but can't find an official source. If it was in
| the video I must have missed it, where is this coming
| from?
| og_kalu wrote:
| It was in the video. I don't know if Open ai have a page
| up yet
| ALittleLight wrote:
| Yes, it's pretty similar to Raven's. The reason it is an
| interesting benchmark is because humans, even very young
| humans, "get" the test in the sense of understanding what
| it's asking and being able to do pretty well on it - but LLMs
| have really struggled with the benchmark in the past.
|
| Chollett (one of the creators of the ARC benchmark) has been
| saying it proves LLMs can't reason. The test questions are
| supposed to be unique and not in the model's training set.
| The fact that LLMs struggled with the ARC challenge suggested
| (to Chollett and others) that models weren't "Truly
| reasoning" but rather just completing based on things they'd
| seen before - when the models were confronted with things
| they hadn't seen before, the novel visual patterns, they
| really struggled.
| Eridrus wrote:
| ML is quite good at understanding and forecasting patterns
| when you train on the data you want to forecast. LLMs manage
| to do so much because we just decided to train on everything
| on the internet and hope that it included everything we ever
| wanted to know.
|
| This tries to create patterns that are intentionally not in
| the data and see if a system can generalize to them, which o3
| super impressively does!
| yunwal wrote:
| ARC is in the dataset though? I mean I'm aware that there
| are new puzzles every day, but there's still a very
| specific format and set of skills required to solve it. I'd
| bet a decent amount of money that humans get better at ARC
| with practice, so it seems strange to suggest that AI
| wouldn't.
| hansonkd wrote:
| It doesn't need to be general intelligence or perfectly map to
| human intelligence.
|
| All it needs to be is useful. Reading constant comments about
| LLMs can't be general intelligence or lack reasoning etc, to me
| seems like people witnessing the airplane and complaining that
| it isn't "real flying" because it isn't a bird flapping its
| wings (a large portion of the population held that point of
| view back then).
|
| It doesn't need to be general intelligence for the rapid
| advancement of LLM capabilities to be the most societal
| shifting development in the past decades.
| zebomon wrote:
| I agree. If the LLMs we have today never got any smarter, the
| world would still be transformed over the next ten years.
| AyyEye wrote:
| > Reading constant comments about LLMs can't be general
| intelligence or lack reasoning etc, to me seems like people
| witnessing the airplane and complaining that it isn't "real
| flying" because it isn't a bird flapping its wings (a large
| portion of the population held that point of view back then).
|
| That is a natural reaction to the incessant techbro, AIbro,
| marketing, and corporate lies that "AI" (or worse AGI) is a
| real thing, and can be directly compared to real humans.
|
| There are people on this very thread saying it's better at
| reasoning than real humans (LOL) because it scored higher on
| some benchmark than humans... Yet this technology still can't
| reliably determine what number is circled, if two lines
| intersect, or count the letters in a word. (That said
| behaviour may have been somewhat finetuned out of newer
| models only reinforces the fact that the technology
| inherently not capable of understanding _anything_.)
| IanCal wrote:
| I encounter "spicy auto complete" style comments far more
| often than techbro AI-everything comments and its frankly
| getting boring.
|
| I've been doing AI things for about 20+ years and llms are
| wild. We've gone from specialized things being pretty bad
| as those jobs to general purpose things better at that and
| everything else. The idea you could make and API call with
| "is this sarcasm?" and get a better than chance guess is
| incredible.
| AyyEye wrote:
| Nobody is disputing the coolness factor, only the
| intelligence factor.
| hansonkd wrote:
| I'm saying the intelligence factor doesn't matter. Only
| the utility factor. Today LLMs are incredibly useful and
| every few months there appears to be bigger and bigger
| leaps.
|
| Analyzing whether or not LLMs have intelligence is
| missing the forest from the trees. This technology is
| emerging in a capitalist society that is hyper optimized
| to adopt useful things at the expense of almost
| everything else. If the utility/price point gets hit for
| a problem, it will replace it regardless of if it is
| intelligent or not.
| surgical_fire wrote:
| Eh, I see far more "AI is the second coming of Jesus"
| type of comments than healthy skepticism. A lot of
| anxiety from people afraid that their source of income
| will dry and a lot of excitement of people with an axe to
| grind that "those entitled expensive peasants will get
| what they deserve".
|
| I think I count myself among the skeptics nowadays for
| that reason. And I say this as someone that thinks LLM is
| an interesting piece of technology, but with somewhat
| limited use and unclear economics.
|
| If the hype was about "look at this thing that can parse
| natural language surprisingly well and generate coherent
| responses", I would be excited too. As someone that had
| to do natural language processing in the past, that is a
| damn hard task to solve, and LLMs excel at it.
|
| But that is not the hype is it? We have people beating
| the drums of how this is just shy of taking the world by
| storm, and AGI is just around the corner, and it will
| revolutionize all economy and society and nothing will
| ever be the same.
|
| So, yeah, it gets tiresome. I wish the hype would die
| down a little so this could be appreciated for what it
| is.
| williamcotton wrote:
| _We have people beating the drums of how this is just shy
| of taking the world by storm, and AGI is just around the
| corner, and it will revolutionize all economy and society
| and nothing will ever be the same._
|
| Where are you seeing this? I pretty much only read HN and
| football blogs so maybe I'm out of the loop.
| sensanaty wrote:
| In this very thread there are multiple people espousing
| their views that the high score here is proof that o3 has
| achieved AGI.
| handsclean wrote:
| People aren't responding to their own assumption that AGI is
| necessary, they're responding to OpenAI and the chorus
| constantly and loudly singing hymns to AGI.
| surgical_fire wrote:
| > to me seems like people witnessing the airplane and
| complaining that it isn't "real flying" because it isn't a
| bird flapping its wings
|
| To me it is more like there is someone jumping on a pogo ball
| while flapping their arms and saying that they are flying
| whenever they hop off the ground.
|
| Skeptics say that they are not really flying, while adherents
| say that "with current pogo ball advancements, they will be
| flying any day now"
| intelVISA wrote:
| Between skeptics and adherents who is more easily able to
| extract VC money for vaporware? If you limit yourself to
| 'the facts' you're leaving tons of $$ on the table...
| surgical_fire wrote:
| By all means, if this is the goal, AI is a success.
|
| I understand that in this forum too many people are
| invested in putting lipstick on this particular pig.
| PaulDavisThe1st wrote:
| An old quote, quite famous: "... is like saying that an ape
| who climbs to the top of a tree for the first time is one
| step closer to landing on the moon".
| DonHopkins wrote:
| Is that what Elon Musk was trying to do on stage?
| billyp-rva wrote:
| > It doesn't need to be general intelligence or perfectly map
| to human intelligence.
|
| > All it needs to be is useful.
|
| Computers were already useful.
|
| The only definition we have for "intelligence" is human (or,
| generally, animal) intelligence. If LLMs aren't that, let's
| call it something else.
| throwup238 wrote:
| What exactly is human (or animal) intelligence? How do you
| define that?
| billyp-rva wrote:
| Does it matter? If LLMs _aren 't_ that, whatever it is,
| then we should use a different word. Finders keepers.
| throwup238 wrote:
| How do you know that LLMs "aren't that" if you can't even
| define what _that_ is?
|
| "I'll know it when I see it" isn't a compelling argument.
| grahamj wrote:
| they can't do what we do therefore they aren't what we
| are
| layer8 wrote:
| And what is that, in concrete terms? Many humans can't do
| what other humans can do. What is the common subset that
| counts as human intelligence?
| jonny_eh wrote:
| > "I'll know it when I see it" isn't a compelling
| argument.
|
| It feels compelling to me.
| Aperocky wrote:
| I think a successful high level intelligence should
| quickly accelerate or converge to infinity/physical
| resource exhaustion because they can now work on
| improving themselves.
|
| So if above human intelligence does happen, I'd assume
| we'd know it, quite soon.
| wruza wrote:
| And look at the airplanes, they really can't just land on a
| mountain slope or a tree without heavy maintenance
| afterwards. Those people weren't all stupid, they questioned
| the promise of flying servicemen delivering mail or milk to
| their window and flying on a personal aircar to their
| workplace. Just like todays promises about whatever the CEOs
| telltales are. Imagining bullshit isn't unique to this
| century.
|
| Aerospace is still a highly regulated area that requires
| training and responsibility. If parallels can be drawn here,
| they don't look so cool for a regular guy.
| skydhash wrote:
| This pretty much. Everyone knows that LLMs are great for
| text generation and processing. What people has been
| questioning is the end goals as promised by its builders,
| i.e. is it useful? And from most of what I saw, it's very
| much a toy.
| Workaccount2 wrote:
| What people always leave out is that society will bend to
| the abilities of the new technology. Planes can't land in
| your backyard so we built airports. We didn't abandon
| planes.
| PaulDavisThe1st wrote:
| Sure, but that also vindicates the GP's point that the
| initial claims of the boosters for planes contained more
| than their fair share of bullshit and lies.
| wruza wrote:
| Yes but the idea was lost in the process. It became a
| faster transportation system that uses air as a medium,
| but that's it. Personal planes are still either big
| business or an expensive and dangerous personal toy
| thing. I don't think it's the same for LLMs (would be
| naive). But where are promises like "we're gonna change
| travel economics etc"? All headlines scream is "AGI
| around the corner". Yeah, now where's my damn postman
| flying? I need my mail.
| ben_w wrote:
| > It became a faster transportation system that uses air
| as a medium, but that's it.
|
| On the one hand, yes; on the other, this understates the
| impact that had.
|
| My uncle moved from the UK to Australia because, I'm
| told*, he didn't like his mum and travel was so expensive
| that he assumed they'd never meet again. My first trip
| abroad... I'm not 100% sure how old I was, but it must
| have been between age 6 and 10, was my gran (his mum)
| paying for herself, for both my parents, and for me, to
| fly to Singapore, then on to various locations in
| Australia including my uncle, and back via Thailand, on
| her pension.
|
| That was a gap of around one and a half generations.
|
| * both of them are long-since dead now so I can't ask
| ForHackernews wrote:
| This is already happening. A few days ago Microsoft
| turned down a documentation PR because the formatting was
| better for humans but worse for LLMs: https://github.com/
| MicrosoftDocs/WSL/pull/2021#issuecomment-...
|
| They changed their mind after a public outcry including
| here on HN.
| oblio wrote:
| We are slowly discovering that many of our wonderful
| inventions from 60-80-100 years ago have serious side
| effects.
|
| Plastics, cars, planes, etc.
|
| One could say that a balanced situation, where vested
| interests are put back in the box (close to impossible
| since it would mean fighting trillions of dollars), would
| mean that for example all 3 in the list above are used a
| lot less than we use them now, for example. And only used
| where truly appropriate.
| tivert wrote:
| > What people always leave out is that society will bend
| to the abilities of the new technology.
|
| Do they really? I don't think they do.
|
| > Planes can't land in your backyard so we built
| airports. We didn't abandon planes.
|
| But then what do you do with the all the fantasies and
| hype about the new technology (like planes that land in
| your backyard and you fly them to work)?
|
| And it's quite possible and fairly common that the new
| technology _actually ends up being mostly hype_ , and
| there's actually no "airports" use case in the wings. I
| mean, how much did society "bend to the abilities of"
| NFTs?
|
| And then what if the mature "airports" use case is
| actually something _most people do not want_?
| moffkalast wrote:
| No, we built helicopters.
| throwaway4aday wrote:
| Your point is on the verge of nullification with the rapid
| improvement and adoption of autonomous drones don't you
| think?
| alexalx666 wrote:
| If I could put it into Tesla style robot and it could do
| dishes and help me figure out tech stuff, it would be more
| than enough.
| skywhopper wrote:
| On the contrary, the pushback is critical because many
| employers are buying the hype from AI companies that AGI is
| imminent, that LLMs can replace professional humans, and that
| computers are about to eliminate all work (except VCs and
| CEOs apparently).
|
| Every person that believes that LLMs are near sentient or
| actually do a good job at reasoning is one more person
| handing over their responsibilities to a zero-accountability
| highly flawed robot. We've already seen LLMs generate bad
| legal documents, bad academic papers, and extremely bad code.
| Similar technology is making bad decisions about who to
| arrest, who to give loans to, who to hire, who to bomb, and
| who to refuse heart surgery for. Overconfident humans
| employing this tech for these purposes have been bamboozled
| by the lies from OpenAI, Microsoft, Google, et al. It's
| crucial to call out overstatement and overhype about this
| tech wherever it crops up.
| jasondigitized wrote:
| This a thousand times.
| colordrops wrote:
| I don't think many informed people doubt the utility of LLMs
| at this point. The potential of human-like AGI has profound
| implications far beyond utility models, which is why people
| are so eager to bring it up. A true human-like AGI basically
| means that most intellectual/white collar work will not be
| needed, and probably manual labor before too long as well.
| Huge huge implications for humanity, e.g. how does an economy
| and society even work without workers?
| vouaobrasil wrote:
| > Huge huge implications for humanity, e.g. how does an
| economy and society even work without workers?
|
| I don't think those that create AI care about that. They
| just to come out on top before someone else does.
| sigmoid10 wrote:
| These comments are getting ridiculous. I remember when this
| test was first discussed here on HN and everyone agreed that it
| clearly proves current AI models are not "intelligent"
| (whatever that means). And people tried to talk me down when I
| theorised this test will get nuked soon - like all the ones
| before. It's time people woke up and realised that the old age
| of AI is over. This new kind is here to stay and it _will_ take
| over the world. And you better guess it 'll be sooner rather
| than later and start to prepare.
| samvher wrote:
| What kind of preparation are you suggesting?
| sigmoid10 wrote:
| This is far too broad to summarise here. You can read up on
| Sutskever or Bostrom or hell even Steven Hawking's ideas
| (going in order from really deep to general topics). We
| need to discuss _everything_ - from education over jobs and
| taxes all the way to the principles of politics, our
| economy and even the military. If we fail at this as a
| society, we will at the very least create a world where the
| people who own capital today massively benefit and become
| rich beyond imagination (despite having contributed nothing
| to it), while the majority of the population will be
| unemployable and forever left behind. And the worst case
| probably falls somewhere between the end of human
| civilisation and the end of our species.
| kelseyfrog wrote:
| What we're going to do is punt the questions and then
| convince ourselves the outcome was inevitable and if
| anything it's actually our fault.
| astrange wrote:
| One way you can tell this isn't realistic is that it's
| the plot of Atlas Shrugged. If your economic intuitions
| produce that book it means they are wrong.
|
| > while the majority of the population will be
| unemployable and forever left behind
|
| Productivity improvements increase employment. A
| superhuman AI is a productivity improvement.
| johnny_canuck wrote:
| Start learning a trade
| jorblumesea wrote:
| that's going to work when every white collar worker goes
| into the trades /s
|
| who is going to pay for residential electrical work lol
| and how much will you make if some guy from MIT is going
| to compete with you
| whynotminot wrote:
| I feel like that's just kicking the can a little further
| down the road.
|
| Our value proposition as humans in a capitalist society
| is an increasingly fragile thing.
| foobarqux wrote:
| You should look up the terms necessary and sufficient.
| sigmoid10 wrote:
| The real issue is people constantly making up new goalposts
| to keep their outdated world view somewhat aligned with
| what we are seeing. But these two things are drifting apart
| faster and faster. Even I got surprised by how quickly the
| ARC benchmark was blown out of the water, and I'm pretty
| bullish on AI.
| foobarqux wrote:
| The ARC maintainers have explicitly said that passing the
| test was necessary but not sufficient so I don't know
| where you come up with goal-post moving. (I personally
| don't like the test; it is more about "intuition" or in-
| built priors, not reasoning).
| manmal wrote:
| Are you like invested in LLM companies or something?
| You're pushing the agenda hard in this thread.
| lawlessone wrote:
| Failing the test may prove the AI is not intelligent. Passing
| the test doesn't necessarily prove it is.
| NitpickLawyer wrote:
| Your comment reminds me of this quote from a book published
| in the 80s:
|
| > There is a related "Theorem" about progress in AI: once
| some mental function is programmed, people soon cease to
| consider it as an essential ingredient of "real thinking".
| The ineluctable core of intelligence is always in that next
| thing which hasn't yet been programmed. This "Theorem" was
| first proposed to me by Larry Tesler, so I call it Tesler's
| Theorem: "AI is whatever hasn't been done yet."
| 6gvONxR4sf7o wrote:
| I've always disliked this argument. A person can do
| something well without devising a general solution to the
| thing. Devising a general solution to the thing is a step
| we're talking all the time with all sorts of things, but
| it doesn't invalidate the cool fact about intelligence:
| whatever it is that lets us do the thing well _without_
| the general solution is hard to pin down and hard to
| reproduce.
|
| All that's invalidated each time is the idea that a
| general solution to that task requires a general solution
| to all tasks, or that a general solution to that task
| requires our special sauce. It's the idea that something
| able to to that task will also be able to do XYZ.
|
| And yet people keep coming up with a new task that people
| point to saying, 'this is the one! there's no way
| something could solve this one without also being able to
| do XYZ!'
| 8note wrote:
| id consider that it doing the test at all, without proper
| compensation is a sign that it isnt intelligent
| QuantumGood wrote:
| "it will take over the world"
|
| Calibrating to the current hype cycle has been challenging
| with AI pronouncements.
| jcims wrote:
| I agree, it's like watching a meadow ablaze and dismissing it
| because it's not a 'real forest fire' yet. No it's not 'real
| AGI' yet, but *this is how we get there* and the pace is
| relentless, incredible and wholly overwhelming.
|
| I've been blessed with grandchildren recently, a little boy
| that's 2 1/2 and just this past Saturday a granddaughter.
| Major events notwithstanding, the world will largely resemble
| today when they are teenagers, but the future is going to
| look very very very different. I can't even imagine what the
| capability and pervasiveness of it all will be like in ten
| years, when they are still just kids. For me as someone
| that's invested in their future I'm interested in all of the
| educational opportunities (technical, philosphical and self-
| awareness) but obviously am concerned about the potential for
| pernicious side effects.
| philipkglass wrote:
| If AI takes over white collar work that's still half of the
| world's labor needs untouched. There are some promising early
| demos of robotics plus AI. I also saw some promising demos of
| robotics 10 and 20 years that didn't reach mass adoption. I'd
| like to believe that by the time I reach old age the robots
| will be fully qualified replacements for plumbers and home
| health aides. Nothing I've seen so far makes me think that's
| especially likely.
|
| I'd love more progress on tasks in the physical world,
| though. There are only a few paths for countries to deal with
| a growing ratio of old retired people to young workers:
|
| 1) Prioritize the young people at the expense of the old by
| e.g. cutting old age benefits (not especially likely since
| older voters have greater numbers and higher participation
| rates in elections)
|
| 2) Prioritize the old people at the expense of the young by
| raising the demands placed on young people (either directly
| as labor, e.g. nurses and aides, or indirectly through higher
| taxation)
|
| 3) Rapidly increase the population of young people through
| high fertility or immigration (the historically favored path,
| but eventually turns back into case 1 or 2 with an even
| larger numerical burden of older people)
|
| 4) Increase the health span of older people, so that they are
| more capable of independent self-care (a good idea, but
| difficult to achieve at scale, since most effective
| approaches require behavioral changes)
|
| 5) Decouple goods and services from labor, so that old people
| with diminished capabilities can get everything they need
| without forcing young people to labor for them
| reducesuffering wrote:
| > If AI takes over white collar work that's still half of
| the world's labor needs untouched.
|
| I am continually _baffled_ that people here throw this
| argument out and can 't imagine the second-order effects.
| If white collar work is automated by AGI, all the RnD to
| solve robotics beyond imagination will happen in a flash.
| The top AI labs, the people smartest enough to make this
| technology, all are focusing on automating AGI Researchers
| and from there follows everything, obviously.
| brotchie wrote:
| +1, the second and third order effects aren't trivial.
|
| We're already seeing escape velocity in world modeling
| (see Google Veo2 and the latest Genesis LLM-based physics
| modeling framework).
|
| The hardware for humanoid robots is 95% of the way there,
| the gap is control logic and intelligence, which is
| rapidly being closed.
|
| Combine Veo2 world model, Genesis control planning,
| o3-style reasoning, and you're pretty much there with
| blue collar work automation.
|
| We're only a few turns (<12 months) away from an
| existence proof of a humanoid robot that can watch a
| Youtube video and then replicate the task in a novel
| environment. May take longer than that to productionize.
|
| It's really hard to think and project forward on an
| exponential. We've been on an exponential technology
| curve since the discovery of fire (at least). The 2nd
| order has kicked up over the last few years.
|
| Not a rational approach to look back at robotics
| 2000-2022 and project that pace forwards. There's more
| happening every month than in decades past.
| philipkglass wrote:
| I hope that you're both right. In 2004-2007 I saw self
| driving vehicles make lightning progress from the weak
| showing of the 2004 DARPA Grand Challenge to the
| impressive 2005 Grand Challenge winners and the even more
| impressive performance in the 2007 Urban Challenge. At
| the time I thought that full self driving vehicles would
| have a major commercial impact within 5 years. I expected
| truck and taxi drivers to be obsolete jobs in 10 years.
| 17 years after the Urban Challenge there are still
| millions of truck driver jobs in America and only Waymo
| seems to have a credible alternative to taxi drivers
| (even then, only in a small number of cities).
| ben_w wrote:
| > It's time people woke up and realised that the old age of
| AI is over. This new kind is here to stay and it will take
| over the world. And you better guess it'll be sooner rather
| than later and start to prepare.
|
| I was just thinking about how 3D game engines were perceived
| in the 90s. Every six months some new engine came out, blew
| people's minds, was declared photorealistic, and was
| forgotten a year later. The best of those engines kept
| improving and are still here, and kinda did change the world
| in their own way.
|
| Software development seemed rapid and exciting until about
| Halo or Half Life 2, then it was shallow but shiny press
| releases for 15 years, and only became so again when OpenAI's
| InstructGPT was demonstrated.
|
| While I'm really impressed with current AI, and value the
| best models greatly, and agree that they will change (and
| have already changed) the world... I can't help but think of
| the _Next Generation_ front cover, February 1997 when
| considering how much further we may be from what we want:
| https://www.giantbomb.com/pc/3045-94/forums/unreal-yes-
| this-...
| torginus wrote:
| The weird thing about the phenomenon you mention is only
| after the field of software engineering has plateaued 15
| years ago, as you mentioned, that this insane demand for
| engineers did arise, with corresponding insane salaries.
|
| It's a very strange thing I've never understood.
| dwaltrip wrote:
| My guess: It's a very lengthy, complex, and error-prone
| process to "digitize" human civilization (government,
| commerce, leisure, military, etc). The tech existed, we
| just didn't know how to use it.
|
| We still barely know how to use computers effectively,
| and they have already transformed the world. For better
| or worse.
| hansonkd wrote:
| > how much further we may be from what we wan
|
| The timescale you are describing for 3D graphics is 4 years
| from the 1997 cover you posted to the release of Halo which
| you are saying plateaued excitement because it got advanced
| enough.
|
| An almost infinitesimally small amount of time in terms of
| history human development and you are mocking the magazine
| being excited for the advancement because it was... 4 years
| yearly?
| ben_w wrote:
| No, the timescale is "the 90s", the _the specific
| example_ is from 1997, and chosen because of how badly it
| aged. Nobody looks at the original single-player Unreal
| graphics today and thinks "this is amazing!", but we all
| did at the time -- Reflections! Dynamic lighting! It was
| amazing for the era -- but it was also a long way from
| photorealism. ChatGPT is amazing... but how far is it
| from Brent Spiner's Data?
|
| The era was people getting wowed from Wolfenstein (1992)
| to "about Halo or Half Life 2" (2001 or 2004).
|
| And I'm not saying the flattening of excitement was for
| any specific reason, just that this was roughly when it
| stopped getting exciting -- it might have been because
| the engines were good enough for 3D art styles beyond "as
| realistic as we can make it", but for all I know it was
| the War On Terror which changed the tone of press
| releases and how much the news in general cared. Or
| perhaps it was a culture shift which came with more
| people getting online and less media being printed on
| glossy paper and sold in newsagents.
|
| Whatever the cause, it happened around that time.
| TeMPOraL wrote:
| I'm still holding on to my hypothesis in that the
| excitement was sustained in large part because this
| progress was something a regular person could partake in.
| Most didn't, but they likely known some kid who was. And
| some of those kids run the gaming magazines.
|
| This was a time where, for 3D graphics, barriers to entry
| got low (math got figured out, hardware was good enough,
| knowledge spread), but the commercial market didn't yet
| capture everything. Hell, a bulk of those excited kids I
| remember, trying to do a better Unreal Tournament after
| school instead of homework (and almost succeeding!), they
| went on create and staff the next generation of
| commercial gamedev.
|
| (Which is maybe why this period lasted for about as long
| as it takes for a schoolkid to grow up, graduate, and
| spend few years in the workforce doing the stuff they
| were so excited about.)
| TeMPOraL wrote:
| > _Software development seemed rapid and exciting until
| about Halo or Half Life 2, then it was shallow but shiny
| press releases for 15 years_
|
| The transition seems to map well to the point where engines
| got sophisticated enough, that highly dedicated high-
| schoolers couldn't keep up. Until then, people would
| routinely make hobby game engines (for games they'd then
| never finish) that were MVPs of what the game industry had
| a year or three earlier. I.e. close enough to compete on
| visuals with top photorealistic games of a given year - but
| more importantly, this was a time where _you could do cool
| nerdy shit to impress your friends and community_.
|
| Then Unreal and Unity came out, with a business model that
| killed the motivation to write your own engine from scratch
| (except for purely educational purposes), we got more
| games, more progress, but the excitement was gone.
|
| Maybe it's just a spurious correlation, but it seems to
| track with:
|
| > _and only became so again when OpenAI 's InstructGPT was
| demonstrated._
|
| Which is again, if you exclude training SOTA models - which
| is still mostly out of reach for anyone but a few entities
| on the planet - the time where _anyone_ can do something
| cool that doesn 't have a better market alternative yet,
| and any dedicated high-schooler can make truly impressive
| and useful work, outpacing commercial and academic work
| based on pure motivation and focus alone (it's easier when
| you're not being distracted by bullshit incentives like
| _user growth_ or _making VCs happy_ or _churning out
| publications, farming citations_ ).
|
| It's, once again, a time of dreams, where anyone with some
| technical interest and a bit of free time can _make the
| future happen in front of their eyes_.
| levocardia wrote:
| I'm a little torn. ARC is really hard, and Francois is
| extremely smart and thoughtful about what intelligence means
| (the original "On the Measure of Intelligence" heavily
| influenced my ideas on how to think about AI).
|
| On the other hand, there is a long, long history of AI
| achieving X but not being what we would casually refer to as
| "generally intelligent," then people deciding X isn't really
| intelligence; only when AI achieves Y will it be
| intelligence. Then AI achieves Y and...
| Workaccount2 wrote:
| You are telling a bunch of high earning individuals ($150k+)
| that they may be dramatically less valuable in the eat
| future. Of course the goal posts will keep being pushed back
| and the acknowledgements will never come.
| ignoramous wrote:
| > _These comments are getting ridiculous._
|
| Not really. Francois (co-creator of the ARC Prize) has this
| to say: The v1 version of the benchmark is
| starting to saturate. There were already signs of this in the
| Kaggle competition this year: an ensemble of all submissions
| would score 81% Early indications are that ARC-
| AGI-v2 will represent a complete reset of the state-of-the-
| art, and it will remain extremely difficult for o3.
| Meanwhile, a smart human or a small panel of average humans
| would still be able to score >95% ... This shows that it's
| still feasible to create unsaturated, interesting benchmarks
| that are easy for humans, yet impossible for AI, without
| involving specialist knowledge. We will have AGI when
| creating such evals becomes outright impossible.
| For me, the main open question is where the scaling
| bottlenecks for the techniques behind o3 are going to be. If
| human-annotated CoT data is a major bottleneck, for instance,
| capabilities would start to plateau quickly like they did for
| LLMs (until the next architecture). If the only bottleneck is
| test-time search, we will see continued scaling in the
| future.
|
| https://x.com/fchollet/status/1870169764762710376 /
| https://ghostarchive.org/archive/Sqjbf
| bluerooibos wrote:
| The goalposts have moved, again and again.
|
| It's gone from "well the output is incoherent" to "well it's
| just spitting out stuff it's already seen online" to
| "WELL...uhh IT CAN'T CREATE NEW/NOVEL KNOWLEDGE" in the space
| of 3-4 years.
|
| It's incredible.
|
| We already have AGI.
| FrustratedMonky wrote:
| " it's complete hubris to conflate ARC or any benchmark with
| truly general intelligence."
|
| Maybe it would help to include some human results in the AI
| ranking.
|
| I think we'd find that Humans score lower?
| zamadatix wrote:
| I'm not sure it'd help what they are talking about much.
|
| E.g. go back in time and imagine you didn't know there are
| ways for computers to be really good at performing
| integration yet as nobody had tried to make them. If someone
| asked you how to tell if something is intelligent "the
| ability to easily reason integrations or calculate extremely
| large multiplications in mathematics" might seem like a great
| test to make.
|
| Skip forward to the modern era and it's blatantly obvious
| CASes like Mathematica on a modern computer range between
| "ridiculously better than the average person" to "impossibly
| better than the best person" depending on the test. At the
| same time, it becomes painfully obvious a CAS is wholly
| unrelated to general intelligence and just because your test
| might have been solvable by an AGI doesn't mean solving it
| proves something must have been an AGI.
|
| So you come up with a new test... but you have the same
| problem as originally, it seems like anything non-human
| completely bombs and an AGI would do well... but how do you
| know the thing that solves it will have been an AGI for sure
| and not just another system clearly unrelated?
|
| Short of a more clever way what GP is saying is the goalposts
| must keep being moved until it's not so obvious the thing
| isn't AGI, not that the average human gets a certain score
| which is worse.
|
| .
|
| All that aside, to answer your original question, in the
| presentation it was said the average human gets 85% and this
| was the first model to beat that. It was also said a second
| version is being worked on. They have some papers on their
| site about clear examples of why the current test clearly has
| a lot of testing unrelated to whether something is really AGI
| (a brute force method was shown to get >50% in 2020) so their
| aim is to create a new goalpost test and see how things shake
| out this time.
| FrustratedMonky wrote:
| "Short of a more clever way what GP is saying is the
| goalposts must keep being moved until it's not so obvious
| the thing isn't AGI, not that the average human gets a
| certain score which is worse."
|
| Best way of stating that I've heard.
|
| The Goal Post must keep moving, until we understand enough
| what is happening.
|
| I usually poo-poo the goal post moving, but this makes
| sense.
| og_kalu wrote:
| Generality is not binary. It's a spectrum. And these models
| are already general in ways those things you've mentioned
| simply weren't.
|
| What exactly is AGI to you ? If it's simply a generally
| intelligent machine then what are you waiting for ? What
| else is there to be sure of ? There's nothing narrow about
| these models.
|
| Humans love to believe they're oh so special so much that
| there will always be debates on whether 'AGI' has arrived.
| If you are waiting for that then you'll be waiting a very
| long time, even if a machine arrives that takes us to the
| next frontier in science.
| m3kw9 wrote:
| From the statement where - this is a pretty tough test where AI
| scores low vs humans just last year, and AI can do it as good
| as humans may not be AGI which I agree, but it means something
| with all caps
| manmal wrote:
| Obviously, the multi billion dollar companies will try to
| satisfy the benchmarks they are not yet good in, as has
| always been the case.
| wslh wrote:
| > My skeptical impression: it's complete hubris to conflate ARC
| or any benchmark with truly general intelligence.
|
| But isn't it interesting to have several benchmarks? Even if
| it's not about passing the Turing test, benchmarks serve a
| purpose--similar to how we measure microprocessors or other
| devices. Intelligence may be more elusive, but even if we had
| an oracle delivering the ultimate intelligence benchmark, we'd
| still argue about its limitations. Perhaps we'd claim it
| doesn't measure creativity well, and we'd find ourselves
| revisiting the same debates about different kinds of
| intelligences.
| zebomon wrote:
| It's certainly interesting. I'm just not convinced it's a
| test of general intelligence, and I don't think we'll know
| whether or not it is until it's been able to operate in the
| real world to the same degree that our general intelligence
| does.
| kelseyfrog wrote:
| > truly general intelligence
|
| Indistinguishable from goalpost moving like you said, but also
| no true Scotsman.
|
| I'm curious what would happen in your eyes if we misattributed
| general intelligence to an AI model? What are the consequences
| of a false positive and how would they affect your life?
|
| It's really clear to me how intelligence fits into our reality
| as part of our social ontology. The attributes and their
| expression that each of us uses to ground our concept of the
| intelligent predicate differs wildly.
|
| My personal theory is that we tend to have an exemplar-based
| dataset of intelligence, and each of us attempts to construct a
| parsimonious model of intelligence, but like all (mental)
| models, they can be useful but wrong. These models operate in a
| space where the trade off is completeness or consistency, and
| most folks, uncomfortable saying "I don't know" lean toward
| being complete in their specification rather than consistent.
| The unfortunate side-effect is that we're able to easily
| generate test data that highlights our model inconsistency - AI
| being a case in point.
| PaulDavisThe1st wrote:
| > I'm curious what would happen in your eyes if we
| misattributed general intelligence to an AI model? What are
| the consequences of a false positive and how would they
| affect your life?
|
| Rich people will think they can use the AI model instead of
| paying other people to do certain tasks.
|
| The consequences could range from brilliant to utterly
| catastrophic, depending on the context and precise way in
| which this is done. But I'd lean toward the catastrophic.
| kelseyfrog wrote:
| Any specifics? It's difficult to separate this from
| generalized concern.
| PaulDavisThe1st wrote:
| someone wants a "personal assistant" and believes that
| the LLM has AGI ...
|
| someone wants a "planning officer" and believes that the
| LLM has AGI ...
|
| someone wants a "hiring consultant" and believes that the
| LLM has AGI ...
|
| etc. etc.
| kelseyfrog wrote:
| My apologies, but would it be possible to list the
| catastrophic consequences of these?
| Agentus wrote:
| how about a extra large dose of your skepticism. is true
| intelligence really a thing and not just a vague human
| construct that tries to point out the mysterious unquantifiable
| combination of human behaviors?
|
| humans clearly dont know what intelligence is unambiguously.
| theres also no divinely ordained objective dictionary that one
| can point at to reference what true intelligence is. a deep
| reflection of trying to pattern associate different human
| cognitive abilities indicates human cognitive capabilities
| arent that spectacular really.
| Bjorkbat wrote:
| I think it's still an interesting way to measure general
| intellience, it's just that o3 has demonstrated that you can
| actually achieve human performance on it by training it on the
| public training set and giving it ridiculous amounts of
| compute, which I imagine equates to ludicrously long chains-of-
| thought, and if I understand correctly more than one chain-of-
| thought per task (they mention sample sizes in the blog post,
| with o3-low using 6 and o3-high using 1024. Not sure if these
| are chains-of-thought per task or what).
|
| Once you look at it that way it the approach really doesn't
| look like intelligence that's able to generalize to novel
| domains. It doesn't pass the sniff test. It looks a lot more
| like brute-forcing.
|
| Which is probably why, in order to actually qualify for the
| leaderboard, they stipulate that you can't use more than $10k
| more of compute. Otherwise, it just sounds like brute-forcing.
| attentionmech wrote:
| Isn't this at the level now where it can sort of self improve. My
| guess is that they will just use it to improve the model and the
| cost they are showing per evaluation will go down drastically.
|
| So, next step in reasoning is open world reasoning now?
| yawnxyz wrote:
| O3 High (tuned) model scored an 88% at what looks like
| $6,000/task haha
|
| I think soon we'll be pricing any kind of tasks by their compute
| costs. So basically, human = $50/task, AI = $6,000/task, use
| human. If AI beats human, use AI? Ofc that's considering both get
| 100% scores on the task
| cchance wrote:
| Isn't that generally what ... all jobs are? Automation Cost vs
| Longterm Human cost... its why amazon did the weird "our stores
| are AI driven" but in reality was cheaper to higher a bunch of
| guys in a sweat shop to look at the cameras and write things
| down lol.
|
| The thing is given what we've seen from distillation and tech,
| even if its 6,000/task... that will come down drastically over
| time through optimization and just... faster more efficient
| processing hardware and software.
| cryptoegorophy wrote:
| I remember hearing Tesla trying to automate all of production
| but some things just couldn't , like the wiring which humans
| still had to do.
| dyauspitr wrote:
| Compute can get optimized and cheap quickly.
| karmasimida wrote:
| Is it? The moore's law is dead dead, I don't think this is a
| given.
| jsheard wrote:
| That's the elephant in the room with the reasoning/COT
| approach, it shifts what was previously a scaling of training
| costs into scaling of training _and_ inference costs. The
| promise of doing expensive training once and then running the
| model cheaply forever falls apart once you 're burning tens,
| hundreds or thousands of dollars worth of compute every time
| you run a query.
| Legend2440 wrote:
| Yeah, but next year they'll come out with a faster GPU, and
| the year after that another still faster one, and so on.
| Compute costs are a temporary problem.
| freehorse wrote:
| The issue is not just scaling compute, but scaling it in a
| rate that meets the increase in complexity of the problems
| that are not currently solved. If that is O(n) then what
| you say probably stands. If that is eg O(n^8) or
| exponential etc, then there is no hope to actually get good
| enough scaling by just increasing compute in a normal rate.
| Then AI technology will still be improving, but improving
| to a halt, practically stagnating.
|
| o3 will be interesting if it offers indeed a novel
| technology to handle problem solving, something that is
| able to learn from few novel examples efficiently and
| adapt. That's what intelligence actually is. Maybe this is
| the case. If, on the other hand, it is a smart way to pair
| CoT within an evaluation loop (as the author hints as
| possibility) then it is probable that, while this _can_
| handle a class of problems that current LLMs cannot, it is
| not really this kind of learning, meaning that it will not
| be able to scale to more complex, real world tasks with a
| problem space that is too large and thus less amenable to
| such a technique. It is still interesting, because having a
| good enough evaluator may be very important step, but it
| would mean that we are not yet there.
|
| We will learn soon enough I suppose.
| Workaccount2 wrote:
| They're gonna figure it out. Something is being missed
| somewhere, as human brains can do all this computation on 20
| watts. Maybe it will be a hardware shift or maybe just a
| software one, but I strongly suspect that modern transformers
| are grossly inefficient.
| redeux wrote:
| Time and availability would also be factors.
| Benjaminsen wrote:
| Compute costs on AI with the same roughly the same capabilities
| have been halving every ~7 months.
|
| That makes something like this competitive in ~3 years
| freehorse wrote:
| This makes me think and speculate if the solution comprises of
| a "solver" trying semi-random or more targeted things and a
| "checker" checking these? Usually checking a solution is
| cognitively (and computationally) easier than coming up with
| it. Else I cannot think what sort of compute would burn 6000$
| per task, unless you are going through a lot of loops and you
| have somehow solved the part of the problem that can figure out
| if a solution is correct or not, while coming up with the
| actual correct solution is not as solved yet to the same
| degree. Or maybe I am just naive and these prices are just like
| breakfast for companies like that.
| og_kalu wrote:
| It's not 6000/task (i.e per question). 6000 is about the retail
| cost for evaluating the entire benchmark on high efficiency
| (about 400 questions)
| Tiberium wrote:
| From reading the blog post and Twitter, and cost of other
| models, I think it's evident that it IS actually cost per
| task, see this tweet: https://files.catbox.moe/z1n8dc.jpg
|
| And o1 cost $15/$60 for 1M in/out, so the estimated costs on
| the graph would match for a single task, not the whole
| benchmark.
| slibhb wrote:
| The blog clarifies that it's $17-20 per task. Maybe it runs
| into thousands for tasks it can't solve?
| Tiberium wrote:
| That cost is for o3 low, o3 high goes into thousands per
| task.
| gbnwl wrote:
| Well they got 75.7% at $17/task. Did you see that?
| seydor wrote:
| What if we use those humans to generate energy for the tasks?
| spaceman_2020 wrote:
| Just as an aside, I've personally found o1 to be completely
| useless for coding.
|
| Sonnet 3.5 remains the king of the hill by quite some margin
| cchance wrote:
| The new gemini's are pretty good too
| lysecret wrote:
| Actually prefer new geminis too. 2.0 experimental especially.
| og_kalu wrote:
| To be fair, until the last checkpoint released 2 days ago, o1
| didn't really beat sonnet (and if so, barely) in most non-
| competitive coding benchmarks
| vessenes wrote:
| To fill this out, I find o1-pro (and -preview when it was live)
| to be pretty good at filling in blindspots/spotting holistic
| bugs. I use Claude for day to day, and when Claude is spinning,
| o1 often can point out why. It's too slow for AI coding, and I
| agree that at default its responses aren't always satisfying.
|
| That said, I think its code style is arguably better, more
| concise and has better patterns -- Claude needs a fair amount
| of prompting and oversight to not put out semi-shitty code in
| terms of structure and architecture.
|
| In my mind: going from Slowest to Fastest, and Best
| Holistically to Worst, the list is:
|
| 1. o1-pro 2. Claude 3.5 3. Gemini 2 Flash
|
| Flash is so fast, that it's tempting to use more, but it really
| needs to be kept to specific work on strong codebases without
| complex interactions.
| bearjaws wrote:
| o1 is pretty good at spotting OWASP defects, compared to most
| other models.
|
| https://myswamp.substack.com/p/benchmarking-llms-against-com...
| InkCanon wrote:
| I just asked o1 a simple yes or no question about x86 atomics
| and it did one of those A or B replies. The first answer was
| yes, the second answer was no.
| m3kw9 wrote:
| o1 is when all else fails, sometimes it does the same mistakes
| as weaker models if you give it simple tasks with very little
| context, but when a good precise context is given it usually
| outperforms other Models
| karmasimida wrote:
| Yeah I feel for chat use case, o1 is just too slow for me, and
| my queries aren't that complicated.
|
| For coding, o1 is marvelous at Leetcode question I think it is
| the best teacher I would ever afford to teach me leetcoding,
| but I don't find myself have a lot of other use cases for o1
| that is complex and requires really long reasoning chain
| bitbuilder wrote:
| I find myself hoping between o1 and Sonnet pretty frequently
| these days, and my personal observation is that the quality of
| output from o1 scales more directly to the quality of the
| prompting you're giving it.
|
| In a way it almost feels like it's become _too_ good at
| following instructions and simply just takes your direction
| more literally. It doesn 't seem to take the initiative of
| going the extra mile of filling in the blanks from your lazy
| input (note: many would see this as a good thing). Claude on
| the other hand feels more intuitive in discerning intent from a
| lazy prompt, which I may be prone to offering it at times when
| I'm simply trying out ideas.
|
| However, if I take the time to write up a well thought out
| prompt detailing my expectations, I find I much prefer the code
| o1 creates. It's smarter in its approach, offers clever ideas I
| wouldn't have thought of, and generally cleaner.
|
| Or put another way, I can give Sonnet a lazy or detailed prompt
| and get a good result, while o1 will give me an excellent
| result with a well thought out prompt.
|
| What this boils down to is I find myself using Sonnet while
| brainstorming ideas, or when I simply don't know how I want to
| approach a problem. I can pitch it a feature idea the same way
| a product owner might pitch an idea to an engineer, and then
| iterate through sensible and intuitive ways of looking at the
| problem. Once I get a handle on how I'd like to implement a
| solution, I type up a spec and hand it off to o1 to crank out
| the code I'd intend to implement.
| jules wrote:
| Can you solve this by putting your lazy prompt through GPT-4o
| or Sonnet 3.6 and asking it to expand the prompt to a full
| prompt for o1?
| smy20011 wrote:
| It seems O3 following trend of Chess engine that you can cut your
| search depth depends on state.
|
| It's good for games with clear signal of success (Win/Lose for
| Chess, tests for programming). One of the blocker for AGI is we
| don't have clear evaluation for most of our tasks and we cannot
| verify them fast enough.
| flakiness wrote:
| The cost axis is interesting. O3 Low is $10+ per task and 03 High
| is over $1000 (it's logarithmic graph so it's like $50 and $5000
| respectively?)
| obblekk wrote:
| Human performance is 85% [1]. o3 high gets 87.5%.
|
| This means we have an algorithm to get to human level performance
| on this task.
|
| If you think this task is an eval of general reasoning ability,
| we have an algorithm for that now.
|
| There's a lot of work ahead to generalize o3 performance to all
| domains. I think this explains why many researchers feel AGI is
| within reach, now that we have an algorithm that works.
|
| Congrats to both Francois Chollet for developing this compelling
| eval, and to the researchers who saturated it!
|
| [1] https://x.com/SmokeAwayyy/status/1870171624403808366,
| https://arxiv.org/html/2409.01374v1
| phillipcarter wrote:
| As excited as I am by this, I still feel like this is still
| just a small approximation of a small chunk of human reasoning
| ability at large. o3 (and whatever comes next) feels to me like
| it will head down the path of being a reasoning coprocessor for
| various tasks.
|
| But, still, this is incredibly impressive.
| qt31415926 wrote:
| Which parts of reasoning do you think is missing? I do feel
| like it covers a lot of 'reasoning' ground despite its on the
| surface simplicity
| phillipcarter wrote:
| I think it's hard to enumerate the unknown, but I'd
| personally love to see how models like this perform on
| things like word problems where you introduce red herrings.
| Right now, LLMs at large tend to struggle mightily to
| understand when some of the given information is not only
| irrelevant, but may explicitly serve to distract from the
| real problem.
| KaoruAoiShiho wrote:
| o1 already fixed the red herrings...
| ALittleLight wrote:
| It's not saturated. 85% is average human performance, not "best
| human" performance. There is still room for the model to go up
| to 100% on this eval.
| scotty79 wrote:
| Still it's comparing average human level performance with best
| AI performance. Examples of things o3 failed at are insanely
| easy for humans.
| FrustratedMonky wrote:
| There are things Chimps do easily that humans fail at, and
| vice/versa of course.
|
| There are blind spots, doesn't take away from 'general'.
| cchance wrote:
| You'd be surprised what the AVERAGE human fails to do that
| you think is easy, my mom can't fucking send an email without
| downloading a virus, i have a coworker that believes beyond a
| shadow of a doubt the world is flat.
|
| The Average human is a lot dumber than people on hackernews
| and reddit seem to realize, shit the people on mturk are
| likely smarter than the AVERAGE person
| staticman2 wrote:
| Yet the average human can drive a car a lot better than
| ChatGPT can, which shows that the way you frame
| "intelligence" dictates your conclusion about who is
| "intelligent".
| p1esk wrote:
| Pretty sure a waymo car drives better than an average SF
| driver.
| tracerbulletx wrote:
| If you take an electrical sensory input signal sequence,
| and transform it to a electrical muscle output signal
| sequence you've got a brain. ChatGPT isn't going to drive
| a car because it's trained on verbal tokens, and it's not
| optimized for the type of latency you need for physical
| interaction.
|
| And the brain doesn't use the same network to do verbal
| reasoning as real time coordination either.
|
| But that work is moving along fine. All of these models
| and lessons are going to be combined into AGI. It is
| happening. There isn't really that much in the way.
| cryptoegorophy wrote:
| What's interesting is it might be very close to human
| intelligence than some "alien" intelligence, because after all
| it is a LLM and trained on human made text, which kind of
| represents human intelligence.
| hammock wrote:
| In that vein, perhaps the delta between o3 @ 87.5% and Human
| @ 85% represents a deficit in the ability of text to
| communicate human reasoning.
|
| In other words, it's possible humans can reason better than
| o3, but cannot articulate that reasoning as well through text
| - only in our heads, or through some alternative medium.
| 85392_school wrote:
| I wonder how much of an effect amount of time to answer has
| on human performance.
| yunwal wrote:
| Yeah, this is sort of meaningless without some idea of
| cost or consequences of a wrong answer. One of the nice
| things about working with a competent human is being able
| to tell them "all of our jobs are on the line" and
| knowing with certainty that they'll come to a good
| answer.
| unsupp0rted wrote:
| It's possible humans reason better through text than not
| through text, so these models, having been trained on text,
| should be able to out-reason any person who's not currently
| sitting down to write.
| antirez wrote:
| NNs are not algorithms.
| notfish wrote:
| An algorithm is "a process or set of rules to be followed in
| calculations or other problem-solving operations, especially
| by a computer"
|
| How does a giant pile of linear algebra not meet that
| definition?
| antirez wrote:
| It's not made of "steps", it's an almost continuous
| function to its inputs. And a function is not an algorithm:
| it is not an object made of conditions, jumps,
| terminations, ... Obviously it has computation capabilities
| and is Turing-complete, but is the opposite of an
| algorithm.
| raegis wrote:
| > It's not made of "steps", it's an almost continuous
| function to its inputs.
|
| Can you define "almost continuous function"? Or explain
| what you mean by this, and how it is used in the A.I.
| stuff?
| janalsncm wrote:
| If it wasn't made of steps then Turing machines wouldn't
| be able to execute them.
|
| Further, this is probably running an algorithm on top of
| an NN. Some kind of tree search.
|
| I get what you're saying though. You're trying to draw a
| distinction between statistical methods and symbolic
| methods. Someday we will have an algorithm which uses
| statistical methods that can match human performance on
| most cognitive tasks, and it won't look or act like a
| brain. In some sense that's disappointing. We can build
| supersonic jets without fully understanding how birds
| fly.
| antirez wrote:
| Let's see that Turing machines can approximate the
| execution of NN :) That's why there are issues related to
| numerical precision, but the contrary is also true
| indeed, NNs can discover and use similar techniques used
| by traditional algorithms. However: the two remain two
| different methods to do computations, and probably it's
| not just by chance that many things we can't do
| algorithmically, we can do with NNs, what I mean is that
| this is not _just_ related to the fact that NNs discover
| complex algorithms via gradient descent, but also that
| the computational model of NNs is more adapt to solving
| certain tasks. So the inference algorithm of NNs (doing
| multiplications and other batch transformations) is just
| needed for standard computers to approximate the NN
| computational model. You can do this analogically, and
| nobody would claim much (maybe?) it 's running an
| algorithm. Or that brains themselves are algorithms.
| benlivengood wrote:
| Deterministic (ieee 754 floats), terminates on all inputs,
| correctness (produces loss < X on N training/test inputs)
|
| At most you can argue that there isn't a useful bounded loss
| on every possible input, but it turns out that humans don't
| achieve useful bounded loss on identifying arbitrary sets of
| pixels as a cat or whatever, either. Most problems NNs are
| aimed at are qualitative or probabilistic where provable
| bounds are less useful than Nth-percentile performance on
| real-world data.
| KeplerBoy wrote:
| Running inference on a model certainly is a algorithm.
| drdeca wrote:
| How do you define "algorithm"? I suspect it is a definition I
| would find somewhat unusual. Not to say that I strictly
| disagree, but only because to my mind "neural net" suggests
| something a bit more concrete than "algorithm", so I might
| instead say that an artificial neural net is an
| implementation of an algorithm, rather than or something like
| that.
|
| But, to my mind, something of the form "Train a neural
| network with an architecture generally like [blah], with a
| training method+data like [bleh], and save the result. Then,
| when inputs are received, run them through the NN in such-
| and-such way." would constitute an algorithm.
| 6gvONxR4sf7o wrote:
| Human performance is much closer to 100% on this, depending on
| your human. It's easy to miss the dot in the corner of the
| headline graph in TFA that says "STEM grad."
| hypoxia wrote:
| It actually beats the human average by a wide margin:
|
| - 64.2% for humans vs. 82.8%+ for o3.
|
| ...
|
| Private Eval:
|
| - 85%: threshold for winning the prize [1]
|
| Semi-Private Eval:
|
| - 87.5%: o3 (unlimited compute) [2]
|
| - 75.7%: o3 (limited compute) [2]
|
| Public Eval:
|
| - 91.5%: o3 (unlimited compute) [2]
|
| - 82.8%: o3 (limited compute) [2]
|
| - 64.2%: human average (Mechanical Turk) [1] [3]
|
| Public Training:
|
| - 76.2%: human average (Mechanical Turk) [1] [3]
|
| ...
|
| References:
|
| [1] https://arcprize.org/guide
|
| [2] https://arcprize.org/blog/oai-o3-pub-breakthrough
|
| [3] https://arxiv.org/abs/2409.01374
| usaar333 wrote:
| Super human isn't beating rando mech turk.
|
| Their post has stem grad at nearly 100%
| tripletao wrote:
| This is correct. It's easy to get arbitrarily bad results
| on Mechanical Turk, since without any quality control
| people will just click as fast as they can to get paid (or
| bot it and get paid even faster).
|
| So in practice, there's always some kind of quality
| control. Stricter quality control will improve your
| results, and the right amount of quality control is
| subjective. This makes any assessment of human quality
| meaningless without explanation of how those humans were
| selected and incentivized. Chollet is careful to provide
| that, but many posters here are not.
|
| In any case, the ensemble of task-specific, low-compute
| Kaggle solutions is reportedly also super-Turk, at 81%. I
| don't think anyone would call that AGI, since it's not
| general; but if the "(tuned)" in the figure means o3 was
| tuned specifically for these tasks, that's not obviously
| general either.
| Imnimo wrote:
| Whenever a benchmark that was thought to be extremely difficult
| is (nearly) solved, it's a mix of two causes. One is that
| progress on AI capabilities was faster than we expected, and the
| other is that there was an approach that made the task easier
| than we expected. I feel like the there's a lot of the former
| here, but the compute cost per task (thousands of dollars to
| solve one little color grid puzzle??) suggests to me that there's
| some amount of the latter. Chollet also mentions ARC-AGI-2 might
| be more resistant to this approach.
|
| Of course, o3 looks strong on other benchmarks as well, and
| sometimes "spend a huge amount of compute for one problem" is a
| great feature to have available if it gets you the answer you
| needed. So even if there's some amount of "ARC-AGI wasn't quite
| as robust as we thought", o3 is clearly a very powerful model.
| exe34 wrote:
| > the other is that there was an approach that made the task
| easier than we expected.
|
| from reading Dennett's philosophy, I'm convinced that that's
| how human intelligence works - for each task that "only a human
| could do that", there's a trick that makes it easier than it
| seems. We are bags of tricks.
| whoistraitor wrote:
| The general message here seems to be that inference-time brute-
| forcing works as long as you have a good search and evaluation
| strategy. We've seemingly hit a ceiling on the base LLM forward-
| pass capability so any further wins are going to be in how we
| juggle multiple inferences to solve the problem space. It feels
| like a scripting problem now. Which is cool! A fun space for
| hacker-engineers. Also:
|
| > My mental model for LLMs is that they work as a repository of
| vector programs. When prompted, they will fetch the program that
| your prompt maps to and "execute" it on the input at hand. LLMs
| are a way to store and operationalize millions of useful mini-
| programs via passive exposure to human-generated content.
|
| I found this such an intriguing way of thinking about it.
| whimsicalism wrote:
| > We've seemingly hit a ceiling on the base LLM forward-pass
| capability so any further wins are going to be in how we juggle
| multiple inferences to solve the problem space
|
| Not so sure - but we might need to figure out the
| inference/search/evaluation strategy in order to provide the
| data we need to distill to the single forward-pass data
| fitting.
| cchance wrote:
| Is it just me or does looking at the ARC-AGI example questions at
| the bottom... make your brain hurt?
| drdaeman wrote:
| Looks pretty obvious to me, although, of course, it took me a
| few moments to understand what's expected as a solution.
|
| c6e1b8da is moving rectangular figures by a given vector,
| 0d87d2a6 is drawing horizontal and/or vertical lines
| (connecting dots at the edges) and filling figures they touch,
| b457fec5 is filling gray figures with a given repeating color
| pattern.
|
| This is pretty straightforward stuff that doesn't require much
| spatial thinking or keeping multiple things/aspects in memory -
| visual puzzles from various "IQ" tests are way harder.
|
| This said, now I'm curious how SoTA LLMs would do on something
| like WAIS-IV.
| randyrand wrote:
| I'll sound like a total douche bag - but I thought they were
| incredibly obvious - which I think is the point of them.
|
| What took me longer was figuring out how the question was
| arranged, i.e. left input, right output, 3 examples each
| airstrike wrote:
| Uhh...some of us are apparently living under a rock, as this is
| the first time I hear about o3 and I'm on HN far too much every
| day
| burningion wrote:
| I think it was just announced today! You're fine!
| cryptoegorophy wrote:
| Besides higher scores - is there any improvements for a general
| use? Like asking to help setup home assistant etc etc?
| rvz wrote:
| Great results. However, let's all just admit it.
|
| It has well replaced journalists, artists and on its way to
| replace nearly both junior and senior engineers. The ultimate
| intention of "AGI" is that it is going to replace tens of
| millions of jobs. That is it and you know it.
|
| It will only accelerate and we need to stop pretending and
| coping. Instead lets discuss solutions for those lost jobs.
|
| So what is the replacement for these lost jobs? (It is not UBI or
| "better jobs" without defining them.)
| neom wrote:
| Do you follow Jack Clark? I noticed he's been on the road a lot
| talking to governments and policy makers, and not just in the
| "AI is coming" way he used to talk.
| whynotminot wrote:
| When none of us have jobs or income, there will be no ability
| for us to buy products. And then no reason for companies to buy
| ads to sell products to people who don't have money. Without ad
| money (or the potential of future ad money), the people pushing
| the bounds of AGI into work replacement will lose the very
| income streams powering this research and their valuations.
|
| Ford didn't support a 40 hour work week out of the kindness of
| his heart. He wanted his workers to have time off for buying
| things (like his cars).
|
| I wonder if our AGI industrialist overlords will do something
| similar for revenue sharing or UBI.
| whimsicalism wrote:
| This picture doesn't make sense. If most don't have any money
| to buy products, just invent some other money and start
| paying one of the other people who doesn't have any money to
| start making the products for you.
|
| In reality, if there really is mass unemployment, AI driven
| automation will make consumables so cheap that anyone will be
| able to buy it.
| whynotminot wrote:
| > This picture doesn't make sense. If most don't have any
| money to buy products, just invent some other money and
| start paying one of the other people who doesn't have any
| money to start making the products for you.
|
| Uh, this picture doesn't make sense. Why would anyone value
| this randomly invented money?
| whimsicalism wrote:
| > Why would anyone value this randomly invented money?
|
| Because they can use it to pay for goods?
|
| Your notion is that almost everyone is going to be out of
| a job and thus have nothing. Okay, so I'm one of those
| people and I need this house built. But I'm not making
| any money because of AI or whatever. Maybe someone else
| needs someone to drive their aging relative around and
| they're a good builder.
|
| If 1. neither of those people have jobs or income because
| of AI 2. AI isn't provisioning services for basically
| free,
|
| then it makes sense for them to do an exchange of labor -
| even with AI (if that AI is not providing services to
| everyone). The original reason for having money and
| exchanging it still exists.
| whynotminot wrote:
| Honestly I don't even know how to engage with your point.
|
| Yes if we recreate society some form of money would
| likely emerge.
| neom wrote:
| Didn't money basically only emerge to deal with with
| difficulty of "double coincidence of wants". Money simply
| solved the problem of making all forms of value
| interchangeable and transportable across time AND
| circumstance? A dollar can do with with or without AI
| existing no?
| whimsicalism wrote:
| Yes, that's my point
| staticman2 wrote:
| You seem to be arguing that large unemployment rates are
| logically impossible, so we shouldn't worry about
| unemployment.
|
| The fact unemployment was 25% during the great depression
| would seem to suggest that at a minimum, a 25%
| unemployment rate is possible during a disruptive event.
| tivert wrote:
| > This picture doesn't make sense. If most don't have any
| money to buy products, just invent some other money and
| start paying one of the other people who doesn't have any
| money to start making the products for you.
|
| Ultimately, it all comes down to raw materials and similar
| resources, _and all those will be claimed by people with
| lots of real money_. Your "invented ... other money" will
| be useless to buy that fundamental stuff. At best, it will
| be useful for trading scrap and other junk among the
| unemployed.
|
| > In reality, if there really is mass unemployment, AI
| driven automation will make consumables so cheap that
| anyone will be able to buy it.
|
| No. Why would the people who own that automation want to
| waste their resources producing consumer goods for people
| with nothing to give them in return?
| tivert wrote:
| > When none of us have jobs or income, there will be no
| ability for us to buy products. And then no reason for
| companies to buy ads to sell products to people who don't
| have money. Without ad money (or the potential of future ad
| money), the people pushing the bounds of AGI into work
| replacement will lose the very income streams powering this
| research and their valuations.
|
| I don't think so. I agree the push for AGI will kill the
| modern consumer product economy, but I think it's quite
| possible for the economy to evolve into a new form (that will
| probably be terrible for most humans) that keep pushes "work
| replacement."
|
| Imagine, an AGI billionare buying up land, mines, and power
| plants as the consumer economy dies, then shifting those
| resources away from the consumer economy into self-
| aggrandizing pet projects (e.g. ziggurats, penthouses on
| Mars, space yachts, life extension, and stuff like that). He
| might still employ a small community of servants, AGI
| researchers, and other specialists; but all the rest of the
| population will be irrelevant to him.
|
| And individual autarky probably isn't necessary, consumption
| will be redirected towards the massive pet production I
| mentioned, with vestigial markets for power, minerals, etc.
| RivieraKid wrote:
| The economic theory answer is that people simply switch to jobs
| that are not yet replaceable by AI. Doctors, nurses,
| electricians, construction workers, police officers, etc.
| People in aggregate will produce more, consume more and work
| less.
| drdaeman wrote:
| > It has well replaced journalists, artists and on its way to
| replace nearly both junior and senior engineers.
|
| Did it, really? Or did it just provide automation for routine
| no-thinking-necessary text-writing tasks, but is still
| ultimately completely bound by the level of human operator's
| intelligence? I strongly suspect it's the latter. If it had
| actually replaced journalists it must be junk outlets, where
| readers' intelligence is negligible and anything goes.
|
| Just yesterday I've used o1 and Claude 3.5 to debug a Linux
| kernel issue (ultimately, a bad DSDT table causing TPM2 driver
| unable to reserve memory region for command response buffer,
| the solution was to use memmap to remove NVS flag from the
| relevant regions) and confirmed once again LLMs still don't
| reason at all - just spew out plausible-looking chains of
| words. The models were good listeners, and a mostly-helpful
| code generators (when they didn't make silliest mistakes), but
| they gave no traces of understanding and no attention for any
| nuances (e.g. LLM used `IS_ERR` to check `__request_resource`
| result, despite me giving it full source code for that function
| and there's even a comment that makes it obvious it returns a
| pointer or NULL, not an error code - misguided attention kind
| of mistake).
|
| So, in my opinion, LLMs (as currently available to broad
| public, like myself) are useful for automating away some
| routine stuff, but their usefulness is bounded by the
| operator's knowledge and intelligence. And that means that the
| actual jobs (if they require thinking and not just writing
| words) are safe.
|
| When asked about what I do at work, I used to joke that I just
| press buttons on my keyboard in fancy patterns. Ultimately,
| LLMs seem to suggest that it's not what I really do.
| mensetmanusman wrote:
| I'm super curious as to whether this technology completely
| destroys the middle class, or if everyone becomes better off
| because productivity is going to skyrocket.
| mhogers wrote:
| Is anyone here aware of the latest research that tries to
| predict the outcome? Please share - super curious as well
| te_chris wrote:
| There's this https://arxiv.org/pdf/2312.05481v9
| pdfernhout wrote:
| Some thoughts I put together on all this circa 2010:
| https://pdfernhout.net/beyond-a-jobless-recovery-knol.html
| "This article explores the issue of a "Jobless Recovery"
| mainly from a heterodox economic perspective. It emphasizes
| the implications of ideas by Marshall Brain and others that
| improvements in robotics, automation, design, and voluntary
| social networks are fundamentally changing the structure of
| the economic landscape. It outlines towards the end four
| major alternatives to mainstream economic practice (a basic
| income, a gift economy, stronger local subsistence economies,
| and resource-based planning). These alternatives could be
| used in combination to address what, even as far back as
| 1964, has been described as a breaking "income-through-jobs
| link". This link between jobs and income is breaking because
| of the declining value of most paid human labor relative to
| capital investments in automation and better design. Or, as
| is now the case, the value of paid human labor like at some
| newspapers or universities is also declining relative to the
| output of voluntary social networks such as for digital
| content production (like represented by this document). It is
| suggested that we will need to fundamentally reevaluate our
| economic theories and practices to adjust to these new
| realities emerging from exponential trends in technology and
| society."
| tivert wrote:
| > I'm super curious as to whether this technology completely
| destroys the middle class, or if everyone becomes better off
| because productivity is going to skyrocket.
|
| Even if productivity skyrockets, why would anyone assume the
| dividends would be shared with the "destroy[ed] middle class"?
|
| All indications will be this will end up like the China Shock:
| "I lost my middle class job, and all I got was the opportunity
| to buy flimsy pieces of crap from a dollar store." America
| lacks the ideological foundations for any other result, and the
| coming economic changes will likely make building those
| foundations even more difficult if not impossible.
| rohan_ wrote:
| Because access to the financial system was democratized ten
| years ago
| croemer wrote:
| The programming task they gave o3-mini high (creating Python
| server that allows chatting with OpenAI API and run some code in
| terminal) didn't seem very hard? Strange choice of example for
| something that's claimed to be a big step forwards.
|
| YT timestamped link:
| https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for
| the fixed link @photonboom)
|
| Updated: I gave the task to Claude 3.5 Sonnet and it worked first
| shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-
| faa5aa...
| bearjaws wrote:
| It's good that it works since if you ask GPT-4o to use the
| openai sdk it will often produce invalid and out of date code.
| m3kw9 wrote:
| I would say they didn't need to demo anything, because if you
| are gonna use the output code live on a demo it may make
| compile errors and then look stupid trying to fix it live
| croemer wrote:
| If it was a safe bet problem, then they should have said
| that. To me it looks like they faked excitement for something
| not exciting which lowers credibility of the whole
| presentation.
| sunaookami wrote:
| They actually did that the last time when they showed the
| apps integration. First try in Xcode didn't work.
| m3kw9 wrote:
| Yeah I think that time it was ok because they were demoing
| the app function, but for this they are demoing the model
| smarts
| photonboom wrote:
| heres the right timestamp:
| https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s
| phil917 wrote:
| Yeah I agree that wasn't particularly mind blowing to me and
| seems fairly in line with what existing SOTA models can do.
| Especially since they did it in steps. Maybe I'm missing
| something.
| MyFirstSass wrote:
| What? Is this what this is? Either this is a complete joke or
| we're missing something.
|
| I've been doing similar stuff in Claude for months and it's not
| that impressive when you see how limited they really are.
| tripletao wrote:
| Their discussion contains an interesting aside:
|
| > Moreover, ARC-AGI-1 is now saturating - besides o3's new score,
| the fact is that a large ensemble of low-compute Kaggle solutions
| can now score 81% on the private eval.
|
| So while these tasks get greatest interest as a benchmark for
| LLMs and other large general models, it doesn't yet seem obvious
| those outperform human-designed domain-specific approaches.
|
| I wonder to what extent the large improvement comes from OpenAI
| training deliberately targeting this class of problem. That
| result would still be significant (since there's no way to
| overfit to the private tasks), but would be different from an
| "accidental" emergent improvement.
| Bjorkbat wrote:
| I was impressed until I read the caveat about the high-compute
| version using 172x more compute.
|
| Assuming for a moment that the cost per task has a linear
| relationship with compute, then it costs a little more than $1
| million to get that score on the public eval.
|
| The results are cool, but man, this sounds like such a busted
| approach.
| futureshock wrote:
| So what? I'm serious. Our current level of progress would have
| been sci-fi fantasy with the computers we had in 2000. The cost
| may be astronomical today, but we have proven a method to
| achieve human performance on tests of reasoning over novel
| problems. WOW. Who cares what it costs. In 25 years it will run
| on your phone.
| Bjorkbat wrote:
| It's not so much the cost as much the fact that they got a
| slightly better result by throwing 172x more compute
| per/task. The fact that it may have cost somewhere north of
| $1 million simply helps to give a better idea of how absurd
| the approach is.
|
| It feels a lot less like the breakthrough when the solution
| looks so much like simply brute-forcing.
|
| But you might be right, who cares? Does it really matter how
| crude the solution is if we can achieve true AGI and bring
| the cost down by increasing the efficiency of compute?
| futureshock wrote:
| "Simply brute-forcing"
|
| That's the thing that's interesting to me though and I had
| the same first reaction. It's a very different problem than
| brute-forcing chess. It has one chance to come to the
| correct answer. Running through thousands or millions of
| options means nothing if the model can't determine which is
| correct. And each of these visual problems involve
| combinations of different interacting concepts. To solve
| them requires understanding, not mimicry. So no matter how
| inefficient and "stupid" these models are, they can be said
| to understand these novel problems. That's a direct counter
| to everyone who ever called these a stochastic parrot and
| said they were a dead-end to AGI that was only searching an
| in distribution training set.
|
| The compute costs are currently disappointing, but so was
| the cost of sequencing the first whole human genome. That
| went from 3 billion to a few hundred bucks from your local
| doctor.
| radioactivist wrote:
| So your claim for optimism here is that something today that
| took ~10^22 floating point operations (based on an estimate
| earlier in the thread) to execute will be running on phones
| in 25 years? Phones which are currently running at O(10^12)
| flops. That means ten orders of magnitudes of improvement for
| that to run in a reasonable amount of time? It's a similar
| scale up in going from ENIAC (500 flops) to a modern desktop
| (5-10 teraflops).
| futureshock wrote:
| That sounds reasonable to me because the compute cost for
| this level of reasoning performance won't stay at 10^22 and
| phones won't stay at 10^12. This reasoning breakthrough is
| about 3 months old.
| radioactivist wrote:
| I think expecting five _orders of magnitude_ improvement
| from either side of this (inference cost or phone
| performance) is insane.
| onemetwo wrote:
| In (1) the author use a technique to improve the performance of
| an LLM, he trained sonnet 3.5 to obtain 53,6% in the arc-agi-pub
| benchmark moreover he said that more computer power would give
| better results. So the results of o3 could be produced in this
| way using the same method with more computer power, so if this is
| the case the result of o3 is not very interesting.
|
| (1) https://params.com/@jeremy-berman/arc-agi
| TypicalHog wrote:
| This is actually mindblowing!
| blixt wrote:
| These results are fantastic. Claude 3.5 and o1 are already good
| enough to provide value, so I can't wait to see how o3 performs
| comparatively in real-world scenarios.
|
| But I gotta say, we must be saturating just about any zero-shot
| reasoning benchmark imaginable at this point. And we will still
| argue about whether this is AGI, in my opinion because these LLMs
| are forgetful and it's very difficult for an application
| developer to fix that.
|
| Models will need better ways to remember and learn from doing a
| task over and over. For example, let's look at code agents: the
| best we can do, even with o3, is to cram as much of the code base
| as we can fit into a context window. And if it doesn't fit we
| branch out to multiple models to prune the context window until
| it does fit. And here's the kicker - the second time you ask for
| it to do something this all starts over from zero again. With
| this amount of reasoning power, I'm hoping session-based learning
| becomes the next frontier for LLM capabilities.
|
| (There are already things like tool use, linear attention, RAG,
| etc that can help here but currently they come with downsides and
| I would consider them insufficient.)
| vessenes wrote:
| This feels like big news to me.
|
| First of all, ARC is definitely an intelligence test for autistic
| people. I say as someone with a tad of the neurodiversity. That
| said, I think it's a pretty interesting one, not least because as
| you go up in the levels, it requires (for a human) a fair amount
| of lateral thinking and analogy-type thinking, and of course, it
| requires that this go in and out of visual representation. That
| said, I think it's a bit funny that most of the people training
| these next-gen AIs are neurodiverse and we are training the AI in
| our own image. I continue to hope for some poet and painter-
| derived intelligence tests to be added to the next gen tests we
| all look at and score.
|
| For those reasons, I've always really liked ARC as a test -- not
| as some be-all end-all for AGI, but just because I think that the
| most intriguing areas next for LLMs are in these analogy arenas
| and ability to hold more cross-domain context together for
| reasoning and etc.
|
| Prompts that are interesting to play with right now on these
| terms range from asking multimodal models to say count to ten in
| a Boston accent, and then propose a regional french accent that's
| an equivalent and count to ten in that. (To my ear, 4o is
| unconvincing on this). Similar in my mind is writing and
| architecting code that crosses multiple languages and APIs, and
| asking for it to be written in different styles. (claude and
| o1-pro are .. okay at this, depending).
|
| Anyway. I agree that this looks like a large step change. I'm not
| sure if the o3 methods here involve the spinning up of clusters
| of python interpreters to breadth-search for solutions -- a
| method used to make headway on ARC in the past; if so, this is
| still big, but I think less exciting than if the stack is close
| to what we know today, and the compute time is just more
| introspection / internal beam search type algorithms.
|
| Either way, something had to assess answers and think they were
| right, and this is a HUGE step forward.
| jamiek88 wrote:
| > most of the people training these next-gen AIs are
| neurodiverse
|
| Citation needed. This is a huge claim based only on stereotype.
| vessenes wrote:
| So true. Perhaps I'm just thinking it's my people and need to
| update my priors.
| getpost wrote:
| > most of the people training these next-gen AIs are
| neurodiverse and we are training the AI in our own image
|
| Do you have any evidence to support that? It would be
| fascinating if the field is primarly advancing due to a unique
| constellation of traits contributed by individuals who, in the
| past, may not have collaborated so effectively.
| vessenes wrote:
| PURELY Anecdotal. But I'll say that as of 2024 1 in 36 US
| children are diagnosed on the spectrum according to the
| CDC(!), which would mean if you met 10 AI researchers and 4
| were neurodivergent you'd reasonably expect that it's a
| higher-than-population average representation. I'm polling
| from the Effective Altruist AI folks in my mind, and the
| number is definitely, definitely higher than 4/10.
| EVa5I7bHFq9mnYK wrote:
| Are there non-Effective Altruist AI folks?
| vessenes wrote:
| I love how this might mean "non-Effective",
| non-"Effective Altruist" or non-"Effective Altruist AI"
| folks.
|
| Yes
| nopinsight wrote:
| Let me go against some skeptics and explain why I think full o3
| is pretty much AGI or at least embodies most essential aspects of
| AGI.
|
| What has been lacking so far in frontier LLMs is the ability to
| reliably deal with the right level of abstraction for a given
| problem. Reasoning is useful but often comes out lacking if one
| cannot reason at the right level of abstraction. (Note that many
| humans can't either when they deal with unfamiliar domains,
| although that is not the case with these models.)
|
| ARC has been challenging precisely because solving its problems
| often requires: 1) using multiple different
| *kinds* of core knowledge [1], such as symmetry, counting, color,
| AND 2) using the right level(s) of abstraction
|
| Achieving human-level performance in the ARC benchmark, _as well
| as_ top human performance in GPQA, Codeforces, AIME, and Frontier
| Math suggests the model can potentially solve any problem at the
| human level if it possesses essential knowledge about it. Yes,
| this includes out-of-distribution problems that most humans can
| solve.
|
| It might not _yet_ be able to generate highly novel theories,
| frameworks, or artifacts to the degree that Einstein,
| Grothendieck, or van Gogh could. But not many humans can either.
|
| [1] https://www.harvardlds.org/wp-
| content/uploads/2017/01/Spelke...
|
| ADDED:
|
| Thanks to the link to Chollet's posts by lswainemoore below. I've
| analyzed some easy problems that o3 failed at. They involve
| spatial intelligence, including connection and movement. This
| skill is very hard to learn from textual and still image data.
|
| I believe this sort of core knowledge is learnable through
| movement and interaction data in a simulated world and it will
| _not_ present a very difficult barrier to cross. (OpenAI
| purchased a company behind a Minecraft clone a while ago. I 've
| wondered if this is the purpose.)
| xvector wrote:
| Agree. AGI is here. I feel such a sense of pride in our
| species.
| timabdulla wrote:
| What's your explanation for why it can only get ~70% on SWE-
| bench Verified?
|
| I believe about 90% of the tasks were estimated by humans to
| take less than one hour to solve, so we aren't talking about
| very complex problems, and to boot, the contamination factor is
| huge: o3 (or any big model) will have in-depth knowledge of the
| internals of these projects, and often even know about the
| individual issues themselves (e.g. you can say what was Github
| issue #4145 in project foo, and there's a decent chance it can
| tell you exactly what the issue was about!)
| slewis wrote:
| I've spent tons of time evaluating o1-preview on SWEBench-
| Verified.
|
| For one, I speculate OpenAI is using a very basic agent
| harness to get the results they've published on SWEBench. I
| believe there is a fair amount of headroom to improve results
| above what they published, using the same models.
|
| For two, some of the instances, even in SWEBench-Verified,
| require a bit of "going above and beyond" to get right. One
| example is an instance where the user states that a TypeError
| isn't properly handled. The developer who fixed it handled
| the TypeError but also handled a ValueError, and the golden
| test checks for both. I don't know how many instances fall in
| this category, but I suspect its more than on a simpler
| benchmark like MATH.
| nopinsight wrote:
| One possibility is that it may not yet have sufficient
| _experience and real-world feedback_ for resolving coding
| issues in professional repos, as this involves multiple steps
| and very diverse actions (or branching factor, in AI terms).
| They have committed to not training on API usage, which
| limits their ability to directly acquire training data from
| it. However, their upcoming agentic efforts may address this
| gap in training data.
| timabdulla wrote:
| Right, but the branching factor increases exponentially
| with the scope of the work.
|
| I think it's obvious that they've cracked the formula for
| solving well-defined, small-in-scope problems at a
| superhuman level. That's an amazing thing.
|
| To me, it's less obvious that this implies that they will
| in short order with just more training data be able to
| solve ambiguous, large-in-scope problems at even just a
| skilled human level.
|
| There are far more paths to consider, much more context to
| use, and in an RL setting, the rewards are much more
| ambiguously defined.
| nopinsight wrote:
| Their reasoning models can learn from procedures and
| methods, which generalize far better than data. Software
| tasks are diverse but most tasks are still fairly limited
| in scope. Novel tasks might remain challenging for these
| models, as they do for humans.
|
| That said, o3 might still lack some kind of interaction
| intelligence that's hard to learn. We'll see.
| Imnimo wrote:
| >Achieving human-level performance in the ARC benchmark, as
| well as top human performance in GPQA, Codeforce, AIME, and
| Frontier Math strongly suggests the model can potentially solve
| any problem at the human level if it possesses essential
| knowledge about it.
|
| The article notes, "o3 still fails on some very easy tasks".
| What explains these failures if o3 can solve "any problem" at
| the human level? Do these failed cases require some essential
| knowledge that has eluded the massive OpenAI training set?
| nopinsight wrote:
| Great point. I'd love to see what these easy tasks are and
| would be happy to revise my hypothesis accordingly. o3's
| intelligence is unlikely to be a strict superset of human
| intelligence. It is certainly superior to humans in some
| respects and probably inferior in others. Whether it's
| sufficiently generally intelligent would be both a matter of
| definition and empirical fact.
| Imnimo wrote:
| Chollet has a few examples here:
|
| https://x.com/fchollet/status/1870172872641261979
|
| https://x.com/fchollet/status/1870173137234727219
|
| I would definitely consider them legitimately easy for
| humans.
| nopinsight wrote:
| Thanks! I added some comments on this at the bottom of
| the post above.
| phil917 wrote:
| Quote from the creators of the AGI-ARC benchmark: "Passing ARC-
| AGI does not equate achieving AGI, and, as a matter of fact, I
| don't think o3 is AGI yet. o3 still fails on some very easy
| tasks, indicating fundamental differences with human
| intelligence."
| CooCooCaCha wrote:
| Yeah the real goalpost is _reliable_ intelligence. A supposed
| phd level AI failing simple problems is a red flag that we're
| still missing something.
| gremlinsinc wrote:
| You've never met a Doctor who couldn't figure out how to
| work their email? Or use street smarts? You can have a PHD
| but be unable to reliably handle soft skills, or any number
| of things you might 'expect' someone to be able to do.
|
| Just playing devils' advocate or nitpicking the language a
| bit...
| CooCooCaCha wrote:
| An important distinction here is you're comparing skill
| across very different tasks.
|
| I'm not even going that far, I'm talking about
| performance on similar tasks. Something many people have
| noticed about modern AI is it can go from genius to baby-
| level performance seemingly at random.
|
| Take self driving cars for example, a reasonably
| intelligent human of sound mind and body would never
| accidentally mistake a concrete pillar for a road. Yet
| that happens with self-driving cars, and seemingly here
| with ARC-AGI problems which all have a similar flavor.
| nuancebydefault wrote:
| A coworker of mine has a phd in physics. Showing the
| difference to him between little and big endian in a hex
| editor, showing file sizes of raw image files and how to
| compute it... I explained 3 times and maybe he understood
| part of it now.
| nopinsight wrote:
| I'd need to see what kinds of easy tasks those are and would
| be happy to revise my hypothesis if that's warranted.
|
| Also, it depends a great deal on what we define as AGI and
| whether they need to be a strict superset of typical human
| intelligence. o3's intelligence is probably superhuman in
| some aspects but inferior in others. We can find many humans
| who exhibit such tendencies as well. We'd probably say they
| think differently but would still call them generally
| intelligent.
| lswainemoore wrote:
| They're in the original post. Also here:
| https://x.com/fchollet/status/1870172872641261979 /
| https://x.com/fchollet/status/1870173137234727219
|
| Personally, I think it's fair to call them "very easy". If
| a person I otherwise thought was intelligent was unable to
| solve these, I'd be quite surprised.
| nopinsight wrote:
| Thanks! I've analyzed some easy problems that o3 failed
| at. They involve spatial intelligence including
| connection and movement. This skill is very hard to learn
| from textual and still image data.
|
| I believe this sort of core knowledge is learnable
| through movement and interaction data in a simulated
| world and it will not present a very difficult barrier to
| cross.
|
| (OpenAI purchased a company behind a Minecraft clone a
| while ago. I've wondered if this is the purpose.)
| lswainemoore wrote:
| > I believe this sort of core knowledge is learnable
| through movement and interaction data in a simulated
| world and it will not present a very difficult barrier to
| cross.
|
| Maybe! I suppose time will tell. That said, spatial
| intelligence (connection/movement included) is the whole
| game in this evaluation set. I think it's revealing that
| they can't handle these particular examples, and
| problematic for claims of AGI.
| 93po wrote:
| they say it isn't AGI but i think the way o3 functions can be
| refined to AGI - it's learning to solve a new, novel
| problems. we just need to make it do that more consistently,
| which seems achievable
| nyrikki wrote:
| GPQA scores are mostly from pre-training, against content in
| the corpus. They have gone silent but look at the GPT4
| technical report which calls this out.
|
| We are nowhere close to what Sam Altman calls AGI and
| transformers are still limited to what uniform-TC0 can do.
|
| As an example the Boolean Formula Value Problem is
| NC1-complete, thus beyond transformers but trivial to solve
| with a TM.
|
| As it is now proven that the frame problem is equivalent to the
| halting problem, even if we can move past uniform-TC0 limits,
| novelty is still a problem.
|
| I think the advancements are truly extraordinary, but unless
| you set the bar very low, we aren't close to AGI.
|
| Heck we aren't close to P with commercial models.
| sebzim4500 wrote:
| Isn't any physically realizable computer (including our
| brains) limited to what uniform-TC0 can do?
| drdeca wrote:
| Do you just mean because any physically realizable computer
| is a finite state machine? Or...?
|
| I wouldn't describe a computer's usual behavior as having
| constant depth.
|
| It is fairly typical to talk about problems in P as being
| feasible (though when the constant factors are too big,
| this isn't strictly true of course).
|
| Just because for unreasonably large inputs, my computer
| can't run a particular program and produce the correct
| answer for that input, due to my computer running out of
| memory, we don't generally say that my computer is
| fundamentally incapable of executing that algorithm.
| nyrikki wrote:
| Neither TC0 nor uniform-TC0 are physically realizable, they
| are tools not physical devices.
|
| The default nonuniform circuits classes are allowed to have
| a different circuit per input size, the uniform types have
| unbounded fan-in
|
| Similar to how a k-tape TM doesn't get 'charged' for the
| input size.
|
| With Nick Class (NC) the number of components is similar to
| traditional compute time while depth relates to the ability
| to parallelize operations.
|
| These are different than biological neurons, not better or
| worse but just different.
|
| Human neurons can use dendritic compartmentalization, use
| spike timing, can retime spikes etc...
|
| While the perceptron model we use in ML is useful, it is
| not able to do xor in one layer, while biological neurons
| do that without anything even reaching the soma, purely in
| the dendrites.
|
| Statistical learning models still comes down to a choice
| function, no matter if you call that set shattering or...
|
| With physical computers the time hierarchy does apply and
| if TIME(g(n)) is given more time than TIME(f(n)), g(n) can
| solve more problems.
|
| So you can simulate a NTM with exhaustive search with a
| physical computer.
|
| Physical computers also tend to have NAND and XOR gates,
| and can have different circuit depths.
|
| When you are in TC0, you only have AND, OR and Threshold
| (or majority) gates.
|
| Think of instruction level parallelism in a typical CPU, it
| can return early, vs Itanium EPIC, which had to wait for
| the longest operation. Predicated execution is also how
| GPUs work.
|
| They can send a mask and save on load store ops as an
| example, but the cost of that parallelism is the consent
| depth.
|
| It is the parallelism tradeoff that both makes transformers
| practical as well as limit what they can do.
|
| The IID assumption and autograd requiring smooth manifolds
| plays a role too.
|
| The frame problem, which causes hard problems to become
| unsolvable for computers and people alike does also.
|
| But the fact that we have polynomial time solutions for the
| Boolean Formula Value Problem, as mentioned in my post
| above is probably a simpler way of realizing physical
| computers aren't limited to uniform-TC0.
| norir wrote:
| Personally I find "human-level" to be a borderline meaningless
| and limiting term. Are we now super human as a species relative
| to ourselves just five years ago because of our advances in
| developing computer programs that better imitate what many (but
| far from all) of us were already capable of doing? Have we
| reached a limit to human potential that can only be surpassed
| by digital machines? Who decides what human level is and when
| we have surpassed it? I have seen some ridiculous claims about
| ai in art that don't stand up to even the slightest scrutiny by
| domain experts but that easily fool the masses.
| PaulDavisThe1st wrote:
| > It might not yet be able to generate highly novel theories,
| frameworks, or artifacts to the degree that Einstein,
| Grothendieck, or van Gogh could.
|
| Every human does this dozens, hundreds or thousands of times
| ... during childhood.
| ec109685 wrote:
| The problem with ARC is that there are a finite number of
| heuristics that could be enumerated and trained for, which
| would give model a substantial leg up on this evaluation, but
| not be generalized to other domains.
|
| For example, if they produce millions of examples of the type
| of problems o3 still struggles on, it would probably do better
| at similar questions.
|
| Perhaps the private data set is different enough that this
| isn't a problem, but the ideal situation would be unveiling a
| truly novel dataset, which it seems like arc aims to do.
| CliveBloomers wrote:
| Another meaningless benchmark, another month--it's like clockwork
| at this point. No one's going to remember this in a month; it's
| just noise. The real test? It's not in these flashy metrics or
| minor improvements. The only thing that actually matters is how
| fast it can wipe out the layers of middle management and all
| those pointless, bureaucratic jobs that add zero value.
|
| That's the true litmus test. Everything else? It's just fine-
| tuning weights, playing around the edges. Until it starts cutting
| through the fat and reshaping how organizations really operate,
| all of this is just more of the same.
| handfuloflight wrote:
| Agreed, but isn't it management who decides that this would be
| implemented? Are they going to propogate their own removal?
| zamadatix wrote:
| Middle manager types are probably interested in their salary
| performance more than anything. "Real" management (more of
| their assets come from their ownership of the company than a
| salary) will override them if it's truthfully the best
| performing operating model for the company.
| oytis wrote:
| So far AI market seems to be focused on replacing meaningful
| jobs, meaningless ones look safe (which kind of makes sense if
| you think about it).
| 6gvONxR4sf7o wrote:
| I'm glad these stats show a better estimate of human ability than
| just the average mturker. The graph here has the average mturker
| performance as well as a STEM grad measurement. Stuff like that
| is why we're always feeling weird that these things supposedly
| outperform humans while still sucking. I'm glad to see 'human
| performance' benchmarked with more variety (attention, time,
| education, etc).
| RivieraKid wrote:
| It sucks that I would love to be excited about this... but I
| mostly feel anxiety and sadness.
| xvector wrote:
| Humanity is about to enter an even steeper hockey stick growth
| curve. Progressing along the Kardashev scale feels all but
| inevitable. We will live to see Longevity Escape Velocity. I'm
| fucking pumped and feel thrilled and excited and proud of our
| species.
|
| Sure, there will be growing pains, friction, etc. Who cares?
| There always is with world-changing tech. Always.
| drcode wrote:
| longevity for the AIs
| tokioyoyo wrote:
| My job should be secure for a while, but why would an average
| person give a damn about humanity when they might lose their
| jobs and comfort levels? If I had kids, I would absolutely
| hate this uncertainty as well.
|
| "Oh well, I guess I can't give the opportunities to my kid
| that I wanted, but at least humanity is growing rapidly!"
| xvector wrote:
| > when they might lose their jobs and comfort levels?
|
| Everyone has always worried about this for every major
| technology throughout history
|
| IMO AGI will dramatically increase comfort levels, lower
| your chance of dying, death, disease, etc.
| tokioyoyo wrote:
| Again, sure, but it doesn't matter to an average person.
| That's too much focus on the hypothetical future. People
| care about the current times. In the short term it will
| suck for a good chunk of people, and whether the
| sacrifice is worth it will depend on who you are.
|
| People aren't really on uproar yet, because
| implementations haven't affected the job market of the
| masses. Afterwards? Tume will show.
| xvector wrote:
| Yes, people tend to focus on current times. It's an
| incredibly shortsighted mentality that selfishly puts
| oneself over tens of billions of future lives being
| improved. https://pessimistsarchive.org
| tokioyoyo wrote:
| Do you have any dependents, like parents or kids, by any
| chance? Imagine not being able to provide for them. Think
| how'd you feel in such circumstances.
|
| Like in general I totally agree with you, but I also
| understand why a person would care about their loved ones
| and themselves first.
| realce wrote:
| Eventually you draw the black ball, it is inevitable.
| croemer wrote:
| Longevity Escape Velocity? Even if you had orders of
| magnitude more people working on medical research, it's not a
| given that prolonging life indefinitely is even possible.
| soheil wrote:
| Of course it's a given unless you want to invoke
| supernatural causes the human brain is a collection of
| cells with electro-chemical connections that if fully
| reconstructed either physically or virtually would
| necessarily need to represent the original person's brain.
| Therefore with sufficient intelligence it would be possible
| to engineer technology that would be able to do that
| reconstruction without even having to go to the atomic
| level, which we also have a near full understanding of
| already.
| lewhoo wrote:
| > Sure, there will be growing pains, friction, etc. Who
| cares?
|
| That's right. Who cares about pains of others and why they
| even should are absolutely words to live by.
| xvector wrote:
| Yeah, with this mentality, we wouldn't have electricity
| today. You will never make transition to new technology
| painless, no matter what you do. (See:
| https://pessimistsarchive.org)
|
| What you are likely doing, though, is making many more
| future humans pay a cost in suffering. Every day we delay
| longevity escape velocity is another 150k people dead.
| lewhoo wrote:
| There was a time when in the name of progress people were
| killed for whatever resources they possessed, others were
| enslaved etc. and I was under the impression that the
| measure of our civilization is that we actually DID care
| and just how much. It seems to me that you are very eager
| to put up altars of sacrifice without even thinking that
| the problems you probably have in mind are perfectly
| solvable without them.
| smokedetector1 wrote:
| By far the greatest issue facing humanity today is wealth
| inequality.
| asdf6969 wrote:
| I would rather follow in the steps of uncle Ted than let AI
| turn me in a homeless person. It's no consolation that my
| tent will have a nice view of a lunar colony
| objektif wrote:
| You sound like a rich person.
| soheil wrote:
| I agree, save invoking supernatural causes, the human brain
| is a collection of cells with electro-chemical connections
| that if fully reconstructed either physically or virtually
| would necessarily need to represent the original person's
| brain. Therefore with sufficient intelligence it would be
| possible to engineer technology that would be able to do that
| reconstruction without even having to go to the atomic level,
| which we also have a near full understanding of already.
| pupppet wrote:
| We're enabling a huge swath of humanity being put out of work
| so a handful of billionaires can become trillionaires.
| abiraja wrote:
| And also the solving of hundreds of diseases that ail us.
| hartator wrote:
| It doesn't matter. Statists rather be poor, sick, and dead
| than risking trillionaires.
| thrance wrote:
| You should read about workers right in the gilded age,
| and see how good _laissez-faire_ capitalism was. What do
| you think will happen when the only thing you can trade
| with the trillionaires, your labor, becomes worthless?
| lewhoo wrote:
| One of the biggest factors in risk of death right now is
| poverty. Also what is being chased right now is "human
| level on most economically viable tasks" because the
| automated research for solving physics etc. even now seems
| far-fetched.
| thrance wrote:
| You need to solve diseases _and_ make the cure available.
| Millions die of curable diseases every year, simply because
| they are not deemed useful enough. What happens when your
| labor becomes worthless?
| asdf6969 wrote:
| Why do you think you'll be able to afford healthcare? The
| new medicine is for the AI owners
| gom_jabbar wrote:
| Anxiety and sadness are actually mild emotional responses to
| the dissolution of human culture. Nick Land in 1992:
|
| "It is ceasing to be a matter of how we think about technics,
| if only because technics is increasingly thinking about itself.
| It might still be a few decades before artificial intelligences
| surpass the horizon of biological ones, but it is utterly
| superstitious to imagine that the human dominion of terrestrial
| culture is still marked out in centuries, let alone in some
| metaphysical perpetuity. The high road to thinking no longer
| passes through a deepening of human cognition, but rather
| through a becoming inhuman of cognition, a migration of
| cognition out into the emerging planetary technosentience
| reservoir, into 'dehumanized landscapes ... emptied spaces'
| where human culture will be dissolved. Just as the capitalist
| urbanization of labour abstracted it in a parallel escalation
| with technical machines, so will intelligence be transplanted
| into the purring data zones of new software worlds in order to
| be abstracted from an increasingly obsolescent anthropoid
| particularity, and thus to venture beyond modernity. Human
| brains are to thinking what mediaeval villages were to
| engineering: antechambers to experimentation, cramped and
| parochial places to be.
|
| [...]
|
| Life is being phased-out into something new, and if we think
| this can be stopped we are even more stupid than we seem." [0]
|
| Land is being ostracized for some of his provocations, but it
| seems pretty clear by now that we are in the Landian
| Accelerationism timeline. Engaging with his thought is crucial
| to understanding what is happening with AI, and what is still
| largely unseen, such as the autonomization of capital.
|
| [0] https://retrochronic.com/#circuitries
| Jcampuzano2 wrote:
| Same, it's sad but I honestly hoped they never achieved these
| results and it came out that it wasn't possible or would take
| an insurmountable amount of resources but here we are ok the
| verge of making most humans useless when it comes to
| productivity.
|
| While there are those that are excited, the world is not
| prepared for the level of distress this could put on the
| average person without critical changes at a monumental level.
| JacksCracked wrote:
| If you don't feel like the world needed grand scale changes
| at a societal level with all the global problems we're unable
| to solve, you haven't been paying attention. Income
| inequality, corporate greed, political apathy, global
| warming.
| sensanaty wrote:
| And you think the bullshit generators backed by the largest
| corporate entities in humanity who are, as we speak,
| causing all the issues you mention are somehow gonna solve
| any of this?
| bluecoconut wrote:
| Efficiency is now key.
|
| ~=$3400 per single task to meet human performance on this
| benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED",
| which makes me think they did some undisclosed amount of fine-
| tuning (eg. via the API they showed off last week), so even more
| compute went into this task.
|
| We can compare this roughly to a human doing ARC-AGI puzzles,
| where a human will take (high variance in my subjective
| experience) between 5 second and 5 minutes to solve the task. (So
| i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr,
| and they include in their document an average mechancal turker at
| $2 USD task in their document)
|
| Going the other direction: I am interpreting this result as human
| level reasoning now costs (approximately) 41k/hr to 2.5M/hr with
| current compute.
|
| Super exciting that OpenAI pushed the compute out this far so we
| could see he O-series scaling continue and intersect humans on
| ARC, now we get to work towards making this economical!
| riku_iki wrote:
| > ~=$3400 per single task
|
| report says it is $17 per task, and $6k for whole dataset of
| 400 tasks.
| bluecoconut wrote:
| That's the low-compute mode. In the plot at the top where
| they score 88%, O3 High (tuned) is ~3.4k
| ionwake wrote:
| sorry to be a noob, but can someone tell me doe sths mena
| o3 will be unaffordable for a typical user? Will only
| companies with thousands to spend per query be able to use
| this?
|
| Sorry for being thick Im just confused how they can turn
| this into an addordable service?
| jhrmnn wrote:
| That's for the low-compute configuration that doesn't reach
| human-level performance (not far though)
| riku_iki wrote:
| I referred on high compute mode. They have table with
| breakdown here: https://arcprize.org/blog/oai-o3-pub-
| breakthrough
| EVa5I7bHFq9mnYK wrote:
| That's high EFFICIENCY. High efficiency = low compute.
| junipertea wrote:
| The table row with 6k figure refers to high efficiency,
| not high compute mode. From the blog post:
|
| Note: OpenAI has requested that we not publish the high-
| compute costs. The amount of compute was roughly 172x the
| low-compute configuration.
| gbnwl wrote:
| That's "efficiency" high, which actually means less
| compute. The 87.5% score using low efficiency (more
| compute) doesn't have cost listed.
| bluecoconut wrote:
| they use some poor language.
|
| "High Efficiency" is O3 Low "Low Efficiency" is O3 High
|
| They left the "Low efficiency" (O3 High) values as `-`
| but you can infer them from the plot at the top.
|
| Note the $20 and $17 per task aligns with the X-axis of
| the O3-low
| binarymax wrote:
| _" Note: OpenAI has requested that we not publish the high-
| compute costs. The amount of compute was roughly 172x the
| low-compute configuration."_
|
| The low compute was $17 per task. Speculate 172*$17 for the
| high compute is $2,924 per task, so I am also confused on the
| $3400 number.
| bluecoconut wrote:
| 3400 came from counting pixels on the plot.
|
| Also its $20 on for the o3-low via the table for the semi-
| private, which x172 is 3440, also coming in close to the
| 3400 number
| xrendan wrote:
| You're misreading it, there's two different runs, a low and a
| high compute run.
|
| The number for the high-compute one is ~172x the first one
| according to the article so ~=$2900
| bluecoconut wrote:
| some other imporant quotes: "Average human off the street:
| 70-80%. STEM college grad: >95%. Panel of 10 random humans:
| 99-100%" -@fchollet on X
|
| So, considering that the $3400/task system isn't able to
| compete with STEM college grad yet, we still have some room
| (but it is shrinking, i expect even more compute will be thrown
| and we'll see these barriers broken in coming years)
|
| Also, some other back of envelope calculations:
|
| The gap in cost is roughly 10^3 between O3 High and Avg.
| mechanical turkers (humans). Via Pure GPU cost improvement
| (~doubling every 2-2.5 years) puts us at 20~25 years.
|
| The question is now, can we close this "to human" gap (10^3)
| quickly with algorithms, or are we stuck waiting for the 20-25
| years for GPU improvements. (I think it feels obvious: this is
| new technology, things are moving fast, the chance for
| algorithmic innovation here is high!)
|
| I also personally think that we need to adjust our efficiency
| priors, and start looking not at "humans" as the bar to beat,
| but theoretical computatble limits (show gaps much larger
| ~10^9-10^15 for modest problems). Though, it may simply be the
| case that tool/code use + AGI at near human cost covers a lot
| of that gap.
| zamadatix wrote:
| I don't follow how 10 random humans can beat the average STEM
| college grad and average humans in that tweet. I suspect it's
| really "a panel of 10 randomly chosen experts in the space"
| or something?
|
| I agree the most interesting thing to watch will be cost for
| a given score more than maximum possible score achieved (not
| that the latter won't be interesting by any means).
| hmottestad wrote:
| Might be that within a group of 10 people, randomly chosen,
| when each person attempts to solve the tasks at least 99%
| of the time 1 person out of the 10 people will get it
| right.
| bcrosby95 wrote:
| Two heads is better than 1. 10 is way better. Even if they
| aren't a field of experts. You're bound to get random
| people that remember random stuff from high school,
| college, work, and life in general, allowing them to piece
| together a solution.
| inerte wrote:
| Aaaah thanks for the explanation. PANEL of 10 humans, as
| in, they were all together. I parsed the phrase as "10
| random people" > "average human" which made little sense.
| modeless wrote:
| Actually I believe that he did mean 10 random people
| tested individually, not a committee of 10 people. The
| key being that the question is considered to be answered
| correctly if any one of the 10 people got it right. This
| is similar to how LLMs are evaluated with pass@5 or
| pass@10 criteria (because the LLM has no memory so
| running it 10 times is more like asking 10 random people
| than asking the same person 10 times in a row).
|
| I would expect 10 random people to do better than a
| committee of 10 people because 10 people have 10 chances
| to get it right while a committee only has one. Even if
| the committee gets 10 guesses (which must be made
| simultaneously, not iteratively) it might not do better
| because people might go along with a wrong consensus
| rather than push for the answer they would have chosen
| independently.
| cchance wrote:
| I mean considering the big breaththrough this year for o1/o3
| seems to have been "models having internal thoughts might
| help reasoning", seems to everyone outside of the AI field
| was sort of a "duh" moment.
|
| I'd hope we see more internal optimizations and improvements
| to the models. The idea that the big breakthrough being
| "don't spit out the first thought that pops into your head"
| seems obvious to everyone outside of the field, but guess
| what turns out it was a big improvement when the devs decided
| to add it.
| iandanforth wrote:
| Let's say that Google is already 1 generation ahead of nvidia
| in terms of efficient AI compute. ($1700)
|
| Then let's say that OpenAI brute forced this without any
| meta-optimization of the hypothesized search component (they
| just set a compute budget). This is probably low hanging
| fruit and another 2x in compute reduction. ($850)
|
| Then let's say that OpenAI was pushing really really hard for
| the numbers and was willing to burn cash and so didn't bother
| with serious thought around hardware aware distributed
| inference. This could be _more_ than a 2x decrease in cost
| like we 've seen deliver 10x reductions in cost via better
| attention mechanisms, but let's go with 2x for now. ($425).
|
| So I think we've got about an 8x reduction in cost sitting
| there once Google steps up. This is probably 4-6 months of
| work flat out if they haven't already started down this path,
| but with what they've got with deep research, maybe it's
| sooner?
|
| Then if "all" we get is hardware improvements we're down to
| what 10-14 years?
| bjornsing wrote:
| > are we stuck waiting for the 20-25 years for GPU
| improvements
|
| If this turns out to be hard to optimize / improve then there
| will be a _huge_ economic incentive for efficient ASICs. No
| freaking way we'll be running on GPUs for 20-25 years, or
| even 2.
| aithrowawaycomm wrote:
| I would like to see this repeated with my highly innovative HARC-
| HAGI, which is ARC-AGI but it uses hexagons instead of squares. I
| suspect humans would only make slightly more brain farts on HARC-
| HAGI than ARC-AGI, but O3 would fail very badly since it almost
| certainly has been specifically trained on squares.
|
| I am not really trying to downplay O3. But this would be a simple
| test as to whether O3 is truly "a system capable of adapting to
| tasks it has never encountered before" versus novel ARC-AGI tasks
| it hasn't encountered before.
| botro wrote:
| The LLM community has come up with tests they call 'Misguided
| Attention'[1] where they prompt the LLM with a slightly altered
| version of common riddles / tests etc. This often causes the LLM
| to fail.
|
| For example I used the prompt "As an astronaut in China, would I
| be able to see the great wall?" and since the training data for
| all LLMs is full of text dispelling the common myth that the
| great wall is visible from space, LLMs do not notice the slight
| variation that the astronaut is IN China. This has been a
| sobering reminder to me as discussion of AGI heats up.
|
| [1] https://github.com/cpldcpu/MisguidedAttention
| whimsicalism wrote:
| We need to start making benchmarks in memory & continued
| processing over a task over multiple days, handoffs, etc (ie.
| 'agentic' behavior). Not sure how possible this is.
| slibhb wrote:
| Interesting about the cost:
|
| > Of course, such generality comes at a steep cost, and wouldn't
| quite be economical yet: you could pay a human to solve ARC-AGI
| tasks for roughly $5 per task (we know, we did that), while
| consuming mere cents in energy. Meanwhile o3 requires $17-20 per
| task in the low-compute mode.
| imranq wrote:
| Based on the chart, the Kaggle SOTA model is far more impressive.
| These O3 models are more expensive to run than just hiring a
| mechanical turk worker. It's nice we are proving out the scaling
| hypothesis further, it's just grossly inelegant.
|
| The Kaggle SOTA performs 2x as well as o1 high at a fraction of
| the cost
| cvhc wrote:
| I was going to say the same.
|
| I wonder what exactly o3 costs. Does it still spend a terrible
| amount of time thinking, despite being finetuned to the
| dataset?
| derac wrote:
| But does that Kaggle solution achieve human level perf with any
| level of compute? I think you're missing the forest for the
| trees here.
| neuroelectron wrote:
| OpenAI spent approximately $1,503,077 to smash the SOTA on ARC-
| AGI with their new o3 model
|
| semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks
| (~$20/task) with just 6 samples & 33M tokens processed in ~1.3
| min/task and a cost of $2012
|
| The "low-efficiency" setting with 1024 samples scored 87.5% but
| required 172x more compute.
|
| If we assume compute spent and cost are proportional, then OpenAI
| might have just spent ~$346.064 for the low efficiency run on the
| semi-private eval.
|
| On the public eval they might have spent ~$1.148.444 to achieve
| 91.5% with the low efficiency setting. (high-efficiency mode:
| $6677)
|
| OpenAI just spent more money to run an eval on ARC than most
| people spend on a full training run.
| rfoo wrote:
| Pretty sure this "cost" is based on their retail price instead
| of actual inference cost.
| neuroelectron wrote:
| Yes that's correct and there's a bit of "pixel math" as well
| so take these numbers with a pinch of salt. Preliminary model
| sizes from the temporarily public HF repository puts the full
| model size at 8tb or roughly 80 H100s
| bluecoconut wrote:
| By my estimates, for this single benchmark, this is comparable
| cost to training a ~70B model from scratch today. Literally
| from 0 to a GPT-3 scale model for the compute they ran on 100
| ARC tasks.
|
| I double checked with some flop estimates (P100 for 12 hours =
| Kaggle limit, they claim ~100-1000x for O3-low, and x172 for
| O3-high) so roughly on the order of 10^22-10^23 flops.
|
| In another way, using H100 market price $2/chip -> at $350k,
| that's ~175k hours. Or 10^24 FLOPs in total.
|
| So, huge margin, but 10^22 - 10^24 flop is the band I think we
| can estimate.
|
| These are the scale of numbers that show up in the chinchilla
| optimal paper, haha. Truly GPT-3 scale models.
| rvnx wrote:
| It sounds like they essentially brute-forced the solutions ?
| Ask LLM for answer, answer for LLM to verify the answer. Ask
| LLM for answer, answer for LLM to verify the answer. Add a bit
| of randomness. Ask LLM for answer, answer for LLM to verify the
| answer. Add a bit of randomness. Repeat 5B times (this is what
| the paper says).
| ramesh31 wrote:
| >OpenAI just spent more money to run an eval on ARC than most
| people spend on a full training run.
|
| Of course, this is just the scaling law holding true. More is
| more when it comes to LLM's as far as we've seen. Now it's just
| on the hardware side to make this feasible economically.
| sys32768 wrote:
| So in a few years, coders will be as relevant as cuneiform
| scribes.
| devoutsalsa wrote:
| When the source code for these LLMs gets leaked, I expect to see:
| def letter_count(string, letter): if string ==
| "strawberry" and letter == "r": return 3
| ...
| knbknb wrote:
| In of their release videos for the o1 -preview model they
| _admitted_ that it's hardcoded in.
| phil917 wrote:
| Direct quote from the ARC-AGI blog:
|
| "SO IS IT AGI?
|
| ARC-AGI serves as a critical benchmark for detecting such
| breakthroughs, highlighting generalization power in a way that
| saturated or less demanding benchmarks cannot. However, it is
| important to note that ARC-AGI is not an acid test for AGI - as
| we've repeated dozens of times this year. It's a research tool
| designed to focus attention on the most challenging unsolved
| problems in AI, a role it has fulfilled well over the past five
| years.
|
| Passing ARC-AGI does not equate achieving AGI, and, as a matter
| of fact, I don't think o3 is AGI yet. o3 still fails on some very
| easy tasks, indicating fundamental differences with human
| intelligence.
|
| Furthermore, early data points suggest that the upcoming ARC-
| AGI-2 benchmark will still pose a significant challenge to o3,
| potentially reducing its score to under 30% even at high compute
| (while a smart human would still be able to score over 95% with
| no training). This demonstrates the continued possibility of
| creating challenging, unsaturated benchmarks without having to
| rely on expert domain knowledge. You'll know AGI is here when the
| exercise of creating tasks that are easy for regular humans but
| hard for AI becomes simply impossible."
|
| The high compute variant sounds like it costed around *$350,000*
| which is kinda wild. Lol the blog post specifically mentioned how
| OpenAPI asked ARC-AGI to not disclose the exact cost for the high
| compute version.
|
| Also, 1 odd thing I noticed is that the graph in their blog post
| shows the top 2 scores as "tuned" (this was not displayed in the
| live demo graph). This suggest in those cases that the model was
| trained to better handle these types of questions, so I do wonder
| about data / answer contamination in those cases...
| Bjorkbat wrote:
| > Also, 1 odd thing I noticed is that the graph in their blog
| post shows the top 2 scores as "tuned"
|
| Something I missed until I scrolled back to the top and reread
| the page was this
|
| > OpenAI's new o3 system - trained on the ARC-AGI-1 Public
| Training set
|
| So yeah, the results were specifically from a version of o3
| trained on the public training set
|
| Which on the one hand I think is a completely fair thing to do.
| It's reasonable that you should teach your AI the rules of the
| game, so to speak. There really aren't any spoken rules though,
| just pattern observation. Thus, if you want to teach the AI how
| to play the game, you must train it.
|
| On the other hand though, I don't think the o1 models nor
| Claude were trained on the dataset, in which case it isn't a
| completely fair competition. If I had to guess, you could
| probably get 60% on o1 if you trained it on the public dataset
| as well.
| skepticATX wrote:
| Great catch. Super disappointing that AI companies continue
| to do things like this. It's a great result either way but
| predictably the excitement is focused on the jump from o1,
| which is now in question.
| Bjorkbat wrote:
| To me it's very frustrating because such little caveats
| make benchmarks less reliable. Implicitly, benchmarks are
| no different from tests in that someone/something who
| scores high on a benchmark/test _should_ be able to
| generalize that knowledge out into the real world.
|
| While that is true with humans taking tests, it's not
| really true with AIs evaluating on benchmarks.
|
| SWE-bench is a great example. Claude Sonnet can get
| something like a 50% on verified, whereas I think I might
| be able to score a 20-25%? So, Claude is a better
| programmer than me.
|
| Except that isn't really true. Claude can still make a lot
| of clumsy mistakes. I wouldn't even say these are junior
| engineer mistakes. I've used it for creative programming
| tasks and have found one example where it tried to use a
| library written for d3js for a p5js programming example.
| The confusion is kind of understandable, but it's also a
| really dumb mistake.
|
| Some very simple explanations, the models were probably
| overfitted to a degree on Python given its popularity in
| AI/ML work, and SWE-bench is all Python. Also, the
| underlying Github issues are quite old, so they probably
| contaminated the training data and the models have simply
| memorized the answers.
|
| Or maybe benchmarks are just bad at measuring intelligence
| in general.
|
| Regardless, every time a model beats a benchmark I'm
| annoyed by the fact that I have no clue whatsoever how much
| this actually translates into real world performance. Did
| OpenAI/Anthropic/Google actually create something that will
| automate wide swathes of the software engineering
| profession? Or did they create the world's most
| knowledgeable junior engineer?
| throwaway0123_5 wrote:
| > Some very simple explanations, the models were probably
| overfitted to a degree on Python given its popularity in
| AI/ML work, and SWE-bench is all Python. Also, the
| underlying Github issues are quite old, so they probably
| contaminated the training data and the models have simply
| memorized the answers.
|
| My understanding is that it works by checking if the
| proposed solution passes test-cases included in the
| original (human) PR. This seems to present some problems
| too, because there are surely ways to write code that
| passes the tests but would fail human review for one
| reason or another. It would be interesting to not only
| see the pass rate but also the rate at which the proposed
| solutions are preferred to the original ones (preferably
| evaluated by a human but even an LLM comparing the two
| solutions would be interesting).
| Bjorkbat wrote:
| If I recall correctly the authors of the benchmark did
| mention on Twitter that for certain issues models will
| submit an answer that technically passes the test but is
| kind of questionable, so yeah, good point.
| phil917 wrote:
| Lol I missed that even though it's literally the first
| sentence of the blog, good catch.
|
| Yeah, that makes this result a lot less impressive for me.
| hartator wrote:
| > acid test
|
| The css acid test? This can be gamed too.
| parsimo2010 wrote:
| I really like that they include reference levels for an average
| STEM grad and an average worker for Mechanical Turk. So for $350k
| worth of compute you can have slightly better performance than a
| menial wage worker, but slightly worse performance than a college
| grad. Right now humans win on value, but AI is catching up.
| nxobject wrote:
| As an aside, I'm a little miffed that the benchmark calls out
| "AGI" in the name, but then heavily cautions that it's necessary
| but insufficient for AGI.
|
| > ARC-AGI serves as a critical benchmark for detecting such
| breakthroughs, highlighting generalization power in a way that
| saturated or less demanding benchmarks cannot. However, it is
| important to note that ARC-AGI is not an acid test for AGI
| mmcnl wrote:
| I immediately thought so too. Why confuse everyone?
| notRobot wrote:
| Humans can take the test here to see what the questions are like:
| https://arcprize.org/play
| spyckie2 wrote:
| The more Hacker News worthy discussion is the part where the
| author talks about search through the possible mini-program space
| of LLMs.
|
| It makes sense because tree search can be endlessly optimized. In
| a sense, LLMs turn the unstructured, open system of general
| problems into a structured, closed system of possible moves.
| Which is really cool, IMO.
| glup wrote:
| Yes! This seems to be a really neat combination of 2010's
| Bayesian cleverness / Tenenbaumian program search approaches
| with the LLMs as merely sources of high-dim conditional
| distributions. I knew people were experimenting in this space
| (like https://escholarship.org/uc/item/7018f2ss) but didn't
| know it did so well wrt these new benchmarks.
| binarymax wrote:
| All those saying "AGI", read the article and especially the
| section "So is it AGI?"
| skizm wrote:
| This might sound dumb, and I'm not sure how to phrase this, but
| is there a way to measure the raw model output quality without
| all the more "traditional" engineering work (mountain of `if`
| statements I assume) done on top of the output? And if so, would
| that be a better measure of when scaling up the input data will
| start showing diminishing returns?
|
| (I know very little about the guts of LLMs or how they're tested,
| so the distinction between "raw" output and the more
| deterministic engineering work might be incorrect)
| whimsicalism wrote:
| what do you mean by the mountain of if-statements on top of the
| output? like checking if the output matches the expected result
| in evaluations?
| skizm wrote:
| Like when you type something into the chat gpt app _I am
| guessing_ it will start by preprocessing your input, doing
| some sanity checks, making sure it doesn't say "how do I
| build a bomb?" or whatever. It may or may not alter /clean up
| your input before sending it to the model for processing.
| Once processed, there's probably dozens of services it goes
| through to detect if the output is racist, somehow actually
| contained a bomb recipe, or maybe copywriter material, normal
| pattern matching stuff, maybe some advanced stuff like
| sentiment analysis to see if the output is bad mouthing Trump
| or something, and it might either alter the output or simply
| try again.
|
| I'm wondering when you strip out all that "extra" non-model
| pre and post processing, if there's someway to measure
| performance of that.
| whimsicalism wrote:
| oh, no - but most queries aren't being filtered by
| supervisor models nowadays anyways.. most of the refusal is
| baked in
| Seattle3503 wrote:
| How can there be "private" taks when you have use the OpenAI API
| to run queries? OpenAI sees everything.
| tmaly wrote:
| Just curious, I know o1 is a model OpenAI offers. I have never
| heard of the o3 model. How does it differ from o1?
| roboboffin wrote:
| Interesting that in the video, there is an admission that they
| have been targeting this benchmark. A comment that was quickly
| shut down by Sam.
|
| A bit puzzling to me. Why does it matter ?
| cubefox wrote:
| This was a surprisingly insightful blog post, going far beyond
| just announcing the o3 results.
| c1b wrote:
| How does o3 know when to stop reasoning?
| adtac wrote:
| It thinks hard about it
| c1b wrote:
| So o1 pro is CoT RL and o3 adds search?
| jack_pp wrote:
| AGI for me is something I can give a new project to and be able
| to use it better than me. And not because it has a huge context
| window, because it will update its weights after consuming that
| project. Until we have that I don't believe we have truly reached
| AGI.
|
| Edit: it also _tests_ the new knowledge, it has concepts such as
| trusting a source, verifying it etc. If I can just gaslight it
| into unlearning python then it 's still too dumb.
| submeta wrote:
| I pay for lots of models, but Claude Sonnet is the one I use
| most. ChatGPT is my quick tool for short Q&As because it's got a
| desktop app. Even Google's new offerings did not lure me away
| from Claude which I use daily for hours via a Teams plan with
| five seats.
|
| Now I am wondering what Anthropic will come up with. Exciting
| times.
| isof4ult wrote:
| Claude also has a desktop app:
| https://support.anthropic.com/en/articles/10065433-installin...
| Animats wrote:
| The graph seems to indicate a new high in cost per task. It looks
| like they came in somewhere around $5000/task, but the log scale
| has too few markers to be sure.
|
| That may be a feature. If AI becomes too cheap, the over-funded
| AI companies lose value.
|
| (1995 called. It wants its web design back.)
| jstummbillig wrote:
| I doubt it. Competitive markets mostly work and inefficiencies
| are opportunities for other players. And AI is full of glaring
| inefficiencies.
| Animats wrote:
| Inefficiency can create a moat. If you can charge a lot for
| your product, you have ample cash for advertising, marketing,
| and lobbying, and can come out with many product variants. If
| you're the lowest cost producer, you don't have the margins
| to do that.
|
| The current US auto industry is an example of that strategy.
| So is the current iPhone.
| hypoxia wrote:
| Many are incorrectly citing 85% as human-level performance.
|
| 85% is just the (semi-arbitrary) threshold for the winning the
| prize.
|
| o3 actually beats the human average by a wide margin: 64.2% for
| humans vs. 82.8%+ for o3.
|
| ...
|
| Here's the full breakdown by dataset, since none of the articles
| make it clear --
|
| Private Eval:
|
| - 85%: threshold for winning the prize [1]
|
| Semi-Private Eval:
|
| - 87.5%: o3 (unlimited compute) [2]
|
| - 75.7%: o3 (limited compute) [2]
|
| Public Eval:
|
| - 91.5%: o3 (unlimited compute) [2]
|
| - 82.8%: o3 (limited compute) [2]
|
| - 64.2%: human average (Mechanical Turk) [1] [3]
|
| Public Training:
|
| - 76.2%: human average (Mechanical Turk) [1] [3]
|
| ...
|
| References:
|
| [1] https://arcprize.org/guide
|
| [2] https://arcprize.org/blog/oai-o3-pub-breakthrough
|
| [3] https://arxiv.org/abs/2409.01374
| Workaccount2 wrote:
| If my life depended on the average rando solving 8/10 arc-prize
| puzzles, I'd consider myself dead.
| highfrequency wrote:
| Very cool. I recommend scrolling down to look at the example
| problem that O3 still can't solve. It's clear what goes on in the
| human brain to solve this problem: we look at one example,
| hypothesize a simple rule that explains it, and then check that
| hypothesis against the other examples. It doesn't quite work, so
| we zoom into an example that we got wrong and refine the
| hypothesis so that it solves that sample. We keep iterating in
| this fashion until we have the simplest hypothesis that satisfies
| all the examples. In other words, how humans do science -
| iteratively formulating, rejecting and refining hypotheses
| against collected data.
|
| From this it makes sense why the original models did poorly and
| why iterative chain of thought is required - the challenge is
| designed to be inherently iterative such that a zero shot model,
| no matter how big, is extremely unlikely to get it right on the
| first try. Of course, it also requires a broad set of human-like
| priors about what hypotheses are "simple", based on things like
| object permanence, directionality and cardinality. But as the
| author says, these basic world models were already encoded in the
| GPT 3/4 line by simply training a gigantic model on a gigantic
| dataset. What was missing was iterative hypothesis generation and
| testing against contradictory examples. My guess is that O3 does
| something like this:
|
| 1. Prompt the model to produce a simple rule to explain the nth
| example (randomly chosen)
|
| 2. Choose a different example, ask the model to check whether the
| hypothesis explains this case as well. If yes, keep going. If no,
| ask the model to _revise_ the hypothesis in the simplest possible
| way that also explains this example.
|
| 3. Keep iterating over examples like this until the hypothesis
| explains all cases. Occasionally, new revisions will invalidate
| already solved examples. That's fine, just keep iterating.
|
| 4. Induce randomness in the process (through next-word sampling
| noise, example ordering, etc) to run this process a large number
| of times, resulting in say 1,000 hypotheses which all explain all
| examples. Due to path dependency, anchoring and consistency
| effects, some of these paths will end in awful hypotheses - super
| convoluted and involving a large number of arbitrary rules. But
| some will be simple.
|
| 5. Ask the model to select among the valid hypotheses (meaning
| those that satisfy all examples) and choose the one that it views
| as the simplest for a human to discover.
| hmottestad wrote:
| I took a look at those examples that o3 can't solve. Looks
| similar to an IQ-test.
|
| Took me less time to figure out the 3 examples that it took to
| read your post.
|
| I was honestly a bit surprised to see how visual the tasks
| were. I had thought they were text based. So now I'm quite
| impressed that o3 can solve this type of task at all.
| highfrequency wrote:
| You must be a stem grad! Or perhaps an ensemble of Kaggle
| submissions?
| neom wrote:
| I also took some time to look at the ones it couldn't solve.
| I stopped after this one: https://kts.github.io/arc-
| viewer/page6/#47996f11
| heliophobicdude wrote:
| We should NOT give up on scaling pretraining just yet!
|
| I believe that we should explore pretraining video completion
| models that explicitly have no text pairings. Why? We can train
| unsupervised like they did for GPT series on the text-internet
| but instead on YouTube lol. Labeling or augmenting the frames
| limits scaling the training data.
|
| Imagine using the initial frames or audio to prompt the video
| completion model. For example, use the initial frames to write
| out a problem on a white board then watch in output generate the
| next frames the solution being worked out.
|
| I fear text pairings with CLIP or OCR constrain a model too much
| and confuse
| thatxliner wrote:
| > verified easy for humans, harder for AI
|
| Isn't that the premise behind the CAPTCHA?
| usaar333 wrote:
| For what it's worth, I'm much more impressed with the frontier
| math score.
| asdf6969 wrote:
| Terrifying. This news makes me happy I save all my money. My only
| hope for the future is that I can retire early before I'm
| unemployable
| rimeice wrote:
| Never underestimate a droid
| thisisthenewme wrote:
| I feel like AI is already changing how we work and live - I've
| been using it myself for a lot of my development work. Though,
| what I'm really concerned about is what happens when it gets
| smart enough to do pretty much everything better (or even close)
| than humans can. We're talking about a huge shift where first
| knowledge workers get automated, then physical work too. The
| thing is, our whole society is built around people working to
| earn money, so what happens when AI can do most jobs? It's not
| just about losing jobs - it's about how people will pay for basic
| stuff like food and housing, and what they'll do with their lives
| when work isn't really a thing anymore. Or do people feel like
| there will be jobs safe from AI? (hopefully also fulfilling)
|
| Some folks say we could fix this with universal basic income,
| where everyone gets enough money to live on, but I'm not
| optimistic that it'll be an easy transition. Plus, there's this
| possibility that whoever controls these 'AGI' systems basically
| controls everything. We definitely need to figure this stuff out
| before it hits us, because once these changes start happening,
| they're probably going to happen really fast. It's kind of like
| we're building this awesome but potentially dangerous new
| technology without really thinking through how it's going to
| affect regular people's lives. I feel like we need a parachute
| before we attempt a skydive. Some people feel pretty safe about
| their jobs and think they can't be replaced. I don't think that
| will be the case. Even if AI doesn't take your job, you now have
| a lot more unemployed people competing for the same job that is
| safe from AI.
| cerved wrote:
| > Though, what I'm really concerned about is what happens when
| it gets smart enough to do pretty much everything better (or
| even close)
|
| I'll get concerned when it stops sucking so hard. It's like
| talking to a dumb robot. Which it unsurprisingly is.
| lacedeconstruct wrote:
| I am pretty sure we will have a deep cultural repulsion from it
| and people will pay serious money to have an AI free
| experience, If AI becomes actually useful there is alot of
| areas that we dont even know how to tackle like medicine and
| biology, I dont think anything would change otherwise, AI will
| take jobs but it will open alot more jobs at much higher
| abstraction, 50 years ago the idea that a software engineer
| would become a get rich quick job would have been insane imo
| neom wrote:
| I spend quite a lot of time noodling on this. The thing that
| became really clear from this o3 announcement is that the
| "throw a lot of compute at it and it can do insane things" line
| of thinking continues to hold very true. If that is true, is
| the right thing to do productize it (use the compute more
| generally) or apply it (use the compute for very specific
| incredibly hard and ground breaking problems)? I don't know if
| any of this thinking is logical or not, but if it's a matter of
| where to apply the compute, I feel like I'd be more inclined to
| say: don't give me AI, instead use AI to very fundamentally
| shift things.
| para_parolu wrote:
| From IT bubble it's very easy to have impression that AI will
| replace most people. Most of people on my street do not work in
| IT. Teacher, nurse, hobby shop owner, construction workers,
| etc. Surely programming and other virtual work may become less
| paid job but it's not end of the world.
| vouaobrasil wrote:
| A possibility is a coalition: of people who refuse to use AI
| and who refuse to do business with those who use AI. If the
| coalition grows large enough, AI can be stopped by economic
| attrition.
| w4 wrote:
| The cost to run the highest performance o3 model is estimated to
| be somewhere between $2,000 and $3,400 per task.[1] Based on
| these estimates, o3 costs about 100x what it would cost to have a
| human perform the exact same task. Many people are therefore
| dismissing the near-term impact of these models because of these
| extremely expensive costs.
|
| I think this is a mistake.
|
| Even if very high costs make o3 uneconomic for businesses, it
| could be an epoch defining development for nation states,
| assuming that it is true that o3 can reason like an averagely
| intelligent person.
|
| Consider the following questions that a state actor might ask
| itself: What is the cost to raise and educate an average person?
| Correspondingly, what is the cost to build and run a datacenter
| with a nuclear power plant attached to it? And finally, how many
| person-equivilant AIs could be run in parallel per datacenter?
|
| There are many state actors, corporations, and even individual
| people who can afford to ask these questions. There are also many
| things that they'd like to do but can't because there just aren't
| enough people available to do them. o3 might change that despite
| its high cost.
|
| So _if_ it is true that we 've now got something like human-
| equivilant intelligence on demand - and that's a really big if -
| then we may see its impacts much sooner than we would otherwise
| intuit, especially in areas where economics takes a back seat to
| other priorities like national security and state
| competitiveness.
|
| [1] https://news.ycombinator.com/item?id=42473876
| istjohn wrote:
| Your economic analysis is deeply flawed. If there was anything
| that valuable and that required that much manpower, it would
| already have driven up the cost of labor accordingly. The one
| property that could conceivably justify a substantially higher
| cost is secrecy. After all, you can't (legally) kill a human
| after your project ends to ensure total secrecy. But that takes
| us into thriller novel territory.
| w4 wrote:
| I don't think that's right. Free societies don't tolerate
| total mobilization by their governments outside of war time,
| no matter how valuable the outcomes might be in the long
| term, in part because of the very economic impacts you
| describe. Human-level AI - even if it's very expensive - puts
| something that looks a lot like total mobilization within
| reach without the societal pushback. This is especially true
| when it comes to tasks that society as a whole may not
| sufficiently value, but that a state actor might value very
| much, and when paired with something like a co-located
| reactor and data center that does not impact the grid.
|
| That said, this is all predicated on o3 or similar actually
| having achieved human level reasoning. That's yet to be fully
| proven. We'll see!
| starchild3001 wrote:
| Intelligence comes in many forms and flavors. ARC prize questions
| are just one version of it -- perhaps measuring more human-like
| pattern recognition than true intelligence.
|
| Can machines be more human-like in their pattern recognition? O3
| met this need today.
|
| While this is some form of accomplishment, it's nowhere near the
| scientific and engineering problem solving needed to call
| something truly artificial (human-like) intelligent.
|
| What's exciting is that these reasoning models are making
| significant strides in tackling eng and scientific problem-
| solving. Solving the ARC challenge seems almost trivial in
| comparison to that.
| demirbey05 wrote:
| It is not exactly AGI but huge step toward it. I would expect
| this step in 2028-2030. I cant really understand why people are
| happy with it, this technology is so dangerous that can disrupt
| whole society. It's neither like smartphone nor internet. What
| will happen to 3rd world countries. Lots of unsolved questions
| and world is not prepared for such a change. Lots of people will
| lose their jobs I am not even mentioning their debts. No one will
| have chance to be rich anymore, If you are in first world country
| you will probably get UBI, if not you wont.
| FanaHOVA wrote:
| > I would expect this step in 2028-2030.
|
| Do you work at one of the frontier labs?
| wyager wrote:
| > What will happen to 3rd world countries
|
| Probably less disruption than will happen in 1st world
| countries.
|
| > No one will have chance to be rich anymore
|
| It's strange to reach this conclusion from "look, a massive new
| productivity increase".
| demirbey05 wrote:
| its not like sonnet, yes current ai tools are increasing
| productivity and provides many ways to have chance to be
| rich, but agi is completely different. You need to handle
| evil competition between you and big fishes, probably big
| fishes will have more ai resources than you. What is the
| survival ratio in such a environment ? Very low.
| janalsncm wrote:
| Strange indeed if we work under the assumption that the
| profits from this productivity will be distributed (even
| roughly) evenly. The problem is that most of us see no
| indication that they will be.
|
| I read "no one will have a chance to be rich anymore" as a
| statement about economic mobility. Despite steep declines in
| mobility over the last 50 years, it was still theoretically
| possible for a poor child (say bottom 20% wealth) to climb
| several quintiles. Our industry (SWE) was one of the best
| examples. Of course there have been practical barriers (poor
| kids go to worse schools, and it's hard to get into college
| if you can't read) but the path was there.
|
| If robots replace a lot of people, that path narrows. If AGI
| replaces all people, the path no longer exists.
| the8472 wrote:
| Intelligence is the thing distinguishing humans from all
| previous inventions that already were superhuman in some
| narrow domain.
|
| car : horse :: AGI : humans
| Ancalagon wrote:
| Same, I don't really get the excitement. None of these
| companies are pushing for a utopian Star Trek society either
| with that power.
| moffkalast wrote:
| Open models will catch up next year or the year after, there
| only so many things to try and there's lots of people trying
| them, so it's more or less an inevitability.
|
| The part to get excited about is that there's plenty of
| headroom left to gain in performance. They called o1 a
| preview, and it was, a preview for QwQ and similar models. We
| get the demo from OAI and then get the real thing for free
| next year.
| lagrange77 wrote:
| I hope governments will finally take action.
| Joeri wrote:
| What action do you expect them to take?
|
| What law would effectively reduce risk from AGI? The EU
| passed a law that is entirely about reducing AI risk and
| people in the technology world almost universally considered
| it a bad law. Why would other countries do better? How could
| they do better?
| dyauspitr wrote:
| I'm extremely excited because I want to see the future and I'm
| trying not to think of how severely fucked my life will be.
| vjerancrnjak wrote:
| The result on Epoch AI Frontier Math benchmark is quite a leap.
| Pretty sure most people couldn't even approach these problems,
| unlike ARC AGI
| laurent_du wrote:
| The real breakthrough is the 25% on Frontier Math.
| Havoc wrote:
| If I'm reading that chart right that means still log scaling & we
| should still be good with "throw more power" at it for a while?
| jaspa99 wrote:
| Can it play Mario 64 now?
| nprateem wrote:
| There should be a benchmark that tells the AI it's previous
| answer was wrong and test the number of times it either corrects
| itself or incorrectly capitulates, since it seems easy to trip
| them up when they are in fact right.
| freediver wrote:
| Wondering what are author's thoughts on the future of this
| approach to benchmarking? Completing super hard tasks while then
| failing on 'easy' (for humans) ones might signal measuring the
| wrong thing, similar to Turing test.
| ChildOfChaos wrote:
| This is insanely expensive to run though. Looks like it cost
| around $1 million of compute to get that result.
|
| Doesn't seem like such a massive breakthrough when they are
| throwing so much compute at it, particularly as this is test time
| compute, it just isn't practical at all, you are not getting this
| level with a ChatGPT subscription, even the new $200 a month
| option.
| pixelsort wrote:
| > You'll know AGI is here when the exercise of creating tasks
| that are easy for regular humans but hard for AI becomes simply
| impossible.
|
| No, we won't. All that will tell us is that the abilities of the
| humans who have attempted to discern the patterns of similarity
| among problems difficult for auto-regressive models has once
| again failed us.
| maxdoop wrote:
| So then what is AGI?
| ndm000 wrote:
| One thing I have not seen commented on is that ARC-AGI is a
| visual benchmark but LLMs are primarily text. For instance when I
| see one of the ARC-AGI puzzles, I have a visual representation in
| my brain and apply some sort of visual reasoning solve it. I can
| "see" in my mind's eye the solution to the puzzle. If I didn't
| have that capability, I don't think I could reason through words
| how to go about solving it - it would certainly be much more
| difficult.
|
| I hypothesize that something similar is going on here. OpenAI has
| not published (or I have not seen) the number of reasoning tokens
| it took to solve these - we do know that each tasks was
| thoussands of dollars. If "a picture is worth a thousand words",
| could we make AI systems that can reason visually with much
| better performance?
| siva7 wrote:
| Seriously, programming as a profession will end soon. Let's not
| kid us anymore. Time to jump the ship.
| mmcnl wrote:
| Why specifically programming? I think every knowledge
| profession is at risk, or at the very minimum suspect to a huge
| transformation. Doctors, analysts, lawyers, etc.
| jdefr89 wrote:
| Uhhhh... It was trained on ARC data? So they targeted a specific
| benchmark and are surprised and blown away the LLM performed well
| in it? What's that law again? When a benchmark is targeted by
| some system the benchmark becomes useless?
| bilsbie wrote:
| When is this available? Which plans can use it?
| bilsbie wrote:
| Does anyone have prompts they like to use to test the quality of
| new models?
|
| Please share. I'm compiling a list.
___________________________________________________________________
(page generated 2024-12-20 23:00 UTC)