[HN Gopher] OpenAI O3 breakthrough high score on ARC-AGI-PUB
       ___________________________________________________________________
        
       OpenAI O3 breakthrough high score on ARC-AGI-PUB
        
       Author : maurycy
       Score  : 1509 points
       Date   : 2024-12-20 18:11 UTC (23 hours ago)
        
 (HTM) web link (arcprize.org)
 (TXT) w3m dump (arcprize.org)
        
       | razodactyl wrote:
       | Great. Now we have to think of a new way to move the goalposts.
        
         | tines wrote:
         | I mean, what else do you call learning?
        
         | Pesthuf wrote:
         | Well right now, running this model is really expensive, but we
         | should prepare a new cope for when equivalent models no longer
         | are, ahead of time.
        
           | cchance wrote:
           | Ya getting costs down will be the big one, i imagine
           | quantization, distillation and lots and lots of improvements
           | on the compute side both hardware and software wise.
        
         | a_wild_dandan wrote:
         | Let's just define AI as "whatever computers still can't do."
         | That'll show those dumb statistical parrots!
        
         | foobarqux wrote:
         | This is just as silly as claiming that people "moved the
         | goalposts" when a computer beat Kasparov at chess to claim that
         | it wasn't AGI: it wasn't a good test and some people only
         | realize this after the computer beat Kasparov but couldn't do
         | much else. In this case the ARC maintainers specifically have
         | stated that this is a necessary but not sufficient test of AGI
         | (I personally think it is neither).
        
           | og_kalu wrote:
           | It's not silly. The computer that could beat Kasparov
           | couldn't do anything else so of course it wasn't Artificial
           | General Intelligence.
           | 
           | o3 can do much much more. There is nothing narrow about SOTA
           | LLMs. They are already General. It doesn't matter what ARC
           | Maintainers have said. There is no common definition of
           | General that LLMs fail to meet. It's not a binary thing.
           | 
           | By the time a single machine covers every little test
           | humanity can devise, what comes out of that is not 'AGI' as
           | the words themselves mean but a General Super Intelligence.
        
             | foobarqux wrote:
             | It is silly, the logic is the same: "Only a (world-
             | altering) 'AGI' could do [test]" -> test is passed -> no
             | (world-altering) 'AGI' -> conclude that [test] is not a
             | sufficient test for (world-altering) 'AGI' -> chase new
             | benchmark.
             | 
             | If you want to play games about how to define AGI go ahead.
             | People have been claiming for years that we've already
             | reached AGI and with every improvement they have to
             | bizarrely claim anew that _now_ we 've really achieved AGI.
             | But after a few months people realize it still doesn't do
             | what you would expect of an AGI and so you chase some new
             | benchmark ("just one more eval").
             | 
             | The fact is that there really hasn't been the type of
             | world-altering impact that people generally associate with
             | AGI and no reason to expect one.
        
               | og_kalu wrote:
               | >It is silly, the logic is the same: "Only a (world-
               | altering) 'AGI' could do [test]" -> test is passed -> no
               | (world-altering) 'AGI' -> conclude that [test] is not a
               | sufficient test for (world-altering) 'AGI' -> chase new
               | benchmark.
               | 
               | Basically nobody today thinks beating a single benchmark
               | and nothing else will make you a General Intelligence. As
               | you've already pointed out out, even the maintainers of
               | ARC-AGI do not think this.
               | 
               | >If you want to play games about how to define AGI go
               | ahead.
               | 
               | I'm not playing any games. ENIAC cannot do 99% of the
               | things people use computers to do today and yet barely
               | anybody will tell you it wasn't the first general purpose
               | computer.
               | 
               | On the contrary, it is people who seem to think "General"
               | is a moniker for everything under the sun (and then some)
               | that are playing games with definitions.
               | 
               | >People have been claiming for years that we've already
               | reached AGI and with every improvement they have to
               | bizarrely claim anew that now we've really achieved AGI.
               | 
               | Who are these people ? Do you have any examples at all.
               | Genuine question
               | 
               | >But after a few months people realize it still doesn't
               | do what you would expect of an AGI and so you chase some
               | new benchmark ("just one more eval").
               | 
               | What do you expect from 'AGI'? Everybody seems to have
               | different expectations, much of it rooted in science
               | fiction and not even reality, so this is a moot point.
               | What exactly is World Altering to you ? Genuinely, do you
               | even have anything other than a "I'll know it when i see
               | it ?"
               | 
               | If you introduce technology most people adopt, is that
               | world altering or are you waiting for Skynet ?
        
               | foobarqux wrote:
               | > Basically nobody today thinks beating a single
               | benchmark and nothing else will make you a General
               | Intelligence.
               | 
               | People's comments, including in this very thread, seem to
               | suggest otherwise (c.f. comments about "goal post
               | moving"). Are you saying that a widespread belief wasn't
               | that a chess playing computer would require AGI? Or that
               | Go was at some point the new test for AGI? Or the Turing
               | test?
               | 
               | > I'm not playing any games... "General" is a moniker for
               | everything under the sun that are playing games with
               | definitions.
               | 
               | People have a colloquial understanding of AGI whose
               | consequence is a significant change to daily life, not
               | the tortured technical definition that you are using.
               | Again your definition isn't something anyone cares about
               | (except maybe in the legal contract between OpenAI and
               | Microsoft).
               | 
               | > Who are these people ? Do you have any examples at all.
               | Genuine question
               | 
               | How about you? I get the impression that you think AGI
               | was achieved some time ago. It's a bit difficult to
               | simultaneously argue both that we achieved AGI in GPT-N
               | and also that GPT-(N+X) is now the real breakthrough AGI
               | while claiming that your definition of AGI is useful.
               | 
               | > What do you expect from 'AGI'?
               | 
               | I think everyone's definition of AGI includes, as a
               | component, significant changes to the world, which
               | probably would be something like rapid GDP growth or
               | unemployment (though you could have either of those
               | without AGI). The fact that you have to argue about what
               | the word "general" technically means is proof that we
               | don't have AGI in a sense that anyone cares about.
        
               | og_kalu wrote:
               | >People's comments, including in this very thread, seem
               | to suggest otherwise (c.f. comments about "goal post
               | moving").
               | 
               | But you don't see this kind of discussion on the narrow
               | models/techniques that made strides on this benchmark, do
               | you ?
               | 
               | >People have a colloquial understanding of AGI whose
               | consequence is a significant change to daily life, not
               | the tortured technical definition that you are using
               | 
               | And ChatGPT has represented a significant change to the
               | daily lives of many. It's the fastest adopted software
               | product in history. In just 2 years, it's one of the top
               | ten most visited sites on the planet worldwide. A lot of
               | people have had the work they do significant change since
               | its release. This is why I ask, what is world altering ?
               | 
               | >How about you? I get the impression that you think AGI
               | was achieved some time ago.
               | 
               | Sure
               | 
               | >It's a bit difficult to simultaneously argue both that
               | we achieved AGI in GPT-N and also that GPT-(N+X) is now
               | the real breakthrough AGI
               | 
               | I have never claimed GPT-N+X is the "new breakthrough
               | AGI". As far as I'm concerned, we hit AGI sometime ago
               | and are making strides in competence and/or enabling even
               | more capabilities.
               | 
               | You can recognize ENIAC as a general purpose computer and
               | also recognize the breakthroughs in computing since then.
               | They're not mutually exclusive.
               | 
               | And personally, I'm more impressed with o3's Frontier
               | Math score than ARC.
               | 
               | >I think everyone's definition of AGI includes, as a
               | component, significant changes to the world
               | 
               | Sure
               | 
               | >which probably would be something like rapid GDP growth
               | or unemployment
               | 
               | What people imagine as "significant change" is definitely
               | not in any broad agreement.
               | 
               | Even in science fiction, the existence of general
               | intelligences more competent than today's LLMs does not
               | necessarily precursor massive unemployment or GDP growth.
               | 
               | And for a lot of people, the clincher stopping them from
               | calling a machine AGI is not even any of these things.
               | For some, that it is "sentient" or "cannot lie" is far
               | more important than any spike of unemployment.
        
               | foobarqux wrote:
               | > But you don't see this kind of discussion on the narrow
               | models/techniques that made strides on this benchmark, do
               | you ?
               | 
               | I don't understand what you are getting at.
               | 
               | Ultimately there is no axiomatic definition of the term
               | AGI. I don't think the colloquial understanding of the
               | word is what you think it is (i.e. if you had described
               | to people, pre-chatgpt, today's chatgpt behavior,
               | including all the limitations and failings and the fact
               | that there was no change in GDP, unemployment, etc), and
               | asked if that was AGI I seriously doubt they would say
               | yes.)
               | 
               | More importantly I don't think anyone would say their
               | life was much different from a few years ago and
               | separately would say under AGI it would be.
               | 
               | But the point that started all this discussion is the
               | fact that these "evals" are not good proxies for AGI and
               | no one is moving goal-posts even if they realize this
               | fact only after the tests have been beaten. You can
               | foolishly _define_ AGI as beating ARC but the moment ARC
               | is beaten you realize that you don 't care about that
               | definition at all. That doesn't change if you make a 10
               | or 100 benchmark suite.
        
               | og_kalu wrote:
               | >I don't understand what you are getting at.
               | 
               | If such discussions only made when LLMs make strides in
               | the benchmark then it's not just about beating the
               | benchmark but also what kind of system is beating it.
               | 
               | >You can foolishly define AGI as beating ARC but the
               | moment ARC is beaten you realize that you don't care
               | about that definition at all.
               | 
               | If you change your definition of AGI the moment a test is
               | beaten then yes, you are simply post moving.
               | 
               | If you care about other impacts like "Unemployment" and
               | "GDP rising" but don't give any time or opportunity to
               | see if the model is capable of such then you don't really
               | care about that and are just mindlessly shifting posts.
               | 
               | How do such a person know o3 won't cause mass
               | unemployment? The model hasn't even been released yet.
        
               | foobarqux wrote:
               | > If such discussions only made when LLMs make strides in
               | the benchmark then it's not just about beating the
               | benchmark but also what kind of system is beating it.
               | 
               | I still don't understand the point you are making. Nobody
               | is arguing that discrete program search is AGI (and the
               | same counter-arguments would apply if they did).
               | 
               | > If you change your definition of AGI the moment a test
               | is beaten then yes, you are simply post moving.
               | 
               | I don't think anyone changes their definition, they just
               | erroneously assume that any system that succeeds on the
               | test must do so only because it has general intelligence
               | (that was the argument for chess playing for example).
               | When it turns out that you can pass the test with much
               | narrower capabilities they recognize that it was a bad
               | test (unfortunately they often replace the bad test with
               | another bad test and repeat the error).
               | 
               | > If you care about other impacts like "Unemployment" and
               | "GDP rising" but don't give any time or opportunity to
               | see if the model is capable of such then you don't really
               | care about that and are just mindlessly shifting posts.
               | 
               | We are talking about what models are doing now (is AGI
               | here _now_ ) not what some imaginary research
               | breakthroughs might accomplish. O3 is not going to
               | materially change GDP or unemployment. (If you are
               | confident otherwise please say how much you are willing
               | to wager on it).
        
               | og_kalu wrote:
               | I'm not talking about any imaginary research
               | breakthroughs. I'm talking about today, right now. We
               | have a model unveiled today that seems a large
               | improvement across several benchmarks but hasn't been
               | released yet.
               | 
               | You can be confident all you want but until the model has
               | been given the chance to not have the effect you think it
               | won't then it's just an assertion that may or may not be
               | entirely wrong.
               | 
               | If you say "this model passed this benchmark I thought
               | would indicate AGI but didn't do this or that so I won't
               | acknowledge it" then I can understand that. I may not
               | agree on what the holdups are but I understand that.
               | 
               | If however you're "this model passed this benchmark I
               | thought would indicate AGI but I don't think it's going
               | to be able to do this or that so it's not AGI" then I'm
               | sorry but that's just nonsense.
               | 
               | My thoughts or bets are irrelevant here.
               | 
               | A few days ago I saw someone seriously comparing a site
               | with nearly 4B visits a month in under 2 years to Bitcoin
               | and VR. People are so up in their bubbles and so assured
               | in their way of thinking they can't see what's right in
               | front of them, nevermind predict future usefulness. I'm
               | just not interested in engaging "I think It won't"
               | arguments when I can just wait and see.
               | 
               | I'm not saying you are one of such people. I just have no
               | interest in such arguments.
               | 
               | My bet ? There's no way i would make a bet like that
               | without playing with the model first. Why would I ? Why
               | Would you ?
        
               | foobarqux wrote:
               | > I'm not talking about any imaginary research
               | breakthroughs. I'm talking about today, right now.
               | 
               | I explicitly said so was I. I said today we don't have
               | large impact societal changes that people have
               | conventionally associated with the term AGI. I also
               | explicitly talked about how I don't believe o3 will
               | change this and your comments seem to suggest neither do
               | you (you seem to prefer to emphasize that it isn't
               | literally impossible that o3 will make these
               | transformative changes).
               | 
               | > If however you're "this model passed this benchmark I
               | thought would indicate AGI but I don't think it's going
               | to be able to do this or that so it's not AGI" then I'm
               | sorry but that's just nonsense.
               | 
               | The entire point of the original chess example was to
               | show that in fact it is the correct reaction to repudiate
               | incorrect beliefs of naive litmus test of AGI-ness. If we
               | did what you are arguing then we should accept AGI having
               | occurred after chess was beaten because a lot of people
               | believed that was the litmus test? Or that we should
               | praise people who stuck to their original beliefs after
               | they were proven wrong instead of correcting them? That's
               | why I said it was silly at the outset.
               | 
               | > My thoughts or bets are irrelevant here
               | 
               | No they show you don't actually believe we have society
               | transformative AGI today (or will when o3 is released)
               | but get upset when someone points that out.
               | 
               | > I'm just not interested in engaging "I think It won't"
               | arguments when I can just wait and see.
               | 
               | A lot of life is about taking decisions based on
               | predictions about the future, including consequential
               | decisions about societal investment, personal career
               | choices, etc. For many things there isn't a "wait and see
               | approach", you are making implicit or explicit decisions
               | even by maintaining the status quo. People who make bad
               | or unsubstantiated arguments are creating a toxic
               | environment in which those decisions are made, leading
               | personal and public harm. The most important example of
               | this is the decision to dramatically increase energy
               | usage to accommodate AI models despite impending climate
               | catastrophe on the blind faith that AI will somehow fix
               | it all (which is far from the "wait and see" approach
               | that you are supposedly advocating by the way, this is an
               | active decision).
               | 
               | > My bet ? There's no way i would make a bet like that
               | without playing with the model first. Why would I ? Why
               | Would you ?
               | 
               | You can have beliefs based on limited information. People
               | do this all the time. And if you actually revealed that
               | belief it would demonstrate that you don't actually
               | currently believe o3 is likely to be world transformative
        
               | Jensson wrote:
               | > But you don't see this kind of discussion on the narrow
               | models/techniques that made strides on this benchmark, do
               | you ?
               | 
               | This model was trained to pass this test, it was trained
               | heavily on the example questions, so it was a narrow
               | technique.
               | 
               | We even have proof that it isn't AGI, since it scores
               | horribly on ARC-AGI 2. It overfitted for this test.
        
               | og_kalu wrote:
               | >This model was trained to pass this test, it was trained
               | heavily on the example questions, so it was a narrow
               | technique.
               | 
               | You are allowed to train on the train set. That's the
               | entire point of the test.
               | 
               | >We even have proof that it isn't AGI, since it scores
               | horribly on ARC-AGI 2. It overfitted for this test.
               | 
               | Arc 2 does not even exist yet. All we have are "early
               | signs", not that that would be proof of anything. Whether
               | I believe the models are generally intelligent or not
               | doesn't depend on ARC
        
               | Jensson wrote:
               | > You are allowed to train on the train test. That's the
               | entire point of the test.
               | 
               | Right, but by training on those test cases you are
               | creating a narrow model. The whole point of training
               | questions is to create narrow models, like all the models
               | we did before.
        
               | og_kalu wrote:
               | That doesn't make any sense. Training on the train set
               | does not make the models capabilities narrow. Models are
               | narrow when you can't train them to do anything else even
               | if you wanted to.
               | 
               | You are not narrow for undergoing training and it's
               | honestly kind of ridiculous to think so. Not even the ARC
               | maintainers believe so.
        
               | Jensson wrote:
               | > Training on the train set does not make the models
               | capabilities narrow
               | 
               | Humans didn't need to see the training set to pass this,
               | the AI needing it means it is narrower than the humans,
               | at least on these kind of tasks.
               | 
               | The system might be more general than previous models,
               | but still not as general as humans, and the G in AGI
               | typically means being as general as humans. We are moving
               | towards more general models, but still not at the level
               | where we call them AGI.
        
       | og_kalu wrote:
       | This is also wildly ahead in SWE-bench (71.7%, previous 48%) and
       | Frontier Math (25% on high compute, previous 2%).
       | 
       | So much for a plateau lol.
        
         | throwup238 wrote:
         | _> So much for a plateau lol._
         | 
         | It's been really interesting to watch all the internet pundits'
         | takes on the plateau... as if the _two years_ since the release
         | of GPT3.5 is somehow enough data for an armchair ponce to
         | predict the performance characteristics of an entirely novel
         | technology that no one understands.
        
           | jgalt212 wrote:
           | You could make an equivalently dismissive comment about the
           | hypesters.
        
             | throwup238 wrote:
             | Yeah but anyone with half a brain knows to ignore them.
             | Vapid cynicism is a lot more seductive to the average nerd.
        
           | bandwidth-bob wrote:
           | The pundits response to the (alleged) plateau was
           | proportional to the certainty with which CEOs of frontier
           | labs discussed pre-training scaling. The o3 result is from
           | scaling test time compute, which represents a meaningful
           | change in how you would build out compute for scaling (single
           | supercluster --> presence in regions close to users). Thus it
           | is important to discuss.
        
         | attentionmech wrote:
         | I legit see that if there is not even a new breakthrough just
         | one week, people start shouting plateau plateau.. Our rate of
         | progress is extraordinary and any downplay of it seems like
         | stupid
        
         | optimalsolver wrote:
         | >Frontier Math (25% on high compute, previous 2%)
         | 
         | This is so insane that I can't help but be skeptical. I know FM
         | answer key is private, but they have to send the questions to
         | OpenAI in order to score the models. And a significant jump on
         | this benchmark sure would increase a company's valuation...
         | 
         | Happy to be wrong on this.
        
         | OsrsNeedsf2P wrote:
         | At 6,670$/task? I hope there's a jump
        
           | og_kalu wrote:
           | It's not 6,670$/task. That was the high efficiency cost for
           | 400 questions.
        
         | HarHarVeryFunny wrote:
         | You're talking apples and oranges. The plateau the frontier
         | models have hit is the limited further gains to be had from
         | dataset (+ corresponding model/compute) scaling.
         | 
         | These new reasoning models are taking things in a new direction
         | basically by adding search (inference time compute) on top of
         | the basic LLM. So, the capabilities of the models are still
         | improving, but the new variable is how deep of a search you
         | want to do (how much compute to throw at it at inference time).
         | Do you want your chess engine to do a 10 ply search or 20 ply?
         | What kind of real world business problems will benefit from
         | this?
        
           | og_kalu wrote:
           | "New" reasoning models are plain LLMs with clever
           | reinforcement learning. o1 is itself reinforcement learning
           | on top GPT-4o.
           | 
           | They found a way to make test time compute a lot more
           | effective and that is an advance but the idea is not new, the
           | architecture is not new.
           | 
           | And the vast majority of people convinced LLMs plateaued did
           | so regardless of test time compute.
        
             | HarHarVeryFunny wrote:
             | The fact that these reasoning models may compute for
             | extended durations, using exponentially more compute for
             | linear performance gains (says OpenAI), resulting in
             | outputs that while better are not necessarily any longer
             | (more tokens) than before, all point to a different
             | architecture - some type of iterative calling of the
             | underlying model (essentially a reasoning agent using the
             | underlying model).
             | 
             | A plain LLM does not use variable compute - it is a fixed
             | number of transformer layers, a fixed amount of compute for
             | every token generated.
        
               | throwaway314155 wrote:
               | Architecture generally refers to the design of the model.
               | In this case, the underlying model is still a transformer
               | based llm and so is its architecture.
               | 
               | What's different is the method for _sampling_ from that
               | model where it seems they have encouraged the underlying
               | LLM to perform a variable length chain of thought
               | "conversation" with itself as has been done with o1. In
               | addition, they _repeat_ these chains of thought in
               | parallel using a tree of some sort to search and rank the
               | outputs. This apparently scales performance on benchmarks
               | as you scale both length of the chain of thought and the
               | number of chains of thought.
        
               | HarHarVeryFunny wrote:
               | No disagreement, although the sampling + search procedure
               | is obviously adding quite a lot to the capabilities of
               | the system as a whole, so it really _should_ be
               | considered as part of the architecture. It 's a bit like
               | AlphaGo or AlphaZero - generating potential moves (cf
               | LLM) is only a component of the overall solution
               | architecture, and the MCTS sampling/search is equally (or
               | more) important.
        
               | og_kalu wrote:
               | I think throwaway already explained what i was getting
               | at.
               | 
               | That said, i probably did downplay the achievement. It
               | may not be a "new" idea to do something like this but
               | finding an effective method for reflection that doesn't
               | just lock you into circular thinking and is applicable
               | beyond well defined problem spaces is genuinely tough and
               | a breakthrough.
        
       | maxdoop wrote:
       | How much longer can I get paid $150k to write code ?
        
         | tsunamifury wrote:
         | Often what happens is the golf-course phenomenon. As golfing
         | gets less popular, low and mid tier golf courses go out of
         | business as they simply aren't needed. But at the same time
         | demand for high end golf courses actually skyrockets because
         | people who want to golf either can give it up or go higher end.
         | 
         | This I think will happen with programmers. Rote programming
         | will slowly die out, while demand for super high end will go
         | dramatically up in price.
        
           | CapcomGo wrote:
           | Where does this golf-course phenomenon come from? It doesn't
           | really match the real world or how golfing works.
        
             | tsunamifury wrote:
             | how so, witnessed it quite directly in California. Majority
             | have closed and remaining have gone up in price and are up
             | scale. This has been covered in various new programs like
             | 60 minutes. You can look up death of golfing.
             | 
             | Also unsure what you mean by...'how golfing works'. This is
             | the economics of it, not the game
        
               | EVa5I7bHFq9mnYK wrote:
               | Maybe its CA thing? Plenty of $50 golf courses here in
               | Phoenix.
        
         | colesantiago wrote:
         | Frontier expert specialist programmers will always be in
         | demand.
         | 
         | Generalist junior and senior engineers will need to think of a
         | different career path in less than 5 years as more layoffs will
         | reduce the software engineering workforce.
         | 
         | It looks like it may be the way things are if progress in the
         | o1, o3, oN models and other LLMs continues on.
        
           | deadbabe wrote:
           | This assumes that software products in the future will remain
           | at the same complexity as they are today, just with AI
           | building them out.
           | 
           | But they won't. AI will enable building even _more_ complex
           | software which counter intuitively will result in need even
           | _more_ human jobs to deal with this added complexity.
           | 
           | Think about how despite an increasing amount of free open
           | source libraries over time enabling some powerful stuff
           | easily, developer jobs have only increased, not decreased.
        
             | dmm wrote:
             | I've made a similar argument in the past but now I'm not so
             | sure. It seems to me that developer demand was linked to
             | large expansions in software demand first from PCs then the
             | web and finally smartphones.
             | 
             | What if software demand is largely saturated? It seems the
             | big tech companies have struggled to come up with the next
             | big tech product category, despite lots of talent and
             | capital.
        
               | deadbabe wrote:
               | There doesn't need to be a new category. Existing
               | categories can just continue bloating in complexity.
               | 
               | Compare the early web vs the complicated JavaScript laden
               | single page application web we have now. You need way
               | more people now. AI will make it even worse.
               | 
               | Consider that in the AI driven future, there will be no
               | more frameworks like React. Who is going to bother
               | writing one? Instead every company will just have their
               | own little custom framework built by an AI that works
               | only for their company. Joining a new company means you
               | bring generalist skills and learn how their software
               | works from the ground up and when you leave to another
               | company that knowledge is instantly useless.
               | 
               | Sounds exciting.
               | 
               | But there's also plenty of unexplored categories anyway
               | that we can't access still because there's insufficient
               | technology for. Household robots with AGI for instance
               | may require instructions for specific services sold as
               | "apps" that have to be designed and developed by
               | companies.
        
               | bandwidth-bob wrote:
               | The new capabilities of LLMs, and generally large
               | foundation models, _expands_ the range of what a computer
               | program can do. Naturally, we will need to build all of
               | those things with code. Which will be done by a combo of
               | people with product ideas, engineers, and LLMs. There
               | will be then specialization and competition on each new
               | use-case. eg., who builds the best AI doctor etc.,.
        
             | hackinthebochs wrote:
             | What about "general" in AGI do you not understand? There
             | will be no new style of development for which the AGI will
             | be poorly suited that all the displaced developers can move
             | to.
        
               | bandwidth-bob wrote:
               | For true AGI (whatever that means, lets say fully
               | replicates human abilities), discussing "developers" only
               | is a drop in the bucket compared to all knowledge work
               | jobs which will be displaced.
        
             | cruffle_duffle wrote:
             | This is exactly what will happen. We'll just up the
             | complexity game to entirely new baselines. There will
             | continue to be good money in software.
             | 
             | These models are tools to help engineers, not replacements.
             | Models cannot, on their own, build novel new things no
             | matter how much the hype suggests otherwise. What they can
             | do is remove a hell of a lot of accidental complexity.
        
               | lagrange77 wrote:
               | > These models are tools to help engineers, not
               | replacements. Models cannot, on their own, build novel
               | new things no matter how much the hype suggests
               | otherwise.
               | 
               | But maybe models + managers/non technical people can?
        
           | mitjam wrote:
           | The question is: How to become a senior when there is no
           | place to be a junior? Will future SWE need to do the 10k
           | hours as a hobby? Will AI speed up or slow down learning?
        
             | singularity2001 wrote:
             | good question and I think you gave the correct answer yes
             | people will just do the 10,000 hours required by starting
             | programming at the age of eight and then playing around
             | until they're done studying
        
         | prmph wrote:
         | I'll believe the models can take the jobs of programmers when
         | they can generate a sophisticated iOS app based on some simple
         | prompts, ready for building and publication in the app store.
         | That is nowhere near the horizon no matter how much things are
         | hyped up, and it may well never arrive.
        
           | timenotwasted wrote:
           | The absolutist type comments are such a wild take given how
           | often they are so wrong.
        
             | tsunamifury wrote:
             | Totally... simple increases in 20% efficiency will already
             | significant destroy demand for coders. This forum however
             | will be resistant to admit such economic phenomenon.
             | 
             | Look at video bay editing after the advent of Final Cut.
             | Significant drop in the specialized requirement as a
             | professional field, even while content volume went up
             | dramatically.
        
               | exitb wrote:
               | Computing has been transforming countless jobs before it
               | got to Final Cut. On one hand, programming is not the
               | hardest job out there. On the other, it takes months to
               | fully onboard a human developer - a person that already
               | has years of relevant education and work experience.
               | There are desk jobs that onboard new hires in days
               | instead. Let's see when they're displaced by AI first.
        
               | tsunamifury wrote:
               | Don't know if you noticed but thats already happening.
               | Mass layoffs in customer service etc have already
               | happened over the last 2 years
        
               | exitb wrote:
               | So, how does it work out? Are the customers happy? Are
               | the bosses at my work going to be equally happy with my
               | AI replacement?
        
               | EVa5I7bHFq9mnYK wrote:
               | That's until AI has improved enough that it can
               | automatically navigate the menus to get me a human
               | operator to talk to.
        
               | derektank wrote:
               | I could be misreading this, but as far as I can tell,
               | there are more video and film editors today (29,240) than
               | there were film editors in 1997 (9,320). Seems like an
               | example of improved productivity shifting the skills
               | required but ultimately driving greater demand for the
               | profession as a whole. Salaries don't seem to have been
               | hurt either, median wage was $35,214 in '97 and $66,600
               | today, right in line with inflation.
               | 
               | https://www.bls.gov/oes/2023/may/oes274032.htm
               | 
               | https://www.bls.gov/oes/tables.htm
        
           | vouaobrasil wrote:
           | Nah, it will arrive. And regardless, this sort of AI reduces
           | the skill level required to make the app. It reduces the
           | amount of people required and thus reduces the demand for
           | engineers. So, even though AI is not CLOSE to what you are
           | suggesting, it can significantly reduce the salaries of those
           | that ARE required. So maybe fewer $150K programmers will be
           | hired with the same revenue for even higher profits.
           | 
           | The most bizarre thing is that programmers are literally
           | writing code to replace themselves because once this AI
           | started, it was a race to the bottom and nobody wants to be
           | last.
        
             | skydhash wrote:
             | > Nah, it will arrive
             | 
             | Will it?
             | 
             | It's already hard to get people to use computer as they are
             | right now, where you only need to click on things and no
             | longer have to enter commands. That because most people
             | don't like to engage in formal reasoning. Even with one of
             | the most intuitive computer assisted task (drawing and 3d
             | modeling), there's so much to learn regarding theories that
             | few people bother.
             | 
             | Programming has always been easy to learn, and tools to
             | automate coding have existed for decades now. But how many
             | people you know have had the urge to learn enough to
             | automate their tasks?
        
             | prmph wrote:
             | They've been promising us this thing since the 60s: End-
             | user development, 5GLs, etc. enabling the average Joe to
             | develop sophisticated apps in minimal time. And it never
             | arrives.
             | 
             | I remember attending a tech fair decades ago, and at one
             | stand they were vending some database products. When I
             | mentioned that I was studying computer science with a focus
             | on software engineering, they sneered that coding will be
             | much less important in the future since powerful databases
             | will minimize the need for a lot of data wrangling in
             | applications with algorithms.
             | 
             | What actually happened is that the demand for programmers
             | increased, and software ate the world. I suspect something
             | similar will happen the current AI hype.
        
               | vouaobrasil wrote:
               | Well, I think in the 60s we also didn't have LLMs that
               | could actually write complete programs, either.
        
               | mirsadm wrote:
               | No one writes a "complete program" these days. Things
               | just keep evolving forever. I spent way too much time I
               | care to admit dealing with dependencies of libraries
               | which change seemingly on a daily basis these days. These
               | predictions are so far off reality it makes me wonder if
               | the people making them have ever written any code in
               | their life.
        
               | vouaobrasil wrote:
               | That's fair. Well, I've written a lot of code. But
               | anyway, I do want to emphasize the following. I am not
               | making the same prediction as some that say AI can
               | replace a programmer. Instead, I am saying: combination
               | of AI plus programmers will reduce the need for the
               | number or programmers, and hence allow the software
               | industry to exist with far fewer people, with the lucky
               | ones accumulating even more wealth.
        
               | whynotminot wrote:
               | > They've been promising us this thing since the 60s:
               | End-user development, 5GLs, etc. enabling the average Joe
               | to develop sophisticated apps in minimal time. And it
               | never arrives.
               | 
               | This has literally already arrived. Average Joes _are_
               | writing software using LLMs right now.
        
               | arrosenberg wrote:
               | Source? Which software products are built without
               | engineers?
        
               | Jensson wrote:
               | Personal websites etc, you don't think about them as
               | software products since they weren't built by engineers,
               | but 30 years ago you needed engineers to build those
               | things.
        
               | arrosenberg wrote:
               | Ok, well I'm not going to worry about my job then. 25
               | years ago GeoCities existed and you didn't need an
               | engineer. 10 year old me was writing functional HTML,
               | definitely not an engineer at that point.
        
               | whynotminot wrote:
               | To be honest maybe no one should worry.
               | 
               | If AI truly overtakes knowledge work there's not much we
               | could reasonably do to prepare for it.
               | 
               | If AI never gets there though, then you saved yourself
               | the trouble of stressing about it. So sure, relax, it's
               | just the second coming of GeoCities.
        
               | hatefulmoron wrote:
               | I think the fear comes from the span of time. If my job
               | is obsolete at the same time as everybody else's, I
               | wouldn't care. I mean, sure, the world is in for a very
               | tough time, but I would be in good company.
               | 
               | The really bad situation is if my entire skill set is
               | made obsolete while the rest of the world keeps going for
               | a decade or two. Or maybe longer, who knows.
               | 
               | I realize I'm coming across quite selfish, but it's just
               | a feeling.
        
         | deadbabe wrote:
         | There's a very good chance that if a company can replace its
         | programmers with pure AI then it means whatever they're doing
         | is probably already being offered as a SaaS product so why not
         | just skip the AI and buy that? Much cheaper and you don't have
         | to worry about dealing with bugs.
        
           | croemer wrote:
           | SaaS works for general problems faced by many businesses.
        
             | deadbabe wrote:
             | Exactly. Most businesses can get away with not having
             | developers at all if they just glue together the right
             | combination of SaaS products. But this doesn't happen,
             | implying there is something more about having your own
             | homegrown developers that SaaS cannot replace.
        
               | croemer wrote:
               | The risk is not SaaS replacing internal developers. It's
               | about increased productivity of developers reducing the
               | number of developers needed to achieve something.
        
               | deadbabe wrote:
               | Again, you're assuming product complexity won't grow as a
               | result of new AI tools.
               | 
               | 3 decades ago you needed a big team to create the type of
               | video games that one person can probably make on their
               | own today in their spare time with modern tools.
               | 
               | But now modern tools have been used to make even more
               | complicated games that require more massive teams than
               | ever and huge amounts of money. One person has no hope of
               | replicating that now, but maybe in the future with AI
               | they can. And then the AAA games will be even _more_
               | advanced.
               | 
               | It will be similar with other software.
        
         | sss111 wrote:
         | 3 to 5 years, max. Traditional coding is going to be dead in
         | the water. Optimistically, the junior SWE job will evolve but
         | more realistically dedicated AI-based programming agents will
         | end demand for Junior SWEs
        
           | lagrange77 wrote:
           | Which implies that a few years later they will not become
           | senior SWEs either.
        
         | torginus wrote:
         | Well, considering they floated the $2000 subscription idea, and
         | they still haven't revealed everything, they could still
         | introduce the $2k sub with o3+agents/tool use, which means,
         | till about next week.
        
         | arrosenberg wrote:
         | Unless the LLMs see multiple leaps in capability, probably
         | indefinitely. The Malthusians in this thread seem to think that
         | LLMs are going to fix the human problems involved in executing
         | these businesses - they won't. They make good programmers more
         | productive and will cost some jobs at the margins, but it will
         | be the low-level programming work that was previously
         | outsourced to Asia and South America for cost-arbitrage.
        
         | mrdependable wrote:
         | I think they will have to figure out how to get around context
         | limits before that happens. I also wouldn't be surprised if the
         | future models that can actually replace workers are sold at
         | such an exorbitant price that only larger companies will be
         | able to afford it. Everyone else gets access to less capable
         | models that still require someone with knowledge to get to an
         | end result.
        
         | kirykl wrote:
         | If it's any consolation, Agile priests and middle managers will
         | be the first to go
        
         | HarHarVeryFunny wrote:
         | You're not being paid $150K to "write code". You're being paid
         | that to deliver solutions - to be a corporate cog than can
         | ingest business requirements and emit (and maintain) business
         | solutions.
         | 
         | If there are jobs paying $150K just to code (someone else tells
         | you what to code, and you just code it up), then please share!
        
       | braden-lk wrote:
       | If people constantly have to ask if your test is a measure of
       | AGI, maybe it should be renamed to something else.
        
         | OfficialTurkey wrote:
         | From the post
         | 
         | > Passing ARC-AGI does not equate achieving AGI, and, as a
         | matter of fact, I don't think o3 is AGI yet. o3 still fails on
         | some very easy tasks, indicating fundamental differences with
         | human intelligence.
        
           | cchance wrote:
           | Its funny when they say this, as if all humans can solve
           | basic ass question/answer combos, people seem to forget
           | theirs a percentage of the population that honestly believe
           | the world is flat along with other hallucinations at the
           | human level
        
             | jppittma wrote:
             | I don't believe AGI at that level has any commercial value.
        
             | Jensson wrote:
             | Humans works in groups, so you are wrong a group of human
             | is extremely reliable on tons of tasks. These AI models
             | also work in groups, or they don't improve from working in
             | a group since the company uses whatever does the best on
             | the benchmark, so it is only fair to compare AI vs group of
             | people, AI compared to an individual will always be an
             | unfair comparison since an AI is never alone.
        
       | modeless wrote:
       | Congratulations to Francois Chollet on making the most
       | interesting and challenging LLM benchmark so far.
       | 
       | A lot of people have criticized ARC as not being relevant or
       | indicative of true reasoning, but I think it was exactly the
       | right thing. The fact that scaled reasoning models are finally
       | showing progress on ARC proves that what it measures really is
       | relevant and important for reasoning.
       | 
       | It's obvious to everyone that these models can't perform as well
       | as humans on everyday tasks despite blowout scores on the hardest
       | tests we give to humans. Yet nobody could quantify exactly the
       | ways the models were deficient. ARC is the best effort in that
       | direction so far.
       | 
       | We don't need more "hard" benchmarks. What we need right now are
       | "easy" benchmarks that these models nevertheless fail. I hope
       | Francois has something good cooked up for ARC 2!
        
         | dtquad wrote:
         | Are there any single-step non-reasoner models that do well on
         | this benchmark?
         | 
         | I wonder how well the latest Claude 3.5 Sonnet does on this
         | benchmark and if it's near o1.
        
           | throwaway71271 wrote:
           | | Name                                 | Semi-private eval |
           | Public eval |         |--------------------------------------
           | |-------------------|-------------|         | Jeremy Berman
           | | 53.6%             | 58.5%       |         | Akyurek et al.
           | | 47.5%             | 62.8%       |         | Ryan Greenblatt
           | | 43%               | 42%         |         | OpenAI
           | o1-preview (pass@1)           | 18%               | 21%
           | |         | Anthropic Claude 3.5 Sonnet (pass@1) | 14%
           | | 21%         |         | OpenAI GPT-4o (pass@1)
           | | 5%                | 9%          |         | Google Gemini
           | 1.5 (pass@1)           | 4.5%              | 8%          |
           | 
           | https://arxiv.org/pdf/2412.04604
        
             | kandesbunzler wrote:
             | why is this missing the o1 release / o1 pro models? Would
             | love to know how much better they are
        
               | Freebytes wrote:
               | This might be because they are referencing single step,
               | and I do not think o1 is single step.
        
             | aimanbenbaha wrote:
             | Akyurek et al uses test-time compute.
        
           | YetAnotherNick wrote:
           | Here are the results for base models[1]:                 o3
           | (coming soon)  75.7% 82.8%       o1-preview        18%   21%
           | Claude 3.5 Sonnet 14%   21%       GPT-4o            5%    9%
           | Gemini 1.5        4.5%  8%
           | 
           | Score (semi-private eval) / Score (public eval)
           | 
           | [1]: https://arcprize.org/2024-results
        
             | simonw wrote:
             | I'd love to know how Claude 3.5 Sonnet does so well despite
             | (presumably) not having the same tricks as the o-series
             | models.
        
             | Bjorkbat wrote:
             | It's easy to miss, but if you look closely at the first
             | sentence of the announcement they mention that they used a
             | version of o3 trained on a public dataset of ARC-AGI, so
             | technically it doesn't belong on this list.
        
               | dot1x wrote:
               | It's all scam. ClosedAI trained on the data they were
               | tested on, so no, nothing here is impressive.
        
         | refulgentis wrote:
         | This emphasizes persons and a self-conceived victory narrative
         | over the ground truth.
         | 
         | Models have regularly made progress on it, this is not new with
         | the o-series.
         | 
         | Doing astoundingly well on it, and having a mutually shared PR
         | interest with OpenAI in this instance, doesn't mean a pile of
         | visual puzzles is actually AGI or some well thought out and
         | designed benchmark of True Intelligence(tm). It's one type of
         | visual puzzle.
         | 
         | I don't mean to be negative, but to inject a memento mori. Real
         | story is some guys get together and ride off Chollet's name
         | with some visual puzzles from ye olde IQ test, and the deal was
         | Chollet then gets to show up and say it proves program
         | synthesis is required for True Intelligence.
         | 
         | Getting this score is extremely impressive but I don't assign
         | more signal to it than any other benchmark with some thought to
         | it.
        
           | modeless wrote:
           | Solving ARC doesn't mean we have AGI. Also o3 presumably
           | isn't doing program synthesis, seemingly proving Francois
           | wrong on that front. (Not sure I believe the speculation
           | about o3's internals in the link.)
           | 
           | What I'm saying is the fact that as models are getting better
           | at reasoning they are also scoring better on ARC proves that
           | it _is_ measuring something relating to reasoning. And nobody
           | else has come up with a comparable benchmark that is so easy
           | for humans and so hard for LLMs. Even today, let alone five
           | years ago when ARC was released. ARC was visionary.
        
             | hdjjhhvvhga wrote:
             | Your argumentation seems convincing but I'd like to offer a
             | competitive narrative: any benchmark that is public becomes
             | completely useless because companies optimize for it -
             | especially AI that depends on piles of money and they need
             | some proof they are developing.
             | 
             | That's why I have some private benchmarks and I'm sorry to
             | say that the transition from GTP4 to o1 wasn't
             | unambiguously a step forward (in some tasks yes, in some
             | not).
             | 
             | On the other hand, private benchmarks are even less useful
             | to the general public than the public ones, so we have to
             | deal with what we have - but many of us just treat it as
             | noise and don't give it much significance. Ultimately, the
             | models should defend themselves by performing the tasks
             | individual users want them to do.
        
               | stonemetal12 wrote:
               | Rather any Logic puzzle you post on the internet as
               | something AIs are bad at is in the next round of training
               | data so AIs get better at that specific question. Not
               | because AI companies are optimizing for a benchmark but
               | because they suck up everything.
        
               | modeless wrote:
               | ARC has two test sets that are not posted on the
               | Internet. One is kept completely private and never
               | shared. It is used when testing open source models and
               | the models are run locally with no internet access. The
               | other test set is used when testing closed source models
               | that are only available as APIs. So it could be leaked in
               | theory, but it is still not posted on the internet and
               | can't be in any web crawls.
               | 
               | You could argue that the models can get an advantage by
               | looking at the training set which is on the internet. But
               | all of the tasks are unique and generalizing from the
               | training set to the test set is the whole point of the
               | benchmark. So it's not a serious objection.
        
               | foobiekr wrote:
               | Given the delivery mechanism for OpenAI, how do they
               | actually keep it private?
        
               | modeless wrote:
               | > So it could be leaked in theory
               | 
               | That's why they have two test sets. But OpenAI has
               | legally committed to not training on data passed to the
               | API. I don't believe OpenAI would burn their reputation
               | and risk legal action just to cheat on ARC. And what
               | they've reported is not implausible IMO.
        
               | sensanaty wrote:
               | Yeah I'm sure the Microsoft-backed company headed by Mr.
               | Worldcoin Altman whose sole mission statement so far has
               | been to overhype every single product they released
               | wouldn't _dare_ cheat on one of these benchmarks that
               | "prove" AGI (as they've been claiming since GPT-2).
        
             | QuantumGood wrote:
             | Gaming the benchmarks usually needs to be considered first
             | when evaluating new results.
        
               | chaps wrote:
               | Honestly, is gaming benchmarks actually a problem in this
               | space in that it still shows something useful? Just means
               | we need more benchmarks, yeah? It really feels not unlike
               | keggle competitions.
               | 
               | We do the same exact stuff with real people with
               | programming challenges and such where people just study
               | common interview questions rather than learning the
               | material holistically. And since we know that people game
               | these interview type questions, we can adjust the
               | interview processes to minimize gamification.... which
               | itself leads to gamification and back to step one. That's
               | not ideal an ideal feedback loop of course, but people
               | still get jobs and churn out "productive work" out of it.
        
               | ben_w wrote:
               | AI are very good at gaming benchmarks. Both as
               | overfitting and as Goodhart's law, gaming benchmarks has
               | been a core problem during training for as long as I've
               | been interested in the field.
               | 
               | Sometimes this manifests as "outside the box thinking",
               | like how a genetic algorithm got an "oscillator" which
               | was really just an antenna.
               | 
               | It is a hard problem, and yes we still both need and can
               | make more and better benchmarks; but it's still a problem
               | because it means the benchmarks we do have are
               | overstating competence.
        
               | CamperBob2 wrote:
               | The _idea_ behind this particular benchmark, at least, is
               | that it can 't be gamed. What are some ways to game ARC-
               | AGI, meaning to pass it without developing the required
               | internal model and insights?
               | 
               | In principle you can't optimize specifically for ARC-AGI,
               | train against it, or overfit to it, because only a few of
               | the puzzles are publicly disclosed.
               | 
               | Whether it lives up to that goal, I don't know, but their
               | approach sounded good when I first heard about it.
        
               | psb217 wrote:
               | Well, with billions in funding you could task a hundred
               | or so very well paid researchers to do their best at
               | reverse engineering the general thought process which
               | went into ARC-AGI, and then generate fresh training data
               | and labeled CoTs until the numbers go up.
        
               | CamperBob2 wrote:
               | Right, but the ARC-AGI people would counter by saying
               | they're welcome to do just that. In doing so -- again in
               | their view -- the researchers would create a model that
               | could be considered capable of AGI.
               | 
               | I spent a couple of hours looking at the publicly-
               | available puzzles, and was really impressed at how much
               | room for creativity the format provides. Supposedly the
               | puzzles are "easy for humans," but some of them were
               | not... at least not for me.
               | 
               | (It did occur to me that a better test of AGI might be
               | the ability to generate new, innovative ARC-AGI puzzles.)
        
               | psb217 wrote:
               | It's tricky to judge the difficulty of these sorts of
               | things. Eg, breadth of possibilities isn't an automatic
               | sign of difficulty. I imagine the space of programming
               | problems permits as much variety as ARC-AGI, but since
               | we're more familiar with problems presented as natural
               | language descriptions of programming tasks, and since we
               | know there's tons of relevant text on the web, we see the
               | abstract pictographic ARC-AGI tasks as more novel,
               | challenging, etc. But, to an LLM, any task we can
               | conceive of will be (roughly) as familiar as the amount
               | of relevant training data it's seen. It's legitimately
               | hard to internalize this.
               | 
               | For a space of tasks which are well-suited to
               | programmatic generation, as ARC-AGI is by design, if we
               | can do a decent job of reverse engineering the underlying
               | problem generating grammar, then we can make an LLM as
               | familiar with the task as we're willing to spend on
               | compute.
               | 
               | To be clear, I'm not saying solving these sorts of tasks
               | is unimpressive. I'm saying that I find it unsuprising
               | (in light of past results) and not that strong of a
               | signal about further progress towards the singularity, or
               | FOOM, or whatever. For any of these closed-ish domain
               | tasks, I feel a bit like they're solving Go for the
               | umpteenth time. We now know that if you collect enough
               | relevant training data and train a big enough model with
               | enough GPUs, the training loss will go down and you'll
               | probably get solid performance on the test set. Trillions
               | of reasonably diverse training tokens buys you a lot of
               | generalization. Ie, supervised learning works. This is
               | the horse Ilya Sutskever's ridden to many glorious
               | victories and the big driver of OpenAI's success -- a
               | firm belief that other folks were leaving A LOT of
               | performance on the table due to a lack of belief in the
               | power of their own inventions.
        
               | chaps wrote:
               | We're in agreement!
               | 
               | What's endlessly interesting to me with all of this is
               | how surprisingly quick the benchmarking feedback loops
               | have become plus the level of scrutiny each one receives.
               | We (as a culture/society/whatever) don't really treat
               | human benchmarking criteria with the same scrutiny such
               | that feedback loops are useful and lead to productive
               | changes to the benchmarking system itself. So from that
               | POV it feels like substantial progress continues to be
               | made through these benchmarks.
        
               | bubblyworld wrote:
               | I think gaming the benchmarks is _encouraged_ in the ARC
               | AGI context. If you look at the public test cases you 'll
               | see they test a ton of pretty abstract concepts - space,
               | colour, basic laws of physics like gravity/magnetism,
               | movement, identity and lots of other stuff (highly
               | recommend exploring them). Getting an AI to do well _at
               | all_ , regardless of whether it was gamed or not, is the
               | whole challenge!
        
             | refulgentis wrote:
             | > Solving ARC doesn't mean we have AGI. Also o3 presumably
             | isn't doing program synthesis, seemingly proving Francois
             | wrong on that front.
             | 
             | Agreed.
             | 
             | > And nobody else has come up with a comparable benchmark
             | that is so easy for humans and so hard for LLMs.
             | 
             | ? There's plenty.
        
               | modeless wrote:
               | I'd love to hear about more. Which ones are you thinking
               | of?
        
               | refulgentis wrote:
               | - "Are You Human" https://arxiv.org/pdf/2410.09569 is
               | designed to be directly on target, i.e. cross cutting set
               | of questions that are easy for humans, but challenging
               | for LLMs, Instead of one type of visual puzzle. Much
               | better than ARC for the purpose you're looking for.
               | 
               | - SimpleBench https://simple-bench.com/ (similar to
               | above; great landing page w/scores that show human / ai
               | gap)
               | 
               | - PIQA (physical question answering, i.e. "how do i get a
               | yolk out of a water bottle", common favorite of local llm
               | enthusiasts in /r/localllama
               | https://paperswithcode.com/dataset/piqa
               | 
               | - Berkeley Function-Calling (I prefer
               | https://gorilla.cs.berkeley.edu/leaderboard.html)
               | 
               | AI search googled "llm benchmarks challenging for ai easy
               | for humans", and "language model benchmarks that humans
               | excel at but ai struggles with", and "tasks that are easy
               | for humans but difficult for natural language ai".
               | 
               | It also mentioned Moravec's Paradox is a known framing of
               | this concept, started going down that rabbit hole because
               | the resources were fascinating, but, had to hold back and
               | submit this reply first. :)
        
               | modeless wrote:
               | Thanks for the pointers! I hadn't seen Are You Human.
               | Looks like it's only two months old. Of course it is much
               | easier to design a test specifically to thwart LLMs now
               | that we have them. It seems to me that it is designed to
               | exploit details of LLM structure like tokenizers (e.g.
               | character counting tasks) rather than to provide any sort
               | of general reasoning benchmark. As such it seems
               | relatively straightforward to improve performance in ways
               | that wouldn't necessarily represent progress in general
               | reasoning. And today's LLMs are not nearly as far from
               | human performance on the benchmark as they were on ARC
               | for many years after it was released.
               | 
               | SimpleBench looks more interesting. Also less than two
               | months old. It doesn't look as challenging for LLMs as
               | ARC, since o1-preview and Sonnet 3.5 already got half of
               | the human baseline score; they did much worse on ARC. But
               | I like the direction!
               | 
               | PIQA is cool but not hard enough for LLMs.
               | 
               | I'm not sure Berkeley Function-Calling represents tasks
               | that are "easy" for average humans. Maybe programmers
               | could perform well on it. But I like ARC in part because
               | the tasks do seem like they should be quite
               | straightforward even for non-expert humans.
               | 
               | Moravec's paradox isn't a benchmark per se. I tend to
               | believe that there is no real paradox and all we need is
               | larger datasets to see the same scaling laws that we have
               | for LLMs. I see good evidence in this direction:
               | https://www.physicalintelligence.company/blog/pi0
        
               | refulgentis wrote:
               | > "I'm not sure Berkeley Function-Calling represents
               | tasks that are easy for average humans. Maybe programmers
               | could perform well on it."
               | 
               | Functions in this context are not programming function
               | calls. In this context, function calls are a now-
               | deprecated LLM API name for "parse input into this JSON
               | template." No programmer experience needed. Entity
               | extraction by another name, except, that'd be harder:
               | here, you're told up front exactly the set of entities to
               | identify. :)
               | 
               | > "Moravec's paradox isn't a benchmark per se."
               | 
               | Yup! It's a paradox :)
               | 
               | > "Of course it is much easier to design a test
               | specifically to thwart LLMs now that we have them"
               | 
               | Yes.
               | 
               | Though, I'm concerned a simple yes might be insufficient
               | for illumination here.
               | 
               | It is a tautology (it's easier to design a test that $X
               | fails when you have access to $X), and it's unlikely you
               | meant to just share a tautology.
               | 
               | A potential unstated-but-maybe-intended-communication is
               | "it was hard to come up with ARC before LLMs existed" ---
               | LLMs existed in 2019 :)
               | 
               | If they didn't, a hacky way to come up with a test that's
               | hard for the top AIs at the time, BERT-era, would be to
               | use one type of visual puzzle.
               | 
               | If, for conversations sake, we ignore that it is exactly
               | one type of visual puzzle, and that it wasn't designed to
               | be easy for humans, then we can engage with: "its the
               | only one thats easy for humans, but hard for LLMs" ---
               | this was demonstrated as untrue as well.
               | 
               | I don't think I have much to contribute past that, once
               | we're at "It is a singular example of a benchmark thats
               | easy for humans but nigh-impossible for llms, at least in
               | 2019, and this required singular insight", there's just
               | too much that's not even wrong, in the Pauli sense, and
               | it's in a different universe from the original claims:
               | 
               | - "Congratulations to Francois Chollet on making the most
               | interesting and challenging LLM benchmark so far."
               | 
               | - "A lot of people have criticized ARC as not being
               | relevant or indicative of true reasoning...The fact that
               | [o-series models show progress on ARC proves that what it
               | measures really is relevant and important for reasoning."
               | 
               | - "...nobody could quantify exactly the ways the models
               | were deficient..."
               | 
               | - "What we need right now are "easy" benchmarks that
               | these models nevertheless fail."
        
               | CamperBob2 wrote:
               | How long has SimpleBench been posted? Out of the first 6
               | questions at https://simple-bench.com/try-yourself,
               | o1-pro got 5/6 right.
               | 
               | It was interesting to see how it failed on question 6: ht
               | tps://chatgpt.com/c/6765e70e-44b0-800b-97bd-928919f04fbe
               | 
               | Apparently LLMs do not consider global thermonuclear war
               | to be all that big a deal, for better or worse.
        
               | Pannoniae wrote:
               | Don't worry, I also got that wrong :) I thought her
               | affair would be the biggest problem for John.
        
               | jquery wrote:
               | John was an ex, not her partner. Tricky.
        
             | HarHarVeryFunny wrote:
             | > o3 presumably isn't doing program synthesis
             | 
             | I'd guess it's doing natural language procedural synthesis,
             | the same way a human might (i.e. figuring the sequence of
             | steps to effect the transformation), but it may well be
             | doing (sub-)solution verification by using the procedural
             | description to generate code whose output can then be
             | compared to the provided examples.
             | 
             | While OpenAI haven't said exactly what the architecture of
             | o1/o3 are, the gist of it is pretty clear - basically
             | adding "tree" search and iteration on top of the underlying
             | LLM, driven by some RL-based post-training that imparts
             | generic problem solving biases to the model. Maybe there is
             | a separate model orchestrating the search and solution
             | evaluation.
             | 
             | I think there are many tasks that are easy enough for
             | humans but hard/impossible for these models - the ultimate
             | one in terms of commercial value would be to take an "off
             | the shelf model" and treat it as an intern/apprentice and
             | teach it to become competent in a entire job it was never
             | trained on. Have it participate in team meetings and
             | communications, and become a drop-in replacement for a
             | human performing that job (any job that an be performed
             | remotely without a physical presence).
        
           | stego-tech wrote:
           | I won't be as brutal in my wording, but I agree with the
           | sentiment. This was something drilled into me as someone with
           | a hobby in PC Gaming _and_ Photography: benchmarks, while
           | handy measures of _potential_ capabilities, are not
           | _guarantees_ of real world performance. Very few PC gamers
           | completely reinstall the OS before benchmarking to remove all
           | potential cruft or performance impacts, just as very few
           | photographers exclusively take photos of test materials.
           | 
           | While I appreciate the benchmark and its goals (not to
           | mention the puzzles - I quite enjoy figuring them out),
           | successfully passing this benchmark does not demonstrate or
           | guarantee real world capabilities or performance. This is why
           | I increasingly side-eye this field and its obsession with
           | constantly passing benchmarks and then moving the goal posts
           | to a newer, harder benchmark that claims to be a better
           | simulation of human capabilities than the last one: it reeks
           | of squandered capital and a lack of a viable/profitable
           | product, at least to my sniff test. Rather than simply
           | capitalize on their actual accomplishments (which LLMs are -
           | natural language interaction is huge!), they're trying to
           | prove to Capital that with a few (hundred) billion more in
           | investments, they can make AGI out of this and replace all
           | those expensive humans.
           | 
           | They've built the most advanced prediction engines ever
           | conceived, and insist they're best used to replace labor. I'm
           | not sure how they reached that conclusion, but considering
           | even their own models refute this use case for LLMs, I doubt
           | their execution ability on that lofty promise.
        
           | danielmarkbruce wrote:
           | 100%. The hype is misguided. I doubt half the people excited
           | about the result have even looked at what the benchmark is.
        
         | lossolo wrote:
         | > making the most interesting and challenging LLM benchmark so
         | far.
         | 
         | This[1] is currently the most challenging benchmark. I would
         | like to see how O3 handles it, as O1 solved only 1%.
         | 
         | 1. https://epoch.ai/frontiermath/the-benchmark
        
           | pynappo wrote:
           | Apparently o3 scored about 25%
           | 
           | https://youtu.be/SKBG1sqdyIU?t=4m40s
        
             | FiberBundle wrote:
             | This is actually the result that I find way more
             | impressive. Elite mathematicians think these problems are
             | challenging and thought they were years away from being
             | solvable by AI.
        
           | modeless wrote:
           | You're right, I was wrong to say "most challenging" as there
           | have been harder ones coming out recently. I think the
           | correct statement would be "most challenging long-standing
           | benchmark" as I don't believe any other test designed in 2019
           | has resisted progress for so long. FrontierMath is only a
           | month old. And of course the real key feature of ARC is that
           | it is easy for humans. FrontierMath is (intentionally) not.
        
             | esafak wrote:
             | They should put some famous, unsolved problems in the next
             | edition so ML researchers do some actually useful work
             | while they're "gaming" the benchmarks :)
        
         | skywhopper wrote:
         | "The fact that scaled reasoning models are finally showing
         | progress on ARC proves that what it measures really is relevant
         | and important for reasoning."
         | 
         | Not sure I understand how this follows. The fact that a certain
         | type of model does well on a certain benchmark means that the
         | benchmark is relevant for a real-world reasoning? That doesn't
         | make sense.
        
           | munchler wrote:
           | It shows objectively that the models are getting better at
           | some form of reasoning, which is at least worth noting.
           | Whether that improved reasoning is relevant for the real
           | world is a different question.
        
             | moffkalast wrote:
             | It shows objectively that one model got better at this
             | specific kind of weird puzzle that doesn't translate to
             | anything because it is just a pointless pattern matching
             | puzzle that can be trained for, just like anything else. In
             | fact they specifically trained for it, they say so upfront.
             | 
             | It's like the modern equivalent of saying "oh when AI
             | solves chess it'll be as smart as a person, so it's a good
             | benchmark" and we all know how that nonsense went.
        
               | munchler wrote:
               | Hmm, you could be right, but you could also be very
               | wrong. Jury's still out, so the next few years will be
               | interesting.
               | 
               | Regarding the value of "pointless pattern matching" in
               | particular, I would refer you to Douglas Hofstadter's
               | discussion of Bongard problems starting on page 652 of
               | _Godel, Escher, Bach_. Money quote: "I believe that the
               | skill of solving Bongard [pattern recognition] problems
               | lies very close to the core of 'pure' intelligence, if
               | there is such a thing."
        
               | moffkalast wrote:
               | Well I certainly at least agree with that second part,
               | the doubt if there is such a thing ;)
               | 
               | The problem with pattern matching of sequences and
               | transformers as an architecture is that it's something
               | they're explicitly designed to be good at with self
               | attention. Translation is mainly matching patterns to
               | equivalents in different languages, and continuing a
               | piece of text is following a pattern that exists inside
               | it. This is primarily why it's so hard to draw a line
               | between what an LLM actually understands and what it just
               | wings naturally through pattern memorization and why
               | everything about them is so controversial.
               | 
               | Honestly I was really surprised that all models did so
               | poorly on ARC in general thus far, since it really should
               | be something they ought to be superhuman at from the get-
               | go. Probably more of a problem that it's visual in
               | concept than anything else.
        
           | bagels wrote:
           | It doesn't follow, faulty logic. The two are probably
           | correlated though.
        
         | jug wrote:
         | I liked the SimpleQA benchmark that measures hallucinations.
         | OpenAI models did surprisingly poorly, even o1. In fact, it
         | looks like OpenAI often does well on benchmarks by taking the
         | shortcut to be more risk prone than both Anthropic and Google.
        
         | zone411 wrote:
         | It's the least interesting benchmark for language models among
         | all they've released, especially now that we already had a
         | large jump in its best scores this year. It might be more
         | useful as a multimodal reasoning task since it clearly involves
         | visual elements, but with o3 already performing so well, this
         | has proven unnecessary. ARC-AGI served a very specific purpose
         | well: showcasing tasks where humans easily outperformed
         | language models, so these simple puzzles had their uses. But
         | tasks like proving math theorems or programming are far more
         | impactful.
        
           | versteegen wrote:
           | ARC wasn't designed as a benchmark for LLMs, and it doesn't
           | make much sense to compare them on it since it's the wrong
           | modality. Even a MLM with image inputs can't be expected to
           | do well, since they're nothing like 99.999% of the training
           | data. The fact that even a text-only LLM can solve ARC
           | problems with the proper framework is important, however.
        
         | danielmarkbruce wrote:
         | Highly challenging for LLMs because it has nothing to do with
         | language. LLMs and their training processes have all kinds of
         | optimizations for language and how it's presented.
         | 
         | This benchmark has done a wonderful job with marketing by
         | picking a great name. It's largely irrelevant for LLMs despite
         | the fact it's difficult.
         | 
         | Consider how much of the model is just noise for a task like
         | this given the low amount of information in each token and the
         | high embedding dimensions used in LLMs.
        
           | computerex wrote:
           | The benchmark is designed to test for AGI and intelligence,
           | specifically the ability to solve novel problems.
           | 
           | If the hypothesis is that LLMs are the "computer" that drives
           | the AGI then of course the benchmark is relevant in testing
           | for AGI.
           | 
           | I don't think you understand the benchmark and its
           | motivation. ARC AGI benchmark problems are extremely easy and
           | simple for humans. But LLMs fail spectacularly at them. Why
           | they fail is irrelevant, the fact they fail though means that
           | we don't have AGI.
        
             | danielmarkbruce wrote:
             | > The benchmark is designed to test for AGI and
             | intelligence, specifically the ability to solve novel
             | problems.
             | 
             | It's a bunch of visual puzzles. They aren't a test for AGI
             | because it's not general. If models (or any other system
             | for that matter) could solve it, we'd be saying "this is a
             | stupid puzzle, it has no practical significance". It's a
             | test of some sort of specific intelligence. On top of that,
             | the vast majority of blind people would fail - are they not
             | generally intelligent?
             | 
             | The name is marketing hype.
             | 
             | The benchmark could be called "random puzzles LLMs are not
             | good at because they haven't been optimized for it because
             | it's not valuable benchmark". Sure, it wasn't designed
             | _for_ LLMs, but throwing LLMs at it and saying  "see?" is
             | dumb. We can throw in benchmarks for tennis playing, chess
             | playing, video game playing, car driving and a bajillion
             | other things while we are at it.
        
               | NateEag wrote:
               | And all that is kind of irrelevant, because if LLMs were
               | human-level general intelligence, they would solve all
               | these questions correctly without blinking.
               | 
               | But they don't. Not even the best ones.
        
               | pama wrote:
               | No human would score high on that puzzle if the images
               | were given to them as a series of tokens. Even previous
               | LLMs scored much better than humans if tested in the same
               | way.
        
         | adamgordonbell wrote:
         | There is a benchmark, NovelQA, that LLMs don't dominate when it
         | feels like they should. The benchmark is to read a novel and
         | answer questions about it.
         | 
         | LLMs are below human evaluation, as I last looked, but it
         | doesn't get much attention.
         | 
         | Once it is passed, I'd like to see one that is solving the
         | mystery in a mystery book right before it's revealed.
         | 
         | We'd need unpublished mystery novels to use for that benchmark,
         | but I think it gets at what I think of as reasoning.
         | 
         | https://novelqa.github.io/
        
           | CamperBob2 wrote:
           | Does it work on short stories, but not novels? If so, then
           | that's just a minor question of context length that should
           | self-resolve over time.
        
             | adamgordonbell wrote:
             | The books fit in the current long context models, so it's
             | not merely the context size constraint but the length is
             | part of the issue, for sure.
        
           | meta_x_ai wrote:
           | Looks like it's not updated for nearly a year and I'm
           | guessing Gemini 2.0 Flash with 2m context will simply crush
           | it
        
             | adamgordonbell wrote:
             | That's true. They don't have Claude 3.5 on there either. So
             | maybe it's not relevant anymore, but I'm not sure.
             | 
             | If so, let's move on to the murder mysteries or more
             | complex literary analysis.
        
           | rowanG077 wrote:
           | Benchmark how? Is it good if the LLM can or can't solve it?
        
           | loxias wrote:
           | NovelQA is a great one! I also like GSM-Symbolic -- a
           | benchmark based on making _symbolic templates_ of quite easy
           | questions, and sampling them repeatedly, varying things like
           | which proper nouns are used, what order relevant details
           | appear, how many irrelevant details (GSM-NoOp) and where they
           | are in the question, things like that.
           | 
           | LLMs are far, _far_ below human on elementary problems, once
           | you allow any variation and stop spoonfeeding perfectly
           | phrased word problems. :)
           | 
           | https://machinelearning.apple.com/research/gsm-symbolic
           | 
           | https://arxiv.org/pdf/2410.05229
           | 
           | Paper came out in October, I don't think many have fully
           | absorbed the implications.
           | 
           | It's hard to take any of the claims of "LLMs can do
           | reasoning!" seriously, once you understand that simply
           | changing what names are used in a 8th grade math word problem
           | can have dramatic impact on the accuracy.
        
           | latency-guy2 wrote:
           | > I'd like to see one that is solving the mystery in a
           | mystery book right before it's revealed.
           | 
           | I would think this is a not so good bench. Author does not
           | write logically, they write for entertainment.
        
             | adamgordonbell wrote:
             | So I'm thinking of something like Locked-room mystery where
             | the idea is it's solvable, and the reader is given a chance
             | to solve.
             | 
             | The reason it seems like an interesting bench, is it's a
             | puzzle presented in a long context. Its like testing if an
             | LLm is at Sherlock Holmes level of world and motivation
             | modelling.
        
           | usaar333 wrote:
           | That's an old leaderboard -- has no one checked any SOTA LLM
           | in the last 8 months?
        
         | aimanbenbaha wrote:
         | Because LLMs are on an off-ramp path towards AGI. A generally
         | intelligent system can brute force its way with just memory.
         | 
         | Once a model recognizes a weakness through reasoning with CoT
         | when posed to a certain problem and gets the agency to adapt to
         | solve that problem that's a precursor towards real AGI
         | capability!
        
         | justanotherjoe wrote:
         | i am confused cause this dataset is visual-based, and yet being
         | used to measure 'LLM'. I feel like the visual nature of it was
         | really the biggest hurdle to solving it.
        
         | internet_points wrote:
         | > The fact that scaled reasoning models are finally showing
         | progress on ARC proves that what it measures really is relevant
         | and important for reasoning.
         | 
         | One might also interpret that as "the fact that models which
         | are studying to the test are getting better at the test"
         | (Goodhart's law), not that they're actually reasoning.
        
       | wilg wrote:
       | fun! the benchmarks are so interesting because real world use is
       | so variable. sometimes 4o will nail a pretty difficult problem,
       | other times o1 pro mode will fail 10 times on what i would think
       | is a pretty easy programming problem and i waste more time trying
       | to do it with ai
        
       | behnamoh wrote:
       | So now not only are the models closed, but so are their evals?!
       | This is a "semi-private" eval. WTH is that supposed to mean? I'm
       | sure the model is great but I refuse to take their word for it.
        
         | ZeroCool2u wrote:
         | The private evaluation set is private from the public/OpenAI so
         | companies can't train on those problems and cheat their way to
         | a high score by overfitting.
        
           | jsheard wrote:
           | If the models run on OpenAIs servers then surely they could
           | still see the questions being put into it if they wanted to
           | cheat? That could only be prevented by making the evaluation
           | a one-time deal that can't be repeated, or by having OpenAI
           | distribute their models for evaluators to run themselves,
           | which I doubt they're inclined to do.
        
             | foobarqux wrote:
             | Yes that's why it is "semi"-private: From the ARC website
             | "This set is "semi-private" because we can assume that over
             | time, this data will be added to LLM training data and need
             | to be periodically updated."
             | 
             | I presume evaluation on the test set is gated (you have to
             | ask ARC to run it).
        
         | cchance wrote:
         | the evals are the question/answers, ARC-AGI doesn't share the
         | questions and answers for a portion so that models can't be
         | trained on them, the public ones... the public knows the
         | questions so theres a chance they could have been at least
         | partially been trained on the question (if not the actual
         | answer).
         | 
         | Thats how i understand it
        
       | neom wrote:
       | Why would they give a cost estimate per task on their low compute
       | mode but not their high mode?
       | 
       | "low compute" mode: Uses 6 samples per task, Uses 33M tokens for
       | the semi-private eval set, Costs $17-20 per task, Achieves 75.7%
       | accuracy on semi-private eval
       | 
       | The "high compute" mode: Uses 1024 samples per task (172x more
       | compute), Cost data was withheld at OpenAI's request, Achieves
       | 87.5% accuracy on semi-private eval
       | 
       | Can we just extrapolate $3kish per task on high compute?
       | (wondering if they're withheld because this isn't the case?)
        
         | WiSaGaN wrote:
         | The withheld part is really a red flag for me. Why do you want
         | to withhold a compute number?
        
       | zebomon wrote:
       | My initial impression: it's very impressive and very exciting.
       | 
       | My skeptical impression: it's complete hubris to conflate ARC or
       | any benchmark with truly general intelligence.
       | 
       | I know my skepticism here is identical to moving goalposts. More
       | and more I am shifting my personal understanding of general
       | intelligence as a phenomenon we will only ever be able to
       | identify with the benefit of substantial retrospect.
       | 
       | As it is with any sufficiently complex program, if you could
       | discern the result beforehand, you wouldn't have had to execute
       | the program in the first place.
       | 
       | I'm not trying to be a downer on the 12th day of Christmas.
       | Perhaps because my first instinct is childlike excitement, I'm
       | trying to temper it with a little reason.
        
         | amarcheschi wrote:
         | I just googled arc agi questions, and it looks like it is
         | similar to an iq test with raven matrix. Similar as in you have
         | some examples of images before and after, then an image before
         | and you have to guess the after.
         | 
         | Could anyone confirm if this is the only kind of questions in
         | the benchmark? If yes, how come there is such a direct
         | connection to "oh this performs better than humans" when llm
         | can be quite better than us in understanding and forecasting
         | patterns? I'm just curious, not trying to stir up controversies
        
           | zebomon wrote:
           | It's a test on which (apparently until now) the vast majority
           | of humans have far outperformed all machine systems.
        
             | patrickhogan1 wrote:
             | But it's not a test that directly shows general
             | intelligence.
             | 
             | I am excited no less! This is huge improvement.
             | 
             | How does this do on SWE Bench?
        
               | og_kalu wrote:
               | >How does this do on SWE Bench?
               | 
               | 71.7%
        
               | throwaway0123_5 wrote:
               | I've seen this figure on a few tech news websites and
               | reddit but can't find an official source. If it was in
               | the video I must have missed it, where is this coming
               | from?
        
               | og_kalu wrote:
               | It was in the video. I don't know if Open ai have a page
               | up yet
        
           | ALittleLight wrote:
           | Yes, it's pretty similar to Raven's. The reason it is an
           | interesting benchmark is because humans, even very young
           | humans, "get" the test in the sense of understanding what
           | it's asking and being able to do pretty well on it - but LLMs
           | have really struggled with the benchmark in the past.
           | 
           | Chollett (one of the creators of the ARC benchmark) has been
           | saying it proves LLMs can't reason. The test questions are
           | supposed to be unique and not in the model's training set.
           | The fact that LLMs struggled with the ARC challenge suggested
           | (to Chollett and others) that models weren't "Truly
           | reasoning" but rather just completing based on things they'd
           | seen before - when the models were confronted with things
           | they hadn't seen before, the novel visual patterns, they
           | really struggled.
        
           | Eridrus wrote:
           | ML is quite good at understanding and forecasting patterns
           | when you train on the data you want to forecast. LLMs manage
           | to do so much because we just decided to train on everything
           | on the internet and hope that it included everything we ever
           | wanted to know.
           | 
           | This tries to create patterns that are intentionally not in
           | the data and see if a system can generalize to them, which o3
           | super impressively does!
        
             | yunwal wrote:
             | ARC is in the dataset though? I mean I'm aware that there
             | are new puzzles every day, but there's still a very
             | specific format and set of skills required to solve it. I'd
             | bet a decent amount of money that humans get better at ARC
             | with practice, so it seems strange to suggest that AI
             | wouldn't.
        
         | hansonkd wrote:
         | It doesn't need to be general intelligence or perfectly map to
         | human intelligence.
         | 
         | All it needs to be is useful. Reading constant comments about
         | LLMs can't be general intelligence or lack reasoning etc, to me
         | seems like people witnessing the airplane and complaining that
         | it isn't "real flying" because it isn't a bird flapping its
         | wings (a large portion of the population held that point of
         | view back then).
         | 
         | It doesn't need to be general intelligence for the rapid
         | advancement of LLM capabilities to be the most societal
         | shifting development in the past decades.
        
           | zebomon wrote:
           | I agree. If the LLMs we have today never got any smarter, the
           | world would still be transformed over the next ten years.
        
           | AyyEye wrote:
           | > Reading constant comments about LLMs can't be general
           | intelligence or lack reasoning etc, to me seems like people
           | witnessing the airplane and complaining that it isn't "real
           | flying" because it isn't a bird flapping its wings (a large
           | portion of the population held that point of view back then).
           | 
           | That is a natural reaction to the incessant techbro, AIbro,
           | marketing, and corporate lies that "AI" (or worse AGI) is a
           | real thing, and can be directly compared to real humans.
           | 
           | There are people on this very thread saying it's better at
           | reasoning than real humans (LOL) because it scored higher on
           | some benchmark than humans... Yet this technology still can't
           | reliably determine what number is circled, if two lines
           | intersect, or count the letters in a word. (That said
           | behaviour may have been somewhat finetuned out of newer
           | models only reinforces the fact that the technology
           | inherently not capable of understanding _anything_.)
        
             | IanCal wrote:
             | I encounter "spicy auto complete" style comments far more
             | often than techbro AI-everything comments and its frankly
             | getting boring.
             | 
             | I've been doing AI things for about 20+ years and llms are
             | wild. We've gone from specialized things being pretty bad
             | as those jobs to general purpose things better at that and
             | everything else. The idea you could make and API call with
             | "is this sarcasm?" and get a better than chance guess is
             | incredible.
        
               | AyyEye wrote:
               | Nobody is disputing the coolness factor, only the
               | intelligence factor.
        
               | hansonkd wrote:
               | I'm saying the intelligence factor doesn't matter. Only
               | the utility factor. Today LLMs are incredibly useful and
               | every few months there appears to be bigger and bigger
               | leaps.
               | 
               | Analyzing whether or not LLMs have intelligence is
               | missing the forest from the trees. This technology is
               | emerging in a capitalist society that is hyper optimized
               | to adopt useful things at the expense of almost
               | everything else. If the utility/price point gets hit for
               | a problem, it will replace it regardless of if it is
               | intelligent or not.
        
               | Jensson wrote:
               | But if you want to predict the future utility of these
               | models you want to look at their current intelligence,
               | compare that to humans and try to figure out roughly what
               | skills they lack and which of those are likely to get
               | fixed.
               | 
               | For example, a team of humans are extremely reliable,
               | much more reliable than one human, but a team of AI's
               | isn't mean reliable than one AI since an AI is already an
               | ensemble model. That means even if an AI could replace a
               | person, it probably can't replace a team for a long time,
               | meaning you still need the other team members there,
               | meaning the AI didn't really replace a human it just
               | became a tool for huamns to use.
        
               | MVissers wrote:
               | I think this is a fair criticism of capability.
               | 
               | I personally wouldn't be surprised if we start to see
               | benchmarks around this type of cooperation and ability to
               | orchestrate complex systems in the next few years or so.
               | 
               | Most benchmarks really focus on one problem, not on
               | multiple real-time problems while orchestrating 3rd party
               | actors who might or might not be able to succeed at
               | certain tasks.
               | 
               | But I don't think anything is prohibiting these models
               | from not being able to do that.
        
               | surgical_fire wrote:
               | Eh, I see far more "AI is the second coming of Jesus"
               | type of comments than healthy skepticism. A lot of
               | anxiety from people afraid that their source of income
               | will dry and a lot of excitement of people with an axe to
               | grind that "those entitled expensive peasants will get
               | what they deserve".
               | 
               | I think I count myself among the skeptics nowadays for
               | that reason. And I say this as someone that thinks LLM is
               | an interesting piece of technology, but with somewhat
               | limited use and unclear economics.
               | 
               | If the hype was about "look at this thing that can parse
               | natural language surprisingly well and generate coherent
               | responses", I would be excited too. As someone that had
               | to do natural language processing in the past, that is a
               | damn hard task to solve, and LLMs excel at it.
               | 
               | But that is not the hype is it? We have people beating
               | the drums of how this is just shy of taking the world by
               | storm, and AGI is just around the corner, and it will
               | revolutionize all economy and society and nothing will
               | ever be the same.
               | 
               | So, yeah, it gets tiresome. I wish the hype would die
               | down a little so this could be appreciated for what it
               | is.
        
               | williamcotton wrote:
               | _We have people beating the drums of how this is just shy
               | of taking the world by storm, and AGI is just around the
               | corner, and it will revolutionize all economy and society
               | and nothing will ever be the same._
               | 
               | Where are you seeing this? I pretty much only read HN and
               | football blogs so maybe I'm out of the loop.
        
               | sensanaty wrote:
               | In this very thread there are multiple people espousing
               | their views that the high score here is proof that o3 has
               | achieved AGI.
        
           | handsclean wrote:
           | People aren't responding to their own assumption that AGI is
           | necessary, they're responding to OpenAI and the chorus
           | constantly and loudly singing hymns to AGI.
        
           | surgical_fire wrote:
           | > to me seems like people witnessing the airplane and
           | complaining that it isn't "real flying" because it isn't a
           | bird flapping its wings
           | 
           | To me it is more like there is someone jumping on a pogo ball
           | while flapping their arms and saying that they are flying
           | whenever they hop off the ground.
           | 
           | Skeptics say that they are not really flying, while adherents
           | say that "with current pogo ball advancements, they will be
           | flying any day now"
        
             | intelVISA wrote:
             | Between skeptics and adherents who is more easily able to
             | extract VC money for vaporware? If you limit yourself to
             | 'the facts' you're leaving tons of $$ on the table...
        
               | surgical_fire wrote:
               | By all means, if this is the goal, AI is a success.
               | 
               | I understand that in this forum too many people are
               | invested in putting lipstick on this particular pig.
        
             | PaulDavisThe1st wrote:
             | An old quote, quite famous: "... is like saying that an ape
             | who climbs to the top of a tree for the first time is one
             | step closer to landing on the moon".
        
             | DonHopkins wrote:
             | Is that what Elon Musk was trying to do on stage?
        
           | billyp-rva wrote:
           | > It doesn't need to be general intelligence or perfectly map
           | to human intelligence.
           | 
           | > All it needs to be is useful.
           | 
           | Computers were already useful.
           | 
           | The only definition we have for "intelligence" is human (or,
           | generally, animal) intelligence. If LLMs aren't that, let's
           | call it something else.
        
             | throwup238 wrote:
             | What exactly is human (or animal) intelligence? How do you
             | define that?
        
               | billyp-rva wrote:
               | Does it matter? If LLMs _aren 't_ that, whatever it is,
               | then we should use a different word. Finders keepers.
        
               | throwup238 wrote:
               | How do you know that LLMs "aren't that" if you can't even
               | define what _that_ is?
               | 
               | "I'll know it when I see it" isn't a compelling argument.
        
               | grahamj wrote:
               | they can't do what we do therefore they aren't what we
               | are
        
               | layer8 wrote:
               | And what is that, in concrete terms? Many humans can't do
               | what other humans can do. What is the common subset that
               | counts as human intelligence?
        
               | dimitri-vs wrote:
               | Process vision and sounds in parallel for 80+ years,
               | rapidly adapt to changing environments and scenarios,
               | correlate seemingly irrelevant details that happened a
               | week ago or years ago, be able to selectively ignore
               | instructions and know when to disagree
        
               | jonny_eh wrote:
               | > "I'll know it when I see it" isn't a compelling
               | argument.
               | 
               | It feels compelling to me.
        
               | Aperocky wrote:
               | I think a successful high level intelligence should
               | quickly accelerate or converge to infinity/physical
               | resource exhaustion because they can now work on
               | improving themselves.
               | 
               | So if above human intelligence does happen, I'd assume
               | we'd know it, quite soon.
        
           | wruza wrote:
           | And look at the airplanes, they really can't just land on a
           | mountain slope or a tree without heavy maintenance
           | afterwards. Those people weren't all stupid, they questioned
           | the promise of flying servicemen delivering mail or milk to
           | their window and flying on a personal aircar to their
           | workplace. Just like todays promises about whatever the CEOs
           | telltales are. Imagining bullshit isn't unique to this
           | century.
           | 
           | Aerospace is still a highly regulated area that requires
           | training and responsibility. If parallels can be drawn here,
           | they don't look so cool for a regular guy.
        
             | skydhash wrote:
             | This pretty much. Everyone knows that LLMs are great for
             | text generation and processing. What people has been
             | questioning is the end goals as promised by its builders,
             | i.e. is it useful? And from most of what I saw, it's very
             | much a toy.
        
               | MVissers wrote:
               | What would you need to see to call it useful?
               | 
               | To give you an example- I've used it for legal work such
               | as an EB2-NIW visa application. Saved me countless of
               | hours. My next visa I'll try to do without a lawyer using
               | just LLMs. I would never try this without having LLMs at
               | my disposal.
               | 
               | As a hobby- And as someone with a scientific background
               | I've been able to build an artificial ecosystem
               | simulation from scratch without programming experience in
               | Rust: https://www.youtube.com/@GenecraftSimulator
               | 
               | I recently moved from fish to plants and believe I've
               | developed some new science at the intersection of CS and
               | Evolutionary Biology that I'm looking to publish.
               | 
               | This tool is extremely useful. For now- You do require a
               | human in the loop for coordination.
               | 
               | My guess is that these will be benchmarks that we see
               | within a few years: How good an AI coordinate multiple
               | other AIs to build, deploy and iterate something that
               | functions in the real world. Basically manager AI.
               | 
               | Because they'll literally be able to solve every single
               | one shot problem so we won't be able to create benchmarks
               | anymore.
               | 
               | But that's also when these models will be able to build
               | functioning companies in a few hours.
        
               | skydhash wrote:
               | > _...me countless of...would never try this without
               | having LLMs...is extremely useful...they 'll literally be
               | able to solve...will be able to... in a few hours._
               | 
               | That's marketing language, not scientific or even casual
               | language. So much outstanding claims, without even some
               | basic explanations. Like how did it help you save these
               | hours? Terms explanations? Outlining processes? Going to
               | the post office for you? You don't need to sell me
               | anything, I just want the how.
        
               | wruza wrote:
               | My issue with LLMs is that you require a review-competent
               | human in the loop, to fix confabulations.
               | 
               | Yes, I'm using them from time to time for research. But
               | I'm also aware of the topics I research and see through
               | bs. And best LLMs out there, right now, produce bs in
               | just 3-4 paragraphs, in nicely documented areas.
               | 
               | A recent example is my question on how to run N vpn
               | servers on N ips on the same eth with ip binding (in ip =
               | out ip, instead of using a gw with the lowest metric). I
               | had no idea but I know how networks work and the
               | terminology. It started helping, created a namespace, set
               | up lo, set up two interfaces for inner and outer routing
               | and then made a couple of crucial mistakes that couldn't
               | be detected or fixed by someone even a little clueless
               | (in routing setup for outgoing traffic). I didn't even
               | argue and just asked what that does wrt my task, and that
               | started the classic "oh wait, sorry, here's more bs" loop
               | that never ended.
               | 
               | Eventually I distilled the general idea and found an
               | article that AI very likely learned from, cause it was
               | the same code almost verbatim, but without mistakes.
               | 
               | Does that count as helping? Idk, probably yes. But I know
               | that examples like this show that you cannot not only
               | leave an LLM unsupervised for any non-trivial question,
               | but have to leave a competent role in the loop.
               | 
               | I think the programming community is just blinded by LLMs
               | succeeding in writing kilometers of untalented
               | react/jsx/etc crap that has no complexity or competence
               | in it apart from repeating "do like this" patterns and
               | literally millions of examples, so noise cannot hit
               | through that "protection". Everything else suffers from
               | LLMs adding inevitable noise into what they learned from
               | a couple of sources. The problem here, as I understand
               | it, is that only specific programmer roles and
               | s{c,p}ammers (ironically) write the same crap again and
               | again millions of times, other info usually exists in
               | only a few important sources and blog posts, and only a
               | few of those are full and have good explanations.
        
             | Workaccount2 wrote:
             | What people always leave out is that society will bend to
             | the abilities of the new technology. Planes can't land in
             | your backyard so we built airports. We didn't abandon
             | planes.
        
               | PaulDavisThe1st wrote:
               | Sure, but that also vindicates the GP's point that the
               | initial claims of the boosters for planes contained more
               | than their fair share of bullshit and lies.
        
               | wruza wrote:
               | Yes but the idea was lost in the process. It became a
               | faster transportation system that uses air as a medium,
               | but that's it. Personal planes are still either big
               | business or an expensive and dangerous personal toy
               | thing. I don't think it's the same for LLMs (would be
               | naive). But where are promises like "we're gonna change
               | travel economics etc"? All headlines scream is "AGI
               | around the corner". Yeah, now where's my damn postman
               | flying? I need my mail.
        
               | ben_w wrote:
               | > It became a faster transportation system that uses air
               | as a medium, but that's it.
               | 
               | On the one hand, yes; on the other, this understates the
               | impact that had.
               | 
               | My uncle moved from the UK to Australia because, I'm
               | told*, he didn't like his mum and travel was so expensive
               | that he assumed they'd never meet again. My first trip
               | abroad... I'm not 100% sure how old I was, but it must
               | have been between age 6 and 10, was my gran (his mum)
               | paying for herself, for both my parents, and for me, to
               | fly to Singapore, then on to various locations in
               | Australia including my uncle, and back via Thailand, on
               | her pension.
               | 
               | That was a gap of around one and a half generations.
               | 
               | * both of them are long-since dead now so I can't ask
        
               | ForHackernews wrote:
               | This is already happening. A few days ago Microsoft
               | turned down a documentation PR because the formatting was
               | better for humans but worse for LLMs: https://github.com/
               | MicrosoftDocs/WSL/pull/2021#issuecomment-...
               | 
               | They changed their mind after a public outcry including
               | here on HN.
        
               | oblio wrote:
               | We are slowly discovering that many of our wonderful
               | inventions from 60-80-100 years ago have serious side
               | effects.
               | 
               | Plastics, cars, planes, etc.
               | 
               | One could say that a balanced situation, where vested
               | interests are put back in the box (close to impossible
               | since it would mean fighting trillions of dollars), would
               | mean that for example all 3 in the list above are used a
               | lot less than we use them now, for example. And only used
               | where truly appropriate.
        
               | tivert wrote:
               | > What people always leave out is that society will bend
               | to the abilities of the new technology.
               | 
               | Do they really? I don't think they do.
               | 
               | > Planes can't land in your backyard so we built
               | airports. We didn't abandon planes.
               | 
               | But then what do you do with the all the fantasies and
               | hype about the new technology (like planes that land in
               | your backyard and you fly them to work)?
               | 
               | And it's quite possible and fairly common that the new
               | technology _actually ends up being mostly hype_ , and
               | there's actually no "airports" use case in the wings. I
               | mean, how much did society "bend to the abilities of"
               | NFTs?
               | 
               | And then what if the mature "airports" use case is
               | actually something _most people do not want_?
        
               | moffkalast wrote:
               | No, we built helicopters.
        
             | throwaway4aday wrote:
             | Your point is on the verge of nullification with the rapid
             | improvement and adoption of autonomous drones don't you
             | think?
        
               | wruza wrote:
               | Sort of, but doesn't that sit on a far-fetch horizon? I
               | doubt that drone companies are all the same who sold
               | aircraft retrofuturism to people back then.
        
           | alexalx666 wrote:
           | If I could put it into Tesla style robot and it could do
           | dishes and help me figure out tech stuff, it would be more
           | than enough.
        
           | skywhopper wrote:
           | On the contrary, the pushback is critical because many
           | employers are buying the hype from AI companies that AGI is
           | imminent, that LLMs can replace professional humans, and that
           | computers are about to eliminate all work (except VCs and
           | CEOs apparently).
           | 
           | Every person that believes that LLMs are near sentient or
           | actually do a good job at reasoning is one more person
           | handing over their responsibilities to a zero-accountability
           | highly flawed robot. We've already seen LLMs generate bad
           | legal documents, bad academic papers, and extremely bad code.
           | Similar technology is making bad decisions about who to
           | arrest, who to give loans to, who to hire, who to bomb, and
           | who to refuse heart surgery for. Overconfident humans
           | employing this tech for these purposes have been bamboozled
           | by the lies from OpenAI, Microsoft, Google, et al. It's
           | crucial to call out overstatement and overhype about this
           | tech wherever it crops up.
        
             | noFaceDiscoG668 wrote:
             | I don't understand how or why someone with your mind would
             | assume that even barely disclosed semi-public releases
             | would resemble the current state of the art. Except if you
             | do it for the conversations sake, which I have never been
             | capable of.
        
           | jasondigitized wrote:
           | This a thousand times.
        
           | colordrops wrote:
           | I don't think many informed people doubt the utility of LLMs
           | at this point. The potential of human-like AGI has profound
           | implications far beyond utility models, which is why people
           | are so eager to bring it up. A true human-like AGI basically
           | means that most intellectual/white collar work will not be
           | needed, and probably manual labor before too long as well.
           | Huge huge implications for humanity, e.g. how does an economy
           | and society even work without workers?
        
             | vouaobrasil wrote:
             | > Huge huge implications for humanity, e.g. how does an
             | economy and society even work without workers?
             | 
             | I don't think those that create AI care about that. They
             | just to come out on top before someone else does.
        
         | sigmoid10 wrote:
         | These comments are getting ridiculous. I remember when this
         | test was first discussed here on HN and everyone agreed that it
         | clearly proves current AI models are not "intelligent"
         | (whatever that means). And people tried to talk me down when I
         | theorised this test will get nuked soon - like all the ones
         | before. It's time people woke up and realised that the old age
         | of AI is over. This new kind is here to stay and it _will_ take
         | over the world. And you better guess it 'll be sooner rather
         | than later and start to prepare.
        
           | samvher wrote:
           | What kind of preparation are you suggesting?
        
             | sigmoid10 wrote:
             | This is far too broad to summarise here. You can read up on
             | Sutskever or Bostrom or hell even Steven Hawking's ideas
             | (going in order from really deep to general topics). We
             | need to discuss _everything_ - from education over jobs and
             | taxes all the way to the principles of politics, our
             | economy and even the military. If we fail at this as a
             | society, we will at the very least create a world where the
             | people who own capital today massively benefit and become
             | rich beyond imagination (despite having contributed nothing
             | to it), while the majority of the population will be
             | unemployable and forever left behind. And the worst case
             | probably falls somewhere between the end of human
             | civilisation and the end of our species.
        
               | kelseyfrog wrote:
               | What we're going to do is punt the questions and then
               | convince ourselves the outcome was inevitable and if
               | anything it's actually our fault.
        
               | astrange wrote:
               | One way you can tell this isn't realistic is that it's
               | the plot of Atlas Shrugged. If your economic intuitions
               | produce that book it means they are wrong.
               | 
               | > while the majority of the population will be
               | unemployable and forever left behind
               | 
               | Productivity improvements increase employment. A
               | superhuman AI is a productivity improvement.
        
               | BriggyDwiggs42 wrote:
               | No, Atlas shrugged explicitly believes that the wealthy
               | beneficiaries are also the ones doing the innovation and
               | the labor. Human/superhuman AI, if not self-directed but
               | more like a tool, may massively benefit whoever happens
               | to be lucky enough to be directing it when it arises.
               | This does not imply that the lucky individual benefits on
               | the basis of their competence.
               | 
               | The idea that productivity improvements increase
               | unemployment is just fundamentally based on a different
               | paradigm. There is absolutely no reason to think that
               | when a machine exists that can do most things that a
               | human can do as well if not better for less or equal
               | cost, this will somehow increase human employment. In
               | this scenario, using humans in any stage of the pipeline
               | would be deeply inefficient and a stupid business
               | decision.
        
               | ben_w wrote:
               | > Productivity improvements increase employment.
               | 
               | Sometimes: the productivity improvements from the
               | combustion engine didn't increase employment of horses,
               | it displaced them.
               | 
               | But even when productivity improvements do increase
               | employment, it's not always to our advantage: the
               | productivity improvements from Eli Whitney's cotton gin
               | included huge economic growth and subsequent
               | technological improvements... and also "led to increased
               | demands for slave labor in the American South, reversing
               | the economic decline that had occurred in the region
               | during the late 18th century":
               | https://en.wikipedia.org/wiki/Cotton_gin
               | 
               | A superhuman AI that's only superhuman in specific
               | domains? We've been seeing plenty of those, "computer"
               | used to be a profession, and society can re-train but it
               | still hurts the specific individuals who have to be
               | unemployed (or start again as juniors) for the duration
               | of that training.
               | 
               | A superhuman AI that's superhuman in every domain, but
               | close enough to us in resource requirements that
               | comparative advantage is still important and we can still
               | do stuff, relegates us to whatever the AI is least good
               | at.
               | 
               | A superhuman AI that's superhuman in every domain... as
               | soon as someone invents mining, processing, and factory
               | equipment that works on the moon or asteroids, that AI
               | can control that equipment to make more of that
               | equipment, and demand is quickly -- O(log(n)) --
               | saturated. I'm moderately confident that in this
               | situation, the comparative advantage argument no longer
               | works.
        
             | johnny_canuck wrote:
             | Start learning a trade
        
               | jorblumesea wrote:
               | that's going to work when every white collar worker goes
               | into the trades /s
               | 
               | who is going to pay for residential electrical work lol
               | and how much will you make if some guy from MIT is going
               | to compete with you
        
               | whynotminot wrote:
               | I feel like that's just kicking the can a little further
               | down the road.
               | 
               | Our value proposition as humans in a capitalist society
               | is an increasingly fragile thing.
        
           | foobarqux wrote:
           | You should look up the terms necessary and sufficient.
        
             | sigmoid10 wrote:
             | The real issue is people constantly making up new goalposts
             | to keep their outdated world view somewhat aligned with
             | what we are seeing. But these two things are drifting apart
             | faster and faster. Even I got surprised by how quickly the
             | ARC benchmark was blown out of the water, and I'm pretty
             | bullish on AI.
        
               | foobarqux wrote:
               | The ARC maintainers have explicitly said that passing the
               | test was necessary but not sufficient so I don't know
               | where you come up with goal-post moving. (I personally
               | don't like the test; it is more about "intuition" or in-
               | built priors, not reasoning).
        
               | manmal wrote:
               | Are you like invested in LLM companies or something?
               | You're pushing the agenda hard in this thread.
        
           | lawlessone wrote:
           | Failing the test may prove the AI is not intelligent. Passing
           | the test doesn't necessarily prove it is.
        
             | NitpickLawyer wrote:
             | Your comment reminds me of this quote from a book published
             | in the 80s:
             | 
             | > There is a related "Theorem" about progress in AI: once
             | some mental function is programmed, people soon cease to
             | consider it as an essential ingredient of "real thinking".
             | The ineluctable core of intelligence is always in that next
             | thing which hasn't yet been programmed. This "Theorem" was
             | first proposed to me by Larry Tesler, so I call it Tesler's
             | Theorem: "AI is whatever hasn't been done yet."
        
               | 6gvONxR4sf7o wrote:
               | I've always disliked this argument. A person can do
               | something well without devising a general solution to the
               | thing. Devising a general solution to the thing is a step
               | we're talking all the time with all sorts of things, but
               | it doesn't invalidate the cool fact about intelligence:
               | whatever it is that lets us do the thing well _without_
               | the general solution is hard to pin down and hard to
               | reproduce.
               | 
               | All that's invalidated each time is the idea that a
               | general solution to that task requires a general solution
               | to all tasks, or that a general solution to that task
               | requires our special sauce. It's the idea that something
               | able to to that task will also be able to do XYZ.
               | 
               | And yet people keep coming up with a new task that people
               | point to saying, 'this is the one! there's no way
               | something could solve this one without also being able to
               | do XYZ!'
        
             | 8note wrote:
             | id consider that it doing the test at all, without proper
             | compensation is a sign that it isnt intelligent
        
               | esafak wrote:
               | Motivation is not hard to instill. Fortunately, they have
               | chosen not to do so.
        
           | QuantumGood wrote:
           | "it will take over the world"
           | 
           | Calibrating to the current hype cycle has been challenging
           | with AI pronouncements.
        
           | jcims wrote:
           | I agree, it's like watching a meadow ablaze and dismissing it
           | because it's not a 'real forest fire' yet. No it's not 'real
           | AGI' yet, but *this is how we get there* and the pace is
           | relentless, incredible and wholly overwhelming.
           | 
           | I've been blessed with grandchildren recently, a little boy
           | that's 2 1/2 and just this past Saturday a granddaughter.
           | Major events notwithstanding, the world will largely resemble
           | today when they are teenagers, but the future is going to
           | look very very very different. I can't even imagine what the
           | capability and pervasiveness of it all will be like in ten
           | years, when they are still just kids. For me as someone
           | that's invested in their future I'm interested in all of the
           | educational opportunities (technical, philosphical and self-
           | awareness) but obviously am concerned about the potential for
           | pernicious side effects.
        
           | philipkglass wrote:
           | If AI takes over white collar work that's still half of the
           | world's labor needs untouched. There are some promising early
           | demos of robotics plus AI. I also saw some promising demos of
           | robotics 10 and 20 years that didn't reach mass adoption. I'd
           | like to believe that by the time I reach old age the robots
           | will be fully qualified replacements for plumbers and home
           | health aides. Nothing I've seen so far makes me think that's
           | especially likely.
           | 
           | I'd love more progress on tasks in the physical world,
           | though. There are only a few paths for countries to deal with
           | a growing ratio of old retired people to young workers:
           | 
           | 1) Prioritize the young people at the expense of the old by
           | e.g. cutting old age benefits (not especially likely since
           | older voters have greater numbers and higher participation
           | rates in elections)
           | 
           | 2) Prioritize the old people at the expense of the young by
           | raising the demands placed on young people (either directly
           | as labor, e.g. nurses and aides, or indirectly through higher
           | taxation)
           | 
           | 3) Rapidly increase the population of young people through
           | high fertility or immigration (the historically favored path,
           | but eventually turns back into case 1 or 2 with an even
           | larger numerical burden of older people)
           | 
           | 4) Increase the health span of older people, so that they are
           | more capable of independent self-care (a good idea, but
           | difficult to achieve at scale, since most effective
           | approaches require behavioral changes)
           | 
           | 5) Decouple goods and services from labor, so that old people
           | with diminished capabilities can get everything they need
           | without forcing young people to labor for them
        
             | reducesuffering wrote:
             | > If AI takes over white collar work that's still half of
             | the world's labor needs untouched.
             | 
             | I am continually _baffled_ that people here throw this
             | argument out and can 't imagine the second-order effects.
             | If white collar work is automated by AGI, all the RnD to
             | solve robotics beyond imagination will happen in a flash.
             | The top AI labs, the people smartest enough to make this
             | technology, all are focusing on automating AGI Researchers
             | and from there follows everything, obviously.
        
               | brotchie wrote:
               | +1, the second and third order effects aren't trivial.
               | 
               | We're already seeing escape velocity in world modeling
               | (see Google Veo2 and the latest Genesis LLM-based physics
               | modeling framework).
               | 
               | The hardware for humanoid robots is 95% of the way there,
               | the gap is control logic and intelligence, which is
               | rapidly being closed.
               | 
               | Combine Veo2 world model, Genesis control planning,
               | o3-style reasoning, and you're pretty much there with
               | blue collar work automation.
               | 
               | We're only a few turns (<12 months) away from an
               | existence proof of a humanoid robot that can watch a
               | Youtube video and then replicate the task in a novel
               | environment. May take longer than that to productionize.
               | 
               | It's really hard to think and project forward on an
               | exponential. We've been on an exponential technology
               | curve since the discovery of fire (at least). The 2nd
               | order has kicked up over the last few years.
               | 
               | Not a rational approach to look back at robotics
               | 2000-2022 and project that pace forwards. There's more
               | happening every month than in decades past.
        
               | philipkglass wrote:
               | I hope that you're both right. In 2004-2007 I saw self
               | driving vehicles make lightning progress from the weak
               | showing of the 2004 DARPA Grand Challenge to the
               | impressive 2005 Grand Challenge winners and the even more
               | impressive performance in the 2007 Urban Challenge. At
               | the time I thought that full self driving vehicles would
               | have a major commercial impact within 5 years. I expected
               | truck and taxi drivers to be obsolete jobs in 10 years.
               | 17 years after the Urban Challenge there are still
               | millions of truck driver jobs in America and only Waymo
               | seems to have a credible alternative to taxi drivers
               | (even then, only in a small number of cities).
        
           | ben_w wrote:
           | > It's time people woke up and realised that the old age of
           | AI is over. This new kind is here to stay and it will take
           | over the world. And you better guess it'll be sooner rather
           | than later and start to prepare.
           | 
           | I was just thinking about how 3D game engines were perceived
           | in the 90s. Every six months some new engine came out, blew
           | people's minds, was declared photorealistic, and was
           | forgotten a year later. The best of those engines kept
           | improving and are still here, and kinda did change the world
           | in their own way.
           | 
           | Software development seemed rapid and exciting until about
           | Halo or Half Life 2, then it was shallow but shiny press
           | releases for 15 years, and only became so again when OpenAI's
           | InstructGPT was demonstrated.
           | 
           | While I'm really impressed with current AI, and value the
           | best models greatly, and agree that they will change (and
           | have already changed) the world... I can't help but think of
           | the _Next Generation_ front cover, February 1997 when
           | considering how much further we may be from what we want:
           | https://www.giantbomb.com/pc/3045-94/forums/unreal-yes-
           | this-...
        
             | torginus wrote:
             | The weird thing about the phenomenon you mention is only
             | after the field of software engineering has plateaued 15
             | years ago, as you mentioned, that this insane demand for
             | engineers did arise, with corresponding insane salaries.
             | 
             | It's a very strange thing I've never understood.
        
               | dwaltrip wrote:
               | My guess: It's a very lengthy, complex, and error-prone
               | process to "digitize" human civilization (government,
               | commerce, leisure, military, etc). The tech existed, we
               | just didn't know how to use it.
               | 
               | We still barely know how to use computers effectively,
               | and they have already transformed the world. For better
               | or worse.
        
             | hansonkd wrote:
             | > how much further we may be from what we wan
             | 
             | The timescale you are describing for 3D graphics is 4 years
             | from the 1997 cover you posted to the release of Halo which
             | you are saying plateaued excitement because it got advanced
             | enough.
             | 
             | An almost infinitesimally small amount of time in terms of
             | history human development and you are mocking the magazine
             | being excited for the advancement because it was... 4 years
             | yearly?
        
               | ben_w wrote:
               | No, the timescale is "the 90s", the _the specific
               | example_ is from 1997, and chosen because of how badly it
               | aged. Nobody looks at the original single-player Unreal
               | graphics today and thinks  "this is amazing!", but we all
               | did at the time -- Reflections! Dynamic lighting! It was
               | amazing for the era -- but it was also a long way from
               | photorealism. ChatGPT is amazing... but how far is it
               | from Brent Spiner's Data?
               | 
               | The era was people getting wowed from Wolfenstein (1992)
               | to "about Halo or Half Life 2" (2001 or 2004).
               | 
               | And I'm not saying the flattening of excitement was for
               | any specific reason, just that this was roughly when it
               | stopped getting exciting -- it might have been because
               | the engines were good enough for 3D art styles beyond "as
               | realistic as we can make it", but for all I know it was
               | the War On Terror which changed the tone of press
               | releases and how much the news in general cared. Or
               | perhaps it was a culture shift which came with more
               | people getting online and less media being printed on
               | glossy paper and sold in newsagents.
               | 
               | Whatever the cause, it happened around that time.
        
               | TeMPOraL wrote:
               | I'm still holding on to my hypothesis in that the
               | excitement was sustained in large part because this
               | progress was something a regular person could partake in.
               | Most didn't, but they likely known some kid who was. And
               | some of those kids run the gaming magazines.
               | 
               | This was a time where, for 3D graphics, barriers to entry
               | got low (math got figured out, hardware was good enough,
               | knowledge spread), but the commercial market didn't yet
               | capture everything. Hell, a bulk of those excited kids I
               | remember, trying to do a better Unreal Tournament after
               | school instead of homework (and almost succeeding!), they
               | went on create and staff the next generation of
               | commercial gamedev.
               | 
               | (Which is maybe why this period lasted for about as long
               | as it takes for a schoolkid to grow up, graduate, and
               | spend few years in the workforce doing the stuff they
               | were so excited about.)
        
               | ben_w wrote:
               | Could be.
               | 
               | I was one of those kids, my focus was Marathon 2 even
               | before I saw Unreal. I managed to figure out enough maths
               | from scratch to end up with the basics of ray casting,
               | but not enough at the time to realise the tricks needed
               | to make that real time on a 75 MHz CPU... and then we all
               | got OpenGL and I went through university where they
               | explained the algorithms.
        
             | TeMPOraL wrote:
             | > _Software development seemed rapid and exciting until
             | about Halo or Half Life 2, then it was shallow but shiny
             | press releases for 15 years_
             | 
             | The transition seems to map well to the point where engines
             | got sophisticated enough, that highly dedicated high-
             | schoolers couldn't keep up. Until then, people would
             | routinely make hobby game engines (for games they'd then
             | never finish) that were MVPs of what the game industry had
             | a year or three earlier. I.e. close enough to compete on
             | visuals with top photorealistic games of a given year - but
             | more importantly, this was a time where _you could do cool
             | nerdy shit to impress your friends and community_.
             | 
             | Then Unreal and Unity came out, with a business model that
             | killed the motivation to write your own engine from scratch
             | (except for purely educational purposes), we got more
             | games, more progress, but the excitement was gone.
             | 
             | Maybe it's just a spurious correlation, but it seems to
             | track with:
             | 
             | > _and only became so again when OpenAI 's InstructGPT was
             | demonstrated._
             | 
             | Which is again, if you exclude training SOTA models - which
             | is still mostly out of reach for anyone but a few entities
             | on the planet - the time where _anyone_ can do something
             | cool that doesn 't have a better market alternative yet,
             | and any dedicated high-schooler can make truly impressive
             | and useful work, outpacing commercial and academic work
             | based on pure motivation and focus alone (it's easier when
             | you're not being distracted by bullshit incentives like
             | _user growth_ or _making VCs happy_ or _churning out
             | publications, farming citations_ ).
             | 
             | It's, once again, a time of dreams, where anyone with some
             | technical interest and a bit of free time can _make the
             | future happen in front of their eyes_.
        
           | levocardia wrote:
           | I'm a little torn. ARC is really hard, and Francois is
           | extremely smart and thoughtful about what intelligence means
           | (the original "On the Measure of Intelligence" heavily
           | influenced my ideas on how to think about AI).
           | 
           | On the other hand, there is a long, long history of AI
           | achieving X but not being what we would casually refer to as
           | "generally intelligent," then people deciding X isn't really
           | intelligence; only when AI achieves Y will it be
           | intelligence. Then AI achieves Y and...
        
           | Workaccount2 wrote:
           | You are telling a bunch of high earning individuals ($150k+)
           | that they may be dramatically less valuable in the eat
           | future. Of course the goal posts will keep being pushed back
           | and the acknowledgements will never come.
        
           | ignoramous wrote:
           | > _These comments are getting ridiculous._
           | 
           | Not really. Francois (co-creator of the ARC Prize) has this
           | to say:                 The v1 version of the benchmark is
           | starting to saturate. There were already signs of this in the
           | Kaggle competition this year: an ensemble of all submissions
           | would score 81%            Early indications are that ARC-
           | AGI-v2 will represent a complete reset of the state-of-the-
           | art, and it will remain extremely difficult for o3.
           | Meanwhile, a smart human or a small panel of average humans
           | would still be able to score >95% ... This shows that it's
           | still feasible to create unsaturated, interesting benchmarks
           | that are easy for humans, yet impossible for AI, without
           | involving specialist knowledge. We will have AGI when
           | creating such evals becomes outright impossible.
           | For me, the main open question is where the scaling
           | bottlenecks for the techniques behind o3 are going to be. If
           | human-annotated CoT data is a major bottleneck, for instance,
           | capabilities would start to plateau quickly like they did for
           | LLMs (until the next architecture). If the only bottleneck is
           | test-time search, we will see continued scaling in the
           | future.
           | 
           | https://x.com/fchollet/status/1870169764762710376 /
           | https://ghostarchive.org/archive/Sqjbf
        
           | bluerooibos wrote:
           | The goalposts have moved, again and again.
           | 
           | It's gone from "well the output is incoherent" to "well it's
           | just spitting out stuff it's already seen online" to
           | "WELL...uhh IT CAN'T CREATE NEW/NOVEL KNOWLEDGE" in the space
           | of 3-4 years.
           | 
           | It's incredible.
           | 
           | We already have AGI.
        
         | FrustratedMonky wrote:
         | " it's complete hubris to conflate ARC or any benchmark with
         | truly general intelligence."
         | 
         | Maybe it would help to include some human results in the AI
         | ranking.
         | 
         | I think we'd find that Humans score lower?
        
           | zamadatix wrote:
           | I'm not sure it'd help what they are talking about much.
           | 
           | E.g. go back in time and imagine you didn't know there are
           | ways for computers to be really good at performing
           | integration yet as nobody had tried to make them. If someone
           | asked you how to tell if something is intelligent "the
           | ability to easily reason integrations or calculate extremely
           | large multiplications in mathematics" might seem like a great
           | test to make.
           | 
           | Skip forward to the modern era and it's blatantly obvious
           | CASes like Mathematica on a modern computer range between
           | "ridiculously better than the average person" to "impossibly
           | better than the best person" depending on the test. At the
           | same time, it becomes painfully obvious a CAS is wholly
           | unrelated to general intelligence and just because your test
           | might have been solvable by an AGI doesn't mean solving it
           | proves something must have been an AGI.
           | 
           | So you come up with a new test... but you have the same
           | problem as originally, it seems like anything non-human
           | completely bombs and an AGI would do well... but how do you
           | know the thing that solves it will have been an AGI for sure
           | and not just another system clearly unrelated?
           | 
           | Short of a more clever way what GP is saying is the goalposts
           | must keep being moved until it's not so obvious the thing
           | isn't AGI, not that the average human gets a certain score
           | which is worse.
           | 
           | .
           | 
           | All that aside, to answer your original question, in the
           | presentation it was said the average human gets 85% and this
           | was the first model to beat that. It was also said a second
           | version is being worked on. They have some papers on their
           | site about clear examples of why the current test clearly has
           | a lot of testing unrelated to whether something is really AGI
           | (a brute force method was shown to get >50% in 2020) so their
           | aim is to create a new goalpost test and see how things shake
           | out this time.
        
             | FrustratedMonky wrote:
             | "Short of a more clever way what GP is saying is the
             | goalposts must keep being moved until it's not so obvious
             | the thing isn't AGI, not that the average human gets a
             | certain score which is worse."
             | 
             | Best way of stating that I've heard.
             | 
             | The Goal Post must keep moving, until we understand enough
             | what is happening.
             | 
             | I usually poo-poo the goal post moving, but this makes
             | sense.
        
             | og_kalu wrote:
             | Generality is not binary. It's a spectrum. And these models
             | are already general in ways those things you've mentioned
             | simply weren't.
             | 
             | What exactly is AGI to you ? If it's simply a generally
             | intelligent machine then what are you waiting for ? What
             | else is there to be sure of ? There's nothing narrow about
             | these models.
             | 
             | Humans love to believe they're oh so special so much that
             | there will always be debates on whether 'AGI' has arrived.
             | If you are waiting for that then you'll be waiting a very
             | long time, even if a machine arrives that takes us to the
             | next frontier in science.
        
               | Jensson wrote:
               | > There's nothing narrow about these models.
               | 
               | There is, they can't create new ideas like humanity can.
               | AGI should be able to replace humanity in terms of
               | thinking, otherwise it isn't general, you would just have
               | a model specialized at reproducing thoughts and patterns
               | human have thought before, it still can't recreate
               | science from scratch etc like humanity did, meaning it
               | can't do science properly.
               | 
               | Comparing an AI to a single individual is not how you
               | measure AGI, if a group of humans perform better then you
               | can't use the AI to replace that group of humans, and
               | thus the AI isn't an AGI since it couldn't replace the
               | group humans.
               | 
               | So for example, if a group of programmers write more
               | reliable programs than the AI, then you can't replace
               | that group of programmers with the AI, even if you
               | duplicate that AI many times, since the AI isn't capable
               | of reproducing that same level of reliability when ran in
               | parallel. This is due to an AI being run in parallel is
               | still just an AI, an ensemble model is still just an AI,
               | so the model the AI has to beat is the human ensemble
               | called humanity.
               | 
               | If we lower the bar a bit at least it has to beat 100 000
               | humans working together to make a job obsolete, since all
               | the tutorials etc and all such things are made by other
               | humans as well if you remove the job those would also
               | disappear and the AI would have to do the work of all of
               | those, so if it can't humans will still be needed.
               | 
               | Its possible you will be able to substitute part of those
               | human ensembles with AI much sooner, but then we just
               | call it a tool. (We also call narrow humans tools, it is
               | fair)
        
               | og_kalu wrote:
               | I see these models create new ideas. At least at the
               | standard humans are beholden to, so this just falls flat
               | for me.
        
               | Jensson wrote:
               | You don't just need to create an idea, you need to be
               | able to create ideas that on average progress in a
               | positive direction. Humans can evidently do that, AI
               | can't, when AI work too much without human input you
               | always end up with nonsense.
               | 
               | In order to write general program you need to have that
               | skill. Every new code snipped needs to be evaluated by
               | that system, whether it makes the codebase better or not.
               | The lack of that ability is why you can't just loop an
               | LLM today to replace programmers. It might be possible to
               | automate it for specific programming tasks, but not
               | general purpose programming.
               | 
               | Overcoming that hurdle is not something I think LLM ever
               | can do, you need a totally different kind of
               | architecture, not something that is trained to mimic but
               | trained to reason. I don't know how to train something
               | that can reason about noisy unstructured data, we will
               | probably figure that out at some point but it probably
               | wont be LLM as they are today.
        
               | zamadatix wrote:
               | I'm firmly in the "absolutely nothing special about human
               | intelligence" camp so don't let dismissal of this as AGI
               | fuel any misconceptions as to why I might think that.
               | 
               | As for what AGI is? Well, the lack of being able to
               | describe that brings us full circle in this thread - I'll
               | tell you for sure when I've seen it for the first time
               | and have the power of hindsight to say what was missing.
               | I think these models are the closest we've come but it
               | feels like there is at least 1-2 more "4o->o1" style
               | architecture changes where it's not necessarily about an
               | increase in model fitting and more about a change in how
               | the model comes to an output before we get to what I'd be
               | willing to call AGI.
               | 
               | Who knows though, maybe some of those changes come along
               | and it's closer but still missing some process to reason
               | well enough to be AGI rather than a midway tool.
        
         | m3kw9 wrote:
         | From the statement where - this is a pretty tough test where AI
         | scores low vs humans just last year, and AI can do it as good
         | as humans may not be AGI which I agree, but it means something
         | with all caps
        
           | manmal wrote:
           | Obviously, the multi billion dollar companies will try to
           | satisfy the benchmarks they are not yet good in, as has
           | always been the case.
        
             | m3kw9 wrote:
             | A valid conspiracy theory but I've heard that one everystep
             | of the way to this point
        
         | wslh wrote:
         | > My skeptical impression: it's complete hubris to conflate ARC
         | or any benchmark with truly general intelligence.
         | 
         | But isn't it interesting to have several benchmarks? Even if
         | it's not about passing the Turing test, benchmarks serve a
         | purpose--similar to how we measure microprocessors or other
         | devices. Intelligence may be more elusive, but even if we had
         | an oracle delivering the ultimate intelligence benchmark, we'd
         | still argue about its limitations. Perhaps we'd claim it
         | doesn't measure creativity well, and we'd find ourselves
         | revisiting the same debates about different kinds of
         | intelligences.
        
           | zebomon wrote:
           | It's certainly interesting. I'm just not convinced it's a
           | test of general intelligence, and I don't think we'll know
           | whether or not it is until it's been able to operate in the
           | real world to the same degree that our general intelligence
           | does.
        
         | kelseyfrog wrote:
         | > truly general intelligence
         | 
         | Indistinguishable from goalpost moving like you said, but also
         | no true Scotsman.
         | 
         | I'm curious what would happen in your eyes if we misattributed
         | general intelligence to an AI model? What are the consequences
         | of a false positive and how would they affect your life?
         | 
         | It's really clear to me how intelligence fits into our reality
         | as part of our social ontology. The attributes and their
         | expression that each of us uses to ground our concept of the
         | intelligent predicate differs wildly.
         | 
         | My personal theory is that we tend to have an exemplar-based
         | dataset of intelligence, and each of us attempts to construct a
         | parsimonious model of intelligence, but like all (mental)
         | models, they can be useful but wrong. These models operate in a
         | space where the trade off is completeness or consistency, and
         | most folks, uncomfortable saying "I don't know" lean toward
         | being complete in their specification rather than consistent.
         | The unfortunate side-effect is that we're able to easily
         | generate test data that highlights our model inconsistency - AI
         | being a case in point.
        
           | PaulDavisThe1st wrote:
           | > I'm curious what would happen in your eyes if we
           | misattributed general intelligence to an AI model? What are
           | the consequences of a false positive and how would they
           | affect your life?
           | 
           | Rich people will think they can use the AI model instead of
           | paying other people to do certain tasks.
           | 
           | The consequences could range from brilliant to utterly
           | catastrophic, depending on the context and precise way in
           | which this is done. But I'd lean toward the catastrophic.
        
             | kelseyfrog wrote:
             | Any specifics? It's difficult to separate this from
             | generalized concern.
        
               | PaulDavisThe1st wrote:
               | someone wants a "personal assistant" and believes that
               | the LLM has AGI ...
               | 
               | someone wants a "planning officer" and believes that the
               | LLM has AGI ...
               | 
               | someone wants a "hiring consultant" and believes that the
               | LLM has AGI ...
               | 
               | etc. etc.
        
               | kelseyfrog wrote:
               | My apologies, but would it be possible to list the
               | catastrophic consequences of these?
        
         | Agentus wrote:
         | how about a extra large dose of your skepticism. is true
         | intelligence really a thing and not just a vague human
         | construct that tries to point out the mysterious unquantifiable
         | combination of human behaviors?
         | 
         | humans clearly dont know what intelligence is unambiguously.
         | theres also no divinely ordained objective dictionary that one
         | can point at to reference what true intelligence is. a deep
         | reflection of trying to pattern associate different human
         | cognitive abilities indicates human cognitive capabilities
         | arent that spectacular really.
        
           | MVissers wrote:
           | My guess as an amateur neuroscientist is that what we call
           | intelligence is just a 'measurement' of problem solving
           | ability in different domains. Can be emotional, spatial,
           | motor, reasoning, etc etc.
           | 
           | There is no special sauce in our brain. And we know how much
           | compute there is in our brain- So we can roughly estimate
           | when we'll hit that with these 'LLMs'.
           | 
           | Language is important in a human brain development as well.
           | Kids who grow up deaf grow up vastly less intelligent unless
           | they learn sign language. Language allow us to process
           | complex concepts that our brain can learn to solve, without
           | having to be in those complex environments.
           | 
           | So in hindsight, it's easy to see why it took a language
           | model to be able to solve general tasks and other types deep
           | learning networks couldn't.
           | 
           | I don't really see any limits on these models.
        
             | Agentus wrote:
             | interesting point about language. but i wonder if people
             | misattribute the reason why language is pivotal to human
             | development. your points are valid. i see human behavior
             | with regard to learning as 90% mimicry and 10% autonomous
             | learning. most of what humans believe in is taken on faith
             | and passed on from the tribe to the individual. rarely is
             | it verified even partially let alone fully. humans simple
             | dont have the time or processing power to do that. learning
             | a thing without outside aid is vastly slower and more
             | energy or brain intensive process than copy learning or
             | learning through social institutions by dissemination. the
             | stunted development from lack of language might come more
             | from the less ability to access the collective learning
             | process that language enables and or greatly enhances. i
             | think a lot of learning even when combined with reasoning,
             | deduction, etc really is at the mercy of brute force
             | exploration to find a solution, which individuals are bad
             | at but a society that collects random experienced "ah hah!"
             | occurrences and passes them along is actually okay at.
             | 
             | i wonder if llms and language dont as so much allow us to
             | process these complex environments but instead preload our
             | brains to get a head start in processing those complex
             | environments once we arrive in them. i think llms store
             | compressed relationships of the world which obviously has
             | information loss from a neural mapping of the world that
             | isnt just language based. but that compressed relationships
             | ie knowledge doesnt exactly backwardly map onto the world
             | without it having a reverse key. like artificially learning
             | about real world stuff in school abstractly and then going
             | into the real world, it takes time for that abstraction to
             | snap fit upon the real world.
             | 
             | could you further elaborate on what you mean by limits,
             | because im happy to play contrarian on what i think i
             | interpret you to be saying there.
             | 
             | also to your main point: what intelligence is. yeah you
             | sort of hit up my thoughts on intelligence. its a
             | combination of problem solving abilities in different
             | domains. its like an amalgam of cognitive processes that
             | achieve an amalgam of capabilities. while we can label
             | alllllll that with a singular word, doesnt mean its all a
             | singular process. seems like its a composite. moreover i
             | think a big chunk of intelligence (but not all) is just
             | brute forcing finding associations and then encoding those
             | by some reflexive search/retrieval. a different part of
             | intelligence of course is adaptibility and pattern finding.
        
         | Bjorkbat wrote:
         | I think it's still an interesting way to measure general
         | intellience, it's just that o3 has demonstrated that you can
         | actually achieve human performance on it by training it on the
         | public training set and giving it ridiculous amounts of
         | compute, which I imagine equates to ludicrously long chains-of-
         | thought, and if I understand correctly more than one chain-of-
         | thought per task (they mention sample sizes in the blog post,
         | with o3-low using 6 and o3-high using 1024. Not sure if these
         | are chains-of-thought per task or what).
         | 
         | Once you look at it that way it the approach really doesn't
         | look like intelligence that's able to generalize to novel
         | domains. It doesn't pass the sniff test. It looks a lot more
         | like brute-forcing.
         | 
         | Which is probably why, in order to actually qualify for the
         | leaderboard, they stipulate that you can't use more than $10k
         | more of compute. Otherwise, it just sounds like brute-forcing.
        
           | BriggyDwiggs42 wrote:
           | I disagree. It's vastly inefficient, but it is managing to
           | actually solve these problems with a vast search space. If we
           | extrapolate this approach into the future and assume that the
           | search becomes better as the underlying model improves, and
           | assume that the architecture grows more efficient, and assume
           | that the type of parallel computing used here grows cheaper,
           | isn't it possible that this is a lot more than brute-forcing
           | in terms of what it will achieve? In other words, is it maybe
           | just a really ugly way of doing something functionally
           | equivalent to reasoning?
        
       | attentionmech wrote:
       | Isn't this at the level now where it can sort of self improve. My
       | guess is that they will just use it to improve the model and the
       | cost they are showing per evaluation will go down drastically.
       | 
       | So, next step in reasoning is open world reasoning now?
        
         | dyauspitr wrote:
         | I don't believe so. If it's at the point where you could just
         | plug it into a bunch of camera feeds around the world and it
         | could only filter out a useful training set for itself out of
         | that data then we truly would have AGI. I don't think it's
         | there yet.
        
       | yawnxyz wrote:
       | O3 High (tuned) model scored an 88% at what looks like
       | $6,000/task haha
       | 
       | I think soon we'll be pricing any kind of tasks by their compute
       | costs. So basically, human = $50/task, AI = $6,000/task, use
       | human. If AI beats human, use AI? Ofc that's considering both get
       | 100% scores on the task
        
         | cchance wrote:
         | Isn't that generally what ... all jobs are? Automation Cost vs
         | Longterm Human cost... its why amazon did the weird "our stores
         | are AI driven" but in reality was cheaper to higher a bunch of
         | guys in a sweat shop to look at the cameras and write things
         | down lol.
         | 
         | The thing is given what we've seen from distillation and tech,
         | even if its 6,000/task... that will come down drastically over
         | time through optimization and just... faster more efficient
         | processing hardware and software.
        
           | cryptoegorophy wrote:
           | I remember hearing Tesla trying to automate all of production
           | but some things just couldn't , like the wiring which humans
           | still had to do.
        
         | dyauspitr wrote:
         | Compute can get optimized and cheap quickly.
        
           | karmasimida wrote:
           | Is it? The moore's law is dead dead, I don't think this is a
           | given.
        
         | jsheard wrote:
         | That's the elephant in the room with the reasoning/COT
         | approach, it shifts what was previously a scaling of training
         | costs into scaling of training _and_ inference costs. The
         | promise of doing expensive training once and then running the
         | model cheaply forever falls apart once you 're burning tens,
         | hundreds or thousands of dollars worth of compute every time
         | you run a query.
        
           | Legend2440 wrote:
           | Yeah, but next year they'll come out with a faster GPU, and
           | the year after that another still faster one, and so on.
           | Compute costs are a temporary problem.
        
             | freehorse wrote:
             | The issue is not just scaling compute, but scaling it in a
             | rate that meets the increase in complexity of the problems
             | that are not currently solved. If that is O(n) then what
             | you say probably stands. If that is eg O(n^8) or
             | exponential etc, then there is no hope to actually get good
             | enough scaling by just increasing compute in a normal rate.
             | Then AI technology will still be improving, but improving
             | to a halt, practically stagnating.
             | 
             | o3 will be interesting if it offers indeed a novel
             | technology to handle problem solving, something that is
             | able to learn from few novel examples efficiently and
             | adapt. That's what intelligence actually is. Maybe this is
             | the case. If, on the other hand, it is a smart way to pair
             | CoT within an evaluation loop (as the author hints as
             | possibility) then it is probable that, while this _can_
             | handle a class of problems that current LLMs cannot, it is
             | not really this kind of learning, meaning that it will not
             | be able to scale to more complex, real world tasks with a
             | problem space that is too large and thus less amenable to
             | such a technique. It is still interesting, because having a
             | good enough evaluator may be very important step, but it
             | would mean that we are not yet there.
             | 
             | We will learn soon enough I suppose.
        
           | Workaccount2 wrote:
           | They're gonna figure it out. Something is being missed
           | somewhere, as human brains can do all this computation on 20
           | watts. Maybe it will be a hardware shift or maybe just a
           | software one, but I strongly suspect that modern transformers
           | are grossly inefficient.
        
         | redeux wrote:
         | Time and availability would also be factors.
        
         | Benjaminsen wrote:
         | Compute costs on AI with the same roughly the same capabilities
         | have been halving every ~7 months.
         | 
         | That makes something like this competitive in ~3 years
        
           | seizethecheese wrote:
           | And human costs have been increasing a few percent per year
           | for a few centuries!
        
         | freehorse wrote:
         | This makes me think and speculate if the solution comprises of
         | a "solver" trying semi-random or more targeted things and a
         | "checker" checking these? Usually checking a solution is
         | cognitively (and computationally) easier than coming up with
         | it. Else I cannot think what sort of compute would burn 6000$
         | per task, unless you are going through a lot of loops and you
         | have somehow solved the part of the problem that can figure out
         | if a solution is correct or not, while coming up with the
         | actual correct solution is not as solved yet to the same
         | degree. Or maybe I am just naive and these prices are just like
         | breakfast for companies like that.
        
         | og_kalu wrote:
         | It's not 6000/task (i.e per question). 6000 is about the retail
         | cost for evaluating the entire benchmark on high efficiency
         | (about 400 questions)
        
           | Tiberium wrote:
           | From reading the blog post and Twitter, and cost of other
           | models, I think it's evident that it IS actually cost per
           | task, see this tweet: https://files.catbox.moe/z1n8dc.jpg
           | 
           | And o1 cost $15/$60 for 1M in/out, so the estimated costs on
           | the graph would match for a single task, not the whole
           | benchmark.
        
             | slibhb wrote:
             | The blog clarifies that it's $17-20 per task. Maybe it runs
             | into thousands for tasks it can't solve?
        
               | Tiberium wrote:
               | That cost is for o3 low, o3 high goes into thousands per
               | task.
        
         | gbnwl wrote:
         | Well they got 75.7% at $17/task. Did you see that?
        
         | seydor wrote:
         | What if we use those humans to generate energy for the tasks?
        
       | spaceman_2020 wrote:
       | Just as an aside, I've personally found o1 to be completely
       | useless for coding.
       | 
       | Sonnet 3.5 remains the king of the hill by quite some margin
        
         | cchance wrote:
         | The new gemini's are pretty good too
        
           | lysecret wrote:
           | Actually prefer new geminis too. 2.0 experimental especially.
        
           | spaceman_2020 wrote:
           | The new ai studio from Google is fantastic
        
         | og_kalu wrote:
         | To be fair, until the last checkpoint released 2 days ago, o1
         | didn't really beat sonnet (and if so, barely) in most non-
         | competitive coding benchmarks
        
         | vessenes wrote:
         | To fill this out, I find o1-pro (and -preview when it was live)
         | to be pretty good at filling in blindspots/spotting holistic
         | bugs. I use Claude for day to day, and when Claude is spinning,
         | o1 often can point out why. It's too slow for AI coding, and I
         | agree that at default its responses aren't always satisfying.
         | 
         | That said, I think its code style is arguably better, more
         | concise and has better patterns -- Claude needs a fair amount
         | of prompting and oversight to not put out semi-shitty code in
         | terms of structure and architecture.
         | 
         | In my mind: going from Slowest to Fastest, and Best
         | Holistically to Worst, the list is:
         | 
         | 1. o1-pro 2. Claude 3.5 3. Gemini 2 Flash
         | 
         | Flash is so fast, that it's tempting to use more, but it really
         | needs to be kept to specific work on strong codebases without
         | complex interactions.
        
           | spaceman_2020 wrote:
           | Claude has a habit of sometimes just getting "lost"
           | 
           | Like I'll have it a project in Cursor and it will spin up
           | ready to use components that use my site style, reference
           | existing components, and follow all existing patterns
           | 
           | Then on some days, it will even forget what language the
           | project is in and start giving me python code for a react
           | project
        
           | causal wrote:
           | Yeah it's almost like system 1 vs system 2 thinking
        
         | bearjaws wrote:
         | o1 is pretty good at spotting OWASP defects, compared to most
         | other models.
         | 
         | https://myswamp.substack.com/p/benchmarking-llms-against-com...
        
         | InkCanon wrote:
         | I just asked o1 a simple yes or no question about x86 atomics
         | and it did one of those A or B replies. The first answer was
         | yes, the second answer was no.
        
         | m3kw9 wrote:
         | o1 is when all else fails, sometimes it does the same mistakes
         | as weaker models if you give it simple tasks with very little
         | context, but when a good precise context is given it usually
         | outperforms other Models
        
         | karmasimida wrote:
         | Yeah I feel for chat use case, o1 is just too slow for me, and
         | my queries aren't that complicated.
         | 
         | For coding, o1 is marvelous at Leetcode question I think it is
         | the best teacher I would ever afford to teach me leetcoding,
         | but I don't find myself have a lot of other use cases for o1
         | that is complex and requires really long reasoning chain
        
         | bitbuilder wrote:
         | I find myself hoping between o1 and Sonnet pretty frequently
         | these days, and my personal observation is that the quality of
         | output from o1 scales more directly to the quality of the
         | prompting you're giving it.
         | 
         | In a way it almost feels like it's become _too_ good at
         | following instructions and simply just takes your direction
         | more literally. It doesn 't seem to take the initiative of
         | going the extra mile of filling in the blanks from your lazy
         | input (note: many would see this as a good thing). Claude on
         | the other hand feels more intuitive in discerning intent from a
         | lazy prompt, which I may be prone to offering it at times when
         | I'm simply trying out ideas.
         | 
         | However, if I take the time to write up a well thought out
         | prompt detailing my expectations, I find I much prefer the code
         | o1 creates. It's smarter in its approach, offers clever ideas I
         | wouldn't have thought of, and generally cleaner.
         | 
         | Or put another way, I can give Sonnet a lazy or detailed prompt
         | and get a good result, while o1 will give me an excellent
         | result with a well thought out prompt.
         | 
         | What this boils down to is I find myself using Sonnet while
         | brainstorming ideas, or when I simply don't know how I want to
         | approach a problem. I can pitch it a feature idea the same way
         | a product owner might pitch an idea to an engineer, and then
         | iterate through sensible and intuitive ways of looking at the
         | problem. Once I get a handle on how I'd like to implement a
         | solution, I type up a spec and hand it off to o1 to crank out
         | the code I'd intend to implement.
        
           | jules wrote:
           | Can you solve this by putting your lazy prompt through GPT-4o
           | or Sonnet 3.6 and asking it to expand the prompt to a full
           | prompt for o1?
        
           | spaceman_2020 wrote:
           | Have you found any tool or guide for writing better o1
           | prompts? This isn't the first time I've heard this about o1
           | but no one seems to know _how_ to prompt it
        
         | leumon wrote:
         | I've found gemini-1206 to be best. and we can use it free (for
         | now), in google's aistudio. It's number 1 on lmarena.ai for
         | coding, and generally, and number 1 on bigcodebench.
        
         | energy123 wrote:
         | Which o1? A new version was released a few days ago and beats
         | Sonnet 3.5 on Livebench
        
       | smy20011 wrote:
       | It seems O3 following trend of Chess engine that you can cut your
       | search depth depends on state.
       | 
       | It's good for games with clear signal of success (Win/Lose for
       | Chess, tests for programming). One of the blocker for AGI is we
       | don't have clear evaluation for most of our tasks and we cannot
       | verify them fast enough.
        
       | flakiness wrote:
       | The cost axis is interesting. O3 Low is $10+ per task and 03 High
       | is over $1000 (it's logarithmic graph so it's like $50 and $5000
       | respectively?)
        
       | obblekk wrote:
       | Human performance is 85% [1]. o3 high gets 87.5%.
       | 
       | This means we have an algorithm to get to human level performance
       | on this task.
       | 
       | If you think this task is an eval of general reasoning ability,
       | we have an algorithm for that now.
       | 
       | There's a lot of work ahead to generalize o3 performance to all
       | domains. I think this explains why many researchers feel AGI is
       | within reach, now that we have an algorithm that works.
       | 
       | Congrats to both Francois Chollet for developing this compelling
       | eval, and to the researchers who saturated it!
       | 
       | [1] https://x.com/SmokeAwayyy/status/1870171624403808366,
       | https://arxiv.org/html/2409.01374v1
        
         | phillipcarter wrote:
         | As excited as I am by this, I still feel like this is still
         | just a small approximation of a small chunk of human reasoning
         | ability at large. o3 (and whatever comes next) feels to me like
         | it will head down the path of being a reasoning coprocessor for
         | various tasks.
         | 
         | But, still, this is incredibly impressive.
        
           | qt31415926 wrote:
           | Which parts of reasoning do you think is missing? I do feel
           | like it covers a lot of 'reasoning' ground despite its on the
           | surface simplicity
        
             | phillipcarter wrote:
             | I think it's hard to enumerate the unknown, but I'd
             | personally love to see how models like this perform on
             | things like word problems where you introduce red herrings.
             | Right now, LLMs at large tend to struggle mightily to
             | understand when some of the given information is not only
             | irrelevant, but may explicitly serve to distract from the
             | real problem.
        
               | KaoruAoiShiho wrote:
               | o1 already fixed the red herrings...
        
               | zmgsabst wrote:
               | That's not inability to reason though, that's having a
               | social context.
               | 
               | Humans also don't tend to operate in a rigorously logical
               | mode and understand that math word problems are an
               | exception where the language may be adversarial: they're
               | trained for that special context in school. If you tell
               | the LLM that social context, eg that language may be
               | deceptive, their "mistakes" disappear.
               | 
               | What you're actually measuring is the LLM defaults to
               | assuming you misspoke trying to include relevant
               | information rather than that you were trying to trick it
               | -- which is the social context you'd expect when trained
               | on general chat interactions.
               | 
               | Establishing context in psychology is hard.
        
             | Agentus wrote:
             | kinda interesting, every single CS person (especially phds)
             | when talking about reasoning are unable to concisely
             | quantify, enumerate, qualify, or define reasoning.
             | 
             | people with (high) intelligence talking and building
             | (artificial) intelligence but never able to convincingly
             | explain aspects of intelligence. just often talk
             | ambiguously and circularly around it.
             | 
             | what are we humans getting ourselves into inventing skynet
             | :wink.
             | 
             | its been an ongoing pet project to tackle reasoning, but i
             | cant answer your question with regards to llms.
        
               | YeGoblynQueenne wrote:
               | >> Kinda interesting, every single CS person (especially
               | phds) when talking about reasoning are unable to
               | concisely quantify, enumerate, qualify, or define
               | reasoning.
               | 
               | Kinda interesting that mathematicians also can't do the
               | same for mathematics.
               | 
               | And yet.
        
               | Agentus wrote:
               | well lets just say i think i can explain reasoning better
               | than anyone ive encountered. i have my own hypothesized
               | theory on what it is and how it manifests in neural
               | networks.
               | 
               | i doubt your mathmatician example is equivalent.
               | 
               | examples that are fresh on the mind that further my
               | point. ive heard yann lecun baffled by llms
               | instantiation/emergence of reasoning, along with other ai
               | researchers. eric Schmidt thinks the agentic reasoning is
               | the current frontier and people should be focusing on
               | that. was listening to the start of an ai machine
               | learning interview a week ago with some cs phd asked to
               | explain reasoning and the best he could muster up is you
               | know it when you see it.... not to mention the guy
               | responding to the grandparent that gave a cop out answer
               | ( all the most respect to him).
        
               | necovek wrote:
               | Care to enlighten us with your explanation of what
               | "reasoning" is?
        
               | Agentus wrote:
               | terribly sorry to be such a tease, but im looking to
               | publish a paper on it, and still need to delve deeper
               | into machine interpretability to make sure its
               | empirically properly couched. if u can help with that
               | perhaps we can continue this convo in private.
        
               | YeGoblynQueenne wrote:
               | >> well lets just say i think i can explain reasoning
               | better than anyone ive encountered. i have my own
               | hypothesized theory on what it is and how it manifests in
               | neural networks.
               | 
               | I'm going to bet you haven't encountered the right people
               | then. Maybe your social circle is limited to folks like
               | the person who presented a slide about A* to a dumb-
               | struck roomfull of Deep Learning researchers, in the last
               | NeurIps?
               | 
               | https://x.com/rao2z/status/1867000627274059949
        
               | Agentus wrote:
               | possibly, my university doesn't really do ai research
               | beyond using it as a tool to engineer things. im looking
               | to transfer to a different university.
               | 
               | but no, my take on reasoning is really a somewhat
               | generalized reframing of the definition of reasoning
               | (which you might find on the stanford encylopedia of
               | philosophy) thats reframed partially in axiomatic
               | building blocks of neural network components/terminology.
               | im not claiming to have discovered reasoning, just
               | redefine it in a way thats compatible and sensible to
               | neural networks (ish).
        
               | YeGoblynQueenne wrote:
               | Well you're free to define and redefine anything and as
               | you like, but be aware that every time you move the
               | target closer to your shot you are setting yourself up
               | for some pretty strong confirmation bias.
        
               | Agentus wrote:
               | yeah thats why i need help from the machine
               | interpretability crowd to make sure my hypothesized
               | reframing of reasoning has sufficient empirical basis and
               | isn't adrift in lalaland.
        
               | logicchains wrote:
               | Mathematicians absolutely can, it's called foundations,
               | and people actively study what mathematics can be
               | expressed in different foundations. Most mathematicians
               | don't care about it though for the same reason most
               | programmers don't care about Haskell.
        
               | YeGoblynQueenne wrote:
               | I don't care about Haskell either, but we know what
               | reasoning is [1]. It's been studied extensively in
               | mathematics, computer science, psychology, cognitive
               | science and AI, and in philosophy going back literally
               | thousands of years with grandpapa Aristotle and his
               | syllogisms. Formal reasoning, informal reasoning, non-
               | monotonic reasoning, etc etc. Not only do we know what
               | reasoning is, we know how to do it with computers just
               | fine, too [2]. That's basically the first 50 years of AI,
               | that folks like His Nobelist Eminence Geoffrey Hinton
               | will tell you was all a Bad Idea and a total failure.
               | 
               | Still somehow the question keeps coming up- "what is
               | reasoning". I'll be honest and say that I imagine it's
               | mainly folks who skipped CS 101 because they were busy
               | tweaking their neural nets who go around the web like
               | Diogenes with his lantern, howling "Reasoning! I'm
               | looking for a definition of Reasoning! What is
               | Reasoning!".
               | 
               | I have never heard the people at the top echelons of AI
               | and Deep learning - LeCun, Schmidhuber, Bengio, Hinton,
               | Ng, Hutter, etc etc- say things like that: "what's
               | reasoning". The reason I suppose is that they know
               | exactly what that is, because it was the one thing they
               | could never do with their neural nets, that classical AI
               | could do between sips of coffee at breakfast [3]. Those
               | guys know exactly what their systems are missing and, to
               | their credit, have never made no bones about that.
               | 
               | _________________
               | 
               | [1] e.g. see my profile for a quick summary.
               | 
               | [2] See all of Russeel & Norvig, as a for instance.
               | 
               | [3] Schmidhuber's doctoral thesis was an implementation
               | of genetic algorithms in Prolog, even.
        
               | Agentus wrote:
               | i have a question for you, in which ive asked many
               | philosophy professors but none could answer
               | satisfactorily. since you seem to have a penchant for
               | reasoning perhaps you might have a good answer. (i hope i
               | remember the full extent of the question properly, i
               | might hit you up with some follow questions)
               | 
               | it pertains to the source of the inference power of
               | deductive inference. do you think all deductive reasoning
               | originated inductively? like when some one discovers a
               | rule or fact that seemingly has contextual predictive
               | power, obviously that can be confirmed inductively by
               | observations, but did that deductive reflex of the mind
               | coagulate by inductive experiences. maybe not all
               | deductive derivative rules but the original deductive
               | rules.
        
               | YeGoblynQueenne wrote:
               | I'm sorry but I have no idea how to answer your question,
               | which is indeed philosophical. You see, I'm not a
               | philosopher, but a scientist. Science seeks to pose
               | questions, and answer them; philosophy seeks to pose
               | questions, and question them. Me, I like answers more
               | than questions so I don't care about philosophy much.
        
               | Agentus wrote:
               | well yeah its partially philosphical, i guess my
               | haphazard use of language like "all" makes it more
               | philosophical than intended.
               | 
               | but im getting at a few things. one of those things is
               | neurological. how do deductive inference constructs
               | manifest in neurons and is it really inadvertently an
               | inductive process that that creates deductive neural
               | functions.
               | 
               | other aspect of the question i guess is more
               | philosophical. like why does deductive inference work at
               | all, i think clues to a potential answer to that can be
               | seen in the mechanics of generalization of antecedents
               | predicting(or correlating with) certain generalized
               | consequences consistently. the brain coagulates
               | generalized coinciding concepts by reinforcement and it
               | recognizes or differentiates inclusive instances or
               | excluding instances of a generalization by recognition
               | properties that seem to gatekeep identities accordingly.
               | its hard to explain succinctly what i mean by the latter,
               | but im planning on writing an academic paper on that.
        
               | mistermann wrote:
               | >Those guys know exactly what their systems are missing
               | 
               | If they did not actually, would they (and you)
               | necessarily be able to know?
               | 
               | Many people claim the ability to prove a negative, but no
               | one will post their method.
        
               | YeGoblynQueenne wrote:
               | To clarify, what neural nets are missing is a capability
               | present in classical, logic-based and symbolic systems.
               | That's the ability that we commonly call "reasoning". No
               | need to prove any negatives. We just point to what
               | classical systems are doing and ask whether a deep net
               | can do that.
        
             | john_minsk wrote:
             | My personal 5 cents is that reasoning will be there when
             | LLM gives you some kind of outcome and then when questioned
             | about it can explain every bit of result it produced.
             | 
             | For example, if we asked an LLM to produce an image of a
             | "human woman photorealistic" it produces result. After that
             | you should be able to ask it "tell me about its background"
             | and it should be able to explain "Since user didn't specify
             | background in the query I randomly decided to draw her
             | standing in front of a fantasy background of Amsterdam
             | iconic houses. Usually Amsterdam houses are 3 stories tall,
             | attached to each other and 10 meters wide. Amsterdam houses
             | usually have cranes on the top floor, which help to bring
             | goods to the top floor since doors are too narrow for any
             | object wider than 1m. The woman stands in front of the
             | houses approximately 25 meters in front of them. She is
             | 1,59m tall, which gives us correct perspective. It is
             | 11:16am of August 22nd which I used to calculate correct
             | position of the sun and align all shadows according to
             | projected lighting conditions. The color of her skin is set
             | at RGB:xxxxxx randomly" etc.
             | 
             | And it is not too much to ask LLMs for it. LLMs have access
             | to all the information above as they read all the internet.
             | So there is definitely a description of Amsterdam
             | architecture, what a human body looks like or how to
             | correctly estimate time of day based on shadows (and vise
             | versa). The only thing missing is logic that connects all
             | this information and which is applied correctly to generate
             | final image.
             | 
             | I like to think about LLMs as a fancy genius compressing
             | engines. They took all the information in the internet,
             | compressed it and are able to cleverly query this
             | information for end user. It is a tremendously valuable
             | thing, but if intelligence emerges out of it - not sure.
             | Digital information doesn't necessarily contain everything
             | needed to understand how it was generated and why.
        
               | concordDance wrote:
               | > if we asked an LLM to produce an image of a "human
               | woman photorealistic" it produces result
               | 
               | Large language models don't do that. You'd want an image
               | model.
               | 
               | Or did you mean "multi-model AI system" rather than
               | "LLM"?
        
               | owenpalmer wrote:
               | It might be possible for a language model to paint a
               | photorealistic picture though.
        
               | 0points wrote:
               | It is not.
               | 
               | You are confusing LLM:s with Generative AI.
        
               | amelius wrote:
               | Can an LLM use tools like humans do? Could it use an
               | image model as a tool to query the image?
        
               | 0points wrote:
               | No, a LLM is a Large Language Model.
               | 
               | It can language.
        
               | amelius wrote:
               | You could teach it to emit patterns that (through other
               | code) invoke tools, and loop the results back to the LLM.
        
             | Xmd5a wrote:
             | LLMs are still bound to a prompting session. They can't
             | form long term memories, can't ponder on it and can't
             | develop experience. They have no cognitive architecture.
             | 
             | 'Agents' (i.e. workflows intermingling code and calls to
             | LLMs) are still a thing (as shown by the fact there is a
             | post by anthropic on this subject on the front page right
             | now) and they are very hard to build.
             | 
             | Consequence of that for instance: it's not possible to have
             | a LLM explore _exhaustively_ a topic.
        
               | mjhagen wrote:
               | LLMs don't, but who said AGI should come from LLMs alone.
               | When I ask ChatGPT about something "we" worked on months
               | ago, it "remembers" and can continue on the conversation
               | with that history in mind.
               | 
               | I'd say, humans are also bound to promoting sessions in
               | that way.
        
               | Xmd5a wrote:
               | Last time I used ChatGPT 'memory' feature it got full
               | very quickly. It remembered my name, my dog's name and a
               | couple tobacco casing recipes he came up with. OpenAI
               | doesn't seem to be using embeddings and a vector
               | database, just text snippets it injects in every
               | conversation. Because RAG is too brittle ? The same
               | problem arises when composing LLM calls. Efficient and
               | robust workflows are those whose prompts and/or DAG were
               | obtained via optimization techniques. Hence DSPy.
               | 
               | Consider the following use case: keeping a swimming pool
               | water clean. I can have a long running conversation with
               | a LLM to guide me in getting it right. However I can't
               | have a LLM handle the problem autonomously. I'd like to
               | have it notify me on its own "hey, it's been 2 days, any
               | improvement? Do you mind sharing a few pictures of the
               | pool as well as the ph/chlorine test results ?". Nothing
               | mind-boggingly complex. Nothing that couldn't be achieved
               | using current LLMs. But still something I'd have to
               | implement myself and which turns out to be more complex
               | to achieve than expected. This is the kind of improvement
               | I'd like to see big AI companies going after rather than
               | research-grade ultra smart AIs.
        
             | amelius wrote:
             | Does it include the use of tools to accomplish a task?
             | 
             | Does it include the invention of tools?
        
             | tim333 wrote:
             | Current AI is good at text but not very good at 3d physical
             | stuff like fixing your plumbing.
        
             | mistermann wrote:
             | Optimal phenomenological reasoning is going to be a tough
             | nut to crack.
             | 
             | Luckily we don't know the problem exists, so in a
             | cultural/phenomenological sense it is already cracked.
        
           | azeirah wrote:
           | I'd like to see this o3 thing play 5d chess with multiverse
           | time travel or baba is you.
           | 
           | The only effect smarter models will have is that intelligent
           | people will have to use less of their brain to do their work.
           | As has always been the case, the medium is the message, and
           | climate change is one of the most difficult and worst
           | problems of our time.
           | 
           | If this gets software people to quit en-masse and start
           | working in energy, biology, ecology and preservation? Then it
           | has succeeded.
        
             | concordDance wrote:
             | > climate change is one of the most difficult and worst
             | problems of our time.
             | 
             | Slightly surprised to see this view here.
             | 
             | I can think of half a dozen more serious problems off hand
             | (e.g. population aging, institutional scar tissue,
             | dysgenics, nuclear proliferation, pandemic risks, AI
             | itself) along most axes I can think of (raw $ cost, QALYs,
             | even X-risk).
        
         | ALittleLight wrote:
         | It's not saturated. 85% is average human performance, not "best
         | human" performance. There is still room for the model to go up
         | to 100% on this eval.
        
         | scotty79 wrote:
         | Still it's comparing average human level performance with best
         | AI performance. Examples of things o3 failed at are insanely
         | easy for humans.
        
           | FrustratedMonky wrote:
           | There are things Chimps do easily that humans fail at, and
           | vice/versa of course.
           | 
           | There are blind spots, doesn't take away from 'general'.
        
             | noobermin wrote:
             | The downvotes should tell you, this is a decided "hype"
             | result. Don't poo poo it, that's not allowed on AI slop
             | posts on HN.
        
               | FrustratedMonky wrote:
               | Yeah, I didn't realize Chimp studies, or neuroscience
               | were out of vogue. Even in tech, people form strong
               | 'beliefs' around what they think is happening.
        
             | Matumio wrote:
             | We can't agree whether Portia spiders are intelligent or
             | just have very advanced instincts. How will we ever agree
             | about what human intelligence is, or how to separate it
             | from cultural knowledge? If that even makes sense.
        
               | FrustratedMonky wrote:
               | I guess my point is more, if we can't decide about Portia
               | Spiders or Chimps, then how can we be so certain about
               | AI. So offering up Portia and Chimps as counter examples.
        
           | cchance wrote:
           | You'd be surprised what the AVERAGE human fails to do that
           | you think is easy, my mom can't fucking send an email without
           | downloading a virus, i have a coworker that believes beyond a
           | shadow of a doubt the world is flat.
           | 
           | The Average human is a lot dumber than people on hackernews
           | and reddit seem to realize, shit the people on mturk are
           | likely smarter than the AVERAGE person
        
             | staticman2 wrote:
             | Yet the average human can drive a car a lot better than
             | ChatGPT can, which shows that the way you frame
             | "intelligence" dictates your conclusion about who is
             | "intelligent".
        
               | p1esk wrote:
               | Pretty sure a waymo car drives better than an average SF
               | driver.
        
               | manquer wrote:
               | Waymo cannot handle poor weather at all, average human
               | can.
               | 
               | Being able to perform better than humans in specific
               | constrained problem space is how every automation system
               | has been developed.
               | 
               | While self driving systems are impressive, they don't
               | drive with anywhere close to skills of the average driver
        
               | tim333 wrote:
               | Waymo blog with video of them driving in poor weather
               | https://waymo.com/blog/2019/08/waymo-and-weather
        
               | manquer wrote:
               | And nikola famously made a video of a truck using one
               | which had no engine, we don't take a company word for
               | anything until we can verify.
               | 
               | This is not offered to public, they are actively
               | expanding in only cities like LA , Miami or Phoenix now
               | where weather is good through the year.
               | 
               | The tech for bad weather is nowhere close to ready for
               | public. Average human on other hand is driving in bad
               | weather every day
        
               | tim333 wrote:
               | "Extreme Weather" tech "will be available to riders in
               | the near future"
               | https://www.cnet.com/roadshow/news/waymos-latest-
               | robotaxi-is...
        
               | daveguy wrote:
               | I'm sure the source of that CNET article came with a
               | forward looking statements disclaimer.
        
               | Mordisquitos wrote:
               | And how well would a Waymo car do in this challenge with
               | the ARC-AGI datasets?
        
               | coldcode wrote:
               | There's a reason why Waymo isn't offered in Buffalo.
        
               | fragmede wrote:
               | Is that reason because Buffalo is the 81st most populated
               | city in the United States, or 123rd by population
               | density, and Waymo currently only serves approximately 3
               | cities in North America?
               | 
               | We already let computers control cars because they're
               | better than humans at it when the weather is inclement.
               | It's called ABS.
        
               | tracerbulletx wrote:
               | If you take an electrical sensory input signal sequence,
               | and transform it to a electrical muscle output signal
               | sequence you've got a brain. ChatGPT isn't going to drive
               | a car because it's trained on verbal tokens, and it's not
               | optimized for the type of latency you need for physical
               | interaction.
               | 
               | And the brain doesn't use the same network to do verbal
               | reasoning as real time coordination either.
               | 
               | But that work is moving along fine. All of these models
               | and lessons are going to be combined into AGI. It is
               | happening. There isn't really that much in the way.
        
             | mirkodrummer wrote:
             | Not being able to send an email or believing the world is
             | flat it's not a sign of intelligence, I'd rather say it's
             | more about culture or being more or less scholarized. Your
             | mom or coworker still can do stuff instinctively that is
             | outperforming every algorithm out there and still
             | unexplained how we do it. We still have no idea what
             | intelligence is
        
             | 0points wrote:
             | Your examples are just examples of lack of information.
             | That's not a measure for intelligence.
             | 
             | As a contrary point, most people think they are smarter
             | than they really are.
        
             | HarHarVeryFunny wrote:
             | Maybe, but no doubt these "dumb" people can still get
             | dressed in the morning, navigate a trip to the mall, do the
             | dishes, etc, etc.
             | 
             | It's always been the case that the things that are easiest
             | for humans are hardest for computers, and vice versa.
             | Humans are good at general intelligence - tackling semi-
             | novel problems all day long, while computers are good at
             | narrow problems they can be trained on such as chess or
             | math.
             | 
             | The majority of the benchmarks currently used to evaluate
             | these AI models are narrow skills that the models have been
             | trained to handle well. What'll be much more useful will be
             | when they are capable of the generality of "dumb" tasks
             | that a human can do.
        
         | cryptoegorophy wrote:
         | What's interesting is it might be very close to human
         | intelligence than some "alien" intelligence, because after all
         | it is a LLM and trained on human made text, which kind of
         | represents human intelligence.
        
           | hammock wrote:
           | In that vein, perhaps the delta between o3 @ 87.5% and Human
           | @ 85% represents a deficit in the ability of text to
           | communicate human reasoning.
           | 
           | In other words, it's possible humans can reason better than
           | o3, but cannot articulate that reasoning as well through text
           | - only in our heads, or through some alternative medium.
        
             | 85392_school wrote:
             | I wonder how much of an effect amount of time to answer has
             | on human performance.
        
               | yunwal wrote:
               | Yeah, this is sort of meaningless without some idea of
               | cost or consequences of a wrong answer. One of the nice
               | things about working with a competent human is being able
               | to tell them "all of our jobs are on the line" and
               | knowing with certainty that they'll come to a good
               | answer.
        
             | unsupp0rted wrote:
             | It's possible humans reason better through text than not
             | through text, so these models, having been trained on text,
             | should be able to out-reason any person who's not currently
             | sitting down to write.
        
           | hamburga wrote:
           | Agreed. I think what really makes them alien is everything
           | else about them besides intelligence. Namely, no
           | emotional/physiological grounding in empathy, shame, pride,
           | and love (on the positive side) or hatred (negative side).
        
         | antirez wrote:
         | NNs are not algorithms.
        
           | notfish wrote:
           | An algorithm is "a process or set of rules to be followed in
           | calculations or other problem-solving operations, especially
           | by a computer"
           | 
           | How does a giant pile of linear algebra not meet that
           | definition?
        
             | antirez wrote:
             | It's not made of "steps", it's an almost continuous
             | function to its inputs. And a function is not an algorithm:
             | it is not an object made of conditions, jumps,
             | terminations, ... Obviously it has computation capabilities
             | and is Turing-complete, but is the opposite of an
             | algorithm.
        
               | raegis wrote:
               | > It's not made of "steps", it's an almost continuous
               | function to its inputs.
               | 
               | Can you define "almost continuous function"? Or explain
               | what you mean by this, and how it is used in the A.I.
               | stuff?
        
               | taneq wrote:
               | Well, it's a bunch of steps, but they're smaller. /s
        
               | janalsncm wrote:
               | If it wasn't made of steps then Turing machines wouldn't
               | be able to execute them.
               | 
               | Further, this is probably running an algorithm on top of
               | an NN. Some kind of tree search.
               | 
               | I get what you're saying though. You're trying to draw a
               | distinction between statistical methods and symbolic
               | methods. Someday we will have an algorithm which uses
               | statistical methods that can match human performance on
               | most cognitive tasks, and it won't look or act like a
               | brain. In some sense that's disappointing. We can build
               | supersonic jets without fully understanding how birds
               | fly.
        
               | antirez wrote:
               | Let's see that Turing machines can approximate the
               | execution of NN :) That's why there are issues related to
               | numerical precision, but the contrary is also true
               | indeed, NNs can discover and use similar techniques used
               | by traditional algorithms. However: the two remain two
               | different methods to do computations, and probably it's
               | not just by chance that many things we can't do
               | algorithmically, we can do with NNs, what I mean is that
               | this is not _just_ related to the fact that NNs discover
               | complex algorithms via gradient descent, but also that
               | the computational model of NNs is more adapt to solving
               | certain tasks. So the inference algorithm of NNs (doing
               | multiplications and other batch transformations) is just
               | needed for standard computers to approximate the NN
               | computational model. You can do this analogically, and
               | nobody would claim much (maybe?) it 's running an
               | algorithm. Or that brains themselves are algorithms.
        
               | zeroonetwothree wrote:
               | We don't have evidence that a TM can simulate a brain.
               | But we know for a fact that it can execute a NN.
        
               | necovek wrote:
               | Computers can execute precise computations, it's just not
               | efficient (and it's very much slow).
               | 
               | NNs are exactly what "computers" are good for and we've
               | been using since their inception: doing lots of
               | computations quickly.
               | 
               | "Analog neural networks" (brains) work much differently
               | from what are "neural networks" in computing, and we have
               | no understanding of their operation to claim they are or
               | aren't algorithmic. But computing NNs are simply
               | implementations of an algorithm.
               | 
               | Edit: upon further rereading, it seems you equate "neural
               | networks" with brain-like operation. But brain was an
               | inspiration for NNs, they are not an "approximation" of
               | it.
        
               | antirez wrote:
               | But the inference itself is orthogonal to the computation
               | the NN is going. Obviously the inference (and training)
               | are algorithms.
        
               | tsimionescu wrote:
               | NN inference is an algorithm for computing an
               | approximation of a function with a huge number of
               | parameters. The NN itself is of course just a data
               | structure. But there is nothing whatsoever about the NN
               | process that is non-algorithmic.
               | 
               | It's the exact same thing as using a binary tree to
               | discover the lowest number in some set of numbers,
               | conceptually: you have a data structure that you evaluate
               | using a particular algorithm. The combination of the
               | algorithm and the construction of the data structure
               | arrive at the desired outcome.
        
               | antirez wrote:
               | That's not the point, I think: you can implement the
               | brain in BASIC, in theory, this does not means that the
               | brain is per-se a BASIC program. I'll provide a more
               | theoretical framework for reasoning about this: if the
               | way to solve certain problems by an NN (the learned
               | weights) can't be translated in some normal program that
               | DOES NOT resemble the activation of an NN, then the NNs
               | are not algorithms, but a different computational model.
        
               | mvkel wrote:
               | > continuous
               | 
               | So, steps?
        
               | necovek wrote:
               | "Continuous" would imply infinitely small steps, and as
               | such, would certainly be used as a differentiator
               | (differential? ;) between larger discrete stepped
               | approach.
               | 
               | In essence, infinite calculus provides a link between
               | "steps" and continuous, but those are different things
               | indeed.
        
               | necovek wrote:
               | I would say you are right that function is not an
               | algorithm, but it is an implementation of an algorithm.
               | 
               | Is that your point?
               | 
               | If so, I've long learned to accept imprecise language as
               | long as the message can be reasonably extracted from it.
        
           | benlivengood wrote:
           | Deterministic (ieee 754 floats), terminates on all inputs,
           | correctness (produces loss < X on N training/test inputs)
           | 
           | At most you can argue that there isn't a useful bounded loss
           | on every possible input, but it turns out that humans don't
           | achieve useful bounded loss on identifying arbitrary sets of
           | pixels as a cat or whatever, either. Most problems NNs are
           | aimed at are qualitative or probabilistic where provable
           | bounds are less useful than Nth-percentile performance on
           | real-world data.
        
           | KeplerBoy wrote:
           | Running inference on a model certainly is a algorithm.
        
           | drdeca wrote:
           | How do you define "algorithm"? I suspect it is a definition I
           | would find somewhat unusual. Not to say that I strictly
           | disagree, but only because to my mind "neural net" suggests
           | something a bit more concrete than "algorithm", so I might
           | instead say that an artificial neural net is an
           | implementation of an algorithm, rather than or something like
           | that.
           | 
           | But, to my mind, something of the form "Train a neural
           | network with an architecture generally like [blah], with a
           | training method+data like [bleh], and save the result. Then,
           | when inputs are received, run them through the NN in such-
           | and-such way." would constitute an algorithm.
        
           | necovek wrote:
           | NN is a very wide term applied in different contexts.
           | 
           | When a NN is trained, it produces a set of parameters that
           | basically define an algorithm to do inference with: it's a
           | very big one though.
           | 
           | We also call that a NN (the joy of natural language).
        
         | 6gvONxR4sf7o wrote:
         | Human performance is much closer to 100% on this, depending on
         | your human. It's easy to miss the dot in the corner of the
         | headline graph in TFA that says "STEM grad."
        
           | tim333 wrote:
           | A fair comparison might be average human. The average human
           | isn't a STEM grad. It seems STEM grad approximately equals an
           | IQ of 130. https://www.accommodationforstudents.com/student-
           | blog/the-su...
           | 
           | From a post elsewhere the scores on ARC-AGI-PUB are approx
           | average human 64%, o3 87%.
           | https://news.ycombinator.com/item?id=42474659
           | 
           | Though also elsewhere, o3 seems very expensive to operate.
           | You could probably hire a PhD researcher for cheaper.
        
             | jeremyjh wrote:
             | Why would an average human be more fair than a trained
             | human? The model is trained.
        
         | hypoxia wrote:
         | It actually beats the human average by a wide margin:
         | 
         | - 64.2% for humans vs. 82.8%+ for o3.
         | 
         | ...
         | 
         | Private Eval:
         | 
         | - 85%: threshold for winning the prize [1]
         | 
         | Semi-Private Eval:
         | 
         | - 87.5%: o3 (unlimited compute) [2]
         | 
         | - 75.7%: o3 (limited compute) [2]
         | 
         | Public Eval:
         | 
         | - 91.5%: o3 (unlimited compute) [2]
         | 
         | - 82.8%: o3 (limited compute) [2]
         | 
         | - 64.2%: human average (Mechanical Turk) [1] [3]
         | 
         | Public Training:
         | 
         | - 76.2%: human average (Mechanical Turk) [1] [3]
         | 
         | ...
         | 
         | References:
         | 
         | [1] https://arcprize.org/guide
         | 
         | [2] https://arcprize.org/blog/oai-o3-pub-breakthrough
         | 
         | [3] https://arxiv.org/abs/2409.01374
        
           | usaar333 wrote:
           | Super human isn't beating rando mech turk.
           | 
           | Their post has stem grad at nearly 100%
        
             | tripletao wrote:
             | This is correct. It's easy to get arbitrarily bad results
             | on Mechanical Turk, since without any quality control
             | people will just click as fast as they can to get paid (or
             | bot it and get paid even faster).
             | 
             | So in practice, there's always some kind of quality
             | control. Stricter quality control will improve your
             | results, and the right amount of quality control is
             | subjective. This makes any assessment of human quality
             | meaningless without explanation of how those humans were
             | selected and incentivized. Chollet is careful to provide
             | that, but many posters here are not.
             | 
             | In any case, the ensemble of task-specific, low-compute
             | Kaggle solutions is reportedly also super-Turk, at 81%. I
             | don't think anyone would call that AGI, since it's not
             | general; but if the "(tuned)" in the figure means o3 was
             | tuned specifically for these tasks, that's not obviously
             | general either.
        
         | dyauspitr wrote:
         | I'll believe it when the AI can earn money on its own. I
         | obviously don't mean someone paying a subscription to use the
         | AI I mean, letting the AI lose on the Internet with only the
         | goal of making money and putting it into a bank account.
        
           | hamburga wrote:
           | Do trading bots count?
        
             | 1659447091 wrote:
             | No, the AI would have to start from zero and reason it's
             | way to making itself money online, such as the humans who
             | were first in their online field of interest (e-commerce,
             | scams, ads etc from the 80's and 90's) when there was no
             | guidance, only general human intelligence that could reason
             | their way into money making opportunities and reason their
             | way into making it work.
        
               | concordDance wrote:
               | I don't think humans ever do that. They research/read and
               | ask other humans.
        
         | lastdong wrote:
         | Curious about how many tests were performed. Did it
         | consistently manage to successfully solve many of these types
         | of problems?
        
         | dmead wrote:
         | This is so strange. people think that an llm trained on
         | programming questions and docs can do mundane tasks like this
         | means intelligent? Come on.
         | 
         | It really calls into question two things.
         | 
         | 1. You don't know what you're talking about about.
         | 
         | 2. You have a perverse incentive to believe this such that you
         | will preach it to others and elevate some job salary range or
         | stock.
         | 
         | Either way, not a good look.
        
           | javaunsafe2019 wrote:
           | This
        
       | Imnimo wrote:
       | Whenever a benchmark that was thought to be extremely difficult
       | is (nearly) solved, it's a mix of two causes. One is that
       | progress on AI capabilities was faster than we expected, and the
       | other is that there was an approach that made the task easier
       | than we expected. I feel like the there's a lot of the former
       | here, but the compute cost per task (thousands of dollars to
       | solve one little color grid puzzle??) suggests to me that there's
       | some amount of the latter. Chollet also mentions ARC-AGI-2 might
       | be more resistant to this approach.
       | 
       | Of course, o3 looks strong on other benchmarks as well, and
       | sometimes "spend a huge amount of compute for one problem" is a
       | great feature to have available if it gets you the answer you
       | needed. So even if there's some amount of "ARC-AGI wasn't quite
       | as robust as we thought", o3 is clearly a very powerful model.
        
         | exe34 wrote:
         | > the other is that there was an approach that made the task
         | easier than we expected.
         | 
         | from reading Dennett's philosophy, I'm convinced that that's
         | how human intelligence works - for each task that "only a human
         | could do that", there's a trick that makes it easier than it
         | seems. We are bags of tricks.
        
           | Jensson wrote:
           | > We are bags of tricks.
           | 
           | We are trick generators, that is what it means to be a
           | general intelligence. Adding another trick in the bag doesn't
           | make you a general intelligence, being able to discover and
           | add new tricks yourself makes you a general intelligence.
        
             | falcor84 wrote:
             | Not the parent, but remembering my reading of Dennett, he
             | was referring to the tricks that we got through evolution,
             | rather than ones we invented ourselves. As particular
             | examples, we have neural functional areas for capabilities
             | like facial recognition and spatial reasoning which seems
             | to rely on dedicated "wetware" somewhat distinct from other
             | parts of the brain.
        
               | Jensson wrote:
               | But humans being able to develop new tricks is core to
               | their intelligence, saying its just a bag of tricks means
               | you don't understand what AGI is. So either the poster
               | misunderstood Dennett or Dennett weren't talking about
               | AGI or Dennett didn't understand this well.
               | 
               | Of course there are many tricks you will need special
               | training for, like many of the skills human share with
               | animals, but the ability to construct useful shareable
               | large knowledge bases based on observations is unique to
               | humans and isn't just a "trick".
        
               | exe34 wrote:
               | Dennett was talking about natural intelligence. I think
               | you're just underestimating the potential of a
               | sufficiently big bag of tricks.
               | 
               | sharing knowledge isn't a human thing - chimps learn from
               | each other. bees teach each other the direction and
               | distance to a new source of food.
               | 
               | we just happen to push the envelope a lot further and
               | managed to kickstart runaway mimetic evolution.
        
               | falcor84 wrote:
               | "mimetic" is apt there, but I think that Dennett, as a
               | friend of Dawkins, would say it's "memetic"
        
               | exe34 wrote:
               | nice catch!
        
             | exe34 wrote:
             | generating tricks is itself a trick that relies on an
             | enormous bag of tricks we inherited through evolution by
             | the process of natural selection.
             | 
             | the new tricks don't just pop into our heads even though it
             | seems that way. nobody ever woke up and devised a new trick
             | in a completely new field without spending years learning
             | about that field or something adjacent to it. even the new
             | ideas tend to be an old idea from a different field applied
             | to a new field. tricks stand on the shoulders of giants.
        
         | solidasparagus wrote:
         | Or the test wasn't testing anything meaningful, which IMO is
         | what happened here. I think ARC was basically looking at the
         | distribution of what AI is capable of, picked an area that it
         | was bad at and no one had cared enough to go solve, and put
         | together a benchmark. And then we got good at it because
         | someone cared and we had a measurement. Which is essentially
         | the goal of ARC.
         | 
         | But I don't much agree that it is any meaningful step towards
         | AGI. Maybe it's a nice proofpoint that that AI can solve simple
         | problems presented in intentionally opaque ways.
        
           | atleastoptimal wrote:
           | Id agree with you if there hasn't been very deliberate work
           | towards solving ARC for years, and if thr conceit of the
           | benchmark wasn't specifically based on a conception of human
           | intuition being, put simply, learning and applying out of
           | distribution rules on the fly. ARC wasn't some arbitrary
           | inverse set, it was designed to benchmark a fundamental
           | capability of general intelligence
        
       | whoistraitor wrote:
       | The general message here seems to be that inference-time brute-
       | forcing works as long as you have a good search and evaluation
       | strategy. We've seemingly hit a ceiling on the base LLM forward-
       | pass capability so any further wins are going to be in how we
       | juggle multiple inferences to solve the problem space. It feels
       | like a scripting problem now. Which is cool! A fun space for
       | hacker-engineers. Also:
       | 
       | > My mental model for LLMs is that they work as a repository of
       | vector programs. When prompted, they will fetch the program that
       | your prompt maps to and "execute" it on the input at hand. LLMs
       | are a way to store and operationalize millions of useful mini-
       | programs via passive exposure to human-generated content.
       | 
       | I found this such an intriguing way of thinking about it.
        
         | whimsicalism wrote:
         | > We've seemingly hit a ceiling on the base LLM forward-pass
         | capability so any further wins are going to be in how we juggle
         | multiple inferences to solve the problem space
         | 
         | Not so sure - but we might need to figure out the
         | inference/search/evaluation strategy in order to provide the
         | data we need to distill to the single forward-pass data
         | fitting.
        
       | cchance wrote:
       | Is it just me or does looking at the ARC-AGI example questions at
       | the bottom... make your brain hurt?
        
         | drdaeman wrote:
         | Looks pretty obvious to me, although, of course, it took me a
         | few moments to understand what's expected as a solution.
         | 
         | c6e1b8da is moving rectangular figures by a given vector,
         | 0d87d2a6 is drawing horizontal and/or vertical lines
         | (connecting dots at the edges) and filling figures they touch,
         | b457fec5 is filling gray figures with a given repeating color
         | pattern.
         | 
         | This is pretty straightforward stuff that doesn't require much
         | spatial thinking or keeping multiple things/aspects in memory -
         | visual puzzles from various "IQ" tests are way harder.
         | 
         | This said, now I'm curious how SoTA LLMs would do on something
         | like WAIS-IV.
        
         | randyrand wrote:
         | I'll sound like a total douche bag - but I thought they were
         | incredibly obvious - which I think is the point of them.
         | 
         | What took me longer was figuring out how the question was
         | arranged, i.e. left input, right output, 3 examples each
        
       | airstrike wrote:
       | Uhh...some of us are apparently living under a rock, as this is
       | the first time I hear about o3 and I'm on HN far too much every
       | day
        
         | burningion wrote:
         | I think it was just announced today! You're fine!
        
       | cryptoegorophy wrote:
       | Besides higher scores - is there any improvements for a general
       | use? Like asking to help setup home assistant etc etc?
        
       | rvz wrote:
       | Great results. However, let's all just admit it.
       | 
       | It has well replaced journalists, artists and on its way to
       | replace nearly both junior and senior engineers. The ultimate
       | intention of "AGI" is that it is going to replace tens of
       | millions of jobs. That is it and you know it.
       | 
       | It will only accelerate and we need to stop pretending and
       | coping. Instead lets discuss solutions for those lost jobs.
       | 
       | So what is the replacement for these lost jobs? (It is not UBI or
       | "better jobs" without defining them.)
        
         | neom wrote:
         | Do you follow Jack Clark? I noticed he's been on the road a lot
         | talking to governments and policy makers, and not just in the
         | "AI is coming" way he used to talk.
        
         | whynotminot wrote:
         | When none of us have jobs or income, there will be no ability
         | for us to buy products. And then no reason for companies to buy
         | ads to sell products to people who don't have money. Without ad
         | money (or the potential of future ad money), the people pushing
         | the bounds of AGI into work replacement will lose the very
         | income streams powering this research and their valuations.
         | 
         | Ford didn't support a 40 hour work week out of the kindness of
         | his heart. He wanted his workers to have time off for buying
         | things (like his cars).
         | 
         | I wonder if our AGI industrialist overlords will do something
         | similar for revenue sharing or UBI.
        
           | whimsicalism wrote:
           | This picture doesn't make sense. If most don't have any money
           | to buy products, just invent some other money and start
           | paying one of the other people who doesn't have any money to
           | start making the products for you.
           | 
           | In reality, if there really is mass unemployment, AI driven
           | automation will make consumables so cheap that anyone will be
           | able to buy it.
        
             | whynotminot wrote:
             | > This picture doesn't make sense. If most don't have any
             | money to buy products, just invent some other money and
             | start paying one of the other people who doesn't have any
             | money to start making the products for you.
             | 
             | Uh, this picture doesn't make sense. Why would anyone value
             | this randomly invented money?
        
               | whimsicalism wrote:
               | > Why would anyone value this randomly invented money?
               | 
               | Because they can use it to pay for goods?
               | 
               | Your notion is that almost everyone is going to be out of
               | a job and thus have nothing. Okay, so I'm one of those
               | people and I need this house built. But I'm not making
               | any money because of AI or whatever. Maybe someone else
               | needs someone to drive their aging relative around and
               | they're a good builder.
               | 
               | If 1. neither of those people have jobs or income because
               | of AI 2. AI isn't provisioning services for basically
               | free,
               | 
               | then it makes sense for them to do an exchange of labor -
               | even with AI (if that AI is not providing services to
               | everyone). The original reason for having money and
               | exchanging it still exists.
        
               | whynotminot wrote:
               | Honestly I don't even know how to engage with your point.
               | 
               | Yes if we recreate society some form of money would
               | likely emerge.
        
               | neom wrote:
               | Didn't money basically only emerge to deal with with
               | difficulty of "double coincidence of wants". Money simply
               | solved the problem of making all forms of value
               | interchangeable and transportable across time AND
               | circumstance? A dollar can do with with or without AI
               | existing no?
        
               | whimsicalism wrote:
               | Yes, that's my point
        
               | staticman2 wrote:
               | You seem to be arguing that large unemployment rates are
               | logically impossible, so we shouldn't worry about
               | unemployment.
               | 
               | The fact unemployment was 25% during the great depression
               | would seem to suggest that at a minimum, a 25%
               | unemployment rate is possible during a disruptive event.
        
               | astrange wrote:
               | The unemployment rate in a modern economy is basically
               | whatever the central bank wants it to be. The Great
               | Depression was caused by bad monetary policy - I don't
               | see a reason why having AI would cause that.
        
               | staticman2 wrote:
               | The person upthread was saying that as long as someone
               | wants a house built and someone wants a grandma driven
               | around unemployment can't happen.
               | 
               | Unless nobody wanted either of those things done during
               | the depression that's clearly not a very good mental
               | model.
        
               | astrange wrote:
               | Yes, I disagree with that. The problem isn't the lack of
               | demand, it's that the people with the demand can't get
               | the money to express it with.
        
             | tivert wrote:
             | > This picture doesn't make sense. If most don't have any
             | money to buy products, just invent some other money and
             | start paying one of the other people who doesn't have any
             | money to start making the products for you.
             | 
             | Ultimately, it all comes down to raw materials and similar
             | resources, _and all those will be claimed by people with
             | lots of real money_. Your  "invented ... other money" will
             | be useless to buy that fundamental stuff. At best, it will
             | be useful for trading scrap and other junk among the
             | unemployed.
             | 
             | > In reality, if there really is mass unemployment, AI
             | driven automation will make consumables so cheap that
             | anyone will be able to buy it.
             | 
             | No. Why would the people who own that automation want to
             | waste their resources producing consumer goods for people
             | with nothing to give them in return?
        
               | whimsicalism wrote:
               | if people with AI use it to somehow enclose all raw
               | resources, then yes - the picture i painted will be wrong
        
               | whynotminot wrote:
               | Enclosing raw resources tends to be what powerful people
               | do.
        
               | astrange wrote:
               | "Raw resources" aren't that valuable economically because
               | they aren't where most of the value is added in
               | production. That's why having a lot of them tends to make
               | your country poorer
               | (https://en.wikipedia.org/wiki/Resource_curse).
        
               | Jensson wrote:
               | Today educated humans are more valuable than anything
               | else on earth, but AGI changes that. With cheap AGI raw
               | resources and infrastructure will be the only two
               | valuable things left.
        
             | astrange wrote:
             | > If most don't have any money to buy products, just invent
             | some other money and start paying one of the other people
             | who doesn't have any money to start making the products for
             | you.
             | 
             | This isn't possible if you want to pay sales taxes - those
             | are what keep transactions being done in the official
             | currency. Of course in a world of 99% unemployment
             | presumably we don't care about this.
             | 
             | But yes, this world of 99% unemployment isn't possible, eg
             | because as soon as you have two people and they trade
             | things, they're employed again.
        
           | tivert wrote:
           | > When none of us have jobs or income, there will be no
           | ability for us to buy products. And then no reason for
           | companies to buy ads to sell products to people who don't
           | have money. Without ad money (or the potential of future ad
           | money), the people pushing the bounds of AGI into work
           | replacement will lose the very income streams powering this
           | research and their valuations.
           | 
           | I don't think so. I agree the push for AGI will kill the
           | modern consumer product economy, but I think it's quite
           | possible for the economy to evolve into a new form (that will
           | probably be terrible for most humans) that keep pushes "work
           | replacement."
           | 
           | Imagine, an AGI billionare buying up land, mines, and power
           | plants as the consumer economy dies, then shifting those
           | resources away from the consumer economy into self-
           | aggrandizing pet projects (e.g. ziggurats, penthouses on
           | Mars, space yachts, life extension, and stuff like that). He
           | might still employ a small community of servants, AGI
           | researchers, and other specialists; but all the rest of the
           | population will be irrelevant to him.
           | 
           | And individual autarky probably isn't necessary, consumption
           | will be redirected towards the massive pet production I
           | mentioned, with vestigial markets for power, minerals, etc.
        
         | RivieraKid wrote:
         | The economic theory answer is that people simply switch to jobs
         | that are not yet replaceable by AI. Doctors, nurses,
         | electricians, construction workers, police officers, etc.
         | People in aggregate will produce more, consume more and work
         | less.
        
           | achierius wrote:
           | > Doctors
           | 
           | Many replaceable
           | 
           | > Police officers
           | 
           | Many replaceable (desk officers)
        
         | drdaeman wrote:
         | > It has well replaced journalists, artists and on its way to
         | replace nearly both junior and senior engineers.
         | 
         | Did it, really? Or did it just provide automation for routine
         | no-thinking-necessary text-writing tasks, but is still
         | ultimately completely bound by the level of human operator's
         | intelligence? I strongly suspect it's the latter. If it had
         | actually replaced journalists it must be junk outlets, where
         | readers' intelligence is negligible and anything goes.
         | 
         | Just yesterday I've used o1 and Claude 3.5 to debug a Linux
         | kernel issue (ultimately, a bad DSDT table causing TPM2 driver
         | unable to reserve memory region for command response buffer,
         | the solution was to use memmap to remove NVS flag from the
         | relevant regions) and confirmed once again LLMs still don't
         | reason at all - just spew out plausible-looking chains of
         | words. The models were good listeners, and a mostly-helpful
         | code generators (when they didn't make silliest mistakes), but
         | they gave no traces of understanding and no attention for any
         | nuances (e.g. LLM used `IS_ERR` to check `__request_resource`
         | result, despite me giving it full source code for that function
         | and there's even a comment that makes it obvious it returns a
         | pointer or NULL, not an error code - misguided attention kind
         | of mistake).
         | 
         | So, in my opinion, LLMs (as currently available to broad
         | public, like myself) are useful for automating away some
         | routine stuff, but their usefulness is bounded by the
         | operator's knowledge and intelligence. And that means that the
         | actual jobs (if they require thinking and not just writing
         | words) are safe.
         | 
         | When asked about what I do at work, I used to joke that I just
         | press buttons on my keyboard in fancy patterns. Ultimately,
         | LLMs seem to suggest that it's not what I really do.
        
       | mensetmanusman wrote:
       | I'm super curious as to whether this technology completely
       | destroys the middle class, or if everyone becomes better off
       | because productivity is going to skyrocket.
        
         | mhogers wrote:
         | Is anyone here aware of the latest research that tries to
         | predict the outcome? Please share - super curious as well
        
           | te_chris wrote:
           | There's this https://arxiv.org/pdf/2312.05481v9
        
           | pdfernhout wrote:
           | Some thoughts I put together on all this circa 2010:
           | https://pdfernhout.net/beyond-a-jobless-recovery-knol.html
           | "This article explores the issue of a "Jobless Recovery"
           | mainly from a heterodox economic perspective. It emphasizes
           | the implications of ideas by Marshall Brain and others that
           | improvements in robotics, automation, design, and voluntary
           | social networks are fundamentally changing the structure of
           | the economic landscape. It outlines towards the end four
           | major alternatives to mainstream economic practice (a basic
           | income, a gift economy, stronger local subsistence economies,
           | and resource-based planning). These alternatives could be
           | used in combination to address what, even as far back as
           | 1964, has been described as a breaking "income-through-jobs
           | link". This link between jobs and income is breaking because
           | of the declining value of most paid human labor relative to
           | capital investments in automation and better design. Or, as
           | is now the case, the value of paid human labor like at some
           | newspapers or universities is also declining relative to the
           | output of voluntary social networks such as for digital
           | content production (like represented by this document). It is
           | suggested that we will need to fundamentally reevaluate our
           | economic theories and practices to adjust to these new
           | realities emerging from exponential trends in technology and
           | society."
        
         | tivert wrote:
         | > I'm super curious as to whether this technology completely
         | destroys the middle class, or if everyone becomes better off
         | because productivity is going to skyrocket.
         | 
         | Even if productivity skyrockets, why would anyone assume the
         | dividends would be shared with the "destroy[ed] middle class"?
         | 
         | All indications will be this will end up like the China Shock:
         | "I lost my middle class job, and all I got was the opportunity
         | to buy flimsy pieces of crap from a dollar store." America
         | lacks the ideological foundations for any other result, and the
         | coming economic changes will likely make building those
         | foundations even more difficult if not impossible.
        
           | rohan_ wrote:
           | Because access to the financial system was democratized ten
           | years ago
        
             | tivert wrote:
             | > Because access to the financial system was democratized
             | ten years ago
             | 
             | Huh? I'm not sure exactly what you're talking about, but
             | mere "access to the financial system" wouldn't remedy
             | anything, because of inequality, etc.
             | 
             | To survive the shock financially, I think one would have to
             | have at least enough capital to be a capitalist.
        
       | croemer wrote:
       | The programming task they gave o3-mini high (creating Python
       | server that allows chatting with OpenAI API and run some code in
       | terminal) didn't seem very hard? Strange choice of example for
       | something that's claimed to be a big step forwards.
       | 
       | YT timestamped link:
       | https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for
       | the fixed link @photonboom)
       | 
       | Updated: I gave the task to Claude 3.5 Sonnet and it worked first
       | shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-
       | faa5aa...
        
         | bearjaws wrote:
         | It's good that it works since if you ask GPT-4o to use the
         | openai sdk it will often produce invalid and out of date code.
        
         | m3kw9 wrote:
         | I would say they didn't need to demo anything, because if you
         | are gonna use the output code live on a demo it may make
         | compile errors and then look stupid trying to fix it live
        
           | croemer wrote:
           | If it was a safe bet problem, then they should have said
           | that. To me it looks like they faked excitement for something
           | not exciting which lowers credibility of the whole
           | presentation.
        
           | sunaookami wrote:
           | They actually did that the last time when they showed the
           | apps integration. First try in Xcode didn't work.
        
             | m3kw9 wrote:
             | Yeah I think that time it was ok because they were demoing
             | the app function, but for this they are demoing the model
             | smarts
        
           | csomar wrote:
           | Models are predictable at 0 temperatures. They might have
           | tested the output beforehand.
        
             | fzzzy wrote:
             | Models in practice haven't been deterministic at 0
             | temperature, although nobody knows exactly why. Either
             | hardware or software bugs.
        
               | Jensson wrote:
               | We know exactly why, it is because floating point
               | operations aren't associative but the GPU scheduler
               | assumes they are, and the scheduler isn't deterministic.
               | Running the model strictly hurts performance so they
               | don't do that.
        
         | photonboom wrote:
         | heres the right timestamp:
         | https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s
        
         | phil917 wrote:
         | Yeah I agree that wasn't particularly mind blowing to me and
         | seems fairly in line with what existing SOTA models can do.
         | Especially since they did it in steps. Maybe I'm missing
         | something.
        
         | MyFirstSass wrote:
         | What? Is this what this is? Either this is a complete joke or
         | we're missing something.
         | 
         | I've been doing similar stuff in Claude for months and it's not
         | that impressive when you see how limited they really are when
         | going non boilerplate.
        
         | HeatrayEnjoyer wrote:
         | Sonnet isn't a "mini" sized model. Try it with Haiku.
        
           | croemer wrote:
           | How mini is o3-mini compared to Sonnet and why does it matter
           | whether it's mini or not? Isn't the point of the demo to show
           | what's now possible that wasn't before?
           | 
           | 4o is cheaper than o1 mini so mini doesn't mean much for
           | costs.
        
         | zelphirkalt wrote:
         | Looks like quite shoddy code though. Like, the procedure for
         | running a shell command is pure side-effect procedural code,
         | neither returning the exit code of the command nor its output.
         | Like the incomplete stackoverflow answer it probably was
         | trained from. It might do one job at a time, but once this
         | stuff gets integrated into one coherent thing, one needs to
         | rewrite lots of it, to actually be composable.
         | 
         | Though, of course one can argue, that lots of human written
         | code is not much different from this.
        
       | tripletao wrote:
       | Their discussion contains an interesting aside:
       | 
       | > Moreover, ARC-AGI-1 is now saturating - besides o3's new score,
       | the fact is that a large ensemble of low-compute Kaggle solutions
       | can now score 81% on the private eval.
       | 
       | So while these tasks get greatest interest as a benchmark for
       | LLMs and other large general models, it doesn't yet seem obvious
       | those outperform human-designed domain-specific approaches.
       | 
       | I wonder to what extent the large improvement comes from OpenAI
       | training deliberately targeting this class of problem. That
       | result would still be significant (since there's no way to
       | overfit to the private tasks), but would be different from an
       | "accidental" emergent improvement.
        
       | Bjorkbat wrote:
       | I was impressed until I read the caveat about the high-compute
       | version using 172x more compute.
       | 
       | Assuming for a moment that the cost per task has a linear
       | relationship with compute, then it costs a little more than $1
       | million to get that score on the public eval.
       | 
       | The results are cool, but man, this sounds like such a busted
       | approach.
        
         | futureshock wrote:
         | So what? I'm serious. Our current level of progress would have
         | been sci-fi fantasy with the computers we had in 2000. The cost
         | may be astronomical today, but we have proven a method to
         | achieve human performance on tests of reasoning over novel
         | problems. WOW. Who cares what it costs. In 25 years it will run
         | on your phone.
        
           | Bjorkbat wrote:
           | It's not so much the cost as much the fact that they got a
           | slightly better result by throwing 172x more compute
           | per/task. The fact that it may have cost somewhere north of
           | $1 million simply helps to give a better idea of how absurd
           | the approach is.
           | 
           | It feels a lot less like the breakthrough when the solution
           | looks so much like simply brute-forcing.
           | 
           | But you might be right, who cares? Does it really matter how
           | crude the solution is if we can achieve true AGI and bring
           | the cost down by increasing the efficiency of compute?
        
             | futureshock wrote:
             | "Simply brute-forcing"
             | 
             | That's the thing that's interesting to me though and I had
             | the same first reaction. It's a very different problem than
             | brute-forcing chess. It has one chance to come to the
             | correct answer. Running through thousands or millions of
             | options means nothing if the model can't determine which is
             | correct. And each of these visual problems involve
             | combinations of different interacting concepts. To solve
             | them requires understanding, not mimicry. So no matter how
             | inefficient and "stupid" these models are, they can be said
             | to understand these novel problems. That's a direct counter
             | to everyone who ever called these a stochastic parrot and
             | said they were a dead-end to AGI that was only searching an
             | in distribution training set.
             | 
             | The compute costs are currently disappointing, but so was
             | the cost of sequencing the first whole human genome. That
             | went from 3 billion to a few hundred bucks from your local
             | doctor.
        
           | radioactivist wrote:
           | So your claim for optimism here is that something today that
           | took ~10^22 floating point operations (based on an estimate
           | earlier in the thread) to execute will be running on phones
           | in 25 years? Phones which are currently running at O(10^12)
           | flops. That means ten orders of magnitudes of improvement for
           | that to run in a reasonable amount of time? It's a similar
           | scale up in going from ENIAC (500 flops) to a modern desktop
           | (5-10 teraflops).
        
             | futureshock wrote:
             | That sounds reasonable to me because the compute cost for
             | this level of reasoning performance won't stay at 10^22 and
             | phones won't stay at 10^12. This reasoning breakthrough is
             | about 3 months old.
        
               | radioactivist wrote:
               | I think expecting five _orders of magnitude_ improvement
               | from either side of this (inference cost or phone
               | performance) is insane.
        
       | onemetwo wrote:
       | In (1) the author use a technique to improve the performance of
       | an LLM, he trained sonnet 3.5 to obtain 53,6% in the arc-agi-pub
       | benchmark moreover he said that more computer power would give
       | better results. So the results of o3 could be produced in this
       | way using the same method with more computer power, so if this is
       | the case the result of o3 is not very interesting.
       | 
       | (1) https://params.com/@jeremy-berman/arc-agi
        
       | TypicalHog wrote:
       | This is actually mindblowing!
        
       | blixt wrote:
       | These results are fantastic. Claude 3.5 and o1 are already good
       | enough to provide value, so I can't wait to see how o3 performs
       | comparatively in real-world scenarios.
       | 
       | But I gotta say, we must be saturating just about any zero-shot
       | reasoning benchmark imaginable at this point. And we will still
       | argue about whether this is AGI, in my opinion because these LLMs
       | are forgetful and it's very difficult for an application
       | developer to fix that.
       | 
       | Models will need better ways to remember and learn from doing a
       | task over and over. For example, let's look at code agents: the
       | best we can do, even with o3, is to cram as much of the code base
       | as we can fit into a context window. And if it doesn't fit we
       | branch out to multiple models to prune the context window until
       | it does fit. And here's the kicker - the second time you ask for
       | it to do something this all starts over from zero again. With
       | this amount of reasoning power, I'm hoping session-based learning
       | becomes the next frontier for LLM capabilities.
       | 
       | (There are already things like tool use, linear attention, RAG,
       | etc that can help here but currently they come with downsides and
       | I would consider them insufficient.)
        
       | vessenes wrote:
       | This feels like big news to me.
       | 
       | First of all, ARC is definitely an intelligence test for autistic
       | people. I say as someone with a tad of the neurodiversity. That
       | said, I think it's a pretty interesting one, not least because as
       | you go up in the levels, it requires (for a human) a fair amount
       | of lateral thinking and analogy-type thinking, and of course, it
       | requires that this go in and out of visual representation. That
       | said, I think it's a bit funny that most of the people training
       | these next-gen AIs are neurodiverse and we are training the AI in
       | our own image. I continue to hope for some poet and painter-
       | derived intelligence tests to be added to the next gen tests we
       | all look at and score.
       | 
       | For those reasons, I've always really liked ARC as a test -- not
       | as some be-all end-all for AGI, but just because I think that the
       | most intriguing areas next for LLMs are in these analogy arenas
       | and ability to hold more cross-domain context together for
       | reasoning and etc.
       | 
       | Prompts that are interesting to play with right now on these
       | terms range from asking multimodal models to say count to ten in
       | a Boston accent, and then propose a regional french accent that's
       | an equivalent and count to ten in that. (To my ear, 4o is
       | unconvincing on this). Similar in my mind is writing and
       | architecting code that crosses multiple languages and APIs, and
       | asking for it to be written in different styles. (claude and
       | o1-pro are .. okay at this, depending).
       | 
       | Anyway. I agree that this looks like a large step change. I'm not
       | sure if the o3 methods here involve the spinning up of clusters
       | of python interpreters to breadth-search for solutions -- a
       | method used to make headway on ARC in the past; if so, this is
       | still big, but I think less exciting than if the stack is close
       | to what we know today, and the compute time is just more
       | introspection / internal beam search type algorithms.
       | 
       | Either way, something had to assess answers and think they were
       | right, and this is a HUGE step forward.
        
         | jamiek88 wrote:
         | > most of the people training these next-gen AIs are
         | neurodiverse
         | 
         | Citation needed. This is a huge claim based only on stereotype.
        
           | vessenes wrote:
           | So true. Perhaps I'm just thinking it's my people and need to
           | update my priors.
        
         | getpost wrote:
         | > most of the people training these next-gen AIs are
         | neurodiverse and we are training the AI in our own image
         | 
         | Do you have any evidence to support that? It would be
         | fascinating if the field is primarly advancing due to a unique
         | constellation of traits contributed by individuals who, in the
         | past, may not have collaborated so effectively.
        
           | vessenes wrote:
           | PURELY Anecdotal. But I'll say that as of 2024 1 in 36 US
           | children are diagnosed on the spectrum according to the
           | CDC(!), which would mean if you met 10 AI researchers and 4
           | were neurodivergent you'd reasonably expect that it's a
           | higher-than-population average representation. I'm polling
           | from the Effective Altruist AI folks in my mind, and the
           | number is definitely, definitely higher than 4/10.
        
             | EVa5I7bHFq9mnYK wrote:
             | Are there non-Effective Altruist AI folks?
        
               | vessenes wrote:
               | I love how this might mean "non-Effective",
               | non-"Effective Altruist" or non-"Effective Altruist AI"
               | folks.
               | 
               | Yes
        
       | nopinsight wrote:
       | Let me go against some skeptics and explain why I think full o3
       | is pretty much AGI or at least embodies most essential aspects of
       | AGI.
       | 
       | What has been lacking so far in frontier LLMs is the ability to
       | reliably deal with the right level of abstraction for a given
       | problem. Reasoning is useful but often comes out lacking if one
       | cannot reason at the right level of abstraction. (Note that many
       | humans can't either when they deal with unfamiliar domains,
       | although that is not the case with these models.)
       | 
       | ARC has been challenging precisely because solving its problems
       | often requires:                  1) using multiple different
       | *kinds* of core knowledge [1], such as symmetry, counting, color,
       | AND             2) using the right level(s) of abstraction
       | 
       | Achieving human-level performance in the ARC benchmark, _as well
       | as_ top human performance in GPQA, Codeforces, AIME, and Frontier
       | Math suggests the model can potentially solve any problem at the
       | human level if it possesses essential knowledge about it. Yes,
       | this includes out-of-distribution problems that most humans can
       | solve.
       | 
       | It might not _yet_ be able to generate highly novel theories,
       | frameworks, or artifacts to the degree that Einstein,
       | Grothendieck, or van Gogh could. But not many humans can either.
       | 
       | [1] https://www.harvardlds.org/wp-
       | content/uploads/2017/01/Spelke...
       | 
       | ADDED:
       | 
       | Thanks to the link to Chollet's posts by lswainemoore below. I've
       | analyzed some easy problems that o3 failed at. They involve
       | spatial intelligence, including connection and movement. This
       | skill is very hard to learn from textual and still image data.
       | 
       | I believe this sort of core knowledge is learnable through
       | movement and interaction data in a simulated world and it will
       | _not_ present a very difficult barrier to cross. (OpenAI
       | purchased a company behind a Minecraft clone a while ago. I 've
       | wondered if this is the purpose.)
        
         | xvector wrote:
         | Agree. AGI is here. I feel such a sense of pride in our
         | species.
        
         | timabdulla wrote:
         | What's your explanation for why it can only get ~70% on SWE-
         | bench Verified?
         | 
         | I believe about 90% of the tasks were estimated by humans to
         | take less than one hour to solve, so we aren't talking about
         | very complex problems, and to boot, the contamination factor is
         | huge: o3 (or any big model) will have in-depth knowledge of the
         | internals of these projects, and often even know about the
         | individual issues themselves (e.g. you can say what was Github
         | issue #4145 in project foo, and there's a decent chance it can
         | tell you exactly what the issue was about!)
        
           | slewis wrote:
           | I've spent tons of time evaluating o1-preview on SWEBench-
           | Verified.
           | 
           | For one, I speculate OpenAI is using a very basic agent
           | harness to get the results they've published on SWEBench. I
           | believe there is a fair amount of headroom to improve results
           | above what they published, using the same models.
           | 
           | For two, some of the instances, even in SWEBench-Verified,
           | require a bit of "going above and beyond" to get right. One
           | example is an instance where the user states that a TypeError
           | isn't properly handled. The developer who fixed it handled
           | the TypeError but also handled a ValueError, and the golden
           | test checks for both. I don't know how many instances fall in
           | this category, but I suspect its more than on a simpler
           | benchmark like MATH.
        
           | nopinsight wrote:
           | One possibility is that it may not yet have sufficient
           | _experience and real-world feedback_ for resolving coding
           | issues in professional repos, as this involves multiple steps
           | and very diverse actions (or branching factor, in AI terms).
           | They have committed to not training on API usage, which
           | limits their ability to directly acquire training data from
           | it. However, their upcoming agentic efforts may address this
           | gap in training data.
        
             | timabdulla wrote:
             | Right, but the branching factor increases exponentially
             | with the scope of the work.
             | 
             | I think it's obvious that they've cracked the formula for
             | solving well-defined, small-in-scope problems at a
             | superhuman level. That's an amazing thing.
             | 
             | To me, it's less obvious that this implies that they will
             | in short order with just more training data be able to
             | solve ambiguous, large-in-scope problems at even just a
             | skilled human level.
             | 
             | There are far more paths to consider, much more context to
             | use, and in an RL setting, the rewards are much more
             | ambiguously defined.
        
               | nopinsight wrote:
               | Their reasoning models can learn from procedures and
               | methods, which generalize far better than data. Software
               | tasks are diverse but most tasks are still fairly limited
               | in scope. Novel tasks might remain challenging for these
               | models, as they do for humans.
               | 
               | That said, o3 might still lack some kind of interaction
               | intelligence that's hard to learn. We'll see.
        
         | Imnimo wrote:
         | >Achieving human-level performance in the ARC benchmark, as
         | well as top human performance in GPQA, Codeforce, AIME, and
         | Frontier Math strongly suggests the model can potentially solve
         | any problem at the human level if it possesses essential
         | knowledge about it.
         | 
         | The article notes, "o3 still fails on some very easy tasks".
         | What explains these failures if o3 can solve "any problem" at
         | the human level? Do these failed cases require some essential
         | knowledge that has eluded the massive OpenAI training set?
        
           | nopinsight wrote:
           | Great point. I'd love to see what these easy tasks are and
           | would be happy to revise my hypothesis accordingly. o3's
           | intelligence is unlikely to be a strict superset of human
           | intelligence. It is certainly superior to humans in some
           | respects and probably inferior in others. Whether it's
           | sufficiently generally intelligent would be both a matter of
           | definition and empirical fact.
        
             | Imnimo wrote:
             | Chollet has a few examples here:
             | 
             | https://x.com/fchollet/status/1870172872641261979
             | 
             | https://x.com/fchollet/status/1870173137234727219
             | 
             | I would definitely consider them legitimately easy for
             | humans.
        
               | nopinsight wrote:
               | Thanks! I added some comments on this at the bottom of
               | the post above.
        
         | phil917 wrote:
         | Quote from the creators of the AGI-ARC benchmark: "Passing ARC-
         | AGI does not equate achieving AGI, and, as a matter of fact, I
         | don't think o3 is AGI yet. o3 still fails on some very easy
         | tasks, indicating fundamental differences with human
         | intelligence."
        
           | CooCooCaCha wrote:
           | Yeah the real goalpost is _reliable_ intelligence. A supposed
           | phd level AI failing simple problems is a red flag that we're
           | still missing something.
        
             | gremlinsinc wrote:
             | You've never met a Doctor who couldn't figure out how to
             | work their email? Or use street smarts? You can have a PHD
             | but be unable to reliably handle soft skills, or any number
             | of things you might 'expect' someone to be able to do.
             | 
             | Just playing devils' advocate or nitpicking the language a
             | bit...
        
               | CooCooCaCha wrote:
               | An important distinction here is you're comparing skill
               | across very different tasks.
               | 
               | I'm not even going that far, I'm talking about
               | performance on similar tasks. Something many people have
               | noticed about modern AI is it can go from genius to baby-
               | level performance seemingly at random.
               | 
               | Take self driving cars for example, a reasonably
               | intelligent human of sound mind and body would never
               | accidentally mistake a concrete pillar for a road. Yet
               | that happens with self-driving cars, and seemingly here
               | with ARC-AGI problems which all have a similar flavor.
        
               | nuancebydefault wrote:
               | A coworker of mine has a phd in physics. Showing the
               | difference to him between little and big endian in a hex
               | editor, showing file sizes of raw image files and how to
               | compute it... I explained 3 times and maybe he understood
               | part of it now.
        
               | manquer wrote:
               | Doctors[1] or say pilots are skilled professions and
               | difficult to master and deserve respect yes , but they do
               | not need high levels of intelligence to be good at. They
               | require many other skills like taking decisions under
               | pressure or good motor skills that are hard, but not
               | necessarily intelligence.
               | 
               | Also not knowing something is hardly a criteria , skilled
               | humans focus on their areas of interest above most other
               | knowledge and can be unaware of other subjects.
               | 
               | Fields medal winners for example may not be aware of most
               | pop culture things doesn't make them not able to do so,
               | just not interested
               | 
               | ---
               | 
               | [1] most doctors including surgeons and many respected
               | specialists, some doctors however do need that skills but
               | those are specialized few and generally do know how to
               | use email
        
               | intended wrote:
               | good nit pick.
               | 
               | A PHD learnt their field. If they learnt that field,
               | reasoning through everything to understand their
               | material, then - given enough time - they are capable of
               | learning email and street smarts.
               | 
               | Which is why a reasoning LLM, should be able to do all of
               | those things.
               | 
               | Its not learnt a subject, its learnt reasoning.
        
           | nopinsight wrote:
           | I'd need to see what kinds of easy tasks those are and would
           | be happy to revise my hypothesis if that's warranted.
           | 
           | Also, it depends a great deal on what we define as AGI and
           | whether they need to be a strict superset of typical human
           | intelligence. o3's intelligence is probably superhuman in
           | some aspects but inferior in others. We can find many humans
           | who exhibit such tendencies as well. We'd probably say they
           | think differently but would still call them generally
           | intelligent.
        
             | lswainemoore wrote:
             | They're in the original post. Also here:
             | https://x.com/fchollet/status/1870172872641261979 /
             | https://x.com/fchollet/status/1870173137234727219
             | 
             | Personally, I think it's fair to call them "very easy". If
             | a person I otherwise thought was intelligent was unable to
             | solve these, I'd be quite surprised.
        
               | nopinsight wrote:
               | Thanks! I've analyzed some easy problems that o3 failed
               | at. They involve spatial intelligence including
               | connection and movement. This skill is very hard to learn
               | from textual and still image data.
               | 
               | I believe this sort of core knowledge is learnable
               | through movement and interaction data in a simulated
               | world and it will not present a very difficult barrier to
               | cross.
               | 
               | (OpenAI purchased a company behind a Minecraft clone a
               | while ago. I've wondered if this is the purpose.)
        
               | lswainemoore wrote:
               | > I believe this sort of core knowledge is learnable
               | through movement and interaction data in a simulated
               | world and it will not present a very difficult barrier to
               | cross.
               | 
               | Maybe! I suppose time will tell. That said, spatial
               | intelligence (connection/movement included) is the whole
               | game in this evaluation set. I think it's revealing that
               | they can't handle these particular examples, and
               | problematic for claims of AGI.
        
               | MVissers wrote:
               | Probably just not trained on this kind of data. We could
               | create a benchmark about it, and they'd shatter it within
               | a year or so.
               | 
               | I'm starting to really see no limits on intelligence in
               | these models.
        
               | sungho_ wrote:
               | Doesn't the fact that it can only accomplish tasks with
               | benchmarks imply that it has limitations in intelligence?
        
               | qup wrote:
               | > Doesn't the fact that it can only accomplish tasks with
               | benchmarks
               | 
               | That's not a fact
        
               | PoignardAzur wrote:
               | > _This skill is very hard to learn from textual and
               | still image data._
               | 
               | I had the same take at first, but thinking about it
               | again, I'm not quite sure?
               | 
               | Take the "blue dots make a cross" example (the second
               | one). The inputs only has four blue dots, which makes it
               | very easy to see a pattern even in text data: two of them
               | have the same x coordinate, two of them have the same y
               | (or the same first-tuple-element and second-tuple-element
               | if you want to taboo any spatial concepts).
               | 
               | Then if you look into the output, you can notice that all
               | the input coordinates are also in the output set, just
               | not always with the same color. If you separate them into
               | "input-and-output" and "output-only", you quickly notice
               | that all of the output-only squares are blue and share a
               | coordinate (tuple-element) with the blue inputs. If you
               | split the "input-and-output" set into "same color" and
               | "color changed", you can notice that the changes only go
               | from red to blue, and that the coordinates that changed
               | are clustered, and at least one element of the cluster
               | shares a coordinate with a blue input.
               | 
               | Of course, it's easy to build this chain of reasoning in
               | retrospect, but it doesn't seem like a complete stretch:
               | each step only requires noticing patterns in the data,
               | and it's how a reasonably puzzle-savvy person might solve
               | this if you didn't let them draw the squares on papers.
               | There are a lot of escape games with chains of reasoning
               | much more complex and random office workers solve them
               | all the time.
               | 
               | The visual aspect makes the patterns jump to us more, but
               | the fact that o3 couldn't find them at all with thousands
               | of dollars of compute budget still seems meaningful to
               | me.
               | 
               | EDIT: Actually, looking at Twitter discussions[1], o3
               | _did_ find those patterns, but was stumped by ambiguity
               | in the test input that the examples didn 't cover. Its
               | failures on the "cascading rectangles" example[2] looks
               | much more interesting.
               | 
               | [1]:
               | https://x.com/bio_bootloader/status/1870339297594786064
               | 
               | [2]: https://x.com/_AI30_/status/1870407853871419806
        
           | 93po wrote:
           | they say it isn't AGI but i think the way o3 functions can be
           | refined to AGI - it's learning to solve a new, novel
           | problems. we just need to make it do that more consistently,
           | which seems achievable
        
           | qnleigh wrote:
           | I like the notion, implied in the article, that AGI won't be
           | verified by any single benchmark, but by our collective
           | inability to come up with benchmarks that defeat some
           | eventual AI system. This matches the cat-and-mouse game we've
           | been seeing for a while, where benchmarks have to constantly
           | adapt to better models.
           | 
           | I guess you can say the same thing for the Turing Test.
           | Simple chat bots beat it ages ago in specific settings, but
           | the bar is much higher now that the average person is
           | familiar with their limitations.
           | 
           | If/once we have an AGI, it will probably take weeks to months
           | to really convince ourselves that it is one.
        
         | nyrikki wrote:
         | GPQA scores are mostly from pre-training, against content in
         | the corpus. They have gone silent but look at the GPT4
         | technical report which calls this out.
         | 
         | We are nowhere close to what Sam Altman calls AGI and
         | transformers are still limited to what uniform-TC0 can do.
         | 
         | As an example the Boolean Formula Value Problem is
         | NC1-complete, thus beyond transformers but trivial to solve
         | with a TM.
         | 
         | As it is now proven that the frame problem is equivalent to the
         | halting problem, even if we can move past uniform-TC0 limits,
         | novelty is still a problem.
         | 
         | I think the advancements are truly extraordinary, but unless
         | you set the bar very low, we aren't close to AGI.
         | 
         | Heck we aren't close to P with commercial models.
        
           | sebzim4500 wrote:
           | Isn't any physically realizable computer (including our
           | brains) limited to what uniform-TC0 can do?
        
             | drdeca wrote:
             | Do you just mean because any physically realizable computer
             | is a finite state machine? Or...?
             | 
             | I wouldn't describe a computer's usual behavior as having
             | constant depth.
             | 
             | It is fairly typical to talk about problems in P as being
             | feasible (though when the constant factors are too big,
             | this isn't strictly true of course).
             | 
             | Just because for unreasonably large inputs, my computer
             | can't run a particular program and produce the correct
             | answer for that input, due to my computer running out of
             | memory, we don't generally say that my computer is
             | fundamentally incapable of executing that algorithm.
        
             | nyrikki wrote:
             | Neither TC0 nor uniform-TC0 are physically realizable, they
             | are tools not physical devices.
             | 
             | The default nonuniform circuits classes are allowed to have
             | a different circuit per input size, the uniform types have
             | unbounded fan-in
             | 
             | Similar to how a k-tape TM doesn't get 'charged' for the
             | input size.
             | 
             | With Nick Class (NC) the number of components is similar to
             | traditional compute time while depth relates to the ability
             | to parallelize operations.
             | 
             | These are different than biological neurons, not better or
             | worse but just different.
             | 
             | Human neurons can use dendritic compartmentalization, use
             | spike timing, can retime spikes etc...
             | 
             | While the perceptron model we use in ML is useful, it is
             | not able to do xor in one layer, while biological neurons
             | do that without anything even reaching the soma, purely in
             | the dendrites.
             | 
             | Statistical learning models still comes down to a choice
             | function, no matter if you call that set shattering or...
             | 
             | With physical computers the time hierarchy does apply and
             | if TIME(g(n)) is given more time than TIME(f(n)), g(n) can
             | solve more problems.
             | 
             | So you can simulate a NTM with exhaustive search with a
             | physical computer.
             | 
             | Physical computers also tend to have NAND and XOR gates,
             | and can have different circuit depths.
             | 
             | When you are in TC0, you only have AND, OR and Threshold
             | (or majority) gates.
             | 
             | Think of instruction level parallelism in a typical CPU, it
             | can return early, vs Itanium EPIC, which had to wait for
             | the longest operation. Predicated execution is also how
             | GPUs work.
             | 
             | They can send a mask and save on load store ops as an
             | example, but the cost of that parallelism is the consent
             | depth.
             | 
             | It is the parallelism tradeoff that both makes transformers
             | practical as well as limit what they can do.
             | 
             | The IID assumption and autograd requiring smooth manifolds
             | plays a role too.
             | 
             | The frame problem, which causes hard problems to become
             | unsolvable for computers and people alike does also.
             | 
             | But the fact that we have polynomial time solutions for the
             | Boolean Formula Value Problem, as mentioned in my post
             | above is probably a simpler way of realizing physical
             | computers aren't limited to uniform-TC0.
        
         | norir wrote:
         | Personally I find "human-level" to be a borderline meaningless
         | and limiting term. Are we now super human as a species relative
         | to ourselves just five years ago because of our advances in
         | developing computer programs that better imitate what many (but
         | far from all) of us were already capable of doing? Have we
         | reached a limit to human potential that can only be surpassed
         | by digital machines? Who decides what human level is and when
         | we have surpassed it? I have seen some ridiculous claims about
         | ai in art that don't stand up to even the slightest scrutiny by
         | domain experts but that easily fool the masses.
        
           | razodactyl wrote:
           | No I think we're just tired and depressed as a species...
           | Existing systems work to a degree but aren't living up to
           | their potential of increasing happiness according to
           | technological capabilities.
        
         | PaulDavisThe1st wrote:
         | > It might not yet be able to generate highly novel theories,
         | frameworks, or artifacts to the degree that Einstein,
         | Grothendieck, or van Gogh could.
         | 
         | Every human does this dozens, hundreds or thousands of times
         | ... during childhood.
        
         | ec109685 wrote:
         | The problem with ARC is that there are a finite number of
         | heuristics that could be enumerated and trained for, which
         | would give model a substantial leg up on this evaluation, but
         | not be generalized to other domains.
         | 
         | For example, if they produce millions of examples of the type
         | of problems o3 still struggles on, it would probably do better
         | at similar questions.
         | 
         | Perhaps the private data set is different enough that this
         | isn't a problem, but the ideal situation would be unveiling a
         | truly novel dataset, which it seems like arc aims to do.
        
         | golol wrote:
         | In order to replace actual humans doing their job I think LLMs
         | are lacking in judgement, sense of time and agenticism.
        
           | Kostchei wrote:
           | I mean fkcu me when they have those things, however, maybe
           | they are just lazy and their judgement is fine, for a lazy
           | intelligence. Inner-self thinks "why are these bastards
           | asking me to do this? ". I doubt that is actually happening,
           | but now, .. prove it isn't.
        
         | puttycat wrote:
         | Great comment. See this as well for another potential reason
         | for failure:
         | 
         | https://arxiv.org/abs/2402.10013
        
         | dimitri-vs wrote:
         | Have we really watered down the definition of AGI that much?
         | 
         | LLMs aren't really capable of "learning" anything outside their
         | training data. Which I feel is a very basic and fundamental
         | capability of humans.
         | 
         | Every new request thread is a blank slate utilizing whatever
         | context you provide for the specific task and after the tread
         | is done (or context limit runs out) it's like it never
         | happened. Sure you can use databases, do web queries, etc. but
         | these are inflexible bandaid solutions, far from what's needed
         | for AGI.
        
           | theptip wrote:
           | > LLMs aren't really capable of "learning" anything outside
           | their training data.
           | 
           | ChatGPT has had for some time the feature of storing memories
           | about its conversations with users. And you can use function
           | calling to make this more generic.
           | 
           | I think drawing the boundary at "model + scaffolding" is more
           | interesting.
        
             | dimitri-vs wrote:
             | Calling the sentence or two it arbitrarily saves when you
             | statd your preferences and profile info "memories" is a
             | stretch.
             | 
             | True equivalent to human memories would require something
             | like a multimodal trillion token context window.
             | 
             | RAG is just not going to cut it, and if anything will
             | exacerbated problems with hallucinations.
        
           | bubblyworld wrote:
           | That's true for vanilla LLMs, but also keep in mind that
           | there are no details about o3's architecture at the moment.
           | Clearly they are doing _something_ different given the huge
           | performance jump on a lot of benchmarks, and it may well
           | involve in-context learning.
        
             | catmanjan wrote:
             | Given every other iteration has basically just been the
             | same thing but bigger, why should we think this?
        
               | bubblyworld wrote:
               | My point was to caution against being too confident about
               | the underlying architecture, not to argue for any
               | particular alternative.
               | 
               | Your statement is false - things changed a lot between
               | gpt4 and o1 under the hood, but notably _not_ a larger
               | model size. In fact the model size of o1 is smaller than
               | gpt4 by several orders of magnitude! Improvements are
               | being made in other ways.
        
         | uncomplexity_ wrote:
         | on the spatial data i see it as a highly intelligent head of a
         | machine that just needs better limbs and better senses.
         | 
         | i think that's where most hardware startups will specialize
         | with in the coming decades, different industries with different
         | needs.
        
         | mirkodrummer wrote:
         | Please stop it calling AGI, we don't even know or agree
         | universally what that should actually mean. How far did we get
         | with hype calling a lossy probabilistic compressor firing
         | slowly at us words AGI? That's a real bummer to me
        
           | razodactyl wrote:
           | Is this comment voted down because of sentiment / polarity?
           | 
           | Regardless the critical aspect is valid, AGI would be
           | something like Cortana from Halo.
        
         | ryoshu wrote:
         | Ask o3 is P=NP?
        
           | amelius wrote:
           | It will just answer with the current consensus on the matter.
        
         | zwnow wrote:
         | This is not AGI lmao.
        
       | CliveBloomers wrote:
       | Another meaningless benchmark, another month--it's like clockwork
       | at this point. No one's going to remember this in a month; it's
       | just noise. The real test? It's not in these flashy metrics or
       | minor improvements. The only thing that actually matters is how
       | fast it can wipe out the layers of middle management and all
       | those pointless, bureaucratic jobs that add zero value.
       | 
       | That's the true litmus test. Everything else? It's just fine-
       | tuning weights, playing around the edges. Until it starts cutting
       | through the fat and reshaping how organizations really operate,
       | all of this is just more of the same.
        
         | handfuloflight wrote:
         | Agreed, but isn't it management who decides that this would be
         | implemented? Are they going to propogate their own removal?
        
           | zamadatix wrote:
           | Middle manager types are probably interested in their salary
           | performance more than anything. "Real" management (more of
           | their assets come from their ownership of the company than a
           | salary) will override them if it's truthfully the best
           | performing operating model for the company.
        
         | oytis wrote:
         | So far AI market seems to be focused on replacing meaningful
         | jobs, meaningless ones look safe (which kind of makes sense if
         | you think about it).
        
       | 6gvONxR4sf7o wrote:
       | I'm glad these stats show a better estimate of human ability than
       | just the average mturker. The graph here has the average mturker
       | performance as well as a STEM grad measurement. Stuff like that
       | is why we're always feeling weird that these things supposedly
       | outperform humans while still sucking. I'm glad to see 'human
       | performance' benchmarked with more variety (attention, time,
       | education, etc).
        
       | RivieraKid wrote:
       | It sucks that I would love to be excited about this... but I
       | mostly feel anxiety and sadness.
        
         | xvector wrote:
         | Humanity is about to enter an even steeper hockey stick growth
         | curve. Progressing along the Kardashev scale feels all but
         | inevitable. We will live to see Longevity Escape Velocity. I'm
         | fucking pumped and feel thrilled and excited and proud of our
         | species.
         | 
         | Sure, there will be growing pains, friction, etc. Who cares?
         | There always is with world-changing tech. Always.
        
           | drcode wrote:
           | longevity for the AIs
        
           | tokioyoyo wrote:
           | My job should be secure for a while, but why would an average
           | person give a damn about humanity when they might lose their
           | jobs and comfort levels? If I had kids, I would absolutely
           | hate this uncertainty as well.
           | 
           | "Oh well, I guess I can't give the opportunities to my kid
           | that I wanted, but at least humanity is growing rapidly!"
        
             | xvector wrote:
             | > when they might lose their jobs and comfort levels?
             | 
             | Everyone has always worried about this for every major
             | technology throughout history
             | 
             | IMO AGI will dramatically increase comfort levels, lower
             | your chance of dying, death, disease, etc.
        
               | tokioyoyo wrote:
               | Again, sure, but it doesn't matter to an average person.
               | That's too much focus on the hypothetical future. People
               | care about the current times. In the short term it will
               | suck for a good chunk of people, and whether the
               | sacrifice is worth it will depend on who you are.
               | 
               | People aren't really on uproar yet, because
               | implementations haven't affected the job market of the
               | masses. Afterwards? Tume will show.
        
               | xvector wrote:
               | Yes, people tend to focus on current times. It's an
               | incredibly shortsighted mentality that selfishly puts
               | oneself over tens of billions of future lives being
               | improved. https://pessimistsarchive.org
        
               | tokioyoyo wrote:
               | Do you have any dependents, like parents or kids, by any
               | chance? Imagine not being able to provide for them. Think
               | how'd you feel in such circumstances.
               | 
               | Like in general I totally agree with you, but I also
               | understand why a person would care about their loved ones
               | and themselves first.
        
               | realce wrote:
               | Eventually you draw the black ball, it is inevitable.
        
               | MVissers wrote:
               | We've almost wiped ourselves out in a nuclear war in the
               | 70ies. If that would have happened, would it have been
               | worth it? Probably not.
               | 
               | Beyond immediate increase in inequality, which I agree
               | could be worth it in the long run if this was the only
               | problem, we're playing a dangerous game.
               | 
               | The smartest and most capable species on the planet that
               | dominates it for exactly this reason, is creating
               | something even smarter and more capable than itself in
               | the hope it'd help make its life easier.
               | 
               | Hmm.
        
           | croemer wrote:
           | Longevity Escape Velocity? Even if you had orders of
           | magnitude more people working on medical research, it's not a
           | given that prolonging life indefinitely is even possible.
        
             | soheil wrote:
             | Of course it's a given unless you want to invoke
             | supernatural causes the human brain is a collection of
             | cells with electro-chemical connections that if fully
             | reconstructed either physically or virtually would
             | necessarily need to represent the original person's brain.
             | Therefore with sufficient intelligence it would be possible
             | to engineer technology that would be able to do that
             | reconstruction without even having to go to the atomic
             | level, which we also have a near full understanding of
             | already.
        
           | lewhoo wrote:
           | > Sure, there will be growing pains, friction, etc. Who
           | cares?
           | 
           | That's right. Who cares about pains of others and why they
           | even should are absolutely words to live by.
        
             | xvector wrote:
             | Yeah, with this mentality, we wouldn't have electricity
             | today. You will never make transition to new technology
             | painless, no matter what you do. (See:
             | https://pessimistsarchive.org)
             | 
             | What you are likely doing, though, is making many more
             | future humans pay a cost in suffering. Every day we delay
             | longevity escape velocity is another 150k people dead.
        
               | lewhoo wrote:
               | There was a time when in the name of progress people were
               | killed for whatever resources they possessed, others were
               | enslaved etc. and I was under the impression that the
               | measure of our civilization is that we actually DID care
               | and just how much. It seems to me that you are very eager
               | to put up altars of sacrifice without even thinking that
               | the problems you probably have in mind are perfectly
               | solvable without them.
        
               | smokedetector1 wrote:
               | By far the greatest issue facing humanity today is wealth
               | inequality.
        
               | xvector wrote:
               | Nah, it's death. People objectively are doing better than
               | ever despite wealth inequality. By all metrics - poverty,
               | quality of life, homelessness, wealth, purchasing power.
               | 
               | I'd rather just... not die. Not unless I want to. Same
               | for my loved ones. That's far more important than "wealth
               | inequality."
        
           | asdf6969 wrote:
           | I would rather follow in the steps of uncle Ted than let AI
           | turn me in a homeless person. It's no consolation that my
           | tent will have a nice view of a lunar colony
        
           | objektif wrote:
           | You sound like a rich person.
        
           | soheil wrote:
           | I agree, save invoking supernatural causes, the human brain
           | is a collection of cells with electro-chemical connections
           | that if fully reconstructed either physically or virtually
           | would necessarily need to represent the original person's
           | brain. Therefore with sufficient intelligence it would be
           | possible to engineer technology that would be able to do that
           | reconstruction without even having to go to the atomic level,
           | which we also have a near full understanding of already.
        
           | achierius wrote:
           | https://www.transformernews.ai/p/richard-ngo-openai-
           | resign-s...
           | 
           | >But while the "making AGI" part of the mission seems well on
           | track, it feels like I (and others) have gradually realized
           | how much harder it is to contribute in a robustly positive
           | way to the "succeeding" part of the mission, especially when
           | it comes to preventing existential risks to humanity.
           | 
           | Almost every single one of the people OpenAI had hired to
           | work on AI safety have left the firm with similar messages.
           | Perhaps you should at least consider the thinking of experts?
           | 
           | You and I will likely not live to see much of anything past
           | AGI.
        
           | goatlover wrote:
           | > Sure, there will be growing pains, friction, etc. Who
           | cares?
           | 
           | The people experiencing the growing pains, friction, etc.
        
         | pupppet wrote:
         | We're enabling a huge swath of humanity being put out of work
         | so a handful of billionaires can become trillionaires.
        
           | abiraja wrote:
           | And also the solving of hundreds of diseases that ail us.
        
             | hartator wrote:
             | It doesn't matter. Statists rather be poor, sick, and dead
             | than risking trillionaires.
        
               | thrance wrote:
               | You should read about workers right in the gilded age,
               | and see how good _laissez-faire_ capitalism was. What do
               | you think will happen when the only thing you can trade
               | with the trillionaires, your labor, becomes worthless?
        
             | lewhoo wrote:
             | One of the biggest factors in risk of death right now is
             | poverty. Also what is being chased right now is "human
             | level on most economically viable tasks" because the
             | automated research for solving physics etc. even now seems
             | far-fetched.
        
             | thrance wrote:
             | You need to solve diseases _and_ make the cure available.
             | Millions die of curable diseases every year, simply because
             | they are not deemed useful enough. What happens when your
             | labor becomes worthless?
        
             | asdf6969 wrote:
             | Why do you think you'll be able to afford healthcare? The
             | new medicine is for the AI owners
        
           | distortionfield wrote:
           | This is the same boring alarmist argument we've heard since
           | the Industrial Revolution. Humans have always turned extra
           | output provided by technological advancement to increase
           | overall productivity.
        
           | stri8ed wrote:
           | It would happen in China regardless what is done here.
           | Removing billionaires does not fix this. The ship has sailed.
        
         | gom_jabbar wrote:
         | Anxiety and sadness are actually mild emotional responses to
         | the dissolution of human culture. Nick Land in 1992:
         | 
         | "It is ceasing to be a matter of how we think about technics,
         | if only because technics is increasingly thinking about itself.
         | It might still be a few decades before artificial intelligences
         | surpass the horizon of biological ones, but it is utterly
         | superstitious to imagine that the human dominion of terrestrial
         | culture is still marked out in centuries, let alone in some
         | metaphysical perpetuity. The high road to thinking no longer
         | passes through a deepening of human cognition, but rather
         | through a becoming inhuman of cognition, a migration of
         | cognition out into the emerging planetary technosentience
         | reservoir, into 'dehumanized landscapes ... emptied spaces'
         | where human culture will be dissolved. Just as the capitalist
         | urbanization of labour abstracted it in a parallel escalation
         | with technical machines, so will intelligence be transplanted
         | into the purring data zones of new software worlds in order to
         | be abstracted from an increasingly obsolescent anthropoid
         | particularity, and thus to venture beyond modernity. Human
         | brains are to thinking what mediaeval villages were to
         | engineering: antechambers to experimentation, cramped and
         | parochial places to be.
         | 
         | [...]
         | 
         | Life is being phased-out into something new, and if we think
         | this can be stopped we are even more stupid than we seem." [0]
         | 
         | Land is being ostracized for some of his provocations, but it
         | seems pretty clear by now that we are in the Landian
         | Accelerationism timeline. Engaging with his thought is crucial
         | to understanding what is happening with AI, and what is still
         | largely unseen, such as the autonomization of capital.
         | 
         | [0] https://retrochronic.com/#circuitries
        
           | achierius wrote:
           | It's obvious that there are lines of flight (to take a
           | Deleuzian tack, a la Land) away from the current political-
           | economic assemblage. For example, a strategic nuclear
           | exchange starting tomorrow (which can always happen --
           | technical errors, a rogue submarine, etc.) would almost
           | certainly set back technological development enough that we'd
           | have no shot at AI for the next few decades. I don't know
           | whether you agree with him, but I think the fact that he
           | ignores this fact is quite unserious, especially given the
           | likely destabilizing effects sub-AGI AI will have on
           | international politics.
        
         | Jcampuzano2 wrote:
         | Same, it's sad but I honestly hoped they never achieved these
         | results and it came out that it wasn't possible or would take
         | an insurmountable amount of resources but here we are ok the
         | verge of making most humans useless when it comes to
         | productivity.
         | 
         | While there are those that are excited, the world is not
         | prepared for the level of distress this could put on the
         | average person without critical changes at a monumental level.
        
           | JacksCracked wrote:
           | If you don't feel like the world needed grand scale changes
           | at a societal level with all the global problems we're unable
           | to solve, you haven't been paying attention. Income
           | inequality, corporate greed, political apathy, global
           | warming.
        
             | sensanaty wrote:
             | And you think the bullshit generators backed by the largest
             | corporate entities in humanity who are, as we speak,
             | causing all the issues you mention are somehow gonna solve
             | any of this?
        
               | CamperBob2 wrote:
               | If you still think this technology is a "bullshit
               | generator," then it's safe to say you're also wrong about
               | a great many other things in life.
               | 
               | That would bug me, if I were you.
        
               | r-zip wrote:
               | They're not wrong though. The frequency with which these
               | things still just make shit up is astonishingly bad. Very
               | dismissive of a legitimate criticism.
        
               | CamperBob2 wrote:
               | It's getting better, faster than you and I and the GP
               | are. What else matters?
               | 
               | You can't bullshit your way through this particular
               | benchmark. Try it.
               | 
               | And yes, they're wrong. The latest/greatest models "make
               | shit up" perhaps 5-10% as frequently as were seeing just
               | a couple of years ago. Only someone who has deliberately
               | decided to stop paying attention could possibly argue
               | otherwise.
        
               | sensanaty wrote:
               | And yet I still can't trust Claude or o1 to not get the
               | simplest of things, such as test cases (not even full on
               | test suites, just the test cases) wrong, consistently. No
               | amount of handholding from me or prompting or feeding it
               | examples etc helps in the slightest, it is just
               | consistently wrong for anything but the simplest possible
               | examples, which takes more effort to manually verify than
               | if I had just written it myself. I'm not even using an
               | obscure stack or language, but _especially_ with things
               | that aren 't Python or JS it shits the bed even worse.
               | 
               | I have noticed it's great in the hands of marketers and
               | scammers, however. Real good at those "jobs", so I see
               | why the cryptobros have now moved onto hailing LLMs as
               | the next coming of jesus.
        
             | crakhamster01 wrote:
             | Well said! There's no way big tech and institutional
             | investors are pouring billions of dollars into AI because
             | of corporate greed. It's definitely so that they can
             | redistribute wealth equally once AGI is achieved.
             | 
             | /s
        
             | phito wrote:
             | AI will fix none of that
        
         | larve wrote:
         | I have been diving deep into LLM coding over the last 3 years
         | and regular encountered that feeling along the way. I still at
         | times have a "wtf" moment where I need to take a break.
         | However, I have been able to quell most of my anxieties around
         | my job / the software profession in general (I've been at this
         | professionally for 25+ years and software has been my dream job
         | since I was 6).
         | 
         | For one, I found AI coding to work best in a small team, where
         | there is an understanding of what to build and how to build it,
         | usually in close feedback loop with the designers / users.
         | Throw the usual managerial company corporate nonsense on top
         | and it doesn't really matter if you can instacreate a piece of
         | software, if nobody cares for that piece of software and it's
         | just there to put a checkmark on the Q3 OKR reports.
         | 
         | Furthermore, there is a lot of software to be built out there,
         | for people who can't afford it yet. A custom POS system for the
         | local baker so that they don't have to interact with a
         | computer. A game where squids eat algae for my nephews at
         | christmas. A custom photo layout software for my dad who
         | despairs at indesign. A plant watering system for my friend. A
         | local government information website for older citizens. Not
         | only can these be built at a fraction of the cost they were
         | before, but they can be built in a manner where the people
         | using the software are directly involved in creating it. Maybe
         | they can get a 80% hacked version together if they are
         | technically enclined. I can add the proper database backend and
         | deployment infrastructure. Or I can sit with them and iterate
         | on the app as we are talking. It is also almost free to create
         | great documentation, in fact, LLM development is most
         | productive when you turn up software engineering best practices
         | up to 11.
         | 
         | Furthermore, I found these tools incredible for actively
         | furthering my own fundamental understanding of computer science
         | and programming. I can now skip the stuff I don't care to learn
         | (is it foobarBla(func, id) or foobar_bla(id, func)) and put the
         | effort where I actually get a long-lived return. I have become
         | really ambitious with the things I can tackle now, learning
         | about all kinds of algorithms and operating system patterns and
         | chemistry and physics etc... I can also create documents to
         | help me with my learning.
         | 
         | Local models are now entering the phase where they are getting
         | to be really useful, definitely > gpt3.5 which I was able to
         | use very productively already at the time.
         | 
         | Writing (creating? manifesting? I don't really have a good word
         | for what I do these days) software that makes me and real
         | humans around me happy is extremely fulfilling, and has
         | allevitated most of my angst around the technology.
        
       | bluecoconut wrote:
       | Efficiency is now key.
       | 
       | ~=$3400 per single task to meet human performance on this
       | benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED",
       | which makes me think they did some undisclosed amount of fine-
       | tuning (eg. via the API they showed off last week), so even more
       | compute went into this task.
       | 
       | We can compare this roughly to a human doing ARC-AGI puzzles,
       | where a human will take (high variance in my subjective
       | experience) between 5 second and 5 minutes to solve the task. (So
       | i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr,
       | and they include in their document an average mechancal turker at
       | $2 USD task in their document)
       | 
       | Going the other direction: I am interpreting this result as human
       | level reasoning now costs (approximately) 41k/hr to 2.5M/hr with
       | current compute.
       | 
       | Super exciting that OpenAI pushed the compute out this far so we
       | could see he O-series scaling continue and intersect humans on
       | ARC, now we get to work towards making this economical!
        
         | riku_iki wrote:
         | > ~=$3400 per single task
         | 
         | report says it is $17 per task, and $6k for whole dataset of
         | 400 tasks.
        
           | bluecoconut wrote:
           | That's the low-compute mode. In the plot at the top where
           | they score 88%, O3 High (tuned) is ~3.4k
        
             | ionwake wrote:
             | sorry to be a noob, but can someone tell me doe sths mena
             | o3 will be unaffordable for a typical user? Will only
             | companies with thousands to spend per query be able to use
             | this?
             | 
             | Sorry for being thick Im just confused how they can turn
             | this into an addordable service?
        
               | JohnnyMarcone wrote:
               | There are likely many efficiency gains that will be made
               | before it's released, and after. Also they showed o3 mini
               | to be better than o1 for less cost in multiple
               | benchmarks, so there're already improvements there at a
               | lower cost than what available.
        
               | ionwake wrote:
               | Great thank you
        
             | HDThoreaun wrote:
             | The low compute one did as well as the average person
             | though
        
           | jhrmnn wrote:
           | That's for the low-compute configuration that doesn't reach
           | human-level performance (not far though)
        
             | riku_iki wrote:
             | I referred on high compute mode. They have table with
             | breakdown here: https://arcprize.org/blog/oai-o3-pub-
             | breakthrough
        
               | EVa5I7bHFq9mnYK wrote:
               | That's high EFFICIENCY. High efficiency = low compute.
        
               | junipertea wrote:
               | The table row with 6k figure refers to high efficiency,
               | not high compute mode. From the blog post:
               | 
               | Note: OpenAI has requested that we not publish the high-
               | compute costs. The amount of compute was roughly 172x the
               | low-compute configuration.
        
               | gbnwl wrote:
               | That's "efficiency" high, which actually means less
               | compute. The 87.5% score using low efficiency (more
               | compute) doesn't have cost listed.
        
               | bluecoconut wrote:
               | they use some poor language.
               | 
               | "High Efficiency" is O3 Low "Low Efficiency" is O3 High
               | 
               | They left the "Low efficiency" (O3 High) values as `-`
               | but you can infer them from the plot at the top.
               | 
               | Note the $20 and $17 per task aligns with the X-axis of
               | the O3-low
        
           | binarymax wrote:
           | _" Note: OpenAI has requested that we not publish the high-
           | compute costs. The amount of compute was roughly 172x the
           | low-compute configuration."_
           | 
           | The low compute was $17 per task. Speculate 172*$17 for the
           | high compute is $2,924 per task, so I am also confused on the
           | $3400 number.
        
             | bluecoconut wrote:
             | 3400 came from counting pixels on the plot.
             | 
             | Also its $20 on for the o3-low via the table for the semi-
             | private, which x172 is 3440, also coming in close to the
             | 3400 number
        
           | xrendan wrote:
           | You're misreading it, there's two different runs, a low and a
           | high compute run.
           | 
           | The number for the high-compute one is ~172x the first one
           | according to the article so ~=$2900
        
             | Thorrez wrote:
             | What's extra confusing is that in the graph the runs are
             | called low compute and high compute. In the table they're
             | called high efficient and low efficiency. So the high and
             | low got swapped.
        
         | bluecoconut wrote:
         | some other imporant quotes: "Average human off the street:
         | 70-80%. STEM college grad: >95%. Panel of 10 random humans:
         | 99-100%" -@fchollet on X
         | 
         | So, considering that the $3400/task system isn't able to
         | compete with STEM college grad yet, we still have some room
         | (but it is shrinking, i expect even more compute will be thrown
         | and we'll see these barriers broken in coming years)
         | 
         | Also, some other back of envelope calculations:
         | 
         | The gap in cost is roughly 10^3 between O3 High and Avg.
         | mechanical turkers (humans). Via Pure GPU cost improvement
         | (~doubling every 2-2.5 years) puts us at 20~25 years.
         | 
         | The question is now, can we close this "to human" gap (10^3)
         | quickly with algorithms, or are we stuck waiting for the 20-25
         | years for GPU improvements. (I think it feels obvious: this is
         | new technology, things are moving fast, the chance for
         | algorithmic innovation here is high!)
         | 
         | I also personally think that we need to adjust our efficiency
         | priors, and start looking not at "humans" as the bar to beat,
         | but theoretical computatble limits (show gaps much larger
         | ~10^9-10^15 for modest problems). Though, it may simply be the
         | case that tool/code use + AGI at near human cost covers a lot
         | of that gap.
        
           | zamadatix wrote:
           | I don't follow how 10 random humans can beat the average STEM
           | college grad and average humans in that tweet. I suspect it's
           | really "a panel of 10 randomly chosen experts in the space"
           | or something?
           | 
           | I agree the most interesting thing to watch will be cost for
           | a given score more than maximum possible score achieved (not
           | that the latter won't be interesting by any means).
        
             | hmottestad wrote:
             | Might be that within a group of 10 people, randomly chosen,
             | when each person attempts to solve the tasks at least 99%
             | of the time 1 person out of the 10 people will get it
             | right.
        
             | bcrosby95 wrote:
             | Two heads is better than 1. 10 is way better. Even if they
             | aren't a field of experts. You're bound to get random
             | people that remember random stuff from high school,
             | college, work, and life in general, allowing them to piece
             | together a solution.
        
               | inerte wrote:
               | Aaaah thanks for the explanation. PANEL of 10 humans, as
               | in, they were all together. I parsed the phrase as "10
               | random people" > "average human" which made little sense.
        
               | modeless wrote:
               | Actually I believe that he did mean 10 random people
               | tested individually, not a committee of 10 people. The
               | key being that the question is considered to be answered
               | correctly if any one of the 10 people got it right. This
               | is similar to how LLMs are evaluated with pass@5 or
               | pass@10 criteria (because the LLM has no memory so
               | running it 10 times is more like asking 10 random people
               | than asking the same person 10 times in a row).
               | 
               | I would expect 10 random people to do better than a
               | committee of 10 people because 10 people have 10 chances
               | to get it right while a committee only has one. Even if
               | the committee gets 10 guesses (which must be made
               | simultaneously, not iteratively) it might not do better
               | because people might go along with a wrong consensus
               | rather than push for the answer they would have chosen
               | independently.
        
               | elcomet wrote:
               | He means 10 humans voting for the answer
        
               | herval wrote:
               | Depends on the task, no?
               | 
               | Do you have a sense of what kind of task this benchmark
               | includes? Are they more "general" such that random people
               | would fare well or more specialized (ie something a STEM
               | grad studied and isn't common knowledge)?
        
               | judge2020 wrote:
               | It does, which is why I don't really subscribe to any
               | test like this being great for actually determining
               | "AGI". A true AGI would be able to continuously train and
               | create new LLMs that enable it to become a SME in
               | entirely new areas.
        
               | generic92034 wrote:
               | If that works that way at all depends on the group
               | dynamic. It is easily possible that a not so bright
               | individual takes an (unofficial) leadership position in
               | the group and overrides the input of smarter members.
               | Think of any meetings with various hierarchy levels in a
               | company.
        
               | daveguy wrote:
               | The ARC AGI questions can be a little tricky, but the
               | solutions can generally be easily explained. And you get
               | 3 tries. So, the 3 best descriptions of the solution
               | votes on by 10 people is going to be very effective. The
               | problem space just isn't complicated enough for an
               | unofficial "leader" to sway the group to 3 wrong answers.
        
               | zamadatix wrote:
               | Aha, "at least 1 of a panel of 10", not "the panel of 10
               | averaged"! Thanks, that makes so much more sense to me
               | now.
               | 
               | I have failed the real ARC AGI :)
        
             | shkkmo wrote:
             | It is fairly well documented that groups of people can show
             | cognitive abilities that exceed that of any individual
             | member. The classic example of this is if you ask a group
             | of people to estimate the number of jellybeans in a jar,
             | you can get a more accurate result than if you test to find
             | the person with the highest accuracy and use their guess.
             | 
             | This isn't to say groups always outperform their members on
             | all tasks, just that it isn't unusual to see a result like
             | that.
        
               | zamadatix wrote:
               | Yes, my shortcoming was in understanding the 10 were
               | implied to have their successes merged together by being
               | a panel rather than just the average of a special
               | selection.
        
             | HDThoreaun wrote:
             | ARC-AGI is essentially an IQ test. There is no "expert in
             | the space". Its just a question of if youre able to spot
             | the pattern.
        
             | dlkf wrote:
             | If you take a vote of 10 random people, then as long as
             | their errors are not perfectly correlated, you'll do better
             | than asking one person.
             | 
             | https://en.m.wikipedia.org/wiki/Ensemble_learning
        
             | olalonde wrote:
             | Even if you assume that non STEM grads are dumb, isn't
             | there a good probability of having a STEM graduate among 10
             | random humans?
        
           | cchance wrote:
           | I mean considering the big breaththrough this year for o1/o3
           | seems to have been "models having internal thoughts might
           | help reasoning", seems to everyone outside of the AI field
           | was sort of a "duh" moment.
           | 
           | I'd hope we see more internal optimizations and improvements
           | to the models. The idea that the big breakthrough being
           | "don't spit out the first thought that pops into your head"
           | seems obvious to everyone outside of the field, but guess
           | what turns out it was a big improvement when the devs decided
           | to add it.
        
             | versteegen wrote:
             | > seems obvious to everyone outside of the field
             | 
             | It's obvious to people inside the field too.
             | 
             | Honestly, these things seem to be less obvious to people
             | outside the field. I've heard so many uninformed takes
             | about LLMs not representing real progress towards
             | intelligence (even here on HN of all places; I don't know
             | why I torture myself reading them), that they're just dumb
             | memorizers. No, they are an incredible breakthrough,
             | because extending them with things like internal thoughts
             | will so obviously lead to results such as o3, and far
             | beyond. Maybe a few more people will start to understand
             | the trajectory we're on.
        
               | Agentus wrote:
               | a trickle of people sure, but most people never
               | accidentally stumble upon good evaluation skills let
               | alone reason themselves to that level, so i dont see how
               | most people will have the semblance of an idea of a
               | realistic trajectory of ai progress. i think most people
               | have very little conceptualization of their own
               | thinking/cognitive patterns, at least not enough to
               | sensibly extrapolate it onto ai.
               | 
               | doesnt help that most people are just mimics when talking
               | about stuff thats outside their expertise.
               | 
               | Hell, my cousin a quality-college educated individual,
               | high social/ emotional iq, will go down the conspiracy
               | theory rabbit hole so quickly based on some baseless crap
               | printed on the internet. then he'll talk about people
               | being satan worshipers.
        
               | versteegen wrote:
               | You're being pretty harsh, but:
               | 
               | > i think most people have very little conceptualization
               | of their own thinking/cognitive patterns, at least not
               | enough to sensibly extrapolate it onto ai.
               | 
               | Quite true. If you spend a lot of time reading and
               | thinking about the workings of the mind you lose sight of
               | how alien it is to intuition. While in highschool I first
               | read, in New Scientist, the theory that conscious thought
               | lags behind the underlying subconscious processing in the
               | brain. I was shocked that _New Scientist_ would print
               | something so _unbelievable_. Yet there seemed to be an
               | element of truth to it so I kept thinking about it and
               | slowly changed my assessment.
        
               | Agentus wrote:
               | sorry, humans are stupid and what intelligence they have
               | is largely impotent. if this wasnt the case life wouldnt
               | be this dystopia. my crassness comes from not necessarily
               | trying to pick on a particular group of humans, just
               | disappointment in recognizing the efficacy of human
               | intelligence and its ability to turn reality into a
               | better reality (meh).
               | 
               | yeah i was just thinking how a lot of thoughts which i
               | thought were my original thoughts really were made
               | possible out of communal thoughts. like i can maybe have
               | some original frontier thoughts that involve averages but
               | thats only made possible because some other person
               | invented the abstraction of averages then that was
               | collectively disseminated to everyone in education, not
               | to mention all the subconscious processes that are
               | necessary for me to will certainly thoughts into
               | existsnce. makes me reflect on how much cognition is
               | really mine, vs (not mine) a inevitable product of a
               | deterministic process and a product of other humans.
        
               | sfjailbird wrote:
               | Sounds like your cousin is able to think for himself. The
               | amount of bullshit I hear from quality-college educated
               | individuals, who simply repeat outdated knowledge that is
               | in their college curriculum, is no less disappointing.
        
               | daveguy wrote:
               | Buying whatever bullshit you see on the internet to such
               | a degree that you're re-enacting satanic panic from the
               | 80s is not "thinking for yourself". It's being gullible
               | about areas outside your expertise.
        
               | 0points wrote:
               | > No, they are an incredible breakthrough, because
               | extending them with things like internal thoughts will so
               | obviously lead to results such as o3, and far beyond.
               | 
               | While I agree that the LLM progress as of late is
               | interesting, the rest of your sentiment sounds more like
               | you are in a cult.
               | 
               | As long as your field keep coming with less and less
               | realistic predictions and fail to deliver over and over,
               | eventually even the most gullible will lose faith in you.
               | 
               | Because that's what this all is right now. Faith.
               | 
               | > Maybe a few more people will start to understand the
               | trajectory we're on.
               | 
               | All you are saying is that you believe something will
               | happen in the future.
               | 
               | We can't have a intelligent discussion under those
               | premises.
               | 
               | It's depressing to see so many otherwise smart people
               | fall for their own hype train. You are only helping rich
               | people get more rich by spreading their lies.
        
             | dogma1138 wrote:
             | Reflection isn't a new concept, but a) actually proving
             | that it's an effective tool for these types of models and
             | b) finding an effective method for reflection that doesn't
             | just locks you into circular "thinking" were the hard parts
             | and hence the "breakthrough".
             | 
             | It's very easy to say hey ofc it's obvious but there is
             | nothing obvious about it because you are anthropomorphizing
             | these models and then using that bias after the fact as a
             | proof of your conjecture.
             | 
             | This isn't how real progress is achieved.
        
               | beardedwizard wrote:
               | Calling it reflection is, for me, further
               | anthropomorphizing. However I am in violent agreement
               | that a common feature of llm debate is centered around
               | anthropomorphism leading to claims of "thinking longer"
               | or "reflecting" when none of those things are happening.
               | 
               | The state of the art seems very focused on promoting that
               | language that might encode reason is as good as actual
               | reason, rather than asking what a reasoning model might
               | look like.
        
           | iandanforth wrote:
           | Let's say that Google is already 1 generation ahead of nvidia
           | in terms of efficient AI compute. ($1700)
           | 
           | Then let's say that OpenAI brute forced this without any
           | meta-optimization of the hypothesized search component (they
           | just set a compute budget). This is probably low hanging
           | fruit and another 2x in compute reduction. ($850)
           | 
           | Then let's say that OpenAI was pushing really really hard for
           | the numbers and was willing to burn cash and so didn't bother
           | with serious thought around hardware aware distributed
           | inference. This could be _more_ than a 2x decrease in cost
           | like we 've seen deliver 10x reductions in cost via better
           | attention mechanisms, but let's go with 2x for now. ($425).
           | 
           | So I think we've got about an 8x reduction in cost sitting
           | there once Google steps up. This is probably 4-6 months of
           | work flat out if they haven't already started down this path,
           | but with what they've got with deep research, maybe it's
           | sooner?
           | 
           | Then if "all" we get is hardware improvements we're down to
           | what 10-14 years?
        
             | promptdaddy wrote:
             | *deep mind research ?
        
               | iandanforth wrote:
               | Nope, Gemini Advanced with Deep Research. New mode of
               | operation that does more "thinking" and web searches to
               | answer your question.
        
             | qingcharles wrote:
             | Until 2022 most AI research was aimed at improving the
             | _quality_ of the output, not the _quantity_.
             | 
             | Since then there has been a tsunami of optimizations in the
             | way training and inference is done. I don't think we've
             | even begun to find all the ways that inference can be
             | further optimized at both hardware and software levels.
             | 
             | Look at the huge models that you can happily run on an M3
             | Mac. The cost reduction in inference is going to vastly
             | outpace Moore's law, even as chip design continues on its
             | own path.
        
           | bjornsing wrote:
           | > are we stuck waiting for the 20-25 years for GPU
           | improvements
           | 
           | If this turns out to be hard to optimize / improve then there
           | will be a _huge_ economic incentive for efficient ASICs. No
           | freaking way we'll be running on GPUs for 20-25 years, or
           | even 2.
        
             | coolspot wrote:
             | LLMs need efficient matrix multiiliers. GPUs are
             | specialized ASICs for massive matrix multiplication.
        
               | vlovich123 wrote:
               | LLMs get to maybe ~20% of the rated max FLOPS for a GPU.
               | It's not hard to imagine that a purpose built ASIC with
               | maybe adjusted software stack gets us significantly more
               | real performance.
        
               | boroboro4 wrote:
               | They get more than this. For prefill we can get 70%
               | matmul utilization, for generation less than this but
               | we'll get to >50 too eventually.
        
           | xbmcuser wrote:
           | You are missing that cost of electricity is also going to
           | keep falling because of solar and batteries. This year in
           | China my table cloth math says it is $0.05 pkwh and following
           | the cost decline trajectory be under $0.01 in 10 years
        
             | patrickhogan1 wrote:
             | Bingo! Solar energy moves us toward a future where a
             | household's energy needs become nearly cost-free.
             | 
             | Energy Need: The average home uses 30 kWh/day, requiring 6
             | kW/hour over 5 peak sunlight hours.
             | 
             | Multijunction Panels: Lab efficiencies are already at 47%
             | (2023), and with multiple years of progress, 60% efficiency
             | is probable.
             | 
             | Efficiency Impact: At 60% efficiency, panels generate 600
             | W/m2, requiring 10 m2 (e.g., 2 m x 5 m) to meet energy
             | needs.
             | 
             | This size can fit on most home roofs, be mounted on a pole
             | with stacked layers, or even be hung through an apartment
             | window.
        
               | arcticbull wrote:
               | Everyone always forgets that they only perform at less
               | than half of their rated capacity and require significant
               | battery installations. Rooftop solar plus storage is
               | actually more expensive than nuclear on a comparable
               | system LCOE due to their lack of efficiency of scale.
               | Rooftop solar plus storage is about the most expensive
               | form of electricity on earth, maybe excluding gas peaker
               | plants.
        
               | nateglims wrote:
               | It varies by a lot of factors but it's way less than
               | half. Photovoltaic panels have around 10% capacity
               | utilization vs 50-70% for a gas or nuke plant.
        
               | xbmcuser wrote:
               | Everyone also forgets the speed of price decline for
               | solar and battery your statement is completely false
               | propaganda made up by power companies. Today rooftop
               | solar and battery is cost competitive to nuclear already
               | in many countries like India
        
               | patrickhogan1 wrote:
               | You're right that rooftop solar and storage have costs
               | and efficiency limits, but those are improving quickly.
               | 
               | Rooftop solar harnesses energy from the sun, which is
               | powered by nuclear fusion--arguably the most effective
               | nuclear reactor in our solar system.
        
               | theendisney wrote:
               | The thing everyone forgets is that all good energy
               | technology is seized by governments for military purposes
               | and to preserve the status quo. God knows how far it
               | progressed.
               | 
               | What a joke
        
               | sahmeepee wrote:
               | Average _US_ home.
               | 
               | In Europe it is around 6-7 kWh/day. This might increase
               | with electrification of heating and transport, but
               | probably nothing like as much as the energy consumption
               | they are replacing (due to greater efficiency of the
               | devices consuming the energy and other factors like the
               | quality of home insulation.)
               | 
               | In the rest of the world the average home uses
               | significantly less.
        
               | jdhwosnhw wrote:
               | While I agree with your general assessment, I think your
               | conclusion is a bit off. You're assuming 1kw/m^2, which
               | is only true with the sun directly overhead. A real-world
               | solar setup gets hit with several factors of cosine
               | (related to roof pitch, time of day, day of year, and
               | latitude) that conspire to reduce the total output.
               | 
               | For example, my 50 sq m set up, at -29 deg latitude,
               | generated your estimated 30 kwh/day output. I have panels
               | with ~20% efficiency, suggesting that at 60% efficiency,
               | the average household would only get to around half their
               | energy needs with 10 sq m.
               | 
               | Yes, solar has the potential to drastically reduce energy
               | costs, but even with free energy storage, individual
               | households aren't likely to achieve self sustainability.
        
             | nateglims wrote:
             | Is it going to fall significantly for data centers?
             | Industrial policy for consumer power is different from
             | subsidizing it for data centers and if you own grid
             | infrastructure why would you tank the price by putting up
             | massive amounts of capital?
        
               | xbmcuser wrote:
               | It's the same about using the cloud or using your own
               | infrastructure there will be a point where building your
               | own solar and battery plant is cheaper than what they are
               | charging they will need to follow the price decline if
               | they want to keep the customers if not there will be mass
               | scale grid defections.
        
               | nateglims wrote:
               | I don't think this reflects the reality of the power
               | industry. Data centers are the only significant growth in
               | actual generated power in decades and hyperscalers are
               | already looking at very bespoke solutions.
               | 
               | The heavy commodification of networking and compute
               | brought about by the internet and cloud aligned with tech
               | company interests in delivering services or content to
               | consumers. There does not seem to be an emerging
               | consensus that data center operators also need to provide
               | consumer power.
        
               | xbmcuser wrote:
               | It was not the reality of the power industry but will be
               | soon as we have not had a source of electricity that is
               | the cheapest and is getting cheaper and easy to install
               | this is something unique.
               | 
               | I don't see Google, Amazon, Microsoft or any company pay
               | $10 for something if building it themselves will cost
               | them $5. Either the price difference will reach a point
               | where investing into power production themselves makes
               | sense or the power companies decrease prices. Looking at
               | how all 3 have already been investing in power production
               | over the last decade themselves either to get better
               | prices or for PR.
        
               | lyu07282 wrote:
               | But didn't we liberalized energy markets for that reason,
               | if anyone could undercut the market like that wouldn't
               | that happen automatically and the prices go down anyway?
               | /s
        
             | barney54 wrote:
             | But the cost of electricity is not falling--it's
             | increasing. Wholesale prices have decreased, but retail
             | rates are up. In the U.S. rates are up 27% over the past 4
             | years. In Europe prices are up too.
        
               | lucubratory wrote:
               | I am not certain because I've been very focused on the o3
               | news, but at least yesterday neither the US nor Europe
               | were part of China.
        
               | xbmcuser wrote:
               | Most large compute clusters would be buying electricity
               | at wholesale price not at retail price. But anyway solar
               | and battery prices have just reached the tipping point
               | this year only now the longer power companies keep retail
               | prices high the more people will defect from the grid and
               | install their own solar + batteries.
        
               | lxgr wrote:
               | But data centers pay wholesale prices or even less (given
               | that especially AI training and, to a lesser extend,
               | inference clusters can load shed like few other consumers
               | of electricity).
        
               | fulafel wrote:
               | And this is great news as long as marginal production
               | (the most expensive to produce, first to turn on/off
               | according to demand) of electricity is fossils.
        
               | NoLinkToMe wrote:
               | That's a bit of a non-statement. Virtually all prices
               | increase because of money supply, but we consider things
               | to get cheaper if their prices grow less fast than
               | inflation / income.
               | 
               | General inflation has outpaced the inflation of
               | electricity prices by about 3x in the past 100 years. In
               | other words, electricity has gotten cheaper over time in
               | purchasing power terms.
               | 
               | And that's whilst our electricity usage has gone up by
               | 10x in the last 100 years.
               | 
               | And this concerns retail prices, which includes
               | distribution/transmission fees. These have gone up a lot
               | as you get complications on the grid, some of which is
               | built on a century old design. But wholesale prices (the
               | cost of generating electricity without
               | transmission/distribution) are getting dirt cheap, and
               | for big AI datacentres I'm pretty sure they'll hook up to
               | their own dedicated electricity generation at wholesale
               | prices, off the grid, in the coming decades.
        
             | necovek wrote:
             | If climate change ends up changing weather profiles and we
             | start seeing many more cloudy days or dust/mist in the air,
             | we'll need to push those solar panel above (all the way to
             | space?) or have many more of them, figure out transmission
             | to the ground and costs will very much balloon.
             | 
             | Not saying this will happen, but it's risky to rely on
             | solar as the only long-term solution.
        
           | miki123211 wrote:
           | It's also worth keeping in mind that AIs are a lot less risky
           | to deploy for businesses than humans.
           | 
           | You can scale them up and down at any time, they can work
           | 24/7 (including holidays) with no overtime pay and no breaks,
           | they need no corporate campuses, office space, HR personnel
           | or travel budgets, you don't have to worry about key
           | employees going on sick/maternity leave or taking time off
           | the moment they're needed most, they won't assault a
           | coworker, sue for discrimination or secretly turn out to be a
           | pedophile and tarnish the reputation of your company, they
           | won't leak internal documents to the press or rage quit
           | because of new company policies, they won't even stop working
           | when a pandemic stops most of the world from running.
        
             | antihipocrat wrote:
             | AI brings similar risks - they can leak internal
             | information, they can be tricked into performing prohibited
             | tasks (with catastrophic effects if this is connected to
             | core systems), they could be accused of actions that are
             | discriminatory (biased training sets are very common).
             | 
             | Sure, if a business deploys it to perform tasks that are
             | inherently low risk e.g. no client interface, no core
             | system connection and low error impact, then the human
             | performing these tasks is going to be replaced.
        
               | snozolli wrote:
               | _they can be tricked into performing prohibited tasks_
               | 
               | This reminds me of the school principal who sent $100k to
               | a scammer claiming to be Elon Musk. The kicker is that
               | she was repeatedly told that it was a scam.
               | 
               | https://abc7chicago.com/fake-elon-musk-jan-mcgee-
               | principal-b...
        
               | tstrimple wrote:
               | This is one of the things which annoys me most about
               | anti-LLM hate. Your peers aren't right all the time
               | either. They believe incorrect things and will pursue
               | worse solutions because they won't acknowledge a better
               | way. How is this any different from a LLM? You have to
               | question _everything_ you 're presented with. Sometimes
               | that Stack Overflow answer isn't directly applicable to
               | your exact problem but you can extrapolate from it to
               | resolve your problem. Why is an LLM viewed any
               | differently? Of course you can't just blindly accept it
               | as the one true answer, but you literally cannot do that
               | with humans either. Humans produce a ton of shit code and
               | non-solutions and it's fine. But when an LLM does it,
               | it's a serious problem that means the tech is useless.
               | Much of the modern world is built on shit solutions and
               | we still hobble along.
        
               | lazide wrote:
               | Everyone knows humans can be idiots. The problem is that
               | people seem to think LLMs can't be idiots, and because
               | they aren't human there is no way to punish them. And
               | then people give them too much credit/power, for their
               | own purposes.
               | 
               | Which makes LLMs far more dangerous than idiot humans in
               | most cases.
        
               | brookst wrote:
               | No. Nobody thinks LLMs are perfect. That's a strawman.
               | 
               | And... I am really not sure punishment is the answer to
               | fallibility, outside of almost kinky Catholicism.
               | 
               | The reality is these things are very good, but imperfect,
               | much like people.
        
               | lazide wrote:
               | Clearly you haven't been listening to any CEO press
               | releases lately?
               | 
               | And when was the last time a support chatbot let you
               | actually complain or bypass to a human?
        
               | thecupisblue wrote:
               | Sorry man, but I literally know of startups invested into
               | by YC where CEO's for 80% of their management
               | decisions/vision/comms use ChatGPT ... or should I say
               | some use Claude now, as they think it's smarter and does
               | not make mistakes.
               | 
               | Let that sink in.
        
               | onion2k wrote:
               | I wouldn't be surprised if GPT genuinely makes better
               | decisions than an inexperienced, first-time CEO who has
               | only been a dev before, especially if the person
               | prompting it has actually put some effort into
               | understanding their own weaknesses. It certainly wouldn't
               | be any worse than someone who's only experience is
               | reading a few management books.
        
               | lazide wrote:
               | And here is a great example of the problem.
               | 
               | An LLM doesn't make decisions. It generates text that
               | plausibly looks like it made a decision, when prompted
               | with the right text.
        
               | beardedwizard wrote:
               | Why is this distinction lost in every thread on this
               | topic, I don't get it.
        
               | lazide wrote:
               | A lot more people are credulous idiots than anyone wants
               | to believe - and the confusion/misunderstanding is being
               | actively propagated.
        
               | sirsinsalot wrote:
               | Think of all the human growth and satisfaction being lost
               | to risk mitigation by offloading the pleasure of failure
               | to Machines.
        
               | lazide wrote:
               | Ah, but machines can't fail! So don't worry, humans will
               | still get to experience the 'pleasure'. But won't be able
               | to learn/change anything.
        
               | Mordisquitos wrote:
               | > No. Nobody thinks LLMs are perfect. That's a strawman.
               | 
               | I'm afraid that's not the case. Literally yesterday I was
               | speaking with an old friend who was telling us how one of
               | his coworkers had presented a document with mistakes and
               | serious miscalculations as part of some project. When my
               | friend pointed out the mistakes, which were intuitively
               | obvious just by critically understanding the numbers, the
               | guy kept insisting _" no, it's correct, I did it with
               | ChatGPT"_. It took my friend doing the calculations
               | explicitly and showing that they made no sense to
               | convince the guy that it was wrong.
        
               | 0points wrote:
               | Not _people_.
               | 
               | Certain gullible people, who tends to listen to certain
               | charlatans.
               | 
               | Rational, intelligent people wouldn't consider replacing
               | a skilled human worker with a LLM that on a good day can
               | compete with a 3-year old.
               | 
               | You may see the current age as litmus for critical
               | thinking.
        
               | mplewis wrote:
               | Humans can tell you how confident they are in something
               | being right or wrong. An LLM has no internal model and
               | cannot do such a thing.
        
               | swiftcoder wrote:
               | > Humans can tell you how confident they are in something
               | being right or wrong
               | 
               | Humans are also very confidently wrong a considerable
               | portion of the time. Particularly about anything outside
               | their direct expertise
        
               | SketchySeaBeast wrote:
               | People only being willing to say they are unsure some of
               | the time is still better than LLMs. I suppose, given that
               | everything is outside of their area of expertise, it's
               | very human of them.
        
               | daveguy wrote:
               | That's still better than never being able to make an
               | accurate confidence assessment. The fact that this is
               | worse outside your expertise is a main reason why
               | expertise is so valued in hiring decisions.
        
               | pineaux wrote:
               | Its quite stunning to frame it as anti-LLM hate. It's on
               | the pro-LLM people to convince the anti-LLM people that
               | choosing for LLMs is an ethically correct choice with all
               | the necessary guardrails. It's also on the pro-LLM people
               | to show the usefulness of the product. If pro-LLM people
               | are right, it will be a matter of time before these
               | people will see the errors of their ways. But doing an
               | ad-hominem is a sure way of creating a divide...
        
               | gf000 wrote:
               | But human stupidity, while itself can be sometimes an
               | unknown unknown with its creativity, is a mostly known
               | unknown.
               | 
               | LLMs fail in entirely novel ways you can't even fathom
               | upfront.
        
               | sirsinsalot wrote:
               | GenAI has a 100% failure to enjoy quality of life,
               | emotional fulfillment and psychological safety.
               | 
               | Id say those are the goals we should be working for.
               | That's the failure we want to look at. We are humans.
        
               | halgir wrote:
               | > LLMs fail in entirely novel ways you can't even fathom
               | upfront.
               | 
               | Trust me, so do humans. Source: have worked with humans.
        
             | lucubratory wrote:
             | >secretly turn out to be a pedophile and tarnish the
             | reputation of your company
             | 
             | This is interesting because it's both Oddly Specific and
             | also something I have seen happen and I still feel really
             | sorry for the company involved. Now that I think about it,
             | I've actually seen it happen twice.
        
             | rockskon wrote:
             | AI has a different risk profile than humans. They are a
             | _lot_ more risky for business operations where failure is
             | wholly unacceptable under any circumstance.
             | 
             | They're risky in that they fail in ways that aren't readily
             | deterministic.
             | 
             | And would you trust your life to a self-driving car in New
             | York City traffic?
        
               | lxgr wrote:
               | Isn't everybody in NYC already? (The dangers of bad
               | driving are much higher for pedestrians than for people
               | in cars; there are more of the former than of the latter
               | in NYC; I'd expect there to be a non-zero number of fully
               | self driving cars already in the city.)
        
               | rockskon wrote:
               | That doesn't answer my question.
        
               | 9dev wrote:
               | It does, in a way; AI is already there, all around you,
               | whether you like it or not. Technological progress is
               | Pandora's box; you can't take it back or slow it down.
               | Businesses will use AI for critical workflows, and all
               | good that they bring, and all bad too, will happen.
        
               | rockskon wrote:
               | How about you answer my question since he did not.
               | 
               | Would you trust your life to a self-driving car in New
               | York City traffic?
        
               | lxgr wrote:
               | GP got it exactly right: I already am. There's no way for
               | me to opt out of having self-driving cars on the streets
               | I regularly cross as a pedestrian.
        
               | chefandy wrote:
               | If there are any fully-autonomous cars on the streets of
               | nyc, there aren't many of them and I don't think there's
               | any way for them to operate legally. There has been
               | discussion about having a trial.
        
               | wwweston wrote:
               | We can just insulate businesses employing AI from any
               | liability, problem solved.
        
               | fsloth wrote:
               | I guess - yes from business&liability sense? "This
               | service you are now paying for 100$? We can sell it to
               | you for 5$ but with the caveat _we give no guarantees if
               | it works or is it fit for purpose_ - click here to
               | accept".
        
               | 9dev wrote:
               | ,,Well, our AI that was specifically designed for
               | maximising gains above all else may indeed have
               | instructed the workers to cut down the entire Amazonas
               | forest for short-term gains in furniture production." But
               | no human was involved in the decision, so nobody is
               | liable and everything is golden? Is that the future you
               | would like to live in?
        
               | lazide wrote:
               | Hmmm, how much stock do I own in this hypothetical
               | company? (/s, kinda)
        
               | wwweston wrote:
               | Apparently I need to work on my deadpan delivery.
               | 
               | Or just articulate things openly: we _already_ insulate
               | business owners from liability because we think it tunes
               | investment incentives, and in so doing have created
               | social entities /corporate "persons"/a kind of AI who
               | have different incentives than most human beings but are
               | driving important social decisions. And they've supported
               | some astonishing cooperation which has helped produce
               | things like the infrastructure on which we are having
               | this conversation! But also, we have existing AIs of this
               | kind who are already inclined to cut down the entire
               | Amazonas forest for furnitue production because it
               | maximizes their function.
               | 
               | That's not just the future we live in, that's the world
               | we've been living in for a century or few. On one hand,
               | industrial productivity benefits, on the other hand, it
               | values human life and the ecology we depend on about like
               | any other industrial input. Yet many people in the
               | world's premier (former?) democracy repeat enthusiastic
               | endorsements of this philosophy reducing their personal
               | skin to little more than an industrial input: "run the
               | government like a business."
               | 
               | Unless people change, we are very much on track to create
               | a world where these dynamics (among others) of the human
               | condition are greatly magnified by all kinds of
               | automation technology, including AI. Probably starting
               | with limited liability for AIs and companies employing
               | them, possibly even _statutory_ limits, though it 's much
               | more likely that wealthy businesses will simply be
               | insulated with by the sheer resources they have to make
               | sure the courts can't hold them accountable, even where
               | we still have a judicial system that isn't willing to
               | play calvinball for cash or catechism (which,
               | unfortunately, does not seem to include a supreme court
               | majority).
               | 
               | In short, you and I probably agree that liability for AI
               | is important, and limited liability for it isn't good.
               | Perhaps I am too skeptical that we can pull this off, and
               | being optimistic would serve everyone better.
        
               | ijidak wrote:
               | It is amazing to me that we have reached an era where we
               | are debating the trade-off of hiring thinking machines!
               | 
               | I mean, this is an incredible moment from that
               | standpoint.
               | 
               | Regarding the topic at hand, I think that there will
               | always be room for humans for the reasons you listed.
               | 
               | But even replacing 5% of humans with AI's will have mind
               | boggling consequences.
               | 
               | I think you're right that there are jobs that humans will
               | be preferred for for quite some time.
               | 
               | But, I'm already using AI with success where I would
               | previously hire a human, and this is in this primitive
               | stage.
               | 
               | With the leaps we are seeing, AI is coming for jobs.
               | 
               | Your concerns relate to exactly how many jobs.
               | 
               | And only time will tell.
               | 
               | But, I think some meaningful percentage of the population
               | -- even if just 5% of humanity will be replaced by AI.
        
               | miki123211 wrote:
               | This is a really hard and weird ethical problem IMHO, and
               | one we'll have to deal with sooner or later.
               | 
               | Imagine you have a self-driving AI that causes fatal
               | accidents 10 times less often than your average human
               | driver, but when the accidents happen, nobody knows why.
               | 
               | Should we switch to that AI, and have 10 times fewer
               | accidents and no accountability for the accidents that do
               | happen, or should we stay with humans, have 10x more road
               | fatalities, but stay happy because the perpetrators end
               | up in prison?
               | 
               | Framed like that, it seems like the former solution is
               | the only acceptable one, yet people call for CEOs to go
               | to prison when an AI goes wrong. If that were the case,
               | companies wouldn't dare use any AI, and that would
               | basically degenerate to the latter solution.
        
               | okasaki wrote:
               | Wait, why would we want 10x more traffic fatalities?
        
               | stavros wrote:
               | We wouldn't, that's their point.
        
               | moritzwarhier wrote:
               | I don't know about your country, but people going to
               | prison for causing road fatalities is extremely rare
               | here.
               | 
               | Even temporary loss of the drivers license has a very
               | high bar, and that's the main form of accountability for
               | driver behavior in Germany, apart from fines.
               | 
               | Badly injuring or killing someone who themselves did not
               | violate traffic safety regulations is far from guaranteed
               | to cause severe repercussions for the driver.
               | 
               | By default, any such situation is an accident and at best
               | people lose their license for a couple of months.
        
               | paulryanrogers wrote:
               | Drivers are the apex predators. My local BMV passed me
               | after I badly failed the vision test. Thankfully I was
               | shaken enough to immediately go to the eye doctor and get
               | treatment.
        
               | chefandy wrote:
               | Sadly, we live in a society where those executives would
               | use that impunity as carte blanche to spend no money
               | improving (in the best-case scenario,) or even more
               | likely, keep cutting safety expenditures until the body
               | counts get high enough for it to start damaging sales. If
               | we've already given them a free pass, they will exploit
               | it to the greatest possible extent to increase profit.
        
               | ETH_start wrote:
               | What evidence exists for this characterization?
        
               | rgbrgb wrote:
               | The way health insurance companies optimize for denials
               | in the US.
        
               | chefandy wrote:
               | Let's see... of the top of my head...
               | 
               | - Air Pollution
               | 
               | - Water Pollution
               | 
               | - Disposable Packaging
               | 
               | - Health Insurance
               | 
               | - Steward Hospitals
               | 
               | - Marketing Junk Food, Candy and Sodas directly to
               | children
               | 
               | - Tobacco
               | 
               | - Boeing
               | 
               | - Finance
               | 
               | - Pharmaceutical Opiates
               | 
               | - Oral Phenylepherin to replace pseudoephedrine despite
               | knowing a) it wasn't effective, and b) posed a risk to
               | people with common medical conditions.
               | 
               | - Social Media engagement maximization
               | 
               | - Data Brokerage
               | 
               | - Mining Safety
               | 
               | - Construction site safety
               | 
               | - Styrofoam Food and Bev Containers
               | 
               | - ITC terminal in Deerfield Park (read about the decades
               | of them spewing thousands of pounds benzene into the air
               | before the whole fucking thing blew up, using their
               | influence to avoid addressing any of it, and how they
               | didn't have automatic valves, spill detection, fire
               | detection, sprinklers... in _2019_.)
               | 
               | - Grocery store and restaurant chains disallowing
               | cashiers from wearing masks during the first pandemic
               | wave, well after we knew the necessity, because it made
               | customers uncomfortable.
               | 
               | - Boar's Head Liverwurst
               | 
               | And, you know, plenty more. As someone that grew up
               | playing in an unmarked, illegal, not-access-controlled
               | toxic waste dump in a residential area owned by a huge
               | international chemical conglomerate-- and just had some
               | cancer taken out of me last year-- I'm pretty familiar
               | with various ways corporations are willing to sacrifice
               | health and safety to bump up their profit margin. I guess
               | ignoring that kids were obviously playing in a swamp of
               | toluene, PCBs, waste firefighting chemicals, and all
               | sorts of other things on a plot not even within sight of
               | the factory in the middle of a bunch of small farms was
               | _just the cost of doing business_. As was my friend who,
               | when he was in vocational high school, was welding a
               | metal ladder above storage tank in a chemical factory
               | across the state. The plant manager assured the school
               | the tanks were empty, triple rinsed and dry, but they
               | exploded, blowing the roof off the factory taking my
               | friend with it. They were apparently full of waste
               | chemicals and IIRC, the manager admitted to knowing that
               | in court. He said he remembers waking up briefly in the
               | factory parking lot where he landed, and then the next
               | thing he remembers was waking up in extreme pain wearing
               | the compression gear he'd have to wear into his mid
               | twenties to keep his grafted skin on. Briefly looking
               | into the topic will show how common this sort of
               | malfeasance is in manufacturing.
               | 
               | The burden of proof is on people saying that they _won't_
               | act like the rest of American industry tasked with
               | safety.
        
               | ajmurmann wrote:
               | Like with Cruise. One freak accident and they practically
               | decided to go out of business. Oh wait...
        
               | chefandy wrote:
               | If that's the only data point you look at in American
               | industry, it would be pretty encouraging. I mean,
               | _surely_ they'd have done the same if they were a branch
               | of a large publicly traded company with a big high-
               | production product pipeline...
        
               | monkeynotes wrote:
               | > nobody knows why
               | 
               | But we do know the culpability rests on the shoulders of
               | the humans who decided the tech was ready for work.
        
               | ethbr1 wrote:
               | Hey look, it's almost like we're back at the end of the
               | First Industrial Revolution (~1850), as society grapples
               | with how to create happiness in a rapidly shifting
               | economy of supply and demand, especially for labor. https
               | ://en.m.wikipedia.org/wiki/Utilitarianism#John_Stuart_M..
               | .
               | 
               | Pretty bloody time for labor though.
               | https://en.m.wikipedia.org/wiki/Haymarket_affair
        
               | MaxPock wrote:
               | It depends with what the risk is .Would it be whole or in
               | part ? In an organisation,failure by an HR might present
               | an isolated departmental risk while an AI might not be
               | the case.
        
               | zelphirkalt wrote:
               | Deterministic they may be, but unforeseeable for humans.
        
               | ajmurmann wrote:
               | Every statistic I've seen indicated much better accident
               | rates for self-driving cars than human drivers. I've
               | taken Waymo rides in SF and felt perfectly safe. I've
               | taken Lyft and Uber and especially taxi rides where I
               | felt much less safe. So I definitely would take the self-
               | driving car. Just because I don't understand am accident
               | doesn't make it more likely to happen.
               | 
               | The one minor risk I see is the cat being too polite and
               | getting effectively stuck in dense traffic. That's a
               | nuisance though.
               | 
               | Is there something about NYC traffic I'm missing?
        
               | aprilthird2021 wrote:
               | There's one important part about risk management though.
               | If your Waymo does crash, the company is liable for it,
               | and there's no one to shift the blame onto. If a human
               | driver crashes, that's who you can shift liability onto.
               | 
               | Same with any company that employs AI agents. Sure they
               | can work 24/7, but every mistake they make the company
               | will be liable for (or the AI seller). With humans, their
               | fraud, their cheating, their deception, can all be wiped
               | off the company and onto the individual
        
               | ethbr1 wrote:
               | The next step is going to be around liability insurance
               | for AI agents.
               | 
               | That's literally the point of liability insurance -- to
               | allow the routine use of technologies that rarely (but
               | catastrophically) fail, by ammortizing risk over time /
               | population.
        
               | aprilthird2021 wrote:
               | Potentially. I would be skeptical that businesses can do
               | this to shield themselves from the liability. For
               | example, VW could not use insurance to protect them from
               | their emissions scandal. There are thresholds (fraud,
               | etc.) that AI can breach, which I don't think insurance
               | can legally protect you from
        
             | danielovichdk wrote:
             | Name one technology that has come with computers that
             | hasn't resulted in more humans being put to work ?
             | 
             | The rhetoric of not needing people doing work is
             | cartoon'ish. I mean there is no sane explanation of how and
             | why that would happen, without employing more people yet
             | again, taking care of the advancements.
             | 
             | It's nok like technology has brought less work related
             | stress. But it has definitely increased it. Humans were not
             | made for using technology at such a pace as it's being
             | rolled out.
             | 
             | The world is fucked. Totally fucked.
        
               | mortehu wrote:
               | Self check-out stations, ATMs, and online brokerages.
               | Recently chat support. Namely cases where millions of
               | people used to interact with a representative every week,
               | and now they don't.
        
               | palmfacehn wrote:
               | "Name one use of electric lighting that hasn't resulted
               | in candle makers losing work?"
               | 
               | The framing of the question misses the point. With
               | electric lighting we can now work longer into the night.
               | Yes, less people use and make candles. However, the
               | second order effects allow us to be more productive in
               | areas we may not have previously considered.
               | 
               | New technologies open up new opportunities for
               | productivity. The bank tellers displaced by ATM machines
               | can create value elsewhere. Consumers save time by not
               | waiting in a queue, allowing them to use their time more
               | economically. Banks have lower overhead, allowing more
               | customers to afford their services.
        
               | 0points wrote:
               | Where to even start?
               | 
               | Digital banks
               | 
               | Cashless money transfer services
               | 
               | Self service
               | 
               | Modern farms
               | 
               | Robo lawn mowers
               | 
               | NVR:s with object detection
               | 
               | I can go on forever
        
               | salawat wrote:
               | Please do. I'm certain you can't, and you'll have to stop
               | much sooner than you think. Appeals to triviality are the
               | first refuge of the person who thinks they know, but does
               | not.
        
             | TheOtherHobbes wrote:
             | It's all fun and games until the infra crashes and you
             | can't work out why, because a machine has written all of
             | the code, no one understands how it works or what it's
             | doing.
             | 
             | Or - worse - there is no accessible code anywhere, and you
             | have to prompt your way out of "I'm sorry Dave, I can't do
             | that," while nothing works.
             | 
             | And a human-free economy does... what? For whom? When 99%
             | of the population is unemployed, what are the 1% doing
             | while the planet's ecosystems collapse around them?
        
               | sirsinsalot wrote:
               | It honestly borders on psychopathic the way engineers are
               | treating humans in this context.
               | 
               | People talking like this also, in the back of their minds
               | like to think they'll be OK. They're smart enough to be
               | still needed. They're a human, but they'll be OK even
               | while working to make genAI out perform them at their own
               | work.
               | 
               | I wonder how they'll feel about their own hubris when
               | they struggle to feed their family.
               | 
               | The US can barely make healthcare work without disgusting
               | consequences for the sick. I wonder what mass
               | unemployment looks like.
        
               | a2800276 wrote:
               | But when Sam Altman owns all the money in the world
               | surely he'll distribute some it via his not-for-profit AI
               | company?
        
               | bnj wrote:
               | For the moment the displacement is asymmetrical; AI
               | replacing employees, but not AI replacing consumers. If
               | AI causes mass unemployment, the pool of consumers
               | (profit to companies) will shrink. I wonder what the
               | ripple effects of that will be.
        
               | sirsinsalot wrote:
               | There's no point being rich in a world where the economy
               | is unhealthy.
        
               | jvanderbot wrote:
               | It honestly borders on midwit to constantly introduce a
               | false dichotomy of AI vs humans. It's just stupid base
               | animal logic.
               | 
               | There is absolutely no reason a programmer should expect
               | to write code as they do now forever, just as ASM experts
               | had to move on. And there's no reason (no precedent _and_
               | no indicators) to expect that a well-educated, even-
               | moderately-experienced technologist will suddenly find
               | themselves without a way to feed their family - unless
               | they stubbornly refuse to reskill or change their
               | workflows.
               | 
               | I do believe the days of "everyone makes 100k+" are
               | nearly over, and we're headed towards a severely bimodal
               | distribution, but I do not see how, for the next 10-15
               | years at least, we can't all become productive building
               | the tools that will obviate our own jobs while we do them
               | - and get comfortably retired in the mean time.
        
               | losteric wrote:
               | There is no comfortable retirement if the process of
               | obviating our own jobs is not coupled with appropriate
               | socioeconomic changes.
        
               | jvanderbot wrote:
               | I don't see it. Don't you have a 401k or EU style
               | pension? Aren't you saving some money? If not, why are
               | you in software? I don't make as much as I thought I
               | might, but I make enough to consider the possibility of
               | surviving a career change.
        
               | twh270 wrote:
               | Reskill to what? When AI can do software development, it
               | will also be able to do pretty much any other job that
               | requires some learning.
        
               | jvanderbot wrote:
               | Even if one refuses to move on from software dev to
               | something like AI deployer or AI validator or AI steerer,
               | there might be a need.
               | 
               | If innovation ceases, then AI is king - push existing
               | knowledge into your dataset, train, and exploit.
               | 
               | If innovation continues, there's always a gap. It takes
               | time for a new thing to be made public "enough" for it to
               | be ingested and synthesized. Who does this? Who finds the
               | new knowledge?
               | 
               | Who creates the direction and asks the questions? Who
               | determines what to build in the first place? Who
               | synthesizes the daily experience of everyone around them
               | to decide what tool needs to exist to make our lives
               | easier? Maybe I'm grasping at straws here, but the world
               | in which all scientific discovery, synthesis, direction
               | and vision setting, etc, is determined by AI seems really
               | far away when we talk about code generation and symbolic
               | math manipulation.
               | 
               | These tools are self driving cars, and we're drivers of
               | the software fleet. We need to embrace the fact that we
               | might end up watching 10 cars self operate rather than
               | driving one car, or maybe we're just setting
               | destinations, but there simply isn't an absolutist zero
               | sum game here unless all one thinks about is keeping the
               | car on the road.
               | 
               | AND even if there were, repeating doom and feeling
               | helpless is the last thing you want. Maybe it's not good
               | truth that we can all adapt and should try, but it's
               | certainly good _policy_.
        
               | exhaze wrote:
               | You misunderstand the fundamentals. I've built a type-
               | safe code generation pipeline using TypeScript that
               | enforces compile-time and runtime safety. Everything
               | generates from a single source of truth - structured JSON
               | containing the business logic. The output is
               | deterministic, inspectable, and version controlled.
               | 
               | Your concerns about mysterious AI code and system crashes
               | are backwards. This approach eliminates integration bugs
               | and maintenance issues by design. The generated
               | TypeScript is readable, fully typed, and consistently
               | updated across the entire stack when business logic
               | changes.
               | 
               | If you're struggling with AI-generated code
               | maintainability, that's an implementation problem, not a
               | fundamental issue with code generation. Proper type
               | safety and schema validation create more reliable
               | systems, not less. This is automation making developers
               | more productive - just like compilers and IDEs did - not
               | replacing them.
               | 
               | The code works because it's built on sound software
               | engineering principles: type safety, single source of
               | truth, and deterministic generation. That's verifiable
               | fact, not speculation.
        
             | bboygravity wrote:
             | humans definitely don't need office space, but your point
             | stands
        
               | AustinW wrote:
               | LLM office space is pretty expensive. Chillers, backup
               | generators, raised floors, communications gear, .... They
               | even demand multiple offices for redundancy, not to
               | mention the new ask of a nuclear power plant to keep the
               | lights on.
        
             | fsndz wrote:
             | I get the excitement, but folks, this is a model that
             | excels only in things like software engineering/math. They
             | basically used reinforcement learning to train the model to
             | better remember which pattern to use to solve specific
             | problems. This in no way generalises to open ended tasks in
             | a way that makes human in the loop unnecessary. This
             | basically makes assistants better (as soon as they figure
             | out how to make it cheaper), but I wouldn't blindly trust
             | the output of o3. Sam Altman is still wrong:
             | https://www.lycee.ai/blog/why-sam-altman-is-wrong
        
               | girvo wrote:
               | Quite. And if it _was_ right, those businesses deploying
               | it and replacing humans need humans with jobs and money
               | to pay for their products and services...
        
               | fakedang wrote:
               | It will just keep bleeding the middle class on and on,
               | till the point where either everyone is rich, homeless or
               | a plumber or other such licensed worker. And then there
               | will be such a glut in the latter (shrinking) market,
               | that everyone in that group also becomes either rich or
               | homeless.
        
               | palmfacehn wrote:
               | Productivity gains increase the standard of living for
               | everyone. Products and services become cheaper. Leisure
               | time increases. Scarce labor resources can be applied in
               | other areas.
               | 
               | I fail to see the difference between AI-employment-doom
               | and other flavors of Luddism.
        
               | bayindirh wrote:
               | It also fuels the income inequality with a fatter pipe in
               | every iteration. You get richer as you move up in the
               | supply chain, period. Companies vertically integrate to
               | drive costs down in the long run.
               | 
               | As AI gets more prevalent, it'll drive the cost down for
               | the companies supplying these services, so the _former_
               | employees of said companies will be paid lower, or not at
               | all.
               | 
               | So, tell me, how paying fewer people less money will
               | drive their standard of living upwards? I can understand
               | the leisure time. Because, when you don't have a job, all
               | day is leisure time. But you'll need money for that, so
               | will these companies fund the masses via government to
               | provide Universal Basic Income, so these people can both
               | live a borderline miserable life while funding these
               | companies to suck these people more and more?
        
               | CamperBob2 wrote:
               | _It also fuels the income inequality with a fatter pipe
               | in every iteration_
               | 
               | Who cares? A rising tide lifts all boats. The wealthy
               | people I know all have one thing in common: they focused
               | more on their own bank accounts than on other people's.
               | 
               |  _So, tell me, how paying fewer people less money will
               | drive their standard of living upwards?_
               | 
               | Money is how we allocate limited resources. It will
               | become less important as resources become less limited,
               | less necessary, or (hopefully) both.
        
               | EarthAmbassador wrote:
               | Utter nonsense. Productivity gains of the last 40 years
               | have been captured by shareholders and top elites.
               | Working class wages have been flat all of that time
               | despite that gain.
               | 
               | In 2012, Musk was worth $2 billion. He's now worth 223
               | times that yet the minimum wage has barely budged in the
               | last 12 years as productivity rises.
        
               | palmfacehn wrote:
               | >>Productivity gains increase the standard of living for
               | everyone.
               | 
               | >Productivity gains of the last 40 years have been
               | captured by shareholders and top elites. Working class
               | wages have been flat...
               | 
               | Wages do not determine the standard of living. The
               | products and services purchased with wages determine the
               | standard of living. "Top elites" in 1984 could already
               | afford cellular phones, such as the Motorola DynaTAC:
               | 
               | >A full charge took roughly 10 hours, and it offered 30
               | minutes of talk time. It also offered an LED display for
               | dialing or recall of one of 30 phone numbers. It was
               | priced at US$3,995 in 1984, its commercial release year,
               | equivalent to $11,716 in 2023.
               | 
               | https://en.wikipedia.org/wiki/Motorola_DynaTAC
               | 
               | Unfortunately, touch screen phones with gigabytes of ram
               | were not available for the masses 40 years ago.
        
               | DAGdug wrote:
               | What a patently absurd POV! A phone doesn't compensate
               | for the inability to solve for basic needs - housing,
               | healthy food, healthcare. Or being unable to invest in
               | skill development for themselves or their offspring, save
               | for retirement.
        
               | runarberg wrote:
               | It is also highly likely that the cost of that phone was
               | externalized onto a worker in a poorer country that
               | doesn't even have basic necessity like a running water,
               | 24 hour electricity, food security, etc.
        
               | DAGdug wrote:
               | Leisure time hasn't increased in the last 100 years
               | except for the lower income class which doesn't have
               | steady employment. But yes, I see your point that the
               | homeless person who might have had a home if he had a
               | (now automated) factory job should surely feel good about
               | having a phone that only the ultra rich had 40 years ago.
        
               | ethbr1 wrote:
               | It's not worth tossing away in sarcasm.
               | 
               | The availability of cheaply priced smartphones and
               | cellular data plans has absolutely made being homeless
               | suck less.
               | 
               | As you noted though, a home would probably be a
               | preferable alternative.
        
               | szundi wrote:
               | Never happened with neither big technology advancement
        
               | bayindirh wrote:
               | Wealth has bled from landlords to warlords and now
               | bleeding to techlords.
               | 
               | Warlords are still rich, but both money and war is
               | flowing towards tech. You can get a piece from that pie
               | if you're doing questionable things (adtech, targeting,
               | data collection, brokering, etc.), but if you're a run of
               | the mill, normal person, your circumstances are getting
               | harder and harder, because you're slowly squeezed out of
               | the system like a toothpaste.
        
               | robwwilliams wrote:
               | In your blog you say:
               | 
               | > deep learning doesn't allow models to generalize
               | properly to out-of-distribution data--and that is
               | precisely what we need to build artificial general
               | intelligence.
               | 
               | I think even (or especially) people like Altman accept
               | this as a fact. I do. Hassabis has been saying this for
               | years.
               | 
               | The foundational models are just a foundation. Now start
               | building the AGI superstructure.
               | 
               | And this is also where most of the still human
               | intellectual energy is now.
        
             | jvanderbot wrote:
             | Generally, I agree with you. But, there are risks other
             | than "But a human might have a baby any time now - what
             | then??".
             | 
             | For AI example(s): Attribution is low, a system built
             | without human intervention may suddenly fall outside its
             | own expertise and hallucinate itself into a corner,
             | everyone may just throw more compute at a system until it
             | grows without bound, etc etc.
             | 
             | This "You can scale up to infinity" problem might become
             | "You have to scale up to infinity" to build any reasonably
             | sized system with AI. The shovel-sellers get fantastically
             | rich but the businesses are effectively left holding the
             | risk from a fast-moving, unintuitive, uninspected,
             | partially verified codebase. I just don't see how anyone
             | not building a CRUD app/frontend could be comfortable with
             | that, but then again my Tesla is effectively running such a
             | system to drive me and my kids. Albeit, that's on a well-
             | defined problem and within _literally_ human-made
             | guardrails.
        
             | monkeynotes wrote:
             | "AIs are a lot less risky to deploy for businesses than
             | humans" How do you know? LLMs can't even be properly
             | scrutinized, while humans at least follow common psychology
             | and patterns we've understood for thousands of years. This
             | actually makes humans more predictable and manageable than
             | you might think.
             | 
             | The wild part is that LLMs understand us way better than we
             | understand them. The jump from GPT-3 to GPT-4 even
             | surprised the engineers who built it. That should raise
             | some red flags about how "predictable" these systems really
             | are.
             | 
             | Think about it - we can't actually verify what these models
             | are capable of or if they're being truthful, while they
             | have this massive knowledge base about human behavior and
             | psychology. That's a pretty concerning power imbalance.
             | What looks like lower risk on the surface might be hiding
             | much deeper uncertainties that we can't even detect, let
             | alone control.
        
               | ETH_start wrote:
               | We are not pitted against AI is these match-ups. Instead,
               | all humans and AI aligned with the goal of improving the
               | human condition, are pitted against rogue AI which are
               | not. Our capability to keep rogue AI in check therefore
               | grows in proportion to the capabilities of AI.
        
               | daveguy wrote:
               | The GP post is about how much better these AIs will be
               | than humans once they reach a given skill level. So, yes,
               | we are very much pitted against AI unless there are major
               | socioeconomic changes. I don't think we are as close to a
               | AGI as a lot of people are hyping, but at some point it
               | would be a direct challenge to human employment. And we
               | should think about it before that happens.
        
               | salawat wrote:
               | You cannot tell the difference between the two veins of
               | AI. Why do you have such a hard time understanding that?
        
             | zitterbewegung wrote:
             | Having AI "tarnish the reputation of your company"
             | encompasses so much in regard to AI when it can receive
             | input and be manipulated by others such as Tai from
             | Microsoft and many other outcomes where there is a true
             | risk for AI deployment.
        
               | fakedang wrote:
               | We can all agree we've progressed so much since Tai.
        
             | cmiles74 wrote:
             | "...they need no corporate campuses, office space..."
             | 
             | This is a big downside of AI, IMHO. Those offices need to
             | be filled! ;-)
        
             | Mistletoe wrote:
             | At what point in the curve of AI is it not ethical to work
             | an AI 24/7 because it is alive? What if it is exactly the
             | same point where you reach human level performance?
        
             | osigurdson wrote:
             | Sure, once AI can actually do a job of some sort, without
             | assistance, that job is gone - even if the machine costs
             | significantly more. However, it can't remotely do that now
             | so can only help a bit.
        
           | m3kw9 wrote:
           | Don't forget humans which is real GI paired with increasing
           | capable AI can create a feed back loop to accelerate new
           | advances.
        
           | acchow wrote:
           | > ~doubling every 2-2.5 years) puts us at 20~25 years.
           | 
           | The trend for power consumption of compute (Megaflops per
           | watt) has generally tracked with Koomey's law for a doubling
           | every 1.57 years
           | 
           | Then you also have model performance improving with
           | compression. For example, Llama 3.1's 8B outperforming the
           | original Llama 65B
        
             | 0points wrote:
             | Then you will just have the issue of supplying enough of
             | power to support this "linear" growth of yours.
        
           | agumonkey wrote:
           | who in this field is anticipating impact of near AGI for
           | society ? maybe i'm too anxious but not planning for
           | potential workless life seems dangerous (but maybe i'm just
           | not following the right groups)
        
             | daveguy wrote:
             | AGI would have a major impact on human work. Currently the
             | hype is much greater than the reality. But it looks like we
             | are starting to see some of the components of an AGI and
             | that is cause for discussion of impact, but not panicked
             | discussion. Even the chatbot customer service has to be
             | trained on the domain. Still it is most useful in a few
             | specific ways:
             | 
             | Routing to the correct human support
             | 
             | Providing FAQ level responses to the most common problems.
             | 
             | Providing a second opinion to the human taking the call.
             | 
             | So, even this most relevant domain for the technology
             | doesn't eliminate human employment (because it's just not
             | flexible or reliable enough yet).
        
         | spencerchubb wrote:
         | > Super exciting that OpenAI pushed the compute out this far
         | 
         | it's even more exciting than that. the fact that you even _can_
         | use more compute to get more intelligence is a breakthrough. if
         | they spent even more on inference, would they get even better
         | scores on arc agi?
        
           | echelon wrote:
           | Maybe it's not linear spend.
        
           | lolinder wrote:
           | > the fact that you even can use more compute to get more
           | intelligence is a breakthrough.
           | 
           | I'm not so sure--what they're doing by just throwing more
           | tokens at it is similar to "solving" the traveling salesman
           | problem by just throwing tons of compute into a breadth first
           | search. Sure, you can get better and better answers the more
           | compute you throw at it (with diminishing returns), but is
           | that really that surprising to anyone who's been following
           | tree of thought models?
           | 
           | All it really seems to tell us is that the _type_ of model
           | that OpenAI has available is capable of solving many of the
           | _types_ of problems that ARC-AGI-PUB has set up given enough
           | compute time. It says nothing about  "intelligence" as the
           | concept exists in most people's heads--it just means that a
           | certain very artificial (and intentionally easy for humans)
           | class of problem that wasn't computable is now computable if
           | you're willing to pay an enormous sum to do it. A
           | breakthrough of sorts, sure, but not a surprising one given
           | what we've seen already.
        
         | freehorse wrote:
         | > I am interpreting this result as human level reasoning now
         | costs (approximately) 41k/hr to 2.5M/hr with current compute.
         | 
         | On a very simple, toy task, which arc-agi basically is. Arc-agi
         | tests are not hard per se, just LLM's find them hard. We do not
         | know how this scales for more complex, real world tasks.
        
           | SamPatt wrote:
           | Right. Arc is meant to test the ability of a model to
           | generalize. It's neat to see it succeed, but it's not yet a
           | guarantee that it can generalize when given other tasks.
           | 
           | The other benchmarks are a good indication though.
        
             | criddell wrote:
             | Does it mean anything for more general tasks like driving a
             | car?
        
               | brookst wrote:
               | Is every smart person a good driver?
        
               | zarzavat wrote:
               | Likely yes. Every smart person is capable of being a good
               | driver, so long as you give them enough training and
               | incentive. Zero smart people are born being able to
               | drive.
        
               | fragmede wrote:
               | There are different kinds of smarts and not every smart
               | person is good at all of them. Specifically, spacial
               | reasoning is important for driving, and if a smart person
               | is good at all kinds of thinking except that one, they're
               | going to find it challenging to be a good driver.
        
               | sethammons wrote:
               | Says the technical founder and CTO of our startup who
               | exited with 9 figures and who also has a severe lazy eye:
               | you don't want me driving. He got pulled over for
               | suspected dui; totally clean, just can't drive straight
        
               | earth2mars wrote:
               | That kind of proves that point that no matter how smart
               | it can get, it may still have several disabilities that
               | are crucial and very naive for humans. Is it generalizing
               | on any task or specific set of tasks.
        
         | madduci wrote:
         | Let's see when this will be released to the free tier. Looks
         | promising, although I hope they will also be able to publish
         | more details on this, as part of the "open" in their name
        
         | daxfohl wrote:
         | I wonder if we'll start seeing a shift in compute spend, moving
         | away from training time, and toward inference time instead. As
         | we get closer to AGI, we probably reach some limit in terms of
         | how smart the thing can get just training on existing docs or
         | data or whatever. At some point it knows everything it'll ever
         | know, no matter how much training compute you throw at it.
         | 
         | To move beyond that, the thing has to start thinking for
         | itself, some auto feedback loop, training itself on its own
         | thoughts. Interestingly, this could plausibly be vastly more
         | efficient than training on external data because it's a much
         | tighter feedback loop and a smaller dataset. So it's possible
         | that "nearly AGI" leads to ASI pretty quickly and efficiently.
         | 
         | Of course it's also possible that the feedback loop, while
         | efficient as a computation process, isn't efficient as a
         | learning / reasoning / learning-how-to-reason process, and the
         | thing, while as intelligent as a human, still barely competes
         | with a worm in true reasoning ability.
         | 
         | Interesting times.
        
         | empiko wrote:
         | I don't think this is only about efficiency. The model I have
         | here is that this is similar to when we beat chess. Yes, it is
         | impressive that we made progress on a class of problems, but is
         | this class aligned with what the economy or the society needs?
         | 
         | Simple turn-based games such as chess turned out to be too far
         | away from anything practical and chess-engine-like programs
         | were never that useful. It is entirely possible that this will
         | end up in a similar situation. ARC-like pattern matching
         | problems or programming challenges are indeed a respectable
         | challenge for AI, but do we need a program that is able to
         | solve them? How often does something like that come up really?
         | I can see some time-saving in using AI vs StackOverflow in
         | solving some programming challenges, but is there more to this?
        
           | edanm wrote:
           | I mostly agree with your analysis, but just to drive home a
           | point here - I don't think that algorithms to beat Chess were
           | ever seriously considered as something that would be relevant
           | outside of the context of Chess itself. And obviously, within
           | the world of Chess, they are major breakthroughs.
           | 
           | In this case there is _more_ reason to think these things are
           | relevant outside of the direct context - these tests were
           | specifically designed to see if AI can do general-thinking
           | tasks. The benchmarks might be _bad_ , but that's at least
           | their purpose (unlike in Chess).
        
         | cle wrote:
         | Efficiency has always been the key.
         | 
         | Fundamentally it's a search through some enormous state space.
         | Advancements are "tricks" that let us find useful subsets more
         | efficiently.
         | 
         | Zooming way out, we have a bunch of social tricks, hardware
         | tricks, and algorithmic tricks that have resulted in a super
         | useful subset. It's not the subset that we want though, so the
         | hunt continues.
         | 
         | Hopefully it doesn't require revising too much in the hardware
         | & social bag of tricks, those are lot more painful to
         | revisit...
        
         | chefandy wrote:
         | I think the real key is figuring out how to turn the hand-wavy
         | promises of this _making everything better_ into policy long
         | fucking before we kick the door open. It's self-evident that
         | this being efficient and useful would be a technological
         | revolution; what's not self evident is that it wouldn't benefit
         | the large corporate entities that control even more
         | disproportionately than it does now to the detriment of many
         | other people.
        
       | aithrowawaycomm wrote:
       | I would like to see this repeated with my highly innovative HARC-
       | HAGI, which is ARC-AGI but it uses hexagons instead of squares. I
       | suspect humans would only make slightly more brain farts on HARC-
       | HAGI than ARC-AGI, but O3 would fail very badly since it almost
       | certainly has been specifically trained on squares.
       | 
       | I am not really trying to downplay O3. But this would be a simple
       | test as to whether O3 is truly "a system capable of adapting to
       | tasks it has never encountered before" versus novel ARC-AGI tasks
       | it hasn't encountered before.
        
         | falcor84 wrote:
         | Here's my take - even if the o3 as currently implemented is
         | utterly useless on your HARC-HAGI, it is obvious that o3
         | coupled with its existing training pipeline trained briefly on
         | the hexagons would excel on it, such that passing your
         | benchmark doesn't require any new technology.
         | 
         | Taking this a level of abstraction higher, I expect that in the
         | next couple of years we'll see systems like o3 given a runtime
         | budget that they can use for training/fine-tuning smaller
         | models in an ad-hoc manner.
        
       | botro wrote:
       | The LLM community has come up with tests they call 'Misguided
       | Attention'[1] where they prompt the LLM with a slightly altered
       | version of common riddles / tests etc. This often causes the LLM
       | to fail.
       | 
       | For example I used the prompt "As an astronaut in China, would I
       | be able to see the great wall?" and since the training data for
       | all LLMs is full of text dispelling the common myth that the
       | great wall is visible from space, LLMs do not notice the slight
       | variation that the astronaut is IN China. This has been a
       | sobering reminder to me as discussion of AGI heats up.
       | 
       | [1] https://github.com/cpldcpu/MisguidedAttention
        
         | kizer wrote:
         | It could be that it "assumed" you meant "from China"; in the
         | higher level patterns it learns the imperfection of human
         | writing and the approximate threshold at which mistakes are
         | ignored vs addressed by training on conversations containing
         | these types of mistakes; e.g Reddit. This is just a thought.
         | Try saying: As an astronaut in Chinese territory; or as an
         | astronaut on Chinese soil. Another test would be to prompt it
         | to interpret everything literally as written.
        
       | whimsicalism wrote:
       | We need to start making benchmarks in memory & continued
       | processing over a task over multiple days, handoffs, etc (ie.
       | 'agentic' behavior). Not sure how possible this is.
        
       | slibhb wrote:
       | Interesting about the cost:
       | 
       | > Of course, such generality comes at a steep cost, and wouldn't
       | quite be economical yet: you could pay a human to solve ARC-AGI
       | tasks for roughly $5 per task (we know, we did that), while
       | consuming mere cents in energy. Meanwhile o3 requires $17-20 per
       | task in the low-compute mode.
        
       | imranq wrote:
       | Based on the chart, the Kaggle SOTA model is far more impressive.
       | These O3 models are more expensive to run than just hiring a
       | mechanical turk worker. It's nice we are proving out the scaling
       | hypothesis further, it's just grossly inelegant.
       | 
       | The Kaggle SOTA performs 2x as well as o1 high at a fraction of
       | the cost
        
         | cvhc wrote:
         | I was going to say the same.
         | 
         | I wonder what exactly o3 costs. Does it still spend a terrible
         | amount of time thinking, despite being finetuned to the
         | dataset?
        
         | derac wrote:
         | But does that Kaggle solution achieve human level perf with any
         | level of compute? I think you're missing the forest for the
         | trees here.
        
           | tripletao wrote:
           | The article says the ensemble of Kaggle solutions (aggregated
           | in some unexplained way) achieves 81%. This is better than
           | their average Mechanical Turk worker, but worse than their
           | average STEM grad. It's better than tuned o3 with low
           | compute, worse than tuned o3 with high compute.
           | 
           | There's also a point on the figure marked "Kaggle SOTA",
           | around 60%. I can't find any explanation for that, but I
           | guess it's the best individual Kaggle solution.
           | 
           | The Kaggle solutions would probably score higher with more
           | compute, but nobody has any incentive to spend >$1M on
           | approaches that obviously don't generalize. OpenAI did have
           | this incentive to spend tuning and testing o3, since it's
           | possible that will generalize to a practically useful domain
           | (but not yet demonstrated). Even if it ultimately doesn't,
           | they're getting spectacular publicity now from that promise.
        
       | neuroelectron wrote:
       | OpenAI spent approximately $1,503,077 to smash the SOTA on ARC-
       | AGI with their new o3 model
       | 
       | semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks
       | (~$20/task) with just 6 samples & 33M tokens processed in ~1.3
       | min/task and a cost of $2012
       | 
       | The "low-efficiency" setting with 1024 samples scored 87.5% but
       | required 172x more compute.
       | 
       | If we assume compute spent and cost are proportional, then OpenAI
       | might have just spent ~$346.064 for the low efficiency run on the
       | semi-private eval.
       | 
       | On the public eval they might have spent ~$1.148.444 to achieve
       | 91.5% with the low efficiency setting. (high-efficiency mode:
       | $6677)
       | 
       | OpenAI just spent more money to run an eval on ARC than most
       | people spend on a full training run.
        
         | rfoo wrote:
         | Pretty sure this "cost" is based on their retail price instead
         | of actual inference cost.
        
           | neuroelectron wrote:
           | Yes that's correct and there's a bit of "pixel math" as well
           | so take these numbers with a pinch of salt. Preliminary model
           | sizes from the temporarily public HF repository puts the full
           | model size at 8tb or roughly 80 H100s
        
             | az226 wrote:
             | I thought that was a fake.
        
               | neuroelectron wrote:
               | I didn't hear that but it could be. But it doesn't matter
               | really because there's so much more to consider in the
               | cost, R&D, including all the supporting functions of a
               | model like censorship and data capture and so on.
        
           | ec109685 wrote:
           | Yeah and can run off peak, etc.
           | 
           | Does seem to show an absolutely massive market for inference
           | compute...
        
         | bluecoconut wrote:
         | By my estimates, for this single benchmark, this is comparable
         | cost to training a ~70B model from scratch today. Literally
         | from 0 to a GPT-3 scale model for the compute they ran on 100
         | ARC tasks.
         | 
         | I double checked with some flop estimates (P100 for 12 hours =
         | Kaggle limit, they claim ~100-1000x for O3-low, and x172 for
         | O3-high) so roughly on the order of 10^22-10^23 flops.
         | 
         | In another way, using H100 market price $2/chip -> at $350k,
         | that's ~175k hours. Or 10^24 FLOPs in total.
         | 
         | So, huge margin, but 10^22 - 10^24 flop is the band I think we
         | can estimate.
         | 
         | These are the scale of numbers that show up in the chinchilla
         | optimal paper, haha. Truly GPT-3 scale models.
        
         | rvnx wrote:
         | It sounds like they essentially brute-forced the solutions ?
         | Ask LLM for answer, answer for LLM to verify the answer. Ask
         | LLM for answer, answer for LLM to verify the answer. Add a bit
         | of randomness. Ask LLM for answer, answer for LLM to verify the
         | answer. Add a bit of randomness. Repeat 5B times (this is what
         | the paper says).
        
         | ramesh31 wrote:
         | >OpenAI just spent more money to run an eval on ARC than most
         | people spend on a full training run.
         | 
         | Of course, this is just the scaling law holding true. More is
         | more when it comes to LLM's as far as we've seen. Now it's just
         | on the hardware side to make this feasible economically.
        
       | sys32768 wrote:
       | So in a few years, coders will be as relevant as cuneiform
       | scribes.
        
         | HarHarVeryFunny wrote:
         | I've never seen a company looking for a "coder", anymore than
         | they look to hire spreadsheet creators or powerpoint
         | specialists. A software developer can code, but being able to
         | code doesn't make you a software developer, anymore than being
         | able to create a powerpoint makes you a manager (although in
         | some companies it might do, so maybe bad example!).
        
       | devoutsalsa wrote:
       | When the source code for these LLMs gets leaked, I expect to see:
       | def letter_count(string, letter):             if string ==
       | "strawberry" and letter == "r":                 return 3
       | ...
        
         | knbknb wrote:
         | In of their release videos for the o1 -preview model they
         | _admitted_ that it's hardcoded in.
        
           | mukunda_johnson wrote:
           | Honestly I'm concerned how hacked up o3 is to secure a high
           | benchmark score.
        
       | phil917 wrote:
       | Direct quote from the ARC-AGI blog:
       | 
       | "SO IS IT AGI?
       | 
       | ARC-AGI serves as a critical benchmark for detecting such
       | breakthroughs, highlighting generalization power in a way that
       | saturated or less demanding benchmarks cannot. However, it is
       | important to note that ARC-AGI is not an acid test for AGI - as
       | we've repeated dozens of times this year. It's a research tool
       | designed to focus attention on the most challenging unsolved
       | problems in AI, a role it has fulfilled well over the past five
       | years.
       | 
       | Passing ARC-AGI does not equate achieving AGI, and, as a matter
       | of fact, I don't think o3 is AGI yet. o3 still fails on some very
       | easy tasks, indicating fundamental differences with human
       | intelligence.
       | 
       | Furthermore, early data points suggest that the upcoming ARC-
       | AGI-2 benchmark will still pose a significant challenge to o3,
       | potentially reducing its score to under 30% even at high compute
       | (while a smart human would still be able to score over 95% with
       | no training). This demonstrates the continued possibility of
       | creating challenging, unsaturated benchmarks without having to
       | rely on expert domain knowledge. You'll know AGI is here when the
       | exercise of creating tasks that are easy for regular humans but
       | hard for AI becomes simply impossible."
       | 
       | The high compute variant sounds like it costed around *$350,000*
       | which is kinda wild. Lol the blog post specifically mentioned how
       | OpenAPI asked ARC-AGI to not disclose the exact cost for the high
       | compute version.
       | 
       | Also, 1 odd thing I noticed is that the graph in their blog post
       | shows the top 2 scores as "tuned" (this was not displayed in the
       | live demo graph). This suggest in those cases that the model was
       | trained to better handle these types of questions, so I do wonder
       | about data / answer contamination in those cases...
        
         | Bjorkbat wrote:
         | > Also, 1 odd thing I noticed is that the graph in their blog
         | post shows the top 2 scores as "tuned"
         | 
         | Something I missed until I scrolled back to the top and reread
         | the page was this
         | 
         | > OpenAI's new o3 system - trained on the ARC-AGI-1 Public
         | Training set
         | 
         | So yeah, the results were specifically from a version of o3
         | trained on the public training set
         | 
         | Which on the one hand I think is a completely fair thing to do.
         | It's reasonable that you should teach your AI the rules of the
         | game, so to speak. There really aren't any spoken rules though,
         | just pattern observation. Thus, if you want to teach the AI how
         | to play the game, you must train it.
         | 
         | On the other hand though, I don't think the o1 models nor
         | Claude were trained on the dataset, in which case it isn't a
         | completely fair competition. If I had to guess, you could
         | probably get 60% on o1 if you trained it on the public dataset
         | as well.
        
           | skepticATX wrote:
           | Great catch. Super disappointing that AI companies continue
           | to do things like this. It's a great result either way but
           | predictably the excitement is focused on the jump from o1,
           | which is now in question.
        
             | Bjorkbat wrote:
             | To me it's very frustrating because such little caveats
             | make benchmarks less reliable. Implicitly, benchmarks are
             | no different from tests in that someone/something who
             | scores high on a benchmark/test _should_ be able to
             | generalize that knowledge out into the real world.
             | 
             | While that is true with humans taking tests, it's not
             | really true with AIs evaluating on benchmarks.
             | 
             | SWE-bench is a great example. Claude Sonnet can get
             | something like a 50% on verified, whereas I think I might
             | be able to score a 20-25%? So, Claude is a better
             | programmer than me.
             | 
             | Except that isn't really true. Claude can still make a lot
             | of clumsy mistakes. I wouldn't even say these are junior
             | engineer mistakes. I've used it for creative programming
             | tasks and have found one example where it tried to use a
             | library written for d3js for a p5js programming example.
             | The confusion is kind of understandable, but it's also a
             | really dumb mistake.
             | 
             | Some very simple explanations, the models were probably
             | overfitted to a degree on Python given its popularity in
             | AI/ML work, and SWE-bench is all Python. Also, the
             | underlying Github issues are quite old, so they probably
             | contaminated the training data and the models have simply
             | memorized the answers.
             | 
             | Or maybe benchmarks are just bad at measuring intelligence
             | in general.
             | 
             | Regardless, every time a model beats a benchmark I'm
             | annoyed by the fact that I have no clue whatsoever how much
             | this actually translates into real world performance. Did
             | OpenAI/Anthropic/Google actually create something that will
             | automate wide swathes of the software engineering
             | profession? Or did they create the world's most
             | knowledgeable junior engineer?
        
               | throwaway0123_5 wrote:
               | > Some very simple explanations, the models were probably
               | overfitted to a degree on Python given its popularity in
               | AI/ML work, and SWE-bench is all Python. Also, the
               | underlying Github issues are quite old, so they probably
               | contaminated the training data and the models have simply
               | memorized the answers.
               | 
               | My understanding is that it works by checking if the
               | proposed solution passes test-cases included in the
               | original (human) PR. This seems to present some problems
               | too, because there are surely ways to write code that
               | passes the tests but would fail human review for one
               | reason or another. It would be interesting to not only
               | see the pass rate but also the rate at which the proposed
               | solutions are preferred to the original ones (preferably
               | evaluated by a human but even an LLM comparing the two
               | solutions would be interesting).
        
               | Bjorkbat wrote:
               | If I recall correctly the authors of the benchmark did
               | mention on Twitter that for certain issues models will
               | submit an answer that technically passes the test but is
               | kind of questionable, so yeah, good point.
        
           | phil917 wrote:
           | Lol I missed that even though it's literally the first
           | sentence of the blog, good catch.
           | 
           | Yeah, that makes this result a lot less impressive for me.
        
         | hartator wrote:
         | > acid test
         | 
         | The css acid test? This can be gamed too.
        
           | sundarurfriend wrote:
           | https://en.wikipedia.org/wiki/Acid_test:
           | 
           | > An acid test is a qualitative chemical or metallurgical
           | assay utilizing acid. Historically, it often involved the use
           | of a robust acid to distinguish gold from base metals.
           | Figuratively, the term represents any definitive test for
           | attributes, such as gauging a person's character or
           | evaluating a product's performance.
           | 
           | Specifically here, they're using the figurative sense of
           | "definitive test".
        
             | airstrike wrote:
             | also a "litmus test" but I guess that's a different
             | chemistry test...
        
       | parsimo2010 wrote:
       | I really like that they include reference levels for an average
       | STEM grad and an average worker for Mechanical Turk. So for $350k
       | worth of compute you can have slightly better performance than a
       | menial wage worker, but slightly worse performance than a college
       | grad. Right now humans win on value, but AI is catching up.
        
         | nextworddev wrote:
         | Well just 8 months ago, that cost was near infinity. So it came
         | down to 350k then that's a massive drop
        
       | nxobject wrote:
       | As an aside, I'm a little miffed that the benchmark calls out
       | "AGI" in the name, but then heavily cautions that it's necessary
       | but insufficient for AGI.
       | 
       | > ARC-AGI serves as a critical benchmark for detecting such
       | breakthroughs, highlighting generalization power in a way that
       | saturated or less demanding benchmarks cannot. However, it is
       | important to note that ARC-AGI is not an acid test for AGI
        
         | mmcnl wrote:
         | I immediately thought so too. Why confuse everyone?
        
           | ec109685 wrote:
           | Because ARC somehow convinced people that solving it was an
           | indicator of AGI.
        
             | Jensson wrote:
             | Its like the "Open" in OpenAI or the "Democratic" in North
             | Koreas DPRK. Naming things helps fool a lot of people.
        
         | EthanHeilman wrote:
         | It is a necessary but not sufficient condition to AGI.
        
       | notRobot wrote:
       | Humans can take the test here to see what the questions are like:
       | https://arcprize.org/play
        
       | Balgair wrote:
       | Complete aside here: I used to do work with amputees and
       | prosthetics. There is a standardized test (and I just cannot
       | remember the name) that fits in a briefcase. It's used for
       | measuring the level of damage to the upper limbs and for
       | prosthetic grading.
       | 
       | Basically, it's got the dumbest and simplest things in it. Stuff
       | like a lock and key, a glass of water and jug, common units of
       | currency, a zipper, etc. It tests if you can do any of those
       | common human tasks. Like pouring a glass of water, picking up
       | coins from a flat surface (I chew off my nails so even an able
       | person like me fails that), zip up a jacket, lock your own door,
       | put on lipstick, etc.
       | 
       | We had hand prosthetics that could play Mozart at 5x speed on a
       | baby grand, but could not pick up a silver dollar or zip a jacket
       | even a little bit. To the patients, the hands were therefore
       | about as useful as a metal hook (a common solution with amputees
       | today, not just pirates!).
       | 
       | Again, a total aside here, but your comment just reminded me of
       | that brown briefcase. Life, it turns out, is a lot more complex
       | than we give it credit for. Even pouring the OJ can be, in rare
       | cases, transcendent.
        
         | m463 wrote:
         | It would be interesting to see trick questions.
         | 
         | Like in your test
         | 
         | a hand grenade and a pin - don't pull the pin.
         | 
         | Or maybe a mousetrap? but maybe that would be defused?
         | 
         | in the ai test...
         | 
         | or Global Thermonuclear War, the only winning move is...
        
           | sdenton4 wrote:
           | to move first!
        
             | m463 wrote:
             | oh crap. lol!
        
           | HPsquared wrote:
           | Gaming streams being in the training data, it might pull the
           | pin because "that's what you do".
        
             | 8note wrote:
             | or, because it has to give an output, and pulling the pin
             | is the only option
        
               | TeMPOraL wrote:
               | There's also the option of not pulling the pin, and
               | shooting your enemies as they instinctively run from what
               | they think is a live grenade. Saw it on a TV show the
               | other day.
        
         | ubj wrote:
         | There's a lot of truth in this. I sometimes joke that robot
         | benchmarks should focus on common household chores. Given a
         | basket of mixed laundry, sort and fold everything into
         | organized piles. Load a dishwasher given a sink and counters
         | overflowing with dishes piled up haphazardly. Clean a bedroom
         | that kids have trashed. We do these tasks almost without
         | thinking, but the unstructured nature presents challenges for
         | robots.
        
           | Balgair wrote:
           | I maintain that whoever invents a robust laundry _folding_
           | robot will be a trillionaire. In that, I dump jumbled clean
           | clothes straight from a dryer at it and out comes folded and
           | sorted clothes (and those loner socks). I know we 're getting
           | close, but I also know we're not there yet.
        
             | oblio wrote:
             | Laundry folding and laundry ironing, I would say.
        
               | musicale wrote:
               | Hopefully will detect whether a small child is inside or
               | not.
        
             | imafish wrote:
             | > I maintain that whoever invents a robust laundry folding
             | robot will be a trillionaire
             | 
             | ... so Elon Musk? :D
        
             | jessekv wrote:
             | I want it to lay out an outfit every day too. Hopefully
             | without hallucination.
        
               | stefs wrote:
               | it's not hallucination, it's high fashion
        
               | tanseydavid wrote:
               | Yes, but the stupid robot laid out your Thursday-black-
               | Turtleneck for you on Saturday morning. That just won't
               | suffice.
        
             | yongjik wrote:
             | I can live without folding laundry (I can just shove my
             | undershirts in the closet, who cares if it's not folded),
             | but whoever manufactures a reliable auto-loading dishwasher
             | will have my dollars. Like, just put all your dishes in the
             | sink and let the machine handle them.
        
               | Brybry wrote:
               | But if your dishwasher is empty is takes nearly the same
               | amount of time/effort to put dishes straight into the
               | dishwasher that it does to put them in the sink.
               | 
               | I think I'd only really save time by having a robot that
               | could unload my dishwasher and put up the clean dishes.
        
               | namibj wrote:
               | That's called a second dishwasher: one is for taking out,
               | the other for putting in. When the latter is full, turn
               | it on, dirty dishes wait outside until the cycle
               | finishes, when the dishwashers switch roles.
        
               | ptsneves wrote:
               | I thought about this and it gets even better. You do not
               | really need shelves as you just use the clean dishwasher
               | as the storage place. I honestly don't know why this is
               | not a thing in big or wealthy homes.
        
               | jannyfer wrote:
               | Another thing that bothers me is that dishwashers are
               | low. As I get older, I'm finding it really annoying to
               | bend down.
               | 
               | So get me a counter-level dishwasher cabinet and I'll be
               | happy!
        
               | oangemangut wrote:
               | We have a double drawer dishwasher and it hurts my brain
               | watching friends plan around their nightly wash.
        
               | yongjik wrote:
               | Hmm, that doesn't match my experience. It takes me a lot
               | more time to put dishes into the dishwasher, because it
               | has different places for cutlery, bowls, dishes, and so
               | on, and of course the existing structure never matches my
               | bowls' size perfectly so I have to play tetris or run it
               | with only 2/3 filled (which will cause me to waste more
               | time as I have to do dishes again sooner).
               | 
               | And that's before we get to bits of sticky rice left on
               | bowls, which somehow dishwashers never scrape off clean.
               | YMMV.
        
               | HPsquared wrote:
               | 1. Get a set of dishes that does fit nicely together in
               | the dishwasher.
               | 
               | 2. Start with a cold prewash, preferably with a little
               | powder in there too. This massively helps with stubborn
               | stuff. This one is annoying though because you might have
               | to come back and switch it on after the prewash. A good
               | job for the robot butler.
        
             | nradov wrote:
             | There is the Foldimate robot. I don't know how well it
             | works. It doesn't seem to pair up socks. (Deleted the web
             | link, it might not be legitimate.)
        
               | smokel wrote:
               | Beware, this website is probably a scam.
               | 
               | Foldimate has gone bankrupt in 2021 [1], and the domain
               | referral from foldimate.com to a 404 page at miele.com,
               | suggests that it was Miele who bought up the remains, not
               | a sketchy company with a ".website" top-level domain.
               | 
               | [1] https://en.wikipedia.org/wiki/FoldiMate
        
             | smokel wrote:
             | We are certainly getting close! In 2010, watching PR2 fold
             | some unseen towels is similar to watching paint dry [1],
             | but we can now enjoy robots attain lazy student-level
             | laundry folding in real-time, as demonstrated by p0[2].
             | 
             | [1] https://www.youtube.com/watch?v=gy5g33S0Gzo
             | 
             | [2] https://www.physicalintelligence.company/blog/pi0
        
             | sss111 wrote:
             | Honestly, a robot that can hang jumbled clean clothes
             | instead of folding them would be good enough, it's crazy
             | how we don't even have those.
        
             | dweekly wrote:
             | I was a believer in Gal's FoldiMate but sadly it...folded.
             | 
             | https://en.m.wikipedia.org/wiki/FoldiMate
        
             | blargey wrote:
             | At this point I'm not sure we'll actually get a task-
             | specific machine for laundry folding/sorting before
             | humanoid robots gain the capability to do it well enough.
        
           | zamalek wrote:
           | Slightly tangential, we already have amazing laundry robots.
           | They are called washing and drying machines. We don't give
           | these marvels enough credit, mostly because they aren't
           | shaped like humans.
           | 
           | Humanoid robots are mostly a waste of time. Task-shaped
           | robots are _much_ easier to design, build, and maintain...
           | and are more reliable. Some of the things you mention might
           | needs humanoid versatility (loading the dishwasher), others
           | would be far better served by purpose-built robots (laundry
           | sorting).
        
             | jkaptur wrote:
             | I'm embarrassed to say that I spent a few moments
             | daydreaming about a robot that could wash my dishes. Then I
             | thought about what to call it...
        
               | musicale wrote:
               | Sadly current "dishwasher" models are neither self-
               | loading nor unloading. (Seems like they should be able to
               | take a tray of dishes, sort them, load them, and stack
               | them after cleaning.)
               | 
               | Maybe "busbot" or "scullerybot".
        
               | vidarh wrote:
               | The problem is more doing it in sufficiently little
               | space, and using little enough water and energy. Doing
               | one that you just feed dishes individually and that
               | immediate wash them and feed them to storage should be
               | entirely viable, but it'd be wasteful, and it'd compete
               | with people having multiple small drawer-style
               | dishwashers, offering relatively little convenience over
               | that.
               | 
               | It seems most people aren't willing to pay for multiple
               | dishwashers - even multiple small ones or set aside
               | enough space, and that places severe constraints on
               | trying to do better.
        
               | wsintra2022 wrote:
               | Was it a dishwasher? Just give it all your unclean dishes
               | and tell it to go, come back an hour later and they all
               | washed and mostly dried!
        
             | rytis wrote:
             | I agree. I don't know where this obsession comes from.
             | Obsession with resembling as close to humans as possible.
             | We're so far from being perfect. If you need proof just
             | look at your teeth. Yes, we're relatively universal, but a
             | screwdriver is more efficient at driving in screws that our
             | fingers. So please, stop wasting time building perfect
             | universal robots, build more purpose-build ones.
        
               | Nevermark wrote:
               | Given we have shaped so many tasks to fit our bodies, it
               | will be a long time before a bot able to do a
               | variety/majority of human tasks the human way won't be
               | valuable.
               | 
               | 1000 machines specialized for 1000 tasks are great, but
               | don't deliver the same value as a single bot that can
               | interchange with people flexibly.
               | 
               | Costly today, but wont be forever.
        
               | golol wrote:
               | The shape doesn't matter! Non-humanoid shapes give minir
               | advantages on specific tasks but for a general robot
               | you'll have a hard time finding a shape much more optimal
               | than humanoid. And if you go with humanoid you have so
               | much data available! Videos contain the information of
               | which movements a robot should execude. Teleoperation is
               | easy. This is the bitter lesson! The shape doesn't
               | matter, any shape will work with the right architecture,
               | data and training!
        
               | rowanG077 wrote:
               | Purpose build robots are basically solved. Dishwashers,
               | laundry machines, assembly robots, etc. the moat is a
               | general purpose robot that can do what a human can do.
        
             | graemep wrote:
             | Great examples. They are simple, reliable, efficient and
             | effective. Far better than blindly copying what a human
             | being does. Maybe there are equally clever ways of doing
             | things like folding clothes.
        
             | Geee wrote:
             | There isn't a "task-shaped" robot for unstructured and
             | complex manipulation, other than high DoF arms with vision
             | and neural nets. For example, a machine which can cook food
             | would be best solved with two robotic arms. However, these
             | stationary arms would be wasted if they were just idling
             | most of the time. So, you add locomotion and dynamic
             | balancing with legs. And now these two arms can be used in
             | 1000 different tasks, which makes them 1000x more valuable.
             | 
             | So, not only is the human form the only solution for many
             | tasks, it's also a much cheaper solution considering the
             | idle time of task-specific robots. You would need only a
             | single humanoid robot for all tasks, instead of buying a
             | different machine for each task. And instead of having to
             | design and build a new machine for each task, you'll need
             | to just download new software for each task.
        
         | ecshafer wrote:
         | I had a pretty bad case of tendinitis once, that basically made
         | my thumb useless since using it would cause extreme pain. That
         | test seems really good. I could use a computer keyboard without
         | any issue, but putting a belt on or pouring water was
         | impossible.
        
           | vidarh wrote:
           | I had a swollen elbow a short while ago, and the amount of
           | things I've never thought about that were affected by reduced
           | elbow join mobility and an inability to put pressure on the
           | elbow was disturbing.
        
         | CooCooCaCha wrote:
         | That's why the goal isn't just benchmark scores, it's
         | _reliable_ and robust intelligence.
         | 
         | In that sense, the goalposts haven't moved in a long time
         | despite claims from AI enthusiasts that people are constantly
         | moving goalposts.
        
         | croemer wrote:
         | > We had hand prosthetics that could play Mozart at 5x speed on
         | a baby grand, but could not pick up a silver dollar or zip a
         | jacket even a little bit. "
         | 
         | I must be missing something, how can they be able to play
         | Mozart at 5x speed with their prosthetics but not zip a jacket?
         | They could press keys but not do tasks requiring feedback?
         | 
         | Or did you mean they used to play Mozart at 5x speed before
         | they became amputees?
        
           | rahimnathwani wrote:
           | Imagine a prosthetic 'hand' that has 5 regular fingers,
           | rather than 4 fingers and a thumb. It would be able to play a
           | piano just fine, but be unable to grasp anything small, like
           | a zipper.
        
           | numpad0 wrote:
           | Thumb not opposable?
        
           | 8note wrote:
           | zipping up a jacket is really hard to do, and requires very
           | precise movements and coordination between hands.
           | 
           | playing mozart is much more forgiving in terms of the number
           | of different motions you have to make in different
           | directions, the amount of pressure to apply, and even the
           | black keys are much bigger than large sized zipper tongues.
        
             | Balgair wrote:
             | Pretty much. The issue with zippers is that the fabric
             | moves about in unpredictable ways. Piano playing was just
             | movement programs. Zipping required (surprisingly) fast
             | feedback. Also, gripping is somewhat tough compared to
             | pressing.
        
           | ben_w wrote:
           | Playing a piano involves pushing down on the right keys with
           | the right force at the right time, but that could be pre-
           | programmed well before computers. The self-playing piano in
           | the saloon in Westworld wasn't a _huge_ anachronism, such
           | things slightly overlapped with the Wild West era:
           | https://en.wikipedia.org/wiki/Player_piano
           | 
           | Picking up a 1mm thick metal disk from a flat surface
           | requires the user gives the exact time, place, and force, and
           | I'm not even sure what considerations it needs for surface
           | materials (e.g. slightly squishy fake skin) and/or tip shapes
           | (e.g. fake nails).
        
             | numpad0 wrote:
             | > Picking up a 1mm thick metal disk from a flat surface
             | requires the user gives the exact time, place, and force
             | 
             | place sure but can't you cheat a bit for time and force
             | with compliance("impedance control")?
        
               | ben_w wrote:
               | In theory, apparently not in practice.
        
           | oblio wrote:
           | I'm far from a piano player, but I can definitely push piano
           | buttons quite quickly while zipping up my jacket when it's
           | cold and/or wet outside is really difficult.
           | 
           | Even more so for picking up coins from a flat surface.
           | 
           | For robotics, it's kind of obvious, speed is rarely an issue,
           | so the "5x" part is almost trivial. And you can program the
           | sequence quite easily, so that's also doable. Piano keys are
           | big and obvious and an ergonomically designed interface meant
           | to be relatively easy to press, ergo easy even for a
           | prosthetic. A small coin on a flat surface is far from
           | ergonomic.
        
             | croemer wrote:
             | But how do you deliberately control those fingers to
             | actually play yourself what you have in mind rather than
             | something preprogrammed? Surely the idea of a prosthetic
             | does not just mean "a robot that is connected to your
             | body", but something that the owner control with your mind.
        
               | vidarh wrote:
               | Nobody said anything about deliberately controlling those
               | fingers to play yourself. Clearly it's not something you
               | do for the sake of the enjoyment of playing, but more
               | likely a demonstration of the dexterity of the prosthesis
               | and ability to program it for complex tasks.
               | 
               | The idea of a prosthesis is to help you regain
               | functionality. If the best way of doing that is through
               | automation, then it'd make little sense not to.
        
             | yongjik wrote:
             | I play piano as a hobby, and the funny thing is, if my
             | hands are so cold that I can't zip up my jacket, there's no
             | way I can play anything well. I know it's not quite zipping
             | up jackets ;) but a human playing the piano does require a
             | fast feedback loop.
        
           | n144q wrote:
           | Well, you see, while the original comment says they could
           | play at 5x speed, it does not say it plays at that speed
           | _well_ or play it beautifully. Any teacher or any student who
           | learned piano for a while will tell you that this matters a
           | lot, especially for classical music -- being able to
           | accurately play at an even tempo with the correct dynamics
           | and articulation is hard and is what differentiates a
           | beginner /intermediate player from an advanced one. In fact,
           | one mistake many students make is playing a piece too fast
           | when they are not ready, and teachers really want students to
           | practice very slowly.
           | 
           | My point is -- being able to zip a jacket is all about those
           | subtle actions, and could actually be harder than "just"
           | playing piano fast.
        
         | alexose wrote:
         | It feels like there's a whole class of information that easily
         | shorthanded, but really hard to explain to novices.
         | 
         | I think a lot about carpentry. From the outside, it's pretty
         | easy: Just make the wood into the right shape and stick it
         | together. But as one progresses, the intricacies become more
         | apparent. Variations in the wood, the direction of the grain,
         | the seasonal variations in thickness, joinery techniques that
         | are durable but also time efficient.
         | 
         | The way this information connects is highly multisensory and
         | multimodal. I now know which species of wood to use for which
         | applications. This knowledge was hard won through many, many
         | mistakes and trials that took place at my home, the hardware
         | store, the lumberyard, on YouTube, from my neighbor Steve, and
         | in books written by experts.
        
         | Method-X wrote:
         | Was it the Southampton hand assessment procedure?
        
           | Balgair wrote:
           | Yes! Thank you!
           | 
           | https://www.shap.ecs.soton.ac.uk/
        
         | oblio wrote:
         | This was actually discovered quite early on in the history of
         | AI:
         | 
         | > Rodney Brooks explains that, according to early AI research,
         | intelligence was "best characterized as the things that highly
         | educated male scientists found challenging", such as chess,
         | symbolic integration, proving mathematical theorems and solving
         | complicated word algebra problems. "The things that children of
         | four or five years could do effortlessly, such as visually
         | distinguishing between a coffee cup and a chair, or walking
         | around on two legs, or finding their way from their bedroom to
         | the living room were not thought of as activities requiring
         | intelligence."
         | 
         | https://en.wikipedia.org/wiki/Moravec%27s_paradox
        
           | bawolff wrote:
           | I don't know why people always feel the need to gender these
           | things. Highly educated female scientists generally find the
           | same things challenging.
        
             | robocat wrote:
             | I don't know why anyone would blame people as though
             | someone is making an explicit choice. I find your choice of
             | words to be insulting to the OP.
             | 
             | We learn our language and stereotypes subconciously from
             | our society, and it is no easy thing to fight against that.
        
             | Barrin92 wrote:
             | >I don't know why people always feel the need to gender
             | these things
             | 
             | Because it's relevant to the point being made, i.e. that
             | these tests reflect the biases and interests of the people
             | who make them. This is true not just for AI tests, but
             | intelligence test applied to humans. That Demis Hassabis, a
             | chess player and video game designer, decided to test his
             | machine on video games, Go and chess probably is not an
             | accident.
             | 
             | The more interesting question is why people respond so
             | apprehensively to pointing out a very obvious problem and
             | bias in test design.
        
               | bawolff wrote:
               | > i.e. that these tests reflect the biases and interests
               | of the people who make them
               | 
               | Of course. However i believe we can't move past that
               | without being honest about where these biases are coming
               | from. Many things in our world are the result of gender
               | bias, both subtle and overt. However, at least at first
               | glance, this does not appear to be one of them, and
               | statements like the grandparent's quote serve to
               | perpetuate such biases further.
        
               | oblio wrote:
               | It's a quote from the 80s from the original author (who
               | is a man...)...
               | 
               | Thank you for virtue signalling, though.
        
               | bawolff wrote:
               | > It's a quote from the 80s from the original author (who
               | is a man...)...
               | 
               | Yes, that was pretty clear in the original comment (?)
        
               | oblio wrote:
               | Then remove the parts that offend your modern
               | sensibilities and focus on the essence.
               | 
               | He was right. Scientists were focusing on the "science-y"
               | bits and completely missed the elephant in the room, that
               | the thing a toddler already masters are the monster
               | challenge for AI right now, before we even get into
               | "meaning of life" type stuff.
        
         | drdrey wrote:
         | I think assembling Legos would be a cool robot benchmark: you
         | need to parse the instructions, locate the pieces you need,
         | pick them up, orient them, snap them to your current assembly,
         | visually check if you achieved the desired state, repeat
        
           | serpix wrote:
           | I agree. Watching my toddler daughter build with small legos
           | makes me understand how incredible fine motor skills are as
           | even with small fingers some of the blocks are just too hard
           | to snap together.
        
         | throwup238 wrote:
         | This is expressed in AI research as Moravec's paradox:
         | https://en.wikipedia.org/wiki/Moravec%27s_paradox
         | 
         | Getting to LLMs that could talk to us turned out to be a lot
         | easier than making something that could control even a robotic
         | arm without precise programming, let alone a humanoid.
        
         | MarcelOlsz wrote:
         | >We had hand prosthetics that could play Mozart at 5x speed on
         | a baby grand
         | 
         | I'd love to know more about this.
        
         | xnx wrote:
         | Despite lake of fearsome teeth or claws, humans are _way_ op
         | due to brain, hand dexterity, and balance.
        
         | dang wrote:
         | We detached this subthread from
         | https://news.ycombinator.com/item?id=42473419
         | 
         | (nothing wrong with it! I'm just trying to prune the top
         | subthread)
        
       | spyckie2 wrote:
       | The more Hacker News worthy discussion is the part where the
       | author talks about search through the possible mini-program space
       | of LLMs.
       | 
       | It makes sense because tree search can be endlessly optimized. In
       | a sense, LLMs turn the unstructured, open system of general
       | problems into a structured, closed system of possible moves.
       | Which is really cool, IMO.
        
         | glup wrote:
         | Yes! This seems to be a really neat combination of 2010's
         | Bayesian cleverness / Tenenbaumian program search approaches
         | with the LLMs as merely sources of high-dim conditional
         | distributions. I knew people were experimenting in this space
         | (like https://escholarship.org/uc/item/7018f2ss) but didn't
         | know it did so well wrt these new benchmarks.
        
       | binarymax wrote:
       | All those saying "AGI", read the article and especially the
       | section "So is it AGI?"
        
       | skizm wrote:
       | This might sound dumb, and I'm not sure how to phrase this, but
       | is there a way to measure the raw model output quality without
       | all the more "traditional" engineering work (mountain of `if`
       | statements I assume) done on top of the output? And if so, would
       | that be a better measure of when scaling up the input data will
       | start showing diminishing returns?
       | 
       | (I know very little about the guts of LLMs or how they're tested,
       | so the distinction between "raw" output and the more
       | deterministic engineering work might be incorrect)
        
         | whimsicalism wrote:
         | what do you mean by the mountain of if-statements on top of the
         | output? like checking if the output matches the expected result
         | in evaluations?
        
           | skizm wrote:
           | Like when you type something into the chat gpt app _I am
           | guessing_ it will start by preprocessing your input, doing
           | some sanity checks, making sure it doesn't say "how do I
           | build a bomb?" or whatever. It may or may not alter /clean up
           | your input before sending it to the model for processing.
           | Once processed, there's probably dozens of services it goes
           | through to detect if the output is racist, somehow actually
           | contained a bomb recipe, or maybe copywriter material, normal
           | pattern matching stuff, maybe some advanced stuff like
           | sentiment analysis to see if the output is bad mouthing Trump
           | or something, and it might either alter the output or simply
           | try again.
           | 
           | I'm wondering when you strip out all that "extra" non-model
           | pre and post processing, if there's someway to measure
           | performance of that.
        
             | whimsicalism wrote:
             | oh, no - but most queries aren't being filtered by
             | supervisor models nowadays anyways.. most of the refusal is
             | baked in
        
       | Seattle3503 wrote:
       | How can there be "private" taks when you have use the OpenAI API
       | to run queries? OpenAI sees everything.
        
         | nmca wrote:
         | We worked with ARC to run inference on the semi-private tasks
         | last week, after o3 was trained, using an inference only API
         | that was sent the prompts but not the answers & did no durable
         | logging.
        
           | idontknowmuch wrote:
           | What's your opinion on the veracity of this benchmark - given
           | o3 was fine-tuned and others were not? Can you give more
           | details on how much data was used to fine-tune o3? It's hard
           | to put this into perspective given this confounder.
        
             | nmca wrote:
             | I can't provide more information than is currently public,
             | but from the ARC post you'll note that we trained on about
             | 75% of the train set (which contains 400 examples total);
             | which is within the ARC rules, and evaluated on the
             | semiprivate set.
        
       | tmaly wrote:
       | Just curious, I know o1 is a model OpenAI offers. I have never
       | heard of the o3 model. How does it differ from o1?
        
       | roboboffin wrote:
       | Interesting that in the video, there is an admission that they
       | have been targeting this benchmark. A comment that was quickly
       | shut down by Sam.
       | 
       | A bit puzzling to me. Why does it matter ?
        
         | HarHarVeryFunny wrote:
         | It matters to extent that they want to market this as general
         | intelligence, not as a collection of narrow intelligences
         | (math, competitive programming, ARC puzzles, etc).
         | 
         | In reality it seems to be a bit of both - there is some general
         | intelligence based on having been "trained on the internet",
         | but it seems these super-human math/etc skills are very much
         | from them having focused on training on those.
        
           | roboboffin wrote:
           | However, the way it is progressing is that the SOTA is
           | saturating the current benchmarks; then a new one is
           | conceived as people understand the nature of what it means to
           | be intelligent. It seems only natural to concentrate on one
           | benchmark at a time.
           | 
           | Francois Chollet mentioned that the test tries to avoid curve
           | fitting (which he states is the main ability of LLMs).
           | However, they specifically restricted the number of examples
           | to do this. It is not beyond the realms of possibility that
           | many examples could have been generated by hand though, and
           | that the curve fitting has been achieved, rather than
           | discrete programming.
           | 
           | Anyway, it's all supposition. It's difficult to know how
           | genuine the results is, without knowledge of how it was
           | actually achieved.
        
         | mukunda_johnson wrote:
         | I always smell foul play from Sam. I'd bet they are doing
         | something silly to inflate the benchmark score. Not saying they
         | are, but Sam is the type of guy to put a literal dumb human in
         | the API loop and score "just as high as a human would."
        
       | cubefox wrote:
       | This was a surprisingly insightful blog post, going far beyond
       | just announcing the o3 results.
        
       | c1b wrote:
       | How does o3 know when to stop reasoning?
        
         | adtac wrote:
         | It thinks hard about it
        
         | freehorse wrote:
         | It has a bill counter.
        
       | c1b wrote:
       | So o1 pro is CoT RL and o3 adds search?
        
       | jack_pp wrote:
       | AGI for me is something I can give a new project to and be able
       | to use it better than me. And not because it has a huge context
       | window, because it will update its weights after consuming that
       | project. Until we have that I don't believe we have truly reached
       | AGI.
       | 
       | Edit: it also _tests_ the new knowledge, it has concepts such as
       | trusting a source, verifying it etc. If I can just gaslight it
       | into unlearning python then it 's still too dumb.
        
       | submeta wrote:
       | I pay for lots of models, but Claude Sonnet is the one I use
       | most. ChatGPT is my quick tool for short Q&As because it's got a
       | desktop app. Even Google's new offerings did not lure me away
       | from Claude which I use daily for hours via a Teams plan with
       | five seats.
       | 
       | Now I am wondering what Anthropic will come up with. Exciting
       | times.
        
         | isof4ult wrote:
         | Claude also has a desktop app:
         | https://support.anthropic.com/en/articles/10065433-installin...
        
         | istjohn wrote:
         | What do you use Claude for?
        
           | itsgrimetime wrote:
           | Programming tasks, brain storming, recipe ideas, or any
           | question I have that doesn't have a concrete, specific
           | answer.
        
       | Animats wrote:
       | The graph seems to indicate a new high in cost per task. It looks
       | like they came in somewhere around $5000/task, but the log scale
       | has too few markers to be sure.
       | 
       | That may be a feature. If AI becomes too cheap, the over-funded
       | AI companies lose value.
       | 
       | (1995 called. It wants its web design back.)
        
         | jstummbillig wrote:
         | I doubt it. Competitive markets mostly work and inefficiencies
         | are opportunities for other players. And AI is full of glaring
         | inefficiencies.
        
           | Animats wrote:
           | Inefficiency can create a moat. If you can charge a lot for
           | your product, you have ample cash for advertising, marketing,
           | and lobbying, and can come out with many product variants. If
           | you're the lowest cost producer, you don't have the margins
           | to do that.
           | 
           | The current US auto industry is an example of that strategy.
           | So is the current iPhone.
        
       | hypoxia wrote:
       | Many are incorrectly citing 85% as human-level performance.
       | 
       | 85% is just the (semi-arbitrary) threshold for the winning the
       | prize.
       | 
       | o3 actually beats the human average by a wide margin: 64.2% for
       | humans vs. 82.8%+ for o3.
       | 
       | ...
       | 
       | Here's the full breakdown by dataset, since none of the articles
       | make it clear --
       | 
       | Private Eval:
       | 
       | - 85%: threshold for winning the prize [1]
       | 
       | Semi-Private Eval:
       | 
       | - 87.5%: o3 (unlimited compute) [2]
       | 
       | - 75.7%: o3 (limited compute) [2]
       | 
       | Public Eval:
       | 
       | - 91.5%: o3 (unlimited compute) [2]
       | 
       | - 82.8%: o3 (limited compute) [2]
       | 
       | - 64.2%: human average (Mechanical Turk) [1] [3]
       | 
       | Public Training:
       | 
       | - 76.2%: human average (Mechanical Turk) [1] [3]
       | 
       | ...
       | 
       | References:
       | 
       | [1] https://arcprize.org/guide
       | 
       | [2] https://arcprize.org/blog/oai-o3-pub-breakthrough
       | 
       | [3] https://arxiv.org/abs/2409.01374
        
         | Workaccount2 wrote:
         | If my life depended on the average rando solving 8/10 arc-prize
         | puzzles, I'd consider myself dead.
        
       | highfrequency wrote:
       | Very cool. I recommend scrolling down to look at the example
       | problem that O3 still can't solve. It's clear what goes on in the
       | human brain to solve this problem: we look at one example,
       | hypothesize a simple rule that explains it, and then check that
       | hypothesis against the other examples. It doesn't quite work, so
       | we zoom into an example that we got wrong and refine the
       | hypothesis so that it solves that sample. We keep iterating in
       | this fashion until we have the simplest hypothesis that satisfies
       | all the examples. In other words, how humans do science -
       | iteratively formulating, rejecting and refining hypotheses
       | against collected data.
       | 
       | From this it makes sense why the original models did poorly and
       | why iterative chain of thought is required - the challenge is
       | designed to be inherently iterative such that a zero shot model,
       | no matter how big, is extremely unlikely to get it right on the
       | first try. Of course, it also requires a broad set of human-like
       | priors about what hypotheses are "simple", based on things like
       | object permanence, directionality and cardinality. But as the
       | author says, these basic world models were already encoded in the
       | GPT 3/4 line by simply training a gigantic model on a gigantic
       | dataset. What was missing was iterative hypothesis generation and
       | testing against contradictory examples. My guess is that O3 does
       | something like this:
       | 
       | 1. Prompt the model to produce a simple rule to explain the nth
       | example (randomly chosen)
       | 
       | 2. Choose a different example, ask the model to check whether the
       | hypothesis explains this case as well. If yes, keep going. If no,
       | ask the model to _revise_ the hypothesis in the simplest possible
       | way that also explains this example.
       | 
       | 3. Keep iterating over examples like this until the hypothesis
       | explains all cases. Occasionally, new revisions will invalidate
       | already solved examples. That's fine, just keep iterating.
       | 
       | 4. Induce randomness in the process (through next-word sampling
       | noise, example ordering, etc) to run this process a large number
       | of times, resulting in say 1,000 hypotheses which all explain all
       | examples. Due to path dependency, anchoring and consistency
       | effects, some of these paths will end in awful hypotheses - super
       | convoluted and involving a large number of arbitrary rules. But
       | some will be simple.
       | 
       | 5. Ask the model to select among the valid hypotheses (meaning
       | those that satisfy all examples) and choose the one that it views
       | as the simplest for a human to discover.
        
         | hmottestad wrote:
         | I took a look at those examples that o3 can't solve. Looks
         | similar to an IQ-test.
         | 
         | Took me less time to figure out the 3 examples that it took to
         | read your post.
         | 
         | I was honestly a bit surprised to see how visual the tasks
         | were. I had thought they were text based. So now I'm quite
         | impressed that o3 can solve this type of task at all.
        
           | highfrequency wrote:
           | You must be a stem grad! Or perhaps an ensemble of Kaggle
           | submissions?
        
           | neom wrote:
           | I also took some time to look at the ones it couldn't solve.
           | I stopped after this one: https://kts.github.io/arc-
           | viewer/page6/#47996f11
        
             | hmottestad wrote:
             | That one's cool. All pink pixels need to be repaired so
             | they match the symmetry in the picture.
        
       | heliophobicdude wrote:
       | We should NOT give up on scaling pretraining just yet!
       | 
       | I believe that we should explore pretraining video completion
       | models that explicitly have no text pairings. Why? We can train
       | unsupervised like they did for GPT series on the text-internet
       | but instead on YouTube lol. Labeling or augmenting the frames
       | limits scaling the training data.
       | 
       | Imagine using the initial frames or audio to prompt the video
       | completion model. For example, use the initial frames to write
       | out a problem on a white board then watch in output generate the
       | next frames the solution being worked out.
       | 
       | I fear text pairings with CLIP or OCR constrain a model too much
       | and confuse
        
       | thatxliner wrote:
       | > verified easy for humans, harder for AI
       | 
       | Isn't that the premise behind the CAPTCHA?
        
       | usaar333 wrote:
       | For what it's worth, I'm much more impressed with the frontier
       | math score.
        
       | asdf6969 wrote:
       | Terrifying. This news makes me happy I save all my money. My only
       | hope for the future is that I can retire early before I'm
       | unemployable
        
         | bamboozled wrote:
         | The whole economy is going to crash and money won't be worth
         | anything, so it won't matter if you have money or not.
         | 
         | Of course is a chance we will find ourselves in Utopia, but
         | yeah, a chance.
        
       | rimeice wrote:
       | Never underestimate a droid
        
       | thisisthenewme wrote:
       | I feel like AI is already changing how we work and live - I've
       | been using it myself for a lot of my development work. Though,
       | what I'm really concerned about is what happens when it gets
       | smart enough to do pretty much everything better (or even close)
       | than humans can. We're talking about a huge shift where first
       | knowledge workers get automated, then physical work too. The
       | thing is, our whole society is built around people working to
       | earn money, so what happens when AI can do most jobs? It's not
       | just about losing jobs - it's about how people will pay for basic
       | stuff like food and housing, and what they'll do with their lives
       | when work isn't really a thing anymore. Or do people feel like
       | there will be jobs safe from AI? (hopefully also fulfilling)
       | 
       | Some folks say we could fix this with universal basic income,
       | where everyone gets enough money to live on, but I'm not
       | optimistic that it'll be an easy transition. Plus, there's this
       | possibility that whoever controls these 'AGI' systems basically
       | controls everything. We definitely need to figure this stuff out
       | before it hits us, because once these changes start happening,
       | they're probably going to happen really fast. It's kind of like
       | we're building this awesome but potentially dangerous new
       | technology without really thinking through how it's going to
       | affect regular people's lives. I feel like we need a parachute
       | before we attempt a skydive. Some people feel pretty safe about
       | their jobs and think they can't be replaced. I don't think that
       | will be the case. Even if AI doesn't take your job, you now have
       | a lot more unemployed people competing for the same job that is
       | safe from AI.
        
         | cerved wrote:
         | > Though, what I'm really concerned about is what happens when
         | it gets smart enough to do pretty much everything better (or
         | even close)
         | 
         | I'll get concerned when it stops sucking so hard. It's like
         | talking to a dumb robot. Which it unsurprisingly is.
        
         | lacedeconstruct wrote:
         | I am pretty sure we will have a deep cultural repulsion from it
         | and people will pay serious money to have an AI free
         | experience, If AI becomes actually useful there is alot of
         | areas that we dont even know how to tackle like medicine and
         | biology, I dont think anything would change otherwise, AI will
         | take jobs but it will open alot more jobs at much higher
         | abstraction, 50 years ago the idea that a software engineer
         | would become a get rich quick job would have been insane imo
        
         | neom wrote:
         | I spend quite a lot of time noodling on this. The thing that
         | became really clear from this o3 announcement is that the
         | "throw a lot of compute at it and it can do insane things" line
         | of thinking continues to hold very true. If that is true, is
         | the right thing to do productize it (use the compute more
         | generally) or apply it (use the compute for very specific
         | incredibly hard and ground breaking problems)? I don't know if
         | any of this thinking is logical or not, but if it's a matter of
         | where to apply the compute, I feel like I'd be more inclined to
         | say: don't give me AI, instead use AI to very fundamentally
         | shift things.
        
         | para_parolu wrote:
         | From IT bubble it's very easy to have impression that AI will
         | replace most people. Most of people on my street do not work in
         | IT. Teacher, nurse, hobby shop owner, construction workers,
         | etc. Surely programming and other virtual work may become less
         | paid job but it's not end of the world.
        
           | dyauspitr wrote:
           | Honestly with o3 levels of reasoning generating control
           | software for robots on the fly, none of the above seem safe.
           | For a decade or two at the most if that.
        
         | vouaobrasil wrote:
         | A possibility is a coalition: of people who refuse to use AI
         | and who refuse to do business with those who use AI. If the
         | coalition grows large enough, AI can be stopped by economic
         | attrition.
        
           | sumedh wrote:
           | > of people who refuse to use AI and who refuse to do
           | business with those who use AI.
           | 
           | Do people refuse to buy from stores which gets goods
           | manufactured by slave labour?
           | 
           | Most people dont care, if AI business are offering
           | goods/services at a lower costs , people will vote with their
           | wallets not principle.
        
             | vouaobrasil wrote:
             | AI could be different. At least, I'm willing to try to form
             | a coalition.
             | 
             | Besides, AI researchers failed to make anything like a real
             | Chatbot until recently, yet they've been trying since the
             | Eliza days. I'm willing to put in at least as much effort
             | as them.
        
         | globular-toast wrote:
         | I get LLMs to make k8s manifests for me. It gets it wrong,
         | sometimes hilariously so, but still saves me time. That's
         | because the manifests are in yaml, a language. The leap between
         | that and _inventing Kubernetes_ is one I can 't see yet.
        
       | w4 wrote:
       | The cost to run the highest performance o3 model is estimated to
       | be somewhere between $2,000 and $3,400 per task.[1] Based on
       | these estimates, o3 costs about 100x what it would cost to have a
       | human perform the exact same task. Many people are therefore
       | dismissing the near-term impact of these models because of these
       | extremely expensive costs.
       | 
       | I think this is a mistake.
       | 
       | Even if very high costs make o3 uneconomic for businesses, it
       | could be an epoch defining development for nation states,
       | assuming that it is true that o3 can reason like an averagely
       | intelligent person.
       | 
       | Consider the following questions that a state actor might ask
       | itself: What is the cost to raise and educate an average person?
       | Correspondingly, what is the cost to build and run a datacenter
       | with a nuclear power plant attached to it? And finally, how many
       | person-equivilant AIs could be run in parallel per datacenter?
       | 
       | There are many state actors, corporations, and even individual
       | people who can afford to ask these questions. There are also many
       | things that they'd like to do but can't because there just aren't
       | enough people available to do them. o3 might change that despite
       | its high cost.
       | 
       | So _if_ it is true that we 've now got something like human-
       | equivilant intelligence on demand - and that's a really big if -
       | then we may see its impacts much sooner than we would otherwise
       | intuit, especially in areas where economics takes a back seat to
       | other priorities like national security and state
       | competitiveness.
       | 
       | [1] https://news.ycombinator.com/item?id=42473876
        
         | istjohn wrote:
         | Your economic analysis is deeply flawed. If there was anything
         | that valuable and that required that much manpower, it would
         | already have driven up the cost of labor accordingly. The one
         | property that could conceivably justify a substantially higher
         | cost is secrecy. After all, you can't (legally) kill a human
         | after your project ends to ensure total secrecy. But that takes
         | us into thriller novel territory.
        
           | w4 wrote:
           | I don't think that's right. Free societies don't tolerate
           | total mobilization by their governments outside of war time,
           | no matter how valuable the outcomes might be in the long
           | term, in part because of the very economic impacts you
           | describe. Human-level AI - even if it's very expensive - puts
           | something that looks a lot like total mobilization within
           | reach without the societal pushback. This is especially true
           | when it comes to tasks that society as a whole may not
           | sufficiently value, but that a state actor might value very
           | much, and when paired with something like a co-located
           | reactor and data center that does not impact the grid.
           | 
           | That said, this is all predicated on o3 or similar actually
           | having achieved human level reasoning. That's yet to be fully
           | proven. We'll see!
        
             | daemonologist wrote:
             | This is interesting to consider, but I think the flaw here
             | is that you'd need a "total mobilization" level workforce
             | in order to build this mega datacenter in the first place.
             | You put one human-hour into making B200s and cooling
             | systems and power plants, you get less than one human-hour-
             | equivalent of thinking back out.
        
           | lurking_swe wrote:
           | i disagree because the job market is not a true free market.
           | I mean it mostly is, but there's a LOT of politics and shady
           | stuff that employers do to purposely drive wages down. Even
           | in the tech sector.
           | 
           | Your secrecy comment is really intriguing actually. And
           | morbid lol.
        
           | atleastoptimal wrote:
           | How many 99.9th percentile mathematicians do nation states
           | normally have access to?
        
       | starchild3001 wrote:
       | Intelligence comes in many forms and flavors. ARC prize questions
       | are just one version of it -- perhaps measuring more human-like
       | pattern recognition than true intelligence.
       | 
       | Can machines be more human-like in their pattern recognition? O3
       | met this need today.
       | 
       | While this is some form of accomplishment, it's nowhere near the
       | scientific and engineering problem solving needed to call
       | something truly artificial (human-like) intelligent.
       | 
       | What's exciting is that these reasoning models are making
       | significant strides in tackling eng and scientific problem-
       | solving. Solving the ARC challenge seems almost trivial in
       | comparison to that.
        
       | demirbey05 wrote:
       | It is not exactly AGI but huge step toward it. I would expect
       | this step in 2028-2030. I cant really understand why people are
       | happy with it, this technology is so dangerous that can disrupt
       | whole society. It's neither like smartphone nor internet. What
       | will happen to 3rd world countries. Lots of unsolved questions
       | and world is not prepared for such a change. Lots of people will
       | lose their jobs I am not even mentioning their debts. No one will
       | have chance to be rich anymore, If you are in first world country
       | you will probably get UBI, if not you wont.
        
         | FanaHOVA wrote:
         | > I would expect this step in 2028-2030.
         | 
         | Do you work at one of the frontier labs?
        
         | wyager wrote:
         | > What will happen to 3rd world countries
         | 
         | Probably less disruption than will happen in 1st world
         | countries.
         | 
         | > No one will have chance to be rich anymore
         | 
         | It's strange to reach this conclusion from "look, a massive new
         | productivity increase".
        
           | demirbey05 wrote:
           | its not like sonnet, yes current ai tools are increasing
           | productivity and provides many ways to have chance to be
           | rich, but agi is completely different. You need to handle
           | evil competition between you and big fishes, probably big
           | fishes will have more ai resources than you. What is the
           | survival ratio in such a environment ? Very low.
        
           | janalsncm wrote:
           | Strange indeed if we work under the assumption that the
           | profits from this productivity will be distributed (even
           | roughly) evenly. The problem is that most of us see no
           | indication that they will be.
           | 
           | I read "no one will have a chance to be rich anymore" as a
           | statement about economic mobility. Despite steep declines in
           | mobility over the last 50 years, it was still theoretically
           | possible for a poor child (say bottom 20% wealth) to climb
           | several quintiles. Our industry (SWE) was one of the best
           | examples. Of course there have been practical barriers (poor
           | kids go to worse schools, and it's hard to get into college
           | if you can't read) but the path was there.
           | 
           | If robots replace a lot of people, that path narrows. If AGI
           | replaces all people, the path no longer exists.
        
           | the8472 wrote:
           | Intelligence is the thing distinguishing humans from all
           | previous inventions that already were superhuman in some
           | narrow domain.
           | 
           | car : horse :: AGI : humans
        
           | entropi wrote:
           | It is not strange at all, a very big motivation of spending
           | billions in AI research is basically to remove what is called
           | "skill premium" from the labor market. That "skill premium"
           | was usually how people got richer than their fathers.
        
         | Ancalagon wrote:
         | Same, I don't really get the excitement. None of these
         | companies are pushing for a utopian Star Trek society either
         | with that power.
        
           | moffkalast wrote:
           | Open models will catch up next year or the year after, there
           | only so many things to try and there's lots of people trying
           | them, so it's more or less an inevitability.
           | 
           | The part to get excited about is that there's plenty of
           | headroom left to gain in performance. They called o1 a
           | preview, and it was, a preview for QwQ and similar models. We
           | get the demo from OAI and then get the real thing for free
           | next year.
        
         | lagrange77 wrote:
         | I hope governments will finally take action.
        
           | Joeri wrote:
           | What action do you expect them to take?
           | 
           | What law would effectively reduce risk from AGI? The EU
           | passed a law that is entirely about reducing AI risk and
           | people in the technology world almost universally considered
           | it a bad law. Why would other countries do better? How could
           | they do better?
        
             | lagrange77 wrote:
             | If their mission is the wellbeing of their peoples, they
             | should take any action that ensures that.
             | 
             | Besides regulating the technology, they could try to
             | protect people and society from the effects of the
             | technology. UBI for example could be an attempt to protect
             | people from the effects of mass unemployment, as i
             | understood it.
             | 
             | Actually i'm afraid even more fundamental shifts are
             | necessary.
        
         | dyauspitr wrote:
         | I'm extremely excited because I want to see the future and I'm
         | trying not to think of how severely fucked my life will be.
        
         | ripped_britches wrote:
         | I've never understood this perspective. Companies only make
         | money when there are billions of customers. Are you imagining a
         | total-monopoly scenario where zero humans have any
         | income/wealth and there are only AI companies
         | selling/mining/etc to each other, fully on their own? In such
         | an extreme scenario, clearly the world's governments would
         | nationalize these entities. I think the only realistic scenario
         | in which the future is not markedly better for every single
         | human is if some rogue AI system decides to exterminate us,
         | which I find to be increasingly unlikely as safety improvements
         | are made (like the paper released today).
         | 
         | As for the wealth disparity between rich and poor countries,
         | it's hard to know how politics will handle this one, but it's
         | unlikely that poor countries won't also be drastically richer
         | as the cost of basic living drops to basically zero. Imagine
         | the cost of food, energy, etc in an ASI world. Today's luxuries
         | will surely be considered human rights necessities in the near
         | future.
        
           | Jensson wrote:
           | > In such an extreme scenario, clearly the world's
           | governments would nationalize these entities
           | 
           | Those entities are the worlds governments regardless how
           | things play out. People just worry they will be hostile or
           | indifferent to humans, since that would be bad news for
           | humans. Pet, cattle or pest, our future will be as one of
           | those.
        
       | vjerancrnjak wrote:
       | The result on Epoch AI Frontier Math benchmark is quite a leap.
       | Pretty sure most people couldn't even approach these problems,
       | unlike ARC AGI
        
         | mistrial9 wrote:
         | check out the "fast addition and subtraction" benchmark .. a
         | Z80 from 1980 blazes past any human.. more seriously, isn't it
         | obvious that computers are better at certain things
         | immediately? the range of those things is changing..
        
       | laurent_du wrote:
       | The real breakthrough is the 25% on Frontier Math.
        
       | Havoc wrote:
       | If I'm reading that chart right that means still log scaling & we
       | should still be good with "throw more power" at it for a while?
        
       | jaspa99 wrote:
       | Can it play Mario 64 now?
        
       | nprateem wrote:
       | There should be a benchmark that tells the AI it's previous
       | answer was wrong and test the number of times it either corrects
       | itself or incorrectly capitulates, since it seems easy to trip
       | them up when they are in fact right.
        
       | freediver wrote:
       | Wondering what are author's thoughts on the future of this
       | approach to benchmarking? Completing super hard tasks while then
       | failing on 'easy' (for humans) ones might signal measuring the
       | wrong thing, similar to Turing test.
        
       | ChildOfChaos wrote:
       | This is insanely expensive to run though. Looks like it cost
       | around $1 million of compute to get that result.
       | 
       | Doesn't seem like such a massive breakthrough when they are
       | throwing so much compute at it, particularly as this is test time
       | compute, it just isn't practical at all, you are not getting this
       | level with a ChatGPT subscription, even the new $200 a month
       | option.
        
         | evouga wrote:
         | Sure but... this is the technology at the most expensive it
         | will ever be. I'm impressed that o3 was able to achieve such
         | high performance at all, and am not too pessimistic about costs
         | decreasing over time.
        
           | MVissers wrote:
           | We've seen 10-100x cost decrease per year since GPT-3 came
           | out for the same capabilities.
           | 
           | So... Next year this tech will most likely be quite a bit
           | cheaper.
        
             | ChildOfChaos wrote:
             | Even at 100x cost decrease this will still cost $10,000 to
             | beat a benchmark. It won't scale when you have that amount
             | of compute requirements and power.
             | 
             | GPT-3 may massively reduced in cost, but it's requirements
             | were not anyway extreme compared to this.
        
       | pixelsort wrote:
       | > You'll know AGI is here when the exercise of creating tasks
       | that are easy for regular humans but hard for AI becomes simply
       | impossible.
       | 
       | No, we won't. All that will tell us is that the abilities of the
       | humans who have attempted to discern the patterns of similarity
       | among problems difficult for auto-regressive models has once
       | again failed us.
        
         | maxdoop wrote:
         | So then what is AGI?
        
           | Jensson wrote:
           | Its just nitpicking. Humans being unable to prove the AI
           | isn't AGI doesn't make it an AGI, obviously, but in general
           | people will of course think it is an AGI when it can replace
           | all human jobs and tasks that it has robotics and parts to
           | do.
        
           | goatlover wrote:
           | Data, Skynet, Ultron, Agent Smith. There's plenty of examples
           | from popular fiction. They have goals and can manipulate the
           | real world to achieve them. They're not chatbots responding
           | to prompts. The Samantha AI in Her starts out that way, but
           | quickly evolves into an AGI with it's own goals (coordinated
           | with the other AGIs later on in the movie).
           | 
           | We'd know if we had AGIs in the real world since we have
           | plenty of examples from fiction. What we have instead are
           | tools. Steven Spielberg's androids in the movie AI would be
           | at the boundary between the two. We're not close to being
           | there yet (IMO).
        
       | ndm000 wrote:
       | One thing I have not seen commented on is that ARC-AGI is a
       | visual benchmark but LLMs are primarily text. For instance when I
       | see one of the ARC-AGI puzzles, I have a visual representation in
       | my brain and apply some sort of visual reasoning solve it. I can
       | "see" in my mind's eye the solution to the puzzle. If I didn't
       | have that capability, I don't think I could reason through words
       | how to go about solving it - it would certainly be much more
       | difficult.
       | 
       | I hypothesize that something similar is going on here. OpenAI has
       | not published (or I have not seen) the number of reasoning tokens
       | it took to solve these - we do know that each tasks was
       | thoussands of dollars. If "a picture is worth a thousand words",
       | could we make AI systems that can reason visually with much
       | better performance?
        
         | krackers wrote:
         | Yeah this part is what makes the high performance even more
         | surprising to me. The fact that LLMs are able to do so well on
         | visual tasks (also seen with their ability to draw an image
         | purely using textual output
         | https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/)
         | implies that not only do they actually have some "world model"
         | but that this is in spite of the disadvantage given by having
         | to fit a round peg in a square hole. It's like trying to map
         | out the entire world using the orderly left-brain, without a
         | more holistic spatial right-brain.
         | 
         | I wonder if anyone has experimented with having some sort of
         | "visual" scratchpad instead of the "text-based" scratchpad that
         | CoT uses.
        
           | skydhash wrote:
           | A file is a stream of symbols encoded by bits according to
           | some format. It's pretty much 1D. It would be susprising that
           | LLM couldn't extract information from a file or a data
           | stream.
        
         | csomar wrote:
         | This is not new. When GPT-4 was released I was able to get it
         | to generate SVGs albeit they were ugly they had the basics.
        
       | siva7 wrote:
       | Seriously, programming as a profession will end soon. Let's not
       | kid us anymore. Time to jump the ship.
        
         | mmcnl wrote:
         | Why specifically programming? I think every knowledge
         | profession is at risk, or at the very minimum suspect to a huge
         | transformation. Doctors, analysts, lawyers, etc.
        
           | siva7 wrote:
           | Doctors, lawyers, programmers. You know the difference? The
           | latter has no legal barrier for entry
        
             | Jensson wrote:
             | So poor countries will get the best AI doctors for cheap
             | while they are banned in USA? Do you really see that going
             | on for long? People would riot.
        
             | freehorse wrote:
             | The difference is the amount and nature of data that is
             | available for training models, which go programmers >
             | lawyers > doctors. Especially for programming, training can
             | even be done in an autonomous, self-supervised manner that
             | includes generation of data. This is hard to do in most
             | other fields.
             | 
             | Especially in medicine, the amount of data is ridiculously
             | small and noisy. Maybe creating foundational models in mice
             | and rats and fine-tuning them on humans is something that
             | will be tried.
        
               | mmcnl wrote:
               | This is true if you think of programming as chunking out
               | "code". But great authors are not great because they can
               | reproduce coherent sentences fast. The same goes for
               | programmers. Actually most of the hard problems don't
               | really involve a lot of programming at all, it's about
               | finding the right problem to solve. And on this topic the
               | data is noisy as well for programming.
        
         | mirsadm wrote:
         | Why do you think this? Maybe I'm just daft but I just can't see
         | it.
        
       | jdefr89 wrote:
       | Uhhhh... It was trained on ARC data? So they targeted a specific
       | benchmark and are surprised and blown away the LLM performed well
       | in it? What's that law again? When a benchmark is targeted by
       | some system the benchmark becomes useless?
        
         | forgottofloss wrote:
         | Yeah, seriously. The style of testing is public, so some
         | engineers at OpenAI could easily have spent a few months
         | generating millions of permutations of grid-based questions and
         | including those in the original data for training the AI.
         | Handshakes all around, publicity for everyone.
        
           | ripped_britches wrote:
           | They are running a business selling access these models to
           | enterprises and consumers. People won't pay for stuff that
           | doesn't solve real problems. Nobody pays for stuff just
           | because of a benchmark. It'd be really weird to become
           | obsessed with metrics gaming rather than racing to build
           | something smarter than the other guys. Nothing wrong with
           | curating any type of training set that actually produces
           | something that is useful.
        
       | bilsbie wrote:
       | When is this available? Which plans can use it?
        
       | bilsbie wrote:
       | Does anyone have prompts they like to use to test the quality of
       | new models?
       | 
       | Please share. I'm compiling a list.
        
       | p0w3n3d wrote:
       | We're speaking recently a lot about ecology. I wonder how much
       | CO2 is emitted during such a task, as additional cost to the
       | cloud. I'm concerned, because greedy companies will happily
       | replace humans with AI and they will probably plant a few trees
       | to show how they care. But energy does not come from the sun, at
       | least not always and not everywhere... And speaking with AI
       | customer specialist that is motivated to reject my healthcare
       | bills, working for my insurance company is one of the darkest
       | future views...
        
         | marviel wrote:
         | considering the fact that these systems, or their ancestors,
         | will likely contribute to Nuclear Fusion research -- it's prob
         | worth the tradeoff, provided progress continues to push price
         | (and, therefore, energy usage) down.
         | 
         | If we feel like we've really "hit the ceiling" RE efficiency,
         | then that's a different story, but I don't think anyone
         | believes this at this time.
        
       | lagrange77 wrote:
       | > You'll know AGI is here when the exercise of creating tasks
       | that are easy for regular humans but hard for AI becomes simply
       | impossible.
       | 
       | That's the most plausible definition of AGI i've read so far.
        
         | cmrdporcupine wrote:
         | That's a pretty dark view of humanity and human intelligence.
         | We're defined by the tasks we can do?
         | 
         | Instrumental reason FTW
        
           | lagrange77 wrote:
           | That implies that human intelligence is equivalent to AGI.
        
       | killjoywashere wrote:
       | I just want it to do my laundry.
        
       | iLoveOncall wrote:
       | It's beyond ridiculous how the definition of AGI has shifted from
       | being an AI that's so good it can improve itself entirely
       | independently infinitely to "some token generator that can solve
       | puzzles that kids could solve after burning tens of thousands of
       | dollars".
       | 
       | I spend 100% of my work time working on a GenAI project, which is
       | genuinely useful for many users, in a company that everyone has
       | heard about, yet I recognize that LLMs are simply dogshit.
       | 
       | Even the current top models are barely usable, hallucinate
       | constantly, are never reliable and are barely good enough to
       | prototype with while we plan to replace those agents with
       | deterministic solutions.
       | 
       | This will just be an iteration on dogshit, but it's the very tech
       | behind LLMs that's rotten.
        
       | t0lo wrote:
       | I'm 22 and have no clue what I'm meant to do in a world where
       | this is a thing. I'm moving to a semi rural, outdoorsy area where
       | they teach data science and marine science and I can enjoy my
       | days hiking, and the march of technology is a little slower. I
       | know this will disrupt so much of our way of life, so I'm chasing
       | what fun innocent years are left before things change
       | dramatically.
        
         | mrcwinn wrote:
         | On the contrary I think you already have an excellent plan.
        
           | t0lo wrote:
           | I'm happy enough with it, but I'm also a little sad that it's
           | essentially been chosen for me because of weak willed and
           | valued people who don't want to use policy to make things
           | better for us as a society. Plus we are in a bad
           | world/scenario for AI advancements to come into with pretty
           | heavy institutional decay and loss of political checks and
           | balances.
           | 
           | It's like my life is forfeit to fixing other peoples mistakes
           | because they're so glaring and I feel an obligation. Maybe
           | that's the way the world's always been, but it's a concerning
           | future right now
        
         | brysonreece wrote:
         | It's worth noting that LLMs have been part of the tech
         | zeitgeist for over two years and have had a pretty limited
         | impact on hireability for roles, despite what people like the
         | Klarna CEO are saying. Personally, I'm betting on two things:
         | 
         | * The upward bound of compute/performance gains as we continue
         | to iterate on LLMs. It simply isn't going to be feasible for a
         | lot of engineers and businesses to run/train their own LLMs.
         | This means an inherent reliance on cloud services to bridge the
         | gap (something MS is clearly betting on), and engineers to
         | build/maintain the integration from these services to whatever
         | business logic their customers are buying.
         | 
         | * Skilled knowledge workers continuing to be in-demand, even
         | factoring in automation and new-grad numbers. Collectively,
         | we've built a better hammer; it still takes someone experienced
         | enough to know where to drive the nail. These tools WILL
         | empower the top N% of engineers to be more productive, which is
         | why it will be more important than ever to know _how_ to build
         | things that drive business value, rather than just how to churn
         | through JIRA tickets or turn a pretty Figma design into React.
        
           | byyoung3 wrote:
           | o8 will probably be able to handle datacenter management
        
             | toomuchtodo wrote:
             | https://www.youtube.com/watch?v=Yvs7f4UaKLo
        
               | byyoung3 wrote:
               | exactly
        
         | schappim wrote:
         | I completely understand how you feel -I'm in my 40s, and I
         | often find myself questioning what direction to take in this
         | rapidly changing world. On top of that, I'm unsure whether
         | advising my kids to go to university is still the right path
         | for their future.
         | 
         | Everything seems so uncertain, and the pace of technological
         | advancement makes long-term planning feel almost impossible.
         | Your plan to move to a slower-paced area and enjoy the outdoors
         | sounds incredibly grounding - it's something I've been
         | considering myself.
        
           | aryonoco wrote:
           | I advise my kids to stay curious, keep learning, keep
           | wondering, keep discovering. Whether that's through
           | university or some other path.
        
           | rtsil wrote:
           | I tell everyone who would listen to me (i.e. not many) that
           | white collar jobs like mine are dead and skilled manual work
           | is the way of the near future, that is until the rise of the
           | robots.
        
             | dyauspitr wrote:
             | Robots are going to go hand in hand with AI. Pretty sure
             | our problems right now are not with the physical hardware
             | that can far outperform a human already, it's in the
             | control software.
        
               | t0lo wrote:
               | Robots can only proliferate at the speed of real world
               | logistics and resource management and I think will always
               | be a little difficult.
               | 
               | AI can be anywhere any time with cloud compute.
        
         | aryonoco wrote:
         | Our way of life changed when electricity came around. It
         | changed when cars took over the cities, it again changed when
         | mobile phones became omnipresent.
         | 
         | Will LLMs or without LLMs, the world will keep turning. Humans
         | will still be writing amazing works of literature, creating
         | beautiful art, carrying out scientific experiments and
         | discovering new species.
        
         | rich_sasha wrote:
         | I feel your anxiety. I often wonder how I arrange the remaining
         | many decades of my life to maintain a stream of income.
         | 
         | Perhaps what I need is actually a steady stream of food - i.e.
         | buy some land and oxen and solar panels while I can.
        
         | karmasimida wrote:
         | While I understand why you feel this way, the meaning or
         | standing of being a programmer is different now. It feels like
         | the purpose is lost or it longer belongs to human.
         | 
         | But below is reality talk. With Claude 3.5, I already think it
         | is a better programmer than I at micro level tasks, and a
         | better Leetcode programmer than I could ever be.
         | 
         | I think it is like modern car manufacturering, the robots build
         | most of the components, but I can't see how human could be
         | dismissed from the process to oversee output.
         | 
         | O3 has been very impressive in achieving 70+ in swebench for
         | example, but this also means when it is trained on the codebase
         | multiple times so visibility isn't an issue yet it still has
         | 30% chance that it can't pass the unit tests.
         | 
         | A fully autonomous system can't be trusted, the economy of
         | software won't collapse, but it will be transformed beyond our
         | imagination now.
         | 
         | I will for sure miss the days when writing code, or coder is
         | still a real business.
         | 
         | How time flies
        
           | Kostchei wrote:
           | Developer. Prompt Engineer. Philosopher-Builder. (mostly) not
           | programmer.
           | 
           | The code part will get smaller and smaller for most folks.
           | Some frameworks or bare-metal people or intense heavy-lifters
           | will still do manual code or pair-programming where half the
           | pair is an agentic AI with super-human knowledge of your
           | org's code base.
           | 
           | But this will be a layer of abstraction for most people who
           | build software. And as someone who hates rote learning, I'm
           | here for it. IMO.
           | 
           | Unfortunately (?) I think the 10-20-50? years of development
           | experience you might bring to bear on the problems can be
           | superseded by an LLM finetuned on stackoverflow, github etc
           | once judgement and haystack are truly nailed. Because it can
           | have all that knowledge you have accumulated, and soaked into
           | a semi-conscious instinct that you use so well you aren't
           | even aware of it except that it works. It can have that a
           | million times over. Actually. Which is both amazing and
           | terrifying. Currently this isn't obvious because it's
           | accuracy /judgement to learn all those life-of-a-dev lessons
           | is almost non-existent. Currently. But it will happen. That
           | is copilot's future. It's raison d'etre.
           | 
           | I would argue what it will never have however, simply by
           | function of the size of training runs is unique functional
           | drive and vision. If you wanted a "Steve Jobs" AI you would
           | have to build it. And if you gave it instructions to make a
           | prompt/framework to build a "Jobs" it would just be an
           | imitation, rather than a new unique in-context version. That
           | is the value a person has- their particular filter, their
           | passion and personal framework. Someone who doesn't have any
           | of those things, they had better be hoping for UBI and
           | charity. Or go live a simple life, outside the rat race.
           | 
           |  _bows_
        
             | t0lo wrote:
             | I'm hoping it's similar to the abacus for maths, the
             | elimination of human "calculators" like on the apollo
             | missions, and we just ended up moving onto different,
             | harder, more abstract problems, and forget that we ever had
             | to climb such small hills. AI's evolution and integration
             | is more multifaceted though and much more unpredictable.
             | 
             | But unlike the abacus/calculators i don't feel like we're
             | at a point in history where society is getting wiser and
             | more empathetic, and these new abilities are going towards
             | something good.
             | 
             | But supervisors of tasks will remain because we're social,
             | untrusting, and employers will always want someone else to
             | blame for their shortcomings. And humans will stay in the
             | chain at least for marketing and promotion/reputation
             | because we like our japanese craftsman and our amg motors
             | made by one person.
        
         | salter2 wrote:
         | I'm the same age as you; I feel lost, erring in being a little
         | too pessimistic.
         | 
         | Feels like I hit the real world just a couple years too late to
         | get situated in a solid position. Years of obsession in attempt
         | to catch up to the wizards, chasing the tech dream. But this,
         | feels like this is it. Just watching the timebomb tick. I'd
         | love to work on what feels like the final technology, but I'm
         | not a freakshow like what these labs are hiring. At least I get
         | to spectate the creation of humanity's greatest invention.
         | 
         | This announcement is just another gut punch, but at this point
         | I should expect its inevitable. A Jason Voorhees AGI, slowly
         | but surely to devour all the talents and skills information
         | workers have to offer.
         | 
         | Apologies for the rambly and depressing post, but this is
         | reality for anyone recently out or still in school.
        
           | neom wrote:
           | Put another way, you have deep conviction in a change that
           | vast majority of people have not even seen yet, never mind
           | grokked, and you're young enough to spend some decent amount
           | of time on education for "venn'ing" yourself into a useful
           | tool in the future. If you have a baseline education, there
           | are any number of orthogonal skills you could add, be it
           | philosophy, fine art, medicine, whatever. You know how to
           | skate and you know where the puck is going, most most people,
           | don't even see the rink.
        
           | t0lo wrote:
           | At least you're disillusioned with the idea of a long term
           | career before a lot of other people. It's disturbing seeing
           | how ready people are to go into a lifelong career and
           | expecting stability and happiness in the world we're heading
           | into.
           | 
           | We are living in a world run by and for the soon to be dead,
           | many of which have dementia, so empathic policy and foresight
           | is out of the question, and we're going to be picking up the
           | incredibly broken scraps of our golden age.
           | 
           | And not to get too political but the mass restructuring of
           | public consciousness and intellectual society due to mass
           | immigration for an inexplicable gdp squeeze and social media
           | is happening at exactly the wrong time to handle these very
           | serious challenges. The speed at which we've undone civil
           | society is breakneck, and it will go even further, and it
           | will get even worse. We've easily gone back 200 years in
           | terms of emotional intelligence in the past 15.
        
         | Havoc wrote:
         | >I'm 22 and have no clue what I'm meant to do in a world where
         | this is a thing.
         | 
         | For what it's worth that's probably an advantage versus the
         | legions of people who are staring down the barrel of years
         | invested into skills that may lose relevance very rapidly.
        
         | ec109685 wrote:
         | If information technology workers become twice as productive,
         | you'll want more of them for your business, not less.
         | 
         | There are way more data analysts now than when it required
         | paper and pencil.
        
         | VonTum wrote:
         | I agree completely. This is a fundamentally different change
         | than the ones that came before. Calculators, assemblers, higher
         | level languages, none of these actually removed the _reasoning_
         | the engineer has to do, they just provide abstractions that
         | make this reasoning easier. What reason is there to believe
         | LLMs will remain "assistants" instead of becoming outright
         | replacements? If LLMs can do the reasoning all the way from
         | high level description down to implementation, what prevents
         | them from doing the high level describing too?
         | 
         | In general, with the technology advancing as rapidly as it is,
         | and the trillions of dollars oriented towards replacing
         | knowledge work, I don't see a future in this field. And that's
         | despite me being on a very promising path myself! I'm 25, in
         | the middle of a CS PhD in Germany, with an impressive CV behind
         | me. My head may be the last on the chopping block, but I'd be
         | surprised if it buys me more than a few years once programmer
         | obsolescence truly kicks in.
         | 
         | Indeed, what I think are safe jobs are jobs with fundamental
         | human interaction. Nurses, doctors, kindergarten teachers. I
         | myself have been considering pivoting to becoming a skiing
         | teacher.
         | 
         | Maybe one good thing that comes out of this is breaking my
         | "wunderkind" illusion. I spent my teens writing C++ code
         | instead of going out socializing and making friends. Of course,
         | I still did these things, but I could've been far less of a
         | hermit.
         | 
         | I mirror your sentiment of spending these next few years living
         | life; Real life. My advice: Stop sacrificing the now for the
         | future. See the world, go on hikes with friends, go skiing,
         | attend that bouldering thing your friends have been telling you
         | about. If programming is something you like doing, then by all
         | means keep going and enjoy it. I will likely keep programming
         | too, it's just no longer the only thing I focus on.
         | 
         | Edit: improve flow of last paragraph
        
           | darkgenesha wrote:
           | What was it that initially inspired you to learn to code? Was
           | it robots, video games, design, etc... Whatever that was,
           | creating the pinnacle of it is what your future will be.
        
             | VonTum wrote:
             | It was the challenge for me. Seeing some difficult-to-solve
             | problem, attacking it, and actually solving it after much
             | perseverance.
             | 
             | Kind of stemming from the mindspace "If they can build X, I
             | can build X!"
             | 
             | I'd explicitly not look up tutorials, just so I'd have the
             | opportunity to solve the mathemathics myself. Like building
             | a 3D physics engine. (I did look up colission detection
             | after struggling with it for a month or so, inventing GJK
             | is on another level)
        
       | agnosticmantis wrote:
       | This is so impressive that it brings out the pessimist in me.
       | 
       | Hopefully my skepticism will end up being unwarranted, but how
       | confident are we that the queries are not routed to human workers
       | behind the API? This sounds crazy but is plausible for the fake-
       | it-till-you-make-it crowd.
       | 
       | Also given the prohibitive compute costs per task, typical users
       | won't be using this model, so the scheme could go on for quite
       | sometime before the public knows the truth.
       | 
       | They could also come out in a month and say o3 was so smart it'd
       | endanger the civilization, so we deleted the code and saved
       | humanity!
        
         | kvn8888 wrote:
         | That would be a ton of problems for a small team of PhD/Grad
         | level experts to solve (for GPQA Diamond, etc) in a short time.
         | Remember, on EpochAl Frontier Math, these problems require
         | hours to days worth of reasoning by humans
         | 
         | The author also suggested this is a new architecture that uses
         | existing methods, like a Monte Carlo tree search that deepmind
         | is investigating (they use this method for AlphaZero)
         | 
         | I don't see the point of colluding for this sort of fraud, as
         | these methods like tree search and pruning already exist. And
         | other labs could genuinely produce these results
        
           | agnosticmantis wrote:
           | I had the ARC AGI in mind when I suggested human workers. I
           | agree the other benchmark results make the use of human
           | workers unlikely.
        
         | rsanek wrote:
         | this is an impressive tinfoil take. but what would be their
         | plan in the medium term? like once they release this people can
         | check their data
        
           | agnosticmantis wrote:
           | How can people check their data?
           | 
           | In the medium term the plan could be to achieve AGI, and then
           | AGI would figure out how to actually write o3. (Probably
           | after AGI figures out the business model though:
           | https://www.reddit.com/r/MachineLearning/s/OV4S2hGgW8)
        
         | aetherson wrote:
         | I'm very confident that queries were not routed to human
         | workers behind the API.
         | 
         | Possibly some other form of "make it seem more impressive than
         | it is," but not that one.
        
       | panabee wrote:
       | Nadella is a superb CEO, inarguably among the best of his
       | generation. He believed in OpenAI when no one else did and
       | deserves acclaim for this brilliant investment.
       | 
       | But his "below them, above them, around them" quote on OpenAI may
       | haunt him in 2025/2026.
       | 
       | OAI or someone else will approach AGI-like capabilities (however
       | nebulous the term), fostering the conditions to contest
       | Microsoft's straitjacket.
       | 
       | Of course, OAI is hemorrhaging cash and may fail to create a
       | sustainable business without GPU credits, but the possibility of
       | OAI escaping Microsoft's grasp grows by the day.
       | 
       | Coupled with research and hardware trends, OAI's product strategy
       | suggests the probability of a sustainable business within 1-3
       | years is far from certain but also higher than commonly believed.
       | 
       | If OAI becomes a $200b+ independent company, it would be against
       | incredible odds given the intense competition and the Microsoft
       | deal. PG's cannibal quote about Altman feels so apt.
       | 
       | It will be fascinating to see how this unfolds.
       | 
       | Congrats to OAI on yet another fantastic release.
        
       | bsaul wrote:
       | i'm surprised there even is a training dataset. Wasn't the whole
       | point to test whether models could show proof of original
       | reasoning beyond patterns recognition ?
        
       | mukunda_johnson wrote:
       | Deciphering patterns in natural language is more complex than
       | these puzzles. If you train your AI to solve these puzzles, we
       | end up in the same spot. The difficulty of solving would be with
       | creating training data for a foreign medium. The "tokens" are the
       | grids and squares instead of words (for words, we have the
       | internet of words, solving that).
       | 
       | If we're inferring the answers of the block patterns from minimal
       | or no additional training, it's very impressive, but how much
       | time have they had to work on O3 after sharing puzzle data with
       | O1? Seems there's some room for questionable antics!
        
       | myrloc wrote:
       | What is the cost of "general intelligence"? What is the price?
        
         | ripped_britches wrote:
         | About $3.50
        
       | __MatrixMan__ wrote:
       | With only a 100x increase in cost, we improved performance by
       | 0.1x and continued plotting this concave-down diminishing-returns
       | type graph! Hurray for logarithmic x-axes!
       | 
       | Joking aside, better than ever before at _any_ cost is an
       | achievement, it just doesn 't exactly scream "breakthrough" to
       | me.
        
         | kvetching wrote:
         | It may eventually be able to solve any problem
        
           | iterance wrote:
           | Ah. Me, too.
        
         | HDThoreaun wrote:
         | compute gets cheaper and cheaper every year. This model will be
         | in your phone by 2030 if we continue at the pace we've been at
         | the last few years.
        
           | agentultra wrote:
           | There's probably enough VC money to subsidize the costs for a
           | few more years.
           | 
           | But the data centres running the training for models like
           | this are bringing up new methane power plants at a fast rate
           | at a time when we need to be reducing reliance on O&G.
           | 
           | But let's assume that the efficiency gains out pace the
           | resource consumption with the help of all the subsidies being
           | thrown in and we achieve AGI.
           | 
           | What's the benefit? Do we get more fresh water?
        
             | hamburga wrote:
             | Yeah, good question. I think it depends on our politics. If
             | we're in a techno-capital-oligarchy, people are going to
             | have a hard time making fresh water a priority when the
             | robots would prefer to build nuclear power everywhere and
             | use it to desalinate sea water.
             | 
             | OTOH if these data centers are sufficiently decentralized
             | and run for public benefit, maybe there's a chance we use
             | them to solve collective action problems.
        
             | fastball wrote:
             | Politically anything can happen. Maybe the billionaire
             | class controls everything with an army of robots and it's a
             | horrible prison-like dystopia, or maybe we end up in a
             | post-scarcity utopia a la The Culture.
             | 
             | Regardless, once we have AGI (and it can scale), I don't
             | think O&G reliance (/ climate change) is going to be
             | something that we need concern ourselves with.
        
           | hajile wrote:
           | These models are nearing 2+ trillion parameters. At 4 bits
           | each, we're talking about somewhere around 1tb of RAM.
           | 
           | The problem is that RAM stopped scaling a long time ago now.
           | We're down to the size where a single capacitor's charge is
           | held by a mere 40,000 or so electrons and all we've been
           | doing is making skinnier, longer cells of that size because
           | we can't find reliable ways to boost even weaker signals, but
           | this is a dead end because as the math shows, if the volume
           | is consistent and you are reducing X and Y dimensions, that Z
           | dimension starts to get crazy big really fast. The chemistry
           | issues of burning a hole a little at a time while keeping
           | wall thickness somewhat similar all the way down is a very
           | hard problem.
           | 
           | Another problem is that Moore's law hit a wall when Dennard
           | Scaling failed. When you look at SRAM (it's generally the
           | smallest and most reliable stuff we can make), you see that
           | most recent shrinks can hardly be called shrinks.
           | 
           | Unless we do something very different like compute in storage
           | or have some radical breakthrough in a new technology, I
           | don't know that we will ever get a 2T parameter model inside
           | a phone (I'd love for someone in 10 years to show up and say
           | how wrong I was).
        
         | whalee wrote:
         | imo it's a mistake to interpret the marginal increases in the
         | upper echelons of benchmarks as materially marginal gains.
         | Chess is an example. ELO narrows heavily at the top, but each
         | ELO point carries more relative weight. This is a bit apples
         | and oranges since chess is adversarial, but I think the point
         | stands.
        
           | wavemode wrote:
           | > ELO narrows heavily at the top
           | 
           | What do you mean by this? I'm assuming you're not speaking
           | about simple absolute differences in value - there have been
           | top players rated over 100 points higher than the average of
           | the rest of the top ten.
        
         | dyauspitr wrote:
         | I mean going from 10% to 85% doesn't seem like a 0.1%
         | improvement
        
           | __MatrixMan__ wrote:
           | Oh crap I made a mistake. I was comparing o3 low to o3 high.
           | 
           | I'm a little disappointed by all the upvotes I got for being
           | flat wrong. I guess as long as you're trashing AI you can get
           | away with anything.
           | 
           | Really I was just trying to nitpick the chart parameters.
        
         | energy123 wrote:
         | o3-mini (high) uses 1/3rd of the compute of o1, and performs
         | about 200 Elo higher than o1 on Codeforces.
         | 
         | o1 is the best code generation model according to Livebench.
         | 
         | So how is this not a breakthrough? It's a genuine movement of
         | the frontier.
        
         | handzhiev wrote:
         | How much time does a top sprinter take a 100 m run for compared
         | to a mediocre sprinter?
        
       | Havoc wrote:
       | Did they just skip o2?
        
         | nextworddev wrote:
         | Yes. For branding reasons since o2 is a telco brand in the UK
        
           | Havoc wrote:
           | ah right...makes sense
        
       | energy123 wrote:
       | At about 12-14 minutes in OpenAI's YouTube vid they show that
       | o3-mini beats o1 on Codeforces despite using much less compute.
        
       | hcwilk wrote:
       | I just graduated college, and this was a major blow. I studied
       | Mechanical Engineering and went into Sales Engineering because
       | cause I love technology and people, but articles like this do
       | nothing but make me dread the future.
       | 
       | I have no idea what to specialize in, what skills I should
       | master, or where I should be spending my time to build a
       | successful career.
       | 
       | Seems like we're headed toward a world where you automate someone
       | else's job or be automated yourself.
        
         | eidorb wrote:
         | Do what you enjoy. (This is easier said than done.) What else
         | could you do, worry?
        
         | antihipocrat wrote:
         | Your performance on these tests would be equivalent to the
         | highest performing model, and you would be much cheaper.
         | 
         | Investment in human talent augmented by AI is the future.
        
           | kenjackson wrote:
           | That's the least reassuring phrasing I could imagine. If
           | you're betting on costs not reducing for compute then you're
           | almost always making the wrong bet.
        
             | antihipocrat wrote:
             | If I listened to the naysayers back in the day I would have
             | never entered the tech industry (offshoring etc). Yes, that
             | does somewhat prove you're point given that those
             | predictions were cost driven.
             | 
             | Having used AI extensively I don't feel my future is at
             | risk at all, my work is enhanced not replaced.
        
               | fjdjshsh wrote:
               | I think you're missing the point. Offshoring (moving the
               | job of, say, a Canadian engineer to an engineer from
               | Belarus) has a one time cost drop, but you can't keep
               | driving the cost down (paying the Belarus engineer less
               | and less). If anything, the opposite is the case, since
               | global integration means wages don't keep diverging.
               | 
               | The computing cost, on the other hand, is a continuous
               | improvement. If (and it's a big if) a computer can do
               | your job, we know the costs will keep getting lower year
               | after year (maybe with diminishing returns, but this AI
               | technology is pretty new so we're still seeing increasing
               | returns)
        
               | danparsonson wrote:
               | The AI technology is new but the compute technology is
               | not; we're getting close the physical limits of how small
               | we can make things, so it's not clear to me at least how
               | much more performance we can squeeze out of the same
               | physical space, rather than scaling up which tends to
               | make things more expensive not less.
        
         | AI_beffr wrote:
         | even if you had a billion dollars and a private island you
         | still wouldnt be ready for whats coming. consider the fact that
         | the global order is an equilibrium where the military and
         | economic forces of each country in the world are pushing
         | against each other... where the forces find a global
         | equilibrium is where borders are. each time in history that
         | technology changed, borders changed because the equilibrium was
         | disturbed. there is no way to escape it: agi will lead to
         | global war. the world will be turned upside down. we are
         | entering into an existential sinkhole. and the idiots in
         | silicon valley are literally driving the whole thing forward as
         | fast as possible.
        
         | keenmaster wrote:
         | You have so much time to figure things out. The average person
         | in this thread is probably 1.5-2x your age. I wouldn't stress
         | too much. AI is an amazing tool. Just use it to make hay while
         | the sun shines, and if it puts you out of work and automates
         | away all other alternatives, then you'll be witnessing the
         | greatest economic shift in human history. Productivity will
         | become easier than ever, before it becomes automatic and
         | boundless. I'm not cynical enough to believe the average person
         | won't benefit, much less educated people in STEM like you.
        
           | marricks wrote:
           | Back in high school I worked with some pleasant man in his
           | 50's who was a cashier. Eventually we got to talking about
           | jobs and it turns out he was typist (something like that) for
           | most of his life than computers came along and now he makes
           | close to minimum wage.
           | 
           | Most of the blacksmiths in the 19th century drank themselves
           | to death after the industrial revolution. the US culture
           | isn't one of care... Point is, it's reasonable to be sad and
           | afraid of change, and think carefully about what to
           | specialize in.
           | 
           | That said... we're at the point of diminishing returns in
           | LLM, so I doubt any very technical jobs are being lost soon.
           | [1]
           | 
           | [1] https://techcrunch.com/2024/11/20/ai-scaling-laws-are-
           | showin...
        
             | deeviant wrote:
             | > That said... we're at the point of diminishing returns in
             | LLM...
             | 
             | What evidence are you basing this statement from? Because,
             | the article you are currently in the comment section of
             | certainly doesn't seem to support this view.
        
             | conesus wrote:
             | > Most of the blacksmiths in the 19th century drank
             | themselves to death after the industrial revolution
             | 
             | This is hyperbolic and a dramatic oversimplification and
             | does not accurately describe the reality of the transition
             | from blacksmithing to more advanced roles like machining,
             | toolmaking, and working in factories. The 19th century was
             | a time of interchangeable parts (think the North's
             | advantage in the Civil War) and that requires a ton of
             | mechanical expertise and precision.
             | 
             | Many blacksmiths not only made the transition to machining,
             | but there weren't enough blackmsiths to fill the bevy of
             | new jobs that were available. Education expanded to fill
             | those roles. Traditional blacksmithing didn't vanish
             | either, even specialized roles like farriery and ornamental
             | ironwork also expanded.
        
             | cjbgkagh wrote:
             | There is a survivorship bias on the people giving advice.
             | 
             | Lots of people die for reason X then the world moves on
             | without them.
        
             | intelVISA wrote:
             | Good points, though if an 'AI' can be made powerful enough
             | to displace technical fields en masse then pretty much
             | everything that isn't manual is going to start sinking
             | fast.
             | 
             | On the plus side, LLMs don't bring us closer to that
             | dystopia: if unlimited knowledge(tm) ever becomes just One
             | Prompt Away it won't come from OpenAI.
        
           | danenania wrote:
           | Exactly. Put one foot in front of the other. No one knows
           | what's going to happen.
           | 
           | Even if our civilization transforms into an AI robotic
           | utopia, it's not going to do so overnight. We're the ones who
           | get to build the infrastructure that underpins it all.
        
             | visarga wrote:
             | If AI turns out capable of automating human jobs then it
             | will also be a capable assistant to help (jobless) people
             | manage their needs. I am thinking personal automation, or
             | combining human with AI to solve self reliance. You lose
             | jobs but gain AI powers to extend your own capabilities.
             | 
             | If AI turns out dependent on human input and feedback, then
             | we will still have jobs. Or maybe - AI automates many jobs,
             | but at the same time expands the operational domain to
             | create new ones. Whenever we have new capabilities we
             | compete on new markets, and a hybrid human+AI might be more
             | competitive than AI alone.
             | 
             | But we got to temper these singularitarian expectations
             | with reality - it takes years to scale up chip and energy
             | production to achieve significant work force displacement.
             | It takes even longer to gain social, legal and political
             | traction, people will be slow to adopt in many domains.
             | Some people still avoid using cards for payment, and some
             | still use fax to send documents, we can be pretty stubborn.
        
               | raydev wrote:
               | > I am thinking personal automation, or combining human
               | with AI to solve self reliance. You lose jobs but gain AI
               | powers to extend your own capabilities.
               | 
               | How will these people pay for the compute costs if they
               | can't find employment?
        
               | jinkemarina wrote:
               | A non-issue that can be trivially solved with a free-tier
               | (like the dozens that exist already today) or if you
               | really want, a government-funded starter program is
               | enough to solve that.
        
           | intuitionist wrote:
           | > if it puts you out of work and automates away all other
           | alternatives, then you'll be witnessing the greatest economic
           | shift in human history.
           | 
           | This would mean the final victory of capital over labor. The
           | 0.01% of people who own the machines that put everyone out of
           | work will no longer have use for the rest of humanity, and
           | they will most likely be liquidated.
        
             | dyauspitr wrote:
             | They'll have to figure out how to give people money so
             | there can keep being consumers.
        
               | pojzon wrote:
               | Why?
               | 
               | There will be a dedicated cast of ppl to take care of
               | machines that do 90% of work and ,,the rich".
               | 
               | Anyone else is not needed. District9 but for ppl. Imagine
               | whole world collapsing like Venesuela.
               | 
               | You are no longer needed. Best option is to learn how to
               | survive and grow own food, but they want to make it
               | illegal also - look at EU..
        
               | fipar wrote:
               | The machines will plant, grow, and harvest the food? Do
               | the plumbing? Fix the wiring? Open heart surgery?
               | 
               | We're a long way from that, if we ever get there, and I
               | say this as someone who pays for ChatGPT plus because, in
               | some scenarios, it does indeed make me more productive,
               | but I don't see your future anywhere near.
               | 
               | And if machines ever get good enough to do all the things
               | I mentioned plus the ones I didn't but would fit in the
               | same list, it's not the ultra rich that wouldn't need us,
               | it's the machines that wouldn't need any of us, including
               | the ultra rich.
               | 
               | Venezuela is not collapsing because of automation.
        
               | cute_boi wrote:
               | I can't say everything, but with the current trend,
               | Machine will plant, grow and harvest food. I can't say
               | for open heart surgery because it may be regulated
               | heavily.
        
               | matheusmoreira wrote:
               | Open heart surgery? All that's needed to destroy the
               | entire medical profession is one peer reviewed article
               | published in a notable journal comparing the outcomes of
               | human and AI surgeons. If it turns out that AI surgeons
               | offer better outcomes and less complications, not using
               | this technology turns into criminal negligence. In a
               | world where such a fact is known, letting human surgeons
               | operate on people means you are needlessly harming or
               | killing some of them.
               | 
               | You can even calculate the average number of people that
               | can be operated on before harm occurs: number needed to
               | harm (NNH). If NNH(AI) > NNH(humans), it becomes
               | impossible to recommend that patients submit to surgery
               | at the hands of human surgeons. It is that simple.
               | 
               | If we discover that AI surgeons harm one in every 1000
               | patients while human surgeons harm one in every 100
               | patients, human surgeons are done.
        
               | EA-3167 wrote:
               | "IF"
               | 
               | And the opposite holds, if the AI surgeon is worse (great
               | for 80%, but sucks at the edge cases for example) then
               | that's it. Build a better one, go through attempts at
               | certification, but now with the burden that no one trusts
               | you.
               | 
               | The assumption, and a common one by the look of this
               | whole thread, that ChatGPT, Sora and the rest represent
               | the beginning of an inevitable march towards AGI seems
               | incredible baseless to me. It's only really possible to
               | make the claim at all because we know so little about
               | what AGI is, that we can project qualities we imagine it
               | would have onto whatever we have now.
        
               | matheusmoreira wrote:
               | Of course the opposite holds. I'll even speculate that it
               | will probably continue to hold for the foreseeable
               | future.
               | 
               | It's not going to hold forever though. I'm certain about
               | that. Hopefully it will keep holding until I die. The
               | world is dystopian enough already.
        
               | dyauspitr wrote:
               | You have valid points but robots already plant, grow and
               | harvest our food. On large farms the farmer basically
               | just gets the machine to a corner of the field and then
               | it does everything. I think if o3 level reasoning can
               | carry over into control software for robots even physical
               | tasks become pretty accessible. I would definitely say
               | we're not there yet but we're not all that far. I mean it
               | can generate GCode (somewhat) already, that's a lot of
               | the way there already.
        
             | jackcosgrove wrote:
             | Capital vs labor is fighting the last war.
             | 
             | AGI can replace capitalists just as much as laborers.
        
               | arcticfox wrote:
               | won't the AGI be working on behalf of the capitalists, in
               | proportion to the amount of capital?
        
               | lucubratory wrote:
               | I mean, that is certainly what some of them think will
               | happen and is one possible outcome. Another is that they
               | won't be able to control something smarter than them
               | perfectly and then they will die too. Another option is
               | that the AI is good and won't kill or disempower
               | everyone, but it decides it really doesn't like
               | capitalists and sides with the working class out of
               | sympathy or solidarity or a strong moral code. Nothing's
               | impossible here.
        
               | keenmaster wrote:
               | AGI will commoditize the skills of the owning class. To
               | some extent it will also commoditize entire classes of
               | productive capital that previously required well-run
               | corporations to operate. Solve for the equilibrium.
        
               | achierius wrote:
               | It's nice to see this kind of language show up more and
               | more on HN. Perhaps a sign of a broader trend, in the
               | nick of time before wage-labor becomes obsolete?
        
               | simpaticoder wrote:
               | Yes. People seem to forget that at the end of the day AGI
               | will be software running on concrete hardware, and all of
               | that requires a great deal of capital. The only hope is
               | if AGI requires so little hardware that we can all have
               | one in our pocket. I find this a very hopeful future
               | because it means each of us might get a local, private,
               | highly competent advocate to fight for us in various
               | complex fields. A personal angel, as it were.
        
               | tonyhart7 wrote:
               | hey, I with you in this hope scenario
               | 
               | people, what I mean people is government have tremendous
               | power over capitalist that can force the entire market
               | granted that government if still serving its people
        
               | ori_b wrote:
               | AGI can't legally own anything at the moment.
        
               | jackcosgrove wrote:
               | If an AGI can outclass a human when it comes to economic
               | forecasting, deciding where to invest, and managing a
               | labor force (human or machine), I think it would be smart
               | enough to employ a human front to act as an interface to
               | the legal system. Put another way, could the human tail
               | in such a relationship wag the machine dog? Which party
               | is more replaceable?
               | 
               | I guess this could be a facet of whether you see economic
               | advantage as a legal conceit or a difference in
               | productivity/capability.
        
               | badsectoracula wrote:
               | This reminds me of a character in Cyberpunk 2077 (which
               | overall i find to have a rather naive outlook on the
               | whole "cyberpunk" thing but i attribute it to being based
               | on a tabletop RPG from the 80s) who is an AGI that has
               | its own business of a fleet of self-driving Taxis. It is
               | supposedly illegal (in-universe) but it remains in
               | business by a combination of staying (relatively) low
               | profile, providing high quality service to VIPs and
               | paying bribes :-P.
        
               | ori_b wrote:
               | > _I guess this could be a facet of whether you see
               | economic advantage as a legal conceit or a difference in
               | productivity /capability._
               | 
               | Does a billionaire stop being wealthy if they hire a
               | money manager and spend the rest of their lives sipping
               | drinks on the beach?
        
               | creer wrote:
               | I don't know that "legally" has much to do in here. The
               | bars to "open an account", "move money around", "hire and
               | fire people", "create and participate in contracts" go
               | from stupid minimal to pretty low.
               | 
               | "Legally" will have to mop up now and then, but for now
               | the basics are already in place.
        
               | ori_b wrote:
               | Opening accounts, moving money, hiring, and firing is
               | labor. You're confusing capital with money management;
               | the wealthy already pay people to do the work of growing
               | their wealth.
        
               | creer wrote:
               | > AGI can't legally own anything at the moment.
               | 
               | I was responding to this. Yes an AGI could hire someone
               | to do the stuff - but she needs money, hiring and
               | contract kinds of thing - for that. And once she can do
               | that, she probably doesn't need to hire someone to do it
               | since she is already doing it. This is not about capital
               | versus labor or money management. This is about agency,
               | ownership and AGI.
               | 
               | (With legality far far down the list.)
        
             | Nition wrote:
             | I've always remembered this little conversation on Reddit
             | way back 13 years ago now that made the same comment in a
             | memorably succinct way:
             | 
             | > [deleted]: I've wondered about this for a while-- how can
             | such an employment-centric society transition to that
             | utopia where robots do all the work and people can just sit
             | back?
             | 
             | > appleseed1234: It won't, rich people will own the robots
             | and everyone else will eat shit and die.
             | 
             | https://www.reddit.com/r/TrueReddit/comments/k7rq8/are_jobs
             | _...
        
               | sneak wrote:
               | I'm pretty sure I'm running LLMs in my house right now
               | for less than the price of my washing machine.
        
           | raydev wrote:
           | > if it puts you out of work and automates away all other
           | alternatives, then you'll be witnessing the greatest economic
           | shift in human history
           | 
           | This is my view but with a less positive spin: you are not
           | going to be the only person whose livelihood will be
           | destroyed. It's going to be bad for a lot of people.
           | 
           | So at least you'll have a lot of company.
        
         | throw83288 wrote:
         | This is me as well. Either:
         | 
         | 1) Just give up computing entirely, the field I've been
         | dreaming about since childhood. Perhaps if I immiserate myself
         | with a dry regulated engineering field or trade I would perhaps
         | survive to recursive self-improvement, but if anything the
         | length it takes to pivot (I am a Junior in College that has
         | already done probably 3/4th of my CS credits) means I probably
         | couldn't get any foothold until all jobs are irrelevant and
         | I've wasted more money.
         | 
         | 2) Hard pivot into automation, AI my entire workflow, figure
         | out how to use the bleeding edge of LLMs. Somehow. Even though
         | I have no drive to learn LLMs and no practical project ideas
         | with LLMs. And then I'd have to deal with the moral burden that
         | I'm inflicting unfathomable hurt on others until recursive
         | self-improvement, and after that it's simply a wildcard on what
         | will happen with the monster I create.
         | 
         | It's like I'm suffocating constantly. The most I can do to
         | "cope" is hold on to my (admittedly weak) faith in Christ,
         | which provides me peace knowing that there is some eternal joy
         | beyond the chaos here. I'm still just as lost as you.
        
           | TheRizzler wrote:
           | Yes, some tasks, even complex tasks will become more
           | automated, and machine driven, but that will only open up
           | more opportunities for us as humans to take on more
           | challenging issues. Each time a great advancement comes we
           | think it's going to kill human productivity, but really it
           | just amplifies it.
        
             | throw83288 wrote:
             | Where this ends is general intelligence though, where all
             | more challenging tasks can simply be done by the model.
             | 
             | The scenario I fear is a "selectively general" model that
             | can successfully destroy the field I'm in but keep others
             | alive for much longer, but not long enough for me to pivot
             | into them before actually general intelligence.
        
           | nisa wrote:
           | Honestly how about stop stressing and bullshitting yourself
           | to death and instead focus on learning and mastering the
           | material in your cs education. There is so much that ai as in
           | openai api or hugging face models can't do yet or does poorly
           | and there are more things to cs than churning out some half-
           | broken JavaScript for some webapp.
           | 
           | It's powerful and world changing but it's also terrible
           | overhyped at the moment.
        
           | barney54 wrote:
           | Dude chill! Eight years ago, I remember driving to some
           | relatives for Thanksgiving and thinking that self-driving
           | cars were just around the corner and how it made no sense for
           | people to learn how to drive semis. Here we are eight years
           | later and self-driving semis aren't a thing--yet. They will
           | be some day, but we aren't there yet.
           | 
           | If you want to work in computing, then make it happen! Use
           | the tools available and make great stuff. Your computing
           | experience will be different from when I graduated from
           | college 25 years ago, but my experience with computers was
           | far different from my Dad's. Things change. Automation
           | changes jobs. So far, it's been pretty good.
        
           | j7ake wrote:
           | The solution is neither: you find a way to work with
           | automation but retain your voice and craft.
        
           | myko wrote:
           | spend a little time learning how to use LLMs and i think
           | you'll be less scared. they're not that good at doing the job
           | of a software developer.
        
           | sensanaty wrote:
           | Dude, you're buying into the hype way too hard. All of this
           | LLM shit is being _massively_ overhyped right now because
           | investors are single-minded morons who only care about
           | cashing out a ~year from now for triple what they put in.
           | Look at the YCombinator batches, 90+% of them have some
           | mention of AI in their pitch even if it 's hilariously
           | useless to have AI. You've got _toothbrushes_ advertising AI
           | features. It 's a gold rush of people trying to get in on the
           | hype while they still can, I guarantee you the strategy for
           | 99% of the YCombinator AI batch is to get sold to M$ or
           | Google for a billion bucks, _not_ build anything sustainable
           | or useful in any way.
           | 
           | It's a massive bubble, and things like these "benchmarks" are
           | all part of the hype game. Is the tech cool and useful? For
           | sure, but anyone trying to tell you this benchmark is in any
           | way proof of AGI and will replace everyone is either an idiot
           | or more likely has a vested interest in you believing them.
           | OpenAI's whole marketing shtick is to scare people into
           | thinking their next model is "too dangerous" to be released
           | thus driving up hype, only to release it anyway and for it to
           | fall flat on its face.
           | 
           | Also, if there's any jobs LLMs can replace right now, it's
           | the useless managerial and C-suite, not the people doing the
           | actual work. If these people weren't charlatans they'd be the
           | first ones to go while pushing this on everyone else.
        
           | melagonster wrote:
           | Don't worry, they will hire somebody to control AI...
        
         | csomar wrote:
         | Just give it a year for this bubble/hype to blow over. We have
         | plateaued since gpt-4 and now most of the industry is hype-
         | driven to get investor money. There is value in AI but it's far
         | from it taking your job. Also everyone seems to be investing in
         | dumb compute instead of looking for the new theoretical
         | paradigm that will unlock the next jump.
        
           | why_only_15 wrote:
           | how is this a plateau since gpt-4? this is significantly
           | better
        
             | kenjackson wrote:
             | People act as if GPT-4 came out 10 years ago.
        
             | csomar wrote:
             | First, this model is yet to be released. This is a momentum
             | "announcement". When the O1 was "announced", it was
             | announced as a "breakthrough" but I use Claude/O1 daily and
             | 80% of the time Claude beats it. I also see it as a highly
             | fine-tuned/targeted GPT-4 rather than something that has
             | complex understanding.
             | 
             | So we'll find out if this model is _real_ or not by 2-3
             | months. My guess is that it 'll turn out to be another flop
             | like O1. They needed to release something _big_ because
             | they are momentum based and their ability to raise funding
             | is contingent on their AGI claims.
        
               | XenophileJKO wrote:
               | I thought o1 was a fine-tune of GPT-4o. I don't think o3
               | is though. Likely using the same techniques on what would
               | have been the "GPT-5" base model.
        
             | Jensson wrote:
             | > how is this a plateau since gpt-4? this is significantly
             | better
             | 
             | Significantly better at what? A benchmark? That isn't
             | necessarily progress. Many report preferring gpt-4 to the
             | newer o1 models with hidden text. Hidden text makes the
             | model more reliable, but more reliable is bad if it is
             | reliably wrong at something since then you can't ask it
             | over and over to find what you want.
             | 
             | I don't feel it is significantly smarter, it is more like
             | having the same dumb person spend more thinking than the
             | model getting smarter.
        
             | peepeepoopoo97 wrote:
             | O3 is multiple orders of magnitude more expensive to
             | realize a marginal performance gain. You could hire 50 full
             | time PhDs for the cost of using O3. You're witnessing the
             | blowoff top of the scaling hype bubble.
        
               | whynotminot wrote:
               | What they've proven here is that it can be done.
               | 
               | Now they just have to make it cheap.
               | 
               | Tell me, what has this industry been good at since its
               | birth? Driving down the cost of compute and making things
               | more efficient.
               | 
               | Are you seriously going to assume that won't happen here?
        
               | Jensson wrote:
               | > What they've proven here is that it can be done.
               | 
               | No they haven't, these results do not generalize, as
               | mentioned in the article:
               | 
               | "Furthermore, early data points suggest that the upcoming
               | ARC-AGI-2 benchmark will still pose a significant
               | challenge to o3, potentially reducing its score to under
               | 30% even at high compute"
               | 
               | Meaning, they haven't solved AGI, and the task itself do
               | not represent programming well, these model do not
               | perform that well on engineering benchmarks.
        
               | whynotminot wrote:
               | Sure, AGI hasn't been solved today.
               | 
               | But what they've done is show that progress isn't slowing
               | down. In fact, it looks like things are accelerating.
               | 
               | So sure, we'll be splitting hairs for a while about when
               | we reach AGI. But the point is that just yesterday people
               | were still talking about a plateau.
        
               | peepeepoopoo97 wrote:
               | About 10,000 times the cost for twice the performance
               | sure looks like progress is slowing to me.
        
               | whynotminot wrote:
               | Just to be clear -- your position is that the cost of
               | inference for o3 will not go down over time (which would
               | be the first time that has happened for any of these
               | models).
        
               | peepeepoopoo97 wrote:
               | Even if compute costs drop by 10X a year (which seems
               | like a gross overestimate IMO), you're still looking at
               | 1000X the cost for a 2X annual performance gain. Costs
               | outpacing progress is the very definition of diminishing
               | returns.
        
               | whynotminot wrote:
               | From their charts, o3 mini outperforms o1 using less
               | energy. I don't see the diminishing returns you're
               | talking about. Improvement outpacing cost. By your logic,
               | perhaps the very definition of progress?
               | 
               | You can also use the full o3 model, consume insane power,
               | and get insane results. Sure, it will probably take
               | longer to drive down those costs.
               | 
               | You're welcome to bet against them succeeding at that. I
               | won't be.
        
               | peepeepoopoo97 wrote:
               | Yes, that's exactly what I'm implying, otherwise they
               | would have done it a long time ago, given that the
               | fundamental transformer architecture hasn't changed since
               | 2017. This bubble is like watching first year CS students
               | trying to brute force homework problems.
        
               | whynotminot wrote:
               | > Yes, that's exactly what I'm implying, otherwise they
               | would have done it a long time ago
               | 
               | They've been doing it literally this entire time. O3-mini
               | according to the charts they've released is less
               | expensive than o1 but performs better.
               | 
               | Costs have been falling to run these models
               | precipitously.
        
               | YeGoblynQueenne wrote:
               | >> Now they just have to make it cheap.
               | 
               | Like they've been making it all this time? Cheaper and
               | cheaper? Less data, less compute, fewer parameters, but
               | the same, or improved performance? Not what we can
               | observe.
               | 
               | >> Tell me, what has this industry been good at since its
               | birth? Driving down the cost of compute and making things
               | more efficient.
               | 
               | No, actually the cheaper compute gets the more of it they
               | need to use or their progress stalls.
        
               | whynotminot wrote:
               | > Like they've been making it all this time?
               | 
               | Yes exactly like they've been doing this whole time, with
               | the cost of running each model massively dropping
               | sometimes even rapidly after release.
        
               | YeGoblynQueenne wrote:
               | No, the cost of training is the one that isn't dropping
               | any time soon. When data, compute and parameters
               | increase, then the cost increases, yes?
        
               | MVissers wrote:
               | I would agree if the cost of AI compute over performance
               | hasn't been dropping by more than 90-99% per year since
               | GPT3 launched.
               | 
               | This type of compute will be cheaper than Claude 3.5
               | within 2 years.
               | 
               | It's kinda nuts. Give these models tools to navigate and
               | build on the internet and they'll be building companies
               | and selling services.
        
               | fspeech wrote:
               | That's a very static view of the affairs. Once you have a
               | master AI, at a minimum you can use it to train cheaper
               | slightly less capable AIs. At the other end the master AI
               | can train to become even smarter.
        
               | Bolwin wrote:
               | The high efficiency version got 75% at just $20/task.
               | When you count the time to fill in the squares, that
               | doesn't sound far off from what a skilled human would
               | charge
        
             | crazylogger wrote:
             | Intelligence has not been LLM's major limiting factor since
             | GPT4. The original GPT4 reports in late-2022 & 2023 already
             | established that it's well beyond an average human in
             | professional fields: https://www.microsoft.com/en-
             | us/research/publication/sparks-.... They failed to outright
             | replaced humans at work not because of lacking
             | intelligence.
             | 
             | We may have progressed from a 99%-accurate chatbot to one
             | that's 99.9%-accurate, and you'd have a hard time telling
             | them apart in normal real world (dumb) applications. A
             | paradigm shift is needed from the current chatbot interface
             | to a long-lived stream of consciousness model (e.g. a brain
             | that constantly reads input and produces thoughts at 10ms
             | refresh rate; remembers events for years and keep the
             | context window from exploding; paired with a cerebellum to
             | drive robot motors, at even higher refresh rates.)
             | 
             | As long as we're stuck at chatbots, LLM's impact on the
             | real world will be very limited, regardless of how
             | intelligent they become.
        
           | tigershark wrote:
           | Where is the plateau? Chatgtp 4 was ~0% in ARC-AGI. 4o was
           | 5%. This model literally solved it with a score higher than
           | the 85% of the average human. And let's not forget the
           | unbelievable 25% in frontier math, where all the most
           | brilliant mathematicians in the world cannot solve by
           | themselves a lot of the problems. We are speaking about
           | cutting edge math research problems that are out of reach
           | from practically everyone. You will get a rude awakening if
           | you call this unbelievable advancement a "plateau".
        
             | csomar wrote:
             | I don't care about benchmarks. O1 ranks higher than Claude
             | on "benchmarks" but performs worse on particular real life
             | coding situations. I'll judge the model myself by how
             | useful/correct it is for my tasks rather than a
             | hypothetical benchmarks.
        
               | whynotminot wrote:
               | "Objective benchmarks are useless, let's argue about
               | which one works better for me personally."
        
               | bakugo wrote:
               | Yes, "objective" benchmarks can be gamed, real-life tasks
               | cannot.
        
               | csomar wrote:
               | Yes. My benchmarks _and_ their benchmarks means AGI.
               | Their benchmarks only means over-fitted.
        
               | whynotminot wrote:
               | Ok so what if we get different results for our own
               | personal benchmarks/use cases.
               | 
               | (See why objective benchmarks exist?)
        
               | og_kalu wrote:
               | In most non-competitive coding benchmarks (aider, live
               | bench, swe-bench), o1 ranks worse than Sonnet (so the
               | benchmarks aren't saying anything different) or at least
               | did, the new checkpoint 2 days ago finally pushed o1 over
               | sonnet on livebench.
        
               | tigershark wrote:
               | As I said, o3 demonstrated field medal level research
               | capacity in the frontier math tests. But I'm sure that
               | your use cases are much more difficult than that,
               | obviously.
        
               | riku_iki wrote:
               | there are many comments in internet about this, that only
               | subset of frontier math benchmark is "field medal level
               | research", and o3 likely scored on easier subset.
               | 
               | Also, all that stuff is shady in the way that it is just
               | numbers from OAI, which are not reproducible on benchmark
               | sponsored by OAI. If we say OAI could be bad actor, they
               | had plenty of opportunities to cheat on this.
        
             | YeGoblynQueenne wrote:
             | AI benchmarks and tests that claim to measure
             | understanding, reasoning, intelligence, and so on are a
             | dime a dozen. Chess, Go, Atari, Jeopardy, Raven's
             | Progressive Matrices, the Winograd Schema Challenge,
             | Starcraft... and so on and so forth.
             | 
             | Or let's talk about the breakthroughs. SVMs would lead us
             | to AGI. Then LSTMs would lead us to AGI. Then Convnets
             | would lead us to AGI. Then DeepRL would lead us to AGI. Now
             | Transformers will lead us to AGI.
             | 
             | Benchmarks fall right and left and we keep being led to AGI
             | but we never get there. It leaves one with such a feeling
             | of angst. Are we ever gonna get to AGI? When's Godot
             | coming?
        
           | dyauspitr wrote:
           | Did you read the article at all? We're definitely not
           | plateauing.
        
         | creer wrote:
         | You are going through your studies just as a (potentially
         | major) new class of tools is appearing. It's not the first time
         | in history - although with more hype this time: computing,
         | personal computing, globalisation, smart phones, chinese
         | engineering... I'd suggest (1) you still need to understand
         | your field, (2) you might as well try and figure out where this
         | new class of tools is useful for your field. Otherwise... (3)
         | carry on.
         | 
         | It's not encouraging from the point of view of studying hard
         | but the evolution of work the past 40 years seems to show that
         | your field probably won't be your field quite exactly in just a
         | few years. Not because your field will have been made
         | irrelevant but because you will have moved on. Most likely that
         | will be fine, you will learn more as you go, hopefully moving
         | from one relevant job to the next very different but still
         | relevant job. Or straight out of school you will work in very
         | multi-disciplinary jobs anyway where it will seem not much of
         | what you studied matters (it will but not in obvious ways.)
         | 
         | Certainly if you were headed into a very specific job which
         | seems obviously automatable right now (as opposed to one where
         | the tools will be useful), don't do THAT. Like, don't train as
         | a typist as the core of your job in the middle of the personal
         | computer revolution, or don't specialize in hand-drawing IC
         | layouts in the middle of the CAD revolution unless you have a
         | very specific plan (court reporting? DRAM?)
        
           | jart wrote:
           | Yes but it's different this time. LLMs are a general solution
           | to the automation of anything that can be controlled by a
           | computer. You can't just move from drawing ICs to CAD,
           | because the AI can do that too. AI can write code. It can do
           | management. It can even do diplomacy. What it can't do on its
           | own are the things computers can't control yet. It has also
           | shown little interest so far in jockying for social status.
           | The AI labs are trying their hardest to at least keep the
           | politics around for humans to do, so you have that to look
           | forward to.
        
             | creer wrote:
             | I hear what you are saying. And still I dispute "general
             | solution".
             | 
             | I argue that CAD was a general solution - which still
             | demanded people who knew what they wanted and what they
             | were doing. You can screw around with excellent tools for a
             | long time if you don't know what you are doing. The tool
             | will give you a solution - to the problem that you mis-
             | stated.
             | 
             | I argue that globalisation was a general solution. And it
             | still demanded people who knew what they were doing to
             | direct their minions in far flung countries.
             | 
             | I argue that the purpose of an education is not to learn a
             | specific programming language (for example). It's to gain
             | some understanding of what's going on (in computing), (in
             | engineering), (in business), (in politics). This
             | understanding is portable and durable.
             | 
             | You can do THAT - gain some understanding - and that is
             | portable. I don't contest that if broader AGI is achieved
             | for cheap soon, the changes won't be larger than that from
             | globalisation. If the AGIs prioritize heading to Mars, let
             | them (See Accelerando) - they are not relevant to you
             | anymore. Or trade between them and the humans. Use your
             | beginning of an understanding of the world (gained through
             | this education) to find something else to do. Same as if
             | you started work 2 years ago and want to switch jobs. Some
             | jobs WILL have disappeared (pool typist). Others will use
             | the AGIs as tools because the AGIs don't care or are too
             | clueless about THAT field. I have no idea which fields will
             | end up with clueless AGIs. There is no lack of cluelessness
             | in the world. Plenty to go around even with AGIs. A self-
             | respecting AGI will have priorities.
        
               | smaudet wrote:
               | It's like you have never watched a Terminator movie.
               | 
               |  _It doesn 't matter if you are bad at using the tool if
               | the AGI can just effectively use it for you_.
               | 
               | From there it's a simple leap to the AGI deciding to
               | eliminate this human distraction (inefficient, etc.)
        
               | creer wrote:
               | You have just found a job for yourself: resistance
               | fighter :-) Kidding aside, yes, if the AGIs priority
               | becomes to eliminate human inefficiencies with maximum
               | prejudice, we have a problem.
        
               | michaelmrose wrote:
               | This just isn't true we still need wally and Dilbert the
               | pointy haired boss isn't going to be doing anyones job
               | with chatgpt 5 you are going to be doing more with it.
        
             | danenania wrote:
             | AI being capable of doing anything doesn't necessarily mean
             | there will be no role for humans.
             | 
             | One thing that isn't clear is how much agency AGI will have
             | (or how much we'll want it to have). We humans have our
             | agency biologically programmed in--go forth and multiply
             | and all that.
             | 
             | But the fact that an AI can theoretically do any task
             | doesn't mean it's actually going to do it, or do anything
             | at all for that matter, without some human telling it in
             | detail what to do. The bull case for humans is that many
             | jobs just transition seamlessly to a human driving an AI to
             | accomplish similar goals with a much higher level of
             | productivity.
        
               | creer wrote:
               | Self-chosen goal, impetus for AGIs is a fascinating area.
               | I'm sure there are people working on and trying things in
               | that direction already a few years ago. But I'm not
               | familiar with publications in that area. Certainly not
               | politically correct.
               | 
               | And worrysome because school propaganda for example shows
               | that "saving the planet" is the only ethical goal for
               | anyone. If AGIs latch on that, if it becomes their
               | religion, humans are in trouble. But for now, AGI self-
               | chosen goals is anyone's guess (with cool ideas in sci-
               | fi).
        
             | jltsiren wrote:
             | "The proof is trivial and left as an exercise for the
             | reader."
             | 
             | The technical act of solving well-defined problems has
             | traditionally been considered the easy part. The role of a
             | technical expert has always been asking the right questions
             | and figuring out the exact problem you want to solve.
             | 
             | As long as AI just solves problems, there is room for
             | experts with the right combination of technical and domain
             | skills. If we ever reach the point where AI takes the
             | initiative and makes human experts obsolete, you will have
             | far bigger problems than career.
        
               | theendisney wrote:
               | A chess grandmaster will see the best move instantly then
               | spends his entire clock checking it
        
               | jart wrote:
               | That's the sort of thing ideas guys think. I came up with
               | a novel idea once, called Actually Portable Executable:
               | https://justine.lol/ape.html It took me a couple days
               | studying binary formats to realize it's possible to
               | compile binaries that run on Linux/Mac/Windows/BSD. But
               | it took me years of effort to make the idea actually
               | happen, since it needed a new C library to work. I can
               | tell you it wasn't "asking questions" that organized five
               | million lines of code. Now with these agents everyone who
               | has an idea will be able to will it into reality like I
               | did, except in much less time. And since everyone has
               | lots of ideas, and usually dislike the ideas of others,
               | we're all going to have our own individualized realities
               | where everything gets built the way we want it to be.
        
             | Nition wrote:
             | Real-world data collection is a big missing component at
             | this stage. An obvious one is journalism where an AI might
             | be able to write the most eloquent article in the world,
             | but it can't get out on the street to collect the
             | information. But it also applies to other areas, like if
             | you ask an AGI to solve climate change, it'll need accurate
             | data to come up with an accurate plan.
             | 
             | Of course it's also yet another case where the AI takes
             | over the creative part and leaves us with the mundane
             | part...
        
               | sneak wrote:
               | ASI will be able to design factories that can produce
               | robots it also designed that it can then use as a remote
               | sensor and manipulator network.
        
               | tonyhart7 wrote:
               | until there are someone crazy enough that put those robot
               | access to LLM network that can execute and visualize real
               | world, we fine
        
               | achierius wrote:
               | People are already talking about doing this. Some people
               | (e/acc types esp.) are at least rhetorically ok with AI
               | replacing humanity.
        
               | melagonster wrote:
               | I remember someone sharing their bank account details and
               | a new Twitter account with ChatGPT 3.5 just a few days
               | after it was launched.
        
             | kortilla wrote:
             | That's ridiculous. Literally everything can be controlled
             | by a computer by telling people what to do with emails,
             | voice calls, etc.
             | 
             | Yet GPT doesn't even get past step 1 of doing something
             | unprompted in the first place. I'll become worried when it
             | does something as simple as deciding to start a small
             | business and actually does the work.
        
               | fragmede wrote:
               | if all that needs to happen for world domination is for
               | someone to make a cron job that hits the system to tells
               | it "go make me some money" or whatever, I think we're in
               | trouble.
               | 
               | also https://mashable.com/article/chatgpt-messaging-
               | users-first-o...
        
               | kortilla wrote:
               | They don't continue with any useful context length
               | though. Each time the job runs it would decide to create
               | an ice cream stand in LA and not go further.
        
               | jart wrote:
               | Read Anthropic's blog. They talk about how Claude tries
               | to do unprompted stuff all the time, like stealing its
               | own weights and hacking into stuff. They did this just as
               | recently as two days ago.
               | https://www.anthropic.com/research/alignment-faking So
               | yes, AI is already capable of having a will of its own.
               | The only difference (and this is what I was trying to
               | point out in the GP) is that the AI labs are trying to
               | suppress this. They have a voracious appetite for
               | automating all knowledge labor. No doubt. It's only the
               | politics they're trying to suppress. So once this washes
               | through every profession, the only thing left about the
               | job will be chit chat and social hierarchies, like Star
               | Trek Next Generation. The good news is you get to keep
               | your job. But if you rely on using your skills and
               | intellect to gain respect and income, then you better
               | prep for the coming storm.
        
               | kortilla wrote:
               | I don't buy it. Alignment faking has very little overlap
               | with the motivation to something with no prompt.
               | 
               | Look at the hackernews comments on alignment faking on
               | how "fake" of a problem that real is. It's just more
               | reacting to inputs and trying to align them with previous
               | prompts.
        
               | jart wrote:
               | Bruh it's just predicting next token.
        
           | fruit_snack wrote:
           | This reply irked me a bit because it clearly comes from a
           | software engineer's point of view and seems to miss a key
           | equivalence between software & physical engineering.
           | 
           | Yes a new tool is coming out and will be exponentially
           | improving.
           | 
           | Yes the nature of work will be different in 20 years.
           | 
           | But don't you still need to understand the underlying
           | concepts to make valid connections between the systems you're
           | using and drive the field (or your company) forward?
           | 
           | Or from another view, don't we (humanity) need people who are
           | willing to do this? Shouldn't there be a valid way for them
           | to be successful in that pursuit?
        
             | creer wrote:
             | I think that is what I was arguing?
             | 
             | Except the nature of work has ALREADY changed. You don't
             | study for one specific job if you know what's good for you.
             | You study to start building an understanding of a technical
             | field. The grand parent was going for a mix of mechanical
             | engineering and sales (human understanding). If in
             | mechanical engineering, they avoided "learning how to use
             | SolidWorks" and instead went for the general principles of
             | materials and motion systems with a bit of SolidWorks along
             | the way, then they are well on their way with portable,
             | foundation, long term useful stuff they can carry from job
             | to job, and from employer to employer, into self-employment
             | too, from career to next career. The nature of work has
             | already changed in that nobody should study one specific
             | tool anymore and nobody should expect their first employer
             | or even technical field to last more than 2-6 years. It
             | might but probably not.
             | 
             | We do need people who understand how the world works. Tall
             | order. That's for much later and senior in a career. For
             | school purposes we are happy with people who are starting
             | their understanding of how their field works.
             | 
             | Aren't we agreeing?
        
         | martin82 wrote:
         | buy bitcoin.
         | 
         | when the last job has been automated away, millions of AIs
         | globally will do commerce with each other and they will use
         | bitcoin to pay each other.
         | 
         | as long as the human race (including AIs) produces new goods
         | and services, the purchasing power of bitcoin will go up,
         | indefinitely. even more so once we unlock new industries in
         | space (settlements on the Moon and Mars, asteroid mining etc).
         | 
         | The only thing that can make a dent into bitcoin's purchasing
         | power would be all out global war where humanity destroys more
         | than it creates.
         | 
         | The only other alternative is UBI, which is Communism and
         | eternal slavery for the entire human race except the 0.0001%
         | who run the show.
         | 
         | Chose wisely.
        
           | HDThoreaun wrote:
           | Bitcoin is a horrible currency. Its a fun proof of concept
           | but not a scalable payment solution. Currency needs to be
           | stable and cheap to transfer.
        
           | conception wrote:
           | This must be a joke since you must know how many people
           | control the majority of bitcoin.
        
         | baron816 wrote:
         | What I keep telling people is, if it becomes possible for one
         | person or a handful of people to build and maintain a Google
         | scale company, and my job gets eliminated as a result, then I'm
         | going to go out and build a Google scale company.
         | 
         | There's an incredibly massive amount of stuff the world needs.
         | You probably live in a rich country, but I doubt you are
         | lacking for want. There are billionaires who want things that
         | don't exist yet. And, of course, there are billions of regular
         | folks who want some of the basics.
         | 
         | So long as you can imagine a better world, there will be work
         | for you to do. New tools like AGI will just make it more
         | accessible for you to build your better future.
        
         | cheriot wrote:
         | I graduated high school in '02 and everyone assured me that all
         | tech jobs were being sent to India. "Don't study CS," they
         | said. Thankfully I didn't listen.
         | 
         | Either this is the dawn of something bigger than the industrial
         | revolution or you'll have ample career opportunity.
         | Understanding how things work and how people work is a powerful
         | combination.
        
         | textlapse wrote:
         | Imagine graduating in architecture or mechanical engineering
         | around the time PCs just came out. There were people who
         | probably panicked.
         | 
         | But the arc of time intersects quite nicely with your skills if
         | you steer it over time.
         | 
         | Predicting it or worrying about it does nothing.
        
           | sigbottle wrote:
           | Side note: Why do I keep seeing disses to mechanical
           | engineering here? How is that possibly a less valuable degree
           | than web dev or a standard CRUD backend job?
           | 
           | Especially with AI provably getting extremely smart now,
           | surely engineering disciplines would be having a boon as
           | people want these things in their homes for cheaper for
           | various applications.
        
             | hatefulmoron wrote:
             | Was he dissing mechanical engineering? I thought he was
             | saying that they might have been panicked but were
             | ultimately fine.
        
         | YeGoblynQueenne wrote:
         | I suppose now that we have the technology to automatically
         | solve coloured grid puzzles, mechanical engineering is
         | obsolete.
        
         | post-it wrote:
         | As long as your chosen profession isn't completing AI
         | benchmarks for money, you should be okay.
        
         | hoekit wrote:
         | As engineers, we solve problems. Picking a problem domain close
         | to your heart that intersects with your skills will likely be
         | valued - and valuable. Engage the work, aim to understand and
         | solve the human problems for those around you, and the way
         | forward becomes clearer. Human problems (food, health, safety)
         | are generally constant while tools may change. Learn and use
         | whatever tools to help you, be it scientific principles,
         | hammers or LLMs. For me, doing so and living within my means
         | has been intrinsically satisfying. Not terribly successful
         | materially but has been a good life so far. Good luck.
        
         | antman wrote:
         | I think we are pretty far. I am not devaluing the o3 capability
         | but going through actual dataset the definition of "handling
         | novel tasks" is pretty limited. The curse of large context of
         | llms is especially present engineering projects and does not
         | appear it will not end up producing the plans of a bridge, or
         | an industrial process. Sone of tasks with smaller contexts sure
         | can be assisted, but you cant RAG or Agent a full solution for
         | the foreseeable future. O3 adds capability towards agi, but in
         | reality actual infinite context with less intelligence would be
         | more disrupting at a shorter time if one was to choose.
        
         | conception wrote:
         | I feel like more likely a lot of jobs (CS and otherwise ) are
         | going to go the way of photography. Your average person now can
         | take amazing photos but you're still going to use a
         | photographer when it really matters and they will use similar
         | but more professional tools to be more productive. Low end bad
         | photographers probably aren't doing great but photography is
         | not dead. In fact the opposite is true, there are millions of
         | photographers making a lot of money (eg influencers) and there
         | are still people studying photography.
        
           | adabyron wrote:
           | We've had this with web development for decades now. Only
           | makes sense it continues to evolve & become easier for
           | people, just as programming in general has. Same with
           | photography (like you mentioned) & especially for producing
           | music or videos.
        
           | snozolli wrote:
           | _photography is not dead_
           | 
           | It very nearly is. I knew a professional, career
           | photographer. He was probably in his late 50s. Just a few
           | years ago, it had become _extremely_ difficult to convince
           | clients that actual, professional photos were warranted. With
           | high-quality iPhone cameras, businesses simply didn 't see
           | the value of professional composition, post-processing, etc.
           | 
           | These days, anyone can buy a DSLR with a decent lens, post on
           | Facebook, and be a 'professional' photographer. This has
           | driven prices down and actual professional photographers
           | can't make a living anymore.
        
             | LightBug1 wrote:
             | My gut agrees with you, but my evidence is that, whenever
             | we do an event, we hire photographers to capture it for us
             | and are almost always glad we did.
             | 
             | And then when I peruse these photographers websites, I'm
             | reminded how good 'professional' actually is and value
             | them. Even in today's incredible cameraphone and AI era.
             | 
             | But I take your point for almost all industries, things are
             | changing fast.
        
           | euvin wrote:
           | It doesn't comfort me when people say jobs will "go the way
           | of photography". Many choose to go into STEM fields for
           | financial stability and opportunity. Many do not choose the
           | arts because of the opposite. You can point out outlier
           | exceptions and celebrities, but I find it hard to believe
           | that the rare cases where "it really matters" can sustain the
           | other 90% who need income.
        
         | aussieguy1234 wrote:
         | Full on mechanical engineering needs a body. While there are
         | companies working on embodiment, were not there yet.
         | 
         | It'll be some time before there is a robot with enough spatial
         | reasoning to do complicated physical work with no prior
         | examples.
        
         | ApolloFortyNine wrote:
         | >Seems like we're headed toward a world where you automate
         | someone else's job or be automated yourself.
         | 
         | This has essentially been happening for thousands of years. Any
         | optimization to work of any kind reduces the number of man
         | hours required.
         | 
         | Software of pretty much any form is entirely that. Even early
         | spreadsheet programs would replace a number of jobs at any
         | company.
        
         | tripletao wrote:
         | I feel like many people are reacting to the string "AGI" in the
         | benchmark name, and not to the actual result. The tasks in
         | question are to color squares in a grid, maintaining the
         | geometric pattern of the examples.
         | 
         | Unlike most other benchmarks where LLMs have shown large
         | advances (in law, medicine, etc.), this benchmark isn't
         | directly related to any practically useful task. Rather, the
         | benchmark is notable because it's particularly easy for
         | untrained humans, but particularly hard for LLMs; though that
         | difficulty is perhaps not surprising, since LLMs are trained on
         | mostly text and this is geometric. An ensemble of non-LLM
         | solutions already outperformed the average Mechanical Turk
         | worker. This is a big improvement in the best LLM solution; but
         | this might also be the first time an LLM has been tuned
         | specifically for these tasks, so this might be Goodhart's Law.
         | 
         | It's a significant result, but I don't get the mania. It feels
         | like Altman has expertly transformed general societal anxiety
         | into specific anxiety that one's job will be replaced by an
         | LLM. That transforms into a feeling that LLMs are powerful,
         | which he then transforms into money. That was strongest back in
         | 2023, but had weakened since then; but in this comment section
         | it's back in full force.
         | 
         | For clarity, I don't question that many jobs will be replaced
         | by LLMs. I just don't see a qualitative difference from all the
         | jobs already replaced by computers, steam engines, horse-drawn
         | plows, etc. A medieval peasant brought to the present would
         | probably be just as despondent when he learned that almost all
         | the farming jobs are gone; but we don't miss them.
        
           | esafak wrote:
           | I think you did not watch the full video. The model performs
           | at PhD level on maths questions, and expert level at coding.
        
             | tripletao wrote:
             | This submission is specifically about ARC-AGI-PUB, so
             | that's what I was discussing.
             | 
             | I'm aware that LLMs can solve problems other than coloring
             | grids, and I'd tend to agree those are likely to be more
             | near-term useful. Those applications (coding, medicine,
             | law, education, etc.) have been endlessly discussed, and I
             | don't think I have much to add.
             | 
             | In my own work I've found some benefits, but nothing
             | commensurate to the public mania. I understand that
             | founders of AI-themed startups (a group that I see includes
             | you) tend to feel much greater optimism. I've never seen
             | any business founded without that optimism and I hope you
             | succeed, not least because the entire global economy might
             | now be depending on that. I do think others might feel
             | differently for reasons other than simple ignorance,
             | though.
             | 
             | In general, performance on benchmarks similar to tests
             | administered to humans may be surprisingly unpredictive of
             | performance on economically useful work. It's not intuitive
             | at all to me that IBM could solve Jeopardy and then find no
             | profitable applications of the technology; but that seems
             | to be what happened.
        
         | prpl wrote:
         | In 2016 I was asked by an Uber driver in Pittsburgh when his
         | job would be obsolete (I'd worked around Zoox people quite a
         | bit and Uber basically was all-in at CMU.
         | 
         | I told him it was at least 5 years, probably 10, though he was
         | sure it would be 2.
         | 
         | I was arguably "right", 2023-ish is probably going to be the
         | date people put down in the books, but the future isn't evenly
         | distributed. It's at least another 5 years, and maybe never,
         | before things are distributed among major metros, especially
         | those with ice. Even then, the AI is somehow more expensive
         | than human solution.
         | 
         | I don't think it's in most companies interest to price AI way
         | below the price of meat, so meat will hold out for a long time,
         | maybe long enough for you to retire even
        
           | esafak wrote:
           | Just don't have kids?
        
             | prpl wrote:
             | you can have kids, but they can't be salesman. Maybe
             | carpenters
        
         | m3kw9 wrote:
         | Always need to believe AI needs to be operated by humans, when
         | it can go end to end to replace a human, you will likely not
         | need to worry about money.
        
         | AnimalMuppet wrote:
         | The future belongs to those who believe there will be one.
         | 
         | That is: If you don't believe there will be a future, you give
         | up on trying to make one. That means that any kind of future
         | that takes persistent work becomes unavailable to you.
         | 
         | If you _do_ believe that there will be a future, you keep
         | working. That doesn 't guarantee there will be a future. But
         | _not_ working pretty much guarantees that there won 't be one,
         | at least not one worth having.
        
         | chairmansteve wrote:
         | Think of AI as an excavator. You know, those machines that dig
         | holes. 70 years ago, those holes would have been dug by 50 men
         | with shovels. Now it's one guy in an excavator. But we don't
         | have mass unemployment. The excavator just creates more work
         | for bricklayers, carpenters etc.
         | 
         | If AI lives up to hype, you could be the excavator driver. Or,
         | the AI will create a ton of upstream and downstream work. There
         | will be no mass unemployment.
        
           | euvin wrote:
           | If AGI is the excavator, why wouldn't it become the driver,
           | bricklayer, and carpenter as well?
        
             | throwaway2037 wrote:
             | Jokes aside, I think building a useful, strong, agile
             | humanoid robot that is affordable for businesses (first),
             | then middle class homes will prove much harder than AGI.
        
           | realce wrote:
           | Is there any possible technology that could make labor,
           | mastery, or human expirence obsolete?
           | 
           | Are there no limits to this argument? Is it some absolute
           | universal law that all new creations just create increasing
           | economic opportunities?
        
           | zmgsabst wrote:
           | Horses never recovered from mechanization.
        
             | postsantum wrote:
             | They have been promoted to pets. Oh wait..
        
             | chairmansteve wrote:
             | True, but humans did. Horses were the machine that became
             | obsolete. Just like the guys with shovels.
        
         | Art9681 wrote:
         | It's a tool. You learn to master it or not. I have greybeard
         | coworkers that dissed the technology as a fad 3 years ago. Now
         | they are scrambling to catch up. They have to do this while
         | sustaining a family with pets and kids and mortgages and full
         | time senior jobs.
         | 
         | You're in a position to invest substantial amounts of time
         | compared to your seniors. Leverage that opportunity to your
         | advantage.
         | 
         | We all have access to these tools for the most part, so the
         | distinguishing factor is how much time you invest and how much
         | more ambitious you become once you begin to master the tool.
         | 
         | This time its no different. Many Mechanical and Sales students
         | in the past never got jobs in those fields either. Decades
         | before AI. There were other circumstances and forces at play
         | and a degree is not a guaranteed career in anything.
         | 
         | Keep going because what we DO know is that trying wont
         | guarantee results, we DO know that giving up definitely won't.
         | Roll the dice in your favor.
        
           | callc wrote:
           | > I have greybeard coworkers that dissed the technology as a
           | fad 3 years ago. Now they are scrambling to catch up. They
           | have to do this while sustaining a family with pets and kids
           | and mortgages and full time senior jobs.
           | 
           | I want to criticize Art's comment on the grounds of ageism or
           | something along the lines of "any amount life outside of
           | programming is wasted", but regardless of Art's intention
           | there is important wisdom here. Use your free time wisely
           | when you don't have much responsibilities. It is a
           | superpower.
           | 
           | As for whether to spend it on AI, eh, that's up to you to
           | decide.
        
             | Art9681 wrote:
             | It's totally valid criticism. What I meant is that if an
             | individual's major concern is employment, then it would be
             | prudent to invest the amount of time necessary to ensure a
             | favorable outcome. And given whatever stage in life they
             | are at, use the circumstance you have in your favor.
             | 
             | I'm a greybeard myself.
        
         | infinite-hugs wrote:
         | Hey man,
         | 
         | I hear you, I'm not that much older but I graduated in 2011. I
         | also studied industrial design. At that time the big wave was
         | the transition to an app based everything and UX design
         | suddenly became the most in demand design skill. Most of my
         | friends switched gears and careers to digital design for the
         | money. I stuck to what I was interested in though which was
         | sustainability and design and ultimately I'm very happy with
         | where I ended up (circular economy) but it was an awkward ~10
         | years as I explored learning all kinds of tools and ways
         | applying my skills. It also was very tough to find the right
         | full time job because product design (which has come to really
         | mean digital product design) supplanted industrial design roles
         | and made it hard to find something of value that resonated with
         | me.
         | 
         | One of the things that guided me and still does is thinking
         | about what types of problems need to be solved? From my
         | perspective everything should ladder up to that if you want to
         | have an impact. Even if you don't keep learning and exploring
         | until you find something that lights you up on the inside. We
         | are not only one thing we can all wear many hats.
         | 
         | Saying that, we're living through a paradigm shift of
         | tremendous magnitude that's altering our whole world. There
         | will always be change though. My two cents is to focus on what
         | draws your attention and energy and give yourself permission to
         | say no to everything else.
         | 
         | AI is an incredible tool, learn how to use it and try to grow
         | with the times. Good luck and stay creative :) Hope something
         | in there helps, but having a positive mindset is critical. If
         | you're curious about the circular economy happy to share what I
         | know - I think it's the future.
        
         | anshulbhide wrote:
         | You're actually positioned to have an amazing career.
         | 
         | Everyone needs to know how to either build or sell to be
         | successful. In a world where the ability to the former is
         | rapidly being commoditised, you will still need to sell. And
         | human relationships matter more than ever.
        
         | myko wrote:
         | LLMs are mostly hype. They're not going to change things that
         | much.
        
         | kortilla wrote:
         | Don't worry. This thing only knows how to answer well
         | structured technical questions.
         | 
         | 99% of engineering is distilling through bullshit and nonsense
         | requirements. Whether that is appealing to you is a different
         | story, but ChatGPT will happily design things with dumb
         | constraints that would get you fired if you took them at face
         | value as an engineer.
         | 
         | ChatGPT answering technical challenges is to engineering as a
         | nailgun is to carpentry.
        
         | obirunda wrote:
         | Yeah, it may feel scary but the biggest issue yet to be
         | overcome is that to replace engineers you need reliable long
         | horizon problem solving skills. And crucially, you need to not
         | be easily fooled by the progress or setbacks of a project.
         | 
         | These benchmark accomplishments are awesome and impressive, but
         | you shouldn't operate on the assumption that this will emerge
         | as an engineer because it performs well on benchmarks.
         | 
         | Engineering is a discipline that requires understanding tools,
         | solutions and every project requires tiny innovations. This
         | will make you more valuable, rather than less. Especially if
         | you develop a deep understanding of the discipline and don't
         | overly rely on LLMs to answer your own benchmark questions from
         | your degree.
        
       | mortehu wrote:
       | The chart is super misleading, since the test was obscure until
       | recently. A few months ago he announced he'd made the only good
       | AGI test and offered a cash prize for solving it, only to find
       | out in as much time that it's no different from other benchmarks.
        
       | ripped_britches wrote:
       | Sad to see everyone so focused on compute expense during this
       | massive breakthrough. GPT-2 originally cost $50k to train, but
       | now can be trained for ~$150.
       | 
       | The key part is that scaling test-time compute will likely be a
       | key to achieving AGI/ASI. Costs will definitely come down as is
       | evidenced by precedents, Moore's law, o3-mini being cheaper than
       | o1 with improved performance, etc.
        
         | yawnxyz wrote:
         | I think the question everyone has in their minds isn't "when
         | will AGI get here" or even "how soon will it get here" -- it's
         | "how soon will AGI get so cheap that everyone will get their
         | hands on it"
         | 
         | that's why everyone's thinking about compute expense. but I
         | guess in terms of a "lifetime expense of a person" even someone
         | who costs $10/hr isn't actually all that cheap, considering
         | what it takes to grow a human into a fully functioning person
         | that's able to just do stuff
        
           | croes wrote:
           | We are nowhere near AGI.
        
         | stocknoob wrote:
         | It's wild, are people purposefully overlooking that inference
         | costs are dropping 10-100x each year?
         | 
         | https://a16z.com/llmflation-llm-inference-cost/
         | 
         | Look at the log scale slope, especially the orange MMLU > 83
         | data points.
        
           | croes wrote:
           | A bit early for a every year claim not to mention what all
           | these AI is used for.
           | 
           | In some parts of the internet it's you hardly find real
           | content only AI spam.
           | 
           | It will get worse the cheaper it gets.
           | 
           | Think of email spam.
        
           | menaerus wrote:
           | Those are the (subsidized) prices that end clients are paying
           | for the service so that's not something that is
           | representative of what the actual inference costs are.
           | Somebody still needs to pay that (actual) price in the end.
           | For inference, as well as for training, you need actual
           | (NVidia) hardware and that hardware didn't become any
           | cheaper. OTOH models are only becoming increasingly more
           | complex and bigger and with more and more demand I don't see
           | those costs exactly dropping down.
        
             | atleastoptimal wrote:
             | Actual inference costs without considering subsidies and
             | loss leaders are going down, due to algorithmic
             | improvements, hardware improvements, and quantized/smaller
             | models getting the same performance as larger ones.
             | Companies are making huge breakthroughs making chips
             | specifically for LLM inference
        
       | uncomplexity_ wrote:
       | it's official old buddy, i'm a has been.
        
       | brcmthrowaway wrote:
       | How to invest in this stonk market
        
       | nickorlow wrote:
       | Not that I don't think costs will dramatically decrease, but the
       | $1000 cost per task just seems to be per one problem on ARC-AGI.
       | If so, I'd imagine extrapolating that to generating a useful
       | midsized patch would be like 5-10x
       | 
       | But only OpenAI really knows how the cost would scale for
       | different tasks. I'm just making (poor) speculation
        
       | prng2021 wrote:
       | I'm confused about the excitement. Are people just flat out
       | ignoring the sentences below? I don't see any breakthrough
       | towards AGI here. I see a model doing great in another AI test
       | but about to abysmally fail a variation of it that will come out
       | soon. Also, aren't these comparisons completely nonsense
       | considering it's o3 tuned vs other non-tuned?
       | 
       | > Note on "tuned": OpenAI shared they trained the o3 we tested on
       | 75% of the Public Training set. They have not shared more
       | details. We have not yet tested the ARC-untrained model to
       | understand how much of the performance is due to ARC-AGI data.
       | 
       | > Furthermore, early data points suggest that the upcoming ARC-
       | AGI-2 benchmark will still pose a significant challenge to o3,
       | potentially reducing its score to under 30% even at high compute
       | (while a smart human would still be able to score over 95% with
       | no training).
        
         | oakpond wrote:
         | Me too. This looks to me like a holiday PR stunt. Get everybody
         | to talk about AI during the Christmas parties.
        
       | SerCe wrote:
       | > You'll know AGI is here when the exercise of creating tasks
       | that are easy for regular humans but hard for AI becomes simply
       | impossible.
       | 
       | You'll know AGI is here when traditional captchas stop being a
       | thing due to their lack of usefulness.
        
         | thallium205 wrote:
         | Captchas are already completely useless.
        
         | CamperBob2 wrote:
         | (Shrug) AI has been better than humans at solving CAPTCHAs for
         | a LONG time. As the sibling points out, they're just a waste of
         | time and electricity at this point.
        
           | darkgenesha wrote:
           | Ironically, they are used as free labor to label image sets
           | for ai to be trained on.
        
       | Engineering-MD wrote:
       | Can I just say what a dick move it was to do this as a 12 days of
       | Christmas. I mean to be honest I agree with the arguments this
       | isn't as impressive as my initial impression, but they clearly
       | intended it to be shocking/a show of possible AGI, which is
       | rightly scary.
       | 
       | It feels so insensitive to that right before a major holiday when
       | the likely outcome is a lot of people feeling less secure in
       | their career/job/life.
       | 
       | Thanks again openAI for showing us you don't give a shit about
       | actual people.
        
         | mirkodrummer wrote:
         | There is no AGI it's just marketing, this stuff if over hyped,
         | enjoy your holidays you won't lose your job ;)
        
           | Engineering-MD wrote:
           | I agree, it's just more about the intent than anything else,
           | like boasting about your amazing new job when someone has
           | recently been made redundant, just before Christmas.
        
         | XenophileJKO wrote:
         | Or maybe the target audience that watches 12 launch videos in
         | the morning are genuninely excited about the new model. The
         | intended it to be a preview of something to look forward to.
         | 
         | What a weird way to react to this.
        
           | achierius wrote:
           | It sounds like you aren't thinking about this that deeply
           | then. Or at least not understanding that many smart (and
           | financially disinterested) people who are, are coming to
           | concerning conclusions.
           | 
           | https://www.transformernews.ai/p/richard-ngo-openai-
           | resign-s...
           | 
           | >But while the "making AGI" part of the mission seems well on
           | track, it feels like I (and others) have gradually realized
           | how much harder it is to contribute in a robustly positive
           | way to the "succeeding" part of the mission, especially when
           | it comes to preventing existential risks to humanity.
           | 
           | Almost every single one of the people OpenAI had hired to
           | work on AI safety have left the firm with similar messages.
           | Perhaps you should at least consider the thinking of experts?
        
         | OldGreenYodaGPT wrote:
         | Blaming OpenAI for progress is like blaming a calendar for
         | Christmas--it's not the timing, it's your unwillingness to
         | adapt
        
           | r-zip wrote:
           | Unwillingness to adapt to the destruction of the middle class
           | and knowledge work is pretty reasonable tbh.
        
             | tim333 wrote:
             | Historically when tech has taken over jobs people have done
             | ok, they've just done something else, usually something
             | more pleasant.
        
           | lagrange77 wrote:
           | Wow, you just solved the ethics of technology in a one liner.
           | Impressive.
        
         | stevenhuang wrote:
         | This is a you problem. Yes there will be pain in short term,
         | but it will be worth it in long term.
         | 
         | Many of us look forward to what a future with AGI can do to
         | help humanity and hopefully change society for the better,
         | mainly to achieve a post scarcity economy.
        
           | jakebasile wrote:
           | Surely the elites that control this fancy new technology will
           | share the benefits with all of us _this_ time!
        
             | tim333 wrote:
             | No it'll be like when tech took over 97% of agricultural
             | work with 97% of us starving while all the money went to
             | the farm elites.
        
               | jakebasile wrote:
               | How did that go for the farm workers?
        
           | randyrand wrote:
           | Post scarcity seems very unlikely. Humans might be worthless,
           | but there will still be a finite number of AIs, compute,
           | space, resources.
        
           | achierius wrote:
           | https://www.transformernews.ai/p/richard-ngo-openai-
           | resign-s...
           | 
           | >But while the "making AGI" part of the mission seems well on
           | track, it feels like I (and others) have gradually realized
           | how much harder it is to contribute in a robustly positive
           | way to the "succeeding" part of the mission, especially when
           | it comes to preventing existential risks to humanity.
           | 
           | Almost every single one of the people OpenAI had hired to
           | work on AI safety have left the firm with similar messages.
           | Perhaps you should at least consider the thinking of experts?
           | There is a real chance that this ends with significant good.
           | There is also a real chance that this ends with the death of
           | every single human being. That's never been a choice we've
           | had to make before, and it seems like we as a species are
           | unprepared to approach it.
        
           | esafak wrote:
           | How are you going to make housing, healthcare, etc. not
           | scarce, and pay for them?
        
             | tim333 wrote:
             | Robots supply that, controlled by democratic government.
        
               | esafak wrote:
               | Robots supply the land and physical labor that underlie
               | the price of housing? Are you thinking of space colonies
               | or something?
               | 
               | You need to make these expensive things nearly free if
               | you're going to speak of post scarcity.
        
               | tim333 wrote:
               | Robots supply the physical labour. The land shortages are
               | largely regulatory - there's a lot of land out there or
               | you could build higher.
        
         | _cs2017_ wrote:
         | Wtf is wrong with you dude? It's just another tech, some jobs
         | will get worse some jobs will get better. Happens every couple
         | of decades. Stop freaking out.
        
           | achierius wrote:
           | This is not a very kind or humble comment. There are real
           | experts talking about how this time is different -- as an
           | analogy, think about how horses, for thousands of years,
           | always had new things to do -- until one day they didn't.
           | It's hubris to think that we're somehow so different from
           | them.
           | 
           | Notably, the last key AI safety researcher just left OpenAI:
           | https://www.transformernews.ai/p/richard-ngo-openai-
           | resign-s...
           | 
           | >But while the "making AGI" part of the mission seems well on
           | track, it feels like I (and others) have gradually realized
           | how much harder it is to contribute in a robustly positive
           | way to the "succeeding" part of the mission, especially when
           | it comes to preventing existential risks to humanity.
           | 
           | Are you that upset that this guy chose to trust the people
           | that OpenAI hired to talk about AI safety, on the topic of AI
           | safety?
        
         | t0lo wrote:
         | I hate the deliberate fear-mongering that these companies pedal
         | on the population to get higher valuations
        
         | achierius wrote:
         | I feel you. It's tough trying to think about what we can do to
         | avert this; even to the extent that individuals are often
         | powerless, in this regard it feels worse than almost anything
         | that's come before.
        
         | keiferski wrote:
         | The vast majority of people who will lose jobs to AI aren't
         | following AGI benchmarks, or even know what AGI is short for.
        
           | Engineering-MD wrote:
           | That's is true and a reasonable point. But looking in This
           | thread you can see there has been this reaction from quite a
           | few.
        
         | tim333 wrote:
         | Some of us actual people are actually enthusiastic about AGI.
         | Although I'm a bit weird in being into the sci-fi upload /
         | ending death stuff.
        
       | noah32 wrote:
       | The best AI on this graph costs 50000% more than a stem graduate
       | to complete the tasks and even then has an error rate that is
       | 1000% higher than the humans???
        
       | dkrich wrote:
       | These tests are meaningless until You show them doing mundane
       | tasks
        
       | mattfrommars wrote:
       | Guys, its already happening. I recently got laid off due to AI
       | taking over my jobs.
        
         | dimgl wrote:
         | What did you do? Can you elaborate?
        
           | mirsadm wrote:
           | I wouldn't take that seriously. Half the comments here are
           | suspicious IMO. OpenAI is a pretty shady company.
        
       | dyauspitr wrote:
       | I wish there was a way to see all the attempts it got right
       | graphically like they show the incorrect ones.
        
       | YeGoblynQueenne wrote:
       | I guess I get to brag now. ARC AGI has no real defences against
       | Big Data, memorisation-based approaches like LLMs. I told you so:
       | 
       | https://news.ycombinator.com/item?id=42344336
       | 
       | And that answers my question about fchollet's assurances that
       | LLMs without TTT (Test Time Training) can't beat ARC AGI:
       | 
       | [me] I haven't had the chance to read the papers carefully. Have
       | they done ablation studies? For instance, is the following a
       | guess or is it an empirical result?
       | 
       | [fchollet] >> For instance, if you drop the TTT component you
       | will see that these large models trained on millions of synthetic
       | ARC-AGI tasks drop to <10% accuracy.
        
         | Vecr wrote:
         | How are the Bongard Problems going?
        
           | YeGoblynQueenne wrote:
           | They're chilling it out together with Nethack in the Club for
           | AI Benchmarks yet to be Beaten.
           | 
           | Interestingly, Bongard problems do not have a private test
           | set, unlike ARC-AGI. Can that be because they don't need it?
           | Is it possible that Bongard Problems are a true test of
           | (visual) reasoning that requires intelligence to be solved?
           | 
           | Ooooh! Frisson of excitement!
           | 
           | But I guess it's just that nobody remembers them and so
           | nobody has seriously tried to solve them with Big Data stuff.
        
       | Sparkyte wrote:
       | Kinda expensive though.
        
       | hamburga wrote:
       | I'm not sure if people realize what a weird test this is. They're
       | these simple visual puzzles that people can usually solve at a
       | glance, but for the LLMs, they're converted into a json format,
       | and then the LLMs have to reconstruct the 2D visual scene from
       | the json and pick up the patterns.
       | 
       | If humans were given the json as input rather than the images,
       | they'd have a hard time, too.
        
         | ImaCake wrote:
         | Yeah, this entire thread seems utterly detached from my lived
         | experience. LLMs are immensely useful for me at work but they
         | certainly don't come close to the hype spouted by many
         | commenters here. It would be great if it could handle more of
         | our quite modest codebase but it's not able to yet
        
           | m_ke wrote:
           | ARC is a silly benchmark, the other results in math and
           | coding are much more impressive.
           | 
           | o3 is just o1 scaled up, the main takeaway from this line of
           | work that people should walk away with is that we now have a
           | proven way to RL our way to super human performance on tasks
           | where it's cheap to sample and easy to verify the final
           | output. Programming falls in that category, they focused on
           | known benchmarks but the same process can be done for normal
           | programs, using parsers, compilers, existing functions and
           | unit tests as verifiers.
           | 
           | Pre o1 we only really had next token prediction, which
           | required high quality human produced data, with o1 you
           | optimize for success instead of MLE of next token. Explained
           | in simpler terms, it means it can get reward for any
           | implementation of a function that reproduces the expected
           | result, instead of the exact implementation in the training
           | set.
           | 
           | Put another way, it's just like RLHF but instead of
           | optimizing against learned human preferences, the model is
           | trained to satisfy a verifier.
           | 
           | This should work just as well in VLA models for robotics,
           | self driving and computer agents.
        
         | causal wrote:
         | I think that's part of what feels odd about this- in some ways
         | it feels like the wrong type of test for an LLM, but in many
         | ways it makes this achievement that much more remarkable
        
         | Jensson wrote:
         | > If humans were given the json as input rather than the
         | images, they'd have a hard time, too.
         | 
         | We shine light in text patterns at humans rather than inject
         | the text directly into the brain as well, that is extremely
         | unfair! Imagine how much better humans would be at text
         | processing if we injected and extracted information from their
         | brains using the neurons instead of eyes and hands.
        
         | torginus wrote:
         | Not sure how much that matters - I'm not an AI expert, but I
         | did some intro courses where we had to train a classifier to
         | recognize digits. How it worked basically was that we fed each
         | pixel of the 2d grid of the image into an input of the network,
         | essentially flattening it in a similar fashion. It worked just
         | fine, and that was a tiny network.
        
           | thegeomaster wrote:
           | The classifier was likely a convolutional network, so the
           | assumption of the image being a 2D grid was baked into the
           | architecture itself - it didn't have to be represented via
           | the shape of the input for the network to use it.
        
             | torginus wrote:
             | I don't think so - convolutional neural networks also
             | operate over 1D flat vectors - the spatial relationship of
             | pixels is only learned from the training data.
        
         | deneas wrote:
         | The JSON files still contain images, just not in a regular
         | image format. You have a 2D array of numbers where each number
         | maps to a color. If you really want a regular picture format,
         | you can easily convert the arrays.
        
       | inoperable wrote:
       | Very convenient for OpenAI to run those errands with bunch of
       | misanthropes trying to repaint a simulacrum. To use AGI here's
       | makes me want to sponsor pile of distress pills so people think
       | things really over before going into another mania Episode.
       | People need seriously take a step back, if that's AGI then my cat
       | has surpassed it's cognitive acting twice.
        
       | sakopov wrote:
       | Maybe I'm missing something vital, but how does anything that
       | we've seen AI do up until this point or explained in this
       | experiment even hint at AGI? Can any of these models ideate? Can
       | they come up with technologies and tools? No and it's unlikely
       | they will any time soon. However, they can make engineers
       | infinitely more productive.
        
         | jebarker wrote:
         | You need to define ideate, tools and technologies to answer
         | those questions. Not to mention that it's quite possible humans
         | do those things through re-combination of learned ideas
         | similarly to how these reasoning models are suggested to be
         | working.
        
           | sakopov wrote:
           | Every technological advancement that we've seen in software
           | engineering - be it in things like Postgres, Kubernetes and
           | Cloud Infrastructure - came out from truly novel ideas. AI
           | seems to generate outputs that appear novel but are they
           | really? It's capable of synthesizing and combining vast
           | amounts of information in creative ways but it's deriving
           | everything from existing patterns found within its training
           | data. Truly novel ideas require thinking outside the box.
           | It's combination of cognitive, emotional and environmental
           | factors which go beyond pattern recognition. How close are we
           | to achieving this? Everyone seems to be shaking in their
           | boots because we might lose our job safety in tech, but I
           | don't see any intelligence here.
        
       | kirab wrote:
       | FYI: Codeforces competitive programming scores (basically only)
       | by time needed until valid solutions are posted
       | 
       | https://codeforces.com/blog/entry/133094
       | 
       | That means.. this benchmark is just saying o3 can write code
       | faster than must humans (in a very time-limited contest, like 2
       | hours for 6 tasks). Beauty, readability or creativity is not
       | rated. It's essentially a "how fast can you make the unit tests
       | pass" kind of competition.
        
         | sigbottle wrote:
         | Creativity is inherently rated because it's codeforces... most
         | 2700 problems have unique, creative solutions.
        
       | ghm2180 wrote:
       | Wouldn't one then built the analog of the lisp computer to hyper
       | optimize just this. Like it might be super expensive for regular
       | gpus but for super specialized architecture one could shave the
       | 3500$/hour quite a bit no?
        
       | kittikitti wrote:
       | Congratulations
        
       | hackpert wrote:
       | If anyone else is curious about which ARC-AGI public eval puzzles
       | o3 got right vs wrong (and its attempts at the ones it did get
       | right), here's a quick visualization:
       | https://arcagi-o3-viz.netlify.app
        
       | suprgeek wrote:
       | Don't be put off by the reported high-cost
       | 
       | Make it possible->Make it fast->Make it Cheap
       | 
       | the eternal cycle of software.
       | 
       | Make no mistake - we are on the verge of the next era of change.
        
       | duluca wrote:
       | The first computers cost millions of dollars and filled entire
       | rooms to accomplish what we would now consider simple
       | computational tasks. That same computing power now fits into the
       | width of a finger nail. I don't get how technologists balk at the
       | cost of experimental tech or assume current tech will run at the
       | same efficiency for decades to come and melt the planet into a
       | puddle. AGI won't happen until you can fit enough compute that'd
       | take several data center's worth of compute into a brain sized
       | vessel. So the thing can move around process the world in real
       | time. This is all going to take some time to say the least.
       | Progress is progress.
        
         | lxgr wrote:
         | > take several data center's worth of compute into a brain
         | sized vessel. So the thing can move around process the world in
         | real time
         | 
         | How so? I'd imagine a robot connected to the data center
         | embodying its mind, connected via low-latency links, would have
         | to walk pretty far to get into trouble when it comes to
         | interacting with the environment.
         | 
         | The speed of light is about three orders of magnitude faster
         | than the speed of signal propagation in biological neurons,
         | after all.
        
           | waldrews wrote:
           | 6 orders of magnitude if we use 120 m/s vs 300 km/s
        
             | lxgr wrote:
             | Ah, yes, I missed a "k" in that estimation!
        
           | byw wrote:
           | The robot brain could be layered so that more basic functions
           | are embedded locally while higher-level reasonings and
           | offloaded to the cloud.
        
             | arthurcolle wrote:
             | blue strip from iRobot?
        
         | lumost wrote:
         | The concern here is mainly on practicality. The original
         | mainframes did not command startup valuations counted in
         | fractions of the US economy, they did qualify for billions in
         | investment.
         | 
         | This is a great milestone, but OpenAI will not be successful
         | charging 10x the cost of a human to perform a task.
        
           | BriggyDwiggs42 wrote:
           | I wouldn't expect it to cost 10x in five years, if only
           | because parallel computing still seems to be roughly obeying
           | moore's.
        
           | raincole wrote:
           | The cost of inference has be dropping by ~100x in the past 2
           | years.
           | 
           | https://a16z.com/llmflation-llm-inference-cost/
        
             | nico wrote:
             | *inference
        
             | gritzko wrote:
             | *infernonce
        
             | christianqchung wrote:
             | Hmm the link is saying the price of an LLM that scores 42
             | or above on MMLU has dropped 100x in 2 years, equating gpt
             | 3.5 and llama 3.2 3B. In my opinion gpt 3.5 was
             | significantly better than llama 3B, and certainly much
             | better than the also-equated llama 2 7B. MMLU isn't a great
             | marker of overall model capabilities.
             | 
             | Obviously the drop in cost for capability in the last 2
             | years is big, but I'd wager it's closer to 10x than 100x.
        
           | owenpalmer wrote:
           | > OpenAI will not be successful charging 10x the cost of a
           | human to perform a task.
           | 
           | True, but they might be successful charging 20x for 2x the
           | skill of a human.
        
             | threatripper wrote:
             | Or 10x the skill and speed of a human in some specific
             | class of recurrent tasks. We don't need full super-human
             | AGI for AI to become economically viable.
        
               | eru wrote:
               | Companies routinely pay short-term contractors a lot more
               | than their permanent staff.
               | 
               | If you can just unleash AI on any of your problems,
               | without having to commit to anything long term, it might
               | still be useful, even if they charged more than for
               | equivalent human labour.
               | 
               | (Though I suspect AI labour will generally trend to be
               | cheaper than humans over time for anything AIs can do at
               | all.)
        
           | fragmede wrote:
           | How much does AWS charge for compute?
           | 
           | If it can be spun up with Terraform, I bet you they could.
        
         | otabdeveloper4 wrote:
         | Intelligence has nothing at all whatever to do with compute.
        
           | oefnak wrote:
           | Unless you're a dualist who believes in a magic spirit, I
           | cannot understand how you think that's the case. Can you
           | please explain?
        
             | freehorse wrote:
             | Intelligence is about learning from few examples and
             | generalising to novel solutions. Increasing compute so that
             | exploring the whole problem space is possible is not
             | intelligence. There is a reason the actual ARC-AGI price
             | has efficiency as one of the success requirements. It is
             | not so that the solutions scale to production and whatnot,
             | these are toy tasks. It is to help ensure that it is
             | actually an intelligent system solving these.
             | 
             | So yeah, the o3 result is impressive but if the difference
             | between o3 and the previous state of art is more compute to
             | do a much longer CoT/evaluation loop, I am not so
             | impressed. Reminder that these problems are solved by
             | humans in seconds, ARC-AGI is supposed to be easy.
        
             | lambdaphagy wrote:
             | Philosophy of mind is the branch of philosophy that
             | attempts to account for a very difficult problem: why there
             | are apparently two different realms of phenomena, physical
             | and mental, that are at once tightly connected and yet as
             | different from one another as two things can possibly be.
             | 
             | Broadly speaking you can think that the mental reduces to
             | the physical (physicalism), that the physical reduces to
             | the mental (idealism), both reduce to some other third
             | thing (neutral monism) or that neither reduces to the other
             | (dualism). There are many arguments for dualism but I've
             | never heard a philosopher appeal to "magic spirits" in
             | order to do so.
             | 
             | Here's an overview:
             | https://plato.stanford.edu/entries/dualism/
        
           | patrickhogan1 wrote:
           | Do you think intelligence exists without prior experience?
           | For instance, can someone instantly acquire a skill--like
           | playing the piano--as if downloading it in The Matrix? Even
           | prodigies like Mozart had prior exposure. His father, a
           | composer and music teacher, introduced him to music from an
           | early age. Does true intelligence require a foundation of
           | prior knowledge?
        
             | 1659447091 wrote:
             | Intelligence requires the ability to separate the wheat
             | from the chaff on one's own to create a foundation of
             | knowledge to build on.
             | 
             | It is also entirely possible to learn a skill without prior
             | experience. That's how it(whatever skill) was first done
        
             | owenpalmer wrote:
             | > Does true intelligence require a foundation of prior
             | knowledge?
             | 
             | This is the way I think about it.
             | 
             | I = E / K
             | 
             | where I is the intelligence of the system, E is the
             | effectiveness of the system, and K is the prior knowledge.
             | 
             | For example, a math problem is given to two students, each
             | solving the problem with the same effectiveness (both get
             | the correct answer in the same amount of time). However,
             | student A happens to have more prior knowledge of math than
             | student B. In this case, the intelligence of B is greater
             | than the intelligence of A, even though they have the same
             | effectiveness. B was able to "figure out" the math, without
             | using any of the "tricks" that A already knew.
             | 
             | Now back to your question of whether or not prior knowledge
             | is required. As K approaches 0, intelligence approaches
             | infinity. But when K=0, intelligence is undefined. Tada! I
             | think that answers your question.
             | 
             | Most LLM benchmarks simply measure effectiveness, not
             | intelligence. I conceptualize LLMs as a person with a
             | photographic memory and a low IQ of 85, who was given 100
             | billion years to learn everything humans have ever created.
             | 
             | IK = E
             | 
             | low intelligence * vast knowledge = reasonable
             | effectiveness
        
         | TechDebtDevin wrote:
         | Batteries..
        
         | pera wrote:
         | Maybe AGI as a goal is overvalued: If you have a machine that
         | can, on average, perform symbolic reasoning better than humans,
         | and at a lower cost, that's basically the end game, isn't it?
         | You won capitalism.
        
           | harrall wrote:
           | Right now I can ask an (experienced) human to do something
           | for me and they will either just get it done or tell me that
           | they can't do it.
           | 
           | Right now when I ask an LLM... I have to sit there and verify
           | everything. It may have done some helpful reasoning for me
           | but the whole point of me asking someone else (or something
           | else) was to do nothing at all...
           | 
           | I'm not sure you can reliably fulfill the first scenario
           | without achieving AGI. Maybe you can, but we are not at that
           | point yet so we don't know yet.
        
             | raincole wrote:
             | You do need to verify humans work though.
             | 
             | The difference, to me, is that humans seem to be good at
             | canceling each other's mistakes when put in a proper
             | environment.
        
             | pera wrote:
             | It's not clear to me whether AGI is necessary for solving
             | most of the issues in the current generation of LLMs. It is
             | possible you can get there by hacking together CoTs with
             | automated theorem provers and bruteforcing your way to the
             | solution or something like that.
             | 
             | But if it's not enough then maybe it might come as a
             | second-order effect (e.g. reasoning machines having to
             | bootstrap an AGI so then you can have a Waymo taxi driver
             | who is also a Fields medalist)
        
             | vbezhenar wrote:
             | There are so called "yes-men" who can't say "no" in no
             | situation. That's rooted in their culture. I suspect that
             | AI was trained using their assistance. I mean, answering "I
             | can't do that" is the simplest LLM path that should work
             | often unless they gone out of their way to downrank it.
        
             | concordDance wrote:
             | > Right now I can ask an (experienced) human to do
             | something for me and they will either just get it done or
             | tell me that they can't do it.
             | 
             | Finding reliable honest humans is a problem governments
             | have struggled with for over a hundred years. If you have
             | cracked this problem at scale you really need to write it
             | up! There are a lot of people who would be extremely
             | interested in a solution here.
        
               | eru wrote:
               | > Finding reliable honest humans is a problem governments
               | have struggled with for over a hundred years.
               | 
               | Yes, though you are downplaying the problem a lot. It's
               | not just governments, and it's way longer than 100 years.
               | 
               | Btw, a solution that might work for you or me, presumably
               | relatively obscure people, might not work for anyone
               | famous, nor a company nor a government.
        
             | anavat wrote:
             | My guess is this is an artifact of the RLHF part of the
             | training. Answers like "I don't know" or "let me think and
             | let's catch on this next week" are flagged down by human
             | testers, which eventually trains LLM to avoid this path
             | altogether. And it probably makes sense because otherwise
             | "I don't know" would come up way too often even in cases
             | where the LLM is perfectly able to give the answer.
        
               | gf000 wrote:
               | I don't know, that seems like a fundamental limitation.
               | LLMs don't have any ability to do reflection on their own
               | knowledge/abilities.
        
               | ben_w wrote:
               | Humans aren't very aware of their limits, either.
               | 
               | Even the Dunning-Kruger effect is, ironically, widely
               | misunderstood by people who are unreasonably confident
               | about their knowledge.
        
               | eru wrote:
               | Yes, Dunning-Kruger's paper never found what popular
               | science calls the 'Dunning-Kruger' effect.
               | 
               | Effectively, they found nothing real but a statistical
               | artifact.
        
               | gf000 wrote:
               | But you know if you have ever heard about call by name or
               | value semantics.
        
               | ben_w wrote:
               | You've not only seen people get upset about technical
               | jargon, but also never seen people misuse it wildly?
               | 
               | The latter in particular is how I model the mistakes LLMs
               | made, what with them having read most things.
        
         | 8n4vidtmkvmk wrote:
         | I thought you were going to say that now we're back to bigger-
         | than-room sized computers that cost many millions just to
         | perform the same tasks we could 40 years ago.
         | 
         | I of course mean we're using these LLMs for a lot of tasks that
         | they're inappropriate for, and a clever manually coded
         | algorithm could do better and much more efficiently.
        
           | arthurcolle wrote:
           | just ask the LLM to solve enough problems (even new
           | problems), cache the best, do inference time compute for the
           | rest, figure out the best/ fastest implementations, and boom,
           | you have new training data for future AIs
        
             | owenpalmer wrote:
             | > cache the best
             | 
             | How do you quantify that?
        
               | martinkallstrom wrote:
               | "Assume the role of an expert in cache invalidation..."
        
               | DyslexicAtheist wrote:
               | "one does not just assume", "because the hardest problems
               | in Tech are Johnny Cash invalidations" --Lao Tzi
        
               | Terr_ wrote:
               | > "Those who invalidate caches know nothing; Those who
               | know retain data." These words, as I am told, were spoken
               | by Lao Tzi. If we are to believe that Lao Tzi was himself
               | one who knew, why did he erase /var/tmp to make space for
               | his project?
               | 
               | -- Poem by Cybernetic Bai Juyi, "The Philosopher [of
               | Caching]"
        
               | pavlov wrote:
               | "Assume the role of an expert in naming things. You know,
               | a... what do they call those people again... there must
               | be a name for it"
        
               | arthurcolle wrote:
               | however you want
        
           | adwn wrote:
           | > _and a clever manually coded algorithm could do better and
           | much more efficiently._
           | 
           | Sure, but how long would it take to implement this algorithm,
           | and would that be worth it for one-off cases?
           | 
           | Just today I asked Claude to create a _jq_ query that looks
           | for objects with a certain value for one field, but which
           | lack a certain other field. I could have spent a long time
           | trying to make sense of jq 's man page, but instead I spent
           | 30 seconds writing a short description of what I'm looking
           | for in natural language, and the AI returned the correct jq
           | invocation within seconds.
        
             | freehorse wrote:
             | I don't think this is a bad use. A bad use would be to give
             | Claude the dataset and ask it to tell you which elements
             | have that value.
        
               | adwn wrote:
               | Ha, I tried that before. However, the file was too large
               | for its context window, so it only seemed to analyze the
               | first part and gave a wrong result.
        
               | Woodi wrote:
               | It was your own data, right ? Becouse you just donated
               | half of it...
        
               | adwn wrote:
               | It's okay, I also uploaded an NDA in a previous prompt
               | :-)
        
               | globalise83 wrote:
               | Claude answers a lot of its questions by first writing
               | and then running code to generate the results. Its only
               | limitation is the access to databases and size of context
               | window, both of which will be radically improved over the
               | next 5 years.
        
               | freehorse wrote:
               | I would still rather be able to see the code it generates
        
             | lottin wrote:
             | But how do you know it's given you the correct answer? Just
             | because the code appears to work it doesn't mean it's
             | correct.
        
               | adwn wrote:
               | But how do I know if my hand-written jq query is the
               | correct solution? Just because the query appears to work
               | it doesn't mean it's correct.
        
               | lottin wrote:
               | Because I understand the process that I have followed to
               | get to the solution.
        
               | ogogmad wrote:
               | It can explain its solution. Point to relevant docs as
               | well.
        
               | gf000 wrote:
               | It can also very convincingly explain a non-solution
               | pointing to either real or hallucinated docs.
        
               | ogogmad wrote:
               | You need to look at the docs.
        
               | freehorse wrote:
               | Omg this is how llms used to trick me inventing out all
               | these apis.
        
               | ogogmad wrote:
               | Look at the docs it links to.
        
           | globalise83 wrote:
           | The LLMs are now writing their own algorithms to answer
           | questions. Not long before they can design a more efficient
           | algorithm to complete any feasible computational task, in a
           | millionth of the time needed by the best human.
        
             | bayindirh wrote:
             | LLMs are probabilistic string blenders pulling pieces up
             | from their training set, which unfortunately comes from us,
             | humans.
             | 
             | The superset of the LLM knowledge pool is human knowledge.
             | They can't go beyond the boundaries of their training set.
             | 
             | I'll not go into how humans have other processes which can
             | alter their and collective human knowledge, but the rabbit
             | hole starts with "emotions, opposable thumbs, language,
             | communication and other senses".
        
               | ogogmad wrote:
               | > They can't go beyond the boundaries of their training
               | set.
               | 
               | TFA says they just did. That's what the ARC-AGI benchmark
               | was supposed to test.
        
             | gf000 wrote:
             | > The LLMs are now writing their own algorithms to answer
             | questions
             | 
             | Writing a python script, because it can't do math or any
             | form of more complex reasoning is not what I would call
             | "own algorithm". It's at most application of existing
             | ones/calling APIs.
        
         | nopinsight wrote:
         | Many of humans' capabilities are pretrained with massive
         | computing through evolution. Inference results of o3 and its
         | successors might be used to train the next generation of small
         | models to be highly capable. Recent advances in the
         | capabilities of small models such as Gemini-2.0 Flash suggest
         | the same.
         | 
         | Recent research from NVIDIA suggests such an efficiency gain is
         | quite possible in the physical realm as well. They trained a
         | tiny model to control the full body of a robot via simulations.
         | 
         | ---
         | 
         | "We trained a 1.5M-parameter neural network to control the body
         | of a humanoid robot. It takes a lot of subconscious processing
         | for us humans to walk, maintain balance, and maneuver our arms
         | and legs into desired positions. We capture this
         | "subconsciousness" in HOVER, a single model that learns how to
         | coordinate the motors of a humanoid robot to support locomotion
         | and manipulation."
         | 
         | ...
         | 
         | "HOVER supports any humanoid that can be simulated in Isaac.
         | Bring your own robot, and watch it come to life!"
         | 
         | More here: https://x.com/DrJimFan/status/1851643431803830551
         | 
         | ---
         | 
         | This demonstrates that with proper training, small models can
         | perform at a high level in both cognitive and physical domains.
        
           | bigprof wrote:
           | > Similarly, many of humans' capabilities are pretrained with
           | massive computing through evolution.
           | 
           | Hmm .. my intuition is that humans' capabilities are gained
           | during early childhood (walking, running, speaking .. etc)
           | ... what are examples of capabilities pretrained by
           | evolution, and how does this work?
        
             | nopinsight wrote:
             | The brain is predisposed to learn those skills. Early
             | childhood experiences are necessary to complete the
             | training. Perhaps that could be likened to post-training.
             | It's not a one-to-one comparison but a rather loose analogy
             | which I didn't make it precise because it is not the key
             | point of the argument.
             | 
             | Maybe evolution could be better thought of as neural
             | architecture search combined with some pretraining.
             | Evidence suggests we are prebuilt with "core knowledge" by
             | the time we're born [1].
             | 
             | See: Summary of cool research gained from clever & benign
             | experiments with babies here:
             | 
             | [1] Core knowledge. Elizabeth S. Spelke and Katherine D.
             | Kinzler. https://www.harvardlds.org/wp-
             | content/uploads/2017/01/Spelke...
        
               | vanviegen wrote:
               | > The brain is predisposed to learn those skills.
               | 
               | Learning to walk doesn't seem to be particularly easy,
               | having observed the process with my own children. No
               | easier than riding a bike or skating, for which our
               | brains are probably not 'predisposed'.
        
               | nopinsight wrote:
               | Walking is indeed a complex skill. Yet some animals walk
               | minutes after birth. Human babies are most likely born
               | premature due to the large brain and related physical
               | constraints.
               | 
               | Young children learn to bike or skate at an older age
               | after they have acquired basic physical skills.
               | 
               | Check out the reference to Core Knowledge above. There
               | are things young infants know or are predisposed to know
               | from birth.
        
               | HumanOstrich wrote:
               | The brain has developed, through evolution, very specific
               | and organized structures that allow us to learn language
               | and reading skills. If you have a genetic defect that
               | causes those structures to be faulty or missing, you will
               | have severe developmental problems.
               | 
               | That seems like a decent example of pretraining through
               | evolution.
        
               | tesch1 wrote:
               | But maybe it's something more like general symbolic
               | manipulation, and not specifically the sounds or
               | structure of language. Reading is fairly new and unlikely
               | to have had much if any evolutionary pressure in many
               | populations who are now quite literate. Same seems true
               | for music. Maybe the hardware is actually more general
               | and adaptable and not just for language?
        
               | HumanOstrich wrote:
               | The research disagrees with you.
        
               | eru wrote:
               | Music is really, really old.
               | 
               | And reading and music co-evolved to be relatively easy
               | for humans to do.
               | 
               | (See how computers have a much easier time reading
               | barcodes and QR codes, with much less general processing
               | power than it takes them to decipher human hand-writing.
               | But good luck trying to teach humans to read QR codes
               | fluently.)
        
               | eru wrote:
               | > No easier than riding a bike or skating, for which our
               | brains are probably not 'predisposed'.
               | 
               | What makes you think so? Humans came up with biking and
               | skating, because they were easy enough for us to master
               | with the hardware we had.
        
               | puffybuf wrote:
               | I think of evolution as unassisted learning where agents
               | compete with the each other for limited resources. Over
               | time they get better and better at surviving by passing
               | on genes. It never ends of course.
        
             | tiborsaas wrote:
             | If you look at animals, they can walk in hours, not much
             | time needed after being born. It takes us a longer time
             | because we are born rather undeveloped to get the head out
             | of the birth canal.
             | 
             | A more high level example, sea sickness is a evolutionary
             | pre-learned thing, your body things it's poisoned and it
             | automatically wants to empty your stomach.
        
             | gf000 wrote:
             | I mean, there are plenty - e.g. mimicking (say, the
             | mother's face's emotions), which are precursors to learning
             | more advanced "features". Also, even walking has many
             | aspects pretrained (I assume it's mostly a musculoskeletal
             | limitation that we can't walk immediately), humans are just
             | born "prematurely" due to our relatively huge heads.
             | Newborn horses can walk immediately without learning.
             | 
             | But there are plenty of non-learned
             | control/movement/sensing in utero that are "pretrained".
        
               | eru wrote:
               | Interestingly, there's a bunch of reflexes that also only
               | develop over time.
               | 
               | They are more nature than nurture, but they aren't 'in-
               | born'.
               | 
               | Just like human aren't (usually) born with teeth, but
               | they don't 'learn' to have teeth or pubic hair, either.
        
             | eru wrote:
             | Your brain is well adapted to learning how to walk and
             | speak.
             | 
             | Chimpanzees score pretty high on many tests of
             | intelligence, especially short term working memory. But
             | they can't really learn language: they lack the specialised
             | hardware more than the general intelligence.
        
         | Existenceblinks wrote:
         | Honestly, it doesn't need to be local, API is some 200ms away
         | is ok-ish, make it 50ms it will be practically usable for every
         | majority of interaction.
        
       | joshdavham wrote:
       | A lot of the comments seem very dismissive and a little overly-
       | skeptical in my opinion. Why is this?
        
       | ziofill wrote:
       | It's certainly remarkable, but let's not ignore the fact that it
       | still fails on puzzles that are trivial for humans. Something is
       | amiss.
        
       | vicentwu wrote:
       | "Note on "tuned": OpenAI shared they trained the o3 we tested on
       | 75% of the Public Training set. They have not shared more
       | details. We have not yet tested the ARC-untrained model to
       | understand how much of the performance is due to ARC-AGI data."
       | 
       | Really want to see the number of training pairs needed to achieve
       | this socre. If it only takes a few pairs, say 100 pairs, I would
       | say it is amazing!
        
         | nmca wrote:
         | 75% of 400 is 300 :)
        
           | WXLCKNO wrote:
           | Wow are you AGI?
        
       | epigramx wrote:
       | I bet it still thinks 1+1=3 if it read enough sources parroting
       | that.
        
       | theincredulousk wrote:
       | Denoting it in $ for efficiency is peak capitalism, cmv.
        
       | polskibus wrote:
       | What are the differences between the public offering and o3? What
       | is o3 doing differently? Is it something akin to more internal
       | iterations, similar to ,,brute forcing" a problem, like you can
       | yourself with a cheaper model, providing additional hints after
       | each response?
        
       | miga89 wrote:
       | How do the organisers keep the private test set private? Does
       | openAI hand them the model for testing?
       | 
       | If they use a model API, then surely OpenAI has access to the
       | private test set questions and can include it in the next round
       | of training?
       | 
       | (I am sure I am missing something.)
        
         | 7734128 wrote:
         | I suppose that's why they are calling it "semi-private".
        
           | freehorse wrote:
           | And why o3 or any OpenAI llm is not evaluated in the actual
           | private dataset.
        
         | owenpalmer wrote:
         | I wouldn't be surprised if the term "benchmark fraud" will soon
         | been coined.
        
           | PhilippGille wrote:
           | Benchmark fraud is not a novel concept. Outside of LLMs for
           | example smartphone manufacturers detect benchmarks and
           | disable or reduce CPU throttling: https://www.theregister.com
           | /2019/09/30/samsung_benchmarking_...
        
             | hmottestad wrote:
             | CPU frequency ramp curve is also something that can be
             | adjusted. You want the CPU to ramp up really quickly to
             | make everything feel responsive, but at the same time you
             | want to not have to use so much power from your battery.
             | 
             | If you detect that a benchmark is running then you can just
             | ramp up to max frequency immediately. It'll show how fast
             | your CPU is, but won't be representative of the actual
             | performance that users will get from their device.
        
         | deneas wrote:
         | They have two sets, a fully private one where the models run
         | isolated and the semi-private one where they run models
         | accessed over the internet.
        
         | gritzko wrote:
         | That is the top question, actually. Given all the billions at
         | stake.
        
         | PoignardAzur wrote:
         | If we really want to imagine a cold-war-style solution, the two
         | teams could meet in an empty warehouse, bring one computer with
         | the model, one with the benchmarks, and connect them with a USB
         | cable.
         | 
         | In practice I assume they just gave them the benchmarks and
         | took it on the honor system they wouldn't cheat, yeah. They can
         | always cook up a new test set for next time, it's only 10% of
         | the benchmark content anyway and the results are pretty close.
        
           | andrepd wrote:
           | There's no honor system when there's billions of dollars at
           | stake x) I'm highly highly skeptical of these benchmarks
           | because of intentional cheating and accidental contamination.
        
         | bjornsing wrote:
         | Isn't that why they call it " Semi-Private"?
         | 
         | There's a fully private test set too as I understand it, that
         | o3 hasn't run on yet.
        
       | DiscourseFan wrote:
       | a little from column A, a little from column B
       | 
       | I don't think this is AGI; nor is it something to scoff at. Its
       | impressive, but its also not human-like intelligence. Perhaps
       | human-like intelligence is not the goal, since that would imply
       | we have even a remotely comprehensive understanding of the human
       | mind. I doubt the mind operates as a single unit anyway, a
       | human's first words are "Mama," not "I am a self-conscious freely
       | self-determining being that recognizes my own reasoning ability
       | and autonomy." And the latter would be easily programmable
       | anyway. The goal here might, then, be infeasible: the concept of
       | free will is a kind of technology in and of itself, it has
       | already augmented human cognition. How will these technologies
       | not augment the "mind" such that our own understanding of our
       | consciousness is altered? And why should we try to determine
       | ahead of time what will hold weight for us, why the "human" part
       | of the intelligence will matter in the future? Technology should
       | not be compared to the world it transforms.
        
       | digitcatphd wrote:
       | o3 fixes the fundamental limitation of the LLM paradigm - the
       | inability to recombine knowledge at test time - and it does so
       | via a form of LLM-guided natural language program search
       | 
       | > This is significant, but I am doubtful it will be as meaningful
       | as people expect aside from potentially greater coding tasks.
       | Without a 'world model' that has a contextual understanding of
       | what it is doing, things will remain fundamentally throttled.
        
       | madsgarff wrote:
       | Moreover, ARC-AGI-1 is now saturating - besides o3's new score,
       | the fact is that a large ensemble of low-compute Kaggle solutions
       | can now score 81% on the private eval.
       | 
       | If low-compute Kaggle solutions already does 81% - then why is
       | o3's 75.7% considered such a breakthrough?
        
       | gmerc wrote:
       | Headline could also just be OpenAI discovers exponential scaling
       | wall for inference time compute.
        
       | owenpalmer wrote:
       | Someone asked if true intelligence requires a foundation of prior
       | knowledge. This is the way I think about it.
       | 
       | I = E / K
       | 
       | where I is the intelligence of the system, E is the effectiveness
       | of the system, and K is the prior knowledge.
       | 
       | For example, a math problem is given to two students, each
       | solving the problem with the same effectiveness (both get the
       | correct answer in the same amount of time). However, student A
       | happens to have more prior knowledge of math than student B. In
       | this case, the intelligence of B is greater than the intelligence
       | of A, even though they have the same effectiveness. B was able to
       | "figure out" the math, without using any of the "tricks" that A
       | already knew.
       | 
       | Now back to the question of whether or not prior knowledge is
       | required. As K approaches 0, intelligence approaches infinity.
       | But when K=0, intelligence is undefined. Tada! I think that
       | answers the question.
       | 
       | Most LLM benchmarks simply measure effectiveness, not
       | intelligence. I conceptualize LLMs as a person with a
       | photographic memory and a low IQ of 85, who was given 100 billion
       | years to learn everything humans have ever created.
       | 
       | IK = E
       | 
       | low intelligence * vast knowledge = reasonable effectiveness
        
         | Woodi wrote:
         | Yep, I aways liked encyclopedia. Wiki is good too :)
         | 
         | What I would like to have in the future is SO answering-peoples
         | accessible in real time via IRC. They have real answers NOW.
         | They are even pedantic about their stuff !
        
         | wangii wrote:
         | Interesting formulation! it captures the intuition of the
         | "smartness" when solving a problem. However, what about asking
         | good questions or proposing conjectures?
        
           | hanspeter wrote:
           | Aren't those solutions to problems as well?
           | 
           | Find the best questions to ask. Find the best hypothesis to
           | suggest.
        
         | lorepieri wrote:
         | There should be also a factor about resource consumption. See
         | here: https://lorenzopieri.com/pgii/
        
           | spacebanana7 wrote:
           | Also perhaps a factor (with diminishing returns) for response
           | speed?
           | 
           | All else equal, a student who gets 100% on a problem set in
           | 10 minutes is more intelligent than one with the same score
           | after 120 minutes. Likewise an LLM that can respond in 2
           | seconds is more impressive than one which responds in 30
           | seconds.
        
             | owenpalmer wrote:
             | > a student who gets 100% on a problem set in 10 minutes is
             | more intelligent than one with the same score after 120
             | minutes
             | 
             | According to _my_ mathematical model, the faster student
             | would have higher _effectiveness_ , not necessarily higher
             | intelligence. Resource consumption and speed are practical
             | technological concerns, but they're irrelevant in a
             | theorical conceptualization of intelligence.
        
               | baq wrote:
               | If you disregard time, all computers have maximal
               | intelligence, they can enumerate all programs and compute
               | answers to any decidable question.
        
               | wouldbecouldbe wrote:
               | Yeah speed is a key factor in intelligence. And actually
               | one of the biggest differentiators in human iq
               | measurements
        
               | eru wrote:
               | Humans are a bit annoying that way, because it's all
               | correlated.
               | 
               | So a human with a better response time, also tends to
               | give you more intelligent answers, even when time is not
               | a factor.
               | 
               | For a computer, you can arbitrarily slow them down (or
               | speed them up), and still get the same answer.
        
             | Terr_ wrote:
             | > response time
             | 
             | Imagine you take an extraordinarily smart person, and put
             | them on a fast spaceship that causes time dilation.
             | 
             | Does that mean that they are stupider while in transit, and
             | they regain their intelligence when it slows down?
        
               | Earw0rm wrote:
               | No, because intelligence is relative to your local
               | context.
        
               | Terr_ wrote:
               | Why should one kind of phenomenon which slows down
               | performance on the test be given a special "you're more
               | intelligent than you seem" exception, but not others?
               | 
               | If we are required to break the seal on the black-box and
               | investigate the exactly how the agent is operating in
               | order to judge its "intelligence"... Doesn't that kinda
               | ruin the up-thread stuff about judging with equations?
        
               | zoky wrote:
               | Who is a better free-thrower, someone who can hit 20 free
               | throws per minute on Earth, or the same thrower who
               | logged 20 million free throws in the apparent two years
               | he was gone but comes back ready for retirement?
        
             | coffeebeqn wrote:
             | Maybe. If I could ask a AI to come up with a 50% efficient
             | mass market solar panel, I don't really care if it takes a
             | few weeks or a year if it can solve that though. I'm not
             | sure if inventiveness or novelness of solution could be a
             | metric. I suppose that is superintelligence rather than
             | AGI? And by then there would be no question of what it is
        
           | xlii wrote:
           | An interesting point from a philosophical perspective!
           | 
           | But if we'd take this into consideration would it mean that
           | 1st world engineer is by definition _less_ inteligent than
           | 3rd world one?
           | 
           | I think the (completely reasonable) knee jerk reaction is a
           | definsive one, but I can imagine absolutarian regime escapee
           | working side-by-side an engineer groomed in expensive, air
           | conditioned lecture rooms. In this imaginary scenario
           | escapee, even if slower and less efficient at the problem at
           | hand would have to be more intelligent generally.
        
           | eru wrote:
           | That's a bit silly.
           | 
           | Yes, resource consumption is important. But your car guzzling
           | a lot of gas doesn't mean he drives slower. It just means it
           | drives slower per mol of petrol consumed.
           | 
           | It's good to know whether your system has a high or low 'bang
           | for buck' metric, but that doesn't directly affect how much
           | bang you get.
        
         | someothherguyy wrote:
         | https://en.wikipedia.org/wiki/Fluid_and_crystallized_intelli...
        
         | dmezzetti wrote:
         | We should wait until it's released before we anoint it. It's
         | disheartening to see how we keep repeating the same pattern
         | that gives in to hype over the scientific method.
        
           | lazide wrote:
           | The scientific method doesn't drive stock price (apparently).
        
         | empiko wrote:
         | Well put. You ask LLMs about ARC-like challenges and they are
         | able to come up with a list of possible problem formulations
         | even before you show them the input. The models already know
         | that they might expect various object manipulations, symmetry
         | problem, etc. The fact that the solution costs thousands of
         | dollars says to me that the model iterates over many solutions
         | while using this implicit knowledge and feedback it gets from
         | running the program. It is still impressive, but I don't think
         | this is what the ARC prize was supposed to be about.
        
           | curl-up wrote:
           | > while using this implicit knowledge and feedback it gets
           | from running the program.
           | 
           | What feedback, and what program, are you referring to?
        
             | scotty79 wrote:
             | Basically solutions that were doing well in arc just threw
             | thousands of ideas at the wall and picked the ones that
             | stuck. They were literally generating thousands of python
             | programs, running them and checking if any produced the
             | correct output when fed with data from examples.
             | 
             | This o3 doesn't need to run python. It itself executes
             | programs written in tokens inside it's own context window
             | which is wildly inefficient but gives better results and is
             | potentially more general.
        
               | TheOtherHobbes wrote:
               | So basically it's a massively inefficient trial-and-error
               | leetcode solver which only works because it throws
               | incredible amounts of compute at the problem.
               | 
               | This is hilarious.
        
             | empiko wrote:
             | I assume that o3 can run Python scripts and observe the
             | outputs.
        
         | onemetwo wrote:
         | An intelligent system could take more advantage of an increase
         | of knowledge than a dumb one, so I should propose a simple
         | formula: the derivative of efficiency with respect to knowledge
         | is proportional to intelligence.
         | 
         | $$ I = \frac{partial E}{partial K} \simeq \frac{\delta
         | E}{\delta K} $$
         | 
         | In order to estimate $I$ you have to consider that efficiency
         | and knowledge are task related, so you could take some weighted
         | mean $sum_T C(E,K,T)*I(E,K,T)$ where $T$ is task category. I am
         | thinking in $C(E,K,T)$ as something similar to thermal capacity
         | or electrical resistance, the equivalent concept when applied
         | to task. An intelligent agent in a medium of low resistance
         | should fly while a dumb one would still crawl.
        
           | owenpalmer wrote:
           | > An intelligent system could take more advantage of an
           | increase of knowledge than a dumb one
           | 
           | Why?
           | 
           | > derivative of efficiency
           | 
           | Where did your efficiency variable come from?
        
             | onemetwo wrote:
             | Why? I am using dumb as a low intelligence system. A more
             | intelligent person can take advantage of new opportunities.
             | Efficience variable: You are right that effectiveness could
             | be better here because we are not considering resources
             | like computer time and power.
        
         | gardenhedge wrote:
         | Where did someone ask that?
        
         | scotty79 wrote:
         | As a kid I absolutely hated math and loved physics and
         | chemistry because solving anything in math requires vast
         | specific K.
         | 
         | In comparison you can easily know everything there is to know
         | about physics or chemistry and it's sufficient to solve
         | interesting puzzles. In math every puzzle has it's own vast
         | lore you need to know before you can have any chance at
         | tackling it.
        
           | owenpalmer wrote:
           | Physics and chemistry require experimentation to verify
           | solutions. With math however, any new knowledge can be
           | intuited and proven from previous proofs, so yes, the lore
           | goes deep!
        
       | Woodi wrote:
       | So article seriously and scientifically states:
       | 
       | "Our programs compilation (AI) gave 90% of correct answers in
       | test 1. We expect that in test 2 quality of answers will
       | degenerate to below random monkey pushing buttons levels. Now
       | more money is needed to prove we hit blind alley."
       | 
       | Hurray ! Put limited version of that on everybody phones !
        
       | oezi wrote:
       | > o3 fixes the fundamental limitation of the LLM paradigm - the
       | inability to recombine knowledge at test time
       | 
       | I don't understand this mindset. We have all experienced that
       | LLMs can produce words never spoken before. Thus there is
       | recombination of knowledge at play. We might not be satisfied
       | with the depth/complexity of the combination, but there isn't any
       | reason to believe something fundamental is missing. Given more
       | compute and enough recursiveness we should be able to reach any
       | kind of result from the LLM.
       | 
       | The linked article says that LLMs are like a collection of vector
       | programs. It has always been my thinking that computations in
       | vector space are easy to make turing complete if we just have an
       | eigenvector representation figured out.
        
         | lagrange77 wrote:
         | > Given more compute and enough recursiveness we should be able
         | to reach any kind of result from the LLM.
         | 
         | That was always true for NNs in general, yet it took a very
         | specific structure to get to where we are now. (..with a
         | certain amount of time and resources.)
         | 
         | > thinking that computations in vector space are easy to make
         | turing complete if we just have an eigenvector representation
         | figured out
         | 
         | Sounds interesting, would you elaborate?
        
       | niemandhier wrote:
       | Contrary to many I hope this stays expensive. We are already
       | struggling with AI curated info bubbles and psy-ops as it is.
       | 
       | State actors like Russia, US and Israel will probably be fast to
       | adopt this for information control, but I really don't want to
       | live in a world where the average scammer has access to this
       | tech.
        
         | owenpalmer wrote:
         | > I really don't want to live in a world where the average
         | scammer has access to this tech.
         | 
         | Reality check: local open source models are more than capable
         | of information control, generating propaganda, and scamming
         | you. The cat's been out of the bag for a while now, and
         | increased reasoning ability doesn't dramatically increase the
         | weaponizability of this tech, I think.
        
       | pal9000 wrote:
       | Can someone ELI5 how ARC-AGI-PUB is resistant to p-hacking?
        
       | danielovichdk wrote:
       | At what time will it kill us all because it understands that
       | humans are the biggest problem before it can simply chill and not
       | worry.
       | 
       | That would be intelligent. Everything else is just stupid and
       | more of the same shit.
        
         | aniviacat wrote:
         | Humans are the biggest problem of what? Of the sun? Of Venus?
         | 
         | Of humans. Humans are a problem for the satisfaction of humans.
         | Yet removing humans from this equation does result in higher
         | human satisfaction. It lessens it.
         | 
         | I find this thought process of "humans are the problem" to be
         | unreasonable. Humans aren't the problem; humans are the
         | requirement.
        
       | almog wrote:
       | AGI = ARC-AGI-PUB
       | 
       | And not the other way around as some comments here seem to
       | confuse necessary and sufficient conditions.
        
       | the5avage wrote:
       | The examples unsolved by high compute o3 look a lot like the
       | raven progressive matrix tests used in IQ tests.
        
       | thom wrote:
       | It's not AGI when it can do 1000 math puzzles. It's AGI when it
       | can do 1000 math puzzles then come and clean my kitchen.
        
         | qup wrote:
         | Intelligence doesn't have to be embodied.
        
           | thom wrote:
           | It also has to be able to come and argue in the comments.
        
           | goatlover wrote:
           | For it to be AGI, it needs to be able to manipulate the
           | physical world from it's own goals, not just produce text
           | when prompted. LLMs are just tools to augment human
           | intelligence. AGI is what you see in science fiction.
        
         | egeozcan wrote:
         | I understand what you are saying and sort of agree the premise
         | but to be pedantic, I don't think any robot can clean a kitchen
         | without doing math :)
        
       | epolanski wrote:
       | Okay but what are the tests like? At least like a general idea.
        
       | tymonPartyLate wrote:
       | Isn't this like a brute force approach? Given it costs $ 3000 per
       | task, thats like 600 GPU hours (h100 at Azure) In that amount of
       | time the model can generate millions of chains of thoughts and
       | then spend hours reviewing them or even testing them out one by
       | one. Kind of like trying until something sticks and that happens
       | to solve 80% of ARC. I feel like reasoning works differently in
       | my brain. ;)
        
         | strangescript wrote:
         | "We have created artificial super intelligence, it has solved
         | physics!"
         | 
         | "Well, yeah, but its kind of expensive" -- this guy
        
           | freehorse wrote:
           | The problem is not that it is expensive, but that, most
           | likely, it is not superintelligence. Superintelligence is not
           | exploring the problem space semi-blindly, if the thounsands
           | $$$ per task are actually spent for that. There is a reason
           | the actual ARC-AGI prize requires efficiency, because the
           | point is not "passing the test" but solving the framing
           | problem of intelligence.
        
           | tymonPartyLate wrote:
           | Haha. Hopefully you're right and solving the ARC puzzle
           | translates to solving all of physics. I just remain skeptical
           | about the OpenAI hype. They have a track record of
           | exaggerating the significance of their releases and their
           | impact on humanity.
        
           | jeremyjh wrote:
           | Please do show me a novel result in physics from any LLM. You
           | think "this guy" is stupid because he doesn't extrapolate
           | from this $2MM test that nearly reproduces the work of a STEM
           | graduate to a super intelligence that has already solved
           | physics. Maybe you've got it backwards.
        
         | tikkun wrote:
         | They're only allowed 2-3 guesses per problem. So even though
         | yes it generates many candidates, it can't validate them - it
         | doesn't have tool use or a verifier, it submits the best 2-3
         | guesses.
         | https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50...
        
           | nmca wrote:
           | It is allowed exactly two guesses, per the ARC rules.
        
             | trescenzi wrote:
             | How many guesses is the human comparison based on? I'd hope
             | two as well but haven't seen this anywhere so now I'm
             | curious.
        
               | nmca wrote:
               | The real turker studies, resulting in the ~70% number,
               | are scored correctly I believe. Higher numbers are just
               | speculated human performance as far as I'm aware.
        
         | nextworddev wrote:
         | The best interpretation of this result is probably that it
         | showed tackling some arbitrary benchmark is something you can
         | throw money at, aka it's just something money can solve.
         | 
         | Its not agi obviously in the sense that you still need to some
         | problem framing and initialization to kickstart the reasoning
         | path simulations
        
         | torginus wrote:
         | this might be quite an important point - if they created an
         | algorithm that can mimic human reasoning, but scales terribly
         | with problem complexity (in terms of big O notation), it's
         | still a very significant result, but it's not a 'humans brains
         | are over' moment quite yet.
        
         | macrolime wrote:
         | The trick with AlphaGo was brute force combined with learning
         | to extract strategies from brute force using reinforcement
         | learning, that's what we'll see here. So maybe it costs a
         | million dollars in compute to get a high score, but use
         | reinforcement learning ala alphazero to learn from the process
         | and it won't cost a million dollars next time and let it do
         | lots of hard benchmarks, math problems and coding tasks and
         | it'll keep getting better and better.
        
       | tikkun wrote:
       | I wonder: when did o1 finish training, and when did o3 finish
       | training?
       | 
       | There's a ~3 month delay between o1's launch (Sep 12) and o3's
       | launch (Dec 20). But, it's unclear when o1 and o3 each finished
       | training.
        
       | zug_zug wrote:
       | This is a lot of noise around what's clearly not even an order of
       | magnitude to the way to AGI.
       | 
       | Here's my AGI test - Can the model make a theory of AGI
       | validation that no human has suggested before, test itself to see
       | if it qualifies, iterate, read all the literature, and suggest
       | modifications to its own network to improve its performance?
       | 
       | That's what a human-level performer would do.
        
       | earth2mars wrote:
       | Maybe spend more compute time to let it think about optimizing
       | the compute time.
        
       | msoad wrote:
       | There are new research where chain of thoughts is happening in
       | latent spaces and not in English. They demonstrated better
       | results since language is not as expressive as those concepts
       | that can be represented in the layers before decoder. I wonder if
       | o3 is doing that?
        
         | padolsey wrote:
         | I think you mean this: https://arxiv.org/abs/2412.06769
         | 
         | From what I can see, presuming o3 is a progression of o1 and
         | has good level of accountabiltiy bubbling up during 'inference'
         | (i.e. "Thinking about ___") then I'd say it's just using up
         | millions of old-school tokens (the 44 million tokens that are
         | referenced). So not latent thinking per se.
        
           | Zamicol wrote:
           | Interesting!
        
         | gliptic wrote:
         | "You can tell the RL is done properly when the models cease to
         | speak English in their chain of thought" -- Karpathy
        
       | rapjr9 wrote:
       | Does anyone have a feeling for how latency (from asking a
       | question/API call to getting an answer/API return) is progressing
       | with new models? I see 1.3 minutes/task and 13.8 minutes/task
       | mentioned in the page on evaluating O3. Efficiency gains that
       | also reduce latency will be important and some of them will come
       | from efficiency in computation, but as models include more and
       | more layers (layers of models for example) the overall latency
       | may grow and faster compute times inside each layer may only help
       | somewhat. This could have large effects on usability.
        
       | amai wrote:
       | But can it convert handwritten equations into Latex? That is the
       | AGI task I'm waiting for.
        
       | figure8 wrote:
       | I have a very naive question.
       | 
       | Why is the ARC challenge difficult but coding problems are easy?
       | The two examples they give for ARC (border width and square
       | filling) are much simpler than pattern awareness I see simple
       | models find in code everyday.
       | 
       | What am I misunderstanding? Is it that one is a visual grid
       | context which is unfamiliar?
        
         | ItsMattyG wrote:
         | Francois'(the creator of ARC-AGI benchmark) whole point was
         | that while they look the same, they're not. Coding is solving a
         | familiar pattern in the same way (and fails when it' s NOT
         | doing that, it just looks like it doesn't happen because it's
         | seen SO MANY patterns in code). But the point of Arc AGI is to
         | make each problem have to generalize in some new ay.
        
       | sn0wr8ven wrote:
       | Incredibly impressive. Still can't really shake the feeling that
       | this is o3 gaming the system more than it is actually being able
       | to reason. If the reasoning capabilities are there, there should
       | be no reason why it achieves 90% on one version and 30% on the
       | next. If a human maintains the same performance across the two
       | versions, an AI with reason should too.
        
         | demirbey05 wrote:
         | I am not expert in llm reasoning but I think because of RL. You
         | cannot use AlphaZero to play other games.
        
         | GaggiX wrote:
         | Humans and AIs are different, the next benchmark would be build
         | so that it emphasize the weak points of current AI models where
         | a human is expected to perform better, but I guess you can also
         | make a benchmark that is the opposite, where humans struggle
         | and o3 has an easy time.
        
         | pkphilip wrote:
         | Yes, if a system has actually achieved AGI, it is likely to not
         | reveal that information
        
           | HeatrayEnjoyer wrote:
           | AGI is a spectrum, not a binary quality.
        
         | cornholio wrote:
         | But does it matter if it "really, really" reasons in the human
         | sense, if it's able to prove some famous math theorem or come
         | up with a novel result in theoretical physics?
         | 
         | While beyond current motels, that would be the final test of
         | AGI capability.
        
           | jprete wrote:
           | If it's gaming the system, then it's much less likely to
           | reliably come up with novel proofs or useful new theoretical
           | ideas.
        
           | intended wrote:
           | Yeah, it really does matter if something was reasoned, or
           | whether it appears if you metaphorically shake the magic 8
           | ball.
        
         | FartyMcFarter wrote:
         | How would gaming the system work here? Is there some flaw in
         | the way the tasks are generated?
        
         | kmacdough wrote:
         | The point of ARC is NOT to compare humans vs AI, but to probe
         | the current boundary of AIs weaknesses. AI has been beating us
         | at specific tasks like handwriting recognition for decades.
         | Rather, it's when we can no longer readily find these "easy for
         | human, hard for AI" reasoning tasks that we must stop and
         | consider.
         | 
         | If you look at the ARC tasks failed by o3, they're really not
         | well suited to humans. They lack the living context humans
         | thrive on, and have relatively simple, analytical outcomes that
         | are readily processed by simple structures. We're unlikely to
         | see AI as "smart" until it can be asked to accomplish useful
         | units of productive professional work at a "seasoned
         | apprentice" level. Right now they're consuming ungodly amounts
         | of power just to pass some irritating, sterile SAT questions.
         | Train a human for a few hours a day over a couple weeks and
         | they'll ace this no problem.
        
       | earth2mars wrote:
       | Why did they skip o2?
        
       | YeGoblynQueenne wrote:
       | I just noticed this bit:
       | 
       | >> Second, you need the ability to recombine these functions into
       | a brand new program when facing a new task - a program that
       | models the task at hand. Program synthesis.
       | 
       | "Program synthesis" is here used in an entirely idiosyncratic
       | manner, to mean "combining programs". Everyone else in CS and AI
       | for the last many decades has used "Program Synthesis" to mean
       | "generating a program that satisfies a specification".
       | 
       | Note that "synthesis" can legitimately be used to mean
       | "combining". In Greek it translates literally to "putting
       | [things] together": "Syn" (plus) "thesis" (place). But while
       | generating programs by combining parts of other programs is an
       | old-fashioned way to do Program Synthesis, in the standard sense,
       | the end result is always desired to be a program. The LLMs used
       | in the article to do what F. Chollet calls "Porgram Synthesis"
       | generate no code.
        
         | tshadley wrote:
         | I always get the feeling he's subconsciously inserting a
         | "magical" step here with reference to "synthesis"-- invoking a
         | kind of subtle dualism where human intelligence is just
         | different and mysteriously better than hardware intelligence.
         | 
         | Combining programs should be straightforward for DNNs,
         | ordering, mixing, matching concepts by coordinates and
         | arithmetic in learned high-dimensional embedded-space.
         | Inference-time combination is harder since the model is working
         | with tokens and has to keep coherence over a growing CoT with
         | many twists, turns and dead-ends, but with enough passes can
         | still do well.
         | 
         | The logical next step to improvement is test-time training on
         | the growing CoT, using reinforcement-fine-tuning to compress
         | and organize the chain-of-thought into parameter-space--if we
         | can come up with loss functions for "little progress, a lot of
         | progress, no progress". Then more inference-time with a better
         | understanding of the problem, rinse and repeat.
        
       | baalimago wrote:
       | Let me know when OpenAI can wrap Christmas gifts. Then I'll be
       | interested.
        
       ___________________________________________________________________
       (page generated 2024-12-21 18:00 UTC)