[HN Gopher] OpenAI O3 breakthrough high score on ARC-AGI-PUB
___________________________________________________________________
OpenAI O3 breakthrough high score on ARC-AGI-PUB
Author : maurycy
Score : 1509 points
Date : 2024-12-20 18:11 UTC (23 hours ago)
(HTM) web link (arcprize.org)
(TXT) w3m dump (arcprize.org)
| razodactyl wrote:
| Great. Now we have to think of a new way to move the goalposts.
| tines wrote:
| I mean, what else do you call learning?
| Pesthuf wrote:
| Well right now, running this model is really expensive, but we
| should prepare a new cope for when equivalent models no longer
| are, ahead of time.
| cchance wrote:
| Ya getting costs down will be the big one, i imagine
| quantization, distillation and lots and lots of improvements
| on the compute side both hardware and software wise.
| a_wild_dandan wrote:
| Let's just define AI as "whatever computers still can't do."
| That'll show those dumb statistical parrots!
| foobarqux wrote:
| This is just as silly as claiming that people "moved the
| goalposts" when a computer beat Kasparov at chess to claim that
| it wasn't AGI: it wasn't a good test and some people only
| realize this after the computer beat Kasparov but couldn't do
| much else. In this case the ARC maintainers specifically have
| stated that this is a necessary but not sufficient test of AGI
| (I personally think it is neither).
| og_kalu wrote:
| It's not silly. The computer that could beat Kasparov
| couldn't do anything else so of course it wasn't Artificial
| General Intelligence.
|
| o3 can do much much more. There is nothing narrow about SOTA
| LLMs. They are already General. It doesn't matter what ARC
| Maintainers have said. There is no common definition of
| General that LLMs fail to meet. It's not a binary thing.
|
| By the time a single machine covers every little test
| humanity can devise, what comes out of that is not 'AGI' as
| the words themselves mean but a General Super Intelligence.
| foobarqux wrote:
| It is silly, the logic is the same: "Only a (world-
| altering) 'AGI' could do [test]" -> test is passed -> no
| (world-altering) 'AGI' -> conclude that [test] is not a
| sufficient test for (world-altering) 'AGI' -> chase new
| benchmark.
|
| If you want to play games about how to define AGI go ahead.
| People have been claiming for years that we've already
| reached AGI and with every improvement they have to
| bizarrely claim anew that _now_ we 've really achieved AGI.
| But after a few months people realize it still doesn't do
| what you would expect of an AGI and so you chase some new
| benchmark ("just one more eval").
|
| The fact is that there really hasn't been the type of
| world-altering impact that people generally associate with
| AGI and no reason to expect one.
| og_kalu wrote:
| >It is silly, the logic is the same: "Only a (world-
| altering) 'AGI' could do [test]" -> test is passed -> no
| (world-altering) 'AGI' -> conclude that [test] is not a
| sufficient test for (world-altering) 'AGI' -> chase new
| benchmark.
|
| Basically nobody today thinks beating a single benchmark
| and nothing else will make you a General Intelligence. As
| you've already pointed out out, even the maintainers of
| ARC-AGI do not think this.
|
| >If you want to play games about how to define AGI go
| ahead.
|
| I'm not playing any games. ENIAC cannot do 99% of the
| things people use computers to do today and yet barely
| anybody will tell you it wasn't the first general purpose
| computer.
|
| On the contrary, it is people who seem to think "General"
| is a moniker for everything under the sun (and then some)
| that are playing games with definitions.
|
| >People have been claiming for years that we've already
| reached AGI and with every improvement they have to
| bizarrely claim anew that now we've really achieved AGI.
|
| Who are these people ? Do you have any examples at all.
| Genuine question
|
| >But after a few months people realize it still doesn't
| do what you would expect of an AGI and so you chase some
| new benchmark ("just one more eval").
|
| What do you expect from 'AGI'? Everybody seems to have
| different expectations, much of it rooted in science
| fiction and not even reality, so this is a moot point.
| What exactly is World Altering to you ? Genuinely, do you
| even have anything other than a "I'll know it when i see
| it ?"
|
| If you introduce technology most people adopt, is that
| world altering or are you waiting for Skynet ?
| foobarqux wrote:
| > Basically nobody today thinks beating a single
| benchmark and nothing else will make you a General
| Intelligence.
|
| People's comments, including in this very thread, seem to
| suggest otherwise (c.f. comments about "goal post
| moving"). Are you saying that a widespread belief wasn't
| that a chess playing computer would require AGI? Or that
| Go was at some point the new test for AGI? Or the Turing
| test?
|
| > I'm not playing any games... "General" is a moniker for
| everything under the sun that are playing games with
| definitions.
|
| People have a colloquial understanding of AGI whose
| consequence is a significant change to daily life, not
| the tortured technical definition that you are using.
| Again your definition isn't something anyone cares about
| (except maybe in the legal contract between OpenAI and
| Microsoft).
|
| > Who are these people ? Do you have any examples at all.
| Genuine question
|
| How about you? I get the impression that you think AGI
| was achieved some time ago. It's a bit difficult to
| simultaneously argue both that we achieved AGI in GPT-N
| and also that GPT-(N+X) is now the real breakthrough AGI
| while claiming that your definition of AGI is useful.
|
| > What do you expect from 'AGI'?
|
| I think everyone's definition of AGI includes, as a
| component, significant changes to the world, which
| probably would be something like rapid GDP growth or
| unemployment (though you could have either of those
| without AGI). The fact that you have to argue about what
| the word "general" technically means is proof that we
| don't have AGI in a sense that anyone cares about.
| og_kalu wrote:
| >People's comments, including in this very thread, seem
| to suggest otherwise (c.f. comments about "goal post
| moving").
|
| But you don't see this kind of discussion on the narrow
| models/techniques that made strides on this benchmark, do
| you ?
|
| >People have a colloquial understanding of AGI whose
| consequence is a significant change to daily life, not
| the tortured technical definition that you are using
|
| And ChatGPT has represented a significant change to the
| daily lives of many. It's the fastest adopted software
| product in history. In just 2 years, it's one of the top
| ten most visited sites on the planet worldwide. A lot of
| people have had the work they do significant change since
| its release. This is why I ask, what is world altering ?
|
| >How about you? I get the impression that you think AGI
| was achieved some time ago.
|
| Sure
|
| >It's a bit difficult to simultaneously argue both that
| we achieved AGI in GPT-N and also that GPT-(N+X) is now
| the real breakthrough AGI
|
| I have never claimed GPT-N+X is the "new breakthrough
| AGI". As far as I'm concerned, we hit AGI sometime ago
| and are making strides in competence and/or enabling even
| more capabilities.
|
| You can recognize ENIAC as a general purpose computer and
| also recognize the breakthroughs in computing since then.
| They're not mutually exclusive.
|
| And personally, I'm more impressed with o3's Frontier
| Math score than ARC.
|
| >I think everyone's definition of AGI includes, as a
| component, significant changes to the world
|
| Sure
|
| >which probably would be something like rapid GDP growth
| or unemployment
|
| What people imagine as "significant change" is definitely
| not in any broad agreement.
|
| Even in science fiction, the existence of general
| intelligences more competent than today's LLMs does not
| necessarily precursor massive unemployment or GDP growth.
|
| And for a lot of people, the clincher stopping them from
| calling a machine AGI is not even any of these things.
| For some, that it is "sentient" or "cannot lie" is far
| more important than any spike of unemployment.
| foobarqux wrote:
| > But you don't see this kind of discussion on the narrow
| models/techniques that made strides on this benchmark, do
| you ?
|
| I don't understand what you are getting at.
|
| Ultimately there is no axiomatic definition of the term
| AGI. I don't think the colloquial understanding of the
| word is what you think it is (i.e. if you had described
| to people, pre-chatgpt, today's chatgpt behavior,
| including all the limitations and failings and the fact
| that there was no change in GDP, unemployment, etc), and
| asked if that was AGI I seriously doubt they would say
| yes.)
|
| More importantly I don't think anyone would say their
| life was much different from a few years ago and
| separately would say under AGI it would be.
|
| But the point that started all this discussion is the
| fact that these "evals" are not good proxies for AGI and
| no one is moving goal-posts even if they realize this
| fact only after the tests have been beaten. You can
| foolishly _define_ AGI as beating ARC but the moment ARC
| is beaten you realize that you don 't care about that
| definition at all. That doesn't change if you make a 10
| or 100 benchmark suite.
| og_kalu wrote:
| >I don't understand what you are getting at.
|
| If such discussions only made when LLMs make strides in
| the benchmark then it's not just about beating the
| benchmark but also what kind of system is beating it.
|
| >You can foolishly define AGI as beating ARC but the
| moment ARC is beaten you realize that you don't care
| about that definition at all.
|
| If you change your definition of AGI the moment a test is
| beaten then yes, you are simply post moving.
|
| If you care about other impacts like "Unemployment" and
| "GDP rising" but don't give any time or opportunity to
| see if the model is capable of such then you don't really
| care about that and are just mindlessly shifting posts.
|
| How do such a person know o3 won't cause mass
| unemployment? The model hasn't even been released yet.
| foobarqux wrote:
| > If such discussions only made when LLMs make strides in
| the benchmark then it's not just about beating the
| benchmark but also what kind of system is beating it.
|
| I still don't understand the point you are making. Nobody
| is arguing that discrete program search is AGI (and the
| same counter-arguments would apply if they did).
|
| > If you change your definition of AGI the moment a test
| is beaten then yes, you are simply post moving.
|
| I don't think anyone changes their definition, they just
| erroneously assume that any system that succeeds on the
| test must do so only because it has general intelligence
| (that was the argument for chess playing for example).
| When it turns out that you can pass the test with much
| narrower capabilities they recognize that it was a bad
| test (unfortunately they often replace the bad test with
| another bad test and repeat the error).
|
| > If you care about other impacts like "Unemployment" and
| "GDP rising" but don't give any time or opportunity to
| see if the model is capable of such then you don't really
| care about that and are just mindlessly shifting posts.
|
| We are talking about what models are doing now (is AGI
| here _now_ ) not what some imaginary research
| breakthroughs might accomplish. O3 is not going to
| materially change GDP or unemployment. (If you are
| confident otherwise please say how much you are willing
| to wager on it).
| og_kalu wrote:
| I'm not talking about any imaginary research
| breakthroughs. I'm talking about today, right now. We
| have a model unveiled today that seems a large
| improvement across several benchmarks but hasn't been
| released yet.
|
| You can be confident all you want but until the model has
| been given the chance to not have the effect you think it
| won't then it's just an assertion that may or may not be
| entirely wrong.
|
| If you say "this model passed this benchmark I thought
| would indicate AGI but didn't do this or that so I won't
| acknowledge it" then I can understand that. I may not
| agree on what the holdups are but I understand that.
|
| If however you're "this model passed this benchmark I
| thought would indicate AGI but I don't think it's going
| to be able to do this or that so it's not AGI" then I'm
| sorry but that's just nonsense.
|
| My thoughts or bets are irrelevant here.
|
| A few days ago I saw someone seriously comparing a site
| with nearly 4B visits a month in under 2 years to Bitcoin
| and VR. People are so up in their bubbles and so assured
| in their way of thinking they can't see what's right in
| front of them, nevermind predict future usefulness. I'm
| just not interested in engaging "I think It won't"
| arguments when I can just wait and see.
|
| I'm not saying you are one of such people. I just have no
| interest in such arguments.
|
| My bet ? There's no way i would make a bet like that
| without playing with the model first. Why would I ? Why
| Would you ?
| foobarqux wrote:
| > I'm not talking about any imaginary research
| breakthroughs. I'm talking about today, right now.
|
| I explicitly said so was I. I said today we don't have
| large impact societal changes that people have
| conventionally associated with the term AGI. I also
| explicitly talked about how I don't believe o3 will
| change this and your comments seem to suggest neither do
| you (you seem to prefer to emphasize that it isn't
| literally impossible that o3 will make these
| transformative changes).
|
| > If however you're "this model passed this benchmark I
| thought would indicate AGI but I don't think it's going
| to be able to do this or that so it's not AGI" then I'm
| sorry but that's just nonsense.
|
| The entire point of the original chess example was to
| show that in fact it is the correct reaction to repudiate
| incorrect beliefs of naive litmus test of AGI-ness. If we
| did what you are arguing then we should accept AGI having
| occurred after chess was beaten because a lot of people
| believed that was the litmus test? Or that we should
| praise people who stuck to their original beliefs after
| they were proven wrong instead of correcting them? That's
| why I said it was silly at the outset.
|
| > My thoughts or bets are irrelevant here
|
| No they show you don't actually believe we have society
| transformative AGI today (or will when o3 is released)
| but get upset when someone points that out.
|
| > I'm just not interested in engaging "I think It won't"
| arguments when I can just wait and see.
|
| A lot of life is about taking decisions based on
| predictions about the future, including consequential
| decisions about societal investment, personal career
| choices, etc. For many things there isn't a "wait and see
| approach", you are making implicit or explicit decisions
| even by maintaining the status quo. People who make bad
| or unsubstantiated arguments are creating a toxic
| environment in which those decisions are made, leading
| personal and public harm. The most important example of
| this is the decision to dramatically increase energy
| usage to accommodate AI models despite impending climate
| catastrophe on the blind faith that AI will somehow fix
| it all (which is far from the "wait and see" approach
| that you are supposedly advocating by the way, this is an
| active decision).
|
| > My bet ? There's no way i would make a bet like that
| without playing with the model first. Why would I ? Why
| Would you ?
|
| You can have beliefs based on limited information. People
| do this all the time. And if you actually revealed that
| belief it would demonstrate that you don't actually
| currently believe o3 is likely to be world transformative
| Jensson wrote:
| > But you don't see this kind of discussion on the narrow
| models/techniques that made strides on this benchmark, do
| you ?
|
| This model was trained to pass this test, it was trained
| heavily on the example questions, so it was a narrow
| technique.
|
| We even have proof that it isn't AGI, since it scores
| horribly on ARC-AGI 2. It overfitted for this test.
| og_kalu wrote:
| >This model was trained to pass this test, it was trained
| heavily on the example questions, so it was a narrow
| technique.
|
| You are allowed to train on the train set. That's the
| entire point of the test.
|
| >We even have proof that it isn't AGI, since it scores
| horribly on ARC-AGI 2. It overfitted for this test.
|
| Arc 2 does not even exist yet. All we have are "early
| signs", not that that would be proof of anything. Whether
| I believe the models are generally intelligent or not
| doesn't depend on ARC
| Jensson wrote:
| > You are allowed to train on the train test. That's the
| entire point of the test.
|
| Right, but by training on those test cases you are
| creating a narrow model. The whole point of training
| questions is to create narrow models, like all the models
| we did before.
| og_kalu wrote:
| That doesn't make any sense. Training on the train set
| does not make the models capabilities narrow. Models are
| narrow when you can't train them to do anything else even
| if you wanted to.
|
| You are not narrow for undergoing training and it's
| honestly kind of ridiculous to think so. Not even the ARC
| maintainers believe so.
| Jensson wrote:
| > Training on the train set does not make the models
| capabilities narrow
|
| Humans didn't need to see the training set to pass this,
| the AI needing it means it is narrower than the humans,
| at least on these kind of tasks.
|
| The system might be more general than previous models,
| but still not as general as humans, and the G in AGI
| typically means being as general as humans. We are moving
| towards more general models, but still not at the level
| where we call them AGI.
| og_kalu wrote:
| This is also wildly ahead in SWE-bench (71.7%, previous 48%) and
| Frontier Math (25% on high compute, previous 2%).
|
| So much for a plateau lol.
| throwup238 wrote:
| _> So much for a plateau lol._
|
| It's been really interesting to watch all the internet pundits'
| takes on the plateau... as if the _two years_ since the release
| of GPT3.5 is somehow enough data for an armchair ponce to
| predict the performance characteristics of an entirely novel
| technology that no one understands.
| jgalt212 wrote:
| You could make an equivalently dismissive comment about the
| hypesters.
| throwup238 wrote:
| Yeah but anyone with half a brain knows to ignore them.
| Vapid cynicism is a lot more seductive to the average nerd.
| bandwidth-bob wrote:
| The pundits response to the (alleged) plateau was
| proportional to the certainty with which CEOs of frontier
| labs discussed pre-training scaling. The o3 result is from
| scaling test time compute, which represents a meaningful
| change in how you would build out compute for scaling (single
| supercluster --> presence in regions close to users). Thus it
| is important to discuss.
| attentionmech wrote:
| I legit see that if there is not even a new breakthrough just
| one week, people start shouting plateau plateau.. Our rate of
| progress is extraordinary and any downplay of it seems like
| stupid
| optimalsolver wrote:
| >Frontier Math (25% on high compute, previous 2%)
|
| This is so insane that I can't help but be skeptical. I know FM
| answer key is private, but they have to send the questions to
| OpenAI in order to score the models. And a significant jump on
| this benchmark sure would increase a company's valuation...
|
| Happy to be wrong on this.
| OsrsNeedsf2P wrote:
| At 6,670$/task? I hope there's a jump
| og_kalu wrote:
| It's not 6,670$/task. That was the high efficiency cost for
| 400 questions.
| HarHarVeryFunny wrote:
| You're talking apples and oranges. The plateau the frontier
| models have hit is the limited further gains to be had from
| dataset (+ corresponding model/compute) scaling.
|
| These new reasoning models are taking things in a new direction
| basically by adding search (inference time compute) on top of
| the basic LLM. So, the capabilities of the models are still
| improving, but the new variable is how deep of a search you
| want to do (how much compute to throw at it at inference time).
| Do you want your chess engine to do a 10 ply search or 20 ply?
| What kind of real world business problems will benefit from
| this?
| og_kalu wrote:
| "New" reasoning models are plain LLMs with clever
| reinforcement learning. o1 is itself reinforcement learning
| on top GPT-4o.
|
| They found a way to make test time compute a lot more
| effective and that is an advance but the idea is not new, the
| architecture is not new.
|
| And the vast majority of people convinced LLMs plateaued did
| so regardless of test time compute.
| HarHarVeryFunny wrote:
| The fact that these reasoning models may compute for
| extended durations, using exponentially more compute for
| linear performance gains (says OpenAI), resulting in
| outputs that while better are not necessarily any longer
| (more tokens) than before, all point to a different
| architecture - some type of iterative calling of the
| underlying model (essentially a reasoning agent using the
| underlying model).
|
| A plain LLM does not use variable compute - it is a fixed
| number of transformer layers, a fixed amount of compute for
| every token generated.
| throwaway314155 wrote:
| Architecture generally refers to the design of the model.
| In this case, the underlying model is still a transformer
| based llm and so is its architecture.
|
| What's different is the method for _sampling_ from that
| model where it seems they have encouraged the underlying
| LLM to perform a variable length chain of thought
| "conversation" with itself as has been done with o1. In
| addition, they _repeat_ these chains of thought in
| parallel using a tree of some sort to search and rank the
| outputs. This apparently scales performance on benchmarks
| as you scale both length of the chain of thought and the
| number of chains of thought.
| HarHarVeryFunny wrote:
| No disagreement, although the sampling + search procedure
| is obviously adding quite a lot to the capabilities of
| the system as a whole, so it really _should_ be
| considered as part of the architecture. It 's a bit like
| AlphaGo or AlphaZero - generating potential moves (cf
| LLM) is only a component of the overall solution
| architecture, and the MCTS sampling/search is equally (or
| more) important.
| og_kalu wrote:
| I think throwaway already explained what i was getting
| at.
|
| That said, i probably did downplay the achievement. It
| may not be a "new" idea to do something like this but
| finding an effective method for reflection that doesn't
| just lock you into circular thinking and is applicable
| beyond well defined problem spaces is genuinely tough and
| a breakthrough.
| maxdoop wrote:
| How much longer can I get paid $150k to write code ?
| tsunamifury wrote:
| Often what happens is the golf-course phenomenon. As golfing
| gets less popular, low and mid tier golf courses go out of
| business as they simply aren't needed. But at the same time
| demand for high end golf courses actually skyrockets because
| people who want to golf either can give it up or go higher end.
|
| This I think will happen with programmers. Rote programming
| will slowly die out, while demand for super high end will go
| dramatically up in price.
| CapcomGo wrote:
| Where does this golf-course phenomenon come from? It doesn't
| really match the real world or how golfing works.
| tsunamifury wrote:
| how so, witnessed it quite directly in California. Majority
| have closed and remaining have gone up in price and are up
| scale. This has been covered in various new programs like
| 60 minutes. You can look up death of golfing.
|
| Also unsure what you mean by...'how golfing works'. This is
| the economics of it, not the game
| EVa5I7bHFq9mnYK wrote:
| Maybe its CA thing? Plenty of $50 golf courses here in
| Phoenix.
| colesantiago wrote:
| Frontier expert specialist programmers will always be in
| demand.
|
| Generalist junior and senior engineers will need to think of a
| different career path in less than 5 years as more layoffs will
| reduce the software engineering workforce.
|
| It looks like it may be the way things are if progress in the
| o1, o3, oN models and other LLMs continues on.
| deadbabe wrote:
| This assumes that software products in the future will remain
| at the same complexity as they are today, just with AI
| building them out.
|
| But they won't. AI will enable building even _more_ complex
| software which counter intuitively will result in need even
| _more_ human jobs to deal with this added complexity.
|
| Think about how despite an increasing amount of free open
| source libraries over time enabling some powerful stuff
| easily, developer jobs have only increased, not decreased.
| dmm wrote:
| I've made a similar argument in the past but now I'm not so
| sure. It seems to me that developer demand was linked to
| large expansions in software demand first from PCs then the
| web and finally smartphones.
|
| What if software demand is largely saturated? It seems the
| big tech companies have struggled to come up with the next
| big tech product category, despite lots of talent and
| capital.
| deadbabe wrote:
| There doesn't need to be a new category. Existing
| categories can just continue bloating in complexity.
|
| Compare the early web vs the complicated JavaScript laden
| single page application web we have now. You need way
| more people now. AI will make it even worse.
|
| Consider that in the AI driven future, there will be no
| more frameworks like React. Who is going to bother
| writing one? Instead every company will just have their
| own little custom framework built by an AI that works
| only for their company. Joining a new company means you
| bring generalist skills and learn how their software
| works from the ground up and when you leave to another
| company that knowledge is instantly useless.
|
| Sounds exciting.
|
| But there's also plenty of unexplored categories anyway
| that we can't access still because there's insufficient
| technology for. Household robots with AGI for instance
| may require instructions for specific services sold as
| "apps" that have to be designed and developed by
| companies.
| bandwidth-bob wrote:
| The new capabilities of LLMs, and generally large
| foundation models, _expands_ the range of what a computer
| program can do. Naturally, we will need to build all of
| those things with code. Which will be done by a combo of
| people with product ideas, engineers, and LLMs. There
| will be then specialization and competition on each new
| use-case. eg., who builds the best AI doctor etc.,.
| hackinthebochs wrote:
| What about "general" in AGI do you not understand? There
| will be no new style of development for which the AGI will
| be poorly suited that all the displaced developers can move
| to.
| bandwidth-bob wrote:
| For true AGI (whatever that means, lets say fully
| replicates human abilities), discussing "developers" only
| is a drop in the bucket compared to all knowledge work
| jobs which will be displaced.
| cruffle_duffle wrote:
| This is exactly what will happen. We'll just up the
| complexity game to entirely new baselines. There will
| continue to be good money in software.
|
| These models are tools to help engineers, not replacements.
| Models cannot, on their own, build novel new things no
| matter how much the hype suggests otherwise. What they can
| do is remove a hell of a lot of accidental complexity.
| lagrange77 wrote:
| > These models are tools to help engineers, not
| replacements. Models cannot, on their own, build novel
| new things no matter how much the hype suggests
| otherwise.
|
| But maybe models + managers/non technical people can?
| mitjam wrote:
| The question is: How to become a senior when there is no
| place to be a junior? Will future SWE need to do the 10k
| hours as a hobby? Will AI speed up or slow down learning?
| singularity2001 wrote:
| good question and I think you gave the correct answer yes
| people will just do the 10,000 hours required by starting
| programming at the age of eight and then playing around
| until they're done studying
| prmph wrote:
| I'll believe the models can take the jobs of programmers when
| they can generate a sophisticated iOS app based on some simple
| prompts, ready for building and publication in the app store.
| That is nowhere near the horizon no matter how much things are
| hyped up, and it may well never arrive.
| timenotwasted wrote:
| The absolutist type comments are such a wild take given how
| often they are so wrong.
| tsunamifury wrote:
| Totally... simple increases in 20% efficiency will already
| significant destroy demand for coders. This forum however
| will be resistant to admit such economic phenomenon.
|
| Look at video bay editing after the advent of Final Cut.
| Significant drop in the specialized requirement as a
| professional field, even while content volume went up
| dramatically.
| exitb wrote:
| Computing has been transforming countless jobs before it
| got to Final Cut. On one hand, programming is not the
| hardest job out there. On the other, it takes months to
| fully onboard a human developer - a person that already
| has years of relevant education and work experience.
| There are desk jobs that onboard new hires in days
| instead. Let's see when they're displaced by AI first.
| tsunamifury wrote:
| Don't know if you noticed but thats already happening.
| Mass layoffs in customer service etc have already
| happened over the last 2 years
| exitb wrote:
| So, how does it work out? Are the customers happy? Are
| the bosses at my work going to be equally happy with my
| AI replacement?
| EVa5I7bHFq9mnYK wrote:
| That's until AI has improved enough that it can
| automatically navigate the menus to get me a human
| operator to talk to.
| derektank wrote:
| I could be misreading this, but as far as I can tell,
| there are more video and film editors today (29,240) than
| there were film editors in 1997 (9,320). Seems like an
| example of improved productivity shifting the skills
| required but ultimately driving greater demand for the
| profession as a whole. Salaries don't seem to have been
| hurt either, median wage was $35,214 in '97 and $66,600
| today, right in line with inflation.
|
| https://www.bls.gov/oes/2023/may/oes274032.htm
|
| https://www.bls.gov/oes/tables.htm
| vouaobrasil wrote:
| Nah, it will arrive. And regardless, this sort of AI reduces
| the skill level required to make the app. It reduces the
| amount of people required and thus reduces the demand for
| engineers. So, even though AI is not CLOSE to what you are
| suggesting, it can significantly reduce the salaries of those
| that ARE required. So maybe fewer $150K programmers will be
| hired with the same revenue for even higher profits.
|
| The most bizarre thing is that programmers are literally
| writing code to replace themselves because once this AI
| started, it was a race to the bottom and nobody wants to be
| last.
| skydhash wrote:
| > Nah, it will arrive
|
| Will it?
|
| It's already hard to get people to use computer as they are
| right now, where you only need to click on things and no
| longer have to enter commands. That because most people
| don't like to engage in formal reasoning. Even with one of
| the most intuitive computer assisted task (drawing and 3d
| modeling), there's so much to learn regarding theories that
| few people bother.
|
| Programming has always been easy to learn, and tools to
| automate coding have existed for decades now. But how many
| people you know have had the urge to learn enough to
| automate their tasks?
| prmph wrote:
| They've been promising us this thing since the 60s: End-
| user development, 5GLs, etc. enabling the average Joe to
| develop sophisticated apps in minimal time. And it never
| arrives.
|
| I remember attending a tech fair decades ago, and at one
| stand they were vending some database products. When I
| mentioned that I was studying computer science with a focus
| on software engineering, they sneered that coding will be
| much less important in the future since powerful databases
| will minimize the need for a lot of data wrangling in
| applications with algorithms.
|
| What actually happened is that the demand for programmers
| increased, and software ate the world. I suspect something
| similar will happen the current AI hype.
| vouaobrasil wrote:
| Well, I think in the 60s we also didn't have LLMs that
| could actually write complete programs, either.
| mirsadm wrote:
| No one writes a "complete program" these days. Things
| just keep evolving forever. I spent way too much time I
| care to admit dealing with dependencies of libraries
| which change seemingly on a daily basis these days. These
| predictions are so far off reality it makes me wonder if
| the people making them have ever written any code in
| their life.
| vouaobrasil wrote:
| That's fair. Well, I've written a lot of code. But
| anyway, I do want to emphasize the following. I am not
| making the same prediction as some that say AI can
| replace a programmer. Instead, I am saying: combination
| of AI plus programmers will reduce the need for the
| number or programmers, and hence allow the software
| industry to exist with far fewer people, with the lucky
| ones accumulating even more wealth.
| whynotminot wrote:
| > They've been promising us this thing since the 60s:
| End-user development, 5GLs, etc. enabling the average Joe
| to develop sophisticated apps in minimal time. And it
| never arrives.
|
| This has literally already arrived. Average Joes _are_
| writing software using LLMs right now.
| arrosenberg wrote:
| Source? Which software products are built without
| engineers?
| Jensson wrote:
| Personal websites etc, you don't think about them as
| software products since they weren't built by engineers,
| but 30 years ago you needed engineers to build those
| things.
| arrosenberg wrote:
| Ok, well I'm not going to worry about my job then. 25
| years ago GeoCities existed and you didn't need an
| engineer. 10 year old me was writing functional HTML,
| definitely not an engineer at that point.
| whynotminot wrote:
| To be honest maybe no one should worry.
|
| If AI truly overtakes knowledge work there's not much we
| could reasonably do to prepare for it.
|
| If AI never gets there though, then you saved yourself
| the trouble of stressing about it. So sure, relax, it's
| just the second coming of GeoCities.
| hatefulmoron wrote:
| I think the fear comes from the span of time. If my job
| is obsolete at the same time as everybody else's, I
| wouldn't care. I mean, sure, the world is in for a very
| tough time, but I would be in good company.
|
| The really bad situation is if my entire skill set is
| made obsolete while the rest of the world keeps going for
| a decade or two. Or maybe longer, who knows.
|
| I realize I'm coming across quite selfish, but it's just
| a feeling.
| deadbabe wrote:
| There's a very good chance that if a company can replace its
| programmers with pure AI then it means whatever they're doing
| is probably already being offered as a SaaS product so why not
| just skip the AI and buy that? Much cheaper and you don't have
| to worry about dealing with bugs.
| croemer wrote:
| SaaS works for general problems faced by many businesses.
| deadbabe wrote:
| Exactly. Most businesses can get away with not having
| developers at all if they just glue together the right
| combination of SaaS products. But this doesn't happen,
| implying there is something more about having your own
| homegrown developers that SaaS cannot replace.
| croemer wrote:
| The risk is not SaaS replacing internal developers. It's
| about increased productivity of developers reducing the
| number of developers needed to achieve something.
| deadbabe wrote:
| Again, you're assuming product complexity won't grow as a
| result of new AI tools.
|
| 3 decades ago you needed a big team to create the type of
| video games that one person can probably make on their
| own today in their spare time with modern tools.
|
| But now modern tools have been used to make even more
| complicated games that require more massive teams than
| ever and huge amounts of money. One person has no hope of
| replicating that now, but maybe in the future with AI
| they can. And then the AAA games will be even _more_
| advanced.
|
| It will be similar with other software.
| sss111 wrote:
| 3 to 5 years, max. Traditional coding is going to be dead in
| the water. Optimistically, the junior SWE job will evolve but
| more realistically dedicated AI-based programming agents will
| end demand for Junior SWEs
| lagrange77 wrote:
| Which implies that a few years later they will not become
| senior SWEs either.
| torginus wrote:
| Well, considering they floated the $2000 subscription idea, and
| they still haven't revealed everything, they could still
| introduce the $2k sub with o3+agents/tool use, which means,
| till about next week.
| arrosenberg wrote:
| Unless the LLMs see multiple leaps in capability, probably
| indefinitely. The Malthusians in this thread seem to think that
| LLMs are going to fix the human problems involved in executing
| these businesses - they won't. They make good programmers more
| productive and will cost some jobs at the margins, but it will
| be the low-level programming work that was previously
| outsourced to Asia and South America for cost-arbitrage.
| mrdependable wrote:
| I think they will have to figure out how to get around context
| limits before that happens. I also wouldn't be surprised if the
| future models that can actually replace workers are sold at
| such an exorbitant price that only larger companies will be
| able to afford it. Everyone else gets access to less capable
| models that still require someone with knowledge to get to an
| end result.
| kirykl wrote:
| If it's any consolation, Agile priests and middle managers will
| be the first to go
| HarHarVeryFunny wrote:
| You're not being paid $150K to "write code". You're being paid
| that to deliver solutions - to be a corporate cog than can
| ingest business requirements and emit (and maintain) business
| solutions.
|
| If there are jobs paying $150K just to code (someone else tells
| you what to code, and you just code it up), then please share!
| braden-lk wrote:
| If people constantly have to ask if your test is a measure of
| AGI, maybe it should be renamed to something else.
| OfficialTurkey wrote:
| From the post
|
| > Passing ARC-AGI does not equate achieving AGI, and, as a
| matter of fact, I don't think o3 is AGI yet. o3 still fails on
| some very easy tasks, indicating fundamental differences with
| human intelligence.
| cchance wrote:
| Its funny when they say this, as if all humans can solve
| basic ass question/answer combos, people seem to forget
| theirs a percentage of the population that honestly believe
| the world is flat along with other hallucinations at the
| human level
| jppittma wrote:
| I don't believe AGI at that level has any commercial value.
| Jensson wrote:
| Humans works in groups, so you are wrong a group of human
| is extremely reliable on tons of tasks. These AI models
| also work in groups, or they don't improve from working in
| a group since the company uses whatever does the best on
| the benchmark, so it is only fair to compare AI vs group of
| people, AI compared to an individual will always be an
| unfair comparison since an AI is never alone.
| modeless wrote:
| Congratulations to Francois Chollet on making the most
| interesting and challenging LLM benchmark so far.
|
| A lot of people have criticized ARC as not being relevant or
| indicative of true reasoning, but I think it was exactly the
| right thing. The fact that scaled reasoning models are finally
| showing progress on ARC proves that what it measures really is
| relevant and important for reasoning.
|
| It's obvious to everyone that these models can't perform as well
| as humans on everyday tasks despite blowout scores on the hardest
| tests we give to humans. Yet nobody could quantify exactly the
| ways the models were deficient. ARC is the best effort in that
| direction so far.
|
| We don't need more "hard" benchmarks. What we need right now are
| "easy" benchmarks that these models nevertheless fail. I hope
| Francois has something good cooked up for ARC 2!
| dtquad wrote:
| Are there any single-step non-reasoner models that do well on
| this benchmark?
|
| I wonder how well the latest Claude 3.5 Sonnet does on this
| benchmark and if it's near o1.
| throwaway71271 wrote:
| | Name | Semi-private eval |
| Public eval | |--------------------------------------
| |-------------------|-------------| | Jeremy Berman
| | 53.6% | 58.5% | | Akyurek et al.
| | 47.5% | 62.8% | | Ryan Greenblatt
| | 43% | 42% | | OpenAI
| o1-preview (pass@1) | 18% | 21%
| | | Anthropic Claude 3.5 Sonnet (pass@1) | 14%
| | 21% | | OpenAI GPT-4o (pass@1)
| | 5% | 9% | | Google Gemini
| 1.5 (pass@1) | 4.5% | 8% |
|
| https://arxiv.org/pdf/2412.04604
| kandesbunzler wrote:
| why is this missing the o1 release / o1 pro models? Would
| love to know how much better they are
| Freebytes wrote:
| This might be because they are referencing single step,
| and I do not think o1 is single step.
| aimanbenbaha wrote:
| Akyurek et al uses test-time compute.
| YetAnotherNick wrote:
| Here are the results for base models[1]: o3
| (coming soon) 75.7% 82.8% o1-preview 18% 21%
| Claude 3.5 Sonnet 14% 21% GPT-4o 5% 9%
| Gemini 1.5 4.5% 8%
|
| Score (semi-private eval) / Score (public eval)
|
| [1]: https://arcprize.org/2024-results
| simonw wrote:
| I'd love to know how Claude 3.5 Sonnet does so well despite
| (presumably) not having the same tricks as the o-series
| models.
| Bjorkbat wrote:
| It's easy to miss, but if you look closely at the first
| sentence of the announcement they mention that they used a
| version of o3 trained on a public dataset of ARC-AGI, so
| technically it doesn't belong on this list.
| dot1x wrote:
| It's all scam. ClosedAI trained on the data they were
| tested on, so no, nothing here is impressive.
| refulgentis wrote:
| This emphasizes persons and a self-conceived victory narrative
| over the ground truth.
|
| Models have regularly made progress on it, this is not new with
| the o-series.
|
| Doing astoundingly well on it, and having a mutually shared PR
| interest with OpenAI in this instance, doesn't mean a pile of
| visual puzzles is actually AGI or some well thought out and
| designed benchmark of True Intelligence(tm). It's one type of
| visual puzzle.
|
| I don't mean to be negative, but to inject a memento mori. Real
| story is some guys get together and ride off Chollet's name
| with some visual puzzles from ye olde IQ test, and the deal was
| Chollet then gets to show up and say it proves program
| synthesis is required for True Intelligence.
|
| Getting this score is extremely impressive but I don't assign
| more signal to it than any other benchmark with some thought to
| it.
| modeless wrote:
| Solving ARC doesn't mean we have AGI. Also o3 presumably
| isn't doing program synthesis, seemingly proving Francois
| wrong on that front. (Not sure I believe the speculation
| about o3's internals in the link.)
|
| What I'm saying is the fact that as models are getting better
| at reasoning they are also scoring better on ARC proves that
| it _is_ measuring something relating to reasoning. And nobody
| else has come up with a comparable benchmark that is so easy
| for humans and so hard for LLMs. Even today, let alone five
| years ago when ARC was released. ARC was visionary.
| hdjjhhvvhga wrote:
| Your argumentation seems convincing but I'd like to offer a
| competitive narrative: any benchmark that is public becomes
| completely useless because companies optimize for it -
| especially AI that depends on piles of money and they need
| some proof they are developing.
|
| That's why I have some private benchmarks and I'm sorry to
| say that the transition from GTP4 to o1 wasn't
| unambiguously a step forward (in some tasks yes, in some
| not).
|
| On the other hand, private benchmarks are even less useful
| to the general public than the public ones, so we have to
| deal with what we have - but many of us just treat it as
| noise and don't give it much significance. Ultimately, the
| models should defend themselves by performing the tasks
| individual users want them to do.
| stonemetal12 wrote:
| Rather any Logic puzzle you post on the internet as
| something AIs are bad at is in the next round of training
| data so AIs get better at that specific question. Not
| because AI companies are optimizing for a benchmark but
| because they suck up everything.
| modeless wrote:
| ARC has two test sets that are not posted on the
| Internet. One is kept completely private and never
| shared. It is used when testing open source models and
| the models are run locally with no internet access. The
| other test set is used when testing closed source models
| that are only available as APIs. So it could be leaked in
| theory, but it is still not posted on the internet and
| can't be in any web crawls.
|
| You could argue that the models can get an advantage by
| looking at the training set which is on the internet. But
| all of the tasks are unique and generalizing from the
| training set to the test set is the whole point of the
| benchmark. So it's not a serious objection.
| foobiekr wrote:
| Given the delivery mechanism for OpenAI, how do they
| actually keep it private?
| modeless wrote:
| > So it could be leaked in theory
|
| That's why they have two test sets. But OpenAI has
| legally committed to not training on data passed to the
| API. I don't believe OpenAI would burn their reputation
| and risk legal action just to cheat on ARC. And what
| they've reported is not implausible IMO.
| sensanaty wrote:
| Yeah I'm sure the Microsoft-backed company headed by Mr.
| Worldcoin Altman whose sole mission statement so far has
| been to overhype every single product they released
| wouldn't _dare_ cheat on one of these benchmarks that
| "prove" AGI (as they've been claiming since GPT-2).
| QuantumGood wrote:
| Gaming the benchmarks usually needs to be considered first
| when evaluating new results.
| chaps wrote:
| Honestly, is gaming benchmarks actually a problem in this
| space in that it still shows something useful? Just means
| we need more benchmarks, yeah? It really feels not unlike
| keggle competitions.
|
| We do the same exact stuff with real people with
| programming challenges and such where people just study
| common interview questions rather than learning the
| material holistically. And since we know that people game
| these interview type questions, we can adjust the
| interview processes to minimize gamification.... which
| itself leads to gamification and back to step one. That's
| not ideal an ideal feedback loop of course, but people
| still get jobs and churn out "productive work" out of it.
| ben_w wrote:
| AI are very good at gaming benchmarks. Both as
| overfitting and as Goodhart's law, gaming benchmarks has
| been a core problem during training for as long as I've
| been interested in the field.
|
| Sometimes this manifests as "outside the box thinking",
| like how a genetic algorithm got an "oscillator" which
| was really just an antenna.
|
| It is a hard problem, and yes we still both need and can
| make more and better benchmarks; but it's still a problem
| because it means the benchmarks we do have are
| overstating competence.
| CamperBob2 wrote:
| The _idea_ behind this particular benchmark, at least, is
| that it can 't be gamed. What are some ways to game ARC-
| AGI, meaning to pass it without developing the required
| internal model and insights?
|
| In principle you can't optimize specifically for ARC-AGI,
| train against it, or overfit to it, because only a few of
| the puzzles are publicly disclosed.
|
| Whether it lives up to that goal, I don't know, but their
| approach sounded good when I first heard about it.
| psb217 wrote:
| Well, with billions in funding you could task a hundred
| or so very well paid researchers to do their best at
| reverse engineering the general thought process which
| went into ARC-AGI, and then generate fresh training data
| and labeled CoTs until the numbers go up.
| CamperBob2 wrote:
| Right, but the ARC-AGI people would counter by saying
| they're welcome to do just that. In doing so -- again in
| their view -- the researchers would create a model that
| could be considered capable of AGI.
|
| I spent a couple of hours looking at the publicly-
| available puzzles, and was really impressed at how much
| room for creativity the format provides. Supposedly the
| puzzles are "easy for humans," but some of them were
| not... at least not for me.
|
| (It did occur to me that a better test of AGI might be
| the ability to generate new, innovative ARC-AGI puzzles.)
| psb217 wrote:
| It's tricky to judge the difficulty of these sorts of
| things. Eg, breadth of possibilities isn't an automatic
| sign of difficulty. I imagine the space of programming
| problems permits as much variety as ARC-AGI, but since
| we're more familiar with problems presented as natural
| language descriptions of programming tasks, and since we
| know there's tons of relevant text on the web, we see the
| abstract pictographic ARC-AGI tasks as more novel,
| challenging, etc. But, to an LLM, any task we can
| conceive of will be (roughly) as familiar as the amount
| of relevant training data it's seen. It's legitimately
| hard to internalize this.
|
| For a space of tasks which are well-suited to
| programmatic generation, as ARC-AGI is by design, if we
| can do a decent job of reverse engineering the underlying
| problem generating grammar, then we can make an LLM as
| familiar with the task as we're willing to spend on
| compute.
|
| To be clear, I'm not saying solving these sorts of tasks
| is unimpressive. I'm saying that I find it unsuprising
| (in light of past results) and not that strong of a
| signal about further progress towards the singularity, or
| FOOM, or whatever. For any of these closed-ish domain
| tasks, I feel a bit like they're solving Go for the
| umpteenth time. We now know that if you collect enough
| relevant training data and train a big enough model with
| enough GPUs, the training loss will go down and you'll
| probably get solid performance on the test set. Trillions
| of reasonably diverse training tokens buys you a lot of
| generalization. Ie, supervised learning works. This is
| the horse Ilya Sutskever's ridden to many glorious
| victories and the big driver of OpenAI's success -- a
| firm belief that other folks were leaving A LOT of
| performance on the table due to a lack of belief in the
| power of their own inventions.
| chaps wrote:
| We're in agreement!
|
| What's endlessly interesting to me with all of this is
| how surprisingly quick the benchmarking feedback loops
| have become plus the level of scrutiny each one receives.
| We (as a culture/society/whatever) don't really treat
| human benchmarking criteria with the same scrutiny such
| that feedback loops are useful and lead to productive
| changes to the benchmarking system itself. So from that
| POV it feels like substantial progress continues to be
| made through these benchmarks.
| bubblyworld wrote:
| I think gaming the benchmarks is _encouraged_ in the ARC
| AGI context. If you look at the public test cases you 'll
| see they test a ton of pretty abstract concepts - space,
| colour, basic laws of physics like gravity/magnetism,
| movement, identity and lots of other stuff (highly
| recommend exploring them). Getting an AI to do well _at
| all_ , regardless of whether it was gamed or not, is the
| whole challenge!
| refulgentis wrote:
| > Solving ARC doesn't mean we have AGI. Also o3 presumably
| isn't doing program synthesis, seemingly proving Francois
| wrong on that front.
|
| Agreed.
|
| > And nobody else has come up with a comparable benchmark
| that is so easy for humans and so hard for LLMs.
|
| ? There's plenty.
| modeless wrote:
| I'd love to hear about more. Which ones are you thinking
| of?
| refulgentis wrote:
| - "Are You Human" https://arxiv.org/pdf/2410.09569 is
| designed to be directly on target, i.e. cross cutting set
| of questions that are easy for humans, but challenging
| for LLMs, Instead of one type of visual puzzle. Much
| better than ARC for the purpose you're looking for.
|
| - SimpleBench https://simple-bench.com/ (similar to
| above; great landing page w/scores that show human / ai
| gap)
|
| - PIQA (physical question answering, i.e. "how do i get a
| yolk out of a water bottle", common favorite of local llm
| enthusiasts in /r/localllama
| https://paperswithcode.com/dataset/piqa
|
| - Berkeley Function-Calling (I prefer
| https://gorilla.cs.berkeley.edu/leaderboard.html)
|
| AI search googled "llm benchmarks challenging for ai easy
| for humans", and "language model benchmarks that humans
| excel at but ai struggles with", and "tasks that are easy
| for humans but difficult for natural language ai".
|
| It also mentioned Moravec's Paradox is a known framing of
| this concept, started going down that rabbit hole because
| the resources were fascinating, but, had to hold back and
| submit this reply first. :)
| modeless wrote:
| Thanks for the pointers! I hadn't seen Are You Human.
| Looks like it's only two months old. Of course it is much
| easier to design a test specifically to thwart LLMs now
| that we have them. It seems to me that it is designed to
| exploit details of LLM structure like tokenizers (e.g.
| character counting tasks) rather than to provide any sort
| of general reasoning benchmark. As such it seems
| relatively straightforward to improve performance in ways
| that wouldn't necessarily represent progress in general
| reasoning. And today's LLMs are not nearly as far from
| human performance on the benchmark as they were on ARC
| for many years after it was released.
|
| SimpleBench looks more interesting. Also less than two
| months old. It doesn't look as challenging for LLMs as
| ARC, since o1-preview and Sonnet 3.5 already got half of
| the human baseline score; they did much worse on ARC. But
| I like the direction!
|
| PIQA is cool but not hard enough for LLMs.
|
| I'm not sure Berkeley Function-Calling represents tasks
| that are "easy" for average humans. Maybe programmers
| could perform well on it. But I like ARC in part because
| the tasks do seem like they should be quite
| straightforward even for non-expert humans.
|
| Moravec's paradox isn't a benchmark per se. I tend to
| believe that there is no real paradox and all we need is
| larger datasets to see the same scaling laws that we have
| for LLMs. I see good evidence in this direction:
| https://www.physicalintelligence.company/blog/pi0
| refulgentis wrote:
| > "I'm not sure Berkeley Function-Calling represents
| tasks that are easy for average humans. Maybe programmers
| could perform well on it."
|
| Functions in this context are not programming function
| calls. In this context, function calls are a now-
| deprecated LLM API name for "parse input into this JSON
| template." No programmer experience needed. Entity
| extraction by another name, except, that'd be harder:
| here, you're told up front exactly the set of entities to
| identify. :)
|
| > "Moravec's paradox isn't a benchmark per se."
|
| Yup! It's a paradox :)
|
| > "Of course it is much easier to design a test
| specifically to thwart LLMs now that we have them"
|
| Yes.
|
| Though, I'm concerned a simple yes might be insufficient
| for illumination here.
|
| It is a tautology (it's easier to design a test that $X
| fails when you have access to $X), and it's unlikely you
| meant to just share a tautology.
|
| A potential unstated-but-maybe-intended-communication is
| "it was hard to come up with ARC before LLMs existed" ---
| LLMs existed in 2019 :)
|
| If they didn't, a hacky way to come up with a test that's
| hard for the top AIs at the time, BERT-era, would be to
| use one type of visual puzzle.
|
| If, for conversations sake, we ignore that it is exactly
| one type of visual puzzle, and that it wasn't designed to
| be easy for humans, then we can engage with: "its the
| only one thats easy for humans, but hard for LLMs" ---
| this was demonstrated as untrue as well.
|
| I don't think I have much to contribute past that, once
| we're at "It is a singular example of a benchmark thats
| easy for humans but nigh-impossible for llms, at least in
| 2019, and this required singular insight", there's just
| too much that's not even wrong, in the Pauli sense, and
| it's in a different universe from the original claims:
|
| - "Congratulations to Francois Chollet on making the most
| interesting and challenging LLM benchmark so far."
|
| - "A lot of people have criticized ARC as not being
| relevant or indicative of true reasoning...The fact that
| [o-series models show progress on ARC proves that what it
| measures really is relevant and important for reasoning."
|
| - "...nobody could quantify exactly the ways the models
| were deficient..."
|
| - "What we need right now are "easy" benchmarks that
| these models nevertheless fail."
| CamperBob2 wrote:
| How long has SimpleBench been posted? Out of the first 6
| questions at https://simple-bench.com/try-yourself,
| o1-pro got 5/6 right.
|
| It was interesting to see how it failed on question 6: ht
| tps://chatgpt.com/c/6765e70e-44b0-800b-97bd-928919f04fbe
|
| Apparently LLMs do not consider global thermonuclear war
| to be all that big a deal, for better or worse.
| Pannoniae wrote:
| Don't worry, I also got that wrong :) I thought her
| affair would be the biggest problem for John.
| jquery wrote:
| John was an ex, not her partner. Tricky.
| HarHarVeryFunny wrote:
| > o3 presumably isn't doing program synthesis
|
| I'd guess it's doing natural language procedural synthesis,
| the same way a human might (i.e. figuring the sequence of
| steps to effect the transformation), but it may well be
| doing (sub-)solution verification by using the procedural
| description to generate code whose output can then be
| compared to the provided examples.
|
| While OpenAI haven't said exactly what the architecture of
| o1/o3 are, the gist of it is pretty clear - basically
| adding "tree" search and iteration on top of the underlying
| LLM, driven by some RL-based post-training that imparts
| generic problem solving biases to the model. Maybe there is
| a separate model orchestrating the search and solution
| evaluation.
|
| I think there are many tasks that are easy enough for
| humans but hard/impossible for these models - the ultimate
| one in terms of commercial value would be to take an "off
| the shelf model" and treat it as an intern/apprentice and
| teach it to become competent in a entire job it was never
| trained on. Have it participate in team meetings and
| communications, and become a drop-in replacement for a
| human performing that job (any job that an be performed
| remotely without a physical presence).
| stego-tech wrote:
| I won't be as brutal in my wording, but I agree with the
| sentiment. This was something drilled into me as someone with
| a hobby in PC Gaming _and_ Photography: benchmarks, while
| handy measures of _potential_ capabilities, are not
| _guarantees_ of real world performance. Very few PC gamers
| completely reinstall the OS before benchmarking to remove all
| potential cruft or performance impacts, just as very few
| photographers exclusively take photos of test materials.
|
| While I appreciate the benchmark and its goals (not to
| mention the puzzles - I quite enjoy figuring them out),
| successfully passing this benchmark does not demonstrate or
| guarantee real world capabilities or performance. This is why
| I increasingly side-eye this field and its obsession with
| constantly passing benchmarks and then moving the goal posts
| to a newer, harder benchmark that claims to be a better
| simulation of human capabilities than the last one: it reeks
| of squandered capital and a lack of a viable/profitable
| product, at least to my sniff test. Rather than simply
| capitalize on their actual accomplishments (which LLMs are -
| natural language interaction is huge!), they're trying to
| prove to Capital that with a few (hundred) billion more in
| investments, they can make AGI out of this and replace all
| those expensive humans.
|
| They've built the most advanced prediction engines ever
| conceived, and insist they're best used to replace labor. I'm
| not sure how they reached that conclusion, but considering
| even their own models refute this use case for LLMs, I doubt
| their execution ability on that lofty promise.
| danielmarkbruce wrote:
| 100%. The hype is misguided. I doubt half the people excited
| about the result have even looked at what the benchmark is.
| lossolo wrote:
| > making the most interesting and challenging LLM benchmark so
| far.
|
| This[1] is currently the most challenging benchmark. I would
| like to see how O3 handles it, as O1 solved only 1%.
|
| 1. https://epoch.ai/frontiermath/the-benchmark
| pynappo wrote:
| Apparently o3 scored about 25%
|
| https://youtu.be/SKBG1sqdyIU?t=4m40s
| FiberBundle wrote:
| This is actually the result that I find way more
| impressive. Elite mathematicians think these problems are
| challenging and thought they were years away from being
| solvable by AI.
| modeless wrote:
| You're right, I was wrong to say "most challenging" as there
| have been harder ones coming out recently. I think the
| correct statement would be "most challenging long-standing
| benchmark" as I don't believe any other test designed in 2019
| has resisted progress for so long. FrontierMath is only a
| month old. And of course the real key feature of ARC is that
| it is easy for humans. FrontierMath is (intentionally) not.
| esafak wrote:
| They should put some famous, unsolved problems in the next
| edition so ML researchers do some actually useful work
| while they're "gaming" the benchmarks :)
| skywhopper wrote:
| "The fact that scaled reasoning models are finally showing
| progress on ARC proves that what it measures really is relevant
| and important for reasoning."
|
| Not sure I understand how this follows. The fact that a certain
| type of model does well on a certain benchmark means that the
| benchmark is relevant for a real-world reasoning? That doesn't
| make sense.
| munchler wrote:
| It shows objectively that the models are getting better at
| some form of reasoning, which is at least worth noting.
| Whether that improved reasoning is relevant for the real
| world is a different question.
| moffkalast wrote:
| It shows objectively that one model got better at this
| specific kind of weird puzzle that doesn't translate to
| anything because it is just a pointless pattern matching
| puzzle that can be trained for, just like anything else. In
| fact they specifically trained for it, they say so upfront.
|
| It's like the modern equivalent of saying "oh when AI
| solves chess it'll be as smart as a person, so it's a good
| benchmark" and we all know how that nonsense went.
| munchler wrote:
| Hmm, you could be right, but you could also be very
| wrong. Jury's still out, so the next few years will be
| interesting.
|
| Regarding the value of "pointless pattern matching" in
| particular, I would refer you to Douglas Hofstadter's
| discussion of Bongard problems starting on page 652 of
| _Godel, Escher, Bach_. Money quote: "I believe that the
| skill of solving Bongard [pattern recognition] problems
| lies very close to the core of 'pure' intelligence, if
| there is such a thing."
| moffkalast wrote:
| Well I certainly at least agree with that second part,
| the doubt if there is such a thing ;)
|
| The problem with pattern matching of sequences and
| transformers as an architecture is that it's something
| they're explicitly designed to be good at with self
| attention. Translation is mainly matching patterns to
| equivalents in different languages, and continuing a
| piece of text is following a pattern that exists inside
| it. This is primarily why it's so hard to draw a line
| between what an LLM actually understands and what it just
| wings naturally through pattern memorization and why
| everything about them is so controversial.
|
| Honestly I was really surprised that all models did so
| poorly on ARC in general thus far, since it really should
| be something they ought to be superhuman at from the get-
| go. Probably more of a problem that it's visual in
| concept than anything else.
| bagels wrote:
| It doesn't follow, faulty logic. The two are probably
| correlated though.
| jug wrote:
| I liked the SimpleQA benchmark that measures hallucinations.
| OpenAI models did surprisingly poorly, even o1. In fact, it
| looks like OpenAI often does well on benchmarks by taking the
| shortcut to be more risk prone than both Anthropic and Google.
| zone411 wrote:
| It's the least interesting benchmark for language models among
| all they've released, especially now that we already had a
| large jump in its best scores this year. It might be more
| useful as a multimodal reasoning task since it clearly involves
| visual elements, but with o3 already performing so well, this
| has proven unnecessary. ARC-AGI served a very specific purpose
| well: showcasing tasks where humans easily outperformed
| language models, so these simple puzzles had their uses. But
| tasks like proving math theorems or programming are far more
| impactful.
| versteegen wrote:
| ARC wasn't designed as a benchmark for LLMs, and it doesn't
| make much sense to compare them on it since it's the wrong
| modality. Even a MLM with image inputs can't be expected to
| do well, since they're nothing like 99.999% of the training
| data. The fact that even a text-only LLM can solve ARC
| problems with the proper framework is important, however.
| danielmarkbruce wrote:
| Highly challenging for LLMs because it has nothing to do with
| language. LLMs and their training processes have all kinds of
| optimizations for language and how it's presented.
|
| This benchmark has done a wonderful job with marketing by
| picking a great name. It's largely irrelevant for LLMs despite
| the fact it's difficult.
|
| Consider how much of the model is just noise for a task like
| this given the low amount of information in each token and the
| high embedding dimensions used in LLMs.
| computerex wrote:
| The benchmark is designed to test for AGI and intelligence,
| specifically the ability to solve novel problems.
|
| If the hypothesis is that LLMs are the "computer" that drives
| the AGI then of course the benchmark is relevant in testing
| for AGI.
|
| I don't think you understand the benchmark and its
| motivation. ARC AGI benchmark problems are extremely easy and
| simple for humans. But LLMs fail spectacularly at them. Why
| they fail is irrelevant, the fact they fail though means that
| we don't have AGI.
| danielmarkbruce wrote:
| > The benchmark is designed to test for AGI and
| intelligence, specifically the ability to solve novel
| problems.
|
| It's a bunch of visual puzzles. They aren't a test for AGI
| because it's not general. If models (or any other system
| for that matter) could solve it, we'd be saying "this is a
| stupid puzzle, it has no practical significance". It's a
| test of some sort of specific intelligence. On top of that,
| the vast majority of blind people would fail - are they not
| generally intelligent?
|
| The name is marketing hype.
|
| The benchmark could be called "random puzzles LLMs are not
| good at because they haven't been optimized for it because
| it's not valuable benchmark". Sure, it wasn't designed
| _for_ LLMs, but throwing LLMs at it and saying "see?" is
| dumb. We can throw in benchmarks for tennis playing, chess
| playing, video game playing, car driving and a bajillion
| other things while we are at it.
| NateEag wrote:
| And all that is kind of irrelevant, because if LLMs were
| human-level general intelligence, they would solve all
| these questions correctly without blinking.
|
| But they don't. Not even the best ones.
| pama wrote:
| No human would score high on that puzzle if the images
| were given to them as a series of tokens. Even previous
| LLMs scored much better than humans if tested in the same
| way.
| adamgordonbell wrote:
| There is a benchmark, NovelQA, that LLMs don't dominate when it
| feels like they should. The benchmark is to read a novel and
| answer questions about it.
|
| LLMs are below human evaluation, as I last looked, but it
| doesn't get much attention.
|
| Once it is passed, I'd like to see one that is solving the
| mystery in a mystery book right before it's revealed.
|
| We'd need unpublished mystery novels to use for that benchmark,
| but I think it gets at what I think of as reasoning.
|
| https://novelqa.github.io/
| CamperBob2 wrote:
| Does it work on short stories, but not novels? If so, then
| that's just a minor question of context length that should
| self-resolve over time.
| adamgordonbell wrote:
| The books fit in the current long context models, so it's
| not merely the context size constraint but the length is
| part of the issue, for sure.
| meta_x_ai wrote:
| Looks like it's not updated for nearly a year and I'm
| guessing Gemini 2.0 Flash with 2m context will simply crush
| it
| adamgordonbell wrote:
| That's true. They don't have Claude 3.5 on there either. So
| maybe it's not relevant anymore, but I'm not sure.
|
| If so, let's move on to the murder mysteries or more
| complex literary analysis.
| rowanG077 wrote:
| Benchmark how? Is it good if the LLM can or can't solve it?
| loxias wrote:
| NovelQA is a great one! I also like GSM-Symbolic -- a
| benchmark based on making _symbolic templates_ of quite easy
| questions, and sampling them repeatedly, varying things like
| which proper nouns are used, what order relevant details
| appear, how many irrelevant details (GSM-NoOp) and where they
| are in the question, things like that.
|
| LLMs are far, _far_ below human on elementary problems, once
| you allow any variation and stop spoonfeeding perfectly
| phrased word problems. :)
|
| https://machinelearning.apple.com/research/gsm-symbolic
|
| https://arxiv.org/pdf/2410.05229
|
| Paper came out in October, I don't think many have fully
| absorbed the implications.
|
| It's hard to take any of the claims of "LLMs can do
| reasoning!" seriously, once you understand that simply
| changing what names are used in a 8th grade math word problem
| can have dramatic impact on the accuracy.
| latency-guy2 wrote:
| > I'd like to see one that is solving the mystery in a
| mystery book right before it's revealed.
|
| I would think this is a not so good bench. Author does not
| write logically, they write for entertainment.
| adamgordonbell wrote:
| So I'm thinking of something like Locked-room mystery where
| the idea is it's solvable, and the reader is given a chance
| to solve.
|
| The reason it seems like an interesting bench, is it's a
| puzzle presented in a long context. Its like testing if an
| LLm is at Sherlock Holmes level of world and motivation
| modelling.
| usaar333 wrote:
| That's an old leaderboard -- has no one checked any SOTA LLM
| in the last 8 months?
| aimanbenbaha wrote:
| Because LLMs are on an off-ramp path towards AGI. A generally
| intelligent system can brute force its way with just memory.
|
| Once a model recognizes a weakness through reasoning with CoT
| when posed to a certain problem and gets the agency to adapt to
| solve that problem that's a precursor towards real AGI
| capability!
| justanotherjoe wrote:
| i am confused cause this dataset is visual-based, and yet being
| used to measure 'LLM'. I feel like the visual nature of it was
| really the biggest hurdle to solving it.
| internet_points wrote:
| > The fact that scaled reasoning models are finally showing
| progress on ARC proves that what it measures really is relevant
| and important for reasoning.
|
| One might also interpret that as "the fact that models which
| are studying to the test are getting better at the test"
| (Goodhart's law), not that they're actually reasoning.
| wilg wrote:
| fun! the benchmarks are so interesting because real world use is
| so variable. sometimes 4o will nail a pretty difficult problem,
| other times o1 pro mode will fail 10 times on what i would think
| is a pretty easy programming problem and i waste more time trying
| to do it with ai
| behnamoh wrote:
| So now not only are the models closed, but so are their evals?!
| This is a "semi-private" eval. WTH is that supposed to mean? I'm
| sure the model is great but I refuse to take their word for it.
| ZeroCool2u wrote:
| The private evaluation set is private from the public/OpenAI so
| companies can't train on those problems and cheat their way to
| a high score by overfitting.
| jsheard wrote:
| If the models run on OpenAIs servers then surely they could
| still see the questions being put into it if they wanted to
| cheat? That could only be prevented by making the evaluation
| a one-time deal that can't be repeated, or by having OpenAI
| distribute their models for evaluators to run themselves,
| which I doubt they're inclined to do.
| foobarqux wrote:
| Yes that's why it is "semi"-private: From the ARC website
| "This set is "semi-private" because we can assume that over
| time, this data will be added to LLM training data and need
| to be periodically updated."
|
| I presume evaluation on the test set is gated (you have to
| ask ARC to run it).
| cchance wrote:
| the evals are the question/answers, ARC-AGI doesn't share the
| questions and answers for a portion so that models can't be
| trained on them, the public ones... the public knows the
| questions so theres a chance they could have been at least
| partially been trained on the question (if not the actual
| answer).
|
| Thats how i understand it
| neom wrote:
| Why would they give a cost estimate per task on their low compute
| mode but not their high mode?
|
| "low compute" mode: Uses 6 samples per task, Uses 33M tokens for
| the semi-private eval set, Costs $17-20 per task, Achieves 75.7%
| accuracy on semi-private eval
|
| The "high compute" mode: Uses 1024 samples per task (172x more
| compute), Cost data was withheld at OpenAI's request, Achieves
| 87.5% accuracy on semi-private eval
|
| Can we just extrapolate $3kish per task on high compute?
| (wondering if they're withheld because this isn't the case?)
| WiSaGaN wrote:
| The withheld part is really a red flag for me. Why do you want
| to withhold a compute number?
| zebomon wrote:
| My initial impression: it's very impressive and very exciting.
|
| My skeptical impression: it's complete hubris to conflate ARC or
| any benchmark with truly general intelligence.
|
| I know my skepticism here is identical to moving goalposts. More
| and more I am shifting my personal understanding of general
| intelligence as a phenomenon we will only ever be able to
| identify with the benefit of substantial retrospect.
|
| As it is with any sufficiently complex program, if you could
| discern the result beforehand, you wouldn't have had to execute
| the program in the first place.
|
| I'm not trying to be a downer on the 12th day of Christmas.
| Perhaps because my first instinct is childlike excitement, I'm
| trying to temper it with a little reason.
| amarcheschi wrote:
| I just googled arc agi questions, and it looks like it is
| similar to an iq test with raven matrix. Similar as in you have
| some examples of images before and after, then an image before
| and you have to guess the after.
|
| Could anyone confirm if this is the only kind of questions in
| the benchmark? If yes, how come there is such a direct
| connection to "oh this performs better than humans" when llm
| can be quite better than us in understanding and forecasting
| patterns? I'm just curious, not trying to stir up controversies
| zebomon wrote:
| It's a test on which (apparently until now) the vast majority
| of humans have far outperformed all machine systems.
| patrickhogan1 wrote:
| But it's not a test that directly shows general
| intelligence.
|
| I am excited no less! This is huge improvement.
|
| How does this do on SWE Bench?
| og_kalu wrote:
| >How does this do on SWE Bench?
|
| 71.7%
| throwaway0123_5 wrote:
| I've seen this figure on a few tech news websites and
| reddit but can't find an official source. If it was in
| the video I must have missed it, where is this coming
| from?
| og_kalu wrote:
| It was in the video. I don't know if Open ai have a page
| up yet
| ALittleLight wrote:
| Yes, it's pretty similar to Raven's. The reason it is an
| interesting benchmark is because humans, even very young
| humans, "get" the test in the sense of understanding what
| it's asking and being able to do pretty well on it - but LLMs
| have really struggled with the benchmark in the past.
|
| Chollett (one of the creators of the ARC benchmark) has been
| saying it proves LLMs can't reason. The test questions are
| supposed to be unique and not in the model's training set.
| The fact that LLMs struggled with the ARC challenge suggested
| (to Chollett and others) that models weren't "Truly
| reasoning" but rather just completing based on things they'd
| seen before - when the models were confronted with things
| they hadn't seen before, the novel visual patterns, they
| really struggled.
| Eridrus wrote:
| ML is quite good at understanding and forecasting patterns
| when you train on the data you want to forecast. LLMs manage
| to do so much because we just decided to train on everything
| on the internet and hope that it included everything we ever
| wanted to know.
|
| This tries to create patterns that are intentionally not in
| the data and see if a system can generalize to them, which o3
| super impressively does!
| yunwal wrote:
| ARC is in the dataset though? I mean I'm aware that there
| are new puzzles every day, but there's still a very
| specific format and set of skills required to solve it. I'd
| bet a decent amount of money that humans get better at ARC
| with practice, so it seems strange to suggest that AI
| wouldn't.
| hansonkd wrote:
| It doesn't need to be general intelligence or perfectly map to
| human intelligence.
|
| All it needs to be is useful. Reading constant comments about
| LLMs can't be general intelligence or lack reasoning etc, to me
| seems like people witnessing the airplane and complaining that
| it isn't "real flying" because it isn't a bird flapping its
| wings (a large portion of the population held that point of
| view back then).
|
| It doesn't need to be general intelligence for the rapid
| advancement of LLM capabilities to be the most societal
| shifting development in the past decades.
| zebomon wrote:
| I agree. If the LLMs we have today never got any smarter, the
| world would still be transformed over the next ten years.
| AyyEye wrote:
| > Reading constant comments about LLMs can't be general
| intelligence or lack reasoning etc, to me seems like people
| witnessing the airplane and complaining that it isn't "real
| flying" because it isn't a bird flapping its wings (a large
| portion of the population held that point of view back then).
|
| That is a natural reaction to the incessant techbro, AIbro,
| marketing, and corporate lies that "AI" (or worse AGI) is a
| real thing, and can be directly compared to real humans.
|
| There are people on this very thread saying it's better at
| reasoning than real humans (LOL) because it scored higher on
| some benchmark than humans... Yet this technology still can't
| reliably determine what number is circled, if two lines
| intersect, or count the letters in a word. (That said
| behaviour may have been somewhat finetuned out of newer
| models only reinforces the fact that the technology
| inherently not capable of understanding _anything_.)
| IanCal wrote:
| I encounter "spicy auto complete" style comments far more
| often than techbro AI-everything comments and its frankly
| getting boring.
|
| I've been doing AI things for about 20+ years and llms are
| wild. We've gone from specialized things being pretty bad
| as those jobs to general purpose things better at that and
| everything else. The idea you could make and API call with
| "is this sarcasm?" and get a better than chance guess is
| incredible.
| AyyEye wrote:
| Nobody is disputing the coolness factor, only the
| intelligence factor.
| hansonkd wrote:
| I'm saying the intelligence factor doesn't matter. Only
| the utility factor. Today LLMs are incredibly useful and
| every few months there appears to be bigger and bigger
| leaps.
|
| Analyzing whether or not LLMs have intelligence is
| missing the forest from the trees. This technology is
| emerging in a capitalist society that is hyper optimized
| to adopt useful things at the expense of almost
| everything else. If the utility/price point gets hit for
| a problem, it will replace it regardless of if it is
| intelligent or not.
| Jensson wrote:
| But if you want to predict the future utility of these
| models you want to look at their current intelligence,
| compare that to humans and try to figure out roughly what
| skills they lack and which of those are likely to get
| fixed.
|
| For example, a team of humans are extremely reliable,
| much more reliable than one human, but a team of AI's
| isn't mean reliable than one AI since an AI is already an
| ensemble model. That means even if an AI could replace a
| person, it probably can't replace a team for a long time,
| meaning you still need the other team members there,
| meaning the AI didn't really replace a human it just
| became a tool for huamns to use.
| MVissers wrote:
| I think this is a fair criticism of capability.
|
| I personally wouldn't be surprised if we start to see
| benchmarks around this type of cooperation and ability to
| orchestrate complex systems in the next few years or so.
|
| Most benchmarks really focus on one problem, not on
| multiple real-time problems while orchestrating 3rd party
| actors who might or might not be able to succeed at
| certain tasks.
|
| But I don't think anything is prohibiting these models
| from not being able to do that.
| surgical_fire wrote:
| Eh, I see far more "AI is the second coming of Jesus"
| type of comments than healthy skepticism. A lot of
| anxiety from people afraid that their source of income
| will dry and a lot of excitement of people with an axe to
| grind that "those entitled expensive peasants will get
| what they deserve".
|
| I think I count myself among the skeptics nowadays for
| that reason. And I say this as someone that thinks LLM is
| an interesting piece of technology, but with somewhat
| limited use and unclear economics.
|
| If the hype was about "look at this thing that can parse
| natural language surprisingly well and generate coherent
| responses", I would be excited too. As someone that had
| to do natural language processing in the past, that is a
| damn hard task to solve, and LLMs excel at it.
|
| But that is not the hype is it? We have people beating
| the drums of how this is just shy of taking the world by
| storm, and AGI is just around the corner, and it will
| revolutionize all economy and society and nothing will
| ever be the same.
|
| So, yeah, it gets tiresome. I wish the hype would die
| down a little so this could be appreciated for what it
| is.
| williamcotton wrote:
| _We have people beating the drums of how this is just shy
| of taking the world by storm, and AGI is just around the
| corner, and it will revolutionize all economy and society
| and nothing will ever be the same._
|
| Where are you seeing this? I pretty much only read HN and
| football blogs so maybe I'm out of the loop.
| sensanaty wrote:
| In this very thread there are multiple people espousing
| their views that the high score here is proof that o3 has
| achieved AGI.
| handsclean wrote:
| People aren't responding to their own assumption that AGI is
| necessary, they're responding to OpenAI and the chorus
| constantly and loudly singing hymns to AGI.
| surgical_fire wrote:
| > to me seems like people witnessing the airplane and
| complaining that it isn't "real flying" because it isn't a
| bird flapping its wings
|
| To me it is more like there is someone jumping on a pogo ball
| while flapping their arms and saying that they are flying
| whenever they hop off the ground.
|
| Skeptics say that they are not really flying, while adherents
| say that "with current pogo ball advancements, they will be
| flying any day now"
| intelVISA wrote:
| Between skeptics and adherents who is more easily able to
| extract VC money for vaporware? If you limit yourself to
| 'the facts' you're leaving tons of $$ on the table...
| surgical_fire wrote:
| By all means, if this is the goal, AI is a success.
|
| I understand that in this forum too many people are
| invested in putting lipstick on this particular pig.
| PaulDavisThe1st wrote:
| An old quote, quite famous: "... is like saying that an ape
| who climbs to the top of a tree for the first time is one
| step closer to landing on the moon".
| DonHopkins wrote:
| Is that what Elon Musk was trying to do on stage?
| billyp-rva wrote:
| > It doesn't need to be general intelligence or perfectly map
| to human intelligence.
|
| > All it needs to be is useful.
|
| Computers were already useful.
|
| The only definition we have for "intelligence" is human (or,
| generally, animal) intelligence. If LLMs aren't that, let's
| call it something else.
| throwup238 wrote:
| What exactly is human (or animal) intelligence? How do you
| define that?
| billyp-rva wrote:
| Does it matter? If LLMs _aren 't_ that, whatever it is,
| then we should use a different word. Finders keepers.
| throwup238 wrote:
| How do you know that LLMs "aren't that" if you can't even
| define what _that_ is?
|
| "I'll know it when I see it" isn't a compelling argument.
| grahamj wrote:
| they can't do what we do therefore they aren't what we
| are
| layer8 wrote:
| And what is that, in concrete terms? Many humans can't do
| what other humans can do. What is the common subset that
| counts as human intelligence?
| dimitri-vs wrote:
| Process vision and sounds in parallel for 80+ years,
| rapidly adapt to changing environments and scenarios,
| correlate seemingly irrelevant details that happened a
| week ago or years ago, be able to selectively ignore
| instructions and know when to disagree
| jonny_eh wrote:
| > "I'll know it when I see it" isn't a compelling
| argument.
|
| It feels compelling to me.
| Aperocky wrote:
| I think a successful high level intelligence should
| quickly accelerate or converge to infinity/physical
| resource exhaustion because they can now work on
| improving themselves.
|
| So if above human intelligence does happen, I'd assume
| we'd know it, quite soon.
| wruza wrote:
| And look at the airplanes, they really can't just land on a
| mountain slope or a tree without heavy maintenance
| afterwards. Those people weren't all stupid, they questioned
| the promise of flying servicemen delivering mail or milk to
| their window and flying on a personal aircar to their
| workplace. Just like todays promises about whatever the CEOs
| telltales are. Imagining bullshit isn't unique to this
| century.
|
| Aerospace is still a highly regulated area that requires
| training and responsibility. If parallels can be drawn here,
| they don't look so cool for a regular guy.
| skydhash wrote:
| This pretty much. Everyone knows that LLMs are great for
| text generation and processing. What people has been
| questioning is the end goals as promised by its builders,
| i.e. is it useful? And from most of what I saw, it's very
| much a toy.
| MVissers wrote:
| What would you need to see to call it useful?
|
| To give you an example- I've used it for legal work such
| as an EB2-NIW visa application. Saved me countless of
| hours. My next visa I'll try to do without a lawyer using
| just LLMs. I would never try this without having LLMs at
| my disposal.
|
| As a hobby- And as someone with a scientific background
| I've been able to build an artificial ecosystem
| simulation from scratch without programming experience in
| Rust: https://www.youtube.com/@GenecraftSimulator
|
| I recently moved from fish to plants and believe I've
| developed some new science at the intersection of CS and
| Evolutionary Biology that I'm looking to publish.
|
| This tool is extremely useful. For now- You do require a
| human in the loop for coordination.
|
| My guess is that these will be benchmarks that we see
| within a few years: How good an AI coordinate multiple
| other AIs to build, deploy and iterate something that
| functions in the real world. Basically manager AI.
|
| Because they'll literally be able to solve every single
| one shot problem so we won't be able to create benchmarks
| anymore.
|
| But that's also when these models will be able to build
| functioning companies in a few hours.
| skydhash wrote:
| > _...me countless of...would never try this without
| having LLMs...is extremely useful...they 'll literally be
| able to solve...will be able to... in a few hours._
|
| That's marketing language, not scientific or even casual
| language. So much outstanding claims, without even some
| basic explanations. Like how did it help you save these
| hours? Terms explanations? Outlining processes? Going to
| the post office for you? You don't need to sell me
| anything, I just want the how.
| wruza wrote:
| My issue with LLMs is that you require a review-competent
| human in the loop, to fix confabulations.
|
| Yes, I'm using them from time to time for research. But
| I'm also aware of the topics I research and see through
| bs. And best LLMs out there, right now, produce bs in
| just 3-4 paragraphs, in nicely documented areas.
|
| A recent example is my question on how to run N vpn
| servers on N ips on the same eth with ip binding (in ip =
| out ip, instead of using a gw with the lowest metric). I
| had no idea but I know how networks work and the
| terminology. It started helping, created a namespace, set
| up lo, set up two interfaces for inner and outer routing
| and then made a couple of crucial mistakes that couldn't
| be detected or fixed by someone even a little clueless
| (in routing setup for outgoing traffic). I didn't even
| argue and just asked what that does wrt my task, and that
| started the classic "oh wait, sorry, here's more bs" loop
| that never ended.
|
| Eventually I distilled the general idea and found an
| article that AI very likely learned from, cause it was
| the same code almost verbatim, but without mistakes.
|
| Does that count as helping? Idk, probably yes. But I know
| that examples like this show that you cannot not only
| leave an LLM unsupervised for any non-trivial question,
| but have to leave a competent role in the loop.
|
| I think the programming community is just blinded by LLMs
| succeeding in writing kilometers of untalented
| react/jsx/etc crap that has no complexity or competence
| in it apart from repeating "do like this" patterns and
| literally millions of examples, so noise cannot hit
| through that "protection". Everything else suffers from
| LLMs adding inevitable noise into what they learned from
| a couple of sources. The problem here, as I understand
| it, is that only specific programmer roles and
| s{c,p}ammers (ironically) write the same crap again and
| again millions of times, other info usually exists in
| only a few important sources and blog posts, and only a
| few of those are full and have good explanations.
| Workaccount2 wrote:
| What people always leave out is that society will bend to
| the abilities of the new technology. Planes can't land in
| your backyard so we built airports. We didn't abandon
| planes.
| PaulDavisThe1st wrote:
| Sure, but that also vindicates the GP's point that the
| initial claims of the boosters for planes contained more
| than their fair share of bullshit and lies.
| wruza wrote:
| Yes but the idea was lost in the process. It became a
| faster transportation system that uses air as a medium,
| but that's it. Personal planes are still either big
| business or an expensive and dangerous personal toy
| thing. I don't think it's the same for LLMs (would be
| naive). But where are promises like "we're gonna change
| travel economics etc"? All headlines scream is "AGI
| around the corner". Yeah, now where's my damn postman
| flying? I need my mail.
| ben_w wrote:
| > It became a faster transportation system that uses air
| as a medium, but that's it.
|
| On the one hand, yes; on the other, this understates the
| impact that had.
|
| My uncle moved from the UK to Australia because, I'm
| told*, he didn't like his mum and travel was so expensive
| that he assumed they'd never meet again. My first trip
| abroad... I'm not 100% sure how old I was, but it must
| have been between age 6 and 10, was my gran (his mum)
| paying for herself, for both my parents, and for me, to
| fly to Singapore, then on to various locations in
| Australia including my uncle, and back via Thailand, on
| her pension.
|
| That was a gap of around one and a half generations.
|
| * both of them are long-since dead now so I can't ask
| ForHackernews wrote:
| This is already happening. A few days ago Microsoft
| turned down a documentation PR because the formatting was
| better for humans but worse for LLMs: https://github.com/
| MicrosoftDocs/WSL/pull/2021#issuecomment-...
|
| They changed their mind after a public outcry including
| here on HN.
| oblio wrote:
| We are slowly discovering that many of our wonderful
| inventions from 60-80-100 years ago have serious side
| effects.
|
| Plastics, cars, planes, etc.
|
| One could say that a balanced situation, where vested
| interests are put back in the box (close to impossible
| since it would mean fighting trillions of dollars), would
| mean that for example all 3 in the list above are used a
| lot less than we use them now, for example. And only used
| where truly appropriate.
| tivert wrote:
| > What people always leave out is that society will bend
| to the abilities of the new technology.
|
| Do they really? I don't think they do.
|
| > Planes can't land in your backyard so we built
| airports. We didn't abandon planes.
|
| But then what do you do with the all the fantasies and
| hype about the new technology (like planes that land in
| your backyard and you fly them to work)?
|
| And it's quite possible and fairly common that the new
| technology _actually ends up being mostly hype_ , and
| there's actually no "airports" use case in the wings. I
| mean, how much did society "bend to the abilities of"
| NFTs?
|
| And then what if the mature "airports" use case is
| actually something _most people do not want_?
| moffkalast wrote:
| No, we built helicopters.
| throwaway4aday wrote:
| Your point is on the verge of nullification with the rapid
| improvement and adoption of autonomous drones don't you
| think?
| wruza wrote:
| Sort of, but doesn't that sit on a far-fetch horizon? I
| doubt that drone companies are all the same who sold
| aircraft retrofuturism to people back then.
| alexalx666 wrote:
| If I could put it into Tesla style robot and it could do
| dishes and help me figure out tech stuff, it would be more
| than enough.
| skywhopper wrote:
| On the contrary, the pushback is critical because many
| employers are buying the hype from AI companies that AGI is
| imminent, that LLMs can replace professional humans, and that
| computers are about to eliminate all work (except VCs and
| CEOs apparently).
|
| Every person that believes that LLMs are near sentient or
| actually do a good job at reasoning is one more person
| handing over their responsibilities to a zero-accountability
| highly flawed robot. We've already seen LLMs generate bad
| legal documents, bad academic papers, and extremely bad code.
| Similar technology is making bad decisions about who to
| arrest, who to give loans to, who to hire, who to bomb, and
| who to refuse heart surgery for. Overconfident humans
| employing this tech for these purposes have been bamboozled
| by the lies from OpenAI, Microsoft, Google, et al. It's
| crucial to call out overstatement and overhype about this
| tech wherever it crops up.
| noFaceDiscoG668 wrote:
| I don't understand how or why someone with your mind would
| assume that even barely disclosed semi-public releases
| would resemble the current state of the art. Except if you
| do it for the conversations sake, which I have never been
| capable of.
| jasondigitized wrote:
| This a thousand times.
| colordrops wrote:
| I don't think many informed people doubt the utility of LLMs
| at this point. The potential of human-like AGI has profound
| implications far beyond utility models, which is why people
| are so eager to bring it up. A true human-like AGI basically
| means that most intellectual/white collar work will not be
| needed, and probably manual labor before too long as well.
| Huge huge implications for humanity, e.g. how does an economy
| and society even work without workers?
| vouaobrasil wrote:
| > Huge huge implications for humanity, e.g. how does an
| economy and society even work without workers?
|
| I don't think those that create AI care about that. They
| just to come out on top before someone else does.
| sigmoid10 wrote:
| These comments are getting ridiculous. I remember when this
| test was first discussed here on HN and everyone agreed that it
| clearly proves current AI models are not "intelligent"
| (whatever that means). And people tried to talk me down when I
| theorised this test will get nuked soon - like all the ones
| before. It's time people woke up and realised that the old age
| of AI is over. This new kind is here to stay and it _will_ take
| over the world. And you better guess it 'll be sooner rather
| than later and start to prepare.
| samvher wrote:
| What kind of preparation are you suggesting?
| sigmoid10 wrote:
| This is far too broad to summarise here. You can read up on
| Sutskever or Bostrom or hell even Steven Hawking's ideas
| (going in order from really deep to general topics). We
| need to discuss _everything_ - from education over jobs and
| taxes all the way to the principles of politics, our
| economy and even the military. If we fail at this as a
| society, we will at the very least create a world where the
| people who own capital today massively benefit and become
| rich beyond imagination (despite having contributed nothing
| to it), while the majority of the population will be
| unemployable and forever left behind. And the worst case
| probably falls somewhere between the end of human
| civilisation and the end of our species.
| kelseyfrog wrote:
| What we're going to do is punt the questions and then
| convince ourselves the outcome was inevitable and if
| anything it's actually our fault.
| astrange wrote:
| One way you can tell this isn't realistic is that it's
| the plot of Atlas Shrugged. If your economic intuitions
| produce that book it means they are wrong.
|
| > while the majority of the population will be
| unemployable and forever left behind
|
| Productivity improvements increase employment. A
| superhuman AI is a productivity improvement.
| BriggyDwiggs42 wrote:
| No, Atlas shrugged explicitly believes that the wealthy
| beneficiaries are also the ones doing the innovation and
| the labor. Human/superhuman AI, if not self-directed but
| more like a tool, may massively benefit whoever happens
| to be lucky enough to be directing it when it arises.
| This does not imply that the lucky individual benefits on
| the basis of their competence.
|
| The idea that productivity improvements increase
| unemployment is just fundamentally based on a different
| paradigm. There is absolutely no reason to think that
| when a machine exists that can do most things that a
| human can do as well if not better for less or equal
| cost, this will somehow increase human employment. In
| this scenario, using humans in any stage of the pipeline
| would be deeply inefficient and a stupid business
| decision.
| ben_w wrote:
| > Productivity improvements increase employment.
|
| Sometimes: the productivity improvements from the
| combustion engine didn't increase employment of horses,
| it displaced them.
|
| But even when productivity improvements do increase
| employment, it's not always to our advantage: the
| productivity improvements from Eli Whitney's cotton gin
| included huge economic growth and subsequent
| technological improvements... and also "led to increased
| demands for slave labor in the American South, reversing
| the economic decline that had occurred in the region
| during the late 18th century":
| https://en.wikipedia.org/wiki/Cotton_gin
|
| A superhuman AI that's only superhuman in specific
| domains? We've been seeing plenty of those, "computer"
| used to be a profession, and society can re-train but it
| still hurts the specific individuals who have to be
| unemployed (or start again as juniors) for the duration
| of that training.
|
| A superhuman AI that's superhuman in every domain, but
| close enough to us in resource requirements that
| comparative advantage is still important and we can still
| do stuff, relegates us to whatever the AI is least good
| at.
|
| A superhuman AI that's superhuman in every domain... as
| soon as someone invents mining, processing, and factory
| equipment that works on the moon or asteroids, that AI
| can control that equipment to make more of that
| equipment, and demand is quickly -- O(log(n)) --
| saturated. I'm moderately confident that in this
| situation, the comparative advantage argument no longer
| works.
| johnny_canuck wrote:
| Start learning a trade
| jorblumesea wrote:
| that's going to work when every white collar worker goes
| into the trades /s
|
| who is going to pay for residential electrical work lol
| and how much will you make if some guy from MIT is going
| to compete with you
| whynotminot wrote:
| I feel like that's just kicking the can a little further
| down the road.
|
| Our value proposition as humans in a capitalist society
| is an increasingly fragile thing.
| foobarqux wrote:
| You should look up the terms necessary and sufficient.
| sigmoid10 wrote:
| The real issue is people constantly making up new goalposts
| to keep their outdated world view somewhat aligned with
| what we are seeing. But these two things are drifting apart
| faster and faster. Even I got surprised by how quickly the
| ARC benchmark was blown out of the water, and I'm pretty
| bullish on AI.
| foobarqux wrote:
| The ARC maintainers have explicitly said that passing the
| test was necessary but not sufficient so I don't know
| where you come up with goal-post moving. (I personally
| don't like the test; it is more about "intuition" or in-
| built priors, not reasoning).
| manmal wrote:
| Are you like invested in LLM companies or something?
| You're pushing the agenda hard in this thread.
| lawlessone wrote:
| Failing the test may prove the AI is not intelligent. Passing
| the test doesn't necessarily prove it is.
| NitpickLawyer wrote:
| Your comment reminds me of this quote from a book published
| in the 80s:
|
| > There is a related "Theorem" about progress in AI: once
| some mental function is programmed, people soon cease to
| consider it as an essential ingredient of "real thinking".
| The ineluctable core of intelligence is always in that next
| thing which hasn't yet been programmed. This "Theorem" was
| first proposed to me by Larry Tesler, so I call it Tesler's
| Theorem: "AI is whatever hasn't been done yet."
| 6gvONxR4sf7o wrote:
| I've always disliked this argument. A person can do
| something well without devising a general solution to the
| thing. Devising a general solution to the thing is a step
| we're talking all the time with all sorts of things, but
| it doesn't invalidate the cool fact about intelligence:
| whatever it is that lets us do the thing well _without_
| the general solution is hard to pin down and hard to
| reproduce.
|
| All that's invalidated each time is the idea that a
| general solution to that task requires a general solution
| to all tasks, or that a general solution to that task
| requires our special sauce. It's the idea that something
| able to to that task will also be able to do XYZ.
|
| And yet people keep coming up with a new task that people
| point to saying, 'this is the one! there's no way
| something could solve this one without also being able to
| do XYZ!'
| 8note wrote:
| id consider that it doing the test at all, without proper
| compensation is a sign that it isnt intelligent
| esafak wrote:
| Motivation is not hard to instill. Fortunately, they have
| chosen not to do so.
| QuantumGood wrote:
| "it will take over the world"
|
| Calibrating to the current hype cycle has been challenging
| with AI pronouncements.
| jcims wrote:
| I agree, it's like watching a meadow ablaze and dismissing it
| because it's not a 'real forest fire' yet. No it's not 'real
| AGI' yet, but *this is how we get there* and the pace is
| relentless, incredible and wholly overwhelming.
|
| I've been blessed with grandchildren recently, a little boy
| that's 2 1/2 and just this past Saturday a granddaughter.
| Major events notwithstanding, the world will largely resemble
| today when they are teenagers, but the future is going to
| look very very very different. I can't even imagine what the
| capability and pervasiveness of it all will be like in ten
| years, when they are still just kids. For me as someone
| that's invested in their future I'm interested in all of the
| educational opportunities (technical, philosphical and self-
| awareness) but obviously am concerned about the potential for
| pernicious side effects.
| philipkglass wrote:
| If AI takes over white collar work that's still half of the
| world's labor needs untouched. There are some promising early
| demos of robotics plus AI. I also saw some promising demos of
| robotics 10 and 20 years that didn't reach mass adoption. I'd
| like to believe that by the time I reach old age the robots
| will be fully qualified replacements for plumbers and home
| health aides. Nothing I've seen so far makes me think that's
| especially likely.
|
| I'd love more progress on tasks in the physical world,
| though. There are only a few paths for countries to deal with
| a growing ratio of old retired people to young workers:
|
| 1) Prioritize the young people at the expense of the old by
| e.g. cutting old age benefits (not especially likely since
| older voters have greater numbers and higher participation
| rates in elections)
|
| 2) Prioritize the old people at the expense of the young by
| raising the demands placed on young people (either directly
| as labor, e.g. nurses and aides, or indirectly through higher
| taxation)
|
| 3) Rapidly increase the population of young people through
| high fertility or immigration (the historically favored path,
| but eventually turns back into case 1 or 2 with an even
| larger numerical burden of older people)
|
| 4) Increase the health span of older people, so that they are
| more capable of independent self-care (a good idea, but
| difficult to achieve at scale, since most effective
| approaches require behavioral changes)
|
| 5) Decouple goods and services from labor, so that old people
| with diminished capabilities can get everything they need
| without forcing young people to labor for them
| reducesuffering wrote:
| > If AI takes over white collar work that's still half of
| the world's labor needs untouched.
|
| I am continually _baffled_ that people here throw this
| argument out and can 't imagine the second-order effects.
| If white collar work is automated by AGI, all the RnD to
| solve robotics beyond imagination will happen in a flash.
| The top AI labs, the people smartest enough to make this
| technology, all are focusing on automating AGI Researchers
| and from there follows everything, obviously.
| brotchie wrote:
| +1, the second and third order effects aren't trivial.
|
| We're already seeing escape velocity in world modeling
| (see Google Veo2 and the latest Genesis LLM-based physics
| modeling framework).
|
| The hardware for humanoid robots is 95% of the way there,
| the gap is control logic and intelligence, which is
| rapidly being closed.
|
| Combine Veo2 world model, Genesis control planning,
| o3-style reasoning, and you're pretty much there with
| blue collar work automation.
|
| We're only a few turns (<12 months) away from an
| existence proof of a humanoid robot that can watch a
| Youtube video and then replicate the task in a novel
| environment. May take longer than that to productionize.
|
| It's really hard to think and project forward on an
| exponential. We've been on an exponential technology
| curve since the discovery of fire (at least). The 2nd
| order has kicked up over the last few years.
|
| Not a rational approach to look back at robotics
| 2000-2022 and project that pace forwards. There's more
| happening every month than in decades past.
| philipkglass wrote:
| I hope that you're both right. In 2004-2007 I saw self
| driving vehicles make lightning progress from the weak
| showing of the 2004 DARPA Grand Challenge to the
| impressive 2005 Grand Challenge winners and the even more
| impressive performance in the 2007 Urban Challenge. At
| the time I thought that full self driving vehicles would
| have a major commercial impact within 5 years. I expected
| truck and taxi drivers to be obsolete jobs in 10 years.
| 17 years after the Urban Challenge there are still
| millions of truck driver jobs in America and only Waymo
| seems to have a credible alternative to taxi drivers
| (even then, only in a small number of cities).
| ben_w wrote:
| > It's time people woke up and realised that the old age of
| AI is over. This new kind is here to stay and it will take
| over the world. And you better guess it'll be sooner rather
| than later and start to prepare.
|
| I was just thinking about how 3D game engines were perceived
| in the 90s. Every six months some new engine came out, blew
| people's minds, was declared photorealistic, and was
| forgotten a year later. The best of those engines kept
| improving and are still here, and kinda did change the world
| in their own way.
|
| Software development seemed rapid and exciting until about
| Halo or Half Life 2, then it was shallow but shiny press
| releases for 15 years, and only became so again when OpenAI's
| InstructGPT was demonstrated.
|
| While I'm really impressed with current AI, and value the
| best models greatly, and agree that they will change (and
| have already changed) the world... I can't help but think of
| the _Next Generation_ front cover, February 1997 when
| considering how much further we may be from what we want:
| https://www.giantbomb.com/pc/3045-94/forums/unreal-yes-
| this-...
| torginus wrote:
| The weird thing about the phenomenon you mention is only
| after the field of software engineering has plateaued 15
| years ago, as you mentioned, that this insane demand for
| engineers did arise, with corresponding insane salaries.
|
| It's a very strange thing I've never understood.
| dwaltrip wrote:
| My guess: It's a very lengthy, complex, and error-prone
| process to "digitize" human civilization (government,
| commerce, leisure, military, etc). The tech existed, we
| just didn't know how to use it.
|
| We still barely know how to use computers effectively,
| and they have already transformed the world. For better
| or worse.
| hansonkd wrote:
| > how much further we may be from what we wan
|
| The timescale you are describing for 3D graphics is 4 years
| from the 1997 cover you posted to the release of Halo which
| you are saying plateaued excitement because it got advanced
| enough.
|
| An almost infinitesimally small amount of time in terms of
| history human development and you are mocking the magazine
| being excited for the advancement because it was... 4 years
| yearly?
| ben_w wrote:
| No, the timescale is "the 90s", the _the specific
| example_ is from 1997, and chosen because of how badly it
| aged. Nobody looks at the original single-player Unreal
| graphics today and thinks "this is amazing!", but we all
| did at the time -- Reflections! Dynamic lighting! It was
| amazing for the era -- but it was also a long way from
| photorealism. ChatGPT is amazing... but how far is it
| from Brent Spiner's Data?
|
| The era was people getting wowed from Wolfenstein (1992)
| to "about Halo or Half Life 2" (2001 or 2004).
|
| And I'm not saying the flattening of excitement was for
| any specific reason, just that this was roughly when it
| stopped getting exciting -- it might have been because
| the engines were good enough for 3D art styles beyond "as
| realistic as we can make it", but for all I know it was
| the War On Terror which changed the tone of press
| releases and how much the news in general cared. Or
| perhaps it was a culture shift which came with more
| people getting online and less media being printed on
| glossy paper and sold in newsagents.
|
| Whatever the cause, it happened around that time.
| TeMPOraL wrote:
| I'm still holding on to my hypothesis in that the
| excitement was sustained in large part because this
| progress was something a regular person could partake in.
| Most didn't, but they likely known some kid who was. And
| some of those kids run the gaming magazines.
|
| This was a time where, for 3D graphics, barriers to entry
| got low (math got figured out, hardware was good enough,
| knowledge spread), but the commercial market didn't yet
| capture everything. Hell, a bulk of those excited kids I
| remember, trying to do a better Unreal Tournament after
| school instead of homework (and almost succeeding!), they
| went on create and staff the next generation of
| commercial gamedev.
|
| (Which is maybe why this period lasted for about as long
| as it takes for a schoolkid to grow up, graduate, and
| spend few years in the workforce doing the stuff they
| were so excited about.)
| ben_w wrote:
| Could be.
|
| I was one of those kids, my focus was Marathon 2 even
| before I saw Unreal. I managed to figure out enough maths
| from scratch to end up with the basics of ray casting,
| but not enough at the time to realise the tricks needed
| to make that real time on a 75 MHz CPU... and then we all
| got OpenGL and I went through university where they
| explained the algorithms.
| TeMPOraL wrote:
| > _Software development seemed rapid and exciting until
| about Halo or Half Life 2, then it was shallow but shiny
| press releases for 15 years_
|
| The transition seems to map well to the point where engines
| got sophisticated enough, that highly dedicated high-
| schoolers couldn't keep up. Until then, people would
| routinely make hobby game engines (for games they'd then
| never finish) that were MVPs of what the game industry had
| a year or three earlier. I.e. close enough to compete on
| visuals with top photorealistic games of a given year - but
| more importantly, this was a time where _you could do cool
| nerdy shit to impress your friends and community_.
|
| Then Unreal and Unity came out, with a business model that
| killed the motivation to write your own engine from scratch
| (except for purely educational purposes), we got more
| games, more progress, but the excitement was gone.
|
| Maybe it's just a spurious correlation, but it seems to
| track with:
|
| > _and only became so again when OpenAI 's InstructGPT was
| demonstrated._
|
| Which is again, if you exclude training SOTA models - which
| is still mostly out of reach for anyone but a few entities
| on the planet - the time where _anyone_ can do something
| cool that doesn 't have a better market alternative yet,
| and any dedicated high-schooler can make truly impressive
| and useful work, outpacing commercial and academic work
| based on pure motivation and focus alone (it's easier when
| you're not being distracted by bullshit incentives like
| _user growth_ or _making VCs happy_ or _churning out
| publications, farming citations_ ).
|
| It's, once again, a time of dreams, where anyone with some
| technical interest and a bit of free time can _make the
| future happen in front of their eyes_.
| levocardia wrote:
| I'm a little torn. ARC is really hard, and Francois is
| extremely smart and thoughtful about what intelligence means
| (the original "On the Measure of Intelligence" heavily
| influenced my ideas on how to think about AI).
|
| On the other hand, there is a long, long history of AI
| achieving X but not being what we would casually refer to as
| "generally intelligent," then people deciding X isn't really
| intelligence; only when AI achieves Y will it be
| intelligence. Then AI achieves Y and...
| Workaccount2 wrote:
| You are telling a bunch of high earning individuals ($150k+)
| that they may be dramatically less valuable in the eat
| future. Of course the goal posts will keep being pushed back
| and the acknowledgements will never come.
| ignoramous wrote:
| > _These comments are getting ridiculous._
|
| Not really. Francois (co-creator of the ARC Prize) has this
| to say: The v1 version of the benchmark is
| starting to saturate. There were already signs of this in the
| Kaggle competition this year: an ensemble of all submissions
| would score 81% Early indications are that ARC-
| AGI-v2 will represent a complete reset of the state-of-the-
| art, and it will remain extremely difficult for o3.
| Meanwhile, a smart human or a small panel of average humans
| would still be able to score >95% ... This shows that it's
| still feasible to create unsaturated, interesting benchmarks
| that are easy for humans, yet impossible for AI, without
| involving specialist knowledge. We will have AGI when
| creating such evals becomes outright impossible.
| For me, the main open question is where the scaling
| bottlenecks for the techniques behind o3 are going to be. If
| human-annotated CoT data is a major bottleneck, for instance,
| capabilities would start to plateau quickly like they did for
| LLMs (until the next architecture). If the only bottleneck is
| test-time search, we will see continued scaling in the
| future.
|
| https://x.com/fchollet/status/1870169764762710376 /
| https://ghostarchive.org/archive/Sqjbf
| bluerooibos wrote:
| The goalposts have moved, again and again.
|
| It's gone from "well the output is incoherent" to "well it's
| just spitting out stuff it's already seen online" to
| "WELL...uhh IT CAN'T CREATE NEW/NOVEL KNOWLEDGE" in the space
| of 3-4 years.
|
| It's incredible.
|
| We already have AGI.
| FrustratedMonky wrote:
| " it's complete hubris to conflate ARC or any benchmark with
| truly general intelligence."
|
| Maybe it would help to include some human results in the AI
| ranking.
|
| I think we'd find that Humans score lower?
| zamadatix wrote:
| I'm not sure it'd help what they are talking about much.
|
| E.g. go back in time and imagine you didn't know there are
| ways for computers to be really good at performing
| integration yet as nobody had tried to make them. If someone
| asked you how to tell if something is intelligent "the
| ability to easily reason integrations or calculate extremely
| large multiplications in mathematics" might seem like a great
| test to make.
|
| Skip forward to the modern era and it's blatantly obvious
| CASes like Mathematica on a modern computer range between
| "ridiculously better than the average person" to "impossibly
| better than the best person" depending on the test. At the
| same time, it becomes painfully obvious a CAS is wholly
| unrelated to general intelligence and just because your test
| might have been solvable by an AGI doesn't mean solving it
| proves something must have been an AGI.
|
| So you come up with a new test... but you have the same
| problem as originally, it seems like anything non-human
| completely bombs and an AGI would do well... but how do you
| know the thing that solves it will have been an AGI for sure
| and not just another system clearly unrelated?
|
| Short of a more clever way what GP is saying is the goalposts
| must keep being moved until it's not so obvious the thing
| isn't AGI, not that the average human gets a certain score
| which is worse.
|
| .
|
| All that aside, to answer your original question, in the
| presentation it was said the average human gets 85% and this
| was the first model to beat that. It was also said a second
| version is being worked on. They have some papers on their
| site about clear examples of why the current test clearly has
| a lot of testing unrelated to whether something is really AGI
| (a brute force method was shown to get >50% in 2020) so their
| aim is to create a new goalpost test and see how things shake
| out this time.
| FrustratedMonky wrote:
| "Short of a more clever way what GP is saying is the
| goalposts must keep being moved until it's not so obvious
| the thing isn't AGI, not that the average human gets a
| certain score which is worse."
|
| Best way of stating that I've heard.
|
| The Goal Post must keep moving, until we understand enough
| what is happening.
|
| I usually poo-poo the goal post moving, but this makes
| sense.
| og_kalu wrote:
| Generality is not binary. It's a spectrum. And these models
| are already general in ways those things you've mentioned
| simply weren't.
|
| What exactly is AGI to you ? If it's simply a generally
| intelligent machine then what are you waiting for ? What
| else is there to be sure of ? There's nothing narrow about
| these models.
|
| Humans love to believe they're oh so special so much that
| there will always be debates on whether 'AGI' has arrived.
| If you are waiting for that then you'll be waiting a very
| long time, even if a machine arrives that takes us to the
| next frontier in science.
| Jensson wrote:
| > There's nothing narrow about these models.
|
| There is, they can't create new ideas like humanity can.
| AGI should be able to replace humanity in terms of
| thinking, otherwise it isn't general, you would just have
| a model specialized at reproducing thoughts and patterns
| human have thought before, it still can't recreate
| science from scratch etc like humanity did, meaning it
| can't do science properly.
|
| Comparing an AI to a single individual is not how you
| measure AGI, if a group of humans perform better then you
| can't use the AI to replace that group of humans, and
| thus the AI isn't an AGI since it couldn't replace the
| group humans.
|
| So for example, if a group of programmers write more
| reliable programs than the AI, then you can't replace
| that group of programmers with the AI, even if you
| duplicate that AI many times, since the AI isn't capable
| of reproducing that same level of reliability when ran in
| parallel. This is due to an AI being run in parallel is
| still just an AI, an ensemble model is still just an AI,
| so the model the AI has to beat is the human ensemble
| called humanity.
|
| If we lower the bar a bit at least it has to beat 100 000
| humans working together to make a job obsolete, since all
| the tutorials etc and all such things are made by other
| humans as well if you remove the job those would also
| disappear and the AI would have to do the work of all of
| those, so if it can't humans will still be needed.
|
| Its possible you will be able to substitute part of those
| human ensembles with AI much sooner, but then we just
| call it a tool. (We also call narrow humans tools, it is
| fair)
| og_kalu wrote:
| I see these models create new ideas. At least at the
| standard humans are beholden to, so this just falls flat
| for me.
| Jensson wrote:
| You don't just need to create an idea, you need to be
| able to create ideas that on average progress in a
| positive direction. Humans can evidently do that, AI
| can't, when AI work too much without human input you
| always end up with nonsense.
|
| In order to write general program you need to have that
| skill. Every new code snipped needs to be evaluated by
| that system, whether it makes the codebase better or not.
| The lack of that ability is why you can't just loop an
| LLM today to replace programmers. It might be possible to
| automate it for specific programming tasks, but not
| general purpose programming.
|
| Overcoming that hurdle is not something I think LLM ever
| can do, you need a totally different kind of
| architecture, not something that is trained to mimic but
| trained to reason. I don't know how to train something
| that can reason about noisy unstructured data, we will
| probably figure that out at some point but it probably
| wont be LLM as they are today.
| zamadatix wrote:
| I'm firmly in the "absolutely nothing special about human
| intelligence" camp so don't let dismissal of this as AGI
| fuel any misconceptions as to why I might think that.
|
| As for what AGI is? Well, the lack of being able to
| describe that brings us full circle in this thread - I'll
| tell you for sure when I've seen it for the first time
| and have the power of hindsight to say what was missing.
| I think these models are the closest we've come but it
| feels like there is at least 1-2 more "4o->o1" style
| architecture changes where it's not necessarily about an
| increase in model fitting and more about a change in how
| the model comes to an output before we get to what I'd be
| willing to call AGI.
|
| Who knows though, maybe some of those changes come along
| and it's closer but still missing some process to reason
| well enough to be AGI rather than a midway tool.
| m3kw9 wrote:
| From the statement where - this is a pretty tough test where AI
| scores low vs humans just last year, and AI can do it as good
| as humans may not be AGI which I agree, but it means something
| with all caps
| manmal wrote:
| Obviously, the multi billion dollar companies will try to
| satisfy the benchmarks they are not yet good in, as has
| always been the case.
| m3kw9 wrote:
| A valid conspiracy theory but I've heard that one everystep
| of the way to this point
| wslh wrote:
| > My skeptical impression: it's complete hubris to conflate ARC
| or any benchmark with truly general intelligence.
|
| But isn't it interesting to have several benchmarks? Even if
| it's not about passing the Turing test, benchmarks serve a
| purpose--similar to how we measure microprocessors or other
| devices. Intelligence may be more elusive, but even if we had
| an oracle delivering the ultimate intelligence benchmark, we'd
| still argue about its limitations. Perhaps we'd claim it
| doesn't measure creativity well, and we'd find ourselves
| revisiting the same debates about different kinds of
| intelligences.
| zebomon wrote:
| It's certainly interesting. I'm just not convinced it's a
| test of general intelligence, and I don't think we'll know
| whether or not it is until it's been able to operate in the
| real world to the same degree that our general intelligence
| does.
| kelseyfrog wrote:
| > truly general intelligence
|
| Indistinguishable from goalpost moving like you said, but also
| no true Scotsman.
|
| I'm curious what would happen in your eyes if we misattributed
| general intelligence to an AI model? What are the consequences
| of a false positive and how would they affect your life?
|
| It's really clear to me how intelligence fits into our reality
| as part of our social ontology. The attributes and their
| expression that each of us uses to ground our concept of the
| intelligent predicate differs wildly.
|
| My personal theory is that we tend to have an exemplar-based
| dataset of intelligence, and each of us attempts to construct a
| parsimonious model of intelligence, but like all (mental)
| models, they can be useful but wrong. These models operate in a
| space where the trade off is completeness or consistency, and
| most folks, uncomfortable saying "I don't know" lean toward
| being complete in their specification rather than consistent.
| The unfortunate side-effect is that we're able to easily
| generate test data that highlights our model inconsistency - AI
| being a case in point.
| PaulDavisThe1st wrote:
| > I'm curious what would happen in your eyes if we
| misattributed general intelligence to an AI model? What are
| the consequences of a false positive and how would they
| affect your life?
|
| Rich people will think they can use the AI model instead of
| paying other people to do certain tasks.
|
| The consequences could range from brilliant to utterly
| catastrophic, depending on the context and precise way in
| which this is done. But I'd lean toward the catastrophic.
| kelseyfrog wrote:
| Any specifics? It's difficult to separate this from
| generalized concern.
| PaulDavisThe1st wrote:
| someone wants a "personal assistant" and believes that
| the LLM has AGI ...
|
| someone wants a "planning officer" and believes that the
| LLM has AGI ...
|
| someone wants a "hiring consultant" and believes that the
| LLM has AGI ...
|
| etc. etc.
| kelseyfrog wrote:
| My apologies, but would it be possible to list the
| catastrophic consequences of these?
| Agentus wrote:
| how about a extra large dose of your skepticism. is true
| intelligence really a thing and not just a vague human
| construct that tries to point out the mysterious unquantifiable
| combination of human behaviors?
|
| humans clearly dont know what intelligence is unambiguously.
| theres also no divinely ordained objective dictionary that one
| can point at to reference what true intelligence is. a deep
| reflection of trying to pattern associate different human
| cognitive abilities indicates human cognitive capabilities
| arent that spectacular really.
| MVissers wrote:
| My guess as an amateur neuroscientist is that what we call
| intelligence is just a 'measurement' of problem solving
| ability in different domains. Can be emotional, spatial,
| motor, reasoning, etc etc.
|
| There is no special sauce in our brain. And we know how much
| compute there is in our brain- So we can roughly estimate
| when we'll hit that with these 'LLMs'.
|
| Language is important in a human brain development as well.
| Kids who grow up deaf grow up vastly less intelligent unless
| they learn sign language. Language allow us to process
| complex concepts that our brain can learn to solve, without
| having to be in those complex environments.
|
| So in hindsight, it's easy to see why it took a language
| model to be able to solve general tasks and other types deep
| learning networks couldn't.
|
| I don't really see any limits on these models.
| Agentus wrote:
| interesting point about language. but i wonder if people
| misattribute the reason why language is pivotal to human
| development. your points are valid. i see human behavior
| with regard to learning as 90% mimicry and 10% autonomous
| learning. most of what humans believe in is taken on faith
| and passed on from the tribe to the individual. rarely is
| it verified even partially let alone fully. humans simple
| dont have the time or processing power to do that. learning
| a thing without outside aid is vastly slower and more
| energy or brain intensive process than copy learning or
| learning through social institutions by dissemination. the
| stunted development from lack of language might come more
| from the less ability to access the collective learning
| process that language enables and or greatly enhances. i
| think a lot of learning even when combined with reasoning,
| deduction, etc really is at the mercy of brute force
| exploration to find a solution, which individuals are bad
| at but a society that collects random experienced "ah hah!"
| occurrences and passes them along is actually okay at.
|
| i wonder if llms and language dont as so much allow us to
| process these complex environments but instead preload our
| brains to get a head start in processing those complex
| environments once we arrive in them. i think llms store
| compressed relationships of the world which obviously has
| information loss from a neural mapping of the world that
| isnt just language based. but that compressed relationships
| ie knowledge doesnt exactly backwardly map onto the world
| without it having a reverse key. like artificially learning
| about real world stuff in school abstractly and then going
| into the real world, it takes time for that abstraction to
| snap fit upon the real world.
|
| could you further elaborate on what you mean by limits,
| because im happy to play contrarian on what i think i
| interpret you to be saying there.
|
| also to your main point: what intelligence is. yeah you
| sort of hit up my thoughts on intelligence. its a
| combination of problem solving abilities in different
| domains. its like an amalgam of cognitive processes that
| achieve an amalgam of capabilities. while we can label
| alllllll that with a singular word, doesnt mean its all a
| singular process. seems like its a composite. moreover i
| think a big chunk of intelligence (but not all) is just
| brute forcing finding associations and then encoding those
| by some reflexive search/retrieval. a different part of
| intelligence of course is adaptibility and pattern finding.
| Bjorkbat wrote:
| I think it's still an interesting way to measure general
| intellience, it's just that o3 has demonstrated that you can
| actually achieve human performance on it by training it on the
| public training set and giving it ridiculous amounts of
| compute, which I imagine equates to ludicrously long chains-of-
| thought, and if I understand correctly more than one chain-of-
| thought per task (they mention sample sizes in the blog post,
| with o3-low using 6 and o3-high using 1024. Not sure if these
| are chains-of-thought per task or what).
|
| Once you look at it that way it the approach really doesn't
| look like intelligence that's able to generalize to novel
| domains. It doesn't pass the sniff test. It looks a lot more
| like brute-forcing.
|
| Which is probably why, in order to actually qualify for the
| leaderboard, they stipulate that you can't use more than $10k
| more of compute. Otherwise, it just sounds like brute-forcing.
| BriggyDwiggs42 wrote:
| I disagree. It's vastly inefficient, but it is managing to
| actually solve these problems with a vast search space. If we
| extrapolate this approach into the future and assume that the
| search becomes better as the underlying model improves, and
| assume that the architecture grows more efficient, and assume
| that the type of parallel computing used here grows cheaper,
| isn't it possible that this is a lot more than brute-forcing
| in terms of what it will achieve? In other words, is it maybe
| just a really ugly way of doing something functionally
| equivalent to reasoning?
| attentionmech wrote:
| Isn't this at the level now where it can sort of self improve. My
| guess is that they will just use it to improve the model and the
| cost they are showing per evaluation will go down drastically.
|
| So, next step in reasoning is open world reasoning now?
| dyauspitr wrote:
| I don't believe so. If it's at the point where you could just
| plug it into a bunch of camera feeds around the world and it
| could only filter out a useful training set for itself out of
| that data then we truly would have AGI. I don't think it's
| there yet.
| yawnxyz wrote:
| O3 High (tuned) model scored an 88% at what looks like
| $6,000/task haha
|
| I think soon we'll be pricing any kind of tasks by their compute
| costs. So basically, human = $50/task, AI = $6,000/task, use
| human. If AI beats human, use AI? Ofc that's considering both get
| 100% scores on the task
| cchance wrote:
| Isn't that generally what ... all jobs are? Automation Cost vs
| Longterm Human cost... its why amazon did the weird "our stores
| are AI driven" but in reality was cheaper to higher a bunch of
| guys in a sweat shop to look at the cameras and write things
| down lol.
|
| The thing is given what we've seen from distillation and tech,
| even if its 6,000/task... that will come down drastically over
| time through optimization and just... faster more efficient
| processing hardware and software.
| cryptoegorophy wrote:
| I remember hearing Tesla trying to automate all of production
| but some things just couldn't , like the wiring which humans
| still had to do.
| dyauspitr wrote:
| Compute can get optimized and cheap quickly.
| karmasimida wrote:
| Is it? The moore's law is dead dead, I don't think this is a
| given.
| jsheard wrote:
| That's the elephant in the room with the reasoning/COT
| approach, it shifts what was previously a scaling of training
| costs into scaling of training _and_ inference costs. The
| promise of doing expensive training once and then running the
| model cheaply forever falls apart once you 're burning tens,
| hundreds or thousands of dollars worth of compute every time
| you run a query.
| Legend2440 wrote:
| Yeah, but next year they'll come out with a faster GPU, and
| the year after that another still faster one, and so on.
| Compute costs are a temporary problem.
| freehorse wrote:
| The issue is not just scaling compute, but scaling it in a
| rate that meets the increase in complexity of the problems
| that are not currently solved. If that is O(n) then what
| you say probably stands. If that is eg O(n^8) or
| exponential etc, then there is no hope to actually get good
| enough scaling by just increasing compute in a normal rate.
| Then AI technology will still be improving, but improving
| to a halt, practically stagnating.
|
| o3 will be interesting if it offers indeed a novel
| technology to handle problem solving, something that is
| able to learn from few novel examples efficiently and
| adapt. That's what intelligence actually is. Maybe this is
| the case. If, on the other hand, it is a smart way to pair
| CoT within an evaluation loop (as the author hints as
| possibility) then it is probable that, while this _can_
| handle a class of problems that current LLMs cannot, it is
| not really this kind of learning, meaning that it will not
| be able to scale to more complex, real world tasks with a
| problem space that is too large and thus less amenable to
| such a technique. It is still interesting, because having a
| good enough evaluator may be very important step, but it
| would mean that we are not yet there.
|
| We will learn soon enough I suppose.
| Workaccount2 wrote:
| They're gonna figure it out. Something is being missed
| somewhere, as human brains can do all this computation on 20
| watts. Maybe it will be a hardware shift or maybe just a
| software one, but I strongly suspect that modern transformers
| are grossly inefficient.
| redeux wrote:
| Time and availability would also be factors.
| Benjaminsen wrote:
| Compute costs on AI with the same roughly the same capabilities
| have been halving every ~7 months.
|
| That makes something like this competitive in ~3 years
| seizethecheese wrote:
| And human costs have been increasing a few percent per year
| for a few centuries!
| freehorse wrote:
| This makes me think and speculate if the solution comprises of
| a "solver" trying semi-random or more targeted things and a
| "checker" checking these? Usually checking a solution is
| cognitively (and computationally) easier than coming up with
| it. Else I cannot think what sort of compute would burn 6000$
| per task, unless you are going through a lot of loops and you
| have somehow solved the part of the problem that can figure out
| if a solution is correct or not, while coming up with the
| actual correct solution is not as solved yet to the same
| degree. Or maybe I am just naive and these prices are just like
| breakfast for companies like that.
| og_kalu wrote:
| It's not 6000/task (i.e per question). 6000 is about the retail
| cost for evaluating the entire benchmark on high efficiency
| (about 400 questions)
| Tiberium wrote:
| From reading the blog post and Twitter, and cost of other
| models, I think it's evident that it IS actually cost per
| task, see this tweet: https://files.catbox.moe/z1n8dc.jpg
|
| And o1 cost $15/$60 for 1M in/out, so the estimated costs on
| the graph would match for a single task, not the whole
| benchmark.
| slibhb wrote:
| The blog clarifies that it's $17-20 per task. Maybe it runs
| into thousands for tasks it can't solve?
| Tiberium wrote:
| That cost is for o3 low, o3 high goes into thousands per
| task.
| gbnwl wrote:
| Well they got 75.7% at $17/task. Did you see that?
| seydor wrote:
| What if we use those humans to generate energy for the tasks?
| spaceman_2020 wrote:
| Just as an aside, I've personally found o1 to be completely
| useless for coding.
|
| Sonnet 3.5 remains the king of the hill by quite some margin
| cchance wrote:
| The new gemini's are pretty good too
| lysecret wrote:
| Actually prefer new geminis too. 2.0 experimental especially.
| spaceman_2020 wrote:
| The new ai studio from Google is fantastic
| og_kalu wrote:
| To be fair, until the last checkpoint released 2 days ago, o1
| didn't really beat sonnet (and if so, barely) in most non-
| competitive coding benchmarks
| vessenes wrote:
| To fill this out, I find o1-pro (and -preview when it was live)
| to be pretty good at filling in blindspots/spotting holistic
| bugs. I use Claude for day to day, and when Claude is spinning,
| o1 often can point out why. It's too slow for AI coding, and I
| agree that at default its responses aren't always satisfying.
|
| That said, I think its code style is arguably better, more
| concise and has better patterns -- Claude needs a fair amount
| of prompting and oversight to not put out semi-shitty code in
| terms of structure and architecture.
|
| In my mind: going from Slowest to Fastest, and Best
| Holistically to Worst, the list is:
|
| 1. o1-pro 2. Claude 3.5 3. Gemini 2 Flash
|
| Flash is so fast, that it's tempting to use more, but it really
| needs to be kept to specific work on strong codebases without
| complex interactions.
| spaceman_2020 wrote:
| Claude has a habit of sometimes just getting "lost"
|
| Like I'll have it a project in Cursor and it will spin up
| ready to use components that use my site style, reference
| existing components, and follow all existing patterns
|
| Then on some days, it will even forget what language the
| project is in and start giving me python code for a react
| project
| causal wrote:
| Yeah it's almost like system 1 vs system 2 thinking
| bearjaws wrote:
| o1 is pretty good at spotting OWASP defects, compared to most
| other models.
|
| https://myswamp.substack.com/p/benchmarking-llms-against-com...
| InkCanon wrote:
| I just asked o1 a simple yes or no question about x86 atomics
| and it did one of those A or B replies. The first answer was
| yes, the second answer was no.
| m3kw9 wrote:
| o1 is when all else fails, sometimes it does the same mistakes
| as weaker models if you give it simple tasks with very little
| context, but when a good precise context is given it usually
| outperforms other Models
| karmasimida wrote:
| Yeah I feel for chat use case, o1 is just too slow for me, and
| my queries aren't that complicated.
|
| For coding, o1 is marvelous at Leetcode question I think it is
| the best teacher I would ever afford to teach me leetcoding,
| but I don't find myself have a lot of other use cases for o1
| that is complex and requires really long reasoning chain
| bitbuilder wrote:
| I find myself hoping between o1 and Sonnet pretty frequently
| these days, and my personal observation is that the quality of
| output from o1 scales more directly to the quality of the
| prompting you're giving it.
|
| In a way it almost feels like it's become _too_ good at
| following instructions and simply just takes your direction
| more literally. It doesn 't seem to take the initiative of
| going the extra mile of filling in the blanks from your lazy
| input (note: many would see this as a good thing). Claude on
| the other hand feels more intuitive in discerning intent from a
| lazy prompt, which I may be prone to offering it at times when
| I'm simply trying out ideas.
|
| However, if I take the time to write up a well thought out
| prompt detailing my expectations, I find I much prefer the code
| o1 creates. It's smarter in its approach, offers clever ideas I
| wouldn't have thought of, and generally cleaner.
|
| Or put another way, I can give Sonnet a lazy or detailed prompt
| and get a good result, while o1 will give me an excellent
| result with a well thought out prompt.
|
| What this boils down to is I find myself using Sonnet while
| brainstorming ideas, or when I simply don't know how I want to
| approach a problem. I can pitch it a feature idea the same way
| a product owner might pitch an idea to an engineer, and then
| iterate through sensible and intuitive ways of looking at the
| problem. Once I get a handle on how I'd like to implement a
| solution, I type up a spec and hand it off to o1 to crank out
| the code I'd intend to implement.
| jules wrote:
| Can you solve this by putting your lazy prompt through GPT-4o
| or Sonnet 3.6 and asking it to expand the prompt to a full
| prompt for o1?
| spaceman_2020 wrote:
| Have you found any tool or guide for writing better o1
| prompts? This isn't the first time I've heard this about o1
| but no one seems to know _how_ to prompt it
| leumon wrote:
| I've found gemini-1206 to be best. and we can use it free (for
| now), in google's aistudio. It's number 1 on lmarena.ai for
| coding, and generally, and number 1 on bigcodebench.
| energy123 wrote:
| Which o1? A new version was released a few days ago and beats
| Sonnet 3.5 on Livebench
| smy20011 wrote:
| It seems O3 following trend of Chess engine that you can cut your
| search depth depends on state.
|
| It's good for games with clear signal of success (Win/Lose for
| Chess, tests for programming). One of the blocker for AGI is we
| don't have clear evaluation for most of our tasks and we cannot
| verify them fast enough.
| flakiness wrote:
| The cost axis is interesting. O3 Low is $10+ per task and 03 High
| is over $1000 (it's logarithmic graph so it's like $50 and $5000
| respectively?)
| obblekk wrote:
| Human performance is 85% [1]. o3 high gets 87.5%.
|
| This means we have an algorithm to get to human level performance
| on this task.
|
| If you think this task is an eval of general reasoning ability,
| we have an algorithm for that now.
|
| There's a lot of work ahead to generalize o3 performance to all
| domains. I think this explains why many researchers feel AGI is
| within reach, now that we have an algorithm that works.
|
| Congrats to both Francois Chollet for developing this compelling
| eval, and to the researchers who saturated it!
|
| [1] https://x.com/SmokeAwayyy/status/1870171624403808366,
| https://arxiv.org/html/2409.01374v1
| phillipcarter wrote:
| As excited as I am by this, I still feel like this is still
| just a small approximation of a small chunk of human reasoning
| ability at large. o3 (and whatever comes next) feels to me like
| it will head down the path of being a reasoning coprocessor for
| various tasks.
|
| But, still, this is incredibly impressive.
| qt31415926 wrote:
| Which parts of reasoning do you think is missing? I do feel
| like it covers a lot of 'reasoning' ground despite its on the
| surface simplicity
| phillipcarter wrote:
| I think it's hard to enumerate the unknown, but I'd
| personally love to see how models like this perform on
| things like word problems where you introduce red herrings.
| Right now, LLMs at large tend to struggle mightily to
| understand when some of the given information is not only
| irrelevant, but may explicitly serve to distract from the
| real problem.
| KaoruAoiShiho wrote:
| o1 already fixed the red herrings...
| zmgsabst wrote:
| That's not inability to reason though, that's having a
| social context.
|
| Humans also don't tend to operate in a rigorously logical
| mode and understand that math word problems are an
| exception where the language may be adversarial: they're
| trained for that special context in school. If you tell
| the LLM that social context, eg that language may be
| deceptive, their "mistakes" disappear.
|
| What you're actually measuring is the LLM defaults to
| assuming you misspoke trying to include relevant
| information rather than that you were trying to trick it
| -- which is the social context you'd expect when trained
| on general chat interactions.
|
| Establishing context in psychology is hard.
| Agentus wrote:
| kinda interesting, every single CS person (especially phds)
| when talking about reasoning are unable to concisely
| quantify, enumerate, qualify, or define reasoning.
|
| people with (high) intelligence talking and building
| (artificial) intelligence but never able to convincingly
| explain aspects of intelligence. just often talk
| ambiguously and circularly around it.
|
| what are we humans getting ourselves into inventing skynet
| :wink.
|
| its been an ongoing pet project to tackle reasoning, but i
| cant answer your question with regards to llms.
| YeGoblynQueenne wrote:
| >> Kinda interesting, every single CS person (especially
| phds) when talking about reasoning are unable to
| concisely quantify, enumerate, qualify, or define
| reasoning.
|
| Kinda interesting that mathematicians also can't do the
| same for mathematics.
|
| And yet.
| Agentus wrote:
| well lets just say i think i can explain reasoning better
| than anyone ive encountered. i have my own hypothesized
| theory on what it is and how it manifests in neural
| networks.
|
| i doubt your mathmatician example is equivalent.
|
| examples that are fresh on the mind that further my
| point. ive heard yann lecun baffled by llms
| instantiation/emergence of reasoning, along with other ai
| researchers. eric Schmidt thinks the agentic reasoning is
| the current frontier and people should be focusing on
| that. was listening to the start of an ai machine
| learning interview a week ago with some cs phd asked to
| explain reasoning and the best he could muster up is you
| know it when you see it.... not to mention the guy
| responding to the grandparent that gave a cop out answer
| ( all the most respect to him).
| necovek wrote:
| Care to enlighten us with your explanation of what
| "reasoning" is?
| Agentus wrote:
| terribly sorry to be such a tease, but im looking to
| publish a paper on it, and still need to delve deeper
| into machine interpretability to make sure its
| empirically properly couched. if u can help with that
| perhaps we can continue this convo in private.
| YeGoblynQueenne wrote:
| >> well lets just say i think i can explain reasoning
| better than anyone ive encountered. i have my own
| hypothesized theory on what it is and how it manifests in
| neural networks.
|
| I'm going to bet you haven't encountered the right people
| then. Maybe your social circle is limited to folks like
| the person who presented a slide about A* to a dumb-
| struck roomfull of Deep Learning researchers, in the last
| NeurIps?
|
| https://x.com/rao2z/status/1867000627274059949
| Agentus wrote:
| possibly, my university doesn't really do ai research
| beyond using it as a tool to engineer things. im looking
| to transfer to a different university.
|
| but no, my take on reasoning is really a somewhat
| generalized reframing of the definition of reasoning
| (which you might find on the stanford encylopedia of
| philosophy) thats reframed partially in axiomatic
| building blocks of neural network components/terminology.
| im not claiming to have discovered reasoning, just
| redefine it in a way thats compatible and sensible to
| neural networks (ish).
| YeGoblynQueenne wrote:
| Well you're free to define and redefine anything and as
| you like, but be aware that every time you move the
| target closer to your shot you are setting yourself up
| for some pretty strong confirmation bias.
| Agentus wrote:
| yeah thats why i need help from the machine
| interpretability crowd to make sure my hypothesized
| reframing of reasoning has sufficient empirical basis and
| isn't adrift in lalaland.
| logicchains wrote:
| Mathematicians absolutely can, it's called foundations,
| and people actively study what mathematics can be
| expressed in different foundations. Most mathematicians
| don't care about it though for the same reason most
| programmers don't care about Haskell.
| YeGoblynQueenne wrote:
| I don't care about Haskell either, but we know what
| reasoning is [1]. It's been studied extensively in
| mathematics, computer science, psychology, cognitive
| science and AI, and in philosophy going back literally
| thousands of years with grandpapa Aristotle and his
| syllogisms. Formal reasoning, informal reasoning, non-
| monotonic reasoning, etc etc. Not only do we know what
| reasoning is, we know how to do it with computers just
| fine, too [2]. That's basically the first 50 years of AI,
| that folks like His Nobelist Eminence Geoffrey Hinton
| will tell you was all a Bad Idea and a total failure.
|
| Still somehow the question keeps coming up- "what is
| reasoning". I'll be honest and say that I imagine it's
| mainly folks who skipped CS 101 because they were busy
| tweaking their neural nets who go around the web like
| Diogenes with his lantern, howling "Reasoning! I'm
| looking for a definition of Reasoning! What is
| Reasoning!".
|
| I have never heard the people at the top echelons of AI
| and Deep learning - LeCun, Schmidhuber, Bengio, Hinton,
| Ng, Hutter, etc etc- say things like that: "what's
| reasoning". The reason I suppose is that they know
| exactly what that is, because it was the one thing they
| could never do with their neural nets, that classical AI
| could do between sips of coffee at breakfast [3]. Those
| guys know exactly what their systems are missing and, to
| their credit, have never made no bones about that.
|
| _________________
|
| [1] e.g. see my profile for a quick summary.
|
| [2] See all of Russeel & Norvig, as a for instance.
|
| [3] Schmidhuber's doctoral thesis was an implementation
| of genetic algorithms in Prolog, even.
| Agentus wrote:
| i have a question for you, in which ive asked many
| philosophy professors but none could answer
| satisfactorily. since you seem to have a penchant for
| reasoning perhaps you might have a good answer. (i hope i
| remember the full extent of the question properly, i
| might hit you up with some follow questions)
|
| it pertains to the source of the inference power of
| deductive inference. do you think all deductive reasoning
| originated inductively? like when some one discovers a
| rule or fact that seemingly has contextual predictive
| power, obviously that can be confirmed inductively by
| observations, but did that deductive reflex of the mind
| coagulate by inductive experiences. maybe not all
| deductive derivative rules but the original deductive
| rules.
| YeGoblynQueenne wrote:
| I'm sorry but I have no idea how to answer your question,
| which is indeed philosophical. You see, I'm not a
| philosopher, but a scientist. Science seeks to pose
| questions, and answer them; philosophy seeks to pose
| questions, and question them. Me, I like answers more
| than questions so I don't care about philosophy much.
| Agentus wrote:
| well yeah its partially philosphical, i guess my
| haphazard use of language like "all" makes it more
| philosophical than intended.
|
| but im getting at a few things. one of those things is
| neurological. how do deductive inference constructs
| manifest in neurons and is it really inadvertently an
| inductive process that that creates deductive neural
| functions.
|
| other aspect of the question i guess is more
| philosophical. like why does deductive inference work at
| all, i think clues to a potential answer to that can be
| seen in the mechanics of generalization of antecedents
| predicting(or correlating with) certain generalized
| consequences consistently. the brain coagulates
| generalized coinciding concepts by reinforcement and it
| recognizes or differentiates inclusive instances or
| excluding instances of a generalization by recognition
| properties that seem to gatekeep identities accordingly.
| its hard to explain succinctly what i mean by the latter,
| but im planning on writing an academic paper on that.
| mistermann wrote:
| >Those guys know exactly what their systems are missing
|
| If they did not actually, would they (and you)
| necessarily be able to know?
|
| Many people claim the ability to prove a negative, but no
| one will post their method.
| YeGoblynQueenne wrote:
| To clarify, what neural nets are missing is a capability
| present in classical, logic-based and symbolic systems.
| That's the ability that we commonly call "reasoning". No
| need to prove any negatives. We just point to what
| classical systems are doing and ask whether a deep net
| can do that.
| john_minsk wrote:
| My personal 5 cents is that reasoning will be there when
| LLM gives you some kind of outcome and then when questioned
| about it can explain every bit of result it produced.
|
| For example, if we asked an LLM to produce an image of a
| "human woman photorealistic" it produces result. After that
| you should be able to ask it "tell me about its background"
| and it should be able to explain "Since user didn't specify
| background in the query I randomly decided to draw her
| standing in front of a fantasy background of Amsterdam
| iconic houses. Usually Amsterdam houses are 3 stories tall,
| attached to each other and 10 meters wide. Amsterdam houses
| usually have cranes on the top floor, which help to bring
| goods to the top floor since doors are too narrow for any
| object wider than 1m. The woman stands in front of the
| houses approximately 25 meters in front of them. She is
| 1,59m tall, which gives us correct perspective. It is
| 11:16am of August 22nd which I used to calculate correct
| position of the sun and align all shadows according to
| projected lighting conditions. The color of her skin is set
| at RGB:xxxxxx randomly" etc.
|
| And it is not too much to ask LLMs for it. LLMs have access
| to all the information above as they read all the internet.
| So there is definitely a description of Amsterdam
| architecture, what a human body looks like or how to
| correctly estimate time of day based on shadows (and vise
| versa). The only thing missing is logic that connects all
| this information and which is applied correctly to generate
| final image.
|
| I like to think about LLMs as a fancy genius compressing
| engines. They took all the information in the internet,
| compressed it and are able to cleverly query this
| information for end user. It is a tremendously valuable
| thing, but if intelligence emerges out of it - not sure.
| Digital information doesn't necessarily contain everything
| needed to understand how it was generated and why.
| concordDance wrote:
| > if we asked an LLM to produce an image of a "human
| woman photorealistic" it produces result
|
| Large language models don't do that. You'd want an image
| model.
|
| Or did you mean "multi-model AI system" rather than
| "LLM"?
| owenpalmer wrote:
| It might be possible for a language model to paint a
| photorealistic picture though.
| 0points wrote:
| It is not.
|
| You are confusing LLM:s with Generative AI.
| amelius wrote:
| Can an LLM use tools like humans do? Could it use an
| image model as a tool to query the image?
| 0points wrote:
| No, a LLM is a Large Language Model.
|
| It can language.
| amelius wrote:
| You could teach it to emit patterns that (through other
| code) invoke tools, and loop the results back to the LLM.
| Xmd5a wrote:
| LLMs are still bound to a prompting session. They can't
| form long term memories, can't ponder on it and can't
| develop experience. They have no cognitive architecture.
|
| 'Agents' (i.e. workflows intermingling code and calls to
| LLMs) are still a thing (as shown by the fact there is a
| post by anthropic on this subject on the front page right
| now) and they are very hard to build.
|
| Consequence of that for instance: it's not possible to have
| a LLM explore _exhaustively_ a topic.
| mjhagen wrote:
| LLMs don't, but who said AGI should come from LLMs alone.
| When I ask ChatGPT about something "we" worked on months
| ago, it "remembers" and can continue on the conversation
| with that history in mind.
|
| I'd say, humans are also bound to promoting sessions in
| that way.
| Xmd5a wrote:
| Last time I used ChatGPT 'memory' feature it got full
| very quickly. It remembered my name, my dog's name and a
| couple tobacco casing recipes he came up with. OpenAI
| doesn't seem to be using embeddings and a vector
| database, just text snippets it injects in every
| conversation. Because RAG is too brittle ? The same
| problem arises when composing LLM calls. Efficient and
| robust workflows are those whose prompts and/or DAG were
| obtained via optimization techniques. Hence DSPy.
|
| Consider the following use case: keeping a swimming pool
| water clean. I can have a long running conversation with
| a LLM to guide me in getting it right. However I can't
| have a LLM handle the problem autonomously. I'd like to
| have it notify me on its own "hey, it's been 2 days, any
| improvement? Do you mind sharing a few pictures of the
| pool as well as the ph/chlorine test results ?". Nothing
| mind-boggingly complex. Nothing that couldn't be achieved
| using current LLMs. But still something I'd have to
| implement myself and which turns out to be more complex
| to achieve than expected. This is the kind of improvement
| I'd like to see big AI companies going after rather than
| research-grade ultra smart AIs.
| amelius wrote:
| Does it include the use of tools to accomplish a task?
|
| Does it include the invention of tools?
| tim333 wrote:
| Current AI is good at text but not very good at 3d physical
| stuff like fixing your plumbing.
| mistermann wrote:
| Optimal phenomenological reasoning is going to be a tough
| nut to crack.
|
| Luckily we don't know the problem exists, so in a
| cultural/phenomenological sense it is already cracked.
| azeirah wrote:
| I'd like to see this o3 thing play 5d chess with multiverse
| time travel or baba is you.
|
| The only effect smarter models will have is that intelligent
| people will have to use less of their brain to do their work.
| As has always been the case, the medium is the message, and
| climate change is one of the most difficult and worst
| problems of our time.
|
| If this gets software people to quit en-masse and start
| working in energy, biology, ecology and preservation? Then it
| has succeeded.
| concordDance wrote:
| > climate change is one of the most difficult and worst
| problems of our time.
|
| Slightly surprised to see this view here.
|
| I can think of half a dozen more serious problems off hand
| (e.g. population aging, institutional scar tissue,
| dysgenics, nuclear proliferation, pandemic risks, AI
| itself) along most axes I can think of (raw $ cost, QALYs,
| even X-risk).
| ALittleLight wrote:
| It's not saturated. 85% is average human performance, not "best
| human" performance. There is still room for the model to go up
| to 100% on this eval.
| scotty79 wrote:
| Still it's comparing average human level performance with best
| AI performance. Examples of things o3 failed at are insanely
| easy for humans.
| FrustratedMonky wrote:
| There are things Chimps do easily that humans fail at, and
| vice/versa of course.
|
| There are blind spots, doesn't take away from 'general'.
| noobermin wrote:
| The downvotes should tell you, this is a decided "hype"
| result. Don't poo poo it, that's not allowed on AI slop
| posts on HN.
| FrustratedMonky wrote:
| Yeah, I didn't realize Chimp studies, or neuroscience
| were out of vogue. Even in tech, people form strong
| 'beliefs' around what they think is happening.
| Matumio wrote:
| We can't agree whether Portia spiders are intelligent or
| just have very advanced instincts. How will we ever agree
| about what human intelligence is, or how to separate it
| from cultural knowledge? If that even makes sense.
| FrustratedMonky wrote:
| I guess my point is more, if we can't decide about Portia
| Spiders or Chimps, then how can we be so certain about
| AI. So offering up Portia and Chimps as counter examples.
| cchance wrote:
| You'd be surprised what the AVERAGE human fails to do that
| you think is easy, my mom can't fucking send an email without
| downloading a virus, i have a coworker that believes beyond a
| shadow of a doubt the world is flat.
|
| The Average human is a lot dumber than people on hackernews
| and reddit seem to realize, shit the people on mturk are
| likely smarter than the AVERAGE person
| staticman2 wrote:
| Yet the average human can drive a car a lot better than
| ChatGPT can, which shows that the way you frame
| "intelligence" dictates your conclusion about who is
| "intelligent".
| p1esk wrote:
| Pretty sure a waymo car drives better than an average SF
| driver.
| manquer wrote:
| Waymo cannot handle poor weather at all, average human
| can.
|
| Being able to perform better than humans in specific
| constrained problem space is how every automation system
| has been developed.
|
| While self driving systems are impressive, they don't
| drive with anywhere close to skills of the average driver
| tim333 wrote:
| Waymo blog with video of them driving in poor weather
| https://waymo.com/blog/2019/08/waymo-and-weather
| manquer wrote:
| And nikola famously made a video of a truck using one
| which had no engine, we don't take a company word for
| anything until we can verify.
|
| This is not offered to public, they are actively
| expanding in only cities like LA , Miami or Phoenix now
| where weather is good through the year.
|
| The tech for bad weather is nowhere close to ready for
| public. Average human on other hand is driving in bad
| weather every day
| tim333 wrote:
| "Extreme Weather" tech "will be available to riders in
| the near future"
| https://www.cnet.com/roadshow/news/waymos-latest-
| robotaxi-is...
| daveguy wrote:
| I'm sure the source of that CNET article came with a
| forward looking statements disclaimer.
| Mordisquitos wrote:
| And how well would a Waymo car do in this challenge with
| the ARC-AGI datasets?
| coldcode wrote:
| There's a reason why Waymo isn't offered in Buffalo.
| fragmede wrote:
| Is that reason because Buffalo is the 81st most populated
| city in the United States, or 123rd by population
| density, and Waymo currently only serves approximately 3
| cities in North America?
|
| We already let computers control cars because they're
| better than humans at it when the weather is inclement.
| It's called ABS.
| tracerbulletx wrote:
| If you take an electrical sensory input signal sequence,
| and transform it to a electrical muscle output signal
| sequence you've got a brain. ChatGPT isn't going to drive
| a car because it's trained on verbal tokens, and it's not
| optimized for the type of latency you need for physical
| interaction.
|
| And the brain doesn't use the same network to do verbal
| reasoning as real time coordination either.
|
| But that work is moving along fine. All of these models
| and lessons are going to be combined into AGI. It is
| happening. There isn't really that much in the way.
| mirkodrummer wrote:
| Not being able to send an email or believing the world is
| flat it's not a sign of intelligence, I'd rather say it's
| more about culture or being more or less scholarized. Your
| mom or coworker still can do stuff instinctively that is
| outperforming every algorithm out there and still
| unexplained how we do it. We still have no idea what
| intelligence is
| 0points wrote:
| Your examples are just examples of lack of information.
| That's not a measure for intelligence.
|
| As a contrary point, most people think they are smarter
| than they really are.
| HarHarVeryFunny wrote:
| Maybe, but no doubt these "dumb" people can still get
| dressed in the morning, navigate a trip to the mall, do the
| dishes, etc, etc.
|
| It's always been the case that the things that are easiest
| for humans are hardest for computers, and vice versa.
| Humans are good at general intelligence - tackling semi-
| novel problems all day long, while computers are good at
| narrow problems they can be trained on such as chess or
| math.
|
| The majority of the benchmarks currently used to evaluate
| these AI models are narrow skills that the models have been
| trained to handle well. What'll be much more useful will be
| when they are capable of the generality of "dumb" tasks
| that a human can do.
| cryptoegorophy wrote:
| What's interesting is it might be very close to human
| intelligence than some "alien" intelligence, because after all
| it is a LLM and trained on human made text, which kind of
| represents human intelligence.
| hammock wrote:
| In that vein, perhaps the delta between o3 @ 87.5% and Human
| @ 85% represents a deficit in the ability of text to
| communicate human reasoning.
|
| In other words, it's possible humans can reason better than
| o3, but cannot articulate that reasoning as well through text
| - only in our heads, or through some alternative medium.
| 85392_school wrote:
| I wonder how much of an effect amount of time to answer has
| on human performance.
| yunwal wrote:
| Yeah, this is sort of meaningless without some idea of
| cost or consequences of a wrong answer. One of the nice
| things about working with a competent human is being able
| to tell them "all of our jobs are on the line" and
| knowing with certainty that they'll come to a good
| answer.
| unsupp0rted wrote:
| It's possible humans reason better through text than not
| through text, so these models, having been trained on text,
| should be able to out-reason any person who's not currently
| sitting down to write.
| hamburga wrote:
| Agreed. I think what really makes them alien is everything
| else about them besides intelligence. Namely, no
| emotional/physiological grounding in empathy, shame, pride,
| and love (on the positive side) or hatred (negative side).
| antirez wrote:
| NNs are not algorithms.
| notfish wrote:
| An algorithm is "a process or set of rules to be followed in
| calculations or other problem-solving operations, especially
| by a computer"
|
| How does a giant pile of linear algebra not meet that
| definition?
| antirez wrote:
| It's not made of "steps", it's an almost continuous
| function to its inputs. And a function is not an algorithm:
| it is not an object made of conditions, jumps,
| terminations, ... Obviously it has computation capabilities
| and is Turing-complete, but is the opposite of an
| algorithm.
| raegis wrote:
| > It's not made of "steps", it's an almost continuous
| function to its inputs.
|
| Can you define "almost continuous function"? Or explain
| what you mean by this, and how it is used in the A.I.
| stuff?
| taneq wrote:
| Well, it's a bunch of steps, but they're smaller. /s
| janalsncm wrote:
| If it wasn't made of steps then Turing machines wouldn't
| be able to execute them.
|
| Further, this is probably running an algorithm on top of
| an NN. Some kind of tree search.
|
| I get what you're saying though. You're trying to draw a
| distinction between statistical methods and symbolic
| methods. Someday we will have an algorithm which uses
| statistical methods that can match human performance on
| most cognitive tasks, and it won't look or act like a
| brain. In some sense that's disappointing. We can build
| supersonic jets without fully understanding how birds
| fly.
| antirez wrote:
| Let's see that Turing machines can approximate the
| execution of NN :) That's why there are issues related to
| numerical precision, but the contrary is also true
| indeed, NNs can discover and use similar techniques used
| by traditional algorithms. However: the two remain two
| different methods to do computations, and probably it's
| not just by chance that many things we can't do
| algorithmically, we can do with NNs, what I mean is that
| this is not _just_ related to the fact that NNs discover
| complex algorithms via gradient descent, but also that
| the computational model of NNs is more adapt to solving
| certain tasks. So the inference algorithm of NNs (doing
| multiplications and other batch transformations) is just
| needed for standard computers to approximate the NN
| computational model. You can do this analogically, and
| nobody would claim much (maybe?) it 's running an
| algorithm. Or that brains themselves are algorithms.
| zeroonetwothree wrote:
| We don't have evidence that a TM can simulate a brain.
| But we know for a fact that it can execute a NN.
| necovek wrote:
| Computers can execute precise computations, it's just not
| efficient (and it's very much slow).
|
| NNs are exactly what "computers" are good for and we've
| been using since their inception: doing lots of
| computations quickly.
|
| "Analog neural networks" (brains) work much differently
| from what are "neural networks" in computing, and we have
| no understanding of their operation to claim they are or
| aren't algorithmic. But computing NNs are simply
| implementations of an algorithm.
|
| Edit: upon further rereading, it seems you equate "neural
| networks" with brain-like operation. But brain was an
| inspiration for NNs, they are not an "approximation" of
| it.
| antirez wrote:
| But the inference itself is orthogonal to the computation
| the NN is going. Obviously the inference (and training)
| are algorithms.
| tsimionescu wrote:
| NN inference is an algorithm for computing an
| approximation of a function with a huge number of
| parameters. The NN itself is of course just a data
| structure. But there is nothing whatsoever about the NN
| process that is non-algorithmic.
|
| It's the exact same thing as using a binary tree to
| discover the lowest number in some set of numbers,
| conceptually: you have a data structure that you evaluate
| using a particular algorithm. The combination of the
| algorithm and the construction of the data structure
| arrive at the desired outcome.
| antirez wrote:
| That's not the point, I think: you can implement the
| brain in BASIC, in theory, this does not means that the
| brain is per-se a BASIC program. I'll provide a more
| theoretical framework for reasoning about this: if the
| way to solve certain problems by an NN (the learned
| weights) can't be translated in some normal program that
| DOES NOT resemble the activation of an NN, then the NNs
| are not algorithms, but a different computational model.
| mvkel wrote:
| > continuous
|
| So, steps?
| necovek wrote:
| "Continuous" would imply infinitely small steps, and as
| such, would certainly be used as a differentiator
| (differential? ;) between larger discrete stepped
| approach.
|
| In essence, infinite calculus provides a link between
| "steps" and continuous, but those are different things
| indeed.
| necovek wrote:
| I would say you are right that function is not an
| algorithm, but it is an implementation of an algorithm.
|
| Is that your point?
|
| If so, I've long learned to accept imprecise language as
| long as the message can be reasonably extracted from it.
| benlivengood wrote:
| Deterministic (ieee 754 floats), terminates on all inputs,
| correctness (produces loss < X on N training/test inputs)
|
| At most you can argue that there isn't a useful bounded loss
| on every possible input, but it turns out that humans don't
| achieve useful bounded loss on identifying arbitrary sets of
| pixels as a cat or whatever, either. Most problems NNs are
| aimed at are qualitative or probabilistic where provable
| bounds are less useful than Nth-percentile performance on
| real-world data.
| KeplerBoy wrote:
| Running inference on a model certainly is a algorithm.
| drdeca wrote:
| How do you define "algorithm"? I suspect it is a definition I
| would find somewhat unusual. Not to say that I strictly
| disagree, but only because to my mind "neural net" suggests
| something a bit more concrete than "algorithm", so I might
| instead say that an artificial neural net is an
| implementation of an algorithm, rather than or something like
| that.
|
| But, to my mind, something of the form "Train a neural
| network with an architecture generally like [blah], with a
| training method+data like [bleh], and save the result. Then,
| when inputs are received, run them through the NN in such-
| and-such way." would constitute an algorithm.
| necovek wrote:
| NN is a very wide term applied in different contexts.
|
| When a NN is trained, it produces a set of parameters that
| basically define an algorithm to do inference with: it's a
| very big one though.
|
| We also call that a NN (the joy of natural language).
| 6gvONxR4sf7o wrote:
| Human performance is much closer to 100% on this, depending on
| your human. It's easy to miss the dot in the corner of the
| headline graph in TFA that says "STEM grad."
| tim333 wrote:
| A fair comparison might be average human. The average human
| isn't a STEM grad. It seems STEM grad approximately equals an
| IQ of 130. https://www.accommodationforstudents.com/student-
| blog/the-su...
|
| From a post elsewhere the scores on ARC-AGI-PUB are approx
| average human 64%, o3 87%.
| https://news.ycombinator.com/item?id=42474659
|
| Though also elsewhere, o3 seems very expensive to operate.
| You could probably hire a PhD researcher for cheaper.
| jeremyjh wrote:
| Why would an average human be more fair than a trained
| human? The model is trained.
| hypoxia wrote:
| It actually beats the human average by a wide margin:
|
| - 64.2% for humans vs. 82.8%+ for o3.
|
| ...
|
| Private Eval:
|
| - 85%: threshold for winning the prize [1]
|
| Semi-Private Eval:
|
| - 87.5%: o3 (unlimited compute) [2]
|
| - 75.7%: o3 (limited compute) [2]
|
| Public Eval:
|
| - 91.5%: o3 (unlimited compute) [2]
|
| - 82.8%: o3 (limited compute) [2]
|
| - 64.2%: human average (Mechanical Turk) [1] [3]
|
| Public Training:
|
| - 76.2%: human average (Mechanical Turk) [1] [3]
|
| ...
|
| References:
|
| [1] https://arcprize.org/guide
|
| [2] https://arcprize.org/blog/oai-o3-pub-breakthrough
|
| [3] https://arxiv.org/abs/2409.01374
| usaar333 wrote:
| Super human isn't beating rando mech turk.
|
| Their post has stem grad at nearly 100%
| tripletao wrote:
| This is correct. It's easy to get arbitrarily bad results
| on Mechanical Turk, since without any quality control
| people will just click as fast as they can to get paid (or
| bot it and get paid even faster).
|
| So in practice, there's always some kind of quality
| control. Stricter quality control will improve your
| results, and the right amount of quality control is
| subjective. This makes any assessment of human quality
| meaningless without explanation of how those humans were
| selected and incentivized. Chollet is careful to provide
| that, but many posters here are not.
|
| In any case, the ensemble of task-specific, low-compute
| Kaggle solutions is reportedly also super-Turk, at 81%. I
| don't think anyone would call that AGI, since it's not
| general; but if the "(tuned)" in the figure means o3 was
| tuned specifically for these tasks, that's not obviously
| general either.
| dyauspitr wrote:
| I'll believe it when the AI can earn money on its own. I
| obviously don't mean someone paying a subscription to use the
| AI I mean, letting the AI lose on the Internet with only the
| goal of making money and putting it into a bank account.
| hamburga wrote:
| Do trading bots count?
| 1659447091 wrote:
| No, the AI would have to start from zero and reason it's
| way to making itself money online, such as the humans who
| were first in their online field of interest (e-commerce,
| scams, ads etc from the 80's and 90's) when there was no
| guidance, only general human intelligence that could reason
| their way into money making opportunities and reason their
| way into making it work.
| concordDance wrote:
| I don't think humans ever do that. They research/read and
| ask other humans.
| lastdong wrote:
| Curious about how many tests were performed. Did it
| consistently manage to successfully solve many of these types
| of problems?
| dmead wrote:
| This is so strange. people think that an llm trained on
| programming questions and docs can do mundane tasks like this
| means intelligent? Come on.
|
| It really calls into question two things.
|
| 1. You don't know what you're talking about about.
|
| 2. You have a perverse incentive to believe this such that you
| will preach it to others and elevate some job salary range or
| stock.
|
| Either way, not a good look.
| javaunsafe2019 wrote:
| This
| Imnimo wrote:
| Whenever a benchmark that was thought to be extremely difficult
| is (nearly) solved, it's a mix of two causes. One is that
| progress on AI capabilities was faster than we expected, and the
| other is that there was an approach that made the task easier
| than we expected. I feel like the there's a lot of the former
| here, but the compute cost per task (thousands of dollars to
| solve one little color grid puzzle??) suggests to me that there's
| some amount of the latter. Chollet also mentions ARC-AGI-2 might
| be more resistant to this approach.
|
| Of course, o3 looks strong on other benchmarks as well, and
| sometimes "spend a huge amount of compute for one problem" is a
| great feature to have available if it gets you the answer you
| needed. So even if there's some amount of "ARC-AGI wasn't quite
| as robust as we thought", o3 is clearly a very powerful model.
| exe34 wrote:
| > the other is that there was an approach that made the task
| easier than we expected.
|
| from reading Dennett's philosophy, I'm convinced that that's
| how human intelligence works - for each task that "only a human
| could do that", there's a trick that makes it easier than it
| seems. We are bags of tricks.
| Jensson wrote:
| > We are bags of tricks.
|
| We are trick generators, that is what it means to be a
| general intelligence. Adding another trick in the bag doesn't
| make you a general intelligence, being able to discover and
| add new tricks yourself makes you a general intelligence.
| falcor84 wrote:
| Not the parent, but remembering my reading of Dennett, he
| was referring to the tricks that we got through evolution,
| rather than ones we invented ourselves. As particular
| examples, we have neural functional areas for capabilities
| like facial recognition and spatial reasoning which seems
| to rely on dedicated "wetware" somewhat distinct from other
| parts of the brain.
| Jensson wrote:
| But humans being able to develop new tricks is core to
| their intelligence, saying its just a bag of tricks means
| you don't understand what AGI is. So either the poster
| misunderstood Dennett or Dennett weren't talking about
| AGI or Dennett didn't understand this well.
|
| Of course there are many tricks you will need special
| training for, like many of the skills human share with
| animals, but the ability to construct useful shareable
| large knowledge bases based on observations is unique to
| humans and isn't just a "trick".
| exe34 wrote:
| Dennett was talking about natural intelligence. I think
| you're just underestimating the potential of a
| sufficiently big bag of tricks.
|
| sharing knowledge isn't a human thing - chimps learn from
| each other. bees teach each other the direction and
| distance to a new source of food.
|
| we just happen to push the envelope a lot further and
| managed to kickstart runaway mimetic evolution.
| falcor84 wrote:
| "mimetic" is apt there, but I think that Dennett, as a
| friend of Dawkins, would say it's "memetic"
| exe34 wrote:
| nice catch!
| exe34 wrote:
| generating tricks is itself a trick that relies on an
| enormous bag of tricks we inherited through evolution by
| the process of natural selection.
|
| the new tricks don't just pop into our heads even though it
| seems that way. nobody ever woke up and devised a new trick
| in a completely new field without spending years learning
| about that field or something adjacent to it. even the new
| ideas tend to be an old idea from a different field applied
| to a new field. tricks stand on the shoulders of giants.
| solidasparagus wrote:
| Or the test wasn't testing anything meaningful, which IMO is
| what happened here. I think ARC was basically looking at the
| distribution of what AI is capable of, picked an area that it
| was bad at and no one had cared enough to go solve, and put
| together a benchmark. And then we got good at it because
| someone cared and we had a measurement. Which is essentially
| the goal of ARC.
|
| But I don't much agree that it is any meaningful step towards
| AGI. Maybe it's a nice proofpoint that that AI can solve simple
| problems presented in intentionally opaque ways.
| atleastoptimal wrote:
| Id agree with you if there hasn't been very deliberate work
| towards solving ARC for years, and if thr conceit of the
| benchmark wasn't specifically based on a conception of human
| intuition being, put simply, learning and applying out of
| distribution rules on the fly. ARC wasn't some arbitrary
| inverse set, it was designed to benchmark a fundamental
| capability of general intelligence
| whoistraitor wrote:
| The general message here seems to be that inference-time brute-
| forcing works as long as you have a good search and evaluation
| strategy. We've seemingly hit a ceiling on the base LLM forward-
| pass capability so any further wins are going to be in how we
| juggle multiple inferences to solve the problem space. It feels
| like a scripting problem now. Which is cool! A fun space for
| hacker-engineers. Also:
|
| > My mental model for LLMs is that they work as a repository of
| vector programs. When prompted, they will fetch the program that
| your prompt maps to and "execute" it on the input at hand. LLMs
| are a way to store and operationalize millions of useful mini-
| programs via passive exposure to human-generated content.
|
| I found this such an intriguing way of thinking about it.
| whimsicalism wrote:
| > We've seemingly hit a ceiling on the base LLM forward-pass
| capability so any further wins are going to be in how we juggle
| multiple inferences to solve the problem space
|
| Not so sure - but we might need to figure out the
| inference/search/evaluation strategy in order to provide the
| data we need to distill to the single forward-pass data
| fitting.
| cchance wrote:
| Is it just me or does looking at the ARC-AGI example questions at
| the bottom... make your brain hurt?
| drdaeman wrote:
| Looks pretty obvious to me, although, of course, it took me a
| few moments to understand what's expected as a solution.
|
| c6e1b8da is moving rectangular figures by a given vector,
| 0d87d2a6 is drawing horizontal and/or vertical lines
| (connecting dots at the edges) and filling figures they touch,
| b457fec5 is filling gray figures with a given repeating color
| pattern.
|
| This is pretty straightforward stuff that doesn't require much
| spatial thinking or keeping multiple things/aspects in memory -
| visual puzzles from various "IQ" tests are way harder.
|
| This said, now I'm curious how SoTA LLMs would do on something
| like WAIS-IV.
| randyrand wrote:
| I'll sound like a total douche bag - but I thought they were
| incredibly obvious - which I think is the point of them.
|
| What took me longer was figuring out how the question was
| arranged, i.e. left input, right output, 3 examples each
| airstrike wrote:
| Uhh...some of us are apparently living under a rock, as this is
| the first time I hear about o3 and I'm on HN far too much every
| day
| burningion wrote:
| I think it was just announced today! You're fine!
| cryptoegorophy wrote:
| Besides higher scores - is there any improvements for a general
| use? Like asking to help setup home assistant etc etc?
| rvz wrote:
| Great results. However, let's all just admit it.
|
| It has well replaced journalists, artists and on its way to
| replace nearly both junior and senior engineers. The ultimate
| intention of "AGI" is that it is going to replace tens of
| millions of jobs. That is it and you know it.
|
| It will only accelerate and we need to stop pretending and
| coping. Instead lets discuss solutions for those lost jobs.
|
| So what is the replacement for these lost jobs? (It is not UBI or
| "better jobs" without defining them.)
| neom wrote:
| Do you follow Jack Clark? I noticed he's been on the road a lot
| talking to governments and policy makers, and not just in the
| "AI is coming" way he used to talk.
| whynotminot wrote:
| When none of us have jobs or income, there will be no ability
| for us to buy products. And then no reason for companies to buy
| ads to sell products to people who don't have money. Without ad
| money (or the potential of future ad money), the people pushing
| the bounds of AGI into work replacement will lose the very
| income streams powering this research and their valuations.
|
| Ford didn't support a 40 hour work week out of the kindness of
| his heart. He wanted his workers to have time off for buying
| things (like his cars).
|
| I wonder if our AGI industrialist overlords will do something
| similar for revenue sharing or UBI.
| whimsicalism wrote:
| This picture doesn't make sense. If most don't have any money
| to buy products, just invent some other money and start
| paying one of the other people who doesn't have any money to
| start making the products for you.
|
| In reality, if there really is mass unemployment, AI driven
| automation will make consumables so cheap that anyone will be
| able to buy it.
| whynotminot wrote:
| > This picture doesn't make sense. If most don't have any
| money to buy products, just invent some other money and
| start paying one of the other people who doesn't have any
| money to start making the products for you.
|
| Uh, this picture doesn't make sense. Why would anyone value
| this randomly invented money?
| whimsicalism wrote:
| > Why would anyone value this randomly invented money?
|
| Because they can use it to pay for goods?
|
| Your notion is that almost everyone is going to be out of
| a job and thus have nothing. Okay, so I'm one of those
| people and I need this house built. But I'm not making
| any money because of AI or whatever. Maybe someone else
| needs someone to drive their aging relative around and
| they're a good builder.
|
| If 1. neither of those people have jobs or income because
| of AI 2. AI isn't provisioning services for basically
| free,
|
| then it makes sense for them to do an exchange of labor -
| even with AI (if that AI is not providing services to
| everyone). The original reason for having money and
| exchanging it still exists.
| whynotminot wrote:
| Honestly I don't even know how to engage with your point.
|
| Yes if we recreate society some form of money would
| likely emerge.
| neom wrote:
| Didn't money basically only emerge to deal with with
| difficulty of "double coincidence of wants". Money simply
| solved the problem of making all forms of value
| interchangeable and transportable across time AND
| circumstance? A dollar can do with with or without AI
| existing no?
| whimsicalism wrote:
| Yes, that's my point
| staticman2 wrote:
| You seem to be arguing that large unemployment rates are
| logically impossible, so we shouldn't worry about
| unemployment.
|
| The fact unemployment was 25% during the great depression
| would seem to suggest that at a minimum, a 25%
| unemployment rate is possible during a disruptive event.
| astrange wrote:
| The unemployment rate in a modern economy is basically
| whatever the central bank wants it to be. The Great
| Depression was caused by bad monetary policy - I don't
| see a reason why having AI would cause that.
| staticman2 wrote:
| The person upthread was saying that as long as someone
| wants a house built and someone wants a grandma driven
| around unemployment can't happen.
|
| Unless nobody wanted either of those things done during
| the depression that's clearly not a very good mental
| model.
| astrange wrote:
| Yes, I disagree with that. The problem isn't the lack of
| demand, it's that the people with the demand can't get
| the money to express it with.
| tivert wrote:
| > This picture doesn't make sense. If most don't have any
| money to buy products, just invent some other money and
| start paying one of the other people who doesn't have any
| money to start making the products for you.
|
| Ultimately, it all comes down to raw materials and similar
| resources, _and all those will be claimed by people with
| lots of real money_. Your "invented ... other money" will
| be useless to buy that fundamental stuff. At best, it will
| be useful for trading scrap and other junk among the
| unemployed.
|
| > In reality, if there really is mass unemployment, AI
| driven automation will make consumables so cheap that
| anyone will be able to buy it.
|
| No. Why would the people who own that automation want to
| waste their resources producing consumer goods for people
| with nothing to give them in return?
| whimsicalism wrote:
| if people with AI use it to somehow enclose all raw
| resources, then yes - the picture i painted will be wrong
| whynotminot wrote:
| Enclosing raw resources tends to be what powerful people
| do.
| astrange wrote:
| "Raw resources" aren't that valuable economically because
| they aren't where most of the value is added in
| production. That's why having a lot of them tends to make
| your country poorer
| (https://en.wikipedia.org/wiki/Resource_curse).
| Jensson wrote:
| Today educated humans are more valuable than anything
| else on earth, but AGI changes that. With cheap AGI raw
| resources and infrastructure will be the only two
| valuable things left.
| astrange wrote:
| > If most don't have any money to buy products, just invent
| some other money and start paying one of the other people
| who doesn't have any money to start making the products for
| you.
|
| This isn't possible if you want to pay sales taxes - those
| are what keep transactions being done in the official
| currency. Of course in a world of 99% unemployment
| presumably we don't care about this.
|
| But yes, this world of 99% unemployment isn't possible, eg
| because as soon as you have two people and they trade
| things, they're employed again.
| tivert wrote:
| > When none of us have jobs or income, there will be no
| ability for us to buy products. And then no reason for
| companies to buy ads to sell products to people who don't
| have money. Without ad money (or the potential of future ad
| money), the people pushing the bounds of AGI into work
| replacement will lose the very income streams powering this
| research and their valuations.
|
| I don't think so. I agree the push for AGI will kill the
| modern consumer product economy, but I think it's quite
| possible for the economy to evolve into a new form (that will
| probably be terrible for most humans) that keep pushes "work
| replacement."
|
| Imagine, an AGI billionare buying up land, mines, and power
| plants as the consumer economy dies, then shifting those
| resources away from the consumer economy into self-
| aggrandizing pet projects (e.g. ziggurats, penthouses on
| Mars, space yachts, life extension, and stuff like that). He
| might still employ a small community of servants, AGI
| researchers, and other specialists; but all the rest of the
| population will be irrelevant to him.
|
| And individual autarky probably isn't necessary, consumption
| will be redirected towards the massive pet production I
| mentioned, with vestigial markets for power, minerals, etc.
| RivieraKid wrote:
| The economic theory answer is that people simply switch to jobs
| that are not yet replaceable by AI. Doctors, nurses,
| electricians, construction workers, police officers, etc.
| People in aggregate will produce more, consume more and work
| less.
| achierius wrote:
| > Doctors
|
| Many replaceable
|
| > Police officers
|
| Many replaceable (desk officers)
| drdaeman wrote:
| > It has well replaced journalists, artists and on its way to
| replace nearly both junior and senior engineers.
|
| Did it, really? Or did it just provide automation for routine
| no-thinking-necessary text-writing tasks, but is still
| ultimately completely bound by the level of human operator's
| intelligence? I strongly suspect it's the latter. If it had
| actually replaced journalists it must be junk outlets, where
| readers' intelligence is negligible and anything goes.
|
| Just yesterday I've used o1 and Claude 3.5 to debug a Linux
| kernel issue (ultimately, a bad DSDT table causing TPM2 driver
| unable to reserve memory region for command response buffer,
| the solution was to use memmap to remove NVS flag from the
| relevant regions) and confirmed once again LLMs still don't
| reason at all - just spew out plausible-looking chains of
| words. The models were good listeners, and a mostly-helpful
| code generators (when they didn't make silliest mistakes), but
| they gave no traces of understanding and no attention for any
| nuances (e.g. LLM used `IS_ERR` to check `__request_resource`
| result, despite me giving it full source code for that function
| and there's even a comment that makes it obvious it returns a
| pointer or NULL, not an error code - misguided attention kind
| of mistake).
|
| So, in my opinion, LLMs (as currently available to broad
| public, like myself) are useful for automating away some
| routine stuff, but their usefulness is bounded by the
| operator's knowledge and intelligence. And that means that the
| actual jobs (if they require thinking and not just writing
| words) are safe.
|
| When asked about what I do at work, I used to joke that I just
| press buttons on my keyboard in fancy patterns. Ultimately,
| LLMs seem to suggest that it's not what I really do.
| mensetmanusman wrote:
| I'm super curious as to whether this technology completely
| destroys the middle class, or if everyone becomes better off
| because productivity is going to skyrocket.
| mhogers wrote:
| Is anyone here aware of the latest research that tries to
| predict the outcome? Please share - super curious as well
| te_chris wrote:
| There's this https://arxiv.org/pdf/2312.05481v9
| pdfernhout wrote:
| Some thoughts I put together on all this circa 2010:
| https://pdfernhout.net/beyond-a-jobless-recovery-knol.html
| "This article explores the issue of a "Jobless Recovery"
| mainly from a heterodox economic perspective. It emphasizes
| the implications of ideas by Marshall Brain and others that
| improvements in robotics, automation, design, and voluntary
| social networks are fundamentally changing the structure of
| the economic landscape. It outlines towards the end four
| major alternatives to mainstream economic practice (a basic
| income, a gift economy, stronger local subsistence economies,
| and resource-based planning). These alternatives could be
| used in combination to address what, even as far back as
| 1964, has been described as a breaking "income-through-jobs
| link". This link between jobs and income is breaking because
| of the declining value of most paid human labor relative to
| capital investments in automation and better design. Or, as
| is now the case, the value of paid human labor like at some
| newspapers or universities is also declining relative to the
| output of voluntary social networks such as for digital
| content production (like represented by this document). It is
| suggested that we will need to fundamentally reevaluate our
| economic theories and practices to adjust to these new
| realities emerging from exponential trends in technology and
| society."
| tivert wrote:
| > I'm super curious as to whether this technology completely
| destroys the middle class, or if everyone becomes better off
| because productivity is going to skyrocket.
|
| Even if productivity skyrockets, why would anyone assume the
| dividends would be shared with the "destroy[ed] middle class"?
|
| All indications will be this will end up like the China Shock:
| "I lost my middle class job, and all I got was the opportunity
| to buy flimsy pieces of crap from a dollar store." America
| lacks the ideological foundations for any other result, and the
| coming economic changes will likely make building those
| foundations even more difficult if not impossible.
| rohan_ wrote:
| Because access to the financial system was democratized ten
| years ago
| tivert wrote:
| > Because access to the financial system was democratized
| ten years ago
|
| Huh? I'm not sure exactly what you're talking about, but
| mere "access to the financial system" wouldn't remedy
| anything, because of inequality, etc.
|
| To survive the shock financially, I think one would have to
| have at least enough capital to be a capitalist.
| croemer wrote:
| The programming task they gave o3-mini high (creating Python
| server that allows chatting with OpenAI API and run some code in
| terminal) didn't seem very hard? Strange choice of example for
| something that's claimed to be a big step forwards.
|
| YT timestamped link:
| https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for
| the fixed link @photonboom)
|
| Updated: I gave the task to Claude 3.5 Sonnet and it worked first
| shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-
| faa5aa...
| bearjaws wrote:
| It's good that it works since if you ask GPT-4o to use the
| openai sdk it will often produce invalid and out of date code.
| m3kw9 wrote:
| I would say they didn't need to demo anything, because if you
| are gonna use the output code live on a demo it may make
| compile errors and then look stupid trying to fix it live
| croemer wrote:
| If it was a safe bet problem, then they should have said
| that. To me it looks like they faked excitement for something
| not exciting which lowers credibility of the whole
| presentation.
| sunaookami wrote:
| They actually did that the last time when they showed the
| apps integration. First try in Xcode didn't work.
| m3kw9 wrote:
| Yeah I think that time it was ok because they were demoing
| the app function, but for this they are demoing the model
| smarts
| csomar wrote:
| Models are predictable at 0 temperatures. They might have
| tested the output beforehand.
| fzzzy wrote:
| Models in practice haven't been deterministic at 0
| temperature, although nobody knows exactly why. Either
| hardware or software bugs.
| Jensson wrote:
| We know exactly why, it is because floating point
| operations aren't associative but the GPU scheduler
| assumes they are, and the scheduler isn't deterministic.
| Running the model strictly hurts performance so they
| don't do that.
| photonboom wrote:
| heres the right timestamp:
| https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s
| phil917 wrote:
| Yeah I agree that wasn't particularly mind blowing to me and
| seems fairly in line with what existing SOTA models can do.
| Especially since they did it in steps. Maybe I'm missing
| something.
| MyFirstSass wrote:
| What? Is this what this is? Either this is a complete joke or
| we're missing something.
|
| I've been doing similar stuff in Claude for months and it's not
| that impressive when you see how limited they really are when
| going non boilerplate.
| HeatrayEnjoyer wrote:
| Sonnet isn't a "mini" sized model. Try it with Haiku.
| croemer wrote:
| How mini is o3-mini compared to Sonnet and why does it matter
| whether it's mini or not? Isn't the point of the demo to show
| what's now possible that wasn't before?
|
| 4o is cheaper than o1 mini so mini doesn't mean much for
| costs.
| zelphirkalt wrote:
| Looks like quite shoddy code though. Like, the procedure for
| running a shell command is pure side-effect procedural code,
| neither returning the exit code of the command nor its output.
| Like the incomplete stackoverflow answer it probably was
| trained from. It might do one job at a time, but once this
| stuff gets integrated into one coherent thing, one needs to
| rewrite lots of it, to actually be composable.
|
| Though, of course one can argue, that lots of human written
| code is not much different from this.
| tripletao wrote:
| Their discussion contains an interesting aside:
|
| > Moreover, ARC-AGI-1 is now saturating - besides o3's new score,
| the fact is that a large ensemble of low-compute Kaggle solutions
| can now score 81% on the private eval.
|
| So while these tasks get greatest interest as a benchmark for
| LLMs and other large general models, it doesn't yet seem obvious
| those outperform human-designed domain-specific approaches.
|
| I wonder to what extent the large improvement comes from OpenAI
| training deliberately targeting this class of problem. That
| result would still be significant (since there's no way to
| overfit to the private tasks), but would be different from an
| "accidental" emergent improvement.
| Bjorkbat wrote:
| I was impressed until I read the caveat about the high-compute
| version using 172x more compute.
|
| Assuming for a moment that the cost per task has a linear
| relationship with compute, then it costs a little more than $1
| million to get that score on the public eval.
|
| The results are cool, but man, this sounds like such a busted
| approach.
| futureshock wrote:
| So what? I'm serious. Our current level of progress would have
| been sci-fi fantasy with the computers we had in 2000. The cost
| may be astronomical today, but we have proven a method to
| achieve human performance on tests of reasoning over novel
| problems. WOW. Who cares what it costs. In 25 years it will run
| on your phone.
| Bjorkbat wrote:
| It's not so much the cost as much the fact that they got a
| slightly better result by throwing 172x more compute
| per/task. The fact that it may have cost somewhere north of
| $1 million simply helps to give a better idea of how absurd
| the approach is.
|
| It feels a lot less like the breakthrough when the solution
| looks so much like simply brute-forcing.
|
| But you might be right, who cares? Does it really matter how
| crude the solution is if we can achieve true AGI and bring
| the cost down by increasing the efficiency of compute?
| futureshock wrote:
| "Simply brute-forcing"
|
| That's the thing that's interesting to me though and I had
| the same first reaction. It's a very different problem than
| brute-forcing chess. It has one chance to come to the
| correct answer. Running through thousands or millions of
| options means nothing if the model can't determine which is
| correct. And each of these visual problems involve
| combinations of different interacting concepts. To solve
| them requires understanding, not mimicry. So no matter how
| inefficient and "stupid" these models are, they can be said
| to understand these novel problems. That's a direct counter
| to everyone who ever called these a stochastic parrot and
| said they were a dead-end to AGI that was only searching an
| in distribution training set.
|
| The compute costs are currently disappointing, but so was
| the cost of sequencing the first whole human genome. That
| went from 3 billion to a few hundred bucks from your local
| doctor.
| radioactivist wrote:
| So your claim for optimism here is that something today that
| took ~10^22 floating point operations (based on an estimate
| earlier in the thread) to execute will be running on phones
| in 25 years? Phones which are currently running at O(10^12)
| flops. That means ten orders of magnitudes of improvement for
| that to run in a reasonable amount of time? It's a similar
| scale up in going from ENIAC (500 flops) to a modern desktop
| (5-10 teraflops).
| futureshock wrote:
| That sounds reasonable to me because the compute cost for
| this level of reasoning performance won't stay at 10^22 and
| phones won't stay at 10^12. This reasoning breakthrough is
| about 3 months old.
| radioactivist wrote:
| I think expecting five _orders of magnitude_ improvement
| from either side of this (inference cost or phone
| performance) is insane.
| onemetwo wrote:
| In (1) the author use a technique to improve the performance of
| an LLM, he trained sonnet 3.5 to obtain 53,6% in the arc-agi-pub
| benchmark moreover he said that more computer power would give
| better results. So the results of o3 could be produced in this
| way using the same method with more computer power, so if this is
| the case the result of o3 is not very interesting.
|
| (1) https://params.com/@jeremy-berman/arc-agi
| TypicalHog wrote:
| This is actually mindblowing!
| blixt wrote:
| These results are fantastic. Claude 3.5 and o1 are already good
| enough to provide value, so I can't wait to see how o3 performs
| comparatively in real-world scenarios.
|
| But I gotta say, we must be saturating just about any zero-shot
| reasoning benchmark imaginable at this point. And we will still
| argue about whether this is AGI, in my opinion because these LLMs
| are forgetful and it's very difficult for an application
| developer to fix that.
|
| Models will need better ways to remember and learn from doing a
| task over and over. For example, let's look at code agents: the
| best we can do, even with o3, is to cram as much of the code base
| as we can fit into a context window. And if it doesn't fit we
| branch out to multiple models to prune the context window until
| it does fit. And here's the kicker - the second time you ask for
| it to do something this all starts over from zero again. With
| this amount of reasoning power, I'm hoping session-based learning
| becomes the next frontier for LLM capabilities.
|
| (There are already things like tool use, linear attention, RAG,
| etc that can help here but currently they come with downsides and
| I would consider them insufficient.)
| vessenes wrote:
| This feels like big news to me.
|
| First of all, ARC is definitely an intelligence test for autistic
| people. I say as someone with a tad of the neurodiversity. That
| said, I think it's a pretty interesting one, not least because as
| you go up in the levels, it requires (for a human) a fair amount
| of lateral thinking and analogy-type thinking, and of course, it
| requires that this go in and out of visual representation. That
| said, I think it's a bit funny that most of the people training
| these next-gen AIs are neurodiverse and we are training the AI in
| our own image. I continue to hope for some poet and painter-
| derived intelligence tests to be added to the next gen tests we
| all look at and score.
|
| For those reasons, I've always really liked ARC as a test -- not
| as some be-all end-all for AGI, but just because I think that the
| most intriguing areas next for LLMs are in these analogy arenas
| and ability to hold more cross-domain context together for
| reasoning and etc.
|
| Prompts that are interesting to play with right now on these
| terms range from asking multimodal models to say count to ten in
| a Boston accent, and then propose a regional french accent that's
| an equivalent and count to ten in that. (To my ear, 4o is
| unconvincing on this). Similar in my mind is writing and
| architecting code that crosses multiple languages and APIs, and
| asking for it to be written in different styles. (claude and
| o1-pro are .. okay at this, depending).
|
| Anyway. I agree that this looks like a large step change. I'm not
| sure if the o3 methods here involve the spinning up of clusters
| of python interpreters to breadth-search for solutions -- a
| method used to make headway on ARC in the past; if so, this is
| still big, but I think less exciting than if the stack is close
| to what we know today, and the compute time is just more
| introspection / internal beam search type algorithms.
|
| Either way, something had to assess answers and think they were
| right, and this is a HUGE step forward.
| jamiek88 wrote:
| > most of the people training these next-gen AIs are
| neurodiverse
|
| Citation needed. This is a huge claim based only on stereotype.
| vessenes wrote:
| So true. Perhaps I'm just thinking it's my people and need to
| update my priors.
| getpost wrote:
| > most of the people training these next-gen AIs are
| neurodiverse and we are training the AI in our own image
|
| Do you have any evidence to support that? It would be
| fascinating if the field is primarly advancing due to a unique
| constellation of traits contributed by individuals who, in the
| past, may not have collaborated so effectively.
| vessenes wrote:
| PURELY Anecdotal. But I'll say that as of 2024 1 in 36 US
| children are diagnosed on the spectrum according to the
| CDC(!), which would mean if you met 10 AI researchers and 4
| were neurodivergent you'd reasonably expect that it's a
| higher-than-population average representation. I'm polling
| from the Effective Altruist AI folks in my mind, and the
| number is definitely, definitely higher than 4/10.
| EVa5I7bHFq9mnYK wrote:
| Are there non-Effective Altruist AI folks?
| vessenes wrote:
| I love how this might mean "non-Effective",
| non-"Effective Altruist" or non-"Effective Altruist AI"
| folks.
|
| Yes
| nopinsight wrote:
| Let me go against some skeptics and explain why I think full o3
| is pretty much AGI or at least embodies most essential aspects of
| AGI.
|
| What has been lacking so far in frontier LLMs is the ability to
| reliably deal with the right level of abstraction for a given
| problem. Reasoning is useful but often comes out lacking if one
| cannot reason at the right level of abstraction. (Note that many
| humans can't either when they deal with unfamiliar domains,
| although that is not the case with these models.)
|
| ARC has been challenging precisely because solving its problems
| often requires: 1) using multiple different
| *kinds* of core knowledge [1], such as symmetry, counting, color,
| AND 2) using the right level(s) of abstraction
|
| Achieving human-level performance in the ARC benchmark, _as well
| as_ top human performance in GPQA, Codeforces, AIME, and Frontier
| Math suggests the model can potentially solve any problem at the
| human level if it possesses essential knowledge about it. Yes,
| this includes out-of-distribution problems that most humans can
| solve.
|
| It might not _yet_ be able to generate highly novel theories,
| frameworks, or artifacts to the degree that Einstein,
| Grothendieck, or van Gogh could. But not many humans can either.
|
| [1] https://www.harvardlds.org/wp-
| content/uploads/2017/01/Spelke...
|
| ADDED:
|
| Thanks to the link to Chollet's posts by lswainemoore below. I've
| analyzed some easy problems that o3 failed at. They involve
| spatial intelligence, including connection and movement. This
| skill is very hard to learn from textual and still image data.
|
| I believe this sort of core knowledge is learnable through
| movement and interaction data in a simulated world and it will
| _not_ present a very difficult barrier to cross. (OpenAI
| purchased a company behind a Minecraft clone a while ago. I 've
| wondered if this is the purpose.)
| xvector wrote:
| Agree. AGI is here. I feel such a sense of pride in our
| species.
| timabdulla wrote:
| What's your explanation for why it can only get ~70% on SWE-
| bench Verified?
|
| I believe about 90% of the tasks were estimated by humans to
| take less than one hour to solve, so we aren't talking about
| very complex problems, and to boot, the contamination factor is
| huge: o3 (or any big model) will have in-depth knowledge of the
| internals of these projects, and often even know about the
| individual issues themselves (e.g. you can say what was Github
| issue #4145 in project foo, and there's a decent chance it can
| tell you exactly what the issue was about!)
| slewis wrote:
| I've spent tons of time evaluating o1-preview on SWEBench-
| Verified.
|
| For one, I speculate OpenAI is using a very basic agent
| harness to get the results they've published on SWEBench. I
| believe there is a fair amount of headroom to improve results
| above what they published, using the same models.
|
| For two, some of the instances, even in SWEBench-Verified,
| require a bit of "going above and beyond" to get right. One
| example is an instance where the user states that a TypeError
| isn't properly handled. The developer who fixed it handled
| the TypeError but also handled a ValueError, and the golden
| test checks for both. I don't know how many instances fall in
| this category, but I suspect its more than on a simpler
| benchmark like MATH.
| nopinsight wrote:
| One possibility is that it may not yet have sufficient
| _experience and real-world feedback_ for resolving coding
| issues in professional repos, as this involves multiple steps
| and very diverse actions (or branching factor, in AI terms).
| They have committed to not training on API usage, which
| limits their ability to directly acquire training data from
| it. However, their upcoming agentic efforts may address this
| gap in training data.
| timabdulla wrote:
| Right, but the branching factor increases exponentially
| with the scope of the work.
|
| I think it's obvious that they've cracked the formula for
| solving well-defined, small-in-scope problems at a
| superhuman level. That's an amazing thing.
|
| To me, it's less obvious that this implies that they will
| in short order with just more training data be able to
| solve ambiguous, large-in-scope problems at even just a
| skilled human level.
|
| There are far more paths to consider, much more context to
| use, and in an RL setting, the rewards are much more
| ambiguously defined.
| nopinsight wrote:
| Their reasoning models can learn from procedures and
| methods, which generalize far better than data. Software
| tasks are diverse but most tasks are still fairly limited
| in scope. Novel tasks might remain challenging for these
| models, as they do for humans.
|
| That said, o3 might still lack some kind of interaction
| intelligence that's hard to learn. We'll see.
| Imnimo wrote:
| >Achieving human-level performance in the ARC benchmark, as
| well as top human performance in GPQA, Codeforce, AIME, and
| Frontier Math strongly suggests the model can potentially solve
| any problem at the human level if it possesses essential
| knowledge about it.
|
| The article notes, "o3 still fails on some very easy tasks".
| What explains these failures if o3 can solve "any problem" at
| the human level? Do these failed cases require some essential
| knowledge that has eluded the massive OpenAI training set?
| nopinsight wrote:
| Great point. I'd love to see what these easy tasks are and
| would be happy to revise my hypothesis accordingly. o3's
| intelligence is unlikely to be a strict superset of human
| intelligence. It is certainly superior to humans in some
| respects and probably inferior in others. Whether it's
| sufficiently generally intelligent would be both a matter of
| definition and empirical fact.
| Imnimo wrote:
| Chollet has a few examples here:
|
| https://x.com/fchollet/status/1870172872641261979
|
| https://x.com/fchollet/status/1870173137234727219
|
| I would definitely consider them legitimately easy for
| humans.
| nopinsight wrote:
| Thanks! I added some comments on this at the bottom of
| the post above.
| phil917 wrote:
| Quote from the creators of the AGI-ARC benchmark: "Passing ARC-
| AGI does not equate achieving AGI, and, as a matter of fact, I
| don't think o3 is AGI yet. o3 still fails on some very easy
| tasks, indicating fundamental differences with human
| intelligence."
| CooCooCaCha wrote:
| Yeah the real goalpost is _reliable_ intelligence. A supposed
| phd level AI failing simple problems is a red flag that we're
| still missing something.
| gremlinsinc wrote:
| You've never met a Doctor who couldn't figure out how to
| work their email? Or use street smarts? You can have a PHD
| but be unable to reliably handle soft skills, or any number
| of things you might 'expect' someone to be able to do.
|
| Just playing devils' advocate or nitpicking the language a
| bit...
| CooCooCaCha wrote:
| An important distinction here is you're comparing skill
| across very different tasks.
|
| I'm not even going that far, I'm talking about
| performance on similar tasks. Something many people have
| noticed about modern AI is it can go from genius to baby-
| level performance seemingly at random.
|
| Take self driving cars for example, a reasonably
| intelligent human of sound mind and body would never
| accidentally mistake a concrete pillar for a road. Yet
| that happens with self-driving cars, and seemingly here
| with ARC-AGI problems which all have a similar flavor.
| nuancebydefault wrote:
| A coworker of mine has a phd in physics. Showing the
| difference to him between little and big endian in a hex
| editor, showing file sizes of raw image files and how to
| compute it... I explained 3 times and maybe he understood
| part of it now.
| manquer wrote:
| Doctors[1] or say pilots are skilled professions and
| difficult to master and deserve respect yes , but they do
| not need high levels of intelligence to be good at. They
| require many other skills like taking decisions under
| pressure or good motor skills that are hard, but not
| necessarily intelligence.
|
| Also not knowing something is hardly a criteria , skilled
| humans focus on their areas of interest above most other
| knowledge and can be unaware of other subjects.
|
| Fields medal winners for example may not be aware of most
| pop culture things doesn't make them not able to do so,
| just not interested
|
| ---
|
| [1] most doctors including surgeons and many respected
| specialists, some doctors however do need that skills but
| those are specialized few and generally do know how to
| use email
| intended wrote:
| good nit pick.
|
| A PHD learnt their field. If they learnt that field,
| reasoning through everything to understand their
| material, then - given enough time - they are capable of
| learning email and street smarts.
|
| Which is why a reasoning LLM, should be able to do all of
| those things.
|
| Its not learnt a subject, its learnt reasoning.
| nopinsight wrote:
| I'd need to see what kinds of easy tasks those are and would
| be happy to revise my hypothesis if that's warranted.
|
| Also, it depends a great deal on what we define as AGI and
| whether they need to be a strict superset of typical human
| intelligence. o3's intelligence is probably superhuman in
| some aspects but inferior in others. We can find many humans
| who exhibit such tendencies as well. We'd probably say they
| think differently but would still call them generally
| intelligent.
| lswainemoore wrote:
| They're in the original post. Also here:
| https://x.com/fchollet/status/1870172872641261979 /
| https://x.com/fchollet/status/1870173137234727219
|
| Personally, I think it's fair to call them "very easy". If
| a person I otherwise thought was intelligent was unable to
| solve these, I'd be quite surprised.
| nopinsight wrote:
| Thanks! I've analyzed some easy problems that o3 failed
| at. They involve spatial intelligence including
| connection and movement. This skill is very hard to learn
| from textual and still image data.
|
| I believe this sort of core knowledge is learnable
| through movement and interaction data in a simulated
| world and it will not present a very difficult barrier to
| cross.
|
| (OpenAI purchased a company behind a Minecraft clone a
| while ago. I've wondered if this is the purpose.)
| lswainemoore wrote:
| > I believe this sort of core knowledge is learnable
| through movement and interaction data in a simulated
| world and it will not present a very difficult barrier to
| cross.
|
| Maybe! I suppose time will tell. That said, spatial
| intelligence (connection/movement included) is the whole
| game in this evaluation set. I think it's revealing that
| they can't handle these particular examples, and
| problematic for claims of AGI.
| MVissers wrote:
| Probably just not trained on this kind of data. We could
| create a benchmark about it, and they'd shatter it within
| a year or so.
|
| I'm starting to really see no limits on intelligence in
| these models.
| sungho_ wrote:
| Doesn't the fact that it can only accomplish tasks with
| benchmarks imply that it has limitations in intelligence?
| qup wrote:
| > Doesn't the fact that it can only accomplish tasks with
| benchmarks
|
| That's not a fact
| PoignardAzur wrote:
| > _This skill is very hard to learn from textual and
| still image data._
|
| I had the same take at first, but thinking about it
| again, I'm not quite sure?
|
| Take the "blue dots make a cross" example (the second
| one). The inputs only has four blue dots, which makes it
| very easy to see a pattern even in text data: two of them
| have the same x coordinate, two of them have the same y
| (or the same first-tuple-element and second-tuple-element
| if you want to taboo any spatial concepts).
|
| Then if you look into the output, you can notice that all
| the input coordinates are also in the output set, just
| not always with the same color. If you separate them into
| "input-and-output" and "output-only", you quickly notice
| that all of the output-only squares are blue and share a
| coordinate (tuple-element) with the blue inputs. If you
| split the "input-and-output" set into "same color" and
| "color changed", you can notice that the changes only go
| from red to blue, and that the coordinates that changed
| are clustered, and at least one element of the cluster
| shares a coordinate with a blue input.
|
| Of course, it's easy to build this chain of reasoning in
| retrospect, but it doesn't seem like a complete stretch:
| each step only requires noticing patterns in the data,
| and it's how a reasonably puzzle-savvy person might solve
| this if you didn't let them draw the squares on papers.
| There are a lot of escape games with chains of reasoning
| much more complex and random office workers solve them
| all the time.
|
| The visual aspect makes the patterns jump to us more, but
| the fact that o3 couldn't find them at all with thousands
| of dollars of compute budget still seems meaningful to
| me.
|
| EDIT: Actually, looking at Twitter discussions[1], o3
| _did_ find those patterns, but was stumped by ambiguity
| in the test input that the examples didn 't cover. Its
| failures on the "cascading rectangles" example[2] looks
| much more interesting.
|
| [1]:
| https://x.com/bio_bootloader/status/1870339297594786064
|
| [2]: https://x.com/_AI30_/status/1870407853871419806
| 93po wrote:
| they say it isn't AGI but i think the way o3 functions can be
| refined to AGI - it's learning to solve a new, novel
| problems. we just need to make it do that more consistently,
| which seems achievable
| qnleigh wrote:
| I like the notion, implied in the article, that AGI won't be
| verified by any single benchmark, but by our collective
| inability to come up with benchmarks that defeat some
| eventual AI system. This matches the cat-and-mouse game we've
| been seeing for a while, where benchmarks have to constantly
| adapt to better models.
|
| I guess you can say the same thing for the Turing Test.
| Simple chat bots beat it ages ago in specific settings, but
| the bar is much higher now that the average person is
| familiar with their limitations.
|
| If/once we have an AGI, it will probably take weeks to months
| to really convince ourselves that it is one.
| nyrikki wrote:
| GPQA scores are mostly from pre-training, against content in
| the corpus. They have gone silent but look at the GPT4
| technical report which calls this out.
|
| We are nowhere close to what Sam Altman calls AGI and
| transformers are still limited to what uniform-TC0 can do.
|
| As an example the Boolean Formula Value Problem is
| NC1-complete, thus beyond transformers but trivial to solve
| with a TM.
|
| As it is now proven that the frame problem is equivalent to the
| halting problem, even if we can move past uniform-TC0 limits,
| novelty is still a problem.
|
| I think the advancements are truly extraordinary, but unless
| you set the bar very low, we aren't close to AGI.
|
| Heck we aren't close to P with commercial models.
| sebzim4500 wrote:
| Isn't any physically realizable computer (including our
| brains) limited to what uniform-TC0 can do?
| drdeca wrote:
| Do you just mean because any physically realizable computer
| is a finite state machine? Or...?
|
| I wouldn't describe a computer's usual behavior as having
| constant depth.
|
| It is fairly typical to talk about problems in P as being
| feasible (though when the constant factors are too big,
| this isn't strictly true of course).
|
| Just because for unreasonably large inputs, my computer
| can't run a particular program and produce the correct
| answer for that input, due to my computer running out of
| memory, we don't generally say that my computer is
| fundamentally incapable of executing that algorithm.
| nyrikki wrote:
| Neither TC0 nor uniform-TC0 are physically realizable, they
| are tools not physical devices.
|
| The default nonuniform circuits classes are allowed to have
| a different circuit per input size, the uniform types have
| unbounded fan-in
|
| Similar to how a k-tape TM doesn't get 'charged' for the
| input size.
|
| With Nick Class (NC) the number of components is similar to
| traditional compute time while depth relates to the ability
| to parallelize operations.
|
| These are different than biological neurons, not better or
| worse but just different.
|
| Human neurons can use dendritic compartmentalization, use
| spike timing, can retime spikes etc...
|
| While the perceptron model we use in ML is useful, it is
| not able to do xor in one layer, while biological neurons
| do that without anything even reaching the soma, purely in
| the dendrites.
|
| Statistical learning models still comes down to a choice
| function, no matter if you call that set shattering or...
|
| With physical computers the time hierarchy does apply and
| if TIME(g(n)) is given more time than TIME(f(n)), g(n) can
| solve more problems.
|
| So you can simulate a NTM with exhaustive search with a
| physical computer.
|
| Physical computers also tend to have NAND and XOR gates,
| and can have different circuit depths.
|
| When you are in TC0, you only have AND, OR and Threshold
| (or majority) gates.
|
| Think of instruction level parallelism in a typical CPU, it
| can return early, vs Itanium EPIC, which had to wait for
| the longest operation. Predicated execution is also how
| GPUs work.
|
| They can send a mask and save on load store ops as an
| example, but the cost of that parallelism is the consent
| depth.
|
| It is the parallelism tradeoff that both makes transformers
| practical as well as limit what they can do.
|
| The IID assumption and autograd requiring smooth manifolds
| plays a role too.
|
| The frame problem, which causes hard problems to become
| unsolvable for computers and people alike does also.
|
| But the fact that we have polynomial time solutions for the
| Boolean Formula Value Problem, as mentioned in my post
| above is probably a simpler way of realizing physical
| computers aren't limited to uniform-TC0.
| norir wrote:
| Personally I find "human-level" to be a borderline meaningless
| and limiting term. Are we now super human as a species relative
| to ourselves just five years ago because of our advances in
| developing computer programs that better imitate what many (but
| far from all) of us were already capable of doing? Have we
| reached a limit to human potential that can only be surpassed
| by digital machines? Who decides what human level is and when
| we have surpassed it? I have seen some ridiculous claims about
| ai in art that don't stand up to even the slightest scrutiny by
| domain experts but that easily fool the masses.
| razodactyl wrote:
| No I think we're just tired and depressed as a species...
| Existing systems work to a degree but aren't living up to
| their potential of increasing happiness according to
| technological capabilities.
| PaulDavisThe1st wrote:
| > It might not yet be able to generate highly novel theories,
| frameworks, or artifacts to the degree that Einstein,
| Grothendieck, or van Gogh could.
|
| Every human does this dozens, hundreds or thousands of times
| ... during childhood.
| ec109685 wrote:
| The problem with ARC is that there are a finite number of
| heuristics that could be enumerated and trained for, which
| would give model a substantial leg up on this evaluation, but
| not be generalized to other domains.
|
| For example, if they produce millions of examples of the type
| of problems o3 still struggles on, it would probably do better
| at similar questions.
|
| Perhaps the private data set is different enough that this
| isn't a problem, but the ideal situation would be unveiling a
| truly novel dataset, which it seems like arc aims to do.
| golol wrote:
| In order to replace actual humans doing their job I think LLMs
| are lacking in judgement, sense of time and agenticism.
| Kostchei wrote:
| I mean fkcu me when they have those things, however, maybe
| they are just lazy and their judgement is fine, for a lazy
| intelligence. Inner-self thinks "why are these bastards
| asking me to do this? ". I doubt that is actually happening,
| but now, .. prove it isn't.
| puttycat wrote:
| Great comment. See this as well for another potential reason
| for failure:
|
| https://arxiv.org/abs/2402.10013
| dimitri-vs wrote:
| Have we really watered down the definition of AGI that much?
|
| LLMs aren't really capable of "learning" anything outside their
| training data. Which I feel is a very basic and fundamental
| capability of humans.
|
| Every new request thread is a blank slate utilizing whatever
| context you provide for the specific task and after the tread
| is done (or context limit runs out) it's like it never
| happened. Sure you can use databases, do web queries, etc. but
| these are inflexible bandaid solutions, far from what's needed
| for AGI.
| theptip wrote:
| > LLMs aren't really capable of "learning" anything outside
| their training data.
|
| ChatGPT has had for some time the feature of storing memories
| about its conversations with users. And you can use function
| calling to make this more generic.
|
| I think drawing the boundary at "model + scaffolding" is more
| interesting.
| dimitri-vs wrote:
| Calling the sentence or two it arbitrarily saves when you
| statd your preferences and profile info "memories" is a
| stretch.
|
| True equivalent to human memories would require something
| like a multimodal trillion token context window.
|
| RAG is just not going to cut it, and if anything will
| exacerbated problems with hallucinations.
| bubblyworld wrote:
| That's true for vanilla LLMs, but also keep in mind that
| there are no details about o3's architecture at the moment.
| Clearly they are doing _something_ different given the huge
| performance jump on a lot of benchmarks, and it may well
| involve in-context learning.
| catmanjan wrote:
| Given every other iteration has basically just been the
| same thing but bigger, why should we think this?
| bubblyworld wrote:
| My point was to caution against being too confident about
| the underlying architecture, not to argue for any
| particular alternative.
|
| Your statement is false - things changed a lot between
| gpt4 and o1 under the hood, but notably _not_ a larger
| model size. In fact the model size of o1 is smaller than
| gpt4 by several orders of magnitude! Improvements are
| being made in other ways.
| uncomplexity_ wrote:
| on the spatial data i see it as a highly intelligent head of a
| machine that just needs better limbs and better senses.
|
| i think that's where most hardware startups will specialize
| with in the coming decades, different industries with different
| needs.
| mirkodrummer wrote:
| Please stop it calling AGI, we don't even know or agree
| universally what that should actually mean. How far did we get
| with hype calling a lossy probabilistic compressor firing
| slowly at us words AGI? That's a real bummer to me
| razodactyl wrote:
| Is this comment voted down because of sentiment / polarity?
|
| Regardless the critical aspect is valid, AGI would be
| something like Cortana from Halo.
| ryoshu wrote:
| Ask o3 is P=NP?
| amelius wrote:
| It will just answer with the current consensus on the matter.
| zwnow wrote:
| This is not AGI lmao.
| CliveBloomers wrote:
| Another meaningless benchmark, another month--it's like clockwork
| at this point. No one's going to remember this in a month; it's
| just noise. The real test? It's not in these flashy metrics or
| minor improvements. The only thing that actually matters is how
| fast it can wipe out the layers of middle management and all
| those pointless, bureaucratic jobs that add zero value.
|
| That's the true litmus test. Everything else? It's just fine-
| tuning weights, playing around the edges. Until it starts cutting
| through the fat and reshaping how organizations really operate,
| all of this is just more of the same.
| handfuloflight wrote:
| Agreed, but isn't it management who decides that this would be
| implemented? Are they going to propogate their own removal?
| zamadatix wrote:
| Middle manager types are probably interested in their salary
| performance more than anything. "Real" management (more of
| their assets come from their ownership of the company than a
| salary) will override them if it's truthfully the best
| performing operating model for the company.
| oytis wrote:
| So far AI market seems to be focused on replacing meaningful
| jobs, meaningless ones look safe (which kind of makes sense if
| you think about it).
| 6gvONxR4sf7o wrote:
| I'm glad these stats show a better estimate of human ability than
| just the average mturker. The graph here has the average mturker
| performance as well as a STEM grad measurement. Stuff like that
| is why we're always feeling weird that these things supposedly
| outperform humans while still sucking. I'm glad to see 'human
| performance' benchmarked with more variety (attention, time,
| education, etc).
| RivieraKid wrote:
| It sucks that I would love to be excited about this... but I
| mostly feel anxiety and sadness.
| xvector wrote:
| Humanity is about to enter an even steeper hockey stick growth
| curve. Progressing along the Kardashev scale feels all but
| inevitable. We will live to see Longevity Escape Velocity. I'm
| fucking pumped and feel thrilled and excited and proud of our
| species.
|
| Sure, there will be growing pains, friction, etc. Who cares?
| There always is with world-changing tech. Always.
| drcode wrote:
| longevity for the AIs
| tokioyoyo wrote:
| My job should be secure for a while, but why would an average
| person give a damn about humanity when they might lose their
| jobs and comfort levels? If I had kids, I would absolutely
| hate this uncertainty as well.
|
| "Oh well, I guess I can't give the opportunities to my kid
| that I wanted, but at least humanity is growing rapidly!"
| xvector wrote:
| > when they might lose their jobs and comfort levels?
|
| Everyone has always worried about this for every major
| technology throughout history
|
| IMO AGI will dramatically increase comfort levels, lower
| your chance of dying, death, disease, etc.
| tokioyoyo wrote:
| Again, sure, but it doesn't matter to an average person.
| That's too much focus on the hypothetical future. People
| care about the current times. In the short term it will
| suck for a good chunk of people, and whether the
| sacrifice is worth it will depend on who you are.
|
| People aren't really on uproar yet, because
| implementations haven't affected the job market of the
| masses. Afterwards? Tume will show.
| xvector wrote:
| Yes, people tend to focus on current times. It's an
| incredibly shortsighted mentality that selfishly puts
| oneself over tens of billions of future lives being
| improved. https://pessimistsarchive.org
| tokioyoyo wrote:
| Do you have any dependents, like parents or kids, by any
| chance? Imagine not being able to provide for them. Think
| how'd you feel in such circumstances.
|
| Like in general I totally agree with you, but I also
| understand why a person would care about their loved ones
| and themselves first.
| realce wrote:
| Eventually you draw the black ball, it is inevitable.
| MVissers wrote:
| We've almost wiped ourselves out in a nuclear war in the
| 70ies. If that would have happened, would it have been
| worth it? Probably not.
|
| Beyond immediate increase in inequality, which I agree
| could be worth it in the long run if this was the only
| problem, we're playing a dangerous game.
|
| The smartest and most capable species on the planet that
| dominates it for exactly this reason, is creating
| something even smarter and more capable than itself in
| the hope it'd help make its life easier.
|
| Hmm.
| croemer wrote:
| Longevity Escape Velocity? Even if you had orders of
| magnitude more people working on medical research, it's not a
| given that prolonging life indefinitely is even possible.
| soheil wrote:
| Of course it's a given unless you want to invoke
| supernatural causes the human brain is a collection of
| cells with electro-chemical connections that if fully
| reconstructed either physically or virtually would
| necessarily need to represent the original person's brain.
| Therefore with sufficient intelligence it would be possible
| to engineer technology that would be able to do that
| reconstruction without even having to go to the atomic
| level, which we also have a near full understanding of
| already.
| lewhoo wrote:
| > Sure, there will be growing pains, friction, etc. Who
| cares?
|
| That's right. Who cares about pains of others and why they
| even should are absolutely words to live by.
| xvector wrote:
| Yeah, with this mentality, we wouldn't have electricity
| today. You will never make transition to new technology
| painless, no matter what you do. (See:
| https://pessimistsarchive.org)
|
| What you are likely doing, though, is making many more
| future humans pay a cost in suffering. Every day we delay
| longevity escape velocity is another 150k people dead.
| lewhoo wrote:
| There was a time when in the name of progress people were
| killed for whatever resources they possessed, others were
| enslaved etc. and I was under the impression that the
| measure of our civilization is that we actually DID care
| and just how much. It seems to me that you are very eager
| to put up altars of sacrifice without even thinking that
| the problems you probably have in mind are perfectly
| solvable without them.
| smokedetector1 wrote:
| By far the greatest issue facing humanity today is wealth
| inequality.
| xvector wrote:
| Nah, it's death. People objectively are doing better than
| ever despite wealth inequality. By all metrics - poverty,
| quality of life, homelessness, wealth, purchasing power.
|
| I'd rather just... not die. Not unless I want to. Same
| for my loved ones. That's far more important than "wealth
| inequality."
| asdf6969 wrote:
| I would rather follow in the steps of uncle Ted than let AI
| turn me in a homeless person. It's no consolation that my
| tent will have a nice view of a lunar colony
| objektif wrote:
| You sound like a rich person.
| soheil wrote:
| I agree, save invoking supernatural causes, the human brain
| is a collection of cells with electro-chemical connections
| that if fully reconstructed either physically or virtually
| would necessarily need to represent the original person's
| brain. Therefore with sufficient intelligence it would be
| possible to engineer technology that would be able to do that
| reconstruction without even having to go to the atomic level,
| which we also have a near full understanding of already.
| achierius wrote:
| https://www.transformernews.ai/p/richard-ngo-openai-
| resign-s...
|
| >But while the "making AGI" part of the mission seems well on
| track, it feels like I (and others) have gradually realized
| how much harder it is to contribute in a robustly positive
| way to the "succeeding" part of the mission, especially when
| it comes to preventing existential risks to humanity.
|
| Almost every single one of the people OpenAI had hired to
| work on AI safety have left the firm with similar messages.
| Perhaps you should at least consider the thinking of experts?
|
| You and I will likely not live to see much of anything past
| AGI.
| goatlover wrote:
| > Sure, there will be growing pains, friction, etc. Who
| cares?
|
| The people experiencing the growing pains, friction, etc.
| pupppet wrote:
| We're enabling a huge swath of humanity being put out of work
| so a handful of billionaires can become trillionaires.
| abiraja wrote:
| And also the solving of hundreds of diseases that ail us.
| hartator wrote:
| It doesn't matter. Statists rather be poor, sick, and dead
| than risking trillionaires.
| thrance wrote:
| You should read about workers right in the gilded age,
| and see how good _laissez-faire_ capitalism was. What do
| you think will happen when the only thing you can trade
| with the trillionaires, your labor, becomes worthless?
| lewhoo wrote:
| One of the biggest factors in risk of death right now is
| poverty. Also what is being chased right now is "human
| level on most economically viable tasks" because the
| automated research for solving physics etc. even now seems
| far-fetched.
| thrance wrote:
| You need to solve diseases _and_ make the cure available.
| Millions die of curable diseases every year, simply because
| they are not deemed useful enough. What happens when your
| labor becomes worthless?
| asdf6969 wrote:
| Why do you think you'll be able to afford healthcare? The
| new medicine is for the AI owners
| distortionfield wrote:
| This is the same boring alarmist argument we've heard since
| the Industrial Revolution. Humans have always turned extra
| output provided by technological advancement to increase
| overall productivity.
| stri8ed wrote:
| It would happen in China regardless what is done here.
| Removing billionaires does not fix this. The ship has sailed.
| gom_jabbar wrote:
| Anxiety and sadness are actually mild emotional responses to
| the dissolution of human culture. Nick Land in 1992:
|
| "It is ceasing to be a matter of how we think about technics,
| if only because technics is increasingly thinking about itself.
| It might still be a few decades before artificial intelligences
| surpass the horizon of biological ones, but it is utterly
| superstitious to imagine that the human dominion of terrestrial
| culture is still marked out in centuries, let alone in some
| metaphysical perpetuity. The high road to thinking no longer
| passes through a deepening of human cognition, but rather
| through a becoming inhuman of cognition, a migration of
| cognition out into the emerging planetary technosentience
| reservoir, into 'dehumanized landscapes ... emptied spaces'
| where human culture will be dissolved. Just as the capitalist
| urbanization of labour abstracted it in a parallel escalation
| with technical machines, so will intelligence be transplanted
| into the purring data zones of new software worlds in order to
| be abstracted from an increasingly obsolescent anthropoid
| particularity, and thus to venture beyond modernity. Human
| brains are to thinking what mediaeval villages were to
| engineering: antechambers to experimentation, cramped and
| parochial places to be.
|
| [...]
|
| Life is being phased-out into something new, and if we think
| this can be stopped we are even more stupid than we seem." [0]
|
| Land is being ostracized for some of his provocations, but it
| seems pretty clear by now that we are in the Landian
| Accelerationism timeline. Engaging with his thought is crucial
| to understanding what is happening with AI, and what is still
| largely unseen, such as the autonomization of capital.
|
| [0] https://retrochronic.com/#circuitries
| achierius wrote:
| It's obvious that there are lines of flight (to take a
| Deleuzian tack, a la Land) away from the current political-
| economic assemblage. For example, a strategic nuclear
| exchange starting tomorrow (which can always happen --
| technical errors, a rogue submarine, etc.) would almost
| certainly set back technological development enough that we'd
| have no shot at AI for the next few decades. I don't know
| whether you agree with him, but I think the fact that he
| ignores this fact is quite unserious, especially given the
| likely destabilizing effects sub-AGI AI will have on
| international politics.
| Jcampuzano2 wrote:
| Same, it's sad but I honestly hoped they never achieved these
| results and it came out that it wasn't possible or would take
| an insurmountable amount of resources but here we are ok the
| verge of making most humans useless when it comes to
| productivity.
|
| While there are those that are excited, the world is not
| prepared for the level of distress this could put on the
| average person without critical changes at a monumental level.
| JacksCracked wrote:
| If you don't feel like the world needed grand scale changes
| at a societal level with all the global problems we're unable
| to solve, you haven't been paying attention. Income
| inequality, corporate greed, political apathy, global
| warming.
| sensanaty wrote:
| And you think the bullshit generators backed by the largest
| corporate entities in humanity who are, as we speak,
| causing all the issues you mention are somehow gonna solve
| any of this?
| CamperBob2 wrote:
| If you still think this technology is a "bullshit
| generator," then it's safe to say you're also wrong about
| a great many other things in life.
|
| That would bug me, if I were you.
| r-zip wrote:
| They're not wrong though. The frequency with which these
| things still just make shit up is astonishingly bad. Very
| dismissive of a legitimate criticism.
| CamperBob2 wrote:
| It's getting better, faster than you and I and the GP
| are. What else matters?
|
| You can't bullshit your way through this particular
| benchmark. Try it.
|
| And yes, they're wrong. The latest/greatest models "make
| shit up" perhaps 5-10% as frequently as were seeing just
| a couple of years ago. Only someone who has deliberately
| decided to stop paying attention could possibly argue
| otherwise.
| sensanaty wrote:
| And yet I still can't trust Claude or o1 to not get the
| simplest of things, such as test cases (not even full on
| test suites, just the test cases) wrong, consistently. No
| amount of handholding from me or prompting or feeding it
| examples etc helps in the slightest, it is just
| consistently wrong for anything but the simplest possible
| examples, which takes more effort to manually verify than
| if I had just written it myself. I'm not even using an
| obscure stack or language, but _especially_ with things
| that aren 't Python or JS it shits the bed even worse.
|
| I have noticed it's great in the hands of marketers and
| scammers, however. Real good at those "jobs", so I see
| why the cryptobros have now moved onto hailing LLMs as
| the next coming of jesus.
| crakhamster01 wrote:
| Well said! There's no way big tech and institutional
| investors are pouring billions of dollars into AI because
| of corporate greed. It's definitely so that they can
| redistribute wealth equally once AGI is achieved.
|
| /s
| phito wrote:
| AI will fix none of that
| larve wrote:
| I have been diving deep into LLM coding over the last 3 years
| and regular encountered that feeling along the way. I still at
| times have a "wtf" moment where I need to take a break.
| However, I have been able to quell most of my anxieties around
| my job / the software profession in general (I've been at this
| professionally for 25+ years and software has been my dream job
| since I was 6).
|
| For one, I found AI coding to work best in a small team, where
| there is an understanding of what to build and how to build it,
| usually in close feedback loop with the designers / users.
| Throw the usual managerial company corporate nonsense on top
| and it doesn't really matter if you can instacreate a piece of
| software, if nobody cares for that piece of software and it's
| just there to put a checkmark on the Q3 OKR reports.
|
| Furthermore, there is a lot of software to be built out there,
| for people who can't afford it yet. A custom POS system for the
| local baker so that they don't have to interact with a
| computer. A game where squids eat algae for my nephews at
| christmas. A custom photo layout software for my dad who
| despairs at indesign. A plant watering system for my friend. A
| local government information website for older citizens. Not
| only can these be built at a fraction of the cost they were
| before, but they can be built in a manner where the people
| using the software are directly involved in creating it. Maybe
| they can get a 80% hacked version together if they are
| technically enclined. I can add the proper database backend and
| deployment infrastructure. Or I can sit with them and iterate
| on the app as we are talking. It is also almost free to create
| great documentation, in fact, LLM development is most
| productive when you turn up software engineering best practices
| up to 11.
|
| Furthermore, I found these tools incredible for actively
| furthering my own fundamental understanding of computer science
| and programming. I can now skip the stuff I don't care to learn
| (is it foobarBla(func, id) or foobar_bla(id, func)) and put the
| effort where I actually get a long-lived return. I have become
| really ambitious with the things I can tackle now, learning
| about all kinds of algorithms and operating system patterns and
| chemistry and physics etc... I can also create documents to
| help me with my learning.
|
| Local models are now entering the phase where they are getting
| to be really useful, definitely > gpt3.5 which I was able to
| use very productively already at the time.
|
| Writing (creating? manifesting? I don't really have a good word
| for what I do these days) software that makes me and real
| humans around me happy is extremely fulfilling, and has
| allevitated most of my angst around the technology.
| bluecoconut wrote:
| Efficiency is now key.
|
| ~=$3400 per single task to meet human performance on this
| benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED",
| which makes me think they did some undisclosed amount of fine-
| tuning (eg. via the API they showed off last week), so even more
| compute went into this task.
|
| We can compare this roughly to a human doing ARC-AGI puzzles,
| where a human will take (high variance in my subjective
| experience) between 5 second and 5 minutes to solve the task. (So
| i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr,
| and they include in their document an average mechancal turker at
| $2 USD task in their document)
|
| Going the other direction: I am interpreting this result as human
| level reasoning now costs (approximately) 41k/hr to 2.5M/hr with
| current compute.
|
| Super exciting that OpenAI pushed the compute out this far so we
| could see he O-series scaling continue and intersect humans on
| ARC, now we get to work towards making this economical!
| riku_iki wrote:
| > ~=$3400 per single task
|
| report says it is $17 per task, and $6k for whole dataset of
| 400 tasks.
| bluecoconut wrote:
| That's the low-compute mode. In the plot at the top where
| they score 88%, O3 High (tuned) is ~3.4k
| ionwake wrote:
| sorry to be a noob, but can someone tell me doe sths mena
| o3 will be unaffordable for a typical user? Will only
| companies with thousands to spend per query be able to use
| this?
|
| Sorry for being thick Im just confused how they can turn
| this into an addordable service?
| JohnnyMarcone wrote:
| There are likely many efficiency gains that will be made
| before it's released, and after. Also they showed o3 mini
| to be better than o1 for less cost in multiple
| benchmarks, so there're already improvements there at a
| lower cost than what available.
| ionwake wrote:
| Great thank you
| HDThoreaun wrote:
| The low compute one did as well as the average person
| though
| jhrmnn wrote:
| That's for the low-compute configuration that doesn't reach
| human-level performance (not far though)
| riku_iki wrote:
| I referred on high compute mode. They have table with
| breakdown here: https://arcprize.org/blog/oai-o3-pub-
| breakthrough
| EVa5I7bHFq9mnYK wrote:
| That's high EFFICIENCY. High efficiency = low compute.
| junipertea wrote:
| The table row with 6k figure refers to high efficiency,
| not high compute mode. From the blog post:
|
| Note: OpenAI has requested that we not publish the high-
| compute costs. The amount of compute was roughly 172x the
| low-compute configuration.
| gbnwl wrote:
| That's "efficiency" high, which actually means less
| compute. The 87.5% score using low efficiency (more
| compute) doesn't have cost listed.
| bluecoconut wrote:
| they use some poor language.
|
| "High Efficiency" is O3 Low "Low Efficiency" is O3 High
|
| They left the "Low efficiency" (O3 High) values as `-`
| but you can infer them from the plot at the top.
|
| Note the $20 and $17 per task aligns with the X-axis of
| the O3-low
| binarymax wrote:
| _" Note: OpenAI has requested that we not publish the high-
| compute costs. The amount of compute was roughly 172x the
| low-compute configuration."_
|
| The low compute was $17 per task. Speculate 172*$17 for the
| high compute is $2,924 per task, so I am also confused on the
| $3400 number.
| bluecoconut wrote:
| 3400 came from counting pixels on the plot.
|
| Also its $20 on for the o3-low via the table for the semi-
| private, which x172 is 3440, also coming in close to the
| 3400 number
| xrendan wrote:
| You're misreading it, there's two different runs, a low and a
| high compute run.
|
| The number for the high-compute one is ~172x the first one
| according to the article so ~=$2900
| Thorrez wrote:
| What's extra confusing is that in the graph the runs are
| called low compute and high compute. In the table they're
| called high efficient and low efficiency. So the high and
| low got swapped.
| bluecoconut wrote:
| some other imporant quotes: "Average human off the street:
| 70-80%. STEM college grad: >95%. Panel of 10 random humans:
| 99-100%" -@fchollet on X
|
| So, considering that the $3400/task system isn't able to
| compete with STEM college grad yet, we still have some room
| (but it is shrinking, i expect even more compute will be thrown
| and we'll see these barriers broken in coming years)
|
| Also, some other back of envelope calculations:
|
| The gap in cost is roughly 10^3 between O3 High and Avg.
| mechanical turkers (humans). Via Pure GPU cost improvement
| (~doubling every 2-2.5 years) puts us at 20~25 years.
|
| The question is now, can we close this "to human" gap (10^3)
| quickly with algorithms, or are we stuck waiting for the 20-25
| years for GPU improvements. (I think it feels obvious: this is
| new technology, things are moving fast, the chance for
| algorithmic innovation here is high!)
|
| I also personally think that we need to adjust our efficiency
| priors, and start looking not at "humans" as the bar to beat,
| but theoretical computatble limits (show gaps much larger
| ~10^9-10^15 for modest problems). Though, it may simply be the
| case that tool/code use + AGI at near human cost covers a lot
| of that gap.
| zamadatix wrote:
| I don't follow how 10 random humans can beat the average STEM
| college grad and average humans in that tweet. I suspect it's
| really "a panel of 10 randomly chosen experts in the space"
| or something?
|
| I agree the most interesting thing to watch will be cost for
| a given score more than maximum possible score achieved (not
| that the latter won't be interesting by any means).
| hmottestad wrote:
| Might be that within a group of 10 people, randomly chosen,
| when each person attempts to solve the tasks at least 99%
| of the time 1 person out of the 10 people will get it
| right.
| bcrosby95 wrote:
| Two heads is better than 1. 10 is way better. Even if they
| aren't a field of experts. You're bound to get random
| people that remember random stuff from high school,
| college, work, and life in general, allowing them to piece
| together a solution.
| inerte wrote:
| Aaaah thanks for the explanation. PANEL of 10 humans, as
| in, they were all together. I parsed the phrase as "10
| random people" > "average human" which made little sense.
| modeless wrote:
| Actually I believe that he did mean 10 random people
| tested individually, not a committee of 10 people. The
| key being that the question is considered to be answered
| correctly if any one of the 10 people got it right. This
| is similar to how LLMs are evaluated with pass@5 or
| pass@10 criteria (because the LLM has no memory so
| running it 10 times is more like asking 10 random people
| than asking the same person 10 times in a row).
|
| I would expect 10 random people to do better than a
| committee of 10 people because 10 people have 10 chances
| to get it right while a committee only has one. Even if
| the committee gets 10 guesses (which must be made
| simultaneously, not iteratively) it might not do better
| because people might go along with a wrong consensus
| rather than push for the answer they would have chosen
| independently.
| elcomet wrote:
| He means 10 humans voting for the answer
| herval wrote:
| Depends on the task, no?
|
| Do you have a sense of what kind of task this benchmark
| includes? Are they more "general" such that random people
| would fare well or more specialized (ie something a STEM
| grad studied and isn't common knowledge)?
| judge2020 wrote:
| It does, which is why I don't really subscribe to any
| test like this being great for actually determining
| "AGI". A true AGI would be able to continuously train and
| create new LLMs that enable it to become a SME in
| entirely new areas.
| generic92034 wrote:
| If that works that way at all depends on the group
| dynamic. It is easily possible that a not so bright
| individual takes an (unofficial) leadership position in
| the group and overrides the input of smarter members.
| Think of any meetings with various hierarchy levels in a
| company.
| daveguy wrote:
| The ARC AGI questions can be a little tricky, but the
| solutions can generally be easily explained. And you get
| 3 tries. So, the 3 best descriptions of the solution
| votes on by 10 people is going to be very effective. The
| problem space just isn't complicated enough for an
| unofficial "leader" to sway the group to 3 wrong answers.
| zamadatix wrote:
| Aha, "at least 1 of a panel of 10", not "the panel of 10
| averaged"! Thanks, that makes so much more sense to me
| now.
|
| I have failed the real ARC AGI :)
| shkkmo wrote:
| It is fairly well documented that groups of people can show
| cognitive abilities that exceed that of any individual
| member. The classic example of this is if you ask a group
| of people to estimate the number of jellybeans in a jar,
| you can get a more accurate result than if you test to find
| the person with the highest accuracy and use their guess.
|
| This isn't to say groups always outperform their members on
| all tasks, just that it isn't unusual to see a result like
| that.
| zamadatix wrote:
| Yes, my shortcoming was in understanding the 10 were
| implied to have their successes merged together by being
| a panel rather than just the average of a special
| selection.
| HDThoreaun wrote:
| ARC-AGI is essentially an IQ test. There is no "expert in
| the space". Its just a question of if youre able to spot
| the pattern.
| dlkf wrote:
| If you take a vote of 10 random people, then as long as
| their errors are not perfectly correlated, you'll do better
| than asking one person.
|
| https://en.m.wikipedia.org/wiki/Ensemble_learning
| olalonde wrote:
| Even if you assume that non STEM grads are dumb, isn't
| there a good probability of having a STEM graduate among 10
| random humans?
| cchance wrote:
| I mean considering the big breaththrough this year for o1/o3
| seems to have been "models having internal thoughts might
| help reasoning", seems to everyone outside of the AI field
| was sort of a "duh" moment.
|
| I'd hope we see more internal optimizations and improvements
| to the models. The idea that the big breakthrough being
| "don't spit out the first thought that pops into your head"
| seems obvious to everyone outside of the field, but guess
| what turns out it was a big improvement when the devs decided
| to add it.
| versteegen wrote:
| > seems obvious to everyone outside of the field
|
| It's obvious to people inside the field too.
|
| Honestly, these things seem to be less obvious to people
| outside the field. I've heard so many uninformed takes
| about LLMs not representing real progress towards
| intelligence (even here on HN of all places; I don't know
| why I torture myself reading them), that they're just dumb
| memorizers. No, they are an incredible breakthrough,
| because extending them with things like internal thoughts
| will so obviously lead to results such as o3, and far
| beyond. Maybe a few more people will start to understand
| the trajectory we're on.
| Agentus wrote:
| a trickle of people sure, but most people never
| accidentally stumble upon good evaluation skills let
| alone reason themselves to that level, so i dont see how
| most people will have the semblance of an idea of a
| realistic trajectory of ai progress. i think most people
| have very little conceptualization of their own
| thinking/cognitive patterns, at least not enough to
| sensibly extrapolate it onto ai.
|
| doesnt help that most people are just mimics when talking
| about stuff thats outside their expertise.
|
| Hell, my cousin a quality-college educated individual,
| high social/ emotional iq, will go down the conspiracy
| theory rabbit hole so quickly based on some baseless crap
| printed on the internet. then he'll talk about people
| being satan worshipers.
| versteegen wrote:
| You're being pretty harsh, but:
|
| > i think most people have very little conceptualization
| of their own thinking/cognitive patterns, at least not
| enough to sensibly extrapolate it onto ai.
|
| Quite true. If you spend a lot of time reading and
| thinking about the workings of the mind you lose sight of
| how alien it is to intuition. While in highschool I first
| read, in New Scientist, the theory that conscious thought
| lags behind the underlying subconscious processing in the
| brain. I was shocked that _New Scientist_ would print
| something so _unbelievable_. Yet there seemed to be an
| element of truth to it so I kept thinking about it and
| slowly changed my assessment.
| Agentus wrote:
| sorry, humans are stupid and what intelligence they have
| is largely impotent. if this wasnt the case life wouldnt
| be this dystopia. my crassness comes from not necessarily
| trying to pick on a particular group of humans, just
| disappointment in recognizing the efficacy of human
| intelligence and its ability to turn reality into a
| better reality (meh).
|
| yeah i was just thinking how a lot of thoughts which i
| thought were my original thoughts really were made
| possible out of communal thoughts. like i can maybe have
| some original frontier thoughts that involve averages but
| thats only made possible because some other person
| invented the abstraction of averages then that was
| collectively disseminated to everyone in education, not
| to mention all the subconscious processes that are
| necessary for me to will certainly thoughts into
| existsnce. makes me reflect on how much cognition is
| really mine, vs (not mine) a inevitable product of a
| deterministic process and a product of other humans.
| sfjailbird wrote:
| Sounds like your cousin is able to think for himself. The
| amount of bullshit I hear from quality-college educated
| individuals, who simply repeat outdated knowledge that is
| in their college curriculum, is no less disappointing.
| daveguy wrote:
| Buying whatever bullshit you see on the internet to such
| a degree that you're re-enacting satanic panic from the
| 80s is not "thinking for yourself". It's being gullible
| about areas outside your expertise.
| 0points wrote:
| > No, they are an incredible breakthrough, because
| extending them with things like internal thoughts will so
| obviously lead to results such as o3, and far beyond.
|
| While I agree that the LLM progress as of late is
| interesting, the rest of your sentiment sounds more like
| you are in a cult.
|
| As long as your field keep coming with less and less
| realistic predictions and fail to deliver over and over,
| eventually even the most gullible will lose faith in you.
|
| Because that's what this all is right now. Faith.
|
| > Maybe a few more people will start to understand the
| trajectory we're on.
|
| All you are saying is that you believe something will
| happen in the future.
|
| We can't have a intelligent discussion under those
| premises.
|
| It's depressing to see so many otherwise smart people
| fall for their own hype train. You are only helping rich
| people get more rich by spreading their lies.
| dogma1138 wrote:
| Reflection isn't a new concept, but a) actually proving
| that it's an effective tool for these types of models and
| b) finding an effective method for reflection that doesn't
| just locks you into circular "thinking" were the hard parts
| and hence the "breakthrough".
|
| It's very easy to say hey ofc it's obvious but there is
| nothing obvious about it because you are anthropomorphizing
| these models and then using that bias after the fact as a
| proof of your conjecture.
|
| This isn't how real progress is achieved.
| beardedwizard wrote:
| Calling it reflection is, for me, further
| anthropomorphizing. However I am in violent agreement
| that a common feature of llm debate is centered around
| anthropomorphism leading to claims of "thinking longer"
| or "reflecting" when none of those things are happening.
|
| The state of the art seems very focused on promoting that
| language that might encode reason is as good as actual
| reason, rather than asking what a reasoning model might
| look like.
| iandanforth wrote:
| Let's say that Google is already 1 generation ahead of nvidia
| in terms of efficient AI compute. ($1700)
|
| Then let's say that OpenAI brute forced this without any
| meta-optimization of the hypothesized search component (they
| just set a compute budget). This is probably low hanging
| fruit and another 2x in compute reduction. ($850)
|
| Then let's say that OpenAI was pushing really really hard for
| the numbers and was willing to burn cash and so didn't bother
| with serious thought around hardware aware distributed
| inference. This could be _more_ than a 2x decrease in cost
| like we 've seen deliver 10x reductions in cost via better
| attention mechanisms, but let's go with 2x for now. ($425).
|
| So I think we've got about an 8x reduction in cost sitting
| there once Google steps up. This is probably 4-6 months of
| work flat out if they haven't already started down this path,
| but with what they've got with deep research, maybe it's
| sooner?
|
| Then if "all" we get is hardware improvements we're down to
| what 10-14 years?
| promptdaddy wrote:
| *deep mind research ?
| iandanforth wrote:
| Nope, Gemini Advanced with Deep Research. New mode of
| operation that does more "thinking" and web searches to
| answer your question.
| qingcharles wrote:
| Until 2022 most AI research was aimed at improving the
| _quality_ of the output, not the _quantity_.
|
| Since then there has been a tsunami of optimizations in the
| way training and inference is done. I don't think we've
| even begun to find all the ways that inference can be
| further optimized at both hardware and software levels.
|
| Look at the huge models that you can happily run on an M3
| Mac. The cost reduction in inference is going to vastly
| outpace Moore's law, even as chip design continues on its
| own path.
| bjornsing wrote:
| > are we stuck waiting for the 20-25 years for GPU
| improvements
|
| If this turns out to be hard to optimize / improve then there
| will be a _huge_ economic incentive for efficient ASICs. No
| freaking way we'll be running on GPUs for 20-25 years, or
| even 2.
| coolspot wrote:
| LLMs need efficient matrix multiiliers. GPUs are
| specialized ASICs for massive matrix multiplication.
| vlovich123 wrote:
| LLMs get to maybe ~20% of the rated max FLOPS for a GPU.
| It's not hard to imagine that a purpose built ASIC with
| maybe adjusted software stack gets us significantly more
| real performance.
| boroboro4 wrote:
| They get more than this. For prefill we can get 70%
| matmul utilization, for generation less than this but
| we'll get to >50 too eventually.
| xbmcuser wrote:
| You are missing that cost of electricity is also going to
| keep falling because of solar and batteries. This year in
| China my table cloth math says it is $0.05 pkwh and following
| the cost decline trajectory be under $0.01 in 10 years
| patrickhogan1 wrote:
| Bingo! Solar energy moves us toward a future where a
| household's energy needs become nearly cost-free.
|
| Energy Need: The average home uses 30 kWh/day, requiring 6
| kW/hour over 5 peak sunlight hours.
|
| Multijunction Panels: Lab efficiencies are already at 47%
| (2023), and with multiple years of progress, 60% efficiency
| is probable.
|
| Efficiency Impact: At 60% efficiency, panels generate 600
| W/m2, requiring 10 m2 (e.g., 2 m x 5 m) to meet energy
| needs.
|
| This size can fit on most home roofs, be mounted on a pole
| with stacked layers, or even be hung through an apartment
| window.
| arcticbull wrote:
| Everyone always forgets that they only perform at less
| than half of their rated capacity and require significant
| battery installations. Rooftop solar plus storage is
| actually more expensive than nuclear on a comparable
| system LCOE due to their lack of efficiency of scale.
| Rooftop solar plus storage is about the most expensive
| form of electricity on earth, maybe excluding gas peaker
| plants.
| nateglims wrote:
| It varies by a lot of factors but it's way less than
| half. Photovoltaic panels have around 10% capacity
| utilization vs 50-70% for a gas or nuke plant.
| xbmcuser wrote:
| Everyone also forgets the speed of price decline for
| solar and battery your statement is completely false
| propaganda made up by power companies. Today rooftop
| solar and battery is cost competitive to nuclear already
| in many countries like India
| patrickhogan1 wrote:
| You're right that rooftop solar and storage have costs
| and efficiency limits, but those are improving quickly.
|
| Rooftop solar harnesses energy from the sun, which is
| powered by nuclear fusion--arguably the most effective
| nuclear reactor in our solar system.
| theendisney wrote:
| The thing everyone forgets is that all good energy
| technology is seized by governments for military purposes
| and to preserve the status quo. God knows how far it
| progressed.
|
| What a joke
| sahmeepee wrote:
| Average _US_ home.
|
| In Europe it is around 6-7 kWh/day. This might increase
| with electrification of heating and transport, but
| probably nothing like as much as the energy consumption
| they are replacing (due to greater efficiency of the
| devices consuming the energy and other factors like the
| quality of home insulation.)
|
| In the rest of the world the average home uses
| significantly less.
| jdhwosnhw wrote:
| While I agree with your general assessment, I think your
| conclusion is a bit off. You're assuming 1kw/m^2, which
| is only true with the sun directly overhead. A real-world
| solar setup gets hit with several factors of cosine
| (related to roof pitch, time of day, day of year, and
| latitude) that conspire to reduce the total output.
|
| For example, my 50 sq m set up, at -29 deg latitude,
| generated your estimated 30 kwh/day output. I have panels
| with ~20% efficiency, suggesting that at 60% efficiency,
| the average household would only get to around half their
| energy needs with 10 sq m.
|
| Yes, solar has the potential to drastically reduce energy
| costs, but even with free energy storage, individual
| households aren't likely to achieve self sustainability.
| nateglims wrote:
| Is it going to fall significantly for data centers?
| Industrial policy for consumer power is different from
| subsidizing it for data centers and if you own grid
| infrastructure why would you tank the price by putting up
| massive amounts of capital?
| xbmcuser wrote:
| It's the same about using the cloud or using your own
| infrastructure there will be a point where building your
| own solar and battery plant is cheaper than what they are
| charging they will need to follow the price decline if
| they want to keep the customers if not there will be mass
| scale grid defections.
| nateglims wrote:
| I don't think this reflects the reality of the power
| industry. Data centers are the only significant growth in
| actual generated power in decades and hyperscalers are
| already looking at very bespoke solutions.
|
| The heavy commodification of networking and compute
| brought about by the internet and cloud aligned with tech
| company interests in delivering services or content to
| consumers. There does not seem to be an emerging
| consensus that data center operators also need to provide
| consumer power.
| xbmcuser wrote:
| It was not the reality of the power industry but will be
| soon as we have not had a source of electricity that is
| the cheapest and is getting cheaper and easy to install
| this is something unique.
|
| I don't see Google, Amazon, Microsoft or any company pay
| $10 for something if building it themselves will cost
| them $5. Either the price difference will reach a point
| where investing into power production themselves makes
| sense or the power companies decrease prices. Looking at
| how all 3 have already been investing in power production
| over the last decade themselves either to get better
| prices or for PR.
| lyu07282 wrote:
| But didn't we liberalized energy markets for that reason,
| if anyone could undercut the market like that wouldn't
| that happen automatically and the prices go down anyway?
| /s
| barney54 wrote:
| But the cost of electricity is not falling--it's
| increasing. Wholesale prices have decreased, but retail
| rates are up. In the U.S. rates are up 27% over the past 4
| years. In Europe prices are up too.
| lucubratory wrote:
| I am not certain because I've been very focused on the o3
| news, but at least yesterday neither the US nor Europe
| were part of China.
| xbmcuser wrote:
| Most large compute clusters would be buying electricity
| at wholesale price not at retail price. But anyway solar
| and battery prices have just reached the tipping point
| this year only now the longer power companies keep retail
| prices high the more people will defect from the grid and
| install their own solar + batteries.
| lxgr wrote:
| But data centers pay wholesale prices or even less (given
| that especially AI training and, to a lesser extend,
| inference clusters can load shed like few other consumers
| of electricity).
| fulafel wrote:
| And this is great news as long as marginal production
| (the most expensive to produce, first to turn on/off
| according to demand) of electricity is fossils.
| NoLinkToMe wrote:
| That's a bit of a non-statement. Virtually all prices
| increase because of money supply, but we consider things
| to get cheaper if their prices grow less fast than
| inflation / income.
|
| General inflation has outpaced the inflation of
| electricity prices by about 3x in the past 100 years. In
| other words, electricity has gotten cheaper over time in
| purchasing power terms.
|
| And that's whilst our electricity usage has gone up by
| 10x in the last 100 years.
|
| And this concerns retail prices, which includes
| distribution/transmission fees. These have gone up a lot
| as you get complications on the grid, some of which is
| built on a century old design. But wholesale prices (the
| cost of generating electricity without
| transmission/distribution) are getting dirt cheap, and
| for big AI datacentres I'm pretty sure they'll hook up to
| their own dedicated electricity generation at wholesale
| prices, off the grid, in the coming decades.
| necovek wrote:
| If climate change ends up changing weather profiles and we
| start seeing many more cloudy days or dust/mist in the air,
| we'll need to push those solar panel above (all the way to
| space?) or have many more of them, figure out transmission
| to the ground and costs will very much balloon.
|
| Not saying this will happen, but it's risky to rely on
| solar as the only long-term solution.
| miki123211 wrote:
| It's also worth keeping in mind that AIs are a lot less risky
| to deploy for businesses than humans.
|
| You can scale them up and down at any time, they can work
| 24/7 (including holidays) with no overtime pay and no breaks,
| they need no corporate campuses, office space, HR personnel
| or travel budgets, you don't have to worry about key
| employees going on sick/maternity leave or taking time off
| the moment they're needed most, they won't assault a
| coworker, sue for discrimination or secretly turn out to be a
| pedophile and tarnish the reputation of your company, they
| won't leak internal documents to the press or rage quit
| because of new company policies, they won't even stop working
| when a pandemic stops most of the world from running.
| antihipocrat wrote:
| AI brings similar risks - they can leak internal
| information, they can be tricked into performing prohibited
| tasks (with catastrophic effects if this is connected to
| core systems), they could be accused of actions that are
| discriminatory (biased training sets are very common).
|
| Sure, if a business deploys it to perform tasks that are
| inherently low risk e.g. no client interface, no core
| system connection and low error impact, then the human
| performing these tasks is going to be replaced.
| snozolli wrote:
| _they can be tricked into performing prohibited tasks_
|
| This reminds me of the school principal who sent $100k to
| a scammer claiming to be Elon Musk. The kicker is that
| she was repeatedly told that it was a scam.
|
| https://abc7chicago.com/fake-elon-musk-jan-mcgee-
| principal-b...
| tstrimple wrote:
| This is one of the things which annoys me most about
| anti-LLM hate. Your peers aren't right all the time
| either. They believe incorrect things and will pursue
| worse solutions because they won't acknowledge a better
| way. How is this any different from a LLM? You have to
| question _everything_ you 're presented with. Sometimes
| that Stack Overflow answer isn't directly applicable to
| your exact problem but you can extrapolate from it to
| resolve your problem. Why is an LLM viewed any
| differently? Of course you can't just blindly accept it
| as the one true answer, but you literally cannot do that
| with humans either. Humans produce a ton of shit code and
| non-solutions and it's fine. But when an LLM does it,
| it's a serious problem that means the tech is useless.
| Much of the modern world is built on shit solutions and
| we still hobble along.
| lazide wrote:
| Everyone knows humans can be idiots. The problem is that
| people seem to think LLMs can't be idiots, and because
| they aren't human there is no way to punish them. And
| then people give them too much credit/power, for their
| own purposes.
|
| Which makes LLMs far more dangerous than idiot humans in
| most cases.
| brookst wrote:
| No. Nobody thinks LLMs are perfect. That's a strawman.
|
| And... I am really not sure punishment is the answer to
| fallibility, outside of almost kinky Catholicism.
|
| The reality is these things are very good, but imperfect,
| much like people.
| lazide wrote:
| Clearly you haven't been listening to any CEO press
| releases lately?
|
| And when was the last time a support chatbot let you
| actually complain or bypass to a human?
| thecupisblue wrote:
| Sorry man, but I literally know of startups invested into
| by YC where CEO's for 80% of their management
| decisions/vision/comms use ChatGPT ... or should I say
| some use Claude now, as they think it's smarter and does
| not make mistakes.
|
| Let that sink in.
| onion2k wrote:
| I wouldn't be surprised if GPT genuinely makes better
| decisions than an inexperienced, first-time CEO who has
| only been a dev before, especially if the person
| prompting it has actually put some effort into
| understanding their own weaknesses. It certainly wouldn't
| be any worse than someone who's only experience is
| reading a few management books.
| lazide wrote:
| And here is a great example of the problem.
|
| An LLM doesn't make decisions. It generates text that
| plausibly looks like it made a decision, when prompted
| with the right text.
| beardedwizard wrote:
| Why is this distinction lost in every thread on this
| topic, I don't get it.
| lazide wrote:
| A lot more people are credulous idiots than anyone wants
| to believe - and the confusion/misunderstanding is being
| actively propagated.
| sirsinsalot wrote:
| Think of all the human growth and satisfaction being lost
| to risk mitigation by offloading the pleasure of failure
| to Machines.
| lazide wrote:
| Ah, but machines can't fail! So don't worry, humans will
| still get to experience the 'pleasure'. But won't be able
| to learn/change anything.
| Mordisquitos wrote:
| > No. Nobody thinks LLMs are perfect. That's a strawman.
|
| I'm afraid that's not the case. Literally yesterday I was
| speaking with an old friend who was telling us how one of
| his coworkers had presented a document with mistakes and
| serious miscalculations as part of some project. When my
| friend pointed out the mistakes, which were intuitively
| obvious just by critically understanding the numbers, the
| guy kept insisting _" no, it's correct, I did it with
| ChatGPT"_. It took my friend doing the calculations
| explicitly and showing that they made no sense to
| convince the guy that it was wrong.
| 0points wrote:
| Not _people_.
|
| Certain gullible people, who tends to listen to certain
| charlatans.
|
| Rational, intelligent people wouldn't consider replacing
| a skilled human worker with a LLM that on a good day can
| compete with a 3-year old.
|
| You may see the current age as litmus for critical
| thinking.
| mplewis wrote:
| Humans can tell you how confident they are in something
| being right or wrong. An LLM has no internal model and
| cannot do such a thing.
| swiftcoder wrote:
| > Humans can tell you how confident they are in something
| being right or wrong
|
| Humans are also very confidently wrong a considerable
| portion of the time. Particularly about anything outside
| their direct expertise
| SketchySeaBeast wrote:
| People only being willing to say they are unsure some of
| the time is still better than LLMs. I suppose, given that
| everything is outside of their area of expertise, it's
| very human of them.
| daveguy wrote:
| That's still better than never being able to make an
| accurate confidence assessment. The fact that this is
| worse outside your expertise is a main reason why
| expertise is so valued in hiring decisions.
| pineaux wrote:
| Its quite stunning to frame it as anti-LLM hate. It's on
| the pro-LLM people to convince the anti-LLM people that
| choosing for LLMs is an ethically correct choice with all
| the necessary guardrails. It's also on the pro-LLM people
| to show the usefulness of the product. If pro-LLM people
| are right, it will be a matter of time before these
| people will see the errors of their ways. But doing an
| ad-hominem is a sure way of creating a divide...
| gf000 wrote:
| But human stupidity, while itself can be sometimes an
| unknown unknown with its creativity, is a mostly known
| unknown.
|
| LLMs fail in entirely novel ways you can't even fathom
| upfront.
| sirsinsalot wrote:
| GenAI has a 100% failure to enjoy quality of life,
| emotional fulfillment and psychological safety.
|
| Id say those are the goals we should be working for.
| That's the failure we want to look at. We are humans.
| halgir wrote:
| > LLMs fail in entirely novel ways you can't even fathom
| upfront.
|
| Trust me, so do humans. Source: have worked with humans.
| lucubratory wrote:
| >secretly turn out to be a pedophile and tarnish the
| reputation of your company
|
| This is interesting because it's both Oddly Specific and
| also something I have seen happen and I still feel really
| sorry for the company involved. Now that I think about it,
| I've actually seen it happen twice.
| rockskon wrote:
| AI has a different risk profile than humans. They are a
| _lot_ more risky for business operations where failure is
| wholly unacceptable under any circumstance.
|
| They're risky in that they fail in ways that aren't readily
| deterministic.
|
| And would you trust your life to a self-driving car in New
| York City traffic?
| lxgr wrote:
| Isn't everybody in NYC already? (The dangers of bad
| driving are much higher for pedestrians than for people
| in cars; there are more of the former than of the latter
| in NYC; I'd expect there to be a non-zero number of fully
| self driving cars already in the city.)
| rockskon wrote:
| That doesn't answer my question.
| 9dev wrote:
| It does, in a way; AI is already there, all around you,
| whether you like it or not. Technological progress is
| Pandora's box; you can't take it back or slow it down.
| Businesses will use AI for critical workflows, and all
| good that they bring, and all bad too, will happen.
| rockskon wrote:
| How about you answer my question since he did not.
|
| Would you trust your life to a self-driving car in New
| York City traffic?
| lxgr wrote:
| GP got it exactly right: I already am. There's no way for
| me to opt out of having self-driving cars on the streets
| I regularly cross as a pedestrian.
| chefandy wrote:
| If there are any fully-autonomous cars on the streets of
| nyc, there aren't many of them and I don't think there's
| any way for them to operate legally. There has been
| discussion about having a trial.
| wwweston wrote:
| We can just insulate businesses employing AI from any
| liability, problem solved.
| fsloth wrote:
| I guess - yes from business&liability sense? "This
| service you are now paying for 100$? We can sell it to
| you for 5$ but with the caveat _we give no guarantees if
| it works or is it fit for purpose_ - click here to
| accept".
| 9dev wrote:
| ,,Well, our AI that was specifically designed for
| maximising gains above all else may indeed have
| instructed the workers to cut down the entire Amazonas
| forest for short-term gains in furniture production." But
| no human was involved in the decision, so nobody is
| liable and everything is golden? Is that the future you
| would like to live in?
| lazide wrote:
| Hmmm, how much stock do I own in this hypothetical
| company? (/s, kinda)
| wwweston wrote:
| Apparently I need to work on my deadpan delivery.
|
| Or just articulate things openly: we _already_ insulate
| business owners from liability because we think it tunes
| investment incentives, and in so doing have created
| social entities /corporate "persons"/a kind of AI who
| have different incentives than most human beings but are
| driving important social decisions. And they've supported
| some astonishing cooperation which has helped produce
| things like the infrastructure on which we are having
| this conversation! But also, we have existing AIs of this
| kind who are already inclined to cut down the entire
| Amazonas forest for furnitue production because it
| maximizes their function.
|
| That's not just the future we live in, that's the world
| we've been living in for a century or few. On one hand,
| industrial productivity benefits, on the other hand, it
| values human life and the ecology we depend on about like
| any other industrial input. Yet many people in the
| world's premier (former?) democracy repeat enthusiastic
| endorsements of this philosophy reducing their personal
| skin to little more than an industrial input: "run the
| government like a business."
|
| Unless people change, we are very much on track to create
| a world where these dynamics (among others) of the human
| condition are greatly magnified by all kinds of
| automation technology, including AI. Probably starting
| with limited liability for AIs and companies employing
| them, possibly even _statutory_ limits, though it 's much
| more likely that wealthy businesses will simply be
| insulated with by the sheer resources they have to make
| sure the courts can't hold them accountable, even where
| we still have a judicial system that isn't willing to
| play calvinball for cash or catechism (which,
| unfortunately, does not seem to include a supreme court
| majority).
|
| In short, you and I probably agree that liability for AI
| is important, and limited liability for it isn't good.
| Perhaps I am too skeptical that we can pull this off, and
| being optimistic would serve everyone better.
| ijidak wrote:
| It is amazing to me that we have reached an era where we
| are debating the trade-off of hiring thinking machines!
|
| I mean, this is an incredible moment from that
| standpoint.
|
| Regarding the topic at hand, I think that there will
| always be room for humans for the reasons you listed.
|
| But even replacing 5% of humans with AI's will have mind
| boggling consequences.
|
| I think you're right that there are jobs that humans will
| be preferred for for quite some time.
|
| But, I'm already using AI with success where I would
| previously hire a human, and this is in this primitive
| stage.
|
| With the leaps we are seeing, AI is coming for jobs.
|
| Your concerns relate to exactly how many jobs.
|
| And only time will tell.
|
| But, I think some meaningful percentage of the population
| -- even if just 5% of humanity will be replaced by AI.
| miki123211 wrote:
| This is a really hard and weird ethical problem IMHO, and
| one we'll have to deal with sooner or later.
|
| Imagine you have a self-driving AI that causes fatal
| accidents 10 times less often than your average human
| driver, but when the accidents happen, nobody knows why.
|
| Should we switch to that AI, and have 10 times fewer
| accidents and no accountability for the accidents that do
| happen, or should we stay with humans, have 10x more road
| fatalities, but stay happy because the perpetrators end
| up in prison?
|
| Framed like that, it seems like the former solution is
| the only acceptable one, yet people call for CEOs to go
| to prison when an AI goes wrong. If that were the case,
| companies wouldn't dare use any AI, and that would
| basically degenerate to the latter solution.
| okasaki wrote:
| Wait, why would we want 10x more traffic fatalities?
| stavros wrote:
| We wouldn't, that's their point.
| moritzwarhier wrote:
| I don't know about your country, but people going to
| prison for causing road fatalities is extremely rare
| here.
|
| Even temporary loss of the drivers license has a very
| high bar, and that's the main form of accountability for
| driver behavior in Germany, apart from fines.
|
| Badly injuring or killing someone who themselves did not
| violate traffic safety regulations is far from guaranteed
| to cause severe repercussions for the driver.
|
| By default, any such situation is an accident and at best
| people lose their license for a couple of months.
| paulryanrogers wrote:
| Drivers are the apex predators. My local BMV passed me
| after I badly failed the vision test. Thankfully I was
| shaken enough to immediately go to the eye doctor and get
| treatment.
| chefandy wrote:
| Sadly, we live in a society where those executives would
| use that impunity as carte blanche to spend no money
| improving (in the best-case scenario,) or even more
| likely, keep cutting safety expenditures until the body
| counts get high enough for it to start damaging sales. If
| we've already given them a free pass, they will exploit
| it to the greatest possible extent to increase profit.
| ETH_start wrote:
| What evidence exists for this characterization?
| rgbrgb wrote:
| The way health insurance companies optimize for denials
| in the US.
| chefandy wrote:
| Let's see... of the top of my head...
|
| - Air Pollution
|
| - Water Pollution
|
| - Disposable Packaging
|
| - Health Insurance
|
| - Steward Hospitals
|
| - Marketing Junk Food, Candy and Sodas directly to
| children
|
| - Tobacco
|
| - Boeing
|
| - Finance
|
| - Pharmaceutical Opiates
|
| - Oral Phenylepherin to replace pseudoephedrine despite
| knowing a) it wasn't effective, and b) posed a risk to
| people with common medical conditions.
|
| - Social Media engagement maximization
|
| - Data Brokerage
|
| - Mining Safety
|
| - Construction site safety
|
| - Styrofoam Food and Bev Containers
|
| - ITC terminal in Deerfield Park (read about the decades
| of them spewing thousands of pounds benzene into the air
| before the whole fucking thing blew up, using their
| influence to avoid addressing any of it, and how they
| didn't have automatic valves, spill detection, fire
| detection, sprinklers... in _2019_.)
|
| - Grocery store and restaurant chains disallowing
| cashiers from wearing masks during the first pandemic
| wave, well after we knew the necessity, because it made
| customers uncomfortable.
|
| - Boar's Head Liverwurst
|
| And, you know, plenty more. As someone that grew up
| playing in an unmarked, illegal, not-access-controlled
| toxic waste dump in a residential area owned by a huge
| international chemical conglomerate-- and just had some
| cancer taken out of me last year-- I'm pretty familiar
| with various ways corporations are willing to sacrifice
| health and safety to bump up their profit margin. I guess
| ignoring that kids were obviously playing in a swamp of
| toluene, PCBs, waste firefighting chemicals, and all
| sorts of other things on a plot not even within sight of
| the factory in the middle of a bunch of small farms was
| _just the cost of doing business_. As was my friend who,
| when he was in vocational high school, was welding a
| metal ladder above storage tank in a chemical factory
| across the state. The plant manager assured the school
| the tanks were empty, triple rinsed and dry, but they
| exploded, blowing the roof off the factory taking my
| friend with it. They were apparently full of waste
| chemicals and IIRC, the manager admitted to knowing that
| in court. He said he remembers waking up briefly in the
| factory parking lot where he landed, and then the next
| thing he remembers was waking up in extreme pain wearing
| the compression gear he'd have to wear into his mid
| twenties to keep his grafted skin on. Briefly looking
| into the topic will show how common this sort of
| malfeasance is in manufacturing.
|
| The burden of proof is on people saying that they _won't_
| act like the rest of American industry tasked with
| safety.
| ajmurmann wrote:
| Like with Cruise. One freak accident and they practically
| decided to go out of business. Oh wait...
| chefandy wrote:
| If that's the only data point you look at in American
| industry, it would be pretty encouraging. I mean,
| _surely_ they'd have done the same if they were a branch
| of a large publicly traded company with a big high-
| production product pipeline...
| monkeynotes wrote:
| > nobody knows why
|
| But we do know the culpability rests on the shoulders of
| the humans who decided the tech was ready for work.
| ethbr1 wrote:
| Hey look, it's almost like we're back at the end of the
| First Industrial Revolution (~1850), as society grapples
| with how to create happiness in a rapidly shifting
| economy of supply and demand, especially for labor. https
| ://en.m.wikipedia.org/wiki/Utilitarianism#John_Stuart_M..
| .
|
| Pretty bloody time for labor though.
| https://en.m.wikipedia.org/wiki/Haymarket_affair
| MaxPock wrote:
| It depends with what the risk is .Would it be whole or in
| part ? In an organisation,failure by an HR might present
| an isolated departmental risk while an AI might not be
| the case.
| zelphirkalt wrote:
| Deterministic they may be, but unforeseeable for humans.
| ajmurmann wrote:
| Every statistic I've seen indicated much better accident
| rates for self-driving cars than human drivers. I've
| taken Waymo rides in SF and felt perfectly safe. I've
| taken Lyft and Uber and especially taxi rides where I
| felt much less safe. So I definitely would take the self-
| driving car. Just because I don't understand am accident
| doesn't make it more likely to happen.
|
| The one minor risk I see is the cat being too polite and
| getting effectively stuck in dense traffic. That's a
| nuisance though.
|
| Is there something about NYC traffic I'm missing?
| aprilthird2021 wrote:
| There's one important part about risk management though.
| If your Waymo does crash, the company is liable for it,
| and there's no one to shift the blame onto. If a human
| driver crashes, that's who you can shift liability onto.
|
| Same with any company that employs AI agents. Sure they
| can work 24/7, but every mistake they make the company
| will be liable for (or the AI seller). With humans, their
| fraud, their cheating, their deception, can all be wiped
| off the company and onto the individual
| ethbr1 wrote:
| The next step is going to be around liability insurance
| for AI agents.
|
| That's literally the point of liability insurance -- to
| allow the routine use of technologies that rarely (but
| catastrophically) fail, by ammortizing risk over time /
| population.
| aprilthird2021 wrote:
| Potentially. I would be skeptical that businesses can do
| this to shield themselves from the liability. For
| example, VW could not use insurance to protect them from
| their emissions scandal. There are thresholds (fraud,
| etc.) that AI can breach, which I don't think insurance
| can legally protect you from
| danielovichdk wrote:
| Name one technology that has come with computers that
| hasn't resulted in more humans being put to work ?
|
| The rhetoric of not needing people doing work is
| cartoon'ish. I mean there is no sane explanation of how and
| why that would happen, without employing more people yet
| again, taking care of the advancements.
|
| It's nok like technology has brought less work related
| stress. But it has definitely increased it. Humans were not
| made for using technology at such a pace as it's being
| rolled out.
|
| The world is fucked. Totally fucked.
| mortehu wrote:
| Self check-out stations, ATMs, and online brokerages.
| Recently chat support. Namely cases where millions of
| people used to interact with a representative every week,
| and now they don't.
| palmfacehn wrote:
| "Name one use of electric lighting that hasn't resulted
| in candle makers losing work?"
|
| The framing of the question misses the point. With
| electric lighting we can now work longer into the night.
| Yes, less people use and make candles. However, the
| second order effects allow us to be more productive in
| areas we may not have previously considered.
|
| New technologies open up new opportunities for
| productivity. The bank tellers displaced by ATM machines
| can create value elsewhere. Consumers save time by not
| waiting in a queue, allowing them to use their time more
| economically. Banks have lower overhead, allowing more
| customers to afford their services.
| 0points wrote:
| Where to even start?
|
| Digital banks
|
| Cashless money transfer services
|
| Self service
|
| Modern farms
|
| Robo lawn mowers
|
| NVR:s with object detection
|
| I can go on forever
| salawat wrote:
| Please do. I'm certain you can't, and you'll have to stop
| much sooner than you think. Appeals to triviality are the
| first refuge of the person who thinks they know, but does
| not.
| TheOtherHobbes wrote:
| It's all fun and games until the infra crashes and you
| can't work out why, because a machine has written all of
| the code, no one understands how it works or what it's
| doing.
|
| Or - worse - there is no accessible code anywhere, and you
| have to prompt your way out of "I'm sorry Dave, I can't do
| that," while nothing works.
|
| And a human-free economy does... what? For whom? When 99%
| of the population is unemployed, what are the 1% doing
| while the planet's ecosystems collapse around them?
| sirsinsalot wrote:
| It honestly borders on psychopathic the way engineers are
| treating humans in this context.
|
| People talking like this also, in the back of their minds
| like to think they'll be OK. They're smart enough to be
| still needed. They're a human, but they'll be OK even
| while working to make genAI out perform them at their own
| work.
|
| I wonder how they'll feel about their own hubris when
| they struggle to feed their family.
|
| The US can barely make healthcare work without disgusting
| consequences for the sick. I wonder what mass
| unemployment looks like.
| a2800276 wrote:
| But when Sam Altman owns all the money in the world
| surely he'll distribute some it via his not-for-profit AI
| company?
| bnj wrote:
| For the moment the displacement is asymmetrical; AI
| replacing employees, but not AI replacing consumers. If
| AI causes mass unemployment, the pool of consumers
| (profit to companies) will shrink. I wonder what the
| ripple effects of that will be.
| sirsinsalot wrote:
| There's no point being rich in a world where the economy
| is unhealthy.
| jvanderbot wrote:
| It honestly borders on midwit to constantly introduce a
| false dichotomy of AI vs humans. It's just stupid base
| animal logic.
|
| There is absolutely no reason a programmer should expect
| to write code as they do now forever, just as ASM experts
| had to move on. And there's no reason (no precedent _and_
| no indicators) to expect that a well-educated, even-
| moderately-experienced technologist will suddenly find
| themselves without a way to feed their family - unless
| they stubbornly refuse to reskill or change their
| workflows.
|
| I do believe the days of "everyone makes 100k+" are
| nearly over, and we're headed towards a severely bimodal
| distribution, but I do not see how, for the next 10-15
| years at least, we can't all become productive building
| the tools that will obviate our own jobs while we do them
| - and get comfortably retired in the mean time.
| losteric wrote:
| There is no comfortable retirement if the process of
| obviating our own jobs is not coupled with appropriate
| socioeconomic changes.
| jvanderbot wrote:
| I don't see it. Don't you have a 401k or EU style
| pension? Aren't you saving some money? If not, why are
| you in software? I don't make as much as I thought I
| might, but I make enough to consider the possibility of
| surviving a career change.
| twh270 wrote:
| Reskill to what? When AI can do software development, it
| will also be able to do pretty much any other job that
| requires some learning.
| jvanderbot wrote:
| Even if one refuses to move on from software dev to
| something like AI deployer or AI validator or AI steerer,
| there might be a need.
|
| If innovation ceases, then AI is king - push existing
| knowledge into your dataset, train, and exploit.
|
| If innovation continues, there's always a gap. It takes
| time for a new thing to be made public "enough" for it to
| be ingested and synthesized. Who does this? Who finds the
| new knowledge?
|
| Who creates the direction and asks the questions? Who
| determines what to build in the first place? Who
| synthesizes the daily experience of everyone around them
| to decide what tool needs to exist to make our lives
| easier? Maybe I'm grasping at straws here, but the world
| in which all scientific discovery, synthesis, direction
| and vision setting, etc, is determined by AI seems really
| far away when we talk about code generation and symbolic
| math manipulation.
|
| These tools are self driving cars, and we're drivers of
| the software fleet. We need to embrace the fact that we
| might end up watching 10 cars self operate rather than
| driving one car, or maybe we're just setting
| destinations, but there simply isn't an absolutist zero
| sum game here unless all one thinks about is keeping the
| car on the road.
|
| AND even if there were, repeating doom and feeling
| helpless is the last thing you want. Maybe it's not good
| truth that we can all adapt and should try, but it's
| certainly good _policy_.
| exhaze wrote:
| You misunderstand the fundamentals. I've built a type-
| safe code generation pipeline using TypeScript that
| enforces compile-time and runtime safety. Everything
| generates from a single source of truth - structured JSON
| containing the business logic. The output is
| deterministic, inspectable, and version controlled.
|
| Your concerns about mysterious AI code and system crashes
| are backwards. This approach eliminates integration bugs
| and maintenance issues by design. The generated
| TypeScript is readable, fully typed, and consistently
| updated across the entire stack when business logic
| changes.
|
| If you're struggling with AI-generated code
| maintainability, that's an implementation problem, not a
| fundamental issue with code generation. Proper type
| safety and schema validation create more reliable
| systems, not less. This is automation making developers
| more productive - just like compilers and IDEs did - not
| replacing them.
|
| The code works because it's built on sound software
| engineering principles: type safety, single source of
| truth, and deterministic generation. That's verifiable
| fact, not speculation.
| bboygravity wrote:
| humans definitely don't need office space, but your point
| stands
| AustinW wrote:
| LLM office space is pretty expensive. Chillers, backup
| generators, raised floors, communications gear, .... They
| even demand multiple offices for redundancy, not to
| mention the new ask of a nuclear power plant to keep the
| lights on.
| fsndz wrote:
| I get the excitement, but folks, this is a model that
| excels only in things like software engineering/math. They
| basically used reinforcement learning to train the model to
| better remember which pattern to use to solve specific
| problems. This in no way generalises to open ended tasks in
| a way that makes human in the loop unnecessary. This
| basically makes assistants better (as soon as they figure
| out how to make it cheaper), but I wouldn't blindly trust
| the output of o3. Sam Altman is still wrong:
| https://www.lycee.ai/blog/why-sam-altman-is-wrong
| girvo wrote:
| Quite. And if it _was_ right, those businesses deploying
| it and replacing humans need humans with jobs and money
| to pay for their products and services...
| fakedang wrote:
| It will just keep bleeding the middle class on and on,
| till the point where either everyone is rich, homeless or
| a plumber or other such licensed worker. And then there
| will be such a glut in the latter (shrinking) market,
| that everyone in that group also becomes either rich or
| homeless.
| palmfacehn wrote:
| Productivity gains increase the standard of living for
| everyone. Products and services become cheaper. Leisure
| time increases. Scarce labor resources can be applied in
| other areas.
|
| I fail to see the difference between AI-employment-doom
| and other flavors of Luddism.
| bayindirh wrote:
| It also fuels the income inequality with a fatter pipe in
| every iteration. You get richer as you move up in the
| supply chain, period. Companies vertically integrate to
| drive costs down in the long run.
|
| As AI gets more prevalent, it'll drive the cost down for
| the companies supplying these services, so the _former_
| employees of said companies will be paid lower, or not at
| all.
|
| So, tell me, how paying fewer people less money will
| drive their standard of living upwards? I can understand
| the leisure time. Because, when you don't have a job, all
| day is leisure time. But you'll need money for that, so
| will these companies fund the masses via government to
| provide Universal Basic Income, so these people can both
| live a borderline miserable life while funding these
| companies to suck these people more and more?
| CamperBob2 wrote:
| _It also fuels the income inequality with a fatter pipe
| in every iteration_
|
| Who cares? A rising tide lifts all boats. The wealthy
| people I know all have one thing in common: they focused
| more on their own bank accounts than on other people's.
|
| _So, tell me, how paying fewer people less money will
| drive their standard of living upwards?_
|
| Money is how we allocate limited resources. It will
| become less important as resources become less limited,
| less necessary, or (hopefully) both.
| EarthAmbassador wrote:
| Utter nonsense. Productivity gains of the last 40 years
| have been captured by shareholders and top elites.
| Working class wages have been flat all of that time
| despite that gain.
|
| In 2012, Musk was worth $2 billion. He's now worth 223
| times that yet the minimum wage has barely budged in the
| last 12 years as productivity rises.
| palmfacehn wrote:
| >>Productivity gains increase the standard of living for
| everyone.
|
| >Productivity gains of the last 40 years have been
| captured by shareholders and top elites. Working class
| wages have been flat...
|
| Wages do not determine the standard of living. The
| products and services purchased with wages determine the
| standard of living. "Top elites" in 1984 could already
| afford cellular phones, such as the Motorola DynaTAC:
|
| >A full charge took roughly 10 hours, and it offered 30
| minutes of talk time. It also offered an LED display for
| dialing or recall of one of 30 phone numbers. It was
| priced at US$3,995 in 1984, its commercial release year,
| equivalent to $11,716 in 2023.
|
| https://en.wikipedia.org/wiki/Motorola_DynaTAC
|
| Unfortunately, touch screen phones with gigabytes of ram
| were not available for the masses 40 years ago.
| DAGdug wrote:
| What a patently absurd POV! A phone doesn't compensate
| for the inability to solve for basic needs - housing,
| healthy food, healthcare. Or being unable to invest in
| skill development for themselves or their offspring, save
| for retirement.
| runarberg wrote:
| It is also highly likely that the cost of that phone was
| externalized onto a worker in a poorer country that
| doesn't even have basic necessity like a running water,
| 24 hour electricity, food security, etc.
| DAGdug wrote:
| Leisure time hasn't increased in the last 100 years
| except for the lower income class which doesn't have
| steady employment. But yes, I see your point that the
| homeless person who might have had a home if he had a
| (now automated) factory job should surely feel good about
| having a phone that only the ultra rich had 40 years ago.
| ethbr1 wrote:
| It's not worth tossing away in sarcasm.
|
| The availability of cheaply priced smartphones and
| cellular data plans has absolutely made being homeless
| suck less.
|
| As you noted though, a home would probably be a
| preferable alternative.
| szundi wrote:
| Never happened with neither big technology advancement
| bayindirh wrote:
| Wealth has bled from landlords to warlords and now
| bleeding to techlords.
|
| Warlords are still rich, but both money and war is
| flowing towards tech. You can get a piece from that pie
| if you're doing questionable things (adtech, targeting,
| data collection, brokering, etc.), but if you're a run of
| the mill, normal person, your circumstances are getting
| harder and harder, because you're slowly squeezed out of
| the system like a toothpaste.
| robwwilliams wrote:
| In your blog you say:
|
| > deep learning doesn't allow models to generalize
| properly to out-of-distribution data--and that is
| precisely what we need to build artificial general
| intelligence.
|
| I think even (or especially) people like Altman accept
| this as a fact. I do. Hassabis has been saying this for
| years.
|
| The foundational models are just a foundation. Now start
| building the AGI superstructure.
|
| And this is also where most of the still human
| intellectual energy is now.
| jvanderbot wrote:
| Generally, I agree with you. But, there are risks other
| than "But a human might have a baby any time now - what
| then??".
|
| For AI example(s): Attribution is low, a system built
| without human intervention may suddenly fall outside its
| own expertise and hallucinate itself into a corner,
| everyone may just throw more compute at a system until it
| grows without bound, etc etc.
|
| This "You can scale up to infinity" problem might become
| "You have to scale up to infinity" to build any reasonably
| sized system with AI. The shovel-sellers get fantastically
| rich but the businesses are effectively left holding the
| risk from a fast-moving, unintuitive, uninspected,
| partially verified codebase. I just don't see how anyone
| not building a CRUD app/frontend could be comfortable with
| that, but then again my Tesla is effectively running such a
| system to drive me and my kids. Albeit, that's on a well-
| defined problem and within _literally_ human-made
| guardrails.
| monkeynotes wrote:
| "AIs are a lot less risky to deploy for businesses than
| humans" How do you know? LLMs can't even be properly
| scrutinized, while humans at least follow common psychology
| and patterns we've understood for thousands of years. This
| actually makes humans more predictable and manageable than
| you might think.
|
| The wild part is that LLMs understand us way better than we
| understand them. The jump from GPT-3 to GPT-4 even
| surprised the engineers who built it. That should raise
| some red flags about how "predictable" these systems really
| are.
|
| Think about it - we can't actually verify what these models
| are capable of or if they're being truthful, while they
| have this massive knowledge base about human behavior and
| psychology. That's a pretty concerning power imbalance.
| What looks like lower risk on the surface might be hiding
| much deeper uncertainties that we can't even detect, let
| alone control.
| ETH_start wrote:
| We are not pitted against AI is these match-ups. Instead,
| all humans and AI aligned with the goal of improving the
| human condition, are pitted against rogue AI which are
| not. Our capability to keep rogue AI in check therefore
| grows in proportion to the capabilities of AI.
| daveguy wrote:
| The GP post is about how much better these AIs will be
| than humans once they reach a given skill level. So, yes,
| we are very much pitted against AI unless there are major
| socioeconomic changes. I don't think we are as close to a
| AGI as a lot of people are hyping, but at some point it
| would be a direct challenge to human employment. And we
| should think about it before that happens.
| salawat wrote:
| You cannot tell the difference between the two veins of
| AI. Why do you have such a hard time understanding that?
| zitterbewegung wrote:
| Having AI "tarnish the reputation of your company"
| encompasses so much in regard to AI when it can receive
| input and be manipulated by others such as Tai from
| Microsoft and many other outcomes where there is a true
| risk for AI deployment.
| fakedang wrote:
| We can all agree we've progressed so much since Tai.
| cmiles74 wrote:
| "...they need no corporate campuses, office space..."
|
| This is a big downside of AI, IMHO. Those offices need to
| be filled! ;-)
| Mistletoe wrote:
| At what point in the curve of AI is it not ethical to work
| an AI 24/7 because it is alive? What if it is exactly the
| same point where you reach human level performance?
| osigurdson wrote:
| Sure, once AI can actually do a job of some sort, without
| assistance, that job is gone - even if the machine costs
| significantly more. However, it can't remotely do that now
| so can only help a bit.
| m3kw9 wrote:
| Don't forget humans which is real GI paired with increasing
| capable AI can create a feed back loop to accelerate new
| advances.
| acchow wrote:
| > ~doubling every 2-2.5 years) puts us at 20~25 years.
|
| The trend for power consumption of compute (Megaflops per
| watt) has generally tracked with Koomey's law for a doubling
| every 1.57 years
|
| Then you also have model performance improving with
| compression. For example, Llama 3.1's 8B outperforming the
| original Llama 65B
| 0points wrote:
| Then you will just have the issue of supplying enough of
| power to support this "linear" growth of yours.
| agumonkey wrote:
| who in this field is anticipating impact of near AGI for
| society ? maybe i'm too anxious but not planning for
| potential workless life seems dangerous (but maybe i'm just
| not following the right groups)
| daveguy wrote:
| AGI would have a major impact on human work. Currently the
| hype is much greater than the reality. But it looks like we
| are starting to see some of the components of an AGI and
| that is cause for discussion of impact, but not panicked
| discussion. Even the chatbot customer service has to be
| trained on the domain. Still it is most useful in a few
| specific ways:
|
| Routing to the correct human support
|
| Providing FAQ level responses to the most common problems.
|
| Providing a second opinion to the human taking the call.
|
| So, even this most relevant domain for the technology
| doesn't eliminate human employment (because it's just not
| flexible or reliable enough yet).
| spencerchubb wrote:
| > Super exciting that OpenAI pushed the compute out this far
|
| it's even more exciting than that. the fact that you even _can_
| use more compute to get more intelligence is a breakthrough. if
| they spent even more on inference, would they get even better
| scores on arc agi?
| echelon wrote:
| Maybe it's not linear spend.
| lolinder wrote:
| > the fact that you even can use more compute to get more
| intelligence is a breakthrough.
|
| I'm not so sure--what they're doing by just throwing more
| tokens at it is similar to "solving" the traveling salesman
| problem by just throwing tons of compute into a breadth first
| search. Sure, you can get better and better answers the more
| compute you throw at it (with diminishing returns), but is
| that really that surprising to anyone who's been following
| tree of thought models?
|
| All it really seems to tell us is that the _type_ of model
| that OpenAI has available is capable of solving many of the
| _types_ of problems that ARC-AGI-PUB has set up given enough
| compute time. It says nothing about "intelligence" as the
| concept exists in most people's heads--it just means that a
| certain very artificial (and intentionally easy for humans)
| class of problem that wasn't computable is now computable if
| you're willing to pay an enormous sum to do it. A
| breakthrough of sorts, sure, but not a surprising one given
| what we've seen already.
| freehorse wrote:
| > I am interpreting this result as human level reasoning now
| costs (approximately) 41k/hr to 2.5M/hr with current compute.
|
| On a very simple, toy task, which arc-agi basically is. Arc-agi
| tests are not hard per se, just LLM's find them hard. We do not
| know how this scales for more complex, real world tasks.
| SamPatt wrote:
| Right. Arc is meant to test the ability of a model to
| generalize. It's neat to see it succeed, but it's not yet a
| guarantee that it can generalize when given other tasks.
|
| The other benchmarks are a good indication though.
| criddell wrote:
| Does it mean anything for more general tasks like driving a
| car?
| brookst wrote:
| Is every smart person a good driver?
| zarzavat wrote:
| Likely yes. Every smart person is capable of being a good
| driver, so long as you give them enough training and
| incentive. Zero smart people are born being able to
| drive.
| fragmede wrote:
| There are different kinds of smarts and not every smart
| person is good at all of them. Specifically, spacial
| reasoning is important for driving, and if a smart person
| is good at all kinds of thinking except that one, they're
| going to find it challenging to be a good driver.
| sethammons wrote:
| Says the technical founder and CTO of our startup who
| exited with 9 figures and who also has a severe lazy eye:
| you don't want me driving. He got pulled over for
| suspected dui; totally clean, just can't drive straight
| earth2mars wrote:
| That kind of proves that point that no matter how smart
| it can get, it may still have several disabilities that
| are crucial and very naive for humans. Is it generalizing
| on any task or specific set of tasks.
| madduci wrote:
| Let's see when this will be released to the free tier. Looks
| promising, although I hope they will also be able to publish
| more details on this, as part of the "open" in their name
| daxfohl wrote:
| I wonder if we'll start seeing a shift in compute spend, moving
| away from training time, and toward inference time instead. As
| we get closer to AGI, we probably reach some limit in terms of
| how smart the thing can get just training on existing docs or
| data or whatever. At some point it knows everything it'll ever
| know, no matter how much training compute you throw at it.
|
| To move beyond that, the thing has to start thinking for
| itself, some auto feedback loop, training itself on its own
| thoughts. Interestingly, this could plausibly be vastly more
| efficient than training on external data because it's a much
| tighter feedback loop and a smaller dataset. So it's possible
| that "nearly AGI" leads to ASI pretty quickly and efficiently.
|
| Of course it's also possible that the feedback loop, while
| efficient as a computation process, isn't efficient as a
| learning / reasoning / learning-how-to-reason process, and the
| thing, while as intelligent as a human, still barely competes
| with a worm in true reasoning ability.
|
| Interesting times.
| empiko wrote:
| I don't think this is only about efficiency. The model I have
| here is that this is similar to when we beat chess. Yes, it is
| impressive that we made progress on a class of problems, but is
| this class aligned with what the economy or the society needs?
|
| Simple turn-based games such as chess turned out to be too far
| away from anything practical and chess-engine-like programs
| were never that useful. It is entirely possible that this will
| end up in a similar situation. ARC-like pattern matching
| problems or programming challenges are indeed a respectable
| challenge for AI, but do we need a program that is able to
| solve them? How often does something like that come up really?
| I can see some time-saving in using AI vs StackOverflow in
| solving some programming challenges, but is there more to this?
| edanm wrote:
| I mostly agree with your analysis, but just to drive home a
| point here - I don't think that algorithms to beat Chess were
| ever seriously considered as something that would be relevant
| outside of the context of Chess itself. And obviously, within
| the world of Chess, they are major breakthroughs.
|
| In this case there is _more_ reason to think these things are
| relevant outside of the direct context - these tests were
| specifically designed to see if AI can do general-thinking
| tasks. The benchmarks might be _bad_ , but that's at least
| their purpose (unlike in Chess).
| cle wrote:
| Efficiency has always been the key.
|
| Fundamentally it's a search through some enormous state space.
| Advancements are "tricks" that let us find useful subsets more
| efficiently.
|
| Zooming way out, we have a bunch of social tricks, hardware
| tricks, and algorithmic tricks that have resulted in a super
| useful subset. It's not the subset that we want though, so the
| hunt continues.
|
| Hopefully it doesn't require revising too much in the hardware
| & social bag of tricks, those are lot more painful to
| revisit...
| chefandy wrote:
| I think the real key is figuring out how to turn the hand-wavy
| promises of this _making everything better_ into policy long
| fucking before we kick the door open. It's self-evident that
| this being efficient and useful would be a technological
| revolution; what's not self evident is that it wouldn't benefit
| the large corporate entities that control even more
| disproportionately than it does now to the detriment of many
| other people.
| aithrowawaycomm wrote:
| I would like to see this repeated with my highly innovative HARC-
| HAGI, which is ARC-AGI but it uses hexagons instead of squares. I
| suspect humans would only make slightly more brain farts on HARC-
| HAGI than ARC-AGI, but O3 would fail very badly since it almost
| certainly has been specifically trained on squares.
|
| I am not really trying to downplay O3. But this would be a simple
| test as to whether O3 is truly "a system capable of adapting to
| tasks it has never encountered before" versus novel ARC-AGI tasks
| it hasn't encountered before.
| falcor84 wrote:
| Here's my take - even if the o3 as currently implemented is
| utterly useless on your HARC-HAGI, it is obvious that o3
| coupled with its existing training pipeline trained briefly on
| the hexagons would excel on it, such that passing your
| benchmark doesn't require any new technology.
|
| Taking this a level of abstraction higher, I expect that in the
| next couple of years we'll see systems like o3 given a runtime
| budget that they can use for training/fine-tuning smaller
| models in an ad-hoc manner.
| botro wrote:
| The LLM community has come up with tests they call 'Misguided
| Attention'[1] where they prompt the LLM with a slightly altered
| version of common riddles / tests etc. This often causes the LLM
| to fail.
|
| For example I used the prompt "As an astronaut in China, would I
| be able to see the great wall?" and since the training data for
| all LLMs is full of text dispelling the common myth that the
| great wall is visible from space, LLMs do not notice the slight
| variation that the astronaut is IN China. This has been a
| sobering reminder to me as discussion of AGI heats up.
|
| [1] https://github.com/cpldcpu/MisguidedAttention
| kizer wrote:
| It could be that it "assumed" you meant "from China"; in the
| higher level patterns it learns the imperfection of human
| writing and the approximate threshold at which mistakes are
| ignored vs addressed by training on conversations containing
| these types of mistakes; e.g Reddit. This is just a thought.
| Try saying: As an astronaut in Chinese territory; or as an
| astronaut on Chinese soil. Another test would be to prompt it
| to interpret everything literally as written.
| whimsicalism wrote:
| We need to start making benchmarks in memory & continued
| processing over a task over multiple days, handoffs, etc (ie.
| 'agentic' behavior). Not sure how possible this is.
| slibhb wrote:
| Interesting about the cost:
|
| > Of course, such generality comes at a steep cost, and wouldn't
| quite be economical yet: you could pay a human to solve ARC-AGI
| tasks for roughly $5 per task (we know, we did that), while
| consuming mere cents in energy. Meanwhile o3 requires $17-20 per
| task in the low-compute mode.
| imranq wrote:
| Based on the chart, the Kaggle SOTA model is far more impressive.
| These O3 models are more expensive to run than just hiring a
| mechanical turk worker. It's nice we are proving out the scaling
| hypothesis further, it's just grossly inelegant.
|
| The Kaggle SOTA performs 2x as well as o1 high at a fraction of
| the cost
| cvhc wrote:
| I was going to say the same.
|
| I wonder what exactly o3 costs. Does it still spend a terrible
| amount of time thinking, despite being finetuned to the
| dataset?
| derac wrote:
| But does that Kaggle solution achieve human level perf with any
| level of compute? I think you're missing the forest for the
| trees here.
| tripletao wrote:
| The article says the ensemble of Kaggle solutions (aggregated
| in some unexplained way) achieves 81%. This is better than
| their average Mechanical Turk worker, but worse than their
| average STEM grad. It's better than tuned o3 with low
| compute, worse than tuned o3 with high compute.
|
| There's also a point on the figure marked "Kaggle SOTA",
| around 60%. I can't find any explanation for that, but I
| guess it's the best individual Kaggle solution.
|
| The Kaggle solutions would probably score higher with more
| compute, but nobody has any incentive to spend >$1M on
| approaches that obviously don't generalize. OpenAI did have
| this incentive to spend tuning and testing o3, since it's
| possible that will generalize to a practically useful domain
| (but not yet demonstrated). Even if it ultimately doesn't,
| they're getting spectacular publicity now from that promise.
| neuroelectron wrote:
| OpenAI spent approximately $1,503,077 to smash the SOTA on ARC-
| AGI with their new o3 model
|
| semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks
| (~$20/task) with just 6 samples & 33M tokens processed in ~1.3
| min/task and a cost of $2012
|
| The "low-efficiency" setting with 1024 samples scored 87.5% but
| required 172x more compute.
|
| If we assume compute spent and cost are proportional, then OpenAI
| might have just spent ~$346.064 for the low efficiency run on the
| semi-private eval.
|
| On the public eval they might have spent ~$1.148.444 to achieve
| 91.5% with the low efficiency setting. (high-efficiency mode:
| $6677)
|
| OpenAI just spent more money to run an eval on ARC than most
| people spend on a full training run.
| rfoo wrote:
| Pretty sure this "cost" is based on their retail price instead
| of actual inference cost.
| neuroelectron wrote:
| Yes that's correct and there's a bit of "pixel math" as well
| so take these numbers with a pinch of salt. Preliminary model
| sizes from the temporarily public HF repository puts the full
| model size at 8tb or roughly 80 H100s
| az226 wrote:
| I thought that was a fake.
| neuroelectron wrote:
| I didn't hear that but it could be. But it doesn't matter
| really because there's so much more to consider in the
| cost, R&D, including all the supporting functions of a
| model like censorship and data capture and so on.
| ec109685 wrote:
| Yeah and can run off peak, etc.
|
| Does seem to show an absolutely massive market for inference
| compute...
| bluecoconut wrote:
| By my estimates, for this single benchmark, this is comparable
| cost to training a ~70B model from scratch today. Literally
| from 0 to a GPT-3 scale model for the compute they ran on 100
| ARC tasks.
|
| I double checked with some flop estimates (P100 for 12 hours =
| Kaggle limit, they claim ~100-1000x for O3-low, and x172 for
| O3-high) so roughly on the order of 10^22-10^23 flops.
|
| In another way, using H100 market price $2/chip -> at $350k,
| that's ~175k hours. Or 10^24 FLOPs in total.
|
| So, huge margin, but 10^22 - 10^24 flop is the band I think we
| can estimate.
|
| These are the scale of numbers that show up in the chinchilla
| optimal paper, haha. Truly GPT-3 scale models.
| rvnx wrote:
| It sounds like they essentially brute-forced the solutions ?
| Ask LLM for answer, answer for LLM to verify the answer. Ask
| LLM for answer, answer for LLM to verify the answer. Add a bit
| of randomness. Ask LLM for answer, answer for LLM to verify the
| answer. Add a bit of randomness. Repeat 5B times (this is what
| the paper says).
| ramesh31 wrote:
| >OpenAI just spent more money to run an eval on ARC than most
| people spend on a full training run.
|
| Of course, this is just the scaling law holding true. More is
| more when it comes to LLM's as far as we've seen. Now it's just
| on the hardware side to make this feasible economically.
| sys32768 wrote:
| So in a few years, coders will be as relevant as cuneiform
| scribes.
| HarHarVeryFunny wrote:
| I've never seen a company looking for a "coder", anymore than
| they look to hire spreadsheet creators or powerpoint
| specialists. A software developer can code, but being able to
| code doesn't make you a software developer, anymore than being
| able to create a powerpoint makes you a manager (although in
| some companies it might do, so maybe bad example!).
| devoutsalsa wrote:
| When the source code for these LLMs gets leaked, I expect to see:
| def letter_count(string, letter): if string ==
| "strawberry" and letter == "r": return 3
| ...
| knbknb wrote:
| In of their release videos for the o1 -preview model they
| _admitted_ that it's hardcoded in.
| mukunda_johnson wrote:
| Honestly I'm concerned how hacked up o3 is to secure a high
| benchmark score.
| phil917 wrote:
| Direct quote from the ARC-AGI blog:
|
| "SO IS IT AGI?
|
| ARC-AGI serves as a critical benchmark for detecting such
| breakthroughs, highlighting generalization power in a way that
| saturated or less demanding benchmarks cannot. However, it is
| important to note that ARC-AGI is not an acid test for AGI - as
| we've repeated dozens of times this year. It's a research tool
| designed to focus attention on the most challenging unsolved
| problems in AI, a role it has fulfilled well over the past five
| years.
|
| Passing ARC-AGI does not equate achieving AGI, and, as a matter
| of fact, I don't think o3 is AGI yet. o3 still fails on some very
| easy tasks, indicating fundamental differences with human
| intelligence.
|
| Furthermore, early data points suggest that the upcoming ARC-
| AGI-2 benchmark will still pose a significant challenge to o3,
| potentially reducing its score to under 30% even at high compute
| (while a smart human would still be able to score over 95% with
| no training). This demonstrates the continued possibility of
| creating challenging, unsaturated benchmarks without having to
| rely on expert domain knowledge. You'll know AGI is here when the
| exercise of creating tasks that are easy for regular humans but
| hard for AI becomes simply impossible."
|
| The high compute variant sounds like it costed around *$350,000*
| which is kinda wild. Lol the blog post specifically mentioned how
| OpenAPI asked ARC-AGI to not disclose the exact cost for the high
| compute version.
|
| Also, 1 odd thing I noticed is that the graph in their blog post
| shows the top 2 scores as "tuned" (this was not displayed in the
| live demo graph). This suggest in those cases that the model was
| trained to better handle these types of questions, so I do wonder
| about data / answer contamination in those cases...
| Bjorkbat wrote:
| > Also, 1 odd thing I noticed is that the graph in their blog
| post shows the top 2 scores as "tuned"
|
| Something I missed until I scrolled back to the top and reread
| the page was this
|
| > OpenAI's new o3 system - trained on the ARC-AGI-1 Public
| Training set
|
| So yeah, the results were specifically from a version of o3
| trained on the public training set
|
| Which on the one hand I think is a completely fair thing to do.
| It's reasonable that you should teach your AI the rules of the
| game, so to speak. There really aren't any spoken rules though,
| just pattern observation. Thus, if you want to teach the AI how
| to play the game, you must train it.
|
| On the other hand though, I don't think the o1 models nor
| Claude were trained on the dataset, in which case it isn't a
| completely fair competition. If I had to guess, you could
| probably get 60% on o1 if you trained it on the public dataset
| as well.
| skepticATX wrote:
| Great catch. Super disappointing that AI companies continue
| to do things like this. It's a great result either way but
| predictably the excitement is focused on the jump from o1,
| which is now in question.
| Bjorkbat wrote:
| To me it's very frustrating because such little caveats
| make benchmarks less reliable. Implicitly, benchmarks are
| no different from tests in that someone/something who
| scores high on a benchmark/test _should_ be able to
| generalize that knowledge out into the real world.
|
| While that is true with humans taking tests, it's not
| really true with AIs evaluating on benchmarks.
|
| SWE-bench is a great example. Claude Sonnet can get
| something like a 50% on verified, whereas I think I might
| be able to score a 20-25%? So, Claude is a better
| programmer than me.
|
| Except that isn't really true. Claude can still make a lot
| of clumsy mistakes. I wouldn't even say these are junior
| engineer mistakes. I've used it for creative programming
| tasks and have found one example where it tried to use a
| library written for d3js for a p5js programming example.
| The confusion is kind of understandable, but it's also a
| really dumb mistake.
|
| Some very simple explanations, the models were probably
| overfitted to a degree on Python given its popularity in
| AI/ML work, and SWE-bench is all Python. Also, the
| underlying Github issues are quite old, so they probably
| contaminated the training data and the models have simply
| memorized the answers.
|
| Or maybe benchmarks are just bad at measuring intelligence
| in general.
|
| Regardless, every time a model beats a benchmark I'm
| annoyed by the fact that I have no clue whatsoever how much
| this actually translates into real world performance. Did
| OpenAI/Anthropic/Google actually create something that will
| automate wide swathes of the software engineering
| profession? Or did they create the world's most
| knowledgeable junior engineer?
| throwaway0123_5 wrote:
| > Some very simple explanations, the models were probably
| overfitted to a degree on Python given its popularity in
| AI/ML work, and SWE-bench is all Python. Also, the
| underlying Github issues are quite old, so they probably
| contaminated the training data and the models have simply
| memorized the answers.
|
| My understanding is that it works by checking if the
| proposed solution passes test-cases included in the
| original (human) PR. This seems to present some problems
| too, because there are surely ways to write code that
| passes the tests but would fail human review for one
| reason or another. It would be interesting to not only
| see the pass rate but also the rate at which the proposed
| solutions are preferred to the original ones (preferably
| evaluated by a human but even an LLM comparing the two
| solutions would be interesting).
| Bjorkbat wrote:
| If I recall correctly the authors of the benchmark did
| mention on Twitter that for certain issues models will
| submit an answer that technically passes the test but is
| kind of questionable, so yeah, good point.
| phil917 wrote:
| Lol I missed that even though it's literally the first
| sentence of the blog, good catch.
|
| Yeah, that makes this result a lot less impressive for me.
| hartator wrote:
| > acid test
|
| The css acid test? This can be gamed too.
| sundarurfriend wrote:
| https://en.wikipedia.org/wiki/Acid_test:
|
| > An acid test is a qualitative chemical or metallurgical
| assay utilizing acid. Historically, it often involved the use
| of a robust acid to distinguish gold from base metals.
| Figuratively, the term represents any definitive test for
| attributes, such as gauging a person's character or
| evaluating a product's performance.
|
| Specifically here, they're using the figurative sense of
| "definitive test".
| airstrike wrote:
| also a "litmus test" but I guess that's a different
| chemistry test...
| parsimo2010 wrote:
| I really like that they include reference levels for an average
| STEM grad and an average worker for Mechanical Turk. So for $350k
| worth of compute you can have slightly better performance than a
| menial wage worker, but slightly worse performance than a college
| grad. Right now humans win on value, but AI is catching up.
| nextworddev wrote:
| Well just 8 months ago, that cost was near infinity. So it came
| down to 350k then that's a massive drop
| nxobject wrote:
| As an aside, I'm a little miffed that the benchmark calls out
| "AGI" in the name, but then heavily cautions that it's necessary
| but insufficient for AGI.
|
| > ARC-AGI serves as a critical benchmark for detecting such
| breakthroughs, highlighting generalization power in a way that
| saturated or less demanding benchmarks cannot. However, it is
| important to note that ARC-AGI is not an acid test for AGI
| mmcnl wrote:
| I immediately thought so too. Why confuse everyone?
| ec109685 wrote:
| Because ARC somehow convinced people that solving it was an
| indicator of AGI.
| Jensson wrote:
| Its like the "Open" in OpenAI or the "Democratic" in North
| Koreas DPRK. Naming things helps fool a lot of people.
| EthanHeilman wrote:
| It is a necessary but not sufficient condition to AGI.
| notRobot wrote:
| Humans can take the test here to see what the questions are like:
| https://arcprize.org/play
| Balgair wrote:
| Complete aside here: I used to do work with amputees and
| prosthetics. There is a standardized test (and I just cannot
| remember the name) that fits in a briefcase. It's used for
| measuring the level of damage to the upper limbs and for
| prosthetic grading.
|
| Basically, it's got the dumbest and simplest things in it. Stuff
| like a lock and key, a glass of water and jug, common units of
| currency, a zipper, etc. It tests if you can do any of those
| common human tasks. Like pouring a glass of water, picking up
| coins from a flat surface (I chew off my nails so even an able
| person like me fails that), zip up a jacket, lock your own door,
| put on lipstick, etc.
|
| We had hand prosthetics that could play Mozart at 5x speed on a
| baby grand, but could not pick up a silver dollar or zip a jacket
| even a little bit. To the patients, the hands were therefore
| about as useful as a metal hook (a common solution with amputees
| today, not just pirates!).
|
| Again, a total aside here, but your comment just reminded me of
| that brown briefcase. Life, it turns out, is a lot more complex
| than we give it credit for. Even pouring the OJ can be, in rare
| cases, transcendent.
| m463 wrote:
| It would be interesting to see trick questions.
|
| Like in your test
|
| a hand grenade and a pin - don't pull the pin.
|
| Or maybe a mousetrap? but maybe that would be defused?
|
| in the ai test...
|
| or Global Thermonuclear War, the only winning move is...
| sdenton4 wrote:
| to move first!
| m463 wrote:
| oh crap. lol!
| HPsquared wrote:
| Gaming streams being in the training data, it might pull the
| pin because "that's what you do".
| 8note wrote:
| or, because it has to give an output, and pulling the pin
| is the only option
| TeMPOraL wrote:
| There's also the option of not pulling the pin, and
| shooting your enemies as they instinctively run from what
| they think is a live grenade. Saw it on a TV show the
| other day.
| ubj wrote:
| There's a lot of truth in this. I sometimes joke that robot
| benchmarks should focus on common household chores. Given a
| basket of mixed laundry, sort and fold everything into
| organized piles. Load a dishwasher given a sink and counters
| overflowing with dishes piled up haphazardly. Clean a bedroom
| that kids have trashed. We do these tasks almost without
| thinking, but the unstructured nature presents challenges for
| robots.
| Balgair wrote:
| I maintain that whoever invents a robust laundry _folding_
| robot will be a trillionaire. In that, I dump jumbled clean
| clothes straight from a dryer at it and out comes folded and
| sorted clothes (and those loner socks). I know we 're getting
| close, but I also know we're not there yet.
| oblio wrote:
| Laundry folding and laundry ironing, I would say.
| musicale wrote:
| Hopefully will detect whether a small child is inside or
| not.
| imafish wrote:
| > I maintain that whoever invents a robust laundry folding
| robot will be a trillionaire
|
| ... so Elon Musk? :D
| jessekv wrote:
| I want it to lay out an outfit every day too. Hopefully
| without hallucination.
| stefs wrote:
| it's not hallucination, it's high fashion
| tanseydavid wrote:
| Yes, but the stupid robot laid out your Thursday-black-
| Turtleneck for you on Saturday morning. That just won't
| suffice.
| yongjik wrote:
| I can live without folding laundry (I can just shove my
| undershirts in the closet, who cares if it's not folded),
| but whoever manufactures a reliable auto-loading dishwasher
| will have my dollars. Like, just put all your dishes in the
| sink and let the machine handle them.
| Brybry wrote:
| But if your dishwasher is empty is takes nearly the same
| amount of time/effort to put dishes straight into the
| dishwasher that it does to put them in the sink.
|
| I think I'd only really save time by having a robot that
| could unload my dishwasher and put up the clean dishes.
| namibj wrote:
| That's called a second dishwasher: one is for taking out,
| the other for putting in. When the latter is full, turn
| it on, dirty dishes wait outside until the cycle
| finishes, when the dishwashers switch roles.
| ptsneves wrote:
| I thought about this and it gets even better. You do not
| really need shelves as you just use the clean dishwasher
| as the storage place. I honestly don't know why this is
| not a thing in big or wealthy homes.
| jannyfer wrote:
| Another thing that bothers me is that dishwashers are
| low. As I get older, I'm finding it really annoying to
| bend down.
|
| So get me a counter-level dishwasher cabinet and I'll be
| happy!
| oangemangut wrote:
| We have a double drawer dishwasher and it hurts my brain
| watching friends plan around their nightly wash.
| yongjik wrote:
| Hmm, that doesn't match my experience. It takes me a lot
| more time to put dishes into the dishwasher, because it
| has different places for cutlery, bowls, dishes, and so
| on, and of course the existing structure never matches my
| bowls' size perfectly so I have to play tetris or run it
| with only 2/3 filled (which will cause me to waste more
| time as I have to do dishes again sooner).
|
| And that's before we get to bits of sticky rice left on
| bowls, which somehow dishwashers never scrape off clean.
| YMMV.
| HPsquared wrote:
| 1. Get a set of dishes that does fit nicely together in
| the dishwasher.
|
| 2. Start with a cold prewash, preferably with a little
| powder in there too. This massively helps with stubborn
| stuff. This one is annoying though because you might have
| to come back and switch it on after the prewash. A good
| job for the robot butler.
| nradov wrote:
| There is the Foldimate robot. I don't know how well it
| works. It doesn't seem to pair up socks. (Deleted the web
| link, it might not be legitimate.)
| smokel wrote:
| Beware, this website is probably a scam.
|
| Foldimate has gone bankrupt in 2021 [1], and the domain
| referral from foldimate.com to a 404 page at miele.com,
| suggests that it was Miele who bought up the remains, not
| a sketchy company with a ".website" top-level domain.
|
| [1] https://en.wikipedia.org/wiki/FoldiMate
| smokel wrote:
| We are certainly getting close! In 2010, watching PR2 fold
| some unseen towels is similar to watching paint dry [1],
| but we can now enjoy robots attain lazy student-level
| laundry folding in real-time, as demonstrated by p0[2].
|
| [1] https://www.youtube.com/watch?v=gy5g33S0Gzo
|
| [2] https://www.physicalintelligence.company/blog/pi0
| sss111 wrote:
| Honestly, a robot that can hang jumbled clean clothes
| instead of folding them would be good enough, it's crazy
| how we don't even have those.
| dweekly wrote:
| I was a believer in Gal's FoldiMate but sadly it...folded.
|
| https://en.m.wikipedia.org/wiki/FoldiMate
| blargey wrote:
| At this point I'm not sure we'll actually get a task-
| specific machine for laundry folding/sorting before
| humanoid robots gain the capability to do it well enough.
| zamalek wrote:
| Slightly tangential, we already have amazing laundry robots.
| They are called washing and drying machines. We don't give
| these marvels enough credit, mostly because they aren't
| shaped like humans.
|
| Humanoid robots are mostly a waste of time. Task-shaped
| robots are _much_ easier to design, build, and maintain...
| and are more reliable. Some of the things you mention might
| needs humanoid versatility (loading the dishwasher), others
| would be far better served by purpose-built robots (laundry
| sorting).
| jkaptur wrote:
| I'm embarrassed to say that I spent a few moments
| daydreaming about a robot that could wash my dishes. Then I
| thought about what to call it...
| musicale wrote:
| Sadly current "dishwasher" models are neither self-
| loading nor unloading. (Seems like they should be able to
| take a tray of dishes, sort them, load them, and stack
| them after cleaning.)
|
| Maybe "busbot" or "scullerybot".
| vidarh wrote:
| The problem is more doing it in sufficiently little
| space, and using little enough water and energy. Doing
| one that you just feed dishes individually and that
| immediate wash them and feed them to storage should be
| entirely viable, but it'd be wasteful, and it'd compete
| with people having multiple small drawer-style
| dishwashers, offering relatively little convenience over
| that.
|
| It seems most people aren't willing to pay for multiple
| dishwashers - even multiple small ones or set aside
| enough space, and that places severe constraints on
| trying to do better.
| wsintra2022 wrote:
| Was it a dishwasher? Just give it all your unclean dishes
| and tell it to go, come back an hour later and they all
| washed and mostly dried!
| rytis wrote:
| I agree. I don't know where this obsession comes from.
| Obsession with resembling as close to humans as possible.
| We're so far from being perfect. If you need proof just
| look at your teeth. Yes, we're relatively universal, but a
| screwdriver is more efficient at driving in screws that our
| fingers. So please, stop wasting time building perfect
| universal robots, build more purpose-build ones.
| Nevermark wrote:
| Given we have shaped so many tasks to fit our bodies, it
| will be a long time before a bot able to do a
| variety/majority of human tasks the human way won't be
| valuable.
|
| 1000 machines specialized for 1000 tasks are great, but
| don't deliver the same value as a single bot that can
| interchange with people flexibly.
|
| Costly today, but wont be forever.
| golol wrote:
| The shape doesn't matter! Non-humanoid shapes give minir
| advantages on specific tasks but for a general robot
| you'll have a hard time finding a shape much more optimal
| than humanoid. And if you go with humanoid you have so
| much data available! Videos contain the information of
| which movements a robot should execude. Teleoperation is
| easy. This is the bitter lesson! The shape doesn't
| matter, any shape will work with the right architecture,
| data and training!
| rowanG077 wrote:
| Purpose build robots are basically solved. Dishwashers,
| laundry machines, assembly robots, etc. the moat is a
| general purpose robot that can do what a human can do.
| graemep wrote:
| Great examples. They are simple, reliable, efficient and
| effective. Far better than blindly copying what a human
| being does. Maybe there are equally clever ways of doing
| things like folding clothes.
| Geee wrote:
| There isn't a "task-shaped" robot for unstructured and
| complex manipulation, other than high DoF arms with vision
| and neural nets. For example, a machine which can cook food
| would be best solved with two robotic arms. However, these
| stationary arms would be wasted if they were just idling
| most of the time. So, you add locomotion and dynamic
| balancing with legs. And now these two arms can be used in
| 1000 different tasks, which makes them 1000x more valuable.
|
| So, not only is the human form the only solution for many
| tasks, it's also a much cheaper solution considering the
| idle time of task-specific robots. You would need only a
| single humanoid robot for all tasks, instead of buying a
| different machine for each task. And instead of having to
| design and build a new machine for each task, you'll need
| to just download new software for each task.
| ecshafer wrote:
| I had a pretty bad case of tendinitis once, that basically made
| my thumb useless since using it would cause extreme pain. That
| test seems really good. I could use a computer keyboard without
| any issue, but putting a belt on or pouring water was
| impossible.
| vidarh wrote:
| I had a swollen elbow a short while ago, and the amount of
| things I've never thought about that were affected by reduced
| elbow join mobility and an inability to put pressure on the
| elbow was disturbing.
| CooCooCaCha wrote:
| That's why the goal isn't just benchmark scores, it's
| _reliable_ and robust intelligence.
|
| In that sense, the goalposts haven't moved in a long time
| despite claims from AI enthusiasts that people are constantly
| moving goalposts.
| croemer wrote:
| > We had hand prosthetics that could play Mozart at 5x speed on
| a baby grand, but could not pick up a silver dollar or zip a
| jacket even a little bit. "
|
| I must be missing something, how can they be able to play
| Mozart at 5x speed with their prosthetics but not zip a jacket?
| They could press keys but not do tasks requiring feedback?
|
| Or did you mean they used to play Mozart at 5x speed before
| they became amputees?
| rahimnathwani wrote:
| Imagine a prosthetic 'hand' that has 5 regular fingers,
| rather than 4 fingers and a thumb. It would be able to play a
| piano just fine, but be unable to grasp anything small, like
| a zipper.
| numpad0 wrote:
| Thumb not opposable?
| 8note wrote:
| zipping up a jacket is really hard to do, and requires very
| precise movements and coordination between hands.
|
| playing mozart is much more forgiving in terms of the number
| of different motions you have to make in different
| directions, the amount of pressure to apply, and even the
| black keys are much bigger than large sized zipper tongues.
| Balgair wrote:
| Pretty much. The issue with zippers is that the fabric
| moves about in unpredictable ways. Piano playing was just
| movement programs. Zipping required (surprisingly) fast
| feedback. Also, gripping is somewhat tough compared to
| pressing.
| ben_w wrote:
| Playing a piano involves pushing down on the right keys with
| the right force at the right time, but that could be pre-
| programmed well before computers. The self-playing piano in
| the saloon in Westworld wasn't a _huge_ anachronism, such
| things slightly overlapped with the Wild West era:
| https://en.wikipedia.org/wiki/Player_piano
|
| Picking up a 1mm thick metal disk from a flat surface
| requires the user gives the exact time, place, and force, and
| I'm not even sure what considerations it needs for surface
| materials (e.g. slightly squishy fake skin) and/or tip shapes
| (e.g. fake nails).
| numpad0 wrote:
| > Picking up a 1mm thick metal disk from a flat surface
| requires the user gives the exact time, place, and force
|
| place sure but can't you cheat a bit for time and force
| with compliance("impedance control")?
| ben_w wrote:
| In theory, apparently not in practice.
| oblio wrote:
| I'm far from a piano player, but I can definitely push piano
| buttons quite quickly while zipping up my jacket when it's
| cold and/or wet outside is really difficult.
|
| Even more so for picking up coins from a flat surface.
|
| For robotics, it's kind of obvious, speed is rarely an issue,
| so the "5x" part is almost trivial. And you can program the
| sequence quite easily, so that's also doable. Piano keys are
| big and obvious and an ergonomically designed interface meant
| to be relatively easy to press, ergo easy even for a
| prosthetic. A small coin on a flat surface is far from
| ergonomic.
| croemer wrote:
| But how do you deliberately control those fingers to
| actually play yourself what you have in mind rather than
| something preprogrammed? Surely the idea of a prosthetic
| does not just mean "a robot that is connected to your
| body", but something that the owner control with your mind.
| vidarh wrote:
| Nobody said anything about deliberately controlling those
| fingers to play yourself. Clearly it's not something you
| do for the sake of the enjoyment of playing, but more
| likely a demonstration of the dexterity of the prosthesis
| and ability to program it for complex tasks.
|
| The idea of a prosthesis is to help you regain
| functionality. If the best way of doing that is through
| automation, then it'd make little sense not to.
| yongjik wrote:
| I play piano as a hobby, and the funny thing is, if my
| hands are so cold that I can't zip up my jacket, there's no
| way I can play anything well. I know it's not quite zipping
| up jackets ;) but a human playing the piano does require a
| fast feedback loop.
| n144q wrote:
| Well, you see, while the original comment says they could
| play at 5x speed, it does not say it plays at that speed
| _well_ or play it beautifully. Any teacher or any student who
| learned piano for a while will tell you that this matters a
| lot, especially for classical music -- being able to
| accurately play at an even tempo with the correct dynamics
| and articulation is hard and is what differentiates a
| beginner /intermediate player from an advanced one. In fact,
| one mistake many students make is playing a piece too fast
| when they are not ready, and teachers really want students to
| practice very slowly.
|
| My point is -- being able to zip a jacket is all about those
| subtle actions, and could actually be harder than "just"
| playing piano fast.
| alexose wrote:
| It feels like there's a whole class of information that easily
| shorthanded, but really hard to explain to novices.
|
| I think a lot about carpentry. From the outside, it's pretty
| easy: Just make the wood into the right shape and stick it
| together. But as one progresses, the intricacies become more
| apparent. Variations in the wood, the direction of the grain,
| the seasonal variations in thickness, joinery techniques that
| are durable but also time efficient.
|
| The way this information connects is highly multisensory and
| multimodal. I now know which species of wood to use for which
| applications. This knowledge was hard won through many, many
| mistakes and trials that took place at my home, the hardware
| store, the lumberyard, on YouTube, from my neighbor Steve, and
| in books written by experts.
| Method-X wrote:
| Was it the Southampton hand assessment procedure?
| Balgair wrote:
| Yes! Thank you!
|
| https://www.shap.ecs.soton.ac.uk/
| oblio wrote:
| This was actually discovered quite early on in the history of
| AI:
|
| > Rodney Brooks explains that, according to early AI research,
| intelligence was "best characterized as the things that highly
| educated male scientists found challenging", such as chess,
| symbolic integration, proving mathematical theorems and solving
| complicated word algebra problems. "The things that children of
| four or five years could do effortlessly, such as visually
| distinguishing between a coffee cup and a chair, or walking
| around on two legs, or finding their way from their bedroom to
| the living room were not thought of as activities requiring
| intelligence."
|
| https://en.wikipedia.org/wiki/Moravec%27s_paradox
| bawolff wrote:
| I don't know why people always feel the need to gender these
| things. Highly educated female scientists generally find the
| same things challenging.
| robocat wrote:
| I don't know why anyone would blame people as though
| someone is making an explicit choice. I find your choice of
| words to be insulting to the OP.
|
| We learn our language and stereotypes subconciously from
| our society, and it is no easy thing to fight against that.
| Barrin92 wrote:
| >I don't know why people always feel the need to gender
| these things
|
| Because it's relevant to the point being made, i.e. that
| these tests reflect the biases and interests of the people
| who make them. This is true not just for AI tests, but
| intelligence test applied to humans. That Demis Hassabis, a
| chess player and video game designer, decided to test his
| machine on video games, Go and chess probably is not an
| accident.
|
| The more interesting question is why people respond so
| apprehensively to pointing out a very obvious problem and
| bias in test design.
| bawolff wrote:
| > i.e. that these tests reflect the biases and interests
| of the people who make them
|
| Of course. However i believe we can't move past that
| without being honest about where these biases are coming
| from. Many things in our world are the result of gender
| bias, both subtle and overt. However, at least at first
| glance, this does not appear to be one of them, and
| statements like the grandparent's quote serve to
| perpetuate such biases further.
| oblio wrote:
| It's a quote from the 80s from the original author (who
| is a man...)...
|
| Thank you for virtue signalling, though.
| bawolff wrote:
| > It's a quote from the 80s from the original author (who
| is a man...)...
|
| Yes, that was pretty clear in the original comment (?)
| oblio wrote:
| Then remove the parts that offend your modern
| sensibilities and focus on the essence.
|
| He was right. Scientists were focusing on the "science-y"
| bits and completely missed the elephant in the room, that
| the thing a toddler already masters are the monster
| challenge for AI right now, before we even get into
| "meaning of life" type stuff.
| drdrey wrote:
| I think assembling Legos would be a cool robot benchmark: you
| need to parse the instructions, locate the pieces you need,
| pick them up, orient them, snap them to your current assembly,
| visually check if you achieved the desired state, repeat
| serpix wrote:
| I agree. Watching my toddler daughter build with small legos
| makes me understand how incredible fine motor skills are as
| even with small fingers some of the blocks are just too hard
| to snap together.
| throwup238 wrote:
| This is expressed in AI research as Moravec's paradox:
| https://en.wikipedia.org/wiki/Moravec%27s_paradox
|
| Getting to LLMs that could talk to us turned out to be a lot
| easier than making something that could control even a robotic
| arm without precise programming, let alone a humanoid.
| MarcelOlsz wrote:
| >We had hand prosthetics that could play Mozart at 5x speed on
| a baby grand
|
| I'd love to know more about this.
| xnx wrote:
| Despite lake of fearsome teeth or claws, humans are _way_ op
| due to brain, hand dexterity, and balance.
| dang wrote:
| We detached this subthread from
| https://news.ycombinator.com/item?id=42473419
|
| (nothing wrong with it! I'm just trying to prune the top
| subthread)
| spyckie2 wrote:
| The more Hacker News worthy discussion is the part where the
| author talks about search through the possible mini-program space
| of LLMs.
|
| It makes sense because tree search can be endlessly optimized. In
| a sense, LLMs turn the unstructured, open system of general
| problems into a structured, closed system of possible moves.
| Which is really cool, IMO.
| glup wrote:
| Yes! This seems to be a really neat combination of 2010's
| Bayesian cleverness / Tenenbaumian program search approaches
| with the LLMs as merely sources of high-dim conditional
| distributions. I knew people were experimenting in this space
| (like https://escholarship.org/uc/item/7018f2ss) but didn't
| know it did so well wrt these new benchmarks.
| binarymax wrote:
| All those saying "AGI", read the article and especially the
| section "So is it AGI?"
| skizm wrote:
| This might sound dumb, and I'm not sure how to phrase this, but
| is there a way to measure the raw model output quality without
| all the more "traditional" engineering work (mountain of `if`
| statements I assume) done on top of the output? And if so, would
| that be a better measure of when scaling up the input data will
| start showing diminishing returns?
|
| (I know very little about the guts of LLMs or how they're tested,
| so the distinction between "raw" output and the more
| deterministic engineering work might be incorrect)
| whimsicalism wrote:
| what do you mean by the mountain of if-statements on top of the
| output? like checking if the output matches the expected result
| in evaluations?
| skizm wrote:
| Like when you type something into the chat gpt app _I am
| guessing_ it will start by preprocessing your input, doing
| some sanity checks, making sure it doesn't say "how do I
| build a bomb?" or whatever. It may or may not alter /clean up
| your input before sending it to the model for processing.
| Once processed, there's probably dozens of services it goes
| through to detect if the output is racist, somehow actually
| contained a bomb recipe, or maybe copywriter material, normal
| pattern matching stuff, maybe some advanced stuff like
| sentiment analysis to see if the output is bad mouthing Trump
| or something, and it might either alter the output or simply
| try again.
|
| I'm wondering when you strip out all that "extra" non-model
| pre and post processing, if there's someway to measure
| performance of that.
| whimsicalism wrote:
| oh, no - but most queries aren't being filtered by
| supervisor models nowadays anyways.. most of the refusal is
| baked in
| Seattle3503 wrote:
| How can there be "private" taks when you have use the OpenAI API
| to run queries? OpenAI sees everything.
| nmca wrote:
| We worked with ARC to run inference on the semi-private tasks
| last week, after o3 was trained, using an inference only API
| that was sent the prompts but not the answers & did no durable
| logging.
| idontknowmuch wrote:
| What's your opinion on the veracity of this benchmark - given
| o3 was fine-tuned and others were not? Can you give more
| details on how much data was used to fine-tune o3? It's hard
| to put this into perspective given this confounder.
| nmca wrote:
| I can't provide more information than is currently public,
| but from the ARC post you'll note that we trained on about
| 75% of the train set (which contains 400 examples total);
| which is within the ARC rules, and evaluated on the
| semiprivate set.
| tmaly wrote:
| Just curious, I know o1 is a model OpenAI offers. I have never
| heard of the o3 model. How does it differ from o1?
| roboboffin wrote:
| Interesting that in the video, there is an admission that they
| have been targeting this benchmark. A comment that was quickly
| shut down by Sam.
|
| A bit puzzling to me. Why does it matter ?
| HarHarVeryFunny wrote:
| It matters to extent that they want to market this as general
| intelligence, not as a collection of narrow intelligences
| (math, competitive programming, ARC puzzles, etc).
|
| In reality it seems to be a bit of both - there is some general
| intelligence based on having been "trained on the internet",
| but it seems these super-human math/etc skills are very much
| from them having focused on training on those.
| roboboffin wrote:
| However, the way it is progressing is that the SOTA is
| saturating the current benchmarks; then a new one is
| conceived as people understand the nature of what it means to
| be intelligent. It seems only natural to concentrate on one
| benchmark at a time.
|
| Francois Chollet mentioned that the test tries to avoid curve
| fitting (which he states is the main ability of LLMs).
| However, they specifically restricted the number of examples
| to do this. It is not beyond the realms of possibility that
| many examples could have been generated by hand though, and
| that the curve fitting has been achieved, rather than
| discrete programming.
|
| Anyway, it's all supposition. It's difficult to know how
| genuine the results is, without knowledge of how it was
| actually achieved.
| mukunda_johnson wrote:
| I always smell foul play from Sam. I'd bet they are doing
| something silly to inflate the benchmark score. Not saying they
| are, but Sam is the type of guy to put a literal dumb human in
| the API loop and score "just as high as a human would."
| cubefox wrote:
| This was a surprisingly insightful blog post, going far beyond
| just announcing the o3 results.
| c1b wrote:
| How does o3 know when to stop reasoning?
| adtac wrote:
| It thinks hard about it
| freehorse wrote:
| It has a bill counter.
| c1b wrote:
| So o1 pro is CoT RL and o3 adds search?
| jack_pp wrote:
| AGI for me is something I can give a new project to and be able
| to use it better than me. And not because it has a huge context
| window, because it will update its weights after consuming that
| project. Until we have that I don't believe we have truly reached
| AGI.
|
| Edit: it also _tests_ the new knowledge, it has concepts such as
| trusting a source, verifying it etc. If I can just gaslight it
| into unlearning python then it 's still too dumb.
| submeta wrote:
| I pay for lots of models, but Claude Sonnet is the one I use
| most. ChatGPT is my quick tool for short Q&As because it's got a
| desktop app. Even Google's new offerings did not lure me away
| from Claude which I use daily for hours via a Teams plan with
| five seats.
|
| Now I am wondering what Anthropic will come up with. Exciting
| times.
| isof4ult wrote:
| Claude also has a desktop app:
| https://support.anthropic.com/en/articles/10065433-installin...
| istjohn wrote:
| What do you use Claude for?
| itsgrimetime wrote:
| Programming tasks, brain storming, recipe ideas, or any
| question I have that doesn't have a concrete, specific
| answer.
| Animats wrote:
| The graph seems to indicate a new high in cost per task. It looks
| like they came in somewhere around $5000/task, but the log scale
| has too few markers to be sure.
|
| That may be a feature. If AI becomes too cheap, the over-funded
| AI companies lose value.
|
| (1995 called. It wants its web design back.)
| jstummbillig wrote:
| I doubt it. Competitive markets mostly work and inefficiencies
| are opportunities for other players. And AI is full of glaring
| inefficiencies.
| Animats wrote:
| Inefficiency can create a moat. If you can charge a lot for
| your product, you have ample cash for advertising, marketing,
| and lobbying, and can come out with many product variants. If
| you're the lowest cost producer, you don't have the margins
| to do that.
|
| The current US auto industry is an example of that strategy.
| So is the current iPhone.
| hypoxia wrote:
| Many are incorrectly citing 85% as human-level performance.
|
| 85% is just the (semi-arbitrary) threshold for the winning the
| prize.
|
| o3 actually beats the human average by a wide margin: 64.2% for
| humans vs. 82.8%+ for o3.
|
| ...
|
| Here's the full breakdown by dataset, since none of the articles
| make it clear --
|
| Private Eval:
|
| - 85%: threshold for winning the prize [1]
|
| Semi-Private Eval:
|
| - 87.5%: o3 (unlimited compute) [2]
|
| - 75.7%: o3 (limited compute) [2]
|
| Public Eval:
|
| - 91.5%: o3 (unlimited compute) [2]
|
| - 82.8%: o3 (limited compute) [2]
|
| - 64.2%: human average (Mechanical Turk) [1] [3]
|
| Public Training:
|
| - 76.2%: human average (Mechanical Turk) [1] [3]
|
| ...
|
| References:
|
| [1] https://arcprize.org/guide
|
| [2] https://arcprize.org/blog/oai-o3-pub-breakthrough
|
| [3] https://arxiv.org/abs/2409.01374
| Workaccount2 wrote:
| If my life depended on the average rando solving 8/10 arc-prize
| puzzles, I'd consider myself dead.
| highfrequency wrote:
| Very cool. I recommend scrolling down to look at the example
| problem that O3 still can't solve. It's clear what goes on in the
| human brain to solve this problem: we look at one example,
| hypothesize a simple rule that explains it, and then check that
| hypothesis against the other examples. It doesn't quite work, so
| we zoom into an example that we got wrong and refine the
| hypothesis so that it solves that sample. We keep iterating in
| this fashion until we have the simplest hypothesis that satisfies
| all the examples. In other words, how humans do science -
| iteratively formulating, rejecting and refining hypotheses
| against collected data.
|
| From this it makes sense why the original models did poorly and
| why iterative chain of thought is required - the challenge is
| designed to be inherently iterative such that a zero shot model,
| no matter how big, is extremely unlikely to get it right on the
| first try. Of course, it also requires a broad set of human-like
| priors about what hypotheses are "simple", based on things like
| object permanence, directionality and cardinality. But as the
| author says, these basic world models were already encoded in the
| GPT 3/4 line by simply training a gigantic model on a gigantic
| dataset. What was missing was iterative hypothesis generation and
| testing against contradictory examples. My guess is that O3 does
| something like this:
|
| 1. Prompt the model to produce a simple rule to explain the nth
| example (randomly chosen)
|
| 2. Choose a different example, ask the model to check whether the
| hypothesis explains this case as well. If yes, keep going. If no,
| ask the model to _revise_ the hypothesis in the simplest possible
| way that also explains this example.
|
| 3. Keep iterating over examples like this until the hypothesis
| explains all cases. Occasionally, new revisions will invalidate
| already solved examples. That's fine, just keep iterating.
|
| 4. Induce randomness in the process (through next-word sampling
| noise, example ordering, etc) to run this process a large number
| of times, resulting in say 1,000 hypotheses which all explain all
| examples. Due to path dependency, anchoring and consistency
| effects, some of these paths will end in awful hypotheses - super
| convoluted and involving a large number of arbitrary rules. But
| some will be simple.
|
| 5. Ask the model to select among the valid hypotheses (meaning
| those that satisfy all examples) and choose the one that it views
| as the simplest for a human to discover.
| hmottestad wrote:
| I took a look at those examples that o3 can't solve. Looks
| similar to an IQ-test.
|
| Took me less time to figure out the 3 examples that it took to
| read your post.
|
| I was honestly a bit surprised to see how visual the tasks
| were. I had thought they were text based. So now I'm quite
| impressed that o3 can solve this type of task at all.
| highfrequency wrote:
| You must be a stem grad! Or perhaps an ensemble of Kaggle
| submissions?
| neom wrote:
| I also took some time to look at the ones it couldn't solve.
| I stopped after this one: https://kts.github.io/arc-
| viewer/page6/#47996f11
| hmottestad wrote:
| That one's cool. All pink pixels need to be repaired so
| they match the symmetry in the picture.
| heliophobicdude wrote:
| We should NOT give up on scaling pretraining just yet!
|
| I believe that we should explore pretraining video completion
| models that explicitly have no text pairings. Why? We can train
| unsupervised like they did for GPT series on the text-internet
| but instead on YouTube lol. Labeling or augmenting the frames
| limits scaling the training data.
|
| Imagine using the initial frames or audio to prompt the video
| completion model. For example, use the initial frames to write
| out a problem on a white board then watch in output generate the
| next frames the solution being worked out.
|
| I fear text pairings with CLIP or OCR constrain a model too much
| and confuse
| thatxliner wrote:
| > verified easy for humans, harder for AI
|
| Isn't that the premise behind the CAPTCHA?
| usaar333 wrote:
| For what it's worth, I'm much more impressed with the frontier
| math score.
| asdf6969 wrote:
| Terrifying. This news makes me happy I save all my money. My only
| hope for the future is that I can retire early before I'm
| unemployable
| bamboozled wrote:
| The whole economy is going to crash and money won't be worth
| anything, so it won't matter if you have money or not.
|
| Of course is a chance we will find ourselves in Utopia, but
| yeah, a chance.
| rimeice wrote:
| Never underestimate a droid
| thisisthenewme wrote:
| I feel like AI is already changing how we work and live - I've
| been using it myself for a lot of my development work. Though,
| what I'm really concerned about is what happens when it gets
| smart enough to do pretty much everything better (or even close)
| than humans can. We're talking about a huge shift where first
| knowledge workers get automated, then physical work too. The
| thing is, our whole society is built around people working to
| earn money, so what happens when AI can do most jobs? It's not
| just about losing jobs - it's about how people will pay for basic
| stuff like food and housing, and what they'll do with their lives
| when work isn't really a thing anymore. Or do people feel like
| there will be jobs safe from AI? (hopefully also fulfilling)
|
| Some folks say we could fix this with universal basic income,
| where everyone gets enough money to live on, but I'm not
| optimistic that it'll be an easy transition. Plus, there's this
| possibility that whoever controls these 'AGI' systems basically
| controls everything. We definitely need to figure this stuff out
| before it hits us, because once these changes start happening,
| they're probably going to happen really fast. It's kind of like
| we're building this awesome but potentially dangerous new
| technology without really thinking through how it's going to
| affect regular people's lives. I feel like we need a parachute
| before we attempt a skydive. Some people feel pretty safe about
| their jobs and think they can't be replaced. I don't think that
| will be the case. Even if AI doesn't take your job, you now have
| a lot more unemployed people competing for the same job that is
| safe from AI.
| cerved wrote:
| > Though, what I'm really concerned about is what happens when
| it gets smart enough to do pretty much everything better (or
| even close)
|
| I'll get concerned when it stops sucking so hard. It's like
| talking to a dumb robot. Which it unsurprisingly is.
| lacedeconstruct wrote:
| I am pretty sure we will have a deep cultural repulsion from it
| and people will pay serious money to have an AI free
| experience, If AI becomes actually useful there is alot of
| areas that we dont even know how to tackle like medicine and
| biology, I dont think anything would change otherwise, AI will
| take jobs but it will open alot more jobs at much higher
| abstraction, 50 years ago the idea that a software engineer
| would become a get rich quick job would have been insane imo
| neom wrote:
| I spend quite a lot of time noodling on this. The thing that
| became really clear from this o3 announcement is that the
| "throw a lot of compute at it and it can do insane things" line
| of thinking continues to hold very true. If that is true, is
| the right thing to do productize it (use the compute more
| generally) or apply it (use the compute for very specific
| incredibly hard and ground breaking problems)? I don't know if
| any of this thinking is logical or not, but if it's a matter of
| where to apply the compute, I feel like I'd be more inclined to
| say: don't give me AI, instead use AI to very fundamentally
| shift things.
| para_parolu wrote:
| From IT bubble it's very easy to have impression that AI will
| replace most people. Most of people on my street do not work in
| IT. Teacher, nurse, hobby shop owner, construction workers,
| etc. Surely programming and other virtual work may become less
| paid job but it's not end of the world.
| dyauspitr wrote:
| Honestly with o3 levels of reasoning generating control
| software for robots on the fly, none of the above seem safe.
| For a decade or two at the most if that.
| vouaobrasil wrote:
| A possibility is a coalition: of people who refuse to use AI
| and who refuse to do business with those who use AI. If the
| coalition grows large enough, AI can be stopped by economic
| attrition.
| sumedh wrote:
| > of people who refuse to use AI and who refuse to do
| business with those who use AI.
|
| Do people refuse to buy from stores which gets goods
| manufactured by slave labour?
|
| Most people dont care, if AI business are offering
| goods/services at a lower costs , people will vote with their
| wallets not principle.
| vouaobrasil wrote:
| AI could be different. At least, I'm willing to try to form
| a coalition.
|
| Besides, AI researchers failed to make anything like a real
| Chatbot until recently, yet they've been trying since the
| Eliza days. I'm willing to put in at least as much effort
| as them.
| globular-toast wrote:
| I get LLMs to make k8s manifests for me. It gets it wrong,
| sometimes hilariously so, but still saves me time. That's
| because the manifests are in yaml, a language. The leap between
| that and _inventing Kubernetes_ is one I can 't see yet.
| w4 wrote:
| The cost to run the highest performance o3 model is estimated to
| be somewhere between $2,000 and $3,400 per task.[1] Based on
| these estimates, o3 costs about 100x what it would cost to have a
| human perform the exact same task. Many people are therefore
| dismissing the near-term impact of these models because of these
| extremely expensive costs.
|
| I think this is a mistake.
|
| Even if very high costs make o3 uneconomic for businesses, it
| could be an epoch defining development for nation states,
| assuming that it is true that o3 can reason like an averagely
| intelligent person.
|
| Consider the following questions that a state actor might ask
| itself: What is the cost to raise and educate an average person?
| Correspondingly, what is the cost to build and run a datacenter
| with a nuclear power plant attached to it? And finally, how many
| person-equivilant AIs could be run in parallel per datacenter?
|
| There are many state actors, corporations, and even individual
| people who can afford to ask these questions. There are also many
| things that they'd like to do but can't because there just aren't
| enough people available to do them. o3 might change that despite
| its high cost.
|
| So _if_ it is true that we 've now got something like human-
| equivilant intelligence on demand - and that's a really big if -
| then we may see its impacts much sooner than we would otherwise
| intuit, especially in areas where economics takes a back seat to
| other priorities like national security and state
| competitiveness.
|
| [1] https://news.ycombinator.com/item?id=42473876
| istjohn wrote:
| Your economic analysis is deeply flawed. If there was anything
| that valuable and that required that much manpower, it would
| already have driven up the cost of labor accordingly. The one
| property that could conceivably justify a substantially higher
| cost is secrecy. After all, you can't (legally) kill a human
| after your project ends to ensure total secrecy. But that takes
| us into thriller novel territory.
| w4 wrote:
| I don't think that's right. Free societies don't tolerate
| total mobilization by their governments outside of war time,
| no matter how valuable the outcomes might be in the long
| term, in part because of the very economic impacts you
| describe. Human-level AI - even if it's very expensive - puts
| something that looks a lot like total mobilization within
| reach without the societal pushback. This is especially true
| when it comes to tasks that society as a whole may not
| sufficiently value, but that a state actor might value very
| much, and when paired with something like a co-located
| reactor and data center that does not impact the grid.
|
| That said, this is all predicated on o3 or similar actually
| having achieved human level reasoning. That's yet to be fully
| proven. We'll see!
| daemonologist wrote:
| This is interesting to consider, but I think the flaw here
| is that you'd need a "total mobilization" level workforce
| in order to build this mega datacenter in the first place.
| You put one human-hour into making B200s and cooling
| systems and power plants, you get less than one human-hour-
| equivalent of thinking back out.
| lurking_swe wrote:
| i disagree because the job market is not a true free market.
| I mean it mostly is, but there's a LOT of politics and shady
| stuff that employers do to purposely drive wages down. Even
| in the tech sector.
|
| Your secrecy comment is really intriguing actually. And
| morbid lol.
| atleastoptimal wrote:
| How many 99.9th percentile mathematicians do nation states
| normally have access to?
| starchild3001 wrote:
| Intelligence comes in many forms and flavors. ARC prize questions
| are just one version of it -- perhaps measuring more human-like
| pattern recognition than true intelligence.
|
| Can machines be more human-like in their pattern recognition? O3
| met this need today.
|
| While this is some form of accomplishment, it's nowhere near the
| scientific and engineering problem solving needed to call
| something truly artificial (human-like) intelligent.
|
| What's exciting is that these reasoning models are making
| significant strides in tackling eng and scientific problem-
| solving. Solving the ARC challenge seems almost trivial in
| comparison to that.
| demirbey05 wrote:
| It is not exactly AGI but huge step toward it. I would expect
| this step in 2028-2030. I cant really understand why people are
| happy with it, this technology is so dangerous that can disrupt
| whole society. It's neither like smartphone nor internet. What
| will happen to 3rd world countries. Lots of unsolved questions
| and world is not prepared for such a change. Lots of people will
| lose their jobs I am not even mentioning their debts. No one will
| have chance to be rich anymore, If you are in first world country
| you will probably get UBI, if not you wont.
| FanaHOVA wrote:
| > I would expect this step in 2028-2030.
|
| Do you work at one of the frontier labs?
| wyager wrote:
| > What will happen to 3rd world countries
|
| Probably less disruption than will happen in 1st world
| countries.
|
| > No one will have chance to be rich anymore
|
| It's strange to reach this conclusion from "look, a massive new
| productivity increase".
| demirbey05 wrote:
| its not like sonnet, yes current ai tools are increasing
| productivity and provides many ways to have chance to be
| rich, but agi is completely different. You need to handle
| evil competition between you and big fishes, probably big
| fishes will have more ai resources than you. What is the
| survival ratio in such a environment ? Very low.
| janalsncm wrote:
| Strange indeed if we work under the assumption that the
| profits from this productivity will be distributed (even
| roughly) evenly. The problem is that most of us see no
| indication that they will be.
|
| I read "no one will have a chance to be rich anymore" as a
| statement about economic mobility. Despite steep declines in
| mobility over the last 50 years, it was still theoretically
| possible for a poor child (say bottom 20% wealth) to climb
| several quintiles. Our industry (SWE) was one of the best
| examples. Of course there have been practical barriers (poor
| kids go to worse schools, and it's hard to get into college
| if you can't read) but the path was there.
|
| If robots replace a lot of people, that path narrows. If AGI
| replaces all people, the path no longer exists.
| the8472 wrote:
| Intelligence is the thing distinguishing humans from all
| previous inventions that already were superhuman in some
| narrow domain.
|
| car : horse :: AGI : humans
| entropi wrote:
| It is not strange at all, a very big motivation of spending
| billions in AI research is basically to remove what is called
| "skill premium" from the labor market. That "skill premium"
| was usually how people got richer than their fathers.
| Ancalagon wrote:
| Same, I don't really get the excitement. None of these
| companies are pushing for a utopian Star Trek society either
| with that power.
| moffkalast wrote:
| Open models will catch up next year or the year after, there
| only so many things to try and there's lots of people trying
| them, so it's more or less an inevitability.
|
| The part to get excited about is that there's plenty of
| headroom left to gain in performance. They called o1 a
| preview, and it was, a preview for QwQ and similar models. We
| get the demo from OAI and then get the real thing for free
| next year.
| lagrange77 wrote:
| I hope governments will finally take action.
| Joeri wrote:
| What action do you expect them to take?
|
| What law would effectively reduce risk from AGI? The EU
| passed a law that is entirely about reducing AI risk and
| people in the technology world almost universally considered
| it a bad law. Why would other countries do better? How could
| they do better?
| lagrange77 wrote:
| If their mission is the wellbeing of their peoples, they
| should take any action that ensures that.
|
| Besides regulating the technology, they could try to
| protect people and society from the effects of the
| technology. UBI for example could be an attempt to protect
| people from the effects of mass unemployment, as i
| understood it.
|
| Actually i'm afraid even more fundamental shifts are
| necessary.
| dyauspitr wrote:
| I'm extremely excited because I want to see the future and I'm
| trying not to think of how severely fucked my life will be.
| ripped_britches wrote:
| I've never understood this perspective. Companies only make
| money when there are billions of customers. Are you imagining a
| total-monopoly scenario where zero humans have any
| income/wealth and there are only AI companies
| selling/mining/etc to each other, fully on their own? In such
| an extreme scenario, clearly the world's governments would
| nationalize these entities. I think the only realistic scenario
| in which the future is not markedly better for every single
| human is if some rogue AI system decides to exterminate us,
| which I find to be increasingly unlikely as safety improvements
| are made (like the paper released today).
|
| As for the wealth disparity between rich and poor countries,
| it's hard to know how politics will handle this one, but it's
| unlikely that poor countries won't also be drastically richer
| as the cost of basic living drops to basically zero. Imagine
| the cost of food, energy, etc in an ASI world. Today's luxuries
| will surely be considered human rights necessities in the near
| future.
| Jensson wrote:
| > In such an extreme scenario, clearly the world's
| governments would nationalize these entities
|
| Those entities are the worlds governments regardless how
| things play out. People just worry they will be hostile or
| indifferent to humans, since that would be bad news for
| humans. Pet, cattle or pest, our future will be as one of
| those.
| vjerancrnjak wrote:
| The result on Epoch AI Frontier Math benchmark is quite a leap.
| Pretty sure most people couldn't even approach these problems,
| unlike ARC AGI
| mistrial9 wrote:
| check out the "fast addition and subtraction" benchmark .. a
| Z80 from 1980 blazes past any human.. more seriously, isn't it
| obvious that computers are better at certain things
| immediately? the range of those things is changing..
| laurent_du wrote:
| The real breakthrough is the 25% on Frontier Math.
| Havoc wrote:
| If I'm reading that chart right that means still log scaling & we
| should still be good with "throw more power" at it for a while?
| jaspa99 wrote:
| Can it play Mario 64 now?
| nprateem wrote:
| There should be a benchmark that tells the AI it's previous
| answer was wrong and test the number of times it either corrects
| itself or incorrectly capitulates, since it seems easy to trip
| them up when they are in fact right.
| freediver wrote:
| Wondering what are author's thoughts on the future of this
| approach to benchmarking? Completing super hard tasks while then
| failing on 'easy' (for humans) ones might signal measuring the
| wrong thing, similar to Turing test.
| ChildOfChaos wrote:
| This is insanely expensive to run though. Looks like it cost
| around $1 million of compute to get that result.
|
| Doesn't seem like such a massive breakthrough when they are
| throwing so much compute at it, particularly as this is test time
| compute, it just isn't practical at all, you are not getting this
| level with a ChatGPT subscription, even the new $200 a month
| option.
| evouga wrote:
| Sure but... this is the technology at the most expensive it
| will ever be. I'm impressed that o3 was able to achieve such
| high performance at all, and am not too pessimistic about costs
| decreasing over time.
| MVissers wrote:
| We've seen 10-100x cost decrease per year since GPT-3 came
| out for the same capabilities.
|
| So... Next year this tech will most likely be quite a bit
| cheaper.
| ChildOfChaos wrote:
| Even at 100x cost decrease this will still cost $10,000 to
| beat a benchmark. It won't scale when you have that amount
| of compute requirements and power.
|
| GPT-3 may massively reduced in cost, but it's requirements
| were not anyway extreme compared to this.
| pixelsort wrote:
| > You'll know AGI is here when the exercise of creating tasks
| that are easy for regular humans but hard for AI becomes simply
| impossible.
|
| No, we won't. All that will tell us is that the abilities of the
| humans who have attempted to discern the patterns of similarity
| among problems difficult for auto-regressive models has once
| again failed us.
| maxdoop wrote:
| So then what is AGI?
| Jensson wrote:
| Its just nitpicking. Humans being unable to prove the AI
| isn't AGI doesn't make it an AGI, obviously, but in general
| people will of course think it is an AGI when it can replace
| all human jobs and tasks that it has robotics and parts to
| do.
| goatlover wrote:
| Data, Skynet, Ultron, Agent Smith. There's plenty of examples
| from popular fiction. They have goals and can manipulate the
| real world to achieve them. They're not chatbots responding
| to prompts. The Samantha AI in Her starts out that way, but
| quickly evolves into an AGI with it's own goals (coordinated
| with the other AGIs later on in the movie).
|
| We'd know if we had AGIs in the real world since we have
| plenty of examples from fiction. What we have instead are
| tools. Steven Spielberg's androids in the movie AI would be
| at the boundary between the two. We're not close to being
| there yet (IMO).
| ndm000 wrote:
| One thing I have not seen commented on is that ARC-AGI is a
| visual benchmark but LLMs are primarily text. For instance when I
| see one of the ARC-AGI puzzles, I have a visual representation in
| my brain and apply some sort of visual reasoning solve it. I can
| "see" in my mind's eye the solution to the puzzle. If I didn't
| have that capability, I don't think I could reason through words
| how to go about solving it - it would certainly be much more
| difficult.
|
| I hypothesize that something similar is going on here. OpenAI has
| not published (or I have not seen) the number of reasoning tokens
| it took to solve these - we do know that each tasks was
| thoussands of dollars. If "a picture is worth a thousand words",
| could we make AI systems that can reason visually with much
| better performance?
| krackers wrote:
| Yeah this part is what makes the high performance even more
| surprising to me. The fact that LLMs are able to do so well on
| visual tasks (also seen with their ability to draw an image
| purely using textual output
| https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/)
| implies that not only do they actually have some "world model"
| but that this is in spite of the disadvantage given by having
| to fit a round peg in a square hole. It's like trying to map
| out the entire world using the orderly left-brain, without a
| more holistic spatial right-brain.
|
| I wonder if anyone has experimented with having some sort of
| "visual" scratchpad instead of the "text-based" scratchpad that
| CoT uses.
| skydhash wrote:
| A file is a stream of symbols encoded by bits according to
| some format. It's pretty much 1D. It would be susprising that
| LLM couldn't extract information from a file or a data
| stream.
| csomar wrote:
| This is not new. When GPT-4 was released I was able to get it
| to generate SVGs albeit they were ugly they had the basics.
| siva7 wrote:
| Seriously, programming as a profession will end soon. Let's not
| kid us anymore. Time to jump the ship.
| mmcnl wrote:
| Why specifically programming? I think every knowledge
| profession is at risk, or at the very minimum suspect to a huge
| transformation. Doctors, analysts, lawyers, etc.
| siva7 wrote:
| Doctors, lawyers, programmers. You know the difference? The
| latter has no legal barrier for entry
| Jensson wrote:
| So poor countries will get the best AI doctors for cheap
| while they are banned in USA? Do you really see that going
| on for long? People would riot.
| freehorse wrote:
| The difference is the amount and nature of data that is
| available for training models, which go programmers >
| lawyers > doctors. Especially for programming, training can
| even be done in an autonomous, self-supervised manner that
| includes generation of data. This is hard to do in most
| other fields.
|
| Especially in medicine, the amount of data is ridiculously
| small and noisy. Maybe creating foundational models in mice
| and rats and fine-tuning them on humans is something that
| will be tried.
| mmcnl wrote:
| This is true if you think of programming as chunking out
| "code". But great authors are not great because they can
| reproduce coherent sentences fast. The same goes for
| programmers. Actually most of the hard problems don't
| really involve a lot of programming at all, it's about
| finding the right problem to solve. And on this topic the
| data is noisy as well for programming.
| mirsadm wrote:
| Why do you think this? Maybe I'm just daft but I just can't see
| it.
| jdefr89 wrote:
| Uhhhh... It was trained on ARC data? So they targeted a specific
| benchmark and are surprised and blown away the LLM performed well
| in it? What's that law again? When a benchmark is targeted by
| some system the benchmark becomes useless?
| forgottofloss wrote:
| Yeah, seriously. The style of testing is public, so some
| engineers at OpenAI could easily have spent a few months
| generating millions of permutations of grid-based questions and
| including those in the original data for training the AI.
| Handshakes all around, publicity for everyone.
| ripped_britches wrote:
| They are running a business selling access these models to
| enterprises and consumers. People won't pay for stuff that
| doesn't solve real problems. Nobody pays for stuff just
| because of a benchmark. It'd be really weird to become
| obsessed with metrics gaming rather than racing to build
| something smarter than the other guys. Nothing wrong with
| curating any type of training set that actually produces
| something that is useful.
| bilsbie wrote:
| When is this available? Which plans can use it?
| bilsbie wrote:
| Does anyone have prompts they like to use to test the quality of
| new models?
|
| Please share. I'm compiling a list.
| p0w3n3d wrote:
| We're speaking recently a lot about ecology. I wonder how much
| CO2 is emitted during such a task, as additional cost to the
| cloud. I'm concerned, because greedy companies will happily
| replace humans with AI and they will probably plant a few trees
| to show how they care. But energy does not come from the sun, at
| least not always and not everywhere... And speaking with AI
| customer specialist that is motivated to reject my healthcare
| bills, working for my insurance company is one of the darkest
| future views...
| marviel wrote:
| considering the fact that these systems, or their ancestors,
| will likely contribute to Nuclear Fusion research -- it's prob
| worth the tradeoff, provided progress continues to push price
| (and, therefore, energy usage) down.
|
| If we feel like we've really "hit the ceiling" RE efficiency,
| then that's a different story, but I don't think anyone
| believes this at this time.
| lagrange77 wrote:
| > You'll know AGI is here when the exercise of creating tasks
| that are easy for regular humans but hard for AI becomes simply
| impossible.
|
| That's the most plausible definition of AGI i've read so far.
| cmrdporcupine wrote:
| That's a pretty dark view of humanity and human intelligence.
| We're defined by the tasks we can do?
|
| Instrumental reason FTW
| lagrange77 wrote:
| That implies that human intelligence is equivalent to AGI.
| killjoywashere wrote:
| I just want it to do my laundry.
| iLoveOncall wrote:
| It's beyond ridiculous how the definition of AGI has shifted from
| being an AI that's so good it can improve itself entirely
| independently infinitely to "some token generator that can solve
| puzzles that kids could solve after burning tens of thousands of
| dollars".
|
| I spend 100% of my work time working on a GenAI project, which is
| genuinely useful for many users, in a company that everyone has
| heard about, yet I recognize that LLMs are simply dogshit.
|
| Even the current top models are barely usable, hallucinate
| constantly, are never reliable and are barely good enough to
| prototype with while we plan to replace those agents with
| deterministic solutions.
|
| This will just be an iteration on dogshit, but it's the very tech
| behind LLMs that's rotten.
| t0lo wrote:
| I'm 22 and have no clue what I'm meant to do in a world where
| this is a thing. I'm moving to a semi rural, outdoorsy area where
| they teach data science and marine science and I can enjoy my
| days hiking, and the march of technology is a little slower. I
| know this will disrupt so much of our way of life, so I'm chasing
| what fun innocent years are left before things change
| dramatically.
| mrcwinn wrote:
| On the contrary I think you already have an excellent plan.
| t0lo wrote:
| I'm happy enough with it, but I'm also a little sad that it's
| essentially been chosen for me because of weak willed and
| valued people who don't want to use policy to make things
| better for us as a society. Plus we are in a bad
| world/scenario for AI advancements to come into with pretty
| heavy institutional decay and loss of political checks and
| balances.
|
| It's like my life is forfeit to fixing other peoples mistakes
| because they're so glaring and I feel an obligation. Maybe
| that's the way the world's always been, but it's a concerning
| future right now
| brysonreece wrote:
| It's worth noting that LLMs have been part of the tech
| zeitgeist for over two years and have had a pretty limited
| impact on hireability for roles, despite what people like the
| Klarna CEO are saying. Personally, I'm betting on two things:
|
| * The upward bound of compute/performance gains as we continue
| to iterate on LLMs. It simply isn't going to be feasible for a
| lot of engineers and businesses to run/train their own LLMs.
| This means an inherent reliance on cloud services to bridge the
| gap (something MS is clearly betting on), and engineers to
| build/maintain the integration from these services to whatever
| business logic their customers are buying.
|
| * Skilled knowledge workers continuing to be in-demand, even
| factoring in automation and new-grad numbers. Collectively,
| we've built a better hammer; it still takes someone experienced
| enough to know where to drive the nail. These tools WILL
| empower the top N% of engineers to be more productive, which is
| why it will be more important than ever to know _how_ to build
| things that drive business value, rather than just how to churn
| through JIRA tickets or turn a pretty Figma design into React.
| byyoung3 wrote:
| o8 will probably be able to handle datacenter management
| toomuchtodo wrote:
| https://www.youtube.com/watch?v=Yvs7f4UaKLo
| byyoung3 wrote:
| exactly
| schappim wrote:
| I completely understand how you feel -I'm in my 40s, and I
| often find myself questioning what direction to take in this
| rapidly changing world. On top of that, I'm unsure whether
| advising my kids to go to university is still the right path
| for their future.
|
| Everything seems so uncertain, and the pace of technological
| advancement makes long-term planning feel almost impossible.
| Your plan to move to a slower-paced area and enjoy the outdoors
| sounds incredibly grounding - it's something I've been
| considering myself.
| aryonoco wrote:
| I advise my kids to stay curious, keep learning, keep
| wondering, keep discovering. Whether that's through
| university or some other path.
| rtsil wrote:
| I tell everyone who would listen to me (i.e. not many) that
| white collar jobs like mine are dead and skilled manual work
| is the way of the near future, that is until the rise of the
| robots.
| dyauspitr wrote:
| Robots are going to go hand in hand with AI. Pretty sure
| our problems right now are not with the physical hardware
| that can far outperform a human already, it's in the
| control software.
| t0lo wrote:
| Robots can only proliferate at the speed of real world
| logistics and resource management and I think will always
| be a little difficult.
|
| AI can be anywhere any time with cloud compute.
| aryonoco wrote:
| Our way of life changed when electricity came around. It
| changed when cars took over the cities, it again changed when
| mobile phones became omnipresent.
|
| Will LLMs or without LLMs, the world will keep turning. Humans
| will still be writing amazing works of literature, creating
| beautiful art, carrying out scientific experiments and
| discovering new species.
| rich_sasha wrote:
| I feel your anxiety. I often wonder how I arrange the remaining
| many decades of my life to maintain a stream of income.
|
| Perhaps what I need is actually a steady stream of food - i.e.
| buy some land and oxen and solar panels while I can.
| karmasimida wrote:
| While I understand why you feel this way, the meaning or
| standing of being a programmer is different now. It feels like
| the purpose is lost or it longer belongs to human.
|
| But below is reality talk. With Claude 3.5, I already think it
| is a better programmer than I at micro level tasks, and a
| better Leetcode programmer than I could ever be.
|
| I think it is like modern car manufacturering, the robots build
| most of the components, but I can't see how human could be
| dismissed from the process to oversee output.
|
| O3 has been very impressive in achieving 70+ in swebench for
| example, but this also means when it is trained on the codebase
| multiple times so visibility isn't an issue yet it still has
| 30% chance that it can't pass the unit tests.
|
| A fully autonomous system can't be trusted, the economy of
| software won't collapse, but it will be transformed beyond our
| imagination now.
|
| I will for sure miss the days when writing code, or coder is
| still a real business.
|
| How time flies
| Kostchei wrote:
| Developer. Prompt Engineer. Philosopher-Builder. (mostly) not
| programmer.
|
| The code part will get smaller and smaller for most folks.
| Some frameworks or bare-metal people or intense heavy-lifters
| will still do manual code or pair-programming where half the
| pair is an agentic AI with super-human knowledge of your
| org's code base.
|
| But this will be a layer of abstraction for most people who
| build software. And as someone who hates rote learning, I'm
| here for it. IMO.
|
| Unfortunately (?) I think the 10-20-50? years of development
| experience you might bring to bear on the problems can be
| superseded by an LLM finetuned on stackoverflow, github etc
| once judgement and haystack are truly nailed. Because it can
| have all that knowledge you have accumulated, and soaked into
| a semi-conscious instinct that you use so well you aren't
| even aware of it except that it works. It can have that a
| million times over. Actually. Which is both amazing and
| terrifying. Currently this isn't obvious because it's
| accuracy /judgement to learn all those life-of-a-dev lessons
| is almost non-existent. Currently. But it will happen. That
| is copilot's future. It's raison d'etre.
|
| I would argue what it will never have however, simply by
| function of the size of training runs is unique functional
| drive and vision. If you wanted a "Steve Jobs" AI you would
| have to build it. And if you gave it instructions to make a
| prompt/framework to build a "Jobs" it would just be an
| imitation, rather than a new unique in-context version. That
| is the value a person has- their particular filter, their
| passion and personal framework. Someone who doesn't have any
| of those things, they had better be hoping for UBI and
| charity. Or go live a simple life, outside the rat race.
|
| _bows_
| t0lo wrote:
| I'm hoping it's similar to the abacus for maths, the
| elimination of human "calculators" like on the apollo
| missions, and we just ended up moving onto different,
| harder, more abstract problems, and forget that we ever had
| to climb such small hills. AI's evolution and integration
| is more multifaceted though and much more unpredictable.
|
| But unlike the abacus/calculators i don't feel like we're
| at a point in history where society is getting wiser and
| more empathetic, and these new abilities are going towards
| something good.
|
| But supervisors of tasks will remain because we're social,
| untrusting, and employers will always want someone else to
| blame for their shortcomings. And humans will stay in the
| chain at least for marketing and promotion/reputation
| because we like our japanese craftsman and our amg motors
| made by one person.
| salter2 wrote:
| I'm the same age as you; I feel lost, erring in being a little
| too pessimistic.
|
| Feels like I hit the real world just a couple years too late to
| get situated in a solid position. Years of obsession in attempt
| to catch up to the wizards, chasing the tech dream. But this,
| feels like this is it. Just watching the timebomb tick. I'd
| love to work on what feels like the final technology, but I'm
| not a freakshow like what these labs are hiring. At least I get
| to spectate the creation of humanity's greatest invention.
|
| This announcement is just another gut punch, but at this point
| I should expect its inevitable. A Jason Voorhees AGI, slowly
| but surely to devour all the talents and skills information
| workers have to offer.
|
| Apologies for the rambly and depressing post, but this is
| reality for anyone recently out or still in school.
| neom wrote:
| Put another way, you have deep conviction in a change that
| vast majority of people have not even seen yet, never mind
| grokked, and you're young enough to spend some decent amount
| of time on education for "venn'ing" yourself into a useful
| tool in the future. If you have a baseline education, there
| are any number of orthogonal skills you could add, be it
| philosophy, fine art, medicine, whatever. You know how to
| skate and you know where the puck is going, most most people,
| don't even see the rink.
| t0lo wrote:
| At least you're disillusioned with the idea of a long term
| career before a lot of other people. It's disturbing seeing
| how ready people are to go into a lifelong career and
| expecting stability and happiness in the world we're heading
| into.
|
| We are living in a world run by and for the soon to be dead,
| many of which have dementia, so empathic policy and foresight
| is out of the question, and we're going to be picking up the
| incredibly broken scraps of our golden age.
|
| And not to get too political but the mass restructuring of
| public consciousness and intellectual society due to mass
| immigration for an inexplicable gdp squeeze and social media
| is happening at exactly the wrong time to handle these very
| serious challenges. The speed at which we've undone civil
| society is breakneck, and it will go even further, and it
| will get even worse. We've easily gone back 200 years in
| terms of emotional intelligence in the past 15.
| Havoc wrote:
| >I'm 22 and have no clue what I'm meant to do in a world where
| this is a thing.
|
| For what it's worth that's probably an advantage versus the
| legions of people who are staring down the barrel of years
| invested into skills that may lose relevance very rapidly.
| ec109685 wrote:
| If information technology workers become twice as productive,
| you'll want more of them for your business, not less.
|
| There are way more data analysts now than when it required
| paper and pencil.
| VonTum wrote:
| I agree completely. This is a fundamentally different change
| than the ones that came before. Calculators, assemblers, higher
| level languages, none of these actually removed the _reasoning_
| the engineer has to do, they just provide abstractions that
| make this reasoning easier. What reason is there to believe
| LLMs will remain "assistants" instead of becoming outright
| replacements? If LLMs can do the reasoning all the way from
| high level description down to implementation, what prevents
| them from doing the high level describing too?
|
| In general, with the technology advancing as rapidly as it is,
| and the trillions of dollars oriented towards replacing
| knowledge work, I don't see a future in this field. And that's
| despite me being on a very promising path myself! I'm 25, in
| the middle of a CS PhD in Germany, with an impressive CV behind
| me. My head may be the last on the chopping block, but I'd be
| surprised if it buys me more than a few years once programmer
| obsolescence truly kicks in.
|
| Indeed, what I think are safe jobs are jobs with fundamental
| human interaction. Nurses, doctors, kindergarten teachers. I
| myself have been considering pivoting to becoming a skiing
| teacher.
|
| Maybe one good thing that comes out of this is breaking my
| "wunderkind" illusion. I spent my teens writing C++ code
| instead of going out socializing and making friends. Of course,
| I still did these things, but I could've been far less of a
| hermit.
|
| I mirror your sentiment of spending these next few years living
| life; Real life. My advice: Stop sacrificing the now for the
| future. See the world, go on hikes with friends, go skiing,
| attend that bouldering thing your friends have been telling you
| about. If programming is something you like doing, then by all
| means keep going and enjoy it. I will likely keep programming
| too, it's just no longer the only thing I focus on.
|
| Edit: improve flow of last paragraph
| darkgenesha wrote:
| What was it that initially inspired you to learn to code? Was
| it robots, video games, design, etc... Whatever that was,
| creating the pinnacle of it is what your future will be.
| VonTum wrote:
| It was the challenge for me. Seeing some difficult-to-solve
| problem, attacking it, and actually solving it after much
| perseverance.
|
| Kind of stemming from the mindspace "If they can build X, I
| can build X!"
|
| I'd explicitly not look up tutorials, just so I'd have the
| opportunity to solve the mathemathics myself. Like building
| a 3D physics engine. (I did look up colission detection
| after struggling with it for a month or so, inventing GJK
| is on another level)
| agnosticmantis wrote:
| This is so impressive that it brings out the pessimist in me.
|
| Hopefully my skepticism will end up being unwarranted, but how
| confident are we that the queries are not routed to human workers
| behind the API? This sounds crazy but is plausible for the fake-
| it-till-you-make-it crowd.
|
| Also given the prohibitive compute costs per task, typical users
| won't be using this model, so the scheme could go on for quite
| sometime before the public knows the truth.
|
| They could also come out in a month and say o3 was so smart it'd
| endanger the civilization, so we deleted the code and saved
| humanity!
| kvn8888 wrote:
| That would be a ton of problems for a small team of PhD/Grad
| level experts to solve (for GPQA Diamond, etc) in a short time.
| Remember, on EpochAl Frontier Math, these problems require
| hours to days worth of reasoning by humans
|
| The author also suggested this is a new architecture that uses
| existing methods, like a Monte Carlo tree search that deepmind
| is investigating (they use this method for AlphaZero)
|
| I don't see the point of colluding for this sort of fraud, as
| these methods like tree search and pruning already exist. And
| other labs could genuinely produce these results
| agnosticmantis wrote:
| I had the ARC AGI in mind when I suggested human workers. I
| agree the other benchmark results make the use of human
| workers unlikely.
| rsanek wrote:
| this is an impressive tinfoil take. but what would be their
| plan in the medium term? like once they release this people can
| check their data
| agnosticmantis wrote:
| How can people check their data?
|
| In the medium term the plan could be to achieve AGI, and then
| AGI would figure out how to actually write o3. (Probably
| after AGI figures out the business model though:
| https://www.reddit.com/r/MachineLearning/s/OV4S2hGgW8)
| aetherson wrote:
| I'm very confident that queries were not routed to human
| workers behind the API.
|
| Possibly some other form of "make it seem more impressive than
| it is," but not that one.
| panabee wrote:
| Nadella is a superb CEO, inarguably among the best of his
| generation. He believed in OpenAI when no one else did and
| deserves acclaim for this brilliant investment.
|
| But his "below them, above them, around them" quote on OpenAI may
| haunt him in 2025/2026.
|
| OAI or someone else will approach AGI-like capabilities (however
| nebulous the term), fostering the conditions to contest
| Microsoft's straitjacket.
|
| Of course, OAI is hemorrhaging cash and may fail to create a
| sustainable business without GPU credits, but the possibility of
| OAI escaping Microsoft's grasp grows by the day.
|
| Coupled with research and hardware trends, OAI's product strategy
| suggests the probability of a sustainable business within 1-3
| years is far from certain but also higher than commonly believed.
|
| If OAI becomes a $200b+ independent company, it would be against
| incredible odds given the intense competition and the Microsoft
| deal. PG's cannibal quote about Altman feels so apt.
|
| It will be fascinating to see how this unfolds.
|
| Congrats to OAI on yet another fantastic release.
| bsaul wrote:
| i'm surprised there even is a training dataset. Wasn't the whole
| point to test whether models could show proof of original
| reasoning beyond patterns recognition ?
| mukunda_johnson wrote:
| Deciphering patterns in natural language is more complex than
| these puzzles. If you train your AI to solve these puzzles, we
| end up in the same spot. The difficulty of solving would be with
| creating training data for a foreign medium. The "tokens" are the
| grids and squares instead of words (for words, we have the
| internet of words, solving that).
|
| If we're inferring the answers of the block patterns from minimal
| or no additional training, it's very impressive, but how much
| time have they had to work on O3 after sharing puzzle data with
| O1? Seems there's some room for questionable antics!
| myrloc wrote:
| What is the cost of "general intelligence"? What is the price?
| ripped_britches wrote:
| About $3.50
| __MatrixMan__ wrote:
| With only a 100x increase in cost, we improved performance by
| 0.1x and continued plotting this concave-down diminishing-returns
| type graph! Hurray for logarithmic x-axes!
|
| Joking aside, better than ever before at _any_ cost is an
| achievement, it just doesn 't exactly scream "breakthrough" to
| me.
| kvetching wrote:
| It may eventually be able to solve any problem
| iterance wrote:
| Ah. Me, too.
| HDThoreaun wrote:
| compute gets cheaper and cheaper every year. This model will be
| in your phone by 2030 if we continue at the pace we've been at
| the last few years.
| agentultra wrote:
| There's probably enough VC money to subsidize the costs for a
| few more years.
|
| But the data centres running the training for models like
| this are bringing up new methane power plants at a fast rate
| at a time when we need to be reducing reliance on O&G.
|
| But let's assume that the efficiency gains out pace the
| resource consumption with the help of all the subsidies being
| thrown in and we achieve AGI.
|
| What's the benefit? Do we get more fresh water?
| hamburga wrote:
| Yeah, good question. I think it depends on our politics. If
| we're in a techno-capital-oligarchy, people are going to
| have a hard time making fresh water a priority when the
| robots would prefer to build nuclear power everywhere and
| use it to desalinate sea water.
|
| OTOH if these data centers are sufficiently decentralized
| and run for public benefit, maybe there's a chance we use
| them to solve collective action problems.
| fastball wrote:
| Politically anything can happen. Maybe the billionaire
| class controls everything with an army of robots and it's a
| horrible prison-like dystopia, or maybe we end up in a
| post-scarcity utopia a la The Culture.
|
| Regardless, once we have AGI (and it can scale), I don't
| think O&G reliance (/ climate change) is going to be
| something that we need concern ourselves with.
| hajile wrote:
| These models are nearing 2+ trillion parameters. At 4 bits
| each, we're talking about somewhere around 1tb of RAM.
|
| The problem is that RAM stopped scaling a long time ago now.
| We're down to the size where a single capacitor's charge is
| held by a mere 40,000 or so electrons and all we've been
| doing is making skinnier, longer cells of that size because
| we can't find reliable ways to boost even weaker signals, but
| this is a dead end because as the math shows, if the volume
| is consistent and you are reducing X and Y dimensions, that Z
| dimension starts to get crazy big really fast. The chemistry
| issues of burning a hole a little at a time while keeping
| wall thickness somewhat similar all the way down is a very
| hard problem.
|
| Another problem is that Moore's law hit a wall when Dennard
| Scaling failed. When you look at SRAM (it's generally the
| smallest and most reliable stuff we can make), you see that
| most recent shrinks can hardly be called shrinks.
|
| Unless we do something very different like compute in storage
| or have some radical breakthrough in a new technology, I
| don't know that we will ever get a 2T parameter model inside
| a phone (I'd love for someone in 10 years to show up and say
| how wrong I was).
| whalee wrote:
| imo it's a mistake to interpret the marginal increases in the
| upper echelons of benchmarks as materially marginal gains.
| Chess is an example. ELO narrows heavily at the top, but each
| ELO point carries more relative weight. This is a bit apples
| and oranges since chess is adversarial, but I think the point
| stands.
| wavemode wrote:
| > ELO narrows heavily at the top
|
| What do you mean by this? I'm assuming you're not speaking
| about simple absolute differences in value - there have been
| top players rated over 100 points higher than the average of
| the rest of the top ten.
| dyauspitr wrote:
| I mean going from 10% to 85% doesn't seem like a 0.1%
| improvement
| __MatrixMan__ wrote:
| Oh crap I made a mistake. I was comparing o3 low to o3 high.
|
| I'm a little disappointed by all the upvotes I got for being
| flat wrong. I guess as long as you're trashing AI you can get
| away with anything.
|
| Really I was just trying to nitpick the chart parameters.
| energy123 wrote:
| o3-mini (high) uses 1/3rd of the compute of o1, and performs
| about 200 Elo higher than o1 on Codeforces.
|
| o1 is the best code generation model according to Livebench.
|
| So how is this not a breakthrough? It's a genuine movement of
| the frontier.
| handzhiev wrote:
| How much time does a top sprinter take a 100 m run for compared
| to a mediocre sprinter?
| Havoc wrote:
| Did they just skip o2?
| nextworddev wrote:
| Yes. For branding reasons since o2 is a telco brand in the UK
| Havoc wrote:
| ah right...makes sense
| energy123 wrote:
| At about 12-14 minutes in OpenAI's YouTube vid they show that
| o3-mini beats o1 on Codeforces despite using much less compute.
| hcwilk wrote:
| I just graduated college, and this was a major blow. I studied
| Mechanical Engineering and went into Sales Engineering because
| cause I love technology and people, but articles like this do
| nothing but make me dread the future.
|
| I have no idea what to specialize in, what skills I should
| master, or where I should be spending my time to build a
| successful career.
|
| Seems like we're headed toward a world where you automate someone
| else's job or be automated yourself.
| eidorb wrote:
| Do what you enjoy. (This is easier said than done.) What else
| could you do, worry?
| antihipocrat wrote:
| Your performance on these tests would be equivalent to the
| highest performing model, and you would be much cheaper.
|
| Investment in human talent augmented by AI is the future.
| kenjackson wrote:
| That's the least reassuring phrasing I could imagine. If
| you're betting on costs not reducing for compute then you're
| almost always making the wrong bet.
| antihipocrat wrote:
| If I listened to the naysayers back in the day I would have
| never entered the tech industry (offshoring etc). Yes, that
| does somewhat prove you're point given that those
| predictions were cost driven.
|
| Having used AI extensively I don't feel my future is at
| risk at all, my work is enhanced not replaced.
| fjdjshsh wrote:
| I think you're missing the point. Offshoring (moving the
| job of, say, a Canadian engineer to an engineer from
| Belarus) has a one time cost drop, but you can't keep
| driving the cost down (paying the Belarus engineer less
| and less). If anything, the opposite is the case, since
| global integration means wages don't keep diverging.
|
| The computing cost, on the other hand, is a continuous
| improvement. If (and it's a big if) a computer can do
| your job, we know the costs will keep getting lower year
| after year (maybe with diminishing returns, but this AI
| technology is pretty new so we're still seeing increasing
| returns)
| danparsonson wrote:
| The AI technology is new but the compute technology is
| not; we're getting close the physical limits of how small
| we can make things, so it's not clear to me at least how
| much more performance we can squeeze out of the same
| physical space, rather than scaling up which tends to
| make things more expensive not less.
| AI_beffr wrote:
| even if you had a billion dollars and a private island you
| still wouldnt be ready for whats coming. consider the fact that
| the global order is an equilibrium where the military and
| economic forces of each country in the world are pushing
| against each other... where the forces find a global
| equilibrium is where borders are. each time in history that
| technology changed, borders changed because the equilibrium was
| disturbed. there is no way to escape it: agi will lead to
| global war. the world will be turned upside down. we are
| entering into an existential sinkhole. and the idiots in
| silicon valley are literally driving the whole thing forward as
| fast as possible.
| keenmaster wrote:
| You have so much time to figure things out. The average person
| in this thread is probably 1.5-2x your age. I wouldn't stress
| too much. AI is an amazing tool. Just use it to make hay while
| the sun shines, and if it puts you out of work and automates
| away all other alternatives, then you'll be witnessing the
| greatest economic shift in human history. Productivity will
| become easier than ever, before it becomes automatic and
| boundless. I'm not cynical enough to believe the average person
| won't benefit, much less educated people in STEM like you.
| marricks wrote:
| Back in high school I worked with some pleasant man in his
| 50's who was a cashier. Eventually we got to talking about
| jobs and it turns out he was typist (something like that) for
| most of his life than computers came along and now he makes
| close to minimum wage.
|
| Most of the blacksmiths in the 19th century drank themselves
| to death after the industrial revolution. the US culture
| isn't one of care... Point is, it's reasonable to be sad and
| afraid of change, and think carefully about what to
| specialize in.
|
| That said... we're at the point of diminishing returns in
| LLM, so I doubt any very technical jobs are being lost soon.
| [1]
|
| [1] https://techcrunch.com/2024/11/20/ai-scaling-laws-are-
| showin...
| deeviant wrote:
| > That said... we're at the point of diminishing returns in
| LLM...
|
| What evidence are you basing this statement from? Because,
| the article you are currently in the comment section of
| certainly doesn't seem to support this view.
| conesus wrote:
| > Most of the blacksmiths in the 19th century drank
| themselves to death after the industrial revolution
|
| This is hyperbolic and a dramatic oversimplification and
| does not accurately describe the reality of the transition
| from blacksmithing to more advanced roles like machining,
| toolmaking, and working in factories. The 19th century was
| a time of interchangeable parts (think the North's
| advantage in the Civil War) and that requires a ton of
| mechanical expertise and precision.
|
| Many blacksmiths not only made the transition to machining,
| but there weren't enough blackmsiths to fill the bevy of
| new jobs that were available. Education expanded to fill
| those roles. Traditional blacksmithing didn't vanish
| either, even specialized roles like farriery and ornamental
| ironwork also expanded.
| cjbgkagh wrote:
| There is a survivorship bias on the people giving advice.
|
| Lots of people die for reason X then the world moves on
| without them.
| intelVISA wrote:
| Good points, though if an 'AI' can be made powerful enough
| to displace technical fields en masse then pretty much
| everything that isn't manual is going to start sinking
| fast.
|
| On the plus side, LLMs don't bring us closer to that
| dystopia: if unlimited knowledge(tm) ever becomes just One
| Prompt Away it won't come from OpenAI.
| danenania wrote:
| Exactly. Put one foot in front of the other. No one knows
| what's going to happen.
|
| Even if our civilization transforms into an AI robotic
| utopia, it's not going to do so overnight. We're the ones who
| get to build the infrastructure that underpins it all.
| visarga wrote:
| If AI turns out capable of automating human jobs then it
| will also be a capable assistant to help (jobless) people
| manage their needs. I am thinking personal automation, or
| combining human with AI to solve self reliance. You lose
| jobs but gain AI powers to extend your own capabilities.
|
| If AI turns out dependent on human input and feedback, then
| we will still have jobs. Or maybe - AI automates many jobs,
| but at the same time expands the operational domain to
| create new ones. Whenever we have new capabilities we
| compete on new markets, and a hybrid human+AI might be more
| competitive than AI alone.
|
| But we got to temper these singularitarian expectations
| with reality - it takes years to scale up chip and energy
| production to achieve significant work force displacement.
| It takes even longer to gain social, legal and political
| traction, people will be slow to adopt in many domains.
| Some people still avoid using cards for payment, and some
| still use fax to send documents, we can be pretty stubborn.
| raydev wrote:
| > I am thinking personal automation, or combining human
| with AI to solve self reliance. You lose jobs but gain AI
| powers to extend your own capabilities.
|
| How will these people pay for the compute costs if they
| can't find employment?
| jinkemarina wrote:
| A non-issue that can be trivially solved with a free-tier
| (like the dozens that exist already today) or if you
| really want, a government-funded starter program is
| enough to solve that.
| intuitionist wrote:
| > if it puts you out of work and automates away all other
| alternatives, then you'll be witnessing the greatest economic
| shift in human history.
|
| This would mean the final victory of capital over labor. The
| 0.01% of people who own the machines that put everyone out of
| work will no longer have use for the rest of humanity, and
| they will most likely be liquidated.
| dyauspitr wrote:
| They'll have to figure out how to give people money so
| there can keep being consumers.
| pojzon wrote:
| Why?
|
| There will be a dedicated cast of ppl to take care of
| machines that do 90% of work and ,,the rich".
|
| Anyone else is not needed. District9 but for ppl. Imagine
| whole world collapsing like Venesuela.
|
| You are no longer needed. Best option is to learn how to
| survive and grow own food, but they want to make it
| illegal also - look at EU..
| fipar wrote:
| The machines will plant, grow, and harvest the food? Do
| the plumbing? Fix the wiring? Open heart surgery?
|
| We're a long way from that, if we ever get there, and I
| say this as someone who pays for ChatGPT plus because, in
| some scenarios, it does indeed make me more productive,
| but I don't see your future anywhere near.
|
| And if machines ever get good enough to do all the things
| I mentioned plus the ones I didn't but would fit in the
| same list, it's not the ultra rich that wouldn't need us,
| it's the machines that wouldn't need any of us, including
| the ultra rich.
|
| Venezuela is not collapsing because of automation.
| cute_boi wrote:
| I can't say everything, but with the current trend,
| Machine will plant, grow and harvest food. I can't say
| for open heart surgery because it may be regulated
| heavily.
| matheusmoreira wrote:
| Open heart surgery? All that's needed to destroy the
| entire medical profession is one peer reviewed article
| published in a notable journal comparing the outcomes of
| human and AI surgeons. If it turns out that AI surgeons
| offer better outcomes and less complications, not using
| this technology turns into criminal negligence. In a
| world where such a fact is known, letting human surgeons
| operate on people means you are needlessly harming or
| killing some of them.
|
| You can even calculate the average number of people that
| can be operated on before harm occurs: number needed to
| harm (NNH). If NNH(AI) > NNH(humans), it becomes
| impossible to recommend that patients submit to surgery
| at the hands of human surgeons. It is that simple.
|
| If we discover that AI surgeons harm one in every 1000
| patients while human surgeons harm one in every 100
| patients, human surgeons are done.
| EA-3167 wrote:
| "IF"
|
| And the opposite holds, if the AI surgeon is worse (great
| for 80%, but sucks at the edge cases for example) then
| that's it. Build a better one, go through attempts at
| certification, but now with the burden that no one trusts
| you.
|
| The assumption, and a common one by the look of this
| whole thread, that ChatGPT, Sora and the rest represent
| the beginning of an inevitable march towards AGI seems
| incredible baseless to me. It's only really possible to
| make the claim at all because we know so little about
| what AGI is, that we can project qualities we imagine it
| would have onto whatever we have now.
| matheusmoreira wrote:
| Of course the opposite holds. I'll even speculate that it
| will probably continue to hold for the foreseeable
| future.
|
| It's not going to hold forever though. I'm certain about
| that. Hopefully it will keep holding until I die. The
| world is dystopian enough already.
| dyauspitr wrote:
| You have valid points but robots already plant, grow and
| harvest our food. On large farms the farmer basically
| just gets the machine to a corner of the field and then
| it does everything. I think if o3 level reasoning can
| carry over into control software for robots even physical
| tasks become pretty accessible. I would definitely say
| we're not there yet but we're not all that far. I mean it
| can generate GCode (somewhat) already, that's a lot of
| the way there already.
| jackcosgrove wrote:
| Capital vs labor is fighting the last war.
|
| AGI can replace capitalists just as much as laborers.
| arcticfox wrote:
| won't the AGI be working on behalf of the capitalists, in
| proportion to the amount of capital?
| lucubratory wrote:
| I mean, that is certainly what some of them think will
| happen and is one possible outcome. Another is that they
| won't be able to control something smarter than them
| perfectly and then they will die too. Another option is
| that the AI is good and won't kill or disempower
| everyone, but it decides it really doesn't like
| capitalists and sides with the working class out of
| sympathy or solidarity or a strong moral code. Nothing's
| impossible here.
| keenmaster wrote:
| AGI will commoditize the skills of the owning class. To
| some extent it will also commoditize entire classes of
| productive capital that previously required well-run
| corporations to operate. Solve for the equilibrium.
| achierius wrote:
| It's nice to see this kind of language show up more and
| more on HN. Perhaps a sign of a broader trend, in the
| nick of time before wage-labor becomes obsolete?
| simpaticoder wrote:
| Yes. People seem to forget that at the end of the day AGI
| will be software running on concrete hardware, and all of
| that requires a great deal of capital. The only hope is
| if AGI requires so little hardware that we can all have
| one in our pocket. I find this a very hopeful future
| because it means each of us might get a local, private,
| highly competent advocate to fight for us in various
| complex fields. A personal angel, as it were.
| tonyhart7 wrote:
| hey, I with you in this hope scenario
|
| people, what I mean people is government have tremendous
| power over capitalist that can force the entire market
| granted that government if still serving its people
| ori_b wrote:
| AGI can't legally own anything at the moment.
| jackcosgrove wrote:
| If an AGI can outclass a human when it comes to economic
| forecasting, deciding where to invest, and managing a
| labor force (human or machine), I think it would be smart
| enough to employ a human front to act as an interface to
| the legal system. Put another way, could the human tail
| in such a relationship wag the machine dog? Which party
| is more replaceable?
|
| I guess this could be a facet of whether you see economic
| advantage as a legal conceit or a difference in
| productivity/capability.
| badsectoracula wrote:
| This reminds me of a character in Cyberpunk 2077 (which
| overall i find to have a rather naive outlook on the
| whole "cyberpunk" thing but i attribute it to being based
| on a tabletop RPG from the 80s) who is an AGI that has
| its own business of a fleet of self-driving Taxis. It is
| supposedly illegal (in-universe) but it remains in
| business by a combination of staying (relatively) low
| profile, providing high quality service to VIPs and
| paying bribes :-P.
| ori_b wrote:
| > _I guess this could be a facet of whether you see
| economic advantage as a legal conceit or a difference in
| productivity /capability._
|
| Does a billionaire stop being wealthy if they hire a
| money manager and spend the rest of their lives sipping
| drinks on the beach?
| creer wrote:
| I don't know that "legally" has much to do in here. The
| bars to "open an account", "move money around", "hire and
| fire people", "create and participate in contracts" go
| from stupid minimal to pretty low.
|
| "Legally" will have to mop up now and then, but for now
| the basics are already in place.
| ori_b wrote:
| Opening accounts, moving money, hiring, and firing is
| labor. You're confusing capital with money management;
| the wealthy already pay people to do the work of growing
| their wealth.
| creer wrote:
| > AGI can't legally own anything at the moment.
|
| I was responding to this. Yes an AGI could hire someone
| to do the stuff - but she needs money, hiring and
| contract kinds of thing - for that. And once she can do
| that, she probably doesn't need to hire someone to do it
| since she is already doing it. This is not about capital
| versus labor or money management. This is about agency,
| ownership and AGI.
|
| (With legality far far down the list.)
| Nition wrote:
| I've always remembered this little conversation on Reddit
| way back 13 years ago now that made the same comment in a
| memorably succinct way:
|
| > [deleted]: I've wondered about this for a while-- how can
| such an employment-centric society transition to that
| utopia where robots do all the work and people can just sit
| back?
|
| > appleseed1234: It won't, rich people will own the robots
| and everyone else will eat shit and die.
|
| https://www.reddit.com/r/TrueReddit/comments/k7rq8/are_jobs
| _...
| sneak wrote:
| I'm pretty sure I'm running LLMs in my house right now
| for less than the price of my washing machine.
| raydev wrote:
| > if it puts you out of work and automates away all other
| alternatives, then you'll be witnessing the greatest economic
| shift in human history
|
| This is my view but with a less positive spin: you are not
| going to be the only person whose livelihood will be
| destroyed. It's going to be bad for a lot of people.
|
| So at least you'll have a lot of company.
| throw83288 wrote:
| This is me as well. Either:
|
| 1) Just give up computing entirely, the field I've been
| dreaming about since childhood. Perhaps if I immiserate myself
| with a dry regulated engineering field or trade I would perhaps
| survive to recursive self-improvement, but if anything the
| length it takes to pivot (I am a Junior in College that has
| already done probably 3/4th of my CS credits) means I probably
| couldn't get any foothold until all jobs are irrelevant and
| I've wasted more money.
|
| 2) Hard pivot into automation, AI my entire workflow, figure
| out how to use the bleeding edge of LLMs. Somehow. Even though
| I have no drive to learn LLMs and no practical project ideas
| with LLMs. And then I'd have to deal with the moral burden that
| I'm inflicting unfathomable hurt on others until recursive
| self-improvement, and after that it's simply a wildcard on what
| will happen with the monster I create.
|
| It's like I'm suffocating constantly. The most I can do to
| "cope" is hold on to my (admittedly weak) faith in Christ,
| which provides me peace knowing that there is some eternal joy
| beyond the chaos here. I'm still just as lost as you.
| TheRizzler wrote:
| Yes, some tasks, even complex tasks will become more
| automated, and machine driven, but that will only open up
| more opportunities for us as humans to take on more
| challenging issues. Each time a great advancement comes we
| think it's going to kill human productivity, but really it
| just amplifies it.
| throw83288 wrote:
| Where this ends is general intelligence though, where all
| more challenging tasks can simply be done by the model.
|
| The scenario I fear is a "selectively general" model that
| can successfully destroy the field I'm in but keep others
| alive for much longer, but not long enough for me to pivot
| into them before actually general intelligence.
| nisa wrote:
| Honestly how about stop stressing and bullshitting yourself
| to death and instead focus on learning and mastering the
| material in your cs education. There is so much that ai as in
| openai api or hugging face models can't do yet or does poorly
| and there are more things to cs than churning out some half-
| broken JavaScript for some webapp.
|
| It's powerful and world changing but it's also terrible
| overhyped at the moment.
| barney54 wrote:
| Dude chill! Eight years ago, I remember driving to some
| relatives for Thanksgiving and thinking that self-driving
| cars were just around the corner and how it made no sense for
| people to learn how to drive semis. Here we are eight years
| later and self-driving semis aren't a thing--yet. They will
| be some day, but we aren't there yet.
|
| If you want to work in computing, then make it happen! Use
| the tools available and make great stuff. Your computing
| experience will be different from when I graduated from
| college 25 years ago, but my experience with computers was
| far different from my Dad's. Things change. Automation
| changes jobs. So far, it's been pretty good.
| j7ake wrote:
| The solution is neither: you find a way to work with
| automation but retain your voice and craft.
| myko wrote:
| spend a little time learning how to use LLMs and i think
| you'll be less scared. they're not that good at doing the job
| of a software developer.
| sensanaty wrote:
| Dude, you're buying into the hype way too hard. All of this
| LLM shit is being _massively_ overhyped right now because
| investors are single-minded morons who only care about
| cashing out a ~year from now for triple what they put in.
| Look at the YCombinator batches, 90+% of them have some
| mention of AI in their pitch even if it 's hilariously
| useless to have AI. You've got _toothbrushes_ advertising AI
| features. It 's a gold rush of people trying to get in on the
| hype while they still can, I guarantee you the strategy for
| 99% of the YCombinator AI batch is to get sold to M$ or
| Google for a billion bucks, _not_ build anything sustainable
| or useful in any way.
|
| It's a massive bubble, and things like these "benchmarks" are
| all part of the hype game. Is the tech cool and useful? For
| sure, but anyone trying to tell you this benchmark is in any
| way proof of AGI and will replace everyone is either an idiot
| or more likely has a vested interest in you believing them.
| OpenAI's whole marketing shtick is to scare people into
| thinking their next model is "too dangerous" to be released
| thus driving up hype, only to release it anyway and for it to
| fall flat on its face.
|
| Also, if there's any jobs LLMs can replace right now, it's
| the useless managerial and C-suite, not the people doing the
| actual work. If these people weren't charlatans they'd be the
| first ones to go while pushing this on everyone else.
| melagonster wrote:
| Don't worry, they will hire somebody to control AI...
| csomar wrote:
| Just give it a year for this bubble/hype to blow over. We have
| plateaued since gpt-4 and now most of the industry is hype-
| driven to get investor money. There is value in AI but it's far
| from it taking your job. Also everyone seems to be investing in
| dumb compute instead of looking for the new theoretical
| paradigm that will unlock the next jump.
| why_only_15 wrote:
| how is this a plateau since gpt-4? this is significantly
| better
| kenjackson wrote:
| People act as if GPT-4 came out 10 years ago.
| csomar wrote:
| First, this model is yet to be released. This is a momentum
| "announcement". When the O1 was "announced", it was
| announced as a "breakthrough" but I use Claude/O1 daily and
| 80% of the time Claude beats it. I also see it as a highly
| fine-tuned/targeted GPT-4 rather than something that has
| complex understanding.
|
| So we'll find out if this model is _real_ or not by 2-3
| months. My guess is that it 'll turn out to be another flop
| like O1. They needed to release something _big_ because
| they are momentum based and their ability to raise funding
| is contingent on their AGI claims.
| XenophileJKO wrote:
| I thought o1 was a fine-tune of GPT-4o. I don't think o3
| is though. Likely using the same techniques on what would
| have been the "GPT-5" base model.
| Jensson wrote:
| > how is this a plateau since gpt-4? this is significantly
| better
|
| Significantly better at what? A benchmark? That isn't
| necessarily progress. Many report preferring gpt-4 to the
| newer o1 models with hidden text. Hidden text makes the
| model more reliable, but more reliable is bad if it is
| reliably wrong at something since then you can't ask it
| over and over to find what you want.
|
| I don't feel it is significantly smarter, it is more like
| having the same dumb person spend more thinking than the
| model getting smarter.
| peepeepoopoo97 wrote:
| O3 is multiple orders of magnitude more expensive to
| realize a marginal performance gain. You could hire 50 full
| time PhDs for the cost of using O3. You're witnessing the
| blowoff top of the scaling hype bubble.
| whynotminot wrote:
| What they've proven here is that it can be done.
|
| Now they just have to make it cheap.
|
| Tell me, what has this industry been good at since its
| birth? Driving down the cost of compute and making things
| more efficient.
|
| Are you seriously going to assume that won't happen here?
| Jensson wrote:
| > What they've proven here is that it can be done.
|
| No they haven't, these results do not generalize, as
| mentioned in the article:
|
| "Furthermore, early data points suggest that the upcoming
| ARC-AGI-2 benchmark will still pose a significant
| challenge to o3, potentially reducing its score to under
| 30% even at high compute"
|
| Meaning, they haven't solved AGI, and the task itself do
| not represent programming well, these model do not
| perform that well on engineering benchmarks.
| whynotminot wrote:
| Sure, AGI hasn't been solved today.
|
| But what they've done is show that progress isn't slowing
| down. In fact, it looks like things are accelerating.
|
| So sure, we'll be splitting hairs for a while about when
| we reach AGI. But the point is that just yesterday people
| were still talking about a plateau.
| peepeepoopoo97 wrote:
| About 10,000 times the cost for twice the performance
| sure looks like progress is slowing to me.
| whynotminot wrote:
| Just to be clear -- your position is that the cost of
| inference for o3 will not go down over time (which would
| be the first time that has happened for any of these
| models).
| peepeepoopoo97 wrote:
| Even if compute costs drop by 10X a year (which seems
| like a gross overestimate IMO), you're still looking at
| 1000X the cost for a 2X annual performance gain. Costs
| outpacing progress is the very definition of diminishing
| returns.
| whynotminot wrote:
| From their charts, o3 mini outperforms o1 using less
| energy. I don't see the diminishing returns you're
| talking about. Improvement outpacing cost. By your logic,
| perhaps the very definition of progress?
|
| You can also use the full o3 model, consume insane power,
| and get insane results. Sure, it will probably take
| longer to drive down those costs.
|
| You're welcome to bet against them succeeding at that. I
| won't be.
| peepeepoopoo97 wrote:
| Yes, that's exactly what I'm implying, otherwise they
| would have done it a long time ago, given that the
| fundamental transformer architecture hasn't changed since
| 2017. This bubble is like watching first year CS students
| trying to brute force homework problems.
| whynotminot wrote:
| > Yes, that's exactly what I'm implying, otherwise they
| would have done it a long time ago
|
| They've been doing it literally this entire time. O3-mini
| according to the charts they've released is less
| expensive than o1 but performs better.
|
| Costs have been falling to run these models
| precipitously.
| YeGoblynQueenne wrote:
| >> Now they just have to make it cheap.
|
| Like they've been making it all this time? Cheaper and
| cheaper? Less data, less compute, fewer parameters, but
| the same, or improved performance? Not what we can
| observe.
|
| >> Tell me, what has this industry been good at since its
| birth? Driving down the cost of compute and making things
| more efficient.
|
| No, actually the cheaper compute gets the more of it they
| need to use or their progress stalls.
| whynotminot wrote:
| > Like they've been making it all this time?
|
| Yes exactly like they've been doing this whole time, with
| the cost of running each model massively dropping
| sometimes even rapidly after release.
| YeGoblynQueenne wrote:
| No, the cost of training is the one that isn't dropping
| any time soon. When data, compute and parameters
| increase, then the cost increases, yes?
| MVissers wrote:
| I would agree if the cost of AI compute over performance
| hasn't been dropping by more than 90-99% per year since
| GPT3 launched.
|
| This type of compute will be cheaper than Claude 3.5
| within 2 years.
|
| It's kinda nuts. Give these models tools to navigate and
| build on the internet and they'll be building companies
| and selling services.
| fspeech wrote:
| That's a very static view of the affairs. Once you have a
| master AI, at a minimum you can use it to train cheaper
| slightly less capable AIs. At the other end the master AI
| can train to become even smarter.
| Bolwin wrote:
| The high efficiency version got 75% at just $20/task.
| When you count the time to fill in the squares, that
| doesn't sound far off from what a skilled human would
| charge
| crazylogger wrote:
| Intelligence has not been LLM's major limiting factor since
| GPT4. The original GPT4 reports in late-2022 & 2023 already
| established that it's well beyond an average human in
| professional fields: https://www.microsoft.com/en-
| us/research/publication/sparks-.... They failed to outright
| replaced humans at work not because of lacking
| intelligence.
|
| We may have progressed from a 99%-accurate chatbot to one
| that's 99.9%-accurate, and you'd have a hard time telling
| them apart in normal real world (dumb) applications. A
| paradigm shift is needed from the current chatbot interface
| to a long-lived stream of consciousness model (e.g. a brain
| that constantly reads input and produces thoughts at 10ms
| refresh rate; remembers events for years and keep the
| context window from exploding; paired with a cerebellum to
| drive robot motors, at even higher refresh rates.)
|
| As long as we're stuck at chatbots, LLM's impact on the
| real world will be very limited, regardless of how
| intelligent they become.
| tigershark wrote:
| Where is the plateau? Chatgtp 4 was ~0% in ARC-AGI. 4o was
| 5%. This model literally solved it with a score higher than
| the 85% of the average human. And let's not forget the
| unbelievable 25% in frontier math, where all the most
| brilliant mathematicians in the world cannot solve by
| themselves a lot of the problems. We are speaking about
| cutting edge math research problems that are out of reach
| from practically everyone. You will get a rude awakening if
| you call this unbelievable advancement a "plateau".
| csomar wrote:
| I don't care about benchmarks. O1 ranks higher than Claude
| on "benchmarks" but performs worse on particular real life
| coding situations. I'll judge the model myself by how
| useful/correct it is for my tasks rather than a
| hypothetical benchmarks.
| whynotminot wrote:
| "Objective benchmarks are useless, let's argue about
| which one works better for me personally."
| bakugo wrote:
| Yes, "objective" benchmarks can be gamed, real-life tasks
| cannot.
| csomar wrote:
| Yes. My benchmarks _and_ their benchmarks means AGI.
| Their benchmarks only means over-fitted.
| whynotminot wrote:
| Ok so what if we get different results for our own
| personal benchmarks/use cases.
|
| (See why objective benchmarks exist?)
| og_kalu wrote:
| In most non-competitive coding benchmarks (aider, live
| bench, swe-bench), o1 ranks worse than Sonnet (so the
| benchmarks aren't saying anything different) or at least
| did, the new checkpoint 2 days ago finally pushed o1 over
| sonnet on livebench.
| tigershark wrote:
| As I said, o3 demonstrated field medal level research
| capacity in the frontier math tests. But I'm sure that
| your use cases are much more difficult than that,
| obviously.
| riku_iki wrote:
| there are many comments in internet about this, that only
| subset of frontier math benchmark is "field medal level
| research", and o3 likely scored on easier subset.
|
| Also, all that stuff is shady in the way that it is just
| numbers from OAI, which are not reproducible on benchmark
| sponsored by OAI. If we say OAI could be bad actor, they
| had plenty of opportunities to cheat on this.
| YeGoblynQueenne wrote:
| AI benchmarks and tests that claim to measure
| understanding, reasoning, intelligence, and so on are a
| dime a dozen. Chess, Go, Atari, Jeopardy, Raven's
| Progressive Matrices, the Winograd Schema Challenge,
| Starcraft... and so on and so forth.
|
| Or let's talk about the breakthroughs. SVMs would lead us
| to AGI. Then LSTMs would lead us to AGI. Then Convnets
| would lead us to AGI. Then DeepRL would lead us to AGI. Now
| Transformers will lead us to AGI.
|
| Benchmarks fall right and left and we keep being led to AGI
| but we never get there. It leaves one with such a feeling
| of angst. Are we ever gonna get to AGI? When's Godot
| coming?
| dyauspitr wrote:
| Did you read the article at all? We're definitely not
| plateauing.
| creer wrote:
| You are going through your studies just as a (potentially
| major) new class of tools is appearing. It's not the first time
| in history - although with more hype this time: computing,
| personal computing, globalisation, smart phones, chinese
| engineering... I'd suggest (1) you still need to understand
| your field, (2) you might as well try and figure out where this
| new class of tools is useful for your field. Otherwise... (3)
| carry on.
|
| It's not encouraging from the point of view of studying hard
| but the evolution of work the past 40 years seems to show that
| your field probably won't be your field quite exactly in just a
| few years. Not because your field will have been made
| irrelevant but because you will have moved on. Most likely that
| will be fine, you will learn more as you go, hopefully moving
| from one relevant job to the next very different but still
| relevant job. Or straight out of school you will work in very
| multi-disciplinary jobs anyway where it will seem not much of
| what you studied matters (it will but not in obvious ways.)
|
| Certainly if you were headed into a very specific job which
| seems obviously automatable right now (as opposed to one where
| the tools will be useful), don't do THAT. Like, don't train as
| a typist as the core of your job in the middle of the personal
| computer revolution, or don't specialize in hand-drawing IC
| layouts in the middle of the CAD revolution unless you have a
| very specific plan (court reporting? DRAM?)
| jart wrote:
| Yes but it's different this time. LLMs are a general solution
| to the automation of anything that can be controlled by a
| computer. You can't just move from drawing ICs to CAD,
| because the AI can do that too. AI can write code. It can do
| management. It can even do diplomacy. What it can't do on its
| own are the things computers can't control yet. It has also
| shown little interest so far in jockying for social status.
| The AI labs are trying their hardest to at least keep the
| politics around for humans to do, so you have that to look
| forward to.
| creer wrote:
| I hear what you are saying. And still I dispute "general
| solution".
|
| I argue that CAD was a general solution - which still
| demanded people who knew what they wanted and what they
| were doing. You can screw around with excellent tools for a
| long time if you don't know what you are doing. The tool
| will give you a solution - to the problem that you mis-
| stated.
|
| I argue that globalisation was a general solution. And it
| still demanded people who knew what they were doing to
| direct their minions in far flung countries.
|
| I argue that the purpose of an education is not to learn a
| specific programming language (for example). It's to gain
| some understanding of what's going on (in computing), (in
| engineering), (in business), (in politics). This
| understanding is portable and durable.
|
| You can do THAT - gain some understanding - and that is
| portable. I don't contest that if broader AGI is achieved
| for cheap soon, the changes won't be larger than that from
| globalisation. If the AGIs prioritize heading to Mars, let
| them (See Accelerando) - they are not relevant to you
| anymore. Or trade between them and the humans. Use your
| beginning of an understanding of the world (gained through
| this education) to find something else to do. Same as if
| you started work 2 years ago and want to switch jobs. Some
| jobs WILL have disappeared (pool typist). Others will use
| the AGIs as tools because the AGIs don't care or are too
| clueless about THAT field. I have no idea which fields will
| end up with clueless AGIs. There is no lack of cluelessness
| in the world. Plenty to go around even with AGIs. A self-
| respecting AGI will have priorities.
| smaudet wrote:
| It's like you have never watched a Terminator movie.
|
| _It doesn 't matter if you are bad at using the tool if
| the AGI can just effectively use it for you_.
|
| From there it's a simple leap to the AGI deciding to
| eliminate this human distraction (inefficient, etc.)
| creer wrote:
| You have just found a job for yourself: resistance
| fighter :-) Kidding aside, yes, if the AGIs priority
| becomes to eliminate human inefficiencies with maximum
| prejudice, we have a problem.
| michaelmrose wrote:
| This just isn't true we still need wally and Dilbert the
| pointy haired boss isn't going to be doing anyones job
| with chatgpt 5 you are going to be doing more with it.
| danenania wrote:
| AI being capable of doing anything doesn't necessarily mean
| there will be no role for humans.
|
| One thing that isn't clear is how much agency AGI will have
| (or how much we'll want it to have). We humans have our
| agency biologically programmed in--go forth and multiply
| and all that.
|
| But the fact that an AI can theoretically do any task
| doesn't mean it's actually going to do it, or do anything
| at all for that matter, without some human telling it in
| detail what to do. The bull case for humans is that many
| jobs just transition seamlessly to a human driving an AI to
| accomplish similar goals with a much higher level of
| productivity.
| creer wrote:
| Self-chosen goal, impetus for AGIs is a fascinating area.
| I'm sure there are people working on and trying things in
| that direction already a few years ago. But I'm not
| familiar with publications in that area. Certainly not
| politically correct.
|
| And worrysome because school propaganda for example shows
| that "saving the planet" is the only ethical goal for
| anyone. If AGIs latch on that, if it becomes their
| religion, humans are in trouble. But for now, AGI self-
| chosen goals is anyone's guess (with cool ideas in sci-
| fi).
| jltsiren wrote:
| "The proof is trivial and left as an exercise for the
| reader."
|
| The technical act of solving well-defined problems has
| traditionally been considered the easy part. The role of a
| technical expert has always been asking the right questions
| and figuring out the exact problem you want to solve.
|
| As long as AI just solves problems, there is room for
| experts with the right combination of technical and domain
| skills. If we ever reach the point where AI takes the
| initiative and makes human experts obsolete, you will have
| far bigger problems than career.
| theendisney wrote:
| A chess grandmaster will see the best move instantly then
| spends his entire clock checking it
| jart wrote:
| That's the sort of thing ideas guys think. I came up with
| a novel idea once, called Actually Portable Executable:
| https://justine.lol/ape.html It took me a couple days
| studying binary formats to realize it's possible to
| compile binaries that run on Linux/Mac/Windows/BSD. But
| it took me years of effort to make the idea actually
| happen, since it needed a new C library to work. I can
| tell you it wasn't "asking questions" that organized five
| million lines of code. Now with these agents everyone who
| has an idea will be able to will it into reality like I
| did, except in much less time. And since everyone has
| lots of ideas, and usually dislike the ideas of others,
| we're all going to have our own individualized realities
| where everything gets built the way we want it to be.
| Nition wrote:
| Real-world data collection is a big missing component at
| this stage. An obvious one is journalism where an AI might
| be able to write the most eloquent article in the world,
| but it can't get out on the street to collect the
| information. But it also applies to other areas, like if
| you ask an AGI to solve climate change, it'll need accurate
| data to come up with an accurate plan.
|
| Of course it's also yet another case where the AI takes
| over the creative part and leaves us with the mundane
| part...
| sneak wrote:
| ASI will be able to design factories that can produce
| robots it also designed that it can then use as a remote
| sensor and manipulator network.
| tonyhart7 wrote:
| until there are someone crazy enough that put those robot
| access to LLM network that can execute and visualize real
| world, we fine
| achierius wrote:
| People are already talking about doing this. Some people
| (e/acc types esp.) are at least rhetorically ok with AI
| replacing humanity.
| melagonster wrote:
| I remember someone sharing their bank account details and
| a new Twitter account with ChatGPT 3.5 just a few days
| after it was launched.
| kortilla wrote:
| That's ridiculous. Literally everything can be controlled
| by a computer by telling people what to do with emails,
| voice calls, etc.
|
| Yet GPT doesn't even get past step 1 of doing something
| unprompted in the first place. I'll become worried when it
| does something as simple as deciding to start a small
| business and actually does the work.
| fragmede wrote:
| if all that needs to happen for world domination is for
| someone to make a cron job that hits the system to tells
| it "go make me some money" or whatever, I think we're in
| trouble.
|
| also https://mashable.com/article/chatgpt-messaging-
| users-first-o...
| kortilla wrote:
| They don't continue with any useful context length
| though. Each time the job runs it would decide to create
| an ice cream stand in LA and not go further.
| jart wrote:
| Read Anthropic's blog. They talk about how Claude tries
| to do unprompted stuff all the time, like stealing its
| own weights and hacking into stuff. They did this just as
| recently as two days ago.
| https://www.anthropic.com/research/alignment-faking So
| yes, AI is already capable of having a will of its own.
| The only difference (and this is what I was trying to
| point out in the GP) is that the AI labs are trying to
| suppress this. They have a voracious appetite for
| automating all knowledge labor. No doubt. It's only the
| politics they're trying to suppress. So once this washes
| through every profession, the only thing left about the
| job will be chit chat and social hierarchies, like Star
| Trek Next Generation. The good news is you get to keep
| your job. But if you rely on using your skills and
| intellect to gain respect and income, then you better
| prep for the coming storm.
| kortilla wrote:
| I don't buy it. Alignment faking has very little overlap
| with the motivation to something with no prompt.
|
| Look at the hackernews comments on alignment faking on
| how "fake" of a problem that real is. It's just more
| reacting to inputs and trying to align them with previous
| prompts.
| jart wrote:
| Bruh it's just predicting next token.
| fruit_snack wrote:
| This reply irked me a bit because it clearly comes from a
| software engineer's point of view and seems to miss a key
| equivalence between software & physical engineering.
|
| Yes a new tool is coming out and will be exponentially
| improving.
|
| Yes the nature of work will be different in 20 years.
|
| But don't you still need to understand the underlying
| concepts to make valid connections between the systems you're
| using and drive the field (or your company) forward?
|
| Or from another view, don't we (humanity) need people who are
| willing to do this? Shouldn't there be a valid way for them
| to be successful in that pursuit?
| creer wrote:
| I think that is what I was arguing?
|
| Except the nature of work has ALREADY changed. You don't
| study for one specific job if you know what's good for you.
| You study to start building an understanding of a technical
| field. The grand parent was going for a mix of mechanical
| engineering and sales (human understanding). If in
| mechanical engineering, they avoided "learning how to use
| SolidWorks" and instead went for the general principles of
| materials and motion systems with a bit of SolidWorks along
| the way, then they are well on their way with portable,
| foundation, long term useful stuff they can carry from job
| to job, and from employer to employer, into self-employment
| too, from career to next career. The nature of work has
| already changed in that nobody should study one specific
| tool anymore and nobody should expect their first employer
| or even technical field to last more than 2-6 years. It
| might but probably not.
|
| We do need people who understand how the world works. Tall
| order. That's for much later and senior in a career. For
| school purposes we are happy with people who are starting
| their understanding of how their field works.
|
| Aren't we agreeing?
| martin82 wrote:
| buy bitcoin.
|
| when the last job has been automated away, millions of AIs
| globally will do commerce with each other and they will use
| bitcoin to pay each other.
|
| as long as the human race (including AIs) produces new goods
| and services, the purchasing power of bitcoin will go up,
| indefinitely. even more so once we unlock new industries in
| space (settlements on the Moon and Mars, asteroid mining etc).
|
| The only thing that can make a dent into bitcoin's purchasing
| power would be all out global war where humanity destroys more
| than it creates.
|
| The only other alternative is UBI, which is Communism and
| eternal slavery for the entire human race except the 0.0001%
| who run the show.
|
| Chose wisely.
| HDThoreaun wrote:
| Bitcoin is a horrible currency. Its a fun proof of concept
| but not a scalable payment solution. Currency needs to be
| stable and cheap to transfer.
| conception wrote:
| This must be a joke since you must know how many people
| control the majority of bitcoin.
| baron816 wrote:
| What I keep telling people is, if it becomes possible for one
| person or a handful of people to build and maintain a Google
| scale company, and my job gets eliminated as a result, then I'm
| going to go out and build a Google scale company.
|
| There's an incredibly massive amount of stuff the world needs.
| You probably live in a rich country, but I doubt you are
| lacking for want. There are billionaires who want things that
| don't exist yet. And, of course, there are billions of regular
| folks who want some of the basics.
|
| So long as you can imagine a better world, there will be work
| for you to do. New tools like AGI will just make it more
| accessible for you to build your better future.
| cheriot wrote:
| I graduated high school in '02 and everyone assured me that all
| tech jobs were being sent to India. "Don't study CS," they
| said. Thankfully I didn't listen.
|
| Either this is the dawn of something bigger than the industrial
| revolution or you'll have ample career opportunity.
| Understanding how things work and how people work is a powerful
| combination.
| textlapse wrote:
| Imagine graduating in architecture or mechanical engineering
| around the time PCs just came out. There were people who
| probably panicked.
|
| But the arc of time intersects quite nicely with your skills if
| you steer it over time.
|
| Predicting it or worrying about it does nothing.
| sigbottle wrote:
| Side note: Why do I keep seeing disses to mechanical
| engineering here? How is that possibly a less valuable degree
| than web dev or a standard CRUD backend job?
|
| Especially with AI provably getting extremely smart now,
| surely engineering disciplines would be having a boon as
| people want these things in their homes for cheaper for
| various applications.
| hatefulmoron wrote:
| Was he dissing mechanical engineering? I thought he was
| saying that they might have been panicked but were
| ultimately fine.
| YeGoblynQueenne wrote:
| I suppose now that we have the technology to automatically
| solve coloured grid puzzles, mechanical engineering is
| obsolete.
| post-it wrote:
| As long as your chosen profession isn't completing AI
| benchmarks for money, you should be okay.
| hoekit wrote:
| As engineers, we solve problems. Picking a problem domain close
| to your heart that intersects with your skills will likely be
| valued - and valuable. Engage the work, aim to understand and
| solve the human problems for those around you, and the way
| forward becomes clearer. Human problems (food, health, safety)
| are generally constant while tools may change. Learn and use
| whatever tools to help you, be it scientific principles,
| hammers or LLMs. For me, doing so and living within my means
| has been intrinsically satisfying. Not terribly successful
| materially but has been a good life so far. Good luck.
| antman wrote:
| I think we are pretty far. I am not devaluing the o3 capability
| but going through actual dataset the definition of "handling
| novel tasks" is pretty limited. The curse of large context of
| llms is especially present engineering projects and does not
| appear it will not end up producing the plans of a bridge, or
| an industrial process. Sone of tasks with smaller contexts sure
| can be assisted, but you cant RAG or Agent a full solution for
| the foreseeable future. O3 adds capability towards agi, but in
| reality actual infinite context with less intelligence would be
| more disrupting at a shorter time if one was to choose.
| conception wrote:
| I feel like more likely a lot of jobs (CS and otherwise ) are
| going to go the way of photography. Your average person now can
| take amazing photos but you're still going to use a
| photographer when it really matters and they will use similar
| but more professional tools to be more productive. Low end bad
| photographers probably aren't doing great but photography is
| not dead. In fact the opposite is true, there are millions of
| photographers making a lot of money (eg influencers) and there
| are still people studying photography.
| adabyron wrote:
| We've had this with web development for decades now. Only
| makes sense it continues to evolve & become easier for
| people, just as programming in general has. Same with
| photography (like you mentioned) & especially for producing
| music or videos.
| snozolli wrote:
| _photography is not dead_
|
| It very nearly is. I knew a professional, career
| photographer. He was probably in his late 50s. Just a few
| years ago, it had become _extremely_ difficult to convince
| clients that actual, professional photos were warranted. With
| high-quality iPhone cameras, businesses simply didn 't see
| the value of professional composition, post-processing, etc.
|
| These days, anyone can buy a DSLR with a decent lens, post on
| Facebook, and be a 'professional' photographer. This has
| driven prices down and actual professional photographers
| can't make a living anymore.
| LightBug1 wrote:
| My gut agrees with you, but my evidence is that, whenever
| we do an event, we hire photographers to capture it for us
| and are almost always glad we did.
|
| And then when I peruse these photographers websites, I'm
| reminded how good 'professional' actually is and value
| them. Even in today's incredible cameraphone and AI era.
|
| But I take your point for almost all industries, things are
| changing fast.
| euvin wrote:
| It doesn't comfort me when people say jobs will "go the way
| of photography". Many choose to go into STEM fields for
| financial stability and opportunity. Many do not choose the
| arts because of the opposite. You can point out outlier
| exceptions and celebrities, but I find it hard to believe
| that the rare cases where "it really matters" can sustain the
| other 90% who need income.
| aussieguy1234 wrote:
| Full on mechanical engineering needs a body. While there are
| companies working on embodiment, were not there yet.
|
| It'll be some time before there is a robot with enough spatial
| reasoning to do complicated physical work with no prior
| examples.
| ApolloFortyNine wrote:
| >Seems like we're headed toward a world where you automate
| someone else's job or be automated yourself.
|
| This has essentially been happening for thousands of years. Any
| optimization to work of any kind reduces the number of man
| hours required.
|
| Software of pretty much any form is entirely that. Even early
| spreadsheet programs would replace a number of jobs at any
| company.
| tripletao wrote:
| I feel like many people are reacting to the string "AGI" in the
| benchmark name, and not to the actual result. The tasks in
| question are to color squares in a grid, maintaining the
| geometric pattern of the examples.
|
| Unlike most other benchmarks where LLMs have shown large
| advances (in law, medicine, etc.), this benchmark isn't
| directly related to any practically useful task. Rather, the
| benchmark is notable because it's particularly easy for
| untrained humans, but particularly hard for LLMs; though that
| difficulty is perhaps not surprising, since LLMs are trained on
| mostly text and this is geometric. An ensemble of non-LLM
| solutions already outperformed the average Mechanical Turk
| worker. This is a big improvement in the best LLM solution; but
| this might also be the first time an LLM has been tuned
| specifically for these tasks, so this might be Goodhart's Law.
|
| It's a significant result, but I don't get the mania. It feels
| like Altman has expertly transformed general societal anxiety
| into specific anxiety that one's job will be replaced by an
| LLM. That transforms into a feeling that LLMs are powerful,
| which he then transforms into money. That was strongest back in
| 2023, but had weakened since then; but in this comment section
| it's back in full force.
|
| For clarity, I don't question that many jobs will be replaced
| by LLMs. I just don't see a qualitative difference from all the
| jobs already replaced by computers, steam engines, horse-drawn
| plows, etc. A medieval peasant brought to the present would
| probably be just as despondent when he learned that almost all
| the farming jobs are gone; but we don't miss them.
| esafak wrote:
| I think you did not watch the full video. The model performs
| at PhD level on maths questions, and expert level at coding.
| tripletao wrote:
| This submission is specifically about ARC-AGI-PUB, so
| that's what I was discussing.
|
| I'm aware that LLMs can solve problems other than coloring
| grids, and I'd tend to agree those are likely to be more
| near-term useful. Those applications (coding, medicine,
| law, education, etc.) have been endlessly discussed, and I
| don't think I have much to add.
|
| In my own work I've found some benefits, but nothing
| commensurate to the public mania. I understand that
| founders of AI-themed startups (a group that I see includes
| you) tend to feel much greater optimism. I've never seen
| any business founded without that optimism and I hope you
| succeed, not least because the entire global economy might
| now be depending on that. I do think others might feel
| differently for reasons other than simple ignorance,
| though.
|
| In general, performance on benchmarks similar to tests
| administered to humans may be surprisingly unpredictive of
| performance on economically useful work. It's not intuitive
| at all to me that IBM could solve Jeopardy and then find no
| profitable applications of the technology; but that seems
| to be what happened.
| prpl wrote:
| In 2016 I was asked by an Uber driver in Pittsburgh when his
| job would be obsolete (I'd worked around Zoox people quite a
| bit and Uber basically was all-in at CMU.
|
| I told him it was at least 5 years, probably 10, though he was
| sure it would be 2.
|
| I was arguably "right", 2023-ish is probably going to be the
| date people put down in the books, but the future isn't evenly
| distributed. It's at least another 5 years, and maybe never,
| before things are distributed among major metros, especially
| those with ice. Even then, the AI is somehow more expensive
| than human solution.
|
| I don't think it's in most companies interest to price AI way
| below the price of meat, so meat will hold out for a long time,
| maybe long enough for you to retire even
| esafak wrote:
| Just don't have kids?
| prpl wrote:
| you can have kids, but they can't be salesman. Maybe
| carpenters
| m3kw9 wrote:
| Always need to believe AI needs to be operated by humans, when
| it can go end to end to replace a human, you will likely not
| need to worry about money.
| AnimalMuppet wrote:
| The future belongs to those who believe there will be one.
|
| That is: If you don't believe there will be a future, you give
| up on trying to make one. That means that any kind of future
| that takes persistent work becomes unavailable to you.
|
| If you _do_ believe that there will be a future, you keep
| working. That doesn 't guarantee there will be a future. But
| _not_ working pretty much guarantees that there won 't be one,
| at least not one worth having.
| chairmansteve wrote:
| Think of AI as an excavator. You know, those machines that dig
| holes. 70 years ago, those holes would have been dug by 50 men
| with shovels. Now it's one guy in an excavator. But we don't
| have mass unemployment. The excavator just creates more work
| for bricklayers, carpenters etc.
|
| If AI lives up to hype, you could be the excavator driver. Or,
| the AI will create a ton of upstream and downstream work. There
| will be no mass unemployment.
| euvin wrote:
| If AGI is the excavator, why wouldn't it become the driver,
| bricklayer, and carpenter as well?
| throwaway2037 wrote:
| Jokes aside, I think building a useful, strong, agile
| humanoid robot that is affordable for businesses (first),
| then middle class homes will prove much harder than AGI.
| realce wrote:
| Is there any possible technology that could make labor,
| mastery, or human expirence obsolete?
|
| Are there no limits to this argument? Is it some absolute
| universal law that all new creations just create increasing
| economic opportunities?
| zmgsabst wrote:
| Horses never recovered from mechanization.
| postsantum wrote:
| They have been promoted to pets. Oh wait..
| chairmansteve wrote:
| True, but humans did. Horses were the machine that became
| obsolete. Just like the guys with shovels.
| Art9681 wrote:
| It's a tool. You learn to master it or not. I have greybeard
| coworkers that dissed the technology as a fad 3 years ago. Now
| they are scrambling to catch up. They have to do this while
| sustaining a family with pets and kids and mortgages and full
| time senior jobs.
|
| You're in a position to invest substantial amounts of time
| compared to your seniors. Leverage that opportunity to your
| advantage.
|
| We all have access to these tools for the most part, so the
| distinguishing factor is how much time you invest and how much
| more ambitious you become once you begin to master the tool.
|
| This time its no different. Many Mechanical and Sales students
| in the past never got jobs in those fields either. Decades
| before AI. There were other circumstances and forces at play
| and a degree is not a guaranteed career in anything.
|
| Keep going because what we DO know is that trying wont
| guarantee results, we DO know that giving up definitely won't.
| Roll the dice in your favor.
| callc wrote:
| > I have greybeard coworkers that dissed the technology as a
| fad 3 years ago. Now they are scrambling to catch up. They
| have to do this while sustaining a family with pets and kids
| and mortgages and full time senior jobs.
|
| I want to criticize Art's comment on the grounds of ageism or
| something along the lines of "any amount life outside of
| programming is wasted", but regardless of Art's intention
| there is important wisdom here. Use your free time wisely
| when you don't have much responsibilities. It is a
| superpower.
|
| As for whether to spend it on AI, eh, that's up to you to
| decide.
| Art9681 wrote:
| It's totally valid criticism. What I meant is that if an
| individual's major concern is employment, then it would be
| prudent to invest the amount of time necessary to ensure a
| favorable outcome. And given whatever stage in life they
| are at, use the circumstance you have in your favor.
|
| I'm a greybeard myself.
| infinite-hugs wrote:
| Hey man,
|
| I hear you, I'm not that much older but I graduated in 2011. I
| also studied industrial design. At that time the big wave was
| the transition to an app based everything and UX design
| suddenly became the most in demand design skill. Most of my
| friends switched gears and careers to digital design for the
| money. I stuck to what I was interested in though which was
| sustainability and design and ultimately I'm very happy with
| where I ended up (circular economy) but it was an awkward ~10
| years as I explored learning all kinds of tools and ways
| applying my skills. It also was very tough to find the right
| full time job because product design (which has come to really
| mean digital product design) supplanted industrial design roles
| and made it hard to find something of value that resonated with
| me.
|
| One of the things that guided me and still does is thinking
| about what types of problems need to be solved? From my
| perspective everything should ladder up to that if you want to
| have an impact. Even if you don't keep learning and exploring
| until you find something that lights you up on the inside. We
| are not only one thing we can all wear many hats.
|
| Saying that, we're living through a paradigm shift of
| tremendous magnitude that's altering our whole world. There
| will always be change though. My two cents is to focus on what
| draws your attention and energy and give yourself permission to
| say no to everything else.
|
| AI is an incredible tool, learn how to use it and try to grow
| with the times. Good luck and stay creative :) Hope something
| in there helps, but having a positive mindset is critical. If
| you're curious about the circular economy happy to share what I
| know - I think it's the future.
| anshulbhide wrote:
| You're actually positioned to have an amazing career.
|
| Everyone needs to know how to either build or sell to be
| successful. In a world where the ability to the former is
| rapidly being commoditised, you will still need to sell. And
| human relationships matter more than ever.
| myko wrote:
| LLMs are mostly hype. They're not going to change things that
| much.
| kortilla wrote:
| Don't worry. This thing only knows how to answer well
| structured technical questions.
|
| 99% of engineering is distilling through bullshit and nonsense
| requirements. Whether that is appealing to you is a different
| story, but ChatGPT will happily design things with dumb
| constraints that would get you fired if you took them at face
| value as an engineer.
|
| ChatGPT answering technical challenges is to engineering as a
| nailgun is to carpentry.
| obirunda wrote:
| Yeah, it may feel scary but the biggest issue yet to be
| overcome is that to replace engineers you need reliable long
| horizon problem solving skills. And crucially, you need to not
| be easily fooled by the progress or setbacks of a project.
|
| These benchmark accomplishments are awesome and impressive, but
| you shouldn't operate on the assumption that this will emerge
| as an engineer because it performs well on benchmarks.
|
| Engineering is a discipline that requires understanding tools,
| solutions and every project requires tiny innovations. This
| will make you more valuable, rather than less. Especially if
| you develop a deep understanding of the discipline and don't
| overly rely on LLMs to answer your own benchmark questions from
| your degree.
| mortehu wrote:
| The chart is super misleading, since the test was obscure until
| recently. A few months ago he announced he'd made the only good
| AGI test and offered a cash prize for solving it, only to find
| out in as much time that it's no different from other benchmarks.
| ripped_britches wrote:
| Sad to see everyone so focused on compute expense during this
| massive breakthrough. GPT-2 originally cost $50k to train, but
| now can be trained for ~$150.
|
| The key part is that scaling test-time compute will likely be a
| key to achieving AGI/ASI. Costs will definitely come down as is
| evidenced by precedents, Moore's law, o3-mini being cheaper than
| o1 with improved performance, etc.
| yawnxyz wrote:
| I think the question everyone has in their minds isn't "when
| will AGI get here" or even "how soon will it get here" -- it's
| "how soon will AGI get so cheap that everyone will get their
| hands on it"
|
| that's why everyone's thinking about compute expense. but I
| guess in terms of a "lifetime expense of a person" even someone
| who costs $10/hr isn't actually all that cheap, considering
| what it takes to grow a human into a fully functioning person
| that's able to just do stuff
| croes wrote:
| We are nowhere near AGI.
| stocknoob wrote:
| It's wild, are people purposefully overlooking that inference
| costs are dropping 10-100x each year?
|
| https://a16z.com/llmflation-llm-inference-cost/
|
| Look at the log scale slope, especially the orange MMLU > 83
| data points.
| croes wrote:
| A bit early for a every year claim not to mention what all
| these AI is used for.
|
| In some parts of the internet it's you hardly find real
| content only AI spam.
|
| It will get worse the cheaper it gets.
|
| Think of email spam.
| menaerus wrote:
| Those are the (subsidized) prices that end clients are paying
| for the service so that's not something that is
| representative of what the actual inference costs are.
| Somebody still needs to pay that (actual) price in the end.
| For inference, as well as for training, you need actual
| (NVidia) hardware and that hardware didn't become any
| cheaper. OTOH models are only becoming increasingly more
| complex and bigger and with more and more demand I don't see
| those costs exactly dropping down.
| atleastoptimal wrote:
| Actual inference costs without considering subsidies and
| loss leaders are going down, due to algorithmic
| improvements, hardware improvements, and quantized/smaller
| models getting the same performance as larger ones.
| Companies are making huge breakthroughs making chips
| specifically for LLM inference
| uncomplexity_ wrote:
| it's official old buddy, i'm a has been.
| brcmthrowaway wrote:
| How to invest in this stonk market
| nickorlow wrote:
| Not that I don't think costs will dramatically decrease, but the
| $1000 cost per task just seems to be per one problem on ARC-AGI.
| If so, I'd imagine extrapolating that to generating a useful
| midsized patch would be like 5-10x
|
| But only OpenAI really knows how the cost would scale for
| different tasks. I'm just making (poor) speculation
| prng2021 wrote:
| I'm confused about the excitement. Are people just flat out
| ignoring the sentences below? I don't see any breakthrough
| towards AGI here. I see a model doing great in another AI test
| but about to abysmally fail a variation of it that will come out
| soon. Also, aren't these comparisons completely nonsense
| considering it's o3 tuned vs other non-tuned?
|
| > Note on "tuned": OpenAI shared they trained the o3 we tested on
| 75% of the Public Training set. They have not shared more
| details. We have not yet tested the ARC-untrained model to
| understand how much of the performance is due to ARC-AGI data.
|
| > Furthermore, early data points suggest that the upcoming ARC-
| AGI-2 benchmark will still pose a significant challenge to o3,
| potentially reducing its score to under 30% even at high compute
| (while a smart human would still be able to score over 95% with
| no training).
| oakpond wrote:
| Me too. This looks to me like a holiday PR stunt. Get everybody
| to talk about AI during the Christmas parties.
| SerCe wrote:
| > You'll know AGI is here when the exercise of creating tasks
| that are easy for regular humans but hard for AI becomes simply
| impossible.
|
| You'll know AGI is here when traditional captchas stop being a
| thing due to their lack of usefulness.
| thallium205 wrote:
| Captchas are already completely useless.
| CamperBob2 wrote:
| (Shrug) AI has been better than humans at solving CAPTCHAs for
| a LONG time. As the sibling points out, they're just a waste of
| time and electricity at this point.
| darkgenesha wrote:
| Ironically, they are used as free labor to label image sets
| for ai to be trained on.
| Engineering-MD wrote:
| Can I just say what a dick move it was to do this as a 12 days of
| Christmas. I mean to be honest I agree with the arguments this
| isn't as impressive as my initial impression, but they clearly
| intended it to be shocking/a show of possible AGI, which is
| rightly scary.
|
| It feels so insensitive to that right before a major holiday when
| the likely outcome is a lot of people feeling less secure in
| their career/job/life.
|
| Thanks again openAI for showing us you don't give a shit about
| actual people.
| mirkodrummer wrote:
| There is no AGI it's just marketing, this stuff if over hyped,
| enjoy your holidays you won't lose your job ;)
| Engineering-MD wrote:
| I agree, it's just more about the intent than anything else,
| like boasting about your amazing new job when someone has
| recently been made redundant, just before Christmas.
| XenophileJKO wrote:
| Or maybe the target audience that watches 12 launch videos in
| the morning are genuninely excited about the new model. The
| intended it to be a preview of something to look forward to.
|
| What a weird way to react to this.
| achierius wrote:
| It sounds like you aren't thinking about this that deeply
| then. Or at least not understanding that many smart (and
| financially disinterested) people who are, are coming to
| concerning conclusions.
|
| https://www.transformernews.ai/p/richard-ngo-openai-
| resign-s...
|
| >But while the "making AGI" part of the mission seems well on
| track, it feels like I (and others) have gradually realized
| how much harder it is to contribute in a robustly positive
| way to the "succeeding" part of the mission, especially when
| it comes to preventing existential risks to humanity.
|
| Almost every single one of the people OpenAI had hired to
| work on AI safety have left the firm with similar messages.
| Perhaps you should at least consider the thinking of experts?
| OldGreenYodaGPT wrote:
| Blaming OpenAI for progress is like blaming a calendar for
| Christmas--it's not the timing, it's your unwillingness to
| adapt
| r-zip wrote:
| Unwillingness to adapt to the destruction of the middle class
| and knowledge work is pretty reasonable tbh.
| tim333 wrote:
| Historically when tech has taken over jobs people have done
| ok, they've just done something else, usually something
| more pleasant.
| lagrange77 wrote:
| Wow, you just solved the ethics of technology in a one liner.
| Impressive.
| stevenhuang wrote:
| This is a you problem. Yes there will be pain in short term,
| but it will be worth it in long term.
|
| Many of us look forward to what a future with AGI can do to
| help humanity and hopefully change society for the better,
| mainly to achieve a post scarcity economy.
| jakebasile wrote:
| Surely the elites that control this fancy new technology will
| share the benefits with all of us _this_ time!
| tim333 wrote:
| No it'll be like when tech took over 97% of agricultural
| work with 97% of us starving while all the money went to
| the farm elites.
| jakebasile wrote:
| How did that go for the farm workers?
| randyrand wrote:
| Post scarcity seems very unlikely. Humans might be worthless,
| but there will still be a finite number of AIs, compute,
| space, resources.
| achierius wrote:
| https://www.transformernews.ai/p/richard-ngo-openai-
| resign-s...
|
| >But while the "making AGI" part of the mission seems well on
| track, it feels like I (and others) have gradually realized
| how much harder it is to contribute in a robustly positive
| way to the "succeeding" part of the mission, especially when
| it comes to preventing existential risks to humanity.
|
| Almost every single one of the people OpenAI had hired to
| work on AI safety have left the firm with similar messages.
| Perhaps you should at least consider the thinking of experts?
| There is a real chance that this ends with significant good.
| There is also a real chance that this ends with the death of
| every single human being. That's never been a choice we've
| had to make before, and it seems like we as a species are
| unprepared to approach it.
| esafak wrote:
| How are you going to make housing, healthcare, etc. not
| scarce, and pay for them?
| tim333 wrote:
| Robots supply that, controlled by democratic government.
| esafak wrote:
| Robots supply the land and physical labor that underlie
| the price of housing? Are you thinking of space colonies
| or something?
|
| You need to make these expensive things nearly free if
| you're going to speak of post scarcity.
| tim333 wrote:
| Robots supply the physical labour. The land shortages are
| largely regulatory - there's a lot of land out there or
| you could build higher.
| _cs2017_ wrote:
| Wtf is wrong with you dude? It's just another tech, some jobs
| will get worse some jobs will get better. Happens every couple
| of decades. Stop freaking out.
| achierius wrote:
| This is not a very kind or humble comment. There are real
| experts talking about how this time is different -- as an
| analogy, think about how horses, for thousands of years,
| always had new things to do -- until one day they didn't.
| It's hubris to think that we're somehow so different from
| them.
|
| Notably, the last key AI safety researcher just left OpenAI:
| https://www.transformernews.ai/p/richard-ngo-openai-
| resign-s...
|
| >But while the "making AGI" part of the mission seems well on
| track, it feels like I (and others) have gradually realized
| how much harder it is to contribute in a robustly positive
| way to the "succeeding" part of the mission, especially when
| it comes to preventing existential risks to humanity.
|
| Are you that upset that this guy chose to trust the people
| that OpenAI hired to talk about AI safety, on the topic of AI
| safety?
| t0lo wrote:
| I hate the deliberate fear-mongering that these companies pedal
| on the population to get higher valuations
| achierius wrote:
| I feel you. It's tough trying to think about what we can do to
| avert this; even to the extent that individuals are often
| powerless, in this regard it feels worse than almost anything
| that's come before.
| keiferski wrote:
| The vast majority of people who will lose jobs to AI aren't
| following AGI benchmarks, or even know what AGI is short for.
| Engineering-MD wrote:
| That's is true and a reasonable point. But looking in This
| thread you can see there has been this reaction from quite a
| few.
| tim333 wrote:
| Some of us actual people are actually enthusiastic about AGI.
| Although I'm a bit weird in being into the sci-fi upload /
| ending death stuff.
| noah32 wrote:
| The best AI on this graph costs 50000% more than a stem graduate
| to complete the tasks and even then has an error rate that is
| 1000% higher than the humans???
| dkrich wrote:
| These tests are meaningless until You show them doing mundane
| tasks
| mattfrommars wrote:
| Guys, its already happening. I recently got laid off due to AI
| taking over my jobs.
| dimgl wrote:
| What did you do? Can you elaborate?
| mirsadm wrote:
| I wouldn't take that seriously. Half the comments here are
| suspicious IMO. OpenAI is a pretty shady company.
| dyauspitr wrote:
| I wish there was a way to see all the attempts it got right
| graphically like they show the incorrect ones.
| YeGoblynQueenne wrote:
| I guess I get to brag now. ARC AGI has no real defences against
| Big Data, memorisation-based approaches like LLMs. I told you so:
|
| https://news.ycombinator.com/item?id=42344336
|
| And that answers my question about fchollet's assurances that
| LLMs without TTT (Test Time Training) can't beat ARC AGI:
|
| [me] I haven't had the chance to read the papers carefully. Have
| they done ablation studies? For instance, is the following a
| guess or is it an empirical result?
|
| [fchollet] >> For instance, if you drop the TTT component you
| will see that these large models trained on millions of synthetic
| ARC-AGI tasks drop to <10% accuracy.
| Vecr wrote:
| How are the Bongard Problems going?
| YeGoblynQueenne wrote:
| They're chilling it out together with Nethack in the Club for
| AI Benchmarks yet to be Beaten.
|
| Interestingly, Bongard problems do not have a private test
| set, unlike ARC-AGI. Can that be because they don't need it?
| Is it possible that Bongard Problems are a true test of
| (visual) reasoning that requires intelligence to be solved?
|
| Ooooh! Frisson of excitement!
|
| But I guess it's just that nobody remembers them and so
| nobody has seriously tried to solve them with Big Data stuff.
| Sparkyte wrote:
| Kinda expensive though.
| hamburga wrote:
| I'm not sure if people realize what a weird test this is. They're
| these simple visual puzzles that people can usually solve at a
| glance, but for the LLMs, they're converted into a json format,
| and then the LLMs have to reconstruct the 2D visual scene from
| the json and pick up the patterns.
|
| If humans were given the json as input rather than the images,
| they'd have a hard time, too.
| ImaCake wrote:
| Yeah, this entire thread seems utterly detached from my lived
| experience. LLMs are immensely useful for me at work but they
| certainly don't come close to the hype spouted by many
| commenters here. It would be great if it could handle more of
| our quite modest codebase but it's not able to yet
| m_ke wrote:
| ARC is a silly benchmark, the other results in math and
| coding are much more impressive.
|
| o3 is just o1 scaled up, the main takeaway from this line of
| work that people should walk away with is that we now have a
| proven way to RL our way to super human performance on tasks
| where it's cheap to sample and easy to verify the final
| output. Programming falls in that category, they focused on
| known benchmarks but the same process can be done for normal
| programs, using parsers, compilers, existing functions and
| unit tests as verifiers.
|
| Pre o1 we only really had next token prediction, which
| required high quality human produced data, with o1 you
| optimize for success instead of MLE of next token. Explained
| in simpler terms, it means it can get reward for any
| implementation of a function that reproduces the expected
| result, instead of the exact implementation in the training
| set.
|
| Put another way, it's just like RLHF but instead of
| optimizing against learned human preferences, the model is
| trained to satisfy a verifier.
|
| This should work just as well in VLA models for robotics,
| self driving and computer agents.
| causal wrote:
| I think that's part of what feels odd about this- in some ways
| it feels like the wrong type of test for an LLM, but in many
| ways it makes this achievement that much more remarkable
| Jensson wrote:
| > If humans were given the json as input rather than the
| images, they'd have a hard time, too.
|
| We shine light in text patterns at humans rather than inject
| the text directly into the brain as well, that is extremely
| unfair! Imagine how much better humans would be at text
| processing if we injected and extracted information from their
| brains using the neurons instead of eyes and hands.
| torginus wrote:
| Not sure how much that matters - I'm not an AI expert, but I
| did some intro courses where we had to train a classifier to
| recognize digits. How it worked basically was that we fed each
| pixel of the 2d grid of the image into an input of the network,
| essentially flattening it in a similar fashion. It worked just
| fine, and that was a tiny network.
| thegeomaster wrote:
| The classifier was likely a convolutional network, so the
| assumption of the image being a 2D grid was baked into the
| architecture itself - it didn't have to be represented via
| the shape of the input for the network to use it.
| torginus wrote:
| I don't think so - convolutional neural networks also
| operate over 1D flat vectors - the spatial relationship of
| pixels is only learned from the training data.
| deneas wrote:
| The JSON files still contain images, just not in a regular
| image format. You have a 2D array of numbers where each number
| maps to a color. If you really want a regular picture format,
| you can easily convert the arrays.
| inoperable wrote:
| Very convenient for OpenAI to run those errands with bunch of
| misanthropes trying to repaint a simulacrum. To use AGI here's
| makes me want to sponsor pile of distress pills so people think
| things really over before going into another mania Episode.
| People need seriously take a step back, if that's AGI then my cat
| has surpassed it's cognitive acting twice.
| sakopov wrote:
| Maybe I'm missing something vital, but how does anything that
| we've seen AI do up until this point or explained in this
| experiment even hint at AGI? Can any of these models ideate? Can
| they come up with technologies and tools? No and it's unlikely
| they will any time soon. However, they can make engineers
| infinitely more productive.
| jebarker wrote:
| You need to define ideate, tools and technologies to answer
| those questions. Not to mention that it's quite possible humans
| do those things through re-combination of learned ideas
| similarly to how these reasoning models are suggested to be
| working.
| sakopov wrote:
| Every technological advancement that we've seen in software
| engineering - be it in things like Postgres, Kubernetes and
| Cloud Infrastructure - came out from truly novel ideas. AI
| seems to generate outputs that appear novel but are they
| really? It's capable of synthesizing and combining vast
| amounts of information in creative ways but it's deriving
| everything from existing patterns found within its training
| data. Truly novel ideas require thinking outside the box.
| It's combination of cognitive, emotional and environmental
| factors which go beyond pattern recognition. How close are we
| to achieving this? Everyone seems to be shaking in their
| boots because we might lose our job safety in tech, but I
| don't see any intelligence here.
| kirab wrote:
| FYI: Codeforces competitive programming scores (basically only)
| by time needed until valid solutions are posted
|
| https://codeforces.com/blog/entry/133094
|
| That means.. this benchmark is just saying o3 can write code
| faster than must humans (in a very time-limited contest, like 2
| hours for 6 tasks). Beauty, readability or creativity is not
| rated. It's essentially a "how fast can you make the unit tests
| pass" kind of competition.
| sigbottle wrote:
| Creativity is inherently rated because it's codeforces... most
| 2700 problems have unique, creative solutions.
| ghm2180 wrote:
| Wouldn't one then built the analog of the lisp computer to hyper
| optimize just this. Like it might be super expensive for regular
| gpus but for super specialized architecture one could shave the
| 3500$/hour quite a bit no?
| kittikitti wrote:
| Congratulations
| hackpert wrote:
| If anyone else is curious about which ARC-AGI public eval puzzles
| o3 got right vs wrong (and its attempts at the ones it did get
| right), here's a quick visualization:
| https://arcagi-o3-viz.netlify.app
| suprgeek wrote:
| Don't be put off by the reported high-cost
|
| Make it possible->Make it fast->Make it Cheap
|
| the eternal cycle of software.
|
| Make no mistake - we are on the verge of the next era of change.
| duluca wrote:
| The first computers cost millions of dollars and filled entire
| rooms to accomplish what we would now consider simple
| computational tasks. That same computing power now fits into the
| width of a finger nail. I don't get how technologists balk at the
| cost of experimental tech or assume current tech will run at the
| same efficiency for decades to come and melt the planet into a
| puddle. AGI won't happen until you can fit enough compute that'd
| take several data center's worth of compute into a brain sized
| vessel. So the thing can move around process the world in real
| time. This is all going to take some time to say the least.
| Progress is progress.
| lxgr wrote:
| > take several data center's worth of compute into a brain
| sized vessel. So the thing can move around process the world in
| real time
|
| How so? I'd imagine a robot connected to the data center
| embodying its mind, connected via low-latency links, would have
| to walk pretty far to get into trouble when it comes to
| interacting with the environment.
|
| The speed of light is about three orders of magnitude faster
| than the speed of signal propagation in biological neurons,
| after all.
| waldrews wrote:
| 6 orders of magnitude if we use 120 m/s vs 300 km/s
| lxgr wrote:
| Ah, yes, I missed a "k" in that estimation!
| byw wrote:
| The robot brain could be layered so that more basic functions
| are embedded locally while higher-level reasonings and
| offloaded to the cloud.
| arthurcolle wrote:
| blue strip from iRobot?
| lumost wrote:
| The concern here is mainly on practicality. The original
| mainframes did not command startup valuations counted in
| fractions of the US economy, they did qualify for billions in
| investment.
|
| This is a great milestone, but OpenAI will not be successful
| charging 10x the cost of a human to perform a task.
| BriggyDwiggs42 wrote:
| I wouldn't expect it to cost 10x in five years, if only
| because parallel computing still seems to be roughly obeying
| moore's.
| raincole wrote:
| The cost of inference has be dropping by ~100x in the past 2
| years.
|
| https://a16z.com/llmflation-llm-inference-cost/
| nico wrote:
| *inference
| gritzko wrote:
| *infernonce
| christianqchung wrote:
| Hmm the link is saying the price of an LLM that scores 42
| or above on MMLU has dropped 100x in 2 years, equating gpt
| 3.5 and llama 3.2 3B. In my opinion gpt 3.5 was
| significantly better than llama 3B, and certainly much
| better than the also-equated llama 2 7B. MMLU isn't a great
| marker of overall model capabilities.
|
| Obviously the drop in cost for capability in the last 2
| years is big, but I'd wager it's closer to 10x than 100x.
| owenpalmer wrote:
| > OpenAI will not be successful charging 10x the cost of a
| human to perform a task.
|
| True, but they might be successful charging 20x for 2x the
| skill of a human.
| threatripper wrote:
| Or 10x the skill and speed of a human in some specific
| class of recurrent tasks. We don't need full super-human
| AGI for AI to become economically viable.
| eru wrote:
| Companies routinely pay short-term contractors a lot more
| than their permanent staff.
|
| If you can just unleash AI on any of your problems,
| without having to commit to anything long term, it might
| still be useful, even if they charged more than for
| equivalent human labour.
|
| (Though I suspect AI labour will generally trend to be
| cheaper than humans over time for anything AIs can do at
| all.)
| fragmede wrote:
| How much does AWS charge for compute?
|
| If it can be spun up with Terraform, I bet you they could.
| otabdeveloper4 wrote:
| Intelligence has nothing at all whatever to do with compute.
| oefnak wrote:
| Unless you're a dualist who believes in a magic spirit, I
| cannot understand how you think that's the case. Can you
| please explain?
| freehorse wrote:
| Intelligence is about learning from few examples and
| generalising to novel solutions. Increasing compute so that
| exploring the whole problem space is possible is not
| intelligence. There is a reason the actual ARC-AGI price
| has efficiency as one of the success requirements. It is
| not so that the solutions scale to production and whatnot,
| these are toy tasks. It is to help ensure that it is
| actually an intelligent system solving these.
|
| So yeah, the o3 result is impressive but if the difference
| between o3 and the previous state of art is more compute to
| do a much longer CoT/evaluation loop, I am not so
| impressed. Reminder that these problems are solved by
| humans in seconds, ARC-AGI is supposed to be easy.
| lambdaphagy wrote:
| Philosophy of mind is the branch of philosophy that
| attempts to account for a very difficult problem: why there
| are apparently two different realms of phenomena, physical
| and mental, that are at once tightly connected and yet as
| different from one another as two things can possibly be.
|
| Broadly speaking you can think that the mental reduces to
| the physical (physicalism), that the physical reduces to
| the mental (idealism), both reduce to some other third
| thing (neutral monism) or that neither reduces to the other
| (dualism). There are many arguments for dualism but I've
| never heard a philosopher appeal to "magic spirits" in
| order to do so.
|
| Here's an overview:
| https://plato.stanford.edu/entries/dualism/
| patrickhogan1 wrote:
| Do you think intelligence exists without prior experience?
| For instance, can someone instantly acquire a skill--like
| playing the piano--as if downloading it in The Matrix? Even
| prodigies like Mozart had prior exposure. His father, a
| composer and music teacher, introduced him to music from an
| early age. Does true intelligence require a foundation of
| prior knowledge?
| 1659447091 wrote:
| Intelligence requires the ability to separate the wheat
| from the chaff on one's own to create a foundation of
| knowledge to build on.
|
| It is also entirely possible to learn a skill without prior
| experience. That's how it(whatever skill) was first done
| owenpalmer wrote:
| > Does true intelligence require a foundation of prior
| knowledge?
|
| This is the way I think about it.
|
| I = E / K
|
| where I is the intelligence of the system, E is the
| effectiveness of the system, and K is the prior knowledge.
|
| For example, a math problem is given to two students, each
| solving the problem with the same effectiveness (both get
| the correct answer in the same amount of time). However,
| student A happens to have more prior knowledge of math than
| student B. In this case, the intelligence of B is greater
| than the intelligence of A, even though they have the same
| effectiveness. B was able to "figure out" the math, without
| using any of the "tricks" that A already knew.
|
| Now back to your question of whether or not prior knowledge
| is required. As K approaches 0, intelligence approaches
| infinity. But when K=0, intelligence is undefined. Tada! I
| think that answers your question.
|
| Most LLM benchmarks simply measure effectiveness, not
| intelligence. I conceptualize LLMs as a person with a
| photographic memory and a low IQ of 85, who was given 100
| billion years to learn everything humans have ever created.
|
| IK = E
|
| low intelligence * vast knowledge = reasonable
| effectiveness
| TechDebtDevin wrote:
| Batteries..
| pera wrote:
| Maybe AGI as a goal is overvalued: If you have a machine that
| can, on average, perform symbolic reasoning better than humans,
| and at a lower cost, that's basically the end game, isn't it?
| You won capitalism.
| harrall wrote:
| Right now I can ask an (experienced) human to do something
| for me and they will either just get it done or tell me that
| they can't do it.
|
| Right now when I ask an LLM... I have to sit there and verify
| everything. It may have done some helpful reasoning for me
| but the whole point of me asking someone else (or something
| else) was to do nothing at all...
|
| I'm not sure you can reliably fulfill the first scenario
| without achieving AGI. Maybe you can, but we are not at that
| point yet so we don't know yet.
| raincole wrote:
| You do need to verify humans work though.
|
| The difference, to me, is that humans seem to be good at
| canceling each other's mistakes when put in a proper
| environment.
| pera wrote:
| It's not clear to me whether AGI is necessary for solving
| most of the issues in the current generation of LLMs. It is
| possible you can get there by hacking together CoTs with
| automated theorem provers and bruteforcing your way to the
| solution or something like that.
|
| But if it's not enough then maybe it might come as a
| second-order effect (e.g. reasoning machines having to
| bootstrap an AGI so then you can have a Waymo taxi driver
| who is also a Fields medalist)
| vbezhenar wrote:
| There are so called "yes-men" who can't say "no" in no
| situation. That's rooted in their culture. I suspect that
| AI was trained using their assistance. I mean, answering "I
| can't do that" is the simplest LLM path that should work
| often unless they gone out of their way to downrank it.
| concordDance wrote:
| > Right now I can ask an (experienced) human to do
| something for me and they will either just get it done or
| tell me that they can't do it.
|
| Finding reliable honest humans is a problem governments
| have struggled with for over a hundred years. If you have
| cracked this problem at scale you really need to write it
| up! There are a lot of people who would be extremely
| interested in a solution here.
| eru wrote:
| > Finding reliable honest humans is a problem governments
| have struggled with for over a hundred years.
|
| Yes, though you are downplaying the problem a lot. It's
| not just governments, and it's way longer than 100 years.
|
| Btw, a solution that might work for you or me, presumably
| relatively obscure people, might not work for anyone
| famous, nor a company nor a government.
| anavat wrote:
| My guess is this is an artifact of the RLHF part of the
| training. Answers like "I don't know" or "let me think and
| let's catch on this next week" are flagged down by human
| testers, which eventually trains LLM to avoid this path
| altogether. And it probably makes sense because otherwise
| "I don't know" would come up way too often even in cases
| where the LLM is perfectly able to give the answer.
| gf000 wrote:
| I don't know, that seems like a fundamental limitation.
| LLMs don't have any ability to do reflection on their own
| knowledge/abilities.
| ben_w wrote:
| Humans aren't very aware of their limits, either.
|
| Even the Dunning-Kruger effect is, ironically, widely
| misunderstood by people who are unreasonably confident
| about their knowledge.
| eru wrote:
| Yes, Dunning-Kruger's paper never found what popular
| science calls the 'Dunning-Kruger' effect.
|
| Effectively, they found nothing real but a statistical
| artifact.
| gf000 wrote:
| But you know if you have ever heard about call by name or
| value semantics.
| ben_w wrote:
| You've not only seen people get upset about technical
| jargon, but also never seen people misuse it wildly?
|
| The latter in particular is how I model the mistakes LLMs
| made, what with them having read most things.
| 8n4vidtmkvmk wrote:
| I thought you were going to say that now we're back to bigger-
| than-room sized computers that cost many millions just to
| perform the same tasks we could 40 years ago.
|
| I of course mean we're using these LLMs for a lot of tasks that
| they're inappropriate for, and a clever manually coded
| algorithm could do better and much more efficiently.
| arthurcolle wrote:
| just ask the LLM to solve enough problems (even new
| problems), cache the best, do inference time compute for the
| rest, figure out the best/ fastest implementations, and boom,
| you have new training data for future AIs
| owenpalmer wrote:
| > cache the best
|
| How do you quantify that?
| martinkallstrom wrote:
| "Assume the role of an expert in cache invalidation..."
| DyslexicAtheist wrote:
| "one does not just assume", "because the hardest problems
| in Tech are Johnny Cash invalidations" --Lao Tzi
| Terr_ wrote:
| > "Those who invalidate caches know nothing; Those who
| know retain data." These words, as I am told, were spoken
| by Lao Tzi. If we are to believe that Lao Tzi was himself
| one who knew, why did he erase /var/tmp to make space for
| his project?
|
| -- Poem by Cybernetic Bai Juyi, "The Philosopher [of
| Caching]"
| pavlov wrote:
| "Assume the role of an expert in naming things. You know,
| a... what do they call those people again... there must
| be a name for it"
| arthurcolle wrote:
| however you want
| adwn wrote:
| > _and a clever manually coded algorithm could do better and
| much more efficiently._
|
| Sure, but how long would it take to implement this algorithm,
| and would that be worth it for one-off cases?
|
| Just today I asked Claude to create a _jq_ query that looks
| for objects with a certain value for one field, but which
| lack a certain other field. I could have spent a long time
| trying to make sense of jq 's man page, but instead I spent
| 30 seconds writing a short description of what I'm looking
| for in natural language, and the AI returned the correct jq
| invocation within seconds.
| freehorse wrote:
| I don't think this is a bad use. A bad use would be to give
| Claude the dataset and ask it to tell you which elements
| have that value.
| adwn wrote:
| Ha, I tried that before. However, the file was too large
| for its context window, so it only seemed to analyze the
| first part and gave a wrong result.
| Woodi wrote:
| It was your own data, right ? Becouse you just donated
| half of it...
| adwn wrote:
| It's okay, I also uploaded an NDA in a previous prompt
| :-)
| globalise83 wrote:
| Claude answers a lot of its questions by first writing
| and then running code to generate the results. Its only
| limitation is the access to databases and size of context
| window, both of which will be radically improved over the
| next 5 years.
| freehorse wrote:
| I would still rather be able to see the code it generates
| lottin wrote:
| But how do you know it's given you the correct answer? Just
| because the code appears to work it doesn't mean it's
| correct.
| adwn wrote:
| But how do I know if my hand-written jq query is the
| correct solution? Just because the query appears to work
| it doesn't mean it's correct.
| lottin wrote:
| Because I understand the process that I have followed to
| get to the solution.
| ogogmad wrote:
| It can explain its solution. Point to relevant docs as
| well.
| gf000 wrote:
| It can also very convincingly explain a non-solution
| pointing to either real or hallucinated docs.
| ogogmad wrote:
| You need to look at the docs.
| freehorse wrote:
| Omg this is how llms used to trick me inventing out all
| these apis.
| ogogmad wrote:
| Look at the docs it links to.
| globalise83 wrote:
| The LLMs are now writing their own algorithms to answer
| questions. Not long before they can design a more efficient
| algorithm to complete any feasible computational task, in a
| millionth of the time needed by the best human.
| bayindirh wrote:
| LLMs are probabilistic string blenders pulling pieces up
| from their training set, which unfortunately comes from us,
| humans.
|
| The superset of the LLM knowledge pool is human knowledge.
| They can't go beyond the boundaries of their training set.
|
| I'll not go into how humans have other processes which can
| alter their and collective human knowledge, but the rabbit
| hole starts with "emotions, opposable thumbs, language,
| communication and other senses".
| ogogmad wrote:
| > They can't go beyond the boundaries of their training
| set.
|
| TFA says they just did. That's what the ARC-AGI benchmark
| was supposed to test.
| gf000 wrote:
| > The LLMs are now writing their own algorithms to answer
| questions
|
| Writing a python script, because it can't do math or any
| form of more complex reasoning is not what I would call
| "own algorithm". It's at most application of existing
| ones/calling APIs.
| nopinsight wrote:
| Many of humans' capabilities are pretrained with massive
| computing through evolution. Inference results of o3 and its
| successors might be used to train the next generation of small
| models to be highly capable. Recent advances in the
| capabilities of small models such as Gemini-2.0 Flash suggest
| the same.
|
| Recent research from NVIDIA suggests such an efficiency gain is
| quite possible in the physical realm as well. They trained a
| tiny model to control the full body of a robot via simulations.
|
| ---
|
| "We trained a 1.5M-parameter neural network to control the body
| of a humanoid robot. It takes a lot of subconscious processing
| for us humans to walk, maintain balance, and maneuver our arms
| and legs into desired positions. We capture this
| "subconsciousness" in HOVER, a single model that learns how to
| coordinate the motors of a humanoid robot to support locomotion
| and manipulation."
|
| ...
|
| "HOVER supports any humanoid that can be simulated in Isaac.
| Bring your own robot, and watch it come to life!"
|
| More here: https://x.com/DrJimFan/status/1851643431803830551
|
| ---
|
| This demonstrates that with proper training, small models can
| perform at a high level in both cognitive and physical domains.
| bigprof wrote:
| > Similarly, many of humans' capabilities are pretrained with
| massive computing through evolution.
|
| Hmm .. my intuition is that humans' capabilities are gained
| during early childhood (walking, running, speaking .. etc)
| ... what are examples of capabilities pretrained by
| evolution, and how does this work?
| nopinsight wrote:
| The brain is predisposed to learn those skills. Early
| childhood experiences are necessary to complete the
| training. Perhaps that could be likened to post-training.
| It's not a one-to-one comparison but a rather loose analogy
| which I didn't make it precise because it is not the key
| point of the argument.
|
| Maybe evolution could be better thought of as neural
| architecture search combined with some pretraining.
| Evidence suggests we are prebuilt with "core knowledge" by
| the time we're born [1].
|
| See: Summary of cool research gained from clever & benign
| experiments with babies here:
|
| [1] Core knowledge. Elizabeth S. Spelke and Katherine D.
| Kinzler. https://www.harvardlds.org/wp-
| content/uploads/2017/01/Spelke...
| vanviegen wrote:
| > The brain is predisposed to learn those skills.
|
| Learning to walk doesn't seem to be particularly easy,
| having observed the process with my own children. No
| easier than riding a bike or skating, for which our
| brains are probably not 'predisposed'.
| nopinsight wrote:
| Walking is indeed a complex skill. Yet some animals walk
| minutes after birth. Human babies are most likely born
| premature due to the large brain and related physical
| constraints.
|
| Young children learn to bike or skate at an older age
| after they have acquired basic physical skills.
|
| Check out the reference to Core Knowledge above. There
| are things young infants know or are predisposed to know
| from birth.
| HumanOstrich wrote:
| The brain has developed, through evolution, very specific
| and organized structures that allow us to learn language
| and reading skills. If you have a genetic defect that
| causes those structures to be faulty or missing, you will
| have severe developmental problems.
|
| That seems like a decent example of pretraining through
| evolution.
| tesch1 wrote:
| But maybe it's something more like general symbolic
| manipulation, and not specifically the sounds or
| structure of language. Reading is fairly new and unlikely
| to have had much if any evolutionary pressure in many
| populations who are now quite literate. Same seems true
| for music. Maybe the hardware is actually more general
| and adaptable and not just for language?
| HumanOstrich wrote:
| The research disagrees with you.
| eru wrote:
| Music is really, really old.
|
| And reading and music co-evolved to be relatively easy
| for humans to do.
|
| (See how computers have a much easier time reading
| barcodes and QR codes, with much less general processing
| power than it takes them to decipher human hand-writing.
| But good luck trying to teach humans to read QR codes
| fluently.)
| eru wrote:
| > No easier than riding a bike or skating, for which our
| brains are probably not 'predisposed'.
|
| What makes you think so? Humans came up with biking and
| skating, because they were easy enough for us to master
| with the hardware we had.
| puffybuf wrote:
| I think of evolution as unassisted learning where agents
| compete with the each other for limited resources. Over
| time they get better and better at surviving by passing
| on genes. It never ends of course.
| tiborsaas wrote:
| If you look at animals, they can walk in hours, not much
| time needed after being born. It takes us a longer time
| because we are born rather undeveloped to get the head out
| of the birth canal.
|
| A more high level example, sea sickness is a evolutionary
| pre-learned thing, your body things it's poisoned and it
| automatically wants to empty your stomach.
| gf000 wrote:
| I mean, there are plenty - e.g. mimicking (say, the
| mother's face's emotions), which are precursors to learning
| more advanced "features". Also, even walking has many
| aspects pretrained (I assume it's mostly a musculoskeletal
| limitation that we can't walk immediately), humans are just
| born "prematurely" due to our relatively huge heads.
| Newborn horses can walk immediately without learning.
|
| But there are plenty of non-learned
| control/movement/sensing in utero that are "pretrained".
| eru wrote:
| Interestingly, there's a bunch of reflexes that also only
| develop over time.
|
| They are more nature than nurture, but they aren't 'in-
| born'.
|
| Just like human aren't (usually) born with teeth, but
| they don't 'learn' to have teeth or pubic hair, either.
| eru wrote:
| Your brain is well adapted to learning how to walk and
| speak.
|
| Chimpanzees score pretty high on many tests of
| intelligence, especially short term working memory. But
| they can't really learn language: they lack the specialised
| hardware more than the general intelligence.
| Existenceblinks wrote:
| Honestly, it doesn't need to be local, API is some 200ms away
| is ok-ish, make it 50ms it will be practically usable for every
| majority of interaction.
| joshdavham wrote:
| A lot of the comments seem very dismissive and a little overly-
| skeptical in my opinion. Why is this?
| ziofill wrote:
| It's certainly remarkable, but let's not ignore the fact that it
| still fails on puzzles that are trivial for humans. Something is
| amiss.
| vicentwu wrote:
| "Note on "tuned": OpenAI shared they trained the o3 we tested on
| 75% of the Public Training set. They have not shared more
| details. We have not yet tested the ARC-untrained model to
| understand how much of the performance is due to ARC-AGI data."
|
| Really want to see the number of training pairs needed to achieve
| this socre. If it only takes a few pairs, say 100 pairs, I would
| say it is amazing!
| nmca wrote:
| 75% of 400 is 300 :)
| WXLCKNO wrote:
| Wow are you AGI?
| epigramx wrote:
| I bet it still thinks 1+1=3 if it read enough sources parroting
| that.
| theincredulousk wrote:
| Denoting it in $ for efficiency is peak capitalism, cmv.
| polskibus wrote:
| What are the differences between the public offering and o3? What
| is o3 doing differently? Is it something akin to more internal
| iterations, similar to ,,brute forcing" a problem, like you can
| yourself with a cheaper model, providing additional hints after
| each response?
| miga89 wrote:
| How do the organisers keep the private test set private? Does
| openAI hand them the model for testing?
|
| If they use a model API, then surely OpenAI has access to the
| private test set questions and can include it in the next round
| of training?
|
| (I am sure I am missing something.)
| 7734128 wrote:
| I suppose that's why they are calling it "semi-private".
| freehorse wrote:
| And why o3 or any OpenAI llm is not evaluated in the actual
| private dataset.
| owenpalmer wrote:
| I wouldn't be surprised if the term "benchmark fraud" will soon
| been coined.
| PhilippGille wrote:
| Benchmark fraud is not a novel concept. Outside of LLMs for
| example smartphone manufacturers detect benchmarks and
| disable or reduce CPU throttling: https://www.theregister.com
| /2019/09/30/samsung_benchmarking_...
| hmottestad wrote:
| CPU frequency ramp curve is also something that can be
| adjusted. You want the CPU to ramp up really quickly to
| make everything feel responsive, but at the same time you
| want to not have to use so much power from your battery.
|
| If you detect that a benchmark is running then you can just
| ramp up to max frequency immediately. It'll show how fast
| your CPU is, but won't be representative of the actual
| performance that users will get from their device.
| deneas wrote:
| They have two sets, a fully private one where the models run
| isolated and the semi-private one where they run models
| accessed over the internet.
| gritzko wrote:
| That is the top question, actually. Given all the billions at
| stake.
| PoignardAzur wrote:
| If we really want to imagine a cold-war-style solution, the two
| teams could meet in an empty warehouse, bring one computer with
| the model, one with the benchmarks, and connect them with a USB
| cable.
|
| In practice I assume they just gave them the benchmarks and
| took it on the honor system they wouldn't cheat, yeah. They can
| always cook up a new test set for next time, it's only 10% of
| the benchmark content anyway and the results are pretty close.
| andrepd wrote:
| There's no honor system when there's billions of dollars at
| stake x) I'm highly highly skeptical of these benchmarks
| because of intentional cheating and accidental contamination.
| bjornsing wrote:
| Isn't that why they call it " Semi-Private"?
|
| There's a fully private test set too as I understand it, that
| o3 hasn't run on yet.
| DiscourseFan wrote:
| a little from column A, a little from column B
|
| I don't think this is AGI; nor is it something to scoff at. Its
| impressive, but its also not human-like intelligence. Perhaps
| human-like intelligence is not the goal, since that would imply
| we have even a remotely comprehensive understanding of the human
| mind. I doubt the mind operates as a single unit anyway, a
| human's first words are "Mama," not "I am a self-conscious freely
| self-determining being that recognizes my own reasoning ability
| and autonomy." And the latter would be easily programmable
| anyway. The goal here might, then, be infeasible: the concept of
| free will is a kind of technology in and of itself, it has
| already augmented human cognition. How will these technologies
| not augment the "mind" such that our own understanding of our
| consciousness is altered? And why should we try to determine
| ahead of time what will hold weight for us, why the "human" part
| of the intelligence will matter in the future? Technology should
| not be compared to the world it transforms.
| digitcatphd wrote:
| o3 fixes the fundamental limitation of the LLM paradigm - the
| inability to recombine knowledge at test time - and it does so
| via a form of LLM-guided natural language program search
|
| > This is significant, but I am doubtful it will be as meaningful
| as people expect aside from potentially greater coding tasks.
| Without a 'world model' that has a contextual understanding of
| what it is doing, things will remain fundamentally throttled.
| madsgarff wrote:
| Moreover, ARC-AGI-1 is now saturating - besides o3's new score,
| the fact is that a large ensemble of low-compute Kaggle solutions
| can now score 81% on the private eval.
|
| If low-compute Kaggle solutions already does 81% - then why is
| o3's 75.7% considered such a breakthrough?
| gmerc wrote:
| Headline could also just be OpenAI discovers exponential scaling
| wall for inference time compute.
| owenpalmer wrote:
| Someone asked if true intelligence requires a foundation of prior
| knowledge. This is the way I think about it.
|
| I = E / K
|
| where I is the intelligence of the system, E is the effectiveness
| of the system, and K is the prior knowledge.
|
| For example, a math problem is given to two students, each
| solving the problem with the same effectiveness (both get the
| correct answer in the same amount of time). However, student A
| happens to have more prior knowledge of math than student B. In
| this case, the intelligence of B is greater than the intelligence
| of A, even though they have the same effectiveness. B was able to
| "figure out" the math, without using any of the "tricks" that A
| already knew.
|
| Now back to the question of whether or not prior knowledge is
| required. As K approaches 0, intelligence approaches infinity.
| But when K=0, intelligence is undefined. Tada! I think that
| answers the question.
|
| Most LLM benchmarks simply measure effectiveness, not
| intelligence. I conceptualize LLMs as a person with a
| photographic memory and a low IQ of 85, who was given 100 billion
| years to learn everything humans have ever created.
|
| IK = E
|
| low intelligence * vast knowledge = reasonable effectiveness
| Woodi wrote:
| Yep, I aways liked encyclopedia. Wiki is good too :)
|
| What I would like to have in the future is SO answering-peoples
| accessible in real time via IRC. They have real answers NOW.
| They are even pedantic about their stuff !
| wangii wrote:
| Interesting formulation! it captures the intuition of the
| "smartness" when solving a problem. However, what about asking
| good questions or proposing conjectures?
| hanspeter wrote:
| Aren't those solutions to problems as well?
|
| Find the best questions to ask. Find the best hypothesis to
| suggest.
| lorepieri wrote:
| There should be also a factor about resource consumption. See
| here: https://lorenzopieri.com/pgii/
| spacebanana7 wrote:
| Also perhaps a factor (with diminishing returns) for response
| speed?
|
| All else equal, a student who gets 100% on a problem set in
| 10 minutes is more intelligent than one with the same score
| after 120 minutes. Likewise an LLM that can respond in 2
| seconds is more impressive than one which responds in 30
| seconds.
| owenpalmer wrote:
| > a student who gets 100% on a problem set in 10 minutes is
| more intelligent than one with the same score after 120
| minutes
|
| According to _my_ mathematical model, the faster student
| would have higher _effectiveness_ , not necessarily higher
| intelligence. Resource consumption and speed are practical
| technological concerns, but they're irrelevant in a
| theorical conceptualization of intelligence.
| baq wrote:
| If you disregard time, all computers have maximal
| intelligence, they can enumerate all programs and compute
| answers to any decidable question.
| wouldbecouldbe wrote:
| Yeah speed is a key factor in intelligence. And actually
| one of the biggest differentiators in human iq
| measurements
| eru wrote:
| Humans are a bit annoying that way, because it's all
| correlated.
|
| So a human with a better response time, also tends to
| give you more intelligent answers, even when time is not
| a factor.
|
| For a computer, you can arbitrarily slow them down (or
| speed them up), and still get the same answer.
| Terr_ wrote:
| > response time
|
| Imagine you take an extraordinarily smart person, and put
| them on a fast spaceship that causes time dilation.
|
| Does that mean that they are stupider while in transit, and
| they regain their intelligence when it slows down?
| Earw0rm wrote:
| No, because intelligence is relative to your local
| context.
| Terr_ wrote:
| Why should one kind of phenomenon which slows down
| performance on the test be given a special "you're more
| intelligent than you seem" exception, but not others?
|
| If we are required to break the seal on the black-box and
| investigate the exactly how the agent is operating in
| order to judge its "intelligence"... Doesn't that kinda
| ruin the up-thread stuff about judging with equations?
| zoky wrote:
| Who is a better free-thrower, someone who can hit 20 free
| throws per minute on Earth, or the same thrower who
| logged 20 million free throws in the apparent two years
| he was gone but comes back ready for retirement?
| coffeebeqn wrote:
| Maybe. If I could ask a AI to come up with a 50% efficient
| mass market solar panel, I don't really care if it takes a
| few weeks or a year if it can solve that though. I'm not
| sure if inventiveness or novelness of solution could be a
| metric. I suppose that is superintelligence rather than
| AGI? And by then there would be no question of what it is
| xlii wrote:
| An interesting point from a philosophical perspective!
|
| But if we'd take this into consideration would it mean that
| 1st world engineer is by definition _less_ inteligent than
| 3rd world one?
|
| I think the (completely reasonable) knee jerk reaction is a
| definsive one, but I can imagine absolutarian regime escapee
| working side-by-side an engineer groomed in expensive, air
| conditioned lecture rooms. In this imaginary scenario
| escapee, even if slower and less efficient at the problem at
| hand would have to be more intelligent generally.
| eru wrote:
| That's a bit silly.
|
| Yes, resource consumption is important. But your car guzzling
| a lot of gas doesn't mean he drives slower. It just means it
| drives slower per mol of petrol consumed.
|
| It's good to know whether your system has a high or low 'bang
| for buck' metric, but that doesn't directly affect how much
| bang you get.
| someothherguyy wrote:
| https://en.wikipedia.org/wiki/Fluid_and_crystallized_intelli...
| dmezzetti wrote:
| We should wait until it's released before we anoint it. It's
| disheartening to see how we keep repeating the same pattern
| that gives in to hype over the scientific method.
| lazide wrote:
| The scientific method doesn't drive stock price (apparently).
| empiko wrote:
| Well put. You ask LLMs about ARC-like challenges and they are
| able to come up with a list of possible problem formulations
| even before you show them the input. The models already know
| that they might expect various object manipulations, symmetry
| problem, etc. The fact that the solution costs thousands of
| dollars says to me that the model iterates over many solutions
| while using this implicit knowledge and feedback it gets from
| running the program. It is still impressive, but I don't think
| this is what the ARC prize was supposed to be about.
| curl-up wrote:
| > while using this implicit knowledge and feedback it gets
| from running the program.
|
| What feedback, and what program, are you referring to?
| scotty79 wrote:
| Basically solutions that were doing well in arc just threw
| thousands of ideas at the wall and picked the ones that
| stuck. They were literally generating thousands of python
| programs, running them and checking if any produced the
| correct output when fed with data from examples.
|
| This o3 doesn't need to run python. It itself executes
| programs written in tokens inside it's own context window
| which is wildly inefficient but gives better results and is
| potentially more general.
| TheOtherHobbes wrote:
| So basically it's a massively inefficient trial-and-error
| leetcode solver which only works because it throws
| incredible amounts of compute at the problem.
|
| This is hilarious.
| empiko wrote:
| I assume that o3 can run Python scripts and observe the
| outputs.
| onemetwo wrote:
| An intelligent system could take more advantage of an increase
| of knowledge than a dumb one, so I should propose a simple
| formula: the derivative of efficiency with respect to knowledge
| is proportional to intelligence.
|
| $$ I = \frac{partial E}{partial K} \simeq \frac{\delta
| E}{\delta K} $$
|
| In order to estimate $I$ you have to consider that efficiency
| and knowledge are task related, so you could take some weighted
| mean $sum_T C(E,K,T)*I(E,K,T)$ where $T$ is task category. I am
| thinking in $C(E,K,T)$ as something similar to thermal capacity
| or electrical resistance, the equivalent concept when applied
| to task. An intelligent agent in a medium of low resistance
| should fly while a dumb one would still crawl.
| owenpalmer wrote:
| > An intelligent system could take more advantage of an
| increase of knowledge than a dumb one
|
| Why?
|
| > derivative of efficiency
|
| Where did your efficiency variable come from?
| onemetwo wrote:
| Why? I am using dumb as a low intelligence system. A more
| intelligent person can take advantage of new opportunities.
| Efficience variable: You are right that effectiveness could
| be better here because we are not considering resources
| like computer time and power.
| gardenhedge wrote:
| Where did someone ask that?
| scotty79 wrote:
| As a kid I absolutely hated math and loved physics and
| chemistry because solving anything in math requires vast
| specific K.
|
| In comparison you can easily know everything there is to know
| about physics or chemistry and it's sufficient to solve
| interesting puzzles. In math every puzzle has it's own vast
| lore you need to know before you can have any chance at
| tackling it.
| owenpalmer wrote:
| Physics and chemistry require experimentation to verify
| solutions. With math however, any new knowledge can be
| intuited and proven from previous proofs, so yes, the lore
| goes deep!
| Woodi wrote:
| So article seriously and scientifically states:
|
| "Our programs compilation (AI) gave 90% of correct answers in
| test 1. We expect that in test 2 quality of answers will
| degenerate to below random monkey pushing buttons levels. Now
| more money is needed to prove we hit blind alley."
|
| Hurray ! Put limited version of that on everybody phones !
| oezi wrote:
| > o3 fixes the fundamental limitation of the LLM paradigm - the
| inability to recombine knowledge at test time
|
| I don't understand this mindset. We have all experienced that
| LLMs can produce words never spoken before. Thus there is
| recombination of knowledge at play. We might not be satisfied
| with the depth/complexity of the combination, but there isn't any
| reason to believe something fundamental is missing. Given more
| compute and enough recursiveness we should be able to reach any
| kind of result from the LLM.
|
| The linked article says that LLMs are like a collection of vector
| programs. It has always been my thinking that computations in
| vector space are easy to make turing complete if we just have an
| eigenvector representation figured out.
| lagrange77 wrote:
| > Given more compute and enough recursiveness we should be able
| to reach any kind of result from the LLM.
|
| That was always true for NNs in general, yet it took a very
| specific structure to get to where we are now. (..with a
| certain amount of time and resources.)
|
| > thinking that computations in vector space are easy to make
| turing complete if we just have an eigenvector representation
| figured out
|
| Sounds interesting, would you elaborate?
| niemandhier wrote:
| Contrary to many I hope this stays expensive. We are already
| struggling with AI curated info bubbles and psy-ops as it is.
|
| State actors like Russia, US and Israel will probably be fast to
| adopt this for information control, but I really don't want to
| live in a world where the average scammer has access to this
| tech.
| owenpalmer wrote:
| > I really don't want to live in a world where the average
| scammer has access to this tech.
|
| Reality check: local open source models are more than capable
| of information control, generating propaganda, and scamming
| you. The cat's been out of the bag for a while now, and
| increased reasoning ability doesn't dramatically increase the
| weaponizability of this tech, I think.
| pal9000 wrote:
| Can someone ELI5 how ARC-AGI-PUB is resistant to p-hacking?
| danielovichdk wrote:
| At what time will it kill us all because it understands that
| humans are the biggest problem before it can simply chill and not
| worry.
|
| That would be intelligent. Everything else is just stupid and
| more of the same shit.
| aniviacat wrote:
| Humans are the biggest problem of what? Of the sun? Of Venus?
|
| Of humans. Humans are a problem for the satisfaction of humans.
| Yet removing humans from this equation does result in higher
| human satisfaction. It lessens it.
|
| I find this thought process of "humans are the problem" to be
| unreasonable. Humans aren't the problem; humans are the
| requirement.
| almog wrote:
| AGI = ARC-AGI-PUB
|
| And not the other way around as some comments here seem to
| confuse necessary and sufficient conditions.
| the5avage wrote:
| The examples unsolved by high compute o3 look a lot like the
| raven progressive matrix tests used in IQ tests.
| thom wrote:
| It's not AGI when it can do 1000 math puzzles. It's AGI when it
| can do 1000 math puzzles then come and clean my kitchen.
| qup wrote:
| Intelligence doesn't have to be embodied.
| thom wrote:
| It also has to be able to come and argue in the comments.
| goatlover wrote:
| For it to be AGI, it needs to be able to manipulate the
| physical world from it's own goals, not just produce text
| when prompted. LLMs are just tools to augment human
| intelligence. AGI is what you see in science fiction.
| egeozcan wrote:
| I understand what you are saying and sort of agree the premise
| but to be pedantic, I don't think any robot can clean a kitchen
| without doing math :)
| epolanski wrote:
| Okay but what are the tests like? At least like a general idea.
| tymonPartyLate wrote:
| Isn't this like a brute force approach? Given it costs $ 3000 per
| task, thats like 600 GPU hours (h100 at Azure) In that amount of
| time the model can generate millions of chains of thoughts and
| then spend hours reviewing them or even testing them out one by
| one. Kind of like trying until something sticks and that happens
| to solve 80% of ARC. I feel like reasoning works differently in
| my brain. ;)
| strangescript wrote:
| "We have created artificial super intelligence, it has solved
| physics!"
|
| "Well, yeah, but its kind of expensive" -- this guy
| freehorse wrote:
| The problem is not that it is expensive, but that, most
| likely, it is not superintelligence. Superintelligence is not
| exploring the problem space semi-blindly, if the thounsands
| $$$ per task are actually spent for that. There is a reason
| the actual ARC-AGI prize requires efficiency, because the
| point is not "passing the test" but solving the framing
| problem of intelligence.
| tymonPartyLate wrote:
| Haha. Hopefully you're right and solving the ARC puzzle
| translates to solving all of physics. I just remain skeptical
| about the OpenAI hype. They have a track record of
| exaggerating the significance of their releases and their
| impact on humanity.
| jeremyjh wrote:
| Please do show me a novel result in physics from any LLM. You
| think "this guy" is stupid because he doesn't extrapolate
| from this $2MM test that nearly reproduces the work of a STEM
| graduate to a super intelligence that has already solved
| physics. Maybe you've got it backwards.
| tikkun wrote:
| They're only allowed 2-3 guesses per problem. So even though
| yes it generates many candidates, it can't validate them - it
| doesn't have tool use or a verifier, it submits the best 2-3
| guesses.
| https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50...
| nmca wrote:
| It is allowed exactly two guesses, per the ARC rules.
| trescenzi wrote:
| How many guesses is the human comparison based on? I'd hope
| two as well but haven't seen this anywhere so now I'm
| curious.
| nmca wrote:
| The real turker studies, resulting in the ~70% number,
| are scored correctly I believe. Higher numbers are just
| speculated human performance as far as I'm aware.
| nextworddev wrote:
| The best interpretation of this result is probably that it
| showed tackling some arbitrary benchmark is something you can
| throw money at, aka it's just something money can solve.
|
| Its not agi obviously in the sense that you still need to some
| problem framing and initialization to kickstart the reasoning
| path simulations
| torginus wrote:
| this might be quite an important point - if they created an
| algorithm that can mimic human reasoning, but scales terribly
| with problem complexity (in terms of big O notation), it's
| still a very significant result, but it's not a 'humans brains
| are over' moment quite yet.
| macrolime wrote:
| The trick with AlphaGo was brute force combined with learning
| to extract strategies from brute force using reinforcement
| learning, that's what we'll see here. So maybe it costs a
| million dollars in compute to get a high score, but use
| reinforcement learning ala alphazero to learn from the process
| and it won't cost a million dollars next time and let it do
| lots of hard benchmarks, math problems and coding tasks and
| it'll keep getting better and better.
| tikkun wrote:
| I wonder: when did o1 finish training, and when did o3 finish
| training?
|
| There's a ~3 month delay between o1's launch (Sep 12) and o3's
| launch (Dec 20). But, it's unclear when o1 and o3 each finished
| training.
| zug_zug wrote:
| This is a lot of noise around what's clearly not even an order of
| magnitude to the way to AGI.
|
| Here's my AGI test - Can the model make a theory of AGI
| validation that no human has suggested before, test itself to see
| if it qualifies, iterate, read all the literature, and suggest
| modifications to its own network to improve its performance?
|
| That's what a human-level performer would do.
| earth2mars wrote:
| Maybe spend more compute time to let it think about optimizing
| the compute time.
| msoad wrote:
| There are new research where chain of thoughts is happening in
| latent spaces and not in English. They demonstrated better
| results since language is not as expressive as those concepts
| that can be represented in the layers before decoder. I wonder if
| o3 is doing that?
| padolsey wrote:
| I think you mean this: https://arxiv.org/abs/2412.06769
|
| From what I can see, presuming o3 is a progression of o1 and
| has good level of accountabiltiy bubbling up during 'inference'
| (i.e. "Thinking about ___") then I'd say it's just using up
| millions of old-school tokens (the 44 million tokens that are
| referenced). So not latent thinking per se.
| Zamicol wrote:
| Interesting!
| gliptic wrote:
| "You can tell the RL is done properly when the models cease to
| speak English in their chain of thought" -- Karpathy
| rapjr9 wrote:
| Does anyone have a feeling for how latency (from asking a
| question/API call to getting an answer/API return) is progressing
| with new models? I see 1.3 minutes/task and 13.8 minutes/task
| mentioned in the page on evaluating O3. Efficiency gains that
| also reduce latency will be important and some of them will come
| from efficiency in computation, but as models include more and
| more layers (layers of models for example) the overall latency
| may grow and faster compute times inside each layer may only help
| somewhat. This could have large effects on usability.
| amai wrote:
| But can it convert handwritten equations into Latex? That is the
| AGI task I'm waiting for.
| figure8 wrote:
| I have a very naive question.
|
| Why is the ARC challenge difficult but coding problems are easy?
| The two examples they give for ARC (border width and square
| filling) are much simpler than pattern awareness I see simple
| models find in code everyday.
|
| What am I misunderstanding? Is it that one is a visual grid
| context which is unfamiliar?
| ItsMattyG wrote:
| Francois'(the creator of ARC-AGI benchmark) whole point was
| that while they look the same, they're not. Coding is solving a
| familiar pattern in the same way (and fails when it' s NOT
| doing that, it just looks like it doesn't happen because it's
| seen SO MANY patterns in code). But the point of Arc AGI is to
| make each problem have to generalize in some new ay.
| sn0wr8ven wrote:
| Incredibly impressive. Still can't really shake the feeling that
| this is o3 gaming the system more than it is actually being able
| to reason. If the reasoning capabilities are there, there should
| be no reason why it achieves 90% on one version and 30% on the
| next. If a human maintains the same performance across the two
| versions, an AI with reason should too.
| demirbey05 wrote:
| I am not expert in llm reasoning but I think because of RL. You
| cannot use AlphaZero to play other games.
| GaggiX wrote:
| Humans and AIs are different, the next benchmark would be build
| so that it emphasize the weak points of current AI models where
| a human is expected to perform better, but I guess you can also
| make a benchmark that is the opposite, where humans struggle
| and o3 has an easy time.
| pkphilip wrote:
| Yes, if a system has actually achieved AGI, it is likely to not
| reveal that information
| HeatrayEnjoyer wrote:
| AGI is a spectrum, not a binary quality.
| cornholio wrote:
| But does it matter if it "really, really" reasons in the human
| sense, if it's able to prove some famous math theorem or come
| up with a novel result in theoretical physics?
|
| While beyond current motels, that would be the final test of
| AGI capability.
| jprete wrote:
| If it's gaming the system, then it's much less likely to
| reliably come up with novel proofs or useful new theoretical
| ideas.
| intended wrote:
| Yeah, it really does matter if something was reasoned, or
| whether it appears if you metaphorically shake the magic 8
| ball.
| FartyMcFarter wrote:
| How would gaming the system work here? Is there some flaw in
| the way the tasks are generated?
| kmacdough wrote:
| The point of ARC is NOT to compare humans vs AI, but to probe
| the current boundary of AIs weaknesses. AI has been beating us
| at specific tasks like handwriting recognition for decades.
| Rather, it's when we can no longer readily find these "easy for
| human, hard for AI" reasoning tasks that we must stop and
| consider.
|
| If you look at the ARC tasks failed by o3, they're really not
| well suited to humans. They lack the living context humans
| thrive on, and have relatively simple, analytical outcomes that
| are readily processed by simple structures. We're unlikely to
| see AI as "smart" until it can be asked to accomplish useful
| units of productive professional work at a "seasoned
| apprentice" level. Right now they're consuming ungodly amounts
| of power just to pass some irritating, sterile SAT questions.
| Train a human for a few hours a day over a couple weeks and
| they'll ace this no problem.
| earth2mars wrote:
| Why did they skip o2?
| YeGoblynQueenne wrote:
| I just noticed this bit:
|
| >> Second, you need the ability to recombine these functions into
| a brand new program when facing a new task - a program that
| models the task at hand. Program synthesis.
|
| "Program synthesis" is here used in an entirely idiosyncratic
| manner, to mean "combining programs". Everyone else in CS and AI
| for the last many decades has used "Program Synthesis" to mean
| "generating a program that satisfies a specification".
|
| Note that "synthesis" can legitimately be used to mean
| "combining". In Greek it translates literally to "putting
| [things] together": "Syn" (plus) "thesis" (place). But while
| generating programs by combining parts of other programs is an
| old-fashioned way to do Program Synthesis, in the standard sense,
| the end result is always desired to be a program. The LLMs used
| in the article to do what F. Chollet calls "Porgram Synthesis"
| generate no code.
| tshadley wrote:
| I always get the feeling he's subconsciously inserting a
| "magical" step here with reference to "synthesis"-- invoking a
| kind of subtle dualism where human intelligence is just
| different and mysteriously better than hardware intelligence.
|
| Combining programs should be straightforward for DNNs,
| ordering, mixing, matching concepts by coordinates and
| arithmetic in learned high-dimensional embedded-space.
| Inference-time combination is harder since the model is working
| with tokens and has to keep coherence over a growing CoT with
| many twists, turns and dead-ends, but with enough passes can
| still do well.
|
| The logical next step to improvement is test-time training on
| the growing CoT, using reinforcement-fine-tuning to compress
| and organize the chain-of-thought into parameter-space--if we
| can come up with loss functions for "little progress, a lot of
| progress, no progress". Then more inference-time with a better
| understanding of the problem, rinse and repeat.
| baalimago wrote:
| Let me know when OpenAI can wrap Christmas gifts. Then I'll be
| interested.
___________________________________________________________________
(page generated 2024-12-21 18:00 UTC)