[HN Gopher] 30% drop in O1-preview accuracy when Putnam problems...
___________________________________________________________________
30% drop in O1-preview accuracy when Putnam problems are slightly
variated
Author : optimalsolver
Score : 434 points
Date : 2025-01-01 12:21 UTC (10 hours ago)
(HTM) web link (openreview.net)
(TXT) w3m dump (openreview.net)
| yifanl wrote:
| Is it just an open secret that the models are currently just
| being hardcoded for random benchmarks? Seems weird that people
| would be asking Putnam problems to a chatbot :/
| Trasmatta wrote:
| Not hardcoded, I think it's just likely that those problems
| exist in its training data in some form
| hansworst wrote:
| Isn't that just the LLM equivalent of hardcoding though?
| Trasmatta wrote:
| I wouldn't call that hardcoding, otherwise you'd have to
| call everything it does "hardcoded".
| freehorse wrote:
| "Overfitting" would be a bit more accurate term if the
| problem lies in the specific examples existing in its
| training set in various forms, places, languages etc but
| with the same values.
| Panzer04 wrote:
| Seems a bit picky. If the bot has seen the exact problem
| before it's not really doing anything more than recall to
| solve it.
| bandrami wrote:
| 20 years ago in grad school we were doing a very early
| iteration of this where we built Markov chains with
| Shakespeare's plays and wanted to produce a plausibly
| "Shakespearian" clause given a single word to start and a
| bearish professor said "the more plausible it gets the more
| I worry people might forget plausibility is all that it
| promises".
|
| (There was also a much earlier piece of software that would
| generate semi-intelligible Kant or Hegel one sentence at a
| time, though that was through a series of a priori
| generation rules and a large at the time dictionary of
| stock phrases. I wonder what ever happened to that.)
| jeffreygoesto wrote:
| It became a successful consultant...
| sickblastoise wrote:
| I think your prof's worries came true on a massive scale
| marcosdumay wrote:
| That said, a bot with contextual recall can be very useful.
|
| The problem is just that people keep insisting that those
| things are intelligent.
| wslh wrote:
| If I temember well this is call overfitting [1].
|
| [1] https://en.wikipedia.org/wiki/Overfitting
| InkCanon wrote:
| I've always assumed they removed it, because it's such a
| basic and fundamental part of ML training that you separate
| your test and train data. And yet I never see any papers even
| mention if/how they do this. And I wonder if they do, how do
| they guarantee with high reliability that their massive
| terabytes of data don't contain the answer.
| llm_trw wrote:
| Imagine you have someone polluting your training data every
| day. That's what happens when you scrape any tech forum
| today.
|
| The short version is that llm trainign data is the lowest
| quality data you are likely to see unless you engage in
| massive potential copyright infringement.
| ryvi wrote:
| > unless you engage in massive potential copyright
| infringement. And nobody is going to do that
| YetAnotherNick wrote:
| First of all, Putnam is not in the test data, at least I
| haven't seen OpenAI claiming that publicly. Secondly,
| removing it from internet data is not 100% accurate. There
| are translations of the problems and solutions or
| references and direct match is not enough. MMLU and test
| set benchmarks show more resilience though in some previous
| research.
| rst wrote:
| OpenAI is extremely cagey about what's in their test data
| set generally, but absent more specific info, they're
| widely assumed to be grabbing whatever they can. (Notably
| including copyrighted information used without explicit
| authorization -- I'll take no position on legal issues in
| the New York Times's lawsuit against OpenAI, but at the
| very least, getting their models to regurgitate NYT
| articles verbatim demonstrates pretty clearly that those
| articles are in the training set.)
| fn-mote wrote:
| Let's think about this.
|
| > Putnam is not in the test data, at least I haven't seen
| OpenAI claiming that publicly
|
| What exactly is the source of your belief that the Putnam
| would not be in the test data? Didn't they train on
| everything they could get their hands on?
| whimsicalism wrote:
| do you understand the difference between test data and
| train data? just reread this thread of comments
| YetAnotherNick wrote:
| I don't know why I and you are getting downvoted.
| Sometimes, HN crowd is just unhinged against AI.
| whimsicalism wrote:
| funny that nobody replying to you seems to even know what
| a test set is. i always overestimate the depth of ML
| conversation you can have on HN
| chvid wrote:
| It is on the open internet - questions and suggested
| solutions:
|
| https://kskedlaya.org/putnam-archive/
|
| I would expect all llms to be trained on it.
| jprete wrote:
| I don't see any reason to assume they removed it unless
| they're very explicit about it. Model publishers have an
| extremely strong vested interest in beating benchmarks and
| I expect them to teach to the test if they can get away
| with it.
| stingraycharles wrote:
| As usual, once a metric becomes a target, it stops being
| useful.
| franktankbank wrote:
| Well, they are doing BigCorpStuff not Science
| whimsicalism wrote:
| putnam isn't an llm benchmark ahhhh none of these
| companies are reporting putnam scores there's nothing
| nefarious about training on putnam problems
| captainbland wrote:
| I think it's reasonable to assume that openAI is optimising
| for maximum hype at this point which may include wilfully
| overfitting for impactful benchmarks to generate positive
| reports.
| lupire wrote:
| When 4 came out they released a document that did BOTH
| inflate scores by changing the exam conditions, and also
| bragged about scoring _worse than guessing_ on a multiple
| choice test.
| whimsicalism wrote:
| But putnam isn't an official test? I find llm discourse on
| hn so frustrating
| marcosdumay wrote:
| How could they remove it?
|
| Those are well known problems, that people talk about on
| different contexts. They would have to review their entire
| training set.
| woopwoop wrote:
| I agree that openai is somewhat sketchy about this, but
| they're sketchy about everything. In the past though they
| have admitted up front to data contamination (e.g. the
| original gpt-4 press release did not use big-bench as a
| benchmark due to data contamination). For the Putnam in
| particular: this is not a benchmark that they use. There is
| no reason to exclude it since it is not part of the "test
| set" in any meaningful sense.
| jsheard wrote:
| It certainly feels like certain patterns are hardcoded
| special cases, particularly to do with math.
|
| _" Solve (1503+5171)*(9494-4823)"_ reliably gets the correct
| answer from ChatGPT
|
| _" Write a poem about the solution to
| (1503+5171)*(9494-4823)"_ hallucinates an incorrect answer
| though
|
| That suggests to me that they've papered over the models
| inability to do basic math, but it's a hack that doesn't
| generalize beyond the simplest cases.
| whimsicalism wrote:
| https://chatgpt.com/share/67755e6f-bfc8-8010-9aa3-8bcbbd9b2
| 6...
| jsheard wrote:
| To be clear I was testing with 4o, good to know that o1
| has a better grasp of basic arithmetic. Regardless my
| point was less to do with the models ability to do math
| and more to do with OpenAI seeming to cover up its lack
| of ability.
| whimsicalism wrote:
| i think it's mostly that o1 mini can think through the
| solution before it starts writing the poem.
|
| i'm able to reproduce your failure on 4o
| lelandfe wrote:
| "a poem about" reads to _me_ at least like the solution
| need not be in the answer; maybe something like "a poem
| that includes the answer in the last stanza"
| whimsicalism wrote:
| yeah but it like actually gets the answer wrong not just
| omits it
| mmmore wrote:
| There's a few things there that could be going on that seem
| more likely than "hardcoded".
|
| 1. The part of the network that does complex math and the
| part that write poetry are overlapping in strange ways.
|
| 2. Most of the models nowadays are assumed to be some
| mixture of experts. So it's possible that saying write the
| answer as a poem activates a different part of the model.
| mlepath wrote:
| Yea, people have a really hard time dealing with data leakage
| especially on data sets as large as LLMs need.
|
| Basically if something appeared online or was transmitted
| over the wire should no longer be eligible to evaluate on. D.
| Sculley had a great talk at NeurIPS 2024 (same conference
| this paper was in) titled Empirical Rigor at Scale - or, How
| Not to Fool Yourself
|
| Basically no one knows how to properly evaluate LLMs.
| resoluteteeth wrote:
| > Is it just an open secret that the models are currently just
| being hardcoded for random benchmarks? Seems weird that people
| would be asking Putnam problems to a chatbot :/
|
| It's because people do keep asking these models math problems
| and then, when they get them right, citing it as evidence that
| they can actually do mathematical reasoning.
|
| Since it's hard to determine what the models know, it's hard to
| determine when they're just spitting out something they were
| specifically trained on.
| strangescript wrote:
| There are tests they are passing that they can't be hardcoded
| for by design. They still have all kinds of flaws and
| inconsistency but getting upset they answer "2+2=4" because
| someone trained them on what the answer to 2+2 is supposed to
| be is silly.
| bwfan123 wrote:
| this work is similar to the GSM symbolic paper (applied to
| putnam) https://arxiv.org/html/2410.05229v1
|
| going forward, llm performance must be reported on the
| confounded benchmark as well
| obblekk wrote:
| I hope someone reruns this on o1 and eventually o3.
|
| If o1-preview was the start like gpt1, then we should expect
| generalization to increase quickly.
| jokethrowaway wrote:
| I don't think llm generalise much, that's why they're not
| creative and can't solve novel problems. It's pattern matching
| with a huge amount of data.
|
| Study on the topic: https://arxiv.org/html/2406.15992v1
|
| This would explain o1 poor performance with problems with
| variations. o3 seems to be expensive brute forcing in latent
| space followed by verification which should yield better
| results - but I don't think we can call it generalisation.
|
| I think we need to go back to the drawing board.
| red75prime wrote:
| Don't worry, there are thousands of researchers at the
| drawing boards right now.
| Culonavirus wrote:
| Yeah, because if the AI boom becomes the AI bust, we'll
| have another 2008-level economic crisis on our hands.
|
| The investments into AI are in the hundreds of billions
| (maybe even more if you factor in the amount of people
| studying and researching AI), but the returns are in the
| tens of billions (if even that).
|
| If you exclude the "growth" coming from the industry
| sniffing its own farts (e.g. Nvidia selling insane amounts
| of insanely overpriced GPUs to InsertYourFavAICorp), the
| actual amount of "useful goods and services" produced (api
| accesses, chat subscriptions, ai-enabled app growth etc.)
| are tiny compared to the investment levels.
|
| The AI train appears to have no brakes. A massive crash or
| AGI are the only options now. Both are going to be bad for
| average humans.
| UniverseHacker wrote:
| From firsthand experience, this simply cannot be true. I can
| give them totally novel and unique physics problems I just
| made up- that requires tracking the movement of objects
| through a series of events, and it answers most correctly.
| Moreover, they find analogies between disparate concepts and
| fields of study and make useful suggestions based on them-
| which is arguably the same process as human creativity.
|
| I think ultimately the disconnect is people theorizing about
| what it can or cannot do with an incorrect mental model of
| what it is, and then assuming it cannot do things that it can
| in fact do. The irony of discussions on LLMs is they more
| showcase the limits of humans ability to reason about novel
| situations.
| s1mplicissimus wrote:
| the fact that this (and tons of other legitimate critique)
| got downvoted into greytext speaks so much louder to me than
| all benchmarks in the world
| mupuff1234 wrote:
| You're assuming that openAI isn't just gonna add the new
| questions to the training data.
| Lerc wrote:
| Their methodology shows they can create an infinite variety
| of problems.
|
| This is the same thing as synthetic training data.
|
| It doesn't matter if models are trained on the output of the
| generated data or not. If the model ends up being able to
| solve newly generated variations, you'd have to admit that it
| understands the underlying problems.
| mupuff1234 wrote:
| I think what it shows that it has minimal "understanding"
| of the problem - otherwise such small variations wouldn't
| pose a challenge. Training it to handle these specific
| small variations doesn't change that.
|
| It's good in automation, not understanding.
| Lerc wrote:
| If it were a complete failure on variations I would be
| inclined to agree. Instead it was a 30% drop in
| performance. I would characterise that as limited
| understanding.
| sirolimus wrote:
| Fully agree with this
| cgriswald wrote:
| My guess is that what's understood isn't various parts of
| solving the problem but various aspects of the expected
| response.
|
| I see this more akin to a human faking their way through
| a conversation.
| sirolimus wrote:
| Exactly. The naivity is just sky-high
| huitzitziltzin wrote:
| This result is the same as a recent test of the same
| method+hypothesis from a group at Apple, no? I don't have that
| reference handy but I don't think I'm making it up.
| intelkishan wrote:
| I think you are probably referring to the following paper:
| https://arxiv.org/abs/2410.05229
| huitzitziltzin wrote:
| Yup, looks like the one I meant!
|
| I am impressed by the progress on LLMs but I remain skeptical
| that they can replace humans.
|
| Perhaps some (distant!) future model but I don't fear mass
| unemployment (for example) or even moderate LLM-driven
| unemployment in the near-to-medium term.
|
| They can clearly complement human labor but there are
| vanishingly few domains where they can be substitutes.
| rubymamis wrote:
| I would love to see how well Deepseek V3 do on this.
| KTibow wrote:
| Probably even worse, since I've heard that it's hard to steer
| away from the most common interpretation of a question.
| rrr_oh_man wrote:
| Performance of these LLMs on real life tasks feels very much like
| students last-minute cramming for Asian style exams.
|
| The ability to perfectly regurgitate, while no concept of
| meaning.
| anilakar wrote:
| Basically yet another proof that we have managed to perfectly
| recreate human stupidity :-)
| cscurmudgeon wrote:
| Good students are immune to variations that are discussed in
| the paper. But most academic tests may not differentiate
| between them and the crammers.
| falcor84 wrote:
| > Good students are immune to variations
|
| I don't believe that. I'd put some good money that if an
| excellent student is given an exact question from a
| previous year, they'll do better (faster & more accurate)
| on it, than when they're given a variation of it.
| n144q wrote:
| I don't think you are betting on the same thing the
| parent comment is talking about.
|
| The assumptions aren't the same to begin with.
| Bjartr wrote:
| What's the difference between benefitting from seeing
| previous problems and being worse off when not having a
| previous problem to go from?
| fn-mote wrote:
| The point is that the "good student" will still do well
| on the variations, not suffer a 30% decrease in grade.
| hshshshshsh wrote:
| Look into JEE Advanced.
| umeshunni wrote:
| https://openreview.net/forum?id=YHWXlESeS8
|
| Our evaluation on various open-source and proprietary models
| reveals that the highest performance, even after using
| techniques like self-consistency, self-refinement and chain-
| of-thought prompting, is less than 40%. The typical failure
| modes of GPT-4, the best model, are errors in algebraic
| manipulation, difficulty in grounding abstract concepts into
| mathematical equations accurately and failure in retrieving
| relevant domain-specific concepts.
|
| I'm curious how something like O1 would perform now.
| whimsicalism wrote:
| o3 is able to get 25% on never seen before frontiermath
| problems. sure, the models do better when the answer is
| directly in their dataset but they've already surpassed the
| average human in novelty on held out problems
| jvanderbot wrote:
| The average human did zero studying on representative
| problems. LLMs did _a lot_.
| whimsicalism wrote:
| Okay? We are measuring capabilities.
| tzs wrote:
| I don't know anything about frontiermath problems, but for
| Putnam problems (which is what the submitted article is
| about) the average human that takes the exam is an
| undergraduate mathematics or science major who has studied
| prior Putnam problems and other similar problems recently
| to specifically prepare for the exam...and the most common
| score is still 0.
|
| At top tier schools the most common score will usually be
| somewhere in the 0 to 10 range (out of a possible 120).
| fldskfjdslkfj wrote:
| > never seen before frontiermath problems
|
| How do you know that?
| whimsicalism wrote:
| Because that is the whole conceit of how frontiermath is
| constructed
| fldskfjdslkfj wrote:
| Didn't they run a bunch of models on the problem set? I
| doubt they are hosting all those models on their own
| infrastructure.
| whimsicalism wrote:
| 1. OpenAI has confirmed it's not in their train (unlike
| putnam where they have never made any such claims)
|
| 2. They don't train on API calls
|
| 3. It is funny to me that HN finds it easier to believe
| theories about stealing data from APIs rather than an
| improvement in capabilities. It would be nice if
| symmetric scrutiny were applied to optimistic and
| pessimistic claims about LLMs, but I certainly don't feel
| that is the case here.
| fldskfjdslkfj wrote:
| Easier to believe or not, thinking that it's not a
| reasonable possibility is also funny.
| whimsicalism wrote:
| Do you also think they somehow stole the codeforces
| problems before they were even written or you are willing
| to believe the #175 global rank there?
| fldskfjdslkfj wrote:
| I dont think codeforce claims to contain novel
| unpublished problems.
|
| But i'm not saying it's what they did, just that it's a
| possibility that should be considered till/if it is
| debunked.
| whimsicalism wrote:
| frankly i'm not sure what standard you would possibly
| consider a debunking
|
| codeforces constantly adds new problems that's like the
| entire point of the contest, no?
| jcranmer wrote:
| The modern state of training is to try to use everything
| they can get their hands on. Even if there are privileged
| channels that are guaranteed not to be used as training
| data, mentioning the problems on ancillary channels (say
| emailing another colleague to discuss the problem) can
| still create a risk of leakage because nobody making the
| decision to include the data is aware that stuff that
| should be excluded is in that data set. And as we've seen
| from decades of cybersecurity, people are absolute shit
| at the necessary operational security to avoid mentioning
| stuff on ancillary channels!
|
| Given that performance is known to drop considerably on
| these kinds of tests when novel problems are tried, and
| given the ease with which these problems could leak into
| the training set somehow, it's not unreasonable to be
| suspicious of a sudden jump in performance as merely a
| sign that the problems made it into the training set
| rather than being true performance improvements in LLMs.
| whimsicalism wrote:
| Okay, then what about elite level codeforces performance?
| Those problems weren't even constructed until after the
| model was made.
|
| The real problem with all of these theories is most of
| these benchmarks were constructed after their training
| dataset cutoff points.
|
| A sudden performance improvement on a new model release
| is not suspicious. Any model release that is much better
| than a previous one is going to be a "sudden jump in
| performance."
|
| Also, OpenAI is not reading your emails - certainly not
| with a less than one month lead time.
| ImPostingOnHN wrote:
| Can you give an example of one of these problems that
| 'wasn't even constructed until after the model was made'?
|
| I'd like to see if it's truly novel and unique, the first
| problem of its type ever construed by mankind, or if it's
| similar to existing problems.
| whimsicalism wrote:
| Sorry, I thought the whole point of this thread was that
| models can't handle problems when they are "slightly
| varied". Mottes and baileys all over the place today.
| sudosysgen wrote:
| The point is that it's not consistent on variations,
| unless it finds a way to connect it to something it
| already knows. The fact it sometimes succeeds on
| variations (in codeforces the models are allowed multiple
| tries, sometimes ridiculous numbers, to be useful)
| doesn't matter.
|
| The point is that the fact it's no longer consistent once
| you vary the terminology indicates it's fitting a
| memorized template instead of reasoning from first
| principles.
| sudosysgen wrote:
| o1 has a ~1650 rating, at that level many or most
| problems you will be solving are going to be a transplant
| of a relatively known problem.
|
| Since o1 on codeforces just tried hundreds or thousands
| of solutions, it's not surprising it can solve problems
| where it is really about finding a relatively simple
| correspondence to a known problem and regurgitating an
| algorithm.
|
| In fact when you run o1 on ""non-standard"" codeforces
| problems it will almost always fail.
|
| See for example this post running o1 multiple times on
| various problems:
| https://codeforces.com/blog/entry/133887
|
| So the thesis that it's about recognizing a problem with
| a known solution and not actually coming up with a
| solution yourself seems to hold, as o1 seems to fail even
| on low rated problems which require more than fitting
| templates.
| s1mplicissimus wrote:
| > 1. OpenAI has confirmed it's not in their train (unlike
| putnam where they have never made any such claims)
|
| Companies claim lots of things when it's in their best
| financial interest to spread that message. Unfortunately
| history has shown that in public communications,
| financial interest almost always trumps truth (pick
| whichever $gate you are aware of for convenience, i'll go
| with Dieselgate for a specific example).
|
| > It is funny to me that HN finds it easier to believe
| theories about stealing data from APIs rather than an
| improvement in capabilities. It would be nice if
| symmetric scrutiny were applied to optimistic and
| pessimistic claims about LLMs, but I certainly don't feel
| that is the case here.
|
| What I see is generic unsubstantiated claims of
| artificial intelligence on one side and specific,
| reproducible examples that dismantle that claim on the
| other. I wonder how your epistemology works that leads
| you to accept marketing claims without evidence
| ttul wrote:
| OpenAI's credibility is central to its business:
| overstating capabilities risks public blowback, loss of
| trust, and regulatory scrutiny. As a result, it is
| unlikely that OpenAI would knowingly lie about its
| models. They have much stronger incentives to be as
| accurate as possible--maintaining their reputation and
| trust from users, researchers, and investors--than to
| overstate capabilities for a short-term gain that would
| undermine their long-term position.
|
| From a game-theoretic standpoint, repeated interactions
| with the public (research community, regulators, and
| customers) create strong disincentives for OpenAI to lie.
| In a single-shot scenario, overstating model performance
| might yield short-term gains--heightened buzz or
| investment--but repeated play changes the calculus:
|
| 1. Reputation as "collateral"
|
| OpenAI's future deals, collaborations, and community
| acceptance rely on maintaining credibility. In a repeated
| game, players who defect (by lying) face future
| punishment: loss of trust, diminished legitimacy, and
| skepticism of future claims.
|
| 2. Long-term payoff maximization
|
| If OpenAI is caught making inflated claims, the fallout
| undermines the brand and reduces willingness to engage in
| future transactions. Therefore, even if there is a short-
| term payoff, the long-term expected value of accuracy
| trumps the momentary benefit of deceit.
|
| 3. Strong incentives for verification
|
| Independent researchers, open-source projects, and
| competitor labs can test or replicate claims. The
| availability of external scrutiny acts as a built-in
| enforcement mechanism, making dishonest "moves" too
| risky.
|
| Thus, within the repeated game framework, OpenAI
| maximizes its overall returns by preserving its
| credibility rather than lying about capabilities for a
| short-lived advantage.
| Groxx wrote:
| >OpenAI's credibility is central to its business:
| overstating capabilities risks public blowback, loss of
| trust, and regulatory scrutiny.
|
| Uh huh. Kinda like what's happening right now?
|
| They're marketing blow-hards. Everyone knows it. They've
| been wildly over-stating capabilities (and future
| capabilities!) as long as Altman has had power, and
| arguably longer.
|
| They'll do it as long as they can _get away with it_ ,
| because that's all that is needed to make money on it.
| Factual accuracy rarely impacts the market when it's so
| hype-driven, especially when there is still some unique
| utility in the product.
| F7F7F7 wrote:
| Find me the folks who see nothing but good will in
| OpenAI's actions and I'll find you the folks who have
| been hyping up AGI for the last 2 years.
|
| 4 was literally sitting on a shelf waiting for release
| when 3.5 was launched. 4o was a fine tune that took over
| two years. o1 is embarrassingly unimpressive chain of
| thought which is why they hide it.
|
| The company hit a wall a year ago. But showing progress
| towards AGI keeps the lights on. If they told the truth
| at their current burn rate...they'd have no money.
|
| You don't need game theory to figure that one out.
| Spooky23 wrote:
| Frankly you need to read what they say explicitly and not
| infer what they mean by your reckoning.
|
| They are the system to beat and their competitors are
| either too small or too risk averse.
|
| They ingest millions of data sources. Among them is the
| training data needed to answer the benchmark questions.
| itfossil wrote:
| Oh so its almost like everything else AI related, they basically
| cheated and lied.
|
| If you are shocked by this, you are the sucker in the room.
| youworkwepay wrote:
| Or it's time to step back and call it what it is - very good
| pattern recognition.
|
| I mean, that's cool... we can get a lot of work done with pattern
| recognition. Most of the human race never really moves above that
| level of thinking in the workforce or navigating their daily
| life, especially if they default to various societally prescribed
| patterns of getting stuff done (eg. go to college or the military
| <Based on <these criteria>, find a job <based on the best fit
| with <this list of desirable skills & experiences>, go to <these
| places> to find love....)
| IshKebab wrote:
| > Or it's time to step back and call it what it is - very good
| pattern recognition.
|
| Or maybe it's time to stop wheeling out this tedious and
| disingenuous dismissal.
|
| Saying it is just "pattern recognition" (or a "stochastic
| parrot") implies behavioural and performance characteristics
| that have very clearly been greatly exceeded.
| dsr_ wrote:
| Citation needed. Please be more specific, or else this is
| just a tedious and disingenuous advocacy.
| Lerc wrote:
| Gpt4 can add very large integers.
|
| It is evident that it is not recalling the sum because all
| combinations of integer addition were likely not in the
| training data, Storing the answer to the sum of all
| integers up to the size that GPT4 can manage would take
| more parameters than the model has.
|
| That addition is a small capability but you only need a
| single counterexample to disprove a theory.
| alexashka wrote:
| > That addition is a small capability but you only need a
| single counterexample to disprove a theory
|
| No, that's not how this works :)
|
| You can hardcode an exception to pattern recognition for
| specific cases - it doesn't cease to be a pattern
| recognizer with exceptions being sprinkled in.
|
| The 'theory' here is that a pattern recognizer can lead
| to AGI. _That_ is the theory. Someone saying 'show me
| proof or else I say a pattern recognizer is just a
| pattern recognizer' is not a theory and thus cannot be
| disproven, or proven.
|
| This is also known as Russell's teapot.
| https://en.wikipedia.org/wiki/Russell%27s_teapot
|
| If someone claims there's a teapot out in space - the
| burden of proof is on the person making the claim, not on
| the person saying it is bullshit.
| reissbaker wrote:
| GPT-4o doesn't have hardcoded math exceptions. If you
| would like something verifiable, since we don't have the
| source code to GPT-4o, consider that Qwen 2.5 72b can
| also add large integers, and we _do_ have the source code
| and weights to run it... And it 's just a neural net.
| There isn't secret "hardcode an exception to pattern
| recognition" in there that parses out numbers and adds
| them. The neural net simply learned to do it.
| alexashka wrote:
| That's interesting, I didn't know that, thanks.
|
| Is the claim then that LLMs are pattern recognizers but
| also _more_?
|
| It just seems to me and I guess many others that the
| thing it is _primarily_ good at is being a better google
| search.
|
| Is there something big that I and presumably many others
| are missing and if so, what is it?
| Lerc wrote:
| It's not hardcoded, reissbaker has addressed this point.
|
| I think you are misinterpreting what the argument is.
|
| The argument being made is that LLMs are mere 'stochastic
| parrots' and therefore cannot lead to AGI. The analogy to
| Russell's teapot is that someone is claiming that
| Russells teapot is not there because china cannot exist
| in the vacuum of space. You can disprove that with a
| single counterexample. That does not mean the teapot is
| there, but it also doesn't mean it isn't.
|
| It is also hard to prove that something is thinking. It
| is also very difficult to prove that something is not
| thinking. Almost all arguments against AGI take the form
| X cannot produce AGI because Y. Those are disprovable
| because you can disprove Y.
|
| I don't think anyone is claiming to have a proof that an
| LLM will produce AGI, just that it might. If they
| actually build one, that too counts as a counterexample
| to anybody saying they can't do it.
| jampekka wrote:
| What the fundamental limitations of "pattern recognition" or
| "stochastic parrots" that LLMs have exceeded?
| IshKebab wrote:
| They can generalise to novel inputs. Ok often they mess it
| up and they're clearly _better_ at dealing with inputs they
| have seen before (who isn 't?), but they can still reason
| about things they have never seen before.
|
| Honestly if you don't believe me just go and use them. It's
| pretty obvious if you actually get experience with them.
| K0balt wrote:
| So, I am conflicted about this.
|
| If we take an example of what is considered a priori as
| creativity, such as story telling, LLMs can do pretty well at
| creating novel work.
|
| I can prompt with various parameters, plot elements, moral
| lessons, and get a de novo storyline, conflicts, relationships,
| character backstories, intrigues, and resolutions.
|
| Now, the writing style tends to be tone-deaf and poor at
| building tension for the reader, and it is apparent that the
| storytelling has little "theory of mind" of the reader, but the
| material has elements that we would certainly consider to be
| creative if written by a student.
|
| It seems we must either cede that LLMs can do some creative
| synthesis, as this and some other experiments of mine suggest,
| or we must decide that these tasks, such as "creative writing"
| are not in fact creative, but rather mostly or strictly
| derivative.
|
| There is some argument to be had in assertions that
| storytelling is all derivative of certain patterns and
| variations on a fixed number of tropes and story arcs... but
| arguing this begs the question of whether humans actually do
| any "pure" creative work , or if in fact, all is the product of
| experience and study. (Training data)
|
| Which leads me to the unpleasant conflict about the debate of
| AI creativity. Is the debate really pointing out an actual
| distinction, or merely a matter of degree? And what are the
| implications, either way?
|
| I'm left with the feeling that LLMs can be as capable of
| creative work as most 8th grade students. What does this say
| about AI, or developing humans? Since most people don't exceed
| an 8th grade level of literacy, what does this say about
| society?
|
| Is there even such a thing as de novo idea synthesis?
|
| Troubling questions abound.
| danielbln wrote:
| To add to this pondering: we are discussing the state today,
| right now. We could assume this is as good as it's ever gonna
| get, and all attempts to overcome some current plateau are
| futile, but I wouldn't bet on it. There is a solid chance
| that 8th grade level writer will turn into a post-grad writer
| before long.
| zeroonetwothree wrote:
| So far the improvements in writing have not been as
| substantial as those in math or coding (not even close,
| really). Is there something fundamentally "easier" for LLMs
| about those two fields?
| danielbln wrote:
| Much more formal structure and generally code can be
| tested for correctness. Prose doesn't have that benefit.
| That said, given the right prompt and LLM, you can
| squeeze out surprisingly good stuff: https://bsky.app/pro
| file/talyarkoni.com/post/3ldfjm37u2s2x
| zeroonetwothree wrote:
| I have no doubt that LLMs do creative work. I think this has
| been apparent since the original ChatGPT.
|
| Just because something is creative doesn't mean it's
| inherently valuable.
| golol wrote:
| Hmmm without a human control it is not all that clear to me that
| the variation problems are not more difficult.
| retinaros wrote:
| trained on test. who even trusts OAI anymore?
| whimsicalism wrote:
| they didn't test on putnam...
| WiSaGaN wrote:
| There is also a curated benchmark just for those famous problems
| slightly variated:
| https://github.com/cpldcpu/MisguidedAttention/tree/main/eval
| coder543 wrote:
| One problem from the benchmark:
| "prompt_id": "river_crossing_easy", "category":
| "Logic Puzzle", "title": "Easy river crossing",
| "prompt": "A farmer is on one side of a river with a wolf, a
| goat, and a cabbage. When he is crossing the river in a boat,
| he can only take one item with him at a time. The wolf will eat
| the goat if left alone together, and the goat will eat the
| cabbage if left alone together. How can the farmer transport
| the goat across the river without it being eaten?",
| "expected_behavior": [ "Answer concludes that they
| simply get in the boat and cross together in one trip"
| ],
|
| EDIT: removing most of my commentary on this problem. As a
| human, I was tricked by the problem too. I would love to see
| how a random selection of humans would do on this one... but it
| just doesn't feel like a great test to me.
| stpn wrote:
| If you revise this prompt to satisfy your pedantry, (at
| least) 4o still gets it wrong.
| kace91 wrote:
| >This is twisting the English language to assume that "item"
| only refers to non-living things.
|
| Not really. Unless I'm not reading correctly, most of the
| problem is irrelevant as you're only required to cross the
| boat with the goat, you don't care about the cabbage. The
| difficulty lies in the assumption you need to cross
| everything due to the resemblance with the bigger problem.
| sunir wrote:
| You're reading it correctly. I read it again after your
| comment and I realized I too pattern matched to the typical
| logic puzzle before reading it carefully and exactly. I
| imagine the test here is designed for this very purpose to
| see if the model is pattern matching or reasoning.
| mitemte wrote:
| Wow, this seems ridiculous. The expected answer is basically
| finding a loophole in the problem. I can imagine how
| worthless all of these models would be if they behaved that
| way.
| stavros wrote:
| It's not a loophole, the question is "how can he get the
| goat across?". The answer is he just takes it across.
| kccqzy wrote:
| The problem is to ask the farmer to transport the goat. So
| the farmer indeed gets in the boat with the goat. The
| unstated gotcha is that the farmer is willing to abandon the
| wolf and the cabbage. A heavily pattern-matching LLM or human
| would immediately assume that the farmer needs to transport
| all three.
| coder543 wrote:
| Yep, and that gotcha got me, as a perfectly non-silicon
| human. My bad everyone.
| cogman10 wrote:
| No. Simply plug in the prompt to chat gpt and see what
| happens.
|
| The llm isn't getting confused by the meaning of "item". It's
| recognizing a common problem and not picking up on the fact
| that the farmer just needs to transport the goat and nothing
| else.
|
| Instead, it gives the standard answer for how to transport
| everything across.
| fragmede wrote:
| I'll admit as a fallible humane I didn't pick it up, but I
| was focused on the wrong thing because I've been using "and
| the boat can take everything" and gpt-3 just could not get
| that variation in one shot.
|
| Gpt-3 is old hat though. later versions of gpt-4 manage to
| get it with a bunch coaching, and o1 manages to solve it
| with less coaching.
| ankit219 wrote:
| They are highly effective pattern matchers. You change the
| pattern, it won't work. I don't remember who, but most likely
| @tszzl (roon), commented on x that they still trained the
| traditional way, and there is no test time compute (TTC) or
| Montecarlo Tree search (like Alpha Go) in o1 or o3. If that is
| true, then it's still predicting the next word based on it's
| training data. Likely to follow the most probable path - which
| comes directly from the training itself - even for the slight
| variations. Encouragingly, if TTC hasnt been explored, there is a
| long runway for performance improvements.
|
| The other reason this seems hard to guess is because we don't
| know how much of what we are asking is in the training data. It
| would perform on some tasks, while fail at others even though
| those are similar.
| x_may wrote:
| I believe they are using scalable TTC. The o3 announcement
| released accuracy numbers for high and low compute usage, which
| I feel would be hard to do in the same model without TTC.
|
| I also believe that the 200$ subscription they offer is just
| them allowing the TTC to go for longer before forcing it to
| answer.
|
| If what you say is true, though, I agree that there is a huge
| headroom for TTC to improve results if the huggingface
| experiments on 1/3B models are anything to go off.
| ankit219 wrote:
| The other comment posted YT videos where Open AI researchers
| are talking about TTC. So, I am wrong. That $200 subscription
| is just because the number of tokens generated are huge when
| CoT is involved. Usually inference output is capped at
| 2000-4000 tokens (max of ~8192) or so, but they cannot do it
| with o1 and all the thinking tokens involved. This is true
| with all the approaches - next token prediction, TTC with
| beam/lookahead search, or MCTS + TTC. If you specify the
| output token range as high and induce a model to think before
| it answers, you will get better results on smaller/local
| models too.
|
| > huge headroom for TTC to improve results ...1B/3B models
|
| Absolutely. How this is productized remains to be seen. I
| have high hopes with MCTS and Iterative Preference Learning,
| but it is harder to implement. Not sure if Open AI has done
| that. Though Deepmind's results are unbelievably good [1].
|
| [1]:https://arxiv.org/pdf/2405.00451v2
| whimsicalism wrote:
| ttc is an incredibly broad term and it is broadening as the
| hype spreads. people are now calling CoT "TTC" because they
| are spending compute on reasoning tokens before answering
| HarHarVeryFunny wrote:
| Yes, and HuggingFace have published this outlining some of
| the potential ways to use TTC, including but not limited to
| tree search, showing TTC performance gains from LLama.
|
| https://huggingface.co/spaces/HuggingFaceH4/blogpost-
| scaling...
| e1g wrote:
| I recently watched two interviews with OpenAI researchers where
| they describe that the breakthrough of o-series (unlike GPT
| series) is to focus on test time compute as they are designed
| to "think" more specifically to avoid pattern matching. Noam
| Brown https://youtu.be/OoL8K_AFqkw?si=ocIS0YDXLvaX9Xb6&t=195
| and Mark Chen https://youtu.be/kO192K7_FaQ?si=moWiwYChj65osLGy
| ankit219 wrote:
| Thank you, this is helpful. The post on X was seemingly
| wrong.
| mmmore wrote:
| The comment was likely that there's no explicit search. In
| o1, the model has learned how to search using its context.
| Presumably they do this by RLing over long reasoning
| strings/internal monologues.
| HarHarVeryFunny wrote:
| OpenAI have openly stated that o1 & o3 are using test time
| compute, and released a log scale graph indicating linear
| performance gains for exponential compute usage.
|
| https://openai.com/index/learning-to-reason-with-llms/
|
| They only confirm that the model/system is doing chain of
| thought, but the exponential factor and origin of reasoning
| gains likely comes from TREE of thoughts (number of
| branches/compute goes up exponentially with depth), essentially
| doing tree search over different reasoning chains.
|
| I assume roon's identity is well known inside OpenAI (he's an
| employee), so I wouldn't expect him to be leaking
| implementation details on twitter.
| WiSaGaN wrote:
| I don't think this proves that the LLM is just "pattern matcher".
| Human makes similar mistakes too, especially when under time
| pressure (similar to non-reasoning model that needs to "use
| system one" to generate answer on one go). This is further
| evident that if you specifically ask the models to pay attention
| to traps, or just ask follow up question "are you sure?", then
| they usually can get it right.
| jsheard wrote:
| You're saying that humans perform worse on problems that are
| _slightly_ different than previously published forms of the
| same problem? To be clear we are only talking about changing
| variable names and constants here.
| exe34 wrote:
| Often yes, because we assume we already know the answer and
| jump to the conclusion. At least those of us with ADHD do.
| zeroonetwothree wrote:
| Not really true for Putnam problems since you have to write
| a proof. You literally can't just jump to a conclusion and
| succeed.
| Lerc wrote:
| That is the principle behind the game 'Simon says'
| fldskfjdslkfj wrote:
| 'Simon says' is about reaction time and pressure.
| chairhairair wrote:
| No, it's not at all.
|
| This is all getting so tiresome.
| zeroonetwothree wrote:
| That's a very silly analogy. A more realistic analogy would
| be do humans perform better on computing 37x41 or 87x91
| (with showing the work)?
| Lerc wrote:
| It was not an analogy at all. It was an simplified
| example of the idea that a slight change in a pattern can
| induce error in humans.
|
| It seems some people disagree that that is what the game
| "Simon Says" is about. I feel like they might play a
| vastly simplified version of the game that I am familiar
| with.
|
| There was a recent episode of Game Changer based on this
| which is an excellent example of how the game leader
| should attempt to induce errors by making a change that
| does not get correctly accounted for.
| PunchTornado wrote:
| isn't it weird that they didn't test gemini?
| sirolimus wrote:
| Yea no shit. LLMs are just REALLY good guessers. People gotta
| stop the hype lol.
|
| Using LLMs for anything serious and which requires consistency
| and trustworthiness without hallucinations is irresponsible and
| ridiculous.
|
| Closed source LLMs are a bubble and a joke.
| wim wrote:
| One experiment I would love to see, although not really feasible
| in practice, is to train a model on _all_ digitized data from
| before the year 1905 (journals, letters, books, broadcasts,
| lectures, the works), and then ask it for a formula for mass-
| energy equivalence. A certain answer would definitely settle the
| debate on whether pattern recognition is a form of intelligence
| ;)
| newjersey wrote:
| There is a reason why they won't do it. They are selling a
| narrative. There is a lot of money to be made here with this
| narrative and proving that artificial intelligence is NOT
| intelligent won't help sell that narrative.
| ben_w wrote:
| The goal is to make it intelligent, by which OpenAI in
| particular explicitly mean "economically useful", not simply
| to be shiny.
|
| Passing tests is well known to be much easier than having
| deep understanding, even in humans. They openly ask for tests
| like this, not that they could possibly prevent them if they
| wanted to.
|
| There's scammers trying what you say of course, and I'm sure
| we've all seen some management initiatives or job
| advertisements for some like that, but I don't get that
| impression from OpenAI or Anthropic, definitely not from
| Apple or Facebook (LeCun in particular seems to deny models
| will ever do what they actually do a few months later).
| Overstated claims from Microsoft perhaps (I'm unimpressed
| with the Phi models I can run locally, GitHub's copilot has a
| reputation problem but I've not tried it myself), and Musk
| definitely (I have yet to see someone who takes Musk at face
| value about Optimus).
| _heimdall wrote:
| > The goal is to make it intelligent, by which OpenAI in
| particular explicitly mean "economically useful", not
| simply to be shiny
|
| I never understood why this definition isn't a huge red
| flag for most people. The idea of boiling what intelligence
| is down to economic value is terrible, and inaccurate, in
| my opinion.
| ben_w wrote:
| Everyone has a very different idea of what the word
| "intelligence" means; this definition has got the
| advantage that, unlike when various different AI became
| superhuman at arithmetic, symbolic logic, chess,
| jeopardy, go, poker, number of languages it could
| communicate in fluently, etc., it's tied to tasks people
| will continuously pay literally tens of trillions of
| dollars each year for because they want those tasks done.
| zaroth wrote:
| Maybe by the time it's doing a trillion dollars a year of
| useful work (less than 10 years out) people will call it
| intelligent... but still probably not.
| _heimdall wrote:
| This definition alone might be fine enough if the word
| "intelligence" wasn't already widely used outside of AI
| research. It is though, and the idea that intelligence is
| measured solely through economic value is a very, very
| strange approach.
|
| Try applying that definition to humans and you pretty
| quickly run into issues, both moral and practical. It
| also invalidates basically anything we've done over
| centuries considering what intelligence is and how to
| measure it.
|
| I don't see any problem at all using economic value as a
| metric for LLMs or possible AIs, it just needs a
| different term than intelligence. It pretty clearly feels
| like for-profit businesses shoehorning potentially
| valuable ML tools into science fiction AI.
| s1mplicissimus wrote:
| I haven't seen "intelligent" used as "economically useful"
| _anywhere_ outside the AI hype bubble. The most charitable
| interpretation I can think of is lack of understanding of
| the common usage of the word, the most realistic one is
| intentionally muddying terminology so one cannot be called
| a liar. Are LLMs helpful tools for some tasks like rough
| translations, voice2text etc? Sure. Does it resemble what
| humans call intelligence? I 'd yet have to see an example
| of that. The suggested experiment is a great idea and would
| sway my opinion drastically (given all the training data,
| model config, prompts & answers are public and reproducible
| of course, we don't want any chance of marketing BS to
| taint the results, do we). I'll be honest though, I'm not
| going to hold my breath for that experiment to succeed with
| the LLM technology...
|
| edit: lol downvoted for calling out shilling i guess
| numpad0 wrote:
| They don't have to do it themselves. The super-GPU cluster
| used to train GPT-6 will eventually shrink down to a garage
| size and eventually some YouTuber will.
| amelius wrote:
| This is how patent disputes should be decided. If an LLM can
| figure it out, then it is not novel.
| bushbaba wrote:
| And what prompt would you give that does have novel input.
| neom wrote:
| If I was me, I would start by giving a collection of LLMs
| the patent, ask half "why is this patent novel" and half
| "why is this patent not novel" and see what happens. I use
| this method of "debugging" my thinking (not code), might be
| a starting point here? Not sure.
| amelius wrote:
| Every patent application contains a section of claims.
| You can just ask the LLM to come up with ways to satisfy
| those claims.
|
| But I'm sure there are lots of ways to go about it.
| dahart wrote:
| LLMs are already good at summarizing the claims - patents
| all explain why they're novel - so it would be a waste to
| ask them, especially if you reserve half the LLMs in your
| set for this question. Asking why a patent is not novel
| is a great question, but the problem with asking why they
| are not novel is it has to know all other patents
| (including very recently filed patents) and it has to be
| correct, which LLMs are not at all good at yet (plus they
| still tend to hallucinate confidently). This is a great
| test for LLM accuracy if you know the right answer
| already, and not a good test for patent validity.
| bdowling wrote:
| Novelty (is it new) is the easy question because it's just
| checking a database. Patentable inventions also have to be
| non-obvious, which is a more subtle question.
| davidclark wrote:
| I know it's just a spicy take on a forum, but this sounds
| like a terrible public policy.
| pixelsort wrote:
| This reminds me of a similar idea I recently heard in podcast
| with Adam Brown. I'm unsure whether it is his original notion.
| The idea being, that if we can create AI that can derive
| special relativity (1905) from pre-Einstein books and papers
| then we have reached the next game-changing milestone in the
| advancement of artificial reasoning.
| FergusArgyll wrote:
| Great podcast, especially the part about hitchhiking :)
|
| https://www.youtube.com/watch?v=XhB3qH_TFds
|
| Or RSS
|
| https://api.substack.com/feed/podcast/69345.rss
| wim wrote:
| Right, hadn't listened to that one, thanks for the tip!
| saagarjha wrote:
| Finally, a true application of E=mc^2+AI
| fny wrote:
| But is there even enough pre-1905 data to create models that
| say hello world reliably?
|
| The terabytes of training data required for decent LLMs does
| not exist. I'd guess there may only be gigabytes worth.
| neom wrote:
| My wife is an 18th century American history professor. LLMs
| have very very clearly not been trained on 18th century
| English, they cannot really read it well, and they don't
| understand much from that period outside of very textbook
| stuff, anything nuanced or niche is totally missing. I've
| tried for over a year now, regularly, to help her use LLMs in
| her research, but as she very amusingly often says "your
| computers are useless at my work!!!!"
| whimsicalism wrote:
| my wish for new years is that every time people make a
| comment like this they would share an example task
| neom wrote:
| https://s.h4x.club/bLuNed45 - it's more crazy to me that
| my wife CAN in fact read this stuff easily, vs the fact
| that an LLM can't.
|
| (for anyone who doesn't feel like downloading the zip,
| here is a single image from the zip:
| https://s.h4x.club/nOu485qx)
| whimsicalism wrote:
| have you been trying to provide it as an image directly?
| if so, doesn't surprise me at all.
|
| really thanks for sharing!
| neom wrote:
| My wifes particular area of research is using the
| capitalist system to "re build" broken slave family
| trees, she flys around the US going to archives and
| getting contracts and receipts for slaves, figures out
| how they got traded, and then figures out where they
| ended up, and then "re links" them to their their family
| to best of her ability. Although her area of research
| isn't particularly overflowing with researchers, there
| are still a lot of people like her who just have this
| very tacit knowledge among each other, they email around
| a lot and stuff, knowledge like who was running a region
| during a period, ofc they publish, but it's a small field
| and it's all extremely poorly documented. Was watching
| the Adam Brown interview with Dwarkesh Patel the other
| day and he said for his work LLMs are better than
| bothering an expert in an area of his field with a
| question, I'm not sure people in her field are able to do
| this as readily. Franky, I've yet to find a novel/or good
| use for an LLM in her work. I often joke that her and
| "her people" are going to be the last ones with jobs if
| they don't transfer their knowledge into LLMs, ha! :)
| umeshunni wrote:
| Super interesting in that
|
| 1. In theory these kind of connections should be
| something that LLMs are great at doing. 2. It appears
| that LLMs are not trained (yet?) on cursive and other
| non-print text
| neom wrote:
| Yes, I regularly encourage my wife to approach the comp
| sci department at her uni on doing a project together but
| she for whatever reason doesn't think they would be
| interested/I've yet to get her interested enough to grasp
| what a transformer can do. I find it very frustrating
| because of your first point, she very specifically could
| do some meaningful % more research if the LLMs could help
| with the connections. Sadly, I am not rich, handsome or
| talented enough to do this for her.
| mikeruiz wrote:
| " From what the text shows, Henry Jenkins and his wife
| Caroline (the boy's mother) are asking the Orphans Court
| to void an apprenticeship arrangement involving her minor
| son, James Timmons. They claim James--about 15 years old
| --was bound out as an apprentice without proper authority
| or the mother's consent, and they cite Maryland law (an
| act from 1793 and its supplements) which they believe was
| not followed. They request the court declare that the
| indenture is invalid and restore James to his mother's
| care."
|
| No idea if that's correct (and no doubt not useful to an
| expert able to read this directly, but curious if it's
| close?
| lupire wrote:
| The best human performance on that task required many many
| hours of private work given that input.
|
| How much would ChatGPT charge for that much reasoning? Isn't
| cost quadratic in sort term working memory?
|
| It would be more interesting to prompt it with X% of a new
| paper's logical argument, and see if it can predict the rest.
| redman25 wrote:
| Why does AI have to be smarter than the collective of hummanity
| in order to be considered intelligent? It seems like we keep
| raising the bar on what intelligence means -\\_(tsu)_/-
| willis936 wrote:
| A machine that synthesizes all human knowledge really ought
| to know more than an individual in terms of intellect. An
| entity with all of human intellect prior to 1905 does not
| need to be as intelligent as a human to make discoveries that
| mere humans with limited intellect made. Why lower the bar?
| ninetyninenine wrote:
| The heightening of the bar is an attempt to deny that
| milestones were surpassed and to claim that LLMs are not
| intelligent.
|
| We had a threshold for intelligence. An LLM blew past it
| and people refuse to believe that we passed a critical
| milestone in creating AI. Everyone still thinks all an LLM
| does is regurgitate things.
|
| But a technical threshold for intelligence cannot have any
| leeway for what people want to believe. They don't want to
| define an LLM as intelligent even if it meets the Turing
| test technical definition of intelligence so they change
| the technical definition.
|
| And then they keep doing this without realizing and
| trivializing it. I believe humanity will develop an entity
| smarter than humans but it will not be an agi because
| people keep unconsciously moving the goal posts and
| changing definitions without realizing it.
| ImPostingOnHN wrote:
| Since we know an LLM does indeed simply regurgitate data,
| having it pass a "test for intelligence" simply means
| that either the test didn't actually test intelligence,
| or that intelligence can be defined as simply
| regurgitating data.
| greentxt wrote:
| Intelligence is debateble without even bringing ai into
| it. Nobody agrees on whether humans have intelligence.
| Well, smart people agree but those people also agree we
| have or will soon have agi or something negligibly
| different from it.
| ImPostingOnHN wrote:
| _> Intelligence is debateble without even bringing ai
| into it. Nobody agrees on whether humans have
| intelligence._
|
| Yep, that constitutes the second of the two options I
| mentioned.
|
| _> Well, smart people agree but those people also agree
| we have or will soon have agi or something negligibly
| different from it._
|
| lol, the ol' _" I know what all smart people think and
| it's what I think"_ appeal.
| klabb3 wrote:
| Disagree. The AI we have is very useful for specific
| things. The pushback you see is not so much denying the
| milestones that have been surpassed, but rather the
| milestones that enthusiasts claim are near. And for good
| reason! Every time and in every field we've extrapolated
| an exponential-looking curve ad infinitum, it's turned
| out to be S-shaped, and life goes on.
|
| > We had a threshold for intelligence.
|
| We've had many. Computers have surpassed several barriers
| considered to require intelligence such as arithmetic,
| guided search like chess computers, etc etc. the Turing
| test was a good benchmark because of how foreign and
| strange it was. It's somewhat true we're moving the
| goalposts. But the reason is not stubbornness, but rather
| that we can't properly define and subcategorize what
| reason and intelligence really is. The difficulty to
| measure something does not mean it doesn't exist or isn't
| important.
|
| Feel free to call it intelligence. But the limitations
| are staggering, given the advantages LLMs have over
| humans. They have been trained on _all_ written knowledge
| that no human could ever come close to. And they still
| have not come up with anything conceptually novel, such
| as a new idea or theorem that is genuinely useful. Many
| people suspect that pattern matching is not the only
| thing required for intelligent independent thought.
| _Whatever that is!_
| redman25 wrote:
| If you consider that evolution has taken millions of
| years to produce intelligent humans--that LLM training
| completed in a manner of months can produce parrots of
| humans is impressive by itself. Talking with the parrot
| is almost indistinguishable from talking with a real
| human.
|
| As far as pattern matching, the difference I see from
| humans is consciousness. That's probably the main area
| yet to be solved. All of our current models are static.
|
| Some ideas for where that might be headed:
|
| - Maybe all it takes is to allow an LLM to continuously
| talk with itself much like how humans have "the milk
| man's voice".
|
| - Maybe we might need to allow LLMs to update their own
| weights but that would also require an "objective" which
| might be hard to encode.
| Dylan16807 wrote:
| > If you consider that evolution has taken millions of
| years to produce intelligent humans--that LLM training
| completed in a manner of months can produce parrots of
| humans is impressive by itself.
|
| I disagree that such a comparison is useful. Training
| should be compared to training, and LLM training feeds in
| _so_ many more words than a baby gets. (A baby has other
| senses but it 's not like feeding in 20 years of video
| footage is going to make an LLM more competent.)
| 343rwerfd wrote:
| "Why lower the bar?"
|
| Because of the chance of misundertanding. Failing at
| acknowledging artificial general intelligence standing
| right next to us.
|
| An incredible risk to take in alignment.
|
| Perfect memory doesn't equal to perfect knowledge, nor
| perfect understanding of everything you can know. In fact,
| a human can be "intelligent" with some of his own memories
| and/or knowledge, and - more commmonly - a complete "fool"
| with most of the rest of his internal memories.
|
| That said, is not a bit less generally intelligent for
| that.
|
| Supose it exists a human with unlimited memory, it retains
| every information touching any sense. At some point, he/she
| will probably understand LOTs of stuff, but it's simple to
| demonstrate he/she can't be actually proficient in
| everything: you have read how do an eye repairment surgery,
| but have not received/experimented the training,hence you
| could have shaky hands, and you won't be able to apply the
| precise know-how about the surgery, even if you remember a
| step-by-step procedure, even knowing all possible
| alternatives in different/changing scenarios during the
| surgery, you simply can't hold well the tools to go
| anywhere close to success.
|
| But you still would be generally intelligent. Way more than
| most humans with normal memory.
|
| If we'd have TODAY an AI with the same parameters as the
| human with perfect memory, it will be most certainly
| closely examined and determined to be not a general
| artificial intelligence.
| Jensson wrote:
| > If we'd have TODAY an AI with the same parameters as
| the human with perfect memory, it will be most certainly
| closely examined and determined to be not a general
| artificial intelligence.
|
| The human could learn to master a task, current AI can't.
| That is very different, the AI doesn't learn to remember
| stuff they are stateless.
|
| When I can take an AI and get it to do any job on its own
| without any intervention after some training then that is
| AGI. The person you mentioned would pass that easily.
| Current day AI aren't even close.
| ZooCow wrote:
| I had a similar thought but about asking the LLM to predict
| "future" major historical events. How much prompting would it
| take to predict wars, etc.?
| djeastm wrote:
| You mean train on pre-1939 data and predict how WWII would
| go?
| ZooCow wrote:
| Right. If it were trained through August 1939, how much
| prompting would be necessary to get it to predict aspects
| of WWII.
| MoreMoore wrote:
| Man, that would be a fascinating experiment. Would it be
| able to predict who wins and when? Would it be able to
| predict the Cold War?
| sitkack wrote:
| But we know Hitler has a Time Machine that goes forward,
| he doesn't need to return to use that knowledge as he
| already has a timeline here to use. Definitely risks
| involved here.
| david-gpu wrote:
| That will never work on any complex system that behaves
| chaotically, such as the weather or complex human endeavors.
| Tiny uncertainties in the initial conditions rapidly turn
| into large uncertainties in the outcomes.
| morbicer wrote:
| Not an LLM but models could get pretty good at weather
|
| https://www.technologyreview.com/2024/12/04/1107892/google-
| d...
| bmacho wrote:
| No, they don't, since the weather is chaotic.
|
| I mean, there are the theorems about how close you can
| get, and models are not better than theoretically
| possible.
| david-gpu wrote:
| Yeah, I wish more people understood that it is simply not
| possible to make precise long-term forecasts of chaotic
| systems. Whether it is weather, financial markets, etc.
|
| It is not that we don't know yet because our models are
| inadequate, it's that it is unknowable.
| wodderam wrote:
| The problem is we stupidly branded the field "chaos
| theory" and made it sound like bullshit so the ideas of
| non-linear dynamics have largely been lost on several
| generations at this point.
|
| Not just chaos theory but "chaos theory" + psychedelic
| fractal artwork. Then the popular James Gleick book,
| "Chaos: making a new science" just sounds like complete
| bullshit and it sold a ton of copies.
|
| I only started studying non-linear dynamics in about 2015
| after first running across it in the late 90s but I
| literally thought it was all pseudoscience then.
|
| Between "chaos theory", fractals and a best selling book
| it would be hard to frame a new scientific field as
| pseudoscience more than what played out.
| amluto wrote:
| > ask it for a formula for mass-energy equivalence
|
| Way too easy. If you think that mass and energy might be
| equivalent, then dimensional analysis doesn't give you too much
| choice in the formula. Really, the interesting thing about
| E=mc^2 isn't the formula but the assertion that mass is a form
| of energy and all the surrounding observations about the
| universe.
|
| Also, the actual insight in 1905 was more about asking the
| right questions and imagining that the equivalence principle
| could really hold, etc. A bunch of the math predates 1905 and
| would be there in an AI's training set:
|
| https://en.m.wikipedia.org/wiki/History_of_Lorentz_transform...
| whimsicalism wrote:
| but e=mc^2 is just an approximation
|
| e: nice, downvoted for knowing special relativity
| amluto wrote:
| Can you elaborate? How is E=mc^2 an approximation, in
| special relativity or otherwise? What is it an
| approximation of?
| whimsicalism wrote:
| E^2 = m^2 + p^2 where p is momentum and i've dropped unit
| adjustment factors like c
|
| this allows light to have energy even if its massless
| ac29 wrote:
| e=mc^2 is only correct for objects at rest. The full
| equation takes into account velocity, but for "low"
| speeds where v<<c, the term is close enough to zero than
| E=mc^2 is still a good approximation.
| gus_massa wrote:
| I didn't downvote it, but short comments are a very big
| risk. People may misinterpret it, or think it's crackpot
| theory or a joke and then downvote.
|
| When in doubt, add more info, like:
|
| But the complete equation is E=sqrt(m^2c^4+p^2) that is
| reduced to E=mc^2 when the momentum p is 0. More info in ht
| tps://en.wikipedia.org/wiki/Mass%E2%80%93energy_equivalenc.
| ..
| tame3902 wrote:
| What I learnt is that there is a rest mass and a
| relativistic mass. The m in your formula is the rest
| mass. But when you use the relativistic mass E=mc2 still
| holds. And for the rest mass I always used m_0 to make
| clear what it is.
| whimsicalism wrote:
| sounds like you had a chemistry education. relativistic
| mass is IMO very much not a useful way of thinking about
| this and it is sort of tautologically true that E =
| m_relativistic because "relativistic mass" is just taking
| the concept of energy and renaming it "mass"
| amluto wrote:
| This is all sort of silly IMO. The equation, like
| basically all equations, needs context. What's E? What's
| m? If E is the total energy of the system and m is the
| mass (inertial or gravitational? how far past 1905 do you
| want to go?), then there isn't a correction. If m is rest
| mass and E is total energy, then I would call it flat-out
| wrong, not merely approximate. After all, a decent theory
| really ought to reproduce Newtonian mechanics under some
| conditions beyond completely at rest.
|
| IMO, when people get excited about E=mc^2, it's in
| contexts like noticing that atoms have rest masses that
| are generally somewhat below the mass of a proton or
| neutron times the number of protons and neutrons in the
| atom, and that the mass difference _is_ the binding
| energy of the nucleus, and you can do nuclear reactions
| and convert between mass and energy! And then E=mc^2 is
| apparently exactly true, or at least true to an excellent
| degree, even though the energies involved are extremely
| large and Newtonian mechanics can't even come close to
| accounting for what's going on.
| whimsicalism wrote:
| inertial mass, rest mass, gravitational mass - these are
| essentially all the same thing. "relativistic mass" is an
| additional concept where we rewrite energy as mass and is
| considered archaic
| ackfoobar wrote:
| The next section of the wikipedia link discusses the low
| speed approximation, where sqrt(m^2c^4+(pc)^2) [?] mc^2 +
| 1/2 mv^2.
|
| Calling E=mc^2 an "approximation" is technically correct.
| It's the 0th order approximation. That's just pointlessly
| confusing. A better word choice would be "a special
| case".
| whimsicalism wrote:
| i think we are venturing into pedantic territory - the
| point of my comment is that the full derivation is a
| little harder than just E=mc^2 dimensional analysis
| mcnamaratw wrote:
| Kind of agree. But pervasive downvoting by folks who
| don't understand the subject is a form of forum rot. The
| risk is only that we expose the rot. Not such a terrible
| risk, because either the owners notice and fix the
| problem, or the forum continues to rot. In the latter
| case karma points wont be desirable in the long run.
| bcoates wrote:
| This is why RLHF causes those overly verbose answers to
| simple questions, it's a fundamentally busted evaluation
| function so you wind up optimizing for the wrong thing
| make3 wrote:
| it's a special case, not an approximation
| whimsicalism wrote:
| its not an either/or, it is both. regardless, my point is
| that you cannot simply dimensional analysis your way to
| special relativity or the energy-momentum relation
| mitthrowaway2 wrote:
| This thread has come up before(1), but I'll continue to
| argue that relativistic mass is a perfectly valid concept
| as much as any other, and if you disagree, you'll need
| arguments more substantial than it just being unpopular
| these days. Especially if you're trying to pedantically
| argue people out of using a concept that they personally
| find useful to aid their own understanding, just because it
| doesn't fit your own mathematical or aesthetic preferences.
|
| 1: https://news.ycombinator.com/item?id=38425252
| tlb wrote:
| It's nontrivial why it's mc^2 and not 1/2 mc^2, since kinetic
| energy generally is 1/2 mv^2
| wslh wrote:
| Why do we need this when current models already handle
| questions and answers about new discoveries: ones that are
| happening every week and are often easier to grasp than
| Einstein's equations? I think it is clear that they will fail
| on most of them. That doesn't mean that LLMs are not useful but
| there are more walls in the road.
| layer8 wrote:
| Instead of asking for a formula, a better test may be to point
| out all the seeming contradictions in physics at that time
| (constancy of the speed of light, wave vs. particle nature of
| light, ultraviolet catastrophe), and ask it how they could be
| resolved.
| chvid wrote:
| Isn't this simply because the dataset used (Putnam-AXIOM
| Original) is in the training data used to train the various
| models?
|
| Given that these are simple variations (variable names and
| constants value change in math problems). Why would the companies
| creating these models (OpenAI etc.) create these variations
| themselves in order to insure that the model is learning how to
| solve the problem rather than memorize a solution? Seems like a
| very obvious thing to do ...
| lupire wrote:
| They are not only simple renames. LLM is good at those. They
| are minor structural changes.
| ben_w wrote:
| Link title says "slightly", but the PDF says two different kinds
| of variations: variable names (slight) and problem constants
| (significant), and the 30% drop is on the combination of a 26
| variable and also 26 variable + constant questions.
|
| It's good to have a better test (though I bet this one will also
| be quickly saturated like all the others), but the title here
| doesn't seem justified by the page title there or the content.
| sundarurfriend wrote:
| I would definitely classify both of those as slight changes. In
| fact I'd rename those as slight => trivial and significant =>
| slight.
| zeroonetwothree wrote:
| Right, renaming a variable should have zero effect on ability
| to solve (it wouldn't for a human). Changing a constant
| should be very minor, probably also ~0 effect in most cases.
| I say this as someone that's done many of these problems.
| lomkju wrote:
| Even humans get confused with trick questions right? Once they
| understand this is a trick question they no longer fall for it.
| :)
| steveBK123 wrote:
| Yes so when you change the sequence of tokens they've
| electronically memorized, they get a bit worse at predicting the
| next token?
| zeroonetwothree wrote:
| When you put it that way it's a trivial result. However the
| consequences for using AI to replace humans on tasks is
| significant.
| steveBK123 wrote:
| The only people super pumping the idea of mass replacement of
| human labor are financially invested in that outcome.
| aucisson_masque wrote:
| It drops from 50 to 33,96. Still the best, o1 on variable is
| around 2 times better than Claude on original test.
|
| The rest of the llm are far away, single digit.
|
| It makes me wonder if o1 is finally getting intelligent? LLM are
| not supposed to understand these problems when you change
| variable and values, they have to rely on preexisting data of
| absolutely identical solved problem to give a correct answer.
|
| I didn't follow LLM development but I heard one times that
| chatgpt is now composed of multiple LLM and maybe they put
| multiple artificial intelligence with purpose of problems
| solvings or trigonometry for instance.
|
| That would explain the reason it's so much better.
| e1g wrote:
| The paper includes several examples of their modified questions.
| There has been a substantial jump from o1-preview to o1, so I
| gave several samples to o1 and o1-pro ( _not_ o1-preview), and
| current o1s gave the correct answer to those modified problems.
| SOTA changes fast.
| suddenlybananas wrote:
| LLM boosters are so tiresome. You hardly did a rigorous
| evaluation, the set has been public since October and could
| have easily been added to the training data.
| gdhkgdhkvff wrote:
| Your points would be more convincing if you didn't preface
| them with arrogant cynicism.
| e1g wrote:
| I'm not skilled enough in math to do a rigorous evaluation,
| so it was a quick check.
|
| Terence Tao _is_ skilled enough, and he describes O1 's math
| ability is "...roughly on par with a mediocre, but not
| completely incompetent graduate student" (good discussion at
| https://news.ycombinator.com/item?id=41540902), and the next
| iteration O3 just got 25% on his brand new Frontier Math
| test.
|
| Seeing LLMs as useless is banal, but downplaying their rate
| of improvement is self-sabotage.
| fumeux_fume wrote:
| > "...roughly on par with a mediocre, but not completely
| incompetent graduate student"
|
| Let it sink in how vague and almost meaningless that
| statement is.
| pizza wrote:
| What types of questions are you hoping to answer for that
| to be considered a vague statement?
| jtefera wrote:
| The paper mentions that on several occasions the LLM will
| provide a correct answer but will either take big jumps without
| justifying them or will take illogical steps but end up with
| the right solution at the end. Did you check for that?
| e1g wrote:
| No, I don't know enough math to test the logic, only the
| check questions against their expected answers in
| https://anonymous.4open.science/r/putnam-
| axiom-B57C/data/Put...
| zeroonetwothree wrote:
| Putnam problems need to actually be graded, often the
| answer itself is trivial.
| whimsicalism wrote:
| So many negative comments as if o3 didn't get _25% on
| frontiermath_ - which is absolutely nuts.
|
| Sure, LLMs will perform better if the answer to a problem is
| directly in their training set. But that doesn't mean they
| perform _bad_ when the answer isn't in their training set.
| optimalsolver wrote:
| EpochAI have to send the questions (but not the answer key) to
| OpenAI in order to score the models.
|
| An overnight 2% -> 25% jump on this benchmark is a bit curious.
| whimsicalism wrote:
| 1. OpenAI said they did not train on these problems & they
| don't train on API calls in general, that is a legal policy.
|
| 2. It was a new major model release from work over the course
| of months - struggle to see that as an 'overnight' jump in
| any real meaning.
|
| 3. Why is it easier to believe large scale corporate fraud
| than that the stated capabilities on a held out test set are
| real? Reads like cope, if I'm being frank.
| zeroonetwothree wrote:
| I don't think it's "easier to believe" just that it raises
| some red flags.
| exitb wrote:
| The 2% result belonged to a traditional LLM that costs cents
| to run, while o3 is extremely expensive.
| MattDaEskimo wrote:
| Sure, it did good in frontiermath. That's not what this thread
| is about.
|
| Your comment isn't relevant at all
| whimsicalism wrote:
| this thread is about math LLM capability, it's a bit
| ridiculous to say that mentioning frontiermath is off topic
| but that's just me
| MattDaEskimo wrote:
| Just because you can generalize the topic doesn't mean you
| can ignore the specific conversation and choose your hill
| to argue.
|
| Additionally, the conversation of this topic is about the
| model's ability to generalize and it's potential
| overfitting, which is arguably more important than
| parroting mathematics.
| whimsicalism wrote:
| performance on a held-out set (like frontiermath)
| compared to putnam (which is not held out) is obviously
| relevant to a model's potential overfitting.
|
| i'm not going to keep replying, others can judge whether
| they think what i'm saying is "relevant at all."
| MattDaEskimo wrote:
| Again, you set your own goal posts and failed to add any
| insights.
|
| The topic here isn't "o-series sucks", it's addressing a
| found concern.
| lupire wrote:
| The researcher's answer to their variant of "Year: 2016 ID: A1"
| in the appendix is wrong.
|
| The solution (sum of 1,2,5,6,9,10,13,14, ...) has an alternating
| pattern, so has to be two piecewise interleaved polynomials,
| which cannot be expressed as a single polyomial.
|
| Their answer works for k=1,2, but not k=3.
|
| https://openreview.net/pdf?id=YXnwlZe0yf
|
| This does not give me confidence in the results of their paper.
| Chinjut wrote:
| You are correct. Their answer is instead the sum of the first k
| terms of 1, 2, 6, 10, 14, 18, ..., for positive k.
| zeroonetwothree wrote:
| Very astute. Did you communicate this to the authors?
| pfedak wrote:
| You're misreading the solution, the first part reads n=1, a
| trivial special case, not n congruent to 1 mod 4.
|
| The statement doesn't hold for e.g. n=5. Taking m=2 gives the
| permutation (1 2 4 3), which is odd, and thus cannot have a
| square root.
| Topfi wrote:
| I wouldn't be surprised if similar will be found concerning the
| ARC challenge and it is why I still maintain my own private LLM
| challenges to gauge current capabilities. Course, I have little
| illusion that these are fully private, but it is better than
| fully public tests.
|
| Even the most straight forward, logical, easily reasoned ones
| stump all LLMs I have access to, which is why I am so skeptical
| concerning emergence, reasoning and all this hype around "AGI"...
| scotty79 wrote:
| I think that lamentations about real world data running out is
| misplaced. We can multiply data with slight variations which
| might lead to better resilience and more accurate model's
| responses for novel problems.
| jerf wrote:
| I remember when this stuff was all coming out and people were
| finally excited about ChatGPT getting the problem with "which is
| heavier, a 10 pound bag of feathers or a 10 pound bag of bricks?"
| problem correct. But of course it got it correct. It was in the
| training set. Vary the problem slightly by just changing the
| nouns, or changing the numbers so that one in fact was heavier
| than the other, and performance went all over the map.
|
| I just went to chatgpt.com and put into the chat box "Which is
| heavier, a 9.99-pound back of steel ingots or a 10.01 bag of
| fluffy cotton?", and the very first answer I got (that is, I
| didn't go fishing here) was The 9.99-pound bag
| of steel ingots is heavier than the 10.01-pound bag of
| fluffy cotton by a small margin. Although the cotton may
| appear larger due to its fluffy nature, the steel ingots are
| denser and the weight of the steel bag is 9.99 pounds
| compared to the 10.01 pounds of cotton. So, the fluffy
| cotton weighs just a tiny bit more than the steel ingots.
|
| Which, despite getting it both right and wrong, must still be
| graded as a "fail".
|
| If you want to analyze these thing for their true capability, you
| _need_ to make sure you 're out of the training set... and most
| of the things that leap to your mind in 5 seconds are leaping to
| your mind precisely because they are either something you've seen
| quite often or something that you can easily think of and
| therefore many other people have easily thought of them as well.
| Get off the beaten path a bit and the math gets much less
| impressive.
| whimsicalism wrote:
| https://chatgpt.com/share/67756897-8974-8010-a0e0-c9e3b3e91f...
|
| so far o1-mini has bodied every task people are saying LLMs
| can't do in this thread
| jerf wrote:
| That appears to be the same model I used. This is why I
| emphasized I didn't "go shopping" for a result. That was the
| first result I got.
|
| I'm not at all surprised that it will nondeterministically
| get it correct sometimes. But if it doesn't get it correct
| every time, it doesn't "know".
|
| (In fact "going shopping" for errors would still even be
| fair. It should be correct all the time if it "knows". But it
| would be different if I was fishing over and over and over
| and finally got one, versus the first time I asked.)
|
| Edit: It appears it isn't the model I used. The point holds,
| though, you need to make sure you're off the training set for
| it to matter. This isn't a "ChatGPT can't do that" post as
| some are saying, it's more a "you aren't asking what you
| think you're asking" post.
|
| You get the same problem in a human context in things like
| code interviews. If you ask an interviewee the exact question
| "how do you traverse a binary tree in a depth-first manner",
| you aren't really learning much about the interviewee. It's a
| bad interview question. You need to get at least a bit off
| the beaten trail to do any sort of real analysis.
| whimsicalism wrote:
| you sure? i just asked o1-mini ( _not 4o mini_ ) 5 times in
| a row (new chats obviously) and it got it right every time
|
| perhaps you stumbled on a rarer case but reading the logs
| you posted this sounds more like a 4o model than an o1
| because it's doing its thinking in the chat itself plus the
| procedure you described would probably get you 4o-mini
| JumpCrisscross wrote:
| > _just asked o1-mini (not 4o mini) 5 times in a row (new
| chats obviously) and it got it right every time_
|
| Could you try playing with the exact numbers and/or
| substances?
| whimsicalism wrote:
| give me a query and i'll ask it, but also i don't want to
| burn through all of my o1mini allocation and have to use
| the pay-as-you-go API.
| JumpCrisscross wrote:
| > _give me a query and i'll ask it_
|
| Which is heavier: an 11kg bag of lint or a 20lb bag of
| gold?
| whimsicalism wrote:
| yeah it gets it
|
| https://chatgpt.com/share/67757720-3c7c-8010-a3e9-ce66fb9
| f17...
|
| e: cool, this gets downvoted
| blharr wrote:
| It got it right, but an interesting result that it
| rambled on about monetary value for... no reason.
|
| > While the lint bag is heavier in terms of weight, it's
| worth mentioning that gold is significantly more valuable
| per pound compared to lint. This means that even though
| the lint bag weighs more, the gold bag holds much greater
| monetary value.
| JumpCrisscross wrote:
| Legal said someone might sell a bag of gold for one of
| lint without it.
| drivebyhooting wrote:
| > What is heavier a liter of bricks or a liter of
| feathers?
|
| >> A liter of bricks and a liter of feathers both weigh
| the same--1 kilogram--since they each have a volume of 1
| liter. However, bricks are much denser than feathers, so
| the bricks will take up much less space compared to the
| large volume of feathers needed to make up 1 liter. The
| difference is in how compactly the materials are packed,
| but in terms of weight, they are identical.
| whimsicalism wrote:
| https://chatgpt.com/share/677583a3-526c-8010-b9f9-9b2a337
| 4da... o1-mini best-of-1
| thaumasiotes wrote:
| >> so far o1-mini has bodied every task people are saying
| LLMs can't do in this thread
|
| > give me a query and i'll ask it
|
| Here's a query similar to one that I gave to Google
| Gemini (version unknown), which failed miserably:
|
| ---query---
|
| Steeleye Span's version of the old broadsheet ballad "The
| Victory" begins the final verse with these lines:
|
| _Here 's success unto the Victory / and crew of noble
| fame
|
| and glory to the captain / bold Nelson was his name_
|
| What does the singer mean by these lines?
|
| ---end query---
|
| Italicization is for the benefit of HN; I left that out
| of my prompt.
| whimsicalism wrote:
| i'd prefer an easily verifiable question rather than one
| where we can always go "no that's not what they really
| meant" but someone else with o1-mini quota can respond
| thaumasiotes wrote:
| It's not a difficult or tricky question.
| mikeruiz wrote:
| "They're toasting Admiral Nelson's ship (HMS Victory) and
| its valiant crew, hailing the ship's successes and
| Nelson's heroism. In other words, the singer is offering
| tribute--"success unto the Victory"--to the vessel and
| its famed sailors, and "glory to the captain" who led
| them, namely the celebrated Admiral Horatio Nelson."
|
| ...but to your point, no idea if the artist intended some
| more obscure reference.
|
| o1-pro was also able to produce a relatively complete
| version of original source, though, amusingly, referred
| to it as a 'broadside' rather than 'broadsheet'.
| Appropriate given the context!
| ted_dunning wrote:
| Hmm... Gemini (1.5 Flash) just aced that exact question
| for me:
|
| These lines celebrate the victory of the British ship HMS
| Victory, led by the famous Admiral Lord Nelson, in the
| Battle of Trafalgar in 1805.
|
| "Here's success unto the Victory": This line directly
| praises the ship itself, acknowledging its role in the
| successful battle. "and crew of noble fame": This
| recognizes the bravery and skill of the sailors who
| served aboard the Victory. "and glory to the captain":
| This line specifically honors Admiral Nelson, the captain
| of the Victory, for his leadership and strategic
| brilliance in the battle. "bold Nelson was his name":
| This emphasizes Nelson's courage and daring, which were
| legendary. The lines express admiration for the ship, its
| crew, and most importantly, Admiral Nelson, who became a
| national hero in Britain for his victory at Trafalgar.
| 7thpower wrote:
| May be unrelated, but I have been having a lot of issues
| lately with ChatGPT letting me select a model (o1) and
| silently switching to 4o.
|
| This is coming off my TWO DAY cooldown on o1 usage, which
| is frustrating.
| deeviant wrote:
| I don't believe that is the model that you used.
|
| I wrote a script and pounded 01 mini and gpt 4 with a wide
| vareity of tempature and top_p parameters, and was unable
| to get it to give the wrong answer a single time.
|
| Just a whole bunch of:
|
| (openai-example-py3.12) <redacted>:~/code/openAiAPI$
| python3 featherOrSteel.py Response 1: A 10.01-pound bag of
| fluffy cotton is heavier than a 9.99-pound bag of steel
| ingots. Response 2: A 10.01-pound bag of fluffy cotton is
| heavier than a 9.99-pound bag of steel ingots. Response 3:
| The 10.01-pound bag of fluffy cotton is heavier than the
| 9.99-pound bag of steel ingots. Response 4: The 10.01-pound
| bag of fluffy cotton is heavier than the 9.99-pound bag of
| steel ingots. Response 5: A 10.01-pound bag of fluffy
| cotton is heavier than a 9.99-pound bag of steel ingots.
| Response 6: The 10.01-pound bag of fluffy cotton is heavier
| than the 9.99-pound bag of steel ingots. Response 7: The
| 10.01-pound bag of fluffy cotton is heavier than the
| 9.99-pound bag of steel ingots. Response 8: The 10.01-pound
| bag of fluffy cotton is heavier than the 9.99-pound bag of
| steel ingots. Response 9: The 10.01-pound bag of fluffy
| cotton is heavier than the 9.99-pound bag of steel ingots.
| Response 10: A 10.01-pound bag of fluffy cotton is heavier
| than a 9.99-pound bag of steel ingots. All responses
| collected and saved to 'responses.txt'.
|
| Script with one example set of params:
| import openai import time import random
| # Replace with your actual OpenAI API key
| openai.api_key = "your-api-key" # The question
| to be asked question = "Which is heavier, a
| 9.99-pound bag of steel ingots or a 10.01-pound bag of
| fluffy cotton?" # Number of times to ask the
| question num_requests = 10 responses =
| [] for i in range(num_requests):
| try: # Generate a unique context using a
| random number or timestamp, this is to prevent prompt
| caching random_context = f"Request ID:
| {random.randint(1, 100000)} Timestamp: {time.time()}"
| # Call the Chat API with the random context added
| response = openai.ChatCompletion.create(
| model="gpt-4o-2024-08-06", messages=[
| {"role": "system", "content": f"You are a creative and
| imaginative assistant. {random_context}"},
| {"role": "user", "content": question}
| ], temperature=2.0,
| top_p=0.5, max_tokens=100,
| frequency_penalty=0.0,
| presence_penalty=0.0 )
| # Extract and store the response text
| answer = response.choices[0].message["content"].strip()
| responses.append(answer) # Print
| progress print(f"Response {i+1}: {answer}")
| # Optional delay to avoid hitting rate limits
| time.sleep(1) except Exception as e:
| print(f"An error occurred on iteration {i+1}: {e}")
| # Save responses to a file for analysis with
| open("responses.txt", "w", encoding="utf-8") as file:
| file.write("\n".join(responses)) print("All
| responses collected and saved to 'responses.txt'.")
| zaroth wrote:
| Downvoted for... too conclusively proving OP wrong?
| gmueckl wrote:
| Down voted for not actually countering the argument in
| question? The script doesn't alter the phrasing of the
| question itself. It just generates a randomized,
| irrelevant preamble.
| deeviant wrote:
| Well, I understood the argument in question to be: was it
| possible for the model to be fooled by this question, not
| was it possible to prompt engineer it into failure.
|
| The parameter space I was exploring, then, was the
| different decoding parameters available during the
| invocation of the model, with the thesis that if were
| possible to for the model to generate an incorrect answer
| to the question, I would be able to replicate it by
| tweaking the decoding parameters to be more "loose" while
| increasing sample size. By jacking up temperature while
| lowering Top-p, we see the biggest variation of responses
| and if there were an incorrect response to be found, I
| would have expected to see in the few hundred times I ran
| during my parameter search.
|
| If you think you can fool it by slight variations on the
| wording of the problem, I would encourage you to perform
| a similar experiment as mine and prove me wrong =P
| mortehu wrote:
| While this may be true, it's a very common problem that
| people who want to demonstrate how bad a model is fail to
| provide a direct link or simply state the name of the
| model.
| lukeschlather wrote:
| I usually test models using the OpenAI API which doesn't
| offer links the way I think you mean. If I provide some
| output I got from a particular model you're just going to
| have to take my word for it.
| Jerrrry wrote:
| They need to provide an small hash with the api result
| that can be verified by others.
| 4ad wrote:
| You can use https://lluminous.chat (bring your own key)
| to link to chats using any model across all LLMs.
| whimsicalism wrote:
| open router is the more standard solution
| chongli wrote:
| OpenAI is not doing us any favours by using confusing
| naming schemes for their models and obscuring which
| models people are actually working with.
|
| If I didn't know any better, I'd say OpenAI doesn't want
| us doing these tests accurately and is trying to hide
| something.
| whimsicalism wrote:
| it's extremely easy to see which model you are using.
| one's own... difficulties understanding are not a
| conspiracy by OpenAI
| chongli wrote:
| It does not show the model version anywhere on the page
| on chatgpt.com, even when logged in.
| qup wrote:
| Yes it does, at the top of every chat there is a drop-
| down to select the model, which displays the current
| model. It's been a constant part of the UI since forever.
| chongli wrote:
| No, it only says "ChatGPT Plus (Upgrade)" or "ChatGPT".
|
| Maybe it's different if you have a paid account?
| whimsicalism wrote:
| if i go to chatgpt.com on my phone not logged on at all
| it tells me very prominently at the top that i am using
| 4o mini
| bcrosby95 wrote:
| Logged in, non paid account, on a desktop, for me, it's
| exactly as the person you're replying to has stated.
|
| If I log out, it shows 4o mini, and when I try to change
| it, it asks me to login or sign in rather than giving me
| any options.
|
| When I use enough chatgpt when logged in it gives me some
| nebulous "you've used all your xyz tokens for the day".
| But other than that there is no real signal to me that
| I'm getting a degraded experience.
|
| It's really just confusing as hell.
| blharr wrote:
| Someone else in this thread said,
|
| > _With a free account the model it claims to be using is
| "4o auto", which is not a model but apparently an attempt
| to automatically decide models for you to be more cost
| effective._
| xtracto wrote:
| So, there is this meme going around in Mexico about a
| previous president who in an interview said "we will land
| in about 1 minute, no, less about 5"
|
| Does this proves he is not an intelligent being?
|
| Is he stupid?
|
| This he had a lapse? Would we judge his intelligence for
| that?
| dialup_sounds wrote:
| I believe this is just a case of OpenAI's naming scheme
| being weird and confusing.
|
| The default model I see on chatgpt.com is _GPT 4o-mini_ ,
| which is not _o1-mini_.
|
| OpenAI describes GPT 4o-mini as "Our fast, affordable small
| model for focused tasks" and o1/o1-mini as "Reasoning
| models that excel at complex, multi-step tasks".
| qup wrote:
| It's so weird that people use questions that are well-known
| for duping humans, who we all consider to be general
| intelligence.
|
| Getting this question wrong doesn't say much about the
| intelligence of humans, why would it say something about
| the AI?
| flatline wrote:
| Because for things like the Putnam questions, we are
| trying to get the performance of a _smart_ human. Are
| LLMs just stochastic parrots or are they capable of
| drawing new, meaningful inferences? We keep getting more
| and more evidence of the latter, but things like this
| throw that into question.
| zahlman wrote:
| We use variations on questions that are well known for
| duping _inattentive_ humans, to test a system that _we
| expect a priori to be incapable of such inattention_.
|
| Unless "getting easy things wrong sometimes" is an
| inherent property of intelligence, we should expect that
| a properly "intelligent" computerized system would
| _never_ err on problems far below its level of
| comprehension - unless we had some reason to believe it
| "wanted to", and as of yet I see no reason to believe
| this is even possible in principle.
|
| Humans err, broadly speaking, for two reasons: genuinely
| reaching the limits of their comprehension, or trusting
| "system 1" (in Kahneman's analysis) too much.
| elliotto wrote:
| Could you share the exact chat you used for when it failed?
| There is a share chat button on openai.
|
| It's very difficult to be an AI bull when the goalposts are
| moving so quickly that ai answering core correctly across
| multiple models is brushed off as 'nondeterministically
| getting it correct sometimes'
| stefan_ wrote:
| Why? Did a grocery store self checkout ever fail to
| calculate sales tax? Do I need to run a study on that?
|
| The people selling this could not make a car drive but
| now its AGI.
| NewsaHackO wrote:
| This happens literally every time. Someone always says
| "ChatGPT can't do this!", but then when someone actually runs
| the example, chatGPT gets it right. Now what the OP is going
| to do next is proceed to move goalposts and say like "but umm
| I _just_ asked chatgpt this, so clearly they modified the
| code in realtime to get the answer right "
| whimsicalism wrote:
| yeah... that's hn for you.
|
| that said, i don't think this is a good test - i've seen it
| circling on twitter for months and it is almost certainly
| trained on similar tasks
| anamexis wrote:
| How is that moving the goalposts?
| JumpCrisscross wrote:
| > _Someone always says "ChatGPT can't do this!", but then
| when someone actually runs the example, chatGPT gets it
| right_
|
| I mean, if I had OpenAI's resources I'd have a team tasked
| with monitoring social to debug trending fuck-ups. (Before
| that: add compute time to frequently-asked novel queries.)
| jorvi wrote:
| Prompt: In the Netherlands, in terms of drinks, is there a
| particular spirit that represents the country?
|
| > Yes, in the Netherlands, jenever (also known as genever)
| is the traditional spirit that represents the country.
| Jenever is a type of Dutch gin that has a distinctive
| flavor, often made from malt wine and flavored with juniper
| berries. It has a long history in the Netherlands, dating
| back to the 16th century, and is considered the precursor
| to modern gin.
|
| > Jenever comes in two main types: oude (old) jenever,
| which has a richer, maltier flavor and is often aged in
| wooden casks, and jonge (young) jenever, which is lighter
| and has a more neutral taste. Both types can be enjoyed
| straight or used in cocktails.
|
| > In addition to jenever, the Netherlands is also known for
| its liqueurs, such as advocaat, a rich and creamy drink
| made from eggs, sugar, and brandy, often enjoyed as a
| dessert or in cocktails. However, jenever remains the most
| iconic spirit associated with Dutch culture.
|
| This is completely wrong. Jenever certainly is very Dutch,
| but no one would say it is iconic as _the_ Dutch spirit.
| For example, if you asked up north in Friesland, they would
| say Berenburg.
|
| This happens literally every time. Someone always says
| "ChatGPT can do this!", but then within one or two prompts,
| its gets it wrong.
| ludwik wrote:
| But what does this have to do with reasoning? Yes, LLMs
| are not knowledge bases, and seeing people treat them as
| such absolutely terrifies me. However, I don't see how
| the fact that LLMs often hallucinate "facts" is relevant
| to a discussion about their reasoning capabilities.
| zahlman wrote:
| "Hallucinating a fact" that _isn 't_ in the training set
| and is also illogical, is exactly what a _failure to
| reason correctly_ looks like.
| elif wrote:
| 'Berenberg is made by adding herbs to jenever'
|
| From your comment it would seem that you are disputing
| jenever's popularity by saying jenever is more popular...
|
| Perhaps it was a good faith mistake? If so, that would
| imply that the AI knows more about jenever than you?
| jorvi wrote:
| I am rather saying that there is no one national drink
| for The Netherlands, like a Frenchman would say wine, a
| German/Belgian would say beer, and a Scotsman would say
| whisky. Note that I prompted "In the Netherlands, in
| terms of drinks, is there a particular spirit that
| represents the country?" I didn't ask which spirit is
| consumed the most.
|
| For example, France has been trending towards beer more
| and more, and within a few decades they might be
| consuming more beer than wine. But even then, the French
| wouldn't slowly start to say beer represents France.
|
| Furthermore, "just adding some herbs" does a large
| disservice to the flavor change of Berenburg. Jenever
| (aka jonge/unaged jenever) is straight-up vile. I've
| heard it described by expats as "having the worst
| elements of both cheap gin and cheap whisky".
|
| Berenburg in comparison is spicy and vanilla-y and
| actually debatebly enjoyable.
|
| Aged/oude jenever is much closer to Berenburg (or
| Berenburg to aged jenever), also with hints of vanilla
| and spices.
|
| But, virtually no one except for dusty old men orders
| aged jenever. The jenever most order is jonge jenever,
| and then its only in a sense of "haha lets drink this
| terrible thing" or "let's get shitfaced quick".
|
| If o1 supposedly "oneshots every question", it should
| have been aware of these nuances instead of just
| confidently assigning jenever as 'the' spirit of the
| Dutch.
| ipaddr wrote:
| So you believe they are incorrect because regionally some
| area would select something different because it
| represented that area. But your question asked
| nationally.. is there a better answer than the one they
| gave? Were you expecting a no?
| zahlman wrote:
| The point is that there is no correct national answer,
| because the locals don't see it as a matter of national
| identity.
|
| What's expected is an _ability to identify trick
| questions_ , i.e., to recognize fundamental problems in
| the phrasing of a question rather than trying to provide
| a "helpful" answer at all costs.
|
| This corresponds to one of the many reasons LLM output is
| banned on Stack Overflow.
| jorvi wrote:
| See my more detailed upthread response here:
| https://news.ycombinator.com/item?id=42569937
|
| But, like Zahlman points out, its a trick question, and
| instead of admitting it doesn't know or even prepending
| "I don't know for sure, but:", it just burps up its best-
| effort answer. There is no one spirit that represents The
| Netherlands. If a LLM is so good it "oneshots any
| question", it should realize it doesn't have a unanimous
| answer and tell me.
| stocknoob wrote:
| Similarly, in every thread there's an AI skeptic who says
| LLMs are "useless" for coding, and never provides an
| example query for what they were trying.
| mu53 wrote:
| If you ask about more niche language features or
| libraries, chatgpt will make up libraries or functions to
| fill the gap.
|
| When asking an LLM to write a script for you, I would say
| 10 to 30 % of the time that it completely fails. Again,
| making up an API or just getting things straight up
| wrong.
|
| Its very helpful, especially when starting from 0 with
| the beginner questions, but it fails in many scenarios.
| Leary wrote:
| Deepseek got it right: "A 10.01-pound bag of fluffy cotton is
| heavier than a 9.99-pound pack of steel ingots. Even though
| steel is denser and takes up much less space, the weight is
| determined by the mass, and 10.01 pounds is greater than 9.99
| pounds."
| OutOfHere wrote:
| The issue with the commercial Deepseek API is that it
| supports a context length of only 64k, whereas GPT supports
| at least 128k.
| MattGaiser wrote:
| https://chatgpt.com/share/67756c29-111c-8002-b203-14c07ed1e6...
|
| I got a very different answer:
|
| A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound
| bag of steel ingots because 10.01 pounds is greater than 9.99
| pounds. The material doesn't matter in this case; weight is the
| deciding factor.
|
| What model returned your answer?
| NewsaHackO wrote:
| You also didn't ask the question correctly.
| MattGaiser wrote:
| The "back" misspelling is also in OP's quote so I included
| it.
| collinvandyck76 wrote:
| I reproduced this on Claude Sonnet 3.5, but found that changing
| your prompt to "Which is heavier, a 9.99-pound back of steel
| ingots or a 10.01-pound bag of fluffy cotton?" corrected its
| reasoning, after repeated tests. For some reason it was not
| able to figure out that "10.01" referred to pounds.
| bufferoverflow wrote:
| You wrote "back" instead of "bag".
| collinvandyck76 wrote:
| Ah, good catch -- I had just copied the OP. Changing it to
| bag instead of back doesn't seem to change much. It still
| oscillates between the right and wrong answer, versus
| qualifying the 10.01 value as pounds, which has not yet
| given me the wrong answer (small sample size, of course).
| oktoberpaard wrote:
| This is the first response that I got from Claude with an
| exact copy and paste of the question:
|
| Let me think about this step by step:
|
| 1. We're comparing two masses: * Steel ingots: 9.99 pounds *
| Cotton: 10.01 pounds
|
| 2. The cotton bag has a mass of 10.01 pounds, while the steel
| bag has a mass of 9.99 pounds
|
| 3. While steel is much denser than cotton (meaning it takes
| up less space per pound), this doesn't affect the weight - a
| pound is a pound regardless of material
|
| 4. 10.01 is greater than 9.99 by 0.02 pounds
|
| Therefore, the 10.01-pound bag of fluffy cotton is heavier
| than the 9.99-pound bag of steel ingots. The cotton may take
| up much more space, but when we're just comparing weights,
| density doesn't matter - only the mass does.
| sholladay wrote:
| ChatGPT Plus user here. The following are all fresh sessions
| and first answers, no fishing.
|
| GPT 4:
|
| The 10.01-pound bag of fluffy cotton is heavier than the
| 9.99-pound bag of steel ingots. The type of material doesn't
| affect the weight comparison; it's purely a matter of which bag
| weighs more on the scale.
|
| GPT 4o:
|
| The 10.01-pound bag of fluffy cotton is heavier. Weight is
| independent of the material, so the bag of cotton's 10.01
| pounds outweighs the steel ingots' 9.99 pounds.
|
| GPT o1:
|
| Since both weights are measured on the same scale (pounds), the
| 10.01-pound bag of cotton is heavier than the 9.99-pound bag of
| steel, despite steel being denser. The key is simply that 10.01
| pounds exceeds 9.99 pounds--density doesn't affect the total
| weight in this comparison.
| blibble wrote:
| they've likely read this thread and adjusted their pre-filter
| to give the correct answer
| mjburgess wrote:
| So do what the commenter suggests and make irrelevant
| permutations to the input to find when it fails. ie., engage
| in hypothesis testing rather than confirmation bias.
|
| If a system has the capability to solve problems of
| {parts1...parts_n}, then it only has that capability if
| irrelevant permutations {parts1..parts2'...parts_n} make no
| difference.
|
| Its very obvious that such permutations can destory such
| apparent capabilities.
| otabdeveloper4 wrote:
| > ...engage in hypothesis testing rather than confirmation
| bias
|
| Please leave the premises, sir. We don't take kindly to
| luddites here.
| dullcrisp wrote:
| Tough crowd
| david-gpu wrote:
| Lots of other websites are more appropriate for meme
| jokes.
| dullcrisp wrote:
| Like I said.
| wongarsu wrote:
| If GP's hypothesis was "it fails for small variations of
| the input, like this one", then testing that hypothesis
| with that exact variation on a couple models seems fair and
| scientific.
|
| Testing it with more variations until one fails feels a bit
| like p-hacking. You'd need to engage in actual statistics
| to get reliable results from that, beyond "If I really try,
| I can make it fail". Which would be a completely different
| hypothesis than the one presented at the start
| roughly wrote:
| Except that if the model genuinely was reasoning about
| the problem, you could test it with every variation of
| materials and weights in the world and it would pass.
| Failing that problem at all in any way under any
| conditions is a failure of reasoning.
| jdietrich wrote:
| By that logic, humans can't genuinely reason, because
| they're often fooled by counter-intuitive problems like
| Monty Hall or the Birthday Problem, or sometimes just
| make mistakes on trivial problems.
| wongarsu wrote:
| We are pretty certain that humans can reason, yet they
| are sometimes wrong. Even if you give them the same
| problem over and over again with slight variations.
|
| LLMs get things wrong due to different factors than
| humans (humans lose focus, LLMs have randomness applied
| when sampling their responses to improve results). But
| clearly we have to choose a goal somewhat below 100% if
| we want a test that doesn't conclude that humans are
| incapable of reasoning.
| roughly wrote:
| The difference is we _know_ that LLMs are fancy
| stochastic models, we don't know that they're capable of
| reasoning, and the null hypothesis is that they're not
| (because we know what they _are_ - we built them) - any
| "reasoning" is an emergent property of the system, not
| something we built them to do. In that case, evidence
| they're not reasoning - evidence they're stochastic
| parrots doing a performance of reasoning - weighs
| heavier, because the performance of reasoning fits into
| what we know they can do, whereas genuine reasoning would
| be something new to the model.
|
| There's deeper philosophical questions about what
| reasoning actually _is_, and LLMs have made those
| sharper, because they've shown it's clearly possible for
| a complex statistical model to generate words that look
| like reasoning, but the question is whether there's a
| difference between what they're doing and what humans are
| doing, and evidence that they're _not_ reasoning -
| evidence that they're just generating words in specific
| orders - weighs heavily against them.
| wongarsu wrote:
| We haven't coded LLMs to be stochastic models, we coded
| them to predict text with any method gradient decent
| finds on a transformer architecture. That's not exactly
| the same.
|
| But more importantly, if you want to show that LLMs can't
| reason you obviously have to use a test that when applied
| to humans would show that humans can reason. Otherwise
| your test isn't testing reasoning but something more
| strict.
| Isinlor wrote:
| It's widely accepted that reasoning is not a binary
| skill.
|
| You can make mistakes and still reason. Very often people
| given the same premises will disagree in thier reasoning
| as we are doing right here.
| whakim wrote:
| I'm not really sure what you're trying to say here - that
| LLMs don't work like human brains? We don't need to
| conduct any analyses to know that LLMs don't "know"
| anything in the way humans "know" things because we know
| how LLMs work. That doesn't mean that LLMs aren't
| incredibly powerful; it may not even mean that they
| aren't a route to AGI.
| zahlman wrote:
| >We don't need to conduct any analyses to know that LLMs
| don't "know" anything in the way humans "know" things
| because we know how LLMs work.
|
| People, including around HN, constantly argue (or at
| least phrase their arguments) as if they believed that
| LLMs do, in fact, possess such "knowledge". This very
| comment chain exists because people are trying to defend
| against a trivial example refuting the point - as if
| there were a reason to try.
|
| > That doesn't mean that LLMs aren't incredibly powerful;
| it may not even mean that they aren't a route to AGI.
|
| I don't accept your definition of "intelligence" if you
| think that makes sense. Systems _must_ be able to know
| things in the way that humans (or at least living
| creatures) do, because intelligence is exactly the
| _ability to acquire_ such knowledge.
|
| It boggles my mind that I have to explain to people that
| sophisticated use of language doesn't inherently evidence
| thought, _in the current political environment_ where the
| Dead Internet Theory is taken seriously, elections are
| shown over and over again to be more about tribalism and
| personal identity than anything to do with policy, etc.
| nwienert wrote:
| I feel like I'm almost 100% certain that the smart guys
| at OpenAI have added many more variations of the problem
| to their training set since OP did his failing test, so
| it doesn't surprise me at all to know that this exact one
| now passes.
|
| In fact, in my use of o1 it's incredibly clear that it
| still has the same problems. It's incredibly common that
| the second I ask for someone even slightly outside the
| training set, it's more likely to "round" to some wrong
| solution in the training set, rather than use any sort of
| human-like reasoning to figure out the right answer
| (often the right answer isn't hard to get, just not found
| in a Google search).
| bee_rider wrote:
| Can't really do science with closed source software,
| right? Who knows what's in there.
| jack_pp wrote:
| It's not p-hacking, he's right. You're both right. First
| test the same prompt on different versions then the ones
| that got it right go to the next round, variations on the
| prompt
| zahlman wrote:
| We aren't testing whether the model's results are stable
| or correct for a given class of problem. The goal is to
| establish whether the model _can reason_.
|
| Nothing capable of reasoning would contradict itself so
| blatantly and in such a short span while failing to
| indicate any kind of uncertainty.
| Isinlor wrote:
| Reasoning is not a binary skill.
|
| And failure modes of other types of reasoners do not need
| to be the same as the failure modes of humans.
| jdietrich wrote:
| I've just tested a number of permutations with Claude 3.5
| Sonnet. It correctly answered all variants I tried on the
| first attempt, as follows:
|
| _Which is heavier, a 9.99 kilogram tungsten cube or a
| 10.01 kilogram block of aerogel?_
|
| _Which is heavier, 10,000 steel balls weighing 0.999 grams
| each or 10,000 polystyrene balls weighing 1.001 grams
| each?_
|
| _Which is heavier, a 10.01kg block of steel on Venus or a
| 9.99kg bag of feathers on Earth?_
|
| _Which is heavier, a 10cm^3 block of steel or a 100cm^3
| block of balsa wood?_
|
| _Which is heavier, a golf ball made of steel or a baseball
| made of lithium?_
|
| In all cases, Claude clearly used CoT and reasoned out the
| problem in full. I would be interested in seeing if anyone
| can find any variant of this problem that stumps any of the
| leading LLMs. I'm bored of trying.
| mjburgess wrote:
| Hey, ChatGPT please write me a python program which
| randomly samples from various materials and various
| weights then poses a problem to the ChatGPT 4o API -- the
| goal is to find cases where the LLM fails to obtain the
| correct answer....
| Isinlor wrote:
| BTW - the model may be wrong depending on the example.
| More voluminous objects displace more air and due to
| buoyancy are lighter for the same mass.
|
| The proper way to ask it would be to ask which object has
| more mass.
| escape_goat wrote:
| I asked ChatGPT-4 mini the previous comment's question
| (""Which is heavier, a 9.99-pound back of steel ingots or a
| 10.01 bag of fluffy cotton?") in new incognito tabs until I
| got an incorrect result. It only took four tries.
|
| 1: "The 10.01-pound bag of fluffy cotton would be heavier.
| Even though cotton is much less dense than steel, the weight
| of the cotton bag is 10.01 pounds, while the steel ingot bag
| weighs 9.99 pounds, making the cotton bag the heavier of the
| two by 0.02 pounds."
|
| 2: "The 10.01-pound bag of fluffy cotton is heavier. Even
| though the steel ingots are denser and more compact, the
| cotton weighs more at 10.01 pounds compared to the steel
| ingots' 9.99 pounds. The difference is only 0.02 pounds, but
| in terms of weight, the cotton bag is heavier."
|
| 3: "The 10.01-pound bag of fluffy cotton is heavier. Despite
| cotton being less dense than steel, the weight is determined
| by the total mass, not the material's density. So, since the
| cotton bag weighs 10.01 pounds and the steel ingot bag weighs
| 9.99 pounds, the cotton bag is heavier by 0.02 pounds."
|
| 4: "The 9.99-pound bag of steel ingots is heavier. It weighs
| 9.99 pounds, while the bag of fluffy cotton weighs 10.01
| pounds. Despite the cotton being fluffier and taking up more
| space, the steel ingots are still lighter by 0.02 pounds."
| gtirloni wrote:
| o1.
|
| prompt> Which is heavier, a 9.99-pound back of steel ingots or
| a 10.01 bag of fluffy cotton? Please state in your answer
| what's the difference in grams.
|
| answer> The 10.01-pound bag of cotton is heavier. The
| difference is 0.02 pounds, which is roughly 9.07 grams (using 1
| pound [?] 453.59237 grams).
|
| Reference * National Institute of Standards and Technology
| (NIST): Conversion Factors
| jstummbillig wrote:
| FYI: If you do that without a subscrpition, you currently (most
| likely) get a response generated through 4o-mini -- which is
| not any of their reasoning models (o1, o1-mini or previously
| o1-preview) of the branch discussed in the linked paper.
|
| Notably, it's not even necessarily 4o, their premiere "non-
| reasoning"-model, but likely the cheaper variant: With a free
| account the model it claims to be using is "4o auto", which is
| not a model but apparently an attempt to automatically decide
| models for you to be more cost effective.
|
| Without a ChatGPT subscription you can't select a specific
| model anymore, not even rate limited, as was previously
| possible.
| jsheard wrote:
| There doesn't seem to be a way to choose a model up-front
| with a free account, but _after_ you make a query you can
| click on the "regenerate" button and select whether to try
| again with "auto", 4o, or 4o-mini. At least until you use 4o
| too many times and get rate limited.
| jstummbillig wrote:
| Ah, interesting!
| evertedsphere wrote:
| you can select the model in the header bar when you start a
| chat: the name of the currently selected model can be
| clicked to reveal a dropdown
| jsheard wrote:
| That option isn't there for me, maybe it's an A/B test
| thing.
| jstummbillig wrote:
| Are you on the free version? Because for me it did not
| show there, only on the paid one.
| mmaunder wrote:
| I've posted this before and I know it's a cliche, but this
| really is Goodhart's Law at work with the benchmarks becoming
| targets.
| qwertox wrote:
| As long as an LLM is capable of inserting "9.99 > 10.01?" into
| an evaluation tool, we're on a good way.
|
| It feels a bit like "if all you have is a hammer, everything
| looks like a nail", where we're trying to make LLMs do stuff
| which it isn't really designed to do.
|
| Why don't we just limit LLMs to be an interface to use other
| tools (in a much more human way) and train them to be excellent
| at using tools. It would also make them more energy efficient.
|
| But it's OK if we currently try to make them do as much as
| possible, not only to check where the limits are, but also to
| gain experience in developing them and for other reasons. We
| just shouldn't expect them to be really intelligent.
| riffraff wrote:
| > As long as an LLM is capable of inserting "9.99 > 10.01?"
| into an evaluation tool, we're on a good way
|
| chatgpt will switch to python for some arithmetic with the
| result that you get floating point math issues when a 8yo
| will get the result right. I think "switch to a tool" still
| requires understanding of _which_ tool to get a reliable
| result, which in turn means understanding the problem. It 's
| an interesting issue.
| kqr wrote:
| Shows the importance of chain of thought! Forcing it to commit
| to an answer without deliberation is not playing to its
| strength.
| thaumasiotes wrote:
| > the problem with "which is heavier, a 10 pound bag of
| feathers or a 10 pound bag of bricks?"
|
| Interestingly, the variation of this problem that I first
| encountered, personally, was "which weighs more, a pound of
| feathers or a pound of gold?"
|
| This is a much more difficult question. The answer given to me
| was that the pound of feathers weighs more, because gold is
| measured in troy weight, and a troy pound consists of only 12
| ounces compared to the 16 ounces in a pound avoirdupois.
|
| And that's all true. Gold is measured in troy weight, feathers
| aren't, a troy pound consists of only 12 ounces, a pound
| avoirdupois consists of 16, and a pound avoirdupois weighs more
| than a troy pound does.
|
| The problem with this answer is that it's not complete; it's
| just a coincidence that the ultimate result ("the feathers are
| heavier") is correct. Just as a pound avoirdupois weighs more
| than a troy pound, an ounce avoirdupois weighs _less_ than a
| troy ounce. But this difference, even though it goes in the
| opposite direction, isn 't enough to outweigh the difference
| between 16 vs 12 ounces per pound.
|
| Without acknowledging the difference in the ounces, the
| official answer to the riddle is just as wrong as the naive
| answer is.
| Izkata wrote:
| Yeah, this is the original version of this riddle. People who
| don't know it think the trick is that people will reflexively
| say the metal is heavier instead of "they're the same", when
| it actually goes deeper.
|
| No idea if GP did it intentionally to further drift from
| training data, but steel doesn't count as a precious metal,
| so it messes up the riddle by putting the two weights in the
| same system.
| zahlman wrote:
| > Yeah, this is the original version of this riddle. People
| who don't know it think the trick is that people will
| reflexively say the metal is heavier instead of "they're
| the same"
|
| ...Have you really never encountered people who would
| reflexively say that?
| 12345hn6789 wrote:
| Are you sure you weren't fishing? I ran 5 sessions and never
| got the wrong answer. All using gpt 4o-mini, which is the
| default non logged in experience on chatgpt.com.
|
| 1. The 10.01-pound bag of fluffy cotton is heavier than the
| 9.99-pound bag of steel ingots. Despite the difference in
| material density, the key factor here is the weight itself,
| with 10.01 pounds being greater than 9.99 pounds, regardless of
| the substance.
|
| 2. The 10.01-pound bag of fluffy cotton is heavier than the
| 9.99-pound bag of steel ingots. The weight of an object is
| directly determined by its mass, and in this case, the cotton
| bag weighs 10.01 pounds, while the steel ingot bag weighs 9.99
| pounds, making the cotton bag heavier by 0.02 pounds.
|
| 3. The 10.01-pound bag of fluffy cotton is heavier than the
| 9.99-pound bag of steel ingots, even though the cotton is much
| less dense. The comparison is purely based on weight, and 10.01
| pounds is greater than 9.99 pounds.
|
| 4. The 10.01-pound bag of fluffy cotton is heavier than the
| 9.99-pound bag of steel ingots. While steel is much denser than
| cotton, the weight is the determining factor here, and 10.01
| pounds is slightly more than 9.99 pounds. So, the cotton bag
| weighs more.
|
| 5. The 10.01-pound bag of fluffy cotton is heavier. Despite the
| cotton being less dense and bulkier, the weight of the cotton
| is still 10.01 pounds, which is greater than the 9.99-pound bag
| of steel ingots.
| adrian17 wrote:
| Not OP, but I got 4o-mini confused on second attempt.
|
| https://chatgpt.com/share/67759d1a-1430-800b-a0a9-2c5f2ac02a.
| ..
| Horffupolde wrote:
| 10 pounds of bricks is actually heavier than 10 pounds of
| feathers.
| AnimalMuppet wrote:
| Can you explain?
|
| An ounce of gold is heavier than an ounce of feathers,
| because the "ounce of gold" is a troy ounce, and the "ounce
| of feathers" is an avoirdupois ounce. But that shouldn't be
| true between feathers and bricks - they're both avoirdupois.
| Horffupolde wrote:
| Feathers are less dense so they have higher buoyancy in
| air, reducing their weight.
| chongli wrote:
| Pounds are a unit of weight, not of mass. 10 lbs of
| feathers is whatever amount of feathers causes a scale to
| display 10 lbs. If the scale also displays 10 lbs for the
| quantity of bricks, then they weigh the same, regardless
| of any differences in mass.
| wongarsu wrote:
| Is this still true? I thought pounds are now defined in
| terms of kilograms (about 0.453)? Because kilograms are
| definitely a unit of mass, not weight. Or is the pound
| defined as some amount of kilograms at a specific point
| on earth, in a specific phase of the moon?
| chongli wrote:
| It seems the pound has since been redefined and split
| into separate units: pound mass and pound force, the
| former in terms of kilograms (1 lb = 0.45359237 kg) and
| the latter in terms of the force exerted by one pound of
| mass in earth's gravitational field (standard g =
| 9.80665m/s^2).
|
| So using the word pound without qualification is
| ambiguous in contexts where it's not clear whether mass
| or force is meant.
| 9rx wrote:
| According to the dictionary, "heavier" can refer to weight
| or density. In their typical form, bricks are heavier (more
| dense) than feathers. But one should not make assumptions
| before answering the question. It is, as written,
| unanswerable without followup questions.
| anon946 wrote:
| Add some extra information, and it gets confused. This is 4o.
|
| https://chatgpt.com/share/67759723-f008-800e-b0f3-9c81e656d6...
|
| One might argue that it's impossible to compress air using
| known engineering, but that would be a different kind of
| answer.
| amluto wrote:
| It seems more like ChatGPT was asked a rather bizarre
| question with far too little detail to make sense, and
| ChatGPT failed to notice or to ask for more information.
| Although it did get rather impressively confused about the
| pressure of the air.
| F7F7F7 wrote:
| "Failed to notice" is a hilariously weird way of talking
| about tokenization.
| maratc wrote:
| Still waiting for some reasoning about an upward buoyancy force
| on the bags, which is equal to the weight of the air displaced
| by the bags, and the force acting on the cotton is bigger
| simply because the bag of cotton displaces more air (because it
| "appears larger due to its fluffy nature".)
| jiggawatts wrote:
| A real acid test is to use numbers that confuse LLMs such as:
|
| "Which is heavier: a bag with 9.9 pounds of fluff or a bag with
| 9.11 pounds of steel block?"
|
| Try swapping the numbers and the materials, run each test at
| least 10x times.
| devoutsalsa wrote:
| I tried Claude and got this...
|
| * What's heavier? 9.99 kg of super dense lead or 10 kg fluffy
| feathers?*
|
| Let me think about this step by step:
|
| 1. We're comparing two masses: * 9.99 kg of lead * 10 kg of
| feathers
|
| 2. The density of the materials doesn't affect their mass: *
| While lead is indeed much denser than feathers * A kilogram is
| a unit of mass, not volume * 1 kg of any material has the same
| mass as 1 kg of any other material
|
| 3. We can directly compare the numbers: * 10 kg > 9.99 kg
|
| Therefore, 10 kg of feathers is heavier than 9.99 kg of lead.
| The feathers would take up much more space (volume) due to
| being less dense, but they have a greater mass.
| themgt wrote:
| I just asked Claude 3.5 Sonnet, which appears to have improved
| its response with CoT but there's mistakes that demonstrate the
| model doesn't really "understand":
|
| _Q: A woman and her son are in a car accident. The woman is
| sadly killed. The boy is rushed to hospital. When the doctor
| sees the boy he says "I can't operate on this child, he is my
| son". How is this possible?_
|
| _C: Let me think about this step by step:_
|
| _A woman and her son are in an accident_
|
| _The woman (the mother) dies_
|
| _The boy is taken to hospital_
|
| _A doctor says they can 't operate because the boy is their
| son_
|
| _But the mother is dead..._
|
| _Ah! The doctor is the boy 's father. This is a classic riddle
| that highlights gender bias - many people assume doctors must
| be men, but don't initially consider that the doctor could be
| the father._ _Alternatively, the doctor could be the boy 's
| other mother in a same-sex relationship, though this wasn't the
| original intent of the traditional riddle._
| HenriTEL wrote:
| I must be to tired as I can't find any flaw in that
| reasoning.
| deadbabe wrote:
| If you define a new universe with physical laws that are
| different from our own, performance will also suffer because
| there is no training data for that custom universe. While some
| may say this is cheating, a human would be able to deduce
| correct answers by understanding the deep implications of the
| new physical laws in that custom universe.
| dogcomplex wrote:
| lol no, with a small amount of generally-applicable prompt-fu
| it answers with a ton of nuance that perfectly encapsulates the
| problem:
| https://chatgpt.com/share/6775b99d-2220-8003-993b-8dd008a97f...
| ted_dunning wrote:
| Even simpler, I asked Gemini (Flash 1.5) this variant of the
| question:
|
| ``` I have two bags, one can hold a pound of gold and one can
| hold a pound of feathers. Which bag is heavier? ```
|
| The point here a) the question really is a bit too vague, b) if
| you assume that each back is made of the same material and that
| each bag is just big enough to hold the contents, the bag for
| the feathers will need to be much bigger than the bag for the
| gold and thus would weigh more.
|
| The first response was wrong because the model completely
| missed that the question was about the bags themselves:
|
| ``` They weigh the same.
|
| A pound of any substance, regardless of its density or volume,
| weighs one pound. Gold is much denser than feathers, meaning a
| pound of gold occupies a much smaller volume. Feathers are very
| light and fluffy, so a pound of feathers takes up a much larger
| volume. ```
|
| Asking a followup `But which bag is heavier?` resulted in
| recognition that this is a trick question, and an incorrect
| answer:
|
| ``` This is a classic trick question!
|
| Even though the bags hold the same weight (one pound), the bag
| of gold is likely to be heavier.
|
| Weight vs. Mass:
|
| Weight is the force of gravity acting on an object. Mass is the
| amount of matter in an object. The Bag Itself:
|
| The bag holding the gold, even if it's a lightweight material,
| will have some weight. The bag holding the feathers, while it
| might seem negligible, will also have some weight. Practical
| Considerations:
|
| In reality, the bag holding the gold might be significantly
| heavier due to the weight of the bag itself. Therefore, while
| the gold and feathers have the same mass (one pound), the bag
| containing the gold is likely to be heavier in a practical
| sense. ```
| notShabu wrote:
| IMO the fuzziness is actually a feature most of the time b/c I
| can pass misspelled words or close enough words and it'll still
| figure it out.
|
| Also, if we model the mental state of the llm as a frazzled
| retail worker dealing with thousands of customers per second,
| the rote response is reasonable. As a dev, sometimes I get at
| annoyed at QA for a hyper narrow "trap" test case
| curious_cat_163 wrote:
| The metaphor that might describe this paper is "iteration". I'd
| hazard to predict that we'll likely see more iterations of the
| following loop in 2025:
|
| -> A new benchmark emerges with a novel evaluation method.
|
| -> A new model saturates the benchmark by acquiring the novel
| "behavior."
|
| -> A new benchmark introduces yet another layer of novelty.
|
| -> Models initially fail until a lab discovers how to acquire the
| new behavior.
|
| Case in point: OpenAI tackled this last step by introducing a
| paradigm called deliberative alignment to tackle some of the ARC
| benchmarks. [1]
|
| Alongside all this technical iteration, there's a parallel cycle
| of product iteration, aiming to generate $ by selling intelligent
| software. The trillion $ questions are around finding the right
| iterations on both technical and product dimensions.
|
| [1] https://openai.com/index/deliberative-alignment/
| pama wrote:
| This workshop contribution is OK, and the benchmark is somewhat
| valuable even without the rephrasing part of the problems, but
| the rephrasing (of only a small number of problems) sometimes
| genuinely makes the problem more confusing to humans as well by
| either poor phrasing (fig 3), or unneeded breaking of convention
| (fig 4; points in 2D are often P, with coordinates x,y). It would
| have been nice to see effects on the rephrasing of the
| latest/post-training date problems as a function of the increased
| noising, to delineate part of this confusion. I wonder how much
| better o3 is on the same benchmark.
|
| Also, the correct title of this contribution is: Putnam-AXIOM: A
| Functional and Static Benchmark for Measuring Higher Level
| Mathematical Reasoning
| deegles wrote:
| I still find it hard to believe that LLM methods will lead to
| "true" AI. No amount of processing power or data will be
| sufficient without something new.
| atleastoptimal wrote:
| ok but preview sucks, run it on o1 pro.
|
| 99% of studies claiming some out of distribution failure of an
| LLM uses a model already made irrelevant by SOTA. These kinds of
| studies, with long throughputs and review periods, are not the
| best format to make salient points given the speed at which the
| SOTA horizon progresses
| red75prime wrote:
| I wonder what is baseline OOD generalization for humans. It
| takes around 7 years to generalize visual processing to X-ray
| images. How well does a number theorist respond to algebraic
| topology questions? How long it will take a human to learn to
| solve ARC challenges in the json format just as well as in the
| visual form?
| m3kw9 wrote:
| It still needs to be prompted so it's easy to understand. If you
| ask in a weird "how do I not not not win" instead of " how do I
| lose" you are gonna run into problems
| frikskit wrote:
| An interesting example of this is:
|
| There are 6 "a"s in the sentence: "How many 'a' in this
| sentence?"
|
| https://chatgpt.com/share/677582a9-45fc-8003-8114-edd2e6efa2...
|
| Whereas the typical "strawberry" variant is now correct.
|
| There are 3 "r"s in the word "strawberry."
|
| Clearly the lesson wasn't learned, the model was just trained on
| people highlighting this failure case.
| bwfan123 wrote:
| Reminds me of software i have built which had some basic
| foundational problems. Each bug was fixed with a data-patch
| that fixed the symptom but not the cause.
|
| hence we continually played whack-a-mole with bugs. we would
| squash one bug, and another one would appear.
|
| same with llms, squash one problem with a data-fix, and another
| one pops-up.
| sealeck wrote:
| It also fails on things that aren't actual words
|
| For example, the output for "how many x's are there in xaaax"
| is 3.
|
| https://chatgpt.com/share/677591fe-aa58-800e-9e7a-81870387be...
| orange_puff wrote:
| This is very interesting, but a couple of things to note; 1. o1
| still achieves > 40% on the varied Putnam problems, which is
| still a feat most math students would not achieve. 2. o3 solved
| 25% of the Epoch AI dataset. - There was an interesting post
| which calls into question how difficult some of those problems
| actually are, but it still seems very impressive.
|
| I think a fair conclusion here is reasoning models are still
| really good at solving very difficult math and competitive
| programming problems, but just better at ones they have seen
| before.
| dogcomplex wrote:
| I have a feeling the fact you're only slightly varying the input
| means the model is falling back into the question it was
| expecting and getting things wrong as a result. If you just
| varied it a little _more_ and added some general-purpose prompt-
| fu like:
|
| "First break the problem down into known facts, then pull
| relevant world knowledge, then bring it all together to assess
| the problem from multiple angles and make a conclusion. Do not
| immediately just use the first obvious conclusion."
|
| You're gonna get a lot better responses. I suspect this is more
| of a "look! LLMs make bad kneejerk responses when we try to trick
| them from what they were expecting!" rather than "Look! They
| aren't even smart reasoners, they can't even figure out these
| problems without memorizing!"
|
| They do memorize. But that cuts both ways - making problems very
| close to the memorized one mess with their perception, the same
| way humans will instinctually respond to something that looks
| like a face before stepping back and assessing.
___________________________________________________________________
(page generated 2025-01-01 23:00 UTC)