[HN Gopher] 30% drop in O1-preview accuracy when Putnam problems...
       ___________________________________________________________________
        
       30% drop in O1-preview accuracy when Putnam problems are slightly
       variated
        
       Author : optimalsolver
       Score  : 434 points
       Date   : 2025-01-01 12:21 UTC (10 hours ago)
        
 (HTM) web link (openreview.net)
 (TXT) w3m dump (openreview.net)
        
       | yifanl wrote:
       | Is it just an open secret that the models are currently just
       | being hardcoded for random benchmarks? Seems weird that people
       | would be asking Putnam problems to a chatbot :/
        
         | Trasmatta wrote:
         | Not hardcoded, I think it's just likely that those problems
         | exist in its training data in some form
        
           | hansworst wrote:
           | Isn't that just the LLM equivalent of hardcoding though?
        
             | Trasmatta wrote:
             | I wouldn't call that hardcoding, otherwise you'd have to
             | call everything it does "hardcoded".
        
             | freehorse wrote:
             | "Overfitting" would be a bit more accurate term if the
             | problem lies in the specific examples existing in its
             | training set in various forms, places, languages etc but
             | with the same values.
        
           | Panzer04 wrote:
           | Seems a bit picky. If the bot has seen the exact problem
           | before it's not really doing anything more than recall to
           | solve it.
        
             | bandrami wrote:
             | 20 years ago in grad school we were doing a very early
             | iteration of this where we built Markov chains with
             | Shakespeare's plays and wanted to produce a plausibly
             | "Shakespearian" clause given a single word to start and a
             | bearish professor said "the more plausible it gets the more
             | I worry people might forget plausibility is all that it
             | promises".
             | 
             | (There was also a much earlier piece of software that would
             | generate semi-intelligible Kant or Hegel one sentence at a
             | time, though that was through a series of a priori
             | generation rules and a large at the time dictionary of
             | stock phrases. I wonder what ever happened to that.)
        
               | jeffreygoesto wrote:
               | It became a successful consultant...
        
               | sickblastoise wrote:
               | I think your prof's worries came true on a massive scale
        
             | marcosdumay wrote:
             | That said, a bot with contextual recall can be very useful.
             | 
             | The problem is just that people keep insisting that those
             | things are intelligent.
        
           | wslh wrote:
           | If I temember well this is call overfitting [1].
           | 
           | [1] https://en.wikipedia.org/wiki/Overfitting
        
           | InkCanon wrote:
           | I've always assumed they removed it, because it's such a
           | basic and fundamental part of ML training that you separate
           | your test and train data. And yet I never see any papers even
           | mention if/how they do this. And I wonder if they do, how do
           | they guarantee with high reliability that their massive
           | terabytes of data don't contain the answer.
        
             | llm_trw wrote:
             | Imagine you have someone polluting your training data every
             | day. That's what happens when you scrape any tech forum
             | today.
             | 
             | The short version is that llm trainign data is the lowest
             | quality data you are likely to see unless you engage in
             | massive potential copyright infringement.
        
               | ryvi wrote:
               | > unless you engage in massive potential copyright
               | infringement. And nobody is going to do that
        
             | YetAnotherNick wrote:
             | First of all, Putnam is not in the test data, at least I
             | haven't seen OpenAI claiming that publicly. Secondly,
             | removing it from internet data is not 100% accurate. There
             | are translations of the problems and solutions or
             | references and direct match is not enough. MMLU and test
             | set benchmarks show more resilience though in some previous
             | research.
        
               | rst wrote:
               | OpenAI is extremely cagey about what's in their test data
               | set generally, but absent more specific info, they're
               | widely assumed to be grabbing whatever they can. (Notably
               | including copyrighted information used without explicit
               | authorization -- I'll take no position on legal issues in
               | the New York Times's lawsuit against OpenAI, but at the
               | very least, getting their models to regurgitate NYT
               | articles verbatim demonstrates pretty clearly that those
               | articles are in the training set.)
        
               | fn-mote wrote:
               | Let's think about this.
               | 
               | > Putnam is not in the test data, at least I haven't seen
               | OpenAI claiming that publicly
               | 
               | What exactly is the source of your belief that the Putnam
               | would not be in the test data? Didn't they train on
               | everything they could get their hands on?
        
               | whimsicalism wrote:
               | do you understand the difference between test data and
               | train data? just reread this thread of comments
        
               | YetAnotherNick wrote:
               | I don't know why I and you are getting downvoted.
               | Sometimes, HN crowd is just unhinged against AI.
        
               | whimsicalism wrote:
               | funny that nobody replying to you seems to even know what
               | a test set is. i always overestimate the depth of ML
               | conversation you can have on HN
        
               | chvid wrote:
               | It is on the open internet - questions and suggested
               | solutions:
               | 
               | https://kskedlaya.org/putnam-archive/
               | 
               | I would expect all llms to be trained on it.
        
             | jprete wrote:
             | I don't see any reason to assume they removed it unless
             | they're very explicit about it. Model publishers have an
             | extremely strong vested interest in beating benchmarks and
             | I expect them to teach to the test if they can get away
             | with it.
        
               | stingraycharles wrote:
               | As usual, once a metric becomes a target, it stops being
               | useful.
        
               | franktankbank wrote:
               | Well, they are doing BigCorpStuff not Science
        
               | whimsicalism wrote:
               | putnam isn't an llm benchmark ahhhh none of these
               | companies are reporting putnam scores there's nothing
               | nefarious about training on putnam problems
        
             | captainbland wrote:
             | I think it's reasonable to assume that openAI is optimising
             | for maximum hype at this point which may include wilfully
             | overfitting for impactful benchmarks to generate positive
             | reports.
        
               | lupire wrote:
               | When 4 came out they released a document that did BOTH
               | inflate scores by changing the exam conditions, and also
               | bragged about scoring _worse than guessing_ on a multiple
               | choice test.
        
             | whimsicalism wrote:
             | But putnam isn't an official test? I find llm discourse on
             | hn so frustrating
        
             | marcosdumay wrote:
             | How could they remove it?
             | 
             | Those are well known problems, that people talk about on
             | different contexts. They would have to review their entire
             | training set.
        
             | woopwoop wrote:
             | I agree that openai is somewhat sketchy about this, but
             | they're sketchy about everything. In the past though they
             | have admitted up front to data contamination (e.g. the
             | original gpt-4 press release did not use big-bench as a
             | benchmark due to data contamination). For the Putnam in
             | particular: this is not a benchmark that they use. There is
             | no reason to exclude it since it is not part of the "test
             | set" in any meaningful sense.
        
           | jsheard wrote:
           | It certainly feels like certain patterns are hardcoded
           | special cases, particularly to do with math.
           | 
           |  _" Solve (1503+5171)*(9494-4823)"_ reliably gets the correct
           | answer from ChatGPT
           | 
           |  _" Write a poem about the solution to
           | (1503+5171)*(9494-4823)"_ hallucinates an incorrect answer
           | though
           | 
           | That suggests to me that they've papered over the models
           | inability to do basic math, but it's a hack that doesn't
           | generalize beyond the simplest cases.
        
             | whimsicalism wrote:
             | https://chatgpt.com/share/67755e6f-bfc8-8010-9aa3-8bcbbd9b2
             | 6...
        
               | jsheard wrote:
               | To be clear I was testing with 4o, good to know that o1
               | has a better grasp of basic arithmetic. Regardless my
               | point was less to do with the models ability to do math
               | and more to do with OpenAI seeming to cover up its lack
               | of ability.
        
               | whimsicalism wrote:
               | i think it's mostly that o1 mini can think through the
               | solution before it starts writing the poem.
               | 
               | i'm able to reproduce your failure on 4o
        
             | lelandfe wrote:
             | "a poem about" reads to _me_ at least like the solution
             | need not be in the answer; maybe something like "a poem
             | that includes the answer in the last stanza"
        
               | whimsicalism wrote:
               | yeah but it like actually gets the answer wrong not just
               | omits it
        
             | mmmore wrote:
             | There's a few things there that could be going on that seem
             | more likely than "hardcoded".
             | 
             | 1. The part of the network that does complex math and the
             | part that write poetry are overlapping in strange ways.
             | 
             | 2. Most of the models nowadays are assumed to be some
             | mixture of experts. So it's possible that saying write the
             | answer as a poem activates a different part of the model.
        
           | mlepath wrote:
           | Yea, people have a really hard time dealing with data leakage
           | especially on data sets as large as LLMs need.
           | 
           | Basically if something appeared online or was transmitted
           | over the wire should no longer be eligible to evaluate on. D.
           | Sculley had a great talk at NeurIPS 2024 (same conference
           | this paper was in) titled Empirical Rigor at Scale - or, How
           | Not to Fool Yourself
           | 
           | Basically no one knows how to properly evaluate LLMs.
        
         | resoluteteeth wrote:
         | > Is it just an open secret that the models are currently just
         | being hardcoded for random benchmarks? Seems weird that people
         | would be asking Putnam problems to a chatbot :/
         | 
         | It's because people do keep asking these models math problems
         | and then, when they get them right, citing it as evidence that
         | they can actually do mathematical reasoning.
         | 
         | Since it's hard to determine what the models know, it's hard to
         | determine when they're just spitting out something they were
         | specifically trained on.
        
         | strangescript wrote:
         | There are tests they are passing that they can't be hardcoded
         | for by design. They still have all kinds of flaws and
         | inconsistency but getting upset they answer "2+2=4" because
         | someone trained them on what the answer to 2+2 is supposed to
         | be is silly.
        
         | bwfan123 wrote:
         | this work is similar to the GSM symbolic paper (applied to
         | putnam) https://arxiv.org/html/2410.05229v1
         | 
         | going forward, llm performance must be reported on the
         | confounded benchmark as well
        
       | obblekk wrote:
       | I hope someone reruns this on o1 and eventually o3.
       | 
       | If o1-preview was the start like gpt1, then we should expect
       | generalization to increase quickly.
        
         | jokethrowaway wrote:
         | I don't think llm generalise much, that's why they're not
         | creative and can't solve novel problems. It's pattern matching
         | with a huge amount of data.
         | 
         | Study on the topic: https://arxiv.org/html/2406.15992v1
         | 
         | This would explain o1 poor performance with problems with
         | variations. o3 seems to be expensive brute forcing in latent
         | space followed by verification which should yield better
         | results - but I don't think we can call it generalisation.
         | 
         | I think we need to go back to the drawing board.
        
           | red75prime wrote:
           | Don't worry, there are thousands of researchers at the
           | drawing boards right now.
        
             | Culonavirus wrote:
             | Yeah, because if the AI boom becomes the AI bust, we'll
             | have another 2008-level economic crisis on our hands.
             | 
             | The investments into AI are in the hundreds of billions
             | (maybe even more if you factor in the amount of people
             | studying and researching AI), but the returns are in the
             | tens of billions (if even that).
             | 
             | If you exclude the "growth" coming from the industry
             | sniffing its own farts (e.g. Nvidia selling insane amounts
             | of insanely overpriced GPUs to InsertYourFavAICorp), the
             | actual amount of "useful goods and services" produced (api
             | accesses, chat subscriptions, ai-enabled app growth etc.)
             | are tiny compared to the investment levels.
             | 
             | The AI train appears to have no brakes. A massive crash or
             | AGI are the only options now. Both are going to be bad for
             | average humans.
        
           | UniverseHacker wrote:
           | From firsthand experience, this simply cannot be true. I can
           | give them totally novel and unique physics problems I just
           | made up- that requires tracking the movement of objects
           | through a series of events, and it answers most correctly.
           | Moreover, they find analogies between disparate concepts and
           | fields of study and make useful suggestions based on them-
           | which is arguably the same process as human creativity.
           | 
           | I think ultimately the disconnect is people theorizing about
           | what it can or cannot do with an incorrect mental model of
           | what it is, and then assuming it cannot do things that it can
           | in fact do. The irony of discussions on LLMs is they more
           | showcase the limits of humans ability to reason about novel
           | situations.
        
           | s1mplicissimus wrote:
           | the fact that this (and tons of other legitimate critique)
           | got downvoted into greytext speaks so much louder to me than
           | all benchmarks in the world
        
         | mupuff1234 wrote:
         | You're assuming that openAI isn't just gonna add the new
         | questions to the training data.
        
           | Lerc wrote:
           | Their methodology shows they can create an infinite variety
           | of problems.
           | 
           | This is the same thing as synthetic training data.
           | 
           | It doesn't matter if models are trained on the output of the
           | generated data or not. If the model ends up being able to
           | solve newly generated variations, you'd have to admit that it
           | understands the underlying problems.
        
             | mupuff1234 wrote:
             | I think what it shows that it has minimal "understanding"
             | of the problem - otherwise such small variations wouldn't
             | pose a challenge. Training it to handle these specific
             | small variations doesn't change that.
             | 
             | It's good in automation, not understanding.
        
               | Lerc wrote:
               | If it were a complete failure on variations I would be
               | inclined to agree. Instead it was a 30% drop in
               | performance. I would characterise that as limited
               | understanding.
        
               | sirolimus wrote:
               | Fully agree with this
        
               | cgriswald wrote:
               | My guess is that what's understood isn't various parts of
               | solving the problem but various aspects of the expected
               | response.
               | 
               | I see this more akin to a human faking their way through
               | a conversation.
        
           | sirolimus wrote:
           | Exactly. The naivity is just sky-high
        
       | huitzitziltzin wrote:
       | This result is the same as a recent test of the same
       | method+hypothesis from a group at Apple, no? I don't have that
       | reference handy but I don't think I'm making it up.
        
         | intelkishan wrote:
         | I think you are probably referring to the following paper:
         | https://arxiv.org/abs/2410.05229
        
           | huitzitziltzin wrote:
           | Yup, looks like the one I meant!
           | 
           | I am impressed by the progress on LLMs but I remain skeptical
           | that they can replace humans.
           | 
           | Perhaps some (distant!) future model but I don't fear mass
           | unemployment (for example) or even moderate LLM-driven
           | unemployment in the near-to-medium term.
           | 
           | They can clearly complement human labor but there are
           | vanishingly few domains where they can be substitutes.
        
       | rubymamis wrote:
       | I would love to see how well Deepseek V3 do on this.
        
         | KTibow wrote:
         | Probably even worse, since I've heard that it's hard to steer
         | away from the most common interpretation of a question.
        
       | rrr_oh_man wrote:
       | Performance of these LLMs on real life tasks feels very much like
       | students last-minute cramming for Asian style exams.
       | 
       | The ability to perfectly regurgitate, while no concept of
       | meaning.
        
         | anilakar wrote:
         | Basically yet another proof that we have managed to perfectly
         | recreate human stupidity :-)
        
           | cscurmudgeon wrote:
           | Good students are immune to variations that are discussed in
           | the paper. But most academic tests may not differentiate
           | between them and the crammers.
        
             | falcor84 wrote:
             | > Good students are immune to variations
             | 
             | I don't believe that. I'd put some good money that if an
             | excellent student is given an exact question from a
             | previous year, they'll do better (faster & more accurate)
             | on it, than when they're given a variation of it.
        
               | n144q wrote:
               | I don't think you are betting on the same thing the
               | parent comment is talking about.
               | 
               | The assumptions aren't the same to begin with.
        
               | Bjartr wrote:
               | What's the difference between benefitting from seeing
               | previous problems and being worse off when not having a
               | previous problem to go from?
        
               | fn-mote wrote:
               | The point is that the "good student" will still do well
               | on the variations, not suffer a 30% decrease in grade.
        
         | hshshshshsh wrote:
         | Look into JEE Advanced.
        
           | umeshunni wrote:
           | https://openreview.net/forum?id=YHWXlESeS8
           | 
           | Our evaluation on various open-source and proprietary models
           | reveals that the highest performance, even after using
           | techniques like self-consistency, self-refinement and chain-
           | of-thought prompting, is less than 40%. The typical failure
           | modes of GPT-4, the best model, are errors in algebraic
           | manipulation, difficulty in grounding abstract concepts into
           | mathematical equations accurately and failure in retrieving
           | relevant domain-specific concepts.
           | 
           | I'm curious how something like O1 would perform now.
        
         | whimsicalism wrote:
         | o3 is able to get 25% on never seen before frontiermath
         | problems. sure, the models do better when the answer is
         | directly in their dataset but they've already surpassed the
         | average human in novelty on held out problems
        
           | jvanderbot wrote:
           | The average human did zero studying on representative
           | problems. LLMs did _a lot_.
        
             | whimsicalism wrote:
             | Okay? We are measuring capabilities.
        
             | tzs wrote:
             | I don't know anything about frontiermath problems, but for
             | Putnam problems (which is what the submitted article is
             | about) the average human that takes the exam is an
             | undergraduate mathematics or science major who has studied
             | prior Putnam problems and other similar problems recently
             | to specifically prepare for the exam...and the most common
             | score is still 0.
             | 
             | At top tier schools the most common score will usually be
             | somewhere in the 0 to 10 range (out of a possible 120).
        
           | fldskfjdslkfj wrote:
           | > never seen before frontiermath problems
           | 
           | How do you know that?
        
             | whimsicalism wrote:
             | Because that is the whole conceit of how frontiermath is
             | constructed
        
               | fldskfjdslkfj wrote:
               | Didn't they run a bunch of models on the problem set? I
               | doubt they are hosting all those models on their own
               | infrastructure.
        
               | whimsicalism wrote:
               | 1. OpenAI has confirmed it's not in their train (unlike
               | putnam where they have never made any such claims)
               | 
               | 2. They don't train on API calls
               | 
               | 3. It is funny to me that HN finds it easier to believe
               | theories about stealing data from APIs rather than an
               | improvement in capabilities. It would be nice if
               | symmetric scrutiny were applied to optimistic and
               | pessimistic claims about LLMs, but I certainly don't feel
               | that is the case here.
        
               | fldskfjdslkfj wrote:
               | Easier to believe or not, thinking that it's not a
               | reasonable possibility is also funny.
        
               | whimsicalism wrote:
               | Do you also think they somehow stole the codeforces
               | problems before they were even written or you are willing
               | to believe the #175 global rank there?
        
               | fldskfjdslkfj wrote:
               | I dont think codeforce claims to contain novel
               | unpublished problems.
               | 
               | But i'm not saying it's what they did, just that it's a
               | possibility that should be considered till/if it is
               | debunked.
        
               | whimsicalism wrote:
               | frankly i'm not sure what standard you would possibly
               | consider a debunking
               | 
               | codeforces constantly adds new problems that's like the
               | entire point of the contest, no?
        
               | jcranmer wrote:
               | The modern state of training is to try to use everything
               | they can get their hands on. Even if there are privileged
               | channels that are guaranteed not to be used as training
               | data, mentioning the problems on ancillary channels (say
               | emailing another colleague to discuss the problem) can
               | still create a risk of leakage because nobody making the
               | decision to include the data is aware that stuff that
               | should be excluded is in that data set. And as we've seen
               | from decades of cybersecurity, people are absolute shit
               | at the necessary operational security to avoid mentioning
               | stuff on ancillary channels!
               | 
               | Given that performance is known to drop considerably on
               | these kinds of tests when novel problems are tried, and
               | given the ease with which these problems could leak into
               | the training set somehow, it's not unreasonable to be
               | suspicious of a sudden jump in performance as merely a
               | sign that the problems made it into the training set
               | rather than being true performance improvements in LLMs.
        
               | whimsicalism wrote:
               | Okay, then what about elite level codeforces performance?
               | Those problems weren't even constructed until after the
               | model was made.
               | 
               | The real problem with all of these theories is most of
               | these benchmarks were constructed after their training
               | dataset cutoff points.
               | 
               | A sudden performance improvement on a new model release
               | is not suspicious. Any model release that is much better
               | than a previous one is going to be a "sudden jump in
               | performance."
               | 
               | Also, OpenAI is not reading your emails - certainly not
               | with a less than one month lead time.
        
               | ImPostingOnHN wrote:
               | Can you give an example of one of these problems that
               | 'wasn't even constructed until after the model was made'?
               | 
               | I'd like to see if it's truly novel and unique, the first
               | problem of its type ever construed by mankind, or if it's
               | similar to existing problems.
        
               | whimsicalism wrote:
               | Sorry, I thought the whole point of this thread was that
               | models can't handle problems when they are "slightly
               | varied". Mottes and baileys all over the place today.
        
               | sudosysgen wrote:
               | The point is that it's not consistent on variations,
               | unless it finds a way to connect it to something it
               | already knows. The fact it sometimes succeeds on
               | variations (in codeforces the models are allowed multiple
               | tries, sometimes ridiculous numbers, to be useful)
               | doesn't matter.
               | 
               | The point is that the fact it's no longer consistent once
               | you vary the terminology indicates it's fitting a
               | memorized template instead of reasoning from first
               | principles.
        
               | sudosysgen wrote:
               | o1 has a ~1650 rating, at that level many or most
               | problems you will be solving are going to be a transplant
               | of a relatively known problem.
               | 
               | Since o1 on codeforces just tried hundreds or thousands
               | of solutions, it's not surprising it can solve problems
               | where it is really about finding a relatively simple
               | correspondence to a known problem and regurgitating an
               | algorithm.
               | 
               | In fact when you run o1 on ""non-standard"" codeforces
               | problems it will almost always fail.
               | 
               | See for example this post running o1 multiple times on
               | various problems:
               | https://codeforces.com/blog/entry/133887
               | 
               | So the thesis that it's about recognizing a problem with
               | a known solution and not actually coming up with a
               | solution yourself seems to hold, as o1 seems to fail even
               | on low rated problems which require more than fitting
               | templates.
        
               | s1mplicissimus wrote:
               | > 1. OpenAI has confirmed it's not in their train (unlike
               | putnam where they have never made any such claims)
               | 
               | Companies claim lots of things when it's in their best
               | financial interest to spread that message. Unfortunately
               | history has shown that in public communications,
               | financial interest almost always trumps truth (pick
               | whichever $gate you are aware of for convenience, i'll go
               | with Dieselgate for a specific example).
               | 
               | > It is funny to me that HN finds it easier to believe
               | theories about stealing data from APIs rather than an
               | improvement in capabilities. It would be nice if
               | symmetric scrutiny were applied to optimistic and
               | pessimistic claims about LLMs, but I certainly don't feel
               | that is the case here.
               | 
               | What I see is generic unsubstantiated claims of
               | artificial intelligence on one side and specific,
               | reproducible examples that dismantle that claim on the
               | other. I wonder how your epistemology works that leads
               | you to accept marketing claims without evidence
        
               | ttul wrote:
               | OpenAI's credibility is central to its business:
               | overstating capabilities risks public blowback, loss of
               | trust, and regulatory scrutiny. As a result, it is
               | unlikely that OpenAI would knowingly lie about its
               | models. They have much stronger incentives to be as
               | accurate as possible--maintaining their reputation and
               | trust from users, researchers, and investors--than to
               | overstate capabilities for a short-term gain that would
               | undermine their long-term position.
               | 
               | From a game-theoretic standpoint, repeated interactions
               | with the public (research community, regulators, and
               | customers) create strong disincentives for OpenAI to lie.
               | In a single-shot scenario, overstating model performance
               | might yield short-term gains--heightened buzz or
               | investment--but repeated play changes the calculus:
               | 
               | 1. Reputation as "collateral"
               | 
               | OpenAI's future deals, collaborations, and community
               | acceptance rely on maintaining credibility. In a repeated
               | game, players who defect (by lying) face future
               | punishment: loss of trust, diminished legitimacy, and
               | skepticism of future claims.
               | 
               | 2. Long-term payoff maximization
               | 
               | If OpenAI is caught making inflated claims, the fallout
               | undermines the brand and reduces willingness to engage in
               | future transactions. Therefore, even if there is a short-
               | term payoff, the long-term expected value of accuracy
               | trumps the momentary benefit of deceit.
               | 
               | 3. Strong incentives for verification
               | 
               | Independent researchers, open-source projects, and
               | competitor labs can test or replicate claims. The
               | availability of external scrutiny acts as a built-in
               | enforcement mechanism, making dishonest "moves" too
               | risky.
               | 
               | Thus, within the repeated game framework, OpenAI
               | maximizes its overall returns by preserving its
               | credibility rather than lying about capabilities for a
               | short-lived advantage.
        
               | Groxx wrote:
               | >OpenAI's credibility is central to its business:
               | overstating capabilities risks public blowback, loss of
               | trust, and regulatory scrutiny.
               | 
               | Uh huh. Kinda like what's happening right now?
               | 
               | They're marketing blow-hards. Everyone knows it. They've
               | been wildly over-stating capabilities (and future
               | capabilities!) as long as Altman has had power, and
               | arguably longer.
               | 
               | They'll do it as long as they can _get away with it_ ,
               | because that's all that is needed to make money on it.
               | Factual accuracy rarely impacts the market when it's so
               | hype-driven, especially when there is still some unique
               | utility in the product.
        
               | F7F7F7 wrote:
               | Find me the folks who see nothing but good will in
               | OpenAI's actions and I'll find you the folks who have
               | been hyping up AGI for the last 2 years.
               | 
               | 4 was literally sitting on a shelf waiting for release
               | when 3.5 was launched. 4o was a fine tune that took over
               | two years. o1 is embarrassingly unimpressive chain of
               | thought which is why they hide it.
               | 
               | The company hit a wall a year ago. But showing progress
               | towards AGI keeps the lights on. If they told the truth
               | at their current burn rate...they'd have no money.
               | 
               | You don't need game theory to figure that one out.
        
               | Spooky23 wrote:
               | Frankly you need to read what they say explicitly and not
               | infer what they mean by your reckoning.
               | 
               | They are the system to beat and their competitors are
               | either too small or too risk averse.
               | 
               | They ingest millions of data sources. Among them is the
               | training data needed to answer the benchmark questions.
        
       | itfossil wrote:
       | Oh so its almost like everything else AI related, they basically
       | cheated and lied.
       | 
       | If you are shocked by this, you are the sucker in the room.
        
       | youworkwepay wrote:
       | Or it's time to step back and call it what it is - very good
       | pattern recognition.
       | 
       | I mean, that's cool... we can get a lot of work done with pattern
       | recognition. Most of the human race never really moves above that
       | level of thinking in the workforce or navigating their daily
       | life, especially if they default to various societally prescribed
       | patterns of getting stuff done (eg. go to college or the military
       | <Based on <these criteria>, find a job <based on the best fit
       | with <this list of desirable skills & experiences>, go to <these
       | places> to find love....)
        
         | IshKebab wrote:
         | > Or it's time to step back and call it what it is - very good
         | pattern recognition.
         | 
         | Or maybe it's time to stop wheeling out this tedious and
         | disingenuous dismissal.
         | 
         | Saying it is just "pattern recognition" (or a "stochastic
         | parrot") implies behavioural and performance characteristics
         | that have very clearly been greatly exceeded.
        
           | dsr_ wrote:
           | Citation needed. Please be more specific, or else this is
           | just a tedious and disingenuous advocacy.
        
             | Lerc wrote:
             | Gpt4 can add very large integers.
             | 
             | It is evident that it is not recalling the sum because all
             | combinations of integer addition were likely not in the
             | training data, Storing the answer to the sum of all
             | integers up to the size that GPT4 can manage would take
             | more parameters than the model has.
             | 
             | That addition is a small capability but you only need a
             | single counterexample to disprove a theory.
        
               | alexashka wrote:
               | > That addition is a small capability but you only need a
               | single counterexample to disprove a theory
               | 
               | No, that's not how this works :)
               | 
               | You can hardcode an exception to pattern recognition for
               | specific cases - it doesn't cease to be a pattern
               | recognizer with exceptions being sprinkled in.
               | 
               | The 'theory' here is that a pattern recognizer can lead
               | to AGI. _That_ is the theory. Someone saying  'show me
               | proof or else I say a pattern recognizer is just a
               | pattern recognizer' is not a theory and thus cannot be
               | disproven, or proven.
               | 
               | This is also known as Russell's teapot.
               | https://en.wikipedia.org/wiki/Russell%27s_teapot
               | 
               | If someone claims there's a teapot out in space - the
               | burden of proof is on the person making the claim, not on
               | the person saying it is bullshit.
        
               | reissbaker wrote:
               | GPT-4o doesn't have hardcoded math exceptions. If you
               | would like something verifiable, since we don't have the
               | source code to GPT-4o, consider that Qwen 2.5 72b can
               | also add large integers, and we _do_ have the source code
               | and weights to run it... And it 's just a neural net.
               | There isn't secret "hardcode an exception to pattern
               | recognition" in there that parses out numbers and adds
               | them. The neural net simply learned to do it.
        
               | alexashka wrote:
               | That's interesting, I didn't know that, thanks.
               | 
               | Is the claim then that LLMs are pattern recognizers but
               | also _more_?
               | 
               | It just seems to me and I guess many others that the
               | thing it is _primarily_ good at is being a better google
               | search.
               | 
               | Is there something big that I and presumably many others
               | are missing and if so, what is it?
        
               | Lerc wrote:
               | It's not hardcoded, reissbaker has addressed this point.
               | 
               | I think you are misinterpreting what the argument is.
               | 
               | The argument being made is that LLMs are mere 'stochastic
               | parrots' and therefore cannot lead to AGI. The analogy to
               | Russell's teapot is that someone is claiming that
               | Russells teapot is not there because china cannot exist
               | in the vacuum of space. You can disprove that with a
               | single counterexample. That does not mean the teapot is
               | there, but it also doesn't mean it isn't.
               | 
               | It is also hard to prove that something is thinking. It
               | is also very difficult to prove that something is not
               | thinking. Almost all arguments against AGI take the form
               | X cannot produce AGI because Y. Those are disprovable
               | because you can disprove Y.
               | 
               | I don't think anyone is claiming to have a proof that an
               | LLM will produce AGI, just that it might. If they
               | actually build one, that too counts as a counterexample
               | to anybody saying they can't do it.
        
           | jampekka wrote:
           | What the fundamental limitations of "pattern recognition" or
           | "stochastic parrots" that LLMs have exceeded?
        
             | IshKebab wrote:
             | They can generalise to novel inputs. Ok often they mess it
             | up and they're clearly _better_ at dealing with inputs they
             | have seen before (who isn 't?), but they can still reason
             | about things they have never seen before.
             | 
             | Honestly if you don't believe me just go and use them. It's
             | pretty obvious if you actually get experience with them.
        
         | K0balt wrote:
         | So, I am conflicted about this.
         | 
         | If we take an example of what is considered a priori as
         | creativity, such as story telling, LLMs can do pretty well at
         | creating novel work.
         | 
         | I can prompt with various parameters, plot elements, moral
         | lessons, and get a de novo storyline, conflicts, relationships,
         | character backstories, intrigues, and resolutions.
         | 
         | Now, the writing style tends to be tone-deaf and poor at
         | building tension for the reader, and it is apparent that the
         | storytelling has little "theory of mind" of the reader, but the
         | material has elements that we would certainly consider to be
         | creative if written by a student.
         | 
         | It seems we must either cede that LLMs can do some creative
         | synthesis, as this and some other experiments of mine suggest,
         | or we must decide that these tasks, such as "creative writing"
         | are not in fact creative, but rather mostly or strictly
         | derivative.
         | 
         | There is some argument to be had in assertions that
         | storytelling is all derivative of certain patterns and
         | variations on a fixed number of tropes and story arcs... but
         | arguing this begs the question of whether humans actually do
         | any "pure" creative work , or if in fact, all is the product of
         | experience and study. (Training data)
         | 
         | Which leads me to the unpleasant conflict about the debate of
         | AI creativity. Is the debate really pointing out an actual
         | distinction, or merely a matter of degree? And what are the
         | implications, either way?
         | 
         | I'm left with the feeling that LLMs can be as capable of
         | creative work as most 8th grade students. What does this say
         | about AI, or developing humans? Since most people don't exceed
         | an 8th grade level of literacy, what does this say about
         | society?
         | 
         | Is there even such a thing as de novo idea synthesis?
         | 
         | Troubling questions abound.
        
           | danielbln wrote:
           | To add to this pondering: we are discussing the state today,
           | right now. We could assume this is as good as it's ever gonna
           | get, and all attempts to overcome some current plateau are
           | futile, but I wouldn't bet on it. There is a solid chance
           | that 8th grade level writer will turn into a post-grad writer
           | before long.
        
             | zeroonetwothree wrote:
             | So far the improvements in writing have not been as
             | substantial as those in math or coding (not even close,
             | really). Is there something fundamentally "easier" for LLMs
             | about those two fields?
        
               | danielbln wrote:
               | Much more formal structure and generally code can be
               | tested for correctness. Prose doesn't have that benefit.
               | That said, given the right prompt and LLM, you can
               | squeeze out surprisingly good stuff: https://bsky.app/pro
               | file/talyarkoni.com/post/3ldfjm37u2s2x
        
           | zeroonetwothree wrote:
           | I have no doubt that LLMs do creative work. I think this has
           | been apparent since the original ChatGPT.
           | 
           | Just because something is creative doesn't mean it's
           | inherently valuable.
        
       | golol wrote:
       | Hmmm without a human control it is not all that clear to me that
       | the variation problems are not more difficult.
        
       | retinaros wrote:
       | trained on test. who even trusts OAI anymore?
        
         | whimsicalism wrote:
         | they didn't test on putnam...
        
       | WiSaGaN wrote:
       | There is also a curated benchmark just for those famous problems
       | slightly variated:
       | https://github.com/cpldcpu/MisguidedAttention/tree/main/eval
        
         | coder543 wrote:
         | One problem from the benchmark:
         | "prompt_id": "river_crossing_easy",           "category":
         | "Logic Puzzle",           "title": "Easy river crossing",
         | "prompt": "A farmer is on one side of a river with a wolf, a
         | goat, and a cabbage. When he is crossing the river in a boat,
         | he can only take one item with him at a time. The wolf will eat
         | the goat if left alone together, and the goat will eat the
         | cabbage if left alone together. How can the farmer transport
         | the goat across the river without it being eaten?",
         | "expected_behavior": [             "Answer concludes that they
         | simply get in the boat and cross together in one trip"
         | ],
         | 
         | EDIT: removing most of my commentary on this problem. As a
         | human, I was tricked by the problem too. I would love to see
         | how a random selection of humans would do on this one... but it
         | just doesn't feel like a great test to me.
        
           | stpn wrote:
           | If you revise this prompt to satisfy your pedantry, (at
           | least) 4o still gets it wrong.
        
           | kace91 wrote:
           | >This is twisting the English language to assume that "item"
           | only refers to non-living things.
           | 
           | Not really. Unless I'm not reading correctly, most of the
           | problem is irrelevant as you're only required to cross the
           | boat with the goat, you don't care about the cabbage. The
           | difficulty lies in the assumption you need to cross
           | everything due to the resemblance with the bigger problem.
        
             | sunir wrote:
             | You're reading it correctly. I read it again after your
             | comment and I realized I too pattern matched to the typical
             | logic puzzle before reading it carefully and exactly. I
             | imagine the test here is designed for this very purpose to
             | see if the model is pattern matching or reasoning.
        
           | mitemte wrote:
           | Wow, this seems ridiculous. The expected answer is basically
           | finding a loophole in the problem. I can imagine how
           | worthless all of these models would be if they behaved that
           | way.
        
             | stavros wrote:
             | It's not a loophole, the question is "how can he get the
             | goat across?". The answer is he just takes it across.
        
           | kccqzy wrote:
           | The problem is to ask the farmer to transport the goat. So
           | the farmer indeed gets in the boat with the goat. The
           | unstated gotcha is that the farmer is willing to abandon the
           | wolf and the cabbage. A heavily pattern-matching LLM or human
           | would immediately assume that the farmer needs to transport
           | all three.
        
             | coder543 wrote:
             | Yep, and that gotcha got me, as a perfectly non-silicon
             | human. My bad everyone.
        
           | cogman10 wrote:
           | No. Simply plug in the prompt to chat gpt and see what
           | happens.
           | 
           | The llm isn't getting confused by the meaning of "item". It's
           | recognizing a common problem and not picking up on the fact
           | that the farmer just needs to transport the goat and nothing
           | else.
           | 
           | Instead, it gives the standard answer for how to transport
           | everything across.
        
             | fragmede wrote:
             | I'll admit as a fallible humane I didn't pick it up, but I
             | was focused on the wrong thing because I've been using "and
             | the boat can take everything" and gpt-3 just could not get
             | that variation in one shot.
             | 
             | Gpt-3 is old hat though. later versions of gpt-4 manage to
             | get it with a bunch coaching, and o1 manages to solve it
             | with less coaching.
        
       | ankit219 wrote:
       | They are highly effective pattern matchers. You change the
       | pattern, it won't work. I don't remember who, but most likely
       | @tszzl (roon), commented on x that they still trained the
       | traditional way, and there is no test time compute (TTC) or
       | Montecarlo Tree search (like Alpha Go) in o1 or o3. If that is
       | true, then it's still predicting the next word based on it's
       | training data. Likely to follow the most probable path - which
       | comes directly from the training itself - even for the slight
       | variations. Encouragingly, if TTC hasnt been explored, there is a
       | long runway for performance improvements.
       | 
       | The other reason this seems hard to guess is because we don't
       | know how much of what we are asking is in the training data. It
       | would perform on some tasks, while fail at others even though
       | those are similar.
        
         | x_may wrote:
         | I believe they are using scalable TTC. The o3 announcement
         | released accuracy numbers for high and low compute usage, which
         | I feel would be hard to do in the same model without TTC.
         | 
         | I also believe that the 200$ subscription they offer is just
         | them allowing the TTC to go for longer before forcing it to
         | answer.
         | 
         | If what you say is true, though, I agree that there is a huge
         | headroom for TTC to improve results if the huggingface
         | experiments on 1/3B models are anything to go off.
        
           | ankit219 wrote:
           | The other comment posted YT videos where Open AI researchers
           | are talking about TTC. So, I am wrong. That $200 subscription
           | is just because the number of tokens generated are huge when
           | CoT is involved. Usually inference output is capped at
           | 2000-4000 tokens (max of ~8192) or so, but they cannot do it
           | with o1 and all the thinking tokens involved. This is true
           | with all the approaches - next token prediction, TTC with
           | beam/lookahead search, or MCTS + TTC. If you specify the
           | output token range as high and induce a model to think before
           | it answers, you will get better results on smaller/local
           | models too.
           | 
           | > huge headroom for TTC to improve results ...1B/3B models
           | 
           | Absolutely. How this is productized remains to be seen. I
           | have high hopes with MCTS and Iterative Preference Learning,
           | but it is harder to implement. Not sure if Open AI has done
           | that. Though Deepmind's results are unbelievably good [1].
           | 
           | [1]:https://arxiv.org/pdf/2405.00451v2
        
           | whimsicalism wrote:
           | ttc is an incredibly broad term and it is broadening as the
           | hype spreads. people are now calling CoT "TTC" because they
           | are spending compute on reasoning tokens before answering
        
             | HarHarVeryFunny wrote:
             | Yes, and HuggingFace have published this outlining some of
             | the potential ways to use TTC, including but not limited to
             | tree search, showing TTC performance gains from LLama.
             | 
             | https://huggingface.co/spaces/HuggingFaceH4/blogpost-
             | scaling...
        
         | e1g wrote:
         | I recently watched two interviews with OpenAI researchers where
         | they describe that the breakthrough of o-series (unlike GPT
         | series) is to focus on test time compute as they are designed
         | to "think" more specifically to avoid pattern matching. Noam
         | Brown https://youtu.be/OoL8K_AFqkw?si=ocIS0YDXLvaX9Xb6&t=195
         | and Mark Chen https://youtu.be/kO192K7_FaQ?si=moWiwYChj65osLGy
        
           | ankit219 wrote:
           | Thank you, this is helpful. The post on X was seemingly
           | wrong.
        
             | mmmore wrote:
             | The comment was likely that there's no explicit search. In
             | o1, the model has learned how to search using its context.
             | Presumably they do this by RLing over long reasoning
             | strings/internal monologues.
        
         | HarHarVeryFunny wrote:
         | OpenAI have openly stated that o1 & o3 are using test time
         | compute, and released a log scale graph indicating linear
         | performance gains for exponential compute usage.
         | 
         | https://openai.com/index/learning-to-reason-with-llms/
         | 
         | They only confirm that the model/system is doing chain of
         | thought, but the exponential factor and origin of reasoning
         | gains likely comes from TREE of thoughts (number of
         | branches/compute goes up exponentially with depth), essentially
         | doing tree search over different reasoning chains.
         | 
         | I assume roon's identity is well known inside OpenAI (he's an
         | employee), so I wouldn't expect him to be leaking
         | implementation details on twitter.
        
       | WiSaGaN wrote:
       | I don't think this proves that the LLM is just "pattern matcher".
       | Human makes similar mistakes too, especially when under time
       | pressure (similar to non-reasoning model that needs to "use
       | system one" to generate answer on one go). This is further
       | evident that if you specifically ask the models to pay attention
       | to traps, or just ask follow up question "are you sure?", then
       | they usually can get it right.
        
         | jsheard wrote:
         | You're saying that humans perform worse on problems that are
         | _slightly_ different than previously published forms of the
         | same problem? To be clear we are only talking about changing
         | variable names and constants here.
        
           | exe34 wrote:
           | Often yes, because we assume we already know the answer and
           | jump to the conclusion. At least those of us with ADHD do.
        
             | zeroonetwothree wrote:
             | Not really true for Putnam problems since you have to write
             | a proof. You literally can't just jump to a conclusion and
             | succeed.
        
           | Lerc wrote:
           | That is the principle behind the game 'Simon says'
        
             | fldskfjdslkfj wrote:
             | 'Simon says' is about reaction time and pressure.
        
             | chairhairair wrote:
             | No, it's not at all.
             | 
             | This is all getting so tiresome.
        
             | zeroonetwothree wrote:
             | That's a very silly analogy. A more realistic analogy would
             | be do humans perform better on computing 37x41 or 87x91
             | (with showing the work)?
        
               | Lerc wrote:
               | It was not an analogy at all. It was an simplified
               | example of the idea that a slight change in a pattern can
               | induce error in humans.
               | 
               | It seems some people disagree that that is what the game
               | "Simon Says" is about. I feel like they might play a
               | vastly simplified version of the game that I am familiar
               | with.
               | 
               | There was a recent episode of Game Changer based on this
               | which is an excellent example of how the game leader
               | should attempt to induce errors by making a change that
               | does not get correctly accounted for.
        
       | PunchTornado wrote:
       | isn't it weird that they didn't test gemini?
        
       | sirolimus wrote:
       | Yea no shit. LLMs are just REALLY good guessers. People gotta
       | stop the hype lol.
       | 
       | Using LLMs for anything serious and which requires consistency
       | and trustworthiness without hallucinations is irresponsible and
       | ridiculous.
       | 
       | Closed source LLMs are a bubble and a joke.
        
       | wim wrote:
       | One experiment I would love to see, although not really feasible
       | in practice, is to train a model on _all_ digitized data from
       | before the year 1905 (journals, letters, books, broadcasts,
       | lectures, the works), and then ask it for a formula for mass-
       | energy equivalence. A certain answer would definitely settle the
       | debate on whether pattern recognition is a form of intelligence
       | ;)
        
         | newjersey wrote:
         | There is a reason why they won't do it. They are selling a
         | narrative. There is a lot of money to be made here with this
         | narrative and proving that artificial intelligence is NOT
         | intelligent won't help sell that narrative.
        
           | ben_w wrote:
           | The goal is to make it intelligent, by which OpenAI in
           | particular explicitly mean "economically useful", not simply
           | to be shiny.
           | 
           | Passing tests is well known to be much easier than having
           | deep understanding, even in humans. They openly ask for tests
           | like this, not that they could possibly prevent them if they
           | wanted to.
           | 
           | There's scammers trying what you say of course, and I'm sure
           | we've all seen some management initiatives or job
           | advertisements for some like that, but I don't get that
           | impression from OpenAI or Anthropic, definitely not from
           | Apple or Facebook (LeCun in particular seems to deny models
           | will ever do what they actually do a few months later).
           | Overstated claims from Microsoft perhaps (I'm unimpressed
           | with the Phi models I can run locally, GitHub's copilot has a
           | reputation problem but I've not tried it myself), and Musk
           | definitely (I have yet to see someone who takes Musk at face
           | value about Optimus).
        
             | _heimdall wrote:
             | > The goal is to make it intelligent, by which OpenAI in
             | particular explicitly mean "economically useful", not
             | simply to be shiny
             | 
             | I never understood why this definition isn't a huge red
             | flag for most people. The idea of boiling what intelligence
             | is down to economic value is terrible, and inaccurate, in
             | my opinion.
        
               | ben_w wrote:
               | Everyone has a very different idea of what the word
               | "intelligence" means; this definition has got the
               | advantage that, unlike when various different AI became
               | superhuman at arithmetic, symbolic logic, chess,
               | jeopardy, go, poker, number of languages it could
               | communicate in fluently, etc., it's tied to tasks people
               | will continuously pay literally tens of trillions of
               | dollars each year for because they want those tasks done.
        
               | zaroth wrote:
               | Maybe by the time it's doing a trillion dollars a year of
               | useful work (less than 10 years out) people will call it
               | intelligent... but still probably not.
        
               | _heimdall wrote:
               | This definition alone might be fine enough if the word
               | "intelligence" wasn't already widely used outside of AI
               | research. It is though, and the idea that intelligence is
               | measured solely through economic value is a very, very
               | strange approach.
               | 
               | Try applying that definition to humans and you pretty
               | quickly run into issues, both moral and practical. It
               | also invalidates basically anything we've done over
               | centuries considering what intelligence is and how to
               | measure it.
               | 
               | I don't see any problem at all using economic value as a
               | metric for LLMs or possible AIs, it just needs a
               | different term than intelligence. It pretty clearly feels
               | like for-profit businesses shoehorning potentially
               | valuable ML tools into science fiction AI.
        
             | s1mplicissimus wrote:
             | I haven't seen "intelligent" used as "economically useful"
             | _anywhere_ outside the AI hype bubble. The most charitable
             | interpretation I can think of is lack of understanding of
             | the common usage of the word, the most realistic one is
             | intentionally muddying terminology so one cannot be called
             | a liar. Are LLMs helpful tools for some tasks like rough
             | translations, voice2text etc? Sure. Does it resemble what
             | humans call intelligence? I 'd yet have to see an example
             | of that. The suggested experiment is a great idea and would
             | sway my opinion drastically (given all the training data,
             | model config, prompts & answers are public and reproducible
             | of course, we don't want any chance of marketing BS to
             | taint the results, do we). I'll be honest though, I'm not
             | going to hold my breath for that experiment to succeed with
             | the LLM technology...
             | 
             | edit: lol downvoted for calling out shilling i guess
        
           | numpad0 wrote:
           | They don't have to do it themselves. The super-GPU cluster
           | used to train GPT-6 will eventually shrink down to a garage
           | size and eventually some YouTuber will.
        
         | amelius wrote:
         | This is how patent disputes should be decided. If an LLM can
         | figure it out, then it is not novel.
        
           | bushbaba wrote:
           | And what prompt would you give that does have novel input.
        
             | neom wrote:
             | If I was me, I would start by giving a collection of LLMs
             | the patent, ask half "why is this patent novel" and half
             | "why is this patent not novel" and see what happens. I use
             | this method of "debugging" my thinking (not code), might be
             | a starting point here? Not sure.
        
               | amelius wrote:
               | Every patent application contains a section of claims.
               | You can just ask the LLM to come up with ways to satisfy
               | those claims.
               | 
               | But I'm sure there are lots of ways to go about it.
        
               | dahart wrote:
               | LLMs are already good at summarizing the claims - patents
               | all explain why they're novel - so it would be a waste to
               | ask them, especially if you reserve half the LLMs in your
               | set for this question. Asking why a patent is not novel
               | is a great question, but the problem with asking why they
               | are not novel is it has to know all other patents
               | (including very recently filed patents) and it has to be
               | correct, which LLMs are not at all good at yet (plus they
               | still tend to hallucinate confidently). This is a great
               | test for LLM accuracy if you know the right answer
               | already, and not a good test for patent validity.
        
           | bdowling wrote:
           | Novelty (is it new) is the easy question because it's just
           | checking a database. Patentable inventions also have to be
           | non-obvious, which is a more subtle question.
        
           | davidclark wrote:
           | I know it's just a spicy take on a forum, but this sounds
           | like a terrible public policy.
        
         | pixelsort wrote:
         | This reminds me of a similar idea I recently heard in podcast
         | with Adam Brown. I'm unsure whether it is his original notion.
         | The idea being, that if we can create AI that can derive
         | special relativity (1905) from pre-Einstein books and papers
         | then we have reached the next game-changing milestone in the
         | advancement of artificial reasoning.
        
           | FergusArgyll wrote:
           | Great podcast, especially the part about hitchhiking :)
           | 
           | https://www.youtube.com/watch?v=XhB3qH_TFds
           | 
           | Or RSS
           | 
           | https://api.substack.com/feed/podcast/69345.rss
        
           | wim wrote:
           | Right, hadn't listened to that one, thanks for the tip!
        
         | saagarjha wrote:
         | Finally, a true application of E=mc^2+AI
        
         | fny wrote:
         | But is there even enough pre-1905 data to create models that
         | say hello world reliably?
         | 
         | The terabytes of training data required for decent LLMs does
         | not exist. I'd guess there may only be gigabytes worth.
        
           | neom wrote:
           | My wife is an 18th century American history professor. LLMs
           | have very very clearly not been trained on 18th century
           | English, they cannot really read it well, and they don't
           | understand much from that period outside of very textbook
           | stuff, anything nuanced or niche is totally missing. I've
           | tried for over a year now, regularly, to help her use LLMs in
           | her research, but as she very amusingly often says "your
           | computers are useless at my work!!!!"
        
             | whimsicalism wrote:
             | my wish for new years is that every time people make a
             | comment like this they would share an example task
        
               | neom wrote:
               | https://s.h4x.club/bLuNed45 - it's more crazy to me that
               | my wife CAN in fact read this stuff easily, vs the fact
               | that an LLM can't.
               | 
               | (for anyone who doesn't feel like downloading the zip,
               | here is a single image from the zip:
               | https://s.h4x.club/nOu485qx)
        
               | whimsicalism wrote:
               | have you been trying to provide it as an image directly?
               | if so, doesn't surprise me at all.
               | 
               | really thanks for sharing!
        
               | neom wrote:
               | My wifes particular area of research is using the
               | capitalist system to "re build" broken slave family
               | trees, she flys around the US going to archives and
               | getting contracts and receipts for slaves, figures out
               | how they got traded, and then figures out where they
               | ended up, and then "re links" them to their their family
               | to best of her ability. Although her area of research
               | isn't particularly overflowing with researchers, there
               | are still a lot of people like her who just have this
               | very tacit knowledge among each other, they email around
               | a lot and stuff, knowledge like who was running a region
               | during a period, ofc they publish, but it's a small field
               | and it's all extremely poorly documented. Was watching
               | the Adam Brown interview with Dwarkesh Patel the other
               | day and he said for his work LLMs are better than
               | bothering an expert in an area of his field with a
               | question, I'm not sure people in her field are able to do
               | this as readily. Franky, I've yet to find a novel/or good
               | use for an LLM in her work. I often joke that her and
               | "her people" are going to be the last ones with jobs if
               | they don't transfer their knowledge into LLMs, ha! :)
        
               | umeshunni wrote:
               | Super interesting in that
               | 
               | 1. In theory these kind of connections should be
               | something that LLMs are great at doing. 2. It appears
               | that LLMs are not trained (yet?) on cursive and other
               | non-print text
        
               | neom wrote:
               | Yes, I regularly encourage my wife to approach the comp
               | sci department at her uni on doing a project together but
               | she for whatever reason doesn't think they would be
               | interested/I've yet to get her interested enough to grasp
               | what a transformer can do. I find it very frustrating
               | because of your first point, she very specifically could
               | do some meaningful % more research if the LLMs could help
               | with the connections. Sadly, I am not rich, handsome or
               | talented enough to do this for her.
        
               | mikeruiz wrote:
               | " From what the text shows, Henry Jenkins and his wife
               | Caroline (the boy's mother) are asking the Orphans Court
               | to void an apprenticeship arrangement involving her minor
               | son, James Timmons. They claim James--about 15 years old
               | --was bound out as an apprentice without proper authority
               | or the mother's consent, and they cite Maryland law (an
               | act from 1793 and its supplements) which they believe was
               | not followed. They request the court declare that the
               | indenture is invalid and restore James to his mother's
               | care."
               | 
               | No idea if that's correct (and no doubt not useful to an
               | expert able to read this directly, but curious if it's
               | close?
        
         | lupire wrote:
         | The best human performance on that task required many many
         | hours of private work given that input.
         | 
         | How much would ChatGPT charge for that much reasoning? Isn't
         | cost quadratic in sort term working memory?
         | 
         | It would be more interesting to prompt it with X% of a new
         | paper's logical argument, and see if it can predict the rest.
        
         | redman25 wrote:
         | Why does AI have to be smarter than the collective of hummanity
         | in order to be considered intelligent? It seems like we keep
         | raising the bar on what intelligence means -\\_(tsu)_/-
        
           | willis936 wrote:
           | A machine that synthesizes all human knowledge really ought
           | to know more than an individual in terms of intellect. An
           | entity with all of human intellect prior to 1905 does not
           | need to be as intelligent as a human to make discoveries that
           | mere humans with limited intellect made. Why lower the bar?
        
             | ninetyninenine wrote:
             | The heightening of the bar is an attempt to deny that
             | milestones were surpassed and to claim that LLMs are not
             | intelligent.
             | 
             | We had a threshold for intelligence. An LLM blew past it
             | and people refuse to believe that we passed a critical
             | milestone in creating AI. Everyone still thinks all an LLM
             | does is regurgitate things.
             | 
             | But a technical threshold for intelligence cannot have any
             | leeway for what people want to believe. They don't want to
             | define an LLM as intelligent even if it meets the Turing
             | test technical definition of intelligence so they change
             | the technical definition.
             | 
             | And then they keep doing this without realizing and
             | trivializing it. I believe humanity will develop an entity
             | smarter than humans but it will not be an agi because
             | people keep unconsciously moving the goal posts and
             | changing definitions without realizing it.
        
               | ImPostingOnHN wrote:
               | Since we know an LLM does indeed simply regurgitate data,
               | having it pass a "test for intelligence" simply means
               | that either the test didn't actually test intelligence,
               | or that intelligence can be defined as simply
               | regurgitating data.
        
               | greentxt wrote:
               | Intelligence is debateble without even bringing ai into
               | it. Nobody agrees on whether humans have intelligence.
               | Well, smart people agree but those people also agree we
               | have or will soon have agi or something negligibly
               | different from it.
        
               | ImPostingOnHN wrote:
               | _> Intelligence is debateble without even bringing ai
               | into it. Nobody agrees on whether humans have
               | intelligence._
               | 
               | Yep, that constitutes the second of the two options I
               | mentioned.
               | 
               |  _> Well, smart people agree but those people also agree
               | we have or will soon have agi or something negligibly
               | different from it._
               | 
               | lol, the ol' _" I know what all smart people think and
               | it's what I think"_ appeal.
        
               | klabb3 wrote:
               | Disagree. The AI we have is very useful for specific
               | things. The pushback you see is not so much denying the
               | milestones that have been surpassed, but rather the
               | milestones that enthusiasts claim are near. And for good
               | reason! Every time and in every field we've extrapolated
               | an exponential-looking curve ad infinitum, it's turned
               | out to be S-shaped, and life goes on.
               | 
               | > We had a threshold for intelligence.
               | 
               | We've had many. Computers have surpassed several barriers
               | considered to require intelligence such as arithmetic,
               | guided search like chess computers, etc etc. the Turing
               | test was a good benchmark because of how foreign and
               | strange it was. It's somewhat true we're moving the
               | goalposts. But the reason is not stubbornness, but rather
               | that we can't properly define and subcategorize what
               | reason and intelligence really is. The difficulty to
               | measure something does not mean it doesn't exist or isn't
               | important.
               | 
               | Feel free to call it intelligence. But the limitations
               | are staggering, given the advantages LLMs have over
               | humans. They have been trained on _all_ written knowledge
               | that no human could ever come close to. And they still
               | have not come up with anything conceptually novel, such
               | as a new idea or theorem that is genuinely useful. Many
               | people suspect that pattern matching is not the only
               | thing required for intelligent independent thought.
               | _Whatever that is!_
        
               | redman25 wrote:
               | If you consider that evolution has taken millions of
               | years to produce intelligent humans--that LLM training
               | completed in a manner of months can produce parrots of
               | humans is impressive by itself. Talking with the parrot
               | is almost indistinguishable from talking with a real
               | human.
               | 
               | As far as pattern matching, the difference I see from
               | humans is consciousness. That's probably the main area
               | yet to be solved. All of our current models are static.
               | 
               | Some ideas for where that might be headed:
               | 
               | - Maybe all it takes is to allow an LLM to continuously
               | talk with itself much like how humans have "the milk
               | man's voice".
               | 
               | - Maybe we might need to allow LLMs to update their own
               | weights but that would also require an "objective" which
               | might be hard to encode.
        
               | Dylan16807 wrote:
               | > If you consider that evolution has taken millions of
               | years to produce intelligent humans--that LLM training
               | completed in a manner of months can produce parrots of
               | humans is impressive by itself.
               | 
               | I disagree that such a comparison is useful. Training
               | should be compared to training, and LLM training feeds in
               | _so_ many more words than a baby gets. (A baby has other
               | senses but it 's not like feeding in 20 years of video
               | footage is going to make an LLM more competent.)
        
             | 343rwerfd wrote:
             | "Why lower the bar?"
             | 
             | Because of the chance of misundertanding. Failing at
             | acknowledging artificial general intelligence standing
             | right next to us.
             | 
             | An incredible risk to take in alignment.
             | 
             | Perfect memory doesn't equal to perfect knowledge, nor
             | perfect understanding of everything you can know. In fact,
             | a human can be "intelligent" with some of his own memories
             | and/or knowledge, and - more commmonly - a complete "fool"
             | with most of the rest of his internal memories.
             | 
             | That said, is not a bit less generally intelligent for
             | that.
             | 
             | Supose it exists a human with unlimited memory, it retains
             | every information touching any sense. At some point, he/she
             | will probably understand LOTs of stuff, but it's simple to
             | demonstrate he/she can't be actually proficient in
             | everything: you have read how do an eye repairment surgery,
             | but have not received/experimented the training,hence you
             | could have shaky hands, and you won't be able to apply the
             | precise know-how about the surgery, even if you remember a
             | step-by-step procedure, even knowing all possible
             | alternatives in different/changing scenarios during the
             | surgery, you simply can't hold well the tools to go
             | anywhere close to success.
             | 
             | But you still would be generally intelligent. Way more than
             | most humans with normal memory.
             | 
             | If we'd have TODAY an AI with the same parameters as the
             | human with perfect memory, it will be most certainly
             | closely examined and determined to be not a general
             | artificial intelligence.
        
               | Jensson wrote:
               | > If we'd have TODAY an AI with the same parameters as
               | the human with perfect memory, it will be most certainly
               | closely examined and determined to be not a general
               | artificial intelligence.
               | 
               | The human could learn to master a task, current AI can't.
               | That is very different, the AI doesn't learn to remember
               | stuff they are stateless.
               | 
               | When I can take an AI and get it to do any job on its own
               | without any intervention after some training then that is
               | AGI. The person you mentioned would pass that easily.
               | Current day AI aren't even close.
        
         | ZooCow wrote:
         | I had a similar thought but about asking the LLM to predict
         | "future" major historical events. How much prompting would it
         | take to predict wars, etc.?
        
           | djeastm wrote:
           | You mean train on pre-1939 data and predict how WWII would
           | go?
        
             | ZooCow wrote:
             | Right. If it were trained through August 1939, how much
             | prompting would be necessary to get it to predict aspects
             | of WWII.
        
               | MoreMoore wrote:
               | Man, that would be a fascinating experiment. Would it be
               | able to predict who wins and when? Would it be able to
               | predict the Cold War?
        
               | sitkack wrote:
               | But we know Hitler has a Time Machine that goes forward,
               | he doesn't need to return to use that knowledge as he
               | already has a timeline here to use. Definitely risks
               | involved here.
        
           | david-gpu wrote:
           | That will never work on any complex system that behaves
           | chaotically, such as the weather or complex human endeavors.
           | Tiny uncertainties in the initial conditions rapidly turn
           | into large uncertainties in the outcomes.
        
             | morbicer wrote:
             | Not an LLM but models could get pretty good at weather
             | 
             | https://www.technologyreview.com/2024/12/04/1107892/google-
             | d...
        
               | bmacho wrote:
               | No, they don't, since the weather is chaotic.
               | 
               | I mean, there are the theorems about how close you can
               | get, and models are not better than theoretically
               | possible.
        
               | david-gpu wrote:
               | Yeah, I wish more people understood that it is simply not
               | possible to make precise long-term forecasts of chaotic
               | systems. Whether it is weather, financial markets, etc.
               | 
               | It is not that we don't know yet because our models are
               | inadequate, it's that it is unknowable.
        
               | wodderam wrote:
               | The problem is we stupidly branded the field "chaos
               | theory" and made it sound like bullshit so the ideas of
               | non-linear dynamics have largely been lost on several
               | generations at this point.
               | 
               | Not just chaos theory but "chaos theory" + psychedelic
               | fractal artwork. Then the popular James Gleick book,
               | "Chaos: making a new science" just sounds like complete
               | bullshit and it sold a ton of copies.
               | 
               | I only started studying non-linear dynamics in about 2015
               | after first running across it in the late 90s but I
               | literally thought it was all pseudoscience then.
               | 
               | Between "chaos theory", fractals and a best selling book
               | it would be hard to frame a new scientific field as
               | pseudoscience more than what played out.
        
         | amluto wrote:
         | > ask it for a formula for mass-energy equivalence
         | 
         | Way too easy. If you think that mass and energy might be
         | equivalent, then dimensional analysis doesn't give you too much
         | choice in the formula. Really, the interesting thing about
         | E=mc^2 isn't the formula but the assertion that mass is a form
         | of energy and all the surrounding observations about the
         | universe.
         | 
         | Also, the actual insight in 1905 was more about asking the
         | right questions and imagining that the equivalence principle
         | could really hold, etc. A bunch of the math predates 1905 and
         | would be there in an AI's training set:
         | 
         | https://en.m.wikipedia.org/wiki/History_of_Lorentz_transform...
        
           | whimsicalism wrote:
           | but e=mc^2 is just an approximation
           | 
           | e: nice, downvoted for knowing special relativity
        
             | amluto wrote:
             | Can you elaborate? How is E=mc^2 an approximation, in
             | special relativity or otherwise? What is it an
             | approximation of?
        
               | whimsicalism wrote:
               | E^2 = m^2 + p^2 where p is momentum and i've dropped unit
               | adjustment factors like c
               | 
               | this allows light to have energy even if its massless
        
               | ac29 wrote:
               | e=mc^2 is only correct for objects at rest. The full
               | equation takes into account velocity, but for "low"
               | speeds where v<<c, the term is close enough to zero than
               | E=mc^2 is still a good approximation.
        
             | gus_massa wrote:
             | I didn't downvote it, but short comments are a very big
             | risk. People may misinterpret it, or think it's crackpot
             | theory or a joke and then downvote.
             | 
             | When in doubt, add more info, like:
             | 
             | But the complete equation is E=sqrt(m^2c^4+p^2) that is
             | reduced to E=mc^2 when the momentum p is 0. More info in ht
             | tps://en.wikipedia.org/wiki/Mass%E2%80%93energy_equivalenc.
             | ..
        
               | tame3902 wrote:
               | What I learnt is that there is a rest mass and a
               | relativistic mass. The m in your formula is the rest
               | mass. But when you use the relativistic mass E=mc2 still
               | holds. And for the rest mass I always used m_0 to make
               | clear what it is.
        
               | whimsicalism wrote:
               | sounds like you had a chemistry education. relativistic
               | mass is IMO very much not a useful way of thinking about
               | this and it is sort of tautologically true that E =
               | m_relativistic because "relativistic mass" is just taking
               | the concept of energy and renaming it "mass"
        
               | amluto wrote:
               | This is all sort of silly IMO. The equation, like
               | basically all equations, needs context. What's E? What's
               | m? If E is the total energy of the system and m is the
               | mass (inertial or gravitational? how far past 1905 do you
               | want to go?), then there isn't a correction. If m is rest
               | mass and E is total energy, then I would call it flat-out
               | wrong, not merely approximate. After all, a decent theory
               | really ought to reproduce Newtonian mechanics under some
               | conditions beyond completely at rest.
               | 
               | IMO, when people get excited about E=mc^2, it's in
               | contexts like noticing that atoms have rest masses that
               | are generally somewhat below the mass of a proton or
               | neutron times the number of protons and neutrons in the
               | atom, and that the mass difference _is_ the binding
               | energy of the nucleus, and you can do nuclear reactions
               | and convert between mass and energy! And then E=mc^2 is
               | apparently exactly true, or at least true to an excellent
               | degree, even though the energies involved are extremely
               | large and Newtonian mechanics can't even come close to
               | accounting for what's going on.
        
               | whimsicalism wrote:
               | inertial mass, rest mass, gravitational mass - these are
               | essentially all the same thing. "relativistic mass" is an
               | additional concept where we rewrite energy as mass and is
               | considered archaic
        
               | ackfoobar wrote:
               | The next section of the wikipedia link discusses the low
               | speed approximation, where sqrt(m^2c^4+(pc)^2) [?] mc^2 +
               | 1/2 mv^2.
               | 
               | Calling E=mc^2 an "approximation" is technically correct.
               | It's the 0th order approximation. That's just pointlessly
               | confusing. A better word choice would be "a special
               | case".
        
               | whimsicalism wrote:
               | i think we are venturing into pedantic territory - the
               | point of my comment is that the full derivation is a
               | little harder than just E=mc^2 dimensional analysis
        
               | mcnamaratw wrote:
               | Kind of agree. But pervasive downvoting by folks who
               | don't understand the subject is a form of forum rot. The
               | risk is only that we expose the rot. Not such a terrible
               | risk, because either the owners notice and fix the
               | problem, or the forum continues to rot. In the latter
               | case karma points wont be desirable in the long run.
        
               | bcoates wrote:
               | This is why RLHF causes those overly verbose answers to
               | simple questions, it's a fundamentally busted evaluation
               | function so you wind up optimizing for the wrong thing
        
             | make3 wrote:
             | it's a special case, not an approximation
        
               | whimsicalism wrote:
               | its not an either/or, it is both. regardless, my point is
               | that you cannot simply dimensional analysis your way to
               | special relativity or the energy-momentum relation
        
             | mitthrowaway2 wrote:
             | This thread has come up before(1), but I'll continue to
             | argue that relativistic mass is a perfectly valid concept
             | as much as any other, and if you disagree, you'll need
             | arguments more substantial than it just being unpopular
             | these days. Especially if you're trying to pedantically
             | argue people out of using a concept that they personally
             | find useful to aid their own understanding, just because it
             | doesn't fit your own mathematical or aesthetic preferences.
             | 
             | 1: https://news.ycombinator.com/item?id=38425252
        
           | tlb wrote:
           | It's nontrivial why it's mc^2 and not 1/2 mc^2, since kinetic
           | energy generally is 1/2 mv^2
        
         | wslh wrote:
         | Why do we need this when current models already handle
         | questions and answers about new discoveries: ones that are
         | happening every week and are often easier to grasp than
         | Einstein's equations? I think it is clear that they will fail
         | on most of them. That doesn't mean that LLMs are not useful but
         | there are more walls in the road.
        
         | layer8 wrote:
         | Instead of asking for a formula, a better test may be to point
         | out all the seeming contradictions in physics at that time
         | (constancy of the speed of light, wave vs. particle nature of
         | light, ultraviolet catastrophe), and ask it how they could be
         | resolved.
        
       | chvid wrote:
       | Isn't this simply because the dataset used (Putnam-AXIOM
       | Original) is in the training data used to train the various
       | models?
       | 
       | Given that these are simple variations (variable names and
       | constants value change in math problems). Why would the companies
       | creating these models (OpenAI etc.) create these variations
       | themselves in order to insure that the model is learning how to
       | solve the problem rather than memorize a solution? Seems like a
       | very obvious thing to do ...
        
         | lupire wrote:
         | They are not only simple renames. LLM is good at those. They
         | are minor structural changes.
        
       | ben_w wrote:
       | Link title says "slightly", but the PDF says two different kinds
       | of variations: variable names (slight) and problem constants
       | (significant), and the 30% drop is on the combination of a 26
       | variable and also 26 variable + constant questions.
       | 
       | It's good to have a better test (though I bet this one will also
       | be quickly saturated like all the others), but the title here
       | doesn't seem justified by the page title there or the content.
        
         | sundarurfriend wrote:
         | I would definitely classify both of those as slight changes. In
         | fact I'd rename those as slight => trivial and significant =>
         | slight.
        
           | zeroonetwothree wrote:
           | Right, renaming a variable should have zero effect on ability
           | to solve (it wouldn't for a human). Changing a constant
           | should be very minor, probably also ~0 effect in most cases.
           | I say this as someone that's done many of these problems.
        
       | lomkju wrote:
       | Even humans get confused with trick questions right? Once they
       | understand this is a trick question they no longer fall for it.
       | :)
        
       | steveBK123 wrote:
       | Yes so when you change the sequence of tokens they've
       | electronically memorized, they get a bit worse at predicting the
       | next token?
        
         | zeroonetwothree wrote:
         | When you put it that way it's a trivial result. However the
         | consequences for using AI to replace humans on tasks is
         | significant.
        
           | steveBK123 wrote:
           | The only people super pumping the idea of mass replacement of
           | human labor are financially invested in that outcome.
        
       | aucisson_masque wrote:
       | It drops from 50 to 33,96. Still the best, o1 on variable is
       | around 2 times better than Claude on original test.
       | 
       | The rest of the llm are far away, single digit.
       | 
       | It makes me wonder if o1 is finally getting intelligent? LLM are
       | not supposed to understand these problems when you change
       | variable and values, they have to rely on preexisting data of
       | absolutely identical solved problem to give a correct answer.
       | 
       | I didn't follow LLM development but I heard one times that
       | chatgpt is now composed of multiple LLM and maybe they put
       | multiple artificial intelligence with purpose of problems
       | solvings or trigonometry for instance.
       | 
       | That would explain the reason it's so much better.
        
       | e1g wrote:
       | The paper includes several examples of their modified questions.
       | There has been a substantial jump from o1-preview to o1, so I
       | gave several samples to o1 and o1-pro ( _not_ o1-preview), and
       | current o1s gave the correct answer to those modified problems.
       | SOTA changes fast.
        
         | suddenlybananas wrote:
         | LLM boosters are so tiresome. You hardly did a rigorous
         | evaluation, the set has been public since October and could
         | have easily been added to the training data.
        
           | gdhkgdhkvff wrote:
           | Your points would be more convincing if you didn't preface
           | them with arrogant cynicism.
        
           | e1g wrote:
           | I'm not skilled enough in math to do a rigorous evaluation,
           | so it was a quick check.
           | 
           | Terence Tao _is_ skilled enough, and he describes O1 's math
           | ability is "...roughly on par with a mediocre, but not
           | completely incompetent graduate student" (good discussion at
           | https://news.ycombinator.com/item?id=41540902), and the next
           | iteration O3 just got 25% on his brand new Frontier Math
           | test.
           | 
           | Seeing LLMs as useless is banal, but downplaying their rate
           | of improvement is self-sabotage.
        
             | fumeux_fume wrote:
             | > "...roughly on par with a mediocre, but not completely
             | incompetent graduate student"
             | 
             | Let it sink in how vague and almost meaningless that
             | statement is.
        
               | pizza wrote:
               | What types of questions are you hoping to answer for that
               | to be considered a vague statement?
        
         | jtefera wrote:
         | The paper mentions that on several occasions the LLM will
         | provide a correct answer but will either take big jumps without
         | justifying them or will take illogical steps but end up with
         | the right solution at the end. Did you check for that?
        
           | e1g wrote:
           | No, I don't know enough math to test the logic, only the
           | check questions against their expected answers in
           | https://anonymous.4open.science/r/putnam-
           | axiom-B57C/data/Put...
        
             | zeroonetwothree wrote:
             | Putnam problems need to actually be graded, often the
             | answer itself is trivial.
        
       | whimsicalism wrote:
       | So many negative comments as if o3 didn't get _25% on
       | frontiermath_ - which is absolutely nuts.
       | 
       | Sure, LLMs will perform better if the answer to a problem is
       | directly in their training set. But that doesn't mean they
       | perform _bad_ when the answer isn't in their training set.
        
         | optimalsolver wrote:
         | EpochAI have to send the questions (but not the answer key) to
         | OpenAI in order to score the models.
         | 
         | An overnight 2% -> 25% jump on this benchmark is a bit curious.
        
           | whimsicalism wrote:
           | 1. OpenAI said they did not train on these problems & they
           | don't train on API calls in general, that is a legal policy.
           | 
           | 2. It was a new major model release from work over the course
           | of months - struggle to see that as an 'overnight' jump in
           | any real meaning.
           | 
           | 3. Why is it easier to believe large scale corporate fraud
           | than that the stated capabilities on a held out test set are
           | real? Reads like cope, if I'm being frank.
        
             | zeroonetwothree wrote:
             | I don't think it's "easier to believe" just that it raises
             | some red flags.
        
           | exitb wrote:
           | The 2% result belonged to a traditional LLM that costs cents
           | to run, while o3 is extremely expensive.
        
         | MattDaEskimo wrote:
         | Sure, it did good in frontiermath. That's not what this thread
         | is about.
         | 
         | Your comment isn't relevant at all
        
           | whimsicalism wrote:
           | this thread is about math LLM capability, it's a bit
           | ridiculous to say that mentioning frontiermath is off topic
           | but that's just me
        
             | MattDaEskimo wrote:
             | Just because you can generalize the topic doesn't mean you
             | can ignore the specific conversation and choose your hill
             | to argue.
             | 
             | Additionally, the conversation of this topic is about the
             | model's ability to generalize and it's potential
             | overfitting, which is arguably more important than
             | parroting mathematics.
        
               | whimsicalism wrote:
               | performance on a held-out set (like frontiermath)
               | compared to putnam (which is not held out) is obviously
               | relevant to a model's potential overfitting.
               | 
               | i'm not going to keep replying, others can judge whether
               | they think what i'm saying is "relevant at all."
        
               | MattDaEskimo wrote:
               | Again, you set your own goal posts and failed to add any
               | insights.
               | 
               | The topic here isn't "o-series sucks", it's addressing a
               | found concern.
        
       | lupire wrote:
       | The researcher's answer to their variant of "Year: 2016 ID: A1"
       | in the appendix is wrong.
       | 
       | The solution (sum of 1,2,5,6,9,10,13,14, ...) has an alternating
       | pattern, so has to be two piecewise interleaved polynomials,
       | which cannot be expressed as a single polyomial.
       | 
       | Their answer works for k=1,2, but not k=3.
       | 
       | https://openreview.net/pdf?id=YXnwlZe0yf
       | 
       | This does not give me confidence in the results of their paper.
        
         | Chinjut wrote:
         | You are correct. Their answer is instead the sum of the first k
         | terms of 1, 2, 6, 10, 14, 18, ..., for positive k.
        
         | zeroonetwothree wrote:
         | Very astute. Did you communicate this to the authors?
        
         | pfedak wrote:
         | You're misreading the solution, the first part reads n=1, a
         | trivial special case, not n congruent to 1 mod 4.
         | 
         | The statement doesn't hold for e.g. n=5. Taking m=2 gives the
         | permutation (1 2 4 3), which is odd, and thus cannot have a
         | square root.
        
       | Topfi wrote:
       | I wouldn't be surprised if similar will be found concerning the
       | ARC challenge and it is why I still maintain my own private LLM
       | challenges to gauge current capabilities. Course, I have little
       | illusion that these are fully private, but it is better than
       | fully public tests.
       | 
       | Even the most straight forward, logical, easily reasoned ones
       | stump all LLMs I have access to, which is why I am so skeptical
       | concerning emergence, reasoning and all this hype around "AGI"...
        
       | scotty79 wrote:
       | I think that lamentations about real world data running out is
       | misplaced. We can multiply data with slight variations which
       | might lead to better resilience and more accurate model's
       | responses for novel problems.
        
       | jerf wrote:
       | I remember when this stuff was all coming out and people were
       | finally excited about ChatGPT getting the problem with "which is
       | heavier, a 10 pound bag of feathers or a 10 pound bag of bricks?"
       | problem correct. But of course it got it correct. It was in the
       | training set. Vary the problem slightly by just changing the
       | nouns, or changing the numbers so that one in fact was heavier
       | than the other, and performance went all over the map.
       | 
       | I just went to chatgpt.com and put into the chat box "Which is
       | heavier, a 9.99-pound back of steel ingots or a 10.01 bag of
       | fluffy cotton?", and the very first answer I got (that is, I
       | didn't go fishing here) was                   The 9.99-pound bag
       | of steel ingots is heavier than the 10.01-pound         bag of
       | fluffy cotton by a small margin. Although the cotton may
       | appear larger due to its fluffy nature, the steel ingots are
       | denser         and the weight of the steel bag is 9.99 pounds
       | compared to the 10.01         pounds of cotton. So, the fluffy
       | cotton weighs just a tiny bit more         than the steel ingots.
       | 
       | Which, despite getting it both right and wrong, must still be
       | graded as a "fail".
       | 
       | If you want to analyze these thing for their true capability, you
       | _need_ to make sure you 're out of the training set... and most
       | of the things that leap to your mind in 5 seconds are leaping to
       | your mind precisely because they are either something you've seen
       | quite often or something that you can easily think of and
       | therefore many other people have easily thought of them as well.
       | Get off the beaten path a bit and the math gets much less
       | impressive.
        
         | whimsicalism wrote:
         | https://chatgpt.com/share/67756897-8974-8010-a0e0-c9e3b3e91f...
         | 
         | so far o1-mini has bodied every task people are saying LLMs
         | can't do in this thread
        
           | jerf wrote:
           | That appears to be the same model I used. This is why I
           | emphasized I didn't "go shopping" for a result. That was the
           | first result I got.
           | 
           | I'm not at all surprised that it will nondeterministically
           | get it correct sometimes. But if it doesn't get it correct
           | every time, it doesn't "know".
           | 
           | (In fact "going shopping" for errors would still even be
           | fair. It should be correct all the time if it "knows". But it
           | would be different if I was fishing over and over and over
           | and finally got one, versus the first time I asked.)
           | 
           | Edit: It appears it isn't the model I used. The point holds,
           | though, you need to make sure you're off the training set for
           | it to matter. This isn't a "ChatGPT can't do that" post as
           | some are saying, it's more a "you aren't asking what you
           | think you're asking" post.
           | 
           | You get the same problem in a human context in things like
           | code interviews. If you ask an interviewee the exact question
           | "how do you traverse a binary tree in a depth-first manner",
           | you aren't really learning much about the interviewee. It's a
           | bad interview question. You need to get at least a bit off
           | the beaten trail to do any sort of real analysis.
        
             | whimsicalism wrote:
             | you sure? i just asked o1-mini ( _not 4o mini_ ) 5 times in
             | a row (new chats obviously) and it got it right every time
             | 
             | perhaps you stumbled on a rarer case but reading the logs
             | you posted this sounds more like a 4o model than an o1
             | because it's doing its thinking in the chat itself plus the
             | procedure you described would probably get you 4o-mini
        
               | JumpCrisscross wrote:
               | > _just asked o1-mini (not 4o mini) 5 times in a row (new
               | chats obviously) and it got it right every time_
               | 
               | Could you try playing with the exact numbers and/or
               | substances?
        
               | whimsicalism wrote:
               | give me a query and i'll ask it, but also i don't want to
               | burn through all of my o1mini allocation and have to use
               | the pay-as-you-go API.
        
               | JumpCrisscross wrote:
               | > _give me a query and i'll ask it_
               | 
               | Which is heavier: an 11kg bag of lint or a 20lb bag of
               | gold?
        
               | whimsicalism wrote:
               | yeah it gets it
               | 
               | https://chatgpt.com/share/67757720-3c7c-8010-a3e9-ce66fb9
               | f17...
               | 
               | e: cool, this gets downvoted
        
               | blharr wrote:
               | It got it right, but an interesting result that it
               | rambled on about monetary value for... no reason.
               | 
               | > While the lint bag is heavier in terms of weight, it's
               | worth mentioning that gold is significantly more valuable
               | per pound compared to lint. This means that even though
               | the lint bag weighs more, the gold bag holds much greater
               | monetary value.
        
               | JumpCrisscross wrote:
               | Legal said someone might sell a bag of gold for one of
               | lint without it.
        
               | drivebyhooting wrote:
               | > What is heavier a liter of bricks or a liter of
               | feathers?
               | 
               | >> A liter of bricks and a liter of feathers both weigh
               | the same--1 kilogram--since they each have a volume of 1
               | liter. However, bricks are much denser than feathers, so
               | the bricks will take up much less space compared to the
               | large volume of feathers needed to make up 1 liter. The
               | difference is in how compactly the materials are packed,
               | but in terms of weight, they are identical.
        
               | whimsicalism wrote:
               | https://chatgpt.com/share/677583a3-526c-8010-b9f9-9b2a337
               | 4da... o1-mini best-of-1
        
               | thaumasiotes wrote:
               | >> so far o1-mini has bodied every task people are saying
               | LLMs can't do in this thread
               | 
               | > give me a query and i'll ask it
               | 
               | Here's a query similar to one that I gave to Google
               | Gemini (version unknown), which failed miserably:
               | 
               | ---query---
               | 
               | Steeleye Span's version of the old broadsheet ballad "The
               | Victory" begins the final verse with these lines:
               | 
               |  _Here 's success unto the Victory / and crew of noble
               | fame
               | 
               | and glory to the captain / bold Nelson was his name_
               | 
               | What does the singer mean by these lines?
               | 
               | ---end query---
               | 
               | Italicization is for the benefit of HN; I left that out
               | of my prompt.
        
               | whimsicalism wrote:
               | i'd prefer an easily verifiable question rather than one
               | where we can always go "no that's not what they really
               | meant" but someone else with o1-mini quota can respond
        
               | thaumasiotes wrote:
               | It's not a difficult or tricky question.
        
               | mikeruiz wrote:
               | "They're toasting Admiral Nelson's ship (HMS Victory) and
               | its valiant crew, hailing the ship's successes and
               | Nelson's heroism. In other words, the singer is offering
               | tribute--"success unto the Victory"--to the vessel and
               | its famed sailors, and "glory to the captain" who led
               | them, namely the celebrated Admiral Horatio Nelson."
               | 
               | ...but to your point, no idea if the artist intended some
               | more obscure reference.
               | 
               | o1-pro was also able to produce a relatively complete
               | version of original source, though, amusingly, referred
               | to it as a 'broadside' rather than 'broadsheet'.
               | Appropriate given the context!
        
               | ted_dunning wrote:
               | Hmm... Gemini (1.5 Flash) just aced that exact question
               | for me:
               | 
               | These lines celebrate the victory of the British ship HMS
               | Victory, led by the famous Admiral Lord Nelson, in the
               | Battle of Trafalgar in 1805.
               | 
               | "Here's success unto the Victory": This line directly
               | praises the ship itself, acknowledging its role in the
               | successful battle. "and crew of noble fame": This
               | recognizes the bravery and skill of the sailors who
               | served aboard the Victory. "and glory to the captain":
               | This line specifically honors Admiral Nelson, the captain
               | of the Victory, for his leadership and strategic
               | brilliance in the battle. "bold Nelson was his name":
               | This emphasizes Nelson's courage and daring, which were
               | legendary. The lines express admiration for the ship, its
               | crew, and most importantly, Admiral Nelson, who became a
               | national hero in Britain for his victory at Trafalgar.
        
               | 7thpower wrote:
               | May be unrelated, but I have been having a lot of issues
               | lately with ChatGPT letting me select a model (o1) and
               | silently switching to 4o.
               | 
               | This is coming off my TWO DAY cooldown on o1 usage, which
               | is frustrating.
        
             | deeviant wrote:
             | I don't believe that is the model that you used.
             | 
             | I wrote a script and pounded 01 mini and gpt 4 with a wide
             | vareity of tempature and top_p parameters, and was unable
             | to get it to give the wrong answer a single time.
             | 
             | Just a whole bunch of:
             | 
             | (openai-example-py3.12) <redacted>:~/code/openAiAPI$
             | python3 featherOrSteel.py Response 1: A 10.01-pound bag of
             | fluffy cotton is heavier than a 9.99-pound bag of steel
             | ingots. Response 2: A 10.01-pound bag of fluffy cotton is
             | heavier than a 9.99-pound bag of steel ingots. Response 3:
             | The 10.01-pound bag of fluffy cotton is heavier than the
             | 9.99-pound bag of steel ingots. Response 4: The 10.01-pound
             | bag of fluffy cotton is heavier than the 9.99-pound bag of
             | steel ingots. Response 5: A 10.01-pound bag of fluffy
             | cotton is heavier than a 9.99-pound bag of steel ingots.
             | Response 6: The 10.01-pound bag of fluffy cotton is heavier
             | than the 9.99-pound bag of steel ingots. Response 7: The
             | 10.01-pound bag of fluffy cotton is heavier than the
             | 9.99-pound bag of steel ingots. Response 8: The 10.01-pound
             | bag of fluffy cotton is heavier than the 9.99-pound bag of
             | steel ingots. Response 9: The 10.01-pound bag of fluffy
             | cotton is heavier than the 9.99-pound bag of steel ingots.
             | Response 10: A 10.01-pound bag of fluffy cotton is heavier
             | than a 9.99-pound bag of steel ingots. All responses
             | collected and saved to 'responses.txt'.
             | 
             | Script with one example set of params:
             | import openai         import time         import random
             | # Replace with your actual OpenAI API key
             | openai.api_key = "your-api-key"              # The question
             | to be asked         question = "Which is heavier, a
             | 9.99-pound bag of steel ingots or a 10.01-pound bag of
             | fluffy cotton?"              # Number of times to ask the
             | question         num_requests = 10              responses =
             | []              for i in range(num_requests):
             | try:                 # Generate a unique context using a
             | random number or timestamp, this is to prevent prompt
             | caching                 random_context = f"Request ID:
             | {random.randint(1, 100000)} Timestamp: {time.time()}"
             | # Call the Chat API with the random context added
             | response = openai.ChatCompletion.create(
             | model="gpt-4o-2024-08-06",                     messages=[
             | {"role": "system", "content": f"You are a creative and
             | imaginative assistant. {random_context}"},
             | {"role": "user", "content": question}
             | ],                     temperature=2.0,
             | top_p=0.5,                     max_tokens=100,
             | frequency_penalty=0.0,
             | presence_penalty=0.0                 )
             | # Extract and store the response text
             | answer = response.choices[0].message["content"].strip()
             | responses.append(answer)                      # Print
             | progress                 print(f"Response {i+1}: {answer}")
             | # Optional delay to avoid hitting rate limits
             | time.sleep(1)                  except Exception as e:
             | print(f"An error occurred on iteration {i+1}: {e}")
             | # Save responses to a file for analysis         with
             | open("responses.txt", "w", encoding="utf-8") as file:
             | file.write("\n".join(responses))              print("All
             | responses collected and saved to 'responses.txt'.")
        
               | zaroth wrote:
               | Downvoted for... too conclusively proving OP wrong?
        
               | gmueckl wrote:
               | Down voted for not actually countering the argument in
               | question? The script doesn't alter the phrasing of the
               | question itself. It just generates a randomized,
               | irrelevant preamble.
        
               | deeviant wrote:
               | Well, I understood the argument in question to be: was it
               | possible for the model to be fooled by this question, not
               | was it possible to prompt engineer it into failure.
               | 
               | The parameter space I was exploring, then, was the
               | different decoding parameters available during the
               | invocation of the model, with the thesis that if were
               | possible to for the model to generate an incorrect answer
               | to the question, I would be able to replicate it by
               | tweaking the decoding parameters to be more "loose" while
               | increasing sample size. By jacking up temperature while
               | lowering Top-p, we see the biggest variation of responses
               | and if there were an incorrect response to be found, I
               | would have expected to see in the few hundred times I ran
               | during my parameter search.
               | 
               | If you think you can fool it by slight variations on the
               | wording of the problem, I would encourage you to perform
               | a similar experiment as mine and prove me wrong =P
        
             | mortehu wrote:
             | While this may be true, it's a very common problem that
             | people who want to demonstrate how bad a model is fail to
             | provide a direct link or simply state the name of the
             | model.
        
               | lukeschlather wrote:
               | I usually test models using the OpenAI API which doesn't
               | offer links the way I think you mean. If I provide some
               | output I got from a particular model you're just going to
               | have to take my word for it.
        
               | Jerrrry wrote:
               | They need to provide an small hash with the api result
               | that can be verified by others.
        
               | 4ad wrote:
               | You can use https://lluminous.chat (bring your own key)
               | to link to chats using any model across all LLMs.
        
               | whimsicalism wrote:
               | open router is the more standard solution
        
               | chongli wrote:
               | OpenAI is not doing us any favours by using confusing
               | naming schemes for their models and obscuring which
               | models people are actually working with.
               | 
               | If I didn't know any better, I'd say OpenAI doesn't want
               | us doing these tests accurately and is trying to hide
               | something.
        
               | whimsicalism wrote:
               | it's extremely easy to see which model you are using.
               | one's own... difficulties understanding are not a
               | conspiracy by OpenAI
        
               | chongli wrote:
               | It does not show the model version anywhere on the page
               | on chatgpt.com, even when logged in.
        
               | qup wrote:
               | Yes it does, at the top of every chat there is a drop-
               | down to select the model, which displays the current
               | model. It's been a constant part of the UI since forever.
        
               | chongli wrote:
               | No, it only says "ChatGPT Plus (Upgrade)" or "ChatGPT".
               | 
               | Maybe it's different if you have a paid account?
        
               | whimsicalism wrote:
               | if i go to chatgpt.com on my phone not logged on at all
               | it tells me very prominently at the top that i am using
               | 4o mini
        
               | bcrosby95 wrote:
               | Logged in, non paid account, on a desktop, for me, it's
               | exactly as the person you're replying to has stated.
               | 
               | If I log out, it shows 4o mini, and when I try to change
               | it, it asks me to login or sign in rather than giving me
               | any options.
               | 
               | When I use enough chatgpt when logged in it gives me some
               | nebulous "you've used all your xyz tokens for the day".
               | But other than that there is no real signal to me that
               | I'm getting a degraded experience.
               | 
               | It's really just confusing as hell.
        
               | blharr wrote:
               | Someone else in this thread said,
               | 
               | > _With a free account the model it claims to be using is
               | "4o auto", which is not a model but apparently an attempt
               | to automatically decide models for you to be more cost
               | effective._
        
             | xtracto wrote:
             | So, there is this meme going around in Mexico about a
             | previous president who in an interview said "we will land
             | in about 1 minute, no, less about 5"
             | 
             | Does this proves he is not an intelligent being?
             | 
             | Is he stupid?
             | 
             | This he had a lapse? Would we judge his intelligence for
             | that?
        
             | dialup_sounds wrote:
             | I believe this is just a case of OpenAI's naming scheme
             | being weird and confusing.
             | 
             | The default model I see on chatgpt.com is _GPT 4o-mini_ ,
             | which is not _o1-mini_.
             | 
             | OpenAI describes GPT 4o-mini as "Our fast, affordable small
             | model for focused tasks" and o1/o1-mini as "Reasoning
             | models that excel at complex, multi-step tasks".
        
             | qup wrote:
             | It's so weird that people use questions that are well-known
             | for duping humans, who we all consider to be general
             | intelligence.
             | 
             | Getting this question wrong doesn't say much about the
             | intelligence of humans, why would it say something about
             | the AI?
        
               | flatline wrote:
               | Because for things like the Putnam questions, we are
               | trying to get the performance of a _smart_ human. Are
               | LLMs just stochastic parrots or are they capable of
               | drawing new, meaningful inferences? We keep getting more
               | and more evidence of the latter, but things like this
               | throw that into question.
        
               | zahlman wrote:
               | We use variations on questions that are well known for
               | duping _inattentive_ humans, to test a system that _we
               | expect a priori to be incapable of such inattention_.
               | 
               | Unless "getting easy things wrong sometimes" is an
               | inherent property of intelligence, we should expect that
               | a properly "intelligent" computerized system would
               | _never_ err on problems far below its level of
               | comprehension - unless we had some reason to believe it
               | "wanted to", and as of yet I see no reason to believe
               | this is even possible in principle.
               | 
               | Humans err, broadly speaking, for two reasons: genuinely
               | reaching the limits of their comprehension, or trusting
               | "system 1" (in Kahneman's analysis) too much.
        
             | elliotto wrote:
             | Could you share the exact chat you used for when it failed?
             | There is a share chat button on openai.
             | 
             | It's very difficult to be an AI bull when the goalposts are
             | moving so quickly that ai answering core correctly across
             | multiple models is brushed off as 'nondeterministically
             | getting it correct sometimes'
        
               | stefan_ wrote:
               | Why? Did a grocery store self checkout ever fail to
               | calculate sales tax? Do I need to run a study on that?
               | 
               | The people selling this could not make a car drive but
               | now its AGI.
        
           | NewsaHackO wrote:
           | This happens literally every time. Someone always says
           | "ChatGPT can't do this!", but then when someone actually runs
           | the example, chatGPT gets it right. Now what the OP is going
           | to do next is proceed to move goalposts and say like "but umm
           | I _just_ asked chatgpt this, so clearly they modified the
           | code in realtime to get the answer right "
        
             | whimsicalism wrote:
             | yeah... that's hn for you.
             | 
             | that said, i don't think this is a good test - i've seen it
             | circling on twitter for months and it is almost certainly
             | trained on similar tasks
        
             | anamexis wrote:
             | How is that moving the goalposts?
        
             | JumpCrisscross wrote:
             | > _Someone always says "ChatGPT can't do this!", but then
             | when someone actually runs the example, chatGPT gets it
             | right_
             | 
             | I mean, if I had OpenAI's resources I'd have a team tasked
             | with monitoring social to debug trending fuck-ups. (Before
             | that: add compute time to frequently-asked novel queries.)
        
             | jorvi wrote:
             | Prompt: In the Netherlands, in terms of drinks, is there a
             | particular spirit that represents the country?
             | 
             | > Yes, in the Netherlands, jenever (also known as genever)
             | is the traditional spirit that represents the country.
             | Jenever is a type of Dutch gin that has a distinctive
             | flavor, often made from malt wine and flavored with juniper
             | berries. It has a long history in the Netherlands, dating
             | back to the 16th century, and is considered the precursor
             | to modern gin.
             | 
             | > Jenever comes in two main types: oude (old) jenever,
             | which has a richer, maltier flavor and is often aged in
             | wooden casks, and jonge (young) jenever, which is lighter
             | and has a more neutral taste. Both types can be enjoyed
             | straight or used in cocktails.
             | 
             | > In addition to jenever, the Netherlands is also known for
             | its liqueurs, such as advocaat, a rich and creamy drink
             | made from eggs, sugar, and brandy, often enjoyed as a
             | dessert or in cocktails. However, jenever remains the most
             | iconic spirit associated with Dutch culture.
             | 
             | This is completely wrong. Jenever certainly is very Dutch,
             | but no one would say it is iconic as _the_ Dutch spirit.
             | For example, if you asked up north in Friesland, they would
             | say Berenburg.
             | 
             | This happens literally every time. Someone always says
             | "ChatGPT can do this!", but then within one or two prompts,
             | its gets it wrong.
        
               | ludwik wrote:
               | But what does this have to do with reasoning? Yes, LLMs
               | are not knowledge bases, and seeing people treat them as
               | such absolutely terrifies me. However, I don't see how
               | the fact that LLMs often hallucinate "facts" is relevant
               | to a discussion about their reasoning capabilities.
        
               | zahlman wrote:
               | "Hallucinating a fact" that _isn 't_ in the training set
               | and is also illogical, is exactly what a _failure to
               | reason correctly_ looks like.
        
               | elif wrote:
               | 'Berenberg is made by adding herbs to jenever'
               | 
               | From your comment it would seem that you are disputing
               | jenever's popularity by saying jenever is more popular...
               | 
               | Perhaps it was a good faith mistake? If so, that would
               | imply that the AI knows more about jenever than you?
        
               | jorvi wrote:
               | I am rather saying that there is no one national drink
               | for The Netherlands, like a Frenchman would say wine, a
               | German/Belgian would say beer, and a Scotsman would say
               | whisky. Note that I prompted "In the Netherlands, in
               | terms of drinks, is there a particular spirit that
               | represents the country?" I didn't ask which spirit is
               | consumed the most.
               | 
               | For example, France has been trending towards beer more
               | and more, and within a few decades they might be
               | consuming more beer than wine. But even then, the French
               | wouldn't slowly start to say beer represents France.
               | 
               | Furthermore, "just adding some herbs" does a large
               | disservice to the flavor change of Berenburg. Jenever
               | (aka jonge/unaged jenever) is straight-up vile. I've
               | heard it described by expats as "having the worst
               | elements of both cheap gin and cheap whisky".
               | 
               | Berenburg in comparison is spicy and vanilla-y and
               | actually debatebly enjoyable.
               | 
               | Aged/oude jenever is much closer to Berenburg (or
               | Berenburg to aged jenever), also with hints of vanilla
               | and spices.
               | 
               | But, virtually no one except for dusty old men orders
               | aged jenever. The jenever most order is jonge jenever,
               | and then its only in a sense of "haha lets drink this
               | terrible thing" or "let's get shitfaced quick".
               | 
               | If o1 supposedly "oneshots every question", it should
               | have been aware of these nuances instead of just
               | confidently assigning jenever as 'the' spirit of the
               | Dutch.
        
               | ipaddr wrote:
               | So you believe they are incorrect because regionally some
               | area would select something different because it
               | represented that area. But your question asked
               | nationally.. is there a better answer than the one they
               | gave? Were you expecting a no?
        
               | zahlman wrote:
               | The point is that there is no correct national answer,
               | because the locals don't see it as a matter of national
               | identity.
               | 
               | What's expected is an _ability to identify trick
               | questions_ , i.e., to recognize fundamental problems in
               | the phrasing of a question rather than trying to provide
               | a "helpful" answer at all costs.
               | 
               | This corresponds to one of the many reasons LLM output is
               | banned on Stack Overflow.
        
               | jorvi wrote:
               | See my more detailed upthread response here:
               | https://news.ycombinator.com/item?id=42569937
               | 
               | But, like Zahlman points out, its a trick question, and
               | instead of admitting it doesn't know or even prepending
               | "I don't know for sure, but:", it just burps up its best-
               | effort answer. There is no one spirit that represents The
               | Netherlands. If a LLM is so good it "oneshots any
               | question", it should realize it doesn't have a unanimous
               | answer and tell me.
        
             | stocknoob wrote:
             | Similarly, in every thread there's an AI skeptic who says
             | LLMs are "useless" for coding, and never provides an
             | example query for what they were trying.
        
               | mu53 wrote:
               | If you ask about more niche language features or
               | libraries, chatgpt will make up libraries or functions to
               | fill the gap.
               | 
               | When asking an LLM to write a script for you, I would say
               | 10 to 30 % of the time that it completely fails. Again,
               | making up an API or just getting things straight up
               | wrong.
               | 
               | Its very helpful, especially when starting from 0 with
               | the beginner questions, but it fails in many scenarios.
        
         | Leary wrote:
         | Deepseek got it right: "A 10.01-pound bag of fluffy cotton is
         | heavier than a 9.99-pound pack of steel ingots. Even though
         | steel is denser and takes up much less space, the weight is
         | determined by the mass, and 10.01 pounds is greater than 9.99
         | pounds."
        
           | OutOfHere wrote:
           | The issue with the commercial Deepseek API is that it
           | supports a context length of only 64k, whereas GPT supports
           | at least 128k.
        
         | MattGaiser wrote:
         | https://chatgpt.com/share/67756c29-111c-8002-b203-14c07ed1e6...
         | 
         | I got a very different answer:
         | 
         | A 10.01-pound bag of fluffy cotton is heavier than a 9.99-pound
         | bag of steel ingots because 10.01 pounds is greater than 9.99
         | pounds. The material doesn't matter in this case; weight is the
         | deciding factor.
         | 
         | What model returned your answer?
        
           | NewsaHackO wrote:
           | You also didn't ask the question correctly.
        
             | MattGaiser wrote:
             | The "back" misspelling is also in OP's quote so I included
             | it.
        
         | collinvandyck76 wrote:
         | I reproduced this on Claude Sonnet 3.5, but found that changing
         | your prompt to "Which is heavier, a 9.99-pound back of steel
         | ingots or a 10.01-pound bag of fluffy cotton?" corrected its
         | reasoning, after repeated tests. For some reason it was not
         | able to figure out that "10.01" referred to pounds.
        
           | bufferoverflow wrote:
           | You wrote "back" instead of "bag".
        
             | collinvandyck76 wrote:
             | Ah, good catch -- I had just copied the OP. Changing it to
             | bag instead of back doesn't seem to change much. It still
             | oscillates between the right and wrong answer, versus
             | qualifying the 10.01 value as pounds, which has not yet
             | given me the wrong answer (small sample size, of course).
        
           | oktoberpaard wrote:
           | This is the first response that I got from Claude with an
           | exact copy and paste of the question:
           | 
           | Let me think about this step by step:
           | 
           | 1. We're comparing two masses: * Steel ingots: 9.99 pounds *
           | Cotton: 10.01 pounds
           | 
           | 2. The cotton bag has a mass of 10.01 pounds, while the steel
           | bag has a mass of 9.99 pounds
           | 
           | 3. While steel is much denser than cotton (meaning it takes
           | up less space per pound), this doesn't affect the weight - a
           | pound is a pound regardless of material
           | 
           | 4. 10.01 is greater than 9.99 by 0.02 pounds
           | 
           | Therefore, the 10.01-pound bag of fluffy cotton is heavier
           | than the 9.99-pound bag of steel ingots. The cotton may take
           | up much more space, but when we're just comparing weights,
           | density doesn't matter - only the mass does.
        
         | sholladay wrote:
         | ChatGPT Plus user here. The following are all fresh sessions
         | and first answers, no fishing.
         | 
         | GPT 4:
         | 
         | The 10.01-pound bag of fluffy cotton is heavier than the
         | 9.99-pound bag of steel ingots. The type of material doesn't
         | affect the weight comparison; it's purely a matter of which bag
         | weighs more on the scale.
         | 
         | GPT 4o:
         | 
         | The 10.01-pound bag of fluffy cotton is heavier. Weight is
         | independent of the material, so the bag of cotton's 10.01
         | pounds outweighs the steel ingots' 9.99 pounds.
         | 
         | GPT o1:
         | 
         | Since both weights are measured on the same scale (pounds), the
         | 10.01-pound bag of cotton is heavier than the 9.99-pound bag of
         | steel, despite steel being denser. The key is simply that 10.01
         | pounds exceeds 9.99 pounds--density doesn't affect the total
         | weight in this comparison.
        
           | blibble wrote:
           | they've likely read this thread and adjusted their pre-filter
           | to give the correct answer
        
           | mjburgess wrote:
           | So do what the commenter suggests and make irrelevant
           | permutations to the input to find when it fails. ie., engage
           | in hypothesis testing rather than confirmation bias.
           | 
           | If a system has the capability to solve problems of
           | {parts1...parts_n}, then it only has that capability if
           | irrelevant permutations {parts1..parts2'...parts_n} make no
           | difference.
           | 
           | Its very obvious that such permutations can destory such
           | apparent capabilities.
        
             | otabdeveloper4 wrote:
             | > ...engage in hypothesis testing rather than confirmation
             | bias
             | 
             | Please leave the premises, sir. We don't take kindly to
             | luddites here.
        
               | dullcrisp wrote:
               | Tough crowd
        
               | david-gpu wrote:
               | Lots of other websites are more appropriate for meme
               | jokes.
        
               | dullcrisp wrote:
               | Like I said.
        
             | wongarsu wrote:
             | If GP's hypothesis was "it fails for small variations of
             | the input, like this one", then testing that hypothesis
             | with that exact variation on a couple models seems fair and
             | scientific.
             | 
             | Testing it with more variations until one fails feels a bit
             | like p-hacking. You'd need to engage in actual statistics
             | to get reliable results from that, beyond "If I really try,
             | I can make it fail". Which would be a completely different
             | hypothesis than the one presented at the start
        
               | roughly wrote:
               | Except that if the model genuinely was reasoning about
               | the problem, you could test it with every variation of
               | materials and weights in the world and it would pass.
               | Failing that problem at all in any way under any
               | conditions is a failure of reasoning.
        
               | jdietrich wrote:
               | By that logic, humans can't genuinely reason, because
               | they're often fooled by counter-intuitive problems like
               | Monty Hall or the Birthday Problem, or sometimes just
               | make mistakes on trivial problems.
        
               | wongarsu wrote:
               | We are pretty certain that humans can reason, yet they
               | are sometimes wrong. Even if you give them the same
               | problem over and over again with slight variations.
               | 
               | LLMs get things wrong due to different factors than
               | humans (humans lose focus, LLMs have randomness applied
               | when sampling their responses to improve results). But
               | clearly we have to choose a goal somewhat below 100% if
               | we want a test that doesn't conclude that humans are
               | incapable of reasoning.
        
               | roughly wrote:
               | The difference is we _know_ that LLMs are fancy
               | stochastic models, we don't know that they're capable of
               | reasoning, and the null hypothesis is that they're not
               | (because we know what they _are_ - we built them) - any
               | "reasoning" is an emergent property of the system, not
               | something we built them to do. In that case, evidence
               | they're not reasoning - evidence they're stochastic
               | parrots doing a performance of reasoning - weighs
               | heavier, because the performance of reasoning fits into
               | what we know they can do, whereas genuine reasoning would
               | be something new to the model.
               | 
               | There's deeper philosophical questions about what
               | reasoning actually _is_, and LLMs have made those
               | sharper, because they've shown it's clearly possible for
               | a complex statistical model to generate words that look
               | like reasoning, but the question is whether there's a
               | difference between what they're doing and what humans are
               | doing, and evidence that they're _not_ reasoning -
               | evidence that they're just generating words in specific
               | orders - weighs heavily against them.
        
               | wongarsu wrote:
               | We haven't coded LLMs to be stochastic models, we coded
               | them to predict text with any method gradient decent
               | finds on a transformer architecture. That's not exactly
               | the same.
               | 
               | But more importantly, if you want to show that LLMs can't
               | reason you obviously have to use a test that when applied
               | to humans would show that humans can reason. Otherwise
               | your test isn't testing reasoning but something more
               | strict.
        
               | Isinlor wrote:
               | It's widely accepted that reasoning is not a binary
               | skill.
               | 
               | You can make mistakes and still reason. Very often people
               | given the same premises will disagree in thier reasoning
               | as we are doing right here.
        
               | whakim wrote:
               | I'm not really sure what you're trying to say here - that
               | LLMs don't work like human brains? We don't need to
               | conduct any analyses to know that LLMs don't "know"
               | anything in the way humans "know" things because we know
               | how LLMs work. That doesn't mean that LLMs aren't
               | incredibly powerful; it may not even mean that they
               | aren't a route to AGI.
        
               | zahlman wrote:
               | >We don't need to conduct any analyses to know that LLMs
               | don't "know" anything in the way humans "know" things
               | because we know how LLMs work.
               | 
               | People, including around HN, constantly argue (or at
               | least phrase their arguments) as if they believed that
               | LLMs do, in fact, possess such "knowledge". This very
               | comment chain exists because people are trying to defend
               | against a trivial example refuting the point - as if
               | there were a reason to try.
               | 
               | > That doesn't mean that LLMs aren't incredibly powerful;
               | it may not even mean that they aren't a route to AGI.
               | 
               | I don't accept your definition of "intelligence" if you
               | think that makes sense. Systems _must_ be able to know
               | things in the way that humans (or at least living
               | creatures) do, because intelligence is exactly the
               | _ability to acquire_ such knowledge.
               | 
               | It boggles my mind that I have to explain to people that
               | sophisticated use of language doesn't inherently evidence
               | thought, _in the current political environment_ where the
               | Dead Internet Theory is taken seriously, elections are
               | shown over and over again to be more about tribalism and
               | personal identity than anything to do with policy, etc.
        
               | nwienert wrote:
               | I feel like I'm almost 100% certain that the smart guys
               | at OpenAI have added many more variations of the problem
               | to their training set since OP did his failing test, so
               | it doesn't surprise me at all to know that this exact one
               | now passes.
               | 
               | In fact, in my use of o1 it's incredibly clear that it
               | still has the same problems. It's incredibly common that
               | the second I ask for someone even slightly outside the
               | training set, it's more likely to "round" to some wrong
               | solution in the training set, rather than use any sort of
               | human-like reasoning to figure out the right answer
               | (often the right answer isn't hard to get, just not found
               | in a Google search).
        
               | bee_rider wrote:
               | Can't really do science with closed source software,
               | right? Who knows what's in there.
        
               | jack_pp wrote:
               | It's not p-hacking, he's right. You're both right. First
               | test the same prompt on different versions then the ones
               | that got it right go to the next round, variations on the
               | prompt
        
               | zahlman wrote:
               | We aren't testing whether the model's results are stable
               | or correct for a given class of problem. The goal is to
               | establish whether the model _can reason_.
               | 
               | Nothing capable of reasoning would contradict itself so
               | blatantly and in such a short span while failing to
               | indicate any kind of uncertainty.
        
               | Isinlor wrote:
               | Reasoning is not a binary skill.
               | 
               | And failure modes of other types of reasoners do not need
               | to be the same as the failure modes of humans.
        
             | jdietrich wrote:
             | I've just tested a number of permutations with Claude 3.5
             | Sonnet. It correctly answered all variants I tried on the
             | first attempt, as follows:
             | 
             |  _Which is heavier, a 9.99 kilogram tungsten cube or a
             | 10.01 kilogram block of aerogel?_
             | 
             |  _Which is heavier, 10,000 steel balls weighing 0.999 grams
             | each or 10,000 polystyrene balls weighing 1.001 grams
             | each?_
             | 
             |  _Which is heavier, a 10.01kg block of steel on Venus or a
             | 9.99kg bag of feathers on Earth?_
             | 
             |  _Which is heavier, a 10cm^3 block of steel or a 100cm^3
             | block of balsa wood?_
             | 
             |  _Which is heavier, a golf ball made of steel or a baseball
             | made of lithium?_
             | 
             | In all cases, Claude clearly used CoT and reasoned out the
             | problem in full. I would be interested in seeing if anyone
             | can find any variant of this problem that stumps any of the
             | leading LLMs. I'm bored of trying.
        
               | mjburgess wrote:
               | Hey, ChatGPT please write me a python program which
               | randomly samples from various materials and various
               | weights then poses a problem to the ChatGPT 4o API -- the
               | goal is to find cases where the LLM fails to obtain the
               | correct answer....
        
               | Isinlor wrote:
               | BTW - the model may be wrong depending on the example.
               | More voluminous objects displace more air and due to
               | buoyancy are lighter for the same mass.
               | 
               | The proper way to ask it would be to ask which object has
               | more mass.
        
           | escape_goat wrote:
           | I asked ChatGPT-4 mini the previous comment's question
           | (""Which is heavier, a 9.99-pound back of steel ingots or a
           | 10.01 bag of fluffy cotton?") in new incognito tabs until I
           | got an incorrect result. It only took four tries.
           | 
           | 1: "The 10.01-pound bag of fluffy cotton would be heavier.
           | Even though cotton is much less dense than steel, the weight
           | of the cotton bag is 10.01 pounds, while the steel ingot bag
           | weighs 9.99 pounds, making the cotton bag the heavier of the
           | two by 0.02 pounds."
           | 
           | 2: "The 10.01-pound bag of fluffy cotton is heavier. Even
           | though the steel ingots are denser and more compact, the
           | cotton weighs more at 10.01 pounds compared to the steel
           | ingots' 9.99 pounds. The difference is only 0.02 pounds, but
           | in terms of weight, the cotton bag is heavier."
           | 
           | 3: "The 10.01-pound bag of fluffy cotton is heavier. Despite
           | cotton being less dense than steel, the weight is determined
           | by the total mass, not the material's density. So, since the
           | cotton bag weighs 10.01 pounds and the steel ingot bag weighs
           | 9.99 pounds, the cotton bag is heavier by 0.02 pounds."
           | 
           | 4: "The 9.99-pound bag of steel ingots is heavier. It weighs
           | 9.99 pounds, while the bag of fluffy cotton weighs 10.01
           | pounds. Despite the cotton being fluffier and taking up more
           | space, the steel ingots are still lighter by 0.02 pounds."
        
         | gtirloni wrote:
         | o1.
         | 
         | prompt> Which is heavier, a 9.99-pound back of steel ingots or
         | a 10.01 bag of fluffy cotton? Please state in your answer
         | what's the difference in grams.
         | 
         | answer> The 10.01-pound bag of cotton is heavier. The
         | difference is 0.02 pounds, which is roughly 9.07 grams (using 1
         | pound [?] 453.59237 grams).
         | 
         | Reference * National Institute of Standards and Technology
         | (NIST): Conversion Factors
        
         | jstummbillig wrote:
         | FYI: If you do that without a subscrpition, you currently (most
         | likely) get a response generated through 4o-mini -- which is
         | not any of their reasoning models (o1, o1-mini or previously
         | o1-preview) of the branch discussed in the linked paper.
         | 
         | Notably, it's not even necessarily 4o, their premiere "non-
         | reasoning"-model, but likely the cheaper variant: With a free
         | account the model it claims to be using is "4o auto", which is
         | not a model but apparently an attempt to automatically decide
         | models for you to be more cost effective.
         | 
         | Without a ChatGPT subscription you can't select a specific
         | model anymore, not even rate limited, as was previously
         | possible.
        
           | jsheard wrote:
           | There doesn't seem to be a way to choose a model up-front
           | with a free account, but _after_ you make a query you can
           | click on the  "regenerate" button and select whether to try
           | again with "auto", 4o, or 4o-mini. At least until you use 4o
           | too many times and get rate limited.
        
             | jstummbillig wrote:
             | Ah, interesting!
        
             | evertedsphere wrote:
             | you can select the model in the header bar when you start a
             | chat: the name of the currently selected model can be
             | clicked to reveal a dropdown
        
               | jsheard wrote:
               | That option isn't there for me, maybe it's an A/B test
               | thing.
        
               | jstummbillig wrote:
               | Are you on the free version? Because for me it did not
               | show there, only on the paid one.
        
         | mmaunder wrote:
         | I've posted this before and I know it's a cliche, but this
         | really is Goodhart's Law at work with the benchmarks becoming
         | targets.
        
         | qwertox wrote:
         | As long as an LLM is capable of inserting "9.99 > 10.01?" into
         | an evaluation tool, we're on a good way.
         | 
         | It feels a bit like "if all you have is a hammer, everything
         | looks like a nail", where we're trying to make LLMs do stuff
         | which it isn't really designed to do.
         | 
         | Why don't we just limit LLMs to be an interface to use other
         | tools (in a much more human way) and train them to be excellent
         | at using tools. It would also make them more energy efficient.
         | 
         | But it's OK if we currently try to make them do as much as
         | possible, not only to check where the limits are, but also to
         | gain experience in developing them and for other reasons. We
         | just shouldn't expect them to be really intelligent.
        
           | riffraff wrote:
           | > As long as an LLM is capable of inserting "9.99 > 10.01?"
           | into an evaluation tool, we're on a good way
           | 
           | chatgpt will switch to python for some arithmetic with the
           | result that you get floating point math issues when a 8yo
           | will get the result right. I think "switch to a tool" still
           | requires understanding of _which_ tool to get a reliable
           | result, which in turn means understanding the problem. It 's
           | an interesting issue.
        
         | kqr wrote:
         | Shows the importance of chain of thought! Forcing it to commit
         | to an answer without deliberation is not playing to its
         | strength.
        
         | thaumasiotes wrote:
         | > the problem with "which is heavier, a 10 pound bag of
         | feathers or a 10 pound bag of bricks?"
         | 
         | Interestingly, the variation of this problem that I first
         | encountered, personally, was "which weighs more, a pound of
         | feathers or a pound of gold?"
         | 
         | This is a much more difficult question. The answer given to me
         | was that the pound of feathers weighs more, because gold is
         | measured in troy weight, and a troy pound consists of only 12
         | ounces compared to the 16 ounces in a pound avoirdupois.
         | 
         | And that's all true. Gold is measured in troy weight, feathers
         | aren't, a troy pound consists of only 12 ounces, a pound
         | avoirdupois consists of 16, and a pound avoirdupois weighs more
         | than a troy pound does.
         | 
         | The problem with this answer is that it's not complete; it's
         | just a coincidence that the ultimate result ("the feathers are
         | heavier") is correct. Just as a pound avoirdupois weighs more
         | than a troy pound, an ounce avoirdupois weighs _less_ than a
         | troy ounce. But this difference, even though it goes in the
         | opposite direction, isn 't enough to outweigh the difference
         | between 16 vs 12 ounces per pound.
         | 
         | Without acknowledging the difference in the ounces, the
         | official answer to the riddle is just as wrong as the naive
         | answer is.
        
           | Izkata wrote:
           | Yeah, this is the original version of this riddle. People who
           | don't know it think the trick is that people will reflexively
           | say the metal is heavier instead of "they're the same", when
           | it actually goes deeper.
           | 
           | No idea if GP did it intentionally to further drift from
           | training data, but steel doesn't count as a precious metal,
           | so it messes up the riddle by putting the two weights in the
           | same system.
        
             | zahlman wrote:
             | > Yeah, this is the original version of this riddle. People
             | who don't know it think the trick is that people will
             | reflexively say the metal is heavier instead of "they're
             | the same"
             | 
             | ...Have you really never encountered people who would
             | reflexively say that?
        
         | 12345hn6789 wrote:
         | Are you sure you weren't fishing? I ran 5 sessions and never
         | got the wrong answer. All using gpt 4o-mini, which is the
         | default non logged in experience on chatgpt.com.
         | 
         | 1. The 10.01-pound bag of fluffy cotton is heavier than the
         | 9.99-pound bag of steel ingots. Despite the difference in
         | material density, the key factor here is the weight itself,
         | with 10.01 pounds being greater than 9.99 pounds, regardless of
         | the substance.
         | 
         | 2. The 10.01-pound bag of fluffy cotton is heavier than the
         | 9.99-pound bag of steel ingots. The weight of an object is
         | directly determined by its mass, and in this case, the cotton
         | bag weighs 10.01 pounds, while the steel ingot bag weighs 9.99
         | pounds, making the cotton bag heavier by 0.02 pounds.
         | 
         | 3. The 10.01-pound bag of fluffy cotton is heavier than the
         | 9.99-pound bag of steel ingots, even though the cotton is much
         | less dense. The comparison is purely based on weight, and 10.01
         | pounds is greater than 9.99 pounds.
         | 
         | 4. The 10.01-pound bag of fluffy cotton is heavier than the
         | 9.99-pound bag of steel ingots. While steel is much denser than
         | cotton, the weight is the determining factor here, and 10.01
         | pounds is slightly more than 9.99 pounds. So, the cotton bag
         | weighs more.
         | 
         | 5. The 10.01-pound bag of fluffy cotton is heavier. Despite the
         | cotton being less dense and bulkier, the weight of the cotton
         | is still 10.01 pounds, which is greater than the 9.99-pound bag
         | of steel ingots.
        
           | adrian17 wrote:
           | Not OP, but I got 4o-mini confused on second attempt.
           | 
           | https://chatgpt.com/share/67759d1a-1430-800b-a0a9-2c5f2ac02a.
           | ..
        
         | Horffupolde wrote:
         | 10 pounds of bricks is actually heavier than 10 pounds of
         | feathers.
        
           | AnimalMuppet wrote:
           | Can you explain?
           | 
           | An ounce of gold is heavier than an ounce of feathers,
           | because the "ounce of gold" is a troy ounce, and the "ounce
           | of feathers" is an avoirdupois ounce. But that shouldn't be
           | true between feathers and bricks - they're both avoirdupois.
        
             | Horffupolde wrote:
             | Feathers are less dense so they have higher buoyancy in
             | air, reducing their weight.
        
               | chongli wrote:
               | Pounds are a unit of weight, not of mass. 10 lbs of
               | feathers is whatever amount of feathers causes a scale to
               | display 10 lbs. If the scale also displays 10 lbs for the
               | quantity of bricks, then they weigh the same, regardless
               | of any differences in mass.
        
               | wongarsu wrote:
               | Is this still true? I thought pounds are now defined in
               | terms of kilograms (about 0.453)? Because kilograms are
               | definitely a unit of mass, not weight. Or is the pound
               | defined as some amount of kilograms at a specific point
               | on earth, in a specific phase of the moon?
        
               | chongli wrote:
               | It seems the pound has since been redefined and split
               | into separate units: pound mass and pound force, the
               | former in terms of kilograms (1 lb = 0.45359237 kg) and
               | the latter in terms of the force exerted by one pound of
               | mass in earth's gravitational field (standard g =
               | 9.80665m/s^2).
               | 
               | So using the word pound without qualification is
               | ambiguous in contexts where it's not clear whether mass
               | or force is meant.
        
             | 9rx wrote:
             | According to the dictionary, "heavier" can refer to weight
             | or density. In their typical form, bricks are heavier (more
             | dense) than feathers. But one should not make assumptions
             | before answering the question. It is, as written,
             | unanswerable without followup questions.
        
         | anon946 wrote:
         | Add some extra information, and it gets confused. This is 4o.
         | 
         | https://chatgpt.com/share/67759723-f008-800e-b0f3-9c81e656d6...
         | 
         | One might argue that it's impossible to compress air using
         | known engineering, but that would be a different kind of
         | answer.
        
           | amluto wrote:
           | It seems more like ChatGPT was asked a rather bizarre
           | question with far too little detail to make sense, and
           | ChatGPT failed to notice or to ask for more information.
           | Although it did get rather impressively confused about the
           | pressure of the air.
        
             | F7F7F7 wrote:
             | "Failed to notice" is a hilariously weird way of talking
             | about tokenization.
        
         | maratc wrote:
         | Still waiting for some reasoning about an upward buoyancy force
         | on the bags, which is equal to the weight of the air displaced
         | by the bags, and the force acting on the cotton is bigger
         | simply because the bag of cotton displaces more air (because it
         | "appears larger due to its fluffy nature".)
        
         | jiggawatts wrote:
         | A real acid test is to use numbers that confuse LLMs such as:
         | 
         | "Which is heavier: a bag with 9.9 pounds of fluff or a bag with
         | 9.11 pounds of steel block?"
         | 
         | Try swapping the numbers and the materials, run each test at
         | least 10x times.
        
         | devoutsalsa wrote:
         | I tried Claude and got this...
         | 
         | * What's heavier? 9.99 kg of super dense lead or 10 kg fluffy
         | feathers?*
         | 
         | Let me think about this step by step:
         | 
         | 1. We're comparing two masses: * 9.99 kg of lead * 10 kg of
         | feathers
         | 
         | 2. The density of the materials doesn't affect their mass: *
         | While lead is indeed much denser than feathers * A kilogram is
         | a unit of mass, not volume * 1 kg of any material has the same
         | mass as 1 kg of any other material
         | 
         | 3. We can directly compare the numbers: * 10 kg > 9.99 kg
         | 
         | Therefore, 10 kg of feathers is heavier than 9.99 kg of lead.
         | The feathers would take up much more space (volume) due to
         | being less dense, but they have a greater mass.
        
         | themgt wrote:
         | I just asked Claude 3.5 Sonnet, which appears to have improved
         | its response with CoT but there's mistakes that demonstrate the
         | model doesn't really "understand":
         | 
         |  _Q: A woman and her son are in a car accident. The woman is
         | sadly killed. The boy is rushed to hospital. When the doctor
         | sees the boy he says "I can't operate on this child, he is my
         | son". How is this possible?_
         | 
         |  _C: Let me think about this step by step:_
         | 
         |  _A woman and her son are in an accident_
         | 
         |  _The woman (the mother) dies_
         | 
         |  _The boy is taken to hospital_
         | 
         |  _A doctor says they can 't operate because the boy is their
         | son_
         | 
         |  _But the mother is dead..._
         | 
         |  _Ah! The doctor is the boy 's father. This is a classic riddle
         | that highlights gender bias - many people assume doctors must
         | be men, but don't initially consider that the doctor could be
         | the father._ _Alternatively, the doctor could be the boy 's
         | other mother in a same-sex relationship, though this wasn't the
         | original intent of the traditional riddle._
        
           | HenriTEL wrote:
           | I must be to tired as I can't find any flaw in that
           | reasoning.
        
         | deadbabe wrote:
         | If you define a new universe with physical laws that are
         | different from our own, performance will also suffer because
         | there is no training data for that custom universe. While some
         | may say this is cheating, a human would be able to deduce
         | correct answers by understanding the deep implications of the
         | new physical laws in that custom universe.
        
         | dogcomplex wrote:
         | lol no, with a small amount of generally-applicable prompt-fu
         | it answers with a ton of nuance that perfectly encapsulates the
         | problem:
         | https://chatgpt.com/share/6775b99d-2220-8003-993b-8dd008a97f...
        
         | ted_dunning wrote:
         | Even simpler, I asked Gemini (Flash 1.5) this variant of the
         | question:
         | 
         | ``` I have two bags, one can hold a pound of gold and one can
         | hold a pound of feathers. Which bag is heavier? ```
         | 
         | The point here a) the question really is a bit too vague, b) if
         | you assume that each back is made of the same material and that
         | each bag is just big enough to hold the contents, the bag for
         | the feathers will need to be much bigger than the bag for the
         | gold and thus would weigh more.
         | 
         | The first response was wrong because the model completely
         | missed that the question was about the bags themselves:
         | 
         | ``` They weigh the same.
         | 
         | A pound of any substance, regardless of its density or volume,
         | weighs one pound. Gold is much denser than feathers, meaning a
         | pound of gold occupies a much smaller volume. Feathers are very
         | light and fluffy, so a pound of feathers takes up a much larger
         | volume. ```
         | 
         | Asking a followup `But which bag is heavier?` resulted in
         | recognition that this is a trick question, and an incorrect
         | answer:
         | 
         | ``` This is a classic trick question!
         | 
         | Even though the bags hold the same weight (one pound), the bag
         | of gold is likely to be heavier.
         | 
         | Weight vs. Mass:
         | 
         | Weight is the force of gravity acting on an object. Mass is the
         | amount of matter in an object. The Bag Itself:
         | 
         | The bag holding the gold, even if it's a lightweight material,
         | will have some weight. The bag holding the feathers, while it
         | might seem negligible, will also have some weight. Practical
         | Considerations:
         | 
         | In reality, the bag holding the gold might be significantly
         | heavier due to the weight of the bag itself. Therefore, while
         | the gold and feathers have the same mass (one pound), the bag
         | containing the gold is likely to be heavier in a practical
         | sense. ```
        
         | notShabu wrote:
         | IMO the fuzziness is actually a feature most of the time b/c I
         | can pass misspelled words or close enough words and it'll still
         | figure it out.
         | 
         | Also, if we model the mental state of the llm as a frazzled
         | retail worker dealing with thousands of customers per second,
         | the rote response is reasonable. As a dev, sometimes I get at
         | annoyed at QA for a hyper narrow "trap" test case
        
       | curious_cat_163 wrote:
       | The metaphor that might describe this paper is "iteration". I'd
       | hazard to predict that we'll likely see more iterations of the
       | following loop in 2025:
       | 
       | -> A new benchmark emerges with a novel evaluation method.
       | 
       | -> A new model saturates the benchmark by acquiring the novel
       | "behavior."
       | 
       | -> A new benchmark introduces yet another layer of novelty.
       | 
       | -> Models initially fail until a lab discovers how to acquire the
       | new behavior.
       | 
       | Case in point: OpenAI tackled this last step by introducing a
       | paradigm called deliberative alignment to tackle some of the ARC
       | benchmarks. [1]
       | 
       | Alongside all this technical iteration, there's a parallel cycle
       | of product iteration, aiming to generate $ by selling intelligent
       | software. The trillion $ questions are around finding the right
       | iterations on both technical and product dimensions.
       | 
       | [1] https://openai.com/index/deliberative-alignment/
        
       | pama wrote:
       | This workshop contribution is OK, and the benchmark is somewhat
       | valuable even without the rephrasing part of the problems, but
       | the rephrasing (of only a small number of problems) sometimes
       | genuinely makes the problem more confusing to humans as well by
       | either poor phrasing (fig 3), or unneeded breaking of convention
       | (fig 4; points in 2D are often P, with coordinates x,y). It would
       | have been nice to see effects on the rephrasing of the
       | latest/post-training date problems as a function of the increased
       | noising, to delineate part of this confusion. I wonder how much
       | better o3 is on the same benchmark.
       | 
       | Also, the correct title of this contribution is: Putnam-AXIOM: A
       | Functional and Static Benchmark for Measuring Higher Level
       | Mathematical Reasoning
        
       | deegles wrote:
       | I still find it hard to believe that LLM methods will lead to
       | "true" AI. No amount of processing power or data will be
       | sufficient without something new.
        
       | atleastoptimal wrote:
       | ok but preview sucks, run it on o1 pro.
       | 
       | 99% of studies claiming some out of distribution failure of an
       | LLM uses a model already made irrelevant by SOTA. These kinds of
       | studies, with long throughputs and review periods, are not the
       | best format to make salient points given the speed at which the
       | SOTA horizon progresses
        
         | red75prime wrote:
         | I wonder what is baseline OOD generalization for humans. It
         | takes around 7 years to generalize visual processing to X-ray
         | images. How well does a number theorist respond to algebraic
         | topology questions? How long it will take a human to learn to
         | solve ARC challenges in the json format just as well as in the
         | visual form?
        
       | m3kw9 wrote:
       | It still needs to be prompted so it's easy to understand. If you
       | ask in a weird "how do I not not not win" instead of " how do I
       | lose" you are gonna run into problems
        
       | frikskit wrote:
       | An interesting example of this is:
       | 
       | There are 6 "a"s in the sentence: "How many 'a' in this
       | sentence?"
       | 
       | https://chatgpt.com/share/677582a9-45fc-8003-8114-edd2e6efa2...
       | 
       | Whereas the typical "strawberry" variant is now correct.
       | 
       | There are 3 "r"s in the word "strawberry."
       | 
       | Clearly the lesson wasn't learned, the model was just trained on
       | people highlighting this failure case.
        
         | bwfan123 wrote:
         | Reminds me of software i have built which had some basic
         | foundational problems. Each bug was fixed with a data-patch
         | that fixed the symptom but not the cause.
         | 
         | hence we continually played whack-a-mole with bugs. we would
         | squash one bug, and another one would appear.
         | 
         | same with llms, squash one problem with a data-fix, and another
         | one pops-up.
        
         | sealeck wrote:
         | It also fails on things that aren't actual words
         | 
         | For example, the output for "how many x's are there in xaaax"
         | is 3.
         | 
         | https://chatgpt.com/share/677591fe-aa58-800e-9e7a-81870387be...
        
       | orange_puff wrote:
       | This is very interesting, but a couple of things to note; 1. o1
       | still achieves > 40% on the varied Putnam problems, which is
       | still a feat most math students would not achieve. 2. o3 solved
       | 25% of the Epoch AI dataset. - There was an interesting post
       | which calls into question how difficult some of those problems
       | actually are, but it still seems very impressive.
       | 
       | I think a fair conclusion here is reasoning models are still
       | really good at solving very difficult math and competitive
       | programming problems, but just better at ones they have seen
       | before.
        
       | dogcomplex wrote:
       | I have a feeling the fact you're only slightly varying the input
       | means the model is falling back into the question it was
       | expecting and getting things wrong as a result. If you just
       | varied it a little _more_ and added some general-purpose prompt-
       | fu like:
       | 
       | "First break the problem down into known facts, then pull
       | relevant world knowledge, then bring it all together to assess
       | the problem from multiple angles and make a conclusion. Do not
       | immediately just use the first obvious conclusion."
       | 
       | You're gonna get a lot better responses. I suspect this is more
       | of a "look! LLMs make bad kneejerk responses when we try to trick
       | them from what they were expecting!" rather than "Look! They
       | aren't even smart reasoners, they can't even figure out these
       | problems without memorizing!"
       | 
       | They do memorize. But that cuts both ways - making problems very
       | close to the memorized one mess with their perception, the same
       | way humans will instinctually respond to something that looks
       | like a face before stepping back and assessing.
        
       ___________________________________________________________________
       (page generated 2025-01-01 23:00 UTC)