[HN Gopher] Recent AI model progress feels mostly like bullshit
___________________________________________________________________
Recent AI model progress feels mostly like bullshit
Author : paulpauper
Score : 161 points
Date : 2025-04-06 18:01 UTC (4 hours ago)
(HTM) web link (www.lesswrong.com)
(TXT) w3m dump (www.lesswrong.com)
| fxtentacle wrote:
| I'd say most of the recent AI model progress has been on price.
|
| A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in
| coding performance. But it's small enough to run on a consumer
| GPU, which means deployment price is now down to $0.10 per hour.
| (from $12+ for models requiring 8x H100)
| shostack wrote:
| Yeah, I'm thinking of this from a Wardley map standpoint.
|
| What innovation opens up when AI gets sufficiently
| commoditized?
| mentalgear wrote:
| Brute force, brute force everything at least for the domains
| you can have automatic verification in.
| bredren wrote:
| One thing I've seen is large enterprises extracting money
| from consumers by putting administrative burden on them.
|
| For example, you can see this in health insurance
| reimbursements and wireless carriers plan changes. (ie,
| Verizon's shift from Do More, etc to what they have now)
|
| Companies basically set up circumstances where consumers lose
| small amounts of money on a recurring basis or sporadically
| enough that the people will just pay the money rather than a
| maze of calls, website navigation and time suck to recover
| funds due to them or that shouldn't have been taken in the
| first place.
|
| I'm hopeful well commoditized AI will give consumers a
| fighting chance at this and other types of disenfranchisement
| that seems to be increasingly normalized by companies that
| have consultants that do nothing but optimize for their own
| financial position.
| xiphias2 wrote:
| Have you compared it with 8-bit QwQ-17B?
|
| In my evals 8 bit quantized smaller Qwen models were better,
| but again evaluating is hard.
| softwaredoug wrote:
| I think the real meaningful progress is getting ChatGPT 3.5 level
| quality running anywhere you want rather than AIs getting smarter
| at high level tasks. This capability being ubiquitous and not
| tied to one vendor is really what's revolutionary.
| gundmc wrote:
| This was published the day before Gemini 2.5 was released. I'd be
| interested if they see any difference with that model.
| Anecdotally, that is the first model that really made me go wow
| and made a big difference for my productivity.
| jonahx wrote:
| I doubt it. It still flails miserably like the other models on
| anything remotely hard, even with plenty of human coaxing. For
| example, try to get it to solve:
| https://www.janestreet.com/puzzles/hall-of-mirrors-3-index/
| Xenoamorphous wrote:
| I'd say the average person wouldn't understand that problem,
| let alone solve it.
| flutas wrote:
| FWIW 2.5-exp was the only one that managed to get a problem I
| asked it right, compared to Claude 3.7 and o1 (or any of the
| other free models in Cursor).
|
| It was reverse engineering ~550MB of Hermes bytecode from a
| react native app, with each function split into a separate
| file for grep-ability and LLM compatibility.
|
| The others would all start off right then quickly default to
| just greping randomly what they expected it to be, which
| failed quickly. 2.5 traced the function all the way back to
| the networking call and provided the expected response
| payload.
|
| All the others hallucinated the networking response I was
| trying to figure out. 2.5 Provided it exactly enough for me
| to intercept the request and using the response it provided
| to get what I wanted to show up.
| arkmm wrote:
| How did you fit 550MB of bytecode into the context window?
| Was this using 2.5 in an agentic framework? (i.e. repeated
| model calls and tool usage)
| georgemcbay wrote:
| As someone who was wildly disappointed with the hype around
| Claude 3.7, Gemini 2.5 is easily the best programmer-assistant
| LLM available, IMO.
|
| But it still feels more like a small incremental improvement
| rather than a radical change, and I still feel its limitations
| constantly.
|
| Like... it gives me the sort of decent but uninspired solution
| I would expect it to generate without predictably walking me
| through a bunch of obvious wrong turns as I repeatedly correct
| it as I would have to have done with earlier models.
|
| And that's certainly not nothing and makes the experience of
| using it much nicer, but I'm still going to roll my eyes
| anytime someone suggests that LLMs are the clear path to
| imminently available AGI.
| usaar333 wrote:
| Ya, I find this hard to imagine aging well. Gemini 2.5 solved
| (at least much better than) multiple real world systems
| questions I've had in the past that other models could not. Its
| visual reasoning also jumped significantly on charts (e.g.
| planning around train schedules)
|
| Even Sonnet 3.7 was able to do refactoring work on my codebase
| sonnet 3.6 could not.
|
| Really not seeing the "LLMs not improving" story
| boxed wrote:
| > So maybe there's no mystery: The AI lab companies are lying,
| and when they improve benchmark results it's because they have
| seen the answers before and are writing them down. [...then says
| maybe not...]
|
| Well.. they've been caught again and again red handed doing
| exactly this. Fool me once shame on you, fool me 100 times shame
| on me.
| smnplk wrote:
| Fool me once, shame on you...If fooled, you cant get fooled
| again.
|
| https://www.youtube.com/shorts/LmFN8iENTPc
| drdaeman wrote:
| Hate to say this but the incentive is growth, not progress.
| Progress is what enabled the growth, but is also extremely hard
| to plan and deliver. On the other hand, hype is probably
| somewhat easier and well-tested approach so no surprise lot of
| the effort goes into marketing. Markets had repeatedly
| confirmed that there aren't any significant immediate
| repercussions for cranking up BS levels in marketing materials,
| while there are some rewards when it works.
| djha-skin wrote:
| > Since 3.5-sonnet, we have been monitoring AI model
| announcements, and trying pretty much every major new release
| that claims some sort of improvement. Unexpectedly by me, aside
| from a minor bump with 3.6 and an even smaller bump with 3.7,
| literally none of the new models we've tried have made a
| significant difference on either our internal benchmarks or in
| our developers' ability to find new bugs. This includes the new
| test-time OpenAI models.
|
| This is likely a manifestation of the bitter lesson[1],
| specifically this part:
|
| > The ultimate reason for this is Moore's law, or rather its
| generalization of continued exponentially falling cost per unit
| of computation. Most AI research has been conducted as if the
| computation available to the agent were constant (in which case
| leveraging human knowledge would be one of the only ways to
| improve performance) but, _over a slightly longer time than a
| typical research project_ [like an incremental model update],
| massively more computation inevitably becomes available.
|
| (Emphasis mine.)
|
| Since the ultimate success strategy of the scruffies[2] or
| proponents of search and learning strategies in AI is Moore's
| Law, short term gains using these strategies will be miniscule.
| It is over at least a five year period that their gains will be
| felt the most. The neats win the day in the short term, but the
| hare in this race will ultimately give away to the steady plod of
| the tortoise.
|
| 1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
|
| 2:
| https://en.m.wikipedia.org/wiki/Neats_and_scruffies#CITEREFM...
| ohgr wrote:
| It's not even approaching the asymptotic line of promises made at
| any achievable rate for the amount of cash being thrown at it.
|
| Where's the business model? Suck investors dry at the start of a
| financial collapse? Yeah that's going to end well...
| maccard wrote:
| > where's the business model?
|
| For who? Nvidia sell GPUs, OpenAI and co sell proprietary
| models and API access, and the startups resell GPT and Claude
| with custom prompts. Each one is hoping that the layer above
| has a breakthrough that makes their current spend viable.
|
| If they do, then you don't want to be left behind, because
| _everything_ changes. It probably won't, but it might.
|
| That's the business model
| grey-area wrote:
| That's not a business model, it's a pipe dream.
|
| This bubble will be burst by the Trump tariffs and the end of
| the zirp era. When inflation and a recession hit together
| hope and dream business models and valuations no longer work.
| maccard wrote:
| Which one? Nvidia are doing pretty ok selling GPU's, and
| OpenAI and Anthropic are doing ok selling their models.
| They're not _viable_ business models, but they could be.
| grey-area wrote:
| NVDA will crash when the AI bubble implodes, and none of
| those Generative AI companies are actually making money,
| nor will they. They have already hit limiting returns in
| LLM improvements after staggering investments and it is
| clear are nowhere near general intelligence.
| maccard wrote:
| All of this can be true, and has nothing to do with them
| having a business model.
|
| > NVDA will crash when the AI bubble implodes, > making
| money, nor will they > They have already hit limiting
| returns in LLM improvements after staggering investments
| > and it is clear are nowhere near general intelligence.
|
| These are all assumptions and opinions, and have nothing
| to do with whether or not they have a business model. You
| mightn't like their business model, but they do have one.
| grey-area wrote:
| I consider it a business model if they have plans to make
| money at some point (no sign of that at openai which are
| not based on hopium) and are not engaged in fraud like
| bundling and selling to their own subsidiaries (nvda).
|
| These are of course just opinions, I'm not sure we can
| know facts about such companies except in retrospect.
| ohgr wrote:
| You missed the end of the supply chain. Paying users. Who
| magically disappear below market sustaining levels of sales
| when asked to pay.
| maccard wrote:
| I never said it was sustainable, and even if it was, OP
| asked for a business model. Customers don't need a business
| model, they're customers.
|
| The same is true for any non essential good or service.
| AstroBen wrote:
| > Going from $1M ARR to $100M ARR in 12 months, Cursor is
| the fastest growing SaaS company of all time
|
| Just because it's not reaching the insane hype being pushed
| doesn't mean it's totally useless
| paulsutter wrote:
| Im able to get substantially more coding done than three months
| ago. This could be largely in the tooling (coding agents, deep
| research). But the models are better too, for both coding and
| brainstorming. And tooling counts, to me, as progress.
|
| Learning to harness current tools helps to harness future tools.
| Work on projects that will benefit from advancements, but can
| succeed without them.
| dghlsakjg wrote:
| I'm not sure if I'm able to do more of the hard stuff, but a
| lot of the easy but time consuming stuff is now easily done by
| LLMs.
|
| Example: I frequently get requests for data from Customer
| Support that used to require 15 minutes of my time noodling
| around writing SQL queries. I can cut that down to less than a
| minute now.
| mountainriver wrote:
| Yes I am a better engineer with every release. I think this is
| mostly empirically validated
| photochemsyn wrote:
| Will LLMs end up like compilers? Compilers are also fundamentally
| important to modern industrial civilization - but they're not
| profit centers, they're mostly free and open-source outside a few
| niche areas. Knowing how to use a compiler effectively to write
| secure and performative software is still a valuable skill - and
| LLMs are a valuable tool that can help with that process,
| especially if the programmer is on the steep end of the learning
| curve - but it doesn't look like anything short of real AGI can
| do novel software creation without a human constantly in the
| loop. The same argument applies to new fundamental research, even
| to reviewing and analyzing new discoveries that aren't in the
| training corpus.
|
| Wasn't it back in the 1980s that you had to pay $1000s for a good
| compiler? The entire LLM industry might just be following in the
| compiler's footsteps.
| lukev wrote:
| This seems like a probable end state, but we're going to have
| to stop calling LLMs "artificial intelligence" in order to get
| there.
| bcoates wrote:
| Yep. I'm looking forward to LLMs/deepnets being considered a
| standard GOFAI technique with uses and limitations and not
| "we asked the God we're building to draw us a picture of a
| gun and then it did and we got scared"
| mmcnl wrote:
| Why not? Objectively speaking LLMs are artificial
| intelligent. Just because it's not human level intelligence
| doesn't mean it's not intelligent.
| lukev wrote:
| Objectively speaking a chess engine is artificially
| intelligent. Just because it's not human level doesn't mean
| it's not intelligent. Repeat for any N of 100s of different
| technologies we've built. We've been calling this stuff
| "thinking machines" since Turing and it's honestly just not
| useful at this point.
|
| The fact is, the phrase "artificial intelligence" is a
| memetic hazard: it immediately positions the subject of
| conversation as "default capable", and then forces the
| conversation into trying to describe what it can't do,
| which is rarely a useful way to approach it.
|
| Whereas with LLMs (and chess engines and every other tech
| advancement) it would be more useful to start with what the
| tech _can_ do and go from there.
| jonahx wrote:
| My personal experience is right in line with the author's.
|
| Also:
|
| > I think what's going on is that large language models are
| trained to "sound smart" in a live conversation with users, and
| so they prefer to highlight possible problems instead of
| confirming that the code looks fine, just like human beings do
| when they want to sound smart.
|
| I immediately thought: That's because in _most_ situations this
| is the purpose of language, at least partially, and LLMs are
| trained on language.
| billyp-rva wrote:
| > [T]here are ~basically~ no public benchmarks for security
| research... nothing that gets at the hard parts of application
| pentesting for LLMs, which are 1. Navigating a real repository of
| code too large to put in context, 2. Inferring a target
| application's security model, and 3. Understanding its
| implementation deeply enough to learn where that security model
| is broken.
|
| A few months ago I looked at essentially this problem from a
| different angle (generating system diagrams from a codebase). My
| conclusion[0] was the same as here: LLMs really struggle to
| understand codebases in a holistic way, especially when it comes
| to the codebase's strategy and purpose. They therefore struggle
| to produce something meaningful from it like a security
| assessment or a system diagram.
|
| [0] https://www.ilograph.com/blog/posts/diagrams-ai-can-and-
| cann...
| sema4hacker wrote:
| > ...whatever gains these companies are reporting to the public,
| they are not reflective of economic usefulness or generality.
|
| I'm not surprised, because I don't expect pattern matching
| systems to grow into something more general and useful. I think
| LLM's are essentially running into the same limitations that the
| "expert systems" of the 1980's ran into.
| maccard wrote:
| My experience as someone who uses LLMs and a coding assist plugin
| (sometimes), but is somewhat bearish on AI is that GPT/Claude and
| friends have gotten worse in the last 12 months or so, and local
| LLMs have gone from useless to borderline functional but still
| not really usable for day to day.
|
| Personally, I think the models are "good enough" that we need to
| start seeing the improvements in tooling and applications that
| come with them now. I think MCP is a good step in the right
| direction, but I'm sceptical on the whole thing (and have been
| since the beginning, despite being a user of the tech).
| joelthelion wrote:
| I've used gemini 2.5 this weekend with aider and it was
| frighteningly good.
|
| It probably depends a lot on what you are using them for, and in
| general, I think it's still too early to say exactly where LLMs
| will lead us.
| jchw wrote:
| I think overall quality with Gemini 2.5 is not much better than
| Gemini 2 in my experience. Gemini 2 was already really good,
| but just like Claude 3.7, Gemini 2.5 goes some steps forward
| and some steps backwards. It sometimes generates some _really_
| verbose code even when you tell it to be succinct. I am pretty
| confident that if you evaluate 2.5 for a bit longer you 'll
| come to the same conclusion eventually.
| mountainriver wrote:
| Yep, and what they are going in cursor either the agentic stuff
| is really game changing.
|
| People who can't recognize this intentionally have their heads
| in the sand
| InkCanon wrote:
| People are really fundamentally asking two different
| questions when they talk about AI "importance": AI's utility
| and AI's "intelligence". There's a careful line between both.
|
| 1) AI undoubtedly has utility. In many agentic uses, it has
| very significant utility. There's absolute utility and
| perceived utility, which is more of user experience. In
| absolute utility, it is likely git is the single most game
| changing piece of software there is. It is likely git has
| saved some ten, maybe eleven digit number in engineer hours
| times salary in how it enables massive teams to work together
| in very seamless ways. In user experience, AI is amazing
| because it can generate so much so quickly. But it is very
| far from an engineer. For example, recently I tried to use
| cursor to bootstrap a website in NextJS for me. It produced
| errors it could not fix, and each rewrite seemed to dig it
| deeper into its own hole. The reasons were quite obvious. A
| lot of it had to do with NextJS 15 and the breaking changes
| it introduces in cookies and auth. It's quite clear if you
| have masses of NextJS code, which disproportionately is older
| versions, but none labeled well with versions, it messes up
| the LLM. Eventually I scrapped what it wrote and did it
| myself. I don't mean to use this anecdote to say LLMs are
| useless, but they have pretty clear limitations. They work
| well on problems with massive data (like front end) and don't
| require much principled understanding (like understanding how
| NextJS 15 would break so and so's auth). Another example of
| this is when I tried to use it to generate flags for a V8
| build, it failed horribly and would simply hallucinate flags
| all the time. This seemed very likely to be (despite the
| existence of a list of V8 flags online) that many flags had
| very close representations in vector embeddings, and that
| there was almost close to zero data/detailed examples on
| their use.
|
| 2) In the more theoretical side, the performance of LLMs on
| benchmarks (claiming to be elite IMO solvers, competitive
| programming solvers) have become incredibly suspicious. When
| the new USAMO 2025 was released, the highest score was 5%,
| despite claims a year ago that SOTA when was at least a
| silver IMO. This is against the backdrop of exponential
| compute and data being fed in. Combined with apparently
| diminishing returns, this suggests that the gains from that
| are running really thin.
| throw310822 wrote:
| I hope it's true. Even if LLMs development stopped now, we would
| still keep finding new uses for them at least for the next ten
| years. The technology is evolving way faster than we can
| meaningfully absorb it and I am genuinely frightened by the
| consequences. So I hope we're hitting some point of diminishing
| returns, although I don't believe it a bit.
| a3w wrote:
| For three years now, my experience with LLMs has been "mostly
| useless, prefer ELIZA".
|
| Which is software written 1966, but the web version is a little
| newer. Does occasional psychotherapy assistance/brainstorming
| just as well, and I more easily know when I stepped out of its
| known range into the extrapolated.
|
| That said, it can vibe code in a framework unknown to me in half
| the time that I would need to school myself and add the feature.
|
| Or vibe coding takes twice as long, if I mostly know how to
| achieve what I want and read no framework documentation but only
| our own project's source code to add a new feature. But on a day
| with a headache, I can still call the LLM a dumb twat and ask it
| to follow my instructions instead of doing bullshit.
|
| But, vibe coding always makes my pulse go to 105, from 65 and
| question my life choices. Since few instructions are rarely ever
| followed and loops never left once entered. Except for on the
| first try getting 80% of the structure kinda right, but then
| getting stuck for the whole workday.
| aerhardt wrote:
| My mom told me yesterday that Paul Newman had massive problems
| with alcohol. I was somewhat skeptical, so this morning I asked
| ChatGPT a very simple question:
|
| "Is Paul Newman known for having had problems with alcohol?"
|
| All of the models up to o3-mini-high told me he had no known
| problems. Here's o3-mini-high's response:
|
| "Paul Newman is not widely known for having had problems with
| alcohol. While he portrayed characters who sometimes dealt with
| personal struggles on screen, his personal life and public image
| were more focused on his celebrated acting career, philanthropic
| work, and passion for auto racing rather than any issues with
| alcohol. There is no substantial or widely reported evidence in
| reputable biographies or interviews that indicates he struggled
| with alcohol abuse."
|
| There is plenty of evidence online that he struggled a lot with
| alcohol, including testimony from his long-time wife Joanne
| Woodward.
|
| I sent my mom the ChatGPT reply and in five minutes she found an
| authoritative source to back her argument [1].
|
| I use ChatGPT for many tasks every day, but I couldn't fathom
| that it would get so wrong something so simple.
|
| Lesson(s) learned... Including not doubting my mother's movie
| trivia knowledge.
|
| [1] https://www.newyorker.com/magazine/2022/10/24/who-paul-
| newma...
| drooby wrote:
| I asked GPT-4.5 and it searched the web and immediately gave me
| a "yes" with paragraphs of sources cited.
| john2x wrote:
| Truth is a probability game. Just keep trying until you
| arrive.
| Avicebron wrote:
| "man puts prompt into LLM" > "LLM ships bad code" >
| "machine fails" > "person dies" > "fire man" > "man puts
| prompt into LLM"
| lfkdev wrote:
| Thats not really 'simple' for an LLM. This is a niche
| information about a specifc person, LLM's train on massive
| amount of data, the more a topic is being present in the data,
| the better will the answers be.
|
| Also, you can/should use the "research" mode for questions like
| this.
| aerhardt wrote:
| The question is simple and verifiable - it is impressive to
| me that it's not contained in the LLM's body of knowledge -
| or rather that it can't reach the answer.
|
| This is niche in the grand scheme of knowledge but Paul
| Newman is easily one of the biggest actors in history, and
| the LLM has been trained on a massive corpus that includes
| references to this.
|
| Where is the threshold for topics with enough presence in the
| data?
| Max_aaa wrote:
| The question might be simple and verifiable, but it is not
| a simple for an LLM to mark a particular question as such.
| This is the tricky part.
|
| An LLM does not care about your question, it is a bunch of
| math that will spit out a result based on what you typed
| in.
| permo-w wrote:
| this seems less like an issue with accuracy and more like an
| issue with model providers making sure they don't get sued for
| libel
| aerhardt wrote:
| I thought about that too.
| ilrwbwrkhv wrote:
| LLMs will never be good at specific knowledge unless
| specifically trained for with narrow "if else" statements.
|
| Its good for broad general overview such as most popular
| categories of books in the world.
| Rebuff5007 wrote:
| Really? Open-AI says PhD intelligence is just around the
| corner!
| dadadad100 wrote:
| If we were to survey 100 PhDs how many would know correctly
| that Paul Newman had an alcohol problem.
| AnimalMuppet wrote:
| I would hope that 100% of them would be able to figure
| out _how to find out_.
| dadadad100 wrote:
| Ah, but isn't that the problem here - asking an LLM for
| facts without requesting a search is like asking a PhD to
| answer a question "off the top of your head". For pop
| culture questions the PhD likely brings little value.
| ludwik wrote:
| I don't think they mean "knowledge" when they talk about
| "intelligence." LLMs are definitely not knowledge bases.
| They can transform information given to them in impressive
| ways, but asking a raw (non-RAG-enabled) LLM to provide its
| own information will probably always be a mistake.
| AnimalMuppet wrote:
| They kind of are knowledge bases, just not in the usual
| way. The knowledge is encoded in the words they were
| trained on. They weren't trained on words chosen at
| random; they were trained on words written by humans to
| encode some information. In fact, that's the only thing
| that makes LLMs somewhat useful.
| blitzar wrote:
| Does the as yet unwritten prequel of Idiocracy tell the tale of
| when we started asking Ai chat bots for facts and this was the
| point of no return for humanity?
| aerhardt wrote:
| Can you blame the users for asking it, when everyone is
| selling that as a key defining feature?
|
| I use it for asking - often very niche - questions on
| advanced probability and simulation modeling, and it often
| gets those right - why those and not a simple verifiable fact
| about one of the most popular actors in history?
|
| I don't know about Idiocracy, but something that I have read
| specific warnings about is that people will often blame the
| user for any of the tool's misgivings.
| pclmulqdq wrote:
| It turns out there's huge demand for un-monetized web search.
| leereeves wrote:
| I like that it's unmonetized, of course, but that's not why
| I use AI. I use AI because it's better at search. When I
| can't remember the right keywords to find something, or
| when the keywords aren't unique, I frequently find that web
| search doesn't return what I need and AI does.
|
| It's impressive how often AI returns the right answer to
| vague questions. (not always though)
| pclmulqdq wrote:
| Google used to return the right answer to vague questions
| until it decided to return the most lucrative answer to
| vague questions instead.
| spudlyo wrote:
| Soon sadly, there will be a huge demand for un-monetized
| LLMs. Enshitification is coming.
| hn_throwaway_99 wrote:
| So, in other words, are you saying that AI model progress _is_
| the real deal and is not bullshit?
|
| That is, as you point out, "all of the models up to o3-mini-
| high" give an incorrect answer, while other comments say that
| OpenAIs later models give correct answers, with web citations.
| So it would seem to follow that "recent AI model progress"
| actually made a verifiable improvement in this case.
| saurik wrote:
| I am pretty sure that they must have meant "up through", not
| "up to", as the answer from o3-mini-high is also wrong in a
| way which seems to fit the same description, no?
| hn_throwaway_99 wrote:
| I tried with 4o and it gave me what I thought was a correct
| answer:
|
| > Paul Newman was not publicly known for having major
| problems with alcohol in the way some other celebrities
| have been. However, he was open about enjoying drinking,
| particularly beer. He even co-founded a line of food
| products (Newman's Own) where profits go to charity, and he
| once joked that he consumed a lot of the product himself --
| including beer when it was briefly offered.
|
| > In his later years, Newman did reflect on how he had
| changed from being more of a heavy drinker in his youth,
| particularly during his time in the Navy and early acting
| career, to moderating his habits. But there's no strong
| public record of alcohol abuse or addiction problems that
| significantly affected his career or personal life.
|
| > So while he liked to drink and sometimes joked about it,
| Paul Newman isn't generally considered someone who had
| problems with alcohol in the serious sense.
|
| As other's have noted, LLMs are much more likely to be
| cautious in providing information that could be construed
| as libel. While Paul Newman may have been an alcoholic, I
| couldn't find any articles about it being "public" in the
| same way as others, e.g. with admitted rehab stays.
| fnordpiglet wrote:
| This is less an LLM thing than an information retrieval
| question. If you choose a model and tell it to "Search," you
| find citation based analysis that discusses that he indeed had
| problems with alcohol. I do find it interesting it quibbles
| whether he was an alcoholic or not - it seems pretty clear from
| the rest that he was - but regardless. This is indicative of
| something crucial when placing LLMs into a toolkit. They are
| not omniscient nor are they deductive reasoning tools.
| Information retrieval systems are excellent at information
| retrieval and should be used for information retrieval. Solvers
| are excellent at solving deductive problems. Use them. The
| better they get at these tasks alone is cool but is IMO a
| parlor trick since we have nearly optimal or actually optimal
| techniques that don't need an LLM. The LLM should use those
| tools. So, click search next time you have an information
| retrieval question.
| https://chatgpt.com/share/67f2dac0-3478-8000-9055-2ae5347037...
| mvdtnz wrote:
| Any information found in a web search about Newman will be
| available in the training set (more or less). It's almost
| certainly a problem of alignment / "safety" causing this
| issue.
| fnordpiglet wrote:
| There's a simpler explanation than that's that the model
| weights aren't an information retrieval system and other
| sequences of tokens are more likely given the totality of
| training data. This is why for an information retrieval
| task you use an information retrieval tool similarly to how
| for driving nails you use a hammer rather than a screw
| driver. It may very well be you could drive the nail with
| the screw driver, but why?
| mvdtnz wrote:
| You think that's a simpler explanation? Ok. I think given
| the amount of effort that goes into "safety" on these
| systems that my explanation is vastly more likely than
| somehow this information got lost in the vector soup
| despite being attached to his name at the top of every
| search result[0].
|
| 0 https://www.google.com/search?q=did+paul+newman+have+a+
| drink...
| fnordpiglet wrote:
| Except if safety blocked this, it would have also blocked
| the linked conversation. Alignment definitely distorts
| behaviors of models, but treating them as information
| retrieval systems is using a screw driver to drive nails.
| Your example didn't refute this.
| simonw wrote:
| "Any information found in a web search about Newman will be
| available in the training set"
|
| I don't think that is a safe assumption these days.
| Training modern LLM isn't about dumping in everything on
| the Internet. To get a really _good_ model you have to be
| selective about your sources of training data.
|
| They still rip off vast amounts of copyrighted data, but I
| get the impression they are increasingly picky about what
| they dump into their training runs.
| Vanit wrote:
| I realise your answer wasn't assertive, but if I heard this
| from someone actively defending AI it would be a copout. If
| the selling point is that you can ask these AIs anything then
| one can't retroactively go "oh but not that" when a
| particular query doesn't pan out.
| stavros wrote:
| LLMs aren't good at being search engines, they're good at
| understanding things. Put an LLM on top of a search engine, and
| that's the appropriate tool for this use case.
|
| I guess the problem with LLMs is that they're too usable for
| their own good, so people don't realizing that they can't
| perfectly know all the trivia in the world, exactly the same as
| any human.
| MegaButts wrote:
| > LLMs aren't good at being search engines, they're good at
| understanding things.
|
| LLMs are literally fundamentally incapable of understanding
| things. They are stochastic parrots and you've been fooled.
| more_corn wrote:
| For them to work at all they need to have some
| representation of concepts. Recent research at anthropic
| has shown a surprising complexity in their reasoning
| behavior. Perhaps the parrot here is you.
| bobsmooth wrote:
| What do you call someone that mentions "stochastic parrots"
| every time LLMs are mentioned?
| MegaButts wrote:
| It's the first time I've ever used that phrase on HN.
| Anyway, what phrase do you think works better than
| 'stochastic parrot' to describe how LLMs function?
| karn97 wrote:
| Try to come up with a way to prove humans aren't
| stochastic parrots then maybe people will atart taking
| you seriously. Just childish reddit angst rn nothing
| else.
| bluefirebrand wrote:
| > Try to come up with a way to prove humans aren't
| stochastic parrots
|
| Look around you
|
| Look at Skyscrapers. Rocket ships. Agriculture.
|
| If you want to make a claim that humans are nothing more
| than stochastic parrots then you need to explain where
| all of this came from. What were we parroting?
|
| Meanwhile all that LLMs do is parrot things that _humans_
| created
| brookst wrote:
| It's good rhetoric but bad analogy. LLMs can be very
| creative (to the point of failure, in hallucinations).
|
| I don't know if there is a pithy shirt phrase to
| accurately describe how LLMs function. Can you give me a
| similar one for how humans think? That might spur my own
| creativity here.
| fancyfredbot wrote:
| That makes me think, has anyone ever heard of an actual
| parrot which wasn't stochastic?
|
| I'm fairly sure I've never seen a deterministic parrot
| which makes me think the term is tautological.
| fancyfredbot wrote:
| We're talking about a stochastic parrot which in many
| circumstances responds in a way which is indistinguishable
| from actual understanding.
| MegaButts wrote:
| I've always been amazed by this. I have never not been
| frustrated with the profound stupidity of LLMs. Obviously
| I must be using it differently because I've never been
| able to trust it with anything and more than half the
| time I fact check it even for information retrieval it's
| objectively incorrect.
| karn97 wrote:
| Its ok to be paranoid
| MegaButts wrote:
| Fact checking is paranoia?
| fancyfredbot wrote:
| If you got as far as checking the output it must have
| appeared to understand your question.
|
| I wouldn't claim LLMs are good at being factual, or good
| at arithmetic, or at drawing wine glasses, or that they
| are "clever". What they are very good at is responding to
| questions in a way which gives you the very strong
| impression they've understood you.
| MegaButts wrote:
| I vehemently disagree. If I ask a question with an
| objective answer, and it simply makes something up and is
| very confident the answer is correct, what the fuck has
| it understood other than how to piss me off?
|
| It clearly doesn't understand that the question has a
| correct answer, or that it does not know the answer. It
| also clearly does not understand that I hate bullshit, no
| matter how many dozens of times I prompt it to not make
| something up and would prefer an admittance of ignorance.
| fancyfredbot wrote:
| It didn't understand you but the response was plausible
| enough to require fact checking.
|
| Although that isn't literally indistinguishable from
| 'understanding' (because your fact checking easily
| discerned that) it suggests that at a surface level it
| did appear to understand your question and knew what a
| plausible answer might look like. This is not necessarily
| useful but it's quite impressive.
| MegaButts wrote:
| There are times it just generates complete nonsense that
| has nothing to do with what I said, but it's certainly
| not most of the time. I do not know how often, but I'd
| say it's definitely under 10% and almost certainly under
| 5% that the above happens.
|
| Sure, LLMs are incredibly impressive from a technical
| standpoint. But they're so fucking stupid I hate using
| them.
|
| > This is not necessarily useful but it's quite
| impressive.
|
| I think we mostly agree on this. Cheers.
| mitthrowaway2 wrote:
| What does the word "understand" mean to you?
| MegaButts wrote:
| An ability to answer questions with a train of thought
| showing how the answer was derived, or the self-awareness
| to recognize you do not have the ability to answer the
| question and declare as much. More than half the time
| I've used LLMs they will simply make answers up, and when
| I point out the answer is wrong it simply regurgitates
| another incorrect answer ad nauseum (regularly cycling
| through answers I've already pointed out are incorrect).
|
| Rather than give you a technical answer - if I ever feel
| like an LLM can recognize its limitations rather than
| make something up, I would say it understands. In my
| experience LLMs are just algorithmic bullshitters. I
| would consider a function that just returns "I do not
| understand" to be an improvement, since most of the time
| I get confidently incorrect answers instead.
|
| Yes, I read Anthropic's paper from a few days ago. I
| remain unimpressed until talking to an LLM isn't a
| profoundly frustrating experience.
| mitthrowaway2 wrote:
| I just want to say that's a much better answer than I
| anticipated!
| the8472 wrote:
| A stochastic parrot with a sufficiently tiny residual error
| rate needs a stochastic model so precisely compressing the
| world and sophisticated decompression algorithms that it
| could be called reasoning.
|
| Take two 4K frames of a falling vase, ask a model to
| predict the next token... I mean the following images. Your
| model now needs include some approximations of physics -
| and the ability to apply it correctly - to produce a
| _realistic_ outcome. I 'm not aware of any model capable of
| doing that, but that's what it would mean to predict the
| unseen with high enough fidelity.
| more_corn wrote:
| Ironically though an LLM powered search engine (some word
| about being perplexed) is becoming way better than the
| undisputed king of traditional search engines (something
| oogle)
| stavros wrote:
| That's because they put an LLM over a traditional search
| engine.
| aspenmayer wrote:
| Google Labs has AI Mode now, apparently.
|
| https://labs.google.com/search/experiment/22
| stavros wrote:
| Hm, that's not available to me, what is it? If its an LLM
| over Google, didn't they release that a few months ago
| already?
| aspenmayer wrote:
| US only for now may be the issue?
|
| It expands what they had before with AI Overviews, but
| I'm not sure how new either of those are. It showed up
| for me organically as an AI Mode tab on a native Google
| search in Firefox ironically.
|
| https://support.google.com/websearch/answer/16011537
| stavros wrote:
| Very interesting, thank you!
| aspenmayer wrote:
| No worries.
|
| What happens if you go directly to
| https://google.com/aimode ?
| stavros wrote:
| It asks me to change some permissions, but that help page
| says this is only available in the US, so I suppose I'll
| get blocked right after I change them.
| r_klancer wrote:
| Gemini (2.5 Pro):
|
| "Yes, Paul Newman was widely known for being a heavy drinker,
| particularly of beer. He himself acknowledged his significant
| alcohol consumption."
|
| The answer I got (https://gemini.google.com/share/9e327dc4be03)
| includes references such as
| https://apnews.com/article/entertainment-reviews-movies-paul...
| and https://www.psychologytoday.com/us/blog/the-playing-
| field/20... although they are redacted from the public-sharing
| link.
| r_klancer wrote:
| Though a local model I'm running (gemma-3-27b-it;
| https://huggingface.co/lmstudio-community/gemma-3-27b-it-
| GGU...) just told me various correct sounding bits about his
| history with alcohol (correctly citing his alma mater and
| first wife), but threw in:
|
| "Sobriety & AA: Newman got sober in 1964 and remained so for
| the rest of his life."
|
| Which doesn't check out. And it includes plausible but
| completely hallucinated URLs (as well as a valid
| biography.com URL that completely omits information about
| alcohol.)
| smahs wrote:
| Gemma 3 4B (QAT quant): Yes, Paul Newman was indeed known
| to have struggled with alcohol throughout his life. While
| he maintained a public image of a charming, clean-cut star,
| he privately battled alcoholism for many years. He sought
| treatment in the late 1980s and early 1990s and was
| reportedly very open about his struggles and the importance
| of seeking help.
| tim333 wrote:
| Perplexity:
|
| >Paul Newman is indeed known for having struggled with
| alcohol during his life. Accounts from various sources,
| including his own memoir and the documentary ...
| (https://www.perplexity.ai/search/is-paul-newman-known-for-
| ha...)
|
| I guess there's something about ChatGPT's set up that makes
| it different? Maybe they wanted it to avoid libeling people?
| aldanor wrote:
| Grok:
|
| > Yes, Paul Newman was known to have struggled with alcohol
| at certain points in his life. In his early years,
| particularly during his time in the Navy and into his
| acting career, Newman admitted to heavy drinking. He was
| open about his fondness for beer and once jokingly referred
| to himself as a "functional alcoholic." In a 1988 interview
| with The New York Times, he acknowledged that he had a
| period where he drank too much, stating, "I was a very good
| drinker. I could put it away." ...
|
| https://grok.com/share/bGVnYWN5_86843e8a-39ee-415d-8785-4f8
| e...
| kayo_20211030 wrote:
| This may have hit the nail on the head about the weaknesses of
| LLM's.
|
| They're going to regurgitate something not so much based on
| facts, but based on things that are accessible as _perceived_
| facts. Those might be right, but they might be wrong also; and
| no one can tell without doing the hard work of checking
| original sources. Many of what are considered accepted facts,
| and also accessible to LLM harvesting, are at best derived
| facts, often mediated by motivated individuals, and published
| to accessible sources by "people with an interest".
|
| The weightings used by any AI should be based on the _facts_ ,
| and not the compounded volume of derived, "mediated", or
| "directed" _facts_ - simply, because they 're not really facts;
| they're reports.
|
| It all seems like dumber, lazier search engine stuff. Honestly,
| what do I know about Paul Newman? But, Joanne Woodward and
| others who knew and worked with him should be weighted as
| being, at least, slightly more credible that others; no matter
| how many text patterns "catch the match" flow.
| Alive-in-2025 wrote:
| These models are not reliable sources of information. They are
| either out of date, subject to hallucination, or just plain
| wrong for a variety of reasons. They are untrustworthy to ask
| facts like this.
|
| I appreciate your consideration of a subjective question and
| how you explained it and understand these nuances. But please -
| do not trust chatgpt etc. I continue to be frustrated at the
| endless people claiming something is true from chatgpt. I
| support the conclusions of this author.
| jonomacd wrote:
| Looks like you are using the wrong models
|
| https://g.co/gemini/share/ffa5a7cd6f46
| iambateman wrote:
| The core point in this article is that the LLM wants to report
| _something_, and so it tends to exaggerate. It's not very good at
| saying "no" or not as good as a programmer would hope.
|
| When you ask it a question, it tends to say yes.
|
| So while the LLM arms race is incrementally increasing benchmark
| scores, those improvements are illusory.
|
| The real challenge is that the LLM's fundamentally want to seem
| agreeable, and that's not improving. So even if the model gets an
| extra 5/100 math problems right, it feels about the same in a
| series of prompts which are more complicated than just a ChatGPT
| scenario.
|
| I would say the industry knows it's missing a tool but doesn't
| know what that tool is yet. Truly agentic performance is getting
| better (Cursor is amazing!) but it's still evolving.
|
| I totally agree that the core benchmarks that matter should be
| ones which evaluate a model in agentic scenario, not just on the
| basis of individual responses.
| fnordpiglet wrote:
| ... deleted ... (Sorry the delete isn't working, meant for
| another subthread)
| bluefirebrand wrote:
| > The real challenge is that the LLM's fundamentally want to
| seem agreeable, and that's not improving
|
| LLMs fundamentally do not want to seem anything
|
| But the companies that are training them and making models
| available for professional use sure want them to seem agreeable
| lukev wrote:
| This is a bit of a meta-comment, but reading through the
| responses to a post like this is really interesting because it
| demonstrates how our collective response to this stuff is (a)
| wildly divergent and (b) entirely anecdote-driven.
|
| I have my own opinions, but I can't really say that they're not
| also based on anecdotes and personal decision-making heuristics.
|
| But some of us are going to end up right and some of us are going
| to end up wrong and I'm really curious what features signal an
| ability to make "better choices" w/r/t AI, even if we don't know
| (or can't prove) what "better" is yet.
| lherron wrote:
| Agreed! And with all the gaming of the evals going on, I think
| we're going to be stuck with anecdotal for some time to come.
|
| I do feel (anecdotally) that models are getting better on every
| major release, but the gains certainly don't seem evenly
| distributed.
|
| I am hopeful the coming waves of vertical
| integration/guardrails/grounding applications will move us away
| from having to hop between models every few weeks.
| InkCanon wrote:
| Frankly the overarching story about evals (which receives
| very little coverage) is how much gaming is going on. On the
| recent USAMO 2025, SOTA models scored 5%, despite claiming
| silver/gold in IMOs. And ARC-AGI: one very easy way to
| "solve" it is to generate masses of synthetic examples by
| extrapolating the basic rules of ARC AGI questions and train
| it on that.
| FiniteIntegral wrote:
| It's not surprising that responses are anecdotal. An easy way
| to communicate a generic sentiment often requires being brief.
|
| A majority of what makes a "better AI" can be condensed to how
| effective the slope-gradient algorithms are at getting the
| local maxima we want it to get to. Until a generative model
| shows actual progress of "making decisions" it will forever be
| seen as a glorified linear algebra solver. Generative machine
| learning is all about giving a pleasing answer to the end user,
| not about creating something that is on the level of human
| decision making.
| nialv7 wrote:
| Good observation but also somewhat trivial. We are not
| omniscient gods, ultimately all our opinions and decisions will
| have to be based on our own limited experiences.
| freehorse wrote:
| There is nothing wrong with sharing anecdotal experiences.
| Reading through anecdotal experiences here can help understand
| how one's own experience are relatable or not. Moreover, if I
| have X experience it could help to know if it is because of me
| doing sth wrong that others have figured out.
|
| Furthermore, as we are talking about actual impact of LLMs, as
| is the point of the article, a bunch of anecdotal experiences
| may be more valuable than a bunch of benchmarks to figure it
| out. Also, apart from the right/wrong dichotomy, people use
| LLMs with different goals and contexts. It may not mean that
| some people do something wrong if they do not see the same
| impact as others. Everytime a web developer says that they do
| not understand how others may be so skeptical of LLMs, conclude
| with certainty that they must be doing sth wrong and move on to
| explain how to actually use LLMs properly, I chuckle.
| dimal wrote:
| It seems like the models are getting more reliable at the things
| they always could do, but they're not showing any ability to move
| past that goalpost. Whereas in the past, they could occasionally
| write some very solid code, but often return nonsense, the
| nonsense is now getting adequately filtered by so-called
| "reasoning", but I see no indication that they could do software
| design.
|
| > how the hell is it going to develop metrics for assessing the
| impact of AIs when they're doing things like managing companies
| or developing public policy?
|
| Why on earth do people _want_ AI to do either of these things? As
| if our society isn't fucked enough, having an untouchable
| oligarchy already managing companies and developing public
| policies, we want to have the oligarchy's AI do this, so policy
| can get even more out of touch with the needs of common people?
| This should _never_ come to pass. It's like people read a pile of
| 90s cyberpunk dystopian novels and decided, "Yeah, let's do
| that." I think it'll fail, but I don't understand how anyone with
| less than 10 billion in assets would want this.
| HarHarVeryFunny wrote:
| The disconnect between improved benchmark results and lack of
| improvement on real world tasks doesn't have to imply cheating -
| it's just a reflection of the nature of LLMs, which at the end of
| the day are just prediction systems - these are language models,
| not cognitive architectures built for generality.
|
| Of course, if you train an LLM heavily on narrow benchmark
| domains then its prediction performance will improve on those
| domains, but why would you expect that to improve performance in
| unrelated areas?
|
| If you trained yourself extensively on advanced math, would you
| expect that to improve your programming ability? If not, they why
| would you expect it to improve programming ability of a far less
| sophisticated "intelligence" (prediction engine) such as a
| language model?! If you trained yourself on LeetCode programming,
| would you expect that to help hardening corporate production
| systems?!
| InkCanon wrote:
| That's fair. But look up the recent experiment on SOTA models
| on the then just released USAMO 2025 questions. Highest score
| was 5%, supposedly SOTA last year was IMO silver level. There
| could be some methodological differences - ie USAMO paper
| required correct proofs and not just numerical answers. But it
| really strongly suggests even within limited domains, it's
| cheating. I'd wager a significant amount that if you tested
| SOTA models on a new ICPC set of questions, actual performance
| would be far, far worse than their supposed benchmarks.
| usaar333 wrote:
| > Highest score was 5%, supposedly SOTA last year was IMO
| silver level.
|
| No LLM last year got silver. Deepmind had a highly
| specialized AI system earning that
| dkersten wrote:
| I honestly can't notice any difference in outdoor quality between
| GPT 4o and GPT 4.5. I also can't notice any difference in
| programming quality in cursor when using Claude 3.7 vs 3.5. I'm
| told there is a clear difference, but I don't notice it.
| mentalgear wrote:
| Who would assume that LLM companies were to hyper optimise on
| public to make their share prices go up and bubble keep afloat
| ... What a unserious thought to maintain ...
| einrealist wrote:
| LeCun criticized LLM technology recently in a presentation:
| https://www.youtube.com/watch?v=ETZfkkv6V7Y
|
| The accuracy problem won't just go away. Increasing accuracy is
| only getting more expensive. This sets the limits for useful
| applications. And casual users might not even care and use LLMs
| anyway, without reasonable result verification. I fear a future
| where overall quality is reduced. Not sure how many people /
| companies would accept that. And AI companies are getting too big
| to fail. Apparently, the US administration does not seem to care
| when they use LLMs to define tariff policy....
| pclmulqdq wrote:
| I don't know why anyone is surprised that a statistical model
| isn't getting 100% accuracy. The fact that statistical models
| of text are good enough to do _anything_ should be shocking.
| whilenot-dev wrote:
| I think the surprising aspect is rather how people are
| praising 80-90% accuracy as the next leap in technological
| advancement. Quality is already in decline, despite LLMs, and
| programming was always a discipline where correctness and
| predictability mattered. It's an advancement for efficiency,
| sure, but on the yet unknown cost of stability. I'm thinking
| about all simulations based on applied mathematical concepts
| and all the accumulated hours fixing bugs - there's now this
| certain aftertaste, sweet for some living their lives
| efficiently, but very bitter for the ones relying on
| stability.
| einrealist wrote:
| That "good enough" is the problem. It requires context. And
| AI companies are selling us that "good enough" with
| questionable proof. And they are selling grandiose visions to
| investors, but move the goal post again and again.
|
| A lot of companies made Copilot available to their workforce.
| I doubt that the majority of users understand what a
| statistical model means. The casual, technically
| inexperienced user just assumes that a computer answer is
| always right.
| delusional wrote:
| > Sometimes the founder will apply a cope to the narrative ("We
| just don't have any PhD level questions to ask")
|
| Please tell me this is not what tech-bros are going around
| telling each other! Are we implying that the problems in the
| world, the things that humans collectively work on to maintain
| the society that took us thousands of years to build up, just
| aren't hard enough to reach the limits of the AI.
|
| Jesus Christ.
| bcoates wrote:
| I mean... most businesses, particularly small businesses and
| startups, aren't exactly doing brain surgery on a rocketship.
|
| It's pretty likely that they have extremely dull problems like
| "running an inbound call center is a lot of work" or "people
| keep having their mail stolen and/or lying that they did" that
| "more smarter gpus" won't solve
| timewizard wrote:
| Government announces critical need to invest in AI and sets aside
| a bunch of money for this purpose.
|
| Suddenly the benchmarks become detached from reality and vendors
| can claim whatever they want about their "new" products.
|
| Just as a possible explanation, as I feel like I've seen this
| story before.
| ants_everywhere wrote:
| There are real and obvious improvements in the past few model
| updates and I'm not sure what the disconnect there is.
|
| Maybe it's that I _do_ have PhD level questions to ask them, and
| they 've gotten much better at it.
|
| But I suspect that these anecdotes are driven by something else.
| Perhaps people found a workable prompt strategy by trial and
| error on an earlier model and it works less well with later
| models.
|
| Or perhaps they have a time-sensitive task and are not able to
| take advantage of the thinking of modern LLMs, which have a slow
| thinking-based feedback loop. Or maybe their code base is getting
| more complicated, so it's harder to reason about.
|
| Or perhaps they're giving the LLMs a poorly defined task where
| older models made assumptions about but newer models understand
| the ambiguity of and so find the space of solutions harder to
| navigate.
|
| Since this is ultimately from a company doing AI scanning for
| security, I would think the latter plays a role to some extent.
| Security is insanely hard and the more you know about it the
| harder it is. Also adversaries are bound to be using AI and are
| increasing in sophistication, which would cause lower efficacy
| (although you could tease this effect out by trying older models
| with the newer threats).
| pclmulqdq wrote:
| In the last year, things like "you are an expert on..." have
| gotten much less effective in my private tests, while actually
| describing the problem precisely has gotten better in terms of
| producing results.
|
| In other words, all the sort of lazy prompt engineering hacks
| are becoming less effective. Domain expertise is becoming more
| effective.
| ants_everywhere wrote:
| yes that would explain the effect I think. I'll try that out
| this week.
| mmcnl wrote:
| I feel we are already in the era of diminishing returns on LLM
| improvements. Newer models seem to be more sophisticated
| implementations of LLM technology + throwing more resources at
| it, but to me they do not seem fundamentally more intelligent.
| InkCanon wrote:
| The biggest story in AI was released a few weeks ago but was
| given little attention: on the recent USAMO, SOTA models scored
| on average 5% (IIRC, it was some abysmal number). This is despite
| them supposedly having gotten 50%, 60% etc performance on IMO
| questions. This massively suggests AI models simply remember the
| past results, instead of actually solving these questions. I'm
| incredibly surprised no one mentions this, but it's ridiculous
| that these companies never tell us what (if any) efforts have
| been made to remove test data (IMO, ICPC, etc) from train data.
| AIPedant wrote:
| Yes, here's the link: https://arxiv.org/abs/2503.21934v1
|
| Anecdotally, I've been playing around with o3-mini on
| undergraduate math questions: it is much better at "plug-and-
| chug" proofs than GPT-4, but those problems aren't
| independently interesting, they are explicitly pedagogical. For
| anything requiring insight, it's either:
|
| 1) A very good answer that reveals the LLM has seen the problem
| before (e.g. naming the theorem, presenting a "standard" proof,
| using a much more powerful result)
|
| 2) A bad answer that looks correct and takes an enormous amount
| of effort to falsify. (This is the secret sauce of LLM hype.)
|
| I dread undergraduate STEM majors using this thing - I asked it
| a problem about rotations and spherical geometry, but got back
| a pile of advanced geometric algebra, when I was looking for
| "draw a spherical triangle." If I didn't know the answer, I
| would have been badly confused. See also this real-world
| example of an LLM leading a recreational mathematician astray:
| https://xcancel.com/colin_fraser/status/1900655006996390172#...
|
| I will add that in 10 years the field will be intensely
| criticized for its reliance on multiple-choice benchmarks; it
| is not surprising or interesting that next-token prediction can
| game multiple-choice questions!
| simonw wrote:
| I had to look up these acronyms:
|
| - USAMO - United States of America Mathematical Olympiad
|
| - IMO - International Mathematical Olympiad
|
| - ICPC - International Collegiate Programming Contest
|
| Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or
| Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted
| 27th March 2025.
| usaar333 wrote:
| And then within a week, Gemini 2.5 was tested and got 25%.
| Point is AI is getting stronger.
|
| And this only suggested LLMs aren't trained well to write
| formal math proofs, which is true.
| AstroBen wrote:
| This seems fairly obvious at this point. If they were actually
| reasoning _at all_ they 'd be capable (even if not good) of
| complex games like chess
|
| Instead they're barely able to eek out wins against a bot that
| plays completely random moves: https://maxim-
| saplin.github.io/llm_chess/
| bglazer wrote:
| Yeah I'm a computational biology researcher. I'm working on a
| novel machine learning approach to inferring cellular behavior.
| I'm currently stumped why my algorithm won't converge.
|
| So, I describe the mathematics to ChatGPT-o3-mini-high to try
| to help reason about what's going on. It was almost completely
| useless. Like blog-slop "intro to ML" solutions and ideas. It
| ignores all the mathematical context, and zeros in on "doesn't
| converge" and suggests that I lower the learning rate. Like, no
| shit I tried that three weeks ago. No amount of cajoling can
| get it to meaningfully "reason" about the problem, because it
| hasn't seen the problem before. The closest point in latent
| space is apparently a thousand identical Medium articles about
| Adam, so I get the statistical average of those.
|
| I can't stress how frustrating this is, especially with people
| like Terence Tao saying that these models are like a mediocre
| grad student. I would really love to have a mediocre (in
| Terry's eyes) grad student looking at this, but I can't seem to
| elicit that. Instead I get low tier ML blogspam author.
|
| **PS** if anyone read this far (doubtful) and knows about
| density estimation and wants to help my email is
| bglazer1@gmail.com
|
| I promise its a fun mathematical puzzle and the biology is
| pretty wild too
| mmcnl wrote:
| I feel we are already in the era of diminishing returns on LLM
| improvements. Newer models seem to be more sophisticated
| implementations of LLM technology + throwing more resources at
| it, but to me they do not seem fundamentally more intelligent.
|
| I don't think this is a problem though. I think there's a lot of
| low-hanging fruit when you create sophisticated implementations
| of relatively dumb LLM models. But that sentiment doesn't
| generate a lot of clicks.
| DisjointedHunt wrote:
| Two things can be true at the same time:
|
| 1. Model "performance" judged by proxy metrics of intelligence
| have improved significantly over the past two years.
|
| 2. These capabilities are yet to be stitched together in the most
| appropriate manner for the cybersecurity scenarios the author is
| talking about.
|
| In my experience, the best usage of Transformer models has come
| from a deep integration into an appropriate workflow. They do not
| (yet) replace the new exploration part of a workflow, but they
| are very scarily performant at following mid level reasoning
| assertions in a massively parallelized manner.
|
| The question you should be asking yourself is if you can break
| down your task into however many small chunks that are
| constrained by feasiility in time to process , chunk those up
| into appropriate buckets or even better, place them in-order as
| though you were doing those steps with your expertise - an
| extension of self. Here's how the two approaches differ:
|
| "Find vulnerabilities in this code" -> This will saturate across
| all models because the intent behind this mission is vast and
| loosely defined, while the outcome is expected to be narrow.
|
| " (a)This piece of code should be doing x, what areas is it
| affecting, lets draw up a perimeter (b) Here is the dependency
| graph of things upstream and downstream of x, lets spawn a
| collection of thinking chains to evaluate each one for risk based
| on the most recent change . . . (b[n]) Where is this likely to
| fail (c) (Next step that a pentester/cybersecurity researcher
| would take) "
|
| This has been trial and error in my experience but it has worked
| great in domains such as financial trading and decision support
| where experts in the field help sketch out the general framework
| of the process where reasoning support is needed and constantly
| iterate as though it is an extension of their selves.
| StickyRibbs wrote:
| There's the politics of the corporations and then there's the
| business of the science behind LLM's, this article feels like the
| former.
|
| Maybe someone active in the research can comment? I feel like all
| of these comments are just conjecture/anecdotal and don't really
| get to the meat of this question of "progress" and the future of
| LLM's
| OtherShrezzing wrote:
| Assuming that the models getting better at SWE benchmarks and
| math tests would translate into positive outcomes in all other
| domains could be an act of spectacular hubris by the big frontier
| labs, which themselves are chock-full of mathematicians and
| software engineers.
| nialv7 wrote:
| Sounds like someone drank their own Kool aid (believing current
| AI can be a security researcher), and then gets frustrated when
| they realize they have overhyped themselves.
|
| Current AI just cannot do the kind of symbolic reasoning required
| for finding security vulnerabilities in softwares. They might
| have learned to recognize "bad code" via pattern matching, but
| that's basically it.
| burny_tech wrote:
| In practice, Sonnet 3.7 and Gemini 2.5 are just often too good
| compared to competitors.
| jaredcwhite wrote:
| There's some interesting information and analysis to start off
| this essay, then it ends with:
|
| "These machines will soon become the beating hearts of the
| society in which we live. The social and political structures
| they create as they compose and interact with each other will
| define everything we see around us."
|
| This sounds like an article of faith to me. One could just as
| easily say they won't become the beating hearts of anything, and
| instead we'll choose to continue to build a better future for
| humans, as humans, without relying on an overly-hyped technology
| rife with error and unethical implications.
___________________________________________________________________
(page generated 2025-04-06 23:00 UTC)