[HN Gopher] Recent AI model progress feels mostly like bullshit
       ___________________________________________________________________
        
       Recent AI model progress feels mostly like bullshit
        
       Author : paulpauper
       Score  : 161 points
       Date   : 2025-04-06 18:01 UTC (4 hours ago)
        
 (HTM) web link (www.lesswrong.com)
 (TXT) w3m dump (www.lesswrong.com)
        
       | fxtentacle wrote:
       | I'd say most of the recent AI model progress has been on price.
       | 
       | A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in
       | coding performance. But it's small enough to run on a consumer
       | GPU, which means deployment price is now down to $0.10 per hour.
       | (from $12+ for models requiring 8x H100)
        
         | shostack wrote:
         | Yeah, I'm thinking of this from a Wardley map standpoint.
         | 
         | What innovation opens up when AI gets sufficiently
         | commoditized?
        
           | mentalgear wrote:
           | Brute force, brute force everything at least for the domains
           | you can have automatic verification in.
        
           | bredren wrote:
           | One thing I've seen is large enterprises extracting money
           | from consumers by putting administrative burden on them.
           | 
           | For example, you can see this in health insurance
           | reimbursements and wireless carriers plan changes. (ie,
           | Verizon's shift from Do More, etc to what they have now)
           | 
           | Companies basically set up circumstances where consumers lose
           | small amounts of money on a recurring basis or sporadically
           | enough that the people will just pay the money rather than a
           | maze of calls, website navigation and time suck to recover
           | funds due to them or that shouldn't have been taken in the
           | first place.
           | 
           | I'm hopeful well commoditized AI will give consumers a
           | fighting chance at this and other types of disenfranchisement
           | that seems to be increasingly normalized by companies that
           | have consultants that do nothing but optimize for their own
           | financial position.
        
         | xiphias2 wrote:
         | Have you compared it with 8-bit QwQ-17B?
         | 
         | In my evals 8 bit quantized smaller Qwen models were better,
         | but again evaluating is hard.
        
       | softwaredoug wrote:
       | I think the real meaningful progress is getting ChatGPT 3.5 level
       | quality running anywhere you want rather than AIs getting smarter
       | at high level tasks. This capability being ubiquitous and not
       | tied to one vendor is really what's revolutionary.
        
       | gundmc wrote:
       | This was published the day before Gemini 2.5 was released. I'd be
       | interested if they see any difference with that model.
       | Anecdotally, that is the first model that really made me go wow
       | and made a big difference for my productivity.
        
         | jonahx wrote:
         | I doubt it. It still flails miserably like the other models on
         | anything remotely hard, even with plenty of human coaxing. For
         | example, try to get it to solve:
         | https://www.janestreet.com/puzzles/hall-of-mirrors-3-index/
        
           | Xenoamorphous wrote:
           | I'd say the average person wouldn't understand that problem,
           | let alone solve it.
        
           | flutas wrote:
           | FWIW 2.5-exp was the only one that managed to get a problem I
           | asked it right, compared to Claude 3.7 and o1 (or any of the
           | other free models in Cursor).
           | 
           | It was reverse engineering ~550MB of Hermes bytecode from a
           | react native app, with each function split into a separate
           | file for grep-ability and LLM compatibility.
           | 
           | The others would all start off right then quickly default to
           | just greping randomly what they expected it to be, which
           | failed quickly. 2.5 traced the function all the way back to
           | the networking call and provided the expected response
           | payload.
           | 
           | All the others hallucinated the networking response I was
           | trying to figure out. 2.5 Provided it exactly enough for me
           | to intercept the request and using the response it provided
           | to get what I wanted to show up.
        
             | arkmm wrote:
             | How did you fit 550MB of bytecode into the context window?
             | Was this using 2.5 in an agentic framework? (i.e. repeated
             | model calls and tool usage)
        
         | georgemcbay wrote:
         | As someone who was wildly disappointed with the hype around
         | Claude 3.7, Gemini 2.5 is easily the best programmer-assistant
         | LLM available, IMO.
         | 
         | But it still feels more like a small incremental improvement
         | rather than a radical change, and I still feel its limitations
         | constantly.
         | 
         | Like... it gives me the sort of decent but uninspired solution
         | I would expect it to generate without predictably walking me
         | through a bunch of obvious wrong turns as I repeatedly correct
         | it as I would have to have done with earlier models.
         | 
         | And that's certainly not nothing and makes the experience of
         | using it much nicer, but I'm still going to roll my eyes
         | anytime someone suggests that LLMs are the clear path to
         | imminently available AGI.
        
         | usaar333 wrote:
         | Ya, I find this hard to imagine aging well. Gemini 2.5 solved
         | (at least much better than) multiple real world systems
         | questions I've had in the past that other models could not. Its
         | visual reasoning also jumped significantly on charts (e.g.
         | planning around train schedules)
         | 
         | Even Sonnet 3.7 was able to do refactoring work on my codebase
         | sonnet 3.6 could not.
         | 
         | Really not seeing the "LLMs not improving" story
        
       | boxed wrote:
       | > So maybe there's no mystery: The AI lab companies are lying,
       | and when they improve benchmark results it's because they have
       | seen the answers before and are writing them down. [...then says
       | maybe not...]
       | 
       | Well.. they've been caught again and again red handed doing
       | exactly this. Fool me once shame on you, fool me 100 times shame
       | on me.
        
         | smnplk wrote:
         | Fool me once, shame on you...If fooled, you cant get fooled
         | again.
         | 
         | https://www.youtube.com/shorts/LmFN8iENTPc
        
         | drdaeman wrote:
         | Hate to say this but the incentive is growth, not progress.
         | Progress is what enabled the growth, but is also extremely hard
         | to plan and deliver. On the other hand, hype is probably
         | somewhat easier and well-tested approach so no surprise lot of
         | the effort goes into marketing. Markets had repeatedly
         | confirmed that there aren't any significant immediate
         | repercussions for cranking up BS levels in marketing materials,
         | while there are some rewards when it works.
        
       | djha-skin wrote:
       | > Since 3.5-sonnet, we have been monitoring AI model
       | announcements, and trying pretty much every major new release
       | that claims some sort of improvement. Unexpectedly by me, aside
       | from a minor bump with 3.6 and an even smaller bump with 3.7,
       | literally none of the new models we've tried have made a
       | significant difference on either our internal benchmarks or in
       | our developers' ability to find new bugs. This includes the new
       | test-time OpenAI models.
       | 
       | This is likely a manifestation of the bitter lesson[1],
       | specifically this part:
       | 
       | > The ultimate reason for this is Moore's law, or rather its
       | generalization of continued exponentially falling cost per unit
       | of computation. Most AI research has been conducted as if the
       | computation available to the agent were constant (in which case
       | leveraging human knowledge would be one of the only ways to
       | improve performance) but, _over a slightly longer time than a
       | typical research project_ [like an incremental model update],
       | massively more computation inevitably becomes available.
       | 
       | (Emphasis mine.)
       | 
       | Since the ultimate success strategy of the scruffies[2] or
       | proponents of search and learning strategies in AI is Moore's
       | Law, short term gains using these strategies will be miniscule.
       | It is over at least a five year period that their gains will be
       | felt the most. The neats win the day in the short term, but the
       | hare in this race will ultimately give away to the steady plod of
       | the tortoise.
       | 
       | 1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
       | 
       | 2:
       | https://en.m.wikipedia.org/wiki/Neats_and_scruffies#CITEREFM...
        
       | ohgr wrote:
       | It's not even approaching the asymptotic line of promises made at
       | any achievable rate for the amount of cash being thrown at it.
       | 
       | Where's the business model? Suck investors dry at the start of a
       | financial collapse? Yeah that's going to end well...
        
         | maccard wrote:
         | > where's the business model?
         | 
         | For who? Nvidia sell GPUs, OpenAI and co sell proprietary
         | models and API access, and the startups resell GPT and Claude
         | with custom prompts. Each one is hoping that the layer above
         | has a breakthrough that makes their current spend viable.
         | 
         | If they do, then you don't want to be left behind, because
         | _everything_ changes. It probably won't, but it might.
         | 
         | That's the business model
        
           | grey-area wrote:
           | That's not a business model, it's a pipe dream.
           | 
           | This bubble will be burst by the Trump tariffs and the end of
           | the zirp era. When inflation and a recession hit together
           | hope and dream business models and valuations no longer work.
        
             | maccard wrote:
             | Which one? Nvidia are doing pretty ok selling GPU's, and
             | OpenAI and Anthropic are doing ok selling their models.
             | They're not _viable_ business models, but they could be.
        
               | grey-area wrote:
               | NVDA will crash when the AI bubble implodes, and none of
               | those Generative AI companies are actually making money,
               | nor will they. They have already hit limiting returns in
               | LLM improvements after staggering investments and it is
               | clear are nowhere near general intelligence.
        
               | maccard wrote:
               | All of this can be true, and has nothing to do with them
               | having a business model.
               | 
               | > NVDA will crash when the AI bubble implodes, > making
               | money, nor will they > They have already hit limiting
               | returns in LLM improvements after staggering investments
               | > and it is clear are nowhere near general intelligence.
               | 
               | These are all assumptions and opinions, and have nothing
               | to do with whether or not they have a business model. You
               | mightn't like their business model, but they do have one.
        
               | grey-area wrote:
               | I consider it a business model if they have plans to make
               | money at some point (no sign of that at openai which are
               | not based on hopium) and are not engaged in fraud like
               | bundling and selling to their own subsidiaries (nvda).
               | 
               | These are of course just opinions, I'm not sure we can
               | know facts about such companies except in retrospect.
        
           | ohgr wrote:
           | You missed the end of the supply chain. Paying users. Who
           | magically disappear below market sustaining levels of sales
           | when asked to pay.
        
             | maccard wrote:
             | I never said it was sustainable, and even if it was, OP
             | asked for a business model. Customers don't need a business
             | model, they're customers.
             | 
             | The same is true for any non essential good or service.
        
             | AstroBen wrote:
             | > Going from $1M ARR to $100M ARR in 12 months, Cursor is
             | the fastest growing SaaS company of all time
             | 
             | Just because it's not reaching the insane hype being pushed
             | doesn't mean it's totally useless
        
       | paulsutter wrote:
       | Im able to get substantially more coding done than three months
       | ago. This could be largely in the tooling (coding agents, deep
       | research). But the models are better too, for both coding and
       | brainstorming. And tooling counts, to me, as progress.
       | 
       | Learning to harness current tools helps to harness future tools.
       | Work on projects that will benefit from advancements, but can
       | succeed without them.
        
         | dghlsakjg wrote:
         | I'm not sure if I'm able to do more of the hard stuff, but a
         | lot of the easy but time consuming stuff is now easily done by
         | LLMs.
         | 
         | Example: I frequently get requests for data from Customer
         | Support that used to require 15 minutes of my time noodling
         | around writing SQL queries. I can cut that down to less than a
         | minute now.
        
         | mountainriver wrote:
         | Yes I am a better engineer with every release. I think this is
         | mostly empirically validated
        
       | photochemsyn wrote:
       | Will LLMs end up like compilers? Compilers are also fundamentally
       | important to modern industrial civilization - but they're not
       | profit centers, they're mostly free and open-source outside a few
       | niche areas. Knowing how to use a compiler effectively to write
       | secure and performative software is still a valuable skill - and
       | LLMs are a valuable tool that can help with that process,
       | especially if the programmer is on the steep end of the learning
       | curve - but it doesn't look like anything short of real AGI can
       | do novel software creation without a human constantly in the
       | loop. The same argument applies to new fundamental research, even
       | to reviewing and analyzing new discoveries that aren't in the
       | training corpus.
       | 
       | Wasn't it back in the 1980s that you had to pay $1000s for a good
       | compiler? The entire LLM industry might just be following in the
       | compiler's footsteps.
        
         | lukev wrote:
         | This seems like a probable end state, but we're going to have
         | to stop calling LLMs "artificial intelligence" in order to get
         | there.
        
           | bcoates wrote:
           | Yep. I'm looking forward to LLMs/deepnets being considered a
           | standard GOFAI technique with uses and limitations and not
           | "we asked the God we're building to draw us a picture of a
           | gun and then it did and we got scared"
        
           | mmcnl wrote:
           | Why not? Objectively speaking LLMs are artificial
           | intelligent. Just because it's not human level intelligence
           | doesn't mean it's not intelligent.
        
             | lukev wrote:
             | Objectively speaking a chess engine is artificially
             | intelligent. Just because it's not human level doesn't mean
             | it's not intelligent. Repeat for any N of 100s of different
             | technologies we've built. We've been calling this stuff
             | "thinking machines" since Turing and it's honestly just not
             | useful at this point.
             | 
             | The fact is, the phrase "artificial intelligence" is a
             | memetic hazard: it immediately positions the subject of
             | conversation as "default capable", and then forces the
             | conversation into trying to describe what it can't do,
             | which is rarely a useful way to approach it.
             | 
             | Whereas with LLMs (and chess engines and every other tech
             | advancement) it would be more useful to start with what the
             | tech _can_ do and go from there.
        
       | jonahx wrote:
       | My personal experience is right in line with the author's.
       | 
       | Also:
       | 
       | > I think what's going on is that large language models are
       | trained to "sound smart" in a live conversation with users, and
       | so they prefer to highlight possible problems instead of
       | confirming that the code looks fine, just like human beings do
       | when they want to sound smart.
       | 
       | I immediately thought: That's because in _most_ situations this
       | is the purpose of language, at least partially, and LLMs are
       | trained on language.
        
       | billyp-rva wrote:
       | > [T]here are ~basically~ no public benchmarks for security
       | research... nothing that gets at the hard parts of application
       | pentesting for LLMs, which are 1. Navigating a real repository of
       | code too large to put in context, 2. Inferring a target
       | application's security model, and 3. Understanding its
       | implementation deeply enough to learn where that security model
       | is broken.
       | 
       | A few months ago I looked at essentially this problem from a
       | different angle (generating system diagrams from a codebase). My
       | conclusion[0] was the same as here: LLMs really struggle to
       | understand codebases in a holistic way, especially when it comes
       | to the codebase's strategy and purpose. They therefore struggle
       | to produce something meaningful from it like a security
       | assessment or a system diagram.
       | 
       | [0] https://www.ilograph.com/blog/posts/diagrams-ai-can-and-
       | cann...
        
       | sema4hacker wrote:
       | > ...whatever gains these companies are reporting to the public,
       | they are not reflective of economic usefulness or generality.
       | 
       | I'm not surprised, because I don't expect pattern matching
       | systems to grow into something more general and useful. I think
       | LLM's are essentially running into the same limitations that the
       | "expert systems" of the 1980's ran into.
        
       | maccard wrote:
       | My experience as someone who uses LLMs and a coding assist plugin
       | (sometimes), but is somewhat bearish on AI is that GPT/Claude and
       | friends have gotten worse in the last 12 months or so, and local
       | LLMs have gone from useless to borderline functional but still
       | not really usable for day to day.
       | 
       | Personally, I think the models are "good enough" that we need to
       | start seeing the improvements in tooling and applications that
       | come with them now. I think MCP is a good step in the right
       | direction, but I'm sceptical on the whole thing (and have been
       | since the beginning, despite being a user of the tech).
        
       | joelthelion wrote:
       | I've used gemini 2.5 this weekend with aider and it was
       | frighteningly good.
       | 
       | It probably depends a lot on what you are using them for, and in
       | general, I think it's still too early to say exactly where LLMs
       | will lead us.
        
         | jchw wrote:
         | I think overall quality with Gemini 2.5 is not much better than
         | Gemini 2 in my experience. Gemini 2 was already really good,
         | but just like Claude 3.7, Gemini 2.5 goes some steps forward
         | and some steps backwards. It sometimes generates some _really_
         | verbose code even when you tell it to be succinct. I am pretty
         | confident that if you evaluate 2.5 for a bit longer you 'll
         | come to the same conclusion eventually.
        
         | mountainriver wrote:
         | Yep, and what they are going in cursor either the agentic stuff
         | is really game changing.
         | 
         | People who can't recognize this intentionally have their heads
         | in the sand
        
           | InkCanon wrote:
           | People are really fundamentally asking two different
           | questions when they talk about AI "importance": AI's utility
           | and AI's "intelligence". There's a careful line between both.
           | 
           | 1) AI undoubtedly has utility. In many agentic uses, it has
           | very significant utility. There's absolute utility and
           | perceived utility, which is more of user experience. In
           | absolute utility, it is likely git is the single most game
           | changing piece of software there is. It is likely git has
           | saved some ten, maybe eleven digit number in engineer hours
           | times salary in how it enables massive teams to work together
           | in very seamless ways. In user experience, AI is amazing
           | because it can generate so much so quickly. But it is very
           | far from an engineer. For example, recently I tried to use
           | cursor to bootstrap a website in NextJS for me. It produced
           | errors it could not fix, and each rewrite seemed to dig it
           | deeper into its own hole. The reasons were quite obvious. A
           | lot of it had to do with NextJS 15 and the breaking changes
           | it introduces in cookies and auth. It's quite clear if you
           | have masses of NextJS code, which disproportionately is older
           | versions, but none labeled well with versions, it messes up
           | the LLM. Eventually I scrapped what it wrote and did it
           | myself. I don't mean to use this anecdote to say LLMs are
           | useless, but they have pretty clear limitations. They work
           | well on problems with massive data (like front end) and don't
           | require much principled understanding (like understanding how
           | NextJS 15 would break so and so's auth). Another example of
           | this is when I tried to use it to generate flags for a V8
           | build, it failed horribly and would simply hallucinate flags
           | all the time. This seemed very likely to be (despite the
           | existence of a list of V8 flags online) that many flags had
           | very close representations in vector embeddings, and that
           | there was almost close to zero data/detailed examples on
           | their use.
           | 
           | 2) In the more theoretical side, the performance of LLMs on
           | benchmarks (claiming to be elite IMO solvers, competitive
           | programming solvers) have become incredibly suspicious. When
           | the new USAMO 2025 was released, the highest score was 5%,
           | despite claims a year ago that SOTA when was at least a
           | silver IMO. This is against the backdrop of exponential
           | compute and data being fed in. Combined with apparently
           | diminishing returns, this suggests that the gains from that
           | are running really thin.
        
       | throw310822 wrote:
       | I hope it's true. Even if LLMs development stopped now, we would
       | still keep finding new uses for them at least for the next ten
       | years. The technology is evolving way faster than we can
       | meaningfully absorb it and I am genuinely frightened by the
       | consequences. So I hope we're hitting some point of diminishing
       | returns, although I don't believe it a bit.
        
       | a3w wrote:
       | For three years now, my experience with LLMs has been "mostly
       | useless, prefer ELIZA".
       | 
       | Which is software written 1966, but the web version is a little
       | newer. Does occasional psychotherapy assistance/brainstorming
       | just as well, and I more easily know when I stepped out of its
       | known range into the extrapolated.
       | 
       | That said, it can vibe code in a framework unknown to me in half
       | the time that I would need to school myself and add the feature.
       | 
       | Or vibe coding takes twice as long, if I mostly know how to
       | achieve what I want and read no framework documentation but only
       | our own project's source code to add a new feature. But on a day
       | with a headache, I can still call the LLM a dumb twat and ask it
       | to follow my instructions instead of doing bullshit.
       | 
       | But, vibe coding always makes my pulse go to 105, from 65 and
       | question my life choices. Since few instructions are rarely ever
       | followed and loops never left once entered. Except for on the
       | first try getting 80% of the structure kinda right, but then
       | getting stuck for the whole workday.
        
       | aerhardt wrote:
       | My mom told me yesterday that Paul Newman had massive problems
       | with alcohol. I was somewhat skeptical, so this morning I asked
       | ChatGPT a very simple question:
       | 
       | "Is Paul Newman known for having had problems with alcohol?"
       | 
       | All of the models up to o3-mini-high told me he had no known
       | problems. Here's o3-mini-high's response:
       | 
       | "Paul Newman is not widely known for having had problems with
       | alcohol. While he portrayed characters who sometimes dealt with
       | personal struggles on screen, his personal life and public image
       | were more focused on his celebrated acting career, philanthropic
       | work, and passion for auto racing rather than any issues with
       | alcohol. There is no substantial or widely reported evidence in
       | reputable biographies or interviews that indicates he struggled
       | with alcohol abuse."
       | 
       | There is plenty of evidence online that he struggled a lot with
       | alcohol, including testimony from his long-time wife Joanne
       | Woodward.
       | 
       | I sent my mom the ChatGPT reply and in five minutes she found an
       | authoritative source to back her argument [1].
       | 
       | I use ChatGPT for many tasks every day, but I couldn't fathom
       | that it would get so wrong something so simple.
       | 
       | Lesson(s) learned... Including not doubting my mother's movie
       | trivia knowledge.
       | 
       | [1] https://www.newyorker.com/magazine/2022/10/24/who-paul-
       | newma...
        
         | drooby wrote:
         | I asked GPT-4.5 and it searched the web and immediately gave me
         | a "yes" with paragraphs of sources cited.
        
           | john2x wrote:
           | Truth is a probability game. Just keep trying until you
           | arrive.
        
             | Avicebron wrote:
             | "man puts prompt into LLM" > "LLM ships bad code" >
             | "machine fails" > "person dies" > "fire man" > "man puts
             | prompt into LLM"
        
         | lfkdev wrote:
         | Thats not really 'simple' for an LLM. This is a niche
         | information about a specifc person, LLM's train on massive
         | amount of data, the more a topic is being present in the data,
         | the better will the answers be.
         | 
         | Also, you can/should use the "research" mode for questions like
         | this.
        
           | aerhardt wrote:
           | The question is simple and verifiable - it is impressive to
           | me that it's not contained in the LLM's body of knowledge -
           | or rather that it can't reach the answer.
           | 
           | This is niche in the grand scheme of knowledge but Paul
           | Newman is easily one of the biggest actors in history, and
           | the LLM has been trained on a massive corpus that includes
           | references to this.
           | 
           | Where is the threshold for topics with enough presence in the
           | data?
        
             | Max_aaa wrote:
             | The question might be simple and verifiable, but it is not
             | a simple for an LLM to mark a particular question as such.
             | This is the tricky part.
             | 
             | An LLM does not care about your question, it is a bunch of
             | math that will spit out a result based on what you typed
             | in.
        
         | permo-w wrote:
         | this seems less like an issue with accuracy and more like an
         | issue with model providers making sure they don't get sued for
         | libel
        
           | aerhardt wrote:
           | I thought about that too.
        
         | ilrwbwrkhv wrote:
         | LLMs will never be good at specific knowledge unless
         | specifically trained for with narrow "if else" statements.
         | 
         | Its good for broad general overview such as most popular
         | categories of books in the world.
        
           | Rebuff5007 wrote:
           | Really? Open-AI says PhD intelligence is just around the
           | corner!
        
             | dadadad100 wrote:
             | If we were to survey 100 PhDs how many would know correctly
             | that Paul Newman had an alcohol problem.
        
               | AnimalMuppet wrote:
               | I would hope that 100% of them would be able to figure
               | out _how to find out_.
        
               | dadadad100 wrote:
               | Ah, but isn't that the problem here - asking an LLM for
               | facts without requesting a search is like asking a PhD to
               | answer a question "off the top of your head". For pop
               | culture questions the PhD likely brings little value.
        
             | ludwik wrote:
             | I don't think they mean "knowledge" when they talk about
             | "intelligence." LLMs are definitely not knowledge bases.
             | They can transform information given to them in impressive
             | ways, but asking a raw (non-RAG-enabled) LLM to provide its
             | own information will probably always be a mistake.
        
               | AnimalMuppet wrote:
               | They kind of are knowledge bases, just not in the usual
               | way. The knowledge is encoded in the words they were
               | trained on. They weren't trained on words chosen at
               | random; they were trained on words written by humans to
               | encode some information. In fact, that's the only thing
               | that makes LLMs somewhat useful.
        
         | blitzar wrote:
         | Does the as yet unwritten prequel of Idiocracy tell the tale of
         | when we started asking Ai chat bots for facts and this was the
         | point of no return for humanity?
        
           | aerhardt wrote:
           | Can you blame the users for asking it, when everyone is
           | selling that as a key defining feature?
           | 
           | I use it for asking - often very niche - questions on
           | advanced probability and simulation modeling, and it often
           | gets those right - why those and not a simple verifiable fact
           | about one of the most popular actors in history?
           | 
           | I don't know about Idiocracy, but something that I have read
           | specific warnings about is that people will often blame the
           | user for any of the tool's misgivings.
        
           | pclmulqdq wrote:
           | It turns out there's huge demand for un-monetized web search.
        
             | leereeves wrote:
             | I like that it's unmonetized, of course, but that's not why
             | I use AI. I use AI because it's better at search. When I
             | can't remember the right keywords to find something, or
             | when the keywords aren't unique, I frequently find that web
             | search doesn't return what I need and AI does.
             | 
             | It's impressive how often AI returns the right answer to
             | vague questions. (not always though)
        
               | pclmulqdq wrote:
               | Google used to return the right answer to vague questions
               | until it decided to return the most lucrative answer to
               | vague questions instead.
        
             | spudlyo wrote:
             | Soon sadly, there will be a huge demand for un-monetized
             | LLMs. Enshitification is coming.
        
         | hn_throwaway_99 wrote:
         | So, in other words, are you saying that AI model progress _is_
         | the real deal and is not bullshit?
         | 
         | That is, as you point out, "all of the models up to o3-mini-
         | high" give an incorrect answer, while other comments say that
         | OpenAIs later models give correct answers, with web citations.
         | So it would seem to follow that "recent AI model progress"
         | actually made a verifiable improvement in this case.
        
           | saurik wrote:
           | I am pretty sure that they must have meant "up through", not
           | "up to", as the answer from o3-mini-high is also wrong in a
           | way which seems to fit the same description, no?
        
             | hn_throwaway_99 wrote:
             | I tried with 4o and it gave me what I thought was a correct
             | answer:
             | 
             | > Paul Newman was not publicly known for having major
             | problems with alcohol in the way some other celebrities
             | have been. However, he was open about enjoying drinking,
             | particularly beer. He even co-founded a line of food
             | products (Newman's Own) where profits go to charity, and he
             | once joked that he consumed a lot of the product himself --
             | including beer when it was briefly offered.
             | 
             | > In his later years, Newman did reflect on how he had
             | changed from being more of a heavy drinker in his youth,
             | particularly during his time in the Navy and early acting
             | career, to moderating his habits. But there's no strong
             | public record of alcohol abuse or addiction problems that
             | significantly affected his career or personal life.
             | 
             | > So while he liked to drink and sometimes joked about it,
             | Paul Newman isn't generally considered someone who had
             | problems with alcohol in the serious sense.
             | 
             | As other's have noted, LLMs are much more likely to be
             | cautious in providing information that could be construed
             | as libel. While Paul Newman may have been an alcoholic, I
             | couldn't find any articles about it being "public" in the
             | same way as others, e.g. with admitted rehab stays.
        
         | fnordpiglet wrote:
         | This is less an LLM thing than an information retrieval
         | question. If you choose a model and tell it to "Search," you
         | find citation based analysis that discusses that he indeed had
         | problems with alcohol. I do find it interesting it quibbles
         | whether he was an alcoholic or not - it seems pretty clear from
         | the rest that he was - but regardless. This is indicative of
         | something crucial when placing LLMs into a toolkit. They are
         | not omniscient nor are they deductive reasoning tools.
         | Information retrieval systems are excellent at information
         | retrieval and should be used for information retrieval. Solvers
         | are excellent at solving deductive problems. Use them. The
         | better they get at these tasks alone is cool but is IMO a
         | parlor trick since we have nearly optimal or actually optimal
         | techniques that don't need an LLM. The LLM should use those
         | tools. So, click search next time you have an information
         | retrieval question.
         | https://chatgpt.com/share/67f2dac0-3478-8000-9055-2ae5347037...
        
           | mvdtnz wrote:
           | Any information found in a web search about Newman will be
           | available in the training set (more or less). It's almost
           | certainly a problem of alignment / "safety" causing this
           | issue.
        
             | fnordpiglet wrote:
             | There's a simpler explanation than that's that the model
             | weights aren't an information retrieval system and other
             | sequences of tokens are more likely given the totality of
             | training data. This is why for an information retrieval
             | task you use an information retrieval tool similarly to how
             | for driving nails you use a hammer rather than a screw
             | driver. It may very well be you could drive the nail with
             | the screw driver, but why?
        
               | mvdtnz wrote:
               | You think that's a simpler explanation? Ok. I think given
               | the amount of effort that goes into "safety" on these
               | systems that my explanation is vastly more likely than
               | somehow this information got lost in the vector soup
               | despite being attached to his name at the top of every
               | search result[0].
               | 
               | 0 https://www.google.com/search?q=did+paul+newman+have+a+
               | drink...
        
               | fnordpiglet wrote:
               | Except if safety blocked this, it would have also blocked
               | the linked conversation. Alignment definitely distorts
               | behaviors of models, but treating them as information
               | retrieval systems is using a screw driver to drive nails.
               | Your example didn't refute this.
        
             | simonw wrote:
             | "Any information found in a web search about Newman will be
             | available in the training set"
             | 
             | I don't think that is a safe assumption these days.
             | Training modern LLM isn't about dumping in everything on
             | the Internet. To get a really _good_ model you have to be
             | selective about your sources of training data.
             | 
             | They still rip off vast amounts of copyrighted data, but I
             | get the impression they are increasingly picky about what
             | they dump into their training runs.
        
           | Vanit wrote:
           | I realise your answer wasn't assertive, but if I heard this
           | from someone actively defending AI it would be a copout. If
           | the selling point is that you can ask these AIs anything then
           | one can't retroactively go "oh but not that" when a
           | particular query doesn't pan out.
        
         | stavros wrote:
         | LLMs aren't good at being search engines, they're good at
         | understanding things. Put an LLM on top of a search engine, and
         | that's the appropriate tool for this use case.
         | 
         | I guess the problem with LLMs is that they're too usable for
         | their own good, so people don't realizing that they can't
         | perfectly know all the trivia in the world, exactly the same as
         | any human.
        
           | MegaButts wrote:
           | > LLMs aren't good at being search engines, they're good at
           | understanding things.
           | 
           | LLMs are literally fundamentally incapable of understanding
           | things. They are stochastic parrots and you've been fooled.
        
             | more_corn wrote:
             | For them to work at all they need to have some
             | representation of concepts. Recent research at anthropic
             | has shown a surprising complexity in their reasoning
             | behavior. Perhaps the parrot here is you.
        
             | bobsmooth wrote:
             | What do you call someone that mentions "stochastic parrots"
             | every time LLMs are mentioned?
        
               | MegaButts wrote:
               | It's the first time I've ever used that phrase on HN.
               | Anyway, what phrase do you think works better than
               | 'stochastic parrot' to describe how LLMs function?
        
               | karn97 wrote:
               | Try to come up with a way to prove humans aren't
               | stochastic parrots then maybe people will atart taking
               | you seriously. Just childish reddit angst rn nothing
               | else.
        
               | bluefirebrand wrote:
               | > Try to come up with a way to prove humans aren't
               | stochastic parrots
               | 
               | Look around you
               | 
               | Look at Skyscrapers. Rocket ships. Agriculture.
               | 
               | If you want to make a claim that humans are nothing more
               | than stochastic parrots then you need to explain where
               | all of this came from. What were we parroting?
               | 
               | Meanwhile all that LLMs do is parrot things that _humans_
               | created
        
               | brookst wrote:
               | It's good rhetoric but bad analogy. LLMs can be very
               | creative (to the point of failure, in hallucinations).
               | 
               | I don't know if there is a pithy shirt phrase to
               | accurately describe how LLMs function. Can you give me a
               | similar one for how humans think? That might spur my own
               | creativity here.
        
               | fancyfredbot wrote:
               | That makes me think, has anyone ever heard of an actual
               | parrot which wasn't stochastic?
               | 
               | I'm fairly sure I've never seen a deterministic parrot
               | which makes me think the term is tautological.
        
             | fancyfredbot wrote:
             | We're talking about a stochastic parrot which in many
             | circumstances responds in a way which is indistinguishable
             | from actual understanding.
        
               | MegaButts wrote:
               | I've always been amazed by this. I have never not been
               | frustrated with the profound stupidity of LLMs. Obviously
               | I must be using it differently because I've never been
               | able to trust it with anything and more than half the
               | time I fact check it even for information retrieval it's
               | objectively incorrect.
        
               | karn97 wrote:
               | Its ok to be paranoid
        
               | MegaButts wrote:
               | Fact checking is paranoia?
        
               | fancyfredbot wrote:
               | If you got as far as checking the output it must have
               | appeared to understand your question.
               | 
               | I wouldn't claim LLMs are good at being factual, or good
               | at arithmetic, or at drawing wine glasses, or that they
               | are "clever". What they are very good at is responding to
               | questions in a way which gives you the very strong
               | impression they've understood you.
        
               | MegaButts wrote:
               | I vehemently disagree. If I ask a question with an
               | objective answer, and it simply makes something up and is
               | very confident the answer is correct, what the fuck has
               | it understood other than how to piss me off?
               | 
               | It clearly doesn't understand that the question has a
               | correct answer, or that it does not know the answer. It
               | also clearly does not understand that I hate bullshit, no
               | matter how many dozens of times I prompt it to not make
               | something up and would prefer an admittance of ignorance.
        
               | fancyfredbot wrote:
               | It didn't understand you but the response was plausible
               | enough to require fact checking.
               | 
               | Although that isn't literally indistinguishable from
               | 'understanding' (because your fact checking easily
               | discerned that) it suggests that at a surface level it
               | did appear to understand your question and knew what a
               | plausible answer might look like. This is not necessarily
               | useful but it's quite impressive.
        
               | MegaButts wrote:
               | There are times it just generates complete nonsense that
               | has nothing to do with what I said, but it's certainly
               | not most of the time. I do not know how often, but I'd
               | say it's definitely under 10% and almost certainly under
               | 5% that the above happens.
               | 
               | Sure, LLMs are incredibly impressive from a technical
               | standpoint. But they're so fucking stupid I hate using
               | them.
               | 
               | > This is not necessarily useful but it's quite
               | impressive.
               | 
               | I think we mostly agree on this. Cheers.
        
             | mitthrowaway2 wrote:
             | What does the word "understand" mean to you?
        
               | MegaButts wrote:
               | An ability to answer questions with a train of thought
               | showing how the answer was derived, or the self-awareness
               | to recognize you do not have the ability to answer the
               | question and declare as much. More than half the time
               | I've used LLMs they will simply make answers up, and when
               | I point out the answer is wrong it simply regurgitates
               | another incorrect answer ad nauseum (regularly cycling
               | through answers I've already pointed out are incorrect).
               | 
               | Rather than give you a technical answer - if I ever feel
               | like an LLM can recognize its limitations rather than
               | make something up, I would say it understands. In my
               | experience LLMs are just algorithmic bullshitters. I
               | would consider a function that just returns "I do not
               | understand" to be an improvement, since most of the time
               | I get confidently incorrect answers instead.
               | 
               | Yes, I read Anthropic's paper from a few days ago. I
               | remain unimpressed until talking to an LLM isn't a
               | profoundly frustrating experience.
        
               | mitthrowaway2 wrote:
               | I just want to say that's a much better answer than I
               | anticipated!
        
             | the8472 wrote:
             | A stochastic parrot with a sufficiently tiny residual error
             | rate needs a stochastic model so precisely compressing the
             | world and sophisticated decompression algorithms that it
             | could be called reasoning.
             | 
             | Take two 4K frames of a falling vase, ask a model to
             | predict the next token... I mean the following images. Your
             | model now needs include some approximations of physics -
             | and the ability to apply it correctly - to produce a
             | _realistic_ outcome. I 'm not aware of any model capable of
             | doing that, but that's what it would mean to predict the
             | unseen with high enough fidelity.
        
           | more_corn wrote:
           | Ironically though an LLM powered search engine (some word
           | about being perplexed) is becoming way better than the
           | undisputed king of traditional search engines (something
           | oogle)
        
             | stavros wrote:
             | That's because they put an LLM over a traditional search
             | engine.
        
               | aspenmayer wrote:
               | Google Labs has AI Mode now, apparently.
               | 
               | https://labs.google.com/search/experiment/22
        
               | stavros wrote:
               | Hm, that's not available to me, what is it? If its an LLM
               | over Google, didn't they release that a few months ago
               | already?
        
               | aspenmayer wrote:
               | US only for now may be the issue?
               | 
               | It expands what they had before with AI Overviews, but
               | I'm not sure how new either of those are. It showed up
               | for me organically as an AI Mode tab on a native Google
               | search in Firefox ironically.
               | 
               | https://support.google.com/websearch/answer/16011537
        
               | stavros wrote:
               | Very interesting, thank you!
        
               | aspenmayer wrote:
               | No worries.
               | 
               | What happens if you go directly to
               | https://google.com/aimode ?
        
               | stavros wrote:
               | It asks me to change some permissions, but that help page
               | says this is only available in the US, so I suppose I'll
               | get blocked right after I change them.
        
         | r_klancer wrote:
         | Gemini (2.5 Pro):
         | 
         | "Yes, Paul Newman was widely known for being a heavy drinker,
         | particularly of beer. He himself acknowledged his significant
         | alcohol consumption."
         | 
         | The answer I got (https://gemini.google.com/share/9e327dc4be03)
         | includes references such as
         | https://apnews.com/article/entertainment-reviews-movies-paul...
         | and https://www.psychologytoday.com/us/blog/the-playing-
         | field/20... although they are redacted from the public-sharing
         | link.
        
           | r_klancer wrote:
           | Though a local model I'm running (gemma-3-27b-it;
           | https://huggingface.co/lmstudio-community/gemma-3-27b-it-
           | GGU...) just told me various correct sounding bits about his
           | history with alcohol (correctly citing his alma mater and
           | first wife), but threw in:
           | 
           | "Sobriety & AA: Newman got sober in 1964 and remained so for
           | the rest of his life."
           | 
           | Which doesn't check out. And it includes plausible but
           | completely hallucinated URLs (as well as a valid
           | biography.com URL that completely omits information about
           | alcohol.)
        
             | smahs wrote:
             | Gemma 3 4B (QAT quant): Yes, Paul Newman was indeed known
             | to have struggled with alcohol throughout his life. While
             | he maintained a public image of a charming, clean-cut star,
             | he privately battled alcoholism for many years. He sought
             | treatment in the late 1980s and early 1990s and was
             | reportedly very open about his struggles and the importance
             | of seeking help.
        
           | tim333 wrote:
           | Perplexity:
           | 
           | >Paul Newman is indeed known for having struggled with
           | alcohol during his life. Accounts from various sources,
           | including his own memoir and the documentary ...
           | (https://www.perplexity.ai/search/is-paul-newman-known-for-
           | ha...)
           | 
           | I guess there's something about ChatGPT's set up that makes
           | it different? Maybe they wanted it to avoid libeling people?
        
             | aldanor wrote:
             | Grok:
             | 
             | > Yes, Paul Newman was known to have struggled with alcohol
             | at certain points in his life. In his early years,
             | particularly during his time in the Navy and into his
             | acting career, Newman admitted to heavy drinking. He was
             | open about his fondness for beer and once jokingly referred
             | to himself as a "functional alcoholic." In a 1988 interview
             | with The New York Times, he acknowledged that he had a
             | period where he drank too much, stating, "I was a very good
             | drinker. I could put it away." ...
             | 
             | https://grok.com/share/bGVnYWN5_86843e8a-39ee-415d-8785-4f8
             | e...
        
         | kayo_20211030 wrote:
         | This may have hit the nail on the head about the weaknesses of
         | LLM's.
         | 
         | They're going to regurgitate something not so much based on
         | facts, but based on things that are accessible as _perceived_
         | facts. Those might be right, but they might be wrong also; and
         | no one can tell without doing the hard work of checking
         | original sources. Many of what are considered accepted facts,
         | and also accessible to LLM harvesting, are at best derived
         | facts, often mediated by motivated individuals, and published
         | to accessible sources by  "people with an interest".
         | 
         | The weightings used by any AI should be based on the _facts_ ,
         | and not the compounded volume of derived, "mediated", or
         | "directed" _facts_ - simply, because they 're not really facts;
         | they're reports.
         | 
         | It all seems like dumber, lazier search engine stuff. Honestly,
         | what do I know about Paul Newman? But, Joanne Woodward and
         | others who knew and worked with him should be weighted as
         | being, at least, slightly more credible that others; no matter
         | how many text patterns "catch the match" flow.
        
         | Alive-in-2025 wrote:
         | These models are not reliable sources of information. They are
         | either out of date, subject to hallucination, or just plain
         | wrong for a variety of reasons. They are untrustworthy to ask
         | facts like this.
         | 
         | I appreciate your consideration of a subjective question and
         | how you explained it and understand these nuances. But please -
         | do not trust chatgpt etc. I continue to be frustrated at the
         | endless people claiming something is true from chatgpt. I
         | support the conclusions of this author.
        
         | jonomacd wrote:
         | Looks like you are using the wrong models
         | 
         | https://g.co/gemini/share/ffa5a7cd6f46
        
       | iambateman wrote:
       | The core point in this article is that the LLM wants to report
       | _something_, and so it tends to exaggerate. It's not very good at
       | saying "no" or not as good as a programmer would hope.
       | 
       | When you ask it a question, it tends to say yes.
       | 
       | So while the LLM arms race is incrementally increasing benchmark
       | scores, those improvements are illusory.
       | 
       | The real challenge is that the LLM's fundamentally want to seem
       | agreeable, and that's not improving. So even if the model gets an
       | extra 5/100 math problems right, it feels about the same in a
       | series of prompts which are more complicated than just a ChatGPT
       | scenario.
       | 
       | I would say the industry knows it's missing a tool but doesn't
       | know what that tool is yet. Truly agentic performance is getting
       | better (Cursor is amazing!) but it's still evolving.
       | 
       | I totally agree that the core benchmarks that matter should be
       | ones which evaluate a model in agentic scenario, not just on the
       | basis of individual responses.
        
         | fnordpiglet wrote:
         | ... deleted ... (Sorry the delete isn't working, meant for
         | another subthread)
        
         | bluefirebrand wrote:
         | > The real challenge is that the LLM's fundamentally want to
         | seem agreeable, and that's not improving
         | 
         | LLMs fundamentally do not want to seem anything
         | 
         | But the companies that are training them and making models
         | available for professional use sure want them to seem agreeable
        
       | lukev wrote:
       | This is a bit of a meta-comment, but reading through the
       | responses to a post like this is really interesting because it
       | demonstrates how our collective response to this stuff is (a)
       | wildly divergent and (b) entirely anecdote-driven.
       | 
       | I have my own opinions, but I can't really say that they're not
       | also based on anecdotes and personal decision-making heuristics.
       | 
       | But some of us are going to end up right and some of us are going
       | to end up wrong and I'm really curious what features signal an
       | ability to make "better choices" w/r/t AI, even if we don't know
       | (or can't prove) what "better" is yet.
        
         | lherron wrote:
         | Agreed! And with all the gaming of the evals going on, I think
         | we're going to be stuck with anecdotal for some time to come.
         | 
         | I do feel (anecdotally) that models are getting better on every
         | major release, but the gains certainly don't seem evenly
         | distributed.
         | 
         | I am hopeful the coming waves of vertical
         | integration/guardrails/grounding applications will move us away
         | from having to hop between models every few weeks.
        
           | InkCanon wrote:
           | Frankly the overarching story about evals (which receives
           | very little coverage) is how much gaming is going on. On the
           | recent USAMO 2025, SOTA models scored 5%, despite claiming
           | silver/gold in IMOs. And ARC-AGI: one very easy way to
           | "solve" it is to generate masses of synthetic examples by
           | extrapolating the basic rules of ARC AGI questions and train
           | it on that.
        
         | FiniteIntegral wrote:
         | It's not surprising that responses are anecdotal. An easy way
         | to communicate a generic sentiment often requires being brief.
         | 
         | A majority of what makes a "better AI" can be condensed to how
         | effective the slope-gradient algorithms are at getting the
         | local maxima we want it to get to. Until a generative model
         | shows actual progress of "making decisions" it will forever be
         | seen as a glorified linear algebra solver. Generative machine
         | learning is all about giving a pleasing answer to the end user,
         | not about creating something that is on the level of human
         | decision making.
        
         | nialv7 wrote:
         | Good observation but also somewhat trivial. We are not
         | omniscient gods, ultimately all our opinions and decisions will
         | have to be based on our own limited experiences.
        
         | freehorse wrote:
         | There is nothing wrong with sharing anecdotal experiences.
         | Reading through anecdotal experiences here can help understand
         | how one's own experience are relatable or not. Moreover, if I
         | have X experience it could help to know if it is because of me
         | doing sth wrong that others have figured out.
         | 
         | Furthermore, as we are talking about actual impact of LLMs, as
         | is the point of the article, a bunch of anecdotal experiences
         | may be more valuable than a bunch of benchmarks to figure it
         | out. Also, apart from the right/wrong dichotomy, people use
         | LLMs with different goals and contexts. It may not mean that
         | some people do something wrong if they do not see the same
         | impact as others. Everytime a web developer says that they do
         | not understand how others may be so skeptical of LLMs, conclude
         | with certainty that they must be doing sth wrong and move on to
         | explain how to actually use LLMs properly, I chuckle.
        
       | dimal wrote:
       | It seems like the models are getting more reliable at the things
       | they always could do, but they're not showing any ability to move
       | past that goalpost. Whereas in the past, they could occasionally
       | write some very solid code, but often return nonsense, the
       | nonsense is now getting adequately filtered by so-called
       | "reasoning", but I see no indication that they could do software
       | design.
       | 
       | > how the hell is it going to develop metrics for assessing the
       | impact of AIs when they're doing things like managing companies
       | or developing public policy?
       | 
       | Why on earth do people _want_ AI to do either of these things? As
       | if our society isn't fucked enough, having an untouchable
       | oligarchy already managing companies and developing public
       | policies, we want to have the oligarchy's AI do this, so policy
       | can get even more out of touch with the needs of common people?
       | This should _never_ come to pass. It's like people read a pile of
       | 90s cyberpunk dystopian novels and decided, "Yeah, let's do
       | that." I think it'll fail, but I don't understand how anyone with
       | less than 10 billion in assets would want this.
        
       | HarHarVeryFunny wrote:
       | The disconnect between improved benchmark results and lack of
       | improvement on real world tasks doesn't have to imply cheating -
       | it's just a reflection of the nature of LLMs, which at the end of
       | the day are just prediction systems - these are language models,
       | not cognitive architectures built for generality.
       | 
       | Of course, if you train an LLM heavily on narrow benchmark
       | domains then its prediction performance will improve on those
       | domains, but why would you expect that to improve performance in
       | unrelated areas?
       | 
       | If you trained yourself extensively on advanced math, would you
       | expect that to improve your programming ability? If not, they why
       | would you expect it to improve programming ability of a far less
       | sophisticated "intelligence" (prediction engine) such as a
       | language model?! If you trained yourself on LeetCode programming,
       | would you expect that to help hardening corporate production
       | systems?!
        
         | InkCanon wrote:
         | That's fair. But look up the recent experiment on SOTA models
         | on the then just released USAMO 2025 questions. Highest score
         | was 5%, supposedly SOTA last year was IMO silver level. There
         | could be some methodological differences - ie USAMO paper
         | required correct proofs and not just numerical answers. But it
         | really strongly suggests even within limited domains, it's
         | cheating. I'd wager a significant amount that if you tested
         | SOTA models on a new ICPC set of questions, actual performance
         | would be far, far worse than their supposed benchmarks.
        
           | usaar333 wrote:
           | > Highest score was 5%, supposedly SOTA last year was IMO
           | silver level.
           | 
           | No LLM last year got silver. Deepmind had a highly
           | specialized AI system earning that
        
       | dkersten wrote:
       | I honestly can't notice any difference in outdoor quality between
       | GPT 4o and GPT 4.5. I also can't notice any difference in
       | programming quality in cursor when using Claude 3.7 vs 3.5. I'm
       | told there is a clear difference, but I don't notice it.
        
       | mentalgear wrote:
       | Who would assume that LLM companies were to hyper optimise on
       | public to make their share prices go up and bubble keep afloat
       | ... What a unserious thought to maintain ...
        
       | einrealist wrote:
       | LeCun criticized LLM technology recently in a presentation:
       | https://www.youtube.com/watch?v=ETZfkkv6V7Y
       | 
       | The accuracy problem won't just go away. Increasing accuracy is
       | only getting more expensive. This sets the limits for useful
       | applications. And casual users might not even care and use LLMs
       | anyway, without reasonable result verification. I fear a future
       | where overall quality is reduced. Not sure how many people /
       | companies would accept that. And AI companies are getting too big
       | to fail. Apparently, the US administration does not seem to care
       | when they use LLMs to define tariff policy....
        
         | pclmulqdq wrote:
         | I don't know why anyone is surprised that a statistical model
         | isn't getting 100% accuracy. The fact that statistical models
         | of text are good enough to do _anything_ should be shocking.
        
           | whilenot-dev wrote:
           | I think the surprising aspect is rather how people are
           | praising 80-90% accuracy as the next leap in technological
           | advancement. Quality is already in decline, despite LLMs, and
           | programming was always a discipline where correctness and
           | predictability mattered. It's an advancement for efficiency,
           | sure, but on the yet unknown cost of stability. I'm thinking
           | about all simulations based on applied mathematical concepts
           | and all the accumulated hours fixing bugs - there's now this
           | certain aftertaste, sweet for some living their lives
           | efficiently, but very bitter for the ones relying on
           | stability.
        
           | einrealist wrote:
           | That "good enough" is the problem. It requires context. And
           | AI companies are selling us that "good enough" with
           | questionable proof. And they are selling grandiose visions to
           | investors, but move the goal post again and again.
           | 
           | A lot of companies made Copilot available to their workforce.
           | I doubt that the majority of users understand what a
           | statistical model means. The casual, technically
           | inexperienced user just assumes that a computer answer is
           | always right.
        
       | delusional wrote:
       | > Sometimes the founder will apply a cope to the narrative ("We
       | just don't have any PhD level questions to ask")
       | 
       | Please tell me this is not what tech-bros are going around
       | telling each other! Are we implying that the problems in the
       | world, the things that humans collectively work on to maintain
       | the society that took us thousands of years to build up, just
       | aren't hard enough to reach the limits of the AI.
       | 
       | Jesus Christ.
        
         | bcoates wrote:
         | I mean... most businesses, particularly small businesses and
         | startups, aren't exactly doing brain surgery on a rocketship.
         | 
         | It's pretty likely that they have extremely dull problems like
         | "running an inbound call center is a lot of work" or "people
         | keep having their mail stolen and/or lying that they did" that
         | "more smarter gpus" won't solve
        
       | timewizard wrote:
       | Government announces critical need to invest in AI and sets aside
       | a bunch of money for this purpose.
       | 
       | Suddenly the benchmarks become detached from reality and vendors
       | can claim whatever they want about their "new" products.
       | 
       | Just as a possible explanation, as I feel like I've seen this
       | story before.
        
       | ants_everywhere wrote:
       | There are real and obvious improvements in the past few model
       | updates and I'm not sure what the disconnect there is.
       | 
       | Maybe it's that I _do_ have PhD level questions to ask them, and
       | they 've gotten much better at it.
       | 
       | But I suspect that these anecdotes are driven by something else.
       | Perhaps people found a workable prompt strategy by trial and
       | error on an earlier model and it works less well with later
       | models.
       | 
       | Or perhaps they have a time-sensitive task and are not able to
       | take advantage of the thinking of modern LLMs, which have a slow
       | thinking-based feedback loop. Or maybe their code base is getting
       | more complicated, so it's harder to reason about.
       | 
       | Or perhaps they're giving the LLMs a poorly defined task where
       | older models made assumptions about but newer models understand
       | the ambiguity of and so find the space of solutions harder to
       | navigate.
       | 
       | Since this is ultimately from a company doing AI scanning for
       | security, I would think the latter plays a role to some extent.
       | Security is insanely hard and the more you know about it the
       | harder it is. Also adversaries are bound to be using AI and are
       | increasing in sophistication, which would cause lower efficacy
       | (although you could tease this effect out by trying older models
       | with the newer threats).
        
         | pclmulqdq wrote:
         | In the last year, things like "you are an expert on..." have
         | gotten much less effective in my private tests, while actually
         | describing the problem precisely has gotten better in terms of
         | producing results.
         | 
         | In other words, all the sort of lazy prompt engineering hacks
         | are becoming less effective. Domain expertise is becoming more
         | effective.
        
           | ants_everywhere wrote:
           | yes that would explain the effect I think. I'll try that out
           | this week.
        
       | mmcnl wrote:
       | I feel we are already in the era of diminishing returns on LLM
       | improvements. Newer models seem to be more sophisticated
       | implementations of LLM technology + throwing more resources at
       | it, but to me they do not seem fundamentally more intelligent.
        
       | InkCanon wrote:
       | The biggest story in AI was released a few weeks ago but was
       | given little attention: on the recent USAMO, SOTA models scored
       | on average 5% (IIRC, it was some abysmal number). This is despite
       | them supposedly having gotten 50%, 60% etc performance on IMO
       | questions. This massively suggests AI models simply remember the
       | past results, instead of actually solving these questions. I'm
       | incredibly surprised no one mentions this, but it's ridiculous
       | that these companies never tell us what (if any) efforts have
       | been made to remove test data (IMO, ICPC, etc) from train data.
        
         | AIPedant wrote:
         | Yes, here's the link: https://arxiv.org/abs/2503.21934v1
         | 
         | Anecdotally, I've been playing around with o3-mini on
         | undergraduate math questions: it is much better at "plug-and-
         | chug" proofs than GPT-4, but those problems aren't
         | independently interesting, they are explicitly pedagogical. For
         | anything requiring insight, it's either:
         | 
         | 1) A very good answer that reveals the LLM has seen the problem
         | before (e.g. naming the theorem, presenting a "standard" proof,
         | using a much more powerful result)
         | 
         | 2) A bad answer that looks correct and takes an enormous amount
         | of effort to falsify. (This is the secret sauce of LLM hype.)
         | 
         | I dread undergraduate STEM majors using this thing - I asked it
         | a problem about rotations and spherical geometry, but got back
         | a pile of advanced geometric algebra, when I was looking for
         | "draw a spherical triangle." If I didn't know the answer, I
         | would have been badly confused. See also this real-world
         | example of an LLM leading a recreational mathematician astray:
         | https://xcancel.com/colin_fraser/status/1900655006996390172#...
         | 
         | I will add that in 10 years the field will be intensely
         | criticized for its reliance on multiple-choice benchmarks; it
         | is not surprising or interesting that next-token prediction can
         | game multiple-choice questions!
        
         | simonw wrote:
         | I had to look up these acronyms:
         | 
         | - USAMO - United States of America Mathematical Olympiad
         | 
         | - IMO - International Mathematical Olympiad
         | 
         | - ICPC - International Collegiate Programming Contest
         | 
         | Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or
         | Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted
         | 27th March 2025.
        
         | usaar333 wrote:
         | And then within a week, Gemini 2.5 was tested and got 25%.
         | Point is AI is getting stronger.
         | 
         | And this only suggested LLMs aren't trained well to write
         | formal math proofs, which is true.
        
         | AstroBen wrote:
         | This seems fairly obvious at this point. If they were actually
         | reasoning _at all_ they 'd be capable (even if not good) of
         | complex games like chess
         | 
         | Instead they're barely able to eek out wins against a bot that
         | plays completely random moves: https://maxim-
         | saplin.github.io/llm_chess/
        
         | bglazer wrote:
         | Yeah I'm a computational biology researcher. I'm working on a
         | novel machine learning approach to inferring cellular behavior.
         | I'm currently stumped why my algorithm won't converge.
         | 
         | So, I describe the mathematics to ChatGPT-o3-mini-high to try
         | to help reason about what's going on. It was almost completely
         | useless. Like blog-slop "intro to ML" solutions and ideas. It
         | ignores all the mathematical context, and zeros in on "doesn't
         | converge" and suggests that I lower the learning rate. Like, no
         | shit I tried that three weeks ago. No amount of cajoling can
         | get it to meaningfully "reason" about the problem, because it
         | hasn't seen the problem before. The closest point in latent
         | space is apparently a thousand identical Medium articles about
         | Adam, so I get the statistical average of those.
         | 
         | I can't stress how frustrating this is, especially with people
         | like Terence Tao saying that these models are like a mediocre
         | grad student. I would really love to have a mediocre (in
         | Terry's eyes) grad student looking at this, but I can't seem to
         | elicit that. Instead I get low tier ML blogspam author.
         | 
         | **PS** if anyone read this far (doubtful) and knows about
         | density estimation and wants to help my email is
         | bglazer1@gmail.com
         | 
         | I promise its a fun mathematical puzzle and the biology is
         | pretty wild too
        
       | mmcnl wrote:
       | I feel we are already in the era of diminishing returns on LLM
       | improvements. Newer models seem to be more sophisticated
       | implementations of LLM technology + throwing more resources at
       | it, but to me they do not seem fundamentally more intelligent.
       | 
       | I don't think this is a problem though. I think there's a lot of
       | low-hanging fruit when you create sophisticated implementations
       | of relatively dumb LLM models. But that sentiment doesn't
       | generate a lot of clicks.
        
       | DisjointedHunt wrote:
       | Two things can be true at the same time:
       | 
       | 1. Model "performance" judged by proxy metrics of intelligence
       | have improved significantly over the past two years.
       | 
       | 2. These capabilities are yet to be stitched together in the most
       | appropriate manner for the cybersecurity scenarios the author is
       | talking about.
       | 
       | In my experience, the best usage of Transformer models has come
       | from a deep integration into an appropriate workflow. They do not
       | (yet) replace the new exploration part of a workflow, but they
       | are very scarily performant at following mid level reasoning
       | assertions in a massively parallelized manner.
       | 
       | The question you should be asking yourself is if you can break
       | down your task into however many small chunks that are
       | constrained by feasiility in time to process , chunk those up
       | into appropriate buckets or even better, place them in-order as
       | though you were doing those steps with your expertise - an
       | extension of self. Here's how the two approaches differ:
       | 
       | "Find vulnerabilities in this code" -> This will saturate across
       | all models because the intent behind this mission is vast and
       | loosely defined, while the outcome is expected to be narrow.
       | 
       | " (a)This piece of code should be doing x, what areas is it
       | affecting, lets draw up a perimeter (b) Here is the dependency
       | graph of things upstream and downstream of x, lets spawn a
       | collection of thinking chains to evaluate each one for risk based
       | on the most recent change . . . (b[n]) Where is this likely to
       | fail (c) (Next step that a pentester/cybersecurity researcher
       | would take) "
       | 
       | This has been trial and error in my experience but it has worked
       | great in domains such as financial trading and decision support
       | where experts in the field help sketch out the general framework
       | of the process where reasoning support is needed and constantly
       | iterate as though it is an extension of their selves.
        
       | StickyRibbs wrote:
       | There's the politics of the corporations and then there's the
       | business of the science behind LLM's, this article feels like the
       | former.
       | 
       | Maybe someone active in the research can comment? I feel like all
       | of these comments are just conjecture/anecdotal and don't really
       | get to the meat of this question of "progress" and the future of
       | LLM's
        
       | OtherShrezzing wrote:
       | Assuming that the models getting better at SWE benchmarks and
       | math tests would translate into positive outcomes in all other
       | domains could be an act of spectacular hubris by the big frontier
       | labs, which themselves are chock-full of mathematicians and
       | software engineers.
        
       | nialv7 wrote:
       | Sounds like someone drank their own Kool aid (believing current
       | AI can be a security researcher), and then gets frustrated when
       | they realize they have overhyped themselves.
       | 
       | Current AI just cannot do the kind of symbolic reasoning required
       | for finding security vulnerabilities in softwares. They might
       | have learned to recognize "bad code" via pattern matching, but
       | that's basically it.
        
       | burny_tech wrote:
       | In practice, Sonnet 3.7 and Gemini 2.5 are just often too good
       | compared to competitors.
        
       | jaredcwhite wrote:
       | There's some interesting information and analysis to start off
       | this essay, then it ends with:
       | 
       | "These machines will soon become the beating hearts of the
       | society in which we live. The social and political structures
       | they create as they compose and interact with each other will
       | define everything we see around us."
       | 
       | This sounds like an article of faith to me. One could just as
       | easily say they won't become the beating hearts of anything, and
       | instead we'll choose to continue to build a better future for
       | humans, as humans, without relying on an overly-hyped technology
       | rife with error and unethical implications.
        
       ___________________________________________________________________
       (page generated 2025-04-06 23:00 UTC)