[HN Gopher] GPT-5: Key characteristics, pricing and system card
       ___________________________________________________________________
        
       GPT-5: Key characteristics, pricing and system card
        
       System card: https://cdn.openai.com/pdf/8124a3ce-
       ab78-4f06-96eb-49ea29ffb...
        
       Author : Philpax
       Score  : 374 points
       Date   : 2025-08-07 17:46 UTC (5 hours ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | ks2048 wrote:
       | So, "system card" now means what used to be a "paper", but
       | without lots of the details?
        
         | kaoD wrote:
         | Nope. System card is a sales thing. I think we generally call
         | that "product sheet" in other markets.
        
         | simonw wrote:
         | AI labs tend to use "system cards" to describe their evaluation
         | and safety research processes.
         | 
         | They used to be more about the training process itself, but
         | that's increasingly secretive these days.
        
       | empiko wrote:
       | Despite the fact that their models are used in hiring, business,
       | education, etc this multibillion company uses one benchmark with
       | very artificial questions (BBQ) to evaluate how fair their model
       | is. I am a little bit disappointed.
        
       | Leary wrote:
       | METR of only 2 hours and 15 minutes. Fast takeoff less likely.
        
         | qsort wrote:
         | Isn't that pretty much in line with what people were expecting?
         | Is it surprising?
        
           | dingnuts wrote:
           | It's not surprising to AI critics but go back to 2022 and
           | open r/singularity and then answer: what "people" were
           | expecting? Which people?
           | 
           | SamA has been promising AGI next year for three years like
           | Musk has been promising FSD next year for the last ten years.
           | 
           | IDK what "people" are expecting but with the amount of hype
           | I'd have to guess they were expecting more than we've gotten
           | so far.
           | 
           | The fact that "fast takeoff" is a term I recognize indicates
           | that some people believed OpenAI when they said this
           | technology (transformers) would lead to sci fi style AI and
           | that is most certainly not happening
        
             | falcor84 wrote:
             | I would say that there are quite a lot of roles where you
             | need to do a lot of planning to effectively manage an ~8
             | hour shift, but then there are good protocols for handing
             | over to the next person. So once AIs get to that level (in
             | 2027?), we'll be much closer to AIs taking on "economically
             | valuable work".
        
             | ToValueFunfetti wrote:
             | >SamA has been promising AGI next year for three years like
             | Musk has been promising FSD next year for the last ten
             | years.
             | 
             | Has he said anything about it since last September:
             | 
             | >It is possible that we will have superintelligence in a
             | few thousand days (!); it may take longer, but I'm
             | confident we'll get there.
             | 
             | This is, at an absolute minimum, 2000 days = 5 years. And
             | he says it may take longer.
             | 
             | Did he even say AGI next year any time before this? It
             | looks like his predictions were all pointing at the late
             | 2020s, and now he's thinking early 2030s. Which you could
             | still make fun of, but it just doesn't match up with your
             | characterization at all.
        
           | usaar333 wrote:
           | No, this is below expectations on both Manifold and lesswrong
           | (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_green
           | ...). Median was ~2.75 hours on both (which already
           | represented a bearish slowdown).
           | 
           | Not massively off -- manifold yesterday implied odds this low
           | were ~35%. 30% before Claude Opus 4.1 came out which updated
           | expected agentic coding abilities downward.
        
             | qsort wrote:
             | Thanks for sharing, that was a good thread!
        
         | umanwizard wrote:
         | What is METR?
        
           | Leary wrote:
           | https://metr.github.io/autonomy-evals-guide/gpt-5-report/
        
           | tunesmith wrote:
           | The 2h 15m is the length of tasks the model can complete with
           | 50% probability. So longer is better in that sense. Or at
           | least, "more advanced" and potentially "more dangerous".
        
           | ravendug wrote:
           | https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-
           | measu...
        
         | kqr wrote:
         | Seems like it's on the line that's scaring people like AI 2027,
         | isn't it? https://aisafety.no/img/articles/length-of-tasks-
         | log.png
        
         | FergusArgyll wrote:
         | It's above the exponential line & right around the Super
         | exponential line
        
       | nickthegreek wrote:
       | This new naming conventions, while not perfect are alot clearer
       | and I am sure will help my coworkers.
        
       | anyg wrote:
       | Good to know - > Knowledge cut-off is September 30th 2024 for
       | GPT-5 and May 30th 2024 for GPT-5 mini and nano
        
         | falcor84 wrote:
         | Oh wow, so essentially a full year of post-training and
         | testing. Or was it ready and there was a sufficiently good
         | business strategy decision to postpone the release?
        
           | thorum wrote:
           | The Information's report from earlier this month claimed that
           | GPT-5 was only developed in the last 1-2 months, after some
           | sort of breakthrough in training methodology.
           | 
           | > As recently as June, the technical problems meant none of
           | OpenAI's models under development seemed good enough to be
           | labeled GPT-5, according to a person who has worked on it.
           | 
           | But it could be that this refers to post-training and the
           | base model was developed earlier.
           | 
           | https://www.theinformation.com/articles/inside-openais-
           | rocky...
           | 
           | https://archive.ph/d72B4
        
             | simonw wrote:
             | My understanding is that training data cut-offs and dates
             | at which the model were trained are independent things.
             | 
             | AI labs gather training data and then do a ton of work to
             | process it, filter it etc.
             | 
             | Model training teams run different parameters and
             | techniques against that processed training data.
             | 
             | It wouldn't surprise me to hear that OpenAI had collected
             | data up to September 2024, dumped that data in a data
             | warehouse of some sort, then spent months experimenting
             | with ways to filter and process it and different training
             | parameters to run against it.
        
           | NullCascade wrote:
           | OpenAI is much more aggressively targeted by NYTimes and
           | similar organizations for "copyright violations".
        
         | bhouston wrote:
         | Weird to have such an early knowledge cutoff. Claude 4.1 has
         | March 2025 - 6 month more recent with comparable results.
        
         | bn-l wrote:
         | Is that late enough for it to have heard of svelte 5?
        
         | dortlick wrote:
         | Yeah I thought that was strange. Wouldn't it be important to
         | have more recent data?
        
       | cco wrote:
       | Only a third cheaper than Sonnet 4? Incrementally better I
       | suppose.
       | 
       | > and minimizing sycophancy
       | 
       | Now we're talking about a good feature! Actually one of my
       | biggest annoyances with Cursor (that mostly uses Sonnet).
       | 
       | "You're absolutely right!"
       | 
       | I mean not really Cursor, but ok. I'll be super excited if we can
       | get rid of these sycophancy tokens.
        
         | logicchains wrote:
         | >Only a third cheaper than Sonnet 4?
         | 
         | The price should be compared to Opus, not Sonnet.
        
           | cco wrote:
           | Wow, if so, 7x cheaper. Crazy if true.
        
         | nosefurhairdo wrote:
         | In my early testing gpt5 is significantly less annoying in this
         | regard. Gives a strong vibe of just doing what it's told
         | without any fluff.
        
       | bdcdo wrote:
       | "GPT-5 in the API is simpler: it's available as three models--
       | regular, mini and nano--which can each be run at one of four
       | reasoning levels: minimal (a new level not previously available
       | for other OpenAI reasoning models), low, medium or high."
       | 
       | Is it actually simpler? For those who are currently using GPT
       | 4.1, we're going from 3 options (4.1, 4.1 mini and 4.1 nano) to
       | at least 8, if we don't consider gpt 5 regular - we now will have
       | to choose between gpt 5 mini minimal, gpt 5 mini low, gpt 5 mini
       | medium, gpt 5 mini high, gpt 5 nano minimal, gpt 5 nano low, gpt
       | 5 nano medium and gpt 5 nano high.
       | 
       | And, while choosing between all these options, we'll always have
       | to wonder: should I try adjusting the prompt that I'm using, or
       | simply change the gpt 5 version or its reasoning level?
        
         | impossiblefork wrote:
         | Yes, I think so. It's n=1,2,3 m=0,1,2,3. There's structure and
         | you know that each parameter goes up and in which direction.
        
           | makeramen wrote:
           | But given the option, do you choose bigger models or more
           | reasoning? Or medium of both?
        
             | namibj wrote:
             | Depends on what you're doing.
        
               | addaon wrote:
               | > Depends on what you're doing.
               | 
               | Trying to get an accurate answer (best correlated with
               | objective truth) on a topic I don't already know the
               | answer to (or why would I ask?). This is, to me, the
               | challenge with the "it depends, tune it" answers that
               | always come up in how to use these tools -- it requires
               | the tools to not be useful for you (because there's
               | already a solution) to be able to do the tuning.
        
               | wongarsu wrote:
               | If cost is no concern (as in infrequent one-off tasks)
               | then you can always go with the biggest model with the
               | most reasoning. Maybe compare it with the biggest model
               | with no/less reasoning, since sometimes reasoning can
               | hurt (just as with humans overthinking something).
               | 
               | If you have a task you do frequently you need some kind
               | of benchmark. Which might just be comparing how good the
               | output of the smaller models holds up to the output of
               | the bigger model, if you don't know the ground truth
        
             | impossiblefork wrote:
             | I would have to get experience with them. I mostly use
             | Mistral, so I have only the choice of thinking or not
             | thinking.
        
               | gunalx wrote:
               | Mistral also has small medium and large. With both small
               | and medium having a thinking one, devstral codestral ++
               | 
               | Not really that mich simpler.
        
               | impossiblefork wrote:
               | Ah, but I never route to these manually. I only use LLMs
               | a little bit, mostly to try to see what they can't do.
        
             | paladin314159 wrote:
             | If you need world knowledge, then bigger models. If you
             | need problem-solving, then more reasoning.
             | 
             | But the specific nuance of picking nano/mini/main and
             | minimal/low/medium/high comes down to experimentation and
             | what your cost/latency constraints are.
        
         | mwigdahl wrote:
         | If reasoning is on the table, then you already had to add
         | o3-mini-high, o3-mini-medium, o3-mini-low, o4-mini-high,
         | o4-mini-medium, and o4-mini-low to the 4.1 variants. The GPT-5
         | way seems simpler to me.
        
         | hirako2000 wrote:
         | Ultimately they are selling tokens, so try many times.
        
         | vineyardmike wrote:
         | When I read "simpler" I interpreted that to mean they don't use
         | their Chat-optimized harness to guess which reasoning level and
         | model to use. The subscription chat service (ChatGPT) and the
         | chat-optimized model on their API seem to have a special
         | harness that changes reasoning based on some heuristics, and
         | will switch between the model sizes without user input.
         | 
         | With the API, you pick a model sizes and reasoning effort. Yes
         | more choices, but also a clear mental model and a simple choice
         | that you control.
        
       | diggan wrote:
       | > but for the moment here's the pelican I got from GPT-5 running
       | at its default "medium" reasoning effort:
       | 
       | Would been interesting to see a comparison between low, medium
       | and high reasoning_effort pelicans :)
       | 
       | When I've played around with GPT-OSS-120b recently, seems the
       | difference in the final answer is huge, where "low" is
       | essentially "no reasoning" and with "high" it can spend seemingly
       | endless amount of tokens. I'm guessing the difference with GPT-5
       | will be similar?
        
         | simonw wrote:
         | > Would been interesting to see a comparison between low,
         | medium and high reasoning_effort pelicans
         | 
         | Yeah, I'm working on that - expect dozens of more pelicans in a
         | later post.
        
       | zaronymous1 wrote:
       | Can anyone explain to me why they've removed parameter controls
       | for temperature and top-p in reasoning models, including gpt-5?
       | It strikes me that it makes it harder to build with these to do
       | small tasks requiring high-levels of consistency, and in the API,
       | I really value the ability to set certain tasks to a low temp.
        
         | Der_Einzige wrote:
         | It's because all forms of sampler settings destroy
         | safety/alignment. That's why top_p/top_k are still used and not
         | tfs, min_p, top_n sigma, etc, why temperature is locked to 0-2
         | arbitrary range, etc
         | 
         | Open source is years ahead of these guys on samplers. It's why
         | their models being so good is that much more impressive.
        
           | oblio wrote:
           | Temperature is the response variation control?
        
             | AH4oFVbPT4f8 wrote:
             | Yes, it controls variability or probability of the next
             | token or text to be selected.
        
       | hodgehog11 wrote:
       | The aggressive pricing here seems unusual for OpenAI. If they had
       | a large moat, they wouldn't need to do this. Competition is
       | fierce indeed.
        
         | 0x00cl wrote:
         | Maybe the need/want data.
        
           | dr_dshiv wrote:
           | And it's a massive distillation of the mother model, so the
           | costs of inference are likely low.
        
           | impure wrote:
           | OpenAI and most AI companies do not train on data submitted
           | to a paid API.
        
             | WhereIsTheTruth wrote:
             | They also do not train using copyrighted material /s
        
               | daveguy wrote:
               | Oh, they never even made that promise. They're trying to
               | say it's fine to launder copyright material through a
               | model.
        
               | simonw wrote:
               | That's different. They train on scrapes of the web. They
               | don't train on data submitted to their API by their
               | paying customers.
        
               | johnnyanmac wrote:
               | If they're bold enough to say they train on data they do
               | not own, I am not optimistic when they say they don't
               | train on data people willingly submit to them.
        
               | simonw wrote:
               | I don't understand your logic there.
               | 
               | They have confessed to doing a bad thing - training on
               | copyrighted data without permission. Why does that
               | indicate they would lie about a worse thing?
        
               | johnnyanmac wrote:
               | >Why does that indicate they would lie about a worse
               | thing?
               | 
               | Because they know their audience. It's an audience that
               | also doesn't care for copyright and would love for them
               | to win their court cases. They are fineaking such an
               | argument to those kinds of people.
               | 
               | Meanwhile, the reaction from the same audience when legal
               | did a very typical subpoena process on said data, data
               | they chose to submit to an online server of their own
               | volition, completely freaked out. Suddenly, they felt
               | like their privacy was invaded.
               | 
               | It doesn't make any logical sense in my mind, but a lot
               | of the discourse over this topic isnt based on logic.
        
             | dortlick wrote:
             | Why don't they?
        
               | echoangle wrote:
               | They probably fear that people wouldn't use the API
               | otherwise, I guess. They could have different tiers
               | though where you pay extra so your data isn't used for
               | training.
        
             | anhner wrote:
             | If you believe that, I have a bridge I can sell you...
        
               | Uehreka wrote:
               | If it ever leaked that OpenAI was training on the vast
               | amounts of confidential data being sent to them, they'd
               | be immediately crushed under a mountain of litigation and
               | probably have to shut down. Lots of people at big
               | companies have accounts, and the bigcos are only letting
               | them use them because of that "Don't train on my data"
               | checkbox. Not all of those accounts are necessarily tied
               | to company emails either, so it's not like OpenAI can
               | discriminate.
        
         | impure wrote:
         | The 5 cents for Nano is interesting. Maybe it will force Google
         | to start dropping their prices again which have been slowly
         | creeping up recently.
        
         | ilaksh wrote:
         | It's like 5% better. I think they obviously had no choice but
         | to be price competitive with Gemini 2.5 Pro. Especially for
         | Cursor to change their default.
        
         | FergusArgyll wrote:
         | They are winning by massive margins in the app, but losing (!)
         | in the API to anthropic
         | 
         | https://finance.yahoo.com/news/enterprise-llm-spend-reaches-...
        
         | canada_dry wrote:
         | Perhaps they're feeling the effect of losing PRO clients (like
         | me) lately.
         | 
         | Their PRO models were not (IMHO) worth 10X that of PLUS!
         | 
         | Not even close.
         | 
         | Especially when new competitors (eg. z.ai) are offering very
         | compelling competition.
        
       | onehair wrote:
       | > Definitely recognizable as a pelican
       | 
       | right :-D
        
       | pancakemouse wrote:
       | Practically the first thing I do after a new model release is try
       | to upgrade `llm`. Thank you, @simonw !
        
         | efavdb wrote:
         | same, looks like he hasn't added 5.0 to the package yet but
         | assume imminent.
         | 
         | https://llm.datasette.io/en/stable/openai-models.html
        
         | simonw wrote:
         | Working on that now! https://github.com/simonw/llm/issues/1229
        
       | isoprophlex wrote:
       | Whoa this looks good. And cheap! How do you hack a proxy together
       | so you can run Claude Code on gpt-5?!
        
         | dalberto wrote:
         | Consider: https://github.com/musistudio/claude-code-router
         | 
         | or even: https://github.com/sst/opencode
         | 
         | Not affiliated with either one of these, but they look
         | promising.
        
       | morleytj wrote:
       | It's cool and I'm glad it sounds like it's getting more reliable,
       | but given the types of things people have been saying GPT-5 would
       | be for the last two years you'd expect GPT-5 to be a world-
       | shattering release rather than incremental and stable
       | improvement.
       | 
       | It does sort of give me the vibe that the pure scaling maximalism
       | really is dying off though. If the approach is on writing better
       | routers, tooling, comboing specialized submodels on tasks, then
       | it feels like there's a search for new ways to improve
       | performance(and lower cost), suggesting the other established
       | approaches weren't working. I could totally be wrong, but I feel
       | like if just throwing more compute at the problem was working
       | OpenAI probably wouldn't be spending much time on optimizing the
       | user routing on currently existing strategies to get marginal
       | improvements on average user interactions.
       | 
       | I've been pretty negative on the thesis of only needing more
       | data/compute to achieve AGI with current techniques though, so
       | perhaps I'm overly biased against it. If there's one thing that
       | bothers me in general about the situation though, it's that it
       | feels like we really have no clue what the actual status of these
       | models is because of how closed off all the industry labs have
       | become + the feeling of not being able to expect anything other
       | than marketing language from the presentations. I suppose that's
       | inevitable with the massive investments though. Maybe they've got
       | some massive earthshattering model release coming out next, who
       | knows.
        
         | jstummbillig wrote:
         | Things have moved differently than what we thought would happen
         | 2 years ago, but lest we forget what has happened in the
         | meanwhile (4o, o1 + thinking paradigm, o3)
         | 
         | So yeah, maybe we are getting more incremental improvements.
         | But that to me seems like a good thing, because more good
         | things earlier. I will take that over world-shattering any day
         | - but if we were to consider everything that has happened since
         | the first release of gpt-4, I would argue the total amount is
         | actually very much world-shattering.
        
         | GaggiX wrote:
         | Compared to GPT-4, it is on a completely different level given
         | that it is a reasoning model so on that regard it does delivers
         | and it's not just scaling, but for this I guess the revolution
         | was o1 and GPT-5 is just a much more mature version of the
         | technology.
        
         | hnuser123456 wrote:
         | I agree, we have now proven that GPUs can ingest information
         | and be trained to generate content for various tasks. But to
         | put it to work, make it useful, requires far more thought about
         | a specific problem and how to apply the tech. If you could just
         | ask GPT to create a startup that'll be guaranteed to be worth
         | $1B on a $1k investment within one year, someone else would've
         | already done it. Elbow grease still required for the
         | foreseeable future.
         | 
         | In the meantime, figuring out how to train them to make less of
         | their most common mistakes is a worthwhile effort.
        
           | morleytj wrote:
           | Certainly, yes, plenty of elbow grease required in all things
           | that matter.
           | 
           | The interesting point as well to me though, is that if it
           | could create a startup that was worth $1B, that startup
           | wouldn't be worth $1B.
           | 
           | Why would anyone pay that much to invest in the startup if
           | they could recreate the entire thing with the same tool that
           | everyone would have access to?
        
         | BoiledCabbage wrote:
         | Performance is doubling roughly every 4-7 months. That trend is
         | continuing. That's insane.
         | 
         | If your expectations were any higher than that then, then it
         | seems like you were caught up in hype. Doubling 2-3 times per
         | year isn't leveling off my any means.
         | 
         | https://metr.github.io/autonomy-evals-guide/gpt-5-report/
        
           | oblio wrote:
           | By "performance" I guess you mean "the length of task that
           | can be done adequately"?
           | 
           | It is a benchmark but I'm not very convinced it's the be-all,
           | end-all.
        
             | nomel wrote:
             | > It is a benchmark but I'm not very convinced it's the be-
             | all, end-all.
             | 
             | Who's suggesting it is?
        
           | morleytj wrote:
           | I wouldn't say model development and performance is "leveling
           | off", and in fact didn't write that. I'd say that tons more
           | funding is going into the development of many models, so one
           | would expect performance increases unless the paradigm was
           | completely flawed at it's core, a belief I wouldn't
           | personally profess to. My point was moreso the following: A
           | couple years ago it was easy to find people saying that all
           | we needed was to add in video data, or genetic data, or some
           | other data modality, in the exact same format that the models
           | trained on existing language data were, and we'd see a fast
           | takeoff scenario with no other algorithmic changes. Given
           | that the top labs seem to be increasingly investigating
           | alternate approaches to setting up the models beyond just
           | adding more data sources, and have been for the last couple
           | years(Which, I should clarify, is a good idea in my opinion),
           | then the probability of those statements of just adding more
           | data or more compute taking us straight to AGI being correct
           | seems at the very least slightly lower, right?
           | 
           | Rather than my personal opinion, I was commenting on commonly
           | viewed opinions of people I would believe to have been caught
           | up in hype in the past. But I do feel that although that's a
           | benchmark, it's not necessarily the end-all of benchmarks.
           | I'll reserve my final opinions until I test personally, of
           | course. I will say that increasing the context window
           | probably translates pretty well to longer context task
           | performance, but I'm not entirely convinced it directly
           | translates to individual end-step improvement on every class
           | of task.
        
           | andrepd wrote:
           | We can barely measure "performance" in any objective sense,
           | let alone claim that it's doubling every 4 months.....
        
         | simonw wrote:
         | I for one am pretty glad about this. I like LLMs that augment
         | human abilities - tools that help people get more done and be
         | more ambitious.
         | 
         | The common concept for AGI seems to be much more about human
         | replacement - the ability to complete "economically valuable
         | tasks" better than humans can. I still don't understand what
         | our human lives or economies would look like there.
         | 
         | What I personally wanted from GPT-5 is exactly what I got:
         | models that do the same stuff that existing models do, but more
         | reliably and "better".
        
           | morleytj wrote:
           | I'd agree on that.
           | 
           | That's pretty much the key component these approaches have
           | been lacking on, the reliability and consistency on the tasks
           | they already work well on to some extent.
           | 
           | I think there's a lot of visions of what our human lives
           | would look like in that world that I can imagine, but your
           | comment did make me think of one particularly interesting
           | tautological scenario in that commonly defined version of
           | AGI.
           | 
           | If artificial general intelligence is defined as completed
           | "economically valuable tasks" better than human can, it
           | requires one to define "economically valuable." As it
           | currently stands, something holds value in an economy
           | relative to human beings wanting it. Houses get expensive
           | because many people, each of whom have economic utility which
           | they use to purchase things, want to have houses, of which
           | there is a limited supply for a variety of reasons. If human
           | beings are not the most effective producers of value in the
           | system, they lose capability to trade for things, which
           | negates that existing definition of economic value. Doesn't
           | matter how many people would pay $5 dollars for your widget
           | if people have no economic utility relative to AGI, meaning
           | they cannot trade that utility for goods.
           | 
           | In general that sort of definition of AGI being held reveals
           | a bit of a deeper belief, which is that there is some version
           | of economic value detached from the humans consuming it. Some
           | sort of nebulous concept of progress, rather than the
           | acknowledgement that for all of human history, progress and
           | value have both been relative to the people themselves
           | getting some form of value or progress. I suppose it
           | generally points to the idea of an economy without consumers,
           | which is always a pretty bizarre thing to consider, but in
           | that case, wouldn't it just be a definition saying that "AGI
           | is achieved when it can do things that the people who control
           | the AI system think are useful." Since in that case, the
           | economy would eventually largely consist of the people
           | controlling the most economically valuable agents.
           | 
           | I suppose that's the whole point of the various alignment
           | studies, but I do find it kind of interesting to think about
           | the fact that even the concept of something being
           | "economically valuable", which sounds very rigorous and
           | measurable to many people, is so nebulous as to be dependent
           | on our preferences and wants as a society.
        
         | thorum wrote:
         | The quiet revolution is happening in tool use and multimodal
         | capabilities. Moderate incremental improvements on general
         | intelligence, but dramatic improvements on multi-step tool use
         | and ability to interact with the world (vs 1 year ago), will
         | eventually feed back into general intelligence.
        
           | darkhorse222 wrote:
           | Completely agree. General intelligence is a building block.
           | By chaining things together you can achieve meta programming.
           | The trick isn't to create one perfect block but to build a
           | variety of blocks and make one of those blocks a block-
           | builder.
        
           | coolKid721 wrote:
           | [flagged]
        
             | dang wrote:
             | Can you please make your substantive points thoughtfully?
             | Thoughtful criticism is welcome but snarky putdowns and
             | onliners, etc., degrade the discussion for everyone.
             | 
             | You've posted substantive comments in other threads, so
             | this should be easy to fix.
             | 
             | If you wouldn't mind reviewing
             | https://news.ycombinator.com/newsguidelines.html and taking
             | the intended spirit of the site more to heart, we'd be
             | grateful.
        
         | belter wrote:
         | > Maybe they've got some massive earthshattering model release
         | coming out next, who knows.
         | 
         | Nothing in the current technology offers a path to AGI. These
         | models are fixed after training completes.
        
           | echoangle wrote:
           | Why do you think that AGI necessitates modification of the
           | model during use? Couldn't all the insights the model gains
           | be contained in the context given to it?
        
             | belter wrote:
             | Because: https://en.wikipedia.org/wiki/Anterograde_amnesia
        
               | echoangle wrote:
               | Like I already said, the model can remember stuff as long
               | as it's in the context. LLMs can obviously remember stuff
               | they were told or output themselves, even a few messages
               | later.
        
               | godelski wrote:
               | > the model can remember stuff as long as it's in the
               | context.
               | 
               | You would need an infinite context or compression
               | 
               | Also you might be interested in this theorem
               | 
               | https://en.wikipedia.org/wiki/Data_processing_inequality
        
               | echoangle wrote:
               | > You would need an infinite context or compression
               | 
               | Only if AGI would require infinite knowledge, which it
               | doesn't.
        
               | belter wrote:
               | AGI needs to genuinely learn and build new knowledge from
               | experience, not just generate creative outputs based on
               | what it has already seen.
               | 
               | LLMs might look "creative" but they are just remixing
               | patterns from their training data and what is in the
               | prompt. They cant actually update themselves or remember
               | new things after training as there is no ongoing feedback
               | loop.
               | 
               | This is why you can't send an LLM to medical school and
               | expect it to truly "graduate". It cannot acquire or
               | integrate new knowledge from real-world experience the
               | way a human can.
               | 
               | Without a learning feedback loop, these models are unable
               | to interact meaningfully with a changing reality or
               | fulfill the expectation from an AGI: Contribute to new
               | science and technology.
        
               | echoangle wrote:
               | I agree that this is kind of true with a plain chat
               | interface, but I don't think that's an inherent limit of
               | an LLM. I think OpenAI actually has a memory feature
               | where the LLM can specify data it wants to save and can
               | then access later. I don't see why this in principle
               | wouldn't be enough for the LLM to learn new data as time
               | goes on. All possible counter arguments seem related to
               | scale (of memory and context size), not the principle
               | itself.
               | 
               | Basically, I wouldn't say that an LLM can never become
               | AGI due to its architecture. I also am not saying that
               | LLM will become AGI (I have no clue), but I don't think
               | the architecture itself makes it impossible.
        
               | belter wrote:
               | LLMs lack mechanisms for persistent memory, causal world
               | modeling, and self-referential planning. Their
               | transformer architecture is static and fundamentally
               | constrains dynamic reasoning and adaptive learning. All
               | core requirements for AGI.
               | 
               | So yeah, AGI is impossible with today LLMs. But at least
               | we got to watch Sam Altman and Mira Murati drop their
               | voices an octave onstage and announce "a new dawn of
               | intelligence" every quarter. Remember Sam Altman 7
               | trillion?
               | 
               | Now that the AGI party is over, its time to sell those
               | NVDA shares and prepare for the crash. What a ride it
               | was. I am grabbing the popcorn.
        
             | godelski wrote:
             | Because time marches on and with it things change.
             | 
             | You _could_ maybe accomplish this if you could fit all new
             | information into context or with cycles of compression but
             | that is kinda a crazy ask. There 's too much new
             | information, even considering compression. It certainly
             | wouldn't allow for exponential growth (I'd expect sub
             | linear).
             | 
             | I think a lot of people greatly underestimate how much new
             | information is created every day. It's hard if you're not
             | working on any research and seeing how incremental but
             | constant improvement compounds. But try just looking at
             | whatever company you work for. Do you know everything that
             | people did that day? It takes more time to generate
             | information than process information so that's on you side,
             | but do you really think you could keep up? Maybe at a very
             | high level but in that case you're missing a lot of
             | information.
             | 
             | Think about it this way: if that could be done then LLM
             | wouldn't need training or tuning because you could do
             | everything through prompting.
        
               | echoangle wrote:
               | The specific instance doesn't need to know everything
               | happening in the world at once to be AGI though. You
               | could feed the trained model different contexts based on
               | the task (and even let the model tell you what kind of
               | raw data it wants) and it could still hypothetically be
               | smarter than a human.
               | 
               | I'm not saying this is a realistic or efficient method to
               | create AGI, but I think the argument ,,Model is static
               | once trained -> model can't be AGI" is fallacious.
        
         | cchance wrote:
         | SAM is a HYPE CEO, he literally hypes his company nonstop, then
         | the announcements come and ... they're... ok, so people aren't
         | really upset, but they end up feeling lackluster at the hype...
         | Until the next cycle comes around...
         | 
         | If you want actual big moves, watch google, anthropic, qwen,
         | deepseek.
         | 
         | Qwen and Deepseek teams honestly seem so much better at under
         | promising and over delivering.
         | 
         | Cant wait to see what Gemini 3 looks like too.
        
         | brandall10 wrote:
         | To be fair, this is one of the pathways GPT-5 was speculated to
         | take as far back at 6 or so months ago - simply being an
         | incremental upgrade from a performance perspective, but a leap
         | from a product simplification approach.
         | 
         | At this point it's pretty much given it's a game of inches
         | moving forward.
        
           | ac29 wrote:
           | > a leap from a product simplification approach.
           | 
           | According to the article, GPT-5 is actually three models and
           | they can be run at 4 levels of thinking. Thats a dozen ways
           | you can run any given input on "GPT-5", so its hardly a
           | simple product line up (but maybe better than before).
        
         | AbstractH24 wrote:
         | > It's cool and I'm glad it sounds like it's getting more
         | reliable, but given the types of things people have been saying
         | GPT-5 would be for the last two years you'd expect GPT-5 to be
         | a world-shattering release rather than incremental and stable
         | improvement.
         | 
         | Are you trying to say the curve is flattening? That advances
         | are coming slower and slower?
         | 
         | As long as it doesn't suggest a dot com level recession I'm
         | good.
        
           | morleytj wrote:
           | I suppose what I'm getting at is that if there are
           | performance increases on a steady pace, but the investment
           | needed to get those performance increases is on a much faster
           | growth rate, it's not really a fair comparison in terms of a
           | rate of progress, and could suggest diminishing returns from
           | a particular approach. I don't really have the actual data to
           | make a claim either way though,I think anyone would need more
           | data to do so than is publicly accessible.
           | 
           | But I do think the fact that we can publicly observe this
           | reallocation of resources and emphasized aspects of the
           | models gives us a bit of insight into what could be happening
           | behind the scenes if we think about the reasons why those
           | shifts could have happened, I guess.
        
         | godelski wrote:
         | > It does sort of give me the vibe that the pure scaling
         | maximalism really is dying off though
         | 
         | I think the big question is if/when investors will start giving
         | money to those who have been predicting this (with evidence)
         | and trying other avenues.
         | 
         | Really though, why put all your eggs in one basket? That's what
         | I've been confused about for awhile. Why fund yet another LLMs
         | to AGI startup. Space is saturated with big players and has
         | been for years. Even if LLMs could get there that doesn't mean
         | something else won't get there faster and for less. It also
         | seems you'd want a backup in order to avoid popping the bubble.
         | Technology S-Curves and all that still apply to AI
         | 
         | Though I'm similarly biased, but so is everyone I know with a
         | strong math and/or science background (I even mentioned it in
         | my thesis more than a few times lol). Scaling is all you need
         | just doesn't check out
        
           | morleytj wrote:
           | I'm pretty curious about the same thing.
           | 
           | I think a somewhat comparable situation is in various online
           | game platforms now that I think about it. Investors would
           | love to make a game like Fortnite, and get the profits that
           | Fortnite makes. So a ton of companies try to make Fortnite.
           | Almost all fail, and make no return whatsoever, just lose a
           | ton of money and toss the game in the bin, shut down the
           | servers.
           | 
           | On the other hand, it may have been more logical for many of
           | them to go for a less ambitious (not always online, not a
           | game that requires a high player count and social buy-in to
           | stay relevant) but still profitable investment (Maybe a
           | smaller scale single player game that doesn't offer recurring
           | revenue), yet we still see a very crowded space for trying to
           | emulate the same business model as something like Fortnite.
           | Another more historical example was the constant question of
           | whether a given MMO would be the next "WoW-killer" all
           | through the 2000's/2010's.
           | 
           | I think part of why this arises is that there's definitely a
           | bit of a psychological hack for humans in particular where if
           | there's a low-probability but extremely high reward outcome,
           | we're deeply entranced by it, and investors are the same.
           | Even if the chances are smaller in their minds than they were
           | before, if they can just follow the same path that seems to
           | be working to some extent and then get lucky, they're
           | completely set. They're not really thinking about any broader
           | bubble that could exist, that's on the level of the society,
           | they're thinking about the individual, who could be very very
           | rich, famous, and powerful if their investment works. And in
           | the mind of someone debating what path to go down, I imagine
           | a more nebulous answer of "we probably need to come up with
           | some fundamentally different tools for learning and research
           | a lot of different approaches to do so" is a bit less
           | satisfying and exciting than a pitch that says "If you just
           | give me enough money, the curve will eventually hit the point
           | where you get to be king of the universe and we go colonize
           | the solar system and carve your face into the moon."
           | 
           | I also have to acknowledge the possibility that they just
           | have access to different information than I do! They might be
           | getting shown much better demos than I do, I suppose.
        
       | ilaksh wrote:
       | This is key info from the article for me:
       | 
       | > -------------------------------
       | 
       | "reasoning": {"summary": "auto"} }'
       | 
       | Here's the response from that API call.
       | 
       | https://gist.github.com/simonw/1d1013ba059af76461153722005a0...
       | 
       | Without that option the API will often provide a lengthy delay
       | while the model burns through thinking tokens until you start
       | getting back visible tokens for the final response.
        
       | justusthane wrote:
       | > a real-time router that quickly decides which model to use
       | based on conversation type, complexity, tool needs, and explicit
       | intent
       | 
       | This is sort of interesting to me. It strikes me that so far
       | we've had more or less direct access to the underlying model
       | (apart from the system prompt and guardrails), but I wonder if
       | going forward there's going to be more and more infrastructure
       | between us and the model.
        
         | hirako2000 wrote:
         | Consider it a low level routing. Keeping in mind it allows the
         | other non active parts to not be in memory. Mistral afaik came
         | up with this concept, quite a while back.
        
       | techpression wrote:
       | "They claim impressive reductions in hallucinations. In my own
       | usage I've not spotted a single hallucination yet, but that's
       | been true for me for Claude 4 and o3 recently as well--
       | hallucination is so much less of a problem with this year's
       | models."
       | 
       | This has me so confused, Claude 4 (Sonnet and Opus) hallucinates
       | daily for me, on both simple and hard things. And this is for
       | small isolated questions at that.
        
         | simonw wrote:
         | What kind of hallucinations are you seeing?
        
           | OtherShrezzing wrote:
           | I rewrote a 4 page document from first to third person a
           | couple of weeks back. I gave Claude Sonnet 4 the document
           | after editing, so it was entirely written in the third
           | person. I asked it to review & highlight places where it was
           | still in the first person.
           | 
           | >Looking through the document, I can identify several
           | instances where it's written in the first person:
           | 
           | And it went on to show a series of "they/them" statements. I
           | asked it to clarify if "they" is "first person" and it
           | responded
           | 
           | >No, "they" is not first person - it's third person. I made
           | an error in my analysis. First person would be: I, we, me,
           | us, our, my. Second person would be: you, your. Third person
           | would be: he, she, it, they, them, their. Looking back at the
           | document more carefully, it appears to be written entirely in
           | third person.
           | 
           | Even the good models are still failing at real-world use
           | cases which should be right in their wheelhouse.
        
             | simonw wrote:
             | That doesn't quite fit the definition I use for
             | "hallucination" - it's clearly a dumb error, but the model
             | didn't confidently state something that's not true (like
             | naming the wrong team who won the Super Bowl).
        
               | OtherShrezzing wrote:
               | >"They claim impressive reductions in hallucinations. In
               | my own usage I've not spotted a single hallucination yet,
               | but that's been true for me for Claude 4 and o3 recently
               | as well--hallucination is so much less of a problem with
               | this year's models."
               | 
               | Could you give an estimate of how many "dumb errors"
               | you've encountered, as opposed to hallucinations? I think
               | many of your readers might read "hallucination" and
               | assume you mean "hallucinations and dumb errors".
        
               | jmull wrote:
               | That's a good way to put it.
               | 
               | As a user, when the model tells me things that are flat
               | out wrong, it doesn't really matter whether it would be
               | categorized as a hallucination or a dumb error. From my
               | perspective, those mean the same thing.
        
               | godelski wrote:
               | I think it qualifies as a hallucination. What's your
               | definition? I'm a researcher too and as far as I'm aware
               | the definition has always been pretty broad and applied
               | to many forms of mistakes. (It was always muddy but
               | definitely got more muddy when adopted by NLP)
               | 
               | It's hard to know why it made the error but isn't it
               | caused by inaccurate "world" modeling? ("World" being
               | English language) Is it not making some hallucination
               | about the English language while interpreting the prompt
               | or document?
               | 
               | I'm having a hard time trying to think of a context where
               | "they" would even be first person. I can't find any
               | search results though Google's AI says it can. It
               | provided two links, the first being a Quora result saying
               | people don't do this but framed it as it's not
               | impossible, just unheard of. Second result just talks
               | about singular you. Both of these I'd consider
               | hallucinations too as the answer isn't supported by the
               | links.
        
           | techpression wrote:
           | Since I mostly use it for code, made up function names are
           | the most common. And of course just broken code all together,
           | which might not count as a hallucination.
        
         | laacz wrote:
         | I suppose that Simon, being all in with LLMs for quite a while
         | now, has developed a good intuition/feeling for framing
         | questions so that they produce less hallucinations.
        
           | simonw wrote:
           | Yeah I think that's exactly right. I don't ask questions that
           | are likely to product hallucinations (like citations from
           | papers about a topic to an LLM without search access), so I
           | rarely see them.
        
             | godelski wrote:
             | But how would you verify? Are you constantly asking
             | questions you already know the answers to? In depth
             | answers?
             | 
             | Often the hallucinations I see are subtle, though usually
             | critical. I see it when generating code, doing my testing,
             | or even just writing. There are hallucinations in today's
             | announcements, such as the airfoil example[0]. An example
             | of more obvious hallucinations is I was asking for help
             | improving writing an abstract for a paper. I gave it my
             | draft and it inserted new numbers and metrics that weren't
             | there. I tried again providing my whole paper. I tried
             | again making explicit to not add new numbers. I tried the
             | whole process again in new sessions and in private
             | sessions. Claude did better than GPT 4 and o3 but none
             | would do it without follow-ups and a few iterations.
             | 
             | Honestly I'm curious what you use them for where you don't
             | see hallucinations
             | 
             | [0] which is a subtle but famous misconception. One that
             | you'll even see in textbooks. Hallucination probably caused
             | by Bernoulli being in the prompt
        
               | simonw wrote:
               | When I'm using them for code these days it is usually in
               | a tool that can execute code in a loop - so I don't tend
               | to even spot the hallucinations because the model self
               | corrects itself.
               | 
               | For factual information I only ever use search-enabled
               | models like o3 or GPT-4.
               | 
               | Most of my other use cases involve pasting large volumes
               | of text into the model and having it extract information
               | or manipulates that text in some way.
        
         | bluetidepro wrote:
         | Agreed. All it takes is a simple reply of "you're wrong." to
         | Claude/ChatGPT/etc. and it will start to crumble on itself and
         | get into a loop that hallucinates over and over. It won't fight
         | back, even if it happened to be right to begin with. It has no
         | backbone to be confident it is right.
        
           | cameldrv wrote:
           | Yeah it may be that previous training data, the model was
           | given a strong negative signal when the human trainer told it
           | it was wrong. In more subjective domains this might lead to
           | sycophancy. If the human is always right and the data is
           | always right, but the data can be interpreted multiple ways,
           | like say human psychology, the model just adjusts to the
           | opinion of the human.
           | 
           | If the question is about harder facts which the human
           | disagrees with, this may put it into an essentially self-
           | contradictory state, where the locus of possibilitie gets
           | squished from each direction, and so the model is forced to
           | respond with crazy outliers which agree with both the human
           | and the data. The probability of an invented reference being
           | true may be very low, but from the model's perspective, it
           | may still be one of the highest probability outputs among a
           | set of bad choices.
           | 
           | What it sounds like they may have done is just have the
           | humans tell it it's wrong when it isn't, and then award it
           | credit for sticking to its guns.
        
             | ashdksnndck wrote:
             | I put in the ChatGPT system prompt to be not sycophantic,
             | be honest, and tell me if I am wrong. When I try to correct
             | it, it hallucinates more complicated epicycles to explain
             | how it was right the first time.
        
           | diggan wrote:
           | > All it takes is a simple reply of "you're wrong." to
           | Claude/ChatGPT/etc. and it will start to crumble on itself
           | and get into a loop that hallucinates over and over.
           | 
           | Yeah, it's seems to be a terrible approach to try to
           | "correct" the context by adding clarifications or telling it
           | what's wrong.
           | 
           | Instead, start from 0 with the same initial prompt you used,
           | but improve it so the LLM gets it right in the first
           | response. If it still gets it wrong, begin from 0 again. The
           | context seems to be "poisoned" really quickly, if you're
           | looking for accuracy in the responses. So better to begin
           | from the beginning as soon as it veers off course.
        
         | squeegmeister wrote:
         | Yeah hallucinations are very context dependent. I'm guessing OP
         | is working in very well documented domains
        
         | Oras wrote:
         | Here you go
         | https://pbs.twimg.com/media/Gxxtiz7WEAAGCQ1?format=jpg&name=...
        
           | simonw wrote:
           | How is that a hallucination?
        
         | madduci wrote:
         | I believe it depends in inputs. For me, Claude 4 has
         | consistently generated hallucinations, especially was pretty
         | confident in generating invalid JSONs, for instance Grafana
         | Dashboards, which were full of syntactic errors.
        
         | godelski wrote:
         | There were also several hallucinations during the announcement.
         | (I also see hallucinations every time I use Claude and GPT,
         | which is several times a week. Paid and free tiers)
         | 
         | So not seeing them means either lying or incompetent. I always
         | try to attribute to stupidity rather than malice (Hanlon's
         | razor).
         | 
         | The big problem of LLMs is that they optimize human preference.
         | This means they optimize for hidden errors.
         | 
         | Personally I'm really cautious about using tools that have
         | stealthy failure modes. They just lead to many problems and
         | lots of wasted hours debugging, even when failure rates are
         | low. It just causes everything to slow down for me as I'm
         | double checking everything and need to be much more meticulous
         | if I know it's hard to see. It's like having a line of Python
         | indented with an inconsistent white space character. Impossible
         | to see. But what if you didn't have the interpreter telling you
         | which line you failed on or being able to search or highlight
         | these different characters. At least in this case you'd know
         | there's an error. It's hard enough dealing with human generated
         | invisible errors, but this just seems to perpetuate the LGTM
         | crowd
        
         | simonw wrote:
         | I updated that section of my post with a clarification about
         | what I meant. Thanks for calling this out, it definitely needed
         | extra context from me.
        
       | drumhead wrote:
       | "Are you GPT5" - No I'm 4o, 5 hasnt been released yet. "It was
       | released today". Oh you're right, Im GPT5. _You have reached the
       | limit of the free usage of 4o_
        
       | cchance wrote:
       | Its basically opus 4.1 ... but cheaper?
        
         | gwd wrote:
         | Cheaper is an understatement... it's less than 1/10 for input
         | and nearly 1/8 for output. Part of me wonders if they're using
         | their massive new investment to sell API below-cost and drive
         | out the competitor. If they're really getting Opus 4.1
         | performance for half of Sonnet compute cost, they've done
         | really well.
        
           | diggan wrote:
           | I'm not sure I'd be surprised, I've been playing around with
           | GPT-OSS last few days, and the architecture seems really fast
           | for the accuracy/quality of responses, way better than most
           | local weights I've tried for the last two years or so. And
           | since they released that architecture publicly, I'd imagine
           | they're sitting on something even better privately.
        
       | aliljet wrote:
       | I'm curious what platform people are using to test GPT-5? I'm so
       | deep into the claude code world that I'm actually unsure what the
       | best option is outside of claude code...
        
         | te_chris wrote:
         | Cursor
        
         | simonw wrote:
         | I've been using codex CLI, OpenAI's Claude Code equivalent. You
         | can run it like this:
         | OPENAI_DEFAULT_MODEL=gpt-5 codex
        
       | cainxinth wrote:
       | It's fascinating and hilarious that pelican on a bicycle in SVG
       | is still such a challenge.
        
         | muglug wrote:
         | How easy is it for you to create an SVG of a pelican riding a
         | bicycle in a text editor by hand?
        
           | jopsen wrote:
           | Without looking at the rendered output :)
        
       | joshmlewis wrote:
       | It seems to be trained to use tools effectively to gather
       | context. In this example against 4.1 and o3 it used 6 in the
       | first turn in a pretty cool way (fetching different categories
       | that could be relevant). Token use increases with that kind of
       | tool calling but the aggressive pricing should make that moot.
       | You could probably get it to not be so tool happy with prompting
       | as well.
       | 
       | https://promptslice.com/share/b-2ap_rfjeJgIQsG
        
       | tomrod wrote:
       | Simon, as always, I appreciate your succinct and dedicated
       | writeup. This really helps to land the results.
        
       ___________________________________________________________________
       (page generated 2025-08-07 23:00 UTC)