[HN Gopher] GPT-5: Key characteristics, pricing and system card
___________________________________________________________________
GPT-5: Key characteristics, pricing and system card
System card: https://cdn.openai.com/pdf/8124a3ce-
ab78-4f06-96eb-49ea29ffb...
Author : Philpax
Score : 374 points
Date : 2025-08-07 17:46 UTC (5 hours ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| ks2048 wrote:
| So, "system card" now means what used to be a "paper", but
| without lots of the details?
| kaoD wrote:
| Nope. System card is a sales thing. I think we generally call
| that "product sheet" in other markets.
| simonw wrote:
| AI labs tend to use "system cards" to describe their evaluation
| and safety research processes.
|
| They used to be more about the training process itself, but
| that's increasingly secretive these days.
| empiko wrote:
| Despite the fact that their models are used in hiring, business,
| education, etc this multibillion company uses one benchmark with
| very artificial questions (BBQ) to evaluate how fair their model
| is. I am a little bit disappointed.
| Leary wrote:
| METR of only 2 hours and 15 minutes. Fast takeoff less likely.
| qsort wrote:
| Isn't that pretty much in line with what people were expecting?
| Is it surprising?
| dingnuts wrote:
| It's not surprising to AI critics but go back to 2022 and
| open r/singularity and then answer: what "people" were
| expecting? Which people?
|
| SamA has been promising AGI next year for three years like
| Musk has been promising FSD next year for the last ten years.
|
| IDK what "people" are expecting but with the amount of hype
| I'd have to guess they were expecting more than we've gotten
| so far.
|
| The fact that "fast takeoff" is a term I recognize indicates
| that some people believed OpenAI when they said this
| technology (transformers) would lead to sci fi style AI and
| that is most certainly not happening
| falcor84 wrote:
| I would say that there are quite a lot of roles where you
| need to do a lot of planning to effectively manage an ~8
| hour shift, but then there are good protocols for handing
| over to the next person. So once AIs get to that level (in
| 2027?), we'll be much closer to AIs taking on "economically
| valuable work".
| ToValueFunfetti wrote:
| >SamA has been promising AGI next year for three years like
| Musk has been promising FSD next year for the last ten
| years.
|
| Has he said anything about it since last September:
|
| >It is possible that we will have superintelligence in a
| few thousand days (!); it may take longer, but I'm
| confident we'll get there.
|
| This is, at an absolute minimum, 2000 days = 5 years. And
| he says it may take longer.
|
| Did he even say AGI next year any time before this? It
| looks like his predictions were all pointing at the late
| 2020s, and now he's thinking early 2030s. Which you could
| still make fun of, but it just doesn't match up with your
| characterization at all.
| usaar333 wrote:
| No, this is below expectations on both Manifold and lesswrong
| (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_green
| ...). Median was ~2.75 hours on both (which already
| represented a bearish slowdown).
|
| Not massively off -- manifold yesterday implied odds this low
| were ~35%. 30% before Claude Opus 4.1 came out which updated
| expected agentic coding abilities downward.
| qsort wrote:
| Thanks for sharing, that was a good thread!
| umanwizard wrote:
| What is METR?
| Leary wrote:
| https://metr.github.io/autonomy-evals-guide/gpt-5-report/
| tunesmith wrote:
| The 2h 15m is the length of tasks the model can complete with
| 50% probability. So longer is better in that sense. Or at
| least, "more advanced" and potentially "more dangerous".
| ravendug wrote:
| https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-
| measu...
| kqr wrote:
| Seems like it's on the line that's scaring people like AI 2027,
| isn't it? https://aisafety.no/img/articles/length-of-tasks-
| log.png
| FergusArgyll wrote:
| It's above the exponential line & right around the Super
| exponential line
| nickthegreek wrote:
| This new naming conventions, while not perfect are alot clearer
| and I am sure will help my coworkers.
| anyg wrote:
| Good to know - > Knowledge cut-off is September 30th 2024 for
| GPT-5 and May 30th 2024 for GPT-5 mini and nano
| falcor84 wrote:
| Oh wow, so essentially a full year of post-training and
| testing. Or was it ready and there was a sufficiently good
| business strategy decision to postpone the release?
| thorum wrote:
| The Information's report from earlier this month claimed that
| GPT-5 was only developed in the last 1-2 months, after some
| sort of breakthrough in training methodology.
|
| > As recently as June, the technical problems meant none of
| OpenAI's models under development seemed good enough to be
| labeled GPT-5, according to a person who has worked on it.
|
| But it could be that this refers to post-training and the
| base model was developed earlier.
|
| https://www.theinformation.com/articles/inside-openais-
| rocky...
|
| https://archive.ph/d72B4
| simonw wrote:
| My understanding is that training data cut-offs and dates
| at which the model were trained are independent things.
|
| AI labs gather training data and then do a ton of work to
| process it, filter it etc.
|
| Model training teams run different parameters and
| techniques against that processed training data.
|
| It wouldn't surprise me to hear that OpenAI had collected
| data up to September 2024, dumped that data in a data
| warehouse of some sort, then spent months experimenting
| with ways to filter and process it and different training
| parameters to run against it.
| NullCascade wrote:
| OpenAI is much more aggressively targeted by NYTimes and
| similar organizations for "copyright violations".
| bhouston wrote:
| Weird to have such an early knowledge cutoff. Claude 4.1 has
| March 2025 - 6 month more recent with comparable results.
| bn-l wrote:
| Is that late enough for it to have heard of svelte 5?
| dortlick wrote:
| Yeah I thought that was strange. Wouldn't it be important to
| have more recent data?
| cco wrote:
| Only a third cheaper than Sonnet 4? Incrementally better I
| suppose.
|
| > and minimizing sycophancy
|
| Now we're talking about a good feature! Actually one of my
| biggest annoyances with Cursor (that mostly uses Sonnet).
|
| "You're absolutely right!"
|
| I mean not really Cursor, but ok. I'll be super excited if we can
| get rid of these sycophancy tokens.
| logicchains wrote:
| >Only a third cheaper than Sonnet 4?
|
| The price should be compared to Opus, not Sonnet.
| cco wrote:
| Wow, if so, 7x cheaper. Crazy if true.
| nosefurhairdo wrote:
| In my early testing gpt5 is significantly less annoying in this
| regard. Gives a strong vibe of just doing what it's told
| without any fluff.
| bdcdo wrote:
| "GPT-5 in the API is simpler: it's available as three models--
| regular, mini and nano--which can each be run at one of four
| reasoning levels: minimal (a new level not previously available
| for other OpenAI reasoning models), low, medium or high."
|
| Is it actually simpler? For those who are currently using GPT
| 4.1, we're going from 3 options (4.1, 4.1 mini and 4.1 nano) to
| at least 8, if we don't consider gpt 5 regular - we now will have
| to choose between gpt 5 mini minimal, gpt 5 mini low, gpt 5 mini
| medium, gpt 5 mini high, gpt 5 nano minimal, gpt 5 nano low, gpt
| 5 nano medium and gpt 5 nano high.
|
| And, while choosing between all these options, we'll always have
| to wonder: should I try adjusting the prompt that I'm using, or
| simply change the gpt 5 version or its reasoning level?
| impossiblefork wrote:
| Yes, I think so. It's n=1,2,3 m=0,1,2,3. There's structure and
| you know that each parameter goes up and in which direction.
| makeramen wrote:
| But given the option, do you choose bigger models or more
| reasoning? Or medium of both?
| namibj wrote:
| Depends on what you're doing.
| addaon wrote:
| > Depends on what you're doing.
|
| Trying to get an accurate answer (best correlated with
| objective truth) on a topic I don't already know the
| answer to (or why would I ask?). This is, to me, the
| challenge with the "it depends, tune it" answers that
| always come up in how to use these tools -- it requires
| the tools to not be useful for you (because there's
| already a solution) to be able to do the tuning.
| wongarsu wrote:
| If cost is no concern (as in infrequent one-off tasks)
| then you can always go with the biggest model with the
| most reasoning. Maybe compare it with the biggest model
| with no/less reasoning, since sometimes reasoning can
| hurt (just as with humans overthinking something).
|
| If you have a task you do frequently you need some kind
| of benchmark. Which might just be comparing how good the
| output of the smaller models holds up to the output of
| the bigger model, if you don't know the ground truth
| impossiblefork wrote:
| I would have to get experience with them. I mostly use
| Mistral, so I have only the choice of thinking or not
| thinking.
| gunalx wrote:
| Mistral also has small medium and large. With both small
| and medium having a thinking one, devstral codestral ++
|
| Not really that mich simpler.
| impossiblefork wrote:
| Ah, but I never route to these manually. I only use LLMs
| a little bit, mostly to try to see what they can't do.
| paladin314159 wrote:
| If you need world knowledge, then bigger models. If you
| need problem-solving, then more reasoning.
|
| But the specific nuance of picking nano/mini/main and
| minimal/low/medium/high comes down to experimentation and
| what your cost/latency constraints are.
| mwigdahl wrote:
| If reasoning is on the table, then you already had to add
| o3-mini-high, o3-mini-medium, o3-mini-low, o4-mini-high,
| o4-mini-medium, and o4-mini-low to the 4.1 variants. The GPT-5
| way seems simpler to me.
| hirako2000 wrote:
| Ultimately they are selling tokens, so try many times.
| vineyardmike wrote:
| When I read "simpler" I interpreted that to mean they don't use
| their Chat-optimized harness to guess which reasoning level and
| model to use. The subscription chat service (ChatGPT) and the
| chat-optimized model on their API seem to have a special
| harness that changes reasoning based on some heuristics, and
| will switch between the model sizes without user input.
|
| With the API, you pick a model sizes and reasoning effort. Yes
| more choices, but also a clear mental model and a simple choice
| that you control.
| diggan wrote:
| > but for the moment here's the pelican I got from GPT-5 running
| at its default "medium" reasoning effort:
|
| Would been interesting to see a comparison between low, medium
| and high reasoning_effort pelicans :)
|
| When I've played around with GPT-OSS-120b recently, seems the
| difference in the final answer is huge, where "low" is
| essentially "no reasoning" and with "high" it can spend seemingly
| endless amount of tokens. I'm guessing the difference with GPT-5
| will be similar?
| simonw wrote:
| > Would been interesting to see a comparison between low,
| medium and high reasoning_effort pelicans
|
| Yeah, I'm working on that - expect dozens of more pelicans in a
| later post.
| zaronymous1 wrote:
| Can anyone explain to me why they've removed parameter controls
| for temperature and top-p in reasoning models, including gpt-5?
| It strikes me that it makes it harder to build with these to do
| small tasks requiring high-levels of consistency, and in the API,
| I really value the ability to set certain tasks to a low temp.
| Der_Einzige wrote:
| It's because all forms of sampler settings destroy
| safety/alignment. That's why top_p/top_k are still used and not
| tfs, min_p, top_n sigma, etc, why temperature is locked to 0-2
| arbitrary range, etc
|
| Open source is years ahead of these guys on samplers. It's why
| their models being so good is that much more impressive.
| oblio wrote:
| Temperature is the response variation control?
| AH4oFVbPT4f8 wrote:
| Yes, it controls variability or probability of the next
| token or text to be selected.
| hodgehog11 wrote:
| The aggressive pricing here seems unusual for OpenAI. If they had
| a large moat, they wouldn't need to do this. Competition is
| fierce indeed.
| 0x00cl wrote:
| Maybe the need/want data.
| dr_dshiv wrote:
| And it's a massive distillation of the mother model, so the
| costs of inference are likely low.
| impure wrote:
| OpenAI and most AI companies do not train on data submitted
| to a paid API.
| WhereIsTheTruth wrote:
| They also do not train using copyrighted material /s
| daveguy wrote:
| Oh, they never even made that promise. They're trying to
| say it's fine to launder copyright material through a
| model.
| simonw wrote:
| That's different. They train on scrapes of the web. They
| don't train on data submitted to their API by their
| paying customers.
| johnnyanmac wrote:
| If they're bold enough to say they train on data they do
| not own, I am not optimistic when they say they don't
| train on data people willingly submit to them.
| simonw wrote:
| I don't understand your logic there.
|
| They have confessed to doing a bad thing - training on
| copyrighted data without permission. Why does that
| indicate they would lie about a worse thing?
| johnnyanmac wrote:
| >Why does that indicate they would lie about a worse
| thing?
|
| Because they know their audience. It's an audience that
| also doesn't care for copyright and would love for them
| to win their court cases. They are fineaking such an
| argument to those kinds of people.
|
| Meanwhile, the reaction from the same audience when legal
| did a very typical subpoena process on said data, data
| they chose to submit to an online server of their own
| volition, completely freaked out. Suddenly, they felt
| like their privacy was invaded.
|
| It doesn't make any logical sense in my mind, but a lot
| of the discourse over this topic isnt based on logic.
| dortlick wrote:
| Why don't they?
| echoangle wrote:
| They probably fear that people wouldn't use the API
| otherwise, I guess. They could have different tiers
| though where you pay extra so your data isn't used for
| training.
| anhner wrote:
| If you believe that, I have a bridge I can sell you...
| Uehreka wrote:
| If it ever leaked that OpenAI was training on the vast
| amounts of confidential data being sent to them, they'd
| be immediately crushed under a mountain of litigation and
| probably have to shut down. Lots of people at big
| companies have accounts, and the bigcos are only letting
| them use them because of that "Don't train on my data"
| checkbox. Not all of those accounts are necessarily tied
| to company emails either, so it's not like OpenAI can
| discriminate.
| impure wrote:
| The 5 cents for Nano is interesting. Maybe it will force Google
| to start dropping their prices again which have been slowly
| creeping up recently.
| ilaksh wrote:
| It's like 5% better. I think they obviously had no choice but
| to be price competitive with Gemini 2.5 Pro. Especially for
| Cursor to change their default.
| FergusArgyll wrote:
| They are winning by massive margins in the app, but losing (!)
| in the API to anthropic
|
| https://finance.yahoo.com/news/enterprise-llm-spend-reaches-...
| canada_dry wrote:
| Perhaps they're feeling the effect of losing PRO clients (like
| me) lately.
|
| Their PRO models were not (IMHO) worth 10X that of PLUS!
|
| Not even close.
|
| Especially when new competitors (eg. z.ai) are offering very
| compelling competition.
| onehair wrote:
| > Definitely recognizable as a pelican
|
| right :-D
| pancakemouse wrote:
| Practically the first thing I do after a new model release is try
| to upgrade `llm`. Thank you, @simonw !
| efavdb wrote:
| same, looks like he hasn't added 5.0 to the package yet but
| assume imminent.
|
| https://llm.datasette.io/en/stable/openai-models.html
| simonw wrote:
| Working on that now! https://github.com/simonw/llm/issues/1229
| isoprophlex wrote:
| Whoa this looks good. And cheap! How do you hack a proxy together
| so you can run Claude Code on gpt-5?!
| dalberto wrote:
| Consider: https://github.com/musistudio/claude-code-router
|
| or even: https://github.com/sst/opencode
|
| Not affiliated with either one of these, but they look
| promising.
| morleytj wrote:
| It's cool and I'm glad it sounds like it's getting more reliable,
| but given the types of things people have been saying GPT-5 would
| be for the last two years you'd expect GPT-5 to be a world-
| shattering release rather than incremental and stable
| improvement.
|
| It does sort of give me the vibe that the pure scaling maximalism
| really is dying off though. If the approach is on writing better
| routers, tooling, comboing specialized submodels on tasks, then
| it feels like there's a search for new ways to improve
| performance(and lower cost), suggesting the other established
| approaches weren't working. I could totally be wrong, but I feel
| like if just throwing more compute at the problem was working
| OpenAI probably wouldn't be spending much time on optimizing the
| user routing on currently existing strategies to get marginal
| improvements on average user interactions.
|
| I've been pretty negative on the thesis of only needing more
| data/compute to achieve AGI with current techniques though, so
| perhaps I'm overly biased against it. If there's one thing that
| bothers me in general about the situation though, it's that it
| feels like we really have no clue what the actual status of these
| models is because of how closed off all the industry labs have
| become + the feeling of not being able to expect anything other
| than marketing language from the presentations. I suppose that's
| inevitable with the massive investments though. Maybe they've got
| some massive earthshattering model release coming out next, who
| knows.
| jstummbillig wrote:
| Things have moved differently than what we thought would happen
| 2 years ago, but lest we forget what has happened in the
| meanwhile (4o, o1 + thinking paradigm, o3)
|
| So yeah, maybe we are getting more incremental improvements.
| But that to me seems like a good thing, because more good
| things earlier. I will take that over world-shattering any day
| - but if we were to consider everything that has happened since
| the first release of gpt-4, I would argue the total amount is
| actually very much world-shattering.
| GaggiX wrote:
| Compared to GPT-4, it is on a completely different level given
| that it is a reasoning model so on that regard it does delivers
| and it's not just scaling, but for this I guess the revolution
| was o1 and GPT-5 is just a much more mature version of the
| technology.
| hnuser123456 wrote:
| I agree, we have now proven that GPUs can ingest information
| and be trained to generate content for various tasks. But to
| put it to work, make it useful, requires far more thought about
| a specific problem and how to apply the tech. If you could just
| ask GPT to create a startup that'll be guaranteed to be worth
| $1B on a $1k investment within one year, someone else would've
| already done it. Elbow grease still required for the
| foreseeable future.
|
| In the meantime, figuring out how to train them to make less of
| their most common mistakes is a worthwhile effort.
| morleytj wrote:
| Certainly, yes, plenty of elbow grease required in all things
| that matter.
|
| The interesting point as well to me though, is that if it
| could create a startup that was worth $1B, that startup
| wouldn't be worth $1B.
|
| Why would anyone pay that much to invest in the startup if
| they could recreate the entire thing with the same tool that
| everyone would have access to?
| BoiledCabbage wrote:
| Performance is doubling roughly every 4-7 months. That trend is
| continuing. That's insane.
|
| If your expectations were any higher than that then, then it
| seems like you were caught up in hype. Doubling 2-3 times per
| year isn't leveling off my any means.
|
| https://metr.github.io/autonomy-evals-guide/gpt-5-report/
| oblio wrote:
| By "performance" I guess you mean "the length of task that
| can be done adequately"?
|
| It is a benchmark but I'm not very convinced it's the be-all,
| end-all.
| nomel wrote:
| > It is a benchmark but I'm not very convinced it's the be-
| all, end-all.
|
| Who's suggesting it is?
| morleytj wrote:
| I wouldn't say model development and performance is "leveling
| off", and in fact didn't write that. I'd say that tons more
| funding is going into the development of many models, so one
| would expect performance increases unless the paradigm was
| completely flawed at it's core, a belief I wouldn't
| personally profess to. My point was moreso the following: A
| couple years ago it was easy to find people saying that all
| we needed was to add in video data, or genetic data, or some
| other data modality, in the exact same format that the models
| trained on existing language data were, and we'd see a fast
| takeoff scenario with no other algorithmic changes. Given
| that the top labs seem to be increasingly investigating
| alternate approaches to setting up the models beyond just
| adding more data sources, and have been for the last couple
| years(Which, I should clarify, is a good idea in my opinion),
| then the probability of those statements of just adding more
| data or more compute taking us straight to AGI being correct
| seems at the very least slightly lower, right?
|
| Rather than my personal opinion, I was commenting on commonly
| viewed opinions of people I would believe to have been caught
| up in hype in the past. But I do feel that although that's a
| benchmark, it's not necessarily the end-all of benchmarks.
| I'll reserve my final opinions until I test personally, of
| course. I will say that increasing the context window
| probably translates pretty well to longer context task
| performance, but I'm not entirely convinced it directly
| translates to individual end-step improvement on every class
| of task.
| andrepd wrote:
| We can barely measure "performance" in any objective sense,
| let alone claim that it's doubling every 4 months.....
| simonw wrote:
| I for one am pretty glad about this. I like LLMs that augment
| human abilities - tools that help people get more done and be
| more ambitious.
|
| The common concept for AGI seems to be much more about human
| replacement - the ability to complete "economically valuable
| tasks" better than humans can. I still don't understand what
| our human lives or economies would look like there.
|
| What I personally wanted from GPT-5 is exactly what I got:
| models that do the same stuff that existing models do, but more
| reliably and "better".
| morleytj wrote:
| I'd agree on that.
|
| That's pretty much the key component these approaches have
| been lacking on, the reliability and consistency on the tasks
| they already work well on to some extent.
|
| I think there's a lot of visions of what our human lives
| would look like in that world that I can imagine, but your
| comment did make me think of one particularly interesting
| tautological scenario in that commonly defined version of
| AGI.
|
| If artificial general intelligence is defined as completed
| "economically valuable tasks" better than human can, it
| requires one to define "economically valuable." As it
| currently stands, something holds value in an economy
| relative to human beings wanting it. Houses get expensive
| because many people, each of whom have economic utility which
| they use to purchase things, want to have houses, of which
| there is a limited supply for a variety of reasons. If human
| beings are not the most effective producers of value in the
| system, they lose capability to trade for things, which
| negates that existing definition of economic value. Doesn't
| matter how many people would pay $5 dollars for your widget
| if people have no economic utility relative to AGI, meaning
| they cannot trade that utility for goods.
|
| In general that sort of definition of AGI being held reveals
| a bit of a deeper belief, which is that there is some version
| of economic value detached from the humans consuming it. Some
| sort of nebulous concept of progress, rather than the
| acknowledgement that for all of human history, progress and
| value have both been relative to the people themselves
| getting some form of value or progress. I suppose it
| generally points to the idea of an economy without consumers,
| which is always a pretty bizarre thing to consider, but in
| that case, wouldn't it just be a definition saying that "AGI
| is achieved when it can do things that the people who control
| the AI system think are useful." Since in that case, the
| economy would eventually largely consist of the people
| controlling the most economically valuable agents.
|
| I suppose that's the whole point of the various alignment
| studies, but I do find it kind of interesting to think about
| the fact that even the concept of something being
| "economically valuable", which sounds very rigorous and
| measurable to many people, is so nebulous as to be dependent
| on our preferences and wants as a society.
| thorum wrote:
| The quiet revolution is happening in tool use and multimodal
| capabilities. Moderate incremental improvements on general
| intelligence, but dramatic improvements on multi-step tool use
| and ability to interact with the world (vs 1 year ago), will
| eventually feed back into general intelligence.
| darkhorse222 wrote:
| Completely agree. General intelligence is a building block.
| By chaining things together you can achieve meta programming.
| The trick isn't to create one perfect block but to build a
| variety of blocks and make one of those blocks a block-
| builder.
| coolKid721 wrote:
| [flagged]
| dang wrote:
| Can you please make your substantive points thoughtfully?
| Thoughtful criticism is welcome but snarky putdowns and
| onliners, etc., degrade the discussion for everyone.
|
| You've posted substantive comments in other threads, so
| this should be easy to fix.
|
| If you wouldn't mind reviewing
| https://news.ycombinator.com/newsguidelines.html and taking
| the intended spirit of the site more to heart, we'd be
| grateful.
| belter wrote:
| > Maybe they've got some massive earthshattering model release
| coming out next, who knows.
|
| Nothing in the current technology offers a path to AGI. These
| models are fixed after training completes.
| echoangle wrote:
| Why do you think that AGI necessitates modification of the
| model during use? Couldn't all the insights the model gains
| be contained in the context given to it?
| belter wrote:
| Because: https://en.wikipedia.org/wiki/Anterograde_amnesia
| echoangle wrote:
| Like I already said, the model can remember stuff as long
| as it's in the context. LLMs can obviously remember stuff
| they were told or output themselves, even a few messages
| later.
| godelski wrote:
| > the model can remember stuff as long as it's in the
| context.
|
| You would need an infinite context or compression
|
| Also you might be interested in this theorem
|
| https://en.wikipedia.org/wiki/Data_processing_inequality
| echoangle wrote:
| > You would need an infinite context or compression
|
| Only if AGI would require infinite knowledge, which it
| doesn't.
| belter wrote:
| AGI needs to genuinely learn and build new knowledge from
| experience, not just generate creative outputs based on
| what it has already seen.
|
| LLMs might look "creative" but they are just remixing
| patterns from their training data and what is in the
| prompt. They cant actually update themselves or remember
| new things after training as there is no ongoing feedback
| loop.
|
| This is why you can't send an LLM to medical school and
| expect it to truly "graduate". It cannot acquire or
| integrate new knowledge from real-world experience the
| way a human can.
|
| Without a learning feedback loop, these models are unable
| to interact meaningfully with a changing reality or
| fulfill the expectation from an AGI: Contribute to new
| science and technology.
| echoangle wrote:
| I agree that this is kind of true with a plain chat
| interface, but I don't think that's an inherent limit of
| an LLM. I think OpenAI actually has a memory feature
| where the LLM can specify data it wants to save and can
| then access later. I don't see why this in principle
| wouldn't be enough for the LLM to learn new data as time
| goes on. All possible counter arguments seem related to
| scale (of memory and context size), not the principle
| itself.
|
| Basically, I wouldn't say that an LLM can never become
| AGI due to its architecture. I also am not saying that
| LLM will become AGI (I have no clue), but I don't think
| the architecture itself makes it impossible.
| belter wrote:
| LLMs lack mechanisms for persistent memory, causal world
| modeling, and self-referential planning. Their
| transformer architecture is static and fundamentally
| constrains dynamic reasoning and adaptive learning. All
| core requirements for AGI.
|
| So yeah, AGI is impossible with today LLMs. But at least
| we got to watch Sam Altman and Mira Murati drop their
| voices an octave onstage and announce "a new dawn of
| intelligence" every quarter. Remember Sam Altman 7
| trillion?
|
| Now that the AGI party is over, its time to sell those
| NVDA shares and prepare for the crash. What a ride it
| was. I am grabbing the popcorn.
| godelski wrote:
| Because time marches on and with it things change.
|
| You _could_ maybe accomplish this if you could fit all new
| information into context or with cycles of compression but
| that is kinda a crazy ask. There 's too much new
| information, even considering compression. It certainly
| wouldn't allow for exponential growth (I'd expect sub
| linear).
|
| I think a lot of people greatly underestimate how much new
| information is created every day. It's hard if you're not
| working on any research and seeing how incremental but
| constant improvement compounds. But try just looking at
| whatever company you work for. Do you know everything that
| people did that day? It takes more time to generate
| information than process information so that's on you side,
| but do you really think you could keep up? Maybe at a very
| high level but in that case you're missing a lot of
| information.
|
| Think about it this way: if that could be done then LLM
| wouldn't need training or tuning because you could do
| everything through prompting.
| echoangle wrote:
| The specific instance doesn't need to know everything
| happening in the world at once to be AGI though. You
| could feed the trained model different contexts based on
| the task (and even let the model tell you what kind of
| raw data it wants) and it could still hypothetically be
| smarter than a human.
|
| I'm not saying this is a realistic or efficient method to
| create AGI, but I think the argument ,,Model is static
| once trained -> model can't be AGI" is fallacious.
| cchance wrote:
| SAM is a HYPE CEO, he literally hypes his company nonstop, then
| the announcements come and ... they're... ok, so people aren't
| really upset, but they end up feeling lackluster at the hype...
| Until the next cycle comes around...
|
| If you want actual big moves, watch google, anthropic, qwen,
| deepseek.
|
| Qwen and Deepseek teams honestly seem so much better at under
| promising and over delivering.
|
| Cant wait to see what Gemini 3 looks like too.
| brandall10 wrote:
| To be fair, this is one of the pathways GPT-5 was speculated to
| take as far back at 6 or so months ago - simply being an
| incremental upgrade from a performance perspective, but a leap
| from a product simplification approach.
|
| At this point it's pretty much given it's a game of inches
| moving forward.
| ac29 wrote:
| > a leap from a product simplification approach.
|
| According to the article, GPT-5 is actually three models and
| they can be run at 4 levels of thinking. Thats a dozen ways
| you can run any given input on "GPT-5", so its hardly a
| simple product line up (but maybe better than before).
| AbstractH24 wrote:
| > It's cool and I'm glad it sounds like it's getting more
| reliable, but given the types of things people have been saying
| GPT-5 would be for the last two years you'd expect GPT-5 to be
| a world-shattering release rather than incremental and stable
| improvement.
|
| Are you trying to say the curve is flattening? That advances
| are coming slower and slower?
|
| As long as it doesn't suggest a dot com level recession I'm
| good.
| morleytj wrote:
| I suppose what I'm getting at is that if there are
| performance increases on a steady pace, but the investment
| needed to get those performance increases is on a much faster
| growth rate, it's not really a fair comparison in terms of a
| rate of progress, and could suggest diminishing returns from
| a particular approach. I don't really have the actual data to
| make a claim either way though,I think anyone would need more
| data to do so than is publicly accessible.
|
| But I do think the fact that we can publicly observe this
| reallocation of resources and emphasized aspects of the
| models gives us a bit of insight into what could be happening
| behind the scenes if we think about the reasons why those
| shifts could have happened, I guess.
| godelski wrote:
| > It does sort of give me the vibe that the pure scaling
| maximalism really is dying off though
|
| I think the big question is if/when investors will start giving
| money to those who have been predicting this (with evidence)
| and trying other avenues.
|
| Really though, why put all your eggs in one basket? That's what
| I've been confused about for awhile. Why fund yet another LLMs
| to AGI startup. Space is saturated with big players and has
| been for years. Even if LLMs could get there that doesn't mean
| something else won't get there faster and for less. It also
| seems you'd want a backup in order to avoid popping the bubble.
| Technology S-Curves and all that still apply to AI
|
| Though I'm similarly biased, but so is everyone I know with a
| strong math and/or science background (I even mentioned it in
| my thesis more than a few times lol). Scaling is all you need
| just doesn't check out
| morleytj wrote:
| I'm pretty curious about the same thing.
|
| I think a somewhat comparable situation is in various online
| game platforms now that I think about it. Investors would
| love to make a game like Fortnite, and get the profits that
| Fortnite makes. So a ton of companies try to make Fortnite.
| Almost all fail, and make no return whatsoever, just lose a
| ton of money and toss the game in the bin, shut down the
| servers.
|
| On the other hand, it may have been more logical for many of
| them to go for a less ambitious (not always online, not a
| game that requires a high player count and social buy-in to
| stay relevant) but still profitable investment (Maybe a
| smaller scale single player game that doesn't offer recurring
| revenue), yet we still see a very crowded space for trying to
| emulate the same business model as something like Fortnite.
| Another more historical example was the constant question of
| whether a given MMO would be the next "WoW-killer" all
| through the 2000's/2010's.
|
| I think part of why this arises is that there's definitely a
| bit of a psychological hack for humans in particular where if
| there's a low-probability but extremely high reward outcome,
| we're deeply entranced by it, and investors are the same.
| Even if the chances are smaller in their minds than they were
| before, if they can just follow the same path that seems to
| be working to some extent and then get lucky, they're
| completely set. They're not really thinking about any broader
| bubble that could exist, that's on the level of the society,
| they're thinking about the individual, who could be very very
| rich, famous, and powerful if their investment works. And in
| the mind of someone debating what path to go down, I imagine
| a more nebulous answer of "we probably need to come up with
| some fundamentally different tools for learning and research
| a lot of different approaches to do so" is a bit less
| satisfying and exciting than a pitch that says "If you just
| give me enough money, the curve will eventually hit the point
| where you get to be king of the universe and we go colonize
| the solar system and carve your face into the moon."
|
| I also have to acknowledge the possibility that they just
| have access to different information than I do! They might be
| getting shown much better demos than I do, I suppose.
| ilaksh wrote:
| This is key info from the article for me:
|
| > -------------------------------
|
| "reasoning": {"summary": "auto"} }'
|
| Here's the response from that API call.
|
| https://gist.github.com/simonw/1d1013ba059af76461153722005a0...
|
| Without that option the API will often provide a lengthy delay
| while the model burns through thinking tokens until you start
| getting back visible tokens for the final response.
| justusthane wrote:
| > a real-time router that quickly decides which model to use
| based on conversation type, complexity, tool needs, and explicit
| intent
|
| This is sort of interesting to me. It strikes me that so far
| we've had more or less direct access to the underlying model
| (apart from the system prompt and guardrails), but I wonder if
| going forward there's going to be more and more infrastructure
| between us and the model.
| hirako2000 wrote:
| Consider it a low level routing. Keeping in mind it allows the
| other non active parts to not be in memory. Mistral afaik came
| up with this concept, quite a while back.
| techpression wrote:
| "They claim impressive reductions in hallucinations. In my own
| usage I've not spotted a single hallucination yet, but that's
| been true for me for Claude 4 and o3 recently as well--
| hallucination is so much less of a problem with this year's
| models."
|
| This has me so confused, Claude 4 (Sonnet and Opus) hallucinates
| daily for me, on both simple and hard things. And this is for
| small isolated questions at that.
| simonw wrote:
| What kind of hallucinations are you seeing?
| OtherShrezzing wrote:
| I rewrote a 4 page document from first to third person a
| couple of weeks back. I gave Claude Sonnet 4 the document
| after editing, so it was entirely written in the third
| person. I asked it to review & highlight places where it was
| still in the first person.
|
| >Looking through the document, I can identify several
| instances where it's written in the first person:
|
| And it went on to show a series of "they/them" statements. I
| asked it to clarify if "they" is "first person" and it
| responded
|
| >No, "they" is not first person - it's third person. I made
| an error in my analysis. First person would be: I, we, me,
| us, our, my. Second person would be: you, your. Third person
| would be: he, she, it, they, them, their. Looking back at the
| document more carefully, it appears to be written entirely in
| third person.
|
| Even the good models are still failing at real-world use
| cases which should be right in their wheelhouse.
| simonw wrote:
| That doesn't quite fit the definition I use for
| "hallucination" - it's clearly a dumb error, but the model
| didn't confidently state something that's not true (like
| naming the wrong team who won the Super Bowl).
| OtherShrezzing wrote:
| >"They claim impressive reductions in hallucinations. In
| my own usage I've not spotted a single hallucination yet,
| but that's been true for me for Claude 4 and o3 recently
| as well--hallucination is so much less of a problem with
| this year's models."
|
| Could you give an estimate of how many "dumb errors"
| you've encountered, as opposed to hallucinations? I think
| many of your readers might read "hallucination" and
| assume you mean "hallucinations and dumb errors".
| jmull wrote:
| That's a good way to put it.
|
| As a user, when the model tells me things that are flat
| out wrong, it doesn't really matter whether it would be
| categorized as a hallucination or a dumb error. From my
| perspective, those mean the same thing.
| godelski wrote:
| I think it qualifies as a hallucination. What's your
| definition? I'm a researcher too and as far as I'm aware
| the definition has always been pretty broad and applied
| to many forms of mistakes. (It was always muddy but
| definitely got more muddy when adopted by NLP)
|
| It's hard to know why it made the error but isn't it
| caused by inaccurate "world" modeling? ("World" being
| English language) Is it not making some hallucination
| about the English language while interpreting the prompt
| or document?
|
| I'm having a hard time trying to think of a context where
| "they" would even be first person. I can't find any
| search results though Google's AI says it can. It
| provided two links, the first being a Quora result saying
| people don't do this but framed it as it's not
| impossible, just unheard of. Second result just talks
| about singular you. Both of these I'd consider
| hallucinations too as the answer isn't supported by the
| links.
| techpression wrote:
| Since I mostly use it for code, made up function names are
| the most common. And of course just broken code all together,
| which might not count as a hallucination.
| laacz wrote:
| I suppose that Simon, being all in with LLMs for quite a while
| now, has developed a good intuition/feeling for framing
| questions so that they produce less hallucinations.
| simonw wrote:
| Yeah I think that's exactly right. I don't ask questions that
| are likely to product hallucinations (like citations from
| papers about a topic to an LLM without search access), so I
| rarely see them.
| godelski wrote:
| But how would you verify? Are you constantly asking
| questions you already know the answers to? In depth
| answers?
|
| Often the hallucinations I see are subtle, though usually
| critical. I see it when generating code, doing my testing,
| or even just writing. There are hallucinations in today's
| announcements, such as the airfoil example[0]. An example
| of more obvious hallucinations is I was asking for help
| improving writing an abstract for a paper. I gave it my
| draft and it inserted new numbers and metrics that weren't
| there. I tried again providing my whole paper. I tried
| again making explicit to not add new numbers. I tried the
| whole process again in new sessions and in private
| sessions. Claude did better than GPT 4 and o3 but none
| would do it without follow-ups and a few iterations.
|
| Honestly I'm curious what you use them for where you don't
| see hallucinations
|
| [0] which is a subtle but famous misconception. One that
| you'll even see in textbooks. Hallucination probably caused
| by Bernoulli being in the prompt
| simonw wrote:
| When I'm using them for code these days it is usually in
| a tool that can execute code in a loop - so I don't tend
| to even spot the hallucinations because the model self
| corrects itself.
|
| For factual information I only ever use search-enabled
| models like o3 or GPT-4.
|
| Most of my other use cases involve pasting large volumes
| of text into the model and having it extract information
| or manipulates that text in some way.
| bluetidepro wrote:
| Agreed. All it takes is a simple reply of "you're wrong." to
| Claude/ChatGPT/etc. and it will start to crumble on itself and
| get into a loop that hallucinates over and over. It won't fight
| back, even if it happened to be right to begin with. It has no
| backbone to be confident it is right.
| cameldrv wrote:
| Yeah it may be that previous training data, the model was
| given a strong negative signal when the human trainer told it
| it was wrong. In more subjective domains this might lead to
| sycophancy. If the human is always right and the data is
| always right, but the data can be interpreted multiple ways,
| like say human psychology, the model just adjusts to the
| opinion of the human.
|
| If the question is about harder facts which the human
| disagrees with, this may put it into an essentially self-
| contradictory state, where the locus of possibilitie gets
| squished from each direction, and so the model is forced to
| respond with crazy outliers which agree with both the human
| and the data. The probability of an invented reference being
| true may be very low, but from the model's perspective, it
| may still be one of the highest probability outputs among a
| set of bad choices.
|
| What it sounds like they may have done is just have the
| humans tell it it's wrong when it isn't, and then award it
| credit for sticking to its guns.
| ashdksnndck wrote:
| I put in the ChatGPT system prompt to be not sycophantic,
| be honest, and tell me if I am wrong. When I try to correct
| it, it hallucinates more complicated epicycles to explain
| how it was right the first time.
| diggan wrote:
| > All it takes is a simple reply of "you're wrong." to
| Claude/ChatGPT/etc. and it will start to crumble on itself
| and get into a loop that hallucinates over and over.
|
| Yeah, it's seems to be a terrible approach to try to
| "correct" the context by adding clarifications or telling it
| what's wrong.
|
| Instead, start from 0 with the same initial prompt you used,
| but improve it so the LLM gets it right in the first
| response. If it still gets it wrong, begin from 0 again. The
| context seems to be "poisoned" really quickly, if you're
| looking for accuracy in the responses. So better to begin
| from the beginning as soon as it veers off course.
| squeegmeister wrote:
| Yeah hallucinations are very context dependent. I'm guessing OP
| is working in very well documented domains
| Oras wrote:
| Here you go
| https://pbs.twimg.com/media/Gxxtiz7WEAAGCQ1?format=jpg&name=...
| simonw wrote:
| How is that a hallucination?
| madduci wrote:
| I believe it depends in inputs. For me, Claude 4 has
| consistently generated hallucinations, especially was pretty
| confident in generating invalid JSONs, for instance Grafana
| Dashboards, which were full of syntactic errors.
| godelski wrote:
| There were also several hallucinations during the announcement.
| (I also see hallucinations every time I use Claude and GPT,
| which is several times a week. Paid and free tiers)
|
| So not seeing them means either lying or incompetent. I always
| try to attribute to stupidity rather than malice (Hanlon's
| razor).
|
| The big problem of LLMs is that they optimize human preference.
| This means they optimize for hidden errors.
|
| Personally I'm really cautious about using tools that have
| stealthy failure modes. They just lead to many problems and
| lots of wasted hours debugging, even when failure rates are
| low. It just causes everything to slow down for me as I'm
| double checking everything and need to be much more meticulous
| if I know it's hard to see. It's like having a line of Python
| indented with an inconsistent white space character. Impossible
| to see. But what if you didn't have the interpreter telling you
| which line you failed on or being able to search or highlight
| these different characters. At least in this case you'd know
| there's an error. It's hard enough dealing with human generated
| invisible errors, but this just seems to perpetuate the LGTM
| crowd
| simonw wrote:
| I updated that section of my post with a clarification about
| what I meant. Thanks for calling this out, it definitely needed
| extra context from me.
| drumhead wrote:
| "Are you GPT5" - No I'm 4o, 5 hasnt been released yet. "It was
| released today". Oh you're right, Im GPT5. _You have reached the
| limit of the free usage of 4o_
| cchance wrote:
| Its basically opus 4.1 ... but cheaper?
| gwd wrote:
| Cheaper is an understatement... it's less than 1/10 for input
| and nearly 1/8 for output. Part of me wonders if they're using
| their massive new investment to sell API below-cost and drive
| out the competitor. If they're really getting Opus 4.1
| performance for half of Sonnet compute cost, they've done
| really well.
| diggan wrote:
| I'm not sure I'd be surprised, I've been playing around with
| GPT-OSS last few days, and the architecture seems really fast
| for the accuracy/quality of responses, way better than most
| local weights I've tried for the last two years or so. And
| since they released that architecture publicly, I'd imagine
| they're sitting on something even better privately.
| aliljet wrote:
| I'm curious what platform people are using to test GPT-5? I'm so
| deep into the claude code world that I'm actually unsure what the
| best option is outside of claude code...
| te_chris wrote:
| Cursor
| simonw wrote:
| I've been using codex CLI, OpenAI's Claude Code equivalent. You
| can run it like this:
| OPENAI_DEFAULT_MODEL=gpt-5 codex
| cainxinth wrote:
| It's fascinating and hilarious that pelican on a bicycle in SVG
| is still such a challenge.
| muglug wrote:
| How easy is it for you to create an SVG of a pelican riding a
| bicycle in a text editor by hand?
| jopsen wrote:
| Without looking at the rendered output :)
| joshmlewis wrote:
| It seems to be trained to use tools effectively to gather
| context. In this example against 4.1 and o3 it used 6 in the
| first turn in a pretty cool way (fetching different categories
| that could be relevant). Token use increases with that kind of
| tool calling but the aggressive pricing should make that moot.
| You could probably get it to not be so tool happy with prompting
| as well.
|
| https://promptslice.com/share/b-2ap_rfjeJgIQsG
| tomrod wrote:
| Simon, as always, I appreciate your succinct and dedicated
| writeup. This really helps to land the results.
___________________________________________________________________
(page generated 2025-08-07 23:00 UTC)