[HN Gopher] GPT-5 for Developers
       ___________________________________________________________________
        
       GPT-5 for Developers
        
       Author : 6thbit
       Score  : 314 points
       Date   : 2025-08-07 17:06 UTC (5 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | andrewmcwatters wrote:
       | I wonder how good it is compared to Claude Sonnet 4, and when
       | it's coming to GitHub Copilot.
       | 
       | I almost exclusively wrote and released
       | https://github.com/andrewmcwattersandco/git-fetch-file yesterday
       | with GPT 4o and Claude Sonnet 4, and the latter's agentic
       | behavior was quite nice. I barely had to guide it, and was able
       | to quickly verify its output.
        
         | fleebee wrote:
         | There is an option in GitHub Copilot settings to enable GPT-5
         | already.
        
       | croemer wrote:
       | > GPT-5 also excels at long-running agentic tasks--achieving SOTA
       | results on t2-bench telecom (96.7%), a tool-calling benchmark
       | released just 2 months ago.
       | 
       | Yes, but it does worse than o3 on the airline version of that
       | benchmark. The prose is totally cherry picker.
        
         | Fogest wrote:
         | How does the cost compare though? From my understanding o3 is
         | pretty expensive to run. Is GPT-5 less costly? If so if the
         | performance is close to o3 but cheaper, then it may still be a
         | good improvement.
        
           | low_tech_punk wrote:
           | I find it strange that GPT-5 is cheaper than GPT-4.1 in input
           | token and is only slightly more expensive in output token. Is
           | it marketing or actually reflecting the underlying compute
           | resources?
        
             | AS04 wrote:
             | Very likely to be an actual reflection. That's probably
             | their real achievement here and the key reason why they are
             | actually publishing it as GPT-5. More or less the best or
             | near to it on everything while being one model,
             | substantially cheaper than the competition.
        
               | ComputerGuru wrote:
               | But it can't do audio in/out or image out. Feels like an
               | architectural step back.
        
               | conradkay wrote:
               | My understanding is that image output is pretty separate
               | and if it doesn't seem that way, they're just abstracting
               | several models into one name
        
             | bn-l wrote:
             | Maybe with the router mechanism (to mini or standard) they
             | estimate the average cost will be a lot lower for chatgpt
             | because the capable model won't be answering dumb questions
             | and then they pass that on to devs?
        
               | low_tech_punk wrote:
               | I think the router applies to chatgpt app. The developer
               | APIs expose manual control to select the specific model
               | and level of reasoning.
        
         | jstummbillig wrote:
         | I mean... they themselves included that information in the
         | post. It's not exactly a gotcha.
        
         | tedsanders wrote:
         | I wrote that section and made the graphs, so you can blame me.
         | We no doubt highlight the evals that make us look good, but in
         | this particular case I think the emphasis on telecom isn't
         | unprincipled cherry picking.
         | 
         | Telecom was made after retail & airline, and fixes some of
         | their problems. In retail and airline, the model is graded
         | against a ground truth reference solution. But in reality,
         | there can be multiple solutions that solve the problem, and
         | perfectly good answers can receive scores of 0 by the automatic
         | grading. This, along with some user model issues, is partly why
         | airline and retail scores haven't climbed with the latest
         | generations of models and are stuck around 60% / 80%. Even a
         | literal superintelligence would probably plateau here.
         | 
         | In telecom, the authors (Barres et al.) made the grading less
         | brittle by grading against outcome states, which may be
         | achieved via multiple solutions, rather than by matching
         | against a single specific solution. They also improved the user
         | modeling and some other things too. So telecom is the much
         | better eval, with a much cleaner signal, which is partly why
         | models can score as high as 97% instead of getting mired at
         | 60%/80% due to brittle grading and other issues.
         | 
         | Even if I had never seen GPT-5's numbers, I like to think I
         | would have said ahead of time that telecom is much better than
         | airline/retail for measuring tool use.
         | 
         | Incidentally, another thing to keep in mind when critically
         | looking at OpenAI and others reporting their scores on these
         | evals is that the evals give no partial credit - so sometimes
         | you can have very good models that do all but one thing
         | perfectly, which results in very poor scores. If you tried
         | generalizing to tasks that don't trigger that quirk, you might
         | get much better performance than the eval scores suggest (or
         | vice versa, if they trigger a quirk not present in the eval).
         | 
         | Here's the tau2-bench paper if anyone wants to read more:
         | https://arxiv.org/abs/2506.07982
        
       | aliljet wrote:
       | Between Opus aand GPT-5, it's not clear there's a substantial
       | difference in software development expertise. The metric that I
       | can't seem to get past in my attempts to use the systems is
       | context awareness over long-running tasks. Producing a very
       | complex, context-exceeding objective is a daily (maybe hourly)
       | ocurrence for me. All I care about is how these systems manage
       | context and stay on track over extended periods of time.
       | 
       | What eval is tracking that? It seems like it's potentially the
       | most imporatnt metric for real-world software engineering and not
       | one-shot vibe prayers.
        
         | realusername wrote:
         | Personally I think I'll wait for another 10x improvement for
         | coding because with the current way it's going, they clearly
         | need that.
        
           | fsloth wrote:
           | From my experience when used through IDE such as Cursor the
           | current gen Claude model enables impressive speedruns over
           | commodity tasks. My context is a CAD application I've been
           | writing as a hobby. I used to work in that field for a decade
           | so have a pretty good touch on how long I would expect tasks
           | to take. I'm using mostly a similar software stack as that at
           | previous job and am definetly getting stuff done much faster
           | on holiday at home than at that previous work. Of course the
           | codebase is also a lot smaller, intrinsic motivation, etc,
           | but still.
        
             | 42lux wrote:
             | How often do you have to build the simple scaffolding
             | though?
        
             | realusername wrote:
             | I've done pretty much the same as you (Cursor/Claude) for
             | our large Rails/React codebase at work and the experience
             | has been horrific so far, I reverted back to vscode.
        
         | bdangubic wrote:
         | _context awareness over long-running tasks_
         | 
         | don't have long-running tasks, llms or not. break the problem
         | down into small manageable chunks and then assemble it. neither
         | humans nor llms are good at long-running tasks.
        
           | beoberha wrote:
           | A series of small manageable chunks becomes a long running
           | task :)
           | 
           | If LLMs are going to act as agents, they need to maintain
           | context across these chunks.
        
           | bastawhiz wrote:
           | > neither humans nor llms are good at long-running tasks.
           | 
           | That's a wild comparison to make. I can easily work for an
           | hour. Cursor can hardly work for a continuous pomodoro.
           | "Long-running" is not a fixed size.
        
             | echelon wrote:
             | Humans can error correct.
             | 
             | LLMs multiply errors over time.
        
             | bdangubic wrote:
             | I just finished my workday, 8hrs with Claude Code. No
             | single task took more than 20 minutes total. Cleared
             | context after each task and asked it to summarize for
             | itself the previous task before I cleared context. If I ran
             | this as a continuous 8hr task it would have died after
             | 35-ish minutes. Just know the limitations (like with any
             | other tool) and you'll be good :)
        
               | 0x457 wrote:
               | I always find it wild that none of these tools use VCS -
               | completed logical unit of work, make a commit, drop
               | entire context related to that commit, while referencing
               | said commit, continue onto the next stage, rinse and
               | repeat.
               | 
               | Claud always misunderstands how API exported by my
               | service works and every compaction it forgets all over
               | and commits "oh api has changed since last time I've
               | used, let me use different query parameters", my brother
               | Christ nothing has changed, and you are the one who made
               | this API.
        
               | bastawhiz wrote:
               | You can use cursor rules to tell cursor to update the
               | project cursor rules with details about the API.
        
               | bahmboo wrote:
               | Roo Code does this
        
             | novok wrote:
             | I think that is because you do implicit plan tracking,
             | creation and modification of the plan in your head in light
             | of new information and then follow that plan. I'm not sure
             | these tools do that very well.
             | 
             | The long running task, at it's core, is composed of many
             | smaller tasks and you mostly focus on one task at a time
             | per brain part. It's why you cannot read two streams of
             | text simultaneously even if both are in your visual focus
             | field.
        
           | vaenaes wrote:
           | You're holding it wrong
        
         | swader999 wrote:
         | If GPT 5 truly has 400k context, that might be all it needs to
         | meaningfully surpass Opus.
        
           | AS04 wrote:
           | 400k context with 100% on the fiction livebench would make
           | GPT-5 the undisputably best model IMHO. Don't think it will
           | achieve that though, sadly.
        
           | simonw wrote:
           | It's 272,000 input tokens and 128,000 output tokens.
        
             | zurfer wrote:
             | Woah that's really kind of hidden. But I think you can
             | specify max output tokens. Need to test that!
        
             | 6thbit wrote:
             | Oh, I had not grasped that the "context window" size
             | advertised had to include both input and output.
             | 
             | But is it really 272k even if the output was say 10k? Cause
             | it does say "max output" in the docs, so I wonder
        
               | simonw wrote:
               | This is the only model where the input limit and the
               | context limit are different values. OpenAI docs team are
               | working on updating that page.
        
           | dimal wrote:
           | Even with large contexts there's diminishing returns. Just
           | having the ability to stuff more tokens in context doesn't
           | mean the model can effectively use it. As far as I can tell,
           | they always reach a point in which more information makes
           | things worse.
        
           | Byamarro wrote:
           | More of a question is its context rot tendency than the size
           | of its context :) LLMs are supposed to load 3 bibles into
           | their context, but they forget what they were about to do
           | after loading a 600LoC of locales.
        
           | andrewmutz wrote:
           | Having a large context window is very different from being
           | able to effectively use a lot of context.
           | 
           | To get great results, it's still very important to manage
           | context well. It doesn't matter if the model allows a very
           | large context window, you can't just throw in the kitchen
           | sink and expect good results
        
           | tekacs wrote:
           | Coupled with the humungous price difference...
        
         | logicchains wrote:
         | >Between Opus aand GPT-5, it's not clear there's a substantial
         | difference in software development expertise.
         | 
         | If there's no substantial difference in software development
         | expertise then GPT-5 absolutely blows Opus out of the water due
         | to being almost 10x cheaper.
        
           | spiderice wrote:
           | Does OpenAI provide a $200/month option that lets me use as
           | much GPT-5 I want inside of Codex?
           | 
           | Because if not, I'd still go with Opus + Claude Code. I'd
           | rather be able to tell my employer, "this will cost you
           | $200/month" than "this might cost you less than $200/month,
           | but we really don't know because it's based on usage"
        
         | nadis wrote:
         | It's pretty vague, but the OP had this callout:
         | 
         | >"GPT-5 is the strongest coding model we've ever released. It
         | outperforms o3 across coding benchmarks and real-world use
         | cases, and has been fine-tuned to shine in agentic coding
         | products like Cursor, Windsurf, GitHub Copilot, and Codex CLI.
         | GPT-5 impressed our alpha testers, setting records on many of
         | their private internal evals."
        
         | RobinL wrote:
         | Totally agree. At the moment I find that frontier LLMs are able
         | to solve most of the problems I throw at them given enough
         | context. Most of my time is spent working out what context
         | they're missing when they fail. So the thing that would help me
         | most is much a much more focussed ability to gather context.
         | 
         | For my use cases, this is mostly needing to be really home in
         | on relevant code files, issues, discussions, PRs. I'm hopeful
         | that GPT5 will be a step forward in this regard that isn't
         | fully captured in the benchmark results. It's certainly
         | promising that it can achieve similar results more cheaply than
         | e.g. Opus.
        
         | abossy wrote:
         | At my company (Charlie Labs), we've had a tremendous amount of
         | success with context awareness over long-running tasks with
         | GPT-5 since getting access a few weeks ago. We ran an eval to
         | solve 10 real Github issues so that we could measure this
         | against Claude Code and the differences were surprisingly
         | large. You can see our write-up here:
         | 
         | https://charlielabs.ai/research/gpt-5
         | 
         | Often, our tasks take 30-45 minutes and can handle massive
         | context threads in Linear or Github without getting tripped up
         | by things like changes in direction part of the way through the
         | thread.
         | 
         | While 10 issues isn't crazy comprehensive, we found it to be
         | directionally very impressive and we'll likely build upon it to
         | better understand performance going forward.
        
           | bartman wrote:
           | I am not (usually) photosensitive, but the animated static
           | noise on your websites causes noticable flickering on various
           | screens I use and made it impossible for me to read your
           | article.
           | 
           | For better accessibility and a safer experience[1] I would
           | recommend not animating the background, or at least making it
           | easily togglable.
           | 
           | [1] https://developer.mozilla.org/en-
           | US/docs/Web/Accessibility/G...
        
             | MPSFounder wrote:
             | I concur. Awful UI
        
             | neom wrote:
             | Removed- sorry, and thank you for the feedback.
        
               | jeanlucas wrote:
               | Nice,
        
               | pxc wrote:
               | Love your responsiveness here!
               | 
               | Edited to add: I am, in fact, photosensitive (due to a
               | genetic retinal condition), and for my eyes, your site as
               | it is very easy to read, and the visualizations look
               | great.
        
               | bartman wrote:
               | Thank you!
               | 
               | Love that you included the judge prompts in your article.
        
               | neom wrote:
               | Please let me know what you would like to see more of.
               | Evals are something we take serious, I think this post
               | was ok enough given our constraints, but I'd like to
               | produce content people find useful and I think we can do
               | a lot better.
        
         | joshmlewis wrote:
         | I've been testing it against Opus 4.1 the last few hours and it
         | has done better and solved problems Claude kept failing at. I
         | would say it's definitely better, at least so far.
        
         | cyanydeez wrote:
         | real context is a graph of objectives and results.
         | 
         | The power of these models has peaked and simply arn't going to
         | manage the type of awareness being promised.
        
         | 1659447091 wrote:
         | > Producing a very complex, context-exceeding objective is a
         | daily (maybe hourly) ocurrence for me. All I care about is how
         | these systems manage context and stay on track over extended
         | periods of time.
         | 
         | For whatever reason Github's Copilot is treated like the
         | redheaded stepchild of coding assistants. Even through there
         | are Anthropic, OpenAI, and Google models to choose from. And
         | there is a "spaces"[0] website feature that may be close to
         | what you are looking for.
         | 
         | I got better results for testing some larger task using that
         | than I did through the IDE version. But have not used it much.
         | Maybe others have more experience with it. Trying to gather all
         | the context and then review the results was taking longer than
         | doing it myself; having the context gathered already or
         | building it up over time is probably where its value is.
         | 
         | [0] https://docs.github.com/en/copilot/concepts/spaces
        
       | risho wrote:
       | over the last week or so I have put probably close to 70 hours
       | into playing around with cursor and claude code and a few other
       | tools (its become my new obsession). I've been blown away by how
       | good and reliable it is now. That said the reality is in my
       | experience the only models that actually work in any sort of
       | reliable way are claude models. I dont care what any benchmark
       | says because the only thing that actually matters is actual use.
       | I'm really hoping that this new gpt model actually works for this
       | usecase because competition is great and the price is also great.
        
         | ralfd wrote:
         | Just replying to ask you next week what your assessment on GPT5
         | is.
        
         | throwaway_2898 wrote:
         | How much of the product were you able to build to say it was
         | good/reliable? IME, 70 hours can get you to a PoC that "works",
         | building beyond the initial set of features -- like say a first
         | draft of all the APIs -- does it do well once you start
         | layering features?
        
           | petralithic wrote:
           | This has been my experience. The greenfield approach works up
           | to a point, then it just breaks.
        
         | Centigonal wrote:
         | Ditto here, except I'm using Roo and it's Claude and Gemini pro
         | 2.5 that work for me.
        
         | neuronexmachina wrote:
         | > That said the reality is in my experience the only models
         | that actually work in any sort of reliable way are claude
         | models.
         | 
         | Anecdotally, the tool updates in the latest Cursor (1.4) seem
         | to have made tool usage in models like Gemini much more
         | reliable. Previously it would struggle to make simple file
         | edits, but now the edits work pretty much every time.
        
         | zarzavat wrote:
         | The magic is the prompting/tool use/finetuning.
         | 
         | I find that OpenAI's reasoning models write better code and are
         | better at raw problem solving, but Claude code is a much more
         | useful product, even if the model itself is weaker.
        
         | rcarr wrote:
         | I think some of this might come down to stack as well. I
         | watched a t3.gg video[1] recently about Convex[2] and how the
         | nature of it leads to the AI getting it right first time more
         | often. I've been playing around with it the last few days and I
         | think I agree with him.
         | 
         | I think the dev workflow is going to fundamentally change
         | because to maximise productivity out of this you need to get
         | multiple AIs working in parallel so rather than just jumping
         | straight into coding we're going to end up writing a bunch of
         | tickets out in a PM tool (Linear[3] looks like it's winning the
         | race atm) and then working out (or using the AI to work out)
         | which ones can be run in parallel without causing merge
         | conflicts and then pulling multiple tickets into your
         | IDE/Terminal and then cycling through the tabs and jumping in
         | as needed.
         | 
         | Atm I'm still not really doing this but I know I need to make
         | the switch and I'm thinking that Warp[4] might be best suited
         | for this kind of workflow, with the occasional switch over to
         | an IDE when you need to jump in and make some edits.
         | 
         | Oh also, to achieve this you need to use git worktrees[5,6,7].
         | 
         | [1]: https://www.youtube.com/watch?v=gZ4Tdwz1L7k
         | 
         | [2]: https://www.convex.dev/
         | 
         | [3]: https://linear.app/
         | 
         | [4]: https://www.warp.dev/
         | 
         | [5]: https://docs.anthropic.com/en/docs/claude-code/common-
         | workfl...
         | 
         | [6]:https://git-scm.com/docs/git-worktree
         | 
         | [7]:https://www.tomups.com/posts/git-worktrees/
        
           | isoprophlex wrote:
           | Sure sounds interesting but... Where on earth do you actually
           | find the time to sit through a 1.5 hour yt video?!
        
             | rcarr wrote:
             | Jump in and start coding entire backend with stack not best
             | suited for job and modern AI tools: most likely future
             | hours lost.
             | 
             | Spend 1.5 hours now to learn from an experienced dev on a
             | stack that is better suited for job: most likely future
             | hours gained.
        
             | v5v3 wrote:
             | People find time for things they seem important to them.
        
             | burnished wrote:
             | 1.5x and 2x speed help a lot, slow down or repeat segments
             | as needed, don't be afraid to fast forward past irrelevant
             | looking bits (just be eager to backtrack).
        
             | mafro wrote:
             | Ask an LLM to transcribe and give the overview and key
             | points
        
             | mceachen wrote:
             | On a desktop browser, tap YouTube's "show transcript" and
             | "hide timecodes", then copy-paste the whole transcript into
             | Claude or chatgpt and tell it to summarize with whatever
             | resolution you want-a couple sentences, 400 lines,
             | whatever. You can also tell it to focus on certain subject
             | material.
             | 
             | This is a complete game changer for staying on top of
             | what's being covered by local government meetings. Our
             | local bureaucrats are astounding competent at talking about
             | absolutely nothing for 95% of the time, but hidden is three
             | minutes of "oh btw we're planning on paving over the local
             | open space preserve to provide parking for the local
             | business".
        
           | rcarr wrote:
           | Seems like VSCode just added a lot of stuff for this in the
           | latest update today, such as worktree support[1] and an agent
           | session mode[2].
           | 
           | [1]: https://code.visualstudio.com/updates/v1_103#_git-
           | worktree-s...
           | 
           | [2]: https://code.visualstudio.com/updates/v1_103#_chat-
           | sessions-...
        
       | timhigins wrote:
       | I opened up the developer playground and the model selection
       | dropdown showed GPT-5 and then it disappeared. Also I don't see
       | it in ChatGPT Pro. What's up?
        
         | Fogest wrote:
         | It's probably being throttled due to high usage.
        
         | IAmGraydon wrote:
         | Not showing in my Pro account either. As someone else
         | mentioned, I'm sure it's throttling due to high use right now.
        
         | brookst wrote:
         | Shipping something at the moment of announcement is always
         | hell.
        
       | sebdufbeau wrote:
       | Has the API rollout started? It's not available in our org, even
       | if we've been verified for a few months
       | 
       | EDIT: It's out now
        
         | spullara wrote:
         | it is out yet. i poll the api for the models and update this
         | GitHub hourly.
         | 
         | https://github.com/spullara/models
        
       | low_tech_punk wrote:
       | The ability to specify a context-free grammar as output
       | constraint? This blows my mind. How do you control the auto
       | regressive sampling to guarantee the correct syntax?
        
         | qsort wrote:
         | You sample only from tokens that could possibly result in a
         | valid production for the grammar. It's an inference-only thing.
        
           | low_tech_punk wrote:
           | ah, thanks!
        
         | evnc wrote:
         | I assume they're doing "Structured Generation" or "Guided
         | generation", which has been possible for a while if you control
         | the LLM itself e.g. running an OSS model, e.g. [0][1]. It's
         | cool to see a major API provider offer it, though.
         | 
         | The basic idea is: at each auto-regressive step (each token
         | generation), instead of letting the model generate a
         | probability distribution over "all tokens in the entire vocab
         | it's ever seen" (the default), only allow the model to generate
         | a probability distribution over "this specific set of tokens I
         | provide". And that set can change from one sampling set to the
         | next, according to a given grammar. E.g. if you're using a JSON
         | grammar, and you've just generated a `{`, you can provide the
         | model a choice of only which tokens are valid JSON immediately
         | after a `{`, etc.
         | 
         | [0] https://github.com/dottxt-ai/outlines [1]
         | https://github.com/guidance-ai/guidance
        
       | low_tech_punk wrote:
       | Tried using gpt-5 family with response API and got error "gpt-5
       | does not exist or you don't have access to it". I guess they are
       | not rolling out in lock step with the live stream and blog
       | article?
        
         | diggan wrote:
         | Seems they're doing rollout over time, I'm not seeing it
         | anywhere yet.
        
         | low_tech_punk wrote:
         | Can confirm that they are rolling out. It's working for me.
        
       | catigula wrote:
       | I thought we were going to have AGI by now.
        
         | RS-232 wrote:
         | No shot. LLMs are simple text predictors and they are too
         | stupid to get us to real AGI.
         | 
         | To achieve AGI, we will need to be capable of high fidelity
         | whole brain simulations that model the brain's entire physical,
         | chemical, and biological behavior. We won't have that kind of
         | computational power until quantum computers are mature.
        
           | evantbyrne wrote:
           | It will be interesting to see if humans can manage to
           | bioengineer human-level general intelligence into another
           | species before computers.
        
           | machiaweliczny wrote:
           | [flagged]
        
             | bopbopbop7 wrote:
             | "some twist" is doing a lot of heavy lifting in that
             | statement.
        
               | AppleBananaPie wrote:
               | CS will define, design and implement human level
               | intelligence before neuroscience has done even the first.
               | 
               | That's what I hear when people say stuff like this
               | anyway.
               | 
               | Similar to CS folks throwing around physics 'theories'
        
           | nawgz wrote:
           | I don't really see any relationship between being able to
           | model/simulate the brain and being able to exceed the brain
           | in intelligence, can you explain more about that? Simulations
           | sound like more of a computational and analytic problem with
           | regards to having an accurate model.
           | 
           | Maybe your point is that until we understand our own
           | intelligence, which would be reflected in such a simulation,
           | it would be difficult to improve upon it.
        
           | brookst wrote:
           | Are you saying that only (human?) biological brains can be
           | GI, _and_ that whatever intelligence is, it would emerge from
           | a pure physics-based simulation?
           | 
           | Both of those seem questionable, multiplying them together
           | seems highly unlikely.
        
             | jplusequalt wrote:
             | Are you arguing that intelligence is not physical? Could
             | you name a single thing in existence that fundamentally
             | cannot be linked to physics?
        
           | 93po wrote:
           | in what way are human brains also not just predictors? our
           | neural pathways are built and reinforced as we have repeated
           | exposure to inputs through any of our senses. our brains are
           | expert pattern-followers, to the point that is happens even
           | when we strongly don't want to (in the case of PTSD, for
           | example, or people who struggle with impulse control and
           | executive functioning).
           | 
           | whats the next sentence i'm going to type? is not just based
           | on the millions of sentences ive typed before and read
           | before? even the premise of me playing devils advocate here,
           | that's a pattern i've learned over my entire life too.
           | 
           | your argument also falls apart a bit when we see emergent
           | behavior, which has definitely happened
        
           | JamesBarney wrote:
           | When we're being hunted down by nano-bots some of the last
           | few survivors will still be surprised that a simple text
           | predictor could do so much.
        
         | IAmGraydon wrote:
         | Not going to happen any time soon, if ever. LLMs are extremely
         | useful, but the intelligence part is an illusion that nearly
         | everyone appears to have fallen for.
        
           | jonplackett wrote:
           | This POV is just the opposite extremity - and it's equally
           | nuts. If you haven't seen any intelligence at all in an LLm
           | you just aren't looking.
        
       | skepticATX wrote:
       | This was really a bad release for OpenAI, if benchmarks are even
       | somewhat indicative of how the model will perform in practice.
        
         | robterrell wrote:
         | In what ways?
        
         | mediaman wrote:
         | I actually don't agree. Tool use is the key to successful
         | enterprise product integration and they have done some very
         | good work here. This is much more important to
         | commercialization than, for example, creative writing quality
         | (which it reportedly is not good at).
        
       | jumploops wrote:
       | If the model is as good as the benchmarks say, the pricing is
       | fantastic:
       | 
       | Input: $1.25 / 1M tokens (cached: $0.125/1Mtok) Output: $10 / 1M
       | tokens
       | 
       | For context, Claude Opus 4.1 is $15 / 1M for input tokens and
       | $75/1M for output tokens.
       | 
       | The big question remains: how well does it handle tools? (i.e.
       | compared to Claude Code)
       | 
       | Initial demos look good, but it performs worse than o3 on
       | Tau2-bench airline, so the jury is still out.
        
         | addaon wrote:
         | > Output: $10 / 1M tokens
         | 
         | It's interesting that they're using flat token pricing for a
         | "model" that is explicitly made of (at least) two underlying
         | models, one with much lower compute costs than the other; and
         | with use ability to at least influence (via prompt) if not
         | choose which model is being used. I have to assume this pricing
         | model is based on a predicted split between how often the
         | underlying models get used; I wonder if that will hold up, if
         | users will instead try to rouse the better model into action
         | more than expected, or if the pricing is so padded that it
         | doesn't matter.
        
           | simianwords wrote:
           | > that is explicitly made of (at least) two underlying models
           | 
           | what do you mean?
        
             | addaon wrote:
             | > a smart and fast model that answers most questions, a
             | deeper reasoning model for harder problems, and a real-time
             | router that quickly decides which model to use based on
             | conversation type, complexity, tool needs, and explicit
             | intent (for example, if you say "think hard about this" in
             | the prompt).
             | 
             | From https://openai.com/index/gpt-5-system-card/
        
               | tedsanders wrote:
               | In the API, there's no router. Developers just pick
               | whether they use the reasoning model or non-thinking
               | ChatGPT model.
        
           | mkozlows wrote:
           | That's how the browser-based ChatGPT works, but not the API.
        
         | joshmlewis wrote:
         | It does seem to be doing well compared to Opus 4.1 in my
         | testing the last few hours. I've been on the Claude Code 200
         | plan for a few months and I've been really frustrated with it's
         | output as of late. GPT-5 seems to be a step forward so far.
        
       | 6thbit wrote:
       | Seems they have quietly increased the context window up to
       | 400,000
       | 
       | https://platform.openai.com/docs/models/gpt-5
        
         | ralfd wrote:
         | How does that compare to Claude/GPT4?
        
           | 6thbit wrote:
           | 4o - 128k o3 - 200k Opus 4.1 - 200k Sonnet 4 - 200k
           | 
           | So, at least twice larger context than those
        
           | hrpnk wrote:
           | gpt4.1 has 1M input and 32k output, Sonnet 4 200k/64k
        
         | simianwords wrote:
         | but is it for the model in chatgpt.com as well?
        
       | mehmetoguzderin wrote:
       | Context-free grammar and regex support are exciting. I wonder
       | what, or whether, there are differences from the Lark-like CFG of
       | llguidance, which powers the JSON schema of the OpenAI API [^1].
       | 
       | [^1]: https://github.com/guidance-
       | ai/llguidance/blob/f4592cc0c783a...
        
         | msp26 wrote:
         | Yeah that was the only exciting part of the announcement for me
         | haha. Can't wait to play around with it.
         | 
         | I'm already running into a bunch of issues with the structured
         | output APIs from other companies like Google and OpenAI have
         | been doing a great job on this front.
        
           | chrisweekly wrote:
           | > "I'm already running into a bunch of issues with the
           | structured output APIs from other companies like Google and
           | OpenAI have been doing a great job on this front."
           | 
           | This run-on sentence swerved at the end; I really can't tell
           | what your point is. Could you reword it for clarity?
        
             | petercooper wrote:
             | I read it as "... from other companies, like Google, and
             | OpenAI have been doing a great job on this front"
        
       | belter wrote:
       | We were promised AGI and all we got was code generators...
        
         | bmau5 wrote:
         | It's a logical starting point, given there are pretty defined
         | success/failure criteria
        
           | ehutch79 wrote:
           | The hype is real. We were told that we'd have AGI and be out
           | of jobs 2 years ago, let alone today.
        
             | brookst wrote:
             | We were also told that AGI would never happen, that it was
             | 6 months away, that it is 20 years away.
             | 
             | I'm not sure of the utility of being so outraged that some
             | people made wrong predictions.
        
             | rowanG077 wrote:
             | By whom? I don't think anyone seriously said in 2023 we
             | have AGI in two years. Even now, no one reputable is
             | claiming AGI in two years.
        
         | esafak wrote:
         | LLMs are saturating every benchmark. AGI may not be all that. I
         | am already impressed. Perhaps you need robots to be awed.
        
       | pamelafox wrote:
       | I am testing out gpt-5-mini for a RAG scenario, and I'm impressed
       | so far.
       | 
       | I used gpt-5-mini with reasoning_effort="minimal", and that model
       | finally resisted a hallucination that every other model
       | generated.
       | 
       | Screenshot in post here:
       | https://bsky.app/profile/pamelafox.bsky.social/post/3lvtdyvb...
       | 
       | I'll run formal evaluations next.
        
         | potatolicious wrote:
         | This feels like honestly the biggest gain/difference. I work on
         | things that do a lot of tool calling, and the model
         | hallucinating fake tools is a huge problem. Worse, sometimes
         | the model will hallucinate a response directly without ever
         | generating the tool call.
         | 
         | The new training rewards that suppress hallucinations and tool-
         | skipping hopefully push us in the right direction.
        
         | ralfd wrote:
         | Q: What does a product manager do?
         | 
         | GPT4: Collaborating with engineering, sales, marketing,
         | finance, external partners, suppliers and customers to ensure
         | ...... etc
         | 
         | GPT5: I don't know.
         | 
         | Upon speaking these words, AI was enlightened.
        
           | ComputerGuru wrote:
           | That is genuinely nice to see. What are you using for the
           | embeddings?
        
             | pamelafox wrote:
             | We use text-embedding-3-large, with both quantization and
             | MRL reduction, plus oversampling on the search to
             | compensate for the compression.
        
         | 0x457 wrote:
         | I get the "good" result with phi-4 and gemma-3n in RAG scenario
         | - i.e. it only used context provided to answer and couldn't
         | answer questions if context lacked the answer without
         | hallucination.
        
       | fatty_patty89 wrote:
       | What the fuck? Nobody else saw the cursor ceo looking through the
       | gpt5 generated code, mindlessly scrolling saying "this looks
       | roughly correct, i would love to merge that" LOL
       | 
       | You can't make this up
        
         | isoprophlex wrote:
         | This is the ideal software engineer. You may not like it, but
         | this is what peak software engineering looks like.
         | 
         | /s
        
         | siva7 wrote:
         | amazing time to be alive, alone for this clown show
        
           | throwawaybob420 wrote:
           | if you're not using an LLM to vibe code garbage then are you
           | really a software developer?
        
         | bn-l wrote:
         | That explains a lot.
        
       | hrpnk wrote:
       | The github issue showed in the livestream is getting lots of
       | traction: https://github.com/openai/openai-python/issues/2472
       | 
       | It was (attempted to be) solved by a human before, yet not
       | merged... With all the great coding models OpenAI has access to,
       | their SDK team still feels too small for the needs.
        
       | te_chris wrote:
       | https://platform.openai.com/docs/guides/latest-model
       | 
       | Looks like they're trying to lock us into using the Responses API
       | for all the good stuff.
        
       | henriquegodoy wrote:
       | I dont think there's so much difference from opus 4.1 and gpt-5,
       | probably just the context size, waiting for the gemini 3.0
        
         | backscratches wrote:
         | gpt5 much cheaper
        
         | macawfish wrote:
         | Claude 5 is the one I'm most excited about.
        
       | sberens wrote:
       | Interesting there doesn't seem to be benchmarking on codeforces
        
       | jaflo wrote:
       | I just wish their realtime audio pricing would go down but it
       | looks like GPT-5 does not have support for that so we're stuck
       | with the old models.
        
       | zaronymous1 wrote:
       | Can anyone explain to me why they've removed parameter controls
       | for temperature and top-p in reasoning models, including gpt-5?
       | It strikes me that it makes it harder to build with these to do
       | small tasks requiring high-levels of consistency, and in the API,
       | I really value the ability to set certain tasks to a low temp.
        
       | jngiam1 wrote:
       | I was a little bummed that there wasn't more about better MCP
       | support in ChatGPT, hopefully soon.
        
         | cheema33 wrote:
         | MCP is overhyped and most MCP servers are useless. What
         | specific MCP server do you find critical in your regular use?
         | And what functionality is missing that you wish to see in
         | ChatGPT?
        
       | ivape wrote:
       | Musk after GPT5 launch: "OpenAI is going to eat Microsoft alive"
       | 
       | https://x.com/elonmusk/status/1953509998233104649
       | 
       | Anyone know why he said that?
        
         | brookst wrote:
         | He was high AF?
        
       | nadis wrote:
       | "When producing frontend code for web apps, GPT-5 is more
       | aesthetically-minded, ambitious, and accurate. In side-by-side
       | comparisons with o3, GPT-5 was preferred by our testers 70% of
       | the time."
       | 
       | That's really interesting to me. Looking forward to trying GPT-5!
        
       | attentive wrote:
       | > scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot
       | 
       | why isn't it on https://aider.chat/docs/leaderboards/?
       | 
       | "last updated August 07, 2025"
        
       | worik wrote:
       | Diminishing returns?
        
       | attentive wrote:
       | "Notably, GPT-5 with minimal reasoning is a different model than
       | the non-reasoning model in ChatGPT, and is better tuned for
       | developers. The non-reasoning model used in ChatGPT is available
       | as gpt-5-chat-latest."
       | 
       | hmm, they should call it gpt-5-chat-nonreasoning or something.
        
       | jodosha wrote:
       | Still no CLI like Claude Code?
        
         | Game_Ender wrote:
         | You are looking for Codex CLI [0].
         | 
         | 0 - https://github.com/openai/codex
        
           | jodosha wrote:
           | Thank you!
        
         | mediaman wrote:
         | It works on Codex CLI, install it with npm.
         | 
         | That's been out for a while and used their 'codex' model, but
         | they updated it today to default to gpt-5 instead.
        
           | jodosha wrote:
           | Oh nice, thanks!
        
       | guybedo wrote:
       | here's a summary for this discussion:
       | 
       | https://extraakt.com/extraakts/openai-s-gpt-5-performance-co...
        
       | mwigdahl wrote:
       | Has anyone tried connecting up GPT-5 to Claude Code using the
       | model environment variables?
        
       | planet_1649c wrote:
       | Can we use this model in a fixed plan like claude code for which
       | we can pay 100$ / month?
       | 
       | Doesnt look like it. Unless they add a fixed pricing, claude imo
       | still would be better from a developer POV
        
         | spiderice wrote:
         | I just said something similar in another comment on this
         | thread. I'm not interested in the mental aspect of getting
         | charged per query. I feel like when I use pay-per-token tools,
         | it's always in the back of my mind. Even if it's a bit more
         | expensive to pay a flat rate, it's so worth it for the peace of
         | mind.
        
       | wewewedxfgdf wrote:
       | Tried it on a tough problem.
       | 
       | GPT-5 solved the problem - which Gemini failed to solve - then
       | failed 6 times in a row to write the code to fix it.
       | 
       | I then gave ChatGPT-5's problem analysis to Google Gemini and it
       | immediately implemented the correct fix.
       | 
       | The lesson - ChatGPT is good at analysis and code reviews, not so
       | good at coding.
        
         | cperkins wrote:
         | I have something that both Gemini (via GCA) and CoPilot
         | (Claude) analyzed and came up withe the same diagnosis. Each of
         | them made the exact same wrong solution, and when I pointed
         | that out, got further wrong.
         | 
         | I haven't tried Chat GPT on it yet, hoping to do so soon.
        
       | 6thbit wrote:
       | Can anyone share their experience with codex CLI? I feel like
       | that's not mentioned enough and gpt5 is already the default model
       | there.
        
         | macawfish wrote:
         | Not good sadly, Claude Code seems so much better in terms of
         | overall polish but also in how it handles context. I don't
         | really want to through the LLM into the deep end without proper
         | tools and context, and I get the sense that this is what was
         | happening with in Codex.
        
       | joshmlewis wrote:
       | It does really well at using tool calls to gain as much context
       | as it can to provide thoughtful answers. In this example it did
       | 6! tool calls in the first response while 4.1 did 3 and o3 did
       | one at a time.
       | 
       | https://promptslice.com/share/b-2ap_rfjeJgIQsG
        
       | joshmlewis wrote:
       | It's free in Cursor for the next few days, you should go try it
       | out if you haven't. I've been an agentic coding power user since
       | the day it came out across several IDE's/CLI tools and Cursor +
       | GPT-5 seems to be a great combo.
        
       | austinmw wrote:
       | Okay so say GPT-5 is better than Claude Opus 4.1. Then is
       | GPT-5+Cursor better than Opus 4.1 + Claude Code? And if not,
       | what's the best way to utilize GPT-5?
        
         | felipemesquita wrote:
         | I'm not sure yet if it's better than Claude, but the best way
         | to use GPT-5 it is https://github.com/charmbracelet/crush
        
         | kristo wrote:
         | Apparently there is a cursor cli now... but I love the flat
         | pricing of Claude's Max plan and dislike having to worry about
         | pricing and when to use "Max" mode in cursor.
        
       ___________________________________________________________________
       (page generated 2025-08-07 23:00 UTC)