[HN Gopher] GPT-5 for Developers
___________________________________________________________________
GPT-5 for Developers
Author : 6thbit
Score : 314 points
Date : 2025-08-07 17:06 UTC (5 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| andrewmcwatters wrote:
| I wonder how good it is compared to Claude Sonnet 4, and when
| it's coming to GitHub Copilot.
|
| I almost exclusively wrote and released
| https://github.com/andrewmcwattersandco/git-fetch-file yesterday
| with GPT 4o and Claude Sonnet 4, and the latter's agentic
| behavior was quite nice. I barely had to guide it, and was able
| to quickly verify its output.
| fleebee wrote:
| There is an option in GitHub Copilot settings to enable GPT-5
| already.
| croemer wrote:
| > GPT-5 also excels at long-running agentic tasks--achieving SOTA
| results on t2-bench telecom (96.7%), a tool-calling benchmark
| released just 2 months ago.
|
| Yes, but it does worse than o3 on the airline version of that
| benchmark. The prose is totally cherry picker.
| Fogest wrote:
| How does the cost compare though? From my understanding o3 is
| pretty expensive to run. Is GPT-5 less costly? If so if the
| performance is close to o3 but cheaper, then it may still be a
| good improvement.
| low_tech_punk wrote:
| I find it strange that GPT-5 is cheaper than GPT-4.1 in input
| token and is only slightly more expensive in output token. Is
| it marketing or actually reflecting the underlying compute
| resources?
| AS04 wrote:
| Very likely to be an actual reflection. That's probably
| their real achievement here and the key reason why they are
| actually publishing it as GPT-5. More or less the best or
| near to it on everything while being one model,
| substantially cheaper than the competition.
| ComputerGuru wrote:
| But it can't do audio in/out or image out. Feels like an
| architectural step back.
| conradkay wrote:
| My understanding is that image output is pretty separate
| and if it doesn't seem that way, they're just abstracting
| several models into one name
| bn-l wrote:
| Maybe with the router mechanism (to mini or standard) they
| estimate the average cost will be a lot lower for chatgpt
| because the capable model won't be answering dumb questions
| and then they pass that on to devs?
| low_tech_punk wrote:
| I think the router applies to chatgpt app. The developer
| APIs expose manual control to select the specific model
| and level of reasoning.
| jstummbillig wrote:
| I mean... they themselves included that information in the
| post. It's not exactly a gotcha.
| tedsanders wrote:
| I wrote that section and made the graphs, so you can blame me.
| We no doubt highlight the evals that make us look good, but in
| this particular case I think the emphasis on telecom isn't
| unprincipled cherry picking.
|
| Telecom was made after retail & airline, and fixes some of
| their problems. In retail and airline, the model is graded
| against a ground truth reference solution. But in reality,
| there can be multiple solutions that solve the problem, and
| perfectly good answers can receive scores of 0 by the automatic
| grading. This, along with some user model issues, is partly why
| airline and retail scores haven't climbed with the latest
| generations of models and are stuck around 60% / 80%. Even a
| literal superintelligence would probably plateau here.
|
| In telecom, the authors (Barres et al.) made the grading less
| brittle by grading against outcome states, which may be
| achieved via multiple solutions, rather than by matching
| against a single specific solution. They also improved the user
| modeling and some other things too. So telecom is the much
| better eval, with a much cleaner signal, which is partly why
| models can score as high as 97% instead of getting mired at
| 60%/80% due to brittle grading and other issues.
|
| Even if I had never seen GPT-5's numbers, I like to think I
| would have said ahead of time that telecom is much better than
| airline/retail for measuring tool use.
|
| Incidentally, another thing to keep in mind when critically
| looking at OpenAI and others reporting their scores on these
| evals is that the evals give no partial credit - so sometimes
| you can have very good models that do all but one thing
| perfectly, which results in very poor scores. If you tried
| generalizing to tasks that don't trigger that quirk, you might
| get much better performance than the eval scores suggest (or
| vice versa, if they trigger a quirk not present in the eval).
|
| Here's the tau2-bench paper if anyone wants to read more:
| https://arxiv.org/abs/2506.07982
| aliljet wrote:
| Between Opus aand GPT-5, it's not clear there's a substantial
| difference in software development expertise. The metric that I
| can't seem to get past in my attempts to use the systems is
| context awareness over long-running tasks. Producing a very
| complex, context-exceeding objective is a daily (maybe hourly)
| ocurrence for me. All I care about is how these systems manage
| context and stay on track over extended periods of time.
|
| What eval is tracking that? It seems like it's potentially the
| most imporatnt metric for real-world software engineering and not
| one-shot vibe prayers.
| realusername wrote:
| Personally I think I'll wait for another 10x improvement for
| coding because with the current way it's going, they clearly
| need that.
| fsloth wrote:
| From my experience when used through IDE such as Cursor the
| current gen Claude model enables impressive speedruns over
| commodity tasks. My context is a CAD application I've been
| writing as a hobby. I used to work in that field for a decade
| so have a pretty good touch on how long I would expect tasks
| to take. I'm using mostly a similar software stack as that at
| previous job and am definetly getting stuff done much faster
| on holiday at home than at that previous work. Of course the
| codebase is also a lot smaller, intrinsic motivation, etc,
| but still.
| 42lux wrote:
| How often do you have to build the simple scaffolding
| though?
| realusername wrote:
| I've done pretty much the same as you (Cursor/Claude) for
| our large Rails/React codebase at work and the experience
| has been horrific so far, I reverted back to vscode.
| bdangubic wrote:
| _context awareness over long-running tasks_
|
| don't have long-running tasks, llms or not. break the problem
| down into small manageable chunks and then assemble it. neither
| humans nor llms are good at long-running tasks.
| beoberha wrote:
| A series of small manageable chunks becomes a long running
| task :)
|
| If LLMs are going to act as agents, they need to maintain
| context across these chunks.
| bastawhiz wrote:
| > neither humans nor llms are good at long-running tasks.
|
| That's a wild comparison to make. I can easily work for an
| hour. Cursor can hardly work for a continuous pomodoro.
| "Long-running" is not a fixed size.
| echelon wrote:
| Humans can error correct.
|
| LLMs multiply errors over time.
| bdangubic wrote:
| I just finished my workday, 8hrs with Claude Code. No
| single task took more than 20 minutes total. Cleared
| context after each task and asked it to summarize for
| itself the previous task before I cleared context. If I ran
| this as a continuous 8hr task it would have died after
| 35-ish minutes. Just know the limitations (like with any
| other tool) and you'll be good :)
| 0x457 wrote:
| I always find it wild that none of these tools use VCS -
| completed logical unit of work, make a commit, drop
| entire context related to that commit, while referencing
| said commit, continue onto the next stage, rinse and
| repeat.
|
| Claud always misunderstands how API exported by my
| service works and every compaction it forgets all over
| and commits "oh api has changed since last time I've
| used, let me use different query parameters", my brother
| Christ nothing has changed, and you are the one who made
| this API.
| bastawhiz wrote:
| You can use cursor rules to tell cursor to update the
| project cursor rules with details about the API.
| bahmboo wrote:
| Roo Code does this
| novok wrote:
| I think that is because you do implicit plan tracking,
| creation and modification of the plan in your head in light
| of new information and then follow that plan. I'm not sure
| these tools do that very well.
|
| The long running task, at it's core, is composed of many
| smaller tasks and you mostly focus on one task at a time
| per brain part. It's why you cannot read two streams of
| text simultaneously even if both are in your visual focus
| field.
| vaenaes wrote:
| You're holding it wrong
| swader999 wrote:
| If GPT 5 truly has 400k context, that might be all it needs to
| meaningfully surpass Opus.
| AS04 wrote:
| 400k context with 100% on the fiction livebench would make
| GPT-5 the undisputably best model IMHO. Don't think it will
| achieve that though, sadly.
| simonw wrote:
| It's 272,000 input tokens and 128,000 output tokens.
| zurfer wrote:
| Woah that's really kind of hidden. But I think you can
| specify max output tokens. Need to test that!
| 6thbit wrote:
| Oh, I had not grasped that the "context window" size
| advertised had to include both input and output.
|
| But is it really 272k even if the output was say 10k? Cause
| it does say "max output" in the docs, so I wonder
| simonw wrote:
| This is the only model where the input limit and the
| context limit are different values. OpenAI docs team are
| working on updating that page.
| dimal wrote:
| Even with large contexts there's diminishing returns. Just
| having the ability to stuff more tokens in context doesn't
| mean the model can effectively use it. As far as I can tell,
| they always reach a point in which more information makes
| things worse.
| Byamarro wrote:
| More of a question is its context rot tendency than the size
| of its context :) LLMs are supposed to load 3 bibles into
| their context, but they forget what they were about to do
| after loading a 600LoC of locales.
| andrewmutz wrote:
| Having a large context window is very different from being
| able to effectively use a lot of context.
|
| To get great results, it's still very important to manage
| context well. It doesn't matter if the model allows a very
| large context window, you can't just throw in the kitchen
| sink and expect good results
| tekacs wrote:
| Coupled with the humungous price difference...
| logicchains wrote:
| >Between Opus aand GPT-5, it's not clear there's a substantial
| difference in software development expertise.
|
| If there's no substantial difference in software development
| expertise then GPT-5 absolutely blows Opus out of the water due
| to being almost 10x cheaper.
| spiderice wrote:
| Does OpenAI provide a $200/month option that lets me use as
| much GPT-5 I want inside of Codex?
|
| Because if not, I'd still go with Opus + Claude Code. I'd
| rather be able to tell my employer, "this will cost you
| $200/month" than "this might cost you less than $200/month,
| but we really don't know because it's based on usage"
| nadis wrote:
| It's pretty vague, but the OP had this callout:
|
| >"GPT-5 is the strongest coding model we've ever released. It
| outperforms o3 across coding benchmarks and real-world use
| cases, and has been fine-tuned to shine in agentic coding
| products like Cursor, Windsurf, GitHub Copilot, and Codex CLI.
| GPT-5 impressed our alpha testers, setting records on many of
| their private internal evals."
| RobinL wrote:
| Totally agree. At the moment I find that frontier LLMs are able
| to solve most of the problems I throw at them given enough
| context. Most of my time is spent working out what context
| they're missing when they fail. So the thing that would help me
| most is much a much more focussed ability to gather context.
|
| For my use cases, this is mostly needing to be really home in
| on relevant code files, issues, discussions, PRs. I'm hopeful
| that GPT5 will be a step forward in this regard that isn't
| fully captured in the benchmark results. It's certainly
| promising that it can achieve similar results more cheaply than
| e.g. Opus.
| abossy wrote:
| At my company (Charlie Labs), we've had a tremendous amount of
| success with context awareness over long-running tasks with
| GPT-5 since getting access a few weeks ago. We ran an eval to
| solve 10 real Github issues so that we could measure this
| against Claude Code and the differences were surprisingly
| large. You can see our write-up here:
|
| https://charlielabs.ai/research/gpt-5
|
| Often, our tasks take 30-45 minutes and can handle massive
| context threads in Linear or Github without getting tripped up
| by things like changes in direction part of the way through the
| thread.
|
| While 10 issues isn't crazy comprehensive, we found it to be
| directionally very impressive and we'll likely build upon it to
| better understand performance going forward.
| bartman wrote:
| I am not (usually) photosensitive, but the animated static
| noise on your websites causes noticable flickering on various
| screens I use and made it impossible for me to read your
| article.
|
| For better accessibility and a safer experience[1] I would
| recommend not animating the background, or at least making it
| easily togglable.
|
| [1] https://developer.mozilla.org/en-
| US/docs/Web/Accessibility/G...
| MPSFounder wrote:
| I concur. Awful UI
| neom wrote:
| Removed- sorry, and thank you for the feedback.
| jeanlucas wrote:
| Nice,
| pxc wrote:
| Love your responsiveness here!
|
| Edited to add: I am, in fact, photosensitive (due to a
| genetic retinal condition), and for my eyes, your site as
| it is very easy to read, and the visualizations look
| great.
| bartman wrote:
| Thank you!
|
| Love that you included the judge prompts in your article.
| neom wrote:
| Please let me know what you would like to see more of.
| Evals are something we take serious, I think this post
| was ok enough given our constraints, but I'd like to
| produce content people find useful and I think we can do
| a lot better.
| joshmlewis wrote:
| I've been testing it against Opus 4.1 the last few hours and it
| has done better and solved problems Claude kept failing at. I
| would say it's definitely better, at least so far.
| cyanydeez wrote:
| real context is a graph of objectives and results.
|
| The power of these models has peaked and simply arn't going to
| manage the type of awareness being promised.
| 1659447091 wrote:
| > Producing a very complex, context-exceeding objective is a
| daily (maybe hourly) ocurrence for me. All I care about is how
| these systems manage context and stay on track over extended
| periods of time.
|
| For whatever reason Github's Copilot is treated like the
| redheaded stepchild of coding assistants. Even through there
| are Anthropic, OpenAI, and Google models to choose from. And
| there is a "spaces"[0] website feature that may be close to
| what you are looking for.
|
| I got better results for testing some larger task using that
| than I did through the IDE version. But have not used it much.
| Maybe others have more experience with it. Trying to gather all
| the context and then review the results was taking longer than
| doing it myself; having the context gathered already or
| building it up over time is probably where its value is.
|
| [0] https://docs.github.com/en/copilot/concepts/spaces
| risho wrote:
| over the last week or so I have put probably close to 70 hours
| into playing around with cursor and claude code and a few other
| tools (its become my new obsession). I've been blown away by how
| good and reliable it is now. That said the reality is in my
| experience the only models that actually work in any sort of
| reliable way are claude models. I dont care what any benchmark
| says because the only thing that actually matters is actual use.
| I'm really hoping that this new gpt model actually works for this
| usecase because competition is great and the price is also great.
| ralfd wrote:
| Just replying to ask you next week what your assessment on GPT5
| is.
| throwaway_2898 wrote:
| How much of the product were you able to build to say it was
| good/reliable? IME, 70 hours can get you to a PoC that "works",
| building beyond the initial set of features -- like say a first
| draft of all the APIs -- does it do well once you start
| layering features?
| petralithic wrote:
| This has been my experience. The greenfield approach works up
| to a point, then it just breaks.
| Centigonal wrote:
| Ditto here, except I'm using Roo and it's Claude and Gemini pro
| 2.5 that work for me.
| neuronexmachina wrote:
| > That said the reality is in my experience the only models
| that actually work in any sort of reliable way are claude
| models.
|
| Anecdotally, the tool updates in the latest Cursor (1.4) seem
| to have made tool usage in models like Gemini much more
| reliable. Previously it would struggle to make simple file
| edits, but now the edits work pretty much every time.
| zarzavat wrote:
| The magic is the prompting/tool use/finetuning.
|
| I find that OpenAI's reasoning models write better code and are
| better at raw problem solving, but Claude code is a much more
| useful product, even if the model itself is weaker.
| rcarr wrote:
| I think some of this might come down to stack as well. I
| watched a t3.gg video[1] recently about Convex[2] and how the
| nature of it leads to the AI getting it right first time more
| often. I've been playing around with it the last few days and I
| think I agree with him.
|
| I think the dev workflow is going to fundamentally change
| because to maximise productivity out of this you need to get
| multiple AIs working in parallel so rather than just jumping
| straight into coding we're going to end up writing a bunch of
| tickets out in a PM tool (Linear[3] looks like it's winning the
| race atm) and then working out (or using the AI to work out)
| which ones can be run in parallel without causing merge
| conflicts and then pulling multiple tickets into your
| IDE/Terminal and then cycling through the tabs and jumping in
| as needed.
|
| Atm I'm still not really doing this but I know I need to make
| the switch and I'm thinking that Warp[4] might be best suited
| for this kind of workflow, with the occasional switch over to
| an IDE when you need to jump in and make some edits.
|
| Oh also, to achieve this you need to use git worktrees[5,6,7].
|
| [1]: https://www.youtube.com/watch?v=gZ4Tdwz1L7k
|
| [2]: https://www.convex.dev/
|
| [3]: https://linear.app/
|
| [4]: https://www.warp.dev/
|
| [5]: https://docs.anthropic.com/en/docs/claude-code/common-
| workfl...
|
| [6]:https://git-scm.com/docs/git-worktree
|
| [7]:https://www.tomups.com/posts/git-worktrees/
| isoprophlex wrote:
| Sure sounds interesting but... Where on earth do you actually
| find the time to sit through a 1.5 hour yt video?!
| rcarr wrote:
| Jump in and start coding entire backend with stack not best
| suited for job and modern AI tools: most likely future
| hours lost.
|
| Spend 1.5 hours now to learn from an experienced dev on a
| stack that is better suited for job: most likely future
| hours gained.
| v5v3 wrote:
| People find time for things they seem important to them.
| burnished wrote:
| 1.5x and 2x speed help a lot, slow down or repeat segments
| as needed, don't be afraid to fast forward past irrelevant
| looking bits (just be eager to backtrack).
| mafro wrote:
| Ask an LLM to transcribe and give the overview and key
| points
| mceachen wrote:
| On a desktop browser, tap YouTube's "show transcript" and
| "hide timecodes", then copy-paste the whole transcript into
| Claude or chatgpt and tell it to summarize with whatever
| resolution you want-a couple sentences, 400 lines,
| whatever. You can also tell it to focus on certain subject
| material.
|
| This is a complete game changer for staying on top of
| what's being covered by local government meetings. Our
| local bureaucrats are astounding competent at talking about
| absolutely nothing for 95% of the time, but hidden is three
| minutes of "oh btw we're planning on paving over the local
| open space preserve to provide parking for the local
| business".
| rcarr wrote:
| Seems like VSCode just added a lot of stuff for this in the
| latest update today, such as worktree support[1] and an agent
| session mode[2].
|
| [1]: https://code.visualstudio.com/updates/v1_103#_git-
| worktree-s...
|
| [2]: https://code.visualstudio.com/updates/v1_103#_chat-
| sessions-...
| timhigins wrote:
| I opened up the developer playground and the model selection
| dropdown showed GPT-5 and then it disappeared. Also I don't see
| it in ChatGPT Pro. What's up?
| Fogest wrote:
| It's probably being throttled due to high usage.
| IAmGraydon wrote:
| Not showing in my Pro account either. As someone else
| mentioned, I'm sure it's throttling due to high use right now.
| brookst wrote:
| Shipping something at the moment of announcement is always
| hell.
| sebdufbeau wrote:
| Has the API rollout started? It's not available in our org, even
| if we've been verified for a few months
|
| EDIT: It's out now
| spullara wrote:
| it is out yet. i poll the api for the models and update this
| GitHub hourly.
|
| https://github.com/spullara/models
| low_tech_punk wrote:
| The ability to specify a context-free grammar as output
| constraint? This blows my mind. How do you control the auto
| regressive sampling to guarantee the correct syntax?
| qsort wrote:
| You sample only from tokens that could possibly result in a
| valid production for the grammar. It's an inference-only thing.
| low_tech_punk wrote:
| ah, thanks!
| evnc wrote:
| I assume they're doing "Structured Generation" or "Guided
| generation", which has been possible for a while if you control
| the LLM itself e.g. running an OSS model, e.g. [0][1]. It's
| cool to see a major API provider offer it, though.
|
| The basic idea is: at each auto-regressive step (each token
| generation), instead of letting the model generate a
| probability distribution over "all tokens in the entire vocab
| it's ever seen" (the default), only allow the model to generate
| a probability distribution over "this specific set of tokens I
| provide". And that set can change from one sampling set to the
| next, according to a given grammar. E.g. if you're using a JSON
| grammar, and you've just generated a `{`, you can provide the
| model a choice of only which tokens are valid JSON immediately
| after a `{`, etc.
|
| [0] https://github.com/dottxt-ai/outlines [1]
| https://github.com/guidance-ai/guidance
| low_tech_punk wrote:
| Tried using gpt-5 family with response API and got error "gpt-5
| does not exist or you don't have access to it". I guess they are
| not rolling out in lock step with the live stream and blog
| article?
| diggan wrote:
| Seems they're doing rollout over time, I'm not seeing it
| anywhere yet.
| low_tech_punk wrote:
| Can confirm that they are rolling out. It's working for me.
| catigula wrote:
| I thought we were going to have AGI by now.
| RS-232 wrote:
| No shot. LLMs are simple text predictors and they are too
| stupid to get us to real AGI.
|
| To achieve AGI, we will need to be capable of high fidelity
| whole brain simulations that model the brain's entire physical,
| chemical, and biological behavior. We won't have that kind of
| computational power until quantum computers are mature.
| evantbyrne wrote:
| It will be interesting to see if humans can manage to
| bioengineer human-level general intelligence into another
| species before computers.
| machiaweliczny wrote:
| [flagged]
| bopbopbop7 wrote:
| "some twist" is doing a lot of heavy lifting in that
| statement.
| AppleBananaPie wrote:
| CS will define, design and implement human level
| intelligence before neuroscience has done even the first.
|
| That's what I hear when people say stuff like this
| anyway.
|
| Similar to CS folks throwing around physics 'theories'
| nawgz wrote:
| I don't really see any relationship between being able to
| model/simulate the brain and being able to exceed the brain
| in intelligence, can you explain more about that? Simulations
| sound like more of a computational and analytic problem with
| regards to having an accurate model.
|
| Maybe your point is that until we understand our own
| intelligence, which would be reflected in such a simulation,
| it would be difficult to improve upon it.
| brookst wrote:
| Are you saying that only (human?) biological brains can be
| GI, _and_ that whatever intelligence is, it would emerge from
| a pure physics-based simulation?
|
| Both of those seem questionable, multiplying them together
| seems highly unlikely.
| jplusequalt wrote:
| Are you arguing that intelligence is not physical? Could
| you name a single thing in existence that fundamentally
| cannot be linked to physics?
| 93po wrote:
| in what way are human brains also not just predictors? our
| neural pathways are built and reinforced as we have repeated
| exposure to inputs through any of our senses. our brains are
| expert pattern-followers, to the point that is happens even
| when we strongly don't want to (in the case of PTSD, for
| example, or people who struggle with impulse control and
| executive functioning).
|
| whats the next sentence i'm going to type? is not just based
| on the millions of sentences ive typed before and read
| before? even the premise of me playing devils advocate here,
| that's a pattern i've learned over my entire life too.
|
| your argument also falls apart a bit when we see emergent
| behavior, which has definitely happened
| JamesBarney wrote:
| When we're being hunted down by nano-bots some of the last
| few survivors will still be surprised that a simple text
| predictor could do so much.
| IAmGraydon wrote:
| Not going to happen any time soon, if ever. LLMs are extremely
| useful, but the intelligence part is an illusion that nearly
| everyone appears to have fallen for.
| jonplackett wrote:
| This POV is just the opposite extremity - and it's equally
| nuts. If you haven't seen any intelligence at all in an LLm
| you just aren't looking.
| skepticATX wrote:
| This was really a bad release for OpenAI, if benchmarks are even
| somewhat indicative of how the model will perform in practice.
| robterrell wrote:
| In what ways?
| mediaman wrote:
| I actually don't agree. Tool use is the key to successful
| enterprise product integration and they have done some very
| good work here. This is much more important to
| commercialization than, for example, creative writing quality
| (which it reportedly is not good at).
| jumploops wrote:
| If the model is as good as the benchmarks say, the pricing is
| fantastic:
|
| Input: $1.25 / 1M tokens (cached: $0.125/1Mtok) Output: $10 / 1M
| tokens
|
| For context, Claude Opus 4.1 is $15 / 1M for input tokens and
| $75/1M for output tokens.
|
| The big question remains: how well does it handle tools? (i.e.
| compared to Claude Code)
|
| Initial demos look good, but it performs worse than o3 on
| Tau2-bench airline, so the jury is still out.
| addaon wrote:
| > Output: $10 / 1M tokens
|
| It's interesting that they're using flat token pricing for a
| "model" that is explicitly made of (at least) two underlying
| models, one with much lower compute costs than the other; and
| with use ability to at least influence (via prompt) if not
| choose which model is being used. I have to assume this pricing
| model is based on a predicted split between how often the
| underlying models get used; I wonder if that will hold up, if
| users will instead try to rouse the better model into action
| more than expected, or if the pricing is so padded that it
| doesn't matter.
| simianwords wrote:
| > that is explicitly made of (at least) two underlying models
|
| what do you mean?
| addaon wrote:
| > a smart and fast model that answers most questions, a
| deeper reasoning model for harder problems, and a real-time
| router that quickly decides which model to use based on
| conversation type, complexity, tool needs, and explicit
| intent (for example, if you say "think hard about this" in
| the prompt).
|
| From https://openai.com/index/gpt-5-system-card/
| tedsanders wrote:
| In the API, there's no router. Developers just pick
| whether they use the reasoning model or non-thinking
| ChatGPT model.
| mkozlows wrote:
| That's how the browser-based ChatGPT works, but not the API.
| joshmlewis wrote:
| It does seem to be doing well compared to Opus 4.1 in my
| testing the last few hours. I've been on the Claude Code 200
| plan for a few months and I've been really frustrated with it's
| output as of late. GPT-5 seems to be a step forward so far.
| 6thbit wrote:
| Seems they have quietly increased the context window up to
| 400,000
|
| https://platform.openai.com/docs/models/gpt-5
| ralfd wrote:
| How does that compare to Claude/GPT4?
| 6thbit wrote:
| 4o - 128k o3 - 200k Opus 4.1 - 200k Sonnet 4 - 200k
|
| So, at least twice larger context than those
| hrpnk wrote:
| gpt4.1 has 1M input and 32k output, Sonnet 4 200k/64k
| simianwords wrote:
| but is it for the model in chatgpt.com as well?
| mehmetoguzderin wrote:
| Context-free grammar and regex support are exciting. I wonder
| what, or whether, there are differences from the Lark-like CFG of
| llguidance, which powers the JSON schema of the OpenAI API [^1].
|
| [^1]: https://github.com/guidance-
| ai/llguidance/blob/f4592cc0c783a...
| msp26 wrote:
| Yeah that was the only exciting part of the announcement for me
| haha. Can't wait to play around with it.
|
| I'm already running into a bunch of issues with the structured
| output APIs from other companies like Google and OpenAI have
| been doing a great job on this front.
| chrisweekly wrote:
| > "I'm already running into a bunch of issues with the
| structured output APIs from other companies like Google and
| OpenAI have been doing a great job on this front."
|
| This run-on sentence swerved at the end; I really can't tell
| what your point is. Could you reword it for clarity?
| petercooper wrote:
| I read it as "... from other companies, like Google, and
| OpenAI have been doing a great job on this front"
| belter wrote:
| We were promised AGI and all we got was code generators...
| bmau5 wrote:
| It's a logical starting point, given there are pretty defined
| success/failure criteria
| ehutch79 wrote:
| The hype is real. We were told that we'd have AGI and be out
| of jobs 2 years ago, let alone today.
| brookst wrote:
| We were also told that AGI would never happen, that it was
| 6 months away, that it is 20 years away.
|
| I'm not sure of the utility of being so outraged that some
| people made wrong predictions.
| rowanG077 wrote:
| By whom? I don't think anyone seriously said in 2023 we
| have AGI in two years. Even now, no one reputable is
| claiming AGI in two years.
| esafak wrote:
| LLMs are saturating every benchmark. AGI may not be all that. I
| am already impressed. Perhaps you need robots to be awed.
| pamelafox wrote:
| I am testing out gpt-5-mini for a RAG scenario, and I'm impressed
| so far.
|
| I used gpt-5-mini with reasoning_effort="minimal", and that model
| finally resisted a hallucination that every other model
| generated.
|
| Screenshot in post here:
| https://bsky.app/profile/pamelafox.bsky.social/post/3lvtdyvb...
|
| I'll run formal evaluations next.
| potatolicious wrote:
| This feels like honestly the biggest gain/difference. I work on
| things that do a lot of tool calling, and the model
| hallucinating fake tools is a huge problem. Worse, sometimes
| the model will hallucinate a response directly without ever
| generating the tool call.
|
| The new training rewards that suppress hallucinations and tool-
| skipping hopefully push us in the right direction.
| ralfd wrote:
| Q: What does a product manager do?
|
| GPT4: Collaborating with engineering, sales, marketing,
| finance, external partners, suppliers and customers to ensure
| ...... etc
|
| GPT5: I don't know.
|
| Upon speaking these words, AI was enlightened.
| ComputerGuru wrote:
| That is genuinely nice to see. What are you using for the
| embeddings?
| pamelafox wrote:
| We use text-embedding-3-large, with both quantization and
| MRL reduction, plus oversampling on the search to
| compensate for the compression.
| 0x457 wrote:
| I get the "good" result with phi-4 and gemma-3n in RAG scenario
| - i.e. it only used context provided to answer and couldn't
| answer questions if context lacked the answer without
| hallucination.
| fatty_patty89 wrote:
| What the fuck? Nobody else saw the cursor ceo looking through the
| gpt5 generated code, mindlessly scrolling saying "this looks
| roughly correct, i would love to merge that" LOL
|
| You can't make this up
| isoprophlex wrote:
| This is the ideal software engineer. You may not like it, but
| this is what peak software engineering looks like.
|
| /s
| siva7 wrote:
| amazing time to be alive, alone for this clown show
| throwawaybob420 wrote:
| if you're not using an LLM to vibe code garbage then are you
| really a software developer?
| bn-l wrote:
| That explains a lot.
| hrpnk wrote:
| The github issue showed in the livestream is getting lots of
| traction: https://github.com/openai/openai-python/issues/2472
|
| It was (attempted to be) solved by a human before, yet not
| merged... With all the great coding models OpenAI has access to,
| their SDK team still feels too small for the needs.
| te_chris wrote:
| https://platform.openai.com/docs/guides/latest-model
|
| Looks like they're trying to lock us into using the Responses API
| for all the good stuff.
| henriquegodoy wrote:
| I dont think there's so much difference from opus 4.1 and gpt-5,
| probably just the context size, waiting for the gemini 3.0
| backscratches wrote:
| gpt5 much cheaper
| macawfish wrote:
| Claude 5 is the one I'm most excited about.
| sberens wrote:
| Interesting there doesn't seem to be benchmarking on codeforces
| jaflo wrote:
| I just wish their realtime audio pricing would go down but it
| looks like GPT-5 does not have support for that so we're stuck
| with the old models.
| zaronymous1 wrote:
| Can anyone explain to me why they've removed parameter controls
| for temperature and top-p in reasoning models, including gpt-5?
| It strikes me that it makes it harder to build with these to do
| small tasks requiring high-levels of consistency, and in the API,
| I really value the ability to set certain tasks to a low temp.
| jngiam1 wrote:
| I was a little bummed that there wasn't more about better MCP
| support in ChatGPT, hopefully soon.
| cheema33 wrote:
| MCP is overhyped and most MCP servers are useless. What
| specific MCP server do you find critical in your regular use?
| And what functionality is missing that you wish to see in
| ChatGPT?
| ivape wrote:
| Musk after GPT5 launch: "OpenAI is going to eat Microsoft alive"
|
| https://x.com/elonmusk/status/1953509998233104649
|
| Anyone know why he said that?
| brookst wrote:
| He was high AF?
| nadis wrote:
| "When producing frontend code for web apps, GPT-5 is more
| aesthetically-minded, ambitious, and accurate. In side-by-side
| comparisons with o3, GPT-5 was preferred by our testers 70% of
| the time."
|
| That's really interesting to me. Looking forward to trying GPT-5!
| attentive wrote:
| > scoring 74.9% on SWE-bench Verified and 88% on Aider polyglot
|
| why isn't it on https://aider.chat/docs/leaderboards/?
|
| "last updated August 07, 2025"
| worik wrote:
| Diminishing returns?
| attentive wrote:
| "Notably, GPT-5 with minimal reasoning is a different model than
| the non-reasoning model in ChatGPT, and is better tuned for
| developers. The non-reasoning model used in ChatGPT is available
| as gpt-5-chat-latest."
|
| hmm, they should call it gpt-5-chat-nonreasoning or something.
| jodosha wrote:
| Still no CLI like Claude Code?
| Game_Ender wrote:
| You are looking for Codex CLI [0].
|
| 0 - https://github.com/openai/codex
| jodosha wrote:
| Thank you!
| mediaman wrote:
| It works on Codex CLI, install it with npm.
|
| That's been out for a while and used their 'codex' model, but
| they updated it today to default to gpt-5 instead.
| jodosha wrote:
| Oh nice, thanks!
| guybedo wrote:
| here's a summary for this discussion:
|
| https://extraakt.com/extraakts/openai-s-gpt-5-performance-co...
| mwigdahl wrote:
| Has anyone tried connecting up GPT-5 to Claude Code using the
| model environment variables?
| planet_1649c wrote:
| Can we use this model in a fixed plan like claude code for which
| we can pay 100$ / month?
|
| Doesnt look like it. Unless they add a fixed pricing, claude imo
| still would be better from a developer POV
| spiderice wrote:
| I just said something similar in another comment on this
| thread. I'm not interested in the mental aspect of getting
| charged per query. I feel like when I use pay-per-token tools,
| it's always in the back of my mind. Even if it's a bit more
| expensive to pay a flat rate, it's so worth it for the peace of
| mind.
| wewewedxfgdf wrote:
| Tried it on a tough problem.
|
| GPT-5 solved the problem - which Gemini failed to solve - then
| failed 6 times in a row to write the code to fix it.
|
| I then gave ChatGPT-5's problem analysis to Google Gemini and it
| immediately implemented the correct fix.
|
| The lesson - ChatGPT is good at analysis and code reviews, not so
| good at coding.
| cperkins wrote:
| I have something that both Gemini (via GCA) and CoPilot
| (Claude) analyzed and came up withe the same diagnosis. Each of
| them made the exact same wrong solution, and when I pointed
| that out, got further wrong.
|
| I haven't tried Chat GPT on it yet, hoping to do so soon.
| 6thbit wrote:
| Can anyone share their experience with codex CLI? I feel like
| that's not mentioned enough and gpt5 is already the default model
| there.
| macawfish wrote:
| Not good sadly, Claude Code seems so much better in terms of
| overall polish but also in how it handles context. I don't
| really want to through the LLM into the deep end without proper
| tools and context, and I get the sense that this is what was
| happening with in Codex.
| joshmlewis wrote:
| It does really well at using tool calls to gain as much context
| as it can to provide thoughtful answers. In this example it did
| 6! tool calls in the first response while 4.1 did 3 and o3 did
| one at a time.
|
| https://promptslice.com/share/b-2ap_rfjeJgIQsG
| joshmlewis wrote:
| It's free in Cursor for the next few days, you should go try it
| out if you haven't. I've been an agentic coding power user since
| the day it came out across several IDE's/CLI tools and Cursor +
| GPT-5 seems to be a great combo.
| austinmw wrote:
| Okay so say GPT-5 is better than Claude Opus 4.1. Then is
| GPT-5+Cursor better than Opus 4.1 + Claude Code? And if not,
| what's the best way to utilize GPT-5?
| felipemesquita wrote:
| I'm not sure yet if it's better than Claude, but the best way
| to use GPT-5 it is https://github.com/charmbracelet/crush
| kristo wrote:
| Apparently there is a cursor cli now... but I love the flat
| pricing of Claude's Max plan and dislike having to worry about
| pricing and when to use "Max" mode in cursor.
___________________________________________________________________
(page generated 2025-08-07 23:00 UTC)