[HN Gopher] Building more with GPT-5.1-Codex-Max
       ___________________________________________________________________
        
       Building more with GPT-5.1-Codex-Max
        
       Author : hansonw
       Score  : 280 points
       Date   : 2025-11-19 18:01 UTC (4 hours ago)
        
 (HTM) web link (openai.com)
 (TXT) w3m dump (openai.com)
        
       | iamronaldo wrote:
       | That was quick
        
         | bigyabai wrote:
         | My first thought was "they must not be seeing as many Claude
         | Code conversions as they hoped"
        
         | giancarlostoro wrote:
         | Whenever one of them releases a milestone release the rest
         | start publishing big milestones too. I'm waiting for Opus 5
         | next.
        
       | LZ_Khan wrote:
       | all i care about is performance on metr benchmark
        
       | Reubend wrote:
       | OpenAI likes to time their announcements alongside major
       | competitor announcements to suck up some of the hype. (See for
       | instance the announcement of GPT-4o a single day before Google's
       | IO conference)
       | 
       | They were probably sitting on this for a while. That makes me
       | think this is a fairly incremental update for Codex.
        
         | Palmik wrote:
         | GPT 5.1 / Codex already beats Gemini 3 on SWE Bench Verified
         | and Terminal Bench and this pushes the gap further. Seems like
         | a decent improvement.
        
         | bugglebeetle wrote:
         | That's how the game is played. We should be grateful for all
         | the competition that is driving these improvements, not
         | whinging about the realities of what companies have to do to
         | contest each other's position.
        
           | johnecheck wrote:
           | It's funny, this release comes right after the Gemini 3
           | release that coincided with day 1 of Microsoft's Ignite
           | conference.
        
         | peab wrote:
         | it's really getting old
        
         | johnwheeler wrote:
         | Gemini is eating their lunch, and OpenAI knows it.
        
         | criemen wrote:
         | Anthropic released the Opus 4.1 (basically, a new Opus 4
         | checkpoint) right around the big GPT-5 release date too, if I
         | remember correctly. At this point, anything goes to stay
         | relevant.
        
       | spmartin823 wrote:
       | I still want something no one has, which is the ability to launch
       | agents in different git worktrees simultaneously and check the
       | results out on my main branch for testing when they are finished.
        
         | bradly wrote:
         | Would this be similar to how Charlie and Jules work?
        
         | cube2222 wrote:
         | I think I've described how I achieve kinda your desired
         | workflow in a comment yesterday [0].
         | 
         | [0]: https://news.ycombinator.com/item?id=45970668
        
           | agentifysh wrote:
           | ha! very interesting how slept on jj is
           | 
           | its been essential to my workflow as well
           | 
           | i use both jj and git and jj is great for just creating a
           | snapshot that i can revert to incase it fails
           | 
           | im still exploring it to see what else i can do with it for
           | agentic use
        
         | agentifysh wrote:
         | lots of tools that do this and I ended up going down this
         | rabbit hole something that could just plug in to codex instead
         | of requiring a fork
         | 
         | http://github.com/agentify-sh/10x
         | 
         | does minimal overhead with agent orchestration (its just a
         | bash/typescript) as its main focus was adding enhancements to
         | codex like double redundant checkpoint via git and jj (lessons
         | learned from codex being git reset --hard happy), something
         | like claude skills (just a bunch of mds that steer it towards
         | specific activity like think, plan, execute), timeout wrappers
         | (to get you unstuck if codex waits a long time), blacklist
         | commands during yolo (rm -rf, git reset banned even if it by
         | small chance run it) MIT licensed
         | 
         | you can work sequentially (subagents launch one after the
         | other) or parallel (worktrees) but tbh sequentially is better
         | because you understand what is going on with parallel it might
         | be best for dealing with tests and UI.
        
           | poly2it wrote:
           | Your link is a 404.
        
         | lysecret wrote:
         | Cursor has this too
        
         | rane wrote:
         | tmux users might find this useful:
         | https://github.com/raine/workmux
        
       | agentifysh wrote:
       | so this was arctic fox it seems, lot of us ended up downgrading
       | to codex 5.0 because of the token burn was too much, i see codex
       | max is a step up which is welcome but still unsure if they solved
       | that github issue around tool use that impacts tokens
       | 
       | going to wait and see after being burned by 5.1 before i upgrade
       | back to 0.58
       | 
       | gemini 3 has been a let down tbh to see agentic coding wasn't a
       | top priority im sticking with codex for now and using gemini 3
       | for frontend
        
         | GenerWork wrote:
         | Have you found that Gemini is better than Codex for front end
         | generation? I'm trying to bring some Figma screens into a small
         | React project I have, and Codex will occasionally screw up the
         | implementation despite the fact that I'm using the MCP server.
        
       | jasonthorsness wrote:
       | "Starting today, GPT-5.1-Codex-Max will replace GPT-5.1-Codex as
       | the default model in Codex surfaces."
       | 
       | Wow, I spent last weekend using a tag-team of Claude and Codex
       | and found Codex to more often get better results (TypeScript
       | physics/graphics application). I probably only wrote a few
       | hundred lines of code out of many thousands; it did a really good
       | job.
       | 
       | Now I guess I'll ask the new Codex to review the work of the old!
        
       | taurath wrote:
       | These 2 sentences right next to each other stood out to me:
       | 
       | > a new step towards becoming a reliable coding partner
       | 
       | > GPT-5.1-Codex-Max is built for long-running, detailed work
       | 
       | Does this not sound contradictory? It's been the shorter form
       | work that has built what little confidence I have in these as a
       | coding partner - a model that goes off and does work without
       | supervision is not a partner to me.
        
         | causal wrote:
         | Absolutely contradictory. The long-running tendency for Codex
         | is why I cannot understand the hype around it: if you bother to
         | watch what it does and read its code the approaches it takes
         | are absolutely horrifying. It would rather rewrite a TLS
         | library from scratch than bother to ask you if the network is
         | available.
        
           | keeganpoppen wrote:
           | these things are actually fixable with prompting. is it easy?
           | no. is it PEBKaC if you don't do anything to change course as
           | it builds a TLS library? yes, but paperclip maximized! xD
        
             | causal wrote:
             | Or you can have a model with some semblance of common sense
             | that will stop and say "Hey I can I have access to the
             | network to do X?"
             | 
             | Codex feels like a tool designed to run after all the
             | humans are gone.
        
           | meowface wrote:
           | >It would rather rewrite a TLS library from scratch than
           | bother to ask you if the network is available.
           | 
           | This is definitely one of the biggest issues with coding
           | agents at the moment.
           | 
           | That said, from my experience, Codex so often does things
           | that are so useful and save me so much time that the
           | occasional "oh god what the hell did it just go off and do"
           | are an acceptable cost for me.
           | 
           | I regularly get great results with open-ended prompts and
           | agents that spend 15+ minutes working on the task. I'm sure
           | they'll eventually get better at common sense understanding
           | of what kind of work is wasteful/absurd.
        
         | ntonozzi wrote:
         | If you haven't, give Cursor's Composer model a shot. It might
         | not be quite as good as the top models, but in my experience
         | it's almost as good, and the lightning fast feedback is more
         | than worth the tradeoff. You can give it a task, wait ten
         | seconds, and evaluate the results. It's quite common for it to
         | not be good enough, but no worse than Sonnet, and if it doesn't
         | work you just wasted 30 seconds instead of 10 minutes.
        
         | embirico wrote:
         | (Disclaimer: Am on the Codex team.) We're basically trying to
         | build a teammate that can do both short, iterative work with
         | you, then as you build trust (and configuration), you can
         | delegate longer tasks to it.
         | 
         | The "# of model-generated tokens per response" chart in [the
         | blog introducing
         | gpt-5-codex](https://openai.com/index/introducing-upgrades-to-
         | codex/) shows an example of how we're improving the model good
         | at both.
        
       | simianwords wrote:
       | > Compaction enables GPT-5.1-Codex-Max to complete tasks that
       | would have previously failed due to context-window limits, such
       | as complex refactors and long-running agent loops by pruning its
       | history while preserving the most important context over long
       | horizons. In Codex applications, GPT-5.1-Codex-Max automatically
       | compacts its session when it approaches its context window limit,
       | giving it a fresh context window. It repeats this process until
       | the task is completed.
       | 
       | Wouldn't the model automatically do that using attention
       | techniques? Why do you need to do it at the token layer and not
       | leave it to the model to automatically decide which tokens are
       | worth paying attention to?
        
         | qsort wrote:
         | > due to context-window limits
        
           | simianwords wrote:
           | context window is not some physical barrier but rather the
           | attention just getting saturated. what did i get wrong here?
        
             | qsort wrote:
             | > what did i get wrong here?
             | 
             | You don't know how an LLM works and you are operating on
             | flawed anthropomorphic metaphors.
             | 
             | Ask a frontier LLM what a context window is, it will tell
             | you.
        
               | Palmik wrote:
               | It's a fair question, even if it might be coming from a
               | place of misunderstanding.
               | 
               | For example, DeepSeek 3.2, which employs sparse attention
               | [1], is not only faster with long context than normal
               | 3.1, but also seems to be better (perhaps thanks to
               | reducing the noise?).
               | 
               | [1] It uses still quadratic router, but it's small, so it
               | scales well in practice. https://api-
               | docs.deepseek.com/news/news250929
        
               | ed wrote:
               | Parent is likely thinking of sparse attention which
               | allows a significantly longer context to fit in memory
        
               | qsort wrote:
               | My comment was harsher than it needed to be and I'm
               | sorry, I think I should have gotten my point across in a
               | better way.
               | 
               | With that out of the way, parent was wondering why
               | compaction is necessary arguing that "context window is
               | not some physical barrier but rather the attention just
               | getting saturated". We're trying to explain that 3+2=2+3
               | and you people are sitting in the back going "well,
               | actually, not all groups are abelian".
        
             | paradite wrote:
             | In theory, auto-regressive models should not have limit on
             | context. It should generate the next token with all
             | previous tokens.
             | 
             | In practice, when training a model, people select a context
             | window so that during inference, you know how much GPU
             | memory to allocate for a prompt and reject the prompt if it
             | exceeds the memory limit.
             | 
             | Of course there's also degrading performance as context
             | gets longer, but I suspect memory limit is the primary
             | factor of why we have context window limits.
        
             | kenjackson wrote:
             | I think attention literally doesn't see anything beyond the
             | context window. Even within the context window you may
             | start to see attentional issues, but that's a different
             | problem.
        
         | adastra22 wrote:
         | Attention is quadratic, so you have to pick a cutoff for
         | context window size. In addition, the error/noise in state
         | space increases with longer contexts, resulting in poorer
         | performance. So even if you're willing to take the O(n^2)
         | slowdown of a larger context window, it still won't work.
        
           | fancy_pantser wrote:
           | > Attention is quadratic
           | 
           | Exactly. Standard Multi-Head Attention uses a matrix that
           | grows to 4B parameters for a 64K sequence as a starting
           | place. FlashAttention v2 helps slightly, but as you grow to
           | 128K context length, you still need over 1TB/s memory
           | bandwidth to stay compute-bound in practice even with this
           | optimization.
           | 
           | So there has been a lot of research in this area and model
           | architectures released this year are showing some promising
           | improvements. Sliding windows lose context fidelity and if
           | you go fully linear, you sacrifice math, logic, and long
           | multi-turn (agentic) capabilities, so everyone is searching
           | for a good alternative compromise.
           | 
           | MiniMax-M1 had lightning attention to scale up to 1M context
           | lengths. It's "I/O aware" via tiling and calculates attention
           | two ways block-wise (intra-block traditional attention and
           | inter-block linear attention), thereby avoiding the speed-
           | inhibiting cumulative summation.
           | 
           | DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA), which is
           | sub-linear by only computing "interesting" pairs. For
           | example, in 128K context lengths this requires only 10-20% of
           | attention pairs to be materialized.
           | 
           | Both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which
           | is borrowed from Mamba2. In Qwen3-Next it alternates three
           | Gated DeltaNet (linear attention) layers for every one gated
           | [full] attention. The speedup is from a delta rule, which
           | basically amounts to caching in a hand-wavy way.
           | 
           | There's no universally-adopted solution yet, as these are all
           | pretty heavy-duty compromises, but the search is going strong
           | right now for linear or better attention mechanisms that
           | still perform well.
        
       | hansonw wrote:
       | Rest assured that we are better at training models than naming
       | them ;D
       | 
       | - New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on
       | SWE-Lancer, and 58.1% on TerminalBench 2.0
       | 
       | - Natively trained to work across many hours across multiple
       | context windows via compaction
       | 
       | - 30% more token-efficient at the same reasoning level across
       | many tasks
       | 
       | Let us know what you think!
        
         | agentifysh wrote:
         | did you address this
         | https://github.com/openai/codex/issues/6426 ?
         | 
         | how much more token efficient is this compared to 5.0
         | 
         | had to use 5.0 because 5.1 was eating tokens like crazy and
         | seemed like a slight incremental improvement barely noticeable
        
         | EnPissant wrote:
         | Compaction is just what Claude Code has done forever, right?
        
           | enraged_camel wrote:
           | I am also trying to understand the difference between
           | compaction, and what IDEs like Cursor do when they
           | "summarize" context over long-running conversations.
           | 
           | Is this saying that said summarization now happens at the
           | model level? Or are there other differences?
        
           | GardenLetter27 wrote:
           | I think the point here is not that it does compaction (which
           | Codex also already does) - but that the model was trained
           | with examples of the Codex compaction, so it should perform
           | better when compaction has taken place (a common source for
           | drops in performance for earlier models).
        
             | EnPissant wrote:
             | Codex previously did only manual compaction, but yeah,
             | maybe some extra training for compaction, too?
        
         | iyn wrote:
         | Looks like a great change! I'll take it for a spin in a moment.
         | 
         | I really like the "subagent" feature in Claude Code -- it's
         | super useful to manage context in complex codebases. Here are
         | some examples of agents that can be useful:
         | https://github.com/humanlayer/humanlayer/tree/main/.claude/a...
         | 
         | Would it make sense to have a similar feature in Codex CLI? I
         | often do "spec-driven development", which is basically a loop
         | of:                   research -> implementation plan -> actual
         | implementation (based on research + plan) -> validation
         | 
         | I have multiple subagents that I use for each phase that (based
         | on subjective judgement) improve the output quality (vs keeping
         | everything, every tool use etc. in the "main" context window).
         | 
         | Codex CLI is great and I use it often but I'd like to have more
         | of these convenient features for managing context from CC. I'm
         | super happy that compaction is now available, hopefully we'll
         | get more features for managing context.
        
         | NitpickLawyer wrote:
         | Will -minis come for the codex family of models? About two
         | months ago I used 5-mini as a daily driver for a few weeks and
         | quite liked it, it seemed capable enough on small tasks with
         | some hand holding and the speed/price were great as well.
        
           | coder543 wrote:
           | codex-mini was released a couple of weeks ago:
           | https://platform.openai.com/docs/models/gpt-5.1-codex-mini
        
             | NitpickLawyer wrote:
             | Thanks! I somehow missed that. Will check it out.
        
         | qsort wrote:
         | Codex is an outstanding product and incremental upgrades are
         | always welcome. I'll make sure to give it a try in the coming
         | days. Great work! :)
        
         | robotswantdata wrote:
         | Sorry don't like the max model, feels like it needs a lot more
         | guiding. The plans it writes however are better, so I tried
         | feeding it back in (meta prompt style) and working okay so far.
         | Very large repository.
        
         | andai wrote:
         | So context window is still 400k but the model got good at
         | removing irrelevant context?
        
         | carbocation wrote:
         | It would be great to have access to this model via the chat
         | interface, even if it was gated behind the "other models"
         | dropdown or something.
        
         | sinatra wrote:
         | I currently use GPT-5.1-Codex High and have a workflow that
         | works well with the 5-hour/weekly limits, credits, et al. If I
         | use GPT-5.1-Codex-Max Medium or GPT-5.1-Codex-Max High, how
         | will that compare cost / credits / limits wise to GPT-5.1-Codex
         | High? I don't think that's clear. "Reduced tokens" makes me
         | think it'll be priced similarly / lower. But, "Max" makes me
         | think it'll be priced higher.
        
         | blks wrote:
         | I think your company will fail soon.
        
           | meowface wrote:
           | I would bet a lot of money it will not.
        
         | SoKamil wrote:
         | > Natively trained
         | 
         | What does it even mean?
        
       | causal wrote:
       | Sigh. Time to try it again I guess. I give OpenAI way more
       | chances than it deserves.
        
       | EcommerceFlow wrote:
       | Gemini 3 had a great 24 hour SOTA run for coding
        
         | CuriouslyC wrote:
         | Gemini is still the best oracle/planner by a mile. It's just a
         | bad agent. Give it a bundle of your repo and get it to plan
         | your changes, then hand it off to codex to implement.
        
       | croes wrote:
       | The new detergent now washes even whiter
        
         | bgwalter wrote:
         | Come on folks, this is funny. They also have industrial
         | strength laundromats to go with the detergent.
        
         | pton_xd wrote:
         | I love how programming discussions du jour have basically
         | devolved into "really? my socks definitely smell better after
         | using 2 scoops of last month's soap. what spin cycle are you
         | using?"
        
       | SunshineTheCat wrote:
       | My observation has been that Codex tends to hit logical/data-
       | driven/back-end tasks out of the park while doing weird, random
       | nonsense with even simple UI tasks. This could me needing to
       | improve how I phrase my prompts, but it will be interesting to
       | see if it's improved in that arena at all.
        
       | cube2222 wrote:
       | Somewhat related, after seeing the praise for codex in the Sonnet
       | 4.5 release thread I gave it a go, and I must say, that CLI is
       | much worse than Claude Code (even if the model is great, I'm not
       | sure where the issue really lies between the two).
       | 
       | It was extremely slow (like, multiple times slower than Sonnet
       | with Claude Code, though that's partially on me for using
       | thinking-high I guess) to finish the task, with the back-and-
       | forths being on the order of tens of minutes.
       | 
       | Moreover, the context management seems to be really weird. I'm
       | not sure how exactly it works, but - 1. It uses very little
       | tokens / fills up the context slowly (good I guess) 2. Doesn't
       | seem to actually internalize the contents of files you mention to
       | it, or it edits.
       | 
       | #2 here being the main one - I usually context-dump reference
       | code for Claude Code, and it does a perfect job of adhering to
       | codebase patterns and its architecture, while codex was
       | completely ignorant of the existing code style.
       | 
       | Moreover, it wrote extremely defensive code, even for code where
       | it wrote both ends itself.
       | 
       | All in all, I was really let down after seeing all the praise.
        
         | agentifysh wrote:
         | sure claude code has better ux but honestly its hard to get any
         | good amount of usage out of the subscriptions vs what codex
         | offers at the same price
         | 
         | with claude im constantly hitting rate limits with codex
         | getting substantially more and "slow" isn't really a problem
         | for me as long as it keep working
         | 
         | the only complaint i have is that codex itself has usage
         | limited now (Either due to outstanding git issues around tools
         | or by throttling on their end) compared to a few months ago
         | 
         | the true magical moment was codex pro letting me run swarms of
         | agents day in day out without any worries about rate limits it
         | truly felt unlimited
         | 
         | if claude manages to release a smaller model or some way to
         | deal with the rapidly depleting usage limits (this is the top
         | complaint on reddit and they eventually just stopped allowing
         | threads about it) it would definitely be used more
         | 
         | but for now codex is clearly the workhorse and claude used side
         | by side.
        
           | cube2222 wrote:
           | Well as I said, codex didn't adhere to codebase standards for
           | me and the code quality was worse (very defensive), so even
           | after waiting longer, results weren't there for me.
           | 
           | But the subscription thing is a non-issue for me as I use the
           | API, and mostly use Claude Code synchronously, with the
           | occasional rare background agent.
        
           | sumedh wrote:
           | > if claude manages to release a smaller model
           | 
           | have you tried Haiku?
        
       | tosh wrote:
       | Codex CLI 0.59 got released (but has no changelog text)
       | 
       | https://github.com/openai/codex/releases/tag/rust-v0.59.0
        
       | bgwalter wrote:
       | So they all release before the Nvidia numbers tonight. The real
       | question is: How well can Nvidia hide the circular deals in the
       | books?
        
       | amluto wrote:
       | I would love to see all the big players put 1% of the effort they
       | put into model training into making the basic process of paying
       | and signing in suck less.
       | 
       | Claude: they barely have a signin system at all. Multiple account
       | support doesn't exist. The minimum seat count for business is
       | nonsense. The data retention policies are weak.
       | 
       | OpenAI: Make ZDR a thing you can use or buy without talking to
       | sales, already. And for those using containers or a remote system
       | or really anything other than local development with the codex
       | CLI, you really really need to fix this bug. I bet Codex could do
       | at least the client part for you!
       | 
       | https://github.com/openai/codex/issues/2798
       | 
       | (Hint: Claude Code gets this right by default, despite the fact
       | that everything else about Claude sign-in is a joke.)
       | 
       | Google: get all your B2B AI product managers in one room and tell
       | them that they need to make one single product menu on one single
       | webpage with all the pricing on _that page_ and that the Google
       | Cloud people are not permitted to make anything that isn't
       | actually logically Google Cloud depend on Google Cloud Billing.
       | Your product cannot compete with OpenAI or Anthropic if people
       | need to ask an LLM to figure out what your product is and if your
       | own fancy LLMs can't give a straight answer. My company pays for
       | a non-Google product primarily because it's too complicated to
       | pay for the Google product! Right now, trying to use Google's AI
       | is like trying to ride Bay Area public transit before the Clipper
       | Card.
        
         | atonse wrote:
         | Agree 1,000%.
         | 
         | I just won't even waste my time with the google stuff cuz I
         | can't figure out how to pay with it.
         | 
         | And that's a problem everywhere at google. Our google play
         | account is suspended cuz I can't verify the company. It won't
         | let me cuz it says I'm not the owner. I've always been the
         | owner of my company. For 18 years. There is no one else.
         | 
         | Once some error said make sure the owner email matches your
         | profile in google payments and I was like, what is google
         | payments and where do I even begin with that? I've never paid
         | for google play so what does payments have to do with anything?
         | 
         | It's totally random stuff. Get your shit together, google. Make
         | your products and payment systems coherent, rather than it
         | obviously looking like it was designed by a fiefdom full of
         | territorial managers.
        
           | nico wrote:
           | Can relate. My inactive google ads account all of a sudden
           | got banned. No explanation except some generic link to their
           | terms of service. Appealed, got automatic denial, no reason
           | given. Have retried multiple times, same result
        
             | AuryGlenz wrote:
             | Same thing happened to me. Guess who didn't start spending
             | $100 a month with them again?
             | 
             | Utterly ridiculous.
        
           | joshstrange wrote:
           | The "Owner" accounts in Google Play and Apple's App Store are
           | so freaking annoying. The only time they make sense is for
           | solo-founders and even then I've had issues. Now expand it to
           | working at a larger company and it's a joke, a bad one. Oh
           | sure, I'll just get the CEO (or other higher-up) to login and
           | accept new agreements, that will be easy. Even more fun when
           | you tell a client (who logged in exactly 1 time to set up the
           | account) that they need to use a generic email (not a
           | personal one or an employee-specific one), the ignore your
           | suggestion, and then they can't get back in because the
           | person who set up the account left the company. It's a mess.
           | 
           | Also, re "Google Payments", I tried to transfer an app from
           | my personal/solo Google Play account to a new business one I
           | set up for my LLC and it was like pulling teeth. They wanted
           | me to find some payment id from the original $20 purchase I
           | made to get access to Google Play, something I did right
           | around when they first launched and while I still have/use
           | the same email, Google came out with approximately 1 googol
           | different "payment solutions" in the interim and their
           | engineers don't care about data migrations. Finally, after
           | many support emails, they just transferred it without me
           | giving that code which just shows how silly the whole thing
           | was from the start.
        
             | tarsinge wrote:
             | I don't have experience in big tech but in the few SaaS
             | companies I've seen the issue is UX designers and Product
             | managers overwhelmingly have a B2C culture.
        
           | swivelmaster wrote:
           | > designed by a fiefdom full of territorial managers
           | 
           | What's harder than herding cats? Herding cats with MBAs and
           | OKRs.
        
           | redler wrote:
           | Conway's Law strikes again.
        
         | computerex wrote:
         | Couldn't agree more about the google product offerings. Vertex
         | AI? AI Studio? Maker studio? Gemini? The documentation is
         | fragmented with redundant offerings making it confusing to
         | determine what is what. GCS billing is complicated to figure
         | out vs OpenAI billing or anthropic.
         | 
         | Sad part is Google does offer a ChatML/OpenAI compliant
         | endpoint to do LLM calls and I believe they in an experiment
         | also reduced friction in getting an API key to start making
         | calls right away but discoverability ever remains a challenge
         | with google services.
        
           | byefruit wrote:
           | I've just found myself using OpenRouter if we need Google
           | models for a project, it's worth the extra 5% just not to
           | have to deal with the utter disaster that is their product
           | offering.
        
             | IanCal wrote:
             | FWIW I had to bail on the same thing because my results
             | were drastically different. There was something happening
             | with images through open router. Although outside of that
             | I'd absolutely do the same thing, their apis are awful and
             | billing worse. Maybe it makes sense for huge orgs but it's
             | a nightmare on the smaller scale.
        
           | int_19h wrote:
           | > I believe they in an experiment also reduced friction in
           | getting an API key to start making calls right away
           | 
           | This part is very easy now: you sign into
           | https://aistudio.google.com/ and then click "Get API key" in
           | the lower left corner.
           | 
           | The problem is that features and docs are still scattered all
           | over. Some thing can only be done via Vertex, for example.
        
         | hassleblad23 wrote:
         | Adding to this, Google's models can only be used with GCP while
         | OpenAI's models can be used with Azure, Anthropic's models can
         | be used with AWD Bedrock, in addition to their own platforms.
         | 
         | I'd love to see the Gemini models being available by other
         | providers :) or if they just build a simple prepaid wallet like
         | OpenAI and Anthropic.
        
           | temp0826 wrote:
           | Didn't realize these stipulations for the models. Looking at
           | devops-y job descriptions the last few months I noticed
           | nearly everyone has some kind of Azure requirement now (which
           | I've mostly avoided because I don't want to end up managing
           | someone's AD), but is openai the actual reason for it?
        
             | sethhochberg wrote:
             | We're just using Github Copilot as our primary entrypoint
             | for all of the model families. Its the only way we can
             | easily offer our devs some level of Claude, Gemini, and
             | Codex all in one place.
        
         | skerit wrote:
         | Last night, just after Gemini 3 was released and became
         | available for Gemini-CLI, I saw Gemini-CLI's team post that you
         | could access Gemini 3 with either an API key OR with _Gemini AI
         | Ultra_, so I thought: great, I'll get that!
         | 
         | Now you CAN NOT get the Google One stuff if your account is
         | part of a workspace. I thought: how awful. I want to pay, but I
         | simply can't?
         | 
         | Oh, but then I noticed: You CAN add a _Gemini AI Ultra_ license
         | via the Google Workspace Admin area, great!
         | 
         | Turns out: you fucking can't. That's _Google AI Ultra FOR
         | BUSINESS_ and that IS NOT supported.
         | 
         | So I had to get the Google One subscription on my personal
         | account after all.
         | 
         | Combine that with the _pathetic_ usage limits: somehow not
         | token-based, but amount of requests per 24 hour window (which
         | is 500 for Gemini 3) and Gemini 3's incredible chattiness (it
         | uses A LOT more requests to get something done compared to
         | Claude) and you hit the usage limits in just 2 hours.
        
           | timtimmy wrote:
           | Careful, their ToS makes it clear they train on your
           | Antigravity prompts (even on AI Ultra) and there is no opt-
           | out that I can find.
        
           | victor106 wrote:
           | the microsoftication of Google. Fighting evil with evil...
        
         | halifaxbeard wrote:
         | At this point I'm not convinced that Gemini 3 Pro was post-
         | trained on data Google had permission to use, going by the
         | myriad of issues on the Gemini CLI tracker around Google
         | AI/Google One/Google Cloud/Google Workspaces.
         | 
         | https://github.com/google-gemini/gemini-cli/issues/12121
         | 
         | It is far too easy to accidentally end up under the wrong
         | privacy agreement, to the point of where some workplaces are
         | banning use of the Gemini CLI!
        
         | timtimmy wrote:
         | Google keeps changing their privacy and "don't train on my
         | data/code" options. When gemini-cli launched, there was a clear
         | toggle for "don't train on my code." That's now gone; it just
         | links to a generic privacy page for me. Maybe something with my
         | account changed, I can't figure it out. Deep in the Cloud
         | Gemini console, there's another setting that might control
         | training, but it's not clear what products it actually covers.
         | 
         | Trying to pay for Gemini-3 is confusing. Maybe an AI Ultra
         | personal subscription? I already pay for OpenAI and Anthropic's
         | pro/max plans and would happily pay Google too. But the only
         | obvious option is a $250/month tier, and its documentation
         | indicates Google can train on your code unless you find and
         | enable the correct opt-out. If that opt-out exists in all the
         | products, it's not obvious where it lives or what products it
         | applies to.
         | 
         | Workspace complicates it further. Google advertises that with
         | business workspace accounts your data isn't used for training.
         | So, I was going to try Antigravity on our codebase. At this
         | point I know I can't trust Google, so I read the ToS carefully.
         | They train on your prompts and source code, and there doesn't
         | appear to be a way to pay them and opt out right now. Be
         | careful, paying for Google Workspace does not protect you,
         | always read the ToS.
         | 
         | Be careful with AI-studio and your Google Workspace accounts.
         | They train on your prompts unless you switch it to API mode.
         | 
         | The result is a lot of uncertainty. I genuinely have no idea
         | how to pay Google for Gemini without risking my code being used
         | for training. And if I do pay, I can't tell whether they'll
         | train on my prompts anyway.
         | 
         | The marketing for their coding products does not clearly state
         | when they do or do not train on your prompts and code.
         | 
         | I had to run deep research to understand the risks with using
         | Gemini 3 for agentic work, and I still don't feel confident
         | that I understand the risks. I might have said some incorrect
         | things above, but I am just so confused. I feel like I have a
         | <75% grasp on the situation.
         | 
         | I don't have a lot of trust. And honestly, this feels confusing
         | and deceptive. One could easily confuse it as deliberate
         | strategy to gather training data through ambiguity and dark
         | patterns, it certainly looks like this could be Google's
         | strategy to win the AI race. I assume this is just how it
         | looks, and that they aren't being evil on purpose.
         | 
         | OpenAI in particular has my trust. They get it. They are
         | carefully building the customer experience, they are product
         | and customer driven from the top.
        
           | bossyTeacher wrote:
           | >OpenAI in particular has my trust.
           | 
           | I wouldn't trust Sam Altman. Or any of the big players
           | really.
        
             | fishmicrowaver wrote:
             | > trust
             | 
             | Hahaha...HAHAhaha. HAHAHHAHAHAHAHAHAHA!!!
        
         | unreal6 wrote:
         | > Claude: they barely have a signin system at all. Multiple
         | account support doesn't exist. The minimum seat count for
         | business is nonsense. The data retention policies are weak.
         | 
         | Please give me an option for a password (or passkey) or
         | literally anything else that doesn't require either linking
         | with google or going through an email flow for every login
        
         | leetrout wrote:
         | And stop asking for phone numbers for "fraud prevention" when
         | I've already given you my name, address and credit card.
        
           | lucasban wrote:
           | The fun one for me is that I moved countries and last I
           | checked there's still no way to change your phone number on
           | ChatGPT short making a new account, so now my account is
           | associated with a phone number that I no longer have access
           | to and will eventually be reassigned to someone else.
        
           | oblio wrote:
           | Can't people spoof the first two and use a stolen credit card
           | number?
        
         | brobdingnagians wrote:
         | Such great case studies of how LLM coding will make all of your
         | employees 1000x more productive at coding, design, and UX. They
         | really are leading the way showing us into the brighter future
         | of AI software /s
        
           | jiggawatts wrote:
           | Nobody claimed AIs will make office politics go away.
           | 
           | Peering into my crystal ball: once all "workers" have been
           | replaced, all humans will spend all of their working hours on
           | nothing but office politics.
        
         | gigatree wrote:
         | It seems pretty clear the moat is built at the application
         | layer, how enjoyable/easy the actual application is to use, but
         | these applications seem to be getting worse over time even as
         | the models get better. Is it really that hard to do both? Isn't
         | the point of agentic coding to do more better (not just more)?
        
         | sophiebits wrote:
         | ZDR is a risk thing for them. They want to make sure you're a
         | legitimate company and have monitoring in place on your side to
         | reduce the chance you're using them for illegal things.
        
         | fHr wrote:
         | Google listen to this man and fire 90% of your useless product
         | managers!
        
         | sumedh wrote:
         | Its the same with Cursor. As a Cursor Admin I want the ability
         | to enable only specific models and disable the rest to save
         | costs but I cannot do that. It should be pretty simple to do it
         | but for some reason Cursor wont add that functionality in their
         | Admin tools.
        
       | kytazo wrote:
       | 500 Internal Server Error.
        
         | morog wrote:
         | ditto. Also OpenAI vector stores are down right now across the
         | board
        
       | nakamoto_damacy wrote:
       | It's good but Gemini 3 beats it.
        
       | syntaxing wrote:
       | I rarely used Codex compared to Claude because it was extremely
       | slow in GitHub copilot . Like maybe 2-5X slower than Claude
       | Sonnet. I really wish they just made their models faster than
       | "better"
        
         | nartho wrote:
         | Have you tried Mistral ? Definitely one of the fastest models
        
           | syntaxing wrote:
           | My employer doesn't offer/allow anything besides the
           | "traditional" offerings on GitHub copilot.
        
         | levocardia wrote:
         | Very interesting to see the range of peoples' preferences. I
         | would almost always prefer smart over fast; I have all my LLMs
         | to be all-thinking-all-the-time.
        
           | syntaxing wrote:
           | It's a balance, I haven't felt like codex provided anything
           | that Sonnet 4.5 didn't. Why wait longer for getting the same
           | results.
           | 
           | Though that does bring up an interesting point. Anecdotally,
           | Sonnet does a lot more grep-ing while Codex reads files
           | straight up. Might be the difference in speed and maybe
           | smarter models will do better. Once this model is on copilot,
           | I can test it out.
        
           | mrguyorama wrote:
           | GPT-5 was recently updated to make it more "thinking" and
           | "warmer" or whatever and now a task (semantically compare
           | these two short files) that used to take 5 seconds and
           | reliably produce useful and _consistent_ output now takes _90
           | seconds_ to  "think" (while it's thinking output makes it
           | pretty clear there is zero thinking happening) and produces a
           | _completely differently structured output_ every single time,
           | making the tool not only slower and more expensive to use,
           | but worse at a simple task that LLMs should be very good at.
           | 
           | There's an option to "get a quick answer" and I hoped
           | clicking that would revert to previous performance and
           | instead what it does is _ignore that I uploaded two files and
           | asks me to upload the files_
           | 
           | Literally the only real good task I've found for these dumb
           | things and they _still_ found a way to fuck it up because
           | they need to keep the weirdos and whales addicted. It 's now
           | almost easier to go back to comparing these files by eye, or
           | just bite the bullet and finally write a few lines of python
           | to actually do it right and reliably.
        
         | jasonsb wrote:
         | OpenAI doesn't want you to use their models outside of their
         | own products, which is why the API and integrations like Github
         | Copilot are super slow.
        
           | sumedh wrote:
           | That does not make business sense though. If people want to
           | use Open AI models in Copilot and other tools and they dont
           | perform they will just switch to another model and not come
           | back they are not going to use Codex.
        
       | andai wrote:
       | Sizeable if veracious!
        
       | the__alchemist wrote:
       | This is a tangent: Has anyone noticed that GPT-5.0 at some point
       | started producing much faster, crappier answers, then 5.1 made it
       | slower + better again? (Both in _Thinking_ mode)
        
         | wincy wrote:
         | I did notice that, I thought maybe I'd exceeded my thinking
         | requests
        
       | Narciss wrote:
       | Here we go again....
        
       | johnfn wrote:
       | I've been using a lot of Claude and Codex recently.
       | 
       | One huge difference I notice between Codex and Claude code is
       | that, while Claude basically disregards your instructions
       | (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly
       | persistent in following every last character of them - to the
       | point that i've seen it work for 30 minutes to convolute some
       | solution that was only convoluted because of some sentence I
       | threw in the instructions I had completely forgotten about.
       | 
       | I imagine Codex as the "literal genie" - it'll give you exactly
       | what you asked for. EXACTLY. If you ask Claude to fix a test that
       | accidentally says assert(1 + 1 === 3), it'll say "this is clearly
       | a typo" and just rewrite the test. Codex will rewrite the entire
       | V8 engine to break arithmetic.
       | 
       | Both these tools have their uses, and I don't think one approach
       | is universally better. Because Claude just hacks its way to a
       | solution, it is really fast, so I like using it for iterate web
       | work, where I need to tweak some styles and I need a fast
       | iterative loop. Codex is much worse at that because it takes like
       | 5 minutes to validate everything is correct. Codex is much better
       | for longer, harder tasks that have to be correct -- I can just
       | write some script to verify that what it did work, and let it
       | spin for 30-40 minutes.
        
         | nico wrote:
         | > Claude basically disregards your instructions (CLAUDE.md)
         | entirely
         | 
         | A friend of mine tells Claude to always address him as "Mr
         | Tinkleberry", he says he can tell when Claude is not paying
         | attention to the instructions on CLAUDE.md when Claude stops
         | calling him "Mr Tinkleberry" consistently
        
           | benzible wrote:
           | Yep, it's David Lee Roth's brown M&M trick
           | https://www.smithsonianmag.com/arts-culture/why-did-van-
           | hale...
        
           | awad wrote:
           | Highly recommend adding some kind of canary like this in all
           | LLM project instructions. I prefer my instructions to say
           | 'always start output with an (uniquely decided by you) emoji'
           | as it's easier to visually scan for one when reading a wall
           | of LLM output, and use a different emoji per project because
           | what's life without a little whim?
        
           | leobg wrote:
           | We used to do that on Upwork. Back in the days where one
           | still hired human coders. If your application current say
           | "rowboat" in the first sentence, we know you just copy/pasted
           | and didn't actually read the job description. Feels like a
           | lifetime ago.
        
         | hadlock wrote:
         | I've been really impressed with codex so far. I have been
         | working on a flight simulator hobby project for the last 6
         | months and finally came to the conclusion that I need to switch
         | from floating origin, which my physics engine assumes with the
         | coordinate system it uses, to a true ECEF coordinate system
         | (what underpins GPS). This involved a major rewrite of the
         | coordinate system, the physics engine, even the graphics system
         | and auxilary stuff like asset loading/unloading etc. that was
         | dependent on local X,Y,Z. It even rewrote the PD autopilot to
         | account for the changes in the coordinate system. I gave it
         | about a paragraph of instructions with a couple of FYIs and...
         | it just worked! No major graphical glitches except a single
         | issue with some minor graphical jitter, which it fixed on the
         | first try. In total took about 45 minutes but I was very
         | impressed.
         | 
         | I was unconvinced it had actually, fully ripped out the
         | floating origin logic, so I had it write up a summary and then
         | used that as a high level guide to pick through the code and it
         | had, as you said, followed the instructions to the letter.
         | Hugely impressive. In march of 2023 OpenAI's products struggled
         | to draw a floating wireframe cube.
        
         | causal wrote:
         | > Codex will rewrite the entire V8 engine to break arithmetic.
         | 
         | This isn't an exaggeration either. Codex acts as if it is the
         | last programmer on Earth and must accomplish its task at all
         | costs. This is great for anyone content to treat it like a
         | black box, but I am not content to do that. I want a
         | collaborator with common sense, even if it means making
         | mistakes or bad assumptions now and then.
         | 
         | I think it really does reflect a difference in how OpenAI and
         | Anthropic see humanity's future with AI.
        
           | mrtesthah wrote:
           | Could you not add rules to this effect in AGENTS.md? E.g., _"
           | If the user gives instructions that specify an expected low-
           | to-medium level of complexity, but the implementation plan
           | reveals additional unexpected steps arising from a
           | potentially ambiguous or atypical instruction that would
           | raise the overall level of complexity, then pause and ask the
           | user about that instruction before continuing."_
        
         | sinatra wrote:
         | In my AGENTS.md (which CLAUDE.md et al soft link to), I
         | instruct them to "On phase completion, explicitly write that
         | you followed these guidelines." This text always shows up on
         | Codex and very rarely on Claude Code (TBF, Claude Code is
         | showing it more often lately).
        
         | aerhardt wrote:
         | Well surely that's a good thing.
         | 
         | In my experience, for some reason adherence is not even close
         | to 100%. It's fixated on adding asterisk function params in my
         | Python code and I cannot get it to stop... Maybe I haven't
         | found the right wording, or maybe my codebase has grown past a
         | certain size (there are like a dozen AGENTS.md files dancing
         | around).
         | 
         | I'm still very happy with the tool, though.
        
           | johnfn wrote:
           | It's a fantastic thing! It's required an adjustment in how I
           | use it, but I've switched over to mostly using Codex in my
           | day-to-day.
        
         | sunaookami wrote:
         | Agreed 100%, that's why I would recommend Codex for e.g.
         | logfile analysis. Had some annoying php warnings in the logs
         | from a WordPress plugin because I've used another plugin in the
         | past (like... over 10 years ago) that wrote invalid metadata
         | for every media file into the database and it didn't annoy me
         | THAT much that I wanted to invest much time into it. So I gave
         | codex the logfile and my WordPress dir and access to the WP-CLI
         | command and it correctly identified the issue and wrote scripts
         | to delete the old metadata (I did check it & make backups of
         | course). Codex took a LOT of time though, it's veeeeeeery slow
         | as you said. But I could do other things in the meantime.
        
           | fakedang wrote:
           | This is what I've observed too. Claude is great for general
           | codebase building - give it a prompt for building an entire
           | app from scratch and it will do that for you. Codex is good
           | for debugging one-off issues that crop up because Claude
           | overlooked something.
        
         | energy123 wrote:
         | GPT-5 is like that
        
         | tekacs wrote:
         | Yeah, Gemini 2.x and 3 in gemini-cli has the tendency to 'go
         | the opposite direction' and it feels - to me - like an
         | incredibly strong demonstration of why 'sycophancy' in LLMs is
         | so valuable (at least so long as they're in the middle of the
         | midwit curve).
         | 
         | I'll give Gemini direction, it'll research... start trying to
         | solve it as I've told it to... and then exclaim, "Oh! It turns
         | out that <X> isn't what <user> thought!" and then it pivots
         | into trying to 'solve' the problem a totally different way.
         | 
         | The issue however... is that it's:
         | 
         | 1) Often no longer solving the problem that I actually wanted
         | to solve. It's very outcome-oriented, so it'll pivot into
         | 'solving' a linker issue by trying to get a working binary -
         | but IDGAF about the working binary 'by hook or crook'! I'm
         | trying to fix the damn linker issue!
         | 
         | 2) Just... wrong. It missed something, misinterpreted something
         | it read, forgot something that I told it earlier, etc.
         | 
         | So... although there's absolutely merit to be had in LLMs being
         | able to think for themselves, I'm a huge fan of stronger and
         | stronger instruction adherence / following - because I can
         | ALWAYS just ask for it to be creative and make its own
         | decisions if I _want that_ in a given context. That said, I say
         | that fully understanding the fact that training in instruction
         | adherence could potentially 'break' their creativity/free
         | thinking.
         | 
         | Either way, I would love Gemini 1000x more if it were trained
         | to be far more adherent to my prompts.
        
           | tekacs wrote:
           | Immediately rebutting myself: a major caveat to this that I'm
           | discovering with Gemini is that... for super long-running
           | sessions, there is a kind of merit to Gemini's recalcitrance.
           | 
           | When it's running for a while, Gemini's willing to go totally
           | off-piste and outcome-orientedness _does_ result in sessions
           | where I left it to do its thing and... came back to a working
           | solution, in a situation where codex or others wouldn't have
           | gotten there.
           | 
           | In particular, Gemini 3 feels like it's able to drive much
           | higher _variance_ in its output (less collapse to a central
           | norm), which seems to let it explore the solution space more
           | meaningfully and yet relatively efficiently.
        
           | buu700 wrote:
           | I haven't had that particular experience with Gemini 2.5, but
           | did run into it during one of my first few uses of Gemini 3
           | yesterday.
           | 
           | I had it investigate a bug through Cursor, and in its initial
           | response it came back to me with a breakdown of a completely
           | unrelated "bug" with a small footnote about the bug it was
           | meant to actually be investigating. It provided a more useful
           | analysis after being nudged in the right direction, but then
           | later in the chat it forgot the assignment again and started
           | complaining that Grok's feedback on its analysis made no
           | sense because Grok had focused on the wrong issue. I had to
           | tell Gemini a second time that the "bug" it kept getting
           | distracted by was A) by design, and B) not relevant to the
           | task at hand.
           | 
           | Ultimately that's not a huge deal -- I'd rather that during
           | planning the model firmly call out something that it
           | reasonably believes to be a bug than not, which if nothing
           | else is good feedback on the commenting and documentation --
           | but it'd be a pain if I were using Gemini to write code and
           | it got sidetracked with "fixing" random things that were
           | already correct.
        
         | bugglebeetle wrote:
         | The solution to this if you want less specification in advance
         | is to simply ask Codex a series of leading questions about a
         | feature of fix. I typically start with something like "it seems
         | like X could be improved with the addition of Y? Can you review
         | the relevant parts of the codebase in a, b, and c to assess?"
         | It will then do so and come back with a set of suggestions that
         | follow this guidance, which you can revise and selectively tell
         | it to implement. In my experience, this fills the context with
         | the appropriate details to then let it make more of its own
         | decisions in a generally correct way without as much
         | handholding.
        
       | wilg wrote:
       | I have been using GPT 5 High Fast in Cursor primarily over Codex,
       | because Codex seems to take way longer and generally annoy me by
       | doing strange CLI stuff, but hopefully I can switch to this new
       | one. I also tried it against Gemini 3 Pro in Cursor and it's hard
       | to tell but at least in some cases I felt like GPT5 was giving
       | better results.
        
       | LZ_Khan wrote:
       | Woah, metr results look impressive. Still looking exponential
        
       | tunesmith wrote:
       | I've been dealing with Codex CLI for a while and I love it, but
       | I'm wondering if my thinking is just limited. While I'm starting
       | discussions and creating plan docs, I've never been able to ask
       | it to do anything that takes it longer than 25 minutes or so.
       | Usually far less. I'm having trouble imagining what I can ask it
       | to do that would make it take hours - like, wouldn't that require
       | putting together an absolutely massive planning doc that would
       | take hours to put together anyway? I'd rather just move
       | incrementally.
        
         | GenerWork wrote:
         | Perhaps they're combining an incredibly complex product that
         | has a lot of interactive features, a big codebase, test
         | creation, and maybe throwing some MCP stuff in there such as
         | creating creating a ticket in Jira if a test fails?
        
         | CuriouslyC wrote:
         | Easy way to get an agent to run a long time is just to get it
         | to babysit CI/CD, tell it to iterate on it until it passes. I
         | got Sonnet 4 to run for >6 hours that way.
        
         | aerhardt wrote:
         | The idea of giving it a task that may take six hours and
         | reviewing it also gives me shivers.
         | 
         | I'm a very happy Codex customer, but everything turns to
         | disgusting slop if I don't provide:
         | 
         | (1) Up-to-date AGENTS.md and an excellent prompt
         | 
         | (2) A full file-level API with function signatures, return
         | types and function-level guidance if it's a complex one
         | 
         | (3) Multiple rounds of feedback until the result is finely
         | sculpted
         | 
         | Overall it's very small units of work - one file or two, tops.
         | 
         | I've been letting the above standards go for the last couple of
         | weeks due to crunch and looking at some of the hotspots of slop
         | now lying around has me going all Homelander-face [1] at the
         | sight of them.
         | 
         | Those hotspots are a few hundred lines in the worst cases; I'm
         | definitely not ready to deal with the fallout of any unit of
         | work that takes even more than 20min.
         | 
         | [1] https://i.kym-
         | cdn.com/entries/icons/original/000/050/702/ab7...
        
       | 999900000999 wrote:
       | I really would prefer them to start creating customized models.
       | 
       | I've vibe coded Godot games extensively.
       | 
       | Just about every model I've tried likes to invent imaginary
       | functions.
       | 
       | I was really prefer for there to be a way for me to pick model
       | trained in whatever framework I need.
       | 
       | Reviewing AI generated code feels like editing a long book, and
       | every now and then you notice some words are just completely made
       | up. You then ask the AI to fix its book, and it will just add
       | more AI generated words.
       | 
       | On one hand I want this to be a reality check to everyone who's
       | trying to lay off real software engineers to replace us with AI.
       | 
       | On the other hand half of the stock market is held up by
       | overhyped AI valuations. If the tide goes out too fast, and there
       | is a mass realization that this stuff just isn't as good as it's
       | hyped to be, it's not going to be fun for anyone.
        
         | andai wrote:
         | I had this problem 2 years ago. All the models were telling me
         | use libraries that hadn't been invented yet.
         | 
         | That was annoying back then, but these days that's not so much
         | of a problem.
         | 
         | You can write your program and then simply have it invent the
         | library as well, while it's at it! ;)
        
           | razodactyl wrote:
           | These days not so much of a problem because the libraries now
           | exist? Haha
        
         | Atotalnoob wrote:
         | I've found writing a MCP server with access to the docs cloned
         | locally does wonders.
        
           | epolanski wrote:
           | I don't know context is still an issue if you have lots of
           | docs in my experience.
        
         | GaggiX wrote:
         | Add the documentation to the context window in that case, a bit
         | of context engineering.
        
         | Narciss wrote:
         | Context7 might be good for you
        
       | spectraldrift wrote:
       | Weird how they only share three hand-picked evals, ignoring the
       | evals where they were left in the dust like ARC-AGI2. This post
       | is so misleading, I don't even know whether to trust the numbers
       | they _did_ share. One is just fraction of a percentage point away
       | from Gemini 3 pro, which is awfully convenient for marketing and
       | easy to hide. Very open, OpenAI.
        
         | XenophileJKO wrote:
         | Not really that weird. This isn't intended to be a "general"
         | model. This is a coding model so they showed the coding evals.
         | The assumption would be relative to GPT5.1, non-coding evals
         | would be likely regress or be similar.
         | 
         | Like when advertising the new airliner, most people don't care
         | about how fast it taxis.
        
       | simonw wrote:
       | Thinking level medium: https://tools.simonwillison.net/svg-
       | render#%3Csvg%20xmlns%3D...
       | 
       | Thinking level xhigh: https://tools.simonwillison.net/svg-
       | render#%20%20%3Csvg%20xm...
        
         | ineedasername wrote:
         | Medium has things dialed in. When both high and low are
         | coherent but medium goes to cubism? That's intent. Or it had a
         | miscue on proportions vs shape placement. Either way, it's
         | great, sandwiched the way it is, between the other two. Did it
         | put a comment in all of them or just the one w/ the hat?
         | 
         | Also, thanks for the posts-- it's hugely helpful to have a
         | continuity of insightful perspective throughout.
        
       | andai wrote:
       | The graph showing higher performance for fewer thinking tokens is
       | really interesting!
       | 
       | It would be even more interesting to see how Sonnet and Haiku
       | compare with that curve.
        
       | tptacek wrote:
       | Is "compaction" a trained-in feature of the model, or just
       | tooling around the model calls? Agents already do compaction.
        
       | kachapopopow wrote:
       | not sure if I am actually using 5.1-codex-max or just normal
       | 5.1-codex (is there even 5.1-codex?) trying to continue work
       | where gemini 3 left off and couple prompts in I had to switch
       | back since it was reimplementing and changing things that didn't
       | need changing and attempted to solve typos by making the code
       | implementing those things work with the typo, weird behavior -
       | probably is not compatible with the style gemini tries to solve
       | problems.
        
         | sumedh wrote:
         | Just run the /model command in codex and select the model which
         | you want.
        
       | rolisz wrote:
       | I got prompted to try it out on the web. It gave me this after 5
       | minutes:
       | 
       | "I wasn't able to finish creating the new base homepage module
       | template and updating every module to inherit from it within the
       | available time. I did not make any changes or commits."
       | 
       | Told it to get back to work. Let's see how that goes.
        
       | epolanski wrote:
       | Small ot question on the GPT cli tool.
       | 
       | I gave it a shot last month but I did not enjoy it due to the
       | lack of a proper planning mode and being able to accept each edit
       | independently, has it improved?
        
       | hereme888 wrote:
       | It's getting so cut-throat for who has the current SOTA model.
       | Seems to be the big income driver.
        
       | kilroy123 wrote:
       | All the frontier models seem fairly neck to neck. I wonder which
       | company or lab will finally leapfrog the others with some kind of
       | breakthrough?
       | 
       | It sounded like Gemini 3 would be that but in my limit testing it
       | didn't appear to be that.
        
       | boole1854 wrote:
       | Today I did some comparisons of GPT-5.1-Codex-Max (on high) in
       | the Codex CLI versus Gemini 3 Pro in the Gemini CLI.
       | 
       | - As a general observation, Gemini is less easy to work with as a
       | collaborator. If I ask the same question to both models, Codex
       | will answer the question. Gemini will read some intention behind
       | the question, write code to implement the intention, and only
       | then answer the question. In one case, it took me five rounds of
       | repeatedly rewriting my prompt in various ways before I could get
       | it to _not code_ but just answer the question.
       | 
       | - Subjectively, it seemed to me that the code that Gemini wrote
       | was more similar to code that I, as a senior-level developer,
       | would have written than what I have been used to from recent
       | iterations of GPT-5.1. The code seemed more readable-by-default
       | and not merely technically correct. I was happy to see this.
       | 
       | - Gemini seems to have a tendency to put its "internal dialogue"
       | into comments. For example, "// Here we will do X because of
       | reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.".
       | Very annoying.
       | 
       | I did two concrete head-to-head comparisons where both models had
       | the same code and the same prompt.
       | 
       | First, both models were told to take a high-level overview of
       | some new functionality that we needed and were told to create a
       | detailed plan for implementing it. Both models' plans were then
       | reviewed by me and also by both models (in fresh conversations).
       | All three of us agreed that Codex's plan was better. In
       | particular, Codex was better at being more comprehensive and at
       | understanding how to integrate the new functionality more
       | naturally into the existing code.
       | 
       | Then (in fresh conversations), both models were told to implement
       | that plan. Afterwards, again, all three of us compared the
       | resulting solutions. And, again, all three of us agreed that
       | Codex's implementation was better.
       | 
       | Notably, Gemini (1) hallucinated database column names, (2)
       | ignored parts of the functionality that the plan called for, and
       | (3) did not produce code that was integrated as well with the
       | existing codebase. In its favor, it did produce a better version
       | of a particular finance-related calculation function than Codex
       | did.
       | 
       | Overall, Codex was the clear winner today. Hallucinations and
       | ignored requirements are _big_ problems that are very annoying to
       | deal with when they happen. Additionally, Gemini 's tendencies to
       | include odd comments and to jump past the discussion phase of
       | projects both make it more frustrating to work with, at this
       | stage.
        
       | atonse wrote:
       | I just tried this out, and was VERY impressed with the speed of
       | the plan mode. I was also totally fine with the code it wrote.
       | 
       | Then I made the mistake of saying "run npm run build and fix all
       | issues" (something I've run probably 50 times across codex and cc
       | in the past 2 months). CC does it pretty much 100% of the time. I
       | walked away from Codex, and when I came back, it had installed 2
       | new node packages, and gone down some crazy rabbit hole with
       | eslint and something else. (this was for 2 minor typescript
       | errors)
       | 
       | After I reverted all its changes, had CC do it and it fixed it in
       | about 30-60 seconds.
       | 
       | I'll try a few more times. Let's see.
        
       ___________________________________________________________________
       (page generated 2025-11-19 23:00 UTC)