[HN Gopher] Building more with GPT-5.1-Codex-Max
___________________________________________________________________
Building more with GPT-5.1-Codex-Max
Author : hansonw
Score : 280 points
Date : 2025-11-19 18:01 UTC (4 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| iamronaldo wrote:
| That was quick
| bigyabai wrote:
| My first thought was "they must not be seeing as many Claude
| Code conversions as they hoped"
| giancarlostoro wrote:
| Whenever one of them releases a milestone release the rest
| start publishing big milestones too. I'm waiting for Opus 5
| next.
| LZ_Khan wrote:
| all i care about is performance on metr benchmark
| Reubend wrote:
| OpenAI likes to time their announcements alongside major
| competitor announcements to suck up some of the hype. (See for
| instance the announcement of GPT-4o a single day before Google's
| IO conference)
|
| They were probably sitting on this for a while. That makes me
| think this is a fairly incremental update for Codex.
| Palmik wrote:
| GPT 5.1 / Codex already beats Gemini 3 on SWE Bench Verified
| and Terminal Bench and this pushes the gap further. Seems like
| a decent improvement.
| bugglebeetle wrote:
| That's how the game is played. We should be grateful for all
| the competition that is driving these improvements, not
| whinging about the realities of what companies have to do to
| contest each other's position.
| johnecheck wrote:
| It's funny, this release comes right after the Gemini 3
| release that coincided with day 1 of Microsoft's Ignite
| conference.
| peab wrote:
| it's really getting old
| johnwheeler wrote:
| Gemini is eating their lunch, and OpenAI knows it.
| criemen wrote:
| Anthropic released the Opus 4.1 (basically, a new Opus 4
| checkpoint) right around the big GPT-5 release date too, if I
| remember correctly. At this point, anything goes to stay
| relevant.
| spmartin823 wrote:
| I still want something no one has, which is the ability to launch
| agents in different git worktrees simultaneously and check the
| results out on my main branch for testing when they are finished.
| bradly wrote:
| Would this be similar to how Charlie and Jules work?
| cube2222 wrote:
| I think I've described how I achieve kinda your desired
| workflow in a comment yesterday [0].
|
| [0]: https://news.ycombinator.com/item?id=45970668
| agentifysh wrote:
| ha! very interesting how slept on jj is
|
| its been essential to my workflow as well
|
| i use both jj and git and jj is great for just creating a
| snapshot that i can revert to incase it fails
|
| im still exploring it to see what else i can do with it for
| agentic use
| agentifysh wrote:
| lots of tools that do this and I ended up going down this
| rabbit hole something that could just plug in to codex instead
| of requiring a fork
|
| http://github.com/agentify-sh/10x
|
| does minimal overhead with agent orchestration (its just a
| bash/typescript) as its main focus was adding enhancements to
| codex like double redundant checkpoint via git and jj (lessons
| learned from codex being git reset --hard happy), something
| like claude skills (just a bunch of mds that steer it towards
| specific activity like think, plan, execute), timeout wrappers
| (to get you unstuck if codex waits a long time), blacklist
| commands during yolo (rm -rf, git reset banned even if it by
| small chance run it) MIT licensed
|
| you can work sequentially (subagents launch one after the
| other) or parallel (worktrees) but tbh sequentially is better
| because you understand what is going on with parallel it might
| be best for dealing with tests and UI.
| poly2it wrote:
| Your link is a 404.
| lysecret wrote:
| Cursor has this too
| rane wrote:
| tmux users might find this useful:
| https://github.com/raine/workmux
| agentifysh wrote:
| so this was arctic fox it seems, lot of us ended up downgrading
| to codex 5.0 because of the token burn was too much, i see codex
| max is a step up which is welcome but still unsure if they solved
| that github issue around tool use that impacts tokens
|
| going to wait and see after being burned by 5.1 before i upgrade
| back to 0.58
|
| gemini 3 has been a let down tbh to see agentic coding wasn't a
| top priority im sticking with codex for now and using gemini 3
| for frontend
| GenerWork wrote:
| Have you found that Gemini is better than Codex for front end
| generation? I'm trying to bring some Figma screens into a small
| React project I have, and Codex will occasionally screw up the
| implementation despite the fact that I'm using the MCP server.
| jasonthorsness wrote:
| "Starting today, GPT-5.1-Codex-Max will replace GPT-5.1-Codex as
| the default model in Codex surfaces."
|
| Wow, I spent last weekend using a tag-team of Claude and Codex
| and found Codex to more often get better results (TypeScript
| physics/graphics application). I probably only wrote a few
| hundred lines of code out of many thousands; it did a really good
| job.
|
| Now I guess I'll ask the new Codex to review the work of the old!
| taurath wrote:
| These 2 sentences right next to each other stood out to me:
|
| > a new step towards becoming a reliable coding partner
|
| > GPT-5.1-Codex-Max is built for long-running, detailed work
|
| Does this not sound contradictory? It's been the shorter form
| work that has built what little confidence I have in these as a
| coding partner - a model that goes off and does work without
| supervision is not a partner to me.
| causal wrote:
| Absolutely contradictory. The long-running tendency for Codex
| is why I cannot understand the hype around it: if you bother to
| watch what it does and read its code the approaches it takes
| are absolutely horrifying. It would rather rewrite a TLS
| library from scratch than bother to ask you if the network is
| available.
| keeganpoppen wrote:
| these things are actually fixable with prompting. is it easy?
| no. is it PEBKaC if you don't do anything to change course as
| it builds a TLS library? yes, but paperclip maximized! xD
| causal wrote:
| Or you can have a model with some semblance of common sense
| that will stop and say "Hey I can I have access to the
| network to do X?"
|
| Codex feels like a tool designed to run after all the
| humans are gone.
| meowface wrote:
| >It would rather rewrite a TLS library from scratch than
| bother to ask you if the network is available.
|
| This is definitely one of the biggest issues with coding
| agents at the moment.
|
| That said, from my experience, Codex so often does things
| that are so useful and save me so much time that the
| occasional "oh god what the hell did it just go off and do"
| are an acceptable cost for me.
|
| I regularly get great results with open-ended prompts and
| agents that spend 15+ minutes working on the task. I'm sure
| they'll eventually get better at common sense understanding
| of what kind of work is wasteful/absurd.
| ntonozzi wrote:
| If you haven't, give Cursor's Composer model a shot. It might
| not be quite as good as the top models, but in my experience
| it's almost as good, and the lightning fast feedback is more
| than worth the tradeoff. You can give it a task, wait ten
| seconds, and evaluate the results. It's quite common for it to
| not be good enough, but no worse than Sonnet, and if it doesn't
| work you just wasted 30 seconds instead of 10 minutes.
| embirico wrote:
| (Disclaimer: Am on the Codex team.) We're basically trying to
| build a teammate that can do both short, iterative work with
| you, then as you build trust (and configuration), you can
| delegate longer tasks to it.
|
| The "# of model-generated tokens per response" chart in [the
| blog introducing
| gpt-5-codex](https://openai.com/index/introducing-upgrades-to-
| codex/) shows an example of how we're improving the model good
| at both.
| simianwords wrote:
| > Compaction enables GPT-5.1-Codex-Max to complete tasks that
| would have previously failed due to context-window limits, such
| as complex refactors and long-running agent loops by pruning its
| history while preserving the most important context over long
| horizons. In Codex applications, GPT-5.1-Codex-Max automatically
| compacts its session when it approaches its context window limit,
| giving it a fresh context window. It repeats this process until
| the task is completed.
|
| Wouldn't the model automatically do that using attention
| techniques? Why do you need to do it at the token layer and not
| leave it to the model to automatically decide which tokens are
| worth paying attention to?
| qsort wrote:
| > due to context-window limits
| simianwords wrote:
| context window is not some physical barrier but rather the
| attention just getting saturated. what did i get wrong here?
| qsort wrote:
| > what did i get wrong here?
|
| You don't know how an LLM works and you are operating on
| flawed anthropomorphic metaphors.
|
| Ask a frontier LLM what a context window is, it will tell
| you.
| Palmik wrote:
| It's a fair question, even if it might be coming from a
| place of misunderstanding.
|
| For example, DeepSeek 3.2, which employs sparse attention
| [1], is not only faster with long context than normal
| 3.1, but also seems to be better (perhaps thanks to
| reducing the noise?).
|
| [1] It uses still quadratic router, but it's small, so it
| scales well in practice. https://api-
| docs.deepseek.com/news/news250929
| ed wrote:
| Parent is likely thinking of sparse attention which
| allows a significantly longer context to fit in memory
| qsort wrote:
| My comment was harsher than it needed to be and I'm
| sorry, I think I should have gotten my point across in a
| better way.
|
| With that out of the way, parent was wondering why
| compaction is necessary arguing that "context window is
| not some physical barrier but rather the attention just
| getting saturated". We're trying to explain that 3+2=2+3
| and you people are sitting in the back going "well,
| actually, not all groups are abelian".
| paradite wrote:
| In theory, auto-regressive models should not have limit on
| context. It should generate the next token with all
| previous tokens.
|
| In practice, when training a model, people select a context
| window so that during inference, you know how much GPU
| memory to allocate for a prompt and reject the prompt if it
| exceeds the memory limit.
|
| Of course there's also degrading performance as context
| gets longer, but I suspect memory limit is the primary
| factor of why we have context window limits.
| kenjackson wrote:
| I think attention literally doesn't see anything beyond the
| context window. Even within the context window you may
| start to see attentional issues, but that's a different
| problem.
| adastra22 wrote:
| Attention is quadratic, so you have to pick a cutoff for
| context window size. In addition, the error/noise in state
| space increases with longer contexts, resulting in poorer
| performance. So even if you're willing to take the O(n^2)
| slowdown of a larger context window, it still won't work.
| fancy_pantser wrote:
| > Attention is quadratic
|
| Exactly. Standard Multi-Head Attention uses a matrix that
| grows to 4B parameters for a 64K sequence as a starting
| place. FlashAttention v2 helps slightly, but as you grow to
| 128K context length, you still need over 1TB/s memory
| bandwidth to stay compute-bound in practice even with this
| optimization.
|
| So there has been a lot of research in this area and model
| architectures released this year are showing some promising
| improvements. Sliding windows lose context fidelity and if
| you go fully linear, you sacrifice math, logic, and long
| multi-turn (agentic) capabilities, so everyone is searching
| for a good alternative compromise.
|
| MiniMax-M1 had lightning attention to scale up to 1M context
| lengths. It's "I/O aware" via tiling and calculates attention
| two ways block-wise (intra-block traditional attention and
| inter-block linear attention), thereby avoiding the speed-
| inhibiting cumulative summation.
|
| DeepSeek V3.2 uses DeepSeek Sparse Attention (DSA), which is
| sub-linear by only computing "interesting" pairs. For
| example, in 128K context lengths this requires only 10-20% of
| attention pairs to be materialized.
|
| Both Qwen3-Next and Kimi Linear adopt a Gated DeltaNet, which
| is borrowed from Mamba2. In Qwen3-Next it alternates three
| Gated DeltaNet (linear attention) layers for every one gated
| [full] attention. The speedup is from a delta rule, which
| basically amounts to caching in a hand-wavy way.
|
| There's no universally-adopted solution yet, as these are all
| pretty heavy-duty compromises, but the search is going strong
| right now for linear or better attention mechanisms that
| still perform well.
| hansonw wrote:
| Rest assured that we are better at training models than naming
| them ;D
|
| - New benchmark SOTAs with 77.9% on SWE-Bench-Verified, 79.9% on
| SWE-Lancer, and 58.1% on TerminalBench 2.0
|
| - Natively trained to work across many hours across multiple
| context windows via compaction
|
| - 30% more token-efficient at the same reasoning level across
| many tasks
|
| Let us know what you think!
| agentifysh wrote:
| did you address this
| https://github.com/openai/codex/issues/6426 ?
|
| how much more token efficient is this compared to 5.0
|
| had to use 5.0 because 5.1 was eating tokens like crazy and
| seemed like a slight incremental improvement barely noticeable
| EnPissant wrote:
| Compaction is just what Claude Code has done forever, right?
| enraged_camel wrote:
| I am also trying to understand the difference between
| compaction, and what IDEs like Cursor do when they
| "summarize" context over long-running conversations.
|
| Is this saying that said summarization now happens at the
| model level? Or are there other differences?
| GardenLetter27 wrote:
| I think the point here is not that it does compaction (which
| Codex also already does) - but that the model was trained
| with examples of the Codex compaction, so it should perform
| better when compaction has taken place (a common source for
| drops in performance for earlier models).
| EnPissant wrote:
| Codex previously did only manual compaction, but yeah,
| maybe some extra training for compaction, too?
| iyn wrote:
| Looks like a great change! I'll take it for a spin in a moment.
|
| I really like the "subagent" feature in Claude Code -- it's
| super useful to manage context in complex codebases. Here are
| some examples of agents that can be useful:
| https://github.com/humanlayer/humanlayer/tree/main/.claude/a...
|
| Would it make sense to have a similar feature in Codex CLI? I
| often do "spec-driven development", which is basically a loop
| of: research -> implementation plan -> actual
| implementation (based on research + plan) -> validation
|
| I have multiple subagents that I use for each phase that (based
| on subjective judgement) improve the output quality (vs keeping
| everything, every tool use etc. in the "main" context window).
|
| Codex CLI is great and I use it often but I'd like to have more
| of these convenient features for managing context from CC. I'm
| super happy that compaction is now available, hopefully we'll
| get more features for managing context.
| NitpickLawyer wrote:
| Will -minis come for the codex family of models? About two
| months ago I used 5-mini as a daily driver for a few weeks and
| quite liked it, it seemed capable enough on small tasks with
| some hand holding and the speed/price were great as well.
| coder543 wrote:
| codex-mini was released a couple of weeks ago:
| https://platform.openai.com/docs/models/gpt-5.1-codex-mini
| NitpickLawyer wrote:
| Thanks! I somehow missed that. Will check it out.
| qsort wrote:
| Codex is an outstanding product and incremental upgrades are
| always welcome. I'll make sure to give it a try in the coming
| days. Great work! :)
| robotswantdata wrote:
| Sorry don't like the max model, feels like it needs a lot more
| guiding. The plans it writes however are better, so I tried
| feeding it back in (meta prompt style) and working okay so far.
| Very large repository.
| andai wrote:
| So context window is still 400k but the model got good at
| removing irrelevant context?
| carbocation wrote:
| It would be great to have access to this model via the chat
| interface, even if it was gated behind the "other models"
| dropdown or something.
| sinatra wrote:
| I currently use GPT-5.1-Codex High and have a workflow that
| works well with the 5-hour/weekly limits, credits, et al. If I
| use GPT-5.1-Codex-Max Medium or GPT-5.1-Codex-Max High, how
| will that compare cost / credits / limits wise to GPT-5.1-Codex
| High? I don't think that's clear. "Reduced tokens" makes me
| think it'll be priced similarly / lower. But, "Max" makes me
| think it'll be priced higher.
| blks wrote:
| I think your company will fail soon.
| meowface wrote:
| I would bet a lot of money it will not.
| SoKamil wrote:
| > Natively trained
|
| What does it even mean?
| causal wrote:
| Sigh. Time to try it again I guess. I give OpenAI way more
| chances than it deserves.
| EcommerceFlow wrote:
| Gemini 3 had a great 24 hour SOTA run for coding
| CuriouslyC wrote:
| Gemini is still the best oracle/planner by a mile. It's just a
| bad agent. Give it a bundle of your repo and get it to plan
| your changes, then hand it off to codex to implement.
| croes wrote:
| The new detergent now washes even whiter
| bgwalter wrote:
| Come on folks, this is funny. They also have industrial
| strength laundromats to go with the detergent.
| pton_xd wrote:
| I love how programming discussions du jour have basically
| devolved into "really? my socks definitely smell better after
| using 2 scoops of last month's soap. what spin cycle are you
| using?"
| SunshineTheCat wrote:
| My observation has been that Codex tends to hit logical/data-
| driven/back-end tasks out of the park while doing weird, random
| nonsense with even simple UI tasks. This could me needing to
| improve how I phrase my prompts, but it will be interesting to
| see if it's improved in that arena at all.
| cube2222 wrote:
| Somewhat related, after seeing the praise for codex in the Sonnet
| 4.5 release thread I gave it a go, and I must say, that CLI is
| much worse than Claude Code (even if the model is great, I'm not
| sure where the issue really lies between the two).
|
| It was extremely slow (like, multiple times slower than Sonnet
| with Claude Code, though that's partially on me for using
| thinking-high I guess) to finish the task, with the back-and-
| forths being on the order of tens of minutes.
|
| Moreover, the context management seems to be really weird. I'm
| not sure how exactly it works, but - 1. It uses very little
| tokens / fills up the context slowly (good I guess) 2. Doesn't
| seem to actually internalize the contents of files you mention to
| it, or it edits.
|
| #2 here being the main one - I usually context-dump reference
| code for Claude Code, and it does a perfect job of adhering to
| codebase patterns and its architecture, while codex was
| completely ignorant of the existing code style.
|
| Moreover, it wrote extremely defensive code, even for code where
| it wrote both ends itself.
|
| All in all, I was really let down after seeing all the praise.
| agentifysh wrote:
| sure claude code has better ux but honestly its hard to get any
| good amount of usage out of the subscriptions vs what codex
| offers at the same price
|
| with claude im constantly hitting rate limits with codex
| getting substantially more and "slow" isn't really a problem
| for me as long as it keep working
|
| the only complaint i have is that codex itself has usage
| limited now (Either due to outstanding git issues around tools
| or by throttling on their end) compared to a few months ago
|
| the true magical moment was codex pro letting me run swarms of
| agents day in day out without any worries about rate limits it
| truly felt unlimited
|
| if claude manages to release a smaller model or some way to
| deal with the rapidly depleting usage limits (this is the top
| complaint on reddit and they eventually just stopped allowing
| threads about it) it would definitely be used more
|
| but for now codex is clearly the workhorse and claude used side
| by side.
| cube2222 wrote:
| Well as I said, codex didn't adhere to codebase standards for
| me and the code quality was worse (very defensive), so even
| after waiting longer, results weren't there for me.
|
| But the subscription thing is a non-issue for me as I use the
| API, and mostly use Claude Code synchronously, with the
| occasional rare background agent.
| sumedh wrote:
| > if claude manages to release a smaller model
|
| have you tried Haiku?
| tosh wrote:
| Codex CLI 0.59 got released (but has no changelog text)
|
| https://github.com/openai/codex/releases/tag/rust-v0.59.0
| bgwalter wrote:
| So they all release before the Nvidia numbers tonight. The real
| question is: How well can Nvidia hide the circular deals in the
| books?
| amluto wrote:
| I would love to see all the big players put 1% of the effort they
| put into model training into making the basic process of paying
| and signing in suck less.
|
| Claude: they barely have a signin system at all. Multiple account
| support doesn't exist. The minimum seat count for business is
| nonsense. The data retention policies are weak.
|
| OpenAI: Make ZDR a thing you can use or buy without talking to
| sales, already. And for those using containers or a remote system
| or really anything other than local development with the codex
| CLI, you really really need to fix this bug. I bet Codex could do
| at least the client part for you!
|
| https://github.com/openai/codex/issues/2798
|
| (Hint: Claude Code gets this right by default, despite the fact
| that everything else about Claude sign-in is a joke.)
|
| Google: get all your B2B AI product managers in one room and tell
| them that they need to make one single product menu on one single
| webpage with all the pricing on _that page_ and that the Google
| Cloud people are not permitted to make anything that isn't
| actually logically Google Cloud depend on Google Cloud Billing.
| Your product cannot compete with OpenAI or Anthropic if people
| need to ask an LLM to figure out what your product is and if your
| own fancy LLMs can't give a straight answer. My company pays for
| a non-Google product primarily because it's too complicated to
| pay for the Google product! Right now, trying to use Google's AI
| is like trying to ride Bay Area public transit before the Clipper
| Card.
| atonse wrote:
| Agree 1,000%.
|
| I just won't even waste my time with the google stuff cuz I
| can't figure out how to pay with it.
|
| And that's a problem everywhere at google. Our google play
| account is suspended cuz I can't verify the company. It won't
| let me cuz it says I'm not the owner. I've always been the
| owner of my company. For 18 years. There is no one else.
|
| Once some error said make sure the owner email matches your
| profile in google payments and I was like, what is google
| payments and where do I even begin with that? I've never paid
| for google play so what does payments have to do with anything?
|
| It's totally random stuff. Get your shit together, google. Make
| your products and payment systems coherent, rather than it
| obviously looking like it was designed by a fiefdom full of
| territorial managers.
| nico wrote:
| Can relate. My inactive google ads account all of a sudden
| got banned. No explanation except some generic link to their
| terms of service. Appealed, got automatic denial, no reason
| given. Have retried multiple times, same result
| AuryGlenz wrote:
| Same thing happened to me. Guess who didn't start spending
| $100 a month with them again?
|
| Utterly ridiculous.
| joshstrange wrote:
| The "Owner" accounts in Google Play and Apple's App Store are
| so freaking annoying. The only time they make sense is for
| solo-founders and even then I've had issues. Now expand it to
| working at a larger company and it's a joke, a bad one. Oh
| sure, I'll just get the CEO (or other higher-up) to login and
| accept new agreements, that will be easy. Even more fun when
| you tell a client (who logged in exactly 1 time to set up the
| account) that they need to use a generic email (not a
| personal one or an employee-specific one), the ignore your
| suggestion, and then they can't get back in because the
| person who set up the account left the company. It's a mess.
|
| Also, re "Google Payments", I tried to transfer an app from
| my personal/solo Google Play account to a new business one I
| set up for my LLC and it was like pulling teeth. They wanted
| me to find some payment id from the original $20 purchase I
| made to get access to Google Play, something I did right
| around when they first launched and while I still have/use
| the same email, Google came out with approximately 1 googol
| different "payment solutions" in the interim and their
| engineers don't care about data migrations. Finally, after
| many support emails, they just transferred it without me
| giving that code which just shows how silly the whole thing
| was from the start.
| tarsinge wrote:
| I don't have experience in big tech but in the few SaaS
| companies I've seen the issue is UX designers and Product
| managers overwhelmingly have a B2C culture.
| swivelmaster wrote:
| > designed by a fiefdom full of territorial managers
|
| What's harder than herding cats? Herding cats with MBAs and
| OKRs.
| redler wrote:
| Conway's Law strikes again.
| computerex wrote:
| Couldn't agree more about the google product offerings. Vertex
| AI? AI Studio? Maker studio? Gemini? The documentation is
| fragmented with redundant offerings making it confusing to
| determine what is what. GCS billing is complicated to figure
| out vs OpenAI billing or anthropic.
|
| Sad part is Google does offer a ChatML/OpenAI compliant
| endpoint to do LLM calls and I believe they in an experiment
| also reduced friction in getting an API key to start making
| calls right away but discoverability ever remains a challenge
| with google services.
| byefruit wrote:
| I've just found myself using OpenRouter if we need Google
| models for a project, it's worth the extra 5% just not to
| have to deal with the utter disaster that is their product
| offering.
| IanCal wrote:
| FWIW I had to bail on the same thing because my results
| were drastically different. There was something happening
| with images through open router. Although outside of that
| I'd absolutely do the same thing, their apis are awful and
| billing worse. Maybe it makes sense for huge orgs but it's
| a nightmare on the smaller scale.
| int_19h wrote:
| > I believe they in an experiment also reduced friction in
| getting an API key to start making calls right away
|
| This part is very easy now: you sign into
| https://aistudio.google.com/ and then click "Get API key" in
| the lower left corner.
|
| The problem is that features and docs are still scattered all
| over. Some thing can only be done via Vertex, for example.
| hassleblad23 wrote:
| Adding to this, Google's models can only be used with GCP while
| OpenAI's models can be used with Azure, Anthropic's models can
| be used with AWD Bedrock, in addition to their own platforms.
|
| I'd love to see the Gemini models being available by other
| providers :) or if they just build a simple prepaid wallet like
| OpenAI and Anthropic.
| temp0826 wrote:
| Didn't realize these stipulations for the models. Looking at
| devops-y job descriptions the last few months I noticed
| nearly everyone has some kind of Azure requirement now (which
| I've mostly avoided because I don't want to end up managing
| someone's AD), but is openai the actual reason for it?
| sethhochberg wrote:
| We're just using Github Copilot as our primary entrypoint
| for all of the model families. Its the only way we can
| easily offer our devs some level of Claude, Gemini, and
| Codex all in one place.
| skerit wrote:
| Last night, just after Gemini 3 was released and became
| available for Gemini-CLI, I saw Gemini-CLI's team post that you
| could access Gemini 3 with either an API key OR with _Gemini AI
| Ultra_, so I thought: great, I'll get that!
|
| Now you CAN NOT get the Google One stuff if your account is
| part of a workspace. I thought: how awful. I want to pay, but I
| simply can't?
|
| Oh, but then I noticed: You CAN add a _Gemini AI Ultra_ license
| via the Google Workspace Admin area, great!
|
| Turns out: you fucking can't. That's _Google AI Ultra FOR
| BUSINESS_ and that IS NOT supported.
|
| So I had to get the Google One subscription on my personal
| account after all.
|
| Combine that with the _pathetic_ usage limits: somehow not
| token-based, but amount of requests per 24 hour window (which
| is 500 for Gemini 3) and Gemini 3's incredible chattiness (it
| uses A LOT more requests to get something done compared to
| Claude) and you hit the usage limits in just 2 hours.
| timtimmy wrote:
| Careful, their ToS makes it clear they train on your
| Antigravity prompts (even on AI Ultra) and there is no opt-
| out that I can find.
| victor106 wrote:
| the microsoftication of Google. Fighting evil with evil...
| halifaxbeard wrote:
| At this point I'm not convinced that Gemini 3 Pro was post-
| trained on data Google had permission to use, going by the
| myriad of issues on the Gemini CLI tracker around Google
| AI/Google One/Google Cloud/Google Workspaces.
|
| https://github.com/google-gemini/gemini-cli/issues/12121
|
| It is far too easy to accidentally end up under the wrong
| privacy agreement, to the point of where some workplaces are
| banning use of the Gemini CLI!
| timtimmy wrote:
| Google keeps changing their privacy and "don't train on my
| data/code" options. When gemini-cli launched, there was a clear
| toggle for "don't train on my code." That's now gone; it just
| links to a generic privacy page for me. Maybe something with my
| account changed, I can't figure it out. Deep in the Cloud
| Gemini console, there's another setting that might control
| training, but it's not clear what products it actually covers.
|
| Trying to pay for Gemini-3 is confusing. Maybe an AI Ultra
| personal subscription? I already pay for OpenAI and Anthropic's
| pro/max plans and would happily pay Google too. But the only
| obvious option is a $250/month tier, and its documentation
| indicates Google can train on your code unless you find and
| enable the correct opt-out. If that opt-out exists in all the
| products, it's not obvious where it lives or what products it
| applies to.
|
| Workspace complicates it further. Google advertises that with
| business workspace accounts your data isn't used for training.
| So, I was going to try Antigravity on our codebase. At this
| point I know I can't trust Google, so I read the ToS carefully.
| They train on your prompts and source code, and there doesn't
| appear to be a way to pay them and opt out right now. Be
| careful, paying for Google Workspace does not protect you,
| always read the ToS.
|
| Be careful with AI-studio and your Google Workspace accounts.
| They train on your prompts unless you switch it to API mode.
|
| The result is a lot of uncertainty. I genuinely have no idea
| how to pay Google for Gemini without risking my code being used
| for training. And if I do pay, I can't tell whether they'll
| train on my prompts anyway.
|
| The marketing for their coding products does not clearly state
| when they do or do not train on your prompts and code.
|
| I had to run deep research to understand the risks with using
| Gemini 3 for agentic work, and I still don't feel confident
| that I understand the risks. I might have said some incorrect
| things above, but I am just so confused. I feel like I have a
| <75% grasp on the situation.
|
| I don't have a lot of trust. And honestly, this feels confusing
| and deceptive. One could easily confuse it as deliberate
| strategy to gather training data through ambiguity and dark
| patterns, it certainly looks like this could be Google's
| strategy to win the AI race. I assume this is just how it
| looks, and that they aren't being evil on purpose.
|
| OpenAI in particular has my trust. They get it. They are
| carefully building the customer experience, they are product
| and customer driven from the top.
| bossyTeacher wrote:
| >OpenAI in particular has my trust.
|
| I wouldn't trust Sam Altman. Or any of the big players
| really.
| fishmicrowaver wrote:
| > trust
|
| Hahaha...HAHAhaha. HAHAHHAHAHAHAHAHAHA!!!
| unreal6 wrote:
| > Claude: they barely have a signin system at all. Multiple
| account support doesn't exist. The minimum seat count for
| business is nonsense. The data retention policies are weak.
|
| Please give me an option for a password (or passkey) or
| literally anything else that doesn't require either linking
| with google or going through an email flow for every login
| leetrout wrote:
| And stop asking for phone numbers for "fraud prevention" when
| I've already given you my name, address and credit card.
| lucasban wrote:
| The fun one for me is that I moved countries and last I
| checked there's still no way to change your phone number on
| ChatGPT short making a new account, so now my account is
| associated with a phone number that I no longer have access
| to and will eventually be reassigned to someone else.
| oblio wrote:
| Can't people spoof the first two and use a stolen credit card
| number?
| brobdingnagians wrote:
| Such great case studies of how LLM coding will make all of your
| employees 1000x more productive at coding, design, and UX. They
| really are leading the way showing us into the brighter future
| of AI software /s
| jiggawatts wrote:
| Nobody claimed AIs will make office politics go away.
|
| Peering into my crystal ball: once all "workers" have been
| replaced, all humans will spend all of their working hours on
| nothing but office politics.
| gigatree wrote:
| It seems pretty clear the moat is built at the application
| layer, how enjoyable/easy the actual application is to use, but
| these applications seem to be getting worse over time even as
| the models get better. Is it really that hard to do both? Isn't
| the point of agentic coding to do more better (not just more)?
| sophiebits wrote:
| ZDR is a risk thing for them. They want to make sure you're a
| legitimate company and have monitoring in place on your side to
| reduce the chance you're using them for illegal things.
| fHr wrote:
| Google listen to this man and fire 90% of your useless product
| managers!
| sumedh wrote:
| Its the same with Cursor. As a Cursor Admin I want the ability
| to enable only specific models and disable the rest to save
| costs but I cannot do that. It should be pretty simple to do it
| but for some reason Cursor wont add that functionality in their
| Admin tools.
| kytazo wrote:
| 500 Internal Server Error.
| morog wrote:
| ditto. Also OpenAI vector stores are down right now across the
| board
| nakamoto_damacy wrote:
| It's good but Gemini 3 beats it.
| syntaxing wrote:
| I rarely used Codex compared to Claude because it was extremely
| slow in GitHub copilot . Like maybe 2-5X slower than Claude
| Sonnet. I really wish they just made their models faster than
| "better"
| nartho wrote:
| Have you tried Mistral ? Definitely one of the fastest models
| syntaxing wrote:
| My employer doesn't offer/allow anything besides the
| "traditional" offerings on GitHub copilot.
| levocardia wrote:
| Very interesting to see the range of peoples' preferences. I
| would almost always prefer smart over fast; I have all my LLMs
| to be all-thinking-all-the-time.
| syntaxing wrote:
| It's a balance, I haven't felt like codex provided anything
| that Sonnet 4.5 didn't. Why wait longer for getting the same
| results.
|
| Though that does bring up an interesting point. Anecdotally,
| Sonnet does a lot more grep-ing while Codex reads files
| straight up. Might be the difference in speed and maybe
| smarter models will do better. Once this model is on copilot,
| I can test it out.
| mrguyorama wrote:
| GPT-5 was recently updated to make it more "thinking" and
| "warmer" or whatever and now a task (semantically compare
| these two short files) that used to take 5 seconds and
| reliably produce useful and _consistent_ output now takes _90
| seconds_ to "think" (while it's thinking output makes it
| pretty clear there is zero thinking happening) and produces a
| _completely differently structured output_ every single time,
| making the tool not only slower and more expensive to use,
| but worse at a simple task that LLMs should be very good at.
|
| There's an option to "get a quick answer" and I hoped
| clicking that would revert to previous performance and
| instead what it does is _ignore that I uploaded two files and
| asks me to upload the files_
|
| Literally the only real good task I've found for these dumb
| things and they _still_ found a way to fuck it up because
| they need to keep the weirdos and whales addicted. It 's now
| almost easier to go back to comparing these files by eye, or
| just bite the bullet and finally write a few lines of python
| to actually do it right and reliably.
| jasonsb wrote:
| OpenAI doesn't want you to use their models outside of their
| own products, which is why the API and integrations like Github
| Copilot are super slow.
| sumedh wrote:
| That does not make business sense though. If people want to
| use Open AI models in Copilot and other tools and they dont
| perform they will just switch to another model and not come
| back they are not going to use Codex.
| andai wrote:
| Sizeable if veracious!
| the__alchemist wrote:
| This is a tangent: Has anyone noticed that GPT-5.0 at some point
| started producing much faster, crappier answers, then 5.1 made it
| slower + better again? (Both in _Thinking_ mode)
| wincy wrote:
| I did notice that, I thought maybe I'd exceeded my thinking
| requests
| Narciss wrote:
| Here we go again....
| johnfn wrote:
| I've been using a lot of Claude and Codex recently.
|
| One huge difference I notice between Codex and Claude code is
| that, while Claude basically disregards your instructions
| (CLAUDE.md) entirely, Codex is extremely, painfully, doggedly
| persistent in following every last character of them - to the
| point that i've seen it work for 30 minutes to convolute some
| solution that was only convoluted because of some sentence I
| threw in the instructions I had completely forgotten about.
|
| I imagine Codex as the "literal genie" - it'll give you exactly
| what you asked for. EXACTLY. If you ask Claude to fix a test that
| accidentally says assert(1 + 1 === 3), it'll say "this is clearly
| a typo" and just rewrite the test. Codex will rewrite the entire
| V8 engine to break arithmetic.
|
| Both these tools have their uses, and I don't think one approach
| is universally better. Because Claude just hacks its way to a
| solution, it is really fast, so I like using it for iterate web
| work, where I need to tweak some styles and I need a fast
| iterative loop. Codex is much worse at that because it takes like
| 5 minutes to validate everything is correct. Codex is much better
| for longer, harder tasks that have to be correct -- I can just
| write some script to verify that what it did work, and let it
| spin for 30-40 minutes.
| nico wrote:
| > Claude basically disregards your instructions (CLAUDE.md)
| entirely
|
| A friend of mine tells Claude to always address him as "Mr
| Tinkleberry", he says he can tell when Claude is not paying
| attention to the instructions on CLAUDE.md when Claude stops
| calling him "Mr Tinkleberry" consistently
| benzible wrote:
| Yep, it's David Lee Roth's brown M&M trick
| https://www.smithsonianmag.com/arts-culture/why-did-van-
| hale...
| awad wrote:
| Highly recommend adding some kind of canary like this in all
| LLM project instructions. I prefer my instructions to say
| 'always start output with an (uniquely decided by you) emoji'
| as it's easier to visually scan for one when reading a wall
| of LLM output, and use a different emoji per project because
| what's life without a little whim?
| leobg wrote:
| We used to do that on Upwork. Back in the days where one
| still hired human coders. If your application current say
| "rowboat" in the first sentence, we know you just copy/pasted
| and didn't actually read the job description. Feels like a
| lifetime ago.
| hadlock wrote:
| I've been really impressed with codex so far. I have been
| working on a flight simulator hobby project for the last 6
| months and finally came to the conclusion that I need to switch
| from floating origin, which my physics engine assumes with the
| coordinate system it uses, to a true ECEF coordinate system
| (what underpins GPS). This involved a major rewrite of the
| coordinate system, the physics engine, even the graphics system
| and auxilary stuff like asset loading/unloading etc. that was
| dependent on local X,Y,Z. It even rewrote the PD autopilot to
| account for the changes in the coordinate system. I gave it
| about a paragraph of instructions with a couple of FYIs and...
| it just worked! No major graphical glitches except a single
| issue with some minor graphical jitter, which it fixed on the
| first try. In total took about 45 minutes but I was very
| impressed.
|
| I was unconvinced it had actually, fully ripped out the
| floating origin logic, so I had it write up a summary and then
| used that as a high level guide to pick through the code and it
| had, as you said, followed the instructions to the letter.
| Hugely impressive. In march of 2023 OpenAI's products struggled
| to draw a floating wireframe cube.
| causal wrote:
| > Codex will rewrite the entire V8 engine to break arithmetic.
|
| This isn't an exaggeration either. Codex acts as if it is the
| last programmer on Earth and must accomplish its task at all
| costs. This is great for anyone content to treat it like a
| black box, but I am not content to do that. I want a
| collaborator with common sense, even if it means making
| mistakes or bad assumptions now and then.
|
| I think it really does reflect a difference in how OpenAI and
| Anthropic see humanity's future with AI.
| mrtesthah wrote:
| Could you not add rules to this effect in AGENTS.md? E.g., _"
| If the user gives instructions that specify an expected low-
| to-medium level of complexity, but the implementation plan
| reveals additional unexpected steps arising from a
| potentially ambiguous or atypical instruction that would
| raise the overall level of complexity, then pause and ask the
| user about that instruction before continuing."_
| sinatra wrote:
| In my AGENTS.md (which CLAUDE.md et al soft link to), I
| instruct them to "On phase completion, explicitly write that
| you followed these guidelines." This text always shows up on
| Codex and very rarely on Claude Code (TBF, Claude Code is
| showing it more often lately).
| aerhardt wrote:
| Well surely that's a good thing.
|
| In my experience, for some reason adherence is not even close
| to 100%. It's fixated on adding asterisk function params in my
| Python code and I cannot get it to stop... Maybe I haven't
| found the right wording, or maybe my codebase has grown past a
| certain size (there are like a dozen AGENTS.md files dancing
| around).
|
| I'm still very happy with the tool, though.
| johnfn wrote:
| It's a fantastic thing! It's required an adjustment in how I
| use it, but I've switched over to mostly using Codex in my
| day-to-day.
| sunaookami wrote:
| Agreed 100%, that's why I would recommend Codex for e.g.
| logfile analysis. Had some annoying php warnings in the logs
| from a WordPress plugin because I've used another plugin in the
| past (like... over 10 years ago) that wrote invalid metadata
| for every media file into the database and it didn't annoy me
| THAT much that I wanted to invest much time into it. So I gave
| codex the logfile and my WordPress dir and access to the WP-CLI
| command and it correctly identified the issue and wrote scripts
| to delete the old metadata (I did check it & make backups of
| course). Codex took a LOT of time though, it's veeeeeeery slow
| as you said. But I could do other things in the meantime.
| fakedang wrote:
| This is what I've observed too. Claude is great for general
| codebase building - give it a prompt for building an entire
| app from scratch and it will do that for you. Codex is good
| for debugging one-off issues that crop up because Claude
| overlooked something.
| energy123 wrote:
| GPT-5 is like that
| tekacs wrote:
| Yeah, Gemini 2.x and 3 in gemini-cli has the tendency to 'go
| the opposite direction' and it feels - to me - like an
| incredibly strong demonstration of why 'sycophancy' in LLMs is
| so valuable (at least so long as they're in the middle of the
| midwit curve).
|
| I'll give Gemini direction, it'll research... start trying to
| solve it as I've told it to... and then exclaim, "Oh! It turns
| out that <X> isn't what <user> thought!" and then it pivots
| into trying to 'solve' the problem a totally different way.
|
| The issue however... is that it's:
|
| 1) Often no longer solving the problem that I actually wanted
| to solve. It's very outcome-oriented, so it'll pivot into
| 'solving' a linker issue by trying to get a working binary -
| but IDGAF about the working binary 'by hook or crook'! I'm
| trying to fix the damn linker issue!
|
| 2) Just... wrong. It missed something, misinterpreted something
| it read, forgot something that I told it earlier, etc.
|
| So... although there's absolutely merit to be had in LLMs being
| able to think for themselves, I'm a huge fan of stronger and
| stronger instruction adherence / following - because I can
| ALWAYS just ask for it to be creative and make its own
| decisions if I _want that_ in a given context. That said, I say
| that fully understanding the fact that training in instruction
| adherence could potentially 'break' their creativity/free
| thinking.
|
| Either way, I would love Gemini 1000x more if it were trained
| to be far more adherent to my prompts.
| tekacs wrote:
| Immediately rebutting myself: a major caveat to this that I'm
| discovering with Gemini is that... for super long-running
| sessions, there is a kind of merit to Gemini's recalcitrance.
|
| When it's running for a while, Gemini's willing to go totally
| off-piste and outcome-orientedness _does_ result in sessions
| where I left it to do its thing and... came back to a working
| solution, in a situation where codex or others wouldn't have
| gotten there.
|
| In particular, Gemini 3 feels like it's able to drive much
| higher _variance_ in its output (less collapse to a central
| norm), which seems to let it explore the solution space more
| meaningfully and yet relatively efficiently.
| buu700 wrote:
| I haven't had that particular experience with Gemini 2.5, but
| did run into it during one of my first few uses of Gemini 3
| yesterday.
|
| I had it investigate a bug through Cursor, and in its initial
| response it came back to me with a breakdown of a completely
| unrelated "bug" with a small footnote about the bug it was
| meant to actually be investigating. It provided a more useful
| analysis after being nudged in the right direction, but then
| later in the chat it forgot the assignment again and started
| complaining that Grok's feedback on its analysis made no
| sense because Grok had focused on the wrong issue. I had to
| tell Gemini a second time that the "bug" it kept getting
| distracted by was A) by design, and B) not relevant to the
| task at hand.
|
| Ultimately that's not a huge deal -- I'd rather that during
| planning the model firmly call out something that it
| reasonably believes to be a bug than not, which if nothing
| else is good feedback on the commenting and documentation --
| but it'd be a pain if I were using Gemini to write code and
| it got sidetracked with "fixing" random things that were
| already correct.
| bugglebeetle wrote:
| The solution to this if you want less specification in advance
| is to simply ask Codex a series of leading questions about a
| feature of fix. I typically start with something like "it seems
| like X could be improved with the addition of Y? Can you review
| the relevant parts of the codebase in a, b, and c to assess?"
| It will then do so and come back with a set of suggestions that
| follow this guidance, which you can revise and selectively tell
| it to implement. In my experience, this fills the context with
| the appropriate details to then let it make more of its own
| decisions in a generally correct way without as much
| handholding.
| wilg wrote:
| I have been using GPT 5 High Fast in Cursor primarily over Codex,
| because Codex seems to take way longer and generally annoy me by
| doing strange CLI stuff, but hopefully I can switch to this new
| one. I also tried it against Gemini 3 Pro in Cursor and it's hard
| to tell but at least in some cases I felt like GPT5 was giving
| better results.
| LZ_Khan wrote:
| Woah, metr results look impressive. Still looking exponential
| tunesmith wrote:
| I've been dealing with Codex CLI for a while and I love it, but
| I'm wondering if my thinking is just limited. While I'm starting
| discussions and creating plan docs, I've never been able to ask
| it to do anything that takes it longer than 25 minutes or so.
| Usually far less. I'm having trouble imagining what I can ask it
| to do that would make it take hours - like, wouldn't that require
| putting together an absolutely massive planning doc that would
| take hours to put together anyway? I'd rather just move
| incrementally.
| GenerWork wrote:
| Perhaps they're combining an incredibly complex product that
| has a lot of interactive features, a big codebase, test
| creation, and maybe throwing some MCP stuff in there such as
| creating creating a ticket in Jira if a test fails?
| CuriouslyC wrote:
| Easy way to get an agent to run a long time is just to get it
| to babysit CI/CD, tell it to iterate on it until it passes. I
| got Sonnet 4 to run for >6 hours that way.
| aerhardt wrote:
| The idea of giving it a task that may take six hours and
| reviewing it also gives me shivers.
|
| I'm a very happy Codex customer, but everything turns to
| disgusting slop if I don't provide:
|
| (1) Up-to-date AGENTS.md and an excellent prompt
|
| (2) A full file-level API with function signatures, return
| types and function-level guidance if it's a complex one
|
| (3) Multiple rounds of feedback until the result is finely
| sculpted
|
| Overall it's very small units of work - one file or two, tops.
|
| I've been letting the above standards go for the last couple of
| weeks due to crunch and looking at some of the hotspots of slop
| now lying around has me going all Homelander-face [1] at the
| sight of them.
|
| Those hotspots are a few hundred lines in the worst cases; I'm
| definitely not ready to deal with the fallout of any unit of
| work that takes even more than 20min.
|
| [1] https://i.kym-
| cdn.com/entries/icons/original/000/050/702/ab7...
| 999900000999 wrote:
| I really would prefer them to start creating customized models.
|
| I've vibe coded Godot games extensively.
|
| Just about every model I've tried likes to invent imaginary
| functions.
|
| I was really prefer for there to be a way for me to pick model
| trained in whatever framework I need.
|
| Reviewing AI generated code feels like editing a long book, and
| every now and then you notice some words are just completely made
| up. You then ask the AI to fix its book, and it will just add
| more AI generated words.
|
| On one hand I want this to be a reality check to everyone who's
| trying to lay off real software engineers to replace us with AI.
|
| On the other hand half of the stock market is held up by
| overhyped AI valuations. If the tide goes out too fast, and there
| is a mass realization that this stuff just isn't as good as it's
| hyped to be, it's not going to be fun for anyone.
| andai wrote:
| I had this problem 2 years ago. All the models were telling me
| use libraries that hadn't been invented yet.
|
| That was annoying back then, but these days that's not so much
| of a problem.
|
| You can write your program and then simply have it invent the
| library as well, while it's at it! ;)
| razodactyl wrote:
| These days not so much of a problem because the libraries now
| exist? Haha
| Atotalnoob wrote:
| I've found writing a MCP server with access to the docs cloned
| locally does wonders.
| epolanski wrote:
| I don't know context is still an issue if you have lots of
| docs in my experience.
| GaggiX wrote:
| Add the documentation to the context window in that case, a bit
| of context engineering.
| Narciss wrote:
| Context7 might be good for you
| spectraldrift wrote:
| Weird how they only share three hand-picked evals, ignoring the
| evals where they were left in the dust like ARC-AGI2. This post
| is so misleading, I don't even know whether to trust the numbers
| they _did_ share. One is just fraction of a percentage point away
| from Gemini 3 pro, which is awfully convenient for marketing and
| easy to hide. Very open, OpenAI.
| XenophileJKO wrote:
| Not really that weird. This isn't intended to be a "general"
| model. This is a coding model so they showed the coding evals.
| The assumption would be relative to GPT5.1, non-coding evals
| would be likely regress or be similar.
|
| Like when advertising the new airliner, most people don't care
| about how fast it taxis.
| simonw wrote:
| Thinking level medium: https://tools.simonwillison.net/svg-
| render#%3Csvg%20xmlns%3D...
|
| Thinking level xhigh: https://tools.simonwillison.net/svg-
| render#%20%20%3Csvg%20xm...
| ineedasername wrote:
| Medium has things dialed in. When both high and low are
| coherent but medium goes to cubism? That's intent. Or it had a
| miscue on proportions vs shape placement. Either way, it's
| great, sandwiched the way it is, between the other two. Did it
| put a comment in all of them or just the one w/ the hat?
|
| Also, thanks for the posts-- it's hugely helpful to have a
| continuity of insightful perspective throughout.
| andai wrote:
| The graph showing higher performance for fewer thinking tokens is
| really interesting!
|
| It would be even more interesting to see how Sonnet and Haiku
| compare with that curve.
| tptacek wrote:
| Is "compaction" a trained-in feature of the model, or just
| tooling around the model calls? Agents already do compaction.
| kachapopopow wrote:
| not sure if I am actually using 5.1-codex-max or just normal
| 5.1-codex (is there even 5.1-codex?) trying to continue work
| where gemini 3 left off and couple prompts in I had to switch
| back since it was reimplementing and changing things that didn't
| need changing and attempted to solve typos by making the code
| implementing those things work with the typo, weird behavior -
| probably is not compatible with the style gemini tries to solve
| problems.
| sumedh wrote:
| Just run the /model command in codex and select the model which
| you want.
| rolisz wrote:
| I got prompted to try it out on the web. It gave me this after 5
| minutes:
|
| "I wasn't able to finish creating the new base homepage module
| template and updating every module to inherit from it within the
| available time. I did not make any changes or commits."
|
| Told it to get back to work. Let's see how that goes.
| epolanski wrote:
| Small ot question on the GPT cli tool.
|
| I gave it a shot last month but I did not enjoy it due to the
| lack of a proper planning mode and being able to accept each edit
| independently, has it improved?
| hereme888 wrote:
| It's getting so cut-throat for who has the current SOTA model.
| Seems to be the big income driver.
| kilroy123 wrote:
| All the frontier models seem fairly neck to neck. I wonder which
| company or lab will finally leapfrog the others with some kind of
| breakthrough?
|
| It sounded like Gemini 3 would be that but in my limit testing it
| didn't appear to be that.
| boole1854 wrote:
| Today I did some comparisons of GPT-5.1-Codex-Max (on high) in
| the Codex CLI versus Gemini 3 Pro in the Gemini CLI.
|
| - As a general observation, Gemini is less easy to work with as a
| collaborator. If I ask the same question to both models, Codex
| will answer the question. Gemini will read some intention behind
| the question, write code to implement the intention, and only
| then answer the question. In one case, it took me five rounds of
| repeatedly rewriting my prompt in various ways before I could get
| it to _not code_ but just answer the question.
|
| - Subjectively, it seemed to me that the code that Gemini wrote
| was more similar to code that I, as a senior-level developer,
| would have written than what I have been used to from recent
| iterations of GPT-5.1. The code seemed more readable-by-default
| and not merely technically correct. I was happy to see this.
|
| - Gemini seems to have a tendency to put its "internal dialogue"
| into comments. For example, "// Here we will do X because of
| reason Y. Wait, the plan calls for Z instead. Ok, we'll do Z.".
| Very annoying.
|
| I did two concrete head-to-head comparisons where both models had
| the same code and the same prompt.
|
| First, both models were told to take a high-level overview of
| some new functionality that we needed and were told to create a
| detailed plan for implementing it. Both models' plans were then
| reviewed by me and also by both models (in fresh conversations).
| All three of us agreed that Codex's plan was better. In
| particular, Codex was better at being more comprehensive and at
| understanding how to integrate the new functionality more
| naturally into the existing code.
|
| Then (in fresh conversations), both models were told to implement
| that plan. Afterwards, again, all three of us compared the
| resulting solutions. And, again, all three of us agreed that
| Codex's implementation was better.
|
| Notably, Gemini (1) hallucinated database column names, (2)
| ignored parts of the functionality that the plan called for, and
| (3) did not produce code that was integrated as well with the
| existing codebase. In its favor, it did produce a better version
| of a particular finance-related calculation function than Codex
| did.
|
| Overall, Codex was the clear winner today. Hallucinations and
| ignored requirements are _big_ problems that are very annoying to
| deal with when they happen. Additionally, Gemini 's tendencies to
| include odd comments and to jump past the discussion phase of
| projects both make it more frustrating to work with, at this
| stage.
| atonse wrote:
| I just tried this out, and was VERY impressed with the speed of
| the plan mode. I was also totally fine with the code it wrote.
|
| Then I made the mistake of saying "run npm run build and fix all
| issues" (something I've run probably 50 times across codex and cc
| in the past 2 months). CC does it pretty much 100% of the time. I
| walked away from Codex, and when I came back, it had installed 2
| new node packages, and gone down some crazy rabbit hole with
| eslint and something else. (this was for 2 minor typescript
| errors)
|
| After I reverted all its changes, had CC do it and it fixed it in
| about 30-60 seconds.
|
| I'll try a few more times. Let's see.
___________________________________________________________________
(page generated 2025-11-19 23:00 UTC)