hngopher.com

       [HN Gopher] Gemini-2.5-pro-preview-06-05
       ___________________________________________________________________
        
       Gemini-2.5-pro-preview-06-05
        
       Author : jcuenod
       Score  : 288 points
       Date   : 2025-06-05 16:44 UTC (6 hours ago)
        
 (HTM) web link (deepmind.google)
 (TXT) w3m dump (deepmind.google)
        
       | jbellis wrote:
       | Did it get upgraded in-place again or do you need to opt in to
       | the new model?
        
       | unpwn wrote:
       | I feel like instead of constantly releasing these preview
       | versions with different dates attached they should just add a
       | patch version and bump that.
        
         | impulser_ wrote:
         | They can't because if someone has built something around that
         | version they don't want to replace that model with a new model
         | that could provide different results.
        
           | nsriv wrote:
           | Looking at you Anthropic. 4.0 markedly different from 3.7 in
           | my experience.
        
             | Aeolun wrote:
             | The model name is completely different? How do you
             | accidentally switch from 3.7 to 4.0?
        
           | jfoster wrote:
           | In what way are dates better than integers at preventing that
           | kind of mistake?
        
           | dist-epoch wrote:
           | Except google did exactly that with the previous release,
           | where they silently redirect 03-25 requests to 05-06.
        
       | op00to wrote:
       | I found Gemini 2.5 Pro highly useful for text summaries, and even
       | reasoning in long conversations... UP TO the last 2 weeks or
       | month. Recently, it seems to totally forget what I'm talking
       | about after 4-5 messages of a paragraph of text each. We're not
       | talking huge amounts of context, but conversational
       | braindeadness. Between ChatGPT's sycophancy, Gemini's
       | forgetfulness and poor attention, I'm just sticking with whatever
       | local model du jour fits my needs and whatever crap my company is
       | paying for today. It's super annoying, hopefully Gemini gets its
       | memory back!
        
         | energy123 wrote:
         | I believe it's intentionally nerfed if you use it through the
         | app. Once you use Gemini for a long time you realize they have
         | a number of dark patterns to deter heavy users but maintain the
         | experience for light users. These dark patterns are:
         | 
         | - "Something went wrong error" after too many prompts in a day.
         | This was an undocumented rate limit because it never occurs
         | earlier in the day and will immediately disappear if you
         | subscribe for and use a new paid account, but it won't
         | disappear if you make a new free account, and the error going
         | away is strictly tied to how long you wait. Users complained
         | about this for over a year. Of course they lied about the real
         | reasons for this error, and it was never fixed until a few days
         | ago when they rug pulled paying users by introducing actual
         | documented tight rate limits.
         | 
         | - "You've been signed out" error if the model has exceeded its
         | output token budget (or runtime duration) for a single
         | inference, so you can't do things like what Anthropic
         | recommends where you coax the model to think longer.
         | 
         | - I have less definitive evidence for this but I would not be
         | surprised if they programmatically nerf the reasoning effort
         | parameter for multiturn conversations. I have no other
         | explanation for why the chain of thought fails to generate for
         | small context multiturn chats but will consistently generate
         | for ultra long context singleturn chats.
        
           | op00to wrote:
           | Right! I feel like it will sail through MBs of text data, but
           | remembering what I said two turns ago is just too much.
        
         | harrisoned wrote:
         | I noticed that same behavior across older Gemini models. I
         | build a chatbot at work around 1.5 Flash, and one day suddenly
         | it was behaving like that. it was perfect before, but after it
         | always saluted the user like it was their first chat, despite
         | me sending the history. And i didn't found any changelog
         | regarding that at the time.
         | 
         | After that i moved to OpenAI, Gemini models just seem
         | unreliable on that regard.
        
           | 85392_school wrote:
           | This might be because Gemini silently updates checkpoints
           | (1.5 001 -> 1.5 002, 2.5 0325 -> 2.5 0506 -> 2.5 0605) while
           | OpenAI doesn't update them without ensuring that they're
           | uniformly better and typically emails customers when they are
           | updated.
        
       | jcuenod wrote:
       | 82.2 on Aider
       | 
       | Still actually falling behind the official scores for o3 high.
       | https://aider.chat/docs/leaderboards/
        
         | sottol wrote:
         | Does 82.2 correspond to the "Percent correct" of the other
         | models?
         | 
         | Not sure if OpenAI has updated O3, but it looks like "pure" o3
         | (high) has a score of 79.6% in the linked table, "o3 (high) +
         | gpt-4.1" combo has a the highest score of 82.7%.
         | 
         | The previous Gemini 2.5 Pro Preview 05-06 (yea, not current
         | 06-05!) was at 76.9%.
         | 
         | That looks like a pretty nice bump!
         | 
         | But either way, these Aider benchmarks seem to be most
         | useful/trustworthy benchmarks currently and really the only
         | ones I'm paying attention to.
        
         | hobofan wrote:
         | That's the older 05-06 preview, not the new one from today.
        
           | energy123 wrote:
           | They knew that. The 82.2 comes from the new benchmarks in the
           | OP not from the aider url. The aider url was supplied for
           | comparison.
        
             | hobofan wrote:
             | Ah, thanks for clearing that up!
        
         | vessenes wrote:
         | But so.much.cheaper.and.faster. Pretty amazing.
        
       | vthallam wrote:
       | As if 3 different preview versions of the same model is not
       | confusing enough, the last two dates are 05-06 and 06-05. They
       | could have held off for a day:)
        
         | tomComb wrote:
         | Since those days are ambiguous anyway, they would have had to
         | hold off until the 13th.
         | 
         | In Canada, a third of the dates we see are British, and another
         | third are American, so it's really confusing. Thankfully y-m-d
         | is now a legal format and seems to be gaining ground.
        
           | layer8 wrote:
           | > they would have had to hold off until the 13th.
           | 
           | 06-06 is unambiguously after 05-06 regardless of date format.
        
         | declan_roberts wrote:
         | Engineers are surprisingly bad at naming things!
        
           | jacob019 wrote:
           | I rather like date codes as versions.
        
         | dist-epoch wrote:
         | > the last two dates are 05-06 and 06-05
         | 
         | they are clearly trolling OpenAI's 4o and o4 models.
        
           | fragmede wrote:
           | ChatGPT itself suggests better names than that!
        
           | oezi wrote:
           | Don't repeat the same mistake if you want to troll somebody.
           | 
           | It makes you look even more stupid.
        
         | UncleOxidant wrote:
         | At what point will they move from Gemini 2.5 pro to Gemini 2.6
         | pro? I'd guess Gemini 3 will be a larger model.
        
       | unsupp0rted wrote:
       | Curious to see how this compares to Claude 4 Sonnet in code.
       | 
       | This table seems to indicate it's markedly worse?
       | 
       | https://blog.google/products/gemini/gemini-2-5-pro-latest-pr...
        
         | gundmc wrote:
         | Almost all of those benchmarks are coding related. It looks
         | like SWE-Bench is the only one where Claude is higher. Hard to
         | say which benchmark is most representative of actual work. The
         | community seems to like Aider Polyglot from what I've seen
        
       | energy123 wrote:
       | So there's both a 05-06 model and a 06-05 model, and the launch
       | page for 06-05 has some graphs with benchmarks for the 05-06
       | model but without the 06-05 model?
        
       | sergiotapia wrote:
       | In Cursor this is called "gemini-2.5-pro-preview-06-05" you have
       | to enable it manually.
        
       | johnfn wrote:
       | Impressive seeing Google notch up another ~25 ELO on lmarena, on
       | top of the previous #1, which was also Gemini!
       | 
       | That being said, I'm starting to doubt the leaderboards as an
       | accurate representation of model ability. While I do think Gemini
       | is a good model, having used both Gemini and Claude Opus 4
       | extensively in the last couple of weeks I think Opus is in
       | another league entirely. I've been dealing with a number of
       | gnarly TypeScript issues, and after a bit Gemini would spin in
       | circles or actually (I've never seen this before!) give up and
       | say it can't do it. Opus solved the same problems with no sweat.
       | I know that that's a fairly isolated anecdote and not necessarily
       | fully indicative of overall performance, but my experience with
       | Gemini is that it would really want to kludge on code in order to
       | make things work, where I found Opus would tend to find cleaner
       | approaches to the problem. Additionally, Opus just seemed to have
       | a greater imagination? Or perhaps it has been tailored to work
       | better in agentic scenarios? I saw it do things like dump the DOM
       | and inspect it for issues after a particular interaction by
       | writing a one-off playwright script, which I found particularly
       | remarkable. My experience with Gemini is that it tries to solve
       | bugs by reading the code really really hard, which is naturally
       | more limited.
       | 
       | Again, I think Gemini is a great model, I'm very impressed with
       | what Google has put out, and until 4.0 came out I would have said
       | it was the best.
        
         | tempusalaria wrote:
         | I agree I find claude easily the best model, at least for
         | programming which is the only thing I use LLMs for
        
         | varunneal wrote:
         | Have you tried o3 on those problems? I've found o3 to be much
         | more impressive than Opus 4 for all of my use cases.
        
           | johnfn wrote:
           | To be honest, I haven't, because the "This model is extremely
           | expensive" popup on Cursor makes me a bit anxious - but given
           | the accolades here I'll have to give it a shot.
        
         | joshmlewis wrote:
         | o3 is still my favorite over even Opus 4 in most cases. I've
         | spent hundreds of dollars on AI code gen tools in the last
         | month alone and my ranking is:
         | 
         | 1. o3 - it's just really damn good at nuance, getting to the
         | core of the goal, and writing the closest thing to quality
         | production level code. The only negative is it's cutoff window
         | and cost, especially with it's love of tools. That's not
         | usually a big deal for the Rails projects I work on but
         | sometimes it is.
         | 
         | 2. Opus 4 via Claude Code - also really good and is my daily
         | driver because o3 is so expensive. I will often have Opus 4
         | come up with the plan and first pass and then let o3 critique
         | and make a list of feedback to make it _really_ good.
         | 
         | 3. Gemini 2.5 Pro - haven't tested this latest release but this
         | was my prior #2 before last week. Now I'd say it's tied or
         | slightly better than Sonnet 4. Depends on the situation.
         | 
         | 4. Sonnet 4 via claude Code - it's not bad but needs a lot of
         | coaching and oversight to produce really good code. It will
         | definitely produce a lot of code if you just let it go do it's
         | thing but it's not the quality, concise, and thoughtful code
         | without more specific prompting and revisions.
         | 
         | I'm also extremely picky and a bit OCD with code quality and
         | organization in projects down to little details with naming,
         | reusability, etc. I accept only 33% of suggested code based on
         | my Cursor stats from last month. I will often revert and go
         | back to refine the prompt before accepting and going down a
         | less than optimal path.
        
           | throwaway314155 wrote:
           | It's interesting you say that because o3, while being a
           | considerable improvement over OpenAI's other models, still
           | doesn't match the performance of Opus 4 and Gemini 2.5 Pro by
           | a long shot for me.
           | 
           | However, o3 resides in the ChatGPT app, which is still
           | superior to the other chat apps in many ways, particularly
           | the internet search implementation works very well.
        
             | joshmlewis wrote:
             | What languages do you use it with and IDE? I use it in
             | Cursor mainly with Max reasoning on. I spent around $300 on
             | token based usage for o3 alone in May still only accepting
             | around 33% of suggestions though. I made a post on X about
             | this the other day but I expect that amount of rejections
             | will go down significantly by the end of this year at the
             | rate things are going.
        
               | drawnwren wrote:
               | Very strange. I find reasoning has very narrow usefulness
               | for me. It's great to get a project in context or to get
               | oriented in the conversation, but on long conversations I
               | find reasoning starts to add way too much extraneous
               | stuff and get distracted from the task at hand.
               | 
               | I think my coding model ranking is something like Claude
               | Code > Claude 4 raw > Gemini > big gap > o4-mini > o3
        
               | joshmlewis wrote:
               | Claude Code isn't a model in itself. By default it routes
               | some to Opus 4 or Sonnet 4 but mostly Sonnet 4 unless you
               | explicitly set it.
        
               | drawnwren wrote:
               | I am aware
        
               | throwaway314155 wrote:
               | i'm using with python, VS Code (not integrated with
               | claude just basic copilot) and Claude Code. For Gemini
               | i'm using AI studio with repomix to package my code into
               | a single file. I copy files over manually in that
               | workflow.
               | 
               | All subscription based, not per token pricing. I'm
               | currently using Claude Max. Can't see myself exhausting
               | its usage at this rate but who knows.
        
             | svachalek wrote:
             | If you're coding through chat apps you're really behind the
             | times. Try an agent IDE or plugin.
        
               | joshmlewis wrote:
               | Yeah, exactly. For everyone who might not know, the chat
               | apps add lots of complex system prompting to handle and
               | shape personality, tone, general usability, etc. IDE's
               | also do this (with Claude Code being one of the ones that
               | are closest to "bare" model that you can get) but at they
               | are at least guiding it's behavior to be really good at
               | coding tasks. Another reason is using the Agent feature
               | that IDE's have had for a few months now which gives it
               | the ability to search/read/edit files across your
               | codebase. You may not like the idea of this and it feels
               | like losing control, but it's the future. After months of
               | using it I've learned how to get it to do what I want but
               | I think a lot of people who try it once and stop get
               | frustrated that it does something dumb and just assume
               | it's not good. That's a practice and skill problem not a
               | model problem.
        
               | Workaccount2 wrote:
               | IDE's are intimidating to non-tech people.
               | 
               | I'm surprised there isn't a VibeIDE yet that is purpose
               | build to make it possible for your grandmother to execute
               | code output by an LLM.
        
               | dragonwriter wrote:
               | > I'm surprised there isn't a VibeIDE yet that is purpose
               | build to make it possible for your grandmother to execute
               | code output by an LLM.
               | 
               | The major LLM chat interfaces often have code execution
               | built in, so there kind of is, it just doesn't look like
               | what an SWE thinks of as an IDE.
        
               | joshmlewis wrote:
               | I have not used them but I feel like there are tools like
               | Replit, Lovable, etc that are for that audience. I
               | totally agree IDE's are intimidating for non-technical
               | people though. Claude Code is pretty cool in that way
               | where it's one command to install and pretty easy to get
               | started with.
        
               | baw-bag wrote:
               | I am really struggling with this. I tried Cline with both
               | OpenAI and Claude to very weird results. Often burning
               | through credits to get no where or just running out of
               | context. I just got Cursor for a try so can't say
               | anything on that yet.
        
               | joshmlewis wrote:
               | It's a skill that takes some persistence and trial and
               | error. Happy to chat with you about it if you want to
               | send me an email.
        
               | baw-bag wrote:
               | I really appreciate that. I will see how I get on and may
               | well give you a shout. Thank you!
        
               | Vetch wrote:
               | There is skill to it but that's certainly not the only
               | relevant variable involved. Other important factors are:
               | 
               | Language: Syntax errors rise, and a common form is the
               | syntax of a more common language bleeding through.
               | 
               | Domain: Less so than what humans deem complex, quality is
               | more strongly controlled by how much code and
               | documentation there is for a domain. Interesting is that
               | if in a less common subdomain, it will often revert to a
               | more common approach (for example working on shaders for
               | a game that takes place in a cylinder geometry requires a
               | lot more hand-holding than on a plane). It's usually not
               | that they can't do it, but that they require much more
               | involved prompting to get the context appropriately set
               | up and then managing drifting to default, more common
               | patterns. Related is decisions with long term
               | consequences. LLMs are pretty weak at this. In humans
               | this one comes with experience, so it's rare and an
               | instance of low coverage.
               | 
               | Dates: Related is reverting to obsolete API patterns.
               | 
               | Complexity: While not as dominant as domain coverage,
               | complexity does play a role. With likelihood of error
               | rising with complexity.
               | 
               | This means if you're at the intersection of multiple of
               | these (such as a low coverage problem in a functional
               | language), agent mode will likely be too much of a waste
               | for you. But interactive mode can still be highly
               | productive.
        
               | throwaway314155 wrote:
               | I think this is debatable. But I've used Cursor and
               | various extensions for VS Code. They're all fine (but
               | cursor can fuck all the way off for stealing the `code`
               | shell integration from VS Code) but you don't _need_ an
               | IDE as Claude Code has shown us (currently my primary
               | method of vibe coding).
               | 
               | It's mostly about the cost though. Things are far more
               | affordable in the the various apps/subscriptions. Token-
               | priced API's can get very expensive very quickly.
        
               | joshvm wrote:
               | An important caveat here is yes, for coding. Apps are
               | fine for coming up with one-liners, or doing other
               | research. I haven't found the quality of IDE based code
               | to be significantly better than what ChatGPT would
               | suggest, but it's very useful to ask questions when the
               | model has access to both the code and can prompt you to
               | run tests which rely on local data (or even attached
               | hardware). I really don't trust YOLO mode so I manually
               | approve terminal calls.
               | 
               | My impression (with Cursor) is that you need to practice
               | some sort of LLM-first design to get the best out of it.
               | Either vibe code your way from the start, or be brutal
               | about limiting what changes the agent can make without
               | your approval. It _does_ force you to be very atomic
               | about your requests, which isn 't a bad thing, but
               | writing a robust spec for the prompt is often slower than
               | writing the code by hand and asking for a refactor. As
               | soon as kipple, for lack of a better word, sneaks into
               | the code, it's a reinforcing signal to the agent that it
               | can add more.
               | 
               | It's definitely worth paying the $20 and playing with a
               | few different clients. The rabbit hole is pretty deep and
               | there's still a ton of prompt engineering suggestions
               | from the community. It encourages a lot of creative
               | guardrails, like using pre-commit to provide negative
               | feedback when the model does something silly like try to
               | write a 200 word commit message. I haven't tried
               | JetBrains' agent yet (Junie), but that seems like it
               | would be a good one to explore as well since it
               | presumably integrates directly with the tooling.
        
             | jorvi wrote:
             | What's most annoying about Gemini 2.5 is that it is
             | obnoxiously verbose compared to Opus 4. Both in explaining
             | the code it wrote and the amount of lines it writes and
             | comments it adds, to the point where the output is often
             | 2-3x more than Opus 4.
             | 
             | You can obviously alleviate this by asking it to be more
             | concise but even then it bleeds through sometimes.
        
               | joshmlewis wrote:
               | Yes this is what I mean by conciseness with o3. If
               | prompted well it can produce extremely high level quality
               | code that blows me away at times. I've also had several
               | instances now where I gave it slightly wrong context and
               | other models just butchered a solution with dozens of
               | lines for the proposed fix which I could tell wasn't
               | right and then after reverting and asking o3, it
               | immediately went searching for another file I hadn't
               | included and fixed it in one line. That kind of, dare I
               | say independent thinking, is worth a lot when dealing
               | with complex codebases.
        
           | monkpit wrote:
           | Have you used Cline with opus+sonnet? Do you have opinions
           | about Claude code vs cline+api? Curious to hear your
           | thoughts!
        
           | spaceman_2020 wrote:
           | I use o3 a lot for basic research and analysis. I also find
           | the deep research tool really useful for even basic shopping
           | research
           | 
           | Like just today, it made a list of toys for my toddler that
           | fit her developmental stage and play style. Would have taken
           | me 1-2 hrs of browsing multiple websites otherwise
        
           | pqdbr wrote:
           | How do you choose which model to use with Claude Code?
        
             | joshmlewis wrote:
             | I have the Max $200 plan so I set it to Opus until it
             | limits me to Sonnet 4 which has only happened in two out of
             | a few dozen sessions so far. My rule of thumb in Cursor is
             | it's worth paying for the Max reasoning models for pretty
             | much every request unless it's stupid simple because it
             | produces the best code each time without any funny business
             | you get with cheaper models.
        
               | sunshinerag wrote:
               | You can use the max plan in cursor? I thought it didn't
               | support calls via api and only worked in Claude code?
        
           | VeejayRampay wrote:
           | we need to stop it with the anecdotal evidence presented by
           | one random dude
        
           | vendiddy wrote:
           | I find o3 to be the clearest thinker as well.
           | 
           | If I'm working on a complex problem and want to go back and
           | forth on software architecture, I like having o3 research
           | prior art and have a back and forth on trade-offs.
           | 
           | If o3 was faster and cheaper I'd use it a lot more.
           | 
           | I'm curious what your workflows are !
        
         | Szpadel wrote:
         | in my experience this highly depends case by case. For some
         | cases Gemini crushed my problem, but in next one stuck and
         | couldn't figure out simple bug.
         | 
         | the same with o3 and sonnet (I didn't tested 4.0 much yet to
         | have opinion)
         | 
         | I feel thet we need better parallel evaluation support. where u
         | could evaluate all top models and decide with one provided best
         | solution
        
         | lispisok wrote:
         | >That being said, I'm starting to doubt the leaderboards as an
         | accurate representation of model ability
         | 
         | Goodhart's law applies here just like everywhere else. Much
         | more so given how much money these companies are dumping into
         | making these models.
        
         | AmazingTurtle wrote:
         | for bulk data extraction on personal real life data I
         | experienced that even gpt-4o-mini outperforms latest gemini
         | models in both quality and cost. i would use reasoning models
         | but their json schema response is different from the non-
         | reasonig models, as in: they can not deal with union types for
         | optional fields when using strict schemas... anyway.
         | 
         | idk whats the hype about gemini, it's really not that good imho
        
         | baq wrote:
         | I've been giving the same tasks to claude 4 and gemini 2.5 this
         | week and gemini provided correct solutions and claude didn't.
         | These weren't hard tasks either, they were e.g. comparing sql
         | queries before/after rewrite - Gemini found legitimate issues
         | where claude said all is ok.
        
         | zamadatix wrote:
         | I think the only way to be particularly impressed with new
         | leading models lately is to hold the opinion all of the
         | benchmarks are inaccurate and/or irrelevant and it's
         | vibes/anecdotes where the model is really light years ahead.
         | Otherwise you look at the numbers on e.g. lmarena and see it's
         | claiming a ~16% preference win rate for gpt-3.5-turbo from
         | November of 2023 over this new world-leading model from Google.
        
           | johnfn wrote:
           | Not sure I follow - Gemini has ELO 1470, GPT3.5-turbo is
           | 1206, which is an 86% win rate. https://chatgpt.com/share/684
           | 1f69d-b2ec-800c-9f8c-3e802ebbc0...
        
           | Workaccount2 wrote:
           | People can ask whatever they want on LMarena, so a question
           | like "List some good snacks to bring to work" might elicit a
           | win for a old/tiny/deprecated model simply because it lists
           | the snack the user liked more.
        
             | AstroBen wrote:
             | are you saying that's a bad way to judge a model? Not sure
             | why we'd want ones that choose bad snacks
        
         | Alifatisk wrote:
         | > after a bit Gemini would spin in circles or actually (I've
         | never seen this before!) give up and say it can't do it
         | 
         | No way, is there any way to see the dialog or recreate this
         | scenario!?
        
           | johnfn wrote:
           | The chat was in Cursor, so I don't know a way to provide a
           | public link, but here is the last paragraph that it output
           | before I (and it) gave up. I honestly could have re-prompted
           | it from scratch and maybe it would have gotten it, but at
           | this point I was pretty sure that even if it did, it was
           | going to make a total mess of things. Note that it was
           | iterating on a test failure and had spun through multiple
           | attempts at this point:
           | 
           | > Given the persistence of the error despite multiple
           | attempts to refine the type definitions, I'm unable to fix
           | this specific TypeScript error without a more profound change
           | to the type structure or potentially a workaround that might
           | compromise type safety or accuracy elsewhere. The current
           | type definitions are already quite complex.
           | 
           | The two prior paragraphs, in case you're curious:
           | 
           | > I suspect the issue might be a fundamental limitation or
           | bug in how TypeScript is resolving these highly recursive and
           | conditional types when they are deeply nested. The type
           | system might be "giving up" or defaulting to a less specific
           | type ({ __raw: T }) prematurely.
           | 
           | > Since the runtime logic seems to be correctly hydrating the
           | nested objects (as the builder.build method recursively calls
           | hydrateHelper), the problem is confined to the type system's
           | ability to represent this.
           | 
           | I found, as you can see in the first of the prior two
           | paragraphs, that Gemini often wanted to claim that the issue
           | was on TypeScript's side for some of these more complex
           | issues. As proven by Opus, this simply wasn't the case.
        
         | tymonPartyLate wrote:
         | I just realized that Opus 4 is the first model that produced
         | "beautiful" code for me. Code that is simple, easy to read, not
         | polluted with comments, no unnecessary crap, just pretty, clean
         | and functional. I had my first "wow" moment with it in a while.
         | That being said it occasionally does something absolutely
         | stupid. Like completely dumb. And when I ask it "why did you do
         | this stupid thing", it replies "oh yeah, you're right, this is
         | super wrong, here is an actual working, smart solution"
         | (proceeds to create brilliant code)
         | 
         | I do not understand how those machines work.
        
           | Tostino wrote:
           | My issue is that every time i've attempted to use Opus 4 to
           | solve any problem, I would burn through my usage cap within a
           | few min and not have solved the problem yet because it
           | misunderstood things about the context and I didn't get the
           | prompt quite right yet.
           | 
           | With Sonnet, at least I don't run out of usage before I
           | actually get it to understand my problem scope.
        
           | simon1ltd wrote:
           | I've also experienced the same, except it produced the same
           | stupid code all over again. I usually use one model (doesn't
           | matter which) until it starts chasing it's tail, then I feed
           | it to a different model to have it fix the mistakes by the
           | first model.
        
           | diggan wrote:
           | > Code that is simple, easy to read, not polluted with
           | comments, no unnecessary crap, just pretty, clean and
           | functional
           | 
           | I get that with most of the better models I've tried,
           | although I'd probably personally favor OpenAI's models
           | overall. I think a good system prompt is probably the best
           | way there, rather than relying in some "innate" "clean code"
           | behavior of specific models. This is a snippet of what I use
           | today for coding guidelines: https://gist.github.com/victorb/
           | 1fe62fe7b80a64fc5b446f82d313...
           | 
           | > That being said it occasionally does something absolutely
           | stupid. Like completely dumb
           | 
           | That's a bit tougher, but you have to carefully read through
           | exactly what you said, and try to figure out what might have
           | led it down the wrong path, or what you could have said in
           | the first place for it avoid that. Try to work it into your
           | system prompt, then slowly build up your system prompt so
           | every one-shot gets closer and closer to being perfect on
           | every first try.
        
         | batrat wrote:
         | What I like about Gemini is the search function that is very
         | very good compared to others. I was blown away when I asked to
         | compose me an email for a company that was sending spam to our
         | domain. It literally searched and found not only the abuse
         | email of the hosting company but all the info about the domain
         | and the host(mx servers, ip owners, datacenters, etc.). Also if
         | you want to convert a research paper into a podcast it did it
         | instantly for me and it's fun to listen.
        
         | cwbriscoe wrote:
         | I haven't tried all of the favorites, just what is available
         | with Jetbrains AI, but I can say that Gemini 2.5 is very good
         | with Go. I guess that makes sense in a way.
        
         | tomr75 wrote:
         | how does it have access to DOM? are you using it with
         | cursor/browser MCP?
        
       | emehex wrote:
       | Is this "kingfall"?
        
         | paisanashapyaar wrote:
         | No, Kingfall is a separate model which is supposed to deliver
         | slightly better performance, around 2.5% to 5% improvement over
         | this.
        
         | Workaccount2 wrote:
         | Sundar tweeted a lion so it's probably goldmane. Kingfall is
         | probably their deep think model, and they might wait for O3 pro
         | to drop so they can swing back.
        
       | pelorat wrote:
       | Why not call it Gemini 2.6?
        
         | MallocVoidstar wrote:
         | Beta, beta, release candidate (this version)
        
         | laweijfmvo wrote:
         | because the plethora of models and versions is getting
         | ridiculous, and for anyone who's not following LLM news daily,
         | you have no clue what to use. There was never a "Google Search
         | 2.6.4 04-13". You just went to google.com and searched.
        
           | johnfn wrote:
           | Well, Google Search never released an API that millions of
           | people depended on.
        
           | AISnakeOil wrote:
           | These api models are for developers. Gemini is for consumers.
        
         | Szpadel wrote:
         | next year maybe? they they so not have year in version so they
         | will need to bump the number make sure you can just sort by
         | name
        
       | Workaccount2 wrote:
       | Apparently 06-05 bridges the gap that people were feeling between
       | the 03-25 and 05-06 release[1]
       | 
       | [1]https://nitter.net/OfficialLoganK/status/1930657743251349854..
       | .
        
       | hu3 wrote:
       | I pay for both ChatGPT Plus and Gemini Pro.
       | 
       | I'm thinking of cancelling my ChatGPT subscription because I keep
       | hitting rate limits.
       | 
       | Meanwhile I have yet to hit any rate limit with Gemini/AI Studio.
        
         | oofbaroomf wrote:
         | I think AI Studio uses the API, so rate limits are extremely
         | high and almost impossible for a normal human to reach if using
         | the paid preview model.
        
           | staticman2 wrote:
           | As far as I know AI Studio is always free, even on pay
           | accounts, and you can definetly hit the rate limit.
        
         | Squarex wrote:
         | I much prefer Gemini over chapgpt, but they recently introduced
         | a limit of 100 messages a day on a pro plan :( aistudio is
         | probably still fine
        
           | MisterPea wrote:
           | I've heard it's only on mobile? I was using gemini for work
           | on desktop for at least 6 hours yesterday (definitely over
           | 100 back and forths) for work and did not get hit with any
           | rate limits
           | 
           | Either way, Google's transparency with this is very poor - I
           | saw the limits from a VP's tweet
        
         | fermentation wrote:
         | Is there a reason not to just use the API through openrouter or
         | something?
        
         | HenriNext wrote:
         | AI Studio uses your API account behind the scenes, and it is
         | subject to normal API limits. When you signup for AI Studio, it
         | creates a Google Cloud free tier project with "gen-lang-
         | client-" prefix behind the scenes. You can link a billing
         | account at the bottom of the "get an api key page".
         | 
         | Also note that AI studio via default free tier API access
         | doesn't seem to fall within "commercial use" in Google's terms
         | of service, which would mean that your prompts can be reviewed
         | by humans and used for training. All info AFAIK.
        
       | tibbar wrote:
       | Interesting, I just learned about matharena.ai. Google cherry-
       | picks one result where they're the best here, but in the overall
       | results, it's still O3 and o4-mini-high who are in the lead.
        
       | pu_pe wrote:
       | I just checked and it looks like the limits for Jules has been
       | bumped from 5 free daily tasks to 60. Not sure it uses the latest
       | model, but I would assume it does
        
         | kristianp wrote:
         | https://jules.google/
        
       | ChrisArchitect wrote:
       | Blog post: https://blog.google/products/gemini/gemini-2-5-pro-
       | latest-pr...
       | 
       | (https://news.ycombinator.com/item?id=44192954)
        
         | xnx wrote:
         | That's a much better link
        
       | abraxas wrote:
       | I found all the previous Gemini models somewhat inferior even
       | compared to Claude 3.7 Sonnet (and much worse than 4) as my
       | coding assistants. I'm keeping an open mind but also not rushing
       | to try this one until some evaluations roll in. I'm actually
       | baffled that the internet at large seems to be very pumped about
       | Gemini but it's not reflective of my personal experience. Not to
       | be that tinfoil hat guy but I smell at least a bit of astroturf
       | activity around Gemini.
        
         | bachmeier wrote:
         | > I'm actually baffled that the internet at large seems to be
         | very pumped about Gemini but it's not reflective of my personal
         | experience. Not to be that tinfoil hat guy but I smell at least
         | a bit of astroturf activity around Gemini.
         | 
         | I haven't used Claude, but Gemini has always returned better
         | answers to general questions relative to ChatGPT or Copilot. My
         | impression, which could be wrong, is that Gemini is better in
         | situations that are a substitute for search. How do I do this
         | on the command line, tell me about this product, etc. all give
         | better results, sometimes much better, on Gemini.
        
           | dist-epoch wrote:
           | You should try Grok then. It's by far the best when searching
           | is required, especially if you enable DeepSearch.
        
             | Take8435 wrote:
             | I don't really want to use the X platform. What's the best
             | alternative? Claude?
        
           | praveer13 wrote:
           | I've honestly had consistently the opposite experiences for
           | general questions. Also for images, Gemini just hallucinates
           | crazily. ChatGPT even on free tier is giving perfectly
           | correct answers, and I'm on Gemini pro. I canceled it
           | yesterday because of this
        
         | strobe wrote:
         | I'm switching a lot between Sonnet and Gemini in Aider - for
         | some reason some of my coding problems only one of models
         | capable to solve and I don't see any pattern which cold give
         | answer upfront which I should to use for specific need.
        
         | throwaway314155 wrote:
         | My experience has been that Gemini's code (and even
         | conversation) is a little bit uglier in general - but that the
         | code tends to solve the issue you asked with fewer
         | hallucinations.
         | 
         | I can't speak to it now - have mostly been using Claude Code w/
         | Opus 4 recently.
        
         | 3abiton wrote:
         | > I found all the previous Gemini models somewhat inferior even
         | compared to Claude 3.7 Sonnet (and much worse than 4) as my
         | coding assistants.
         | 
         | What are your usecases? Really not my experience, Claude
         | disappoints in Data Science and complex ETL requests in python.
         | O3 on the other hand really is phenomenal.
        
           | abraxas wrote:
           | Backend python code, postgres database. Front end:
           | Reeact/NextJS. Very common stack in 2025. Using LLMs in
           | assist mode (not as agents) for enhancing the existing code
           | base that weighs in under 1MM LoC. So not a greenfield
           | project anymore but not a huge amount of legacy cruft either.
        
         | tiahura wrote:
         | As a lawyer, Claude 4 is the best writer, and usually, but not
         | always, the leader in legal reasoning. That said, o3 often
         | grinds out the best response, and Gemini seems to be the most
         | exhaustive researcher.
        
         | Fergusonb wrote:
         | I think they are fairly interchangeable, In Roo Code, Claude
         | uses the tools better, but I prefer gemini's coding style and
         | brevity (except for comments, it loves to write comments)
         | Sometimes I mix and match if one fails or pursues a path I
         | don't like.
        
         | verall wrote:
         | I think it's just very dependent on what you're doing. Claude
         | 3.5/3.7 Sonnet (thinking or not) were just absolutely terrible
         | at almost anything I asked of it (C/C++/Make/CMake). Like
         | constantly giving wrong facts, generating code that could never
         | work, hallucinating syntax and APIs, thinking about something
         | then concluding the opposite, etc. Gemini 2.5-pro and o3 (even
         | old o1-preview, o1-mini) were miles better. I haven't used
         | Claude 4 yet.
         | 
         | But everyone is using them for different things and it doesn't
         | always generalize. Maybe Claude was great at typescript or ruby
         | or something else I don't do. But for some of us, it definitely
         | was not astroturf for Gemini. My whole team was talking about
         | how much better it was.
        
       | wiradikusuma wrote:
       | I have two issues with Gemini that I don't experience with
       | Claude: 1. It RENAMES VARIABLE NAMES even in places where I don't
       | tell it to change (I pass them just as context). and 2. Sometimes
       | it's missing closing square brackets.
       | 
       | Sure I'm a lazy bum, I call the variable "json" instead of
       | "jsonStringForX", but it's contextual (within a closure or
       | function), and I appreciate the feedback, but it makes reviewing
       | the changes difficult (too much noise).
        
         | 93po wrote:
         | i've noticed with ChatGPT is will 100% ignore certain
         | instructions and I wonder if it's just an LLM thing. For
         | example, I can scream and yell in caps at ChatGPT to not use em
         | or en dashes and if anything it makes it use them even _more_.
         | I 've literally never once made it successfully not use them,
         | even when it ignored it the first time, and my follow up is
         | "output the same thing again but NO EM or EN DASHES!"
         | 
         | i've not tested this thoroughly, it's just my ancedotal
         | experience over like a dozen attempts.
        
           | creesch wrote:
           | There are some things so ubiquitous in the training data that
           | it is really difficult to tell models to not so them. Simply
           | because it is so ingrained in their core training. Em dashes
           | are apparently one of those things.
           | 
           | It's something I read a lottle while ago in a larger article
           | but can't remember which article it was.
        
           | tacotime wrote:
           | I wonder if using the character itself in the directions,
           | instead of the name for the character, might help with this.
           | 
           | Something like, "Forbidden character list: [--, -]" or "Do
           | NOT use the characters '--' or '-' in any of your output"
        
         | danielbln wrote:
         | Gemini loves to add idiotic non-functional inline comments.
         | 
         | "# Added this function" "# Changed this to fix the issue"
         | 
         | No, I know, I was there! This is what commit messages for, not
         | comments that are only relevant in one PR.
        
           | oezi wrote:
           | And it sure loves removing your carefully inserted comments
           | for human readers.
        
           | macNchz wrote:
           | I love when I ask it to remove things and it doesn't want to
           | truly let go, so it leaves a comment instead:
           | # Removed iterMod variable here because it is no longer
           | needed.
           | 
           | It's like it spent too much time hanging out with an engineer
           | who doesn't trust version control and prefers to just comment
           | everything out.
           | 
           | Still enjoying Gemini 2.5 Pro more than Claude Sonnet these
           | days, though, purely on vibes.
        
           | Workaccount2 wrote:
           | I think it is likely that the comments are more for the model
           | than for the user. I would not be even slightly surprised if
           | verbose coding versions outperformed light commenting
           | versions.
        
             | xmprt wrote:
             | On the other hand, I'm skeptical if that has any impact
             | because these models have thinking tokens where they can
             | put all those comments and attention shouldn't care about
             | how close the tokens are as long as they're within the
             | context window.
        
         | xtracto wrote:
         | I have a very clear example of Gemini getting it wrong:
         | 
         | For a code like this, it keeps changing
         | processing_class=tokenizer to "tokenizer=tokenizer", even
         | though the parameter was renamed and even after adding the all
         | caps comment.                   #Set up the SFTTrainer
         | print("Setting up SFTTrainer...")         trainer = SFTTrainer(
         | model=model,         train_dataset=train_dataset,
         | args=sft_config,         processing_class=tokenizer, # DO NOT
         | CHANGE. THIS IS NOW THE CORRECT PROPERTY NAME         )
         | print("SFTTrainer ready.")
         | 
         | I haven't tried with this latest version, but the 05-06 pro
         | still did it wrong.
        
           | diggan wrote:
           | Do you have in the system prompt to actually not edit lines
           | that has comments about not editing them? Had that happen to
           | me too, that code comments been ignored, and adding
           | instructions about actually following code comments helped
           | for that. But different models so YMMV.
        
         | AaronAPU wrote:
         | I find o1-pro, which nobody ever mentions, is in the top spot
         | along with Gemini. But Gemini is an absolute mess to work with
         | because it constantly adds tons of comments and changes
         | unrelated code.
         | 
         | It is worth it sometimes, but usually I use it to explore ideas
         | and then have o1-pro spit out a perfect solution ready diff
         | test and merge.
        
       | carbocation wrote:
       | Is it possible to know which model version their chat app (
       | https://gemini.google.com/app ) is using?
        
       | chollida1 wrote:
       | I'd start to worry about OpenAI, from a valuation standpoint. The
       | company has some serious competition now and is arguably no
       | longer the leader.
       | 
       | its going to be interesting to see how easily they can raise more
       | money. Their valuation is already in the $300B range. How much
       | larger can it get given their relatively paltry revenue at the
       | moment and increasingly rising costs for hardware and
       | electricity.
       | 
       | If the next generation of llms needs new data sources, then
       | Facebook and Google seem well positioned there, OpenAI on the
       | other hand seems like its going to lose such race for proprietary
       | data sets as unlike those other two, they don't have another
       | business that generates such data.
       | 
       | When they were the leader in both research and in user facing
       | applications they certainly deserved their lofty valuation.
       | 
       | What is new money coming into OpenAI getting now?
       | 
       | At even a $300B valuation a typical wall street analysts would
       | want to value them at 2x sales which would mean they'd expect
       | OpenAI to have $600B in annual sales to account for this
       | valuation when they go public.
       | 
       | Or at an extremely lofty P/E ratio of say 100 that would be $3B
       | in annual earnings, that analysts would have to expect you to
       | double each year for the next 10ish years looking out, ala AMZN
       | in the 2000s, to justify this valuation.
       | 
       | They seem to have boxed themselves into a corner where it will be
       | painful to go public, assuming they can ever figure out the
       | nonprofit/profit issue their company has.
       | 
       | Congrats to Google here, they have done great work and look like
       | they'll be one of the biggest winners of the AI race.
        
         | ketzo wrote:
         | OpenAI has already forecast _$12B_ in revenue by the end of
         | _this_ year.
         | 
         | I agree that Google is well-positioned, but the
         | mindshare/product advantage OpenAI has gives them a stupendous
         | amount of leeway
        
           | chollida1 wrote:
           | Agreed, its the doubling of that each year for the next 4-5
           | years that I see as being difficult.
        
           | Workaccount2 wrote:
           | The hurdle for OpenAI is going to be on the profit side.
           | Google has their own hardware acceleration and their own data
           | centers. OpenAI has to pay a monopolist for hardware
           | acceleration and beholden to another tech giant for data
           | centers. Never mind that Google can customize it's hardware
           | specifically for it's models.
           | 
           | The only way for OpenAI to really get ahead on solid ground
           | is to discover some sort of absolute game changer (new
           | architecture, new algorithm) and manage to keep it bottled
           | away.
        
             | geodel wrote:
             | OpenAI has now partnered with Jony Ive now and they are
             | going to have thinnest data centers with thinnest servers
             | mounted on thinnest racks. And since everything is so thin,
             | servers can just whisper to each other instead of
             | communicating via fat cables.
             | 
             | I think that will be the game changer OpenAI will show us
             | soon.
        
               | falloon wrote:
               | All servers will have a single thunderbolt port.
        
             | diggan wrote:
             | > OpenAI has to pay a monopolist for hardware acceleration
             | and beholden to another tech giant for data centers.
             | 
             | Don't they have a data center in progress as we speak?
             | Seems by now they're planning on building not just one huge
             | data center in Texas, but more in other countries too.
        
           | VeejayRampay wrote:
           | the leeway comes from the grotesque fanboyism the company
           | benefits from
           | 
           | they haven't been number one for quite some time and still
           | people can't stop presenting them as the leaders
        
             | ketzo wrote:
             | People said much the same thing about Apple for decades,
             | and they're a $3T company; not a bad thing to have fans.
             | 
             | Plus, it's a consumer product; it doesn't matter if people
             | are "presenting them as leaders", it matters if hundreds of
             | millions of totally average people will open their
             | computers and use the product. OpenAI has that.
        
         | Rudybega wrote:
         | I think OpenAI has projected 12.7B in revenue this year and
         | 29.4B in 2026.
         | 
         | Edit: I am dumb, ignore the second half of my post.
        
           | eamag wrote:
           | isn't P/E about earnings, not revenue?
        
             | Rudybega wrote:
             | You are correct. I need some coffee.
        
         | jadbox wrote:
         | Currently I only find OpenAI to be clearly better for image
         | generation: like illustrations, comics, or photo editing for
         | home project ideation.
        
         | energy123 wrote:
         | Even if they're winning the AI race, their search business is
         | still going to be cannibalized, and it's unclear if they'll be
         | able to extract any economic rents from AI thanks to market
         | competition. Of course they have no choice but to compete, but
         | they probably would have preferred the pre-AI status quo of
         | unquestioned monopoly and eyeballs on ads.
        
           | xmprt wrote:
           | Historically, every company has failed by not adapting to new
           | technologies and trying to protect their core business (eg.
           | Kodak, Blockbuster, Blackberry, Intel, etc). I applaud Google
           | for going against their instincts and actively trying to
           | disrupt their cash cow in order to gain an advantage in the
           | AI race.
        
         | qeternity wrote:
         | > At even a $300B valuation a typical wall street analysts
         | would want to value them at 2x sales which would mean they'd
         | expect OpenAI to have $600B in annual sales to account for this
         | valuation when they go public.
         | 
         | Lmfao where did you get this from? Microsoft has less than half
         | of that revenue, and is valued > 10x than OpenAI.
         | 
         | Revenue is not the metric by which these companies are
         | valued...
        
         | orionsbelt wrote:
         | I think it's too early to say they are not the leader given
         | they have o3 pro and GPT 5 coming out within the next month or
         | two. Only if those are not impressive would I start to consider
         | that they have lost their edge.
         | 
         | Although it does feel likely that at minimum, they are neck and
         | neck with Google and others.
        
           | ed_mercer wrote:
           | Source for gpt 5 coming out soon?
        
         | sebzim4500 wrote:
         | >At even a $300B valuation a typical wall street analysts would
         | want to value them at 2x sales which would mean they'd expect
         | OpenAI to have $600B in annual sales to account for this
         | valuation when they go public.
         | 
         | What? Apple has a revenue of 400B and a market cap of 3T
        
         | raincole wrote:
         | > At even a $300B valuation a typical wall street analysts
         | would want to value them at 2x sales which would mean they'd
         | expect OpenAI to have $600B in annual sales to account for this
         | valuation when they go public.
         | 
         | Even Google doesn't have $600B revenue. Sorry, it sounds like
         | numbers pulled from someone's rear.
        
         | jstummbillig wrote:
         | There is some serious confusion about the strength of OpenAIs
         | position.
         | 
         | "chatgpt" is a verb. People have no idea what claude or gemini
         | are, and they will not be interested in it, unless something
         | absolutely _fantastic_ happens. Being a little better will do
         | absolutely nothing to convince normal people to change product
         | (the little moat that ChatGPT has simply by virtue of chat
         | history is probably enough from a convenience standpoint, add
         | memories and no super obvious path to export /import either and
         | you are done here).
         | 
         | All that OpenAI would have to do, to easily be worth their
         | evaluation eventually, is to optimize and not become
         | _offensively bad_ to their, what, 500 million active users.
         | And, if we assume the current paradigm that everyone is working
         | with is here to stay, why would they? Instead of leading (as
         | they have done so far, for the most part) they can at any point
         | simply do what others have resorted to successfully and copy
         | with a slight delay. People won 't care.
        
           | aeyes wrote:
           | Google has a text input box on google.com, as soon as this
           | gives similar responses there is no need for the average user
           | to use ChatGPT anymore.
           | 
           | I already see lots of normal people share screenshots of the
           | AI Overview responses.
        
             | jstummbillig wrote:
             | You are skipping over the part where you need to bring
             | normal people, specially young normal people, back to
             | google.com for them to see anything at all on google.com.
             | Hundreds of millions of them don't go there anymore.
        
             | askafriend wrote:
             | As the other poster mentioned, young people are not going
             | there. What happens when they grow up?
        
           | candiddevmike wrote:
           | ChatGPT is going to be Kleenex'd. They wasted their first
           | mover advantage. Replace ChatGPT's interface with any other
           | LLM and most users won't be able to tell the difference.
        
           | potatolicious wrote:
           | I think this pretty substantially overstates ChatGPT's
           | stickiness. Just because something is widely (if not
           | universally) known doesn't mean it's universally _used_ , or
           | that such usage is sticky.
           | 
           | For example, I had occasion to chat with a relative who's
           | still in high school recently, and was curious what the
           | situation was in their classrooms re: AI.
           | 
           | tl;dr: LLM use is basically universal, _but ChatGPT is not
           | the favored tool_. The favored tools are LLMs /apps
           | specifically marketed as study/homework aids.
           | 
           | It seems like the market is fine with seeking specific LLMs
           | for specific kinds of tasks, as opposed to some omni-LLM one-
           | stop-shop that does everything. The market has _already_ and
           | rapidly moved beyond from ChatGPT.
           | 
           | Not to mention I am willing to bet that Gemini has
           | _radically_ more usage than OpenAI 's models simply by virtue
           | of being plugged into Google Search. There are distribution
           | effects, I just don't think OpenAI has the strongest
           | position!
           | 
           | I think OpenAI has _some_ first-mover advantage, I just don
           | 't think it's anywhere near as durable (nor as large) as
           | you're making it out to be.
        
         | Oleksa_dr wrote:
         | I was tempted by the ratings and immediately paid for a
         | subscription to Gemini 2.5. Half an hour later, I canceled the
         | subscription and got a refund. This is the laziest and
         | stupidest LLM. What he had to do, he told me to do on my own.
         | And also when analyzing simple short documents, he pulled up
         | some completely strange documents from the Internet not related
         | to the topic. Even local LLMs (3B) were not so stupid and lazy.
        
         | PantaloonFlames wrote:
         | > At even a $300B valuation a typical wall street analysts
         | would want to value them at 2x sales which would mean they'd
         | expect OpenAI to have $600B in annual sales to account for this
         | valuation when they go public.
         | 
         | Oops I think you may have flipped the numerator and the
         | denominator there, if I'm understanding you. Valuation of 300B
         | , if 2x sales, would imply 150B sales.
         | 
         | Probably your point still stands.
        
       | lxe wrote:
       | Gemini is a good and fast model, but I think the style of code it
       | writes is... amateur / inexperienced. It doesn't make a lot of
       | mistakes typical of an LLM, but rather chooses approaches that
       | are typical of someone who just learned programming. I have to
       | always nudge it to avoid verbosity, keep structure less
       | repetitive, optimize async code, etc. With claude, I rarely have
       | this problem -- it feels more like working with a more
       | experienced developer.
        
       | jdmoreira wrote:
       | Is there a no brainer alternative to Claude Code where I can try
       | other models?
        
         | ketzo wrote:
         | People quite like aider! I'm not as much of a fan of the CLI
         | workflow but it's quite comparable, I think.
        
           | jdmoreira wrote:
           | I've heard about it but is the outcome as good as claude
           | code?
        
           | kristianp wrote:
           | I enjoy using Aider, but it's not agentic: it cant run your
           | tests for you, for example.
        
       | simianwords wrote:
       | I feel stupid for asking but how do I enable deepthink?
        
         | koakuma-chan wrote:
         | They added a thinking section in AI studio
        
           | simianwords wrote:
           | True but it's greyed out. Not sure if this is "deep think"
        
       | johnnyApplePRNG wrote:
       | General first impressions are that it's not as capable as 05-06,
       | although it's technically testing better on the leaderboards...
       | interesting.
        
       | Alifatisk wrote:
       | Finally Google is advertising their ai studio, it's a shame they
       | didn't push that beautiful app before.
        
       | consumer451 wrote:
       | Man, if the benchmarks are to be believed, this is a lifeline for
       | Windsurf as Anthropic becomes less and less friendly.
       | 
       | However, in my personal experience Sonnet 3.x has still been king
       | so far. Will be interesting to watch this unfold. At this point,
       | it's still looking grim for Windsurf.
        
       | _pdp_ wrote:
       | Is it still rate limited though?
        
       | zone411 wrote:
       | Omproves on the Extended NYT Connections benchmark compared to
       | both Gemini 2.5 Pro Exp (03-25) and Gemini 2.5 Pro Preview
       | (05-06), scoring 58.7. The decline observed between 03-25 and
       | 05-06 has been reversed - https://github.com/lechmazur/nyt-
       | connections/.
        
       | bli940505 wrote:
       | I'm confused by the naming. It advertises itself as "Thinking" so
       | is this the release of the new "Deep Think" model or not?
        
       ___________________________________________________________________
       (page generated 2025-06-05 23:01 UTC)