[HN Gopher] Gemini-2.5-pro-preview-06-05
___________________________________________________________________
Gemini-2.5-pro-preview-06-05
Author : jcuenod
Score : 288 points
Date : 2025-06-05 16:44 UTC (6 hours ago)
(HTM) web link (deepmind.google)
(TXT) w3m dump (deepmind.google)
| jbellis wrote:
| Did it get upgraded in-place again or do you need to opt in to
| the new model?
| unpwn wrote:
| I feel like instead of constantly releasing these preview
| versions with different dates attached they should just add a
| patch version and bump that.
| impulser_ wrote:
| They can't because if someone has built something around that
| version they don't want to replace that model with a new model
| that could provide different results.
| nsriv wrote:
| Looking at you Anthropic. 4.0 markedly different from 3.7 in
| my experience.
| Aeolun wrote:
| The model name is completely different? How do you
| accidentally switch from 3.7 to 4.0?
| jfoster wrote:
| In what way are dates better than integers at preventing that
| kind of mistake?
| dist-epoch wrote:
| Except google did exactly that with the previous release,
| where they silently redirect 03-25 requests to 05-06.
| op00to wrote:
| I found Gemini 2.5 Pro highly useful for text summaries, and even
| reasoning in long conversations... UP TO the last 2 weeks or
| month. Recently, it seems to totally forget what I'm talking
| about after 4-5 messages of a paragraph of text each. We're not
| talking huge amounts of context, but conversational
| braindeadness. Between ChatGPT's sycophancy, Gemini's
| forgetfulness and poor attention, I'm just sticking with whatever
| local model du jour fits my needs and whatever crap my company is
| paying for today. It's super annoying, hopefully Gemini gets its
| memory back!
| energy123 wrote:
| I believe it's intentionally nerfed if you use it through the
| app. Once you use Gemini for a long time you realize they have
| a number of dark patterns to deter heavy users but maintain the
| experience for light users. These dark patterns are:
|
| - "Something went wrong error" after too many prompts in a day.
| This was an undocumented rate limit because it never occurs
| earlier in the day and will immediately disappear if you
| subscribe for and use a new paid account, but it won't
| disappear if you make a new free account, and the error going
| away is strictly tied to how long you wait. Users complained
| about this for over a year. Of course they lied about the real
| reasons for this error, and it was never fixed until a few days
| ago when they rug pulled paying users by introducing actual
| documented tight rate limits.
|
| - "You've been signed out" error if the model has exceeded its
| output token budget (or runtime duration) for a single
| inference, so you can't do things like what Anthropic
| recommends where you coax the model to think longer.
|
| - I have less definitive evidence for this but I would not be
| surprised if they programmatically nerf the reasoning effort
| parameter for multiturn conversations. I have no other
| explanation for why the chain of thought fails to generate for
| small context multiturn chats but will consistently generate
| for ultra long context singleturn chats.
| op00to wrote:
| Right! I feel like it will sail through MBs of text data, but
| remembering what I said two turns ago is just too much.
| harrisoned wrote:
| I noticed that same behavior across older Gemini models. I
| build a chatbot at work around 1.5 Flash, and one day suddenly
| it was behaving like that. it was perfect before, but after it
| always saluted the user like it was their first chat, despite
| me sending the history. And i didn't found any changelog
| regarding that at the time.
|
| After that i moved to OpenAI, Gemini models just seem
| unreliable on that regard.
| 85392_school wrote:
| This might be because Gemini silently updates checkpoints
| (1.5 001 -> 1.5 002, 2.5 0325 -> 2.5 0506 -> 2.5 0605) while
| OpenAI doesn't update them without ensuring that they're
| uniformly better and typically emails customers when they are
| updated.
| jcuenod wrote:
| 82.2 on Aider
|
| Still actually falling behind the official scores for o3 high.
| https://aider.chat/docs/leaderboards/
| sottol wrote:
| Does 82.2 correspond to the "Percent correct" of the other
| models?
|
| Not sure if OpenAI has updated O3, but it looks like "pure" o3
| (high) has a score of 79.6% in the linked table, "o3 (high) +
| gpt-4.1" combo has a the highest score of 82.7%.
|
| The previous Gemini 2.5 Pro Preview 05-06 (yea, not current
| 06-05!) was at 76.9%.
|
| That looks like a pretty nice bump!
|
| But either way, these Aider benchmarks seem to be most
| useful/trustworthy benchmarks currently and really the only
| ones I'm paying attention to.
| hobofan wrote:
| That's the older 05-06 preview, not the new one from today.
| energy123 wrote:
| They knew that. The 82.2 comes from the new benchmarks in the
| OP not from the aider url. The aider url was supplied for
| comparison.
| hobofan wrote:
| Ah, thanks for clearing that up!
| vessenes wrote:
| But so.much.cheaper.and.faster. Pretty amazing.
| vthallam wrote:
| As if 3 different preview versions of the same model is not
| confusing enough, the last two dates are 05-06 and 06-05. They
| could have held off for a day:)
| tomComb wrote:
| Since those days are ambiguous anyway, they would have had to
| hold off until the 13th.
|
| In Canada, a third of the dates we see are British, and another
| third are American, so it's really confusing. Thankfully y-m-d
| is now a legal format and seems to be gaining ground.
| layer8 wrote:
| > they would have had to hold off until the 13th.
|
| 06-06 is unambiguously after 05-06 regardless of date format.
| declan_roberts wrote:
| Engineers are surprisingly bad at naming things!
| jacob019 wrote:
| I rather like date codes as versions.
| dist-epoch wrote:
| > the last two dates are 05-06 and 06-05
|
| they are clearly trolling OpenAI's 4o and o4 models.
| fragmede wrote:
| ChatGPT itself suggests better names than that!
| oezi wrote:
| Don't repeat the same mistake if you want to troll somebody.
|
| It makes you look even more stupid.
| UncleOxidant wrote:
| At what point will they move from Gemini 2.5 pro to Gemini 2.6
| pro? I'd guess Gemini 3 will be a larger model.
| unsupp0rted wrote:
| Curious to see how this compares to Claude 4 Sonnet in code.
|
| This table seems to indicate it's markedly worse?
|
| https://blog.google/products/gemini/gemini-2-5-pro-latest-pr...
| gundmc wrote:
| Almost all of those benchmarks are coding related. It looks
| like SWE-Bench is the only one where Claude is higher. Hard to
| say which benchmark is most representative of actual work. The
| community seems to like Aider Polyglot from what I've seen
| energy123 wrote:
| So there's both a 05-06 model and a 06-05 model, and the launch
| page for 06-05 has some graphs with benchmarks for the 05-06
| model but without the 06-05 model?
| sergiotapia wrote:
| In Cursor this is called "gemini-2.5-pro-preview-06-05" you have
| to enable it manually.
| johnfn wrote:
| Impressive seeing Google notch up another ~25 ELO on lmarena, on
| top of the previous #1, which was also Gemini!
|
| That being said, I'm starting to doubt the leaderboards as an
| accurate representation of model ability. While I do think Gemini
| is a good model, having used both Gemini and Claude Opus 4
| extensively in the last couple of weeks I think Opus is in
| another league entirely. I've been dealing with a number of
| gnarly TypeScript issues, and after a bit Gemini would spin in
| circles or actually (I've never seen this before!) give up and
| say it can't do it. Opus solved the same problems with no sweat.
| I know that that's a fairly isolated anecdote and not necessarily
| fully indicative of overall performance, but my experience with
| Gemini is that it would really want to kludge on code in order to
| make things work, where I found Opus would tend to find cleaner
| approaches to the problem. Additionally, Opus just seemed to have
| a greater imagination? Or perhaps it has been tailored to work
| better in agentic scenarios? I saw it do things like dump the DOM
| and inspect it for issues after a particular interaction by
| writing a one-off playwright script, which I found particularly
| remarkable. My experience with Gemini is that it tries to solve
| bugs by reading the code really really hard, which is naturally
| more limited.
|
| Again, I think Gemini is a great model, I'm very impressed with
| what Google has put out, and until 4.0 came out I would have said
| it was the best.
| tempusalaria wrote:
| I agree I find claude easily the best model, at least for
| programming which is the only thing I use LLMs for
| varunneal wrote:
| Have you tried o3 on those problems? I've found o3 to be much
| more impressive than Opus 4 for all of my use cases.
| johnfn wrote:
| To be honest, I haven't, because the "This model is extremely
| expensive" popup on Cursor makes me a bit anxious - but given
| the accolades here I'll have to give it a shot.
| joshmlewis wrote:
| o3 is still my favorite over even Opus 4 in most cases. I've
| spent hundreds of dollars on AI code gen tools in the last
| month alone and my ranking is:
|
| 1. o3 - it's just really damn good at nuance, getting to the
| core of the goal, and writing the closest thing to quality
| production level code. The only negative is it's cutoff window
| and cost, especially with it's love of tools. That's not
| usually a big deal for the Rails projects I work on but
| sometimes it is.
|
| 2. Opus 4 via Claude Code - also really good and is my daily
| driver because o3 is so expensive. I will often have Opus 4
| come up with the plan and first pass and then let o3 critique
| and make a list of feedback to make it _really_ good.
|
| 3. Gemini 2.5 Pro - haven't tested this latest release but this
| was my prior #2 before last week. Now I'd say it's tied or
| slightly better than Sonnet 4. Depends on the situation.
|
| 4. Sonnet 4 via claude Code - it's not bad but needs a lot of
| coaching and oversight to produce really good code. It will
| definitely produce a lot of code if you just let it go do it's
| thing but it's not the quality, concise, and thoughtful code
| without more specific prompting and revisions.
|
| I'm also extremely picky and a bit OCD with code quality and
| organization in projects down to little details with naming,
| reusability, etc. I accept only 33% of suggested code based on
| my Cursor stats from last month. I will often revert and go
| back to refine the prompt before accepting and going down a
| less than optimal path.
| throwaway314155 wrote:
| It's interesting you say that because o3, while being a
| considerable improvement over OpenAI's other models, still
| doesn't match the performance of Opus 4 and Gemini 2.5 Pro by
| a long shot for me.
|
| However, o3 resides in the ChatGPT app, which is still
| superior to the other chat apps in many ways, particularly
| the internet search implementation works very well.
| joshmlewis wrote:
| What languages do you use it with and IDE? I use it in
| Cursor mainly with Max reasoning on. I spent around $300 on
| token based usage for o3 alone in May still only accepting
| around 33% of suggestions though. I made a post on X about
| this the other day but I expect that amount of rejections
| will go down significantly by the end of this year at the
| rate things are going.
| drawnwren wrote:
| Very strange. I find reasoning has very narrow usefulness
| for me. It's great to get a project in context or to get
| oriented in the conversation, but on long conversations I
| find reasoning starts to add way too much extraneous
| stuff and get distracted from the task at hand.
|
| I think my coding model ranking is something like Claude
| Code > Claude 4 raw > Gemini > big gap > o4-mini > o3
| joshmlewis wrote:
| Claude Code isn't a model in itself. By default it routes
| some to Opus 4 or Sonnet 4 but mostly Sonnet 4 unless you
| explicitly set it.
| drawnwren wrote:
| I am aware
| throwaway314155 wrote:
| i'm using with python, VS Code (not integrated with
| claude just basic copilot) and Claude Code. For Gemini
| i'm using AI studio with repomix to package my code into
| a single file. I copy files over manually in that
| workflow.
|
| All subscription based, not per token pricing. I'm
| currently using Claude Max. Can't see myself exhausting
| its usage at this rate but who knows.
| svachalek wrote:
| If you're coding through chat apps you're really behind the
| times. Try an agent IDE or plugin.
| joshmlewis wrote:
| Yeah, exactly. For everyone who might not know, the chat
| apps add lots of complex system prompting to handle and
| shape personality, tone, general usability, etc. IDE's
| also do this (with Claude Code being one of the ones that
| are closest to "bare" model that you can get) but at they
| are at least guiding it's behavior to be really good at
| coding tasks. Another reason is using the Agent feature
| that IDE's have had for a few months now which gives it
| the ability to search/read/edit files across your
| codebase. You may not like the idea of this and it feels
| like losing control, but it's the future. After months of
| using it I've learned how to get it to do what I want but
| I think a lot of people who try it once and stop get
| frustrated that it does something dumb and just assume
| it's not good. That's a practice and skill problem not a
| model problem.
| Workaccount2 wrote:
| IDE's are intimidating to non-tech people.
|
| I'm surprised there isn't a VibeIDE yet that is purpose
| build to make it possible for your grandmother to execute
| code output by an LLM.
| dragonwriter wrote:
| > I'm surprised there isn't a VibeIDE yet that is purpose
| build to make it possible for your grandmother to execute
| code output by an LLM.
|
| The major LLM chat interfaces often have code execution
| built in, so there kind of is, it just doesn't look like
| what an SWE thinks of as an IDE.
| joshmlewis wrote:
| I have not used them but I feel like there are tools like
| Replit, Lovable, etc that are for that audience. I
| totally agree IDE's are intimidating for non-technical
| people though. Claude Code is pretty cool in that way
| where it's one command to install and pretty easy to get
| started with.
| baw-bag wrote:
| I am really struggling with this. I tried Cline with both
| OpenAI and Claude to very weird results. Often burning
| through credits to get no where or just running out of
| context. I just got Cursor for a try so can't say
| anything on that yet.
| joshmlewis wrote:
| It's a skill that takes some persistence and trial and
| error. Happy to chat with you about it if you want to
| send me an email.
| baw-bag wrote:
| I really appreciate that. I will see how I get on and may
| well give you a shout. Thank you!
| Vetch wrote:
| There is skill to it but that's certainly not the only
| relevant variable involved. Other important factors are:
|
| Language: Syntax errors rise, and a common form is the
| syntax of a more common language bleeding through.
|
| Domain: Less so than what humans deem complex, quality is
| more strongly controlled by how much code and
| documentation there is for a domain. Interesting is that
| if in a less common subdomain, it will often revert to a
| more common approach (for example working on shaders for
| a game that takes place in a cylinder geometry requires a
| lot more hand-holding than on a plane). It's usually not
| that they can't do it, but that they require much more
| involved prompting to get the context appropriately set
| up and then managing drifting to default, more common
| patterns. Related is decisions with long term
| consequences. LLMs are pretty weak at this. In humans
| this one comes with experience, so it's rare and an
| instance of low coverage.
|
| Dates: Related is reverting to obsolete API patterns.
|
| Complexity: While not as dominant as domain coverage,
| complexity does play a role. With likelihood of error
| rising with complexity.
|
| This means if you're at the intersection of multiple of
| these (such as a low coverage problem in a functional
| language), agent mode will likely be too much of a waste
| for you. But interactive mode can still be highly
| productive.
| throwaway314155 wrote:
| I think this is debatable. But I've used Cursor and
| various extensions for VS Code. They're all fine (but
| cursor can fuck all the way off for stealing the `code`
| shell integration from VS Code) but you don't _need_ an
| IDE as Claude Code has shown us (currently my primary
| method of vibe coding).
|
| It's mostly about the cost though. Things are far more
| affordable in the the various apps/subscriptions. Token-
| priced API's can get very expensive very quickly.
| joshvm wrote:
| An important caveat here is yes, for coding. Apps are
| fine for coming up with one-liners, or doing other
| research. I haven't found the quality of IDE based code
| to be significantly better than what ChatGPT would
| suggest, but it's very useful to ask questions when the
| model has access to both the code and can prompt you to
| run tests which rely on local data (or even attached
| hardware). I really don't trust YOLO mode so I manually
| approve terminal calls.
|
| My impression (with Cursor) is that you need to practice
| some sort of LLM-first design to get the best out of it.
| Either vibe code your way from the start, or be brutal
| about limiting what changes the agent can make without
| your approval. It _does_ force you to be very atomic
| about your requests, which isn 't a bad thing, but
| writing a robust spec for the prompt is often slower than
| writing the code by hand and asking for a refactor. As
| soon as kipple, for lack of a better word, sneaks into
| the code, it's a reinforcing signal to the agent that it
| can add more.
|
| It's definitely worth paying the $20 and playing with a
| few different clients. The rabbit hole is pretty deep and
| there's still a ton of prompt engineering suggestions
| from the community. It encourages a lot of creative
| guardrails, like using pre-commit to provide negative
| feedback when the model does something silly like try to
| write a 200 word commit message. I haven't tried
| JetBrains' agent yet (Junie), but that seems like it
| would be a good one to explore as well since it
| presumably integrates directly with the tooling.
| jorvi wrote:
| What's most annoying about Gemini 2.5 is that it is
| obnoxiously verbose compared to Opus 4. Both in explaining
| the code it wrote and the amount of lines it writes and
| comments it adds, to the point where the output is often
| 2-3x more than Opus 4.
|
| You can obviously alleviate this by asking it to be more
| concise but even then it bleeds through sometimes.
| joshmlewis wrote:
| Yes this is what I mean by conciseness with o3. If
| prompted well it can produce extremely high level quality
| code that blows me away at times. I've also had several
| instances now where I gave it slightly wrong context and
| other models just butchered a solution with dozens of
| lines for the proposed fix which I could tell wasn't
| right and then after reverting and asking o3, it
| immediately went searching for another file I hadn't
| included and fixed it in one line. That kind of, dare I
| say independent thinking, is worth a lot when dealing
| with complex codebases.
| monkpit wrote:
| Have you used Cline with opus+sonnet? Do you have opinions
| about Claude code vs cline+api? Curious to hear your
| thoughts!
| spaceman_2020 wrote:
| I use o3 a lot for basic research and analysis. I also find
| the deep research tool really useful for even basic shopping
| research
|
| Like just today, it made a list of toys for my toddler that
| fit her developmental stage and play style. Would have taken
| me 1-2 hrs of browsing multiple websites otherwise
| pqdbr wrote:
| How do you choose which model to use with Claude Code?
| joshmlewis wrote:
| I have the Max $200 plan so I set it to Opus until it
| limits me to Sonnet 4 which has only happened in two out of
| a few dozen sessions so far. My rule of thumb in Cursor is
| it's worth paying for the Max reasoning models for pretty
| much every request unless it's stupid simple because it
| produces the best code each time without any funny business
| you get with cheaper models.
| sunshinerag wrote:
| You can use the max plan in cursor? I thought it didn't
| support calls via api and only worked in Claude code?
| VeejayRampay wrote:
| we need to stop it with the anecdotal evidence presented by
| one random dude
| vendiddy wrote:
| I find o3 to be the clearest thinker as well.
|
| If I'm working on a complex problem and want to go back and
| forth on software architecture, I like having o3 research
| prior art and have a back and forth on trade-offs.
|
| If o3 was faster and cheaper I'd use it a lot more.
|
| I'm curious what your workflows are !
| Szpadel wrote:
| in my experience this highly depends case by case. For some
| cases Gemini crushed my problem, but in next one stuck and
| couldn't figure out simple bug.
|
| the same with o3 and sonnet (I didn't tested 4.0 much yet to
| have opinion)
|
| I feel thet we need better parallel evaluation support. where u
| could evaluate all top models and decide with one provided best
| solution
| lispisok wrote:
| >That being said, I'm starting to doubt the leaderboards as an
| accurate representation of model ability
|
| Goodhart's law applies here just like everywhere else. Much
| more so given how much money these companies are dumping into
| making these models.
| AmazingTurtle wrote:
| for bulk data extraction on personal real life data I
| experienced that even gpt-4o-mini outperforms latest gemini
| models in both quality and cost. i would use reasoning models
| but their json schema response is different from the non-
| reasonig models, as in: they can not deal with union types for
| optional fields when using strict schemas... anyway.
|
| idk whats the hype about gemini, it's really not that good imho
| baq wrote:
| I've been giving the same tasks to claude 4 and gemini 2.5 this
| week and gemini provided correct solutions and claude didn't.
| These weren't hard tasks either, they were e.g. comparing sql
| queries before/after rewrite - Gemini found legitimate issues
| where claude said all is ok.
| zamadatix wrote:
| I think the only way to be particularly impressed with new
| leading models lately is to hold the opinion all of the
| benchmarks are inaccurate and/or irrelevant and it's
| vibes/anecdotes where the model is really light years ahead.
| Otherwise you look at the numbers on e.g. lmarena and see it's
| claiming a ~16% preference win rate for gpt-3.5-turbo from
| November of 2023 over this new world-leading model from Google.
| johnfn wrote:
| Not sure I follow - Gemini has ELO 1470, GPT3.5-turbo is
| 1206, which is an 86% win rate. https://chatgpt.com/share/684
| 1f69d-b2ec-800c-9f8c-3e802ebbc0...
| Workaccount2 wrote:
| People can ask whatever they want on LMarena, so a question
| like "List some good snacks to bring to work" might elicit a
| win for a old/tiny/deprecated model simply because it lists
| the snack the user liked more.
| AstroBen wrote:
| are you saying that's a bad way to judge a model? Not sure
| why we'd want ones that choose bad snacks
| Alifatisk wrote:
| > after a bit Gemini would spin in circles or actually (I've
| never seen this before!) give up and say it can't do it
|
| No way, is there any way to see the dialog or recreate this
| scenario!?
| johnfn wrote:
| The chat was in Cursor, so I don't know a way to provide a
| public link, but here is the last paragraph that it output
| before I (and it) gave up. I honestly could have re-prompted
| it from scratch and maybe it would have gotten it, but at
| this point I was pretty sure that even if it did, it was
| going to make a total mess of things. Note that it was
| iterating on a test failure and had spun through multiple
| attempts at this point:
|
| > Given the persistence of the error despite multiple
| attempts to refine the type definitions, I'm unable to fix
| this specific TypeScript error without a more profound change
| to the type structure or potentially a workaround that might
| compromise type safety or accuracy elsewhere. The current
| type definitions are already quite complex.
|
| The two prior paragraphs, in case you're curious:
|
| > I suspect the issue might be a fundamental limitation or
| bug in how TypeScript is resolving these highly recursive and
| conditional types when they are deeply nested. The type
| system might be "giving up" or defaulting to a less specific
| type ({ __raw: T }) prematurely.
|
| > Since the runtime logic seems to be correctly hydrating the
| nested objects (as the builder.build method recursively calls
| hydrateHelper), the problem is confined to the type system's
| ability to represent this.
|
| I found, as you can see in the first of the prior two
| paragraphs, that Gemini often wanted to claim that the issue
| was on TypeScript's side for some of these more complex
| issues. As proven by Opus, this simply wasn't the case.
| tymonPartyLate wrote:
| I just realized that Opus 4 is the first model that produced
| "beautiful" code for me. Code that is simple, easy to read, not
| polluted with comments, no unnecessary crap, just pretty, clean
| and functional. I had my first "wow" moment with it in a while.
| That being said it occasionally does something absolutely
| stupid. Like completely dumb. And when I ask it "why did you do
| this stupid thing", it replies "oh yeah, you're right, this is
| super wrong, here is an actual working, smart solution"
| (proceeds to create brilliant code)
|
| I do not understand how those machines work.
| Tostino wrote:
| My issue is that every time i've attempted to use Opus 4 to
| solve any problem, I would burn through my usage cap within a
| few min and not have solved the problem yet because it
| misunderstood things about the context and I didn't get the
| prompt quite right yet.
|
| With Sonnet, at least I don't run out of usage before I
| actually get it to understand my problem scope.
| simon1ltd wrote:
| I've also experienced the same, except it produced the same
| stupid code all over again. I usually use one model (doesn't
| matter which) until it starts chasing it's tail, then I feed
| it to a different model to have it fix the mistakes by the
| first model.
| diggan wrote:
| > Code that is simple, easy to read, not polluted with
| comments, no unnecessary crap, just pretty, clean and
| functional
|
| I get that with most of the better models I've tried,
| although I'd probably personally favor OpenAI's models
| overall. I think a good system prompt is probably the best
| way there, rather than relying in some "innate" "clean code"
| behavior of specific models. This is a snippet of what I use
| today for coding guidelines: https://gist.github.com/victorb/
| 1fe62fe7b80a64fc5b446f82d313...
|
| > That being said it occasionally does something absolutely
| stupid. Like completely dumb
|
| That's a bit tougher, but you have to carefully read through
| exactly what you said, and try to figure out what might have
| led it down the wrong path, or what you could have said in
| the first place for it avoid that. Try to work it into your
| system prompt, then slowly build up your system prompt so
| every one-shot gets closer and closer to being perfect on
| every first try.
| batrat wrote:
| What I like about Gemini is the search function that is very
| very good compared to others. I was blown away when I asked to
| compose me an email for a company that was sending spam to our
| domain. It literally searched and found not only the abuse
| email of the hosting company but all the info about the domain
| and the host(mx servers, ip owners, datacenters, etc.). Also if
| you want to convert a research paper into a podcast it did it
| instantly for me and it's fun to listen.
| cwbriscoe wrote:
| I haven't tried all of the favorites, just what is available
| with Jetbrains AI, but I can say that Gemini 2.5 is very good
| with Go. I guess that makes sense in a way.
| tomr75 wrote:
| how does it have access to DOM? are you using it with
| cursor/browser MCP?
| emehex wrote:
| Is this "kingfall"?
| paisanashapyaar wrote:
| No, Kingfall is a separate model which is supposed to deliver
| slightly better performance, around 2.5% to 5% improvement over
| this.
| Workaccount2 wrote:
| Sundar tweeted a lion so it's probably goldmane. Kingfall is
| probably their deep think model, and they might wait for O3 pro
| to drop so they can swing back.
| pelorat wrote:
| Why not call it Gemini 2.6?
| MallocVoidstar wrote:
| Beta, beta, release candidate (this version)
| laweijfmvo wrote:
| because the plethora of models and versions is getting
| ridiculous, and for anyone who's not following LLM news daily,
| you have no clue what to use. There was never a "Google Search
| 2.6.4 04-13". You just went to google.com and searched.
| johnfn wrote:
| Well, Google Search never released an API that millions of
| people depended on.
| AISnakeOil wrote:
| These api models are for developers. Gemini is for consumers.
| Szpadel wrote:
| next year maybe? they they so not have year in version so they
| will need to bump the number make sure you can just sort by
| name
| Workaccount2 wrote:
| Apparently 06-05 bridges the gap that people were feeling between
| the 03-25 and 05-06 release[1]
|
| [1]https://nitter.net/OfficialLoganK/status/1930657743251349854..
| .
| hu3 wrote:
| I pay for both ChatGPT Plus and Gemini Pro.
|
| I'm thinking of cancelling my ChatGPT subscription because I keep
| hitting rate limits.
|
| Meanwhile I have yet to hit any rate limit with Gemini/AI Studio.
| oofbaroomf wrote:
| I think AI Studio uses the API, so rate limits are extremely
| high and almost impossible for a normal human to reach if using
| the paid preview model.
| staticman2 wrote:
| As far as I know AI Studio is always free, even on pay
| accounts, and you can definetly hit the rate limit.
| Squarex wrote:
| I much prefer Gemini over chapgpt, but they recently introduced
| a limit of 100 messages a day on a pro plan :( aistudio is
| probably still fine
| MisterPea wrote:
| I've heard it's only on mobile? I was using gemini for work
| on desktop for at least 6 hours yesterday (definitely over
| 100 back and forths) for work and did not get hit with any
| rate limits
|
| Either way, Google's transparency with this is very poor - I
| saw the limits from a VP's tweet
| fermentation wrote:
| Is there a reason not to just use the API through openrouter or
| something?
| HenriNext wrote:
| AI Studio uses your API account behind the scenes, and it is
| subject to normal API limits. When you signup for AI Studio, it
| creates a Google Cloud free tier project with "gen-lang-
| client-" prefix behind the scenes. You can link a billing
| account at the bottom of the "get an api key page".
|
| Also note that AI studio via default free tier API access
| doesn't seem to fall within "commercial use" in Google's terms
| of service, which would mean that your prompts can be reviewed
| by humans and used for training. All info AFAIK.
| tibbar wrote:
| Interesting, I just learned about matharena.ai. Google cherry-
| picks one result where they're the best here, but in the overall
| results, it's still O3 and o4-mini-high who are in the lead.
| pu_pe wrote:
| I just checked and it looks like the limits for Jules has been
| bumped from 5 free daily tasks to 60. Not sure it uses the latest
| model, but I would assume it does
| kristianp wrote:
| https://jules.google/
| ChrisArchitect wrote:
| Blog post: https://blog.google/products/gemini/gemini-2-5-pro-
| latest-pr...
|
| (https://news.ycombinator.com/item?id=44192954)
| xnx wrote:
| That's a much better link
| abraxas wrote:
| I found all the previous Gemini models somewhat inferior even
| compared to Claude 3.7 Sonnet (and much worse than 4) as my
| coding assistants. I'm keeping an open mind but also not rushing
| to try this one until some evaluations roll in. I'm actually
| baffled that the internet at large seems to be very pumped about
| Gemini but it's not reflective of my personal experience. Not to
| be that tinfoil hat guy but I smell at least a bit of astroturf
| activity around Gemini.
| bachmeier wrote:
| > I'm actually baffled that the internet at large seems to be
| very pumped about Gemini but it's not reflective of my personal
| experience. Not to be that tinfoil hat guy but I smell at least
| a bit of astroturf activity around Gemini.
|
| I haven't used Claude, but Gemini has always returned better
| answers to general questions relative to ChatGPT or Copilot. My
| impression, which could be wrong, is that Gemini is better in
| situations that are a substitute for search. How do I do this
| on the command line, tell me about this product, etc. all give
| better results, sometimes much better, on Gemini.
| dist-epoch wrote:
| You should try Grok then. It's by far the best when searching
| is required, especially if you enable DeepSearch.
| Take8435 wrote:
| I don't really want to use the X platform. What's the best
| alternative? Claude?
| praveer13 wrote:
| I've honestly had consistently the opposite experiences for
| general questions. Also for images, Gemini just hallucinates
| crazily. ChatGPT even on free tier is giving perfectly
| correct answers, and I'm on Gemini pro. I canceled it
| yesterday because of this
| strobe wrote:
| I'm switching a lot between Sonnet and Gemini in Aider - for
| some reason some of my coding problems only one of models
| capable to solve and I don't see any pattern which cold give
| answer upfront which I should to use for specific need.
| throwaway314155 wrote:
| My experience has been that Gemini's code (and even
| conversation) is a little bit uglier in general - but that the
| code tends to solve the issue you asked with fewer
| hallucinations.
|
| I can't speak to it now - have mostly been using Claude Code w/
| Opus 4 recently.
| 3abiton wrote:
| > I found all the previous Gemini models somewhat inferior even
| compared to Claude 3.7 Sonnet (and much worse than 4) as my
| coding assistants.
|
| What are your usecases? Really not my experience, Claude
| disappoints in Data Science and complex ETL requests in python.
| O3 on the other hand really is phenomenal.
| abraxas wrote:
| Backend python code, postgres database. Front end:
| Reeact/NextJS. Very common stack in 2025. Using LLMs in
| assist mode (not as agents) for enhancing the existing code
| base that weighs in under 1MM LoC. So not a greenfield
| project anymore but not a huge amount of legacy cruft either.
| tiahura wrote:
| As a lawyer, Claude 4 is the best writer, and usually, but not
| always, the leader in legal reasoning. That said, o3 often
| grinds out the best response, and Gemini seems to be the most
| exhaustive researcher.
| Fergusonb wrote:
| I think they are fairly interchangeable, In Roo Code, Claude
| uses the tools better, but I prefer gemini's coding style and
| brevity (except for comments, it loves to write comments)
| Sometimes I mix and match if one fails or pursues a path I
| don't like.
| verall wrote:
| I think it's just very dependent on what you're doing. Claude
| 3.5/3.7 Sonnet (thinking or not) were just absolutely terrible
| at almost anything I asked of it (C/C++/Make/CMake). Like
| constantly giving wrong facts, generating code that could never
| work, hallucinating syntax and APIs, thinking about something
| then concluding the opposite, etc. Gemini 2.5-pro and o3 (even
| old o1-preview, o1-mini) were miles better. I haven't used
| Claude 4 yet.
|
| But everyone is using them for different things and it doesn't
| always generalize. Maybe Claude was great at typescript or ruby
| or something else I don't do. But for some of us, it definitely
| was not astroturf for Gemini. My whole team was talking about
| how much better it was.
| wiradikusuma wrote:
| I have two issues with Gemini that I don't experience with
| Claude: 1. It RENAMES VARIABLE NAMES even in places where I don't
| tell it to change (I pass them just as context). and 2. Sometimes
| it's missing closing square brackets.
|
| Sure I'm a lazy bum, I call the variable "json" instead of
| "jsonStringForX", but it's contextual (within a closure or
| function), and I appreciate the feedback, but it makes reviewing
| the changes difficult (too much noise).
| 93po wrote:
| i've noticed with ChatGPT is will 100% ignore certain
| instructions and I wonder if it's just an LLM thing. For
| example, I can scream and yell in caps at ChatGPT to not use em
| or en dashes and if anything it makes it use them even _more_.
| I 've literally never once made it successfully not use them,
| even when it ignored it the first time, and my follow up is
| "output the same thing again but NO EM or EN DASHES!"
|
| i've not tested this thoroughly, it's just my ancedotal
| experience over like a dozen attempts.
| creesch wrote:
| There are some things so ubiquitous in the training data that
| it is really difficult to tell models to not so them. Simply
| because it is so ingrained in their core training. Em dashes
| are apparently one of those things.
|
| It's something I read a lottle while ago in a larger article
| but can't remember which article it was.
| tacotime wrote:
| I wonder if using the character itself in the directions,
| instead of the name for the character, might help with this.
|
| Something like, "Forbidden character list: [--, -]" or "Do
| NOT use the characters '--' or '-' in any of your output"
| danielbln wrote:
| Gemini loves to add idiotic non-functional inline comments.
|
| "# Added this function" "# Changed this to fix the issue"
|
| No, I know, I was there! This is what commit messages for, not
| comments that are only relevant in one PR.
| oezi wrote:
| And it sure loves removing your carefully inserted comments
| for human readers.
| macNchz wrote:
| I love when I ask it to remove things and it doesn't want to
| truly let go, so it leaves a comment instead:
| # Removed iterMod variable here because it is no longer
| needed.
|
| It's like it spent too much time hanging out with an engineer
| who doesn't trust version control and prefers to just comment
| everything out.
|
| Still enjoying Gemini 2.5 Pro more than Claude Sonnet these
| days, though, purely on vibes.
| Workaccount2 wrote:
| I think it is likely that the comments are more for the model
| than for the user. I would not be even slightly surprised if
| verbose coding versions outperformed light commenting
| versions.
| xmprt wrote:
| On the other hand, I'm skeptical if that has any impact
| because these models have thinking tokens where they can
| put all those comments and attention shouldn't care about
| how close the tokens are as long as they're within the
| context window.
| xtracto wrote:
| I have a very clear example of Gemini getting it wrong:
|
| For a code like this, it keeps changing
| processing_class=tokenizer to "tokenizer=tokenizer", even
| though the parameter was renamed and even after adding the all
| caps comment. #Set up the SFTTrainer
| print("Setting up SFTTrainer...") trainer = SFTTrainer(
| model=model, train_dataset=train_dataset,
| args=sft_config, processing_class=tokenizer, # DO NOT
| CHANGE. THIS IS NOW THE CORRECT PROPERTY NAME )
| print("SFTTrainer ready.")
|
| I haven't tried with this latest version, but the 05-06 pro
| still did it wrong.
| diggan wrote:
| Do you have in the system prompt to actually not edit lines
| that has comments about not editing them? Had that happen to
| me too, that code comments been ignored, and adding
| instructions about actually following code comments helped
| for that. But different models so YMMV.
| AaronAPU wrote:
| I find o1-pro, which nobody ever mentions, is in the top spot
| along with Gemini. But Gemini is an absolute mess to work with
| because it constantly adds tons of comments and changes
| unrelated code.
|
| It is worth it sometimes, but usually I use it to explore ideas
| and then have o1-pro spit out a perfect solution ready diff
| test and merge.
| carbocation wrote:
| Is it possible to know which model version their chat app (
| https://gemini.google.com/app ) is using?
| chollida1 wrote:
| I'd start to worry about OpenAI, from a valuation standpoint. The
| company has some serious competition now and is arguably no
| longer the leader.
|
| its going to be interesting to see how easily they can raise more
| money. Their valuation is already in the $300B range. How much
| larger can it get given their relatively paltry revenue at the
| moment and increasingly rising costs for hardware and
| electricity.
|
| If the next generation of llms needs new data sources, then
| Facebook and Google seem well positioned there, OpenAI on the
| other hand seems like its going to lose such race for proprietary
| data sets as unlike those other two, they don't have another
| business that generates such data.
|
| When they were the leader in both research and in user facing
| applications they certainly deserved their lofty valuation.
|
| What is new money coming into OpenAI getting now?
|
| At even a $300B valuation a typical wall street analysts would
| want to value them at 2x sales which would mean they'd expect
| OpenAI to have $600B in annual sales to account for this
| valuation when they go public.
|
| Or at an extremely lofty P/E ratio of say 100 that would be $3B
| in annual earnings, that analysts would have to expect you to
| double each year for the next 10ish years looking out, ala AMZN
| in the 2000s, to justify this valuation.
|
| They seem to have boxed themselves into a corner where it will be
| painful to go public, assuming they can ever figure out the
| nonprofit/profit issue their company has.
|
| Congrats to Google here, they have done great work and look like
| they'll be one of the biggest winners of the AI race.
| ketzo wrote:
| OpenAI has already forecast _$12B_ in revenue by the end of
| _this_ year.
|
| I agree that Google is well-positioned, but the
| mindshare/product advantage OpenAI has gives them a stupendous
| amount of leeway
| chollida1 wrote:
| Agreed, its the doubling of that each year for the next 4-5
| years that I see as being difficult.
| Workaccount2 wrote:
| The hurdle for OpenAI is going to be on the profit side.
| Google has their own hardware acceleration and their own data
| centers. OpenAI has to pay a monopolist for hardware
| acceleration and beholden to another tech giant for data
| centers. Never mind that Google can customize it's hardware
| specifically for it's models.
|
| The only way for OpenAI to really get ahead on solid ground
| is to discover some sort of absolute game changer (new
| architecture, new algorithm) and manage to keep it bottled
| away.
| geodel wrote:
| OpenAI has now partnered with Jony Ive now and they are
| going to have thinnest data centers with thinnest servers
| mounted on thinnest racks. And since everything is so thin,
| servers can just whisper to each other instead of
| communicating via fat cables.
|
| I think that will be the game changer OpenAI will show us
| soon.
| falloon wrote:
| All servers will have a single thunderbolt port.
| diggan wrote:
| > OpenAI has to pay a monopolist for hardware acceleration
| and beholden to another tech giant for data centers.
|
| Don't they have a data center in progress as we speak?
| Seems by now they're planning on building not just one huge
| data center in Texas, but more in other countries too.
| VeejayRampay wrote:
| the leeway comes from the grotesque fanboyism the company
| benefits from
|
| they haven't been number one for quite some time and still
| people can't stop presenting them as the leaders
| ketzo wrote:
| People said much the same thing about Apple for decades,
| and they're a $3T company; not a bad thing to have fans.
|
| Plus, it's a consumer product; it doesn't matter if people
| are "presenting them as leaders", it matters if hundreds of
| millions of totally average people will open their
| computers and use the product. OpenAI has that.
| Rudybega wrote:
| I think OpenAI has projected 12.7B in revenue this year and
| 29.4B in 2026.
|
| Edit: I am dumb, ignore the second half of my post.
| eamag wrote:
| isn't P/E about earnings, not revenue?
| Rudybega wrote:
| You are correct. I need some coffee.
| jadbox wrote:
| Currently I only find OpenAI to be clearly better for image
| generation: like illustrations, comics, or photo editing for
| home project ideation.
| energy123 wrote:
| Even if they're winning the AI race, their search business is
| still going to be cannibalized, and it's unclear if they'll be
| able to extract any economic rents from AI thanks to market
| competition. Of course they have no choice but to compete, but
| they probably would have preferred the pre-AI status quo of
| unquestioned monopoly and eyeballs on ads.
| xmprt wrote:
| Historically, every company has failed by not adapting to new
| technologies and trying to protect their core business (eg.
| Kodak, Blockbuster, Blackberry, Intel, etc). I applaud Google
| for going against their instincts and actively trying to
| disrupt their cash cow in order to gain an advantage in the
| AI race.
| qeternity wrote:
| > At even a $300B valuation a typical wall street analysts
| would want to value them at 2x sales which would mean they'd
| expect OpenAI to have $600B in annual sales to account for this
| valuation when they go public.
|
| Lmfao where did you get this from? Microsoft has less than half
| of that revenue, and is valued > 10x than OpenAI.
|
| Revenue is not the metric by which these companies are
| valued...
| orionsbelt wrote:
| I think it's too early to say they are not the leader given
| they have o3 pro and GPT 5 coming out within the next month or
| two. Only if those are not impressive would I start to consider
| that they have lost their edge.
|
| Although it does feel likely that at minimum, they are neck and
| neck with Google and others.
| ed_mercer wrote:
| Source for gpt 5 coming out soon?
| sebzim4500 wrote:
| >At even a $300B valuation a typical wall street analysts would
| want to value them at 2x sales which would mean they'd expect
| OpenAI to have $600B in annual sales to account for this
| valuation when they go public.
|
| What? Apple has a revenue of 400B and a market cap of 3T
| raincole wrote:
| > At even a $300B valuation a typical wall street analysts
| would want to value them at 2x sales which would mean they'd
| expect OpenAI to have $600B in annual sales to account for this
| valuation when they go public.
|
| Even Google doesn't have $600B revenue. Sorry, it sounds like
| numbers pulled from someone's rear.
| jstummbillig wrote:
| There is some serious confusion about the strength of OpenAIs
| position.
|
| "chatgpt" is a verb. People have no idea what claude or gemini
| are, and they will not be interested in it, unless something
| absolutely _fantastic_ happens. Being a little better will do
| absolutely nothing to convince normal people to change product
| (the little moat that ChatGPT has simply by virtue of chat
| history is probably enough from a convenience standpoint, add
| memories and no super obvious path to export /import either and
| you are done here).
|
| All that OpenAI would have to do, to easily be worth their
| evaluation eventually, is to optimize and not become
| _offensively bad_ to their, what, 500 million active users.
| And, if we assume the current paradigm that everyone is working
| with is here to stay, why would they? Instead of leading (as
| they have done so far, for the most part) they can at any point
| simply do what others have resorted to successfully and copy
| with a slight delay. People won 't care.
| aeyes wrote:
| Google has a text input box on google.com, as soon as this
| gives similar responses there is no need for the average user
| to use ChatGPT anymore.
|
| I already see lots of normal people share screenshots of the
| AI Overview responses.
| jstummbillig wrote:
| You are skipping over the part where you need to bring
| normal people, specially young normal people, back to
| google.com for them to see anything at all on google.com.
| Hundreds of millions of them don't go there anymore.
| askafriend wrote:
| As the other poster mentioned, young people are not going
| there. What happens when they grow up?
| candiddevmike wrote:
| ChatGPT is going to be Kleenex'd. They wasted their first
| mover advantage. Replace ChatGPT's interface with any other
| LLM and most users won't be able to tell the difference.
| potatolicious wrote:
| I think this pretty substantially overstates ChatGPT's
| stickiness. Just because something is widely (if not
| universally) known doesn't mean it's universally _used_ , or
| that such usage is sticky.
|
| For example, I had occasion to chat with a relative who's
| still in high school recently, and was curious what the
| situation was in their classrooms re: AI.
|
| tl;dr: LLM use is basically universal, _but ChatGPT is not
| the favored tool_. The favored tools are LLMs /apps
| specifically marketed as study/homework aids.
|
| It seems like the market is fine with seeking specific LLMs
| for specific kinds of tasks, as opposed to some omni-LLM one-
| stop-shop that does everything. The market has _already_ and
| rapidly moved beyond from ChatGPT.
|
| Not to mention I am willing to bet that Gemini has
| _radically_ more usage than OpenAI 's models simply by virtue
| of being plugged into Google Search. There are distribution
| effects, I just don't think OpenAI has the strongest
| position!
|
| I think OpenAI has _some_ first-mover advantage, I just don
| 't think it's anywhere near as durable (nor as large) as
| you're making it out to be.
| Oleksa_dr wrote:
| I was tempted by the ratings and immediately paid for a
| subscription to Gemini 2.5. Half an hour later, I canceled the
| subscription and got a refund. This is the laziest and
| stupidest LLM. What he had to do, he told me to do on my own.
| And also when analyzing simple short documents, he pulled up
| some completely strange documents from the Internet not related
| to the topic. Even local LLMs (3B) were not so stupid and lazy.
| PantaloonFlames wrote:
| > At even a $300B valuation a typical wall street analysts
| would want to value them at 2x sales which would mean they'd
| expect OpenAI to have $600B in annual sales to account for this
| valuation when they go public.
|
| Oops I think you may have flipped the numerator and the
| denominator there, if I'm understanding you. Valuation of 300B
| , if 2x sales, would imply 150B sales.
|
| Probably your point still stands.
| lxe wrote:
| Gemini is a good and fast model, but I think the style of code it
| writes is... amateur / inexperienced. It doesn't make a lot of
| mistakes typical of an LLM, but rather chooses approaches that
| are typical of someone who just learned programming. I have to
| always nudge it to avoid verbosity, keep structure less
| repetitive, optimize async code, etc. With claude, I rarely have
| this problem -- it feels more like working with a more
| experienced developer.
| jdmoreira wrote:
| Is there a no brainer alternative to Claude Code where I can try
| other models?
| ketzo wrote:
| People quite like aider! I'm not as much of a fan of the CLI
| workflow but it's quite comparable, I think.
| jdmoreira wrote:
| I've heard about it but is the outcome as good as claude
| code?
| kristianp wrote:
| I enjoy using Aider, but it's not agentic: it cant run your
| tests for you, for example.
| simianwords wrote:
| I feel stupid for asking but how do I enable deepthink?
| koakuma-chan wrote:
| They added a thinking section in AI studio
| simianwords wrote:
| True but it's greyed out. Not sure if this is "deep think"
| johnnyApplePRNG wrote:
| General first impressions are that it's not as capable as 05-06,
| although it's technically testing better on the leaderboards...
| interesting.
| Alifatisk wrote:
| Finally Google is advertising their ai studio, it's a shame they
| didn't push that beautiful app before.
| consumer451 wrote:
| Man, if the benchmarks are to be believed, this is a lifeline for
| Windsurf as Anthropic becomes less and less friendly.
|
| However, in my personal experience Sonnet 3.x has still been king
| so far. Will be interesting to watch this unfold. At this point,
| it's still looking grim for Windsurf.
| _pdp_ wrote:
| Is it still rate limited though?
| zone411 wrote:
| Omproves on the Extended NYT Connections benchmark compared to
| both Gemini 2.5 Pro Exp (03-25) and Gemini 2.5 Pro Preview
| (05-06), scoring 58.7. The decline observed between 03-25 and
| 05-06 has been reversed - https://github.com/lechmazur/nyt-
| connections/.
| bli940505 wrote:
| I'm confused by the naming. It advertises itself as "Thinking" so
| is this the release of the new "Deep Think" model or not?
___________________________________________________________________
(page generated 2025-06-05 23:01 UTC)