[HN Gopher] Gemini 2.5 Deep Think
       ___________________________________________________________________
        
       Gemini 2.5 Deep Think
        
       Author : meetpateltech
       Score  : 371 points
       Date   : 2025-08-01 11:10 UTC (11 hours ago)
        
 (HTM) web link (blog.google)
 (TXT) w3m dump (blog.google)
        
       | amatic wrote:
       | At the moment, Deep Think is only available with the ULTRA
       | subscription ($250 per month).
        
         | stingraycharles wrote:
         | It's not available through the API?
        
         | siva7 wrote:
         | Is it available in EU? Someone can confirm?
        
       | highfrequency wrote:
       | Approach is analogous to Grok 4 Heavy: use multiple "reasoning"
       | agents in parallel and then compare answers before coming back
       | with a single response, taking ~30 minutes. Great results, though
       | it would be more fair for the benchmark comparisons to be against
       | Grok 4 Heavy rather than Grok 4 (the fast, single-agent model).
        
         | stingraycharles wrote:
         | Yeah the general "discovery" is that using the same reasoning
         | compute effort, but spreading them over multiple different
         | agents generally leads to better results.
         | 
         | It solves the "longer thinking leads to worse results" problem
         | by approaching multiple paths of thinking in parallel, but just
         | not think as long.
        
           | Xenoamorphous wrote:
           | > Yeah the general "discovery" is that using the same
           | reasoning compute effort, but spreading them over multiple
           | different agents generally leads to better results.
           | 
           | Isn't the compute effort N times as expensive, where N is the
           | number of agents? Unless you meant in terms of time (and even
           | then, I guess it'd be the slowest of the N agents).
        
             | NitpickLawyer wrote:
             | Not exactly N times, no. In a traditional transformer arch
             | token 1 is cheaper to generate than token 1000 is cheaper
             | than token 10k and so on. So having 10x 1000 tokens would
             | be cheaper to run concurrently than 10.000 in one session.
             | 
             | You also run into context issues and quality degradation
             | the longer you go.
             | 
             | (this is assuming gemini uses a traditional arch, and not
             | something special regarding attention)
        
         | lynx97 wrote:
         | I am surprised such a simple approach has taken so long to be
         | actually used. My first image description cli attempt did
         | basically that: Use n to get several answers and another pass
         | to summarize.
        
           | cinntaile wrote:
           | It's very resource intensive so maybe they had to wait until
           | processes got more efficient? I can also imagine they would
           | want to try and solve it in a... better way before doing
           | this.
        
           | simianwords wrote:
           | I agree but I think its hard to get a sufficient increase in
           | performance that would justify 3-4x increase in cost.
        
           | datadrivenangel wrote:
           | It's an expensive approach, and depends on assessment being
           | easy, which is often not the case.
        
           | golol wrote:
           | People have played with (multi-) agentic frameworks for LLMs
           | from the very beginning but it seems like only now with
           | powerful reasoning models it is really making a difference.
        
           | NitpickLawyer wrote:
           | I have a similar thing built around a year ago w/ autogen.
           | The difference now is models can really be steered towards
           | "part" of the overall goal, and they actually follow that.
           | 
           | Before this, even the best "math" models were RLd to death to
           | only solve problems. If you wanted it to explore "method_a"
           | of solving a problem you'd be SoL. The model would start like
           | "ok, the user wants me to explore method_a, so here's the
           | solution: blablabla doing whatever it wanted, unrelated to
           | method_a.
           | 
           | Similar things for gathering multiple sources. Only recently
           | can models actually pick the best thing out of many
           | instances, and work _effectively_ at large context lengths.
           | The previous tries with 1M context lengths were at best
           | gimmicks, IMO. Gemini 2.5 seems the first model that can
           | actually do useful stuff after 100-200k tokens.
        
         | mvieira38 wrote:
         | Is o3-pro the same as these?
        
           | satvikpendem wrote:
           | No, it doesn't take 30 minutes
        
         | Workaccount2 wrote:
         | Grok-4 heavy benchmarks used tools, which trivializes a lot of
         | problems.
        
         | littlestymaar wrote:
         | That this kind of approach works is good news for local LLM
         | enthusiasts, as it makes Cloud LLM using this more expensive
         | while local LLM can do so for free up to a point (because LLM
         | inference is limited by memory bandwidth not compute, you can
         | run multiple queries in parallel on your graphic card at the
         | same speed as the single one. Until you become compute-bound of
         | course).
        
           | mettamage wrote:
           | Wait, how does this work? If you load in one LLM of 40 GB,
           | then to load in four more LLMs of 40 GB still takes up an
           | extra 160 GB of memory right?
        
             | masterjack wrote:
             | It will typically be the same 40 GB model loaded in, but
             | called with many different inputs simultaneously
        
           | zozbot234 wrote:
           | > because LLM inference is limited by memory bandwidth not
           | compute, you can run multiple queries in parallel on your
           | graphic card at the same speed as the single one
           | 
           | I don't think this is correct, especially given MoE. You can
           | save _some_ memory bandwidth by reusing model parameters, but
           | that 's about it. It's not giving you the same speed as a
           | single query.
        
         | dweekly wrote:
         | Dumb (?) question but how is Google's approach here different
         | than Mixture of Experts? Where instead of training different
         | experts to have different model weights you just count on
         | temperature to provide diversity of thought. How much benefit
         | is there in getting the diversity of thought in different runs
         | of the same model versus running a consortium of different
         | model weights and architectures? Is there a paper contrasting
         | results given fixed computation between spending that compute
         | on multiple runs of the same model vs different models?
        
           | joaogui1 wrote:
           | Mixture of Experts isn't using multiple models with different
           | specialties, it's more like a sparsity technique, where you
           | massively increase the number of parameters and use only a
           | subset of the weights in each forward pass.
        
           | HarHarVeryFunny wrote:
           | MOE is just a way to add more parameters/capacity to a model
           | without making it less efficient to run, since it's done in a
           | way that not all parameters are used for each token passing
           | through the model. The name MOE is a bit misleading since the
           | "experts" are just alternate paths though part of the model,
           | not having any distinct expertise in the way the name might
           | suggest.
           | 
           | Just running the model multiple times on the same input and
           | selecting the best response (according to some judgement)
           | seems a bit of a haphazard way of getting much diversity of
           | response, if that is really all it is doing.
           | 
           | There are multiple alternate approaches to sampling different
           | responses from the model that come to mind, such as:
           | 
           | 1) "Tree of thoughts" - generate a partial response (e.g. one
           | token, or one reasoning step), then generate branching
           | continuations of each of those, etc, etc. Compute would go up
           | exponentially according to number of chained steps, unless
           | heavy pruning is done similar to how it is done for MCTS.
           | 
           | 2) Separate response planning/brainstorming from response
           | generation by first using a "tree of thoughts" like process
           | just to generate some shallow (e.g. depth < 3) alternate
           | approaches, then use each of those approaches as additional
           | context to generate one or more actual responses (to then
           | evaluate and choose from). Hopefully this would result in
           | some high level variety of response without the cost of of
           | just generating a bunch of responses and hoping that they are
           | usefully diverse.
        
         | bitshiftfaced wrote:
         | What makes you sure of that? From the article,
         | 
         | > Deep Think pushes the frontier of thinking capabilities by
         | using parallel thinking techniques. This approach lets Gemini
         | generate many ideas at once and consider them simultaneously,
         | even revising or combining different ideas over time, before
         | arriving at the best answer.
         | 
         | This doesn't exclude the possibility of using multiple agents
         | in parallel, but to me it doesn't necessarily mean that this is
         | what's happening, either.
        
           | highfrequency wrote:
           | What could "parallel thinking techniques" entail if not
           | "using multiple agents in parallel"?
        
           | JyB wrote:
           | How can it not be exactly what's happening?
        
         | Melatonic wrote:
         | Surprised no one has released an app yet that pits all the
         | major models against each other for a final answer.
        
       | 7373737373 wrote:
       | 139.99EUR/month for something you can't even test first, lol
        
       | lynx97 wrote:
       | Wait...
       | 
       | So if someone cool enough, they could actually give us a
       | DeepThought model?
       | 
       | Please, let that happen.
       | 
       | Vendor-DeepThought-42B maybe?
        
         | xnx wrote:
         | > could actually give us a DeepThought model
         | 
         | Yes, but the response time is terrible. 7.5 million years
        
         | stocksinsmocks wrote:
         | Qwen-DeepInThot-69b-Instruct is what I'm looking forward to.
        
       | simianwords wrote:
       | Grok 4 heavy, o3 pro and Gemini Deep Think all are equivalent. I
       | wonder how they compare?
        
       | baggachipz wrote:
       | You can't go anywhere without having Gemini shoved in your face.
       | I had an immediate visceral reaction to this.
        
         | serialNumber wrote:
         | I'm never the one to defend AI, but what do you mean? Is it the
         | "AI overview" that pops up on Google? Other than that, I would
         | say Gemini is definitely less in your face than ChatGPT for
         | example
        
           | baggachipz wrote:
           | My company uses google workspace and every google doc,
           | spreadsheet, calendar, online meeting and search puts nonstop
           | callouts and messages about using Gemini. It's gotten so bad
           | that I'm about to try building a browser extension to block
           | that bullshit. It clutters the UI and nags. If I wanted that
           | crap, I'd turn it on.
        
             | serialNumber wrote:
             | I see - I feel the same way about Copilot (as my company
             | uses the Microsoft ecosystem).
        
             | breppp wrote:
             | Yes it smells like google+, but at least gemini is actually
             | a good product
        
             | int_19h wrote:
             | I also find those "please please please try me!" popups
             | annoying, but at least Google Workspace is one product
             | where deep AI integration actually makes sense. I like the
             | ability to quickly get summaries or edit documents by
             | telling it what needs to be done in generic terms.
        
       | resoluteteeth wrote:
       | > If you're a Google AI Ultra subscriber, you can use Deep Think
       | in the Gemini app today with a fixed set of prompts a day by
       | toggling "Deep Think" in the prompt bar when selecting 2.5 Pro in
       | the model drop down.
       | 
       | If fixed set means fixed number it would be nice to know how
       | many.
       | 
       | Otherwise i would like to know what fixed set means here.
        
         | Workaccount2 wrote:
         | You get 10 requests per day it seems.
         | 
         | Apparently the model will think for 30+ minutes on a given
         | prompt. So it seems it's more for research or dense multi-
         | faceted problems than for general coding or writing fan fic.
        
       | barapa wrote:
       | Available via API?
        
         | simonw wrote:
         | Not yet.
         | https://x.com/OfficialLoganK/status/1951260803459338394 said
         | "Should we put it in the Gemini API next?"
        
           | Oras wrote:
           | Going to twitter to find info is exactly what's wrong with
           | Google
        
       | asaddhamani wrote:
       | I find it interesting, how OpenAI came out with a $200 plan,
       | Anthropic did $100 and $200, then Gemini ups it to $250, and now
       | Grok is at $300.
       | 
       | OpenAI is the only one that says "practically unlimited" and I
       | have never hit any limit on my ChatGPT Pro plan. I hit limits on
       | Claude Max (both plans) several times.
       | 
       | Why are these companies not upfront about what the limits are?
        
         | thimabi wrote:
         | > Why are these companies not upfront about what the limits
         | are?
         | 
         | Most likely because they reserve the right to dynamically alter
         | the limits in response to market demands or infrastructure
         | changes.
         | 
         | See, for instance, the Ghibli craze that dominated ChatGPT a
         | few months ago. At the time OpenAI had no choice but to
         | severely limit image generation quotas, yet today there are
         | fewer constraints.
        
         | jstummbillig wrote:
         | Because if you are transparent about the limits, more people
         | will start to game the limits, which leads to lower limits for
         | everyone - which is a worse outcome for almost everyone.
         | 
         | tldr: We can't have nice things, because we are assholes.
        
         | holowoodman wrote:
         | Because they want to have their cake and eat it too.
         | 
         | A fair pricing model would be token-based, so that a user can
         | see for each query how much they cost, and only pay for what
         | they actually used. But AI companies want a steady stream of
         | income, and they want users to pay as much as possible, while
         | using as little as possible. Therefore they ask for a monthly
         | or even yearly price with an unknown number of tokens included,
         | such that you will always pay more then with token-based
         | payments.
        
           | hnav wrote:
           | I don't think it's that, I think they just want people to
           | onboard onto these things before understanding what the
           | actual cost might be once they're not subsidized by megacorps
           | anymore. Something similar to loss-leading endeavors like
           | Uber and Lyft in the 2010s, I suspect that that showing the
           | actual cost of inference would raise questions about the cost
           | effectiveness of these things for a lot of applications.
           | Internally, Google's data query surface tell you cost in
           | terms of SWE-time (e.g. this query cost 1 SWE hour) since the
           | incentives are different.
        
           | teitoklien wrote:
           | you're right, about their intentions in the future. But right
           | now, they are literally losing money every single time
           | someone uses their product...
           | 
           | In most cases, atleast claude does for sure. So yea, for now,
           | they're losing money anyways
        
             | Invictus0 wrote:
             | You just completely made that up
        
       | cpursley wrote:
       | I can't even convince Gemini CLI while planning things to not go
       | off and make a bunch of random changes on its own, even after
       | being very clear not to do so, intercepting to tell it to stop
       | doing that, then it just continues on fucking everything up.
        
         | WhitneyLand wrote:
         | Agents muddy the waters.
         | 
         | Claude Code gets the most out of Anthropic's models, that's why
         | people love it.
         | 
         | Conversely, Gemini CLI makes Gemini Pro 2.5 less capable than
         | the model itself actual is.
         | 
         | It's such a stark difference I've given up using Gemini CLI
         | even with it being free, but still use it for situations
         | amenable to a prompt interface on a regular basis. It's a very
         | strong model.
        
         | panarky wrote:
         | That's my experience too, when I give Gemini CLI a big, general
         | task and just let it run.
         | 
         | But if I give it structure so it can write its own context, it
         | is truly astonishing.
         | 
         | I'll describe my big, general task and tell it to first read
         | the codebase and then write a detailed requirements document,
         | and not to change any code.
         | 
         | Then I'll tell it to read the codebase and the detailed
         | requirements document it just wrote, and then write a detailed
         | technical spec with API endpoints, params, pseudocode for
         | tricky logic, etc.
         | 
         | Then I'll tell it to read the codebase, and the requirements
         | document it just wrote, and the tech spec it just wrote, and
         | decomp the whole development effort into weekly, daily and
         | hourly tasks to assign to developers and save that in a dev
         | plan document.
         | 
         | Only then is it ready to write code.
         | 
         | And I tell it to read the code base, requirements, tech spec
         | and dev plan, all of which it authored, and implement Phase 1
         | of the dev plan.
         | 
         | It's not all mechanical and deterministic, or I could just
         | script the whole process. Just like with a team of junior devs,
         | I still need to review each document it writes, tweak things I
         | don't like, or give it a better prompt to reflect my priorities
         | that I forgot to tell it the first time, and have it redo a
         | document from scratch.
         | 
         | But it produces 90% or more of its own context. It ingests all
         | that context that it mostly authored, and then just chugs along
         | for a long time, rarely going off the rails anymore.
        
       | simonw wrote:
       | This isn't the exact same model that achieved gold in the IMO a
       | few weeks ago but is a close relative:
       | https://x.com/OfficialLoganK/status/1951262261512659430
       | 
       | It's not yet available via an API.
        
       | foundry27 wrote:
       | I started doing some experimentation with this new Deep Think
       | agent, and after five prompts I reached my daily usage limit. For
       | $250 USD/mo that's what you'll be getting folks.
       | 
       | It's just bizarrely uncompetitive with o3-pro and Grok 4 Heavy.
       | Anecdotally (from my experience) this was the one feature that
       | enthusiasts in the AI community were interested in to justify the
       | exorbitant price of Google's Ultra subscription. I find it
       | astonishing that the same company providing _free_ usage of their
       | top models to everybody via AI Studio is nickel-and-diming their
       | actual customers like that.
       | 
       | Performance-wise. So far, I couldn't even tell. I provided it
       | with a challenging organizational problem that my business was
       | facing, with the relevant context, and it proposed a lucid and
       | well-thought-out solution that was consistent with our internal
       | discussions on the matter. But o3 came to an equally effective
       | conclusion for a fraction of the cost, even if it was less
       | "cohesive" of a report. I guess I'll have to wait until tomorrow
       | to learn more.
        
         | andsoitis wrote:
         | it turns out that AI at this level is very expensive to run
         | (capex, energy). my bet is that AI itself won't figure out how
         | to overcome these constraints and reach escape velocity.
        
           | crowcroft wrote:
           | Mainframes are the only viable way to build computers. Micro
           | processors will never figure out how to get small and fast
           | enough for personal computers to reach escape velocity.
        
             | andsoitis wrote:
             | Why do you think the analogy hold?
        
               | suddenlybananas wrote:
               | Well it's an analogy, that's a water-tight argument.
        
               | crowcroft wrote:
               | Hardware typically gets faster and cheaper over time.
               | Unless we hit hard a wall because of physics then I don't
               | see any reason that won't continue to be true.
        
           | twobitshifter wrote:
           | our minds are incredibly energy efficient, that leads me to
           | believe it is possible to figure out, but it might be a human
           | rather than an AI that gives us something more akin to a
           | biological solution.
        
             | klabb3 wrote:
             | This could fix my main gripe with The Matrix. "Humans are
             | used as batteries" always felt off, but it totally would
             | make sense if the human brains have uniquely energy
             | efficient pattern matching abilities that an emerging AI
             | organism would harvest. That would also strengthen the
             | spiritual humanist subtext.
        
               | Kon5ole wrote:
               | That bugged me too! I decided that they actually meant
               | "source of creativity" which made more sense.
        
               | Melatonic wrote:
               | Thats because the original script did actually have the
               | human farms being used for brainpower for the machines.
               | They changed it to "batteries" because they thought
               | audiences at the time wouldnt understand it!
        
           | int_19h wrote:
           | Perhaps this will be the incentive to finally get fusion
           | working. Big tech megacorps are flush with cash and could
           | fund this research many times over at current rates. E.g. NIF
           | is several billion dollars; Google alone has almost $100B in
           | the bank.
        
         | raincole wrote:
         | Interestingly Gemini CLI has a very generous free quota. Is
         | Google's strategy just overpricing some stuff and subsidizing
         | the underpriced stuff?
        
           | sunaookami wrote:
           | It doesn't, it's not "1000 Gemini Pro" requests for free,
           | Google misled everyone. It's 1000 Gemini requests, Flash
           | included. You get like 5-7 Gemini Pro requests before you get
           | limited.
        
             | panarky wrote:
             | I'm getting 100 Gemini Pro requests per day with an AI
             | Studio API key that doesn't have billing enabled.
             | 
             | After that it's bumped down to Flash, which is surpisingly
             | effective in Gemini CLI.
             | 
             | If I need Pro, I just swap in an API from an account with
             | billing enabled, but usually 100 requests is enough for a
             | day of work.
        
               | indigodaddy wrote:
               | I created an AI Studio key (unbilled) probably over a
               | year ago or so. Is it still good for the current models
               | are should I be creating a new key?
        
               | yunohn wrote:
               | Whoa. I'm definitely getting just handful of requests via
               | free, just like patent commenter...
        
           | ebiester wrote:
           | I've found the free version swaps away from pro incredibly
           | fast. But our company has gemini but can't even get that - we
           | were being asked to do everything by API key.
        
           | pembrook wrote:
           | This is the fundamental pricing strategy of all modern
           | software in fact.
           | 
           | Underpriced for consumers, overpriced for businesses.
        
         | thimabi wrote:
         | > I find it astonishing that the same company providing free
         | usage of their top models to everybody via AI Studio is nickel-
         | and-diming their actual customers like that.
         | 
         | I agree that's not a good posture, but it is entirely
         | unsurprising.
         | 
         | Google is probably not profiting from AI Ultra customers
         | either, and grabbing all that sweet usage data from the free
         | tier of AI Studio is what matters most to improve their models.
         | 
         | Giving free access to the best models allows Google to capture
         | market share among the most demanding users, which are
         | precisely the ones that will be charged more in the future. In
         | a certain sense, it's a great way for Google to use its huge
         | idle server capacity nowadays.
        
           | hirako2000 wrote:
           | I'm burning well over 10 millions tokens a day on free tier.
           | 99% of the input is freely availzble data, the rest is
           | useless. I never provided any feedback. Sure there is some
           | telemetry, they can have it.
           | 
           | I doubt I'm an isolated case. This Gemini gig will cost
           | Google a lot, they pushed it on all android phones around the
           | globe. I can't wait to see what happens when they have to
           | admit that not many people will pay over 20 bucks for "Ai",
           | and I would pay well over 20 bucks just to see the face of
           | the c suite next year when one dares to explain in simple
           | terms there is absolutely no way to recoup the DC investment
           | and that powering the whole thing will cost the company 10
           | times that.
        
         | svantana wrote:
         | I suspect that the main goal here was to grab the top spot in a
         | bunch of benchmarks, and being counted as an "available" model.
        
           | llm_nerd wrote:
           | They're using it as a major inducement to upgrade to AI
           | Ultra. I mean, the image and video stuff is neat, but adds no
           | value for the vast majority of AI subscribers, so right now
           | this is the most notable benefit of paying 12x more.
           | 
           | FWIW, Google seems to be having some severe issues with
           | oddball, perhaps malfunctioning quota systems. I'm regularly
           | finding extraordinarily little use of gemini-cli is hitting
           | the purported 1000 request limit, when in reality I've done
           | less than 10.
        
             | hirako2000 wrote:
             | I faced the exact same problem, with the API. It seems that
             | it doesn't throttle early enough, then may cumulate the
             | cool off period, malong it impossible to determine when to
             | fire requests again.
             | 
             | Also, I noticed Gemini (even flash) has Google search
             | support. But only via the web UI or the native mobile app.
             | Via the API that would requires serp via MCP of sort. Even
             | with Gemini pro.
             | 
             | Oh, some models are regularly facing outages. 503s are not
             | uncommon. No SLA page, alerts, whatsoever.
             | 
             | The reasoning feature is buggy, even if disabled, it
             | sometimes triggers anyway.
             | 
             | It occured to me the other day that Google probably have
             | the best engineers given how good Gemini performs and where
             | it's coming from, and the context window that is uniquely
             | large compared to any other model. But that it is likely
             | operated by managers coming from AWS where shipping half
             | baked, barely tested software, was all it took to get a
             | bonus.
        
         | mnmatin wrote:
         | They might not have been ready/optimized for production, but
         | still wanted to release it before Aug 2 EU AI Act, this way
         | they have 2 years for compliance. So the strategy with
         | aggressively rate-limit for few users make sense.
        
           | novok wrote:
           | wheee, great way to lock in incumbents even more or lock out
           | the EU from startups
        
         | amelius wrote:
         | It could be that your problem was too simple to justify the use
         | of Deep Think.
         | 
         | But yes, Google should have figured that out and used a less
         | expensive mode of reasoning.
        
           | dweekly wrote:
           | "I'm sorry but that wasn't a very interesting question you
           | just asked. I'll spare you the credit and have a cheaper
           | model answer that for you for free. Come back when you have
           | something actually challenging."
        
             | ccozan wrote:
             | Actually why not? Recognizing problem complexity as a fist
             | step is really crucial for such expensive "experts". Humans
             | do the same.
             | 
             | And a question to the knowledgeable: does a simple/stupid
             | question cost more in terms of resources then a complex
             | problem? in terms of power consumption.
        
               | joshuanapoli wrote:
               | Just put in the prompt customization to model responses
               | on Marvin from Hitchhiker's Guide.
               | 
               | "Here I am, brain the size of a planet, and they ask me
               | to ..."
        
               | Fade_Dance wrote:
               | Do they not do this in some general way behind the
               | scenes? Surprising to me if not.
        
               | nyrikki wrote:
               | IIRC that isn't possible under current models at least in
               | general, for multiple reasons, including attention cannot
               | attend to future tokens, the fact that they are
               | existential logic, that they are really NLP and not NLU,
               | etc...
               | 
               | Even proof mining and the Harrop formula have to exclude
               | disjunction and existential quantification to stay away
               | from intuitionist math.
               | 
               | IID in PAC/ML implies PEM which is also intentionally
               | existential quantification.
               | 
               | This is the most gentle introduction I know of, but
               | remember LLMs are fundamentally set shattering, and
               | produce disjoint sets also.
               | 
               | We are just at reactive model based systems now, much
               | work is needed to even approach this if it ever is even
               | possible.
               | 
               | [0] https://www.cmu.edu/dietrich/philosophy/docs/tech-
               | reports/99...
        
               | robwwilliams wrote:
               | Hmm, I needed Claude 4's help to parse your response. The
               | critique was not too kind to your abbreviated arguments
               | that current systems are not able to gauge the complexity
               | of a prompt and the resources needed to address a
               | question.
        
               | jiggawatts wrote:
               | It feels like the rant of someone upset that their
               | decades of formal logic approach to AI become a dead end.
               | 
               | I see this semi-regularly: futile attempts at handwaving
               | away the obvious intelligence by some formal argument
               | that is either irrelevant or inapplicable. Everything
               | from thermodynamics -- which applies to human brains too
               | -- to information theory.
               | 
               | Grey-bearded academics clinging to anything that might
               | float to rescue their investment into ineffective
               | approaches.
               | 
               | PS: This argument seems to be that LLMs "can't think
               | ahead" when all evidence is that they clearly can! I
               | don't know exactly what words I'll be typing into this
               | comment textbox seconds or minutes from now but I can --
               | hopefully obviously -- think intelligent thoughts and
               | plan ahead.
               | 
               | PPS: The em-dashes were inserted automatically by my
               | iPhone, not a chat bot. I assure you that I am a mostly
               | human person.
        
             | riwsky wrote:
             | "This meeting could've been an email"
        
             | nojito wrote:
             | I know this is a joke but I have been able to lower my
             | costs by routing my prompts to a smaller model to determine
             | if I need to send it to a larger model or not.
        
           | danenania wrote:
           | Model routing is deceptively hard though. It has halting
           | problem characteristics: often only the smartest model is
           | smart enough to accurately determine a task's difficulty. And
           | if you need the smartest model to reliably classify the
           | prompt, it's cheaper to just let it handle the prompt
           | directly.
           | 
           | This is why model pickers persist despite no one liking them.
        
             | martinald wrote:
             | Yes but prompt evaluation is far faster than inference as
             | it can be done (mostly) in parallel, so I don't think
             | that's true.
        
               | danenania wrote:
               | The problem is that input token cost dominates output
               | token cost for the majority of tasks.
               | 
               | Once you've given the model your prompt and are reading
               | the first output token for classification, you've already
               | paid most of the cost of just prompting it directly.
               | 
               | That said, there could definitely be exceptions for short
               | prompts where output costs dominate input costs. But
               | these aren't usually the interesting use cases.
        
               | redox99 wrote:
               | That's usually not the case for thinking models. And
               | usually hard problems have a very short prompt.
        
               | danenania wrote:
               | For me personally (using mostly for coding and project
               | planning) it's nearly always the case, including with
               | thinking models. I'm usually pasting in a bunch of files,
               | screenshots, etc., and having long conversations. Input
               | nearly always heavily dominates output.
               | 
               | I don't disagree that there _are_ hard problems which use
               | short prompts, like math homework problems etc., but they
               | mostly aren 't what I would categorize as "real work".
               | But of course I can only speak to my own experience
               | /shrug.
        
               | redox99 wrote:
               | Yeah coding is definitely a situation where context is
               | usually very very large. But at the same time in those
               | situations something like Sonnet is fine.
        
               | energy123 wrote:
               | No, you're talking about costs to user, which are
               | oversimplifications of the costs that providers bear. One
               | output token with a million input tokens is incredibly
               | cheap for providers
        
               | danenania wrote:
               | > One output token with a million input tokens is
               | incredibly cheap for providers
               | 
               | Source? Afaik this is incorrect.
        
               | danielbln wrote:
               | Chevk out any LLM API providers pricing. Output tokens
               | are always significantly more expensive than input (which
               | can also be cached).
        
               | danenania wrote:
               | Input tokens usually dominate output tokens by a _lot_
               | more than 2x though. It's often 10x or more input. It can
               | even easily be 100x or more. Again in realistic
               | workflows.
               | 
               | Caching does help the situation, but you always at least
               | pay the initial cache write. And prompts need to be
               | structured carefully to be cacheable. It's not a free
               | lunch.
        
         | golfer wrote:
         | What was the experimentation? Can you share with us so we can
         | see how "bizarrely uncompetitive" it is?
        
           | iamronaldo wrote:
           | Bizarrely uncompetitive is referencing the 5 uses per day not
           | the performance itself
        
         | dataviz1000 wrote:
         | Several years ago I thought a good litmus test for mastery of
         | coding is not finding a solution using internet search nor
         | getting well written questions about esoteric coding problems
         | answered on StackOverflow. For a while, I would post a question
         | and answer my own question after I solved the problem for
         | posterity (or AI bots). I always loved getting the "I've been
         | working on this for 3 days and you saved my life" comments.
         | 
         | I've been working on a challenging problem all this week and
         | all the AI copilot models are worthless helping me. Mastery in
         | coding is being alone when nobody else nor AI copilots can help
         | you and you have dig deep into generalization, synthesis, and
         | creativity.
         | 
         | (I thought to myself, at least it will be a little while longer
         | before I'm replaced with AI coding agents.)
        
           | benreesman wrote:
           | They're remarkably useless on stuff they've seen but not had
           | up-weighted in the training set. Even the best ones (Opus 4
           | running hot, Qwen and K2 will surprise you fairly often) are
           | a net liability in some obscure thing.
           | 
           | Probably the starkest example of this is build system stuff:
           | it's really obvious which ones have seen a bunch of
           | `nixpkgs`, and even the best ones seem to really struggle
           | with Bazel and sometimes CMake!
           | 
           | The absolute prestige high-end ones running flat out burning
           | 100+ dollars a day and it's a lift on pre-SEO Google/SO I
           | think... but it's not like a blowout vs. a working search
           | index. Back when all the source, all the docs, and all the
           | troubleshooting for any topic on the whole Internet were all
           | above the fold on Google? It was kinda like this: type a
           | question in the magic box and working-ish code pops out. Same
           | at a glory-days FAANG with the internal mega-grep.
           | 
           | I think there's a whole cohort or two who think that "type in
           | the magic box and code comes out" is new. It's not new, we
           | just didn't have it for 5-10 years.
        
           | burnte wrote:
           | I have similar issues with support form companies that
           | heavily push AI and self-serve models and make human support
           | hard. I'm very accomplished and highly capable. If I feel the
           | need to turn to support, the chances the solution is in a KB
           | is very slim, same with AI. It'll be a very specific
           | situation with a very specific need.
        
             | Melatonic wrote:
             | There are a lot of internal KB's companies keep to
             | themselves in their ticketing systems - would be
             | interesting to estimate how much good data there is in
             | there that could in the future be used to train more
             | advanced (or maybe more niche or specific) AI models.
        
           | Melatonic wrote:
           | This has been my thought for a long time - unless there is
           | some breakthrough in AI algo I feel like we are going to hit
           | a "creativity wall" for coding (and some other tasks).
        
             | red75prime wrote:
             | Any reason to think that the wall will be under the human
             | level?
        
               | hirako2000 wrote:
               | Off the thousands of responses I have read from the top
               | LLMs in the last couple of years: never seen one that was
               | creative. Throwing writing, coding, problem solving,
               | mathematical questions and what not.
               | 
               | It's somewhat easier to perceive the creativeless aspect
               | with stable diffusion. I'm not talking about the missing
               | limb or extra finger glitches. With a bit of experience
               | looking through generated images our brain eventually
               | perceives the absolute lack of creativity, an artist
               | probably spot it without prior experience with generative
               | AI pieces. With LLMs it takes a bit longer.
               | 
               | Anecdotal, baseless I guess. Papers were published, some
               | researchers in the fields of science couldn't get the
               | best LLMs to solve any unsolved problem. I recently came
               | across a paper stating bluntly that all LLMs tested were
               | unable to conceptualize, nor derive laws that generalize
               | whatsoever. E.g formulas.
               | 
               | We are being duped, it doesn't help selling $200 monthly
               | subscriptions - soon for even more - if marketers
               | admitted there is absolutely zero reasoning going on with
               | these stochastic machines on steroids.
               | 
               | I deeply wish the circus ends soon, so that we can start
               | focusing on what LLMs are excellent, well fitted to do
               | better, faster than humans.
               | 
               | Creative it is not.
        
           | epolanski wrote:
           | Your post misses the fact that 99% of programming is
           | repetitive plumbing and that the overwhelming majority of
           | developers, even ivy league graduates, suck at coding and
           | problem solving.
           | 
           | Thus, AI is a great productivity tool if you know how to use
           | it for the overwhelming majority of problems out there.
           | 
           | This whole narrative of "okay but it can't replace me in this
           | or that situation" is honestly between an obvious touche (why
           | would you think AI would replace rather than empower those
           | who know their craft) and stale luddism.
        
         | LeafItAlone wrote:
         | > It's just bizarrely uncompetitive with o3-pro and Grok 4
         | Heavy.
         | 
         | In my experience Grok 4 and 4 Heavy have been crap. Who cares
         | how many requests you get with it when the response is
         | terrible. Worst LLM money I've spent this year and I've spent a
         | lot.
        
           | danenania wrote:
           | It's interesting how multi-dimensional LLM capabilities have
           | proven to be.
           | 
           | OpenAI reasoning models (o1-pro, o3, o3-pro) have been the
           | strongest, in my experience, at harder problems, like finding
           | race conditions in intricate concurrency code, yet they still
           | lag behind even the initial sonnet 3.5 release for writing
           | basic usable code.
           | 
           | The OpenAI models are kind of like CS grads who can solve
           | complex math problems but can't write a decent React
           | component without yadda-yadda-ing half of it, while the
           | Anthropic models will crank out many files of decent,
           | reasonably usable code while frequently missing subtleties
           | and forgetting the bigger picture.
        
             | nxobject wrote:
             | Those may have been the exact people creating training
             | material for OpenAI...
        
           | qingcharles wrote:
           | It's just wildly inconsistent to me. Some times it'll produce
           | a work of genius. Other times, total garbage.
        
             | epa wrote:
             | Unfortunately we are still in the prompt optimization
             | stage, garbage in garbage out
        
               | MaxikCZ wrote:
               | I hear this repeated so many times I feel like its a
               | narrative pushed by the sellers. Year ago you could ask
               | for glass of wine filled to the brim and you just wouldnt
               | get it. It wasnt garbage in, garbage out, it was
               | sensibility in, garbage out.
               | 
               | The line where chatbots stop being sensible and start
               | outputting garbage is in movement, but slower than avg
               | joe would guess. You only notice it when you get an
               | intuition of the answer before you see it, which requires
               | a lot of experience on range of complexity. Persisten
               | newbies are the best spotters, because they ask obvious
               | basic questions while asking for stuff beyond what
               | geniuses could solve, and only by getting garbage answer
               | and enduring a process of realizing its actually garbage
               | they truly make wider picture of AI than even most
               | powerusers, who tend to have more balanced querries.
        
               | LeafItAlone wrote:
               | Maybe. That could be true.
               | 
               | But doesn't happen the same with other tools. I'll give
               | the _same exact_ prompt to all of LLMs I have access to
               | and look at the responses for the best one. Grok is
               | consistently the worst. So if it's garbage in, garbage
               | out, why are the other ones so much better at dealing
               | with my garbage?
        
               | hirako2000 wrote:
               | I think it meant in the training stage, not inference.
        
         | 827a wrote:
         | Similar complaints are happening all over reddit with the
         | Claude Code $200/mo plan and Cursor. The companies with deep VC
         | funding have been subsidizing usage for a year now, but we're
         | starting to see that bleed off.
         | 
         | I think the primary concern of this industry right now is how,
         | relative to the current latest generation models, we
         | simultaneously need intelligence to increase, cost to decrease,
         | effective context windows to increase, and token bandwidths to
         | increase. All four of these things are real bottlenecks to
         | unlocking the "next level" of these tools for software
         | engineering usage.
         | 
         | Google isn't going to make billions on solving advanced math
         | exams.
        
           | Fade_Dance wrote:
           | Agreed, and big context windows are key to mass adoption in
           | wider use cases beyond chatbots (random ex: in knowledge
           | management apps, being able to parse the entire note
           | library/section and hook it into global AI search), but those
           | use cases are decidedly _not_ areas where $200 per month
           | subscriptions can work.
           | 
           | I'll hazard to say that cost and context windows are the two
           | key metrics to bridge that chasm with acceptable results....
           | As for software engineering though, that cohort will be
           | demanding on all front for the foreseeable future, especially
           | because there's a bit of a competitive element. Nobody wants
           | to be the vibecoder using sub-par tools compared to everyone
           | else showing off their GitHub results and making sexy blog
           | posts about it on HN.
        
             | com2kid wrote:
             | Outside of code, the current RAG strategy is throw shit
             | tons of unstructured text at it that has been found using
             | vector search. Some companies are doing better, but the
             | default rag pipelines are... kind of garbage.
             | 
             | For example, a chat bot doing recipe work should have a RAG
             | DB that, by default, returns entire recipes. A vector DB is
             | actually not the solution here, any number of traditional
             | DBs (relational or even a document store) would work fine.
             | Sure do a vector search across the recipe texts, but then
             | fetch the entire recipe from someplace else. Current RAG
             | solutions can do this, but the majority of RAG deployments
             | I have seen don't bother, they just abuse large context
             | windows.
             | 
             | Which looks like it works, except what you actually have in
             | your context window is 15 different recipes all stitched
             | together. Or if you put an entire recipe book into the
             | context (which is perfectly doable now days!), you'll end
             | up with the chatbot mixing up ingredients and proportions
             | between recipes because you just voluntarily polluted its
             | context with irrelevant info.
             | 
             | Large context windows allow for sloppy practices that end
             | up making for worse results. Kind of like when we decided
             | web servers needed 16 cores and gigs of RAM to run IBM
             | Websphere back in the early 2000s, to serve up mostly
             | static pages. The availability of massive servers taught
             | bad habits (huge complicated XML deployment and
             | configuration files, oodles of processes communicating with
             | each other to serve a single page, etc).
             | 
             | Meanwhile in the modern world I've ran mission critical
             | high throughput services for giant companies on a K8
             | cluster consisting of 3 machines each with .25 CPU and a
             | couple hundred megs of RAM allocated.
             | 
             | Sometimes more is worse.
        
               | 827a wrote:
               | IMO: Context engineering is a fascinating topic because
               | it starts approaching the metaphysical abstract idea of
               | what LLMs even are.
               | 
               | If you believe that an LLM is a digital brain, then it
               | follows that their limitation in capabilities today are a
               | result of their limited characteristics (namely: coherent
               | context windows). If we increase context windows (and
               | intelligence), we can simply pack more data into the
               | context, ask specific questions, and let the LLM figure
               | it out.
               | 
               | However, if you have a more grounded belief that, at
               | best, LLMs are just one part of a more heterogeneous
               | digital brain: It follows that maybe actually their
               | limitations are a result of how we're feeding it data.
               | That we need to be smarter about context engineering, we
               | need to do roundtrips with the LLM to narrow down what
               | thbe context should be, it needs targeted context to
               | maximize the quality of its output.
               | 
               | The second situation feels so much harder, but more
               | likely. IMO: This fundamental schism is the single reason
               | why ASI won't be achieved on any timeframe worth making a
               | prediction about. LLMs are just one part of the puzzle.
        
               | Fade_Dance wrote:
               | It's also a question of general vs specialized tools. If
               | LLMs are being used in a limited capacity, such as
               | retrieving recipes, then a limited environment where it
               | only has the ability to retrieve complete recipes via RAG
               | may be ideal in the literal sense of the word. There
               | really is nothing better than the perfect specialized
               | tool for a specialized job.
        
               | com2kid wrote:
               | I did embedded work for years. A 100mhz CPU with 1 cycle
               | SRAM latency and a bare metal OS can do as much as a
               | 600MHZ CPU hitting DRAM running a preemptive OS.
               | 
               | Specialized tools _rock_.
        
               | com2kid wrote:
               | Information in an LLM exists in two places:
               | 
               | 1. Embedded in the parameters
               | 
               | 2. Within the context window
               | 
               | We all talk a lot about #2, but until we get a really
               | good grip on #1, I think we as a field are going to hit a
               | progress wall.
               | 
               | The problem is we have not been able to separate out
               | knowledge embedded in parameters with model capability,
               | famously even if you don't want a model to write code,
               | throwing a bunch of code at a model makes it a better
               | model. (Also famously, even if someone never grows up to
               | work with math day to day, learning math makes them
               | better at all sorts of related logical thinking tasks.)
               | 
               | Also there is plenty of research showing performance
               | degrades as we stuff more and more into context. This is
               | why even the best models have limits on tool call
               | performance when naively throwing 15+ JSON schemas at it.
               | (The technique to use RAG to determine which tool call
               | schema to feed into the context window is super cool!)
        
             | 827a wrote:
             | Big, coherent context windows are key to almost all use-
             | cases. The whole house of cards RAG implementations most
             | platforms are using right now are pretty bad. You start
             | asking around about how to implement RAG and you realize:
             | No one knows, the architecture and outcomes at every
             | company are pretty bad, the most common words you hear are
             | "yeah it pretty much works ok i guess".
        
         | Closi wrote:
         | It's not particularly interesting if Deep Mind comes to the
         | same (correct) conclusion on a single problem as o3 but costs
         | more. You could ask gpt 2.5 and gpt4 what 1+1= and would get
         | the same response with gpt 4 costing more, but this doesn't
         | tell us much about model capability or value.
         | 
         | It would be more interesting to know if it can handle problems
         | that o3 can't do, or if it is 'correct' more often than o3 pro
         | on these sort of problems.
         | 
         | i.e. if o3 is correct 90% of the time, but deep mind is correct
         | 91% of the time on challenging organisational problems, it will
         | be worth paying $250 for an extra 1% certainty (assuming the
         | problem is high-value / high-risk enough).
        
           | lucianbr wrote:
           | > It would be more interesting to know if it can handle
           | problems that o3 can't do
           | 
           | Suppose it can't. How will you know? All the datapoints will
           | be "not particularly interesting".
        
             | Closi wrote:
             | > Suppose it can't. How will you know?
             | 
             | By finding and testing problems that o3 can't do on Deep
             | Think, and also testing the reverse? Or by large benchmarks
             | comparing a whole suite of questions with known answers.
             | 
             | Problems that both get correct will be easy to find and
             | don't say much about comparative performance. That's why
             | some of the benchmarks listed in the article (e.g.
             | Humanity's Last Exam / AIME 2025) are potentially more
             | insightful than one person's report on testing one question
             | (which they don't provide) where both models replied with
             | the same answer.
        
         | profsummergig wrote:
         | The part I cannot understand is why for many AI offerings, I
         | cannot make out what each pricing tier does with a quick
         | glance.
         | 
         | What happened to the simplicity of Steve Jobs' 2x2 (consumer
         | vs.pro, laptop vs. desktop)?
        
         | starfallg wrote:
         | The rate limits are not because of compute performance or the
         | lack of. It's to stop people from training their own models on
         | the very cutting edge.
        
       | TheAceOfHearts wrote:
       | I would be interested in reading about how people who are paying
       | for access to Google's top AI plan are intending to use this. Do
       | you have any examples of immediate use-cases that might benefit?
       | 
       | Is Google using this tool internally? One would expect them to
       | give some examples of how it's helping internal teams accelerate
       | or solve more challenging problems, if they were eating their own
       | dogfood.
        
         | meta_ai_x wrote:
         | I'm guessing most Ultra users are there for Veo 3, where it has
         | monetary benefits if your 3s video go viral on
         | TikTok/Reels/Shorts
        
       | irthomasthomas wrote:
       | You can spin up a version of this at home using simonw's LLM cli
       | with the llm-consortium plugin.
       | 
       | Bonus 1: Use any combination of models. Mix n match models from
       | any lab.
       | 
       | Bonus 2: Serve your custom consortium on a local API from a
       | single command using the llm-model-gateway plugin and use it in
       | your apps and coding assistants.
       | 
       | https://x.com/karpathy/status/1870692546969735361
       | > uv tool install llm       llm install llm-consortium       llm
       | consortium save gthink-n5       -m gemini-pro -n 5 --arbiter
       | gemini-flash --confidence-threshold 99 --max-iterations 4
       | llm serve --host 0.0.0.0            curl
       | http://0.0.0.0:8000/v1/chat/completions \       -X POST \
       | -H "Content-Type: application/json" \       -d '{
       | "model": "gthink-n5",         "messages": [{"role": "user",
       | "content": "find a polynomial algorithm for graph-isomorphism"}]
       | }'
       | 
       | You can also build a consortium of consortiums like so:
       | llm consortium save gem-squared -m gthink-n5 -n 2 --arbiter gem-
       | flash
       | 
       | Or even make the arbiter a consortium:                 llm
       | consortium save gem-cubed -m gthink-n5 -n 2 --arbiter gthink-n5
       | --max-iteration 2
       | 
       | or go openweights only:                 llm consortium save open-
       | council -m qwen3:2 -m kimi-k2:2 -m glm-4.5:2 -m mistral:2
       | --arbiter minimax-m1 --min-iterations 2 --confidence-threshold 95
       | 
       | https://GitHub.com/irthomasthomas/llm-consortium
        
         | barapa wrote:
         | I am not seeing this llm serve command
        
           | irthomasthomas wrote:
           | it's a separate plugin rn. llm install llm-model-gateway
        
         | mt_ wrote:
         | Is the European Union a consortium of consortiums?
        
         | SubiculumCode wrote:
         | 1. Why do you say this is a version of Gemini deep think? It
         | seems like there could be multiple ways to build a multiagent
         | model to explore a space. 2. The covariance between models
         | leads to correlated errors, lowering the individual
         | effectiveness of each contributing model. It would seem to me
         | that you'd want to find a set of model
         | architectures/prompt_congigs that minimizes covariance while
         | maintaining individual accuracy, on a benchmark set of problems
         | that have multiple provable solutions (i.e. not one path to a
         | solution that is objectively correct).
        
           | irthomasthomas wrote:
           | I didn't mean to suggest it's a clone of Deep Think, which is
           | proprietary. I meant that it's a version of _parallel
           | reasoning_. Got the idea from Karpathy 's tweet in December
           | and built it. Then DeepMind published the "Evolving Deeper
           | LLM Thinking" paper in January with similar concepts. Great
           | minds, I guess? https://arxiv.org/html/2501.09891v1
           | 
           | 2. The correlated errors thing is real, though I'd argue it's
           | not always a dealbreaker. Sometimes you want similar models
           | for consistency, sometimes you want diversity for coverage.
           | The plugin lets you do either - mix Claude with kimi and Qwen
           | if you want, or run 5 instances of the same model. The
           | "right" approach probably depends on your use case.
        
         | dumbmrblah wrote:
         | Thanks! Do you happen to know if there any OpenWebUI plugins
         | similar to this?
        
           | irthomasthomas wrote:
           | You can use this with openwebui already. Just llm install
           | llm-model-gateway. Then after you save a consortium you run
           | llm serve --host 0.0.0.0 This will give you a openai
           | compatible endpoint which you add to your chat client.
        
       | nickandbro wrote:
       | Ladies and Gentlemen,
       | 
       | Here's Gemini Deep Think when prompted with:
       | 
       | "Create a svg of a pelican riding on a bicycle"
       | 
       | https://www.svgviewer.dev/s/5R5iTexQ
       | 
       | Beat Simon Willison to it :)
        
         | baggachipz wrote:
         | Truly worth the price. We live in the future.
        
         | jjcm wrote:
         | Honestly the first one where I would have guessed "this is a
         | pelican riding a bicycle" if presented with just the image and
         | 0 other context. This and the voxel tower are fairly impressive
         | - we're seeing some semblance of visual / spatial understanding
         | with this model.
        
         | criemen wrote:
         | How much would it cost at list-price API pricing?
        
         | arnaudsm wrote:
         | Meme benchmarks like this and Strawberry are funny but very
         | easy to game, I bet they're all over training sets nowadays.
        
           | simonw wrote:
           | If you train a model on the SVGs of pelicans on a bicycle
           | that are out there already you're going to get a VERY weird
           | looking pelican on a bicycle:
           | https://simonwillison.net/tags/pelican-riding-a-bicycle/
        
             | nickandbro wrote:
             | Thanks for the mention on your blog. You're the original
             | GPT pelican artist
        
         | amelius wrote:
         | Can it do circuit diagrams? Because that's one practical area
         | where I think the AI models are lacking.
        
           | HaZeust wrote:
           | Not yet, or schemas. It can do netlists, though! But it's
           | much harder to go from "Netlist -> Diagram/Schema" than the
           | other way around :(
        
         | kingstnap wrote:
         | It was an expensive SVG, but it did a good job.
         | 
         | The bike is an actual bike with a diamond frame.
        
         | mNovak wrote:
         | Interestingly it seems to draw the bike's seat too (around line
         | 34) which then gets covered by the pelican.
        
         | simonw wrote:
         | OK that is recognizably a pelican, pretty great!
        
           | qingcharles wrote:
           | This feels like the best pelicanbike yet. The singularity
           | might be closer than we imagine.
           | 
           | Time for a leaderboard?
        
             | lostmsu wrote:
             | Ask and you'll receive: https://pelicans.borg.games/
        
               | qingcharles wrote:
               | Nice! Is there a way I can click on the leaderboard items
               | so I can view them?
        
               | espadrine wrote:
               | It would be interesting to have two generations per model
               | without cherry picking, so that the Elo estimation can
               | include an easy-to-compute standard deviation estimation.
        
         | jug wrote:
         | Easily the best one yet!
        
           | tim333 wrote:
           | First I've seen of human quality. Maybe we are reaching API.
           | (Artificial Pelican Intelligence)
        
           | NitpickLawyer wrote:
           | Saw one today from gpt5 (via some api trick someone found)
           | that was better than this, let me see if I can find it.
           | 
           | Pelican:
           | 
           | https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd..
           | ..
           | 
           | Longer thread re gpt5:
           | 
           | https://old.reddit.com/r/OpenAI/comments/1mettre/gpt5_is_alr.
           | ..
        
             | blueblisters wrote:
             | Uh that doesn't look better. it has more texture but the
             | composition is bad/incomplete
        
         | rdtsc wrote:
         | If it's on HN and is a meme at this point, it will end up in
         | the training set.
         | 
         | It's kind of fun to imagine that there is an intern in every AI
         | company furiously trying to get nice looking svg pelicans on
         | bicycles.
        
         | akomtu wrote:
         | Now add irrelevant facts about cats and see if it can still
         | draw it.
        
         | drozycki wrote:
         | I don't have access but I wonder whether a dog on a jetski
         | would be nearly as good
        
       | judge123 wrote:
       | I'm wondering if 'slow AI' like this is a temporary bridge, or a
       | whole new category we need to get used to. Is the future really
       | about having these specialized 'deep thinkers' alongside our
       | fast, everyday models? Or is this just a clunky V1 until the main
       | models get this powerful on their own in seconds?
        
         | onlyrealcuzzo wrote:
         | It's not unreasonable to think that with improvements on the
         | software side - a Saturn-like model based on diffusion could be
         | this powerful within a decade - with 1s responses.
         | 
         | I'd highly doubt in 10 years, people are waiting 30m for
         | answers of _this_ quality - either due to the software side,
         | the hardware side, and /or scaling.
         | 
         | It's possible in 10 years, the cost you pay is still
         | comparable, but I doubt the time will be 30m.
         | 
         | It's also possible that there's still top-tier models like this
         | that use absurd amounts of resources (by today's standards) and
         | take 30m - but they'd likely be at a _much_ higher quality than
         | today 's.
        
           | regularfry wrote:
           | The pressure in the other direction is tool use. The more a
           | model wants to call out to a series of tools, the more the
           | delay will be, just because it the serial process isn't part
           | of the model.
        
         | twobitshifter wrote:
         | we're optimizing for quality over performance right now, at
         | some point the pendulum will swing the other way, but there
         | might be problems that require deep thinking just like we have
         | a need for supercomputers to run jobs for days today.
        
       | nusl wrote:
       | Been using Gemini for a few months, somehow it's gotten much,
       | much worse in that time. Hallucinations are very common, and it
       | will argue with you when you point it out. So, don't have much
       | confidence.
        
         | panarky wrote:
         | In my experience with chat, Flash has gotten much, much better.
         | It's my go-to model even though I'm paying for Pro.
         | 
         | Pro is frustrating because it too often won't search to find
         | current information, and just gives stale results from before
         | its training cutoff. Flash doesn't do this much anymore.
         | 
         | For coding I use Pro in Gemini CLI. It is amazing at coding,
         | but I'm actually using it more to write design docs, decomp
         | multi-week assignments down to daily and hourly tasks, and then
         | feed those docs back to Gemini CLI to have it work through each
         | task sequentially.
         | 
         | With a little structure like this, it can basically write its
         | own context.
        
           | declan_roberts wrote:
           | I like flash because when it's wrong it's wrong very quickly.
           | You can either change the prompt or just solve the problem
           | yourself. It works well for people who can spot the answer as
           | being "wrong"
        
         | arnaudsm wrote:
         | I feel the same, but cannot measure the effect in any context
         | benchmark like fiction.livebench.
         | 
         | Are they aggressively quantizing, or are our expectations
         | silently increasing ?
        
         | quadrature wrote:
         | Is the problem mainly with tool use ? and are you using it
         | through AI studio or through the API ?.
         | 
         | I've found that it hallucinates tool use for tools that aren't
         | available and then gets very confident about the results.
        
         | alecco wrote:
         | Same here. I stopped using Gemini Pro because on top of it's
         | hard to follow verbosity it was giving contradicting answers.
         | Things that Claude Sonnet 4 could answer.
         | 
         | Speaking of Sonnet, I feel like it's closing the gap to Opus.
         | After the new quotas I started to try it before Opus and now it
         | gets complex things right more often than not. This wasn't my
         | experience just a couple of months ago.
        
       | AtlanticThird wrote:
       | Upgraded and quickly hit my limit. And find that they have
       | limits, I just wish that they were more transparent. Even if it's
       | just a vague statement about limited usage. I assumed it would be
       | similar to regular Gemini 2.5 on the pro plan but it's not
        
       | unsupp0rted wrote:
       | Interesting they compared Gemini 2.5 Deep Think on code to all
       | the top models EXCEPT Claude, the best top model at code.
        
       | childintime wrote:
       | This comes at a time where my experience with Gemini is lacking,
       | it seems to get worse. It's not picking up on my intention,
       | sometimes replies in the wrong language, etc. Either that or I am
       | just transparent that it's a tool and its feelings are hurt. I've
       | had to call it a moron several times, and it was funny when it
       | started reprimanding me for my foul language once. But it was
       | wrong. This behavior seems new. I could never trust it to not do
       | random edits everywhere in a document, so nowadays I use it to
       | check Claude, which can be trusted with a document.
        
         | declan_roberts wrote:
         | I had a good experience with gemini-cli (which I think uses pro
         | initially?). It's not very good, but it's very fast. So when
         | it's wrong it's wrong very quickly and you can either solve it
         | yourself or pivot your prompt. For a professional software
         | engineer this actually works out OK.
        
       | jawerty wrote:
       | If anyone wants a free to use deep research agent check out what
       | i'm working on: https://projectrex.onrender.com/
        
       | 44za12 wrote:
       | Is it something like tree of thought reasoning?
        
       | jawerty wrote:
       | If anyone wants a free to use research agent try rex I've been
       | building it for a couple months: https://projectrex.onrender.com/
        
       | kelvinjps10 wrote:
       | These AI names are getting ridiculous
        
       | tdhz77 wrote:
       | I think at this point its fair to say that I've switched modals
       | more than Leonardo Dicaprio.
        
       | mclau157 wrote:
       | Asking AI to create 3D scenes like the example in the page seems
       | like asking someone to hammer something with a screwdriver, we
       | would need an AI compatible 3D software that either has easier to
       | use voxels built in so it can create similar to pixel art, or
       | easier math defined curves that can be meshed, either way AI just
       | does not currently have the right tools to generate 3D scenes
        
         | mhh__ wrote:
         | One of the most alluring things about LLMs though is that it is
         | like having a screwdriver that can just about work like a
         | hammer, and draft an email to your landlord and so on
        
       | depluber wrote:
       | Is this a joke? I am paying $250/month because I want to use Deep
       | Think and I literally just burned through my daily quota in 5
       | PROMPTS. That is insane!
        
       | srameshc wrote:
       | I've been using Gemini 2.5 Pro for a few months now and have
       | found my experience to be very positive. I primarily use it for
       | coding and through the API, and I feel it's consistently
       | improving. I haven't yet tried Deep Think, though.
        
       | km3k wrote:
       | They missed an oppportunity to name it Deep Thought.
        
       | motbus3 wrote:
       | I wonder how long it will take for someone starts o investigating
       | the relation between SEO and Google AI crawler
        
       | Invictus0 wrote:
       | I used to be enthusiastic about Gemini-2.5-Pro but now I can't
       | even get it to do decent file-diff summaries for a PR commit.
        
       ___________________________________________________________________
       (page generated 2025-08-01 23:00 UTC)