[HN Gopher] Gemini 2.5 Deep Think
___________________________________________________________________
Gemini 2.5 Deep Think
Author : meetpateltech
Score : 371 points
Date : 2025-08-01 11:10 UTC (11 hours ago)
(HTM) web link (blog.google)
(TXT) w3m dump (blog.google)
| amatic wrote:
| At the moment, Deep Think is only available with the ULTRA
| subscription ($250 per month).
| stingraycharles wrote:
| It's not available through the API?
| siva7 wrote:
| Is it available in EU? Someone can confirm?
| highfrequency wrote:
| Approach is analogous to Grok 4 Heavy: use multiple "reasoning"
| agents in parallel and then compare answers before coming back
| with a single response, taking ~30 minutes. Great results, though
| it would be more fair for the benchmark comparisons to be against
| Grok 4 Heavy rather than Grok 4 (the fast, single-agent model).
| stingraycharles wrote:
| Yeah the general "discovery" is that using the same reasoning
| compute effort, but spreading them over multiple different
| agents generally leads to better results.
|
| It solves the "longer thinking leads to worse results" problem
| by approaching multiple paths of thinking in parallel, but just
| not think as long.
| Xenoamorphous wrote:
| > Yeah the general "discovery" is that using the same
| reasoning compute effort, but spreading them over multiple
| different agents generally leads to better results.
|
| Isn't the compute effort N times as expensive, where N is the
| number of agents? Unless you meant in terms of time (and even
| then, I guess it'd be the slowest of the N agents).
| NitpickLawyer wrote:
| Not exactly N times, no. In a traditional transformer arch
| token 1 is cheaper to generate than token 1000 is cheaper
| than token 10k and so on. So having 10x 1000 tokens would
| be cheaper to run concurrently than 10.000 in one session.
|
| You also run into context issues and quality degradation
| the longer you go.
|
| (this is assuming gemini uses a traditional arch, and not
| something special regarding attention)
| lynx97 wrote:
| I am surprised such a simple approach has taken so long to be
| actually used. My first image description cli attempt did
| basically that: Use n to get several answers and another pass
| to summarize.
| cinntaile wrote:
| It's very resource intensive so maybe they had to wait until
| processes got more efficient? I can also imagine they would
| want to try and solve it in a... better way before doing
| this.
| simianwords wrote:
| I agree but I think its hard to get a sufficient increase in
| performance that would justify 3-4x increase in cost.
| datadrivenangel wrote:
| It's an expensive approach, and depends on assessment being
| easy, which is often not the case.
| golol wrote:
| People have played with (multi-) agentic frameworks for LLMs
| from the very beginning but it seems like only now with
| powerful reasoning models it is really making a difference.
| NitpickLawyer wrote:
| I have a similar thing built around a year ago w/ autogen.
| The difference now is models can really be steered towards
| "part" of the overall goal, and they actually follow that.
|
| Before this, even the best "math" models were RLd to death to
| only solve problems. If you wanted it to explore "method_a"
| of solving a problem you'd be SoL. The model would start like
| "ok, the user wants me to explore method_a, so here's the
| solution: blablabla doing whatever it wanted, unrelated to
| method_a.
|
| Similar things for gathering multiple sources. Only recently
| can models actually pick the best thing out of many
| instances, and work _effectively_ at large context lengths.
| The previous tries with 1M context lengths were at best
| gimmicks, IMO. Gemini 2.5 seems the first model that can
| actually do useful stuff after 100-200k tokens.
| mvieira38 wrote:
| Is o3-pro the same as these?
| satvikpendem wrote:
| No, it doesn't take 30 minutes
| Workaccount2 wrote:
| Grok-4 heavy benchmarks used tools, which trivializes a lot of
| problems.
| littlestymaar wrote:
| That this kind of approach works is good news for local LLM
| enthusiasts, as it makes Cloud LLM using this more expensive
| while local LLM can do so for free up to a point (because LLM
| inference is limited by memory bandwidth not compute, you can
| run multiple queries in parallel on your graphic card at the
| same speed as the single one. Until you become compute-bound of
| course).
| mettamage wrote:
| Wait, how does this work? If you load in one LLM of 40 GB,
| then to load in four more LLMs of 40 GB still takes up an
| extra 160 GB of memory right?
| masterjack wrote:
| It will typically be the same 40 GB model loaded in, but
| called with many different inputs simultaneously
| zozbot234 wrote:
| > because LLM inference is limited by memory bandwidth not
| compute, you can run multiple queries in parallel on your
| graphic card at the same speed as the single one
|
| I don't think this is correct, especially given MoE. You can
| save _some_ memory bandwidth by reusing model parameters, but
| that 's about it. It's not giving you the same speed as a
| single query.
| dweekly wrote:
| Dumb (?) question but how is Google's approach here different
| than Mixture of Experts? Where instead of training different
| experts to have different model weights you just count on
| temperature to provide diversity of thought. How much benefit
| is there in getting the diversity of thought in different runs
| of the same model versus running a consortium of different
| model weights and architectures? Is there a paper contrasting
| results given fixed computation between spending that compute
| on multiple runs of the same model vs different models?
| joaogui1 wrote:
| Mixture of Experts isn't using multiple models with different
| specialties, it's more like a sparsity technique, where you
| massively increase the number of parameters and use only a
| subset of the weights in each forward pass.
| HarHarVeryFunny wrote:
| MOE is just a way to add more parameters/capacity to a model
| without making it less efficient to run, since it's done in a
| way that not all parameters are used for each token passing
| through the model. The name MOE is a bit misleading since the
| "experts" are just alternate paths though part of the model,
| not having any distinct expertise in the way the name might
| suggest.
|
| Just running the model multiple times on the same input and
| selecting the best response (according to some judgement)
| seems a bit of a haphazard way of getting much diversity of
| response, if that is really all it is doing.
|
| There are multiple alternate approaches to sampling different
| responses from the model that come to mind, such as:
|
| 1) "Tree of thoughts" - generate a partial response (e.g. one
| token, or one reasoning step), then generate branching
| continuations of each of those, etc, etc. Compute would go up
| exponentially according to number of chained steps, unless
| heavy pruning is done similar to how it is done for MCTS.
|
| 2) Separate response planning/brainstorming from response
| generation by first using a "tree of thoughts" like process
| just to generate some shallow (e.g. depth < 3) alternate
| approaches, then use each of those approaches as additional
| context to generate one or more actual responses (to then
| evaluate and choose from). Hopefully this would result in
| some high level variety of response without the cost of of
| just generating a bunch of responses and hoping that they are
| usefully diverse.
| bitshiftfaced wrote:
| What makes you sure of that? From the article,
|
| > Deep Think pushes the frontier of thinking capabilities by
| using parallel thinking techniques. This approach lets Gemini
| generate many ideas at once and consider them simultaneously,
| even revising or combining different ideas over time, before
| arriving at the best answer.
|
| This doesn't exclude the possibility of using multiple agents
| in parallel, but to me it doesn't necessarily mean that this is
| what's happening, either.
| highfrequency wrote:
| What could "parallel thinking techniques" entail if not
| "using multiple agents in parallel"?
| JyB wrote:
| How can it not be exactly what's happening?
| Melatonic wrote:
| Surprised no one has released an app yet that pits all the
| major models against each other for a final answer.
| 7373737373 wrote:
| 139.99EUR/month for something you can't even test first, lol
| lynx97 wrote:
| Wait...
|
| So if someone cool enough, they could actually give us a
| DeepThought model?
|
| Please, let that happen.
|
| Vendor-DeepThought-42B maybe?
| xnx wrote:
| > could actually give us a DeepThought model
|
| Yes, but the response time is terrible. 7.5 million years
| stocksinsmocks wrote:
| Qwen-DeepInThot-69b-Instruct is what I'm looking forward to.
| simianwords wrote:
| Grok 4 heavy, o3 pro and Gemini Deep Think all are equivalent. I
| wonder how they compare?
| baggachipz wrote:
| You can't go anywhere without having Gemini shoved in your face.
| I had an immediate visceral reaction to this.
| serialNumber wrote:
| I'm never the one to defend AI, but what do you mean? Is it the
| "AI overview" that pops up on Google? Other than that, I would
| say Gemini is definitely less in your face than ChatGPT for
| example
| baggachipz wrote:
| My company uses google workspace and every google doc,
| spreadsheet, calendar, online meeting and search puts nonstop
| callouts and messages about using Gemini. It's gotten so bad
| that I'm about to try building a browser extension to block
| that bullshit. It clutters the UI and nags. If I wanted that
| crap, I'd turn it on.
| serialNumber wrote:
| I see - I feel the same way about Copilot (as my company
| uses the Microsoft ecosystem).
| breppp wrote:
| Yes it smells like google+, but at least gemini is actually
| a good product
| int_19h wrote:
| I also find those "please please please try me!" popups
| annoying, but at least Google Workspace is one product
| where deep AI integration actually makes sense. I like the
| ability to quickly get summaries or edit documents by
| telling it what needs to be done in generic terms.
| resoluteteeth wrote:
| > If you're a Google AI Ultra subscriber, you can use Deep Think
| in the Gemini app today with a fixed set of prompts a day by
| toggling "Deep Think" in the prompt bar when selecting 2.5 Pro in
| the model drop down.
|
| If fixed set means fixed number it would be nice to know how
| many.
|
| Otherwise i would like to know what fixed set means here.
| Workaccount2 wrote:
| You get 10 requests per day it seems.
|
| Apparently the model will think for 30+ minutes on a given
| prompt. So it seems it's more for research or dense multi-
| faceted problems than for general coding or writing fan fic.
| barapa wrote:
| Available via API?
| simonw wrote:
| Not yet.
| https://x.com/OfficialLoganK/status/1951260803459338394 said
| "Should we put it in the Gemini API next?"
| Oras wrote:
| Going to twitter to find info is exactly what's wrong with
| Google
| asaddhamani wrote:
| I find it interesting, how OpenAI came out with a $200 plan,
| Anthropic did $100 and $200, then Gemini ups it to $250, and now
| Grok is at $300.
|
| OpenAI is the only one that says "practically unlimited" and I
| have never hit any limit on my ChatGPT Pro plan. I hit limits on
| Claude Max (both plans) several times.
|
| Why are these companies not upfront about what the limits are?
| thimabi wrote:
| > Why are these companies not upfront about what the limits
| are?
|
| Most likely because they reserve the right to dynamically alter
| the limits in response to market demands or infrastructure
| changes.
|
| See, for instance, the Ghibli craze that dominated ChatGPT a
| few months ago. At the time OpenAI had no choice but to
| severely limit image generation quotas, yet today there are
| fewer constraints.
| jstummbillig wrote:
| Because if you are transparent about the limits, more people
| will start to game the limits, which leads to lower limits for
| everyone - which is a worse outcome for almost everyone.
|
| tldr: We can't have nice things, because we are assholes.
| holowoodman wrote:
| Because they want to have their cake and eat it too.
|
| A fair pricing model would be token-based, so that a user can
| see for each query how much they cost, and only pay for what
| they actually used. But AI companies want a steady stream of
| income, and they want users to pay as much as possible, while
| using as little as possible. Therefore they ask for a monthly
| or even yearly price with an unknown number of tokens included,
| such that you will always pay more then with token-based
| payments.
| hnav wrote:
| I don't think it's that, I think they just want people to
| onboard onto these things before understanding what the
| actual cost might be once they're not subsidized by megacorps
| anymore. Something similar to loss-leading endeavors like
| Uber and Lyft in the 2010s, I suspect that that showing the
| actual cost of inference would raise questions about the cost
| effectiveness of these things for a lot of applications.
| Internally, Google's data query surface tell you cost in
| terms of SWE-time (e.g. this query cost 1 SWE hour) since the
| incentives are different.
| teitoklien wrote:
| you're right, about their intentions in the future. But right
| now, they are literally losing money every single time
| someone uses their product...
|
| In most cases, atleast claude does for sure. So yea, for now,
| they're losing money anyways
| Invictus0 wrote:
| You just completely made that up
| cpursley wrote:
| I can't even convince Gemini CLI while planning things to not go
| off and make a bunch of random changes on its own, even after
| being very clear not to do so, intercepting to tell it to stop
| doing that, then it just continues on fucking everything up.
| WhitneyLand wrote:
| Agents muddy the waters.
|
| Claude Code gets the most out of Anthropic's models, that's why
| people love it.
|
| Conversely, Gemini CLI makes Gemini Pro 2.5 less capable than
| the model itself actual is.
|
| It's such a stark difference I've given up using Gemini CLI
| even with it being free, but still use it for situations
| amenable to a prompt interface on a regular basis. It's a very
| strong model.
| panarky wrote:
| That's my experience too, when I give Gemini CLI a big, general
| task and just let it run.
|
| But if I give it structure so it can write its own context, it
| is truly astonishing.
|
| I'll describe my big, general task and tell it to first read
| the codebase and then write a detailed requirements document,
| and not to change any code.
|
| Then I'll tell it to read the codebase and the detailed
| requirements document it just wrote, and then write a detailed
| technical spec with API endpoints, params, pseudocode for
| tricky logic, etc.
|
| Then I'll tell it to read the codebase, and the requirements
| document it just wrote, and the tech spec it just wrote, and
| decomp the whole development effort into weekly, daily and
| hourly tasks to assign to developers and save that in a dev
| plan document.
|
| Only then is it ready to write code.
|
| And I tell it to read the code base, requirements, tech spec
| and dev plan, all of which it authored, and implement Phase 1
| of the dev plan.
|
| It's not all mechanical and deterministic, or I could just
| script the whole process. Just like with a team of junior devs,
| I still need to review each document it writes, tweak things I
| don't like, or give it a better prompt to reflect my priorities
| that I forgot to tell it the first time, and have it redo a
| document from scratch.
|
| But it produces 90% or more of its own context. It ingests all
| that context that it mostly authored, and then just chugs along
| for a long time, rarely going off the rails anymore.
| simonw wrote:
| This isn't the exact same model that achieved gold in the IMO a
| few weeks ago but is a close relative:
| https://x.com/OfficialLoganK/status/1951262261512659430
|
| It's not yet available via an API.
| foundry27 wrote:
| I started doing some experimentation with this new Deep Think
| agent, and after five prompts I reached my daily usage limit. For
| $250 USD/mo that's what you'll be getting folks.
|
| It's just bizarrely uncompetitive with o3-pro and Grok 4 Heavy.
| Anecdotally (from my experience) this was the one feature that
| enthusiasts in the AI community were interested in to justify the
| exorbitant price of Google's Ultra subscription. I find it
| astonishing that the same company providing _free_ usage of their
| top models to everybody via AI Studio is nickel-and-diming their
| actual customers like that.
|
| Performance-wise. So far, I couldn't even tell. I provided it
| with a challenging organizational problem that my business was
| facing, with the relevant context, and it proposed a lucid and
| well-thought-out solution that was consistent with our internal
| discussions on the matter. But o3 came to an equally effective
| conclusion for a fraction of the cost, even if it was less
| "cohesive" of a report. I guess I'll have to wait until tomorrow
| to learn more.
| andsoitis wrote:
| it turns out that AI at this level is very expensive to run
| (capex, energy). my bet is that AI itself won't figure out how
| to overcome these constraints and reach escape velocity.
| crowcroft wrote:
| Mainframes are the only viable way to build computers. Micro
| processors will never figure out how to get small and fast
| enough for personal computers to reach escape velocity.
| andsoitis wrote:
| Why do you think the analogy hold?
| suddenlybananas wrote:
| Well it's an analogy, that's a water-tight argument.
| crowcroft wrote:
| Hardware typically gets faster and cheaper over time.
| Unless we hit hard a wall because of physics then I don't
| see any reason that won't continue to be true.
| twobitshifter wrote:
| our minds are incredibly energy efficient, that leads me to
| believe it is possible to figure out, but it might be a human
| rather than an AI that gives us something more akin to a
| biological solution.
| klabb3 wrote:
| This could fix my main gripe with The Matrix. "Humans are
| used as batteries" always felt off, but it totally would
| make sense if the human brains have uniquely energy
| efficient pattern matching abilities that an emerging AI
| organism would harvest. That would also strengthen the
| spiritual humanist subtext.
| Kon5ole wrote:
| That bugged me too! I decided that they actually meant
| "source of creativity" which made more sense.
| Melatonic wrote:
| Thats because the original script did actually have the
| human farms being used for brainpower for the machines.
| They changed it to "batteries" because they thought
| audiences at the time wouldnt understand it!
| int_19h wrote:
| Perhaps this will be the incentive to finally get fusion
| working. Big tech megacorps are flush with cash and could
| fund this research many times over at current rates. E.g. NIF
| is several billion dollars; Google alone has almost $100B in
| the bank.
| raincole wrote:
| Interestingly Gemini CLI has a very generous free quota. Is
| Google's strategy just overpricing some stuff and subsidizing
| the underpriced stuff?
| sunaookami wrote:
| It doesn't, it's not "1000 Gemini Pro" requests for free,
| Google misled everyone. It's 1000 Gemini requests, Flash
| included. You get like 5-7 Gemini Pro requests before you get
| limited.
| panarky wrote:
| I'm getting 100 Gemini Pro requests per day with an AI
| Studio API key that doesn't have billing enabled.
|
| After that it's bumped down to Flash, which is surpisingly
| effective in Gemini CLI.
|
| If I need Pro, I just swap in an API from an account with
| billing enabled, but usually 100 requests is enough for a
| day of work.
| indigodaddy wrote:
| I created an AI Studio key (unbilled) probably over a
| year ago or so. Is it still good for the current models
| are should I be creating a new key?
| yunohn wrote:
| Whoa. I'm definitely getting just handful of requests via
| free, just like patent commenter...
| ebiester wrote:
| I've found the free version swaps away from pro incredibly
| fast. But our company has gemini but can't even get that - we
| were being asked to do everything by API key.
| pembrook wrote:
| This is the fundamental pricing strategy of all modern
| software in fact.
|
| Underpriced for consumers, overpriced for businesses.
| thimabi wrote:
| > I find it astonishing that the same company providing free
| usage of their top models to everybody via AI Studio is nickel-
| and-diming their actual customers like that.
|
| I agree that's not a good posture, but it is entirely
| unsurprising.
|
| Google is probably not profiting from AI Ultra customers
| either, and grabbing all that sweet usage data from the free
| tier of AI Studio is what matters most to improve their models.
|
| Giving free access to the best models allows Google to capture
| market share among the most demanding users, which are
| precisely the ones that will be charged more in the future. In
| a certain sense, it's a great way for Google to use its huge
| idle server capacity nowadays.
| hirako2000 wrote:
| I'm burning well over 10 millions tokens a day on free tier.
| 99% of the input is freely availzble data, the rest is
| useless. I never provided any feedback. Sure there is some
| telemetry, they can have it.
|
| I doubt I'm an isolated case. This Gemini gig will cost
| Google a lot, they pushed it on all android phones around the
| globe. I can't wait to see what happens when they have to
| admit that not many people will pay over 20 bucks for "Ai",
| and I would pay well over 20 bucks just to see the face of
| the c suite next year when one dares to explain in simple
| terms there is absolutely no way to recoup the DC investment
| and that powering the whole thing will cost the company 10
| times that.
| svantana wrote:
| I suspect that the main goal here was to grab the top spot in a
| bunch of benchmarks, and being counted as an "available" model.
| llm_nerd wrote:
| They're using it as a major inducement to upgrade to AI
| Ultra. I mean, the image and video stuff is neat, but adds no
| value for the vast majority of AI subscribers, so right now
| this is the most notable benefit of paying 12x more.
|
| FWIW, Google seems to be having some severe issues with
| oddball, perhaps malfunctioning quota systems. I'm regularly
| finding extraordinarily little use of gemini-cli is hitting
| the purported 1000 request limit, when in reality I've done
| less than 10.
| hirako2000 wrote:
| I faced the exact same problem, with the API. It seems that
| it doesn't throttle early enough, then may cumulate the
| cool off period, malong it impossible to determine when to
| fire requests again.
|
| Also, I noticed Gemini (even flash) has Google search
| support. But only via the web UI or the native mobile app.
| Via the API that would requires serp via MCP of sort. Even
| with Gemini pro.
|
| Oh, some models are regularly facing outages. 503s are not
| uncommon. No SLA page, alerts, whatsoever.
|
| The reasoning feature is buggy, even if disabled, it
| sometimes triggers anyway.
|
| It occured to me the other day that Google probably have
| the best engineers given how good Gemini performs and where
| it's coming from, and the context window that is uniquely
| large compared to any other model. But that it is likely
| operated by managers coming from AWS where shipping half
| baked, barely tested software, was all it took to get a
| bonus.
| mnmatin wrote:
| They might not have been ready/optimized for production, but
| still wanted to release it before Aug 2 EU AI Act, this way
| they have 2 years for compliance. So the strategy with
| aggressively rate-limit for few users make sense.
| novok wrote:
| wheee, great way to lock in incumbents even more or lock out
| the EU from startups
| amelius wrote:
| It could be that your problem was too simple to justify the use
| of Deep Think.
|
| But yes, Google should have figured that out and used a less
| expensive mode of reasoning.
| dweekly wrote:
| "I'm sorry but that wasn't a very interesting question you
| just asked. I'll spare you the credit and have a cheaper
| model answer that for you for free. Come back when you have
| something actually challenging."
| ccozan wrote:
| Actually why not? Recognizing problem complexity as a fist
| step is really crucial for such expensive "experts". Humans
| do the same.
|
| And a question to the knowledgeable: does a simple/stupid
| question cost more in terms of resources then a complex
| problem? in terms of power consumption.
| joshuanapoli wrote:
| Just put in the prompt customization to model responses
| on Marvin from Hitchhiker's Guide.
|
| "Here I am, brain the size of a planet, and they ask me
| to ..."
| Fade_Dance wrote:
| Do they not do this in some general way behind the
| scenes? Surprising to me if not.
| nyrikki wrote:
| IIRC that isn't possible under current models at least in
| general, for multiple reasons, including attention cannot
| attend to future tokens, the fact that they are
| existential logic, that they are really NLP and not NLU,
| etc...
|
| Even proof mining and the Harrop formula have to exclude
| disjunction and existential quantification to stay away
| from intuitionist math.
|
| IID in PAC/ML implies PEM which is also intentionally
| existential quantification.
|
| This is the most gentle introduction I know of, but
| remember LLMs are fundamentally set shattering, and
| produce disjoint sets also.
|
| We are just at reactive model based systems now, much
| work is needed to even approach this if it ever is even
| possible.
|
| [0] https://www.cmu.edu/dietrich/philosophy/docs/tech-
| reports/99...
| robwwilliams wrote:
| Hmm, I needed Claude 4's help to parse your response. The
| critique was not too kind to your abbreviated arguments
| that current systems are not able to gauge the complexity
| of a prompt and the resources needed to address a
| question.
| jiggawatts wrote:
| It feels like the rant of someone upset that their
| decades of formal logic approach to AI become a dead end.
|
| I see this semi-regularly: futile attempts at handwaving
| away the obvious intelligence by some formal argument
| that is either irrelevant or inapplicable. Everything
| from thermodynamics -- which applies to human brains too
| -- to information theory.
|
| Grey-bearded academics clinging to anything that might
| float to rescue their investment into ineffective
| approaches.
|
| PS: This argument seems to be that LLMs "can't think
| ahead" when all evidence is that they clearly can! I
| don't know exactly what words I'll be typing into this
| comment textbox seconds or minutes from now but I can --
| hopefully obviously -- think intelligent thoughts and
| plan ahead.
|
| PPS: The em-dashes were inserted automatically by my
| iPhone, not a chat bot. I assure you that I am a mostly
| human person.
| riwsky wrote:
| "This meeting could've been an email"
| nojito wrote:
| I know this is a joke but I have been able to lower my
| costs by routing my prompts to a smaller model to determine
| if I need to send it to a larger model or not.
| danenania wrote:
| Model routing is deceptively hard though. It has halting
| problem characteristics: often only the smartest model is
| smart enough to accurately determine a task's difficulty. And
| if you need the smartest model to reliably classify the
| prompt, it's cheaper to just let it handle the prompt
| directly.
|
| This is why model pickers persist despite no one liking them.
| martinald wrote:
| Yes but prompt evaluation is far faster than inference as
| it can be done (mostly) in parallel, so I don't think
| that's true.
| danenania wrote:
| The problem is that input token cost dominates output
| token cost for the majority of tasks.
|
| Once you've given the model your prompt and are reading
| the first output token for classification, you've already
| paid most of the cost of just prompting it directly.
|
| That said, there could definitely be exceptions for short
| prompts where output costs dominate input costs. But
| these aren't usually the interesting use cases.
| redox99 wrote:
| That's usually not the case for thinking models. And
| usually hard problems have a very short prompt.
| danenania wrote:
| For me personally (using mostly for coding and project
| planning) it's nearly always the case, including with
| thinking models. I'm usually pasting in a bunch of files,
| screenshots, etc., and having long conversations. Input
| nearly always heavily dominates output.
|
| I don't disagree that there _are_ hard problems which use
| short prompts, like math homework problems etc., but they
| mostly aren 't what I would categorize as "real work".
| But of course I can only speak to my own experience
| /shrug.
| redox99 wrote:
| Yeah coding is definitely a situation where context is
| usually very very large. But at the same time in those
| situations something like Sonnet is fine.
| energy123 wrote:
| No, you're talking about costs to user, which are
| oversimplifications of the costs that providers bear. One
| output token with a million input tokens is incredibly
| cheap for providers
| danenania wrote:
| > One output token with a million input tokens is
| incredibly cheap for providers
|
| Source? Afaik this is incorrect.
| danielbln wrote:
| Chevk out any LLM API providers pricing. Output tokens
| are always significantly more expensive than input (which
| can also be cached).
| danenania wrote:
| Input tokens usually dominate output tokens by a _lot_
| more than 2x though. It's often 10x or more input. It can
| even easily be 100x or more. Again in realistic
| workflows.
|
| Caching does help the situation, but you always at least
| pay the initial cache write. And prompts need to be
| structured carefully to be cacheable. It's not a free
| lunch.
| golfer wrote:
| What was the experimentation? Can you share with us so we can
| see how "bizarrely uncompetitive" it is?
| iamronaldo wrote:
| Bizarrely uncompetitive is referencing the 5 uses per day not
| the performance itself
| dataviz1000 wrote:
| Several years ago I thought a good litmus test for mastery of
| coding is not finding a solution using internet search nor
| getting well written questions about esoteric coding problems
| answered on StackOverflow. For a while, I would post a question
| and answer my own question after I solved the problem for
| posterity (or AI bots). I always loved getting the "I've been
| working on this for 3 days and you saved my life" comments.
|
| I've been working on a challenging problem all this week and
| all the AI copilot models are worthless helping me. Mastery in
| coding is being alone when nobody else nor AI copilots can help
| you and you have dig deep into generalization, synthesis, and
| creativity.
|
| (I thought to myself, at least it will be a little while longer
| before I'm replaced with AI coding agents.)
| benreesman wrote:
| They're remarkably useless on stuff they've seen but not had
| up-weighted in the training set. Even the best ones (Opus 4
| running hot, Qwen and K2 will surprise you fairly often) are
| a net liability in some obscure thing.
|
| Probably the starkest example of this is build system stuff:
| it's really obvious which ones have seen a bunch of
| `nixpkgs`, and even the best ones seem to really struggle
| with Bazel and sometimes CMake!
|
| The absolute prestige high-end ones running flat out burning
| 100+ dollars a day and it's a lift on pre-SEO Google/SO I
| think... but it's not like a blowout vs. a working search
| index. Back when all the source, all the docs, and all the
| troubleshooting for any topic on the whole Internet were all
| above the fold on Google? It was kinda like this: type a
| question in the magic box and working-ish code pops out. Same
| at a glory-days FAANG with the internal mega-grep.
|
| I think there's a whole cohort or two who think that "type in
| the magic box and code comes out" is new. It's not new, we
| just didn't have it for 5-10 years.
| burnte wrote:
| I have similar issues with support form companies that
| heavily push AI and self-serve models and make human support
| hard. I'm very accomplished and highly capable. If I feel the
| need to turn to support, the chances the solution is in a KB
| is very slim, same with AI. It'll be a very specific
| situation with a very specific need.
| Melatonic wrote:
| There are a lot of internal KB's companies keep to
| themselves in their ticketing systems - would be
| interesting to estimate how much good data there is in
| there that could in the future be used to train more
| advanced (or maybe more niche or specific) AI models.
| Melatonic wrote:
| This has been my thought for a long time - unless there is
| some breakthrough in AI algo I feel like we are going to hit
| a "creativity wall" for coding (and some other tasks).
| red75prime wrote:
| Any reason to think that the wall will be under the human
| level?
| hirako2000 wrote:
| Off the thousands of responses I have read from the top
| LLMs in the last couple of years: never seen one that was
| creative. Throwing writing, coding, problem solving,
| mathematical questions and what not.
|
| It's somewhat easier to perceive the creativeless aspect
| with stable diffusion. I'm not talking about the missing
| limb or extra finger glitches. With a bit of experience
| looking through generated images our brain eventually
| perceives the absolute lack of creativity, an artist
| probably spot it without prior experience with generative
| AI pieces. With LLMs it takes a bit longer.
|
| Anecdotal, baseless I guess. Papers were published, some
| researchers in the fields of science couldn't get the
| best LLMs to solve any unsolved problem. I recently came
| across a paper stating bluntly that all LLMs tested were
| unable to conceptualize, nor derive laws that generalize
| whatsoever. E.g formulas.
|
| We are being duped, it doesn't help selling $200 monthly
| subscriptions - soon for even more - if marketers
| admitted there is absolutely zero reasoning going on with
| these stochastic machines on steroids.
|
| I deeply wish the circus ends soon, so that we can start
| focusing on what LLMs are excellent, well fitted to do
| better, faster than humans.
|
| Creative it is not.
| epolanski wrote:
| Your post misses the fact that 99% of programming is
| repetitive plumbing and that the overwhelming majority of
| developers, even ivy league graduates, suck at coding and
| problem solving.
|
| Thus, AI is a great productivity tool if you know how to use
| it for the overwhelming majority of problems out there.
|
| This whole narrative of "okay but it can't replace me in this
| or that situation" is honestly between an obvious touche (why
| would you think AI would replace rather than empower those
| who know their craft) and stale luddism.
| LeafItAlone wrote:
| > It's just bizarrely uncompetitive with o3-pro and Grok 4
| Heavy.
|
| In my experience Grok 4 and 4 Heavy have been crap. Who cares
| how many requests you get with it when the response is
| terrible. Worst LLM money I've spent this year and I've spent a
| lot.
| danenania wrote:
| It's interesting how multi-dimensional LLM capabilities have
| proven to be.
|
| OpenAI reasoning models (o1-pro, o3, o3-pro) have been the
| strongest, in my experience, at harder problems, like finding
| race conditions in intricate concurrency code, yet they still
| lag behind even the initial sonnet 3.5 release for writing
| basic usable code.
|
| The OpenAI models are kind of like CS grads who can solve
| complex math problems but can't write a decent React
| component without yadda-yadda-ing half of it, while the
| Anthropic models will crank out many files of decent,
| reasonably usable code while frequently missing subtleties
| and forgetting the bigger picture.
| nxobject wrote:
| Those may have been the exact people creating training
| material for OpenAI...
| qingcharles wrote:
| It's just wildly inconsistent to me. Some times it'll produce
| a work of genius. Other times, total garbage.
| epa wrote:
| Unfortunately we are still in the prompt optimization
| stage, garbage in garbage out
| MaxikCZ wrote:
| I hear this repeated so many times I feel like its a
| narrative pushed by the sellers. Year ago you could ask
| for glass of wine filled to the brim and you just wouldnt
| get it. It wasnt garbage in, garbage out, it was
| sensibility in, garbage out.
|
| The line where chatbots stop being sensible and start
| outputting garbage is in movement, but slower than avg
| joe would guess. You only notice it when you get an
| intuition of the answer before you see it, which requires
| a lot of experience on range of complexity. Persisten
| newbies are the best spotters, because they ask obvious
| basic questions while asking for stuff beyond what
| geniuses could solve, and only by getting garbage answer
| and enduring a process of realizing its actually garbage
| they truly make wider picture of AI than even most
| powerusers, who tend to have more balanced querries.
| LeafItAlone wrote:
| Maybe. That could be true.
|
| But doesn't happen the same with other tools. I'll give
| the _same exact_ prompt to all of LLMs I have access to
| and look at the responses for the best one. Grok is
| consistently the worst. So if it's garbage in, garbage
| out, why are the other ones so much better at dealing
| with my garbage?
| hirako2000 wrote:
| I think it meant in the training stage, not inference.
| 827a wrote:
| Similar complaints are happening all over reddit with the
| Claude Code $200/mo plan and Cursor. The companies with deep VC
| funding have been subsidizing usage for a year now, but we're
| starting to see that bleed off.
|
| I think the primary concern of this industry right now is how,
| relative to the current latest generation models, we
| simultaneously need intelligence to increase, cost to decrease,
| effective context windows to increase, and token bandwidths to
| increase. All four of these things are real bottlenecks to
| unlocking the "next level" of these tools for software
| engineering usage.
|
| Google isn't going to make billions on solving advanced math
| exams.
| Fade_Dance wrote:
| Agreed, and big context windows are key to mass adoption in
| wider use cases beyond chatbots (random ex: in knowledge
| management apps, being able to parse the entire note
| library/section and hook it into global AI search), but those
| use cases are decidedly _not_ areas where $200 per month
| subscriptions can work.
|
| I'll hazard to say that cost and context windows are the two
| key metrics to bridge that chasm with acceptable results....
| As for software engineering though, that cohort will be
| demanding on all front for the foreseeable future, especially
| because there's a bit of a competitive element. Nobody wants
| to be the vibecoder using sub-par tools compared to everyone
| else showing off their GitHub results and making sexy blog
| posts about it on HN.
| com2kid wrote:
| Outside of code, the current RAG strategy is throw shit
| tons of unstructured text at it that has been found using
| vector search. Some companies are doing better, but the
| default rag pipelines are... kind of garbage.
|
| For example, a chat bot doing recipe work should have a RAG
| DB that, by default, returns entire recipes. A vector DB is
| actually not the solution here, any number of traditional
| DBs (relational or even a document store) would work fine.
| Sure do a vector search across the recipe texts, but then
| fetch the entire recipe from someplace else. Current RAG
| solutions can do this, but the majority of RAG deployments
| I have seen don't bother, they just abuse large context
| windows.
|
| Which looks like it works, except what you actually have in
| your context window is 15 different recipes all stitched
| together. Or if you put an entire recipe book into the
| context (which is perfectly doable now days!), you'll end
| up with the chatbot mixing up ingredients and proportions
| between recipes because you just voluntarily polluted its
| context with irrelevant info.
|
| Large context windows allow for sloppy practices that end
| up making for worse results. Kind of like when we decided
| web servers needed 16 cores and gigs of RAM to run IBM
| Websphere back in the early 2000s, to serve up mostly
| static pages. The availability of massive servers taught
| bad habits (huge complicated XML deployment and
| configuration files, oodles of processes communicating with
| each other to serve a single page, etc).
|
| Meanwhile in the modern world I've ran mission critical
| high throughput services for giant companies on a K8
| cluster consisting of 3 machines each with .25 CPU and a
| couple hundred megs of RAM allocated.
|
| Sometimes more is worse.
| 827a wrote:
| IMO: Context engineering is a fascinating topic because
| it starts approaching the metaphysical abstract idea of
| what LLMs even are.
|
| If you believe that an LLM is a digital brain, then it
| follows that their limitation in capabilities today are a
| result of their limited characteristics (namely: coherent
| context windows). If we increase context windows (and
| intelligence), we can simply pack more data into the
| context, ask specific questions, and let the LLM figure
| it out.
|
| However, if you have a more grounded belief that, at
| best, LLMs are just one part of a more heterogeneous
| digital brain: It follows that maybe actually their
| limitations are a result of how we're feeding it data.
| That we need to be smarter about context engineering, we
| need to do roundtrips with the LLM to narrow down what
| thbe context should be, it needs targeted context to
| maximize the quality of its output.
|
| The second situation feels so much harder, but more
| likely. IMO: This fundamental schism is the single reason
| why ASI won't be achieved on any timeframe worth making a
| prediction about. LLMs are just one part of the puzzle.
| Fade_Dance wrote:
| It's also a question of general vs specialized tools. If
| LLMs are being used in a limited capacity, such as
| retrieving recipes, then a limited environment where it
| only has the ability to retrieve complete recipes via RAG
| may be ideal in the literal sense of the word. There
| really is nothing better than the perfect specialized
| tool for a specialized job.
| com2kid wrote:
| I did embedded work for years. A 100mhz CPU with 1 cycle
| SRAM latency and a bare metal OS can do as much as a
| 600MHZ CPU hitting DRAM running a preemptive OS.
|
| Specialized tools _rock_.
| com2kid wrote:
| Information in an LLM exists in two places:
|
| 1. Embedded in the parameters
|
| 2. Within the context window
|
| We all talk a lot about #2, but until we get a really
| good grip on #1, I think we as a field are going to hit a
| progress wall.
|
| The problem is we have not been able to separate out
| knowledge embedded in parameters with model capability,
| famously even if you don't want a model to write code,
| throwing a bunch of code at a model makes it a better
| model. (Also famously, even if someone never grows up to
| work with math day to day, learning math makes them
| better at all sorts of related logical thinking tasks.)
|
| Also there is plenty of research showing performance
| degrades as we stuff more and more into context. This is
| why even the best models have limits on tool call
| performance when naively throwing 15+ JSON schemas at it.
| (The technique to use RAG to determine which tool call
| schema to feed into the context window is super cool!)
| 827a wrote:
| Big, coherent context windows are key to almost all use-
| cases. The whole house of cards RAG implementations most
| platforms are using right now are pretty bad. You start
| asking around about how to implement RAG and you realize:
| No one knows, the architecture and outcomes at every
| company are pretty bad, the most common words you hear are
| "yeah it pretty much works ok i guess".
| Closi wrote:
| It's not particularly interesting if Deep Mind comes to the
| same (correct) conclusion on a single problem as o3 but costs
| more. You could ask gpt 2.5 and gpt4 what 1+1= and would get
| the same response with gpt 4 costing more, but this doesn't
| tell us much about model capability or value.
|
| It would be more interesting to know if it can handle problems
| that o3 can't do, or if it is 'correct' more often than o3 pro
| on these sort of problems.
|
| i.e. if o3 is correct 90% of the time, but deep mind is correct
| 91% of the time on challenging organisational problems, it will
| be worth paying $250 for an extra 1% certainty (assuming the
| problem is high-value / high-risk enough).
| lucianbr wrote:
| > It would be more interesting to know if it can handle
| problems that o3 can't do
|
| Suppose it can't. How will you know? All the datapoints will
| be "not particularly interesting".
| Closi wrote:
| > Suppose it can't. How will you know?
|
| By finding and testing problems that o3 can't do on Deep
| Think, and also testing the reverse? Or by large benchmarks
| comparing a whole suite of questions with known answers.
|
| Problems that both get correct will be easy to find and
| don't say much about comparative performance. That's why
| some of the benchmarks listed in the article (e.g.
| Humanity's Last Exam / AIME 2025) are potentially more
| insightful than one person's report on testing one question
| (which they don't provide) where both models replied with
| the same answer.
| profsummergig wrote:
| The part I cannot understand is why for many AI offerings, I
| cannot make out what each pricing tier does with a quick
| glance.
|
| What happened to the simplicity of Steve Jobs' 2x2 (consumer
| vs.pro, laptop vs. desktop)?
| starfallg wrote:
| The rate limits are not because of compute performance or the
| lack of. It's to stop people from training their own models on
| the very cutting edge.
| TheAceOfHearts wrote:
| I would be interested in reading about how people who are paying
| for access to Google's top AI plan are intending to use this. Do
| you have any examples of immediate use-cases that might benefit?
|
| Is Google using this tool internally? One would expect them to
| give some examples of how it's helping internal teams accelerate
| or solve more challenging problems, if they were eating their own
| dogfood.
| meta_ai_x wrote:
| I'm guessing most Ultra users are there for Veo 3, where it has
| monetary benefits if your 3s video go viral on
| TikTok/Reels/Shorts
| irthomasthomas wrote:
| You can spin up a version of this at home using simonw's LLM cli
| with the llm-consortium plugin.
|
| Bonus 1: Use any combination of models. Mix n match models from
| any lab.
|
| Bonus 2: Serve your custom consortium on a local API from a
| single command using the llm-model-gateway plugin and use it in
| your apps and coding assistants.
|
| https://x.com/karpathy/status/1870692546969735361
| > uv tool install llm llm install llm-consortium llm
| consortium save gthink-n5 -m gemini-pro -n 5 --arbiter
| gemini-flash --confidence-threshold 99 --max-iterations 4
| llm serve --host 0.0.0.0 curl
| http://0.0.0.0:8000/v1/chat/completions \ -X POST \
| -H "Content-Type: application/json" \ -d '{
| "model": "gthink-n5", "messages": [{"role": "user",
| "content": "find a polynomial algorithm for graph-isomorphism"}]
| }'
|
| You can also build a consortium of consortiums like so:
| llm consortium save gem-squared -m gthink-n5 -n 2 --arbiter gem-
| flash
|
| Or even make the arbiter a consortium: llm
| consortium save gem-cubed -m gthink-n5 -n 2 --arbiter gthink-n5
| --max-iteration 2
|
| or go openweights only: llm consortium save open-
| council -m qwen3:2 -m kimi-k2:2 -m glm-4.5:2 -m mistral:2
| --arbiter minimax-m1 --min-iterations 2 --confidence-threshold 95
|
| https://GitHub.com/irthomasthomas/llm-consortium
| barapa wrote:
| I am not seeing this llm serve command
| irthomasthomas wrote:
| it's a separate plugin rn. llm install llm-model-gateway
| mt_ wrote:
| Is the European Union a consortium of consortiums?
| SubiculumCode wrote:
| 1. Why do you say this is a version of Gemini deep think? It
| seems like there could be multiple ways to build a multiagent
| model to explore a space. 2. The covariance between models
| leads to correlated errors, lowering the individual
| effectiveness of each contributing model. It would seem to me
| that you'd want to find a set of model
| architectures/prompt_congigs that minimizes covariance while
| maintaining individual accuracy, on a benchmark set of problems
| that have multiple provable solutions (i.e. not one path to a
| solution that is objectively correct).
| irthomasthomas wrote:
| I didn't mean to suggest it's a clone of Deep Think, which is
| proprietary. I meant that it's a version of _parallel
| reasoning_. Got the idea from Karpathy 's tweet in December
| and built it. Then DeepMind published the "Evolving Deeper
| LLM Thinking" paper in January with similar concepts. Great
| minds, I guess? https://arxiv.org/html/2501.09891v1
|
| 2. The correlated errors thing is real, though I'd argue it's
| not always a dealbreaker. Sometimes you want similar models
| for consistency, sometimes you want diversity for coverage.
| The plugin lets you do either - mix Claude with kimi and Qwen
| if you want, or run 5 instances of the same model. The
| "right" approach probably depends on your use case.
| dumbmrblah wrote:
| Thanks! Do you happen to know if there any OpenWebUI plugins
| similar to this?
| irthomasthomas wrote:
| You can use this with openwebui already. Just llm install
| llm-model-gateway. Then after you save a consortium you run
| llm serve --host 0.0.0.0 This will give you a openai
| compatible endpoint which you add to your chat client.
| nickandbro wrote:
| Ladies and Gentlemen,
|
| Here's Gemini Deep Think when prompted with:
|
| "Create a svg of a pelican riding on a bicycle"
|
| https://www.svgviewer.dev/s/5R5iTexQ
|
| Beat Simon Willison to it :)
| baggachipz wrote:
| Truly worth the price. We live in the future.
| jjcm wrote:
| Honestly the first one where I would have guessed "this is a
| pelican riding a bicycle" if presented with just the image and
| 0 other context. This and the voxel tower are fairly impressive
| - we're seeing some semblance of visual / spatial understanding
| with this model.
| criemen wrote:
| How much would it cost at list-price API pricing?
| arnaudsm wrote:
| Meme benchmarks like this and Strawberry are funny but very
| easy to game, I bet they're all over training sets nowadays.
| simonw wrote:
| If you train a model on the SVGs of pelicans on a bicycle
| that are out there already you're going to get a VERY weird
| looking pelican on a bicycle:
| https://simonwillison.net/tags/pelican-riding-a-bicycle/
| nickandbro wrote:
| Thanks for the mention on your blog. You're the original
| GPT pelican artist
| amelius wrote:
| Can it do circuit diagrams? Because that's one practical area
| where I think the AI models are lacking.
| HaZeust wrote:
| Not yet, or schemas. It can do netlists, though! But it's
| much harder to go from "Netlist -> Diagram/Schema" than the
| other way around :(
| kingstnap wrote:
| It was an expensive SVG, but it did a good job.
|
| The bike is an actual bike with a diamond frame.
| mNovak wrote:
| Interestingly it seems to draw the bike's seat too (around line
| 34) which then gets covered by the pelican.
| simonw wrote:
| OK that is recognizably a pelican, pretty great!
| qingcharles wrote:
| This feels like the best pelicanbike yet. The singularity
| might be closer than we imagine.
|
| Time for a leaderboard?
| lostmsu wrote:
| Ask and you'll receive: https://pelicans.borg.games/
| qingcharles wrote:
| Nice! Is there a way I can click on the leaderboard items
| so I can view them?
| espadrine wrote:
| It would be interesting to have two generations per model
| without cherry picking, so that the Elo estimation can
| include an easy-to-compute standard deviation estimation.
| jug wrote:
| Easily the best one yet!
| tim333 wrote:
| First I've seen of human quality. Maybe we are reaching API.
| (Artificial Pelican Intelligence)
| NitpickLawyer wrote:
| Saw one today from gpt5 (via some api trick someone found)
| that was better than this, let me see if I can find it.
|
| Pelican:
|
| https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd..
| ..
|
| Longer thread re gpt5:
|
| https://old.reddit.com/r/OpenAI/comments/1mettre/gpt5_is_alr.
| ..
| blueblisters wrote:
| Uh that doesn't look better. it has more texture but the
| composition is bad/incomplete
| rdtsc wrote:
| If it's on HN and is a meme at this point, it will end up in
| the training set.
|
| It's kind of fun to imagine that there is an intern in every AI
| company furiously trying to get nice looking svg pelicans on
| bicycles.
| akomtu wrote:
| Now add irrelevant facts about cats and see if it can still
| draw it.
| drozycki wrote:
| I don't have access but I wonder whether a dog on a jetski
| would be nearly as good
| judge123 wrote:
| I'm wondering if 'slow AI' like this is a temporary bridge, or a
| whole new category we need to get used to. Is the future really
| about having these specialized 'deep thinkers' alongside our
| fast, everyday models? Or is this just a clunky V1 until the main
| models get this powerful on their own in seconds?
| onlyrealcuzzo wrote:
| It's not unreasonable to think that with improvements on the
| software side - a Saturn-like model based on diffusion could be
| this powerful within a decade - with 1s responses.
|
| I'd highly doubt in 10 years, people are waiting 30m for
| answers of _this_ quality - either due to the software side,
| the hardware side, and /or scaling.
|
| It's possible in 10 years, the cost you pay is still
| comparable, but I doubt the time will be 30m.
|
| It's also possible that there's still top-tier models like this
| that use absurd amounts of resources (by today's standards) and
| take 30m - but they'd likely be at a _much_ higher quality than
| today 's.
| regularfry wrote:
| The pressure in the other direction is tool use. The more a
| model wants to call out to a series of tools, the more the
| delay will be, just because it the serial process isn't part
| of the model.
| twobitshifter wrote:
| we're optimizing for quality over performance right now, at
| some point the pendulum will swing the other way, but there
| might be problems that require deep thinking just like we have
| a need for supercomputers to run jobs for days today.
| nusl wrote:
| Been using Gemini for a few months, somehow it's gotten much,
| much worse in that time. Hallucinations are very common, and it
| will argue with you when you point it out. So, don't have much
| confidence.
| panarky wrote:
| In my experience with chat, Flash has gotten much, much better.
| It's my go-to model even though I'm paying for Pro.
|
| Pro is frustrating because it too often won't search to find
| current information, and just gives stale results from before
| its training cutoff. Flash doesn't do this much anymore.
|
| For coding I use Pro in Gemini CLI. It is amazing at coding,
| but I'm actually using it more to write design docs, decomp
| multi-week assignments down to daily and hourly tasks, and then
| feed those docs back to Gemini CLI to have it work through each
| task sequentially.
|
| With a little structure like this, it can basically write its
| own context.
| declan_roberts wrote:
| I like flash because when it's wrong it's wrong very quickly.
| You can either change the prompt or just solve the problem
| yourself. It works well for people who can spot the answer as
| being "wrong"
| arnaudsm wrote:
| I feel the same, but cannot measure the effect in any context
| benchmark like fiction.livebench.
|
| Are they aggressively quantizing, or are our expectations
| silently increasing ?
| quadrature wrote:
| Is the problem mainly with tool use ? and are you using it
| through AI studio or through the API ?.
|
| I've found that it hallucinates tool use for tools that aren't
| available and then gets very confident about the results.
| alecco wrote:
| Same here. I stopped using Gemini Pro because on top of it's
| hard to follow verbosity it was giving contradicting answers.
| Things that Claude Sonnet 4 could answer.
|
| Speaking of Sonnet, I feel like it's closing the gap to Opus.
| After the new quotas I started to try it before Opus and now it
| gets complex things right more often than not. This wasn't my
| experience just a couple of months ago.
| AtlanticThird wrote:
| Upgraded and quickly hit my limit. And find that they have
| limits, I just wish that they were more transparent. Even if it's
| just a vague statement about limited usage. I assumed it would be
| similar to regular Gemini 2.5 on the pro plan but it's not
| unsupp0rted wrote:
| Interesting they compared Gemini 2.5 Deep Think on code to all
| the top models EXCEPT Claude, the best top model at code.
| childintime wrote:
| This comes at a time where my experience with Gemini is lacking,
| it seems to get worse. It's not picking up on my intention,
| sometimes replies in the wrong language, etc. Either that or I am
| just transparent that it's a tool and its feelings are hurt. I've
| had to call it a moron several times, and it was funny when it
| started reprimanding me for my foul language once. But it was
| wrong. This behavior seems new. I could never trust it to not do
| random edits everywhere in a document, so nowadays I use it to
| check Claude, which can be trusted with a document.
| declan_roberts wrote:
| I had a good experience with gemini-cli (which I think uses pro
| initially?). It's not very good, but it's very fast. So when
| it's wrong it's wrong very quickly and you can either solve it
| yourself or pivot your prompt. For a professional software
| engineer this actually works out OK.
| jawerty wrote:
| If anyone wants a free to use deep research agent check out what
| i'm working on: https://projectrex.onrender.com/
| 44za12 wrote:
| Is it something like tree of thought reasoning?
| jawerty wrote:
| If anyone wants a free to use research agent try rex I've been
| building it for a couple months: https://projectrex.onrender.com/
| kelvinjps10 wrote:
| These AI names are getting ridiculous
| tdhz77 wrote:
| I think at this point its fair to say that I've switched modals
| more than Leonardo Dicaprio.
| mclau157 wrote:
| Asking AI to create 3D scenes like the example in the page seems
| like asking someone to hammer something with a screwdriver, we
| would need an AI compatible 3D software that either has easier to
| use voxels built in so it can create similar to pixel art, or
| easier math defined curves that can be meshed, either way AI just
| does not currently have the right tools to generate 3D scenes
| mhh__ wrote:
| One of the most alluring things about LLMs though is that it is
| like having a screwdriver that can just about work like a
| hammer, and draft an email to your landlord and so on
| depluber wrote:
| Is this a joke? I am paying $250/month because I want to use Deep
| Think and I literally just burned through my daily quota in 5
| PROMPTS. That is insane!
| srameshc wrote:
| I've been using Gemini 2.5 Pro for a few months now and have
| found my experience to be very positive. I primarily use it for
| coding and through the API, and I feel it's consistently
| improving. I haven't yet tried Deep Think, though.
| km3k wrote:
| They missed an oppportunity to name it Deep Thought.
| motbus3 wrote:
| I wonder how long it will take for someone starts o investigating
| the relation between SEO and Google AI crawler
| Invictus0 wrote:
| I used to be enthusiastic about Gemini-2.5-Pro but now I can't
| even get it to do decent file-diff summaries for a PR commit.
___________________________________________________________________
(page generated 2025-08-01 23:00 UTC)