[HN Gopher] The End of Moore's Law for AI? Gemini Flash Offers a...
___________________________________________________________________
The End of Moore's Law for AI? Gemini Flash Offers a Warning
Author : sethkim
Score : 92 points
Date : 2025-07-03 17:34 UTC (5 hours ago)
(HTM) web link (sutro.sh)
(TXT) w3m dump (sutro.sh)
| cmogni1 wrote:
| The article does a great job of highlighting the core disconnect
| in the LLM API economy: linear pricing for a service with non-
| linear, quadratic compute costs. The traffic analogy is an
| excellent framing.
|
| One addition: the O(n^2) compute cost is most acute during the
| one-time prefill of the input prompt. I think the real
| bottleneck, however, is the KV cache during the decode phase.
|
| For each new token generated, the model must access the
| intermediate state of all previous tokens. This state is held in
| the KV Cache, which grows linearly with sequence length and
| consumes an enormous amount of expensive GPU VRAM. The speed of
| generating a response is therefore more limited by memory
| bandwidth.
|
| Viewed this way, Google's 2x price hike on input tokens is
| probably related to the KV Cache, which supports the article's
| "workload shape" hypothesis. A long input prompt creates a huge
| memory footprint that must be held for the entire generation,
| even if the output is short.
| trhway wrote:
| That obviously should and will be fixed architecturally.
|
| >For each new token generated, the model must access the
| intermediate state of all previous tokens.
|
| Not all the previous tokens are equal, not all deserve the same
| attention so to speak. The farther the tokens, the more
| opportunity for many of them to be pruned and/or collapsed with
| other similarly distant and lesser meaningful tokens in a given
| context. So instead of O(n^2) it would be more like O(nlog(n))
|
| I mean, you'd expect that for example "knowlegde worker" models
| (vs. say "poetry" models) would posses some perturbative
| stability wrt. changes to/pruning of the remote previous
| tokens, at least to those tokens which are less meaningful in
| the current context.
|
| Personally, i feel the situation is good - performance
| engineering work again becomes somewhat valuable as we're
| reaching N where O(n^2) forces management to throw some money
| at engineers instead of at the hardware :)
| simonw wrote:
| "In a move that at first went unnoticed, Google significantly
| increased the price of its popular Gemini 2.5 Flash model"
|
| It's not quite that simple. Gemini 2.5 Flash previously had two
| prices, depending on if you enabled "thinking" mode or not. The
| new 2.5 Flash has just a single price, which is a lot more if you
| were using the non-thinking mode and may be slightly less for
| thinking mode.
|
| Another way to think about this is that they retired their Gemini
| 2.5 Flash non-thinking model entirely, and changed the price of
| their Gemini 2.5 Flash thinking model from $0.15/m input, $3.50/m
| output to $0.30/m input (more expensive) and $2.50/m output (less
| expensive).
|
| Another minor nit-pick:
|
| > For LLM providers, API calls cost them quadratically in
| throughput as sequence length increases. However, API providers
| price their services linearly, meaning that there is a fixed cost
| to the end consumer for every unit of input or output token they
| use.
|
| That's mostly true, but not entirely: Gemini 2.5 Pro (but oddly
| not Gemini 2.5 Flash) charges a higher rate for inputs over
| 200,000 tokens. Gemini 1.5 also had a higher rate for >128,000
| tokens. As a result I treat those as separate models on my
| pricing table on https://www.llm-prices.com
|
| One last one:
|
| > o3 is a completely different class of model. It is at the
| frontier of intelligence, whereas Flash is meant to be a
| workhorse. Consequently, there is more room for optimization that
| isn't available in Flash's case, such as more room for pruning,
| distillation, etc.
|
| OpenAI are on the record that the o3 optimizations were _not_
| through model changes such as pruning or distillation. This is
| backed up by independent benchmarks that find the performance of
| the new o3 matches the previous one:
| https://twitter.com/arcprize/status/1932836756791177316
| sethkim wrote:
| Both great points, but more or less speak to the same root
| cause - customer usage patterns are becoming more of a driver
| for pricing than underlying technology improvements. If so, we
| likely have hit a "soft" floor for now on pricing. Do you not
| see it this way?
| simonw wrote:
| Even given how much prices have decreased over the past 3
| years I think there's still room for them to keep going down.
| I expect there remain a whole lot of optimizations that have
| not yet been discovered, in both software and hardware.
|
| That 80% drop in o3 was only a few weeks ago!
| sethkim wrote:
| No doubt prices will continue to drop! We just don't think
| it will be anything like the orders-of-magnitude YoY
| improvements we're used to seeing. Consequently, developers
| shouldn't expect the cost of building and scaling AI
| applications to be anything close to "free" in the near
| future as many suspect.
| vfvthunter wrote:
| I do not see it this way. Google is a publicly traded company
| responsible for creating value for their shareholders. When
| they became dicks about ad blockers on youtube last year or
| so, was it because they hit a bandwidth Moore's law? No. It
| was a money grab.
|
| ChatGPT is simply what Google should've been 5-7 years ago,
| but Google was more interested in presenting me with ads to
| click on instead of helping me find what I was looking for.
| ChatGPT is at least 50% of my searches now. And they're
| losing revenue because of that.
| mathiaspoint wrote:
| I really hate the thinking. I do my best to disable it but
| don't always remember. So often it just gets into a loop second
| guessing itself until it hits the token limit. It's rare it
| figures anything out while it's thinking too but maybe that's
| because I'm better at writing prompts.
| thomashop wrote:
| I have the impression that the thinking helps even if the
| actual content of the thinking output is nonsense. It awards
| more cycles to the model to think about the problem.
| wat10000 wrote:
| That would be strange. There's no hidden memory or data
| channel, the "thinking" output is all the model receives
| afterwards. If it's all nonsense, then nonsense is all it
| gets. I wouldn't be completely surprised if a context with
| a bunch of apparent nonsense still helps somehow, LLMs are
| weird, but it would be odd.
| mathiaspoint wrote:
| Eh. The embeddings themselves could act like hidden layer
| activations and encode some useful information.
| yorwba wrote:
| Attention operates entirely on hidden memory, in the
| sense that it usually isn't exposed to the end user. An
| attention head on one thinking token can attend to one
| thing and the same attention head on the next thinking
| token can attend to something entirely different, and the
| next layer can combine the two values, maybe on the
| second thinking token, maybe much later. So even nonsense
| filler can create space for intermediate computation to
| happen.
| barrkel wrote:
| This isn't quite right. Even when an LLM generates
| meaningless tokens, its internal state continues to
| evolve. Each new token triggers a fresh pass through the
| network, with attention over the KV cache, allowing the
| model to refine its contextual representation. The
| specific tokens may be gibberish, but the underlying
| computation can still reflect ongoing "thinking".
| Wowfunhappy wrote:
| Wasn't there some study that just telling the LLM to
| write a bunch of periods first improves responses?
| sharkjacobs wrote:
| > This is the first time a major provider has backtracked on the
| price of an established model
|
| Arguably that was Haiku 3.5 in October 2024.
|
| I think the same hypothesis could apply though, that you price
| your model expecting a certain average input size, and then
| adjust price up to accommodate the reality that people use that
| cheapest model when they want to throw as much as they can into
| the context.
| simonw wrote:
| Haiku 3.5 was a completely different model from Haiku 3, and
| part of a new model generation.
|
| Gemini Flash 2.5 and Gemini 2.5 Flash Preview were presumably a
| whole lot more similar to each other.
| mossTechnician wrote:
| Is there a consumer expectation that Haiku 3.5 is
| _completely_ different? Even leaving semantic versioning
| aside, even if the .5 symbolizes a "halfway point" between
| major releases, it still suggests a non-major release to me.
| simonw wrote:
| Consumers have no idea what Haiku is.
|
| Engineers who work with LLM APIs are hopefully paying
| enough attention that they understand the difference
| between Claude 3, Claude 3.5 and Claude 4.
| mossTechnician wrote:
| I appreciate the clarification for people who aren't
| engineers who work in-depth with LLM APIs, but I have
| enough familiarity with both semantic versioning[0] and
| .NET versioning[1], and usually a ".5" in either of them
| implies a large but not _complete_ difference.
|
| [0]: https://en.wikipedia.org/wiki/Software_versioning#Se
| mantic_v... [1]: https://en.wikipedia.org/wiki/.NET_Frame
| work_version_history...
| ryao wrote:
| I had the same thought about haiku 3.5. They claimed it was due
| to the model being more capable, which basically means that
| they raised the price because they could.
|
| Then there is Poe with its pricing games. Prices at Poe have
| been going up over time since they were extremely aggressive to
| gain market share presumably under the assumption that there
| would be reduced pricing in the future and the reduced pricing
| for LLMs did not materialize.
| guluarte wrote:
| they are doing the we work approach, gain customers at all costs
| even if that means losing money.
| FirmwareBurner wrote:
| Aren't all LLMs loosing money at this point?
| simonw wrote:
| I don't believe that's true on inference - I think most if
| not all of the major providers are selling inference at a
| (likely very small) margin over what it costs to serve them
| (hardware + energy).
|
| They likely lose money when you take into account the capital
| cost of training the model itself, but that cost is at least
| fixed: once it's trained you can serve traffic from it for as
| long as you chose to keep the model running in production.
| bungalowmunch wrote:
| yes I would generally agree; although I don't have a have
| source for this, I've heard whispers of Anthropic running
| at a much higher margin compared to the other labs
| throwawayoldie wrote:
| Yes, and the obvious endgame is wait until most software
| development is effectively outsourced to them, then jack the
| prices to whatever they want. The Uber model.
| FirmwareBurner wrote:
| Good thing AI can't replace my drinking during work time
| skills
| incomingpain wrote:
| I think the big thing that really surprised me.
|
| Llama 4 maverick is 16x 17b. So 67GB of size. The equivalency is
| 400billion.
|
| Llama 4 behemoth is 128x 17b. 245gb size. The equivalency is 2
| trillion.
|
| I dont have the resources to be able to test these unfortunately;
| but they are claiming behemoth is superior to the best SAAS
| options via internal benchmarking.
|
| Comparatively Deepseek r1 671B is 404gb in size; with pretty
| similar benchmarks.
|
| But you compare deepseek r1 32b to any model from 2021 and it's
| going to be significantly superior.
|
| So we have quality of models increasing, resources needed
| decreasing. In 5-10 years, do we have an LLM that loads up on a
| 16-32GB video card that is simply capable of doing it all?
| sethkim wrote:
| My two cents here is the classic answer - it depends. If you
| need general "reasoning" capabilities, I see this being a
| strong possibility. If you need specific, factual information
| baked into the weights themselves, you'll need something large
| enough to store that data.
|
| I think the best of both worlds is a sufficiently capable
| reasoning model with access to external tools and data that can
| perform CPU-based lookups for information that it doesn't
| possess.
| sharkjacobs wrote:
| > If you're building batch tasks with LLMs and are looking to
| navigate this new cost landscape, feel free to reach out to see
| how Sutro can help.
|
| I don't have any reason to doubt the reasoning this article is
| doing or the conclusions it reaches, but it's important to
| recognize that this article is part of a sales pitch.
| sethkim wrote:
| Yes, we're a startup! And LLM inference is a major component of
| what we do - more importantly, we're working on making these
| models accessible as analytical processing tools, so we have a
| strong focus on making them cost-effective at scale.
| sharkjacobs wrote:
| I see your prices page lists the _average_ cost per million
| tokens. Is that because you are using the formula you
| describe, which depends on hardware time and throughput?
|
| > API Price [?] (Hourly Hardware Cost / Throughput in Tokens
| per Hour) + Margin
| samtheprogram wrote:
| There's absolutely nothing wrong with putting a small plug at
| the end of an article.
| sharkjacobs wrote:
| Of course not.
|
| But the thrust of the article is that contrary to
| conventional wisdom, we shouldn't expect llm models to
| continue getting more efficient, and so its worthwhile to
| explore other options for cost savings in inference, such as
| batch processing.
|
| The conclusion they reach is one which directly serves what
| they're selling.
|
| I'll repeat; I'm not disputing anything in this article. I'm
| really not, I'm not even trying to be coy and make allusions
| without directly saying anything. If I thought this was
| bullshit I'm not afraid to semi-anonymously post a comment
| saying so.
|
| But this is advertising, just like Backblaze's hard drive
| reliability blog posts are advertising.
| jasonthorsness wrote:
| Unfounded extrapolation from a minor pricing update. I am sure
| every generation of chips also came with "end of Moore's law"
| articles for the actual Moore's law.
|
| FWIW Gemini 2.5 Flash Lite is still very good; I used it in my
| latest side project to generate entire web sites and it outputs
| great content and markup every single time.
| ramesh31 wrote:
| >By embracing batch processing and leveraging the power of cost-
| effective open-source models, you can sidestep the price floor
| and continue to scale your AI initiatives in ways that are no
| longer feasible with traditional APIs.
|
| Context size is the real killer when you look at running open
| source alternatives on your own hardware. Has anything even come
| close to the 100k+ range yet?
| sethkim wrote:
| Yes! Both Llama 3 and Gemma 3 have 128k context windows.
| ryao wrote:
| Llama 3 had a 8192 token context window. Llama 3.1 increased
| it to 131072.
| ryao wrote:
| Mistral Small 3.2 has a 131072 token context window.
| georgeburdell wrote:
| Is there math backing up the "quadratic" statement with LLM input
| size? At least in the traffic analogy, I imagine it's
| exponential, but for small amounts exceeding some critical
| threshold, a quadratic term is sufficient
| gpm wrote:
| Every token has to calculate attention for every previous
| token, that is that attention takes O(sum_i=0^n i) work,
| sum_i=0^n i = n(n-1)/2, so that first expression is equivalent
| to O(n^2).
|
| I'm not sure where you're getting an exponential from.
| timewizard wrote:
| I've only ever seen linear increases. When did Moore's law even
| _start_?
| jjani wrote:
| > In a move that at first went unnoticed
|
| Stopped reading here, if you're positioning yourself as if you
| have some kind of unique insight when there is none in order to
| boost youe credentials and sell your product there's little
| chance you have anything actually insightful to offer. Might
| sound like an overreaction/nitpicking but it's entirely needless
| LinkedIn style "thought leader" nonsense.
|
| In reality it was immediately noticed by anyone using these
| models, have a look at the HN threads at the time, or even on
| Reddit, let alone the actual spaces dedicated to AI builders.
| fusionadvocate wrote:
| What is holding back AI is this business necessity that models
| must perform everything. Nobody can push for a smaller model that
| learns a few simple tasks and then build upon that, similar to
| the best known intelligent machine: the human.
|
| If these corporations had to build a car they would make the
| largest possible engine, because "MORE ENGINE MORE SPEED", just
| like they think that bigger models means bigger intelligence, but
| forget to add steering, or even a chassi.
| dehugger wrote:
| I agree. I want to be able to get smaller models which are
| complete, contained, products which we can run on-prem for our
| organization.
|
| I'll take a model specialized in web scraping. Give me one
| trained on generating report and documentation templates (I'd
| commit felonies for one which could spit out a near-conplete
| report for SSRS).
|
| Models trained for specific helpdesk tasks ("install a
| printer", "grant this user access to these services with this
| permission level").
|
| A model for analyzing network traffic and identifying specific
| patterns.
|
| None of these things should require titanic models nearing
| trillions of parameters.
| cruffle_duffle wrote:
| That's just machine learning though!
| furyofantares wrote:
| This is extremely theorycrafted but I see this as an excellent
| thing driving AI forward, not holding it back.
|
| I suspect a large part of the reason we've had many decades of
| exponential improvements in compute is the general purpose
| nature of computers. It's a narrow set of technologies that are
| universally applicable and each time they get better/cheaper
| they find more demand, so we've put an exponentially increasing
| amount of economical force behind it to match. There needed to
| be "plenty of room at the bottom" in terms of physics and
| plenty of room at the top in terms of software eating the
| world, but if we'd built special purpose hardware for each
| application I don't think we'd have seen such incredible
| sustained growth.
|
| I see neural networks and even LLMs as being potentially
| similar. They're general purpose, a small set of technologies
| that are broadly applicable and, as long as we can keep making
| them better/faster/cheaper, they will find more demand, and so
| benefit from concentrated economic investment.
| fnord123 wrote:
| They aren't arguing against LLMs They are arguing against
| their toaster's LLM to make the perfect toast from being
| trained on the tax policies of the Chang Dynasty.
| furyofantares wrote:
| I'm aware! And I'm personally excited about small models
| but my intuition is that maybe pouring more and more money
| into giant general purpose models will have payoff as long
| as it keeps working at producing better general purpose
| results (which maybe it won't).
| flakiness wrote:
| It can be just Google trying to capitalize Gemini's increasing
| popularity. Until 2.5 Gemini was a total underdog. Less so since
| 2.5.
| apstroll wrote:
| Extremely doubtful that it boils down to quadratic scaling of
| attention. That whole issue is a leftover from the days of small
| bert models with very few parameters.
|
| For large models, compute is very rarely dominated by attention.
| Take, for example, this FLOPs calculation from
| https://www.adamcasson.com/posts/transformer-flops
|
| Compute per token = 2(P + L x W x D)
|
| P: total parameters L: Number of Layers W: context size D:
| Embedding dimension
|
| For Llama 8b, the window size starts dominating compute cost per
| token only at 61k tokens.
| fathermarz wrote:
| Google is raising prices for most of their services. I do not
| agree that this is due to the cost of compute or that this is the
| end of Moore's Law. I don't think we have scratched the surface.
| checker659 wrote:
| > cost of compute
|
| DRAM scaling + interconnect bandwidth stagnation
| llm_nerd wrote:
| Basing anything on Google's pricing is folly. Quite recently
| Google offered several of their preview models at a price of
| $0.00.
|
| Because they were the underdog. Everyone was talking about
| ChatGPT, or maybe Anthropic. Then Deepseek. Google were the
| afterthought that was renowned for that ridiculous image
| generator that envisioned 17th century European scientists as
| full-headdress North American natives.
|
| There has been absolute 180 since then, and Google now has the
| ability to set their pricing similar to the others. Indeed,
| Google's pricing still has a pretty large discount over similarly
| capable model levels, even after they raised prices.
|
| The warning is that there is no free lunch, and when someone is
| basically subsidizing usage to get noticed, they don't have to do
| that once their offering is good.
| mpalmer wrote:
| Is this overthinking it? Google had a huge incentive to outprice
| Anthropic and OAI to join the "conversation". I was certainly
| attracted to the low price initially, but I'm staying because
| it's still affordable and I still think the Gemini 2.5 options
| are the best simple mix of models available.
| refulgentis wrote:
| This is a marketing blog[^1], written with AI[^2], heavily
| sensationalized, & doesn't understand much in the first place.
|
| We don't have accurate price signals externally because Google,
| in particular, had been very aggressive at treating pricing as a
| _competition_ exercise than anything that seemed tethered to
| costs.
|
| For quite some time, their pricing updates would be across-the-
| board exactly 2/3 of the cost of OpenAI's equivalent mode.
|
| [^1] "If you're building batch tasks with LLMs and are looking to
| navigate this new cost landscape, feel free to reach out to see
| how Sutro can help."
|
| [^2] "Google's decision to raise the price of Gemini 2.5 Flash
| wasn't just a business decision; it was a signal to the entire
| market." by far the biggest giveaway, the other tells are
| repeated fanciful descriptions of things that _could_ be real,
| that when stacked up, indicate a surreal, artifical,
| understanding of what they 're being asked to write about, i.e.
| "In a move that at first went unnoticed,"
| YetAnotherNick wrote:
| Pricing != Cost.
|
| One of the clearest example is Deepseek v3. Deepseek has
| mentioned its price of 0.27/1.10 has 80% profit margin, so it
| cost them 90% lesser than the price of Gemini flash. And Gemini
| flash is very likely smaller model than Deepseek v3.
| impure wrote:
| > In a move that at first went unnoticed
|
| Oh, I noticed. I've also complained how Gemini 2.0 Flash is 50%
| more expensive than Gemini 1.5 Flash for small requests.
|
| Also I'm sure if Google wanted to price Gemini 2.5 Flash cheaper
| they could. The reason they won't is because there is almost zero
| competition at the <10 cents per million input token area.
| Google's answer to the 10 cents per million input token area is
| 2.5 Flash Lite which they say is equivalent to 2.0 Flash at the
| same cost. Might be a bit cheaper if you factor in automatic
| context caching.
|
| Also the quadratic increase is valid but it's not as simple as
| the article states due to caching. And if it was a bit issue
| Google would impose tiered pricing like they do for Gemini 2.5
| Pro.
|
| And for what it's worth I've been playing around with Gemma E4B
| on together.ai. It takes 10x as long as Gemini 2.5 Flash Lite and
| it sucks at multilingual. But other than that it seems to produce
| acceptable results and is way cheaper.
| antirez wrote:
| I think providers are making a mistake in simplifying prices at
| all costs, hiding the quadratic nature of attention. People can
| understand the pricing anyway, even if more complex, by having a
| tool that let them select a prompt and a reply length and see the
| cost, or fancy 3D graphs that capture the cost surface of
| different cases. People would start sending smaller prompts and
| less context when less is enough, and what they pay would be more
| related to the amount of GPU/TPU/... power they use.
___________________________________________________________________
(page generated 2025-07-03 23:00 UTC)