hngopher.com

       [HN Gopher] The End of Moore's Law for AI? Gemini Flash Offers a...
       ___________________________________________________________________
        
       The End of Moore's Law for AI? Gemini Flash Offers a Warning
        
       Author : sethkim
       Score  : 92 points
       Date   : 2025-07-03 17:34 UTC (5 hours ago)
        
 (HTM) web link (sutro.sh)
 (TXT) w3m dump (sutro.sh)
        
       | cmogni1 wrote:
       | The article does a great job of highlighting the core disconnect
       | in the LLM API economy: linear pricing for a service with non-
       | linear, quadratic compute costs. The traffic analogy is an
       | excellent framing.
       | 
       | One addition: the O(n^2) compute cost is most acute during the
       | one-time prefill of the input prompt. I think the real
       | bottleneck, however, is the KV cache during the decode phase.
       | 
       | For each new token generated, the model must access the
       | intermediate state of all previous tokens. This state is held in
       | the KV Cache, which grows linearly with sequence length and
       | consumes an enormous amount of expensive GPU VRAM. The speed of
       | generating a response is therefore more limited by memory
       | bandwidth.
       | 
       | Viewed this way, Google's 2x price hike on input tokens is
       | probably related to the KV Cache, which supports the article's
       | "workload shape" hypothesis. A long input prompt creates a huge
       | memory footprint that must be held for the entire generation,
       | even if the output is short.
        
         | trhway wrote:
         | That obviously should and will be fixed architecturally.
         | 
         | >For each new token generated, the model must access the
         | intermediate state of all previous tokens.
         | 
         | Not all the previous tokens are equal, not all deserve the same
         | attention so to speak. The farther the tokens, the more
         | opportunity for many of them to be pruned and/or collapsed with
         | other similarly distant and lesser meaningful tokens in a given
         | context. So instead of O(n^2) it would be more like O(nlog(n))
         | 
         | I mean, you'd expect that for example "knowlegde worker" models
         | (vs. say "poetry" models) would posses some perturbative
         | stability wrt. changes to/pruning of the remote previous
         | tokens, at least to those tokens which are less meaningful in
         | the current context.
         | 
         | Personally, i feel the situation is good - performance
         | engineering work again becomes somewhat valuable as we're
         | reaching N where O(n^2) forces management to throw some money
         | at engineers instead of at the hardware :)
        
       | simonw wrote:
       | "In a move that at first went unnoticed, Google significantly
       | increased the price of its popular Gemini 2.5 Flash model"
       | 
       | It's not quite that simple. Gemini 2.5 Flash previously had two
       | prices, depending on if you enabled "thinking" mode or not. The
       | new 2.5 Flash has just a single price, which is a lot more if you
       | were using the non-thinking mode and may be slightly less for
       | thinking mode.
       | 
       | Another way to think about this is that they retired their Gemini
       | 2.5 Flash non-thinking model entirely, and changed the price of
       | their Gemini 2.5 Flash thinking model from $0.15/m input, $3.50/m
       | output to $0.30/m input (more expensive) and $2.50/m output (less
       | expensive).
       | 
       | Another minor nit-pick:
       | 
       | > For LLM providers, API calls cost them quadratically in
       | throughput as sequence length increases. However, API providers
       | price their services linearly, meaning that there is a fixed cost
       | to the end consumer for every unit of input or output token they
       | use.
       | 
       | That's mostly true, but not entirely: Gemini 2.5 Pro (but oddly
       | not Gemini 2.5 Flash) charges a higher rate for inputs over
       | 200,000 tokens. Gemini 1.5 also had a higher rate for >128,000
       | tokens. As a result I treat those as separate models on my
       | pricing table on https://www.llm-prices.com
       | 
       | One last one:
       | 
       | > o3 is a completely different class of model. It is at the
       | frontier of intelligence, whereas Flash is meant to be a
       | workhorse. Consequently, there is more room for optimization that
       | isn't available in Flash's case, such as more room for pruning,
       | distillation, etc.
       | 
       | OpenAI are on the record that the o3 optimizations were _not_
       | through model changes such as pruning or distillation. This is
       | backed up by independent benchmarks that find the performance of
       | the new o3 matches the previous one:
       | https://twitter.com/arcprize/status/1932836756791177316
        
         | sethkim wrote:
         | Both great points, but more or less speak to the same root
         | cause - customer usage patterns are becoming more of a driver
         | for pricing than underlying technology improvements. If so, we
         | likely have hit a "soft" floor for now on pricing. Do you not
         | see it this way?
        
           | simonw wrote:
           | Even given how much prices have decreased over the past 3
           | years I think there's still room for them to keep going down.
           | I expect there remain a whole lot of optimizations that have
           | not yet been discovered, in both software and hardware.
           | 
           | That 80% drop in o3 was only a few weeks ago!
        
             | sethkim wrote:
             | No doubt prices will continue to drop! We just don't think
             | it will be anything like the orders-of-magnitude YoY
             | improvements we're used to seeing. Consequently, developers
             | shouldn't expect the cost of building and scaling AI
             | applications to be anything close to "free" in the near
             | future as many suspect.
        
           | vfvthunter wrote:
           | I do not see it this way. Google is a publicly traded company
           | responsible for creating value for their shareholders. When
           | they became dicks about ad blockers on youtube last year or
           | so, was it because they hit a bandwidth Moore's law? No. It
           | was a money grab.
           | 
           | ChatGPT is simply what Google should've been 5-7 years ago,
           | but Google was more interested in presenting me with ads to
           | click on instead of helping me find what I was looking for.
           | ChatGPT is at least 50% of my searches now. And they're
           | losing revenue because of that.
        
         | mathiaspoint wrote:
         | I really hate the thinking. I do my best to disable it but
         | don't always remember. So often it just gets into a loop second
         | guessing itself until it hits the token limit. It's rare it
         | figures anything out while it's thinking too but maybe that's
         | because I'm better at writing prompts.
        
           | thomashop wrote:
           | I have the impression that the thinking helps even if the
           | actual content of the thinking output is nonsense. It awards
           | more cycles to the model to think about the problem.
        
             | wat10000 wrote:
             | That would be strange. There's no hidden memory or data
             | channel, the "thinking" output is all the model receives
             | afterwards. If it's all nonsense, then nonsense is all it
             | gets. I wouldn't be completely surprised if a context with
             | a bunch of apparent nonsense still helps somehow, LLMs are
             | weird, but it would be odd.
        
               | mathiaspoint wrote:
               | Eh. The embeddings themselves could act like hidden layer
               | activations and encode some useful information.
        
               | yorwba wrote:
               | Attention operates entirely on hidden memory, in the
               | sense that it usually isn't exposed to the end user. An
               | attention head on one thinking token can attend to one
               | thing and the same attention head on the next thinking
               | token can attend to something entirely different, and the
               | next layer can combine the two values, maybe on the
               | second thinking token, maybe much later. So even nonsense
               | filler can create space for intermediate computation to
               | happen.
        
               | barrkel wrote:
               | This isn't quite right. Even when an LLM generates
               | meaningless tokens, its internal state continues to
               | evolve. Each new token triggers a fresh pass through the
               | network, with attention over the KV cache, allowing the
               | model to refine its contextual representation. The
               | specific tokens may be gibberish, but the underlying
               | computation can still reflect ongoing "thinking".
        
               | Wowfunhappy wrote:
               | Wasn't there some study that just telling the LLM to
               | write a bunch of periods first improves responses?
        
       | sharkjacobs wrote:
       | > This is the first time a major provider has backtracked on the
       | price of an established model
       | 
       | Arguably that was Haiku 3.5 in October 2024.
       | 
       | I think the same hypothesis could apply though, that you price
       | your model expecting a certain average input size, and then
       | adjust price up to accommodate the reality that people use that
       | cheapest model when they want to throw as much as they can into
       | the context.
        
         | simonw wrote:
         | Haiku 3.5 was a completely different model from Haiku 3, and
         | part of a new model generation.
         | 
         | Gemini Flash 2.5 and Gemini 2.5 Flash Preview were presumably a
         | whole lot more similar to each other.
        
           | mossTechnician wrote:
           | Is there a consumer expectation that Haiku 3.5 is
           | _completely_ different? Even leaving semantic versioning
           | aside, even if the .5 symbolizes a  "halfway point" between
           | major releases, it still suggests a non-major release to me.
        
             | simonw wrote:
             | Consumers have no idea what Haiku is.
             | 
             | Engineers who work with LLM APIs are hopefully paying
             | enough attention that they understand the difference
             | between Claude 3, Claude 3.5 and Claude 4.
        
               | mossTechnician wrote:
               | I appreciate the clarification for people who aren't
               | engineers who work in-depth with LLM APIs, but I have
               | enough familiarity with both semantic versioning[0] and
               | .NET versioning[1], and usually a ".5" in either of them
               | implies a large but not _complete_ difference.
               | 
               | [0]: https://en.wikipedia.org/wiki/Software_versioning#Se
               | mantic_v... [1]: https://en.wikipedia.org/wiki/.NET_Frame
               | work_version_history...
        
         | ryao wrote:
         | I had the same thought about haiku 3.5. They claimed it was due
         | to the model being more capable, which basically means that
         | they raised the price because they could.
         | 
         | Then there is Poe with its pricing games. Prices at Poe have
         | been going up over time since they were extremely aggressive to
         | gain market share presumably under the assumption that there
         | would be reduced pricing in the future and the reduced pricing
         | for LLMs did not materialize.
        
       | guluarte wrote:
       | they are doing the we work approach, gain customers at all costs
       | even if that means losing money.
        
         | FirmwareBurner wrote:
         | Aren't all LLMs loosing money at this point?
        
           | simonw wrote:
           | I don't believe that's true on inference - I think most if
           | not all of the major providers are selling inference at a
           | (likely very small) margin over what it costs to serve them
           | (hardware + energy).
           | 
           | They likely lose money when you take into account the capital
           | cost of training the model itself, but that cost is at least
           | fixed: once it's trained you can serve traffic from it for as
           | long as you chose to keep the model running in production.
        
             | bungalowmunch wrote:
             | yes I would generally agree; although I don't have a have
             | source for this, I've heard whispers of Anthropic running
             | at a much higher margin compared to the other labs
        
           | throwawayoldie wrote:
           | Yes, and the obvious endgame is wait until most software
           | development is effectively outsourced to them, then jack the
           | prices to whatever they want. The Uber model.
        
             | FirmwareBurner wrote:
             | Good thing AI can't replace my drinking during work time
             | skills
        
       | incomingpain wrote:
       | I think the big thing that really surprised me.
       | 
       | Llama 4 maverick is 16x 17b. So 67GB of size. The equivalency is
       | 400billion.
       | 
       | Llama 4 behemoth is 128x 17b. 245gb size. The equivalency is 2
       | trillion.
       | 
       | I dont have the resources to be able to test these unfortunately;
       | but they are claiming behemoth is superior to the best SAAS
       | options via internal benchmarking.
       | 
       | Comparatively Deepseek r1 671B is 404gb in size; with pretty
       | similar benchmarks.
       | 
       | But you compare deepseek r1 32b to any model from 2021 and it's
       | going to be significantly superior.
       | 
       | So we have quality of models increasing, resources needed
       | decreasing. In 5-10 years, do we have an LLM that loads up on a
       | 16-32GB video card that is simply capable of doing it all?
        
         | sethkim wrote:
         | My two cents here is the classic answer - it depends. If you
         | need general "reasoning" capabilities, I see this being a
         | strong possibility. If you need specific, factual information
         | baked into the weights themselves, you'll need something large
         | enough to store that data.
         | 
         | I think the best of both worlds is a sufficiently capable
         | reasoning model with access to external tools and data that can
         | perform CPU-based lookups for information that it doesn't
         | possess.
        
       | sharkjacobs wrote:
       | > If you're building batch tasks with LLMs and are looking to
       | navigate this new cost landscape, feel free to reach out to see
       | how Sutro can help.
       | 
       | I don't have any reason to doubt the reasoning this article is
       | doing or the conclusions it reaches, but it's important to
       | recognize that this article is part of a sales pitch.
        
         | sethkim wrote:
         | Yes, we're a startup! And LLM inference is a major component of
         | what we do - more importantly, we're working on making these
         | models accessible as analytical processing tools, so we have a
         | strong focus on making them cost-effective at scale.
        
           | sharkjacobs wrote:
           | I see your prices page lists the _average_ cost per million
           | tokens. Is that because you are using the formula you
           | describe, which depends on hardware time and throughput?
           | 
           | > API Price [?] (Hourly Hardware Cost / Throughput in Tokens
           | per Hour) + Margin
        
         | samtheprogram wrote:
         | There's absolutely nothing wrong with putting a small plug at
         | the end of an article.
        
           | sharkjacobs wrote:
           | Of course not.
           | 
           | But the thrust of the article is that contrary to
           | conventional wisdom, we shouldn't expect llm models to
           | continue getting more efficient, and so its worthwhile to
           | explore other options for cost savings in inference, such as
           | batch processing.
           | 
           | The conclusion they reach is one which directly serves what
           | they're selling.
           | 
           | I'll repeat; I'm not disputing anything in this article. I'm
           | really not, I'm not even trying to be coy and make allusions
           | without directly saying anything. If I thought this was
           | bullshit I'm not afraid to semi-anonymously post a comment
           | saying so.
           | 
           | But this is advertising, just like Backblaze's hard drive
           | reliability blog posts are advertising.
        
       | jasonthorsness wrote:
       | Unfounded extrapolation from a minor pricing update. I am sure
       | every generation of chips also came with "end of Moore's law"
       | articles for the actual Moore's law.
       | 
       | FWIW Gemini 2.5 Flash Lite is still very good; I used it in my
       | latest side project to generate entire web sites and it outputs
       | great content and markup every single time.
        
       | ramesh31 wrote:
       | >By embracing batch processing and leveraging the power of cost-
       | effective open-source models, you can sidestep the price floor
       | and continue to scale your AI initiatives in ways that are no
       | longer feasible with traditional APIs.
       | 
       | Context size is the real killer when you look at running open
       | source alternatives on your own hardware. Has anything even come
       | close to the 100k+ range yet?
        
         | sethkim wrote:
         | Yes! Both Llama 3 and Gemma 3 have 128k context windows.
        
           | ryao wrote:
           | Llama 3 had a 8192 token context window. Llama 3.1 increased
           | it to 131072.
        
         | ryao wrote:
         | Mistral Small 3.2 has a 131072 token context window.
        
       | georgeburdell wrote:
       | Is there math backing up the "quadratic" statement with LLM input
       | size? At least in the traffic analogy, I imagine it's
       | exponential, but for small amounts exceeding some critical
       | threshold, a quadratic term is sufficient
        
         | gpm wrote:
         | Every token has to calculate attention for every previous
         | token, that is that attention takes O(sum_i=0^n i) work,
         | sum_i=0^n i = n(n-1)/2, so that first expression is equivalent
         | to O(n^2).
         | 
         | I'm not sure where you're getting an exponential from.
        
       | timewizard wrote:
       | I've only ever seen linear increases. When did Moore's law even
       | _start_?
        
       | jjani wrote:
       | > In a move that at first went unnoticed
       | 
       | Stopped reading here, if you're positioning yourself as if you
       | have some kind of unique insight when there is none in order to
       | boost youe credentials and sell your product there's little
       | chance you have anything actually insightful to offer. Might
       | sound like an overreaction/nitpicking but it's entirely needless
       | LinkedIn style "thought leader" nonsense.
       | 
       | In reality it was immediately noticed by anyone using these
       | models, have a look at the HN threads at the time, or even on
       | Reddit, let alone the actual spaces dedicated to AI builders.
        
       | fusionadvocate wrote:
       | What is holding back AI is this business necessity that models
       | must perform everything. Nobody can push for a smaller model that
       | learns a few simple tasks and then build upon that, similar to
       | the best known intelligent machine: the human.
       | 
       | If these corporations had to build a car they would make the
       | largest possible engine, because "MORE ENGINE MORE SPEED", just
       | like they think that bigger models means bigger intelligence, but
       | forget to add steering, or even a chassi.
        
         | dehugger wrote:
         | I agree. I want to be able to get smaller models which are
         | complete, contained, products which we can run on-prem for our
         | organization.
         | 
         | I'll take a model specialized in web scraping. Give me one
         | trained on generating report and documentation templates (I'd
         | commit felonies for one which could spit out a near-conplete
         | report for SSRS).
         | 
         | Models trained for specific helpdesk tasks ("install a
         | printer", "grant this user access to these services with this
         | permission level").
         | 
         | A model for analyzing network traffic and identifying specific
         | patterns.
         | 
         | None of these things should require titanic models nearing
         | trillions of parameters.
        
         | cruffle_duffle wrote:
         | That's just machine learning though!
        
         | furyofantares wrote:
         | This is extremely theorycrafted but I see this as an excellent
         | thing driving AI forward, not holding it back.
         | 
         | I suspect a large part of the reason we've had many decades of
         | exponential improvements in compute is the general purpose
         | nature of computers. It's a narrow set of technologies that are
         | universally applicable and each time they get better/cheaper
         | they find more demand, so we've put an exponentially increasing
         | amount of economical force behind it to match. There needed to
         | be "plenty of room at the bottom" in terms of physics and
         | plenty of room at the top in terms of software eating the
         | world, but if we'd built special purpose hardware for each
         | application I don't think we'd have seen such incredible
         | sustained growth.
         | 
         | I see neural networks and even LLMs as being potentially
         | similar. They're general purpose, a small set of technologies
         | that are broadly applicable and, as long as we can keep making
         | them better/faster/cheaper, they will find more demand, and so
         | benefit from concentrated economic investment.
        
           | fnord123 wrote:
           | They aren't arguing against LLMs They are arguing against
           | their toaster's LLM to make the perfect toast from being
           | trained on the tax policies of the Chang Dynasty.
        
             | furyofantares wrote:
             | I'm aware! And I'm personally excited about small models
             | but my intuition is that maybe pouring more and more money
             | into giant general purpose models will have payoff as long
             | as it keeps working at producing better general purpose
             | results (which maybe it won't).
        
       | flakiness wrote:
       | It can be just Google trying to capitalize Gemini's increasing
       | popularity. Until 2.5 Gemini was a total underdog. Less so since
       | 2.5.
        
       | apstroll wrote:
       | Extremely doubtful that it boils down to quadratic scaling of
       | attention. That whole issue is a leftover from the days of small
       | bert models with very few parameters.
       | 
       | For large models, compute is very rarely dominated by attention.
       | Take, for example, this FLOPs calculation from
       | https://www.adamcasson.com/posts/transformer-flops
       | 
       | Compute per token = 2(P + L x W x D)
       | 
       | P: total parameters L: Number of Layers W: context size D:
       | Embedding dimension
       | 
       | For Llama 8b, the window size starts dominating compute cost per
       | token only at 61k tokens.
        
       | fathermarz wrote:
       | Google is raising prices for most of their services. I do not
       | agree that this is due to the cost of compute or that this is the
       | end of Moore's Law. I don't think we have scratched the surface.
        
         | checker659 wrote:
         | > cost of compute
         | 
         | DRAM scaling + interconnect bandwidth stagnation
        
       | llm_nerd wrote:
       | Basing anything on Google's pricing is folly. Quite recently
       | Google offered several of their preview models at a price of
       | $0.00.
       | 
       | Because they were the underdog. Everyone was talking about
       | ChatGPT, or maybe Anthropic. Then Deepseek. Google were the
       | afterthought that was renowned for that ridiculous image
       | generator that envisioned 17th century European scientists as
       | full-headdress North American natives.
       | 
       | There has been absolute 180 since then, and Google now has the
       | ability to set their pricing similar to the others. Indeed,
       | Google's pricing still has a pretty large discount over similarly
       | capable model levels, even after they raised prices.
       | 
       | The warning is that there is no free lunch, and when someone is
       | basically subsidizing usage to get noticed, they don't have to do
       | that once their offering is good.
        
       | mpalmer wrote:
       | Is this overthinking it? Google had a huge incentive to outprice
       | Anthropic and OAI to join the "conversation". I was certainly
       | attracted to the low price initially, but I'm staying because
       | it's still affordable and I still think the Gemini 2.5 options
       | are the best simple mix of models available.
        
       | refulgentis wrote:
       | This is a marketing blog[^1], written with AI[^2], heavily
       | sensationalized, & doesn't understand much in the first place.
       | 
       | We don't have accurate price signals externally because Google,
       | in particular, had been very aggressive at treating pricing as a
       | _competition_ exercise than anything that seemed tethered to
       | costs.
       | 
       | For quite some time, their pricing updates would be across-the-
       | board exactly 2/3 of the cost of OpenAI's equivalent mode.
       | 
       | [^1] "If you're building batch tasks with LLMs and are looking to
       | navigate this new cost landscape, feel free to reach out to see
       | how Sutro can help."
       | 
       | [^2] "Google's decision to raise the price of Gemini 2.5 Flash
       | wasn't just a business decision; it was a signal to the entire
       | market." by far the biggest giveaway, the other tells are
       | repeated fanciful descriptions of things that _could_ be real,
       | that when stacked up, indicate a surreal, artifical,
       | understanding of what they 're being asked to write about, i.e.
       | "In a move that at first went unnoticed,"
        
       | YetAnotherNick wrote:
       | Pricing != Cost.
       | 
       | One of the clearest example is Deepseek v3. Deepseek has
       | mentioned its price of 0.27/1.10 has 80% profit margin, so it
       | cost them 90% lesser than the price of Gemini flash. And Gemini
       | flash is very likely smaller model than Deepseek v3.
        
       | impure wrote:
       | > In a move that at first went unnoticed
       | 
       | Oh, I noticed. I've also complained how Gemini 2.0 Flash is 50%
       | more expensive than Gemini 1.5 Flash for small requests.
       | 
       | Also I'm sure if Google wanted to price Gemini 2.5 Flash cheaper
       | they could. The reason they won't is because there is almost zero
       | competition at the <10 cents per million input token area.
       | Google's answer to the 10 cents per million input token area is
       | 2.5 Flash Lite which they say is equivalent to 2.0 Flash at the
       | same cost. Might be a bit cheaper if you factor in automatic
       | context caching.
       | 
       | Also the quadratic increase is valid but it's not as simple as
       | the article states due to caching. And if it was a bit issue
       | Google would impose tiered pricing like they do for Gemini 2.5
       | Pro.
       | 
       | And for what it's worth I've been playing around with Gemma E4B
       | on together.ai. It takes 10x as long as Gemini 2.5 Flash Lite and
       | it sucks at multilingual. But other than that it seems to produce
       | acceptable results and is way cheaper.
        
       | antirez wrote:
       | I think providers are making a mistake in simplifying prices at
       | all costs, hiding the quadratic nature of attention. People can
       | understand the pricing anyway, even if more complex, by having a
       | tool that let them select a prompt and a reply length and see the
       | cost, or fancy 3D graphs that capture the cost surface of
       | different cases. People would start sending smaller prompts and
       | less context when less is enough, and what they pay would be more
       | related to the amount of GPU/TPU/... power they use.
        
       ___________________________________________________________________
       (page generated 2025-07-03 23:00 UTC)