[HN Gopher] Ironwood: The first Google TPU for the age of inference
       ___________________________________________________________________
        
       Ironwood: The first Google TPU for the age of inference
        
       Author : meetpateltech
       Score  : 324 points
       Date   : 2025-04-09 12:24 UTC (10 hours ago)
        
 (HTM) web link (blog.google)
 (TXT) w3m dump (blog.google)
        
       | fancyfredbot wrote:
       | It looks amazing but I wish we could stop playing silly games
       | with benchmarks. Why compare fp8 performance in ironwood to
       | architectures which don't support fp8 in hardware? Why leave out
       | TPUv6 in the comparison?
       | 
       | Why compare fp64 flops in the El Capitan supercomputer to fp8
       | flops in the TPU pod when you know full well these are not
       | comparable?
       | 
       | [Edit: it turns out that El Capitan is actually faster when
       | compared like for like and the statement below underestimated how
       | much slower fp64 is, my original comment in italics below is not
       | accurate] ( _The TPU would still be faster even allowing for the
       | fact fp64 is ~8x harder than fp8. Is it worthwhile to
       | misleadingly claim it 's 24x faster instead of honestly saying
       | it's 3x faster? Really?_)
       | 
       | It comes across as a bit cheap. Using misleading statements is a
       | tactic for snake oil salesmen. This isn't snake oil so why lower
       | yourself?
        
         | shihab wrote:
         | I went through the article and it seems you're right about the
         | comparison with El Capitan. These performance figures are so
         | bafflingly misleading.
         | 
         | And so unnecessary too- nobody shopping for AI inference server
         | cares at all about its relative performance vs a fp64 machine.
         | This language seems designed solely to wow tech-illiterate
         | C-Suites.
        
         | cheptsov wrote:
         | I think it's not misleading, but rather very clear that there
         | are problems. v7 is compared to v5e. Also, notice that it's not
         | compared to competitors, and the price isn't mentioned.
         | Finally, I think the much bigger issue with TPU is the software
         | and developer experience. Without improvements there, there's
         | close to zero chance that anyone besides a few companies will
         | use TPU. It's barely viable if the trend continues.
        
           | latchkey wrote:
           | The reference to El Capitan, is a competitor.
        
             | cheptsov wrote:
             | Are you suggesting NVIDIA is not a competitor?
        
               | latchkey wrote:
               | You said: "notice that it's not compared to competitors"
               | 
               | The article says: "When scaled to 9,216 chips per pod for
               | a total of 42.5 Exaflops, Ironwood supports more than 24x
               | the compute power of the world's largest supercomputer -
               | El Capitan - which offers just 1.7 Exaflops per pod."
               | 
               | It is literally compared to a competitor.
        
               | cheptsov wrote:
               | I believe my original sentence was accurate. I was
               | expecting the article to provide an objective comparison
               | between TPUs and their main competitors. If you're
               | suggesting that El Capitan is the primary competitor, I'm
               | not sure I agree, but I appreciate the perspective.
               | Perhaps I was looking for other competitors, which is why
               | I didn't really pay attention to El Capitan.
        
               | latchkey wrote:
               | Andrey, this is what I'm referring to:
               | https://news.ycombinator.com/item?id=43632709
        
               | cheptsov wrote:
               | Yea, makes sense
        
           | sebzim4500 wrote:
           | >Without improvements there, there's close to zero chance
           | that anyone besides a few companies will use TPU. It's barely
           | viable if the trend continues.
           | 
           | I wonder whether Google sees this as a problem. In a way it
           | just means more AI compute capacity for Google.
        
           | mupuff1234 wrote:
           | > besides a few companies will use TPU. It's barely viable if
           | the trend continues
           | 
           | That doesn't matter much of those few companies are the
           | biggest companies. Even with Nvidia majority of the revenue
           | is being generated by a handful of hyperscalers.
        
         | imtringued wrote:
         | Also, there is no such thing as a "El Capitan pod". The quoted
         | number is for the entire supercomputer.
         | 
         | My impression from this is that they are too scared to say that
         | their TPU pod is equivalent to 60 GB200 NVL72 racks in terms of
         | fp8 flops.
         | 
         | I can only assume that they need way more than 60 racks and
         | they want to hide this fact.
        
           | jeffbee wrote:
           | A max-spec v5p deployment, at least the biggest one they'll
           | let you rent, occupies 140 racks, for reference.
        
             | aaronax wrote:
             | 8960 chips in those 140 racks. $4.20/hour/chip or
             | $4,066/month/chip
             | 
             | So $68k per hour or $27 million per month.
             | 
             | Get 55% off with 3 year commitment.
        
         | charcircuit wrote:
         | >Why compare fp8 performance in ironwood to architectures which
         | don't support fp8 in hardware?
         | 
         | Because end users want to use fp8. Why should architectural
         | differences matter when the speed is what matters at the end of
         | the day?
        
           | bobim wrote:
           | GP bikes are faster than dirt bikes, but not on dirt. The
           | context has some influence here.
        
         | zipy124 wrote:
         | Because it is a public company that aims to maximise
         | shareholder value and thus the value of it's stock. Since value
         | is largely evaluated by perception, if you can convince people
         | your product is better than it is, your stock valuation, at
         | least in the short term will be higher.
         | 
         | Hence Tesla saying FSD and robo-taxis are 1 year away, the
         | fusion companies saying fusion is closer than it is etc....
         | 
         | Nvidia, AMD, apple and intel have all been publishing
         | misleading graphs for decades and even under constant criticism
         | they continue to.
        
           | fancyfredbot wrote:
           | I understand the value of perception.
           | 
           | A big part of my issue here is that they've really messed up
           | the misleading benchmarks.
           | 
           | They've failed to compare to the most obvious alternative,
           | which is Nvidia GPUs. They look like they've got something to
           | hide, not like they're ahead.
           | 
           | They've needlessly made their own current products look bad
           | in comparison to this one understating the long-standing
           | advantage TPUs have given Google.
           | 
           | Then they've gone and produced a misleading comparison to the
           | wrong product (who cares about El Capitan? I can't rent
           | that!). This is a waste of credibility. If you are going to
           | go with misleading benchmarks then at least compare to
           | something people care about.
        
         | segmondy wrote:
         | Why not? If we line up to race. You can't say why compare v8 to
         | v6 turbo or electric engine. It's a race, the drive train
         | doesn't matter. Who gets to the finish line first?
         | 
         | No one is shopping for GPU by fp8, fp16, fp32, fp64. It's all
         | about cost/performance factor. 8 bits is as good as 32bits,
         | great performance is even been pulled out of 4 bits...
        
           | fancyfredbot wrote:
           | This is like saying I'm faster because I ran (a mile) in 8
           | minutes whereas it took you 15 minutes (to run two miles).
        
         | fancyfredbot wrote:
         | It's even worse than I thought. El Capitan has 43,808 MI300A
         | APUs. According to AMD each MI300A can do 3922TF of sparse FP8
         | for a total of 171EF sparse FP8 performance, or 85TF non-
         | sparse.
         | 
         | In other words El Capitan is between 2 and 4 times as fast as
         | one of these pods, yet they claim the pod is 24x faster than El
         | Capitan.
        
         | adrian_b wrote:
         | FP64 is more like 64 times harder than FP8.
         | 
         | Actually the cost is even much higher, because the cost ratio
         | is not much less than the square of the ratio between the sizes
         | of the significands, which in this case is 52 bits / 4 bits =
         | 13, and the square of 13 is 169.
        
           | christkv wrote:
           | Memory size and bandwidth goes up a lot right?
        
         | dekhn wrote:
         | Google shouldn't do that comparison. When I worked there I
         | strongly emphasized to the TPU leadership to not compare their
         | systems to supercomputers- not only were the comparisons
         | misleading, Google absolutely does not want supercomputer users
         | to switch to TPUs. SC users are demanding and require huge
         | support.
        
       | lawlessone wrote:
       | Can these be repurposed for other things? Encoding/decoding
       | video? Graphics processing etc?
       | 
       | edit: >It's a move from responsive AI models that provide real-
       | time information for people to interpret, to models that provide
       | the proactive generation of insights and interpretation. This is
       | what we call the "age of inference" where AI agents will
       | proactively retrieve and generate data to collaboratively deliver
       | insights and answers, not just data.
       | 
       | maybe i will sound like a luddite but im not sure i want this.
       | 
       | I'd rather AI/ML only do what i ask it to.
        
         | vinkelhake wrote:
         | Google already has custom ASICs for video transcoding. YouTube
         | has been running those for many years now.
         | 
         | https://streaminglearningcenter.com/encoding/asics-vs-softwa...
        
           | lawlessone wrote:
           | Thank you :)
        
         | cavisne wrote:
         | The JAX docs have a good explanation for how a TPU works
         | 
         | https://docs.jax.dev/en/latest/pallas/tpu/details.html#what-...
         | 
         | Its not really useful for other workloads (unless your workload
         | looks like a bunch of matrix multiplications).
        
       | no_wizard wrote:
       | Some honest competition in the chip space in the machine learning
       | race! Genuinely interested to see how this ends up playing out.
       | Nvidia seemed 'untouchable' for so long in this space that its
       | nice to see things get shaken up.
       | 
       | I know they aren't selling the TPU as boxed units, but still,
       | even as hardware that backs GCP services and what not, its
       | interesting to see how it'll shake out!
        
         | epolanski wrote:
         | > Nvidia seemed 'untouchable' for so long in this space that
         | its nice to see things get shaken up.
         | 
         | Did it?
         | 
         | Both Mistral's LeChat (running on Cerebras) and Google's Gemini
         | (running on Tensors) have clearly showed ages ago Nvidia had no
         | advantage at all in inference.
         | 
         | The hundreds of billions spent in hardware till now focused on
         | training, but inference is in the long run gonna get the lion
         | share of the work.
        
           | wyager wrote:
           | > but inference is in the long run gonna get the lion share
           | of the work.
           | 
           | I'm not sure - might not the equilibrium state be that we are
           | constantly fine-tuning models with the latest data (e.g.
           | social media firehose)?
        
       | nharada wrote:
       | The first specifically designed for inference? Wasn't the
       | original TPU inference only?
        
         | jeffbee wrote:
         | Yeah that made me chuckle, too. The original was indeed
         | inference-only.
        
         | dgacmu wrote:
         | Yup. (Source: was at brain at the time.)
         | 
         | Also holy cow that was 10 years ago already? Dang.
         | 
         | Amusing bit: The first TPU design was based on fully connected
         | networks; the advent of CNNs forced some design rethinking, and
         | then the advent of RNNs (and then transformers) did it yet
         | again.
         | 
         | So maybe it's reasonable to say that this is the first TPU
         | designed for inference in the world where you have both a
         | matrix multiply unit and an embedding processor.
         | 
         | (Also, the first gen was purely a co-processor, whereas the
         | later generations included their own network fabric, a trait
         | shared by this most recent one. So it's not totally crazy to
         | think of the first one as a very different beast.)
        
           | kleiba wrote:
           | _> the advent of CNNs forced some design rethinking, and then
           | the advent of RNNs (and then transformers) did it yet again._
           | 
           | Certainly, RNNs are much older than TPUs?!
        
             | woodson wrote:
             | So are CNNs, but I guess their popularity heavily increased
             | at that time, to the point where it made sense to optimize
             | the hardware for them.
        
             | hyhjtgh wrote:
             | RNN was of course well known at the at time, but they
             | werent putting out state of the art numbers at that time.
        
           | miki123211 wrote:
           | Wow, you guys needed a custom ASIC for inference _before CNNs
           | were even invented_?
           | 
           | What were the use cases like back then?
        
             | refulgentis wrote:
             | https://research.google/blog/the-google-brain-team-
             | looking-b... is a good overview
             | 
             | I wasn't on Brain, but got obsessed with Kerminology of ML
             | internally at Google because I wanted to know why
             | leadership was so gung ho on it.
             | 
             | The general sense in the early days was these things can
             | learn anything, and they'll replace fundamental units of
             | computing. This thought process is best exhibited
             | externally by ex. https://research.google/pubs/the-case-
             | for-learned-index-stru...
             | 
             | It was also a different Google, the "3 different teams
             | working on 3 different chips" bit reminds me of lore re:
             | how many teams were working on Android wearables until
             | upper management settled it.
             | 
             | FWIW it's a very, very, different company now. Back then it
             | was more entrepreneurial. A better version of Wave-era,
             | where things launch themselves. An MBA would find this top-
             | down company in 2025 even _better_ , I find it less - it's
             | perfectly tuned to do what Apple or OpenAI did 6-12 months
             | ago, but not to lead - almost certainly a better
             | investment, but a worse version of an average workplace,
             | because it hasn't developed antibodies against BSing.
             | (disclaimer: worked on Android)
        
             | huijzer wrote:
             | According to a Google blog post from 2016 [1], use-cases
             | were RankBrain to improve the relevancy of search results
             | and Street View. Also they used it for AlphaGo. And from
             | what I remember from my MSc thesis, they also probably were
             | starting to use it for Translate. I can't find any TPU
             | reference in the Attention is All You Need or BERT: Pre-
             | training of Deep Bidirectional Transformers for Language
             | Understanding, but I have been fine-tuning BERT in a TPU at
             | the time in okt 2018 [2]. If I remember correctly, the BERT
             | example repository showed how to fit a model with a TPU
             | inside a Colab. So I would guess that the natural language
             | research was mostly not on TPU's around 2016-2018, but then
             | moved over to TPU in production. I could be wrong though
             | and dgacmu probably knows more.
             | 
             | [1]: https://cloud.google.com/blog/products/ai-machine-
             | learning/g...
             | 
             | [2]: https://github.com/rikhuijzer/improv/blob/master/runs/
             | 2018-1...
        
               | mmx1 wrote:
               | Yes, IIRC (please correct me if I'm wrong), translate did
               | utilize Seastar (TPU v1) which was integer only, so not
               | easily useful for training.
        
             | dekhn wrote:
             | As an aside, Google used CPU-based machine learning (using
             | enormous numbers of CPUs) for a long time before custom
             | ASICS or tensorflow even existed.
             | 
             | The big ones were SmartASS (ads serving) and Sibyl
             | (everything else serving). There was an internal debate
             | over the value of GPUs with a prominent engineer writing an
             | influential doc that caused Google continue with fat CPU
             | nodes when it was clear that accelerators were a good
             | alternative. This was around the time ImageNet blew up, and
             | some eng were stuffing multiple GPUs in their dev boxes to
             | demonstrate training speeds on tasks like voice
             | recognition.
             | 
             | Sibyl was a heavy user of embeddings before there was any
             | real custom ASIC support for that and there was an add-on
             | for TPUs called barnacore to give limited embedding support
             | (embeddings are very useful for maximizing profit through
             | ranking).
        
         | theptip wrote:
         | The phrasing is very precise here, it's the first TPU for _the
         | age of inference_, which is a novel marketing term they have
         | defined to refer to CoT and Deep Research.
        
       | nehalem wrote:
       | Not knowing much about special-purpose chips, I would like to
       | understand whether chips like this would give Google a
       | significant cost advantage over the likes of Anthropic or OpenAI
       | when offering LLM services. Is similar technology available to
       | Google's competitors?
        
         | baby_souffle wrote:
         | There are other ai/llm 'specific' chips out there, yes. But the
         | thing about asics is that you need one for each *specific*
         | task. Eventually we'll hit an equilibrium but for now, the
         | stuff that Cerebras is best at is not what TPUs are best at is
         | not what GPUs are best at...
        
           | monocasa wrote:
           | I don't even know if eventually we'll hit an equilibrium.
           | 
           | The end of Moore's law pretty much dictates specialization,
           | it's just more apparent in fields without as much
           | ossification first.
        
         | avrionov wrote:
         | NVIDIA operates at 70% profit right now. Not paying that
         | premium and having alternative to NVIDIA is beneficial. We just
         | don't know how much.
        
           | kccqzy wrote:
           | I might be misremembering here, but Google's own AI models
           | (Gemini) don't use NVIDIA hardware in any way, training or
           | inference. Google bought a large number of NVIDIA hardware
           | only for Google Cloud customers, not themselves.
        
         | heymijo wrote:
         | GPUs, very good for pretraining. Inefficient for inference.
         | 
         | Why?
         | 
         | For each new word a transformer generates it has to move the
         | entire set of model weights from memory to compute units. For a
         | 70 billion parameter model with 16-bit weights that requires
         | moving approximately 140 gigabytes of data to generate just a
         | single word.
         | 
         | GPUs have off-chip memory. That means a GPU has to push data
         | across a chip - memory bridge for every single word it creates.
         | This architectural choice, is an advantage for graphics
         | processing where large amounts of data needs to be stored but
         | not necessarily accessed as rapidly for every single
         | computation. It's a liability in inference where quick and
         | frequent data access is critical.
         | 
         | Listening to Andrew Feldman of Cerebras [0] is what helped me
         | grok the differences. Caveat, he is a founder/CEO of a company
         | that sells hardware for AI inference, so the guy is talking his
         | book.
         | 
         | [0]
         | https://www.youtube.com/watch?v=MW9vwF7TUI8&list=PLnJFlI3aIN...
        
           | hanska wrote:
           | The Groq interview was good too. Seems that the thought
           | process is that companies like Groq/Cerebras can run the
           | inference, and companies like Nvidia can keep/focus on their
           | highly lucrative pretraining business.
           | 
           | https://www.youtube.com/watch?v=xBMRL_7msjY
        
           | latchkey wrote:
           | Cerebras (and Groq) has the problem of using too much die for
           | compute and not enough for memory. Their method of scaling is
           | to fan out the compute across more physical space. This takes
           | more dc space, power and cooling, which is a huge issue.
           | Funny enough, when I talked to Cerebras at SC24, they told me
           | their largest customers are for training, not inference. They
           | just market it as an inference product, which is even more
           | confusing to me.
           | 
           | I wish I could say more about what AMD is doing in this
           | space, but keep an eye on their MI4xx line.
        
             | heymijo wrote:
             | > _they told me their largest customers are for training,
             | not inference_
             | 
             | That is curious. Things are moving so quickly right now. I
             | typed out a few speculative sentences then went ahead and
             | asked an LLM.
             | 
             | Looks like Cerebras is responding to the market and
             | pivoting towards a perceived strength of their product
             | combined with the growth in inference, especially with the
             | advent of reasoning models.
        
               | latchkey wrote:
               | I wouldn't call it "pivoting" as much as "marketing".
        
           | ein0p wrote:
           | Several incorrect assumptions in this take. For one thing, 16
           | bit is not necessary. For another 140GB/token holds only if
           | your batch size is 1 and your sequence length is 1 (no
           | speculative decoding). Nobody runs LLMs like that on those
           | GPUs - if you do it like that, compute utilization becomes
           | ridiculously low. With batch of greater than 1 and
           | speculative decoding arithmetic intensity of the kernels is
           | much higher, and having weights "off chip" is not that much
           | of a concern.
        
         | xnx wrote:
         | Google has a significant advantage over other hyperscalers
         | because Google's AI data centers are much more compute cost
         | efficient (capex and opex).
        
           | claytonjy wrote:
           | Because of the TPUs, or due to other factors?
           | 
           | What even is an AI data center? are the GPU/TPU boxes in a
           | different building than the others?
        
             | xnx wrote:
             | > Because of the TPUs, or due to other factors?
             | 
             | Google does many pieces of the data center better. Google
             | TPUs use 3D torus networking and are liquid cooled.
             | 
             | > What even is an AI data center?
             | 
             | Being newer, AI installations have more
             | variations/innovation than traditional data centers.
             | Google's competitors have not yet adopted all of Google's
             | advances.
             | 
             | > are the GPU/TPU boxes in a different building than the
             | others?
             | 
             | Not that I've read. They are definitely bringing on new
             | data centers, but I don't know if they are initially
             | designed for pure-AI workloads.
        
               | nsteel wrote:
               | Wouldn't a 3d torus network have horrible performance
               | with 9,216 nodes? And really horrible latency? I'd have
               | assumed traditional spine-leaf would do better. But I
               | must be wrong as they're claiming their latency is great
               | here. Of course, they provide zero actual evidence of
               | that.
               | 
               | And I'll echo, what even is an AI data center, because
               | we're still none the wiser.
        
               | xnx wrote:
               | > what even is an AI data center
               | 
               | A data center that runs significant AI training or
               | inference loads. Non AI data centers are fairly
               | commodity. Google's non-AI efficiency is not much better
               | than Amazon or anyone else. Google is much more efficient
               | at running AI workloads than anyone else.
        
               | xadhominemx wrote:
               | It's data center with much higher power density. We're
               | talking about 100 going to 1,000 kw/rack vs 20 kw/rack
               | for a traditional data center. Requiring much different
               | cooling a power delivery.
        
               | dekhn wrote:
               | A 3d torus is a tradeoff in terms of wiring
               | complexity/cost and performance. When node counts get
               | high you can't really have a pair of wires between all
               | pairs of nodes, so if you don't use a torus you usually
               | need a stack of switches/routers aggregating traffic.
               | Those mid-level and top-level switch/routers get very
               | expensive (high bandwidth cross-section) and the routing
               | can get a bit painful. 3d torus has far fewer cables, and
               | the routing can be really simple ("hop vertically until
               | you reach your row, then hop horizontally to read your
               | node"), and the wrap-around connections are nice.
               | 
               | That said, the torus approach was a gamble that most
               | workloads would be nearest-neighbor, and allreduce needs
               | extra work to optimize.
               | 
               | An AI data center tends to have enormous power
               | consumption and cooling capabilities, with less disk, and
               | slightly different networking setups. But really it just
               | means "this part of the warehouse has more ML chips than
               | disks"
        
             | summerlight wrote:
             | Lots of other factors. I suspect this is one of the reasons
             | why Google cannot offer TPU hardware itself out of their
             | cloud service. A significant chunk of TPU efficiency can be
             | attributed external factors which customers cannot easily
             | replicate.
        
         | pkaye wrote:
         | Anthropic is using Google TPUs. Also jointly working with
         | Amazon on a data center using Amazon's custom AI chips. Also
         | Google and Amazon are both investors in Anthropic.
         | 
         | https://www.datacenterknowledge.com/data-center-chips/ai-sta...
         | 
         | https://www.semafor.com/article/12/03/2024/amazon-announces-...
        
         | cavisne wrote:
         | Nvidia has ~60% margins in their datacenter chips. So TPU's
         | have quite a bit of headroom to save google money without being
         | as good as Nvidia GPU's.
         | 
         | No one else has access to anything similar, Amazon is just
         | starting to scale their Trainium chip.
        
           | buildbot wrote:
           | Microsoft has the MAIA 100 as well. No comment on their
           | scale/plans though.
        
       | behnamoh wrote:
       | The naming of these chips (GPUs, CPUs) is kinda badass: Ironwood,
       | Blackwell, ThreadRipper, Epyc, etc.
        
         | mikrl wrote:
         | Scroll through wikichip sometime and try to figure out the
         | Intel march names.
         | 
         | I always confuse Blackwell with Bakewell (tart) and my CPU is
         | on Coffee Lake and great... now I want coffee and cake
        
       | qoez wrote:
       | Post just to tease us since they barely sell TPUs
        
       | throwaway48476 wrote:
       | Its hard to be excited about hardware that will only exist in the
       | cloud before shredding.
        
         | p_j_w wrote:
         | I think this article is for Wall Street, not Silicon Valley.
        
           | noitpmeder wrote:
           | What's their use case?
        
             | fennokin wrote:
             | As in for investor sentiment, not literally finance
             | companies.
        
             | amelius wrote:
             | Gambling^H^H^H^H Making markets more "efficient".
        
           | mycall wrote:
           | Bad timing as I think Wall Street is preoccupied at the
           | moment.
        
             | asdfman123 wrote:
             | Oh, believe me, they are very much paying attention to tech
             | stocks right now.
        
         | jeffbee wrote:
         | Ogg no care multi-axis computer-numerical machine center. Ogg
         | no space Ogg cave for nonsense. Ogg bang rock scrape hide.
        
           | CursedSilicon wrote:
           | Please don't make low-effort bait comments. This isn't Reddit
        
         | crazygringo wrote:
         | You can't get excited about lower prices for your cloud GPU
         | workloads thanks to the competition it brings to Nvidia?
         | 
         | This benefits everyone, even if you don't use Google Cloud,
         | because of the competition it introduces.
        
           | 01HNNWZ0MV43FF wrote:
           | I like owning things
        
             | sodality2 wrote:
             | Cloud will buy less NVDA chips, and since they're related
             | goods, prices will drop.
        
             | xadhominemx wrote:
             | You own any GB200s?
        
             | baobabKoodaa wrote:
             | You will own nothing and you will be happy.
        
           | throwaway48476 wrote:
           | It's only competitive with nvidia if you believe Google won't
           | kill this product like everything else.
        
             | maxrmk wrote:
             | I love to hate on google, but I suspect this is strategic
             | enough that they wont kill it.
             | 
             | Like graviton at AWS its as much of a negotiation tool as
             | it is a technical solution, letting them push harder with
             | NVIDIA on pricing because they have a backup option.
        
               | mmx1 wrote:
               | Google has done stuff primarily for negotiation purposes
               | (e.g. POWER9 chips) but TPU ain't one. It's not a backup
               | option or presumed "inferior solution" to NVIDIA. Their
               | entire ecosystem is TPU-first.
        
             | joshuamorton wrote:
             | Google's been doing custom ML accelerators for 10 years
             | now, and (depending on how much you're willing to stretch
             | the definition) has been doing them in consumer hardware
             | for soon to be five years (the Google Tensor chips in pixel
             | phones).
        
         | justanotheratom wrote:
         | exactly. I wish Groq would start selling their cards that they
         | use internally.
        
           | xadhominemx wrote:
           | They would lose money on every sale
        
         | foota wrote:
         | Personally, I have a (non-functional) TPU sitting on my desk at
         | home :-)
        
       | fluidcruft wrote:
       | This isn't anything anyone can purchase, is it? Who's the
       | audience for this announcement?
        
         | badlucklottery wrote:
         | > Who's the audience for this announcement?
         | 
         | Probably whales who can afford to rent one from Google Cloud.
        
           | jeffbee wrote:
           | People with $3 are whales now? TPU prices are similar to
           | other cloud resources.
        
             | dylan604 wrote:
             | Does anyone do anything useful with a $3 spend, or is it $3
             | X $manyManyHours?
        
               | scarmig wrote:
               | No one does anything useful with a $3 spend. That's not
               | anything particular to TPUs, though.
        
               | dylan604 wrote:
               | That's my point. The touting of $3 is beyond misleading.
        
               | fancyfredbot wrote:
               | You can do real work for a few hundred dollars which is
               | hardly the exclusive domain of "whales"?
               | 
               | The programmer who writes code to run on these likely
               | costs at least 15x this amount an hour.
        
           | MasterScrat wrote:
           | An on-demand v5e-1 is $1.2/h, it's pretty accessible.
           | 
           | The challenge is getting them to run efficiently, which
           | typically involves learning JAX.
        
         | llm_nerd wrote:
         | The overwhelming majority of AI compute is by either the few
         | bigs in their own products, or by third parties that rent out
         | access to compute resources from those same bigs. Extremely few
         | AI companies are buying their own GPU/TPU buildouts.
         | 
         | Google says Ironwood will be available in the Google Cloud late
         | this year, so it's relevant to just about anyone that rents AI
         | compute, which is just about everyone in tech. Even if you have
         | zero interest in this product, it will likely lead to downward
         | pressure on pricing, mostly courtesy of the large memory
         | allocations.
        
           | fluidcruft wrote:
           | It just seems like John Deere putting out a press-release
           | about about a new sparkplug that is only useful to John Deere
           | and can maybe be used on rented John Deere harvesters when
           | sharecropping on John Deere-owned fields using John Deere GMO
           | crops. I just don't see what's appealing about any of it. Not
           | only is it a walled garden, you can't even own anything and
           | are completely dependent on the whims of John Deere to not
           | bulldoze the entire field.
           | 
           | It just seems like if you build on Tensor then sure, you can
           | go home, but Google will keep your ball.
        
             | aseipp wrote:
             | The reality is that for large scale AI deployment there's
             | only one criterion that matters: what is the total cost of
             | ownership? If TPUs are 1/30th the total perf but 1/50th the
             | total price, then they will be bought by customers.
             | Basically that simple.
             | 
             | Most places using AI hardware don't actually want to expend
             | massive amounts of capital to procure it and then shove it
             | into racks somewhere and then manage it over its total
             | lifetime. Hyperscalers like Google are also far, far ahead
             | in things like DC energy efficiency, and at really large
             | scale those energy costs are huge and have to be factored
             | into the TCO. The long dominant cost of this stuff is all
             | operational expenditures. Anyone running a physical AI
             | cluster is going to have to consider this.
             | 
             | The walled garden stuff doesn't matter, because places
             | demanding large-scale AI deployments (and actually willing
             | to spend money on it) do not really have the same
             | priorities as HN homelabbers who want to install
             | inefficient 5090s so they can run Ollama.
        
               | fluidcruft wrote:
               | At large scales why shouldn't it matter whether you're
               | beholden to Google's cloud only vs having options to use
               | AWS or Oracle or Azure etc. There's maybe an argument to
               | be made about price and efficiency of Google's data
               | centers, but Google's cloud is far from notably cheaper
               | than alternatives (to put it mildly) so that's a moot
               | point if there's any efficiencies to be had Google's
               | pocketing it themselves. I just don't see why anyone
               | should care about this chip except Google themselves. It
               | would be a different story if we were talking about a
               | chip that had the option of being available in non-Google
               | data centers.
        
         | xhkkffbf wrote:
         | People who buy their stock.
        
         | avrionov wrote:
         | The audience is Google cloud customers + investors
        
       | _hark wrote:
       | Can anyone comment on where efficiency gains come from these days
       | at the arch level? I.e. not process-node improvements.
       | 
       | Are there a few big things, many small things...? I'm curious
       | what fruit are left hanging for fast SIMD matrix multiplication.
        
         | yeahwhatever10 wrote:
         | Specialization. Ie specialized for inference.
        
         | vessenes wrote:
         | One big area the last two years has been algorithmic
         | improvements feeding hardware improvements. Supercomputer folks
         | use f64 for everything, or did. Most training was done at f32
         | four years ago. As algo teams have shown fp8 can be used for
         | training and inference, hardware has updated to accommodate,
         | yielding big gains.
         | 
         | NB: Hobbyist, take all with a grain of salt
        
           | jmalicki wrote:
           | Unlike a lot of supercomputer algorithms, where fp error
           | accumulates as you go, gradient descent based algorithms
           | don't need as much precision since any fp errors will still
           | show up at the next loss function calculation to be
           | corrected, which allows you to make do with much lower
           | precision.
        
         | muxamilian wrote:
         | In-memory computing (analog or digital). Still doing SIMD
         | matrix multiplication but using more efficient hardware:
         | https://arxiv.org/html/2401.14428v1
         | https://www.nature.com/articles/s41565-020-0655-z
        
           | gautamcgoel wrote:
           | This is very interesting, but not what the Ironside TPU is
           | doing. The blog post says that the TPU uses conventional HBM
           | RAM.
        
             | nsteel wrote:
             | There's been some talk/rumour of next-gen HBMs having some
             | compute capability on the base die. But again, not what
             | they're doing here, this is regular HBM3/HBM3e.
             | 
             | https://semiengineering.com/speeding-down-memory-lane-
             | with-c...
        
       | vessenes wrote:
       | 7.2 Terabit/s HBM Bandwidth raised my eyebrows. But then I
       | googled, and it looks like GB200 is 16Tb/s. In plebe land, 2Tb is
       | pretty awesome.
       | 
       | These continue to be mostly for bragging rights and strategic
       | safety I think. I bet they are not on premium processor nodes; If
       | I worked at GOOG I'd probably think about these as competitive
       | insurance vis-a-vis NVIDIA -- total costs of chip team, software,
       | tape outs, and increased data center energy use probably wipe out
       | any savings from not buying NV, but you are 100% not beholden to
       | Jensen.
        
       | gigel82 wrote:
       | I was hoping they're launching a Coral kind of device that can
       | run locally and cheaply, with updated specs.
       | 
       | It would be awesome for things like homelabs (to run Frigate NVR,
       | Immich ML tasks or the Home Assistant LLM).
        
       | GrumpyNl wrote:
       | Why doesnt google offer the most advanced voice technology when
       | they offer a playback version, it still sounds like the most
       | basic text to voice.
        
       | tuna74 wrote:
       | How is API story for these devices? Are the drivers mainlined in
       | Linux? Is there a specific API you use to code for them? How does
       | the instance you rent on Google Cloud look and what does that
       | have for software?
        
         | cbarrick wrote:
         | XLA (Accelerated Linear Algebra) [1] is likely the library that
         | you'll want to use to code for these machines.
         | 
         | TensorFlow, PyTorch, and Jax all support XLA on the backend.
         | 
         | [1]: https://openxla.org/
        
       | g42gregory wrote:
       | And ... where could we get one? If they wouldn't sell it anyone,
       | then is this a self-congratulation story? Why do we even need to
       | know about this? If it propagates to the lower Gemini prices,
       | fantastic. If not, then isn't it kind of irrelevant for the
       | actual user experience?
        
         | lordofgibbons wrote:
         | You can rent it on GCP in a few months
        
           | g42gregory wrote:
           | Good point. At what prices per GB/TOPS? Better be lower than
           | the existing TPUs ... That's what I care about.
        
         | jstummbillig wrote:
         | Well, with stocks and all, there is more that matters in the
         | world than "actual user experience"
        
       | DeathArrow wrote:
       | Cool. But does it support CUDA?
        
       | wg0 wrote:
       | Can anyone buy them?
        
       | ein0p wrote:
       | God damn it, Google. Make a desktop version of these things.
        
       | DisjointedHunt wrote:
       | Cloud resources are trending towards consumer technology adoption
       | numbers rather than being reserved mostly for Enterprise. This is
       | the most exciting thing in decades!
       | 
       | There is going to be a GPU/Accelerator shortage for the
       | foreseeable future to run the most advanced models, Gemini 2.5
       | Pro is such a good example. It is probably the first model that
       | many developers i've considered skeptics of extended agent use
       | have started to saturate free token thresholds on.
       | 
       | Grok is honestly the same, but the lack of an API is suggestive
       | of the massive demand wall they face.
        
       | attentive wrote:
       | anyone knows how this compares to AWS Inferentia chips?
        
       | aranw wrote:
       | I wonder if these chips might contribute towards advancements for
       | the Coral TPU chips?
        
       ___________________________________________________________________
       (page generated 2025-04-09 23:00 UTC)