[HN Gopher] Trillium TPU Is GA
___________________________________________________________________
Trillium TPU Is GA
Author : gok
Score : 97 points
Date : 2024-12-11 15:44 UTC (7 hours ago)
(HTM) web link (cloud.google.com)
(TXT) w3m dump (cloud.google.com)
| xnx wrote:
| > We used Trillium TPUs to train the new Gemini 2.0,
|
| Wow. I knew custom Google silicon was used for inference, but I
| didn't realize it was used for training too. Does this mean
| Google is free of dependence on Nvidia GPUs? That would be a huge
| advantage over AI competitors.
| m3kw9 wrote:
| Maybe only for their own models
| walterbell wrote:
| Now any Google customer can use Trillium for training any
| model?
| richards wrote:
| [Google employee] Yes, you can use TPUs in Compute Engine
| and GKE, among other places, for whatever you'd like. I
| just checked and the v6 are available.
| KaoruAoiShiho wrote:
| Is there not goin to be a v6p?
| richards wrote:
| Can't speculate on futures, but here's the current
| version log ... https://cloud.google.com/tpu/docs/system-
| architecture-tpu-vm...
| xnx wrote:
| Google trained Llama-2-70B on Trillium chips
| monocasa wrote:
| I thought llama was trained by meta.
| DrBenCarson wrote:
| > Google trained Llama
|
| Source? This would make quite the splash in the market
| xnx wrote:
| It's in the article: "When training the Llama-2-70B
| model, our tests demonstrate that Trillium achieves near-
| linear scaling from a 4-slice Trillium-256 chip pod to a
| 36-slice Trillium-256 chip pod at a 99% scaling
| efficiency."
| llm_nerd wrote:
| I'm pretty sure they're doing fine-tune training, using
| Llama because it is a widely known and available sample.
| They used SDXL elsewhere for the same reason.
|
| Llama 2 was released well over a year ago and was
| training between Meta and Microsoft.
| hhh wrote:
| They can just train another one.
| llm_nerd wrote:
| Llama 2 end weights are public. The data used to train
| it, or even the process used to train it, are not. Google
| can't just train another Llama 2 from scratch.
|
| They could train something similar, but it'd be super
| weird if they called it Llama 2. They could call it
| something like "Gemini", or if it's open weights,
| "Gemma".
| Permit wrote:
| My understanding is that the Trillium TPU was primarily
| targeted at inference (so it's surprising to see it was used to
| train Gemini 2.0) but other generations of TPUs have targeted
| training. For example the chip prior to this one is called TPU
| v5p and was targeted toward training.
| dekhn wrote:
| Google silicon TPUs have been used for training for at least 5
| years, probably more (I think it's 10 years). They do not
| depend on nvidia gpus for the majority of their projects. Took
| TPUs a while to catch up on some details, like sparsity.
| summerlight wrote:
| This aligns with my knowledge. I don't know much about LLM,
| but TPU has been used for training deep prediction models in
| ads at least from 2018, though there were some gap filled by
| CPU/GPU for a while. Nowaday, TPU capacity is probably more
| than the combination of CPU/GPU.
| felarof wrote:
| +1, almost all (if not all) Google training runs on TPU.
| They don't use NVIDIA GPUs at all.
| dekhn wrote:
| at some point some researchers were begging for GPUs...
| mainly for sparse work. I think that's why sparsecore was
| added to TPU (https://cloud.google.com/tpu/docs/system-
| architecture-tpu-vm...) in v4. I think at this point with
| their tech turnaround time they can catch up as
| competitors add new features and researchers want to use
| them.
| felarof wrote:
| dumb question: wdym by sparse work? Is it embedding
| lookups?
|
| (TPUs have had BarnaCore for efficient embedding lookups
| since TPU v3)
| dekhn wrote:
| Mostly embedding, but IIRC DeepMind RL made use of
| sparsity- basically, huge matrices with only a few non-
| zero elements.
|
| BarnaCore existed and was used, but was tailored mostly
| for embeddings. BTW, IIRC they were called that because
| they were added "like a barnacle hanging off the side".
|
| The evolution of TPU has been interesting to watch; I
| came from the HPC and supercomputing space, and seeing
| Google as mostly-CPU for the longest time, and then
| finally learning how to build "supercomputers" over a
| decade+ (gradually adding many features that classical
| supercomputers have long had), was a very interesting
| process. Some very expensive mistakes along the way. But
| now they've paid down almost all the expensive up-front
| costs and can now ride on the margins, adding new bits
| and pieces while increasing the clocks and capacities on
| a cadence.
| amelius wrote:
| Do they have the equivalent of CUDA, and what is it
| called?
| lern_too_spel wrote:
| Since TPUv2, announced in 2017:
| https://arstechnica.com/information-technology/2017/05/googl...
|
| The superscalers are all working on this.
| https://aws.amazon.com/ai/machine-learning/trainium/
| drusepth wrote:
| Why is that a huge advantage over AI competitors? Just not
| having to fight for limited Nvidia supply?
| aseipp wrote:
| That is one factor, but another is total cost of ownership.
| At large scales something that's 1/2 the overall speed but
| 1/3rd the total cost is still a net win by a large margin.
| This is one of the reasons why every major hyperscaler is, to
| some extent, developing their own hardware e.g. Meta, who
| famously have an insane amount of Nvidia GPUs.
|
| Of course this does not mean their models will necessarily be
| proportionally better, nor does it mean Google won't buy GPUs
| for other reasons (like providing them to customers on Google
| Cloud.)
| xnx wrote:
| Yes. And cheaper operating cost per TFLOP.
| badlucklottery wrote:
| Vertical integration.
|
| Nvidia is making big bucks "selling shovels in a gold rush".
| Google has made their own shovel factory and they can avoid
| paying Nvidia's margins.
| bufferoverflow wrote:
| TPUs are cheaper and faster than GPUs. But it's custom
| silicon. Which means barrier to entry is very very high.
| felarof wrote:
| > Which means barrier to entry is very very high.
|
| +1 on this. The tooling to use TPUs still needs more work.
| But we are betting on building this tooling and unlocking
| these ASIC chips (https://github.com/felafax/felafax).
| felarof wrote:
| TPUs have been used for training since a long time.
|
| (PS: we are startup trying to make TPUs more accessible, if you
| wanna fine-tune Llama3 on TPU check out
| https://github.com/felafax/felafax)
| randomcatuser wrote:
| How good is Trillium/TPU compared to Nvidia? It seems the stats
| are: tpu v6e achieves 900 TFLOPS per chip (fp16) while Nvidia
| H100 achieves 1800 TFLOPS per gpu? (fp16)?
|
| Would be neat if anyone has benchmarks!!
| chessgecko wrote:
| 1800 on the h100s is with 2/4 sparsity, it's half of that
| without. Not sure if the tpu number is doing that too, but I
| don't think 2/4 is used that heavily so I probably would
| compare without it.
| amelius wrote:
| Are the Gemini models open?
| stefan_ wrote:
| Not even to Google customers most days, it seems.
| jnwatson wrote:
| Just the Gemma models are open.
| blackeyeblitzar wrote:
| So Google has Trillium, Amazon has Trainium, Apple is working on
| a custom chip with Broadcom, etc. Nvidia's moat doesn't seem that
| big.
|
| Plus big tech companies have the data and customers and will
| probably be the only surviving big AI training companies. I doubt
| startups can survive this game - they can't afford the chips,
| can't build their own, don't have existing products to leech data
| off of, and don't have control over distribution channels like OS
| or app stores
| talldayo wrote:
| > Nvidia's moat doesn't seem that big.
|
| Well, look at it this way. Nvidia played their cards so well
| that their competitors had to invent entirely new product
| categories to supplant their demand for Nvidia hardware. This
| new hardware isn't even reprising the role of CUDA, just the
| subset of tensor operations that are used for training and AI
| inference. If demand for training and inference wanes, these
| hardware investments will be almost entirely wasted.
|
| Nvidia's core competencies - scaling hardware up and down,
| providing good software interfaces and selling direct to
| consumer are not really assailed at all. The big lesson Nvidia
| is giving to the industry is that you _should_ invest in
| complex GPU architectures and write the software to support it.
| Currently the industry is trying it 's hardest to reject that
| philosophy, and only time will tell if they're correct.
| felarof wrote:
| > CUDA, just the subset of tensor operations that are used
| for training and AI inference. If demand for training and
| inference wanes
|
| Interesting take, but why would demand for training and
| inference wade? This seems like a very contrarian take.
| talldayo wrote:
| Maybe it won't - I say "time will tell" because we really
| do not know how much LLMs will be demanded in 10 years.
| Nvidia's stock skyrocketed because they were incidentally
| prepared for an enormous increase in demand _the moment_ it
| happened. Now that expectations are cooling down and Sam
| Altman is signalling that AGI is a long ways off, the math
| that justified designing NPU /TPU hardware in-house might
| not add up anymore. Even if you believe in the tech itself,
| the hype is cooling and the do-or-die moment is rapidly
| approaching.
|
| My overall point is that I think Nvidia played smartly from
| the start. They could derive profit from any sufficiently
| large niche their competitors were too afraid to exploit,
| and general purpose GPU compute was the perfect investment.
| With AMD, Apple and the rest of the industry focusing on
| simpler GPUs, Nvidia was given an empty soapbox to market
| CUDA with. The big question is whether demand for CUDA can
| be supplanted with application-specific accelerators.
| felarof wrote:
| > The big question is whether demand for CUDA can be
| supplanted with application-specific accelerators.
|
| At least for AI workloads, Google's XLA compiler and the
| JAX ML framework have reduced the need for something like
| CUDA.
|
| There are two main ways to train ML models today:
|
| 1) Kernel-heavy approach: This is where frameworks like
| PyTorch are used, and developers write custom kernels
| (using Triton or CUDA) to speed up certain ops.
|
| 2) Compiler-heavy approach: This uses tools like XLA,
| which apply techniques like op fusion and compiler
| optimizations to automatically generate fast, low-level
| code.
|
| NVIDIA's CUDA is a major strength in the first approach.
| But if the second approach gains more traction, NVIDIA's
| advantage might not be as important.
|
| And I think the second approach has a strong chance of
| succeeding, given that two massive companies--Google
| (TPUs) and Amazon (Trainium)--are heavily investing in
| it.
|
| (PS: I'm also bit biased towards approach 2), we build
| llama3 fine-tuning on TPU
| https://github.com/felafax/felafax)
| 01100011 wrote:
| It's weird to me that folks think NVDA is just sitting
| there, waiting for everyone to take their lunch. Yes, I'm
| totally sure NVDA is completely blind to competition and
| has chosen to sit on their cash rather than develop
| alternatives...</s>
| tada131 wrote:
| > If demand for training and inference wanes, these hardware
| investments will be almost entirely wasted
|
| Nvidia also need to invent smth then, as pumping mining (or
| giving good to gamers) again is not sexy. What's next? Will
| we finally compute for drug development and achieve just as
| great results as with chatbots?
| talldayo wrote:
| > Nvidia also need to invent smth then, as pumping mining
| (or giving good to gamers) again is not sexy.
|
| They do! Their research page is well worth checking out,
| they wrote a lot of the fundamental papers that people cite
| for machine learning today:
| https://research.nvidia.com/publications
|
| > Will we finally compute for drug development and achieve
| just as great results as with chatbots?
|
| Maybe - but they're not really analogous problem spaces.
| Fooling humans with text is easy - Markov chains have been
| doing it for decades. Automating the discovery of drugs and
| research of proteins is not quite so easy, rudimentary
| attempts like Folding@Home went on for years without any
| breakthrough discoveries. It's going to take a lot more
| research before we get to ChatGPT levels of success. But
| tools like CUDA certainly help with this by providing
| flexible compute that's easy to scale.
| dekhn wrote:
| There was nothing rudimentary about Folding@Home (either
| in the MD engine or the MSM clustering method), and my
| paper on GPCRs that used Folding@Home regularly gets
| cites from pharma (we helped establish the idea that
| treating proteins as being a single structure at the
| global energy minimum was too simplistic to design
| drugs). But F@H was never really a serious attempt at
| drug discovery- it was intended to probe the underlying
| physics of protein folding, which is tangentially
| related.
|
| In drug discovery, we'd love to be able to show that
| virtual screening really worked- if you could do docking
| against a protein to find good leads affordably, and also
| ensure that the resulting leads were likely to pass FDA
| review (IE, effective and non-toxic), that could
| potentially greatly increase the rate of discovery.
| llm_nerd wrote:
| It seems this way, but we've been saying this for years and
| years. And somehow nvidia keeps making more and more.
|
| Isn't it telling when Google's release of an "AI" chip doesn't
| include a single reference to nvidia or its products? They're
| releasing it for general availability, for people to build
| solutions on, so it's pretty weird that there isn't comparisons
| to H100s et al. All of their comparisons are to their own prior
| generations, which you do if you're the leader (e.g. Apple does
| it with their chips), but it's a notable gap when you're a
| contender.
| jeffbee wrote:
| Google posted TPUv6 results for a few things on MLCommons in
| August. You can compare them to H100 over there, at least for
| inference on stable diffusion xl.
|
| Suspiciously there is a filter for "TPU-trillium" in the
| training results table, but no result using such an
| accelerator. Maybe results were there and later redacted, or
| have been embargoed.
| mlboss wrote:
| The biggest barrier for any Nvidia competitor is that hackers
| can run the models on their desktop. You don't need a cloud
| provider specific model to do stuff locally.
| r3trohack3r wrote:
| This. I suspect consumer brands focusing on consumer hardware
| are going to make a bigger dent in this space than cloud
| vendors.
|
| The future of AI is local, not remote.
| Hilift wrote:
| "we constantly strive to enhance the performance and efficiency
| of our Mamba and Jamba language models."
|
| ... "The growing importance of multi-step reasoning at inference
| time necessitates accelerators that can efficiently handle the
| increased computational demands."
|
| Unlike others, my main concern with AI is any savings we got from
| converting petroleum generating plants to wind/solar, it was
| blasted away by AI power consumption months or even years ago.
| Maybe Microsoft is on to something with the TMI revival.
| beepbooptheory wrote:
| This has been a constant thought for me as well. Like, the plan
| from what I can tell is that we are going to start to spinning
| all this stuff up every single time someone searches something
| on google, or perhaps, when someone would _otherwise_ search
| something on there.
|
| Just that alone feels like an absolutely massive load to bear!
| But its only a drop in the bucket compared to everything else
| around this stuff.
|
| But while I may be thirsty and hungry in the future, at least I
| will (maybe) be able to know how many rs are in "strawberry".
| r3trohack3r wrote:
| Energy availability at this point appears (to me at least) to
| be practically infinite. In the sense that it is technically
| finite but not for any definition of the word that is
| meaningful for Earth or humans at this stage.
|
| I don't see our current energy production scaling to meet the
| demands of AI. I see a lot of signals that most AI players feel
| the same. From where I'm sitting, AI is already accelerating
| energy generation to meet demand.
|
| If your goal is to convert the planet to clean energy, AI seems
| like one of the most effective engines for doing that. It's
| going to drive the development of new technologies (like small
| modular nuclear reactors) pushing down the cost of construction
| and ownership. Strongly suspect that, in 50 years, the new
| energy tech that AI drives development of will have rendered
| most of our current energy infrastructure worthless.
|
| We will abandon current forms of energy production not because
| they were "bad" but because they were rendered inefficient.
| peepeepoopoo95 wrote:
| Can we please pop this insane Nvidia valuation bubble now?
___________________________________________________________________
(page generated 2024-12-11 23:00 UTC)