[HN Gopher] TPU v4 provides exaFLOPS-scale ML with efficiency gains
___________________________________________________________________
TPU v4 provides exaFLOPS-scale ML with efficiency gains
Author : zekrioca
Score : 53 points
Date : 2023-04-05 20:52 UTC (2 hours ago)
(HTM) web link (cloud.google.com)
(TXT) w3m dump (cloud.google.com)
| reaperman wrote:
| I am very impressed with what Google has done for the state of
| machine learning infrastructure. I'm looking forward to future
| models based on OpenXLA which can run between Nvidia, Apple
| Silicon, and google's TPU's. My main limiter to using TPU more
| often is model compatibility. The TPU hardware is clearly the
| very best, just not always cost-effective for those of us who are
| starved of available engineering hours. OpenXLA may fix this if
| it lives up to its promise.
|
| That said, it's also incredible how fast things move in this
| space:
|
| > Midjourney, one of the leading text-to-image AI startups, have
| been using Cloud TPU v4 to train their state-of-the-art model,
| coincidentally also called "version four".
|
| Midjourney is already on v5 as of the date of publication of this
| press release.
| xiphias2 wrote:
| ,,Midjourney, one of the leading text-to-image AI startups, have
| been using Cloud TPU v4 to train their state-of-the-art model,
| coincidentally also called "version four''
|
| This sounds quite bad in a press release when Midjourney is at
| v5. Why did they move away?
| sebzim4500 wrote:
| Sounds like they are just out of date, it is possible MJv5 was
| also trained on TPUs.
| obblekk wrote:
| This is very impressive technology and engineering.
|
| However, I remain a bit skeptical of the business case for TPUs
| for 3 core reasons:
|
| 1) 100000x lower unit production volume than GPUs means higher
| unit costs
|
| 2) Slow iteration cycle - these TPUv4 were launched in 2020.
| Maybe Google publishes one gen behind, but that would still be a
| 2-3 year iteration cycle from v3 to v4.
|
| 3) Constant multiple advantage over GPUs - maybe 5-10x compute
| advantage over off the shelf GPU, and that number isn't
| increasing with each generation.
|
| It's cool to get that 5-10x performance over GPUs, but that's
| 4.5yrs of Moore's Law, and might already be offset today due to
| unit cost advantages.
|
| If the TPU architecture did something to allow fundamentally
| faster transistor density scaling, it's advantage over GPUs would
| increase each year and become unbeatable. But based on the TPUv3
| to TPUv4 perf improvement over 3 years, it doesn't seem so.
|
| Apple's competing approach seems a bit more promising from a
| business perspective. The M1 unifies memory reducing the time
| commitment required to move data and switch between CPU and GPU
| processing. This allows advances in GPUs to continue scaling
| independently, while decreasing the user experience cost of using
| GPUs.
|
| Apple's version also seems to scale from 8GB RAM to 128GB meaning
| the same fundamental process can be used at high volume,
| achieving a low unit cost.
|
| Are there other interesting hardware for ML approaches out there?
| sliken wrote:
| > 100000x lower unit production volume than GPUs means higher
| unit costs
|
| Two points. Nvidia's RTX 3000 series (3060 Ti, 3080, and many
| other flavors) ships 6 or more flavors per generation. The
| related silicon has names like the GA102, GA103, GA104, GA106,
| and GA107. So only 1/6th of the consumer market for Nvidia
| silicon can be amortized over any single design.
|
| I wouldn't be at all surprised to see Google making the TPUs by
| the million. I found a vague reference to 9 exaflops and single
| facilities (one of many) costing $4 billion to $8 billion.
|
| So I wouldn't assume that the consumer GPU market/number of
| silicon designs is 100,000 times larger than the TPUv4 market.
|
| > Slow iteration cycle
|
| True. Then again generations make much less difference than
| they used to. Gone are the days where even after a multiple
| generations that average performance increases by 2x. Sure
| nvidia's 4000 series claims 2x ... on raytracing. But normal
| game performance seems to be more like 15%. Sure various
| trickery like DLSS helps, but similar tricks are increasing the
| performance of older cards as well. Similarly apple's a14 ->
| a15 -> a16 (or m1 -> m2 if you prefer) chips have had modest
| performance increases and mostly have increases in perf/watt.
|
| > 4.5yrs of Moore's Law
|
| It's dead Jim.
| pclmulqdq wrote:
| I believe that Nvidia uses the same chip from A16 up to the
| A100, and maybe for some of the Quadro chips. That easily
| puts the unit count into the several millions.
|
| The picture in this article shows 8 racks with (according to
| the paper on arxiv) has 16 TPU sleds each, 4 TPUs per sled.
| that's only 512 chips. According to the paper, it is one of
| eight in a 4096-chip supercomputer. If you give them 10-100
| of those around the world, you get 40,000-400,000 chips.
| That's enough for reasonable scale. Nvidia should still have
| 100x (or more) their scale.
| kccqzy wrote:
| > Are there other interesting hardware for ML approaches out
| there?
|
| Google also has Coral, which is a non-cloud mobile-focused TPU
| that you can buy and plug in (USB or PCIe).
|
| https://coral.ai/products/
| 1MachineElf wrote:
| Just received my mSATA Coral TPU in the mail. I'd been
| waiting 11 months for it after backordering on Digikey.
| Perhaps this speaks to the parent comment's concerns over
| unit volume and iteration cycle? Hopefully that will improve
| in the future and modules like these will become widespread.
| cubefox wrote:
| > If the TPU architecture did something to allow fundamentally
| faster transistor density scaling, it's advantage over GPUs
| would increase each year and become unbeatable.
|
| It is completely unreasonable to expect something like that.
| sebzim4500 wrote:
| >100000x lower unit production volume than GPUs
|
| This is obviously an exaggeration, I wonder what the actual
| ratio is between TPU proudction and e.g. A100 production.
| summerlight wrote:
| Probably more close to 1000x? I see a fairly large number of
| TPU pod these days and I don't think A100 is as prevalent as
| high end consumer GPU, which is typically measured in
| millions, not billions.
| tinco wrote:
| They're so non confrontational. Their performance comparisons are
| against "CPU". Just come out and say it, even if it's not apples
| to apples. If the 3D-torus interconnect is so much better, just
| say how it compares to NVidia's latest and greatest. It's cool
| that midjourney committed to building on TPU, but I have a hard
| time betting my company on a technology that's so guarded that
| they won't even post a benchmark against their main competitor.
| jeffbee wrote:
| The paper compares it to A100.
|
| https://arxiv.org/pdf/2304.01433.pdf
| KeplerBoy wrote:
| I refuse to care about them until they sell them on PCIe Cards.
|
| The lock-in is bad enough when dealing with niche hardware on-
| prem, i certainly won't deal with niche hardware in the cloud.
| TradingPlaces wrote:
| If Google can't become the king of AI cloud training, they should
| all just quit.
___________________________________________________________________
(page generated 2023-04-05 23:00 UTC)