[HN Gopher] Cerebras' new monster AI chip adds 1.4T transistors
___________________________________________________________________
Cerebras' new monster AI chip adds 1.4T transistors
Author : Anon84
Score : 109 points
Date : 2021-04-22 19:37 UTC (3 hours ago)
(HTM) web link (spectrum.ieee.org)
(TXT) w3m dump (spectrum.ieee.org)
| AlexCoventry wrote:
| Imagine a beowulf cluster of these.
| faichai wrote:
| This made me chuckle!
| xhrpost wrote:
| How can the chip itself consume that kind of power? Or is the
| 15kw value for the entire unit? That's like 10 residential space
| heaters all turned to max. I'm surprised that much heat could be
| dissipated over such a small surface area. Does it use
| refrigerant for cooling? If my math is correct, if you had a
| 6500BTU window air conditioner, you'd need 8 of them to move the
| heat from this chip.
| philjohn wrote:
| Well, a single 7nm CPU (AMD Ryzen) can pull down 95 odd watts
| at peak, granted, the IO die is a fair bit of that as it's on
| an older process, but if you extrapolate that out across a
| giant wafer it's "only" 157 Ryzen 7's.
| YetAnotherNick wrote:
| It would melt if you just do air cooling. I did some math for
| liquid nitrogen which has heat of vaporization as 200 kJ / kg.
| It would boil 3600*15/200 = 270 kg of nitrogen every hour. Just
| insane.
| terafo wrote:
| Is there any information on how this thing performs on actual AI
| workloads? It is priced similarly to a dozen of DGX A100s, but is
| it faster on something like training big transformer models(such
| as CLIP or GPT-3)?
| 01100011 wrote:
| This thing is awesome and, as someone working for a competitor,
| kind of scary. I applaud their approach though. I think we're a
| couple years off from it, but we'll probably see wider adoption
| of larger silicon, with more specialized functional units, which
| are used with a lower duty cycle to manage heat. If nothing else,
| they're probably developing some good IP and techniques to handle
| other sorts of ultra-mega-insane-scale-integration.
|
| I wonder what their software stack looks like. Can they support
| the sort of virtualization and sharing you'd want to keep this
| expensive beast fully utilized 24/7?
| ohazi wrote:
| LSI, VLSI, UMISI
|
| I like it!
|
| In previous articles they've gone into some detail about how
| they deal with reticle limits, jumping over the scribe line
| area, and other n stuff. Between that, chiplets, HBM-style die
| stacks, etc... the developments here have been more interesting
| than I expected.
| KETpXDDzR wrote:
| I think I once saw one of the founders with a wafer in an In n'
| out with a potential investor. Looking at what Apple achieved
| with their M1A and the demand for "AI" - or training neutral
| networks, what it really is - they have a lot of potential. At
| least as long as the AI bubble doesn't burst.
| arisAlexis wrote:
| How can the future burst? It's like saying medicine will burst
| or physics
| foobiekr wrote:
| Personalized medicine, as an example, absolutely burst.
| mattkrause wrote:
| Same way as the last time: rampant over-promising and under-
| delivering, maybe catalyzed by some high profile mishaps.
|
| Bubbles popping aren't always evaluations of the objective
| quality of something; it could just be about its assumed
| value _relative to other plausible options._ Homes mostly
| don't become uninhabitable when a real estate bubble pops.
| lainga wrote:
| I don't know, expert systems haven't been so hot lately...
| arisAlexis wrote:
| It's very surprising what people with downvote power do but
| it's *** to downvote anyone that disagrees. Anyway expert
| systems were a precursor to the technology and they got
| replaced much like new physics replace old physics.
| ASalazarMX wrote:
| Disagreeing doesn't mean you're right. AI is not
| equivalent to the future, and it will burst if it stalls
| and another AI winter cools the current hype cycle.
| jefft255 wrote:
| It's because you made a false analogy. AI isn't literally
| "the future". Billions of $ are being invested in deep-
| learning focused AI right now (which you call "the
| future"), and yes it could be a bubble and it could
| burst. You can disagree, but it's still a sensible thing
| to predict.
| arisAlexis wrote:
| By bursting you mean humanity will never create
| artificial intelligence? Or you mean that there will be a
| cool off period as for example what happened with quantum
| physics at some point? Because it sure looks to me that
| there is no future without AI regardless of cool off
| periods. That makes my statement true. If you think
| humanity will never progress from where we are now then
| we pretty much are on very opposite schools of thought.
| etaioinshrdlu wrote:
| How much memory is on the chip, and what kind is it?
|
| Under what circumstances does the chip need to access external
| memory?
|
| What type of communication interfaces does this chip have?
|
| Also, if the chip is the size of a wafer, is it appropriate to
| call it a Chip?
| verdverm wrote:
| Tom's Hardware has some nice tables comparing the specs:
| https://www.tomshardware.com/news/cerebras-wafer-scale-engin...
|
| (more than the IEEE)
|
| This thing pulls 15-20kW of juice!
| meepmorp wrote:
| > This thing pulls 15-20kW of juice!
|
| If you look at the wafer she's holding at the top, it's
| seemingly segmented into a 12x7 grid of roughly chip-sized
| rectangles. That's 84 "CPUs" at 200-240 watts each, which is
| pretty well in line with discrete server CPUs.
|
| The amount of heat coming off this thing must be amazing,
| though.
| teruakohatu wrote:
| I did a double that when I realised that was Kilowatts not
| Watts. This chip uses more energy in an hour than the average
| household (in my country at least) does in a day.
|
| It may be a very large wafer but dissipating that heat is
| still very impressive.
| kllrnohj wrote:
| It sounds like a lot but it almost isn't? Like this is ~50x
| bigger than an Nvidia A100, and the A100 pulls up to 400w.
| 50 * 400 ~= 20kW. So in terms of thermal density it's in-
| line with existing GPUs.
|
| That said, I'd be fascinated to see the cooling solution.
| Is it just a _massive_ copper heatsink & a boatload of
| airflow? Typical approaches of using heatpipes to expand
| the heatsink won't really work with something this big
| after all. Or is it a massive waterblock with multiple
| inlets/outlets so it can hit up a stack of radiators? How
| do they get even mounting pressure across that large of an
| area?
| mvanaltvorst wrote:
| There's an image on their website[1], pretty huge water
| pumps.
|
| "To solve the 70-year-old problem of wafer-scale, we
| needed not only to yield a big chip, but to invent new
| mechanisms for powering, packaging, and cooling it.
|
| The traditional method of powering a chip from its edges
| creates too much dissipation at a large chip's center. To
| prevent this, CS-2's innovative design delivers power
| perpendicularly to each core.
|
| To uniformly cool the entire wafer, pumps inside CS-2
| move water across the back of the WSE-2, then into a heat
| exchanger where the internal water is cooled by either
| cold datacenter water or air."
|
| [1]: https://cerebras.net/product/
| typon wrote:
| 40GB of SRAM. Not quite big enough to fit the big models like
| GPT3
| etaioinshrdlu wrote:
| It seems like you typically want a balance of memory
| size/bandwidth to compute ratio for typical deep learning
| applications.
|
| The 40GB of SRAM probably has tremendous bandwidth (it could
| all be updated every few cycles!), but the memory size is
| very small compared to the amount of compute available.
|
| However, maybe a different way of looking at it is that this
| chip will allow the training steps on deep learning models to
| take a fraction of the time as a GPU. Perhaps what takes 1s
| on a GPU could take 10ms on this chip.
|
| So, this product may be effective at making training happen
| very fast, but without substantial model size or efficiency
| gains.
|
| That's still ground breaking -- you can't acheive this result
| on GPUs. You can't achieve this result by any parallelization
| or distributed training, either. The large batch sizes in
| distributed training do not result in the same model or one
| that generalizes as well.
| claytonius wrote:
| I don't think it's straightforward to do a head to head
| comparison.
|
| from: https://www.youtube.com/watch?v=yso2S2Svdlg
|
| @ 25:14
|
| James Wang: "If a model doesn't fit into a GPU's HBM, is it
| smaller when it's laid out in the Cerebras way relative to
| your 18 gigabytes?"
|
| Andrew Feldman: "It is -- it's smaller in that we hold
| different things in memory than they do. One can imagine a
| model that has more parameters than we can hold -- one can
| posit one, but remember our memory is doing different things.
| Our memory is basically holding parameters. That's not what
| their memory is doing. Their memory is holding the shape of
| the model, their model is holding the results of the batches.
| We use memory rather differently. We haven't found models
| that we can't place and train on a chip. We expect them to
| emerge, that's why we support clustering of chips and
| systems, that's why we do that in whats called a "model
| parallel" way, where If you put two chips together you get
| twice the memory capacity. That's not what you get when you
| put multiple GPUs together. When you put multiple GPUs
| together you get two versions of the same amount of memory,
| you actually don't get twice the memory. I see you smiling
| here because you know that's a problem... ...With us if we
| support 4 billion parameters and you add a second wafer scale
| engine, now you support 8 billion parameters ,and if you add
| a third you can support 12 billion. That's not the way it
| works with GPUs. With GPUs you just support two chips, each
| with a few million - tens of millions of parameters."
| frogblast wrote:
| Are there any good resources out there describing in
| practice how existing training workloads are distributed
| among GPUs? (using tensorflow, pytorch, or whatever else?).
|
| I'm curious how the problem effectively gets sliced.
| MuffinFlavored wrote:
| SRAM (static RAM) vs DRAM (dynamic RAM) for anybody else
| curious: https://computer.howstuffworks.com/question452.htm
| IshKebab wrote:
| It's 40 GB of SRAM. I doubt it supports external memory.
|
| > Also, if the chip is the size of a wafer, is it appropriate
| to call it a Chip?
|
| Good question. I think it is. I mean the word "chip" isn't
| really that well defined (is HBM one chip?), but given that
| they sell it as a single unit and you can't really cut it in
| half I think it's one chip.
| NaturalPhallacy wrote:
| Looking at the picture, the thing is a platter. Not a chip.
|
| It's a really cool picture too.
| fredfoobar wrote:
| Time for an AI winter I guess.
| Logon90 wrote:
| Just from the headline you can see the irrelevance of this
| chip. Who talks about transistor count as a proxy of
| performance?
| streetcat1 wrote:
| You do realize that AI crossed human expert performance in NLP
| / Vision tasks ?
| semi-extrinsic wrote:
| Exactly how does one outperform a human expert in natural
| language processing?
| zetazzed wrote:
| See, if you were an AI, you would understand EXACTLY what
| the poster means by this.
| twic wrote:
| Maybe the gibberish GPT-3 spits out is actually true, and
| our puny monke brains are just too weak to understand it.
| pulse7 wrote:
| Parent probably meant that it outperformed a human expert
| in some specific task in the area of natural language
| processing - for example the task of converting a spoken
| language into a written language...
| PeterisP wrote:
| They don't really outperform human experts on real tasks
| yet (no matter what some superGLUE or other benchmark
| shows); but in general, once a system can solve a
| particualr task well, it would be plausible to outperform
| human experts simply by not making random errors.
|
| If we have multiple human experts annotate a NLP task and
| measure inter-annotator agreement, it will be far from
| 100%; part of that will be genuine disagreements or fuzzy
| gray area, but part of the identified differences will be
| simply obviously wrong answers given by the experts -
| everyone makes mistakes. The same applies for many other
| domains - business process automation, data entry, etc; no
| employee will produce error-free output in a manual
| process, no matter how simple and unambiguous the task is.
|
| And for simpler tasks the computer can easily make less
| mistakes than a human - especially if you measure the human
| reliability not for a few minutes of focus, but for a whole
| tedious working day.
| firebaze wrote:
| I'm not sure if you're assuming best intentions. If you do,
| it'd be nice to provide sources - to my knowledge, vision
| under non-optimal conditions (rain, snow, sunlight ahead) is
| only partially solved by resorting to sensors resistant to
| the disturbance.
|
| I'd be glad to learn I'm wrong.
| king_magic wrote:
| No, it hasn't. Not even close.
|
| AI has "crossed human expert performance" on _extremely
| narrow_ NLP /CV tasks.
|
| AI is still light years away from human-level performance.
| anthk wrote:
| GPT-3 is not even close to a minimal and coherent text
| adventure made with Inform6 from a novice, even if it's
| written by a non-native English speaker.
|
| Those networks didn't match "Detective", a crappy story
| written by a 12yo.
| trhway wrote:
| >Time for an AI winter
|
| with AI chips burning 15KW? No chance for the winter in sight.
| Some chances for AI hell though.
| zitterbewegung wrote:
| Much better article from anandtech at
| https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...
| XCSme wrote:
| I laughed at the "arm+leg" official cost of the CPU.
|
| EDIT: Or GPU, or whatever it is.
| wyxuan wrote:
| Here's a video made by the author, Ian Cuttress, which goes
| into more detail as well.
|
| https://www.youtube.com/watch?v=FNd94_XaVlY
| smithza wrote:
| This is very cool. The explanation of how yield/defects was
| interesting: they can bypass cores with defects due to
| channeling and account for the statistical defects, allowing
| them to have a 100% yield.
| sillysaurusx wrote:
| I'm bearish on new hardware for AI training. The most important
| thing is the software stack, and thus far everyone has failed to
| support pytorch in a drop-in way.
|
| The philosophy here seems to be "if we build it, they'll buy it."
| But suppose you wanted to train a gpt model with this specialized
| hardware. That means you're looking at two months of R&D minimum
| to get everything rewritten, running, tested, trained, and with
| an inferencing pipeline to generate samples.
|
| And that's _just_ for gpt -- you lose all the other libraries
| people have written. This matters more in GAN training, since for
| example you can find someone else's FID implementation and drop
| it in without too much hassle. But with this specialized chip,
| you'd have to write it from scratch.
|
| We had a similar situation in gamedev circa 2003-2009.
| Practically every year there was a new GPU, which boasted similar
| architectural improvements. But, for all its flaws, GL made these
| improvements "drop-in" --- just opt in to the new extension, and
| keep writing your gl code as you have been.
|
| Ditto for direct3d, except they took the attitude of "limit to a
| specific API, not arbitrary extensions." (Pixel shader 2.0 was an
| awesome upgrade from 1.1.)
|
| AI has no such standards, and it hurts. The M1 GPU in my new Air
| is supposedly ready to do AI training. Imagine my surprise when I
| loaded up tensorflow and saw that it doesn't support any GPU
| devices whatsoever. They seem to transparently rewrite the cpu
| ops to run on the gpu automatically, which isn't the expected
| behavior.
|
| So I dig into Apple's actual api for doing training, and holy
| cow, that looks miserable to write in swift. I like how much
| control it gives you over allocation patterns, but I can't
| imagine trying to do serious work in it on a daily basis.
|
| What we need is a unified API that can easily support multiple
| backends -- something like "pytorch, but just enough pytorch to
| trick everybody" since supporting the full api seems to be beyond
| hardware vendors' capabilities at the moment. (Lookin' at you,
| google. Love ya though.)
| whimsicalism wrote:
| I'm on board with you that there should be a "drop-in" cross
| support of these chips, but pytorch is at a way higher
| abstraction level than what should be commonly supported.
| socialdemocrat wrote:
| Maybe people will get to their senses and switch to Julia
| instead of having to waste all these time on Python bindings.
| habibur wrote:
| Python here basically works as a binder for C, which is what
| everything is written in.
| zucker42 wrote:
| > The philosophy here seems to be "if we build it, they'll buy
| it."
|
| Supposedly Cerebras is already profitable, so it's hardly a
| situation where they are building something and hoping people
| buy it eventually.
|
| > That means you're looking at two months of R&D minimum to get
| everything rewritten, running, tested, trained, and with an
| inferencing pipeline to generate samples.
|
| Again, based on the companies representations, Cebebras
| transparently supports Pytorch and Tensorflow, only requiring a
| few lines of changed code.
|
| Source: https://www.anandtech.com/show/16626/cerebras-unveils-
| wafer-... (Dr. Cutress's video on TechTechPotato is also good).
| michelpp wrote:
| This thing needs a GraphBLAS[1] implementation yesterday. 100
| billion edge graphs and up are the new norm. This monster could
| smoke the competition if the implementation was tuned right!
|
| [1] http://graphblas.org
| blueyes wrote:
| Creating the ecosystem of both software and adjacent hardware
| for wafers this size is the real challenge for a company like
| Cerebras (which is doing amazing work). At first, they thought
| they just needed to make a chip 56x the size of its
| predecessor, and somehow get around the issue of defects and
| yield. After they solved those problems (which blocked Gene
| Amdahl, among others), they found they needed to bring an
| entire ecosystem into being to work with their hardware.
| michelpp wrote:
| Agreed, that's why I think the GraphBLAS would be such a
| great fit for this hardware. The ecosystem is growing pretty
| fast. There are, for example Python bindings, you can do
| sparse 'A @ B' on millions of elements in parallel with this
| wafer-chip. MATLAB 2021a now has GraphBLAS built in, you
| could drive this thing directly from your notebooks.
|
| I'm sure there's a compiler and low level primitives to
| really get the maximum performance out of it, but the trade-
| off maybe worth it in many cases to do it using an
| abstraction like the linear algebra approach.
| systemvoltage wrote:
| Most interesting aspect of wafer-scale manufacturing is yield.
| Even if we have 95% _chip_ yield, as the chip size approaches the
| wafer-level dimensions, I don 't know off top of my head what the
| math would be but it is going to plummet drastically. My guess is
| that they're handling this in the chip logic. Building resiliency
| by turning off cells in the wafer that didn't yield. That begs
| the question, how are they probing them? A probe card of the size
| of the wafer is unheard of. How are they running validation?
| Pretty mindblowing to say the least!
| Zardoz84 wrote:
| It's the old Wafer Scale Integration again. But now shows a
| successful product : https://en.wikipedia.org/wiki/Wafer-
| scale_integration
| pradn wrote:
| The first Cerebras Wafer Scale Engine used "... breakthrough
| techniques in cross-reticle patterning, but with the level of
| redundancy built into the design, ensured a yield of 100%,
| every time." I'm unsure what to think of this.
|
| "When we spoke to Cerebras last year, the company stated that
| they already had orders in the 'strong double digits'."
|
| And they cost 2- 2.5 million each!
|
| https://www.anandtech.com/show/15838/cerebras-wafer-scale-en...
| robocat wrote:
| "the price has risen from ~$2-3 million to 'several'
| million".
|
| "The CEO Andrew Feldman tells me that as a company they are
| already profitable, with dozens of customers already with
| CS-1 deployed and a number more already trialling CS-2
| remotely as they bring up the commercial systems".
|
| Quotes from https://www.anandtech.com/show/16626/cerebras-
| unveils-wafer-...
| belval wrote:
| Not an expert in chip manufacturing but my guess is that they
| just disable the parts that don't work and their big numbers
| represent ~80% of the actual number of transistors in the wafer
| because they account for that manufacturing loss.
| KETpXDDzR wrote:
| Correct. You already have the same with modern CPUs and GPUs.
| You just disable the defect parts. Obviously, that sounds way
| easier than it is.
| vkazanov wrote:
| Disabling parts of the chip? The secret sauce then is to make
| it defect-proof
| gsnedders wrote:
| Yup. This. From
| https://www.anandtech.com/show/16626/cerebras-unveils-
| wafer-...:
|
| > Cerebras achieves 100% yield by designing a system in which
| any manufacturing defect can be bypassed - initially Cerebras
| had 1.5% extra cores to allow for defects, but we've since
| been told this was way too much as TSMC's process is so
| mature.
| baybal2 wrote:
| Yep, not all semiconductor defects can be repaired, and at
| such scale repair circuits themselves will need to be
| redundant.
| [deleted]
| bob1029 wrote:
| They probably aren't bothering to. The extreme economics for
| producing this type of chip are likely acceptable to the
| stakeholders.
|
| Also, there is no reason they cant have some redundancy
| throughout the design so you can fuse off bad parts. It all
| really depends on the nature of the anticipated vs actual
| defects, which is an extraordinarily deep rabbit hole to climb
| into.
| Pet_Ant wrote:
| How does this fusing work? I assume there are a bunch of
| wires than are either hot or ground and that determines if
| part of a chip gets run?
| mechagodzilla wrote:
| It might actually use a traditional fuse block, where at
| some point in the packaging/testing process, you literally
| apply a sufficiently high voltage that you can permanently
| 'set' some part of it (whether that's actually melting a
| tiny wire, I'm not sure). But that's basically just
| programming a ROM that gets read in at boot time, and sets
| a bunch of logic on the chip to route around the bad parts.
| You could just use an external EEPROM to track that info
| too, and it would basically work the same.
| PeterisP wrote:
| I believe you design special fuses on the chip that can be
| "blown" with a laser after testing and before putting the
| silicon in the protective packaging.
| p_j_w wrote:
| Fuses aren't blown with a laser, it's purely electrical.
| You apply a sufficiently high voltage to some port on the
| part from a source that can drive a high enough current
| and then tell the digital block of the chip which address
| to fuse and if you want it high or low. Repeat for the
| entire fuse bank and you're done.
|
| https://en.wikipedia.org/wiki/Efuse
| systemvoltage wrote:
| I wonder how much this "chip" costs!
| aryonoco wrote:
| According to Anandtech, an arm + leg. Also known as several
| million.
| meepmorp wrote:
| The previous generation was the $2-3 million range, and
| these are now in the neighborhood of "several million". Or,
| arm+leg.
| LASR wrote:
| Products like these are often sold as part of a contract to
| deliver complete solutions + support + maintenance over
| some number of years.
|
| It's hard to estimate a per-unit cost. But suffice to say
| it would cost similar to other datacenter compute solutions
| on a performance/$ level.
| baybal2 wrote:
| I bet they just probe the chip:
|
| 1. piece by piece
|
| 2. on die test circuits
| seniorivn wrote:
| they exclude/disable cores with issues so an long as
| infrastructure parts of the chip are not affected the chip is
| functional
| zetazzed wrote:
| I wonder what the dev story looks like for these? I know they say
| "just use TF/Pytorch" but surely developers need to actually test
| stuff out on silicon and run CI on code... do they offer a baby
| version for testing?
| noobydoobydoo wrote:
| What does the programming model look like for one of these (like
| at the assembly level)? I'm not even sure what to google.
| cpr wrote:
| Interesting--the placement of code-to-processor so that things
| will be done roughly at the same time sounds a lot like the VLIW
| compiler problem of scheduling execution units so things are
| available at exactly the right time, without hardware interlocks.
| bhewes wrote:
| This thing reminds me of Daniel Hillis's "The Connection
| Machine". Just 35 years later.
| syntaxing wrote:
| Super curious, how much do one of these cost? Like 10K,100K, or
| 1M range?
| lastrajput wrote:
| Who said Moore's law is dead?
| miohtama wrote:
| It's not dead, but melting.
| LASR wrote:
| This is an interesting approach. Are there any
| benchmark/performance indicators for these wafer-scale chips?
| tandr wrote:
| How do they do heat management and dissipation on such a big
| wafer? I can imagine some different parts heats up differently,
| putting a mechanical strain on wafer, and leading to cracks.
| punnerud wrote:
| A large portion of the article is answering most of this
| tandr wrote:
| No, it does not. There is a paragraph _talking_ about it, but
| not much info still.
| NortySpock wrote:
| "Carefully."
|
| The article mentions a year of engineering went into dealing
| with the entire wafer thermally expanding under load.
| addaon wrote:
| If you look at the die shots (wafer shots?), you'll notice
| small holes a few millimeters across spaced roughly at reticle
| spacing. Those are drilled holes to allow through-wafer liquid
| cooling. With liquid cooling not just around but through the
| wafer, the temperature differential is minimized.
| durst wrote:
| Does anyone here have firsthand experience using the compiler?
| Can you give a rough approximation of performance tuning with the
| compiler compared to performance tuning with compilers targeting
| the tensor cores on an A100 or a TPU?
___________________________________________________________________
(page generated 2021-04-22 23:00 UTC)