[HN Gopher] 4T transistors, one giant chip (Cerebras WSE-3) [video]
___________________________________________________________________
4T transistors, one giant chip (Cerebras WSE-3) [video]
Author : asdfasdf1
Score : 96 points
Date : 2024-03-13 16:13 UTC (6 hours ago)
(HTM) web link (www.youtube.com)
(TXT) w3m dump (www.youtube.com)
| asdfasdf1 wrote:
| https://www.cerebras.net/press-release/cerebras-announces-th...
|
| https://www.cerebras.net/product-chip/
| ortusdux wrote:
| Not trying to sound critical, but is there a reason to use 4B,000
| vs 4T?
| geph2021 wrote:
| original title is:
|
| "4,000,000,000,000 Transistors, One Giant Chip (Cerebras
| WSE-3)"
|
| So I guess they're trying to stay true to it.
| ortusdux wrote:
| There are times where diverting from normal conventions make
| sense. The average consumer might not know that 1Gb/s is
| faster than 750Mb/s. That being said, I don't think I've ever
| seen anything along the lines of 1G,000b/s.
| leptons wrote:
| Made you click!
| bee_rider wrote:
| It looks like it has now been switched to just have the number.
| I wonder if there was just some auto-formatting error.
| wincy wrote:
| My guess is the titles get auto adjusted by Hacker News, but
| the script that does it doesn't have logic for a trillion and
| only goes up to a billion, hence the weirdness of a string
| match and replace
| gosub100 wrote:
| at least a trillion means a trillion. Unlike the "tebi vs
| tera"-byte marketing-speak in storage and ram.
| yalok wrote:
| because Billion is ambiguous -
|
| Quote:
|
| Billion is a word for a large number, and it has two distinct
| definitions:
|
| 1,000,000,000, i.e. one thousand million, or 10^9 (ten to the
| ninth power), as defined on the short scale. This is now the
| most common sense of the word in all varieties of English; it
| has long been established in American English and has since
| become common in Britain and other English-speaking countries
| as well.
|
| 1,000,000,000,000, i.e. one million million, or 10^12 (ten to
| the twelfth power), as defined on the long scale. This number
| is the historical sense of the word and remains the established
| sense of the word in other European languages. Though displaced
| by the short scale definition relatively early in US English,
| it remained the most common sense of the word in Britain until
| the 1950s and still remains in occasional use there.
|
| https://en.wikipedia.org/wiki/Billion
| AdamH12113 wrote:
| Title should be either "4,000,000,000,000 Transistors" (as in the
| actual video title) or "4 Trillion Transistors" or maybe "4T
| Transistors". "4B,000" ("four billion thousand"?) looks like
| 48,000 (forty-eight thousand).
| brucethemoose2 wrote:
| Reposting the CS-2 teardown in case anyone missed it. The thermal
| and electrical engineering is absolutely nuts:
|
| https://vimeo.com/853557623
|
| https://web.archive.org/web/20230812020202/https://www.youtu...
|
| (Vimeo/Archive because the original video was taken down from
| YouTube)
| bitwrangler wrote:
| 20,000 amps
|
| 200,000 electrical contacts
|
| 850,000 cores
|
| and that's the "old" one. wow.
| Keyframe wrote:
| This is something I'm clueless about and can't really
| understand. They say this is 24kW hungry. How does CPU power
| consumption really work on electrical level, what warrants
| that much power, even for regular CPUs? Like, from the basics
| level.. is it resistance of the material with frequency of
| switching or what is really going on there? Where does the
| power go on such a relatively small surface?
|
| edit: thanks people, makes sense now!
| danbruc wrote:
| Yes, the power consumption comes from the resistance of the
| circuit. In case of CMOS circuits there would ideally be no
| current flow when no signal changes but transistors are not
| perfect and have leakage currents. When signals change,
| primarily triggered by the clock rising or falling, there
| is a short time in which the supply rail and ground rail
| are essentially shorted out.
|
| Each gate has logically two transistors of which exactly
| one is always conducting, either connecting the output to
| the supply rail making the output a one, or connecting the
| output to the ground rail making the output a zero. When
| the output of the gate changes, both transistors have to
| switch in order to connect the output to the other rail
| than before. While this happens both transistors are
| conducting at the same time allowing current to flow from
| the supply rail to the ground rail.
|
| In addition to that the input capacitances of subsequent
| gates get charged from the supply rail when the output goes
| high and discharged into the ground rail when input goes
| low. So every signal change pumps some charge from the
| supply rail through the input capacitances to the ground
| rail.
| Tuna-Fish wrote:
| > When the output of the gate changes, both transistors
| have to switch in order to connect the output to the
| other rail than before. While this happens both
| transistors are conducting at the same time allowing
| current to flow from the supply rail to the ground rail.
|
| This is generally not true for modern logic. The
| transistors are biased so that one transistor switches
| off before the other switches on. What you described used
| to be true for a long time, because doing it that way
| gets you faster transistors, but it also increases power
| consumption so much that it is no longer worth it.
|
| Leakage also used to be a much larger problem before hkmg
| and finfets. These days, most of the power consumption of
| a chip really comes just from the draining of gate
| capacitance.
| danbruc wrote:
| Does modern logic mean sub-threshold logic?
| Workaccount2 wrote:
| It simply takes a non-zero amount of energy to turn a
| transistor on and off.
|
| Add up trillions of transistors, flicking on and off
| billions of times a second, and you get enormous power
| draws.
|
| What is actually drawing power is the gate capacitance of
| the transistors. If the transistor were a physical switch,
| the gate capacitance is the "weight" that must be put on
| the switch to flip it. Of course this weight gets smaller
| as the switches shrink and as the tech improves, but it
| will always be non-zero.
|
| None of this accounts for resistive losses either, which is
| just the cost of doing business for a CPU.
| magicalhippo wrote:
| Modern CPUs are built using CMOS MOSFET transistors[1]. The
| gate, which controls if the transistor conducts or not, is
| effectively a small capacitor. The gate capacitor has to be
| charged up for the transistor to conduct[2], ie you have to
| stuff some electrons into it to turn the transistor on.
|
| Once you've done that, the transistor is on until the gate
| capacitor is discharged. This requires getting rid of the
| electrons you stuffed into it. The easiest is to just
| connect the gate to ground, essentially throwing the
| electrons away.
|
| So for each time the transistor goes through an on-off
| cycle, you need to "spend" some electrons, which in turn
| need to be supplied from the power supply. Thus higher
| frequency means more current just from more on-off cycles
| per second.
|
| There's also resistive losses and leakage currents and
| such.
|
| Now in theory I suppose you could recycle some of these
| electrons (using a charge pump arrangement[3]), reducing
| the overall demand. But that would require relatively large
| capacitors, and on-chip capacitors take a lot of chip area
| which could have been used for many transistors instead.
|
| [1]: https://en.wikipedia.org/wiki/CMOS
|
| [2]:
| https://en.wikipedia.org/wiki/MOSFET#Modes_of_operation
|
| [3]: https://en.wikipedia.org/wiki/Charge_pump
| magicalhippo wrote:
| Just to get some sense of perspective, the Zen 4-based Ryzen
| 7950X3D is built on TMSC's 5nm node and is listed[1] as being
| two 71mm^2 dies. The 5nm node uses a 300mm wafer[2], which
| means roughly 900 dies or 450 7950X3D's on one wafer, for a
| total of 8 trillion transistors.
|
| The peak power average of the 7950X3D is roughly 150W[3],
| which means if you could somehow run all 450 CPUs (900 dies)
| at peak, they'd consume around 68kW.
|
| edit: I forgot about the IO die which contains the memory
| controller, so that will suck some power as well. So if we
| say 50W for that and 50W for the CPU dies, that's 45kW.
|
| That's assuming you get a "clean wafer" with all dies
| working, not "just" the 80% yield or so.
|
| [1]: https://www.techpowerup.com/cpu-
| specs/ryzen-9-7950x3d.c3024
|
| [2]: https://www.anandtech.com/show/15219/early-tsmc-5nm-
| test-chi...
|
| [3]: https://www.tomshardware.com/reviews/amd-
| ryzen-9-7950x3d-cpu...
| brucethemoose2 wrote:
| And even mundane details are so difficult... thermal
| expansion, the sheer number of pins that have to line up.
| This thing is a marvel.
| dougmwne wrote:
| I want this woman running my postAI-apocalypse hardware
| research lab.
| nachexnachex wrote:
| She's got strong merit to be spared by the Roko's basilisk
| ai.
| bsder wrote:
| As always, IBM did it first:
| https://www.righto.com/2021/03/logic-chip-teardown-from-vint...
| Rexxar wrote:
| It seems the vimeo video has been removed now too.
| imbusy111 wrote:
| I wish they dug into how this monstrosity is powered. Assuming 1V
| and 24kW, that's 24kAmps.
| geph2021 wrote:
| and cooling!
|
| Imagine the heat sink on that thing. Would look like a cast-
| iron Dutch oven :)
| tivert wrote:
| The one of the videos posted here gets into that:
| https://news.ycombinator.com/item?id=39693930:
| https://vimeo.com/853557623
| geph2021 wrote:
| Thanks for sharing! Very interesting.
|
| "We call it the engine block because it somewhat resembles
| a three cylinder motorcycle engine"
|
| (referring to the power distribution and cooling)
| jandrese wrote:
| 24kW translates to like 32hp, so you could imagine this thing
| with a liquid cooling loop hooked up to something that looks
| like a car radiator.
| dist-epoch wrote:
| 24kW is on the lower end of a home heating gas boiler.
| anticensor wrote:
| Then just connect the water loop to the house heaters...
| MenhirMike wrote:
| Finally, a chip that outmatches the Noctua NH-D15!
| tivert wrote:
| Interesting. I know there's a lot of attempts to hobble China by
| limiting their access to cutting edge chips and semiconductor
| manufacturing technology, but could something like this be a
| workaround for them, at least for datacenter-type jobs?
|
| Maybe it wouldn't be as powerful as one of these, due to their
| less capable fabs, but something that's good enough to get the
| job done in spite of the embargoes.
| eternauta3k wrote:
| What do you mean by "this", and how does it work around the
| restrictions? Do you mean just making bigger chips instead of
| shrinking the transistors?
| tivert wrote:
| > Do you mean just making bigger chips instead of shrinking
| the transistors?
|
| Yes.
| asdfasdf1 wrote:
| - Interconnect between WSE-2's chips in the cluster was 150GB/s,
| much lower than NVIDIA's 900GB/s.
|
| - non-sparse fp16 in WSE-2 was 7.5 tflops (about 8 H100s, 10x
| worse performance per dollar)
|
| Does anyone know the WSE-3 numbers? Datasheet seems lacking loads
| of details
|
| Also, 2.5 million USD for 1 x WSE-3, why just 44GB tho???
| Tuna-Fish wrote:
| 44GB is the SRAM on a single device, comparable to the 50MB of
| L2 on the H100. There is also a lot of directly attached DRAM.
| terafo wrote:
| No, it's comparable to 230Mb of SRAM on Groq chip, since both
| of them are SRAM-only chips that can't really use external
| memory.
| bee_rider wrote:
| Is that 150GB/s between elements that expect to run tightly
| coupled processes together? Maybe the bandwidth between chips
| is less important.
|
| I mean, in a cluster you might have a bunch of nodes with 8x
| GPUs hanging off each, if this thing replaces a whole node
| rather than a single GPU, which I assume is the case, it is not
| really a useful comparison, right?
| xcv123 wrote:
| >> why just 44GB tho???
|
| You can order one with 1.2 Petabytes of external memory. Is
| that enough?
|
| "External memory: 1.5TB, 12TB, or 1.2PB"
|
| https://www.cerebras.net/press-release/cerebras-announces-th...
|
| "214Pb/s Interconnect Bandwidth"
|
| https://www.cerebras.net/product-system/
| acchow wrote:
| I can't find the memory bandwidth to that external memory.
| Did they publish this?
| terafo wrote:
| Because SRAM stopped getting smaller with recent nodes.
| asdfasdf1 wrote:
| WHITE PAPER Training Giant Neural Networks Using Weight Streaming
| on Cerebras Wafer-Scale Clusters
|
| https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20B...
| fxj wrote:
| It has its own programming language CSL
|
| https://www.cerebras.net/blog/whats-new-in-r0.6-of-the-cereb...
|
| "CSL allows for compile time execution of code blocks that take
| compile-time constant objects as input, a powerful feature it
| inherits from Zig, on which CSL is based. CSL will be largely
| familiar to anyone who is comfortable with C/C++, but there are
| some new capabilities on top of the C-derived basics."
|
| https://github.com/Cerebras/csl-examples
| rbanffy wrote:
| It is oddly reminiscent of the Thinking Machines CM-1/2 series,
| but with CSL as the official language instead of Lisp.
|
| And far fewer blinking lights.
| wizardforhire wrote:
| But can it run doom?
| mlhpdx wrote:
| Does it come in a mobile/laptop version?
| pmontra wrote:
| It's 215 x 215 mm so it fits in a large laptop, some 15" and
| definitely 17" ones. The keyboard could get a little warm and
| battery life doesn't look good.
| whyenot wrote:
| Imagine setting up a Beowulf cluster of these /s
| terafo wrote:
| Not right now.
| cs702 wrote:
| According to the company, the new chip will enable training of AI
| models with up to 24 trillion parameters. Let me repeat that, in
| case you're as excited as I am: _24. Trillion. Parameters._ For
| comparison, the largest AI models currently in use have around
| 0.5 trillion parameters, around 48x times smaller.
|
| Each parameter is a _connection between artificial neurons_. For
| example, inside an AI model, a linear layer that transforms an
| input vector with 1024 elements to an output vector with 2048
| elements has 1024x2048 = ~2M parameters in a weight matrix. Each
| parameter specifies by how much each element in the input vector
| contributes to or subtracts from each element in the output
| vector. Each output vector element is a weighted sum (AKA a
| linear combination), of each input vector element.
|
| A human brain has an estimated 100-500 trillion synapses
| connecting biological neurons. Each synapse is quite a
| complicated biological structure[a], but if we oversimplify
| things and assume that every synapse can be modeled as a single
| parameter in a weight matrix, then the largest AI models in use
| today have approximately 100T to 500T / 0.5T = 200x to 1000x
| fewer connections between neurons that the human brain. If the
| company's claims prove true, this new chip will enable training
| of AI models that have only 4x to 20x fewer connections that the
| human brain.
|
| We sure live in interesting times!
|
| ---
|
| [a] https://en.wikipedia.org/wiki/Synapse
| mlyle wrote:
| > but if we oversimplify things and assume that every synapse
| can be modeled as a single parameter in a weight matrix
|
| Which, it probably can't... but offsetting those
| simplifications and 4-20x difference is the massive difference
| in how quickly those synapses can be activated.
| topspin wrote:
| > that have only 4x to 20x fewer connections that the human
| brain
|
| So only 4-20 of these systems are necessary to match the human
| brain. No?
| ipsum2 wrote:
| Fun fact, I can also train a 24 trillion parameter model on my
| laptop! Just need to offload weights to the cloud every layer.
|
| ...
|
| It's meaningless to say something can train a model that has 24
| trillion parameters without specifying the dataset size and
| time it takes to train.
| Rexxar wrote:
| Is there a reason it's not roughly a disc if they use the whole
| wafer ? They could have 50% more surface.
| londons_explore wrote:
| Packaging method can't handle non-square dies?
| terafo wrote:
| To quote their official response "If the WSE weren't
| rectangular, the complexity of power delivery, I/O, mechanical
| integrity and cooling become much more difficult, to the point
| of impracticality.".
| RetroTechie wrote:
| If you were to add up all transistors fabricated worldwide, up
| until <year>, such that total roughly matches the # on this
| beast, what year would you arrive? Hell, throw in discrete
| transistors if you want.
|
| How many early supercomputers / workstations etc would that
| include? How much progress did humanity make using all those
| early machines (or _any_ transistorized device!) combined?
| itishappy wrote:
| Rough guess: mid 1980s
|
| 4004 from the 1970s used 2300 transistors, so it would have
| needed to sell billions.
|
| Pentium from 1990s had 3M transistors, so it could hit our
| target by selling a million units.
|
| I'm betting (without much research) that the Pentium line alone
| sold millions, and the industry as a whole could hit those
| numbers about 5 years earlier.
| holoduke wrote:
| Better sell all nvidia stocks. Once these chips are common there
| is no need anymore for GPUs in training super large AI models.
| incrudible wrote:
| This chip does not outperform NVIDIA on key metrics. Economics
| of scale are unfavorable. Software is exotic.
|
| I trust that gamers will outlast every hype, be it crypto or
| AI.
| nickpsecurity wrote:
| Two of you have a take on this that sounds similar to prior
| projects, like the Cell processor. They lost in the long run.
| Not a good sign.
| anon291 wrote:
| Ex Cerebras engineer. In my opinion, this is not going to be
| the case. The WSE-2 was a b** to program and debug. Their
| compilation strategy is a dead end, and they invest very little
| into developer ease. My two cents.
| imtringued wrote:
| I would be more worried about the fact that next year every CPU
| is going to ship with some kind of AI accelerator already
| integrated to the die, which means the only competitive
| differentiation boils down to how much SRAM and memory
| bandwidth your AI accelerator is going to have. TOPS or FLOPS
| will become an irrelevant differentiator.
| terafo wrote:
| This thing targets training, which isn't affected by tiny
| accelerators inside CPUs.
| imtringued wrote:
| One thing I don't understand about their architecture is that
| they have spent so much effort building this monster of a chip,
| but if you are going to do something crazy, why not work on
| processing in memory instead? At least for transformers you will
| primarily be bottlenecked on matrix multiplication and almost
| nothing else, so you only need to add a simple matrix vector unit
| behind your address decoder and then almost every AI accelerator
| will become obsolete over night. I wouldn't suggest this to a
| random startup though.
| TheDudeMan wrote:
| FWIW, this chip has 44 GB of on-chip memory.
| hashtag-til wrote:
| Any idea on what's the yield on these chips?
| wtallis wrote:
| Previous versions have had basically 100% yield, because when
| you're working with the whole wafer it's pretty easy to squeeze
| in enough redundancy to route around defects unless you get a
| really unlucky cluster of defects.
| pgraf wrote:
| related discussion (2021):
| https://news.ycombinator.com/item?id=27459466
| beautifulfreak wrote:
| So it's increased from 2.6 to 4 trillion transistors over the
| previous version.
| tedivm wrote:
| The missing numbers that I really want to see-
|
| * Power Usage
|
| * Rack Size (last one I played with was 17u)
|
| * Cooling requirements
| api wrote:
| I'm surprised we haven't seen wafer scale many-core CPUs for
| cloud data centers yet.
| marmaduke wrote:
| Hm, let's wait and see what the gemm/W perf is, and how many
| programmer hours it takes to implement say an mlp. Wafer scale
| data flow may not be a solved problem?
| tibbydudeza wrote:
| Wow - it as bigger than my kitchen tiles - who uses them ???. NSA
| ???.
| TradingPlaces wrote:
| Near-100% yield is some dark magic.
| modeless wrote:
| As I understand it, WSE-2 was kind of handicapped because its
| performance could only really be harnessed if the neural net fit
| in the on-chip SRAM. Bandwidth to off-chip memory (normalized to
| FLOPS) was not as high as Nvidia. Is that improved with WSE-3?
| Seems like the SRAM is only 10% bigger, so that's not helping.
|
| In the days before LLMs 44 GB of SRAM sounded like a lot, but
| these days it's practically nothing. It's possible that novel
| architectures could be built for Cerebras that leverage the
| unique capabilities, but the inaccessibility of the hardware is a
| problem. So few people will ever get to play with one that it's
| unlikely new architectures will be developed for it.
___________________________________________________________________
(page generated 2024-03-13 23:01 UTC)