[HN Gopher] 4T transistors, one giant chip (Cerebras WSE-3) [video]
       ___________________________________________________________________
        
       4T transistors, one giant chip (Cerebras WSE-3) [video]
        
       Author : asdfasdf1
       Score  : 96 points
       Date   : 2024-03-13 16:13 UTC (6 hours ago)
        
 (HTM) web link (www.youtube.com)
 (TXT) w3m dump (www.youtube.com)
        
       | asdfasdf1 wrote:
       | https://www.cerebras.net/press-release/cerebras-announces-th...
       | 
       | https://www.cerebras.net/product-chip/
        
       | ortusdux wrote:
       | Not trying to sound critical, but is there a reason to use 4B,000
       | vs 4T?
        
         | geph2021 wrote:
         | original title is:
         | 
         | "4,000,000,000,000 Transistors, One Giant Chip (Cerebras
         | WSE-3)"
         | 
         | So I guess they're trying to stay true to it.
        
           | ortusdux wrote:
           | There are times where diverting from normal conventions make
           | sense. The average consumer might not know that 1Gb/s is
           | faster than 750Mb/s. That being said, I don't think I've ever
           | seen anything along the lines of 1G,000b/s.
        
         | leptons wrote:
         | Made you click!
        
         | bee_rider wrote:
         | It looks like it has now been switched to just have the number.
         | I wonder if there was just some auto-formatting error.
        
         | wincy wrote:
         | My guess is the titles get auto adjusted by Hacker News, but
         | the script that does it doesn't have logic for a trillion and
         | only goes up to a billion, hence the weirdness of a string
         | match and replace
        
         | gosub100 wrote:
         | at least a trillion means a trillion. Unlike the "tebi vs
         | tera"-byte marketing-speak in storage and ram.
        
         | yalok wrote:
         | because Billion is ambiguous -
         | 
         | Quote:
         | 
         | Billion is a word for a large number, and it has two distinct
         | definitions:
         | 
         | 1,000,000,000, i.e. one thousand million, or 10^9 (ten to the
         | ninth power), as defined on the short scale. This is now the
         | most common sense of the word in all varieties of English; it
         | has long been established in American English and has since
         | become common in Britain and other English-speaking countries
         | as well.
         | 
         | 1,000,000,000,000, i.e. one million million, or 10^12 (ten to
         | the twelfth power), as defined on the long scale. This number
         | is the historical sense of the word and remains the established
         | sense of the word in other European languages. Though displaced
         | by the short scale definition relatively early in US English,
         | it remained the most common sense of the word in Britain until
         | the 1950s and still remains in occasional use there.
         | 
         | https://en.wikipedia.org/wiki/Billion
        
       | AdamH12113 wrote:
       | Title should be either "4,000,000,000,000 Transistors" (as in the
       | actual video title) or "4 Trillion Transistors" or maybe "4T
       | Transistors". "4B,000" ("four billion thousand"?) looks like
       | 48,000 (forty-eight thousand).
        
       | brucethemoose2 wrote:
       | Reposting the CS-2 teardown in case anyone missed it. The thermal
       | and electrical engineering is absolutely nuts:
       | 
       | https://vimeo.com/853557623
       | 
       | https://web.archive.org/web/20230812020202/https://www.youtu...
       | 
       | (Vimeo/Archive because the original video was taken down from
       | YouTube)
        
         | bitwrangler wrote:
         | 20,000 amps
         | 
         | 200,000 electrical contacts
         | 
         | 850,000 cores
         | 
         | and that's the "old" one. wow.
        
           | Keyframe wrote:
           | This is something I'm clueless about and can't really
           | understand. They say this is 24kW hungry. How does CPU power
           | consumption really work on electrical level, what warrants
           | that much power, even for regular CPUs? Like, from the basics
           | level.. is it resistance of the material with frequency of
           | switching or what is really going on there? Where does the
           | power go on such a relatively small surface?
           | 
           | edit: thanks people, makes sense now!
        
             | danbruc wrote:
             | Yes, the power consumption comes from the resistance of the
             | circuit. In case of CMOS circuits there would ideally be no
             | current flow when no signal changes but transistors are not
             | perfect and have leakage currents. When signals change,
             | primarily triggered by the clock rising or falling, there
             | is a short time in which the supply rail and ground rail
             | are essentially shorted out.
             | 
             | Each gate has logically two transistors of which exactly
             | one is always conducting, either connecting the output to
             | the supply rail making the output a one, or connecting the
             | output to the ground rail making the output a zero. When
             | the output of the gate changes, both transistors have to
             | switch in order to connect the output to the other rail
             | than before. While this happens both transistors are
             | conducting at the same time allowing current to flow from
             | the supply rail to the ground rail.
             | 
             | In addition to that the input capacitances of subsequent
             | gates get charged from the supply rail when the output goes
             | high and discharged into the ground rail when input goes
             | low. So every signal change pumps some charge from the
             | supply rail through the input capacitances to the ground
             | rail.
        
               | Tuna-Fish wrote:
               | > When the output of the gate changes, both transistors
               | have to switch in order to connect the output to the
               | other rail than before. While this happens both
               | transistors are conducting at the same time allowing
               | current to flow from the supply rail to the ground rail.
               | 
               | This is generally not true for modern logic. The
               | transistors are biased so that one transistor switches
               | off before the other switches on. What you described used
               | to be true for a long time, because doing it that way
               | gets you faster transistors, but it also increases power
               | consumption so much that it is no longer worth it.
               | 
               | Leakage also used to be a much larger problem before hkmg
               | and finfets. These days, most of the power consumption of
               | a chip really comes just from the draining of gate
               | capacitance.
        
               | danbruc wrote:
               | Does modern logic mean sub-threshold logic?
        
             | Workaccount2 wrote:
             | It simply takes a non-zero amount of energy to turn a
             | transistor on and off.
             | 
             | Add up trillions of transistors, flicking on and off
             | billions of times a second, and you get enormous power
             | draws.
             | 
             | What is actually drawing power is the gate capacitance of
             | the transistors. If the transistor were a physical switch,
             | the gate capacitance is the "weight" that must be put on
             | the switch to flip it. Of course this weight gets smaller
             | as the switches shrink and as the tech improves, but it
             | will always be non-zero.
             | 
             | None of this accounts for resistive losses either, which is
             | just the cost of doing business for a CPU.
        
             | magicalhippo wrote:
             | Modern CPUs are built using CMOS MOSFET transistors[1]. The
             | gate, which controls if the transistor conducts or not, is
             | effectively a small capacitor. The gate capacitor has to be
             | charged up for the transistor to conduct[2], ie you have to
             | stuff some electrons into it to turn the transistor on.
             | 
             | Once you've done that, the transistor is on until the gate
             | capacitor is discharged. This requires getting rid of the
             | electrons you stuffed into it. The easiest is to just
             | connect the gate to ground, essentially throwing the
             | electrons away.
             | 
             | So for each time the transistor goes through an on-off
             | cycle, you need to "spend" some electrons, which in turn
             | need to be supplied from the power supply. Thus higher
             | frequency means more current just from more on-off cycles
             | per second.
             | 
             | There's also resistive losses and leakage currents and
             | such.
             | 
             | Now in theory I suppose you could recycle some of these
             | electrons (using a charge pump arrangement[3]), reducing
             | the overall demand. But that would require relatively large
             | capacitors, and on-chip capacitors take a lot of chip area
             | which could have been used for many transistors instead.
             | 
             | [1]: https://en.wikipedia.org/wiki/CMOS
             | 
             | [2]:
             | https://en.wikipedia.org/wiki/MOSFET#Modes_of_operation
             | 
             | [3]: https://en.wikipedia.org/wiki/Charge_pump
        
           | magicalhippo wrote:
           | Just to get some sense of perspective, the Zen 4-based Ryzen
           | 7950X3D is built on TMSC's 5nm node and is listed[1] as being
           | two 71mm^2 dies. The 5nm node uses a 300mm wafer[2], which
           | means roughly 900 dies or 450 7950X3D's on one wafer, for a
           | total of 8 trillion transistors.
           | 
           | The peak power average of the 7950X3D is roughly 150W[3],
           | which means if you could somehow run all 450 CPUs (900 dies)
           | at peak, they'd consume around 68kW.
           | 
           | edit: I forgot about the IO die which contains the memory
           | controller, so that will suck some power as well. So if we
           | say 50W for that and 50W for the CPU dies, that's 45kW.
           | 
           | That's assuming you get a "clean wafer" with all dies
           | working, not "just" the 80% yield or so.
           | 
           | [1]: https://www.techpowerup.com/cpu-
           | specs/ryzen-9-7950x3d.c3024
           | 
           | [2]: https://www.anandtech.com/show/15219/early-tsmc-5nm-
           | test-chi...
           | 
           | [3]: https://www.tomshardware.com/reviews/amd-
           | ryzen-9-7950x3d-cpu...
        
           | brucethemoose2 wrote:
           | And even mundane details are so difficult... thermal
           | expansion, the sheer number of pins that have to line up.
           | This thing is a marvel.
        
         | dougmwne wrote:
         | I want this woman running my postAI-apocalypse hardware
         | research lab.
        
           | nachexnachex wrote:
           | She's got strong merit to be spared by the Roko's basilisk
           | ai.
        
         | bsder wrote:
         | As always, IBM did it first:
         | https://www.righto.com/2021/03/logic-chip-teardown-from-vint...
        
         | Rexxar wrote:
         | It seems the vimeo video has been removed now too.
        
       | imbusy111 wrote:
       | I wish they dug into how this monstrosity is powered. Assuming 1V
       | and 24kW, that's 24kAmps.
        
         | geph2021 wrote:
         | and cooling!
         | 
         | Imagine the heat sink on that thing. Would look like a cast-
         | iron Dutch oven :)
        
           | tivert wrote:
           | The one of the videos posted here gets into that:
           | https://news.ycombinator.com/item?id=39693930:
           | https://vimeo.com/853557623
        
             | geph2021 wrote:
             | Thanks for sharing! Very interesting.
             | 
             | "We call it the engine block because it somewhat resembles
             | a three cylinder motorcycle engine"
             | 
             | (referring to the power distribution and cooling)
        
           | jandrese wrote:
           | 24kW translates to like 32hp, so you could imagine this thing
           | with a liquid cooling loop hooked up to something that looks
           | like a car radiator.
        
             | dist-epoch wrote:
             | 24kW is on the lower end of a home heating gas boiler.
        
               | anticensor wrote:
               | Then just connect the water loop to the house heaters...
        
           | MenhirMike wrote:
           | Finally, a chip that outmatches the Noctua NH-D15!
        
       | tivert wrote:
       | Interesting. I know there's a lot of attempts to hobble China by
       | limiting their access to cutting edge chips and semiconductor
       | manufacturing technology, but could something like this be a
       | workaround for them, at least for datacenter-type jobs?
       | 
       | Maybe it wouldn't be as powerful as one of these, due to their
       | less capable fabs, but something that's good enough to get the
       | job done in spite of the embargoes.
        
         | eternauta3k wrote:
         | What do you mean by "this", and how does it work around the
         | restrictions? Do you mean just making bigger chips instead of
         | shrinking the transistors?
        
           | tivert wrote:
           | > Do you mean just making bigger chips instead of shrinking
           | the transistors?
           | 
           | Yes.
        
       | asdfasdf1 wrote:
       | - Interconnect between WSE-2's chips in the cluster was 150GB/s,
       | much lower than NVIDIA's 900GB/s.
       | 
       | - non-sparse fp16 in WSE-2 was 7.5 tflops (about 8 H100s, 10x
       | worse performance per dollar)
       | 
       | Does anyone know the WSE-3 numbers? Datasheet seems lacking loads
       | of details
       | 
       | Also, 2.5 million USD for 1 x WSE-3, why just 44GB tho???
        
         | Tuna-Fish wrote:
         | 44GB is the SRAM on a single device, comparable to the 50MB of
         | L2 on the H100. There is also a lot of directly attached DRAM.
        
           | terafo wrote:
           | No, it's comparable to 230Mb of SRAM on Groq chip, since both
           | of them are SRAM-only chips that can't really use external
           | memory.
        
         | bee_rider wrote:
         | Is that 150GB/s between elements that expect to run tightly
         | coupled processes together? Maybe the bandwidth between chips
         | is less important.
         | 
         | I mean, in a cluster you might have a bunch of nodes with 8x
         | GPUs hanging off each, if this thing replaces a whole node
         | rather than a single GPU, which I assume is the case, it is not
         | really a useful comparison, right?
        
         | xcv123 wrote:
         | >> why just 44GB tho???
         | 
         | You can order one with 1.2 Petabytes of external memory. Is
         | that enough?
         | 
         | "External memory: 1.5TB, 12TB, or 1.2PB"
         | 
         | https://www.cerebras.net/press-release/cerebras-announces-th...
         | 
         | "214Pb/s Interconnect Bandwidth"
         | 
         | https://www.cerebras.net/product-system/
        
           | acchow wrote:
           | I can't find the memory bandwidth to that external memory.
           | Did they publish this?
        
         | terafo wrote:
         | Because SRAM stopped getting smaller with recent nodes.
        
       | asdfasdf1 wrote:
       | WHITE PAPER Training Giant Neural Networks Using Weight Streaming
       | on Cerebras Wafer-Scale Clusters
       | 
       | https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20B...
        
       | fxj wrote:
       | It has its own programming language CSL
       | 
       | https://www.cerebras.net/blog/whats-new-in-r0.6-of-the-cereb...
       | 
       | "CSL allows for compile time execution of code blocks that take
       | compile-time constant objects as input, a powerful feature it
       | inherits from Zig, on which CSL is based. CSL will be largely
       | familiar to anyone who is comfortable with C/C++, but there are
       | some new capabilities on top of the C-derived basics."
       | 
       | https://github.com/Cerebras/csl-examples
        
         | rbanffy wrote:
         | It is oddly reminiscent of the Thinking Machines CM-1/2 series,
         | but with CSL as the official language instead of Lisp.
         | 
         | And far fewer blinking lights.
        
       | wizardforhire wrote:
       | But can it run doom?
        
         | mlhpdx wrote:
         | Does it come in a mobile/laptop version?
        
           | pmontra wrote:
           | It's 215 x 215 mm so it fits in a large laptop, some 15" and
           | definitely 17" ones. The keyboard could get a little warm and
           | battery life doesn't look good.
        
         | whyenot wrote:
         | Imagine setting up a Beowulf cluster of these /s
        
         | terafo wrote:
         | Not right now.
        
       | cs702 wrote:
       | According to the company, the new chip will enable training of AI
       | models with up to 24 trillion parameters. Let me repeat that, in
       | case you're as excited as I am: _24. Trillion. Parameters._ For
       | comparison, the largest AI models currently in use have around
       | 0.5 trillion parameters, around 48x times smaller.
       | 
       | Each parameter is a _connection between artificial neurons_. For
       | example, inside an AI model, a linear layer that transforms an
       | input vector with 1024 elements to an output vector with 2048
       | elements has 1024x2048 = ~2M parameters in a weight matrix. Each
       | parameter specifies by how much each element in the input vector
       | contributes to or subtracts from each element in the output
       | vector. Each output vector element is a weighted sum (AKA a
       | linear combination), of each input vector element.
       | 
       | A human brain has an estimated 100-500 trillion synapses
       | connecting biological neurons. Each synapse is quite a
       | complicated biological structure[a], but if we oversimplify
       | things and assume that every synapse can be modeled as a single
       | parameter in a weight matrix, then the largest AI models in use
       | today have approximately 100T to 500T / 0.5T = 200x to 1000x
       | fewer connections between neurons that the human brain. If the
       | company's claims prove true, this new chip will enable training
       | of AI models that have only 4x to 20x fewer connections that the
       | human brain.
       | 
       | We sure live in interesting times!
       | 
       | ---
       | 
       | [a] https://en.wikipedia.org/wiki/Synapse
        
         | mlyle wrote:
         | > but if we oversimplify things and assume that every synapse
         | can be modeled as a single parameter in a weight matrix
         | 
         | Which, it probably can't... but offsetting those
         | simplifications and 4-20x difference is the massive difference
         | in how quickly those synapses can be activated.
        
         | topspin wrote:
         | > that have only 4x to 20x fewer connections that the human
         | brain
         | 
         | So only 4-20 of these systems are necessary to match the human
         | brain. No?
        
         | ipsum2 wrote:
         | Fun fact, I can also train a 24 trillion parameter model on my
         | laptop! Just need to offload weights to the cloud every layer.
         | 
         | ...
         | 
         | It's meaningless to say something can train a model that has 24
         | trillion parameters without specifying the dataset size and
         | time it takes to train.
        
       | Rexxar wrote:
       | Is there a reason it's not roughly a disc if they use the whole
       | wafer ? They could have 50% more surface.
        
         | londons_explore wrote:
         | Packaging method can't handle non-square dies?
        
         | terafo wrote:
         | To quote their official response "If the WSE weren't
         | rectangular, the complexity of power delivery, I/O, mechanical
         | integrity and cooling become much more difficult, to the point
         | of impracticality.".
        
       | RetroTechie wrote:
       | If you were to add up all transistors fabricated worldwide, up
       | until <year>, such that total roughly matches the # on this
       | beast, what year would you arrive? Hell, throw in discrete
       | transistors if you want.
       | 
       | How many early supercomputers / workstations etc would that
       | include? How much progress did humanity make using all those
       | early machines (or _any_ transistorized device!) combined?
        
         | itishappy wrote:
         | Rough guess: mid 1980s
         | 
         | 4004 from the 1970s used 2300 transistors, so it would have
         | needed to sell billions.
         | 
         | Pentium from 1990s had 3M transistors, so it could hit our
         | target by selling a million units.
         | 
         | I'm betting (without much research) that the Pentium line alone
         | sold millions, and the industry as a whole could hit those
         | numbers about 5 years earlier.
        
       | holoduke wrote:
       | Better sell all nvidia stocks. Once these chips are common there
       | is no need anymore for GPUs in training super large AI models.
        
         | incrudible wrote:
         | This chip does not outperform NVIDIA on key metrics. Economics
         | of scale are unfavorable. Software is exotic.
         | 
         | I trust that gamers will outlast every hype, be it crypto or
         | AI.
        
           | nickpsecurity wrote:
           | Two of you have a take on this that sounds similar to prior
           | projects, like the Cell processor. They lost in the long run.
           | Not a good sign.
        
         | anon291 wrote:
         | Ex Cerebras engineer. In my opinion, this is not going to be
         | the case. The WSE-2 was a b** to program and debug. Their
         | compilation strategy is a dead end, and they invest very little
         | into developer ease. My two cents.
        
         | imtringued wrote:
         | I would be more worried about the fact that next year every CPU
         | is going to ship with some kind of AI accelerator already
         | integrated to the die, which means the only competitive
         | differentiation boils down to how much SRAM and memory
         | bandwidth your AI accelerator is going to have. TOPS or FLOPS
         | will become an irrelevant differentiator.
        
           | terafo wrote:
           | This thing targets training, which isn't affected by tiny
           | accelerators inside CPUs.
        
       | imtringued wrote:
       | One thing I don't understand about their architecture is that
       | they have spent so much effort building this monster of a chip,
       | but if you are going to do something crazy, why not work on
       | processing in memory instead? At least for transformers you will
       | primarily be bottlenecked on matrix multiplication and almost
       | nothing else, so you only need to add a simple matrix vector unit
       | behind your address decoder and then almost every AI accelerator
       | will become obsolete over night. I wouldn't suggest this to a
       | random startup though.
        
         | TheDudeMan wrote:
         | FWIW, this chip has 44 GB of on-chip memory.
        
       | hashtag-til wrote:
       | Any idea on what's the yield on these chips?
        
         | wtallis wrote:
         | Previous versions have had basically 100% yield, because when
         | you're working with the whole wafer it's pretty easy to squeeze
         | in enough redundancy to route around defects unless you get a
         | really unlucky cluster of defects.
        
       | pgraf wrote:
       | related discussion (2021):
       | https://news.ycombinator.com/item?id=27459466
        
       | beautifulfreak wrote:
       | So it's increased from 2.6 to 4 trillion transistors over the
       | previous version.
        
       | tedivm wrote:
       | The missing numbers that I really want to see-
       | 
       | * Power Usage
       | 
       | * Rack Size (last one I played with was 17u)
       | 
       | * Cooling requirements
        
       | api wrote:
       | I'm surprised we haven't seen wafer scale many-core CPUs for
       | cloud data centers yet.
        
       | marmaduke wrote:
       | Hm, let's wait and see what the gemm/W perf is, and how many
       | programmer hours it takes to implement say an mlp. Wafer scale
       | data flow may not be a solved problem?
        
       | tibbydudeza wrote:
       | Wow - it as bigger than my kitchen tiles - who uses them ???. NSA
       | ???.
        
       | TradingPlaces wrote:
       | Near-100% yield is some dark magic.
        
       | modeless wrote:
       | As I understand it, WSE-2 was kind of handicapped because its
       | performance could only really be harnessed if the neural net fit
       | in the on-chip SRAM. Bandwidth to off-chip memory (normalized to
       | FLOPS) was not as high as Nvidia. Is that improved with WSE-3?
       | Seems like the SRAM is only 10% bigger, so that's not helping.
       | 
       | In the days before LLMs 44 GB of SRAM sounded like a lot, but
       | these days it's practically nothing. It's possible that novel
       | architectures could be built for Cerebras that leverage the
       | unique capabilities, but the inaccessibility of the hardware is a
       | problem. So few people will ever get to play with one that it's
       | unlikely new architectures will be developed for it.
        
       ___________________________________________________________________
       (page generated 2024-03-13 23:01 UTC)