[HN Gopher] The Cerebras CS-2 wafer-scale engine (850k cores, 40...
       ___________________________________________________________________
        
       The Cerebras CS-2 wafer-scale engine (850k cores, 40 GB SRAM)
        
       Author : unwind
       Score  : 114 points
       Date   : 2021-06-10 11:34 UTC (11 hours ago)
        
 (HTM) web link (cerebras.net)
 (TXT) w3m dump (cerebras.net)
        
       | shrubble wrote:
       | A friend visited the Pittsburgh Supercomputing Center, where he
       | saw one of these. Said that they throw off a LOT of heat.
       | Stainless steel braided piping is used to move the liquid cooling
       | around.
        
         | rokobobo wrote:
         | At 1:51 of this video, you can see them fitting it to what
         | looks like a giant heatsink, to be plugged into some very large
         | pipes: https://youtu.be/qSqAxEXtZY0?t=111
        
         | tyingq wrote:
         | Their home page says 23kw in 15 rack units.
         | 
         | Compare that to somewhere in the range of 5 to 25kw for 42 rack
         | units for a more traditional mix of equipment in a rack.
         | 
         | Sounds like it consumes somewhere in the range of 2.5 to 10
         | times more than a normal server that would fit in the same
         | space. 1.5kw per RU as compared to 0.12-0.6 per RU for "normal
         | equipment".
        
         | meepmorp wrote:
         | The 1st generation units draw 23kW peak, so yeah, basically a
         | computational space heater.
         | 
         | https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...
        
       | EvanAnderson wrote:
       | What a charmingly impractical product! I wonder what their yields
       | are like.
       | 
       | I am reminded of a recording of a Seymour Cray talk where he
       | discussed some of the challenges his company faced with actually
       | manufacturing their computers. The challenges associated with
       | physical packaging, cooling, and power were as much or more
       | interesting to me than the computing capabilities of the
       | resulting product. I bet there are similar interesting rabbit-
       | holes w/ this product.
       | 
       | I have some nostalgia for the past world of "supercomputers" that
       | were actually purpose-built devices, rather than clusters of off-
       | the-shelf hardware. Economics don't favor the purpose-built, for
       | sure, but things like the old Connection Machines and Cray
       | machines seem very romantic to me. This product seems a bit like
       | an artifact from that time.
       | 
       | Edit: I think the Cray talk I'm thinking about is
       | https://www.youtube.com/watch?v=8Z9VStbhplQ
        
         | fabiospampinato wrote:
         | I think yields are ~100% actually, I've read the chip is
         | architected in a way that if any of those cores doesn't work
         | the rest of the chip still works.
        
           | EvanAnderson wrote:
           | I figured there would be spare silicon on the die. I guess I
           | was thinking more about the raw yields and how much over-
           | capacity was engineered into the product to handle real-world
           | physics. There has to be some pretty impressive engineering
           | behind this thing.
        
             | kasperni wrote:
             | 1-2% over-capacity.
             | 
             | ""Both Cerebras' first and second generation chips are
             | created by removing the largest possible square from a 300
             | mm wafer to create 46,000 square millimeter chips roughly
             | the size of a dinner plate. An array of repeated identical
             | tiles (84 of them) is built into the wafer, enabling
             | redundancy.""
        
               | EvanAnderson wrote:
               | They're fabbing it at a 7nm process, too! This is really
               | exciting stuff.
               | 
               | This rabbit hole got me to this Anandtech article:
               | https://www.anandtech.com/show/16626/cerebras-unveils-
               | wafer-...
        
         | qayxc wrote:
         | > I have some nostalgia for the past world of "supercomputers"
         | that were actually purpose-built devices, rather than clusters
         | of off-the-shelf hardware.
         | 
         | From mainframes to supercomputers, these parts are still far
         | from off-the-shell hardware. They might use components that
         | share a common heritage with their COTS cousins (in that
         | they're not being developed specifically for a single
         | supercomputer), but from CPU to interconnects you won't be able
         | to source the components as-is new on the free market.
         | Everything is still customised to varying degrees to meet
         | customer specs and -pricing.
         | 
         | Packaging and cooling has been solved, btw. and the company
         | indeed sells a "plug-and-play" system complete with software
         | support for relevant software packages (e.g. TensorFlow and
         | friends).
        
           | EvanAnderson wrote:
           | > They might use components that share a common heritage with
           | their COTS cousins (in that they're not being developed
           | specifically for a single supercomputer)
           | 
           | That's what I'm talking about. This system isn't just new
           | arrangement of off-the-shelf x86/x64-based CPUs or commodity
           | GPU cores with exotic interconnects.
           | 
           | > Packaging and cooling has been solved, btw. and the company
           | indeed sells a "plug-and-play" system
           | 
           | I'm aware that they have a product. My interest would be in
           | hearing about how they solved their packaging, power, and
           | cooling challenges. That's actually more interesting to me
           | than the capabilities of the system to deliver neural network
           | processing. There had to be a ton of fun engineering in
           | figuring out how to make a system out of the raw technology.
        
             | richk449 wrote:
             | It has been covered in the popular tech press:
             | 
             | https://spectrum.ieee.org/semiconductors/processors/cerebra
             | s...
             | 
             | https://spectrum.ieee.org/tech-
             | talk/semiconductors/processor...
        
       | rokobobo wrote:
       | Out of curiosity, does anyone know of an actual company that's
       | bought and uses one of these chips (say, the CS-1)? What's the
       | use case?
       | 
       | They mention "low latency datacenter inference." Surely, the
       | Facebooks and Amazons in the world could do better by using
       | localized, smaller-scale machines for inference, since their use
       | cases are distributed geographically. The one use case that I can
       | think of, is high-frequency trading. But I can tell you that
       | 20PB/s memory bandwidth is an overkill, and also, you're going to
       | have a bad time trying to cool this thing down in a colocation
       | center that doesn't belong to you.
        
         | thomasjudge wrote:
         | Under the "Industries" part of their webpage they have
         | testimonials from GlaxoSmithKline, Lawrence Livermore National
         | Lab, Argonne National Lab
        
           | jamessb wrote:
           | This slide (https://images.anandtech.com/doci/16626/Cerebras%
           | 20WSE2%20La...) lists announced Deployments at:
           | 
           | Argonne National Laboratory
           | 
           | Lawrence Livermore National Laboratory
           | 
           | Pittsburgh Supercomputer Center
           | 
           | Edinburgh Parallel Computing Centre
           | 
           | GlaxoSmthKline
           | 
           | "Other wins in heavy manufacturing, Pharma, Biotech, military
           | and intelligence"
        
         | throw737858 wrote:
         | There is something on YouTube. They are very profitable,
         | machine learning, filtering and stuff. No trading.
        
         | wvaske wrote:
         | There are AI models that are too big to effectively run on
         | smaller hardware. The performance benefit of a monolithic
         | system like this vs distributed GPU based systems can be orders
         | of magnitude.
        
         | [deleted]
        
       | zitterbewegung wrote:
       | I remember when I was a college student at UIC and I took a class
       | that taught me how to use Maple (now they use Sage). The
       | professor also taught supercomputing and he bought some
       | specialized NUMA computers and a few dense multiprocessor
       | systems. If you don't have a large company that will support what
       | you buy then there is a huge risk if the company folds and then
       | your hardware becomes useless to the point that you will let an
       | Undergraduate try to figure out how to fix it (that undergraduate
       | was me). Unfortunately, I was unsuccessful.
        
         | madengr wrote:
         | Wouldn't these things, like quantum computers, be leased? The
         | complexity is getting so high that you would need full time
         | staff dedicated just to that hardware. At least if it goes
         | belly up, you are not stuck with unsupported hardware.
        
         | stainforth wrote:
         | Was the hardware broken or was it a matter of figuring out how
         | to run something on it?
        
           | zitterbewegung wrote:
           | IIRC Hardware was broken and unserviceable (at least the ones
           | I was given).
        
         | choppaface wrote:
         | The CS1 had a Tensorflow / pytorch API; the CS2 probably has
         | the same capability. There's still a lot of risk for sure, but
         | at least it's not sawzall.
        
       | gautamcgoel wrote:
       | Dumb question: how are they able to sell each of these for ~
       | $2-3M? I doubt they have sold more than 20-30 of these systems.
       | Given the upfront cost of design, verification, wafer
       | construction, etc, shouldn't the unit price be much higher?
        
         | alpaca128 wrote:
         | Those processors apparently have a 100% yield rate in
         | production because a small percentage of cores are only for
         | redundancy and can balance out small defects. The actual
         | distribution of the software and the flow of data is handled by
         | the compiler later.
         | 
         | Don't know much about the development costs though.
         | 
         | Edit: I got those infos from a YT video[0], which talks a bit
         | about it and the second generation of the chip
         | 
         | [0] https://www.youtube.com/watch?v=FNd94_XaVlY
        
           | eb0la wrote:
           | If I remember well Intel did something similar with the
           | i486sx - most were just i468DX with a faulty i487 coprocessor
           | disabled.
           | 
           | Not sure how they did it. Probably burning the SX microcode
           | after testing?
        
       | ramshanker wrote:
       | Ever since cererus-1 launched, I have been dreaming of Either
       | Intel or AMD going full-on wafer scale research. All the
       | individual university departments gonna love their own Million
       | Doller personal super-computer. Even business who can't afford
       | the super computers would want one if Wafer Scale makes it
       | affordable.
       | 
       | I guess only inertia is stopping them from going full scale on
       | wafer scale.
        
       | xattt wrote:
       | Direct link to whitepaper:
       | 
       | https://f.hubspotusercontent30.net/hubfs/8968533/Cerebras-CS...
        
         | mikewarot wrote:
         | No actual specifications about precision of computations,
         | operations supported, etc., just impressive I/O numbers.
        
           | tyingq wrote:
           | There's better info here: https://blog.inten.to/hardware-for-
           | deep-learning-part-4-asic...
           | 
           |  _" The instruction set supports operations on INT16, FP16,
           | and FP32 types. Floating-point adds, multiplies, and fused
           | multiply-accumulate (or FMAC, with no rounding of the product
           | prior to the add) can occur in a 4-way SIMD manner for 16-bit
           | operands. The instruction set supports SIMD operations across
           | subtensors of four-dimensional tensors, making use of tensor
           | address generation hardware to efficiently access tensor data
           | in memory."_
           | 
           | The source appears to be this PDF:
           | https://arxiv.org/pdf/2010.03660.pdf
        
       | polymorph1sm wrote:
       | Unless they are able to introduce smaller module like "GPU"
       | version, it will only be a nice demo product of what we can do
       | with state of the art semiconductor manufacturing.
        
         | f6v wrote:
         | Where do you think all those next-gen deep learning models for
         | autonomous weapons are going to be trained? For more civil
         | applications, imagine GPT3.
        
         | mmazing wrote:
         | I'm sure people had the same things to say about computers that
         | required the space of entire rooms in the 40s.
         | 
         | "That's nice but we'll never be able to scale this down any
         | more."
        
           | JohnJamesRambo wrote:
           | Isn't scaling it down just the normal CPUs we have now?
        
       | chrisMyzel wrote:
       | So is this an available product? Everyone hates 'contact sales'
       | 'get a demo' websites - what's the price, what's the spec, where
       | can I order?
        
         | eb0la wrote:
         | It is essentially a B2B for a deep-pocket kind of customer.
         | Think energy, defense, or a big financial company (or fund).
         | 
         | If you can calculate and iterate fast something you want to
         | build / identify where it is feasible to drill / etc. then the
         | price of this is peanuts compared with doing the wrong thing
         | (tm).
        
         | qayxc wrote:
         | Just like IBM mainframes, you won't be able to just "click here
         | to buy" these.
         | 
         | They do seem to sell actual systems, though. For several
         | millions a piece, so not exactly a product for the masses.
        
         | aseipp wrote:
         | It's an eight figure product, so regardless of how you or
         | anyone else on a forum feels about it (I mean, everyone likes
         | to window shop, including me), I suspect their sales pipelines
         | aren't going to be hamstrung by it.
        
           | eb0la wrote:
           | Even at eight figures still makes sense for accurate value at
           | risk calculation.
           | 
           | A friend that works in risk management told me a financial
           | institution you must have enough cash 'at rest' to cover your
           | part of your whole VAR.
           | 
           | The problem here is VAR is hard to calculate.
           | 
           | For a small bank it might be simple, but for a big one
           | exposed to ousands of financial instruments it is not
           | simple...
           | 
           | ... and every million that you can have at work counts.
        
         | mhh__ wrote:
         | These types of monster-chips are probably sold as part of a
         | solution for a specific problem so they probably don't actually
         | have a fixed price in the sense that no one buys just a chip.
        
         | utf_8x wrote:
         | It's one of those products where if you have to ask, you can't
         | afford it.
        
         | chmod775 wrote:
         | Since this chip is a whole wafer we can make some back-off the-
         | envelope calculations.
         | 
         | Let's pretend this was an Intel product and Intel's typical
         | yield was like 150 CPUs / wafer. Using high-end Intel CPUs as a
         | price-point, we would already be in the $300,000 price range
         | for this chip.
         | 
         | But this isn't an Intel product and not made at scale, more
         | likely to-order. Further they don't just sell you the
         | wafer/chip, but a complete solution. They have to recoup
         | development costs with much fewer units sold.
         | 
         | I wouldn't be surprised if this goes for $2m - $5m.
         | 
         | Edit: The previous generation of this product was sold "for a
         | couple million" according to some voices on the internet.
        
       | wejick wrote:
       | Just recently read the interview transcript [0] of CEO and CTO
       | (Jim Keller) of Tenstorrent which I believe they're in the same
       | field. That's interestingly deep conversation open my eye a
       | little bit about AI processor and how it's not only about a chip
       | that accelerate "AI task".
       | 
       | [0] https://www.anandtech.com/show/16709/an-interview-with-
       | tenst...
        
         | dragontamer wrote:
         | These "AI processors" are just matrix-multiplication engines,
         | often 8-bit or 16-bits. Maybe 4-bit in some cases.
         | 
         | 16-bit (and smaller) just isn't enough for most compute
         | problems. But emphasis on "most". Neural Nets clearly are fine
         | with 16-bit, but some video game lighting effects can be
         | calculated on 16-bit and look "good enough".
         | 
         | Maybe a certified Jewelry Appraiser can tell the difference
         | between 1.3 refractive index vs 1.35, but the typical video
         | game player (and dare I say, human) won't be able to tell the
         | difference. If all the light bounces are just slightly off, as
         | long as its "close enough", you probably have good-enough
         | looking ray-tracing or whatever.
         | 
         | --------
         | 
         | I've also heard of "iterative solvers" being accelerated by
         | 16-bit and 32-bit passes before doing a 64-bit pass. That's not
         | applicable to all math problems of course, but that's still a
         | methodology where your 16-bit performance accelerates your
         | 64-bit ultimate answer.
        
       | notum wrote:
       | Skynet is prettier than I thought it would be.
        
         | mhh__ wrote:
         | Don't be silly, Skynet will be an IBM z
        
       | wiz21c wrote:
       | For someone who has started coding when 128kilobytes were
       | considered luxury, that is just, well, incommensurate :-)
        
         | tyingq wrote:
         | In a way, it's a throwback to that. Each "core" has access to
         | 48kb of SRAM, there is no shared memory across cores.
        
           | wiz21c wrote:
           | 48Kb, my comfort zone !
        
       | extr wrote:
       | I would be curious to understand what the equivalent cluster
       | solution would look like. A big selling point here is apparently
       | that your researchers "don't have to understand the do's and
       | don'ts of cluster scale computing", but if your budget is already
       | in the millions, how big of an issue is that really? Is the
       | barrier to entry for large scale NN training really that there
       | aren't enough engineers with experience navigating the issues on
       | commodity/cloud hardware?
       | 
       | Overall I guess I'm just confused at what scale of problems this
       | is meant for. GPT-3 cost $5M to train, GPT-2 cost ~$50k. GPT-3 is
       | way too big to fit into 40GB of SRAM, GPT-2 fits but isn't
       | exactly bleeding edge. If this product is $2-5M, you could train
       | GPT-2 40-100 times to make it cost effective, but now you're
       | locked into the scale provided by this platform, and not all
       | problems are GPT-2 sized. I'm not an expert so happy for someone
       | else to chime in and correct me if I've gone wrong somewhere.
        
         | blihp wrote:
         | GPT-2/3 is at the high end in terms of size. There's a whole
         | universe of problems, and most of the market, below it. A great
         | many business/research needs for at least the near future would
         | fit into a CS-2 quite nicely.
        
           | extr wrote:
           | Yes I suppose you're right, GPT-2/3 aren't particularly
           | representative of the "average" industry problem. Honestly
           | what I'm jonesing for is an insider's take on the
           | cost/benefit of a solution like this vs cloud, in terms of
           | problem size, future flexibility, raw price, talent required,
           | etc. Even just a ballpark "this is probably an order of
           | magnitude more efficient vs cloud GPU compute, game changer"
           | vs "we'd have to carefully consider and test performance,
           | etc"
           | 
           | But hey, that's the hard part right. Probably aren't many
           | people total with experience with CS-1, let alone browsing HN
           | :D
        
       ___________________________________________________________________
       (page generated 2021-06-10 23:02 UTC)