[HN Gopher] The Cerebras CS-2 wafer-scale engine (850k cores, 40...
___________________________________________________________________
The Cerebras CS-2 wafer-scale engine (850k cores, 40 GB SRAM)
Author : unwind
Score : 114 points
Date : 2021-06-10 11:34 UTC (11 hours ago)
(HTM) web link (cerebras.net)
(TXT) w3m dump (cerebras.net)
| shrubble wrote:
| A friend visited the Pittsburgh Supercomputing Center, where he
| saw one of these. Said that they throw off a LOT of heat.
| Stainless steel braided piping is used to move the liquid cooling
| around.
| rokobobo wrote:
| At 1:51 of this video, you can see them fitting it to what
| looks like a giant heatsink, to be plugged into some very large
| pipes: https://youtu.be/qSqAxEXtZY0?t=111
| tyingq wrote:
| Their home page says 23kw in 15 rack units.
|
| Compare that to somewhere in the range of 5 to 25kw for 42 rack
| units for a more traditional mix of equipment in a rack.
|
| Sounds like it consumes somewhere in the range of 2.5 to 10
| times more than a normal server that would fit in the same
| space. 1.5kw per RU as compared to 0.12-0.6 per RU for "normal
| equipment".
| meepmorp wrote:
| The 1st generation units draw 23kW peak, so yeah, basically a
| computational space heater.
|
| https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...
| EvanAnderson wrote:
| What a charmingly impractical product! I wonder what their yields
| are like.
|
| I am reminded of a recording of a Seymour Cray talk where he
| discussed some of the challenges his company faced with actually
| manufacturing their computers. The challenges associated with
| physical packaging, cooling, and power were as much or more
| interesting to me than the computing capabilities of the
| resulting product. I bet there are similar interesting rabbit-
| holes w/ this product.
|
| I have some nostalgia for the past world of "supercomputers" that
| were actually purpose-built devices, rather than clusters of off-
| the-shelf hardware. Economics don't favor the purpose-built, for
| sure, but things like the old Connection Machines and Cray
| machines seem very romantic to me. This product seems a bit like
| an artifact from that time.
|
| Edit: I think the Cray talk I'm thinking about is
| https://www.youtube.com/watch?v=8Z9VStbhplQ
| fabiospampinato wrote:
| I think yields are ~100% actually, I've read the chip is
| architected in a way that if any of those cores doesn't work
| the rest of the chip still works.
| EvanAnderson wrote:
| I figured there would be spare silicon on the die. I guess I
| was thinking more about the raw yields and how much over-
| capacity was engineered into the product to handle real-world
| physics. There has to be some pretty impressive engineering
| behind this thing.
| kasperni wrote:
| 1-2% over-capacity.
|
| ""Both Cerebras' first and second generation chips are
| created by removing the largest possible square from a 300
| mm wafer to create 46,000 square millimeter chips roughly
| the size of a dinner plate. An array of repeated identical
| tiles (84 of them) is built into the wafer, enabling
| redundancy.""
| EvanAnderson wrote:
| They're fabbing it at a 7nm process, too! This is really
| exciting stuff.
|
| This rabbit hole got me to this Anandtech article:
| https://www.anandtech.com/show/16626/cerebras-unveils-
| wafer-...
| qayxc wrote:
| > I have some nostalgia for the past world of "supercomputers"
| that were actually purpose-built devices, rather than clusters
| of off-the-shelf hardware.
|
| From mainframes to supercomputers, these parts are still far
| from off-the-shell hardware. They might use components that
| share a common heritage with their COTS cousins (in that
| they're not being developed specifically for a single
| supercomputer), but from CPU to interconnects you won't be able
| to source the components as-is new on the free market.
| Everything is still customised to varying degrees to meet
| customer specs and -pricing.
|
| Packaging and cooling has been solved, btw. and the company
| indeed sells a "plug-and-play" system complete with software
| support for relevant software packages (e.g. TensorFlow and
| friends).
| EvanAnderson wrote:
| > They might use components that share a common heritage with
| their COTS cousins (in that they're not being developed
| specifically for a single supercomputer)
|
| That's what I'm talking about. This system isn't just new
| arrangement of off-the-shelf x86/x64-based CPUs or commodity
| GPU cores with exotic interconnects.
|
| > Packaging and cooling has been solved, btw. and the company
| indeed sells a "plug-and-play" system
|
| I'm aware that they have a product. My interest would be in
| hearing about how they solved their packaging, power, and
| cooling challenges. That's actually more interesting to me
| than the capabilities of the system to deliver neural network
| processing. There had to be a ton of fun engineering in
| figuring out how to make a system out of the raw technology.
| richk449 wrote:
| It has been covered in the popular tech press:
|
| https://spectrum.ieee.org/semiconductors/processors/cerebra
| s...
|
| https://spectrum.ieee.org/tech-
| talk/semiconductors/processor...
| rokobobo wrote:
| Out of curiosity, does anyone know of an actual company that's
| bought and uses one of these chips (say, the CS-1)? What's the
| use case?
|
| They mention "low latency datacenter inference." Surely, the
| Facebooks and Amazons in the world could do better by using
| localized, smaller-scale machines for inference, since their use
| cases are distributed geographically. The one use case that I can
| think of, is high-frequency trading. But I can tell you that
| 20PB/s memory bandwidth is an overkill, and also, you're going to
| have a bad time trying to cool this thing down in a colocation
| center that doesn't belong to you.
| thomasjudge wrote:
| Under the "Industries" part of their webpage they have
| testimonials from GlaxoSmithKline, Lawrence Livermore National
| Lab, Argonne National Lab
| jamessb wrote:
| This slide (https://images.anandtech.com/doci/16626/Cerebras%
| 20WSE2%20La...) lists announced Deployments at:
|
| Argonne National Laboratory
|
| Lawrence Livermore National Laboratory
|
| Pittsburgh Supercomputer Center
|
| Edinburgh Parallel Computing Centre
|
| GlaxoSmthKline
|
| "Other wins in heavy manufacturing, Pharma, Biotech, military
| and intelligence"
| throw737858 wrote:
| There is something on YouTube. They are very profitable,
| machine learning, filtering and stuff. No trading.
| wvaske wrote:
| There are AI models that are too big to effectively run on
| smaller hardware. The performance benefit of a monolithic
| system like this vs distributed GPU based systems can be orders
| of magnitude.
| [deleted]
| zitterbewegung wrote:
| I remember when I was a college student at UIC and I took a class
| that taught me how to use Maple (now they use Sage). The
| professor also taught supercomputing and he bought some
| specialized NUMA computers and a few dense multiprocessor
| systems. If you don't have a large company that will support what
| you buy then there is a huge risk if the company folds and then
| your hardware becomes useless to the point that you will let an
| Undergraduate try to figure out how to fix it (that undergraduate
| was me). Unfortunately, I was unsuccessful.
| madengr wrote:
| Wouldn't these things, like quantum computers, be leased? The
| complexity is getting so high that you would need full time
| staff dedicated just to that hardware. At least if it goes
| belly up, you are not stuck with unsupported hardware.
| stainforth wrote:
| Was the hardware broken or was it a matter of figuring out how
| to run something on it?
| zitterbewegung wrote:
| IIRC Hardware was broken and unserviceable (at least the ones
| I was given).
| choppaface wrote:
| The CS1 had a Tensorflow / pytorch API; the CS2 probably has
| the same capability. There's still a lot of risk for sure, but
| at least it's not sawzall.
| gautamcgoel wrote:
| Dumb question: how are they able to sell each of these for ~
| $2-3M? I doubt they have sold more than 20-30 of these systems.
| Given the upfront cost of design, verification, wafer
| construction, etc, shouldn't the unit price be much higher?
| alpaca128 wrote:
| Those processors apparently have a 100% yield rate in
| production because a small percentage of cores are only for
| redundancy and can balance out small defects. The actual
| distribution of the software and the flow of data is handled by
| the compiler later.
|
| Don't know much about the development costs though.
|
| Edit: I got those infos from a YT video[0], which talks a bit
| about it and the second generation of the chip
|
| [0] https://www.youtube.com/watch?v=FNd94_XaVlY
| eb0la wrote:
| If I remember well Intel did something similar with the
| i486sx - most were just i468DX with a faulty i487 coprocessor
| disabled.
|
| Not sure how they did it. Probably burning the SX microcode
| after testing?
| ramshanker wrote:
| Ever since cererus-1 launched, I have been dreaming of Either
| Intel or AMD going full-on wafer scale research. All the
| individual university departments gonna love their own Million
| Doller personal super-computer. Even business who can't afford
| the super computers would want one if Wafer Scale makes it
| affordable.
|
| I guess only inertia is stopping them from going full scale on
| wafer scale.
| xattt wrote:
| Direct link to whitepaper:
|
| https://f.hubspotusercontent30.net/hubfs/8968533/Cerebras-CS...
| mikewarot wrote:
| No actual specifications about precision of computations,
| operations supported, etc., just impressive I/O numbers.
| tyingq wrote:
| There's better info here: https://blog.inten.to/hardware-for-
| deep-learning-part-4-asic...
|
| _" The instruction set supports operations on INT16, FP16,
| and FP32 types. Floating-point adds, multiplies, and fused
| multiply-accumulate (or FMAC, with no rounding of the product
| prior to the add) can occur in a 4-way SIMD manner for 16-bit
| operands. The instruction set supports SIMD operations across
| subtensors of four-dimensional tensors, making use of tensor
| address generation hardware to efficiently access tensor data
| in memory."_
|
| The source appears to be this PDF:
| https://arxiv.org/pdf/2010.03660.pdf
| polymorph1sm wrote:
| Unless they are able to introduce smaller module like "GPU"
| version, it will only be a nice demo product of what we can do
| with state of the art semiconductor manufacturing.
| f6v wrote:
| Where do you think all those next-gen deep learning models for
| autonomous weapons are going to be trained? For more civil
| applications, imagine GPT3.
| mmazing wrote:
| I'm sure people had the same things to say about computers that
| required the space of entire rooms in the 40s.
|
| "That's nice but we'll never be able to scale this down any
| more."
| JohnJamesRambo wrote:
| Isn't scaling it down just the normal CPUs we have now?
| chrisMyzel wrote:
| So is this an available product? Everyone hates 'contact sales'
| 'get a demo' websites - what's the price, what's the spec, where
| can I order?
| eb0la wrote:
| It is essentially a B2B for a deep-pocket kind of customer.
| Think energy, defense, or a big financial company (or fund).
|
| If you can calculate and iterate fast something you want to
| build / identify where it is feasible to drill / etc. then the
| price of this is peanuts compared with doing the wrong thing
| (tm).
| qayxc wrote:
| Just like IBM mainframes, you won't be able to just "click here
| to buy" these.
|
| They do seem to sell actual systems, though. For several
| millions a piece, so not exactly a product for the masses.
| aseipp wrote:
| It's an eight figure product, so regardless of how you or
| anyone else on a forum feels about it (I mean, everyone likes
| to window shop, including me), I suspect their sales pipelines
| aren't going to be hamstrung by it.
| eb0la wrote:
| Even at eight figures still makes sense for accurate value at
| risk calculation.
|
| A friend that works in risk management told me a financial
| institution you must have enough cash 'at rest' to cover your
| part of your whole VAR.
|
| The problem here is VAR is hard to calculate.
|
| For a small bank it might be simple, but for a big one
| exposed to ousands of financial instruments it is not
| simple...
|
| ... and every million that you can have at work counts.
| mhh__ wrote:
| These types of monster-chips are probably sold as part of a
| solution for a specific problem so they probably don't actually
| have a fixed price in the sense that no one buys just a chip.
| utf_8x wrote:
| It's one of those products where if you have to ask, you can't
| afford it.
| chmod775 wrote:
| Since this chip is a whole wafer we can make some back-off the-
| envelope calculations.
|
| Let's pretend this was an Intel product and Intel's typical
| yield was like 150 CPUs / wafer. Using high-end Intel CPUs as a
| price-point, we would already be in the $300,000 price range
| for this chip.
|
| But this isn't an Intel product and not made at scale, more
| likely to-order. Further they don't just sell you the
| wafer/chip, but a complete solution. They have to recoup
| development costs with much fewer units sold.
|
| I wouldn't be surprised if this goes for $2m - $5m.
|
| Edit: The previous generation of this product was sold "for a
| couple million" according to some voices on the internet.
| wejick wrote:
| Just recently read the interview transcript [0] of CEO and CTO
| (Jim Keller) of Tenstorrent which I believe they're in the same
| field. That's interestingly deep conversation open my eye a
| little bit about AI processor and how it's not only about a chip
| that accelerate "AI task".
|
| [0] https://www.anandtech.com/show/16709/an-interview-with-
| tenst...
| dragontamer wrote:
| These "AI processors" are just matrix-multiplication engines,
| often 8-bit or 16-bits. Maybe 4-bit in some cases.
|
| 16-bit (and smaller) just isn't enough for most compute
| problems. But emphasis on "most". Neural Nets clearly are fine
| with 16-bit, but some video game lighting effects can be
| calculated on 16-bit and look "good enough".
|
| Maybe a certified Jewelry Appraiser can tell the difference
| between 1.3 refractive index vs 1.35, but the typical video
| game player (and dare I say, human) won't be able to tell the
| difference. If all the light bounces are just slightly off, as
| long as its "close enough", you probably have good-enough
| looking ray-tracing or whatever.
|
| --------
|
| I've also heard of "iterative solvers" being accelerated by
| 16-bit and 32-bit passes before doing a 64-bit pass. That's not
| applicable to all math problems of course, but that's still a
| methodology where your 16-bit performance accelerates your
| 64-bit ultimate answer.
| notum wrote:
| Skynet is prettier than I thought it would be.
| mhh__ wrote:
| Don't be silly, Skynet will be an IBM z
| wiz21c wrote:
| For someone who has started coding when 128kilobytes were
| considered luxury, that is just, well, incommensurate :-)
| tyingq wrote:
| In a way, it's a throwback to that. Each "core" has access to
| 48kb of SRAM, there is no shared memory across cores.
| wiz21c wrote:
| 48Kb, my comfort zone !
| extr wrote:
| I would be curious to understand what the equivalent cluster
| solution would look like. A big selling point here is apparently
| that your researchers "don't have to understand the do's and
| don'ts of cluster scale computing", but if your budget is already
| in the millions, how big of an issue is that really? Is the
| barrier to entry for large scale NN training really that there
| aren't enough engineers with experience navigating the issues on
| commodity/cloud hardware?
|
| Overall I guess I'm just confused at what scale of problems this
| is meant for. GPT-3 cost $5M to train, GPT-2 cost ~$50k. GPT-3 is
| way too big to fit into 40GB of SRAM, GPT-2 fits but isn't
| exactly bleeding edge. If this product is $2-5M, you could train
| GPT-2 40-100 times to make it cost effective, but now you're
| locked into the scale provided by this platform, and not all
| problems are GPT-2 sized. I'm not an expert so happy for someone
| else to chime in and correct me if I've gone wrong somewhere.
| blihp wrote:
| GPT-2/3 is at the high end in terms of size. There's a whole
| universe of problems, and most of the market, below it. A great
| many business/research needs for at least the near future would
| fit into a CS-2 quite nicely.
| extr wrote:
| Yes I suppose you're right, GPT-2/3 aren't particularly
| representative of the "average" industry problem. Honestly
| what I'm jonesing for is an insider's take on the
| cost/benefit of a solution like this vs cloud, in terms of
| problem size, future flexibility, raw price, talent required,
| etc. Even just a ballpark "this is probably an order of
| magnitude more efficient vs cloud GPU compute, game changer"
| vs "we'd have to carefully consider and test performance,
| etc"
|
| But hey, that's the hard part right. Probably aren't many
| people total with experience with CS-1, let alone browsing HN
| :D
___________________________________________________________________
(page generated 2021-06-10 23:02 UTC)