[HN Gopher] Chinese researchers planning 1,600-core chips that u...
___________________________________________________________________
Chinese researchers planning 1,600-core chips that use an entire
wafer
Author : _____k
Score : 107 points
Date : 2024-01-22 07:20 UTC (1 days ago)
(HTM) web link (www.tomshardware.com)
(TXT) w3m dump (www.tomshardware.com)
| a-french-anon wrote:
| 1.7%
| Alifatisk wrote:
| What do you mean?
| a-french-anon wrote:
| https://www.semiaccurate.com/2009/09/15/nvidia-
| gt300-yeilds-...
|
| Old memes die hard
| ethbr1 wrote:
| Does anyone know how yields are billed, e.g. by TSMC?
|
| If I'm Nvidia, and I contract for X volume on Y process,
| and TSMC delivers it with Z yield... how does the +/- to Z
| work?
|
| I'd assume it isn't completely Nvidia's to eat? More like
| there's an expected yield and then bonuses / penalties to
| TSMC for above/below that?
| andrewstuart wrote:
| I would have said 1.3.
| WhereIsTheTruth wrote:
| Those who try may get 1.7%
|
| Those who don't may forever get 0%
| adhesive_wombat wrote:
| Wayne Gretzky's post-retirement career in semiconductor
| engineering may not have been expected, but seems to be going
| well.
| blihp wrote:
| That's because he doesn't design for where the state of the
| art chips are, he designs for where they are going to be.
| berserk1010 wrote:
| China doesn't have the money nor economy to support such low
| yield effort for long term
|
| China has ordered its local governments to halt public-
| private partnership projects identified as "problematic" and
| replaced a 10% budget spending allowance for these ventures
| with a vetting mechanism by Beijing as it tries to curb
| municipal debt risks.
| https://www.reuters.com/markets/asia/china-orders-local-
| gove...
|
| China's Economy Has Picked Up Traits Reminiscent Of The Great
| Depression. https://www.forbes.com/sites/miltonezrati/2024/01
| /22/chinas-...
|
| Also from unconfirmed resource: Central government orders
| high debt load places like Tianjin, Inner Mongolia,
| Heilongjiang, Chungking, Guizhou, and a few others to stop
| any new constructions in 2024, and only allow constructions
| to provide water, electric, or heat.
| WhereIsTheTruth wrote:
| Every year, economists are saying China is collapsing.. yet
| they keep deliver, they seem immune to the organized FUD
| adventured wrote:
| Collapse hasn't been the suggestion in fact. That's a
| straw premise to distract from the negative claims (which
| are supported by persistently bad economic data).
|
| The US didn't collapse due to the great depression
| either.
|
| What is being suggested, is that China is facing a very
| bad economic stretch as a result of their extraordinarily
| poor command economy choices.
|
| Beijing's fabled (supposed) long-term thinking has shown
| itself to be a complete fraud. The Emperor, Xi, is
| wearing no clothes, as is the case with all dictators. An
| economy the scale and complexity of China's can't be run
| effectively with an authoritarian command approach. We
| have been seeing the proof of that increasingly since the
| great recession hit, wherein China switched to forever
| stimulus & debt fakery to prop up their flagging economy.
| In the span of a decade China became the most indebted
| nation in world history, while their growth sank below
| that of the US (which is a very mature, slower growth
| economy). And now the imploding demographics are setting
| in hard and fast, while the affluent world shuns China
| (leaving them with key partners like Russia, Iran, North
| Korea).
|
| It's quite obvious that while China kicked the can down
| the road for a long time, the cost of doing so just keeps
| increasing in the form of negatives on their economy.
| WhereIsTheTruth wrote:
| economy is weakening globally, it affects everyone,
| including both the US and the EU, so your argument falls
| flat
|
| any positive comment about China gets downvoted to
| oblivion, further emphasizing the theory of an organized
| FUD against China
| emptysongglass wrote:
| > any positive comment about China gets downvoted to
| oblivion, further emphasizing the theory of an organized
| FUD against China
|
| Just based on this very comment thread you're wrong, so
| how are we to trust any of your other rebuttals?
|
| Even well-sourced critiques of China are greytext so one
| may as well assume the opposite if they were taking your
| position.
| Dalewyn wrote:
| >Beijing's fabled (supposed) long-term thinking has shown
| itself to be a complete fraud. ...<snip>
|
| Been hearing all that and more for _at least_ the past 25
| years or so, and still China 's doing fine, blew Japan
| out of the #2 GDP spot, and far as I can tell is winning
| the Cold War of our present era.
|
| So, thanks for reaffirming what you tried to refute.
| mensetmanusman wrote:
| Are they doing fine if there will be 500,000,000 fewer
| Chinese by 2100?
| berserk1010 wrote:
| Not sure if you call this "delivering"
|
| 1.) The heavy market losses in 2024 come hot on the heels
| of a bruising run last year, when the CSI 300 index,
| comprising 300 major stocks listed in Shanghai and
| Shenzhen, fell more than 11%. By contrast, the United
| States' benchmark S&P 500 index climbed 24% in 2023,
| while Europe's grew almost 13%. Japan's Nikkei 225 soared
| 28% last year and is still going strong, notching gains
| of nearly 10% so far this month.
| https://www.cnn.com/2024/01/22/business/china-stock-
| market-f...
|
| 2.) China suffers from deflation, while the rest of the
| world combats inflation. Not only does deflation signal a
| stagnating economy, it can lead to high unemployment,
| unaffordable debt repayment, and dismal outcomes for
| businesses. In the worst cases, deflation can lead an
| economy into a recession, or even a depression.
| https://www.wsj.com/world/china/deflation-worries-deepen-
| in-...
|
| 3.) Crushing debt. Going back further, China accounts for
| over half of the entire world's total debt-to-GDP
| increases since 2008.
| https://www.geopoliticalmonitor.com/backgrounder-china-
| econo... https://www.bloomberg.com/news/newsletters/2024-
| 01-06/bloomb...
|
| 4.) China's youth unemployment rate hit consecutive
| record highs in recent months. From April to June, the
| jobless rate for 16- to 24-year-olds reached 20.4%, 20.8%
| and 21.3% respectively.
| https://www.cnn.com/2023/08/14/economy/china-economy-
| july-sl.... For reference, G7 countries is at 10%, US is
| at 8% https://data.oecd.org/unemp/youth-unemployment-
| rate.htm
| corethree wrote:
| These are just problems. Just like how the US has some
| really big glaring problems:
|
| 1. Incapable of building infrastructure, can't build the
| CA hsr while China built an entire network across the
| country.
|
| 2. Can't raise the population out of poverty. You see
| homeless people and drug addicts everywhere.
|
| 3. Unemployment going up, Inflation is going up.
|
| 4. Wealth ienquality is rising. The Housing is becoming
| more and more unaffordable for most Americans.
|
| I mean this is "delivering" too. As if 4 random economic
| problems a country is facing is indicative of total
| collapse.
| selectodude wrote:
| Everybody said the USSR was collapsing and it kept
| growing until the country collapsed. China's GDP numbers
| are notoriously opaque and difficult to vet. It's no
| different than saying Los Angeles is going to get
| flattened by an earthquake. Its inevitable but good luck
| predicting it.
| pphysch wrote:
| The USSR had obviously deep political dysfunction; it was
| a hollow gerontocracy in its latter decades. It had run
| out of ideas and was desperately looking to the West for
| inspiration.
|
| There is a superpower that bears stark resemblance to the
| dying USSR, and it's not the PRC.
| sudosysgen wrote:
| The fall of the USSR was more of a political collapse
| caused by elites thinking another system would be better,
| not economic collapse.
| bigbillheck wrote:
| I've seen a few hundred too many "the Chinese economy is
| about to collapse" stories to believe any of them.
| rightbyte wrote:
| The concept is interesting.
|
| I guess it will have to be able to route around broken cores?
| dragontamer wrote:
| I've seen other projects that market themselves as "wafer
| scale". https://www.cerebras.net/product-chip/
|
| > I guess it will have to be able to route around broken cores?
|
| Yeah, but you'll also have to route around broken routes, and
| that starts to get a bit too much chicken-and-egg problem for
| me.
|
| I guess you could design something akin to error correction
| codes, meaning you're resilient to X failures. Ex: a 64-bit bus
| could have physically Single Error Correction, Double Error
| Detection, which IIRC would be 72 physical wires.
|
| That means any wire can completely fail, but you still have a
| 64-bit bus (indeed, the 8x error-correction wires could all
| fail and you'd still have a 64-bit bus).
|
| ------------
|
| At some point, it makes more sense to cut the chips out, test
| them for reliability. Then cut the router out, and test those
| for reliability, and then finally glue them together.
|
| On the other hand, doing it all on one wafer has cost savings /
| manufacturing simplicity. The math is likely difficult for
| optimizing over costs, production speeds, and so forth.
| fspeech wrote:
| You can design supervisory nodes and routing elements with
| relaxed design rules that guarantee high yield for these
| critical elements.
| shreezus wrote:
| There are techniques to disable nodes that fail testing, so it
| shouldn't be a problem (within reason).
| ramshanker wrote:
| Now that Cerebrus has proven it works, I would love to have an
| x86 / ARM / NVidia do this. And for best results, onboard one of
| the memory maker as well. Cerebrus seems to have underestimated
| memory requirement of LLM. So imagine, 16 H200 GPU along with
| single digit TB HBM memory stitched together on a single
| substrate wafer. It seems doable with correct technology.
|
| Go for it China. You are in good track here.
| verall wrote:
| > 16 H200 GPU along with single digit TB HBM memory stitched
| together on a single substrate wafer
|
| How on earth would you cool this?
| MacsHeadroom wrote:
| Submersion cooling.
| aleph_minus_one wrote:
| > How on earth would you cool this?
|
| With liquid nitrogen. :-)
| snakeyjake wrote:
| Seymour Cray solved this problem 40 years ago.
|
| https://en.wikipedia.org/wiki/Cray-2#/media/File:Cray2.jpg
|
| Bring back toxic waterfalls.
| shrx wrote:
| The issue with Fluorinert is not toxicity but a very high
| global warming potential.
| mensetmanusman wrote:
| Fluorinerts with very low global warming potentials have
| been invented since then.
| monetus wrote:
| > _Although Fluorinert was intended to be inert, the
| Lawrence Livermore National Laboratory discovered that
| the liquid cooling system of their Cray-2 supercomputers
| decomposed during extended service, producing some highly
| toxic perfluoroisobutene.[5] Catalytic scrubbers were
| installed to remove this contaminant._
|
| With that much constant heat, what chemical might be
| best/most practical I wonder.
|
| https://en.m.wikipedia.org/wiki/Novec_649
|
| https://en.m.wikipedia.org/wiki/Hydrofluoroether
| jdietrich wrote:
| 15kW dissipated over that area isn't particularly challenging
| for an industrial water cooler - within reason, you can just
| keep cranking up the flow rate. What'd scare me is power
| delivery, because they've probably got >10,000 amps going to
| the die.
|
| https://www.eetimes.com/powering-and-cooling-a-wafer-
| scale-d...
| buildbot wrote:
| 20K amps (!!!): https://vimeo.com/853557623
|
| They use probably the fanciest piece of rubber (and metal)
| ever made to pass the power in frontside to the die.
| Dylan16807 wrote:
| Doing it with all that water needing to get everywhere is
| definitely a feat.
|
| But the power itself isn't too bad. A square millimeter
| of wire can comfortably carry 20 amps, and you can scale
| that up pretty straightforwardly.
|
| It looks like each of those 84 rectangles has to deal
| with 240 amps and has multiple square centimeters of
| contact with the voltage regulator card above it.
| SAI_Peregrinus wrote:
| Direct-die vapor phase change cooler? That tends to be the
| fastest way to get heat out of a chip and to a big radiator.
| For industrial applications the need to run a big fan & a
| compressor are less problematic.
| ethbr1 wrote:
| That sounds great... until you get a defect on your networking
| block.
|
| Then what? You've got a heterogenous network with tons of "this
| core to this core is not like the others" exceptions (latency,
| bandwidth, etc).
|
| I know chip-to-chip/memory interconnects burn a ton of power, but
| fabbing discrete "biggest chip we can get with decent yield"
| still seems a solid tradeoff in the reality of < 100% yields.
|
| Does anyone have a link or search phrases on how this is
| currently handled for high-chiplet counts? E.g. interconnection
| routing architectures that are still reasonable with random
| manufacturing-time failing links
| arcticbull wrote:
| I assume it's probably an 1800 core wafer and they just cut
| fuse off the 200 defective cores. Some redundancy is likely
| just built-in based on the expectations of process reliability.
|
| Probably multiple networking blocks, too, and you'd use less
| demanding process features on the things that can't be
| duplicated. In fact you could probably even have FPGA-style
| soft programmable fabric interconnects to work around process
| failures.
| ethbr1 wrote:
| That's great if it's just hitting compute cores.
|
| But when your wafer networking flows _through_ cores (because
| it 's cores-all-the-way-down), a defective core starts to
| impact network performance. Which cascades into cache,
| memory, locality, etc. Which starts to make a very
| unpredictable hardware system for software to reason about.
|
| See also dragontamer's comment down below.
| fspeech wrote:
| Not all transistors have to yield the same. You can use
| more forgiving design rules to ensure that critical network
| elements don't fail. There is also post fabrication repair.
| It all comes down to economics. There is no fundamental
| problem engineering wise.
| adgjlsfhk1 wrote:
| also the networking is a small fraction of the area so
| you naturally expect it to have fewer defects
| nerpderp82 wrote:
| Cerebras already solved this problem. So we have that
| existence proof. The redundancy overhead in v1 was 1-1.5%,
| and reportedly half that for v2 WSE.
|
| https://www.youtube.com/watch?v=8i1_Ru5siXc
| gregmac wrote:
| This is on a 22nm process, which is what Intel Haswell [1] (4th
| gen core) was using 10 years ago. The latest gen chips are now
| 7 and 5nm [2], and it seems a lot of the innovation in chip
| manufacturing is about shrinking this size.
|
| How much is being done to improve yields of these older process
| sizes, maybe using the improvements done for smaller sizes?
| Logically it must be possible to have 100% yield on wafers at a
| certain process size -- but what size is that?
|
| [1] https://en.wikipedia.org/wiki/Haswell_(microarchitecture)
|
| [2]
| https://en.wikipedia.org/wiki/Microprocessor_chronology#2020...
| Dylan16807 wrote:
| Core to core latency is already very heterogeneous on existing
| CPUs. Fusing off a few links isn't the end of the world.
| hulitu wrote:
| > China planning 1600-core chips that use an entire wafer -
| 'wafer-scale' designs
|
| ... and they will cool it by pouring water on it. /s
| throwup238 wrote:
| Pretty much. This is the way Cerebrus does it:
| https://web.archive.org/web/20230812020202/https://www.youtu...
| lebean wrote:
| They can use these for their high speed trains!
| kylehotchkiss wrote:
| What about trains requires this level of processing power???
| RajT88 wrote:
| Well of course the LLM that is going to drive it.
|
| /me ducks
| scythe wrote:
| Our next stop is "Let's talk about something else."
|
| https://en.wikipedia.org/wiki/Tian%27anmendong_station
| cm2187 wrote:
| The article talks about chiplets. I presume the wafer will still
| be cut into distinct chips? I thought there were thermal (and
| yield) reasons to not making chips that are too large.
| blihp wrote:
| My take was that they currently have a working chiplet design
| and are looking to move to a wafer-scale (i.e. not cut up)
| design. Thermal/power/yield are all issues and the design has
| to take all of that into account. Cerebras has done it for
| their NN processors, so it has been done before.
| dboreham wrote:
| The article doesn't note this, but wafer scale integration is a
| very old idea. We discussed it at Inmos back in the day, since
| often the systems we built essentially consisted of many CPU die
| sliced out of the wafer, bonded into a package, then tiled onto a
| PCB[1]. But there are...issues: cooling for one. Iann Barron
| joked that you could make a toaster from two WSI wafers running
| full-tilt.
|
| [1] https://twitter.com/tnmoc/status/429638751904878592
| nickpsecurity wrote:
| There's been quite a few. I can't remember the first one I
| found which was also heavily analog. Here's another:
|
| https://www.kip.uni-heidelberg.de/vision/previous-projects/f...
|
| https://iopscience.iop.org/article/10.1088/2634-4386/acf7e4
| buildbot wrote:
| The HICANN chip is really interesting, never seen or heard of
| it before!
| nerpderp82 wrote:
| That is super cool. I wrote to SGS-Thomson when I was in
| highschool and they sent me what felt like a refrigerator box
| of manuals for the 400 through the 9000. Huge transputer fan.
| The simplicity was striking, and that you could construct a
| system that could handle huge numbers of threads communicating
| across the network with almost no "system software" in the way
| was mind blowing.
|
| Most famously, Gene Amdahl started
| https://en.wikipedia.org/wiki/Trilogy_Systems in the 80s to
| explore this idea.
|
| The calculus changes when you don't have to dice the wafer,
| packaging, etc. I'd say that we clock things now at the highest
| speed that we can safely remove heat, so these wafer scale
| chips, we trade frequency for area and need/should clock them
| much slower.
|
| At large production runs, the wafer in a Cerebras is 20k each
| for a system costing millions and the primary engineering feat
| is still cooling. I'd love to see a WSI system utilizing NTV
| (near threshold voltage) logic.
|
| https://semiengineering.com/near-threshold-computing-2/
|
| Another interesting design pattern that has arisen is that
| Cerebras, Esperanto, Tenstorrent, and InspireSemi are all mesh
| networks using message passing.
|
| What kinds of things did you work on at Inmos?
| Pet_Ant wrote:
| >> But there are...issues: cooling for one > That is super
| cool.
|
| ...so apparently not... ;)
| __MatrixMan__ wrote:
| > you could make a toaster from two WSI wafers running full-
| tilt.
|
| I'd totally eat compute toast, where can I get such a toaster?
| dylan604 wrote:
| There was a recent Ask HN about making a company to produce
| Made In USA toasters. Maybe you can pitch him this idea.
| mrtksn wrote:
| Not exactly a wafer but once I tried to create a very powerful
| LED bulb by combining LEDs of multiple 10W LED bulbs close
| together and yep, it got very hard to cool it down. I wasn't
| expecting to get that hot but later once I gave it some
| thought, I said silly me of course it would be hard to cool it
| down. When you put together multiple elements that heat-up, the
| radiator-to-heater ratio quickly deteriorates.
|
| And now when I've red the title, the heat was the first thing
| that came to my mind.
| depereo wrote:
| If there's serious work that can't be effectively done any
| other way then complex and esoteric cooling setups to
| overcome those challenges are fine to design and use.
|
| Not sure what needs 1600 cores in one 'chip' but it's
| probably fairly impressive.
| mrtksn wrote:
| Sure, at some point solving the cooling becomes the path
| with least resistance. It's just surprisingly hard if you
| have to account in the reliability and dimensions - so it
| depends on the application.
| lazide wrote:
| I haven't done the math - why wouldn't something as
| 'simple' as a die diameter solid copper slug work?
|
| Easy to drill for water cooling, and at these scales
| pretty cheap.
|
| Or are we talking just getting the heat out to whatever
| heat management device is attached without burning
| something in the chip itself?
| mrtksn wrote:
| In my understanding, normally you have a tiny chip that
| is inside a larger package and then you can attach that
| packaged chip to something that takes away the heat. So
| the ratio of cooling element(package + heat remover) to
| heating element(the IC) is pretty large. Also, the speed
| of removing heat from the silicon IC is limited to the
| heat conductivity of the silicon itself and the materials
| used to make the package. The silicon itself is not where
| the heating happens but on the "etched/printed" features
| on top of the silicon.
|
| Therefore, the amount of heat by heating elements(the
| tiny wires over the wafer) grow faster than the heat
| dissipation capacity, which rises the temperature until
| the unit breaks.
|
| Notice that when the heating elements are close together
| you lose the horizontal heat gradient advantage since
| that grows by the perimeter when the heat generating
| elements grow by the area.
|
| Which means you have to get creative, add moving parts or
| use more exotic materials, which makes the thing
| significantly more expensive and less reliable as more
| things to break are added.
| cornholio wrote:
| The traditional computer chip will have a power draw
| proportional to frequency and the square of voltage, which
| itself must be increased to control delay and raise frequency:
| P [?] CxV2xf
|
| So if you are ready to accept a lower speed per core, the power
| draw can be controlled and you won't get a toaster.
|
| One difficulty would be that modern technological nodes have
| high leakage and dissipate power even when they are not
| switching, making it more advantageous to clock aggressively,
| finish the computation as fast as possible and cut the power on
| that entire circuit for the remainder of the timer slot ("race
| to idle"), as opposed to reducing the frequency and prolonging
| the "on" phase.
|
| But that's a deliberate design choice, knowing the chip will be
| cut out and fitted with a substantial thermal solution. Wafer
| level power draw it's definitely something you can control at
| the design stage.
| BizarroLand wrote:
| I wonder when we'll convert to plasma computing, as in the
| entire instruction set and operators running as a waveform in
| a condensed cloud of plasma where the voltage outs at
| specific points equal the computational result of the inputs?
|
| If we could do that then we could run terahertz frequencies.
| im3w1l wrote:
| I'm not well versed in this subject but could you solve the
| heat issue by running the cores at low voltage and clock
| speeds?
| nmstoker wrote:
| Yes, I was going to say I thought this had been looked at with
| the transputer.
|
| I also recall Clive Sinclair suggesting this approach in the
| late 80s (can't recall if that was somehow related to the
| transputer or was completely separate). I believe his idea was
| that the faulty CPUs that naturally exist due to wafer defects
| would be cut off from the main group (I could have
| misremembered but I think it may have been via some kind of
| self test process).
| TriangleEdge wrote:
| What kind of wattage is expected for this? And how is heat
| management done?
| jeroen79 wrote:
| How the hell are they gonna cool a cpu like that?
| sandworm101 wrote:
| Big block of copper. Water pumped through channels. The total
| heat output isnt all that big comparred to other water-cooled
| processes. A car engine cylinder pumps out more heat across a
| similar area. You will just need pumps and fans bigger than the
| toy parts used in normal pc cooling.
| hyperthesis wrote:
| You just need a car radiator and engine.
| sandworm101 wrote:
| Not really. A small automotive radiator, like on a
| motorcycle, can easily handle a few thousand watts. You
| just need a 200ish watt fan and a 100+ watt pump, not the
| sort of thing for home use but nothing extreme in the world
| of cooling products.
| Torkel wrote:
| "China" is "planning" to do this...?
|
| A better title might be: "Researchers in China studying 1600-core
| chip".
| dang wrote:
| Ok, we've replaced China with some Chinese researchers in the
| title above.
| brindlejim wrote:
| Software determines whether good hardware succeeds or fails.
| China has yet to build successful software ecosystems on top of
| its hardware innovations.
| Barrin92 wrote:
| Wait what, China has by significant margin the second largest
| software ecosystem in the world, and in VC terms is comparable
| to the US.
| foofie wrote:
| > China has yet to build successful software ecosystems on top
| of its hardware innovations.
|
| I'm not sure your belief is grounded in reality. I'd go as far
| as to assert that if China was able to research and develop
| these chips, both their design and production processes, they
| certainly are not leaving software as an afterthought.
|
| Nevertheless, even entertaining your fantasy, once these chips
| are out and people like you and me are able to take these toys
| out to play with them, you'll soon get software that does
| something interesting and useful. Software is hardly the hard
| part, or even costlier.
| leashless wrote:
| Been hearing it for 30 years.
|
| Finally?
| femto wrote:
| There were experiments with wafer scale FPGAs in the 1990s. The
| idea was that being programmable, the final chip could be
| programmed to route around defects. Lasers were also used to
| eliminated defective cells.
| FpUser wrote:
| >"The latter has only been managed by Cerebras so far, but it
| looks like Chinese developers are looking towards them as well."
|
| Cerebra's wafer has 850,000 cores which totally dwarves 1600
| cores on Chinese wafer. I did read though that Cerebra cores
| optimized for tensor ops. Does Chinese version have more
| universal cores or it just way smaller clone of Cerebra?
___________________________________________________________________
(page generated 2024-01-23 23:01 UTC)