[HN Gopher] Single-chip processors have reached their limits
___________________________________________________________________
Single-chip processors have reached their limits
Author : blopeur
Score : 129 points
Date : 2022-04-04 16:53 UTC (6 hours ago)
(HTM) web link (spectrum.ieee.org)
(TXT) w3m dump (spectrum.ieee.org)
| Veliladon wrote:
| The M1 Ultra is fabricated as a single chip. The 12900K is
| fabricated as a single chip and is still a quarter the size of
| the M1 Ultra. Ryzen 3 puts 8 cores on a CCX instead of four
| because DDR memory controllers don't have infinite memory
| bandwidth (contrary to AMD's wishful nomenclature) and make
| shitty interconnects between banks of L3.
|
| Chiplets are valid strategies that are going to be used in the
| future but there are still more tricks that CPU makers have up
| their sleeves that they need to use out of necessity. They're
| nowhere near their limits.
| paulmd wrote:
| "chip" is ambiguous the way you're using it.
|
| The M1 Ultra is two _dies_ in one _package_. The package is
| what goes on the motherboard.
|
| You can also count the memory modules as dies/packages as well.
| It's not incorrect to say that M1 Ultra has a bunch of LPDDR5
| _packages_ on it as well, each LPDDR5 _package_ may have
| multiple _dies_ in it as well.
|
| But depending on context it also wouldn't be incorrect to say
| the M1 ultra is a _package_ as a chip even if it 's got more
| packages on it. From the context of the motherboard maker, the
| CPU BGA unit is the "package".
|
| Anyway no, Ultra isn't a monolithic die in the sense you're
| meaning, it's two dies that are joined, Apple just uses a
| ridiculously fat pipe to do it (far beyond what AMD is using
| for Ryzen) such that it basically appears to be a single die.
| The same is true for AMD, Rome/Milan are notionally NUMA -
| running in NPS4 mode can squeeze some extra performance in
| extreme situations if applications are aware of it, and there's
| some weird oddities caused by memory locality in "unbalanced"
| configurations where each quadrant doesn't have the same amount
| of channels. It just doesn't feel like it because AMD has done
| a very good job hiding it.
|
| However you're also right that we haven't reached the end of
| monolithic chips either. Splitting a chip into modules imposes
| a power penalty for data movement, it's much more expensive to
| move data off-chiplet than on-chiplet, and that imposes a limit
| on how finely you can split your chiplets (doing let's say 64
| tiny chiplets on a package would use a huge amount of power
| moving data around, since everything is off-chip). There are
| various technologies like copper-copper bonding and EMIB that
| will hopefully lower that power cost in the future, but it's
| there.
|
| And even AMD uses monolithic chips for their laptop parts,
| because of that. If _any_ cores are running, the IO die has to
| be powered up, running its memory and infinity fabric links,
| and at least one CCD has to be powered up, even if it 's just
| to run "hello world". This seems to be around 15-20W, which is
| significant in the context of a home or office PC.
|
| It's worth noting that Ryzen is not really a desktop-first
| architecture. It's server-first, and AMD has found a clever way
| to pump their volumes by using it for enthusiast hardware.
| Servers don't generally run 100% idle, they are loaded or they
| are turned off entirely and rebooted when needed. If you can't
| stand the extra 20W at idle, AMD would probably tell you to buy
| an APU instead.
| 2OEH8eoCRo0 wrote:
| > The M1 Ultra is fabricated as a single chip.
|
| I'm curious how much the M1 Ultra costs. It's such a massive
| single piece of glass I'd guess it's $1,200+. If that's the
| case it doesn't make sense to compare the M1 Ultra to $500 CPUs
| from Intel and AMD.
| mrtksn wrote:
| Wouldn't the price be primarily based on capital investment
| and not so much on the unit itself? After all, it's
| essentially a print out on a crystal using reeeeeally
| expensive printers. AFAIK Apple's relationship with TSMC is
| more than a customer relationship.
| 2OEH8eoCRo0 wrote:
| In a parallel universe where Intel builds and sells this
| CPU- what's the price? Single chip, die size of 860 square
| mm, 114 billion transistors, on package memory.
|
| It just got me thinking the other day since all of these
| benchmarks pit it against $500-$1000 CPUs and it doesn't
| seem to fall in that price range at all. Look at this
| thing:
|
| https://cdn.wccftech.com/wp-
| content/uploads/2022/03/2022-03-...
| headass wrote:
| also all the other shit that's on the chip ram etc.
| gameswithgo wrote:
| It is commonly said that on the new M1 macs, that the ram
| is on the chip, it is not. It is on the same substrate, but
| its just normal (fast) dram chips soldered on nearby.
| grishka wrote:
| If there's a defective M1 Ultra, they can cut it in half and
| say those are two low-end M1 Max.
| 2OEH8eoCRo0 wrote:
| Wouldn't they only get, at most, one low-end M1 Max if
| there is a defect?
| grishka wrote:
| They sell cheaper models with some cores disabled, that's
| what I meant by low-end. Ever wondered what's the deal
| with the cheapest "7-core GPU" M1?
| monocasa wrote:
| If the defect is in the right place, Apple apparently
| sells M1 Max chips with some GPU cores disabled.
| wmf wrote:
| I estimate that Apple's internal "price" for the M1 Ultra is
| around $2,000. Since most of the chip is GPU, it should
| really be compared to a combo like 5950X + 6800 XT or 12900K
| + 3080.
| 2OEH8eoCRo0 wrote:
| It wouldn't surprise me. M1 Ultra has 114 billion
| transistors and a total area of ~860 square mm. For
| comparison, an RTX 3090 has 28 billion transistors and a
| total area of 628 square mm.
| sliken wrote:
| Dunno, M1 Ultra includes a decent GPU, which the $500 CPUs
| from Intel and AMD do not. Seems relatively comparable to a
| $700 GPU (like a RTX 3070 if you can find one) depending on
| what you are using. Sadly metal native games are rare, many
| use some metal wrapper and/or Rosetta emulation.
|
| Seems pretty fair to compare an Intel alder lake or higher
| end AMD Ryzen AND a GPU (rtx 3070 or radeon 6800) to the M1
| ultra, assuming you don't care about power, heat, or space.
| touisteur wrote:
| Has anyone managed to reach the actual advertised 21 FP32
| TFLOPS? I'm curious. Even BLAS or pure custom matmul stuff?
| How much of that is actually available? I can almost
| saturate and sustain an NVIDIA A40 or A4000 to their peak
| perf, so, wondering whether anyone written something there?
| cma wrote:
| M1 ultra is two chips with an interconnect between them I
| thought? Or is the interconnect already on die with them?
|
| (Edit: sounds like it is two: "Apple said fusing the two M1
| processors together required a custom-built package that uses a
| silicon interposer to make the connection between chips. "
| https://www.protocol.com/bulletins/apple-m1-ultra-chip )
| monocasa wrote:
| > M1 ultra is two chips with an interconnect between them I
| thought? Or is the interconnect already on die with them?
|
| It's either depending on how you look at it. The active
| components of the interconnect are on the two M1 dies, but
| the interconnect itself goes through the interposer as well.
| fulafel wrote:
| Some older stuff for reference: IBM POWER5 and POWER5+
| (2004&2005) are MCM designs, had 2-4 CPU chips plus cache chips
| in same package.
|
| Link: https://en.wikipedia.org/wiki/POWER5
| sliken wrote:
| Pentium pro from 1995 had two pieces of silicon in the package:
| https://en.wikipedia.org/wiki/Pentium_Pro
| h2odragon wrote:
| PPros are quite hard to find now because the "gold
| scavengers" loved them. As i recall, at the peak in 2008,
| they were $100ea and more for the ceramic packages. All that
| interconnect was tiny gold wires, apparently.
| sliken wrote:
| Heh, had no idea, they seemed to have a pretty limited run,
| ran at up 200 MHz, but was pretty quickly replaced by a
| Pentium-II at 233 Mhz on a single die.
| marcodiego wrote:
| Makes me remember the processor in the film terminator 2:
| https://gndn.files.wordpress.com/2016/04/shot00332.jpg
| bob1029 wrote:
| Despite the limitations apparently present in single chip/CPU
| systems, they can still provide an insane amount of performance
| if used properly.
|
| There are also many problems that are literally impossible to
| make faster or more correct than by simply running them on a
| single thread/processor/core/etc. There always will be forever
| and ever. This is not a "we lack the innovation" problem. It's an
| information-theoretic / causality problem you can demonstrate
| with actual math & physics. Does a future event's processing
| circumstances maybe depend on all events received up until now?
| If yes, congratulations. You now have a total ordering problem
| just like pretty much everyone else. Yes, you can cheat and say
| "well these pieces here and here dont have a hard dependency on
| each other", but its incredibly hard to get this shit right if
| you decide to go down that path.
|
| The most fundamental demon present in any distributed system is
| latency. The difference between L1 and a network hop in the same
| datacenter can add up very quickly.
|
| Again, for many classes of problems, there is simply no
| handwaving this away. You either wait the requisite # of
| microseconds for the synchronous ack to come back, or you hope
| your business doesnt care if john doe gets duplicated a few times
| in the database on a totally random basis.
| AnthonyMouse wrote:
| The alternative is speculative execution. If you can guess what
| the result is going to be, you can proceed to the next
| calculation and you get there faster if it turns out you were
| right.
|
| If you have parallel processors, you can stop guessing and just
| proceed under both assumptions concurrently and throw out the
| result that was wrong when you find out which one it was. This
| is going to be less efficient, but if your only concern is
| "make latency go down," it can beat waiting for the result or
| guessing wrong.
| tsimionescu wrote:
| Not necessarily. There are problems you can't speed up even
| if you are given a literal infinity of processors - the
| problems in EXP for example (well, EXP - NP). Even for NP
| problems, the number of processors you need for a meaningful
| speed up grows proportionally to the size of the problem
| (assuming P!=NP).
| AnthonyMouse wrote:
| Computational complexity and parallelism are orthogonal.
| Many EXP algorithms are embarrassingly parallel. You still
| have to do 2^n calculations, but if you have 1000
| processors then it will take 1000 times less wall clock
| time because you're doing 1000 calculations at once.
|
| The reason parallelism doesn't "solve" EXP problems is that
| parallelism grows linearly against something whose time
| complexity grows exponentially. It's not that it doesn't
| work at all, it's that if you want to solve the problem for
| 2n in the same time as for n, you need to double the number
| of processors. So the number of processors you need to
| solve the problem in whatever you define as a reasonable
| amount of time grows exponentially with n, but having e.g.
| 2^n processors when n is 1000 is Not Gonna Happen.
|
| Having 1000 processors will still solve the problem twice
| as fast as 500, but that's not much help in practice when
| it's the difference between 50 billion billion years and
| 100.
| kzrdude wrote:
| It's surprising it took that many cores before the limit was
| reached!
| ksec wrote:
| Is spectrum.ieee.org becoming another mainstream ( so to speak )
| journalism where everything is dumbed down to basically Newspeak.
| The article is poorly written, the content is shallow and the
| headline is click bait.
| retrac wrote:
| The best chiplet interconnect may turn out to be no interconnect
| at all. Wafer scale integration [1] has come up periodically over
| the years. In short, just make a physically larger integrated
| circuit, potentially as large as the entire wafer -- like a foot
| across. As I understand it, there's no particular technical
| hurdle, and indeed the progress with self-healing and self-
| testing designs with redundancy to improve yield for small
| processors, also makes really large designs more feasible than in
| the past. The economics never worked out in the favour of this
| approach before, but now we're at the scaling limit maybe that
| will change.
|
| At least one company is pursuing this at the very high end. The
| Cerebras WFE-2 [2] ("wafer scale engine") has 2.6 trillion
| transistors with 800,000 cores and 48 gigabytes of RAM, on a
| single, giant, integrated circuit (shown in the linked article).
| I'm just an interested follower of the field, no expert, so what
| do I know. But I think that we may see a shift in that direction
| eventually. Everything on-die with a really big die. System on a
| chip, but for the high end, not just tiny microcontrollers.
|
| [1] https://en.wikipedia.org/wiki/Wafer-scale_integration
|
| [2] https://www.zdnet.com/article/cerebras-continues-absolute-
| do...
| AceJohnny2 wrote:
| To clarify and contextualize a bit what you're saying:
|
| The one big obstacle in creating larger chips is defects.
| There's just a statistical chance of there being a defect on
| any given surface area of the wafer, defect which generally
| breaks the chip that occupies that area of the wafer.
|
| So historically, the approach was to make more smaller chips
| and trash those chips on the wafer affected by defects. Then
| came the "chiplet" approach where they can assemble those
| functional chips into a larger meta-chip (like the Apple M1
| Ultra).
|
| But as you're saying, changes in the way chips are designed can
| make them resilient to defects, so you no longer need to trash
| that chip on your wafer that's affected by such a defect, and
| can thus design a larger chip without fear of defects.
|
| (Of course such an approach requires a level of redundancy in
| the design, so there's a tradeoff)
| zitterbewegung wrote:
| The Cerebras WFE is design has on each wafer to disable /
| efuse a portion of itself to account for defects. This is
| what you can do if your control the wafer.
| candiddevmike wrote:
| Wikipedia link on microlithography if you want a rabbit hole
| about wafer making:
|
| https://wikipedia.org/wiki/Microlithography
|
| Being able to print something in nanometers is an overlooked
| technical achievement for human manufacturing.
| adhesive_wombat wrote:
| If that rabbit hole appeals, the ITRS reports (now called
| IRDS[2]) are very good mid-level, year-by-year summary of
| the state of the art in chipmaking, including upcoming
| challenges and future directions.
|
| > Being able to print something in nanometers is an
| overlooked technical achievement for human manufacturing.
|
| IMO, a semiconductor fab probably is _the_ highest human
| achievement in terms of process engineering. Not only do
| you "print" nanometric devices, you do it continuously, in
| a multi-month pipelined system and sell the results for as
| little as under a penny (micros, and even the biggest
| baddest CPUs are "only" a thousand pounds, far less than
| any other item with literally a billion functional designed
| features on it).
|
| [1]: https://en.wikipedia.org/wiki/International_Technology
| _Roadm...
|
| [2]: https://en.wikipedia.org/wiki/International_Roadmap_fo
| r_Devi...
| AceJohnny2 wrote:
| I'll add that many DRAM chips already do something like this,
| but ironically enough the re-routing mechanism adds
| complexity _which is itself a source of problems_ , (be it
| manufacturing or design, such as broken timing promises)
|
| Also, NAND Flash storage (SSD) is designed around the very
| concept of re-routing around bad blocks, because the very
| technology means they have a wear-life.
| Dylan16807 wrote:
| > I'll add that many DRAM chips already do something like
| this, but ironically enough the re-routing mechanism adds
| complexity which is itself a source of problems, (be it
| manufacturing or design, such as broken timing promises)
|
| The best-performing solution there is probably software.
| Tell the OS about bad blocks and keep the hardware simple.
| nine_k wrote:
| I think this is already implemented both in Linux and in
| Windows; you can tell the OS which RAM ranges are
| defective.
|
| Doing this from the chip side is not there yet,
| apparently. I wonder when will this be included in the
| DRAM feature list, if ever. I suspect that detecting
| defects from the RAM side is not trivial.
| Dylan16807 wrote:
| > I suspect that detecting defects from the RAM side is
| not trivial.
|
| Factory testing or a basic self-test mode could easily
| find any parts that are flat-out broken. And as internal
| ECC rolls out as a standard feature, that could help find
| weaker rows over time.
| dylan42 wrote:
| > change in the way chips are designed can make them
| resilient to defects
|
| This is already happening for almost all modern chips
| manufactured in the last 10+ years. DRAM chips have extra
| rows/cols. Even Intel CPUs have redundant cache lines,
| internal bus lines and other redundant critical parts, which
| are burned-in during initial chip testing.
| ip26 wrote:
| The other one big obstacle is chips are square while wafers
| are round.
| paulmd wrote:
| it depends on the exact shape of your mask of course, but
| typically losses around the edges are in the 2-3% range.
|
| It's not really possible to fix this either since wafers
| need to be round for various manufacturing processes
| (spinning the wafer for coating or washing stages) and
| round obviously isn't a dense packing of the mask itself.
| It just kinda is how it is, square mask and round wafer
| means you lose a bit off the edges, fact of life.
| paulmd wrote:
| > changes in the way chips are designed can make them
| resilient to defects, so you no longer need to trash that
| chip on your wafer that's affected by such a defect,
|
| no, it's basically "chiplets but you don't cut the chiplets
| apart". You design the chiplets to be nodes in a mesh
| interconnect, and failed chiplets can simply be disabled
| entirely and then routed around. But they're still "chiplets"
| that have their own functionality and provide a coarser
| conceptual block than a core itself and thus simplify some of
| the rest of the chip design (communications/interconnect,
| etc).
|
| note that technically (if you don't mind the complexity)
| there's nothing wrong with harvesting at multiple levels like
| this! You could have "this chiplet has 8 cores, that one has
| 6, that one failed entirely and is disabled" and as long as
| it doesn't adversely affect program characteristics too much
| (data load piling up or whatever) that can be fine too.
|
| however, there's nothing about "changes in the way the chips
| are designed that makes them more resilient to defects", you
| still get the same failure rates per chiplet, and will still
| get the same amount of failed (or partially failed) chiplets
| per wafer, but instead of cutting out the good ones and then
| repackaging, you just leave them all together around "route
| around the bad ones".
|
| The advantage is that MCM-style chiplet/interposer packaging
| actually makes data movement much more expensive, because you
| have to run a more powerful interconnect, where this isn't
| moving anything "off-chip", so you avoid a lot of that power
| cost. There are other technologies like EMIB and copper-
| copper bonding that potentially can lessen those costs for
| chiplets of course.
|
| What Intel is looking at doing with "tiles" in their future
| architectures with chiplets connected by EMIB at the edges
| (especially if they use copper-copper bonding) is sort of a
| half-step in engineering terms here but I think there are
| still engineering benefits (and downsides of course) to doing
| it as a single wafer rather than hopping through the bridge
| even with a really good copper-copper bond. Actual full-on
| MCM/interposer packaging is a step worse than cu-cu bonding
| and requires more energy but even cu-cu bonding is not
| perfect and thus not as good as just "on-chip" routing. So
| WSI is designed to get everything "on-chip" but without the
| yield problems of just a single giant chip.
| wmf wrote:
| Calling wafer-scale "no interconnect" is kind of misleading
| since it's still very difficult to stitch reticles and it has
| yield challenges.
| galaxyLogic wrote:
| Sounds like a great development if it works out.
|
| But consider also that you can stick chiplets on top of each
| other vertically. That means you can put chiplets much closer
| together than if they were constrained to exist on the same
| single plane of the wafer.
|
| Now how about stacking wafers on top of wafers? That could be
| super, but there might be technical difficulties, which maybe
| sooner or later can be overcome.
| AceJohnny2 wrote:
| > _But consider also that you can stick chiplets on top of
| each other vertically._
|
| The problem there is heat dissipation. Already the
| performance constraint on consumer chips like the Apple M1 is
| how well it can dissipate heat in the product it's placed in
| (see Macbook Air vs Mac Mini). Stacking the chips just makes
| it worse.
| GeekyBear wrote:
| The fact that the M1 Macbook Air operates without needing a
| fan is very unusual for that level of performance.
| paulmd wrote:
| AMD's 5800X3D and the upcoming generation of AMD/NVIDIA
| GPUs (both of which are rumored to feature stacked cache
| dies) are going to be real interesting. So far we haven't
| ever seen a stacked _enthusiast_ die (MCM doesn 't feature
| any active transistors on the interposer) and it will be
| interesting to see how the thermals work out.
|
| This isn't even stacking _compute_ dies either, stacking
| memory /cache is the low-hanging fruit but in the long term
| what everyone really wants is stacking multiple compute
| dies on top of each other, and _that 's_ going to get spicy
| real quick.
|
| M1 is the other example but again, Apple's architecture is
| sort of unique in that they've designed it to run from the
| ground up at 3 GHz exactly, there's no overclocking/etc
| like enthusiasts generally expect. AMD is having to disable
| voltage control/overclocking on the 5800X3D as well
| (although that may be more related to voltage control
| rather than thermals - sounds like the cache die may run
| off one of the voltage rails from the CPU, potentially a
| FIVR could be used to drive that rail independently, or add
| an additional V_mem rail...)
|
| And maybe that's the long-term future of things, that
| overclocking goes away and you just design for a tighter
| design envelope, one where you _know_ the thermals work for
| the dies in the middle of the sandwich. Plus the Apple
| design of "crazy high IPC and moderately low ~3 GHz
| clocks" appears well-adapted for that reality.
| Iwan-Zotow wrote:
| the problem is signal propagation
|
| for light to cross 1 feet should take ca 1ns
| paulmd wrote:
| 3D circuits would be denser (shorter propagation
| distances) than a planar circuit. In fact "computronium"
| is sort of an idea about how dense you can conceptually
| make computation.
|
| You just can't really cool it that well with current
| technologies. Microfluidics are the current magic wand
| that everyone wishes existed but it's a ways away yet.
| nynx wrote:
| There are some new chip manufacturing technique coming down the
| pipeline, which will lead to prices dropping and likely "wafer-
| scale" will get to the mainstream.
| tragictrash wrote:
| Could you elaborate? Would love to know more.
| nynx wrote:
| Unfortunately, I cannot.
| AtlasBarfed wrote:
| ... that's the exact opposite of every economic and yield
| advantage that chiplet design addresses, isn't it?
|
| Want to fine tune your chip offering to some multiple of 8
| cores (arbitrary example of the # cores on the chiplet)? Just a
| packaging issue.
|
| Want to upbin very large corecounts that generally overclock
| quite well? For a massive unichip described, maybe there are
| sections of the chip that are clocking well and sections that
| aren't: you're stuck. With chiplets, you have better binning
| granularity and packaging.
|
| Want to fine-tune various cache levels? I believe from what
| I've read that AMD is doing L3 on a separate chiplet (and
| vertically stacking it!). So you can custom-tune the cache size
| for chip designs,
|
| You can custom-process different parts of the "CPU" with
| different processes and fabrication, possibly even different
| fab vendors.
|
| You can upgrade various things like memory support and other
| evolving things in an isolated package, which should help
| design and testing.
|
| The interconnects are the main problem. But then again, I can't
| imaging what a foot-wide CPU introduces for intra-chip
| communication, it probably would have it's own pseudo-
| interconnect highway anyway.
|
| Maybe you don't even need to reengineer some chiplets between
| processor generations. If the BFD of your new release is some
| improvement to the higher or lower cpus in the High-Low
| designs, but the other is the same, then that should be more
| organizational efficiency.
|
| Intel and others have effectively moved from gigantic
| integrated circuits decades ago: motherboard chipsets were
| always done with a separate cheaper fab that was a gen or two
| behind the CPU.
|
| Maybe when process tech has finally stabilized for a generation
| now that process technology seems to be stagnating more, then
| massive wafer designs will start to edge out chiplet designs,
| but right now it appears that the opposite has happened and
| will continue for the foreseeable future.
| kwhitefoot wrote:
| Nothing new under the sun. Ivor Catt was proposing wafer scale
| computing in the '70s. Large numbers of processors with the
| ability to route around defective units.
|
| https://www.ivorcatt.org/icrns86jun_0004.htm
| Iwan-Zotow wrote:
| speed of light ~ 1*10^9 feet/sec
|
| To cross one foot - no less than 10^-9 sec = 1ns
| FredPret wrote:
| Instead of microcircuits, megacircuits. I like it
| truth_seeker wrote:
| A chip with Semi-FPGA as well as Semi-ASIC strategy could work.
| FPGA dev tools chain needs to improve.
| tempnow987 wrote:
| "Reached their limits" - I feel like I've heard this many many
| times before.
|
| Not that I doubt it, but just I've also been impressed with the
| ingenuity that folks come up with in this space.
| mjreacher wrote:
| Agreed. I would be wary of reaching fundamental limits set by
| physics although I don't think we're there yet.
|
| "It would appear that we have reached the limits of what is
| possible to achieve with computer technology, although one
| should be careful with such statements, as they tend to sound
| pretty silly in five years."
|
| - attributed to von Neumann, 1949.
| tawaypol wrote:
| "There's plenty of room at the bottom."
| marcosdumay wrote:
| Actually, we are getting out of room there.
|
| that speech is about 80 years old nowadays. There was plenty
| of room at that time.
|
| Of course, it also speculated that we would move into quantum
| computers at some point, what is still a possibility, but now
| we know that quantum computers won't solve every issue.
| syntheweave wrote:
| We only have to solve one limitation per year to keep making
| progress year over year, and as it is, the semiconductor
| industry still seems to be solving large numbers of significant
| issues yearly. So while we don't necessarily get smooth,
| predictable improvement, a safe bet is that there will be
| continue to be useful new developments 10-20 years out, even if
| they don't translate to the same kinds of gains as in years
| past.
| macrolocal wrote:
| For example, there's lots to explore in the VLIW space.
| JonChesterfield wrote:
| Compilers, largely.
| macrolocal wrote:
| Yep, and also architectures whose state is simpler to
| model.
| gameswithgo wrote:
| aeturnum wrote:
| I read articles like this as saying "reached their limits [as
| we currently understand them]." Sometimes we learn we were
| mistaken and more is possible but it's not reliable and,
| crucially, when it happens it happens in unexpected ways. The
| process of talking about when (and why) techniques have hit
| their useful limits is often key to unearthing the next step.
| anonymousDan wrote:
| So is UCI-e a competitor/potential successor for something like
| Intel's QPI (or whatever they are using now)?
| RcouF1uZ4gsC wrote:
| > UCIe is a start, but the standard's future remains to be seen.
| "The founding members of initial UCIe promoters represent an
| impressive list of contributors across a broad range of
| technology design and manufacturing areas, including the HPC
| ecosystem," said Nossokoff, "but a number of major organizations
| have not as yet joined, including Apple, AWS, Broadcom, IBM,
| NVIDIA, other silicon foundries, and memory vendors."
|
| The fact that the standard doesn't include anyone who is actually
| building chips makes me very pessimistic about it.
| ranger207 wrote:
| Looks like a lot of people who actually build chips are in the
| organization
|
| https://www.uciexpress.org/membership
| AnimalMuppet wrote:
| "More multi-chip processor designs" != "single-chip processors
| have reached their limits".
| refulgentis wrote:
| I'm embarrassed to admit I still don't quite understand what a
| chiplet is, would be very grateful for your input here.
|
| If a thread can run on multiple chiplets then this is awesome and
| seems like a solution.
|
| If one thread == one chiplet, then*:
|
| - a chiplet is equivalent to a core, except with speedier
| connections to other cores?
|
| - this isn't a solution, we're 15 years into cores and single-
| threaded performance is still king. If separating work into
| separate threads was a solution, cores would work more or less
| just fine.**
|
| * put "in my totally uneducated opinion, it seems like..." before
| each of these, internet doesn't communicate tone well and I'm
| definitely not trying to pass judgement here, I don't know what
| I'm talking about!
|
| ** generally, for consumer hardware and use cases, i.e. "I am
| buying a new laptop and I want it to go brrrr", all sorts of
| caveats there of course
| [deleted]
| sliken wrote:
| AMD Epyc is (AFAIK) what popularized the term. Their current
| design has a memory controller (PCIe controller, 8 x 64 bit
| channels of ram, etc) and 8 chiplets which are pretty much just
| 8 cores and a infinity fabric connection for a cache coherent
| connection to other CPUs (in the same or other sockets) and
| dram.
|
| So generally Epyc come with some multiple of 8 CPUs enabled (1
| per chiplet) and the latency between cores on the same chiplet
| is lower than the latency to other chiplets.
|
| This allows AMD to target high end servers (up to 64 cores),
| low end (down to 16), workstations with threadripper (4
| chiplets instead of 8), and high end desktops (2 chiplets
| instead of 8) with the same silicon. This allows them to spend
| less on fabs, R&D, etc because they can amortize the silicon
| over more products/volume. It also lets them bin them so
| chiplets with bad cores can still be sold. It's one of the
| things that lets AMD compete with the much larger volume Intel
| has, and do pretty well against numerous silicon designs Intel
| chips.
| hesdeadjim wrote:
| A chiplet is a full-fledged CPU with many cores on it. The term
| is used when multiple of these chips are stitched together with
| a high speed interconnect and plugged into the single socket on
| your motherboard.
|
| If you ripped the lid off a Ryzen "chip", you would see
| multiple CPU dies underneath for the high end models.
| tenebrisalietum wrote:
| Additionally - MCM - multi-chip module - instead of putting
| separate chips for various functions on a board, they're
| fused together in what from the outside looks like a single
| chip, but internally is 3 or 4 unrelated chips.
|
| Examples at the Wikipedia article:
| https://en.wikipedia.org/wiki/Multi-chip_module
| WalterBright wrote:
| I remember back in the 80's the limit was considered to be 64K
| RAM chips, because otherwise the defect rate would kill the
| yield.
|
| Of course, there's always the "make a 4 core chip. If one core
| doesn't work, sell it as a 3 core chip. And so on."
| dboreham wrote:
| Hmm. I worked for a memory manufacturer in the 80s and I do not
| remember any limit.
| ksec wrote:
| That is mainstream news reporting for you since the 80s.
| throwaway4good wrote:
| "Single-Chip Processors Have Reached Their Limits
|
| Announcements from XYZ and ABC prove that chiplets are the
| future, but interconnects remain a battleground"
|
| This could easily have been written 10 years ago, and I bet
| someone will write it in 10 years again.
|
| We need these really big chips with their big powerful cores
| because the nature the computing we do only changes very slowly
| towards being distributed and parallelizable and thus able to use
| a massive number of smaller but far more efficient cores.
| Dylan16807 wrote:
| You're implying you can't put big powerful cores on chiplets
| but that's not true at all.
| lazide wrote:
| Hardly - performance/core hasn't flatlined, but has not
| maintained the same growth over time (decades) in performance
| we've traditionally had. That's the problem.
|
| So if you want better aggregate performance, more cores has
| been the plan for a decade+ now.
|
| FLOP/s per core or whatever other metric you choose to use.
|
| Previously it was possible to get 20-50% or more performance
| improvements even year to year for a core.
| Dylan16807 wrote:
| I wasn't talking about improvement at all. This was about
| big strong cores versus efficient cores, which is a
| tradeoff that always exists.
|
| You could choose between 20 strong cores or 48 efficient
| cores on the same die space across four chiplets, for
| example.
| alain94040 wrote:
| Correct. Also known as Rent's rule. According to Wikipedia, it
| first was mentioned in the 1960s:
| https://en.wikipedia.org/wiki/Rent%27s_rule
| narag wrote:
| I hope somebody with relevant knowledge can answer this question,
| please: what % of the costs is "physical cost per unit" and what
| % is maintaining the I+D, factories, channels...?
|
| In other words, if a chip with 100x size (100x gates, etc.) made
| sense, would it cost 100x to produce or just 10x or just 2x?
|
| Edit: providing there wouldn't be additional design costs, just
| stacking current tech.
| ksec wrote:
| >would it cost 100x to produce or just 10x or just 2x?
|
| Why would 100x something only cost 2x to produce?
|
| >what % of the costs is "physical cost per unit" and what % is
| maintaining the I+D, factories, channels...?
|
| Without unit volume and a definition of the first "cost" in the
| sentence no one could answer that question. But if you want to
| know the BOM cost of a chip, it is simply Wafer Price divided
| total useable chips depending on yield where yield is both a
| factor of current maturity of node and whether your design
| allows correction of defects for usable chips. Then add about
| ~10% for testing and packaging.
| mlyle wrote:
| There's many limiting factors... one is the reticle limit.
|
| But most fundamental is the defect density on wafers. If you
| have, say, 10 defects per wafer, and you have 1000 chips on it:
| odds are you get 990 good chips.
|
| If you have 10 chips on the wafer, you get 2-3 good chips per
| wafer.
|
| Of course, there's yield maximization strategies, like being
| able to turn off portions of the die if it's defective (for
| certain kinds of defects).
|
| For the upper limit, look at what Cerebras is doing with wafer
| scale. Then you get into related, crazy problems, like getting
| thousands of amperes into the circuit and cooling it.
| nightfly wrote:
| I'm not an expert, or even an amateur, here but I defects are
| inevitable. So if you _need_ 100x the size without defects and
| one defect ruins the chip the cost might be 10000x to produce
| tonyarkles wrote:
| It's been a while since I've been out of that industry, but
| back around the 45nm days, one of the biggest concerns was
| yield. If you've got 100x the surface area, the probability of
| there being a manufacturing defect that wrecks the chip goes
| up. Now, you could probably get away with selectively disabling
| defective cores, but the chiplet idea seems, to me, like it
| would give you a lot more flexibility. As an example, let's say
| a chiplet i9 requires 8x flawless chips, and a chiplet Celeron
| requires 4 chips, but they're allowed to have defects in the
| cache because the Celeron is sold with a smaller cache anyway.
|
| In the "huge chip" case, you need the whole 8x area to be
| flawless, otherwise the chip gets binned as a Celeron. If the
| chiplet case, any single chip with a flaw can go into the
| Celeron bin, and 8 flawless ones can be assembled into a
| flawless CPU, and any defect ones go into the re-use bin. And
| if you end up with a flawed chip that can't be used at the
| smallest bin size, you're only tossing 1/4 or 1/8 of a CPU in
| the trash.
| wmf wrote:
| The way TSMC amortizes those fixed costs is to charge by the
| wafer, so if your chip is 100x larger it costs at least 100x
| more. (You will have losses due to defects and around the edges
| of the wafer.) You can play with a calculator like
| https://caly-technologies.com/die-yield-calculator/ to get a
| feel for the numbers.
| hinkley wrote:
| I hope we are going to get back to a more asymmetric multi-
| processing arrangement in the near term where we abandon the
| fiction of a processor or two running the whole show with
| peripheral systems that have as little smarts as possible and
| promote them to at least second class citizens.
|
| These systems are much more powerful than when these abstractions
| were laid down, and at this point it feels like the difference
| between redundant storage on the box versus three feet away is
| more academic than anything else.
| wmf wrote:
| That kind of exists since most I/O devices have CPU cores in
| them, although usually hidden behind register-based interfaces.
| Apple has taken it a little further by using the same core
| everywhere and creating a standard IPC mechanism.
| gotaquestion wrote:
| The problem is AMP is very hard to program and debug. In
| embedded, one core is a scheduler and another is doing some
| real-time task (like arm BIG.little). In larger automotive
| heterogeneous compute platform, typically they are all treated
| as accelerators, or with bespoke Tier-1 integration (or like
| NVIDIA Xavier). And on top of that, OEMs always want to
| "reclaim" those spare cycles when the other AMP cores are
| underutilized, which is nigh impossible to do, so they fall
| back to symmetric MP. I think embedded is the only place for
| this to work right now.
|
| EDIT: I'm not an expert in this field but I have been asked to
| do work in this domain, and this narrow sampling is what I
| encountered, but I'd like to learn more about tooling and
| strategies for more generic AMP deployments.
| gnarbarian wrote:
| Are we moving this way because bigger chips with many cores have
| worse yields? so the answer is to make lots of little chips and
| fuse then together?
| [deleted]
| sliken wrote:
| Well fuse is one possibility. The AMD Epyc has generally an
| IO+memory controller die (called IOD) + 8 chiplets that are 8
| cores each for most of the Epyc chips, however not all cores
| are enabled depending on the SKU.
|
| However apple's approach does allow impressive bandwidth,
| 2.5TB/sec which is much higher than any of the chiplet
| approaches I'm aware of.
| monocasa wrote:
| Yeah, in the very general case, chip errors are a function of
| die area. Cutting a die into four pieces so that when an error
| occurs in manufacturing, you only throw out a quarter of the
| die area is becoming the right model for a lot of designs.
|
| Like all things chips, it's way more complicated than that,
| fractally, as you start digging in. Like AMD started down this
| road initially because of their contractual agreements with
| GloFlo to keep shipping with GloFlo does, but wanted the bulk
| of the logic on a smaller node than GloFo could provide, hence
| the IO die and compute chiplets model that still exists in Zen.
| It's still a good idea for other reasons but they lucked out a
| bit by being forced in that direction before other major
| fabless companies.
|
| This is also not a new idea, but sort of ebbs and flows with
| the economics of the chip market. See the VAX 9000 multi chip
| modules for an 80s take on the same ideas and economic
| pressures.
| WithinReason wrote:
| Their GPUs are likely to be multichip for the first time too
| with NAVI 31 (while Nvidia's next gen will still be single
| chip and likely fall behind AMD). It also seems like that the
| cache will be 6nm while the logic will be 5nm and bonded
| together with some new TSMC technology. At least that can be
| inferred from some leaks:
|
| https://www.tweaktown.com/news/84418/amd-rdna-3-gpu-
| engineer...
| thissiteb1lows wrote:
| ceeplusplus wrote:
| I've yet to see any sort of research out of AMD on MCM
| mitigations for things like cache coherency and NUMA.
| Nvidia on the other hand has published papers as far back
| as 2017 on the subject. On top of that even the M1 Ultra
| has some rough scaling spots in certain workloads and Apple
| is by far ahead of everyone else on the chiplet curve (if
| you don't believe me, try testing lock-free atomic
| load/store latency across CCX's in Zen3).
|
| Also AMD claimed the MI250X is "multichip" but it presents
| itself as 2 GPUs to the OS and the interconnect is worse
| than NVLink.
| monocasa wrote:
| There's a few ways to interpret that. Another
| interpretation could be that they are simply taping out
| Navi32 on two nodes, perhaps for AMD to better utilize the
| 5nm slots they have access to. Perhaps when Nvidia is on
| Samsung 10nm+++, then the large consumer AMD GPUs get a
| node advantage already being at TSMC 7nm+++, and so they're
| only using 5nm slots for places like integrated GPUs and
| data center parts that care about perf/watt.
|
| But your interpretation is equally valid with the
| information we have AFAICT.
| IshKebab wrote:
| This is what Tesla's Dojo does (it's really a TSMC technology
| that they are the first to utilize). You can cut your wafer up
| into chips, ditch the bad ones, then reassemble them into a
| bigger wafery chip thing using some kind of glue. Then you can
| do more layers to wire them up.
|
| I think they do it using identical chips but I guess there's no
| real reason you couldn't have different chips connected in one
| wafer. Expensive though!
___________________________________________________________________
(page generated 2022-04-04 23:00 UTC)