hngopher.com

       [HN Gopher] Single-chip processors have reached their limits
       ___________________________________________________________________
        
       Single-chip processors have reached their limits
        
       Author : blopeur
       Score  : 129 points
       Date   : 2022-04-04 16:53 UTC (6 hours ago)
        
 (HTM) web link (spectrum.ieee.org)
 (TXT) w3m dump (spectrum.ieee.org)
        
       | Veliladon wrote:
       | The M1 Ultra is fabricated as a single chip. The 12900K is
       | fabricated as a single chip and is still a quarter the size of
       | the M1 Ultra. Ryzen 3 puts 8 cores on a CCX instead of four
       | because DDR memory controllers don't have infinite memory
       | bandwidth (contrary to AMD's wishful nomenclature) and make
       | shitty interconnects between banks of L3.
       | 
       | Chiplets are valid strategies that are going to be used in the
       | future but there are still more tricks that CPU makers have up
       | their sleeves that they need to use out of necessity. They're
       | nowhere near their limits.
        
         | paulmd wrote:
         | "chip" is ambiguous the way you're using it.
         | 
         | The M1 Ultra is two _dies_ in one _package_. The package is
         | what goes on the motherboard.
         | 
         | You can also count the memory modules as dies/packages as well.
         | It's not incorrect to say that M1 Ultra has a bunch of LPDDR5
         | _packages_ on it as well, each LPDDR5 _package_ may have
         | multiple _dies_ in it as well.
         | 
         | But depending on context it also wouldn't be incorrect to say
         | the M1 ultra is a _package_ as a chip even if it 's got more
         | packages on it. From the context of the motherboard maker, the
         | CPU BGA unit is the "package".
         | 
         | Anyway no, Ultra isn't a monolithic die in the sense you're
         | meaning, it's two dies that are joined, Apple just uses a
         | ridiculously fat pipe to do it (far beyond what AMD is using
         | for Ryzen) such that it basically appears to be a single die.
         | The same is true for AMD, Rome/Milan are notionally NUMA -
         | running in NPS4 mode can squeeze some extra performance in
         | extreme situations if applications are aware of it, and there's
         | some weird oddities caused by memory locality in "unbalanced"
         | configurations where each quadrant doesn't have the same amount
         | of channels. It just doesn't feel like it because AMD has done
         | a very good job hiding it.
         | 
         | However you're also right that we haven't reached the end of
         | monolithic chips either. Splitting a chip into modules imposes
         | a power penalty for data movement, it's much more expensive to
         | move data off-chiplet than on-chiplet, and that imposes a limit
         | on how finely you can split your chiplets (doing let's say 64
         | tiny chiplets on a package would use a huge amount of power
         | moving data around, since everything is off-chip). There are
         | various technologies like copper-copper bonding and EMIB that
         | will hopefully lower that power cost in the future, but it's
         | there.
         | 
         | And even AMD uses monolithic chips for their laptop parts,
         | because of that. If _any_ cores are running, the IO die has to
         | be powered up, running its memory and infinity fabric links,
         | and at least one CCD has to be powered up, even if it 's just
         | to run "hello world". This seems to be around 15-20W, which is
         | significant in the context of a home or office PC.
         | 
         | It's worth noting that Ryzen is not really a desktop-first
         | architecture. It's server-first, and AMD has found a clever way
         | to pump their volumes by using it for enthusiast hardware.
         | Servers don't generally run 100% idle, they are loaded or they
         | are turned off entirely and rebooted when needed. If you can't
         | stand the extra 20W at idle, AMD would probably tell you to buy
         | an APU instead.
        
         | 2OEH8eoCRo0 wrote:
         | > The M1 Ultra is fabricated as a single chip.
         | 
         | I'm curious how much the M1 Ultra costs. It's such a massive
         | single piece of glass I'd guess it's $1,200+. If that's the
         | case it doesn't make sense to compare the M1 Ultra to $500 CPUs
         | from Intel and AMD.
        
           | mrtksn wrote:
           | Wouldn't the price be primarily based on capital investment
           | and not so much on the unit itself? After all, it's
           | essentially a print out on a crystal using reeeeeally
           | expensive printers. AFAIK Apple's relationship with TSMC is
           | more than a customer relationship.
        
             | 2OEH8eoCRo0 wrote:
             | In a parallel universe where Intel builds and sells this
             | CPU- what's the price? Single chip, die size of 860 square
             | mm, 114 billion transistors, on package memory.
             | 
             | It just got me thinking the other day since all of these
             | benchmarks pit it against $500-$1000 CPUs and it doesn't
             | seem to fall in that price range at all. Look at this
             | thing:
             | 
             | https://cdn.wccftech.com/wp-
             | content/uploads/2022/03/2022-03-...
        
           | headass wrote:
           | also all the other shit that's on the chip ram etc.
        
             | gameswithgo wrote:
             | It is commonly said that on the new M1 macs, that the ram
             | is on the chip, it is not. It is on the same substrate, but
             | its just normal (fast) dram chips soldered on nearby.
        
           | grishka wrote:
           | If there's a defective M1 Ultra, they can cut it in half and
           | say those are two low-end M1 Max.
        
             | 2OEH8eoCRo0 wrote:
             | Wouldn't they only get, at most, one low-end M1 Max if
             | there is a defect?
        
               | grishka wrote:
               | They sell cheaper models with some cores disabled, that's
               | what I meant by low-end. Ever wondered what's the deal
               | with the cheapest "7-core GPU" M1?
        
               | monocasa wrote:
               | If the defect is in the right place, Apple apparently
               | sells M1 Max chips with some GPU cores disabled.
        
           | wmf wrote:
           | I estimate that Apple's internal "price" for the M1 Ultra is
           | around $2,000. Since most of the chip is GPU, it should
           | really be compared to a combo like 5950X + 6800 XT or 12900K
           | + 3080.
        
             | 2OEH8eoCRo0 wrote:
             | It wouldn't surprise me. M1 Ultra has 114 billion
             | transistors and a total area of ~860 square mm. For
             | comparison, an RTX 3090 has 28 billion transistors and a
             | total area of 628 square mm.
        
           | sliken wrote:
           | Dunno, M1 Ultra includes a decent GPU, which the $500 CPUs
           | from Intel and AMD do not. Seems relatively comparable to a
           | $700 GPU (like a RTX 3070 if you can find one) depending on
           | what you are using. Sadly metal native games are rare, many
           | use some metal wrapper and/or Rosetta emulation.
           | 
           | Seems pretty fair to compare an Intel alder lake or higher
           | end AMD Ryzen AND a GPU (rtx 3070 or radeon 6800) to the M1
           | ultra, assuming you don't care about power, heat, or space.
        
             | touisteur wrote:
             | Has anyone managed to reach the actual advertised 21 FP32
             | TFLOPS? I'm curious. Even BLAS or pure custom matmul stuff?
             | How much of that is actually available? I can almost
             | saturate and sustain an NVIDIA A40 or A4000 to their peak
             | perf, so, wondering whether anyone written something there?
        
         | cma wrote:
         | M1 ultra is two chips with an interconnect between them I
         | thought? Or is the interconnect already on die with them?
         | 
         | (Edit: sounds like it is two: "Apple said fusing the two M1
         | processors together required a custom-built package that uses a
         | silicon interposer to make the connection between chips. "
         | https://www.protocol.com/bulletins/apple-m1-ultra-chip )
        
           | monocasa wrote:
           | > M1 ultra is two chips with an interconnect between them I
           | thought? Or is the interconnect already on die with them?
           | 
           | It's either depending on how you look at it. The active
           | components of the interconnect are on the two M1 dies, but
           | the interconnect itself goes through the interposer as well.
        
       | fulafel wrote:
       | Some older stuff for reference: IBM POWER5 and POWER5+
       | (2004&2005) are MCM designs, had 2-4 CPU chips plus cache chips
       | in same package.
       | 
       | Link: https://en.wikipedia.org/wiki/POWER5
        
         | sliken wrote:
         | Pentium pro from 1995 had two pieces of silicon in the package:
         | https://en.wikipedia.org/wiki/Pentium_Pro
        
           | h2odragon wrote:
           | PPros are quite hard to find now because the "gold
           | scavengers" loved them. As i recall, at the peak in 2008,
           | they were $100ea and more for the ceramic packages. All that
           | interconnect was tiny gold wires, apparently.
        
             | sliken wrote:
             | Heh, had no idea, they seemed to have a pretty limited run,
             | ran at up 200 MHz, but was pretty quickly replaced by a
             | Pentium-II at 233 Mhz on a single die.
        
       | marcodiego wrote:
       | Makes me remember the processor in the film terminator 2:
       | https://gndn.files.wordpress.com/2016/04/shot00332.jpg
        
       | bob1029 wrote:
       | Despite the limitations apparently present in single chip/CPU
       | systems, they can still provide an insane amount of performance
       | if used properly.
       | 
       | There are also many problems that are literally impossible to
       | make faster or more correct than by simply running them on a
       | single thread/processor/core/etc. There always will be forever
       | and ever. This is not a "we lack the innovation" problem. It's an
       | information-theoretic / causality problem you can demonstrate
       | with actual math & physics. Does a future event's processing
       | circumstances maybe depend on all events received up until now?
       | If yes, congratulations. You now have a total ordering problem
       | just like pretty much everyone else. Yes, you can cheat and say
       | "well these pieces here and here dont have a hard dependency on
       | each other", but its incredibly hard to get this shit right if
       | you decide to go down that path.
       | 
       | The most fundamental demon present in any distributed system is
       | latency. The difference between L1 and a network hop in the same
       | datacenter can add up very quickly.
       | 
       | Again, for many classes of problems, there is simply no
       | handwaving this away. You either wait the requisite # of
       | microseconds for the synchronous ack to come back, or you hope
       | your business doesnt care if john doe gets duplicated a few times
       | in the database on a totally random basis.
        
         | AnthonyMouse wrote:
         | The alternative is speculative execution. If you can guess what
         | the result is going to be, you can proceed to the next
         | calculation and you get there faster if it turns out you were
         | right.
         | 
         | If you have parallel processors, you can stop guessing and just
         | proceed under both assumptions concurrently and throw out the
         | result that was wrong when you find out which one it was. This
         | is going to be less efficient, but if your only concern is
         | "make latency go down," it can beat waiting for the result or
         | guessing wrong.
        
           | tsimionescu wrote:
           | Not necessarily. There are problems you can't speed up even
           | if you are given a literal infinity of processors - the
           | problems in EXP for example (well, EXP - NP). Even for NP
           | problems, the number of processors you need for a meaningful
           | speed up grows proportionally to the size of the problem
           | (assuming P!=NP).
        
             | AnthonyMouse wrote:
             | Computational complexity and parallelism are orthogonal.
             | Many EXP algorithms are embarrassingly parallel. You still
             | have to do 2^n calculations, but if you have 1000
             | processors then it will take 1000 times less wall clock
             | time because you're doing 1000 calculations at once.
             | 
             | The reason parallelism doesn't "solve" EXP problems is that
             | parallelism grows linearly against something whose time
             | complexity grows exponentially. It's not that it doesn't
             | work at all, it's that if you want to solve the problem for
             | 2n in the same time as for n, you need to double the number
             | of processors. So the number of processors you need to
             | solve the problem in whatever you define as a reasonable
             | amount of time grows exponentially with n, but having e.g.
             | 2^n processors when n is 1000 is Not Gonna Happen.
             | 
             | Having 1000 processors will still solve the problem twice
             | as fast as 500, but that's not much help in practice when
             | it's the difference between 50 billion billion years and
             | 100.
        
       | kzrdude wrote:
       | It's surprising it took that many cores before the limit was
       | reached!
        
       | ksec wrote:
       | Is spectrum.ieee.org becoming another mainstream ( so to speak )
       | journalism where everything is dumbed down to basically Newspeak.
       | The article is poorly written, the content is shallow and the
       | headline is click bait.
        
       | retrac wrote:
       | The best chiplet interconnect may turn out to be no interconnect
       | at all. Wafer scale integration [1] has come up periodically over
       | the years. In short, just make a physically larger integrated
       | circuit, potentially as large as the entire wafer -- like a foot
       | across. As I understand it, there's no particular technical
       | hurdle, and indeed the progress with self-healing and self-
       | testing designs with redundancy to improve yield for small
       | processors, also makes really large designs more feasible than in
       | the past. The economics never worked out in the favour of this
       | approach before, but now we're at the scaling limit maybe that
       | will change.
       | 
       | At least one company is pursuing this at the very high end. The
       | Cerebras WFE-2 [2] ("wafer scale engine") has 2.6 trillion
       | transistors with 800,000 cores and 48 gigabytes of RAM, on a
       | single, giant, integrated circuit (shown in the linked article).
       | I'm just an interested follower of the field, no expert, so what
       | do I know. But I think that we may see a shift in that direction
       | eventually. Everything on-die with a really big die. System on a
       | chip, but for the high end, not just tiny microcontrollers.
       | 
       | [1] https://en.wikipedia.org/wiki/Wafer-scale_integration
       | 
       | [2] https://www.zdnet.com/article/cerebras-continues-absolute-
       | do...
        
         | AceJohnny2 wrote:
         | To clarify and contextualize a bit what you're saying:
         | 
         | The one big obstacle in creating larger chips is defects.
         | There's just a statistical chance of there being a defect on
         | any given surface area of the wafer, defect which generally
         | breaks the chip that occupies that area of the wafer.
         | 
         | So historically, the approach was to make more smaller chips
         | and trash those chips on the wafer affected by defects. Then
         | came the "chiplet" approach where they can assemble those
         | functional chips into a larger meta-chip (like the Apple M1
         | Ultra).
         | 
         | But as you're saying, changes in the way chips are designed can
         | make them resilient to defects, so you no longer need to trash
         | that chip on your wafer that's affected by such a defect, and
         | can thus design a larger chip without fear of defects.
         | 
         | (Of course such an approach requires a level of redundancy in
         | the design, so there's a tradeoff)
        
           | zitterbewegung wrote:
           | The Cerebras WFE is design has on each wafer to disable /
           | efuse a portion of itself to account for defects. This is
           | what you can do if your control the wafer.
        
           | candiddevmike wrote:
           | Wikipedia link on microlithography if you want a rabbit hole
           | about wafer making:
           | 
           | https://wikipedia.org/wiki/Microlithography
           | 
           | Being able to print something in nanometers is an overlooked
           | technical achievement for human manufacturing.
        
             | adhesive_wombat wrote:
             | If that rabbit hole appeals, the ITRS reports (now called
             | IRDS[2]) are very good mid-level, year-by-year summary of
             | the state of the art in chipmaking, including upcoming
             | challenges and future directions.
             | 
             | > Being able to print something in nanometers is an
             | overlooked technical achievement for human manufacturing.
             | 
             | IMO, a semiconductor fab probably is _the_ highest human
             | achievement in terms of process engineering. Not only do
             | you  "print" nanometric devices, you do it continuously, in
             | a multi-month pipelined system and sell the results for as
             | little as under a penny (micros, and even the biggest
             | baddest CPUs are "only" a thousand pounds, far less than
             | any other item with literally a billion functional designed
             | features on it).
             | 
             | [1]: https://en.wikipedia.org/wiki/International_Technology
             | _Roadm...
             | 
             | [2]: https://en.wikipedia.org/wiki/International_Roadmap_fo
             | r_Devi...
        
           | AceJohnny2 wrote:
           | I'll add that many DRAM chips already do something like this,
           | but ironically enough the re-routing mechanism adds
           | complexity _which is itself a source of problems_ , (be it
           | manufacturing or design, such as broken timing promises)
           | 
           | Also, NAND Flash storage (SSD) is designed around the very
           | concept of re-routing around bad blocks, because the very
           | technology means they have a wear-life.
        
             | Dylan16807 wrote:
             | > I'll add that many DRAM chips already do something like
             | this, but ironically enough the re-routing mechanism adds
             | complexity which is itself a source of problems, (be it
             | manufacturing or design, such as broken timing promises)
             | 
             | The best-performing solution there is probably software.
             | Tell the OS about bad blocks and keep the hardware simple.
        
               | nine_k wrote:
               | I think this is already implemented both in Linux and in
               | Windows; you can tell the OS which RAM ranges are
               | defective.
               | 
               | Doing this from the chip side is not there yet,
               | apparently. I wonder when will this be included in the
               | DRAM feature list, if ever. I suspect that detecting
               | defects from the RAM side is not trivial.
        
               | Dylan16807 wrote:
               | > I suspect that detecting defects from the RAM side is
               | not trivial.
               | 
               | Factory testing or a basic self-test mode could easily
               | find any parts that are flat-out broken. And as internal
               | ECC rolls out as a standard feature, that could help find
               | weaker rows over time.
        
           | dylan42 wrote:
           | > change in the way chips are designed can make them
           | resilient to defects
           | 
           | This is already happening for almost all modern chips
           | manufactured in the last 10+ years. DRAM chips have extra
           | rows/cols. Even Intel CPUs have redundant cache lines,
           | internal bus lines and other redundant critical parts, which
           | are burned-in during initial chip testing.
        
           | ip26 wrote:
           | The other one big obstacle is chips are square while wafers
           | are round.
        
             | paulmd wrote:
             | it depends on the exact shape of your mask of course, but
             | typically losses around the edges are in the 2-3% range.
             | 
             | It's not really possible to fix this either since wafers
             | need to be round for various manufacturing processes
             | (spinning the wafer for coating or washing stages) and
             | round obviously isn't a dense packing of the mask itself.
             | It just kinda is how it is, square mask and round wafer
             | means you lose a bit off the edges, fact of life.
        
           | paulmd wrote:
           | > changes in the way chips are designed can make them
           | resilient to defects, so you no longer need to trash that
           | chip on your wafer that's affected by such a defect,
           | 
           | no, it's basically "chiplets but you don't cut the chiplets
           | apart". You design the chiplets to be nodes in a mesh
           | interconnect, and failed chiplets can simply be disabled
           | entirely and then routed around. But they're still "chiplets"
           | that have their own functionality and provide a coarser
           | conceptual block than a core itself and thus simplify some of
           | the rest of the chip design (communications/interconnect,
           | etc).
           | 
           | note that technically (if you don't mind the complexity)
           | there's nothing wrong with harvesting at multiple levels like
           | this! You could have "this chiplet has 8 cores, that one has
           | 6, that one failed entirely and is disabled" and as long as
           | it doesn't adversely affect program characteristics too much
           | (data load piling up or whatever) that can be fine too.
           | 
           | however, there's nothing about "changes in the way the chips
           | are designed that makes them more resilient to defects", you
           | still get the same failure rates per chiplet, and will still
           | get the same amount of failed (or partially failed) chiplets
           | per wafer, but instead of cutting out the good ones and then
           | repackaging, you just leave them all together around "route
           | around the bad ones".
           | 
           | The advantage is that MCM-style chiplet/interposer packaging
           | actually makes data movement much more expensive, because you
           | have to run a more powerful interconnect, where this isn't
           | moving anything "off-chip", so you avoid a lot of that power
           | cost. There are other technologies like EMIB and copper-
           | copper bonding that potentially can lessen those costs for
           | chiplets of course.
           | 
           | What Intel is looking at doing with "tiles" in their future
           | architectures with chiplets connected by EMIB at the edges
           | (especially if they use copper-copper bonding) is sort of a
           | half-step in engineering terms here but I think there are
           | still engineering benefits (and downsides of course) to doing
           | it as a single wafer rather than hopping through the bridge
           | even with a really good copper-copper bond. Actual full-on
           | MCM/interposer packaging is a step worse than cu-cu bonding
           | and requires more energy but even cu-cu bonding is not
           | perfect and thus not as good as just "on-chip" routing. So
           | WSI is designed to get everything "on-chip" but without the
           | yield problems of just a single giant chip.
        
         | wmf wrote:
         | Calling wafer-scale "no interconnect" is kind of misleading
         | since it's still very difficult to stitch reticles and it has
         | yield challenges.
        
         | galaxyLogic wrote:
         | Sounds like a great development if it works out.
         | 
         | But consider also that you can stick chiplets on top of each
         | other vertically. That means you can put chiplets much closer
         | together than if they were constrained to exist on the same
         | single plane of the wafer.
         | 
         | Now how about stacking wafers on top of wafers? That could be
         | super, but there might be technical difficulties, which maybe
         | sooner or later can be overcome.
        
           | AceJohnny2 wrote:
           | > _But consider also that you can stick chiplets on top of
           | each other vertically._
           | 
           | The problem there is heat dissipation. Already the
           | performance constraint on consumer chips like the Apple M1 is
           | how well it can dissipate heat in the product it's placed in
           | (see Macbook Air vs Mac Mini). Stacking the chips just makes
           | it worse.
        
             | GeekyBear wrote:
             | The fact that the M1 Macbook Air operates without needing a
             | fan is very unusual for that level of performance.
        
             | paulmd wrote:
             | AMD's 5800X3D and the upcoming generation of AMD/NVIDIA
             | GPUs (both of which are rumored to feature stacked cache
             | dies) are going to be real interesting. So far we haven't
             | ever seen a stacked _enthusiast_ die (MCM doesn 't feature
             | any active transistors on the interposer) and it will be
             | interesting to see how the thermals work out.
             | 
             | This isn't even stacking _compute_ dies either, stacking
             | memory /cache is the low-hanging fruit but in the long term
             | what everyone really wants is stacking multiple compute
             | dies on top of each other, and _that 's_ going to get spicy
             | real quick.
             | 
             | M1 is the other example but again, Apple's architecture is
             | sort of unique in that they've designed it to run from the
             | ground up at 3 GHz exactly, there's no overclocking/etc
             | like enthusiasts generally expect. AMD is having to disable
             | voltage control/overclocking on the 5800X3D as well
             | (although that may be more related to voltage control
             | rather than thermals - sounds like the cache die may run
             | off one of the voltage rails from the CPU, potentially a
             | FIVR could be used to drive that rail independently, or add
             | an additional V_mem rail...)
             | 
             | And maybe that's the long-term future of things, that
             | overclocking goes away and you just design for a tighter
             | design envelope, one where you _know_ the thermals work for
             | the dies in the middle of the sandwich. Plus the Apple
             | design of  "crazy high IPC and moderately low ~3 GHz
             | clocks" appears well-adapted for that reality.
        
             | Iwan-Zotow wrote:
             | the problem is signal propagation
             | 
             | for light to cross 1 feet should take ca 1ns
        
               | paulmd wrote:
               | 3D circuits would be denser (shorter propagation
               | distances) than a planar circuit. In fact "computronium"
               | is sort of an idea about how dense you can conceptually
               | make computation.
               | 
               | You just can't really cool it that well with current
               | technologies. Microfluidics are the current magic wand
               | that everyone wishes existed but it's a ways away yet.
        
         | nynx wrote:
         | There are some new chip manufacturing technique coming down the
         | pipeline, which will lead to prices dropping and likely "wafer-
         | scale" will get to the mainstream.
        
           | tragictrash wrote:
           | Could you elaborate? Would love to know more.
        
             | nynx wrote:
             | Unfortunately, I cannot.
        
         | AtlasBarfed wrote:
         | ... that's the exact opposite of every economic and yield
         | advantage that chiplet design addresses, isn't it?
         | 
         | Want to fine tune your chip offering to some multiple of 8
         | cores (arbitrary example of the # cores on the chiplet)? Just a
         | packaging issue.
         | 
         | Want to upbin very large corecounts that generally overclock
         | quite well? For a massive unichip described, maybe there are
         | sections of the chip that are clocking well and sections that
         | aren't: you're stuck. With chiplets, you have better binning
         | granularity and packaging.
         | 
         | Want to fine-tune various cache levels? I believe from what
         | I've read that AMD is doing L3 on a separate chiplet (and
         | vertically stacking it!). So you can custom-tune the cache size
         | for chip designs,
         | 
         | You can custom-process different parts of the "CPU" with
         | different processes and fabrication, possibly even different
         | fab vendors.
         | 
         | You can upgrade various things like memory support and other
         | evolving things in an isolated package, which should help
         | design and testing.
         | 
         | The interconnects are the main problem. But then again, I can't
         | imaging what a foot-wide CPU introduces for intra-chip
         | communication, it probably would have it's own pseudo-
         | interconnect highway anyway.
         | 
         | Maybe you don't even need to reengineer some chiplets between
         | processor generations. If the BFD of your new release is some
         | improvement to the higher or lower cpus in the High-Low
         | designs, but the other is the same, then that should be more
         | organizational efficiency.
         | 
         | Intel and others have effectively moved from gigantic
         | integrated circuits decades ago: motherboard chipsets were
         | always done with a separate cheaper fab that was a gen or two
         | behind the CPU.
         | 
         | Maybe when process tech has finally stabilized for a generation
         | now that process technology seems to be stagnating more, then
         | massive wafer designs will start to edge out chiplet designs,
         | but right now it appears that the opposite has happened and
         | will continue for the foreseeable future.
        
         | kwhitefoot wrote:
         | Nothing new under the sun. Ivor Catt was proposing wafer scale
         | computing in the '70s. Large numbers of processors with the
         | ability to route around defective units.
         | 
         | https://www.ivorcatt.org/icrns86jun_0004.htm
        
         | Iwan-Zotow wrote:
         | speed of light ~ 1*10^9 feet/sec
         | 
         | To cross one foot - no less than 10^-9 sec = 1ns
        
         | FredPret wrote:
         | Instead of microcircuits, megacircuits. I like it
        
       | truth_seeker wrote:
       | A chip with Semi-FPGA as well as Semi-ASIC strategy could work.
       | FPGA dev tools chain needs to improve.
        
       | tempnow987 wrote:
       | "Reached their limits" - I feel like I've heard this many many
       | times before.
       | 
       | Not that I doubt it, but just I've also been impressed with the
       | ingenuity that folks come up with in this space.
        
         | mjreacher wrote:
         | Agreed. I would be wary of reaching fundamental limits set by
         | physics although I don't think we're there yet.
         | 
         | "It would appear that we have reached the limits of what is
         | possible to achieve with computer technology, although one
         | should be careful with such statements, as they tend to sound
         | pretty silly in five years."
         | 
         | - attributed to von Neumann, 1949.
        
         | tawaypol wrote:
         | "There's plenty of room at the bottom."
        
           | marcosdumay wrote:
           | Actually, we are getting out of room there.
           | 
           | that speech is about 80 years old nowadays. There was plenty
           | of room at that time.
           | 
           | Of course, it also speculated that we would move into quantum
           | computers at some point, what is still a possibility, but now
           | we know that quantum computers won't solve every issue.
        
         | syntheweave wrote:
         | We only have to solve one limitation per year to keep making
         | progress year over year, and as it is, the semiconductor
         | industry still seems to be solving large numbers of significant
         | issues yearly. So while we don't necessarily get smooth,
         | predictable improvement, a safe bet is that there will be
         | continue to be useful new developments 10-20 years out, even if
         | they don't translate to the same kinds of gains as in years
         | past.
        
         | macrolocal wrote:
         | For example, there's lots to explore in the VLIW space.
        
           | JonChesterfield wrote:
           | Compilers, largely.
        
             | macrolocal wrote:
             | Yep, and also architectures whose state is simpler to
             | model.
        
         | gameswithgo wrote:
        
         | aeturnum wrote:
         | I read articles like this as saying "reached their limits [as
         | we currently understand them]." Sometimes we learn we were
         | mistaken and more is possible but it's not reliable and,
         | crucially, when it happens it happens in unexpected ways. The
         | process of talking about when (and why) techniques have hit
         | their useful limits is often key to unearthing the next step.
        
       | anonymousDan wrote:
       | So is UCI-e a competitor/potential successor for something like
       | Intel's QPI (or whatever they are using now)?
        
       | RcouF1uZ4gsC wrote:
       | > UCIe is a start, but the standard's future remains to be seen.
       | "The founding members of initial UCIe promoters represent an
       | impressive list of contributors across a broad range of
       | technology design and manufacturing areas, including the HPC
       | ecosystem," said Nossokoff, "but a number of major organizations
       | have not as yet joined, including Apple, AWS, Broadcom, IBM,
       | NVIDIA, other silicon foundries, and memory vendors."
       | 
       | The fact that the standard doesn't include anyone who is actually
       | building chips makes me very pessimistic about it.
        
         | ranger207 wrote:
         | Looks like a lot of people who actually build chips are in the
         | organization
         | 
         | https://www.uciexpress.org/membership
        
       | AnimalMuppet wrote:
       | "More multi-chip processor designs" != "single-chip processors
       | have reached their limits".
        
       | refulgentis wrote:
       | I'm embarrassed to admit I still don't quite understand what a
       | chiplet is, would be very grateful for your input here.
       | 
       | If a thread can run on multiple chiplets then this is awesome and
       | seems like a solution.
       | 
       | If one thread == one chiplet, then*:
       | 
       | - a chiplet is equivalent to a core, except with speedier
       | connections to other cores?
       | 
       | - this isn't a solution, we're 15 years into cores and single-
       | threaded performance is still king. If separating work into
       | separate threads was a solution, cores would work more or less
       | just fine.**
       | 
       | * put "in my totally uneducated opinion, it seems like..." before
       | each of these, internet doesn't communicate tone well and I'm
       | definitely not trying to pass judgement here, I don't know what
       | I'm talking about!
       | 
       | ** generally, for consumer hardware and use cases, i.e. "I am
       | buying a new laptop and I want it to go brrrr", all sorts of
       | caveats there of course
        
         | [deleted]
        
         | sliken wrote:
         | AMD Epyc is (AFAIK) what popularized the term. Their current
         | design has a memory controller (PCIe controller, 8 x 64 bit
         | channels of ram, etc) and 8 chiplets which are pretty much just
         | 8 cores and a infinity fabric connection for a cache coherent
         | connection to other CPUs (in the same or other sockets) and
         | dram.
         | 
         | So generally Epyc come with some multiple of 8 CPUs enabled (1
         | per chiplet) and the latency between cores on the same chiplet
         | is lower than the latency to other chiplets.
         | 
         | This allows AMD to target high end servers (up to 64 cores),
         | low end (down to 16), workstations with threadripper (4
         | chiplets instead of 8), and high end desktops (2 chiplets
         | instead of 8) with the same silicon. This allows them to spend
         | less on fabs, R&D, etc because they can amortize the silicon
         | over more products/volume. It also lets them bin them so
         | chiplets with bad cores can still be sold. It's one of the
         | things that lets AMD compete with the much larger volume Intel
         | has, and do pretty well against numerous silicon designs Intel
         | chips.
        
         | hesdeadjim wrote:
         | A chiplet is a full-fledged CPU with many cores on it. The term
         | is used when multiple of these chips are stitched together with
         | a high speed interconnect and plugged into the single socket on
         | your motherboard.
         | 
         | If you ripped the lid off a Ryzen "chip", you would see
         | multiple CPU dies underneath for the high end models.
        
           | tenebrisalietum wrote:
           | Additionally - MCM - multi-chip module - instead of putting
           | separate chips for various functions on a board, they're
           | fused together in what from the outside looks like a single
           | chip, but internally is 3 or 4 unrelated chips.
           | 
           | Examples at the Wikipedia article:
           | https://en.wikipedia.org/wiki/Multi-chip_module
        
       | WalterBright wrote:
       | I remember back in the 80's the limit was considered to be 64K
       | RAM chips, because otherwise the defect rate would kill the
       | yield.
       | 
       | Of course, there's always the "make a 4 core chip. If one core
       | doesn't work, sell it as a 3 core chip. And so on."
        
         | dboreham wrote:
         | Hmm. I worked for a memory manufacturer in the 80s and I do not
         | remember any limit.
        
           | ksec wrote:
           | That is mainstream news reporting for you since the 80s.
        
       | throwaway4good wrote:
       | "Single-Chip Processors Have Reached Their Limits
       | 
       | Announcements from XYZ and ABC prove that chiplets are the
       | future, but interconnects remain a battleground"
       | 
       | This could easily have been written 10 years ago, and I bet
       | someone will write it in 10 years again.
       | 
       | We need these really big chips with their big powerful cores
       | because the nature the computing we do only changes very slowly
       | towards being distributed and parallelizable and thus able to use
       | a massive number of smaller but far more efficient cores.
        
         | Dylan16807 wrote:
         | You're implying you can't put big powerful cores on chiplets
         | but that's not true at all.
        
           | lazide wrote:
           | Hardly - performance/core hasn't flatlined, but has not
           | maintained the same growth over time (decades) in performance
           | we've traditionally had. That's the problem.
           | 
           | So if you want better aggregate performance, more cores has
           | been the plan for a decade+ now.
           | 
           | FLOP/s per core or whatever other metric you choose to use.
           | 
           | Previously it was possible to get 20-50% or more performance
           | improvements even year to year for a core.
        
             | Dylan16807 wrote:
             | I wasn't talking about improvement at all. This was about
             | big strong cores versus efficient cores, which is a
             | tradeoff that always exists.
             | 
             | You could choose between 20 strong cores or 48 efficient
             | cores on the same die space across four chiplets, for
             | example.
        
         | alain94040 wrote:
         | Correct. Also known as Rent's rule. According to Wikipedia, it
         | first was mentioned in the 1960s:
         | https://en.wikipedia.org/wiki/Rent%27s_rule
        
       | narag wrote:
       | I hope somebody with relevant knowledge can answer this question,
       | please: what % of the costs is "physical cost per unit" and what
       | % is maintaining the I+D, factories, channels...?
       | 
       | In other words, if a chip with 100x size (100x gates, etc.) made
       | sense, would it cost 100x to produce or just 10x or just 2x?
       | 
       | Edit: providing there wouldn't be additional design costs, just
       | stacking current tech.
        
         | ksec wrote:
         | >would it cost 100x to produce or just 10x or just 2x?
         | 
         | Why would 100x something only cost 2x to produce?
         | 
         | >what % of the costs is "physical cost per unit" and what % is
         | maintaining the I+D, factories, channels...?
         | 
         | Without unit volume and a definition of the first "cost" in the
         | sentence no one could answer that question. But if you want to
         | know the BOM cost of a chip, it is simply Wafer Price divided
         | total useable chips depending on yield where yield is both a
         | factor of current maturity of node and whether your design
         | allows correction of defects for usable chips. Then add about
         | ~10% for testing and packaging.
        
         | mlyle wrote:
         | There's many limiting factors... one is the reticle limit.
         | 
         | But most fundamental is the defect density on wafers. If you
         | have, say, 10 defects per wafer, and you have 1000 chips on it:
         | odds are you get 990 good chips.
         | 
         | If you have 10 chips on the wafer, you get 2-3 good chips per
         | wafer.
         | 
         | Of course, there's yield maximization strategies, like being
         | able to turn off portions of the die if it's defective (for
         | certain kinds of defects).
         | 
         | For the upper limit, look at what Cerebras is doing with wafer
         | scale. Then you get into related, crazy problems, like getting
         | thousands of amperes into the circuit and cooling it.
        
         | nightfly wrote:
         | I'm not an expert, or even an amateur, here but I defects are
         | inevitable. So if you _need_ 100x the size without defects and
         | one defect ruins the chip the cost might be 10000x to produce
        
         | tonyarkles wrote:
         | It's been a while since I've been out of that industry, but
         | back around the 45nm days, one of the biggest concerns was
         | yield. If you've got 100x the surface area, the probability of
         | there being a manufacturing defect that wrecks the chip goes
         | up. Now, you could probably get away with selectively disabling
         | defective cores, but the chiplet idea seems, to me, like it
         | would give you a lot more flexibility. As an example, let's say
         | a chiplet i9 requires 8x flawless chips, and a chiplet Celeron
         | requires 4 chips, but they're allowed to have defects in the
         | cache because the Celeron is sold with a smaller cache anyway.
         | 
         | In the "huge chip" case, you need the whole 8x area to be
         | flawless, otherwise the chip gets binned as a Celeron. If the
         | chiplet case, any single chip with a flaw can go into the
         | Celeron bin, and 8 flawless ones can be assembled into a
         | flawless CPU, and any defect ones go into the re-use bin. And
         | if you end up with a flawed chip that can't be used at the
         | smallest bin size, you're only tossing 1/4 or 1/8 of a CPU in
         | the trash.
        
         | wmf wrote:
         | The way TSMC amortizes those fixed costs is to charge by the
         | wafer, so if your chip is 100x larger it costs at least 100x
         | more. (You will have losses due to defects and around the edges
         | of the wafer.) You can play with a calculator like
         | https://caly-technologies.com/die-yield-calculator/ to get a
         | feel for the numbers.
        
       | hinkley wrote:
       | I hope we are going to get back to a more asymmetric multi-
       | processing arrangement in the near term where we abandon the
       | fiction of a processor or two running the whole show with
       | peripheral systems that have as little smarts as possible and
       | promote them to at least second class citizens.
       | 
       | These systems are much more powerful than when these abstractions
       | were laid down, and at this point it feels like the difference
       | between redundant storage on the box versus three feet away is
       | more academic than anything else.
        
         | wmf wrote:
         | That kind of exists since most I/O devices have CPU cores in
         | them, although usually hidden behind register-based interfaces.
         | Apple has taken it a little further by using the same core
         | everywhere and creating a standard IPC mechanism.
        
         | gotaquestion wrote:
         | The problem is AMP is very hard to program and debug. In
         | embedded, one core is a scheduler and another is doing some
         | real-time task (like arm BIG.little). In larger automotive
         | heterogeneous compute platform, typically they are all treated
         | as accelerators, or with bespoke Tier-1 integration (or like
         | NVIDIA Xavier). And on top of that, OEMs always want to
         | "reclaim" those spare cycles when the other AMP cores are
         | underutilized, which is nigh impossible to do, so they fall
         | back to symmetric MP. I think embedded is the only place for
         | this to work right now.
         | 
         | EDIT: I'm not an expert in this field but I have been asked to
         | do work in this domain, and this narrow sampling is what I
         | encountered, but I'd like to learn more about tooling and
         | strategies for more generic AMP deployments.
        
       | gnarbarian wrote:
       | Are we moving this way because bigger chips with many cores have
       | worse yields? so the answer is to make lots of little chips and
       | fuse then together?
        
         | [deleted]
        
         | sliken wrote:
         | Well fuse is one possibility. The AMD Epyc has generally an
         | IO+memory controller die (called IOD) + 8 chiplets that are 8
         | cores each for most of the Epyc chips, however not all cores
         | are enabled depending on the SKU.
         | 
         | However apple's approach does allow impressive bandwidth,
         | 2.5TB/sec which is much higher than any of the chiplet
         | approaches I'm aware of.
        
         | monocasa wrote:
         | Yeah, in the very general case, chip errors are a function of
         | die area. Cutting a die into four pieces so that when an error
         | occurs in manufacturing, you only throw out a quarter of the
         | die area is becoming the right model for a lot of designs.
         | 
         | Like all things chips, it's way more complicated than that,
         | fractally, as you start digging in. Like AMD started down this
         | road initially because of their contractual agreements with
         | GloFlo to keep shipping with GloFlo does, but wanted the bulk
         | of the logic on a smaller node than GloFo could provide, hence
         | the IO die and compute chiplets model that still exists in Zen.
         | It's still a good idea for other reasons but they lucked out a
         | bit by being forced in that direction before other major
         | fabless companies.
         | 
         | This is also not a new idea, but sort of ebbs and flows with
         | the economics of the chip market. See the VAX 9000 multi chip
         | modules for an 80s take on the same ideas and economic
         | pressures.
        
           | WithinReason wrote:
           | Their GPUs are likely to be multichip for the first time too
           | with NAVI 31 (while Nvidia's next gen will still be single
           | chip and likely fall behind AMD). It also seems like that the
           | cache will be 6nm while the logic will be 5nm and bonded
           | together with some new TSMC technology. At least that can be
           | inferred from some leaks:
           | 
           | https://www.tweaktown.com/news/84418/amd-rdna-3-gpu-
           | engineer...
        
             | thissiteb1lows wrote:
        
             | ceeplusplus wrote:
             | I've yet to see any sort of research out of AMD on MCM
             | mitigations for things like cache coherency and NUMA.
             | Nvidia on the other hand has published papers as far back
             | as 2017 on the subject. On top of that even the M1 Ultra
             | has some rough scaling spots in certain workloads and Apple
             | is by far ahead of everyone else on the chiplet curve (if
             | you don't believe me, try testing lock-free atomic
             | load/store latency across CCX's in Zen3).
             | 
             | Also AMD claimed the MI250X is "multichip" but it presents
             | itself as 2 GPUs to the OS and the interconnect is worse
             | than NVLink.
        
             | monocasa wrote:
             | There's a few ways to interpret that. Another
             | interpretation could be that they are simply taping out
             | Navi32 on two nodes, perhaps for AMD to better utilize the
             | 5nm slots they have access to. Perhaps when Nvidia is on
             | Samsung 10nm+++, then the large consumer AMD GPUs get a
             | node advantage already being at TSMC 7nm+++, and so they're
             | only using 5nm slots for places like integrated GPUs and
             | data center parts that care about perf/watt.
             | 
             | But your interpretation is equally valid with the
             | information we have AFAICT.
        
         | IshKebab wrote:
         | This is what Tesla's Dojo does (it's really a TSMC technology
         | that they are the first to utilize). You can cut your wafer up
         | into chips, ditch the bad ones, then reassemble them into a
         | bigger wafery chip thing using some kind of glue. Then you can
         | do more layers to wire them up.
         | 
         | I think they do it using identical chips but I guess there's no
         | real reason you couldn't have different chips connected in one
         | wafer. Expensive though!
        
       ___________________________________________________________________
       (page generated 2022-04-04 23:00 UTC)