[HN Gopher] Potential and Limitation of High-Frequency Cores and...
       ___________________________________________________________________
        
       Potential and Limitation of High-Frequency Cores and Caches (2024)
        
       Author : matt_d
       Score  : 18 points
       Date   : 2025-06-05 23:19 UTC (3 days ago)
        
 (HTM) web link (arch.cs.ucdavis.edu)
 (TXT) w3m dump (arch.cs.ucdavis.edu)
        
       | bob1029 wrote:
       | > We also did not model the SERDES (serializer-deserializer)
       | circuits that would be required to interface the superconductor
       | components with the room-temperature components, which would have
       | an impact on the performance of the workloads. Instead, we
       | assumed that the interconnect is unchanged from CMOS.
       | 
       | I had a little chuckle when I got to this. I/O is the hard part.
       | Getting the information from A to B.
       | 
       | IBM is probably pushing the practical limits with 5.5GHz base
       | clock on every core. When you can chew through 10+ gigabytes of
       | data per second per core, it becomes a lot less about what the
       | CPU can do and more about what everything around it can do.
       | 
       | The software is usually the weak link in all of this. Disrespect
       | the NUMA and nothing will matter. The layers of abstraction can
       | make it really easy to screw this up.
        
         | PaulHoule wrote:
         | In a phase when I was doing a lot of networking I hooked up
         | with a chip designer who familiarized me with the "memory
         | wall", ASIC and FPGA aren't quite the panacea they seem to be
         | because if you have a large working set you are limited by
         | memory bandwidth and latency.
         | 
         | Note faster-than-silicon electronics have been around for a
         | while, the DOD put out an SBIR for a microprocessor based on
         | Indium Phosphide in the 1990s which I suspect is a real product
         | today but secret. [1] Looking at what fabs provide it seems one
         | could make something a bit better than a 6502 that clocks out
         | at 60 GHz and maybe you can couple it to 64kb of static RAM,
         | maybe more with 2.5-d packaging. You might imagine something
         | like that would be good for electronic warfare and for the
         | simplest algorithms and waveforms it could buy a few ns of
         | reduced latency but for more complex algorithms modern chips
         | get a lot of parallelism and are hard to beat on throughput.
         | 
         | [1] Tried talking with people who might know, nobody wanted to
         | talk.
        
           | foota wrote:
           | I've rea confidential proposals for chips with very high
           | available memory bandwidth, but otherwise reduced performance
           | compared to a standard general purpose CPU.
           | 
           | Something somewhere between a CPU and a GPU, that could
           | handle many parallel streams, but at lower throughput than a
           | CPU, and with very high memory bandwidth for tasks that need
           | to be done against main memory. The niche here is for things
           | like serialization and compression that need lots of
           | bandwidth, can't be done efficiently on the GPU (not
           | parallel), and waste precious time on the CPU.
        
             | PaulHoule wrote:
             | Like
             | 
             | https://en.wikipedia.org/wiki/UltraSPARC_T1
             | 
             | ?
        
               | foota wrote:
               | Similar in concept, I think the idea is that it would be
               | used as an application coprocessor though, as opposed to
               | the main processor, and obviously a lot more threads.
               | 
               | I don't remember all the details, but picture a bunch of
               | those attached to different parts of the processor
               | hierarchy remotely, e.g., one per core or one per NUMA
               | node etc.,. The connection between the coprocessor and
               | the processor can be thin, because the processor would
               | just be sending commands to the coprocessor, so they
               | wouldn't consume much of the constrained processor
               | bandwidth, and each coprocessor would have a high
               | bandwidth connection to memory.
        
               | saltcured wrote:
               | There was also the Tera MTA and various "processor-in-
               | memory" research projects in academia.
               | 
               | Eventually, it's all full circle to supercomputer versus
               | "hadoop cluster" again. Can you farm out work locally
               | near bits of data or does your algorithm effectively need
               | global scope to "transpose" data and hit bisection
               | bandwidth limits of your interconnect topology.
        
           | Veserv wrote:
           | I am not sure that is the case anymore. High Bandwidth Memory
           | (HBM) [1] as used on modern ML training GPUs has immensely
           | more memory bandwidth than traditional CPU systems.
           | 
           | DDR5 [2] tops out around 60-80 GB/s. HBM3, used on the H100
           | GPUs, tops out at 819 GB/s. 10-15x more bandwidth. At a 4 GHz
           | clock, you need to crunch 200 bytes/clock to become memory
           | bandwidth limited.
           | 
           | [1] https://en.wikipedia.org/wiki/High_Bandwidth_Memory
           | 
           | [2] https://en.wikipedia.org/wiki/DDR5_SDRAM
        
             | ryao wrote:
             | The memory wall (also known as the Von Neumann bottleneck)
             | is still true. Token generation on Nvidia GPUs is memory
             | bound, unless you do very large batch sizes to become
             | compute bound.
             | 
             | That said, more exotic architectures from cerebras and groq
             | get far less token per second performance than their memory
             | bandwidth suggests they can, so they have a bottleneck
             | elsewhere.
        
               | Veserv wrote:
               | You get a memory bound on GPUs because they have so much
               | more compute per memory. The H100 has 144 SMs driving
               | 4x32 threads per clock. That is 18,432 threads demanding
               | memory.
               | 
               | Now to be fair, that is separated into 8 clusters which I
               | assume are connected to their own memory so you actually
               | only have 576 threads sharing memory bandwidth. But that
               | is still way more compute than any single processing
               | element could ever hope to have. You can drown any
               | individual processor in memory bandwidth these days
               | unless you somehow produce a processor clocked at
               | multiple THz.
               | 
               | The problem does not seem to be memory bandwidth, but
               | cost, latency, and finding the cost-efficient compute-
               | bandwidth tradeoff for a given task.
        
             | PaulHoule wrote:
             | Certainly an ASIC or FPGA on a package with HBM could do
             | more.
             | 
             | So far as exotic 10x clocked systems based on 3-5
             | semiconductors, squids, or something, I think memory does
             | have to be packaged with the rest of it. Ecauss of speed of
             | light issues.
        
             | markhahn wrote:
             | they're both DRAM, so have roughly the same performance per
             | interface-bit-width and clock. you can see this very
             | naturally by looking at higher-end CPUs, which have wider
             | DDR interfaces (currently up to 12x64b per socket - not as
             | wide as in-package HBM, but duh)
        
       ___________________________________________________________________
       (page generated 2025-06-09 23:01 UTC)