[HN Gopher] Potential and Limitation of High-Frequency Cores and...
___________________________________________________________________
Potential and Limitation of High-Frequency Cores and Caches (2024)
Author : matt_d
Score : 18 points
Date : 2025-06-05 23:19 UTC (3 days ago)
(HTM) web link (arch.cs.ucdavis.edu)
(TXT) w3m dump (arch.cs.ucdavis.edu)
| bob1029 wrote:
| > We also did not model the SERDES (serializer-deserializer)
| circuits that would be required to interface the superconductor
| components with the room-temperature components, which would have
| an impact on the performance of the workloads. Instead, we
| assumed that the interconnect is unchanged from CMOS.
|
| I had a little chuckle when I got to this. I/O is the hard part.
| Getting the information from A to B.
|
| IBM is probably pushing the practical limits with 5.5GHz base
| clock on every core. When you can chew through 10+ gigabytes of
| data per second per core, it becomes a lot less about what the
| CPU can do and more about what everything around it can do.
|
| The software is usually the weak link in all of this. Disrespect
| the NUMA and nothing will matter. The layers of abstraction can
| make it really easy to screw this up.
| PaulHoule wrote:
| In a phase when I was doing a lot of networking I hooked up
| with a chip designer who familiarized me with the "memory
| wall", ASIC and FPGA aren't quite the panacea they seem to be
| because if you have a large working set you are limited by
| memory bandwidth and latency.
|
| Note faster-than-silicon electronics have been around for a
| while, the DOD put out an SBIR for a microprocessor based on
| Indium Phosphide in the 1990s which I suspect is a real product
| today but secret. [1] Looking at what fabs provide it seems one
| could make something a bit better than a 6502 that clocks out
| at 60 GHz and maybe you can couple it to 64kb of static RAM,
| maybe more with 2.5-d packaging. You might imagine something
| like that would be good for electronic warfare and for the
| simplest algorithms and waveforms it could buy a few ns of
| reduced latency but for more complex algorithms modern chips
| get a lot of parallelism and are hard to beat on throughput.
|
| [1] Tried talking with people who might know, nobody wanted to
| talk.
| foota wrote:
| I've rea confidential proposals for chips with very high
| available memory bandwidth, but otherwise reduced performance
| compared to a standard general purpose CPU.
|
| Something somewhere between a CPU and a GPU, that could
| handle many parallel streams, but at lower throughput than a
| CPU, and with very high memory bandwidth for tasks that need
| to be done against main memory. The niche here is for things
| like serialization and compression that need lots of
| bandwidth, can't be done efficiently on the GPU (not
| parallel), and waste precious time on the CPU.
| PaulHoule wrote:
| Like
|
| https://en.wikipedia.org/wiki/UltraSPARC_T1
|
| ?
| foota wrote:
| Similar in concept, I think the idea is that it would be
| used as an application coprocessor though, as opposed to
| the main processor, and obviously a lot more threads.
|
| I don't remember all the details, but picture a bunch of
| those attached to different parts of the processor
| hierarchy remotely, e.g., one per core or one per NUMA
| node etc.,. The connection between the coprocessor and
| the processor can be thin, because the processor would
| just be sending commands to the coprocessor, so they
| wouldn't consume much of the constrained processor
| bandwidth, and each coprocessor would have a high
| bandwidth connection to memory.
| saltcured wrote:
| There was also the Tera MTA and various "processor-in-
| memory" research projects in academia.
|
| Eventually, it's all full circle to supercomputer versus
| "hadoop cluster" again. Can you farm out work locally
| near bits of data or does your algorithm effectively need
| global scope to "transpose" data and hit bisection
| bandwidth limits of your interconnect topology.
| Veserv wrote:
| I am not sure that is the case anymore. High Bandwidth Memory
| (HBM) [1] as used on modern ML training GPUs has immensely
| more memory bandwidth than traditional CPU systems.
|
| DDR5 [2] tops out around 60-80 GB/s. HBM3, used on the H100
| GPUs, tops out at 819 GB/s. 10-15x more bandwidth. At a 4 GHz
| clock, you need to crunch 200 bytes/clock to become memory
| bandwidth limited.
|
| [1] https://en.wikipedia.org/wiki/High_Bandwidth_Memory
|
| [2] https://en.wikipedia.org/wiki/DDR5_SDRAM
| ryao wrote:
| The memory wall (also known as the Von Neumann bottleneck)
| is still true. Token generation on Nvidia GPUs is memory
| bound, unless you do very large batch sizes to become
| compute bound.
|
| That said, more exotic architectures from cerebras and groq
| get far less token per second performance than their memory
| bandwidth suggests they can, so they have a bottleneck
| elsewhere.
| Veserv wrote:
| You get a memory bound on GPUs because they have so much
| more compute per memory. The H100 has 144 SMs driving
| 4x32 threads per clock. That is 18,432 threads demanding
| memory.
|
| Now to be fair, that is separated into 8 clusters which I
| assume are connected to their own memory so you actually
| only have 576 threads sharing memory bandwidth. But that
| is still way more compute than any single processing
| element could ever hope to have. You can drown any
| individual processor in memory bandwidth these days
| unless you somehow produce a processor clocked at
| multiple THz.
|
| The problem does not seem to be memory bandwidth, but
| cost, latency, and finding the cost-efficient compute-
| bandwidth tradeoff for a given task.
| PaulHoule wrote:
| Certainly an ASIC or FPGA on a package with HBM could do
| more.
|
| So far as exotic 10x clocked systems based on 3-5
| semiconductors, squids, or something, I think memory does
| have to be packaged with the rest of it. Ecauss of speed of
| light issues.
| markhahn wrote:
| they're both DRAM, so have roughly the same performance per
| interface-bit-width and clock. you can see this very
| naturally by looking at higher-end CPUs, which have wider
| DDR interfaces (currently up to 12x64b per socket - not as
| wide as in-package HBM, but duh)
___________________________________________________________________
(page generated 2025-06-09 23:01 UTC)