[HN Gopher] SK Hynix Reveals DDR5 MCR DIMM, Up to DDR5-8000 Spee...
       ___________________________________________________________________
        
       SK Hynix Reveals DDR5 MCR DIMM, Up to DDR5-8000 Speeds for HPC
        
       Author : rbanffy
       Score  : 58 points
       Date   : 2022-12-12 13:20 UTC (9 hours ago)
        
 (HTM) web link (www.anandtech.com)
 (TXT) w3m dump (www.anandtech.com)
        
       | dinvlad wrote:
       | They really need to step up their game with making higher
       | capacity modules available to consumer HEDT market. I think the
       | current roadmap is to intro 48 and possibly 64GB modules at the
       | end of 2023. Without that, many folks who've used 4x32GB DDR4
       | configuration for professional work are left out, since the
       | current MCs and modules can't support that for DDR5 at reasonable
       | speeds.
       | 
       | In other words, HEDT market doesn't need faster RAM as much as it
       | needs more of it.
        
         | moneycantbuy wrote:
         | How noticeably slower is running 128GB DDR5 on AM5? Is it
         | perceptible? I'm building a 7950x rig and am perplexed by the
         | ram. The specs from amd are:
         | 
         | Max Memory Speed 2x1R DDR5-5200 2x2R DDR5-5200 4x1R DDR5-3600
         | 4x2R DDR5-3600
         | 
         | Does anyone know if cas latency (CL) also drops if using 4
         | sticks? For example, if I use 4x32GB DDR5 5600 CL36-36-36-89
         | will it drop to 3600 while maintaining CL36 or will CL also
         | slow to like CL 40?
        
           | dinvlad wrote:
           | Basically, on AM5 4x32GB sticks nominally run at only 3,600
           | (i.e. no better than DDR4, and at probably slightly higher
           | latency). On Level1Techs forums, there're some folks that got
           | it working as high as 4,800, which seems to be the barrier,
           | but even that is far from guaranteed.
           | 
           | On Intel the situation isn't really any better. Technically
           | on paper, the stock speed is 4,000 (i.e. +400MT/s compared to
           | AMD), but I'm yet to hear about anyone running 5,200 fully
           | stable in this config.
           | 
           | Also, primary timings like CL36 don't really matter in DDR5.
           | There's a whole ActuallyHardwareOverclocking YouTube video on
           | it.
        
           | the_pwner224 wrote:
           | Those are the bare minimum speeds that AMD guarantees. You
           | can find real world info about this on reddit. My memory of
           | this is a bit old, but iirc the Zen 4 memory controller can
           | pretty much always do 4800 MT/s on 4 sticks. And you might be
           | able to do 5000/5200 or maybe even a bit more depending on
           | your motherboard & RAM. The situation has been getting better
           | with newer BIOS versions; speeds were much lower around the
           | initial Zen 4 release.
        
             | the_pwner224 wrote:
             | Found some links from an old notepad. These are close to a
             | month old, so stability at higher speeds should have
             | improved since then.
             | 
             | ASRock Steel Legend runs at 4800(?) https://old.reddit.com/
             | r/ASRock/comments/xtt3ol/x670e_steel_... His initial 5200
             | was actually unstable, a comment said he went to 4800, not
             | sure whether 4800 is stable
             | 
             | Gigabyte board could run 128gb "stable" at 5200 (need to
             | expand comment thread near bottom for that) https://old.red
             | dit.com/r/gigabyte/comments/y1ns7j/aorus_x670... Expanded
             | comment about 5200 stable, and link to benchmark with it: h
             | ttps://old.reddit.com/r/gigabyte/comments/y1ns7j/aorus_x670
             | ...
             | 
             | TUF X670E can only run 128gb at 4400, not 4800 https://old.
             | reddit.com/r/Amd/comments/ya3gxx/psa_dont_buy_an...
             | 
             | On ROG Strix X670, commenter could run 4x16gb at 6000 (6400
             | was untable). The commenter previously could only do 5200;
             | BIOS update made 6000 work. OP with 4x32gb was stable at
             | 4400, not 5000, didn't say about 4800. https://old.reddit.c
             | om/r/overclocking/comments/xzau6z/x670_m...
             | 
             | Many comments here: https://old.reddit.com/r/Amd/comments/x
             | zj69v/is_anyone_runni...
             | 
             | He's running 128gb at 5200 on a Gigabyte Aorus Master 670E 
             | https://old.reddit.com/r/Amd/comments/yggpr5/7950x6900xt128
             | g...
        
               | dinvlad wrote:
               | Yeah - I think the key here is 4x32GB specifically, not
               | just 4 sticks. 128GB is much much harder on the memory
               | controller, since those are dual-rank memory modules (or
               | pseudo-4R due to how DDR5 is wired compared to DDR4), and
               | we got 4 of them. So even 4,800 is far from guaranteed in
               | this situation, on any platform atm, and it's unclear at
               | this point if and when the situation will improve,
               | barring larger memory modules in 2x48Gb and 2x64GB
               | configurations (which would be much easier to run) or the
               | next gen of chips with stronger MCs.
        
         | hardware2win wrote:
         | Seems like you wanted Optane memory before they were cancelled
         | 
         | >Intel(r) Optane(tm) persistent memory is available in
         | capacities of 128 GiB, 256 GiB, and 512 GiB and is a much
         | larger alternative to DRAM which currently caps at 128 GiB.
        
         | MuffinFlavored wrote:
         | > consumer HEDT market
         | 
         | high-end desktop for anybody curious
        
           | dinvlad wrote:
           | Thanks for clarifying. HEDT is somewhat of a stretch name
           | here, since technically there're no current consumer-centric
           | HEDT platforms like Threadripper or X599, but in their place
           | I'm talking about the next best choices like 7950x and
           | 13900k.
        
             | rbanffy wrote:
             | Also, the line between HEDT and low-end tower server is
             | quite blurry.
        
         | rbanffy wrote:
         | > making higher capacity modules available to consumer HEDT
         | market
         | 
         | One way I'm able to function is by getting tower servers.
         | Selection isn't great, and they cost a lot of money, but you
         | can get a machine with 8 or 12 memory slots that still fits
         | under your desk (and that is surprisingly silent while you
         | browse or read e-mail - it only comes to life when you start
         | getting cores saturated by running things like `make -j 64`).
        
       | coolspot wrote:
       | Can we get modular GDDR6X VRAM for machine learning please?
        
         | rbanffy wrote:
         | The Xeon Max has HBM built-in. Kind of the same idea that came
         | with Xeon Phis, but updated.
        
       | qwertox wrote:
       | Yesterday I got presented an ad disguised as an educational
       | YouTube video, but the content was really good and is worth
       | watching if you have some time on your hands.
       | 
       | "How does Computer Memory Work?" [0]
       | 
       | Uploaded 1 month ago, 35 minutes playing time.
       | 
       | It even goes into explaining what the timings mean while showing
       | an animation for it.
       | 
       | [0] https://www.youtube.com/watch?v=7J7X7aZvMXQ
        
       | iamchp wrote:
       | This would benefit sequential access, but it'd either be disabled
       | for random-access or pollute the caches with unused lines.
       | 
       | But in cases where sequential memory bandwidth is required, this
       | is pretty cool! (But I assume Intel only, which would also be a
       | bummer)
        
         | fhars wrote:
         | RAM is the new tape...
        
         | rektide wrote:
         | Somewhat ironic that we _just_ reduced RAM channels from 64b
         | down to 32b wide in DDR5 (but each DIMM has two channels).
         | (newsflash if you missed it: desktop DDR5 is quad channel, but
         | 32b, yes.)
         | 
         | SK Hynix: what if we increase the ram width to 64?
        
       | api wrote:
       | We seem to be getting to the point that CPUs are no longer I/O
       | bound. This is sort of the end of an era, since starting as early
       | as the 2000s we entered a time when I/O (RAM, disk, network) was
       | the main bottleneck for most forms of compute.
        
         | PragmaticPulp wrote:
         | > We seem to be getting to the point that CPUs are no longer
         | I/O bound.
         | 
         | These memory modules are for server and HPC use, where memory
         | bandwidth is still a major limitation for many workloads.
         | 
         | Your desktop CPU may not be memory I/O bound for your average
         | single-task desktop use case, but a 128-core server running
         | intense workloads or doing HPC can definitely be I/O bound.
         | 
         | Even desktop gaming applications show benefits from memory
         | overclocking in most cases, so it's not something that can be
         | dismissed.
        
           | rbanffy wrote:
           | > Your desktop CPU may not be memory I/O bound for your
           | average single-task desktop use case, but a 128-core server
           | running intense workloads or doing HPC can definitely be I/O
           | bound.
           | 
           | Indeed. From my measurements (very unscientific), processes
           | stall first for IO, then for CPU resources, then from memory.
        
         | gpderetta wrote:
         | RAM latency is still an issue though and often the actual
         | bottleneck.
        
         | drewg123 wrote:
         | I disagree. RAM is still a huge bottleneck. Even when there is
         | sufficient bandwidth, latency is a performance killer.
         | 
         | I just spent most of the weekend trying to optimize a hash
         | table lookup which is one of our biggest sources of cache
         | misses (and CPU stalls). The CPI (cycles per instruction) in
         | that function started as 13.9 and I have it down to 7.5 by re-
         | ordering a few fields and cacheline-aligning the struct (so as
         | to have 1 cache miss per iteration, rather than 2). Now I need
         | to figure out what's wrong with the hash function, as the table
         | should be big enough to hold everything without much pointer
         | chasing on average, but I'm seeing we do at least one pointer
         | deref on average before we find the entry we're looking for.
        
           | montecarl wrote:
           | Small tangent. Would you mind sharing how you calculate CPI?
           | Do you just do timing and use some base clock rate? With the
           | CPU frequency being so variable (with boost clocks) I imagine
           | the correct way is with some fancy instrumentation.
        
           | adgjlsfhk1 wrote:
           | Any chance you can switch the hash table to use probing
           | rather than chaining? Probing is way better for locality
           | because on collisions you will just look at the next element
           | in ram.
        
             | drewg123 wrote:
             | This table is not well suited for that, but it did get me
             | interested in re-structuring it. Thanks for the suggestion!
        
           | NovemberWhiskey wrote:
           | SDRAM latency is definitely not keeping up with improvements
           | in throughput: if you go back to the time of the PC100 SDRAM
           | standard (late 90s?) it was not uncommon to see RAM with CAS
           | latency of 2 clocks and overall timings compatible with a
           | latency of around 20-25ns; now with DDR5-4800, CL 34 is
           | standard and latencies have only come down to about 15ns.
           | 
           | This despite the number of MT/s increasing 48-fold.
        
             | iamchp wrote:
             | Computer architects use the term 'memory wall' to refer to
             | this latency problem. CPU microarchitectures are constantly
             | improved to improve IPC, Die technology helps increase CPU
             | frequency, but memory access latency is not keeping up with
             | the CPU improvements.
        
               | NovemberWhiskey wrote:
               | Right - I think, at this point, that L1 cache latency is
               | worse than main-memory latency on a 'per CPU clock' basis
               | than it was in the late 1990s!
        
               | rbanffy wrote:
               | Makes sense for a larger cache to take longer to decode
               | an address.
        
             | [deleted]
        
             | kstrauser wrote:
             | I've had a devil of a time looking this up. Perhaps you'd
             | know:
             | 
             | Back in the Amiga days, 60ns SIMMs were common. Some CPU
             | accelerator boards could run faster if you upgraded to 50ns
             | SIMMs instead. What did those numbers refer to? Latency?
             | Time to fetch a byte? Something else?
             | 
             | I'd be interested in an apples-to-apples comparison of the
             | old hardware with the new.
        
               | [deleted]
        
               | [deleted]
        
               | colejohnson66 wrote:
               | Those times referred to the speed of the actual ICs on
               | the PCB. A "60 ns" SRAM IC takes _up to_ 60 nanoseconds
               | from the (input) address lines going stable to the
               | (output) data lines going stable. If you sampled the data
               | bus during those 60 nanoseconds, you might get incomplete
               | data. Swapping for 50 ns modules means the ICs were
               | verified to take less time for the data bus to be valid.
               | 
               | It's a bit like overclocking your memory nowadays.
               | Basically, the 50 and 60 nanosecond parts might be the
               | same silicon, but the 50 ns ones were validated to
               | perform at that speed. Today, a 3200 MT/s DDR4 module
               | doesn't mean it won't run at 3600 MT/s; just that it was
               | only verified to run at the former.
               | 
               | ---
               | 
               | The big difference between the two memory formats,
               | however, is that DDR is pipelined. In the old days, you
               | would present an address on the bus, hold it, and then
               | wait for the data to come back. Only after sampling the
               | data could you request a new address. DDR, being
               | pipelined, allows you to request an address, and, before
               | the data comes back, request a new address. After a
               | while, the data from the first address would come back,
               | followed by the data from the second.
               | 
               | That alone makes apples-to-apples comparisons hard.
        
               | NovemberWhiskey wrote:
               | With original DRAM you had the row strobe, the column
               | strobe and then you could read out your data. You then
               | had to precharge the row again before doing another
               | access. Page mode DRAM improved on this by allowing you
               | to keep the row open and read multiple columns, but you
               | had to wait until the read completed before presenting
               | the next column address. Enhanced-data-out (EDO) DRAM
               | extends this by allowing you to pipeline accesses with
               | the row.
               | 
               | So the ability to pipeline is much older than DDR. DDR
               | SDRAM is an evolution on SDRAM, which is the first
               | variety of DRAM that is actually clocked (and came after
               | EDO); and the main innovation is transfer on both edges
               | of the clock (falling as well as rising) - hence double-
               | data-rate.
        
               | kstrauser wrote:
               | Ah, thank you for that! So you could fetch 1/(60ns) bytes
               | per second synchronously, then (assuming the CPU could
               | run in a loop that tightly)?
        
               | NovemberWhiskey wrote:
               | No; because after the read, you need to precharge the row
               | again (due to the design of dynamic RAM).
               | 
               | ref. https://en.wikipedia.org/wiki/Dynamic_random-
               | access_memory#M...
               | 
               | The quoted "50 ns" DRAM has a read-cycle time of 84 ns.
        
               | kstrauser wrote:
               | Oh, interesting. I'll read up more on that.
        
           | [deleted]
        
         | pca006132 wrote:
         | IO is often still the bottleneck. Although the bandwidth is
         | larger, main memory access latency is still high, that's why
         | AMD's 3D cache can enable performance improvement by _just_
         | providing more cache.
        
         | [deleted]
        
         | Eleison23 wrote:
        
       ___________________________________________________________________
       (page generated 2022-12-12 23:01 UTC)