[HN Gopher] SK Hynix Reveals DDR5 MCR DIMM, Up to DDR5-8000 Spee...
___________________________________________________________________
SK Hynix Reveals DDR5 MCR DIMM, Up to DDR5-8000 Speeds for HPC
Author : rbanffy
Score : 58 points
Date : 2022-12-12 13:20 UTC (9 hours ago)
(HTM) web link (www.anandtech.com)
(TXT) w3m dump (www.anandtech.com)
| dinvlad wrote:
| They really need to step up their game with making higher
| capacity modules available to consumer HEDT market. I think the
| current roadmap is to intro 48 and possibly 64GB modules at the
| end of 2023. Without that, many folks who've used 4x32GB DDR4
| configuration for professional work are left out, since the
| current MCs and modules can't support that for DDR5 at reasonable
| speeds.
|
| In other words, HEDT market doesn't need faster RAM as much as it
| needs more of it.
| moneycantbuy wrote:
| How noticeably slower is running 128GB DDR5 on AM5? Is it
| perceptible? I'm building a 7950x rig and am perplexed by the
| ram. The specs from amd are:
|
| Max Memory Speed 2x1R DDR5-5200 2x2R DDR5-5200 4x1R DDR5-3600
| 4x2R DDR5-3600
|
| Does anyone know if cas latency (CL) also drops if using 4
| sticks? For example, if I use 4x32GB DDR5 5600 CL36-36-36-89
| will it drop to 3600 while maintaining CL36 or will CL also
| slow to like CL 40?
| dinvlad wrote:
| Basically, on AM5 4x32GB sticks nominally run at only 3,600
| (i.e. no better than DDR4, and at probably slightly higher
| latency). On Level1Techs forums, there're some folks that got
| it working as high as 4,800, which seems to be the barrier,
| but even that is far from guaranteed.
|
| On Intel the situation isn't really any better. Technically
| on paper, the stock speed is 4,000 (i.e. +400MT/s compared to
| AMD), but I'm yet to hear about anyone running 5,200 fully
| stable in this config.
|
| Also, primary timings like CL36 don't really matter in DDR5.
| There's a whole ActuallyHardwareOverclocking YouTube video on
| it.
| the_pwner224 wrote:
| Those are the bare minimum speeds that AMD guarantees. You
| can find real world info about this on reddit. My memory of
| this is a bit old, but iirc the Zen 4 memory controller can
| pretty much always do 4800 MT/s on 4 sticks. And you might be
| able to do 5000/5200 or maybe even a bit more depending on
| your motherboard & RAM. The situation has been getting better
| with newer BIOS versions; speeds were much lower around the
| initial Zen 4 release.
| the_pwner224 wrote:
| Found some links from an old notepad. These are close to a
| month old, so stability at higher speeds should have
| improved since then.
|
| ASRock Steel Legend runs at 4800(?) https://old.reddit.com/
| r/ASRock/comments/xtt3ol/x670e_steel_... His initial 5200
| was actually unstable, a comment said he went to 4800, not
| sure whether 4800 is stable
|
| Gigabyte board could run 128gb "stable" at 5200 (need to
| expand comment thread near bottom for that) https://old.red
| dit.com/r/gigabyte/comments/y1ns7j/aorus_x670... Expanded
| comment about 5200 stable, and link to benchmark with it: h
| ttps://old.reddit.com/r/gigabyte/comments/y1ns7j/aorus_x670
| ...
|
| TUF X670E can only run 128gb at 4400, not 4800 https://old.
| reddit.com/r/Amd/comments/ya3gxx/psa_dont_buy_an...
|
| On ROG Strix X670, commenter could run 4x16gb at 6000 (6400
| was untable). The commenter previously could only do 5200;
| BIOS update made 6000 work. OP with 4x32gb was stable at
| 4400, not 5000, didn't say about 4800. https://old.reddit.c
| om/r/overclocking/comments/xzau6z/x670_m...
|
| Many comments here: https://old.reddit.com/r/Amd/comments/x
| zj69v/is_anyone_runni...
|
| He's running 128gb at 5200 on a Gigabyte Aorus Master 670E
| https://old.reddit.com/r/Amd/comments/yggpr5/7950x6900xt128
| g...
| dinvlad wrote:
| Yeah - I think the key here is 4x32GB specifically, not
| just 4 sticks. 128GB is much much harder on the memory
| controller, since those are dual-rank memory modules (or
| pseudo-4R due to how DDR5 is wired compared to DDR4), and
| we got 4 of them. So even 4,800 is far from guaranteed in
| this situation, on any platform atm, and it's unclear at
| this point if and when the situation will improve,
| barring larger memory modules in 2x48Gb and 2x64GB
| configurations (which would be much easier to run) or the
| next gen of chips with stronger MCs.
| hardware2win wrote:
| Seems like you wanted Optane memory before they were cancelled
|
| >Intel(r) Optane(tm) persistent memory is available in
| capacities of 128 GiB, 256 GiB, and 512 GiB and is a much
| larger alternative to DRAM which currently caps at 128 GiB.
| MuffinFlavored wrote:
| > consumer HEDT market
|
| high-end desktop for anybody curious
| dinvlad wrote:
| Thanks for clarifying. HEDT is somewhat of a stretch name
| here, since technically there're no current consumer-centric
| HEDT platforms like Threadripper or X599, but in their place
| I'm talking about the next best choices like 7950x and
| 13900k.
| rbanffy wrote:
| Also, the line between HEDT and low-end tower server is
| quite blurry.
| rbanffy wrote:
| > making higher capacity modules available to consumer HEDT
| market
|
| One way I'm able to function is by getting tower servers.
| Selection isn't great, and they cost a lot of money, but you
| can get a machine with 8 or 12 memory slots that still fits
| under your desk (and that is surprisingly silent while you
| browse or read e-mail - it only comes to life when you start
| getting cores saturated by running things like `make -j 64`).
| coolspot wrote:
| Can we get modular GDDR6X VRAM for machine learning please?
| rbanffy wrote:
| The Xeon Max has HBM built-in. Kind of the same idea that came
| with Xeon Phis, but updated.
| qwertox wrote:
| Yesterday I got presented an ad disguised as an educational
| YouTube video, but the content was really good and is worth
| watching if you have some time on your hands.
|
| "How does Computer Memory Work?" [0]
|
| Uploaded 1 month ago, 35 minutes playing time.
|
| It even goes into explaining what the timings mean while showing
| an animation for it.
|
| [0] https://www.youtube.com/watch?v=7J7X7aZvMXQ
| iamchp wrote:
| This would benefit sequential access, but it'd either be disabled
| for random-access or pollute the caches with unused lines.
|
| But in cases where sequential memory bandwidth is required, this
| is pretty cool! (But I assume Intel only, which would also be a
| bummer)
| fhars wrote:
| RAM is the new tape...
| rektide wrote:
| Somewhat ironic that we _just_ reduced RAM channels from 64b
| down to 32b wide in DDR5 (but each DIMM has two channels).
| (newsflash if you missed it: desktop DDR5 is quad channel, but
| 32b, yes.)
|
| SK Hynix: what if we increase the ram width to 64?
| api wrote:
| We seem to be getting to the point that CPUs are no longer I/O
| bound. This is sort of the end of an era, since starting as early
| as the 2000s we entered a time when I/O (RAM, disk, network) was
| the main bottleneck for most forms of compute.
| PragmaticPulp wrote:
| > We seem to be getting to the point that CPUs are no longer
| I/O bound.
|
| These memory modules are for server and HPC use, where memory
| bandwidth is still a major limitation for many workloads.
|
| Your desktop CPU may not be memory I/O bound for your average
| single-task desktop use case, but a 128-core server running
| intense workloads or doing HPC can definitely be I/O bound.
|
| Even desktop gaming applications show benefits from memory
| overclocking in most cases, so it's not something that can be
| dismissed.
| rbanffy wrote:
| > Your desktop CPU may not be memory I/O bound for your
| average single-task desktop use case, but a 128-core server
| running intense workloads or doing HPC can definitely be I/O
| bound.
|
| Indeed. From my measurements (very unscientific), processes
| stall first for IO, then for CPU resources, then from memory.
| gpderetta wrote:
| RAM latency is still an issue though and often the actual
| bottleneck.
| drewg123 wrote:
| I disagree. RAM is still a huge bottleneck. Even when there is
| sufficient bandwidth, latency is a performance killer.
|
| I just spent most of the weekend trying to optimize a hash
| table lookup which is one of our biggest sources of cache
| misses (and CPU stalls). The CPI (cycles per instruction) in
| that function started as 13.9 and I have it down to 7.5 by re-
| ordering a few fields and cacheline-aligning the struct (so as
| to have 1 cache miss per iteration, rather than 2). Now I need
| to figure out what's wrong with the hash function, as the table
| should be big enough to hold everything without much pointer
| chasing on average, but I'm seeing we do at least one pointer
| deref on average before we find the entry we're looking for.
| montecarl wrote:
| Small tangent. Would you mind sharing how you calculate CPI?
| Do you just do timing and use some base clock rate? With the
| CPU frequency being so variable (with boost clocks) I imagine
| the correct way is with some fancy instrumentation.
| adgjlsfhk1 wrote:
| Any chance you can switch the hash table to use probing
| rather than chaining? Probing is way better for locality
| because on collisions you will just look at the next element
| in ram.
| drewg123 wrote:
| This table is not well suited for that, but it did get me
| interested in re-structuring it. Thanks for the suggestion!
| NovemberWhiskey wrote:
| SDRAM latency is definitely not keeping up with improvements
| in throughput: if you go back to the time of the PC100 SDRAM
| standard (late 90s?) it was not uncommon to see RAM with CAS
| latency of 2 clocks and overall timings compatible with a
| latency of around 20-25ns; now with DDR5-4800, CL 34 is
| standard and latencies have only come down to about 15ns.
|
| This despite the number of MT/s increasing 48-fold.
| iamchp wrote:
| Computer architects use the term 'memory wall' to refer to
| this latency problem. CPU microarchitectures are constantly
| improved to improve IPC, Die technology helps increase CPU
| frequency, but memory access latency is not keeping up with
| the CPU improvements.
| NovemberWhiskey wrote:
| Right - I think, at this point, that L1 cache latency is
| worse than main-memory latency on a 'per CPU clock' basis
| than it was in the late 1990s!
| rbanffy wrote:
| Makes sense for a larger cache to take longer to decode
| an address.
| [deleted]
| kstrauser wrote:
| I've had a devil of a time looking this up. Perhaps you'd
| know:
|
| Back in the Amiga days, 60ns SIMMs were common. Some CPU
| accelerator boards could run faster if you upgraded to 50ns
| SIMMs instead. What did those numbers refer to? Latency?
| Time to fetch a byte? Something else?
|
| I'd be interested in an apples-to-apples comparison of the
| old hardware with the new.
| [deleted]
| [deleted]
| colejohnson66 wrote:
| Those times referred to the speed of the actual ICs on
| the PCB. A "60 ns" SRAM IC takes _up to_ 60 nanoseconds
| from the (input) address lines going stable to the
| (output) data lines going stable. If you sampled the data
| bus during those 60 nanoseconds, you might get incomplete
| data. Swapping for 50 ns modules means the ICs were
| verified to take less time for the data bus to be valid.
|
| It's a bit like overclocking your memory nowadays.
| Basically, the 50 and 60 nanosecond parts might be the
| same silicon, but the 50 ns ones were validated to
| perform at that speed. Today, a 3200 MT/s DDR4 module
| doesn't mean it won't run at 3600 MT/s; just that it was
| only verified to run at the former.
|
| ---
|
| The big difference between the two memory formats,
| however, is that DDR is pipelined. In the old days, you
| would present an address on the bus, hold it, and then
| wait for the data to come back. Only after sampling the
| data could you request a new address. DDR, being
| pipelined, allows you to request an address, and, before
| the data comes back, request a new address. After a
| while, the data from the first address would come back,
| followed by the data from the second.
|
| That alone makes apples-to-apples comparisons hard.
| NovemberWhiskey wrote:
| With original DRAM you had the row strobe, the column
| strobe and then you could read out your data. You then
| had to precharge the row again before doing another
| access. Page mode DRAM improved on this by allowing you
| to keep the row open and read multiple columns, but you
| had to wait until the read completed before presenting
| the next column address. Enhanced-data-out (EDO) DRAM
| extends this by allowing you to pipeline accesses with
| the row.
|
| So the ability to pipeline is much older than DDR. DDR
| SDRAM is an evolution on SDRAM, which is the first
| variety of DRAM that is actually clocked (and came after
| EDO); and the main innovation is transfer on both edges
| of the clock (falling as well as rising) - hence double-
| data-rate.
| kstrauser wrote:
| Ah, thank you for that! So you could fetch 1/(60ns) bytes
| per second synchronously, then (assuming the CPU could
| run in a loop that tightly)?
| NovemberWhiskey wrote:
| No; because after the read, you need to precharge the row
| again (due to the design of dynamic RAM).
|
| ref. https://en.wikipedia.org/wiki/Dynamic_random-
| access_memory#M...
|
| The quoted "50 ns" DRAM has a read-cycle time of 84 ns.
| kstrauser wrote:
| Oh, interesting. I'll read up more on that.
| [deleted]
| pca006132 wrote:
| IO is often still the bottleneck. Although the bandwidth is
| larger, main memory access latency is still high, that's why
| AMD's 3D cache can enable performance improvement by _just_
| providing more cache.
| [deleted]
| Eleison23 wrote:
___________________________________________________________________
(page generated 2022-12-12 23:01 UTC)