[HN Gopher] RAM is the new disk - and how to measure its perform...
___________________________________________________________________
RAM is the new disk - and how to measure its performance (2015)
Author : tanelpoder
Score : 52 points
Date : 2021-01-21 19:33 UTC (3 hours ago)
(HTM) web link (tanelpoder.com)
(TXT) w3m dump (tanelpoder.com)
| vlovich123 wrote:
| I think how DMA operates needs another serious look. Right now we
| have to fetch everything into the CPU before we can make
| decisions. What if we had asynchronous HW embedded within the
| memory that could be given a small (safe) executable program to
| process the memory in-place rather than evaluating it on the CPU.
| In other words, a linked list would be much faster and simpler to
| traverse.
|
| A lot of the software architecture theory we learn is based on
| existing HW paradigms without much thought being given to how we
| can change HW paradigms. By nature HW is massively parallel but
| where physical distance from compute = latency (vs the ultimately
| serial execution nature of traditional CPUs that can process all
| data at blistering speed but only one at a time with some SIMD
| exceptions). There are real-world benefits to this kind of design
| - memory is cheap and simple to manufacture and abundantly
| available. The downside though is that the CPU is sitting doing
| nothing but waiting for memory most of the time, especially when
| processing large data sets.
|
| Imagine how efficient a GC algorithm would be if it could compute
| a result in the background just doing a concurrent mark and
| sweep, perhaps as part of a DRAM refresh cycle so that you could
| even choose to stop refreshing that RAM because your application
| no longer needs that row.
|
| The power and performance savings are pretty enticing.
| cogman10 wrote:
| Interestingly enough, this is why Simultaneous multithreading
| [1] exists!
|
| The revelation that "The CPU could be doing useful work while
| stalled waiting on a load" has lead to CPU designers "faking"
| the number of cores available to the OS to allow the CPU to do
| more useful work while different "threads" are paused waiting
| on memory to come back with the data they need.
|
| [1] https://en.wikipedia.org/wiki/Simultaneous_multithreading
| tanelpoder wrote:
| And all kinds of manual prefetching instructions built into
| the CPUs (for compilers to use) and prefetching algorithms &
| predictors built into the CPUs too!
| nomel wrote:
| I think the point is that this still eats into CPU <-> memory
| bandwidth. Offloading could let the CPU use that memory for
| better purposes, especially since the stall is from memory
| access anyways.
| tanelpoder wrote:
| When doing pointer chasing, then it's gonna be more of a
| latency problem, there can be plenty of memory bandwidth
| available, but we don't know which memory line we want to
| go to next, before the previous line (where the pointer
| resides) has been loaded. So, the CPUs spend a lot of time
| in "stalled backend" mode due to the big difference in CPU
| cycle latency vs RAM access latency.
|
| Offloading some operations closer to the RAM would be a
| hardware solution (like the Oracle/Sun SPARC DAX I
| mentioned in a separate comment).
|
| Or you could design your software to rely more on the
| memory throughput (scanning through columnar structures) vs
| memory latency (pointer chasing).
|
| Btw, even with pointer-chasing, you could optimize the
| application to work mostly from the CPU cache (assuming
| that the CPUs don't do concurrent writes into these cache
| lines all the time), but this would require not only
| different application code, but different underlying data
| structures too. That's pretty much what my article series
| is about - fancy CPU throuhgput features like SIMD would
| not be very helpful, if the underlying data (and memory)
| structures don't support their way of thinking.
| geocar wrote:
| Linked-list chasing (as in vlovich123's example) isn't
| bandwidth-limited as much as it is latency: SMT helps here
| because you're able to enqueue multiple wait-states
| "simultaneously".
| jeffbee wrote:
| Anything you can do with SMT you can do without SMT using
| ILP instead. SMT does not grant a CPU extra cache-filling
| resources.
| geocar wrote:
| SMT is easier to program.
|
| Walking a single linked-list is nothing; A garbage
| collector walks lots of them that tend to fan out for a
| bit.
| gpsar wrote:
| Software SMT is even better for concurrent traversals as
| you are not limited by the number of hw contexts
| https://www.linkedin.com/pulse/dont-stall-multitask-
| georgios...
| jeffbee wrote:
| SMT gives you some probability that thread X and thread Y
| are under-using the resources of the CPU but if you lose
| that bet you get antagonism. On Intel CPUs in particular
| there are only 2 slots for filling cache lines, and
| filling them randomly from main memory takes many cycles,
| so two threads can easily get starved.
|
| If thread X is chasing a linked list and thread Y is
| compressing an MPEG then it's brilliant.
| fctorial wrote:
| They aren't faking the number. A hyperthreaded core is two
| separate cores that share alu, fpu etc.
| tanelpoder wrote:
| I think it's more correct to say that a single core has two
| "sets of registers" (and I guess instruction
| decoders/dispatchers perhaps)... so it's a single core with
| 2 "execution entry points"?
| jzwinck wrote:
| This is not how Intel sees it. They describe hyperthreading
| as a single core having two (or more, but usually two) sets
| of architectural state which basically means registers.
| It's not two cores, it is one core that can switch between
| two different instruction pointers. They share almost
| everything else apart from the APIC.
| lalaithion wrote:
| Even just "go to memory location X+Y, load the value there, set
| X = that value, repeat N times" would allow fast linked list
| traversal, as well as doing various high-level pointer chasing
| faster.
| 55873445216111 wrote:
| There are some memories supporting basic in-memory operations.
| For example: https://mosys.com/products/blazar-family/be3rmw-
| bandwidth-en.... This supports operations like read-modify-
| write within the memory device itself. (I have no affiliation
| with this company.)
|
| The barrier to adoption of this is not technical, it's
| economic. Memory industry has focused on making the highest
| capacity and lowest cost/bit products. This drives high
| manufacturing volume which drives economies of scale. Memory
| products with integrated functions are inherently niche, and
| therefore do not have anywhere near the market size and economy
| of scale. Designers have decided (historically) that it is
| cheaper at the system level to keep the logic operations within
| the CPU and use a "dumb" commodity memory, even though this
| necessitates more bandwidth usage. (It's a complex engineering
| trade-off.)
|
| With logic performance continuing to scale faster than memory
| bandwidth, at some point an architecture that reduces the
| required memory bandwidth (such as computing in-memory) might
| start to make sense economically.
| d_tr wrote:
| A few months ago I wanted to take a look at the Gen-Z fabric
| specifications, but unfortunately they still have a lame
| members-only download request form in place.
| tanelpoder wrote:
| Oracle (Sun) latest CPUs support something called DAX (data
| analytics extension, not the same thing as the Intel DAX -
| direct access extensions). It's a coprocessor that allows
| offloading simpler operations closer to RAM, apparently:
|
| "DAX is an integrated co-processor which provides a specialized
| set of instructions that can run very selective functionality -
| Scan, Extract, Select, Filter, and Translate - at fast speeds.
| The multiple DAX share the same memory interface with the
| processors cores so the DAX can take full advantage of the
| 140-160 GB/sec memory bandwidth of the SPARC M7 processor."
|
| https://blogs.oracle.com/bestperf/accelerating-spark-sql-usi...
| dragontamer wrote:
| > Right now we have to fetch everything into the CPU before we
| can make decisions.
|
| Think about virtual memory. The memory at location #6000 is NOT
| at where the program thinks is at memory location #6000.
|
| In fact, the program might be reading / writing to memory
| location 0x08004000 (or close to that, whatever the magic start
| address was on various OS like Linux / Windows). And then your
| CPU virtually-translates that address to memory-stick #0
| column#4000 or whatever.
|
| Because of virtual memory: all memory operations must be
| translated by the CPU before you actually go to RAM (and that
| translation may require a page-table walk in the worst case)
| ben509 wrote:
| Well, all memory operations must be translated by an MMU. So
| can the MMU be a distributed beast that lives in the CPU, the
| memory and the IO?
| dragontamer wrote:
| > So can the MMU be a distributed beast that lives in the
| CPU, the memory and the IO?
|
| That "blah = blah->next" memory operation could be:
|
| * Going to swap thanks to swap / overcommit behavior on
| Linux/Windows.
|
| * Or going to a file thanks to mmap
|
| * Going to Ethernet and to another computer thanks to RDMA
|
| So... no. I'm pretty sure the CPU is the proper place for
| that kind of routing.
| freeone3000 wrote:
| why is that? we need the CPU to handle page fault
| interrupts, in order to populate the RAM. But assuming
| the page is already in RAM, there's no reason any of the
| memory accesses actually need to go through the CPU.
| (hardware can already raise interrupts; if the MMU can
| raise a page fault indicator, then you might be able to
| bypass CPU entirely until a new page needs to be loaded)
|
| moreover, if we have support for mmap at the MMU level,
| we can cut the CPU bottleneck for disk access entirely.
| the disk controller can already handle DMA, but there's
| simply no way for things that aren't the CPU to trigger
| it. DirectStorage is an effort for GPUs to trigger it,
| but what if we could also trigger it by other means?
| dragontamer wrote:
| Okay, lets say page-faults are off the table for some
| reason. Lets think about what can happen even if
| everything is in RAM still.
|
| * MMAP is still on the table: different processes can
| share RAM at different addresses. (Process#1 thinks the
| data is at memory location 0x90000000, Process#2 thinks
| the data is at 0x70000000, but in both cases, the data is
| at physical location 0x42).
|
| * Physical location 0x42 is a far-read on a far-away NUMA
| node. Which means the CPU#0 now needs to send a message
| to a CPU#1 very far away to get a copy of that RAM. This
| message traverses Intel Ultratransport or AMD Infinity
| fabric (proprietary details), but its a remote message
| that happens nonetheless.
|
| * Turns out CPU#1 has modified location 0x42. Now CPU#1
| must push the most recent copy out of L1 cache, into L2
| cache... then into L3 cache, and then send it back to
| CPU#0. CPU#0 has to wait until this process is done.
|
| Modern computers work very hard to hold the illusion of a
| singular memory space.
| jeffbee wrote:
| Pointers are virtual addresses but memory is accessed
| physically. All of the means of translating virtual to
| physical are in the CPU. If you are proposing throwing
| out virtual addressing, I imagine you won't get a lot of
| support for that idea.
| charlesdaniels wrote:
| This is indeed an idea that has been coming up from time to
| time for over 25 years at least. I think the earliest
| publication in this space was Gockhale's Terasys (DOI
| 10.1109/2.375174).
|
| It is a good idea in principle, but it's never really taken off
| to my knowledge. Parallel programming is hard. Deeply embedded
| programming is hard. The languages we have are mostly bad at
| both.
|
| If you want to search for more, the keyword is "processor in
| memory" or "processing in memory" (PIM). Also "in memory
| computing" is commonly used as well. A number of groups are
| working on this right now. The new buss is "memristors"
| (memristive computing). Whether or not any of it actually ends
| up working or being commercially viable outside of a lab
| remains to be seen.
| wcerfgba wrote:
| This reminds me of the Connection Machine architecture [1]
|
| > Each CM-1 microprocessor has its own 4 kilobits of random-
| access memory (RAM), and the hypercube-based array of them was
| designed to perform the same operation on multiple data points
| simultaneously, i.e., to execute tasks in single instruction,
| multiple data (SIMD) fashion. The CM-1, depending on the
| configuration, has as many as 65,536 individual processors,
| each extremely simple, processing one bit at a time.
|
| [1] https://en.wikipedia.org/wiki/Connection_Machine
| BenoitP wrote:
| > HW embedded within the memory that could be given a small
| (safe) executable program to process the memory in-place rather
| than evaluating it on the CPU
|
| Well, that seems to be the exact definition of what UPMEM is
| doing:
|
| https://www.upmem.com/upmem-announces-silicon-based-processi...
|
| Between the M1, GPUs, TPUs, RISC-V interesting times are coming
| in hardware. I blame the physical limits, which are putting the
| Duke Nukem development method to an end (you promise the client
| 2x the same performance in 18 months, then play Duke Nukem the
| whole time). The only way to better performance now is through
| hardware specialization. That and avoiding Electron.
| WJW wrote:
| You'd hope that the only way to better performance was HW
| specialisation, but there are SO MANY algorithmic
| improvements still to be made. Just the other day I found a
| case someone had rolled their own priority queue with a
| sorted array instead of a heap. In a fairly popular open
| source library too.
|
| There's still loads of performance to be gained just by being
| better programmers.
| spockz wrote:
| How do you propose to fix this? Should languages include
| high(er?) performance data structures in their standard
| libraries? Or possibly even include some segmentation for
| small/medium/huge data sets?
| Arelius wrote:
| > Just the other day I found a case someone had rolled
| their own priority queue with a sorted array instead of a
| heap.
|
| I feel like this story needs an ending? Did somebody re-
| implement it using a heap and found significant performance
| wins? Or was the sorted array used on purpose to take
| advantage of some specific cache constraints and actually
| end up being a huge win?
| rektide wrote:
| meanwhile, disk is potentially getting to be as fast as ram,
| throughput wise.
|
| 128 lanes of pcie 4.0 is 256GBps iirc. epyc's 8 channel ddr4-3200
| otoh is good for 208GBps.
|
| Let's Encrypt stopped a little short, using 24x nvme disks (it
| fits in a 2U though so that's nice)[1]. that could be up to 96 of
| 128 pcie links in use. with the right ssds, working on large
| data-objects, that'd be somewhere a bit under 192GBps versus the
| max 208GBps of their ram.
|
| in truth, ram's random access capabilities are far better,
| there's much less overhead (although nvme is pretty good). and
| i'm not sure i've ever seen anyone try to confirm that those 128
| lanes of pcie on epyc aren't oversubscribed, that devices really
| can push that much data around. note that this doesn't
| necessarily even have to mean using the cpu; pci p2p is where
| it's at for in-the-know folks doing nvme, network, and gpu data-
| pushing; epyc's io-die is acting like a data-packet switch in
| these conditions, rather than having the cpu process/crunch these
| peripheral's data.
|
| [1] https://news.ycombinator.com/item?id=25861422
| tanelpoder wrote:
| Author (of the RAM article) here:
|
| Indeed, you can go further, but got to plan for other bandwidth
| needed by other peripherals, data movement and inter-CPU
| bandwidth (NUMA) and intra-CPU-core bandwidth limitations too
| (AMD's infinity fabric is point to point between chiplets, but
| intel has some ring-bus architecture for moving bits between
| CPU cores).
|
| I got my Lenovo ThinkStation P620 workstation (with AMD Zen-2
| ThreadRipper Pro WX, 8-memory channels like EPYC) to scan 10 x
| PCIe 4.0 SSDs at 66 GB/s (I had to move SSD cards around so
| they'd use separate PCIe root complexes to avoid a PCIe <-> CPU
| data transfer bottleneck. And even with doing I/O through 3
| PCIe root complexes (out of 4 connected to that CPU), I seem to
| be hitting some inter-CPU-core bandwidth limitation. The
| throughput differs depending on which specific CPU cores happen
| to run the processes doing I/Os against different SSDs.
|
| Planning to publish some blog entries about these I/O tests but
| a teaser tweet is here (11M IOPS with a single-socket
| ThreadRipper workstation - it's not even a NUMA server! :-)
|
| https://twitter.com/TanelPoder/status/1352329243070504964
| 1996 wrote:
| > I had to move SSD cards around so they'd use separate PCIe
| root complexes to avoid a PCIe <-> CPU data transfer
| bottleneck
|
| I am doing similar things. Have you considered looking at how
| to control by software the PCI lanes assignment?
|
| Intel HSIO seems to be software configurable - except that
| usually, it's all done just by the bios.
|
| But as PCI specs allow for both device-side and host-side
| negotiations, it should be doable without "moving SSDs
| around"
|
| > The throughput differs depending on which specific CPU
| cores happen to run the processes doing I/Os against
| different SSDs.
|
| That strikes me as odd. I would check the detail of the PCI
| lanes and their routing. You could have something funky going
| on. My first guess would be that it's slow on one core
| because it's also handling something else, by design or by
| accident.
|
| There're some bad hardware designs out there. But thanks to
| stuff like HSIO, it should now be possible to fix the worst
| ones by software (how else would the bios do it otherwise!)
| just like in the old days of isapnptools!
| tanelpoder wrote:
| As this is an AMD machine - and as it's a workstation, not
| server, perhaps this is why they've restricted it in BIOS.
|
| I'm not too much of an expert in PCI express - but if this
| workstation has 4 PCIe root complexes/host bridges, each
| capable of x32 PCIe 4.0 lanes - and there are no multi-root
| PCIe switches, wouldn't a lane physically have to
| communicate with just one PCIe root complex/CPU "port"?
| MayeulC wrote:
| That bandwidth limitation could be due to Infinity Fabric,
| which seems to be rated at 42GBps (x2 as full duplex,
| though)?
|
| https://en.wikichip.org/wiki/amd/infinity_fabric
| tanelpoder wrote:
| Yes, that's what I'm suspecting too, although with higher
| clocked RAM, I should have somewhat more bandwidth. My
| DIMMs are 3200 MT, so should be running at 1600 MHz. But I
| saw a note (not sure where) that Infinity Fabric can run up
| to 2933 MT on my machine and it would run in sync with
| memory with DIMMs only up to 2933 MT. Unfortunately my BIOS
| doesn't allow to downgrade the RAM "clock" from 3200 MT to
| 2933, thus Ininity Fabric is running "out of sync" with my
| RAM.
|
| This should mean non-ideal memory access latency at least,
| not sure how it affects throughput of large sequential
| transfers.
|
| I'm planning to come up with some additional tests and
| hopefully write up a "part 2" too.
| 1996 wrote:
| > Unfortunately my BIOS doesn't allow to downgrade the
| RAM "clock"
|
| How deep are you willing to go?
|
| The RAM clock is controlled by the memory training
| algorithms. They use data from the XMP, which can be
| edited.
|
| The simplest is to reflash your memory sticks to alter
| their XMP, so the training algorithm will reach the
| conclusions you want. There's some Windows software to do
| that.
|
| You could also implement your own MRC, something done by
| coreboot and the likes.
| tanelpoder wrote:
| Ha, thanks for the idea! I was briefly thinking of buying
| 2933 "MHz" RAM for the test (as I later would put it into
| my other workstation that can go up to 2600 "MHz" only),
| but then I realized I don't have time for this right now
| (will do my throughput, performance stability tests first
| and maybe look into getting the most out of the latency
| later).
___________________________________________________________________
(page generated 2021-01-21 23:01 UTC)