[HN Gopher] AMD's 7950X3D: Zen 4 Gets VCache
       ___________________________________________________________________
        
       AMD's 7950X3D: Zen 4 Gets VCache
        
       Author : zdw
       Score  : 68 points
       Date   : 2023-04-23 17:01 UTC (5 hours ago)
        
 (HTM) web link (chipsandcheese.com)
 (TXT) w3m dump (chipsandcheese.com)
        
       | naiv wrote:
       | I have absolutely no idea regarding cpus but would this also
       | speed up database caching?
        
         | loeg wrote:
         | If your working set is between 32 and 96 MB, it can make a big
         | difference. This is probably not what most people mean when
         | they talk about database caching, though.
        
         | jackmott wrote:
         | [dead]
        
         | stingraycharles wrote:
         | Generally memory has different access speeds. In order for a
         | CPU to actually perform operations on the data, it needs to be
         | loaded into the memory closest to / within the CPU.
         | 
         | L1 cache is the fastest (and closest to the CPU), but there's
         | very few of it. There's also L2, L3 and L4, each being slightly
         | slower but larger than the previous one.
         | 
         | VCache is a new kind of mechanism where the caching layer is
         | placed vertically "on top" of the CPU. This unlocks new
         | possibilities, and possibly more cache available at lower
         | access times, but there are some technical challenges to make
         | it work.
         | 
         | It will enable your database to perform computations faster, as
         | more data can be stored closer to the CPU.
        
           | loeg wrote:
           | VCache is more or less just a big L3 cache, with some quirks
           | due to the way it is implemented. It's slightly slower than
           | on-die L3 (~10%); on the bigger chips, it can only be
           | accessed from a single CCD, etc.
        
         | justsomehnguy wrote:
         | Not if you modus operandi is to                 SELECT * FROM
         | t1 WHERE (SELECT * FROM t1 WHERE (SELECT * FROM t1....
         | 
         | EDIT: for whose who downvoted:
         | https://news.ycombinator.com/item?id=32720856
        
           | trollied wrote:
           | ... then stream the entire resultset to the server and do the
           | filtering in software. Drives me mad!
        
       | shmerl wrote:
       | Somewhat moot benefit if scheduler doesn't decide what core to
       | use depending on what the thread is doing.
        
         | loeg wrote:
         | The smaller Vcache devices (8 cores and fewer) uniformly have
         | access to the large L3 (though not the 16-core 7950 review unit
         | in the article).
        
           | shmerl wrote:
           | Yeah, the point is about 16 core one.
        
         | 7e wrote:
         | The scheduler might be able to do it, but apps like games can
         | always pin their threads by hand.
        
           | shmerl wrote:
           | Pinning anything by hand defeats the purpose. You'd need to
           | benchmark things first to figure out if it helps or not. Too
           | manual.
        
             | tpxl wrote:
             | Some games do run benchmarks to figure out recommended
             | settings for your setup.
        
           | syntheticcdo wrote:
           | No need - the chipset drivers automatically park the non-
           | VCache CCD cores when a game is running (effectively turning
           | the 7950X3D into a 7800X3D).
        
             | shmerl wrote:
             | How would they know it's a game or not a game? I'm playing
             | my games on Linux anyway. Haven't heard of schedulers using
             | such logic.
             | 
             | I'd expect some kind of predictive AI that analyzes thread
             | behavior to be able to help. But not sure if anyone tried
             | making a scheduler like that.
        
               | wmf wrote:
               | Under Windows the driver has a whitelist of process names
               | that it recognizes and pins to V-cache. Of course you
               | don't have these problems if you buy the 7800X3D...
        
               | fredoralive wrote:
               | AIUI they currently use a rather basic system on Windows
               | that asks "is the Xbox game bar active?", and using that
               | to switch off the low cache cores. I suspect if these
               | sorts of chips become common we might get something a bit
               | more nuanced.
        
               | sharpneli wrote:
               | It's the same mechanism that triggers the game mode in
               | Windows. You can tag a program in the Xbox game bar as a
               | game if it hasn't recognized it by default.
        
               | sudosysgen wrote:
               | Windows already has a system for detecting what is and
               | isn't a game for purposes of switchable graphics laptops,
               | so I imagine they reuse that.
               | 
               | You can get pretty good heuristics by looking at graphics
               | API usage.
               | 
               | On Linux you could just put the appropriate taskset or
               | numactl command in your game shortcut, it's pretty easy.
        
       | cronin101 wrote:
       | Seems like an obvious question but is there conventional wisdom
       | on whether compilation/transpilation heavy workloads are more
       | suited to cache size or to higher clocks? Is this a "it depends"
       | situation? Wondering what to pick for my next workhorse.
        
         | mastax wrote:
         | I thought I remembered benchmarks from when the 5800X3D came
         | out showing it was good at code compilation but that is, at the
         | very least, not always true.
         | 
         | https://www.phoronix.com/review/amd-ryzen9-7950x3d-linux/13
        
           | the8472 wrote:
           | Assuming the power is reported correctly it looks like an
           | efficiency win:
           | 
           |  _> While the build times were similar between the 7950X and
           | 7950X3D, with the latter the CPU power consumption was
           | significantly lower._
        
             | mastax wrote:
             | I'm assuming the vast majority of that benefit is just from
             | the X3D chips having a lower stock TDP. You could achieve
             | the same efficiency by just reducing the 7950X TDP to 120W
             | in BIOS or Ryzen Master.
        
         | sosodev wrote:
         | It's kinda random. Every workload is going to have different
         | requirements depending on how they're implemented and the
         | context of what the program is actually doing.
        
       | [deleted]
        
       | ChuckMcM wrote:
       | From the article -- _" Unfortunately, there's been a trend of
       | uneven performance across cores as manufacturers try to make
       | their chips best at everything, and use different core setups to
       | cover more bases."_
       | 
       | I don't find this unfortunate. Engineering is compromise and
       | being able to make things that do a particular thing well can get
       | you more performance per $ and per watt than you might otherwise
       | see. The whole GPU thing that kneecapped Intel[1] is an example I
       | use of how a compute element optimized for one thing can boost
       | overall system performance.
       | 
       | I have worked on a number of bits of code for "big/little" ARM
       | chips and while it does make scheduling more complex, overall
       | we've been able to deliver more capability per $ and per watt.
       | That is perhaps more important in portable systems but it works
       | in data centers too.
       | 
       | I had an interesting discussion inside Google about storage and
       | whether or not using a full up motherboard for GFS nodes was
       | ideal. The prevailing argument at the time was uniformity of
       | "nodes" meant everything was software, nodes could be swapped at
       | will and you could write software to do what ever you needed. But
       | when it comes to efficiency, which is to say what percentage of
       | available resource is used to deliver the services, you found a
       | lost of wasted watts/cpu cycles. All depends on what part of the
       | engineering design space you are trying to optimize.
       | 
       | [1] The "kneecapped" situation is that since the PC/AT (80286)
       | on, Intel generally made the most margin of all the chips that
       | went into a complete system. Now that it is often NVIDIA.
        
         | usrusr wrote:
         | Unfortunate in how it demonstrates that there don't seem to be
         | any lower-hanging fruit left. Actually it doesn't so much feel
         | like picking higher hanging fruit, it feels like stretching
         | hard for the last withered leaves, long after the last fruit
         | have disappeared.
        
           | ChuckMcM wrote:
           | Yeah pretty much. There was never any real expectation that
           | Moore's observation would result in infinitely improving
           | compute performance, and Intel famously demonstrated that
           | reality was much closer to Jim Gray's "smoking hairy golf
           | ball" future.
           | 
           | [1] _" Jim: Here it is. The concept is that the speed of
           | light is finite and a nanosecond is a foot. So if you buy a
           | gigahertz processor, it's doing something every nanosecond.
           | That's the event horizon. But that's in a vacuum, the
           | processor is not a vacuum, and signals don't go in a straight
           | line and the processor is running at 3 gigahertz. So you
           | don't have a foot. You've got four inches. And the speed of
           | light in a solid is less than that. So this is the event
           | horizon. If something happens on one side of this thing, the
           | clock is going to tick before to the signal gets to the other
           | side. That's why processors of the future have to be small
           | and in fact golf ball- size. Why are they smoking? Well,
           | because they have to run on a lot of electricity. The way you
           | get things to go fast is you put a lot of power into them. So
           | heat dissipation is a big problem. Now it's astonishing to me
           | that Intel has decided that this is a big problem only
           | recently, because people knew that we were headed towards
           | this heat cliff a long time ago. And why is it hairy? Because
           | you've got to get signals in and out of it, so this thing is
           | going to be wrapped in pins."_ --
           | https://amturing.acm.org/pdf/GrayTuringTranscript.pdf
        
       | metadat wrote:
       | I wish they'd also test xz andaune also RAR compression for
       | vanilla vs vCores.
       | 
       | It'd be interesting to learn if / how the results differ
       | depending on compression implementation.
        
       | andrewstuart wrote:
       | It sounds like this would be great for single user contexts, but
       | really unpredictable for servers running multiple duplicate
       | tasks.
       | 
       | Can someone who knows better than me please comment on the Linux
       | server scheduling issues with a CPU like this.
       | 
       | At this stage I'm assuming I'd be better with a CPU with all
       | cores the same.
        
         | brnt wrote:
         | It sounds great when paired with an APU. No such part was
         | announced though.
        
           | AnthonyMouse wrote:
           | All of the Ryzen 7000 series are APUs.
        
             | Dalewyn wrote:
             | With AMD finally taking a page from Intel with regards to
             | providing an iGPU to everyone, Intel expected to integrate
             | ARC into their CPUs with Meteor Lake (14th gen), and
             | discrete GPUs from Nvidia and AMD being as _bloody_
             | expensive as they are, I wonder if we 're on the verge of a
             | turning point in desktop computing as discrete GPUs go down
             | the path once trodden by sound cards?
        
               | sampa wrote:
               | look at the size of 4090 and think again
        
               | noizejoy wrote:
               | The soundcard analogy may not be that faulty. General
               | purpose soundcards have pretty much been replaced by
               | motherboard components, while high end soundcards (audio
               | interfaces) for DAW (digital audio workstation) use cases
               | are an entirely separate market.
               | 
               | Side note: Most of those high end audio interfaces are
               | now external devices connected via USB (formerly also via
               | Firewire).
               | 
               | And since a DAW typically doesn't need high end video, I
               | didn't bother with a separate GPU on my latest DAW build.
               | I will only add a GPU card, if I ever want to do higher
               | end gaming, video production or machine learning on that
               | computer.
        
           | wmf wrote:
           | The GPU cannot access the L3 though so V-cache would not help
           | APUs. Maybe in the future AMD will have V-Infinity Cache.
        
           | andrewstuart wrote:
           | APUs are very slow compared to discrete GPUs.
        
         | wmf wrote:
         | Recent discussion on that topic:
         | https://news.ycombinator.com/item?id=35656374
        
         | sudosysgen wrote:
         | This platform exposes each CCX ( group of 4 cores which share
         | the L3 between them) as a NUMA domain if you want. This means
         | that if your workload really takes a huge performance penalty
         | from the ~10% lower (still very high) clock speed, or only part
         | of it really enjoys the cache, you can manually tell the OS to
         | stay where you want.
         | 
         | Scheduling for this kind of chip is not super, but it might
         | improve. Meanwhile, you can enjoy almost all of the performance
         | for the specific workloads where it matters by doing this.
        
         | MichaelZuo wrote:
         | Linux multi-threading code isn't the most elegant or robust out
         | there, but I doubt it would catastrophically fail, in the sense
         | of performing worse on the newer AMD design, outside of a few
         | exceptional cases.
        
           | andrewstuart wrote:
           | >> Linux multi-threading code isn't the most elegant or
           | robust out there
           | 
           | Evidence?
        
             | MichaelZuo wrote:
             | You want evidence of my personal estimation of Linux?
        
         | rektide wrote:
         | I'd rather have more cache than less. There are some corner
         | cases where things will go wrong that we can dig up I'm sure,
         | but generally, most tasks will execute at least as fast as they
         | would have without the lopsided extra cache, and some will
         | operate faster.
         | 
         | There's definitely a lot of iteration & growing in here we
         | could do to get better here. Yet... it feels like searching for
         | problems to worry about some tasks running faster than they
         | would have before (but not all tasks). In most deployment
         | models, it should be fine. I think, for now, this slightly
         | favors two use cases: one big monolith running on all cores,
         | which will let work get consumed as it's ready. Or divide your
         | workers onto the different chiplets, and use a good load
         | balancing strategy that is somehow utilization aware.
         | 
         | Personally I'm lacking in Fear, Uncertainty or Doubt over this
         | making things worse.
        
           | AnonCoward42 wrote:
           | Just a small reminder that the CCD with V-Cache has way lower
           | boost clocks and therefore performs worse in almost
           | everything beside gaming. For casual use this is irrelevant I
           | think, but it is still significant in many workloads. The
           | only workload I know where V-Cache benefits more than it is
           | hurt by lower boost clocks is gaming.
           | 
           | The advantage of the dual CCD design is that this only/mostly
           | hurts in highly threaded workloads at least.
        
           | jackmott42 wrote:
           | The benchmarks show it making plenty of things worse in the
           | non gaming domain on this particular CPU. On a server cpu
           | that is running lower clocks anyway, probably near impossible
           | for it to hurt.
        
         | toast0 wrote:
         | > Can someone who knows better than me please comment on the
         | Linux server scheduling issues with a CPU like this.
         | 
         | The scheduler was more or less designed around symmetric multi
         | processing. BIG/little asymmetric systems will still have
         | obviously preferred cores; if you're optimizing for throughput,
         | add one task to each fast core first, then to each slow core,
         | then maybe move tasks around to fit policies, etc.
         | 
         | With the 7900X3D and the 7950X3D, it's trickier, because one
         | chiplet has a lower clock speed but more cache. Tasks that fit
         | into the smaller cache will do better on the less cache
         | chiplet, and tasks that fit into the larger cache but not the
         | smaller cache will do better on the larger cache chiplet, and
         | tasks that don't fit into either task will probably do better
         | on the faster chiplet, but it kind of depends. In order to make
         | good decisions, the scheduler would need more information about
         | the task's memory access patterns, and I don't think that's
         | something schedulers tend to keep track of; but if this type of
         | chip is common in the future, it will need to happen.
         | 
         | For Epyc, AMD's server processor line, I believe the plan for
         | X3D is to add cache to all the chiplets, keeping them roughly
         | symmetric; there's still the modern situation that some cores
         | will boost higher than others, and moving tasks to different
         | cores can be very expensive if the process has memory still
         | resident in cache on the old core, and the new core doesn't
         | share that cache, etc.
        
         | mastax wrote:
         | It's a pretty similar shape of problem to NUMA, which servers
         | have managed for quite some time. (Perhaps more similar to
         | big.LITTLE which is not so common in servers but Linux should
         | have decent support for from ARM SoCs)
         | 
         | Its a bit of an unnecessary hassle and annoyance, though. I'm
         | pretty sure all the V-Cache EPYCs will have cache on all their
         | dies so there won't be this issue.
        
       ___________________________________________________________________
       (page generated 2023-04-23 23:00 UTC)