[HN Gopher] AMD's 7950X3D: Zen 4 Gets VCache
___________________________________________________________________
AMD's 7950X3D: Zen 4 Gets VCache
Author : zdw
Score : 68 points
Date : 2023-04-23 17:01 UTC (5 hours ago)
(HTM) web link (chipsandcheese.com)
(TXT) w3m dump (chipsandcheese.com)
| naiv wrote:
| I have absolutely no idea regarding cpus but would this also
| speed up database caching?
| loeg wrote:
| If your working set is between 32 and 96 MB, it can make a big
| difference. This is probably not what most people mean when
| they talk about database caching, though.
| jackmott wrote:
| [dead]
| stingraycharles wrote:
| Generally memory has different access speeds. In order for a
| CPU to actually perform operations on the data, it needs to be
| loaded into the memory closest to / within the CPU.
|
| L1 cache is the fastest (and closest to the CPU), but there's
| very few of it. There's also L2, L3 and L4, each being slightly
| slower but larger than the previous one.
|
| VCache is a new kind of mechanism where the caching layer is
| placed vertically "on top" of the CPU. This unlocks new
| possibilities, and possibly more cache available at lower
| access times, but there are some technical challenges to make
| it work.
|
| It will enable your database to perform computations faster, as
| more data can be stored closer to the CPU.
| loeg wrote:
| VCache is more or less just a big L3 cache, with some quirks
| due to the way it is implemented. It's slightly slower than
| on-die L3 (~10%); on the bigger chips, it can only be
| accessed from a single CCD, etc.
| justsomehnguy wrote:
| Not if you modus operandi is to SELECT * FROM
| t1 WHERE (SELECT * FROM t1 WHERE (SELECT * FROM t1....
|
| EDIT: for whose who downvoted:
| https://news.ycombinator.com/item?id=32720856
| trollied wrote:
| ... then stream the entire resultset to the server and do the
| filtering in software. Drives me mad!
| shmerl wrote:
| Somewhat moot benefit if scheduler doesn't decide what core to
| use depending on what the thread is doing.
| loeg wrote:
| The smaller Vcache devices (8 cores and fewer) uniformly have
| access to the large L3 (though not the 16-core 7950 review unit
| in the article).
| shmerl wrote:
| Yeah, the point is about 16 core one.
| 7e wrote:
| The scheduler might be able to do it, but apps like games can
| always pin their threads by hand.
| shmerl wrote:
| Pinning anything by hand defeats the purpose. You'd need to
| benchmark things first to figure out if it helps or not. Too
| manual.
| tpxl wrote:
| Some games do run benchmarks to figure out recommended
| settings for your setup.
| syntheticcdo wrote:
| No need - the chipset drivers automatically park the non-
| VCache CCD cores when a game is running (effectively turning
| the 7950X3D into a 7800X3D).
| shmerl wrote:
| How would they know it's a game or not a game? I'm playing
| my games on Linux anyway. Haven't heard of schedulers using
| such logic.
|
| I'd expect some kind of predictive AI that analyzes thread
| behavior to be able to help. But not sure if anyone tried
| making a scheduler like that.
| wmf wrote:
| Under Windows the driver has a whitelist of process names
| that it recognizes and pins to V-cache. Of course you
| don't have these problems if you buy the 7800X3D...
| fredoralive wrote:
| AIUI they currently use a rather basic system on Windows
| that asks "is the Xbox game bar active?", and using that
| to switch off the low cache cores. I suspect if these
| sorts of chips become common we might get something a bit
| more nuanced.
| sharpneli wrote:
| It's the same mechanism that triggers the game mode in
| Windows. You can tag a program in the Xbox game bar as a
| game if it hasn't recognized it by default.
| sudosysgen wrote:
| Windows already has a system for detecting what is and
| isn't a game for purposes of switchable graphics laptops,
| so I imagine they reuse that.
|
| You can get pretty good heuristics by looking at graphics
| API usage.
|
| On Linux you could just put the appropriate taskset or
| numactl command in your game shortcut, it's pretty easy.
| cronin101 wrote:
| Seems like an obvious question but is there conventional wisdom
| on whether compilation/transpilation heavy workloads are more
| suited to cache size or to higher clocks? Is this a "it depends"
| situation? Wondering what to pick for my next workhorse.
| mastax wrote:
| I thought I remembered benchmarks from when the 5800X3D came
| out showing it was good at code compilation but that is, at the
| very least, not always true.
|
| https://www.phoronix.com/review/amd-ryzen9-7950x3d-linux/13
| the8472 wrote:
| Assuming the power is reported correctly it looks like an
| efficiency win:
|
| _> While the build times were similar between the 7950X and
| 7950X3D, with the latter the CPU power consumption was
| significantly lower._
| mastax wrote:
| I'm assuming the vast majority of that benefit is just from
| the X3D chips having a lower stock TDP. You could achieve
| the same efficiency by just reducing the 7950X TDP to 120W
| in BIOS or Ryzen Master.
| sosodev wrote:
| It's kinda random. Every workload is going to have different
| requirements depending on how they're implemented and the
| context of what the program is actually doing.
| [deleted]
| ChuckMcM wrote:
| From the article -- _" Unfortunately, there's been a trend of
| uneven performance across cores as manufacturers try to make
| their chips best at everything, and use different core setups to
| cover more bases."_
|
| I don't find this unfortunate. Engineering is compromise and
| being able to make things that do a particular thing well can get
| you more performance per $ and per watt than you might otherwise
| see. The whole GPU thing that kneecapped Intel[1] is an example I
| use of how a compute element optimized for one thing can boost
| overall system performance.
|
| I have worked on a number of bits of code for "big/little" ARM
| chips and while it does make scheduling more complex, overall
| we've been able to deliver more capability per $ and per watt.
| That is perhaps more important in portable systems but it works
| in data centers too.
|
| I had an interesting discussion inside Google about storage and
| whether or not using a full up motherboard for GFS nodes was
| ideal. The prevailing argument at the time was uniformity of
| "nodes" meant everything was software, nodes could be swapped at
| will and you could write software to do what ever you needed. But
| when it comes to efficiency, which is to say what percentage of
| available resource is used to deliver the services, you found a
| lost of wasted watts/cpu cycles. All depends on what part of the
| engineering design space you are trying to optimize.
|
| [1] The "kneecapped" situation is that since the PC/AT (80286)
| on, Intel generally made the most margin of all the chips that
| went into a complete system. Now that it is often NVIDIA.
| usrusr wrote:
| Unfortunate in how it demonstrates that there don't seem to be
| any lower-hanging fruit left. Actually it doesn't so much feel
| like picking higher hanging fruit, it feels like stretching
| hard for the last withered leaves, long after the last fruit
| have disappeared.
| ChuckMcM wrote:
| Yeah pretty much. There was never any real expectation that
| Moore's observation would result in infinitely improving
| compute performance, and Intel famously demonstrated that
| reality was much closer to Jim Gray's "smoking hairy golf
| ball" future.
|
| [1] _" Jim: Here it is. The concept is that the speed of
| light is finite and a nanosecond is a foot. So if you buy a
| gigahertz processor, it's doing something every nanosecond.
| That's the event horizon. But that's in a vacuum, the
| processor is not a vacuum, and signals don't go in a straight
| line and the processor is running at 3 gigahertz. So you
| don't have a foot. You've got four inches. And the speed of
| light in a solid is less than that. So this is the event
| horizon. If something happens on one side of this thing, the
| clock is going to tick before to the signal gets to the other
| side. That's why processors of the future have to be small
| and in fact golf ball- size. Why are they smoking? Well,
| because they have to run on a lot of electricity. The way you
| get things to go fast is you put a lot of power into them. So
| heat dissipation is a big problem. Now it's astonishing to me
| that Intel has decided that this is a big problem only
| recently, because people knew that we were headed towards
| this heat cliff a long time ago. And why is it hairy? Because
| you've got to get signals in and out of it, so this thing is
| going to be wrapped in pins."_ --
| https://amturing.acm.org/pdf/GrayTuringTranscript.pdf
| metadat wrote:
| I wish they'd also test xz andaune also RAR compression for
| vanilla vs vCores.
|
| It'd be interesting to learn if / how the results differ
| depending on compression implementation.
| andrewstuart wrote:
| It sounds like this would be great for single user contexts, but
| really unpredictable for servers running multiple duplicate
| tasks.
|
| Can someone who knows better than me please comment on the Linux
| server scheduling issues with a CPU like this.
|
| At this stage I'm assuming I'd be better with a CPU with all
| cores the same.
| brnt wrote:
| It sounds great when paired with an APU. No such part was
| announced though.
| AnthonyMouse wrote:
| All of the Ryzen 7000 series are APUs.
| Dalewyn wrote:
| With AMD finally taking a page from Intel with regards to
| providing an iGPU to everyone, Intel expected to integrate
| ARC into their CPUs with Meteor Lake (14th gen), and
| discrete GPUs from Nvidia and AMD being as _bloody_
| expensive as they are, I wonder if we 're on the verge of a
| turning point in desktop computing as discrete GPUs go down
| the path once trodden by sound cards?
| sampa wrote:
| look at the size of 4090 and think again
| noizejoy wrote:
| The soundcard analogy may not be that faulty. General
| purpose soundcards have pretty much been replaced by
| motherboard components, while high end soundcards (audio
| interfaces) for DAW (digital audio workstation) use cases
| are an entirely separate market.
|
| Side note: Most of those high end audio interfaces are
| now external devices connected via USB (formerly also via
| Firewire).
|
| And since a DAW typically doesn't need high end video, I
| didn't bother with a separate GPU on my latest DAW build.
| I will only add a GPU card, if I ever want to do higher
| end gaming, video production or machine learning on that
| computer.
| wmf wrote:
| The GPU cannot access the L3 though so V-cache would not help
| APUs. Maybe in the future AMD will have V-Infinity Cache.
| andrewstuart wrote:
| APUs are very slow compared to discrete GPUs.
| wmf wrote:
| Recent discussion on that topic:
| https://news.ycombinator.com/item?id=35656374
| sudosysgen wrote:
| This platform exposes each CCX ( group of 4 cores which share
| the L3 between them) as a NUMA domain if you want. This means
| that if your workload really takes a huge performance penalty
| from the ~10% lower (still very high) clock speed, or only part
| of it really enjoys the cache, you can manually tell the OS to
| stay where you want.
|
| Scheduling for this kind of chip is not super, but it might
| improve. Meanwhile, you can enjoy almost all of the performance
| for the specific workloads where it matters by doing this.
| MichaelZuo wrote:
| Linux multi-threading code isn't the most elegant or robust out
| there, but I doubt it would catastrophically fail, in the sense
| of performing worse on the newer AMD design, outside of a few
| exceptional cases.
| andrewstuart wrote:
| >> Linux multi-threading code isn't the most elegant or
| robust out there
|
| Evidence?
| MichaelZuo wrote:
| You want evidence of my personal estimation of Linux?
| rektide wrote:
| I'd rather have more cache than less. There are some corner
| cases where things will go wrong that we can dig up I'm sure,
| but generally, most tasks will execute at least as fast as they
| would have without the lopsided extra cache, and some will
| operate faster.
|
| There's definitely a lot of iteration & growing in here we
| could do to get better here. Yet... it feels like searching for
| problems to worry about some tasks running faster than they
| would have before (but not all tasks). In most deployment
| models, it should be fine. I think, for now, this slightly
| favors two use cases: one big monolith running on all cores,
| which will let work get consumed as it's ready. Or divide your
| workers onto the different chiplets, and use a good load
| balancing strategy that is somehow utilization aware.
|
| Personally I'm lacking in Fear, Uncertainty or Doubt over this
| making things worse.
| AnonCoward42 wrote:
| Just a small reminder that the CCD with V-Cache has way lower
| boost clocks and therefore performs worse in almost
| everything beside gaming. For casual use this is irrelevant I
| think, but it is still significant in many workloads. The
| only workload I know where V-Cache benefits more than it is
| hurt by lower boost clocks is gaming.
|
| The advantage of the dual CCD design is that this only/mostly
| hurts in highly threaded workloads at least.
| jackmott42 wrote:
| The benchmarks show it making plenty of things worse in the
| non gaming domain on this particular CPU. On a server cpu
| that is running lower clocks anyway, probably near impossible
| for it to hurt.
| toast0 wrote:
| > Can someone who knows better than me please comment on the
| Linux server scheduling issues with a CPU like this.
|
| The scheduler was more or less designed around symmetric multi
| processing. BIG/little asymmetric systems will still have
| obviously preferred cores; if you're optimizing for throughput,
| add one task to each fast core first, then to each slow core,
| then maybe move tasks around to fit policies, etc.
|
| With the 7900X3D and the 7950X3D, it's trickier, because one
| chiplet has a lower clock speed but more cache. Tasks that fit
| into the smaller cache will do better on the less cache
| chiplet, and tasks that fit into the larger cache but not the
| smaller cache will do better on the larger cache chiplet, and
| tasks that don't fit into either task will probably do better
| on the faster chiplet, but it kind of depends. In order to make
| good decisions, the scheduler would need more information about
| the task's memory access patterns, and I don't think that's
| something schedulers tend to keep track of; but if this type of
| chip is common in the future, it will need to happen.
|
| For Epyc, AMD's server processor line, I believe the plan for
| X3D is to add cache to all the chiplets, keeping them roughly
| symmetric; there's still the modern situation that some cores
| will boost higher than others, and moving tasks to different
| cores can be very expensive if the process has memory still
| resident in cache on the old core, and the new core doesn't
| share that cache, etc.
| mastax wrote:
| It's a pretty similar shape of problem to NUMA, which servers
| have managed for quite some time. (Perhaps more similar to
| big.LITTLE which is not so common in servers but Linux should
| have decent support for from ARM SoCs)
|
| Its a bit of an unnecessary hassle and annoyance, though. I'm
| pretty sure all the V-Cache EPYCs will have cache on all their
| dies so there won't be this issue.
___________________________________________________________________
(page generated 2023-04-23 23:00 UTC)