[HN Gopher] AMD 3rd Gen EPYC Milan Review
___________________________________________________________________
AMD 3rd Gen EPYC Milan Review
Author : pella
Score : 165 points
Date : 2021-03-15 15:11 UTC (7 hours ago)
(HTM) web link (www.anandtech.com)
(TXT) w3m dump (www.anandtech.com)
| zhdc1 wrote:
| It looks like Zen 2 processors about about to become even more of
| a bargain than they already are.
|
| I'll take a 7702P at $2-3K over a 7713P at $5K ten times out of
| ten.
| dragontamer wrote:
| Considering both are made from the 7nm TSMC process, AMD
| probably aren't going to make any more Zen2 processors at this
| point.
|
| I think you're right: that buying a generation old or so can
| offer gross cost savings. But that's only true for the time-
| period where those chips are available.
| IanCutress wrote:
| AMD is going to keep Zen 2 EPYC sales going for a good while
| yet. Both families will co-exist in the market.
| blagie wrote:
| I suspect so. A lot of the commercial market wants
| stability. Once I've validated a server config for a
| particular use, I want to be able to continue building
| those servers for a long time (often long past
| obsolescence).
|
| That may seem odd, but a lot of safety-critical
| applications (e.g. medical, military, aerospace, etc.)
| require spending tens of thousands, hundreds of thousands
| of dollars, or even millions of dollars (not to mention
| months of time) re-validating a system with any substantive
| change.
|
| Even for less critical applications, spending $2000 extra
| on each CPU is a bargain compared to re-validating a
| system.
|
| If AMD wants to be a credible presence in those markets,
| and I'm pretty sure it does, it needs to chips with many
| year lifespans before EOL.
|
| Some companies manage this by having a subset of devices or
| of software which is LTS.
| derefr wrote:
| Rather than buying new-old-stock CPUs, why not just buy
| all the CPUs the long-term program will ever need when
| they're still cheap, and stockpile them? It's not like
| they go bad.
| bryanlarsen wrote:
| Hopefully they can fix their idle power consumption with a
| firmware tweak or a new stepping, that's a massive regression. It
| looks like that causes a significant performance degradation too
| -- more of the power budget going to IO means less to the compute
| cores.
| blinkingled wrote:
| INVLPGB New instruction to use instead of inter-core
| unterrupts to broadcast page invalidates, requires
| OS/hypervisor support VAES / VPCLMULQDQ AVX2
| Instructions for encryption/decryption acceleration
| SEV-ES Limits the interruptions a malicious hypervisor may
| inject into a VM/instance Memory Protection Keys
| Application control for access- disable and write-disable
| settings without TLB management Process Context ID
| (PCID) Process tags in TLB to reduce flush requirements
|
| Interruptions (Instructions) and Unterrupts (Interrupts) aside
| (the article obviously was pushed out as fast as AT could lol) -
| these additions seem like they would help with performance when
| it comes to mitigations of all the speculation vulnerabilities in
| an hypervisor env?
| stillbourne wrote:
| I'm waiting for news on genesis peak, I'd love to get 4th gen
| threadripper on my next box.
| fvv wrote:
| Benchmarks
| https://www.phoronix.com/scan.php?page=article&item=epyc-700...
| IanCutress wrote:
| The article linked at the top has pages of benchmarks. Did....
| you miss them?
| modzu wrote:
| how is a 300W cpu cooled in a server environment? just high rpms
| and good environmentals? ive stayed on intel with my workstation
| so i can keep a virtually passive and quiet heatsink without
| having to go water
| [deleted]
| numpad0 wrote:
| By an array of fans specced like 8cm 10W 10krpm, like six fans
| wide two fans deep, blowing into a passive heatsink, with help
| of air ducts inside the chassis
|
| Intel or AMD or SPARC or ARM the setup is all the same for
| rackmount hardwares, pure copper passive heatsink and high
| power axial fans
| rodgerd wrote:
| Servers are horrifyingly loud, and datacentres will destroy
| your hearing in no time flat. Airflow management in datacentres
| is quite the art form, as well: proper packing of the racks for
| airflow, height of raised floors to accommodate blown air, and
| so on; some people are starting to go back to the future with
| things like liquid-cooled racks as well.
| formerly_proven wrote:
| Typical servers are 1U or 2U high boxes. 1U=44.5 mm and each
| box is "full depth", so around 700 mm long (sometimes more),
| and ~450 mm wide. The 1U boxes typically just have a block of
| copper with a dense bunch of fins on them as a CPU cooler,
| while 2U designs usually incorporate heatpipes.
|
| Lots of airflow.
|
| High-density systems are usually 2U, but with four dual-socket
| systems per chassis.
| ben-schaaf wrote:
| To put it in perspective, the Intel Xeon 6258R is rated at
| 205W. Servers have oodles of (likely airconditioned) airflow so
| this isn't a problem.
| folago wrote:
| When I worked in HPC in Tromsoe, arctic Norway, one of local
| supercomputers was liquid cooled and the heat used warmed up
| some buildings on the university campus.
| Latty wrote:
| Yeah, noise is relatively unimportant so they tend to use a ton
| of extremely high RPM fans and big hunks of copper, from what
| I've seen.
| wffurr wrote:
| And the DC staff wears hearing protection when they're
| working among the racks.
| myself248 wrote:
| That still blows my mind. Coming from telecom where
| everything prior to the #5 ESS was convection cooled, a
| happy office is a quiet office.
|
| Data got weird.
| rbanffy wrote:
| I'm assuming telecom had very different volume-power
| requirements. Where I grew up there were many mid-city
| phone switches that were large, concrete exterior, almost
| windowless, buildings.
| singlow wrote:
| I wish that hearing protection had been required or at
| least offered when I used to visit data centers frequently.
| They made a big deal of the fire suppression training, but
| never even suggested ear plugs. 20 year-old me had no idea
| how bad that noise was for my ears. I hope the staff there
| were wearing plugs, but it was never apparent.
| mhh__ wrote:
| You only get one set of ears.
|
| I suspect I still have the record for most expletives
| used in front of the headmaster at my old school because
| someone turned a few kW speaker on while I was wiring
| something under the stage - i.e. I wasn't pleased.
| willis936 wrote:
| Servers are in 4U rack mounted enclosures with (relatively) low
| height heatsinks and huge amounts of airflow. Intake in front,
| exhaust in rear. Most clients will have beefy air conditioning
| to keep the ambient intake temp and humidity low.
| wtallis wrote:
| I think 4U servers are quite rare these days, except for
| models designed to accommodate large numbers of either GPUs
| or 3.5" hard drives. Most 2-socket servers with up to a few
| dozen SSDs are 2U designs.
| touisteur wrote:
| Y'all can try to pry my quad-cpu 3-upi 4U HPE DL.580 from
| my cold dead hands.
| [deleted]
| Out_of_Characte wrote:
| a cooling unit provides a delta between surface area
| temparature and the ambient temparature. Epyc chips are
| significantly larger and more spread out with the chiplets so
| the heat density of these chips are relatively similar. So the
| cooler doesn't have to provide a larger temparature delta and
| ony has to be slightly larger if at all.
|
| Overall heat density on chips has increased due to litography
| changes so the chiplet architecture in a way is only a stopgap.
| epmaybe wrote:
| Performance per watt is better on amd right now, at least last
| I heard.
| unicornfinder wrote:
| Performance per watt _is_ better but unfortunately the idle
| power consumption is fairly high.
| epmaybe wrote:
| Oh I see that in the article now. It also seems like the
| performance gains you saw on other Zen3 chips doesn't
| correlate with EPYC due to the sheer number of cores and
| other components on chip.
| kllrnohj wrote:
| Epyc's idle power consumption is fairly high but Ryzen's
| isn't. The more workstation-focused Threadripper &
| Threadripper Pro is also still significantly better than
| Epyc here.
| whatshisface wrote:
| That logarithmic curve fit to the samples from a step function...
| rbanffy wrote:
| Why so little L3 cache on the competition?
| xirbeosbwo1234 wrote:
| First off, it's not a direct comparison. The Epyc has one L3
| cache per chiplet. This means that latency is not uniform
| across the entire L3 cache. This was a serious concern on the
| first generation of Epyc, where accessing L3 could take
| anywhere from zero to three hops across an internal network.
| AMD has greatly improved the problem on the more recent
| generations by switching to a star topology with more
| predictable latency.
|
| That said, there are two major reasons:
|
| 1. Epyc is on a chiplet architecture. Large chips are harder to
| make than small ones. Building two 200mm^2 chips is cheaper
| than building one 400mm^2 chip. Since Epyc has a chiplet
| architecture, this means they can put more silicon into a chip
| for the same price. This means that Epyc can be just plain
| _bigger_ than the competition. This comes with some complexity
| and inefficiency but has, so far, paid off in spades for them.
|
| 2. Epyc is on a newer process. This means AMD can fit more
| transistors even with the same area. Intel has had serious
| problems with their newer processes, so this is not an
| advantage AMD expected to have when designing the part. The use
| of a cutting-edge process was, in part, enabled by the chiplet
| architecture. It is possible to fabricate several small chips
| on a 7nm process even though one large chip would be
| prohibitively expensive, and AMD has been able to use a 14nm
| process in parts of the CPU that wouldn't benefit from a 7nm
| process to cut costs.
|
| The first point is serious cleverness on the part of AMD. The
| second point is mostly that Intel dropped the ball.
| totalZero wrote:
| What is the likelihood that mixed-process chiplets become the
| state of the art?
| uluyol wrote:
| Intel already said they would use chiplets [1] and TSMC has
| been talking about the various packaging technologies being
| developed [2].
|
| [1] https://www.anandtech.com/show/16021/intel-moving-to-
| chiplet...
|
| [2] https://www.anandtech.com/show/16051/3dfabric-the-home-
| for-t...
| Macha wrote:
| Aren't they already for big (desktop/workstation/server)
| chips? I'd say Zen3 is the state of the art in that market
| and that uses a mixed process. The IO dies are global
| foundries 12nm for AMD.
|
| The mobile market cares more about efficiency than easily
| scaling up to much bigger chips, so the M1 and other ARM
| chips are probably going to ignore this without much
| consequence for smaller chips.
|
| Intel still tops sales because of non-perf related reasons
| like refresh cycles, distrust of AMD from last time they
| fell apart in the server space, producing chips in
| sufficient quantities unlike the entire rest of the
| industry fighting over TSMC's capacity, etc.
| dragontamer wrote:
| EPYC is a split L3 cache. Any particular core only benefits
| from 32MBs of L3, the 33rd MB is "on another chip". (EDIT: Zen2
| was 16MBs, Zen3 is 32MBs. Fixed numbers for Zen3)
|
| As such, AMD can make absolutely huge amounts of L3 cache
| (well, many parallel L3 clusters), while other CPU designers
| need to figure out how to combine the L3 so that a single core
| can benefit it from it all.
| xirbeosbwo1234 wrote:
| That's not quite accurate. Every core has access to the
| entire L3, including the L3 on an entirely different socket.
| CPUs communicate through caches, so if a core just plain
| couldn't talk to another core's cache then cache coherency
| algorithms wouldn't work. Though a core can access the entire
| cache, the latency is higher when going off-die. It is
| _really_ high when going to another socket.
|
| The first generation of Epyc had a complicated hierarchy that
| made latency quite hard to predict, but the new architecture
| is simpler. A CPU can talk to a cache in the same package but
| on a different die with reasonably low latency.
|
| (I don't have numbers. Still reading.)
| dragontamer wrote:
| In Zen1, the "remote L3" caches had longer read/write times
| than DDR4.
|
| Think of the MESI messages that must happen before you can
| talk to a remote L3 cache:
|
| 1. Core#0 tries to talk to L3 cache associated with
| Core#17.
|
| 2. Core#17 has to evict data from L1 and L2, ensuring that
| its L3 cache is in fact up to date. During this time,
| Core#0 is stalled (or working on its hyperthread instead).
|
| 3. Once done, then Core#17's L3 cache can send the data to
| Core#0's L3 cache.
|
| ----------
|
| In contrast, step#2 doesn't happen with raw DDR4 (no core
| owns the data).
|
| This fact doesn't change with the new "star" architecture
| of Zen2 or Zen3. The I/O die just makes it a bit more
| efficient. I'd still expect remote L3 communications to be
| as slow, or slower, than DDR4.
| rbanffy wrote:
| What AMD does is not magic and is not beyond what others can
| do. My question is why they chose to have just 32MB for up to
| 80 cores when AMD can choose to have 32MB per 8-core chiplet.
|
| As a comparison, an IBM z15 mainframe CPU has 10 cores and
| 256MB per socket.
| dragontamer wrote:
| > As a comparison, an IBM z15 mainframe CPU has 10 cores
| and 256MB per socket.
|
| Well, that's eDRAM magic, isn't it? Most manufacturers are
| unable to make eDRAM on a CPU.
|
| > My question is why they chose to have just 32MB for up to
| 80 cores when AMD can choose to have 32MB per 8-core
| chiplet.
|
| From my understanding, those ARM chips are largely I/O
| devices: read from disk -> output to Ethernet.
|
| In contrast, IBM's are known for database backends, which
| likely benefits from gross amounts of L3 cache. EPYC is
| general purpose: you might run a database on it, you might
| run I/O constrained apps on it. So kind of a middle ground.
| meepmorp wrote:
| IBM doesn't fab its own chips, right? I thought they used
| GF.
| wmf wrote:
| It's basically IBM fabs that were "sold" to
| GlobalFoundries. AFAIK IBM processors use a customized
| process that isn't used by any other GF customers.
| ChuckNorris89 wrote:
| Limitations due to die size and power consumption since Intel
| Xeon is still on the _ye olde_ 14nm++ process.
|
| Also since Xeon dies are monolithic, unlike AMDs chiplet
| design, means that increasing the size of certain components on
| the die, like cache for example, increases the risk of defects
| wich reduces the yields, making them unprofitable.
| rbanffy wrote:
| True, but the ARM ones have just 32MB for up to 80 threads.
|
| I wonder if we could get numbers for L3 misses and cycles
| spent waiting for main memory under realistic workloads.
| dragontamer wrote:
| That information changes with every application. Literally
| every single program in the world has its own cache
| characteristics.
|
| I suggest learning to read performance counters, so that
| you can get information like this yourself! L3 cache is a
| bit difficult for AMD processors (because many cores share
| the L3 cache), but L2 cache is pretty easy to work with and
| profile.
|
| General memory-reads / memory latency is pretty easy to
| read with various performance counters. Given the amount of
| latency, you can sorta guess if its in L3 or in DDR4.
| rbanffy wrote:
| There must be a typo on the 74F3 price. US$2900 for it is a
| steal.
| tecleandor wrote:
| I found it in the original press release (price for !K units,
| of course)
|
| https://ir.amd.com/news-events/press-releases/detail/993/amd...
| masklinn wrote:
| Still seems like a typo, it doesn't make any sense that the
| 24 / 48 would be priced between the 8 / 16 and the 16 / 32.
| Either the prices of the 73 and 74 were swapped or the tag is
| just plain wrong. "2900" is also very suspiciously round
| compared to every other price on the press release.
| dragontamer wrote:
| How is it suspicious?
|
| 256MB L3 (or really, 8 x 32MBs L3) and 24-cores suggests
| the bottom-of-the-barrel 3 cores active per 8-core CCX.
|
| 8x CCX with 3-cores. The yields on those chips must be
| outstanding: its like 62.5% of the cores could have
| critical errors and they can still sell it at that price.
|
| EDIT: My numbers were wrong at first. Fixed. Zen3 is
| double-sized CCX (32MBs / CCX instead of 16MBs/CCX)
|
| ---------
|
| In contrast, the 28-core 7453 is $1,570. I personally would
| probably go with the 28-core (with only 2x32MB L3 cache, or
| 64MBs) rather than the 24-core with 256MBs L3 cache.
|
| For my applications, I bet that having 7-cores share an L3
| cache (and therefore able to communicate quickly) is better
| than having 1 or 2 cores having 32MBs of L3 to themselves.
|
| There are also significant price savings, as well as
| significant power / wattage savings with the 28-core /
| 64MBs model.
| [deleted]
| masklinn wrote:
| > In contrast, the 28-core 7453 is $1,570.
|
| Which is cheaper than the 24c 7443 and 7413 but not the
| 16c 7343 and 7313.
|
| And it only has half the L3 compared to its siblings
| (1/4th compared to the 7543 top end), a lower turbo than
| every other processor in the range (whether lower or
| higher core counts), well as an unimpressive base
| frequency, and a fairly high TDP by comparison (as high
| as the 7543).
|
| The 74F3 has no such discrepancy, it has the same L3 as
| every other F-series and slots neatly into the range
| frequency-wise: same turbo as its siblings (with the 72
| being 100MHz higher), 300MHz base lower han the 73, and
| 250 higher than the 75.
| dragontamer wrote:
| > Which is cheaper than the 24c 7443 and 7413 but not the
| 16c 7343 and 7313.
|
| 28-cores for $1570 seems to be the "cheapest per core" in
| the entire lineup.
|
| It all comes down to whether you want those cores
| actually communicating over L3 cache, or not. Do you want
| 7-cores per L3 cache, or do you prefer 4-cores per L3
| cache?
|
| 4-cores per L3 cache benefits from having more overall
| cache per core. But more-cores per L3 cache means that
| more of your threads can tightly-communicate cheaply, and
| effectively.
|
| ---------
|
| More L3 cache probably benefits from cloud-deployments,
| Virtual Desktops, and similar (since those cores aren't
| communicating as much).
|
| More cores per L3 cache benefits from more tightly
| integrated multicore applications.
|
| EDIT: Also note that "more cores" means more L1 and L2
| cache, which is arguably more important in compute-heavy
| situations. L3 cache size is great of course, but many
| applications are L1 / L2 constrained and will prefer more
| cores instead. 24c 7443 with 2x32MB L3 is probably a
| better chess-engine than 16c 7343 4x32MB L3.
| mrb wrote:
| It doesn't seem to be a typo. AMD offers many variations of
| each core configurations, with different base frequencies.
| It's just that there are simply pricing overlaps between
| some low-core high-freq version and some higher-core lower-
| freq versions. For example the 7513 (32 cores) is also
| cheaper than the 73F3 (16 cores). 75F3
| 32-core 2.95GHz $4,860 7543 32-core 2.80GHz $3,761
| 7513 32-core 2.60GHz $2,840 74F3 24-core
| 3.20GHz $2,900 7443 24-core 2.85GHz $2,010 7413
| 24-core 2.65FHz $1,825 73F3 16-core 3.50GHz
| $3,521 7343 16-core 3.20GHz $1,565 7313 16-core
| 3.00GHz $1,083
|
| Source: https://ir.amd.com/news-events/press-
| releases/detail/993/amd...
| coder543 wrote:
| It makes perfect sense if you're an enterprise customer and
| your software dependencies charge you extremely different
| priced tiers for different maximum numbers of cores. AMD is
| selling a license-optimized part at a higher price because
| there will be plenty of demand for it.
|
| People who don't save a boatload by getting the license-
| optimized CPU will invariably choose to buy the 24-core
| one, which helps AMD by making it easier for them to keep
| up with the demand for the 16-core variant, and the 16-core
| variant gets an unusually nice profit margin. Win win.
|
| This is not the first time AMD or Intel have offered a
| weird inverse-pricing jump like this... I highly doubt it
| is a typo.
|
| My other comment reiterates some of these points a
| different way:
| https://news.ycombinator.com/item?id=26469182
| rodgerd wrote:
| Yep - in the past I've done "special orders" for not-
| publicly-advertised CPU configs from our hardware vendor
| to get low core count, high-clock servers for products
| like Oracle DB.
| fvv wrote:
| right, I think 3900 may be the correct price
| wffurr wrote:
| RTFA:
|
| " Users will notice that the 16-core processor is more
| expensive ($3521) than the 24 core processor ($2900) here. This
| was the same in the previous generation, however in that case
| the 16-core had the higher TDP. For this launch, both the
| 16-core F and 24-core F have the same TDP, so the only reason I
| can think of for AMD to have a higher price on the 16-core
| processor is that it only has 2 cores per chiplet active,
| rather than three? Perhaps it is easier to bin a processor with
| an even number of cores active"
| coder543 wrote:
| I really don't think the article's speculation there is
| helpful... it's really reaching.
|
| As I said below the article in the comments:
|
| > If I were to speculate, I would strongly guess that the
| actual reason is licensing. AMD knows that more people are
| going to want the 16 core CPUs in order to fit into certain
| brackets of software licensing, so AMD charges more for those
| to maximize profit and availability of the 16 core parts. For
| those customers, moving to a 24 core processor would probably
| mean paying _significantly_ more for whatever software they
| 're licensing.
|
| This is the more compelling reason to me, and it matches with
| server processors that Intel and AMD have charged more for in
| the past.
|
| "Even vs odd" affecting the difficulty of the binning process
| just sounds extremely arbitrary... definitely not likely to
| affect customer prices, given how many other products are in
| AMD's stack that don't show this same inverse pricing
| discrepancy.
| cm2187 wrote:
| I am building a single socket server right now, I can't really
| justify more than twice the price of a 7443P for a marginally
| higher base clock and twice the cache. Does the cache makes
| that much of a difference? I thought these are already very
| large caches vs lots of Intel CPUs.
| dragontamer wrote:
| Hmm, with AMD Threadripper, you're already looking at TLB
| issues at these L3 sizes. So if you actually want to take
| advantage of lots of L3, you need either many cores, or
| hugepages.
|
| Case in point: AMD Zen2 has 2048 TLB entries (L2), under a
| default (in Linux and Windows) of 4kB per TLB entry. That's
| 8MBs of TLB before your processor starts to page-walk.
|
| Emphasis: Your application will pagewalk when the data still
| fits in L3 cache.
|
| ------------
|
| I'm looking at some of this lineup with 3-cores per CCX
| (32MBs L3 cache), which means under default 4kB pages, those
| cores will always require a pagewalk to just read/write its
| 32MBs L3 cache effectively.
|
| With that being said: 2048 TLB entries for Zen2 processors.
| Maybe AMD has increased the TLB entries for Zen3. Either way,
| you probably should start looking at hugepage configuration
| settings...
|
| These L3 cache sizes are absurd, to the point where its kind
| of unwieldy. I mean, with enough configuration / programming,
| you can really make these things fly. But its not exactly
| plug-and-play.
| justincormack wrote:
| The 64k page size available on Arm (and Power) makes a lot
| more sense with these kind of cache sizes. With 2MB amd64
| hugepages its only 16 different pages in that L3 cache,
| which for a cluster of up to 8 CPUs is not much at all when
| using huge pages.
| dragontamer wrote:
| TLB-misses always slows down your code, even out-of-
| cache.
|
| So having 2MB (or even 1GB) hugepages is a big advantage
| in memory-heavy applications, like databases. No, 1GB
| pages won't fit in L3 cache, but it still means you won't
| have to page-walk when looking for memory.
|
| 1GB pages might be too big for today's computers, but 2MB
| pages might be good enough for default now. Historically,
| 4kB was needed for swap purposes (going to 2MB with Swap
| would incur too much latency if data paged out), but with
| 32GBs RAM + SSDs on today's computers... fewer and fewer
| people seem to need swap.
|
| There might be some kind of fragmentation-benefit for
| using the smaller pages, but it really is a hassle for
| your CPU's TLB to try to keep track of all that virtual
| memory and put it back in order.
|
| ---------
|
| While there is performance hits associated with page-
| walks, the page-walk process is fortunately pretty fast.
| So most applications probably won't notice a major
| speedup... still though, the idea of tons of unnecessary
| page-walks slowing down untold amounts of code bothers me
| a bit for some reason.
|
| Note: ARM also supports hugepages. So going up to 2MBs
| (or bigger) on ARM is also possible.
| temptemptemp111 wrote:
| But why won't AMD let us do any secure booting?
| ChuckMcM wrote:
| Nice bump in specs. Perhaps now they will announce the Zen3
| Threadripper :-). As others have mentioned the TR can starve
| itself on memory accesses when doing a lot of cache invalidation
| (think pointer chasing through large datasets). If the EPYC
| improvement of having the chiplet CPUs all share L3 cache moved
| into the TR space (which one might assume it would[1]) then this
| could be a reason to upgrade.
|
| [1] I may be wrong here but the TR looks to me like an EPYC chip
| with the mult-CPU stuff all pulled off. It would be interesting
| to have a decap with the chiplets identified.
| gameswithgo wrote:
| yes TR will have the new cache configuration, just like regular
| ryzen and epyc do.
___________________________________________________________________
(page generated 2021-03-15 23:01 UTC)