[HN Gopher] AMD EPYC 7C13 Is a Surprisingly Cheap and Good CPU
       ___________________________________________________________________
        
       AMD EPYC 7C13 Is a Surprisingly Cheap and Good CPU
        
       Author : PaulHoule
       Score  : 131 points
       Date   : 2024-03-27 15:12 UTC (7 hours ago)
        
 (HTM) web link (www.servethehome.com)
 (TXT) w3m dump (www.servethehome.com)
        
       | jeffbee wrote:
       | This is some fell-off-a-truck stuff. Aren't the weird part
       | numbers with infix letters custom made for large customers
       | (Amazon, Google, et al.)?
        
         | astrodust wrote:
         | Large cloud providers dump their gear in bulk all the time, and
         | these parts get picked, tested and packaged for resale.
         | 
         | I'm not sure these custom parts are barred from resale like the
         | "ES" (Engineering Sample) type chips are.
        
           | jeffbee wrote:
           | I don't think they are barred from sale, but I do think that
           | if you're selling secondhand CPUs on Newegg, the used nature
           | of the hardware should be prominently stated. That is, for
           | customers who are still willing to risk their money on
           | Newegg.
           | 
           | With those caveats it's a great deal for something like a
           | build box. You could put this into an existing ATX case with
           | $1000 worth of RAM (that you may already own?) for less than
           | the price of a new Threadripper CPU.
        
             | CogitoCogito wrote:
             | Is newegg bad now? It's been a long time since I've ordered
             | from them, but I only had positive experiences with them.
        
               | radicality wrote:
               | It's become a marketplace ever since it was bought out,
               | so lots of sellers of varying qualities. As long as you
               | always filter by "sold and shipped by Newegg" you should
               | be fine.
        
               | mindcrime wrote:
               | Who are the recommended vendor for purchasing PC parts
               | these days then? That is, who (if anybody) fills New
               | Egg's previous niche? I've actually just bought a bunch
               | of stuff from New Egg, after not doing any PC building
               | for 15+ years, and didn't initially realized how much
               | they had switched to the "marketplace" model.
        
               | jeffbee wrote:
               | A good online retailer is B&H Photo. As far as I have
               | seen, everything they sell is first-party. It's not a
               | marketplace like Amazon or Newegg.
        
           | kjs3 wrote:
           | They don't have to be used; there's overstock and grey market
           | possibilities.
           | 
           | I've seen many times (usually smaller) runs of products and
           | noticed house-marked or otherwise oddly identified chips and
           | when asked the producer said "the OEM didn't use them so they
           | sold them to us cheap". And I've certainly bought a couple
           | brand new big-box labeled motherboards that were really (and
           | obviously) minor variations of existing Asus, Gigabyte or
           | Supermicro motherboards. Shoot, somewhere I've got a NiB
           | Intel Phi card with a weird part number only because it was
           | made for (I think) Dell and now that Phi is dead they were
           | being fire-saled.
        
         | pengaru wrote:
         | Or whole-system vendors like Lenovo/HP?
        
         | JonChesterfield wrote:
         | Nah, it's just variation on price of older stock. Relatively
         | few buyers and relatively low stock pushes the variance up.
         | E.g. I see a 7763 listed at 3k in one store and 4k in another.
         | 
         | If you can find a motherboard to match it's a lot of computer
         | for the price.
        
         | derefr wrote:
         | Some maybe-interesting observations about my experiences over
         | the last few years, as someone who uses both cloud-provisioned
         | (GCP N2D) and dedicated-server (e.g. OVH HGR-HCI-class) AMD
         | EPYC-based machines at $work.
         | 
         | * GCP N2D instances always had a strict per-AZ allocation
         | quota. This allocation quota has _not_ increased over time. And
         | when we asked to have it bumped up, it was the only time a
         | quota-increase request of ours has ever been denied.
         | 
         | * When OVH was offering their HGR-HCI-6 machine type (2x EPYC
         | 7532), we provisioned a few of them. The first few, leased ~2
         | years back, each took a few days to provision -- presumably,
         | OVH doesn't buy these expensive CPUs until a customer asks for
         | a machine to be stood up with one in them. More recently,
         | though (~6mo ago), for the same machine type, they gave us a
         | provisioning lead time of more than a month, due to supply
         | difficulties for the CPU.
         | 
         | * These chips were buggy! Again on OVH, when allocating these
         | HGR-HCI-6 machines, we were allocated two separate machines
         | that ended up having CPU faults. (Symptoms: random reboots
         | after an hour or two of heavily utilizing the native AES-NI
         | instructions; and spurious "PCI-e link training errors" in
         | talking to the network card and/or NVMe drives.) They were
         | replaced quickly, but I've never seen this kind of CPU fault on
         | a hardware-managed system before or since.
         | 
         | * Just a month ago, the high-end dedicated-server hosters (OVH,
         | but also Hetzner and so forth) seem to have removed all their
         | SKUs that use 2nd- and 3rd-gen EPYC 7xxx CPUs. (Except for one
         | baseline SKU on OVH, which is probably there because they have
         | a big pile of them.) Everything suddenly switched over to 4th-
         | gen 9xxx EPYCs just a month or two ago. It might just be that
         | availability of these 9xxx EPYCs is finally reaching levels
         | where these providers think they can meet demand with them --
         | but everyone switching over simultaneously, _and_ dropping
         | their old SKUs at the same time?
         | 
         | * GCP recently launched the storage-optimized Z3 instance type.
         | They chose to build this instance type on an Intel platform
         | (Sapphire Rapids.) That's even though AMD EPYCs have had enough
         | PCIe lanes to deliver equivalent performance to this Z3
         | platform -- ignoring the "Titanium offload" part, which isn't
         | CPU-platform-specific -- for years. (In fact, the need for a
         | huge pool of fast NVMe is in part _why_ we switched some of our
         | base load from GCP over to those OVH HGR-HCI-6 instances --
         | which satisfied our needs quite well.) GCP could _in theory_
         | have launched something akin to this instance type, with the
         | same 36TiB storage pool size (but PCIe 4.0 speeds rather than
         | 5.0) three years ago, using EPYC 7xxxs. Cusomers have been
         | asking for something like that for years now -- wondering why
         | GCP instances are all limited to 8.8TiB of local NVMe. (We
         | actually asked them ourselves, back then, where  "the instance
         | type with more local NVMe" was. They gave a very handwave-y
         | response, which in retrospect, may have been a "we're trying,
         | but it's not looking good for delivering this at scale right
         | now" response.)
         | 
         | These points all lead me to believe that something _weird_
         | happened with the EPYC 7xxx rollout. Supply didn 't grow to
         | meet demand over time.
         | 
         | And then, suddenly, _after_ the seeming EOL of these chips --
         | but long _before_ cloud providers would normally cycle them out
         | -- we 're seeing 7xxxs ending up on the open market, in enough
         | bulk to make them affordable? Bizarre.
         | 
         | ---
         | 
         | My own vague hypothesis at this point, is that at some point
         | during the generation, AMD discovered a fatal flaw in the
         | silicon of the entire EPYC 7xxx platform. Maybe it was the
         | hardware crypto instructions, like I saw. Or maybe it was some
         | capability more specific to cloud-computing customers (SEV-
         | SNP?) that turned out to not work right (which would make more
         | sense given that Threadrippers didn't see the same problems.)
         | So the big cloud customers immediately halted their purchase
         | orders (keeping only what they had already installed so as to
         | not disturb existing customer workloads); and AMD responded by
         | scaling down production.
         | 
         | This resulted in two things: a supply shock of AMD EPYC-based
         | machines/VMs that lasted for a while; but also, negotiated
         | settlements with the cloud vendors, where AMD was now obligated
         | to fulfill existing POs for 7xxx parts _with 9xxxs_ , as they
         | ramped up production of those. Which is why 9xxxs have taken so
         | long (2 years!) to make it onto the open market: the lines have
         | been dedicated to fulfilling not just 9xxx bulk purchase-
         | orders, but also 7xxx purchase-orders.
         | 
         | (And which is why the switchover to 9xxx among smaller players
         | is so immediate: such a switchover has been on every hosting
         | company's roadmap for a long time now, having been repeatedly
         | delayed by supply issues due to the huge volume of 9xxx parts
         | required to satisfy the clouds' backlogged demand. They've had
         | a stock of 9xxx-compatible motherboards + memory + PSUs +
         | chassis just sitting there for months/years now, waiting for
         | 9xxx CPUs to slot into them.)
         | 
         | Perhaps we're seeing these cloud-customer 7Cxx parts on the
         | open market now, because the clouds have finally received
         | enough 9xxxs to satisfy their actual demand for 9xxxs, _and_
         | their backlogged demand for 7xxxs; and the clouds are now
         | finally at the point where they can replace their initial
         | _actual_ (faulty  / feature-disabled) 7xxx parts they were sent
         | with 9xxxs, selling off the 7xxx parts.
         | 
         | My guess is that, now that they have "fixed" AMD chips in
         | place, we'll soon see the cloud providers heavily hyping up
         | some particular AMD-silicon-enabled feature that they had been
         | _starting_ to market four years ago, but then went radio-silent
         | on. ( "Confidential computing", maybe.)
         | 
         | ---
         | 
         | I'd love to hear what someone with more insider knowledge
         | thinks is happening here.
        
           | AnthonyMouse wrote:
           | CPU faults on individual machines aren't that rare. The
           | machine has a dodgy power supply that almost works but has
           | voltage drop under load etc. Sometimes this can be caused by
           | environmental factors. The rack is positioned poorly and has
           | thermal issues, the UPS is supplying bad power etc. Then you
           | can see issues with multiple machines, or replace the machine
           | without fixing the issue. Vendors often put machines for the
           | same customer in the same rack for various reasons, e.g.
           | because they might send a lot of traffic to each other and
           | put less load on their network if connected to the same
           | switch, but then if there is a problem in that rack it
           | affects more of your machines.
           | 
           | The Epyc 7000 series was popular. There have been enough of
           | them in private hands for long enough that if there were
           | widespread issues they would be well-known.
           | 
           | It's possible that AMD didn't order enough capacity from TSMC
           | to meet demand, and couldn't get more during the COVID supply
           | chain issues. For the 9000 series they learned from their
           | mistake, or there is otherwise more fab capacity available
           | now, so customers can get them. Meanwhile cloud providers
           | really like Zen4c because they can sell "cores" that cost
           | less and use less power, so they're buying it and replacing
           | their existing hardware as they tend to do regardless. That
           | is typically how they expand their business: If you add more
           | servers you need more real estate and power and cooling. If
           | you replace older servers with faster ones, you don't.
        
             | derefr wrote:
             | To be clear, it was a CPU fault that doesn't occur at all
             | when running e.g. stress-ng, but _only_ (as far as I know)
             | when running our particular production workload.
             | 
             | And only after _several hours_ of running our production
             | workload.
             | 
             | But then, once it's known to be provokeable for a given
             | machine, it's extremely reliable to trigger it again -- in
             | that it seems to take the same number of executed
             | instructions that utilize the faulty part of the die, since
             | power on. (I.e. if I run a workload that's 50% AES-NI and
             | 50% something else, then it takes exactly twice as long to
             | fault as if the workload was 100% AES-NI.)
             | 
             | And it _isn 't_ provoked any more quickly, by having just
             | provoked it and then running the same workload again --
             | i.e. there's no temporal locality to it. Which would make
             | both "environmental conditions" and "CPU is overheating /
             | overvolting" much less likely as contributing factors.
             | 
             | > There have been enough of them in private hands for long
             | enough that if there were widespread issues they would be
             | well-known.
             | 
             | Our setup is likely a bit unusual. These machines that
             | experienced the faults, have every available PCIe lane
             | (other than the few given to the NIC) dedicated to NVMe;
             | where we've got the NVMe sticks stuck together in
             | extremely-wide software RAID0 (meaning that every disk read
             | fans in as many almost-precisely-parallel PCIe packets
             | contending for bus time to DMA their way back into the
             | kernel BIO buffers.) On top of this, we then have every
             | core saturated with parallel CPU-bottlenecked activity,
             | with a heavy focus on these AES-NI instructions; and a high
             | level of rapid allocation/dellocation of multi-GB per-
             | client working arenas, contending against a very large
             | _and_ very hot disk page cache, for a working set that 's
             | far, far larger than memory.
             | 
             | I'll put it like this: _some_ of these machines are  "real-
             | time OLAP" DB (Postgres) servers. And under load, our PG
             | transactions sit in WAIT_LWLOCK waiting to start up,
             | because they're actually (according to our profiling)
             | _contending over acquiring the global in-memory pg_locks
             | table_ in order to write their per-table READ_SHARED locks
             | there (in turn because they 're dealing with wide joins
             | across N tables in M schemas where each table has hundreds
             | of partitions and the query is an aggregate so no
             | constraint-exclusion can be used. Our pg_locks Prometheus
             | metrics look _crazy_.) Imagine the TLB havoc going on, as
             | those forked-off heavy-workload query workers also all
             | fight to memory-map the same huge set of backing table heap
             | files.
             | 
             | It's to the point that if we don't either terminate our
             | long-lived client connections (even when _not_ idle), or
             | restart our PG servers at least once a month, we actually
             | see per-backend resource leaks that eventually cause PG to
             | get OOMed!
             | 
             | The machines that _aren 't_ DB servers, meanwhile -- but
             | are still set up the same on an OS level -- are blockchain
             | nodes, running https://github.com/ledgerwatch/erigon, which
             | likes to do its syncing work in big batches: download N
             | blocks, then execute N blocks, then index N blocks. The
             | part that reliably causes the faults is "hashing N blocks",
             | for sufficiently large values of N that you only ever
             | really hit during a backfill sync, not live sync.
             | 
             | In neither case would I expect many others to have hit on
             | just the right combination of load to end up with the same
             | problems.
             | 
             | (Which is why I don't really believe that whatever problem
             | AMD might have seen, is related to this one. This seems
             | more like a single-batch production error than anything,
             | where OVH happened to acquire multiple CPUs from that
             | single batch.)
             | 
             | ---
             | 
             | > It's possible that AMD didn't order enough capacity from
             | TSMC to meet demand, and couldn't get more during the COVID
             | supply chain issues.
             | 
             | Yes, but that doesn't explain why they weren't able to ramp
             | up production at _any_ point in the last four years. Even
             | now, there are still likely some smaller hosts that would
             | like to buy EPYC 7xxxs at more-affordable prices, if AMD
             | would make them.
             | 
             | You need an additional factor to explain this lack of ramp-
             | up _post_ -COVID; and to explain why the cloud providers
             | never _started_ receiving more 7xxxs (which they _would_
             | normally do, to satisfy legacy clients who want to
             | replicate their exact setup across more AZs /regions.)
             | Server CPUs don't normally have 2-year purchase
             | commitments! It's normally more like 6!
             | 
             | Sure, maybe Zen4c was super-marketable to the clouds'
             | customers and saved them a bunch of OpEx -- so they
             | negotiated with AMD to _drop_ all their existing spend
             | commitments on 7xxx parts purchases in favor of committing
             | to 9xxx parts purchases.
             | 
             | But why would AMD agree to that, without anything the
             | clouds could hold over their head to force them into it? It
             | would mean shutting down many of the 7xxx production lines
             | early, translating to the CapEx for those production lines
             | not getting paid off! Being able to pay off the production
             | lines is why CPU vendors negotiate these long purchase
             | commitments in the first place!
             | 
             | And if the clouds _are_ replacing capacity, then where are
             | all those _used_ CPUs going?
             | 
             | Take notice that the OP article isn't talking about a used
             | CPU, but a "new server" -- namely (I think) this one:
             | https://www.newegg.com/tyan-s8030gm4ne-2t-supports-amd-
             | epyc-...
             | 
             | This server was never in an IaaS datacenter. This is a
             | motherboard straight from the motherboard vendor, with an
             | EPYC 7C13 prepopulated into it.
             | 
             | This isn't the sort of thing you get when a cloud resells.
             | This is the sort of thing you get when a cloud (or other
             | hosting provider) _stops buying unexpectedly_ -- and
             | upstream suppliers /manufacturers/integrators are left
             | holding the bag, of preconfigured-to-spec hardware they no
             | longer have a pre-committed buyer for.
        
       | 486sx33 wrote:
       | NOT shipped by newegg but very interesting
       | https://www.newegg.com/tyan-s8030gm4ne-2t-supports-amd-epyc-...
       | 
       | Seems like a good value per dollar for a monero mining rig approx
       | 4 x the performance of a 5950x on monerobenchmark site. Given
       | that the epyc has about 4 times the cache (256MB vs 64MB) this
       | makes sense in the monero world. I'd assume real world
       | performance side by side compairison the epyc would get up past
       | 4x than a real world 5950x which requires a lot of tweaking to
       | get anywhere close the monerobenchmark numbers.. I'd expect the
       | epyc runs better out of the box
        
         | bethekind wrote:
         | I've always wondered how the epycs with huge amounts of L3
         | would perform on monero.
         | 
         | Is there an optimal core to l3 ratio? Or is it always more is
         | better
        
           | 486sx33 wrote:
           | here is a quick blurb from someone on reddit which sums up
           | the general advice in a way that matches my experience
           | 
           | "most mining algorithms targeted for CPUs require certain
           | amount of L3 cache per thread (core), usually 1-4MB, so just
           | divide your total amount of CPU L3 cache by this number and
           | the result is how many threads can you run max on your cpu.
           | For example if an algorithm requires 2MB of cache per thread
           | and you have a 10-core 16MB L3 cache cpu, you can run at most
           | 16/2=8 threads, 9 or 10 threads will result in worse
           | performance as cores will be kicking out each other's data
           | from the cache. " https://www.reddit.com/r/MoneroMining/comme
           | nts/jurv6j/proces...
           | 
           | Below are two real world examples of monero mining I run
           | myself
           | 
           | An example is my i9-10850K. exact same performance using 8
           | cores or 10 cores (16 threads or 20 threads).. in fact it
           | slightly goes down performance wise at 20 threads.. Given
           | that it has 20MB cache, it is an example of the bottom limit
           | not being optimal. using this example i'd say minimum is
           | around 1.12MB per thread, or 2.25MB per core
           | 
           | Another machine I have, the 5950x crunches away all day and
           | night using all threads (32) on all cores (16) with 64MB
           | cache no problem this correlates to 4MB per core / 2MB per
           | thread.. and it seems like more than what it needs because I
           | can use the machine all day for daily tasks with no hiccups
           | and while mining full out. If you have a desktop at work and
           | need to kill it all day long, the 5950x will absolutely take
           | anything and everything you throw at it.. samsung b-die ram
           | helps monero mining as well. I run only 32GB but in 4 8GB
           | b-die sticks.
        
             | nsbk wrote:
             | Is Monero CPU mining profitable nowadays or does it need
             | free energy to be?
             | 
             | Edit: I have a spare 5950x collecting dust
        
               | bethekind wrote:
               | Profitable, yes but pennies per day. 5950X should make
               | ~$5/month after all is said and done iirc.
               | 
               | If you live in cold climates where space heaters are
               | used, it makes sense, as you would've been heating the
               | house anyways.
        
           | pclmulqdq wrote:
           | I think ~2 MB (a bit more) is that ratio, at least as
           | designed. It's possible that you can hyperthread Monero with
           | 4-5 MB of cache per core.
        
       | mise_en_place wrote:
       | It's good but I have a feeling anything you find on aftermarket
       | will be used and abused. These chips are designed to handle high
       | thermal load, but if it's been in a DC or server room, it may
       | impact its longevity.
        
         | wmf wrote:
         | "Abused" chips are mostly a myth and Milan is not that old so
         | these chips should have plenty of life left in them.
        
           | londons_explore wrote:
           | Agreed - with CPU's, if it's working the day you buy it, it
           | will most likely still be working in 10 years with typical
           | desktop use cases, no matter the past life it had.
        
             | paulmd wrote:
             | that's not true at all, XMP is very much within the
             | wheelhouse of "typical desktop use-cases" and can
             | absolutely damage a CPU from electromigration within a
             | matter of years.
             | 
             | (or rather, the overclocked/out-of-spec memory controller
             | usually requires the board to kick up the CPU memory
             | controller (VCCSA/VSOC) voltages, and that's what does the
             | damage.)
             | 
             | https://youtu.be/HLNk0NNQQ8s?t=510
             | 
             | https://www.youtube.com/watch?v=uMHUz16MuYA
             | 
             | People have generally convinced themselves that it's safe
             | but, the rate of CPU failures is _incredibly high_ among
             | enthusiasts compared to the general enterprise fleet and
             | the reason is XMP. This has been  "out there" for a long
             | time if you know to look for it. But, enthusiasts fall into
             | that classic "can't make a man understand when his salary
             | depends on not understanding it" thing - everyone has every
             | reason to convince themselves it doesn't, because it would
             | affect "their lifesyle".
             | 
             | But electromigration exists. Electromigration affects parts
             | on consumer-relevant timescales, if you overclock.
             | Electromigration particularly affects memory
             | controllers/system fabric nowadays. And yes, you can
             | absolutely burn out a memory controller with just XMP (and
             | the aggressive CPU voltages it applies) and this is not new
             | or a secret. And the problem of electromigration/lifespan
             | is accelerating as the operating range becomes narrower on
             | newer nodes etc.
             | 
             | https://semiengineering.com/aging-problems-at-5nm-and-
             | below/
             | 
             | https://semiengineering.com/3d-ic-reliability-degrades-
             | with-...
             | 
             | https://semiengineering.com/on-chip-power-distribution-
             | model...
             | 
             | Similarly: "24/7 safe" fabric overclocks are really not.
             | Not on the order of years. Everyone is already incentivized
             | to push the "official" limit as much as is safe/reliable -
             | AMD/Intel know about the impact on benchmark scores too,
             | they want their parts to look as good as they can. There is
             | no "safe" increase above the official spec, not really.
             | 
             | The unique thing about Asus wasn't that they killed a chip
             | from XMP - it's that they put _so much_ voltage into it
             | that it went into immediate runaway and popped instantly,
             | explosively, and visibly. And it 's not surprising it was
             | Asus (those giant memory QVLs come from just throwing
             | voltage at the problem) but low-key everyone has been
             | applying _at least some_ additional voltage for a long
             | time. Eventually it kills chips. It 's overclocking/out-of-
             | spec and very deliberately and specifically excluded from
             | the warranty (AMD GD-106/GD-112).
             | 
             | It's completely understandable why AMD wants to make some
             | fuses/degradation canary cells to monitor whether the CPU
             | has operated out-of-spec as far as warranty coverage. This
             | is a serious failure mode that probably causes a large % of
             | overall/total "CPU premature failure" warranty returns etc.
             | And essentially it continues to get worse on every new node
             | and with every new DDR standard, and with the increased
             | thermals that currently are characteristic of stacked
             | solutions etc.
             | 
             | https://www.amd.com/en/legal/claims/gaming-details.html
             | 
             | https://www.extremetech.com/computing/amds-new-
             | threadripper-...
        
               | wmf wrote:
               | Fortunately servers don't have XMP.
        
               | paulmd wrote:
               | true, I am just pushing back on the idea that "abusing a
               | chip is mostly a myth" and "if a CPU is working on the
               | day you buy it, it's fine for desktop use-cases". For
               | server parts that can't be OC'd - true, I guess. For
               | regular CPUs? Absolutely not true, enthusiasts abuse the
               | shit out of them and even if you do no further damage
               | yourself, the degradation can continue over time etc as
               | parts of the circuit just become critically unstable from
               | small routine usage etc.
               | 
               | (people treat their CPUs like gamer piss-jugs, big
               | deferred cost tomorrow for a small benefit today.)
               | 
               | But yes - ironically this means surplus server CPUs are
               | actually way more reliable than used enthusiast CPUs. In
               | some cases they are drop-in compatible in the consumer
               | platform (although not so much in newer stuff), and the
               | server stuff got the better bins in the first place, and
               | it's cheaper (because they sold a lot more units), and
               | also hasn't been abused by an enthusiast for 5 years etc.
               | If you are on a platform like Z97 or X99 that supports
               | the Xeon chips, the server chips are a complete no-
               | brainer.
               | 
               | And some xeons are even multiplier unlocked etc - used to
               | be a thing, back in the day.
               | 
               | ("server bins are binned for leakage and don't compete
               | with gaming cpus" is another myth that is not really true
               | except for XOC binning - server CPUs are better binned
               | than enthusiast ones for ambient use-cases.)
        
               | londons_explore wrote:
               | But are there many cases of a CPU being overclocked (&
               | overheated & overvolted), then later not being
               | overclocked (and working fine), but then failing shortly
               | afterwards?
               | 
               | Yes, I understand it is theoretically possible. But I
               | think it is just super rare - I've never heard of a
               | single case.
        
           | irusensei wrote:
           | It kinda feels like the mining GPU being bad myth when in
           | fact miners nursed those GPUs like babies because their
           | income depended on those devices.
        
             | namibj wrote:
             | Tbf the fans may be broken on them, or at least not far
             | from being broken. I.e., plan to waterblock it.
        
             | epolanski wrote:
             | People forget there's decade old servers out there working
             | 24/7.
             | 
             | Anyway there's a Microsoft research paper on silicon which
             | essentially says that failure rates of CPUs increase by
             | mostly two factors: - cycles. The more calculations the
             | higher the rate of failure
             | 
             | - temperature/power. I will let you guess it by yourself.
             | Even minor slips overvoltages and overclocks can enhance
             | failure rates by magnitude of orders.
             | 
             | Getting back to your comment: I would've chosen a GPU used
             | for mining (if properly cleaned during its life span, far
             | from a given) over years rather than one used by some kid
             | benchmarking and overclocking any day. Because years of
             | crunching calculations did very little damage in comparison
             | to a kid trying to find the overclock limits for few days.
             | Most mining GPUs were used undervolted and underclocked
             | (especially as Ethereum mining was memory rather than core
             | intensive).
        
         | usefulcat wrote:
         | I'd much rather have something that came from a server room.
         | Lots of cool, dust-free air--far better than a machine that's
         | been sitting under someone's desk, clogged with dust and
         | exposed to who know what temperatures.
        
       | jmole wrote:
       | I recently picked up a 64-core AMD EPYC Genoa QS (eng sample) on
       | ebay for $1600, and have been very pleased with the performance.
        
         | mhuffman wrote:
         | Agreed! I have a AMD EPYC 7702P in an ASRocks mb and have been
         | very pleased with performance in a homelab.
        
       | naked-ferret wrote:
       | What app are they using in this screenshot?
       | https://www.servethehome.com/amd-epyc-7c13-is-a-surprisingly...
       | 
       | Looks very neat!
        
         | qwertox wrote:
         | s-tui: https://github.com/amanusk/s-tui
        
       | MenhirMike wrote:
       | Still rocking my EPYC 7282 in my home server, which really sits
       | in a sweet spots: 16 Cores, about $700, 120W TDP (because of the
       | reduced memory bandwidth).
       | 
       | Looks like the 7303 fills that same niche in the Milan generation
       | (and should be compatible with any ROME mainboard, possibly after
       | a BIOS update), or if you're building a new system you can get
       | the 32-Core Siena 8324PN for about 130W TDP.
       | 
       | (While it may be silly to look at TDP for a server CPU, it does
       | matter for home servers if you want a regular PC Chassis and not
       | a 1U/2U case with a 12W Delta cooling fan that is audible three
       | cities over. In fact, you can get the 8-Core 80W 8324PN and still
       | get all those nice PCIe lanes to connect NVMe SSDs to, and of
       | course ECC RAM that doesn't require hunting down a specific
       | motherboard and hoping for the best.)
        
         | semi-extrinsic wrote:
         | With these constraints, what is the benefit of Epyc over
         | Threadripper? I've been running a 3970x in my workstation for
         | several years now. Sure it's about twice the TDP, but with
         | water cooling it stays quite silent even on full load.
        
           | MenhirMike wrote:
           | I wanted remote management (IPMI), which none of the
           | Threadripper boards offered. I went with the ASRock Rack
           | ROMED8-2T, which also has 2x 10G Ethernet on board, which was
           | another nice thing I didn't have to sacrifice a PCIe slot
           | for. It does require a Tower Case with space for fans on top
           | though, because the CPU slot is rotated 90 Degrees compared
           | to Threadripper boards, so the airflow is different.
           | 
           | The EPYC CPU was also a quite a bit cheaper than the then-
           | equivalent Threadripper 2950X (though the mainboard being
           | $600 made up for that). This is even more true today because
           | AMD really jacked up the prices for Threadripper to the point
           | that EPYC is actually a good budget alternative. I guess that
           | making 16 Core Ryzen made low-end Threadrippers less
           | attractive, but it's the PCIe slots that were so great about
           | those!
           | 
           | Also, I do believe that it was much easier to find 64 GB
           | RDIMMs whereas 64 GB ECC UDIMMs were not available or much
           | more expensive, though my memory (ha!) is hazy on that, I
           | just remember it being a PITA.
           | 
           | So that EPYC system was just much more compelling.
        
             | paulmd wrote:
             | ROMED8-2T is one of the all-star boards of the modern era
             | imo. Like that's literally "ATX-maxxed" in the Panamax
             | sense - you can't go bigger than that in a traditional ATX
             | layout, and there is no point to having a bigger CPU (even
             | if you do not use all the pins) because it starts to eat up
             | the space for the Other Stuff. It's a local optimum in
             | board design.
             | 
             | EEB/EE-ATX can push things a little farther (like
             | GENOAD8X-2T) but you can't pull any more PCIe slots off, so
             | it has to be MCIO/oculink instead. And imo this is the
             | absolute limit of what can be done with single-socket Epyc.
             | 
             | And you can't really get more than 8 memory slots without
             | moving the CPU over to the other side of the board, like
             | MZ32-AR0 or MZ33-AR0, which means it overhangs the PCIe
             | slots etc. IIRC you can _sorta_ do 16-dimm SP3 if you don
             | 't do OCP 2.0 (gigabyte or asus might have some of these
             | iirc) and you drop to like 5 pcie slots or something. But
             | it's really hard to get 2DPC on epyc at all, the layouts
             | get very janky very quickly.
             | 
             | You can fit more RAM slots into EEB/EE-ATX with a smaller
             | socket (dual 2011-3 with 3DPC goes up to 24 slots in EE-
             | ATX) but 2DPC is as big as you can go with epyc in a
             | commodity form-factor. In SP5 this gets fully silly,
             | MZ33-AR0 is an example of 2DPC 12-channel SP5, and it's
             | like, oops all memory slots, _even with EEB and completely
             | overlapping every single pcie slot_.
             | 
             | And of course dual-socket epyc gets very cramped even on
             | EEB/EE-ATX even with only 8 slots per socket (MZ72-HB0).
             | You just are throwing away a tremendous amount of board
             | space and you lose pcie, MCIO, everything. SP3 is already a
             | honkin big socket let alone SP5, let alone two SP3, let
             | alone two SP5 (me when I see a honkin pair), etc... they
             | are big enough that you have to make Tough Choices about
             | what parts of the platform you are going to exploit, or
             | accept a non-"standard" form factor (it's not standard for
             | anyone except home users/beige boxes). Servers don't use
             | EEB/EE-ATX form factors anymore, because it just isn't the
             | right shape for these platforms. And you need to be pulling
             | a significant amount of the IO off in high-density
             | formfactors (MCIO, Oculink, SlimSAS, ...) already, and your
             | case ecosystem needs to support that riser-based paradigm,
             | etc. ATX is dying and enthusiasts are not even close to
             | being ready for the ground to shift underneath them like
             | this.
             | 
             | There's still good AM4, AM5, and LGA1700 server boards
             | (with ECC) btw - check out AM5D5ID-2T, X570D4I-2T, X470D4U,
             | W680 ACE IPMI, W680D4U-2L2T/G5, X11SAE-M, X11SAE-F,
             | IMB-X1231, IMB-X1314, X300TM-ITX, etc. And Asrock Rack and
             | Supermicro do make threadripper boards too, although I
             | think they're not viable since threadripper is leaning
             | farther and farther into the OEM market and it just doesn't
             | make cost sense unless you really need the clocks. It's not
             | like the X99 days where HEDT was just "better platform for
             | enthusiasts", there is a big penalty to choosing HEDT right
             | now if you don't need it.
             | 
             | Unregistered DDR4 tops out at 32GB per stick (UDIMM or
             | SODIMM), registered can go larger. DDR5 unregistered will
             | go larger, and actually a few 48GB sticks do exist already,
             | but generally you can't use all four slots without a
             | massive hit to clocks (current LGA1700/AM5 drop to 3600
             | MT/s) so consumers/prosumers have to consider that one
             | carefully.
             | 
             | (this generally means that drop-in upgrades are not viable
             | for DDR5 memory btw - 4-stick configs suck, you should plan
             | on just buying 2 new sticks when you need more. And the
             | slots on the mobo are worse than useless, since the empty
             | slots worsen the signal integrity compared to 2-slot
             | configurations without the extra parasitics...)
        
               | MenhirMike wrote:
               | I agree, the ROMED8-2T has everything I want and
               | compromises almost nothing. One of the PCI Express slots
               | is shared with one of the on-board M.2 slots, SATA, and
               | Oculink, but even then, you get to choose: Run the slot
               | at x16 and turn off M2/Sata/Oculink? Run the Slot in x8
               | and get M2/Sata but lose Oculink? Or disable the slot and
               | get M2/Sata/Oculink? I think that's a great compromise (I
               | run the slot at x8 and use it for a Fibre Channel card to
               | my backup tape drive). Lovely block diagram in the manual
               | as well.
               | 
               | Plenty of Fan headers as well, and using SFF-8643
               | connectors for the SATA ports makes so much sense (though
               | it's an extra cost for the cables). They even put a power
               | header if you run too many high-powered PCIe cards (since
               | PCIe allows AFAIK to pull up to 75W from the slot).
               | 
               | They really put every feature that makes sense onto that
               | board, and yeah, if you want Dual CPUs or 16 DIMM Slots,
               | chances are that a proper vendor server is more what you
               | want.
               | 
               | I can't think of anything that I don't like about the
               | board. Well, I wish the built-in Ethernet ports weren't
               | RJ45 but SFP+, but that's really the only thing I wish to
               | change.
        
         | z8 wrote:
         | It should be noted that non-vendorlocked 7282s can be had for
         | as little as 80 bucks on eBay. Bought one just a few weeks ago.
         | Lovely piece of silicon.
        
           | gigatexal wrote:
           | What issues come with the vendor locking?
        
             | MenhirMike wrote:
             | Only works on the original motherboard (or maybe only
             | motherboards made by the specific vendor the CPU was locked
             | to) - so if you buy a used vendor-locked CPU, there's a
             | risk it's basically just a nice looking paperweight. Serve
             | The Home has a pretty good video:
             | https://www.youtube.com/watch?v=kNVuTAVYxpM
        
       | tiffanyh wrote:
       | Last Generation
       | 
       | Am I mistaken but isn't this AMD's last generation server proc?
       | 
       | The current generation is 7xx4 / 9xx4.
       | 
       | Which should be surprising it's cheaper.
        
         | wmf wrote:
         | Yes, it's the previous generation. Homelabbers mostly buy older
         | used equipment at deep discounts.
        
       | dheera wrote:
       | This naming is confusing. Is 7C13 > 7950X? Why can't companies
       | stick to simple conventions of "higher numbers are better" ...
       | 
       | Even NVIDIA ... A800 > A100 > A10 but A6000 < A100
        
         | Osiris wrote:
         | Completely different platform. The Ryzen 7950X is a consumer
         | CPU. The 7C13 is a server CPU and follows a separate naming
         | convention.
         | 
         | It shouldn't be confusing because you really wouldn't be
         | comparing them to each other.
        
       ___________________________________________________________________
       (page generated 2024-03-27 23:01 UTC)