[HN Gopher] Jam 80 Cores, 768GB of RAM into E-ATX Case with This...
___________________________________________________________________
Jam 80 Cores, 768GB of RAM into E-ATX Case with This Tiny Board
Author : rbanffy
Score : 65 points
Date : 2021-09-17 12:14 UTC (10 hours ago)
(HTM) web link (www.tomshardware.com)
(TXT) w3m dump (www.tomshardware.com)
| bullen wrote:
| I find it fascinating that companies still do not mention the
| Gflops/watt/$ for their products?
|
| I mean I understand why, because the raspberry pi 4 melts all
| competition into the ground, but why nobody is asking is my
| concern.
|
| If you need atomic parallelism you should be fine with a 25W Atom
| 8-core machine that you can passively cool.
|
| Also would be interesting to see how this performs in a atomic
| parallel scenario? My guess is my HTTP server would not perform
| so well because the selector thread would not be able to service
| 79 other cores but I might be wrong about that.
|
| I'm pretty sure the RAM will throttle the 80 cores if they work
| on a joint problem though!
| srcmap wrote:
| Yap,
|
| Someone likely would just docker/vm to partition that 80 cores
| into micro-service handle httpd proxy and app, db backend.
|
| Those app/VM can easily be converted to app running pi modules.
|
| 20 pi modules likely have much better DDR, SSD, Network
| bandwidth. Probably scale from 2 pi to 200 pi as easily as
| typical vm setup - and it comes with GPUs for free for those
| need them.
| dragontamer wrote:
| > 20 pi modules likely have much better DDR, SSD, Network
| bandwidth
|
| 20 pi modules of 8GBs is only 160GBs.
|
| Ignoring that: Bigger nodes are better in all practical
| scenarios. With 768 GBs of RAM, this singular big server can
| likely keep in-memory a large collection of information (ex:
| all of English-Wikipedia likely fits inside of that RAM).
|
| 20x Rasp. Pi cannot access all of English-Wikipedia in RAM.
| This means that you can't index, you can't search, you can't
| analyze the pages. Even if you could: you'd need to have a
| collaborative external memory model, which is not easy to
| program.
|
| 80-cores with access to all 768GBs can do many, many more
| things than 20x 4-core Rasp. Pi working on only 32GB at a
| time.
| wmf wrote:
| If you care about GFLOPS/Watt/$ then GPUs beat everything. Real
| requirements are usually more subtle, including things like
| compatibility with existing software, latency (single-thread
| performance), and management overhead per node.
| joe_the_user wrote:
| That is my impression, it would still be good be good have an
| at least semi/partly objective evaluation of all the parallel
| compute devices that one can see appearing.
|
| Also, to have a number of benchmarks involve memory, caching,
| etc.
| bch wrote:
| I'm somewhat ignorant here, and indeed also niche (I'm a
| NetBSD user), but:
|
| > GPUs beat everything
|
| that's workload dependant, and also subject to support, no?
| Doesn't GPGPU basically mean CUDA currently, and therefore
| beholden to Nvidia support for your hardware/software
| platform?
| dragontamer wrote:
| AMD GPUs are supported over their open-source ROCm builds.
| HIP (CUDA-clone) works with Vega and CDNA.
|
| It seems like RDNA (5xxx and 6xxx cards) are less
| supported, but reports are that OpenCL kinda-sorta works
| (RDNA cards have a very different architecture than Vega /
| CDNA)
| UncleEntity wrote:
| Nah, I was playing around with an AMD integrated gpu a
| while ago using openCL (I believe).
| dragontamer wrote:
| > I find it fascinating that companies still do not mention the
| Gflops/watt/$ for their products?
|
| GPUs win in TFlops (like 20 or 40 TFlops these days), and
| TFlops/watt (A modest 20 TFlop GPU these days would be under
| 400W).
|
| GPUs also have 1TBps memory bandwidth thanks to HBM2 (on the
| high end), or at least 500GBps (thanks to GDDR6 on the low
| end).
|
| Any serious compute problem with "obvious" parallelism will run
| on a GPU these days. CPUs are for sequential problems... (and
| running many sequential problems in parallel: which GPUs are
| kinda bad at actually. No GPU would ever be able to serve web-
| pages like a CPU: branch divergence is just too high)
| belltaco wrote:
| What is the size comparison to a system with x86 cores and
| similar RAM etc? Epyc has 64 cores right?
| wmf wrote:
| Yeah, to get 80 x86 cores you'd need dual socket but that may
| also fit in EATX.
|
| N1 cores are weaker so you should really compare 80 threads vs
| 80 threads which would be a single-socket Epyc or Xeon which
| fit into the same or smaller board.
| JoshTriplett wrote:
| Right. You can get 56-core, 112-thread in one socket, and in
| terms of performance that's going to be much closer 112x than
| 56x: https://ark.intel.com/content/www/us/en/ark/products/194
| 146/...
| dragontamer wrote:
| EPYC is up to 64-core/128-threads per socket. A dual-EPYC will
| get you 128-cores / 256-threads.
|
| > similar RAM
|
| IIRC, EPYC is 2TBs of RAM support from LRDIMMs. Maybe 4TB now,
| but you need many, many DIMMs for that: like 16 DIMMs or
| something on a dual-socket EPYC.
| zamadatix wrote:
| 128 core/256 thread 2 TB
| RAMhttps://www.newegg.com/p/296-0002-003A3
|
| 64 core/128 thread micro atx https://www.newegg.com/asrock-
| rack-romed6u-2l2t-amd-epyc-700...
|
| Bonus board just because it's badass:
| https://www.newegg.com/asus-pro-ws-wrx80e-sage-se-wifi/p/N82...
| wolrah wrote:
| > Jam 80 Cores, 768GB of RAM into E-ATX Case with This Tiny Board
|
| The headline sounds like a brag, but isn't exactly impressive.
|
| There have been multiple EATX Dual EPYC motherboards on sale for
| some time now. The limited board area means only one DIMM per
| channel, but with eight channels and 128GB DIMMs that still means
| you could have 128 cores and 2TB of RAM in a single EATX system
| with "ordinary" AMD hardware.
| Pet_Ant wrote:
| So back in 2017 Sen had the SPARC T8-4 with 4TB of RAM 4x SPARC
| M8 (each one 32-core, 256-threaded) in a 5U enclosure which isn't
| that much bigger than my current computer case so I'm not sure
| what is so excicting here?
|
| https://en.wikipedia.org/wiki/SPARC_T_series#SPARC_M8
| wmf wrote:
| HN doesn't know anything about hardware so everything is
| exciting.
| dreen wrote:
| I've been considering small ATX gaming build but Im really
| worried it will heat like crazy after a while, even if it doesnt
| have 80 cores. Am I completely wrong?
| dragontamer wrote:
| I've got an AMD Zen Threadripper (1st generation) with 180W of
| TDP.
|
| If I'm doing something computationally expensive, I can feel
| the room heatup for sure: just 2 degrees F in practice (~1C or
| so). Enough to notice + enough to see it in my thermometer I
| keep in my room... but not enough to be worried about anything.
|
| And honestly, I don't spin up all 16c/32 threads that often.
| kiran-rao wrote:
| The site claims:
|
| > The Ampere Altra Q80-28 SoC with 80 cores runs at 2.80 GHz
| and consumes around 175 Watts.
|
| 175w is nothing to scoff at, but also isn't completely
| ridiculous. GPUs will end up consuming more than that in a
| typical ATX gaming machine.
| Arrath wrote:
| _If_ the case has the headroom to fit a decent cooler, a
| smaller case sounds almost better, as one intake and one
| exhaust fan would be more than sufficient to continually
| exchange the volume of air within the case?
| dreen wrote:
| Yeah that makes sense, but since Im talking about a gaming
| build I would think theres not much space left after you put
| the GPU in which is the main heat source, thus making airflow
| difficult
| Arrath wrote:
| That concern came to mind as I was typing my comment, it
| doesn't take too many other bits in the case to really
| compromise airflow. Again assuming it would even have space
| for the yacht that modern GPUs have become.
| favorited wrote:
| There are really neat ITX case layouts that use PCIE
| riser cables to mount the GPU in a separate chamber, or
| behind the motherboard
| (https://i.ytimg.com/vi/v4dtjsJEFQw/maxresdefault.jpg).
|
| Even ITX cases without riser cables end up positioning
| the GPU right at the bottom of the case, so it gets fresh
| air (https://cdna.pcpartpicker.com/static/forever/images/
| userbuil...)
| Arrath wrote:
| Neat! Thanks!
|
| Though my inner troubleshooter is thinking "oh man more
| connections to re-seat"
| favorited wrote:
| Haha you're not wrong! It's another potentially
| incompatible part, especially now as PCIE generations
| change. People will upgrade their GPU and suddenly won't
| have any video output because it's a gen. 4 motherboard
| talking to a gen. 4 GPU over a gen. 3 riser cable.
|
| That said, you really only run into that kind of stuff
| when you're entering hobbyist mode. If what someone wants
| is a compact gaming PC, there are cases like the
| CoolerMaster NR200 (https://i.redd.it/fq9y7mevznb71.jpg)
| which are affordable and just as easy to build in as any
| mid-tower. The only difference is you'll be using a mini
| ITX motherboard, and an SFX power supply.
| favorited wrote:
| I've built a few small form factor gaming PCs over the last 5
| or so years. The heat produced is directly proportional to the
| power consumed.
|
| If you're worried about your components overheating in a small
| case, there are absolutely ways to cram a ton of performance
| into a small package. You can fit top of the line gaming
| hardware in as little as 10-15 liters of case volume.
|
| If you're worried about your PC becoming a space heater, you'll
| need to go with less power-hungry components, but you can still
| absolutely build a tiny and capable gaming PC.
| dragontamer wrote:
| I'm curious why the Neoverse N1/N2 designers left SMT off of the
| table?
|
| I'd assume that any workload that benefits from 80 cores would
| benefit from 160-threads (on those 80 cores). Apple's decision to
| avoid SMT on M1 kinda-sorta makes sense, from the perspective
| that phones probably don't have throughput-sensitive workloads
| like servers.
|
| But if databases / other systems with lots of I/O or RAM-heavy
| wait times start coming up, surely SMT would easily improve
| performance without much costs in area or power?
|
| ---------
|
| It seems like the lower-power E1 core (Efficiency core) has SMT.
| So the ARM / Neoverse team has the experience to bring SMT should
| they desire it. This suggests that there's some design reason
| they left SMT off the table.
|
| The N1/N2 cores are more "general purpose", so I'd assume that
| they'd see more workloads than E1. If E1 benefits from SMT, why
| not N1/N2?
| SkyMarshal wrote:
| Could it have anything to do with the Portsmash attack against
| hyperthreaded systems?
|
| I'm not current on this one, but I recall when it first came
| out that disabling hyperthreading was the only solution.
|
| Has it been solved yet, or are some chipmakers avoiding
| enabling hyperthreading now as a result?
|
| https://nvd.nist.gov/vuln/detail/CVE-2018-5407
| dragontamer wrote:
| I doubt it. Otherwise, they would have disabled SMT /
| Hyperthreading on E1 cores.
|
| That's the confusing thing: they have the tech on one core,
| but not the other.
| twoodfin wrote:
| Perhaps the rationale is something like, "If your workload is
| memory-bandwidth-bound on 80 threads, this system is not for
| you."
|
| Seems like its ideal workload is lots of compute on a cache-
| friendly quantity of data.
| dragontamer wrote:
| The opposite.
|
| If you hit a memory-bandwidth bound on 80 threads, there's no
| point going up to 160 threads.
|
| In most situations, I expect code to be memory-latency bound
| on a single thread. (Ex: node = node->next style traversals
| are quite common, and you cannot progress until the memory
| has responded). This is exceptionally common in interpreted
| code (Java, Javascript, PHP, Python), especially OOP-code.
|
| So your 80 cores are sitting there waiting for RAM-latency to
| respond. Wouldn't it be nice if they could execute 80-other-
| threads in parallel while waiting? This converts a RAM-
| latency problem into a RAM-bandwidth problem.
|
| ------------
|
| ONLY problems that are memory-bandwidth bound on 80 threads
| will benefit from this architecture.
| rbanffy wrote:
| One of the reasons for SMT is to keep a wide backend occupied.
| If a single thread plus a reorder buffer and speculative
| execution can keep the backend busy, there's no reason to have
| more threads.
|
| OTOH, if your reorder buffer can't keep your backend full,
| adding threads may be cheaper in terms of silicon area.
| dragontamer wrote:
| Yes, your comment makes sense but...
|
| The Neoverse E1 is a 2-wide decode core with SMT (wtf??).
|
| The Neoverse N1 is a 4-wide decode core (1-thread per
| 1-core).
|
| -------
|
| You're right in that the N1 is narrower than Skylake / Zen.
| But N1 isn't too shabby: it has 8 execution pipelines: 1
| branch, 3x 64-bit integer, 2x 128-bit vector, and 2
| load/store.
|
| Furthermore: the core that ARM decided to shove their SMT-
| effort into is the E1, which is probably 1/2 the size of a N1
| (well, at least 1/2 sized decode).
| my123 wrote:
| For E1, it's because of less aggressive OoO for power
| saving, and a very aggressive power target of 180mW at
| 2.5GHz on 7nm.
|
| It's within 1.5x of Cortex-A55 perf on single threaded
| workloads, and with efficiency as the mantra, SMT was worth
| it there. (but we'll see what happens in future designs...
| there's both the in-order Cortex-A5xx and the OoO
| Cortex-A6x which is Neoverse E for the power efficient
| role)
| bingohbangoh wrote:
| Where in the world do you buy these things?!
|
| I keep seeing talk about Ampere. Can I buy from them or do I need
| to be a business or some sort and speak with a rep?
|
| nvm: they're not being sold quite yet.
| wmf wrote:
| https://store.avantek.co.uk/ampere-altra-64bit-arm-workstati...
|
| https://store.avantek.co.uk/ampere-altra-1u-server-r152-p30....
| bingohbangoh wrote:
| many thanks
| LeifCarrotson wrote:
| I'm curious - what are people running on boxes like these that
| makes good use of 80 cores and 768 GB of RAM?
| uncertainrhymes wrote:
| Fastly has published their server specs:
|
| 2 Intel(R) Xeon(R) CPU E5-2690 @ 2.90GHz
|
| 768 GB of RAM (384 GB per Processor)
|
| 18 TB of SSD Storage (Intel 3 Series or Samsung 840 Pro
| Enterprise Series)
| jenny91 wrote:
| I was just looking for that as an example, remembered they
| had something akin to this!
| qsort wrote:
| Microsoft Teams?
| wizzwizz4 wrote:
| > good use
| glitchc wrote:
| Close the thread, you win.
| lunatuna wrote:
| This is just hitting the market so it will be interesting to
| see where it goes.
|
| If the vendors were ready for something like this on the
| software side, this would be great for edge compute when low
| latency response is required - remote utility substation
| handling and reacting to a large array of sensors feeding at 60
| data points per second. In some use cases going to the control
| centre and back would be too slow to benefit. Basic grid
| control is well handled, but I could see optimizations
| benefiting from this. Vendors and utilities are way behind on
| this though.
| vmception wrote:
| _moves goal post of the term good_
| [deleted]
| freedomben wrote:
| I would use it for CI runners.
| gjsman-1000 wrote:
| Oracle Cloud provides 4 Altra Cores, 24GB of RAM, and 200GB of
| storage for free (supposedly indefinitely). I use it for a
| Minecraft server. Handles ~15 players with a half-dozen plugins
| without players complaining about any lag. I only use 4GB of
| the RAM because Java's Garbage Collector - and Minecraft is
| _heavily_ single-threaded so I 'm probably not using all cores
| very effectively, but it's free and works.
| whazor wrote:
| With four times 10Ge, so many cores and memory I can imagine
| this is perfect for web hosting or virtual machines.
| junon wrote:
| A single electron application. Pick any of them.
| goldenkey wrote:
| Hacker News: Desktop Edition. Keeps an open tab for every
| submission favorited. ;-)
| vgeek wrote:
| Do you think it will run Doom?
| penagwin wrote:
| one instance of doom compiled with wasm on electron, but
| only if you lower the resolution to get it to run smooth
| nottorp wrote:
| Isn't that single threaded?
| philliphaydon wrote:
| Sure but these days you need to run many of them for
| different apps :D
| junon wrote:
| Nope! Two threads at least are used.
| binarymax wrote:
| Low-latency information retrieval for heavy-read/low-write
| content sets.
|
| An example that might solidify the idea: pack a Wikipedia
| snapshot into it for search, and serve ~1m queries per second
| on it (12k qps per core).
| prox wrote:
| Nice.
| cogman10 wrote:
| I'd throw some video encoding work at something like that. Easy
| enough to eat up all that ram and the cores.
| dragontamer wrote:
| https://chipsandcheese.com/2021/08/05/neoverse-n1-vs-
| zen-2-a...
|
| No. This core is terrible at encoding.
|
| EDIT: And encoding is limited to ~16 cores in practice. It
| seems like after that, the communication between threads get
| too much to be useful. Unless you plan to be doing
| 5-simultaneous encodings at a time, then you're gonna have to
| find something else to do with all those cores.
| cogman10 wrote:
| Most of the new encoding tools split videos by scene and
| then run parallel encodes from there, such as av1an.
|
| For a decently sized video (say a TV episode) there's
| usually like 100 split points to divy out to encoders.
|
| https://github.com/master-of-zen/Av1an
| LeifCarrotson wrote:
| Is there a reason that 175W of processing power on 80 small
| cores at 2 GHz would be faster than, for example, an AMD EPYC
| 7F32, which has a similar TDP of 180W and 8 cores with 2
| threads each that run at ~4 GHz?
|
| Naively, assuming identical instruction sets (I know they're
| not), 16 threads at 4 GHz is less than half as good as 80 at
| 2 GHz. But that can't be the whole story.
| dragontamer wrote:
| AVX2 (256-bit SIMD instructions) is huge in the encoding
| world. A lot of these encoding algorithms operate over
| reasonable block sizes (8x8 macroblocks) that ensure that
| SIMD-instruction sets benefit greatly.
|
| ARM only has 128-bit SIMD through NEON. Its reasonably well
| designed, but nothing beats the brute force of just doing
| 256-bits at a time (or 512-bits in the case of Intel's
| AVX512)
| packetslave wrote:
| In my case, Houdini simulations for my vfx hobby ("only" 64
| cores and 128gb RAM, though).
| sleepybrett wrote:
| might be nice in a render farm?
| packetslave wrote:
| Not at 2.80 GHz it isn't.
| [deleted]
| OptionX wrote:
| Based on the spec and the form-factor I assume an oven.
| clircle wrote:
| I think the target is low power consumption server
| applications.
| m4rtink wrote:
| Minecraft servers? ;-)
| gjsman-1000 wrote:
| Oracle Cloud provides 4 Altra Cores and 24GB of RAM for free.
| I can support ~15 players with a half-dozen plugins without
| players complaining of any lag. Minecraft is very single-
| threaded though and I'm only using 4GB of the RAM because of
| Java's Garbage Collector - but it does work and it's
| supposedly free indefinitely.
| [deleted]
| baybal2 wrote:
| Cost?
| nottorp wrote:
| Speaking of which, do they have smaller cheaper systems for
| hobbyists playing around, or are they just enterprise?
| wmf wrote:
| Everything from Ampere is crazy expensive (unless you count
| the Oracle Cloud free tier).
| banana_giraffe wrote:
| A 2U server with this same processor (and a case and
| powersupply) is in the $10k range, so I'd expect something
| around that.
| wmf wrote:
| Probably over $5,000... for the motherboard with no CPU
| included.
| LargoLasskhyfv wrote:
| related:
|
| [0] https://www.oracle.com/news/announcement/oracle-unlocks-
| powe...
| iflp wrote:
| This is 7.2$/mo, at which rate you can rent xeon cores as well.
___________________________________________________________________
(page generated 2021-09-17 23:01 UTC)