hngopher.com

       [HN Gopher] Jam 80 Cores, 768GB of RAM into E-ATX Case with This...
       ___________________________________________________________________
        
       Jam 80 Cores, 768GB of RAM into E-ATX Case with This Tiny Board
        
       Author : rbanffy
       Score  : 65 points
       Date   : 2021-09-17 12:14 UTC (10 hours ago)
        
 (HTM) web link (www.tomshardware.com)
 (TXT) w3m dump (www.tomshardware.com)
        
       | bullen wrote:
       | I find it fascinating that companies still do not mention the
       | Gflops/watt/$ for their products?
       | 
       | I mean I understand why, because the raspberry pi 4 melts all
       | competition into the ground, but why nobody is asking is my
       | concern.
       | 
       | If you need atomic parallelism you should be fine with a 25W Atom
       | 8-core machine that you can passively cool.
       | 
       | Also would be interesting to see how this performs in a atomic
       | parallel scenario? My guess is my HTTP server would not perform
       | so well because the selector thread would not be able to service
       | 79 other cores but I might be wrong about that.
       | 
       | I'm pretty sure the RAM will throttle the 80 cores if they work
       | on a joint problem though!
        
         | srcmap wrote:
         | Yap,
         | 
         | Someone likely would just docker/vm to partition that 80 cores
         | into micro-service handle httpd proxy and app, db backend.
         | 
         | Those app/VM can easily be converted to app running pi modules.
         | 
         | 20 pi modules likely have much better DDR, SSD, Network
         | bandwidth. Probably scale from 2 pi to 200 pi as easily as
         | typical vm setup - and it comes with GPUs for free for those
         | need them.
        
           | dragontamer wrote:
           | > 20 pi modules likely have much better DDR, SSD, Network
           | bandwidth
           | 
           | 20 pi modules of 8GBs is only 160GBs.
           | 
           | Ignoring that: Bigger nodes are better in all practical
           | scenarios. With 768 GBs of RAM, this singular big server can
           | likely keep in-memory a large collection of information (ex:
           | all of English-Wikipedia likely fits inside of that RAM).
           | 
           | 20x Rasp. Pi cannot access all of English-Wikipedia in RAM.
           | This means that you can't index, you can't search, you can't
           | analyze the pages. Even if you could: you'd need to have a
           | collaborative external memory model, which is not easy to
           | program.
           | 
           | 80-cores with access to all 768GBs can do many, many more
           | things than 20x 4-core Rasp. Pi working on only 32GB at a
           | time.
        
         | wmf wrote:
         | If you care about GFLOPS/Watt/$ then GPUs beat everything. Real
         | requirements are usually more subtle, including things like
         | compatibility with existing software, latency (single-thread
         | performance), and management overhead per node.
        
           | joe_the_user wrote:
           | That is my impression, it would still be good be good have an
           | at least semi/partly objective evaluation of all the parallel
           | compute devices that one can see appearing.
           | 
           | Also, to have a number of benchmarks involve memory, caching,
           | etc.
        
           | bch wrote:
           | I'm somewhat ignorant here, and indeed also niche (I'm a
           | NetBSD user), but:
           | 
           | > GPUs beat everything
           | 
           | that's workload dependant, and also subject to support, no?
           | Doesn't GPGPU basically mean CUDA currently, and therefore
           | beholden to Nvidia support for your hardware/software
           | platform?
        
             | dragontamer wrote:
             | AMD GPUs are supported over their open-source ROCm builds.
             | HIP (CUDA-clone) works with Vega and CDNA.
             | 
             | It seems like RDNA (5xxx and 6xxx cards) are less
             | supported, but reports are that OpenCL kinda-sorta works
             | (RDNA cards have a very different architecture than Vega /
             | CDNA)
        
             | UncleEntity wrote:
             | Nah, I was playing around with an AMD integrated gpu a
             | while ago using openCL (I believe).
        
         | dragontamer wrote:
         | > I find it fascinating that companies still do not mention the
         | Gflops/watt/$ for their products?
         | 
         | GPUs win in TFlops (like 20 or 40 TFlops these days), and
         | TFlops/watt (A modest 20 TFlop GPU these days would be under
         | 400W).
         | 
         | GPUs also have 1TBps memory bandwidth thanks to HBM2 (on the
         | high end), or at least 500GBps (thanks to GDDR6 on the low
         | end).
         | 
         | Any serious compute problem with "obvious" parallelism will run
         | on a GPU these days. CPUs are for sequential problems... (and
         | running many sequential problems in parallel: which GPUs are
         | kinda bad at actually. No GPU would ever be able to serve web-
         | pages like a CPU: branch divergence is just too high)
        
       | belltaco wrote:
       | What is the size comparison to a system with x86 cores and
       | similar RAM etc? Epyc has 64 cores right?
        
         | wmf wrote:
         | Yeah, to get 80 x86 cores you'd need dual socket but that may
         | also fit in EATX.
         | 
         | N1 cores are weaker so you should really compare 80 threads vs
         | 80 threads which would be a single-socket Epyc or Xeon which
         | fit into the same or smaller board.
        
           | JoshTriplett wrote:
           | Right. You can get 56-core, 112-thread in one socket, and in
           | terms of performance that's going to be much closer 112x than
           | 56x: https://ark.intel.com/content/www/us/en/ark/products/194
           | 146/...
        
         | dragontamer wrote:
         | EPYC is up to 64-core/128-threads per socket. A dual-EPYC will
         | get you 128-cores / 256-threads.
         | 
         | > similar RAM
         | 
         | IIRC, EPYC is 2TBs of RAM support from LRDIMMs. Maybe 4TB now,
         | but you need many, many DIMMs for that: like 16 DIMMs or
         | something on a dual-socket EPYC.
        
         | zamadatix wrote:
         | 128 core/256 thread 2 TB
         | RAMhttps://www.newegg.com/p/296-0002-003A3
         | 
         | 64 core/128 thread micro atx https://www.newegg.com/asrock-
         | rack-romed6u-2l2t-amd-epyc-700...
         | 
         | Bonus board just because it's badass:
         | https://www.newegg.com/asus-pro-ws-wrx80e-sage-se-wifi/p/N82...
        
       | wolrah wrote:
       | > Jam 80 Cores, 768GB of RAM into E-ATX Case with This Tiny Board
       | 
       | The headline sounds like a brag, but isn't exactly impressive.
       | 
       | There have been multiple EATX Dual EPYC motherboards on sale for
       | some time now. The limited board area means only one DIMM per
       | channel, but with eight channels and 128GB DIMMs that still means
       | you could have 128 cores and 2TB of RAM in a single EATX system
       | with "ordinary" AMD hardware.
        
       | Pet_Ant wrote:
       | So back in 2017 Sen had the SPARC T8-4 with 4TB of RAM 4x SPARC
       | M8 (each one 32-core, 256-threaded) in a 5U enclosure which isn't
       | that much bigger than my current computer case so I'm not sure
       | what is so excicting here?
       | 
       | https://en.wikipedia.org/wiki/SPARC_T_series#SPARC_M8
        
         | wmf wrote:
         | HN doesn't know anything about hardware so everything is
         | exciting.
        
       | dreen wrote:
       | I've been considering small ATX gaming build but Im really
       | worried it will heat like crazy after a while, even if it doesnt
       | have 80 cores. Am I completely wrong?
        
         | dragontamer wrote:
         | I've got an AMD Zen Threadripper (1st generation) with 180W of
         | TDP.
         | 
         | If I'm doing something computationally expensive, I can feel
         | the room heatup for sure: just 2 degrees F in practice (~1C or
         | so). Enough to notice + enough to see it in my thermometer I
         | keep in my room... but not enough to be worried about anything.
         | 
         | And honestly, I don't spin up all 16c/32 threads that often.
        
         | kiran-rao wrote:
         | The site claims:
         | 
         | > The Ampere Altra Q80-28 SoC with 80 cores runs at 2.80 GHz
         | and consumes around 175 Watts.
         | 
         | 175w is nothing to scoff at, but also isn't completely
         | ridiculous. GPUs will end up consuming more than that in a
         | typical ATX gaming machine.
        
         | Arrath wrote:
         | _If_ the case has the headroom to fit a decent cooler, a
         | smaller case sounds almost better, as one intake and one
         | exhaust fan would be more than sufficient to continually
         | exchange the volume of air within the case?
        
           | dreen wrote:
           | Yeah that makes sense, but since Im talking about a gaming
           | build I would think theres not much space left after you put
           | the GPU in which is the main heat source, thus making airflow
           | difficult
        
             | Arrath wrote:
             | That concern came to mind as I was typing my comment, it
             | doesn't take too many other bits in the case to really
             | compromise airflow. Again assuming it would even have space
             | for the yacht that modern GPUs have become.
        
               | favorited wrote:
               | There are really neat ITX case layouts that use PCIE
               | riser cables to mount the GPU in a separate chamber, or
               | behind the motherboard
               | (https://i.ytimg.com/vi/v4dtjsJEFQw/maxresdefault.jpg).
               | 
               | Even ITX cases without riser cables end up positioning
               | the GPU right at the bottom of the case, so it gets fresh
               | air (https://cdna.pcpartpicker.com/static/forever/images/
               | userbuil...)
        
               | Arrath wrote:
               | Neat! Thanks!
               | 
               | Though my inner troubleshooter is thinking "oh man more
               | connections to re-seat"
        
               | favorited wrote:
               | Haha you're not wrong! It's another potentially
               | incompatible part, especially now as PCIE generations
               | change. People will upgrade their GPU and suddenly won't
               | have any video output because it's a gen. 4 motherboard
               | talking to a gen. 4 GPU over a gen. 3 riser cable.
               | 
               | That said, you really only run into that kind of stuff
               | when you're entering hobbyist mode. If what someone wants
               | is a compact gaming PC, there are cases like the
               | CoolerMaster NR200 (https://i.redd.it/fq9y7mevznb71.jpg)
               | which are affordable and just as easy to build in as any
               | mid-tower. The only difference is you'll be using a mini
               | ITX motherboard, and an SFX power supply.
        
         | favorited wrote:
         | I've built a few small form factor gaming PCs over the last 5
         | or so years. The heat produced is directly proportional to the
         | power consumed.
         | 
         | If you're worried about your components overheating in a small
         | case, there are absolutely ways to cram a ton of performance
         | into a small package. You can fit top of the line gaming
         | hardware in as little as 10-15 liters of case volume.
         | 
         | If you're worried about your PC becoming a space heater, you'll
         | need to go with less power-hungry components, but you can still
         | absolutely build a tiny and capable gaming PC.
        
       | dragontamer wrote:
       | I'm curious why the Neoverse N1/N2 designers left SMT off of the
       | table?
       | 
       | I'd assume that any workload that benefits from 80 cores would
       | benefit from 160-threads (on those 80 cores). Apple's decision to
       | avoid SMT on M1 kinda-sorta makes sense, from the perspective
       | that phones probably don't have throughput-sensitive workloads
       | like servers.
       | 
       | But if databases / other systems with lots of I/O or RAM-heavy
       | wait times start coming up, surely SMT would easily improve
       | performance without much costs in area or power?
       | 
       | ---------
       | 
       | It seems like the lower-power E1 core (Efficiency core) has SMT.
       | So the ARM / Neoverse team has the experience to bring SMT should
       | they desire it. This suggests that there's some design reason
       | they left SMT off the table.
       | 
       | The N1/N2 cores are more "general purpose", so I'd assume that
       | they'd see more workloads than E1. If E1 benefits from SMT, why
       | not N1/N2?
        
         | SkyMarshal wrote:
         | Could it have anything to do with the Portsmash attack against
         | hyperthreaded systems?
         | 
         | I'm not current on this one, but I recall when it first came
         | out that disabling hyperthreading was the only solution.
         | 
         | Has it been solved yet, or are some chipmakers avoiding
         | enabling hyperthreading now as a result?
         | 
         | https://nvd.nist.gov/vuln/detail/CVE-2018-5407
        
           | dragontamer wrote:
           | I doubt it. Otherwise, they would have disabled SMT /
           | Hyperthreading on E1 cores.
           | 
           | That's the confusing thing: they have the tech on one core,
           | but not the other.
        
         | twoodfin wrote:
         | Perhaps the rationale is something like, "If your workload is
         | memory-bandwidth-bound on 80 threads, this system is not for
         | you."
         | 
         | Seems like its ideal workload is lots of compute on a cache-
         | friendly quantity of data.
        
           | dragontamer wrote:
           | The opposite.
           | 
           | If you hit a memory-bandwidth bound on 80 threads, there's no
           | point going up to 160 threads.
           | 
           | In most situations, I expect code to be memory-latency bound
           | on a single thread. (Ex: node = node->next style traversals
           | are quite common, and you cannot progress until the memory
           | has responded). This is exceptionally common in interpreted
           | code (Java, Javascript, PHP, Python), especially OOP-code.
           | 
           | So your 80 cores are sitting there waiting for RAM-latency to
           | respond. Wouldn't it be nice if they could execute 80-other-
           | threads in parallel while waiting? This converts a RAM-
           | latency problem into a RAM-bandwidth problem.
           | 
           | ------------
           | 
           | ONLY problems that are memory-bandwidth bound on 80 threads
           | will benefit from this architecture.
        
         | rbanffy wrote:
         | One of the reasons for SMT is to keep a wide backend occupied.
         | If a single thread plus a reorder buffer and speculative
         | execution can keep the backend busy, there's no reason to have
         | more threads.
         | 
         | OTOH, if your reorder buffer can't keep your backend full,
         | adding threads may be cheaper in terms of silicon area.
        
           | dragontamer wrote:
           | Yes, your comment makes sense but...
           | 
           | The Neoverse E1 is a 2-wide decode core with SMT (wtf??).
           | 
           | The Neoverse N1 is a 4-wide decode core (1-thread per
           | 1-core).
           | 
           | -------
           | 
           | You're right in that the N1 is narrower than Skylake / Zen.
           | But N1 isn't too shabby: it has 8 execution pipelines: 1
           | branch, 3x 64-bit integer, 2x 128-bit vector, and 2
           | load/store.
           | 
           | Furthermore: the core that ARM decided to shove their SMT-
           | effort into is the E1, which is probably 1/2 the size of a N1
           | (well, at least 1/2 sized decode).
        
             | my123 wrote:
             | For E1, it's because of less aggressive OoO for power
             | saving, and a very aggressive power target of 180mW at
             | 2.5GHz on 7nm.
             | 
             | It's within 1.5x of Cortex-A55 perf on single threaded
             | workloads, and with efficiency as the mantra, SMT was worth
             | it there. (but we'll see what happens in future designs...
             | there's both the in-order Cortex-A5xx and the OoO
             | Cortex-A6x which is Neoverse E for the power efficient
             | role)
        
       | bingohbangoh wrote:
       | Where in the world do you buy these things?!
       | 
       | I keep seeing talk about Ampere. Can I buy from them or do I need
       | to be a business or some sort and speak with a rep?
       | 
       | nvm: they're not being sold quite yet.
        
         | wmf wrote:
         | https://store.avantek.co.uk/ampere-altra-64bit-arm-workstati...
         | 
         | https://store.avantek.co.uk/ampere-altra-1u-server-r152-p30....
        
           | bingohbangoh wrote:
           | many thanks
        
       | LeifCarrotson wrote:
       | I'm curious - what are people running on boxes like these that
       | makes good use of 80 cores and 768 GB of RAM?
        
         | uncertainrhymes wrote:
         | Fastly has published their server specs:
         | 
         | 2 Intel(R) Xeon(R) CPU E5-2690 @ 2.90GHz
         | 
         | 768 GB of RAM (384 GB per Processor)
         | 
         | 18 TB of SSD Storage (Intel 3 Series or Samsung 840 Pro
         | Enterprise Series)
        
           | jenny91 wrote:
           | I was just looking for that as an example, remembered they
           | had something akin to this!
        
         | qsort wrote:
         | Microsoft Teams?
        
           | wizzwizz4 wrote:
           | > good use
        
           | glitchc wrote:
           | Close the thread, you win.
        
         | lunatuna wrote:
         | This is just hitting the market so it will be interesting to
         | see where it goes.
         | 
         | If the vendors were ready for something like this on the
         | software side, this would be great for edge compute when low
         | latency response is required - remote utility substation
         | handling and reacting to a large array of sensors feeding at 60
         | data points per second. In some use cases going to the control
         | centre and back would be too slow to benefit. Basic grid
         | control is well handled, but I could see optimizations
         | benefiting from this. Vendors and utilities are way behind on
         | this though.
        
         | vmception wrote:
         | _moves goal post of the term good_
        
         | [deleted]
        
         | freedomben wrote:
         | I would use it for CI runners.
        
         | gjsman-1000 wrote:
         | Oracle Cloud provides 4 Altra Cores, 24GB of RAM, and 200GB of
         | storage for free (supposedly indefinitely). I use it for a
         | Minecraft server. Handles ~15 players with a half-dozen plugins
         | without players complaining about any lag. I only use 4GB of
         | the RAM because Java's Garbage Collector - and Minecraft is
         | _heavily_ single-threaded so I 'm probably not using all cores
         | very effectively, but it's free and works.
        
         | whazor wrote:
         | With four times 10Ge, so many cores and memory I can imagine
         | this is perfect for web hosting or virtual machines.
        
         | junon wrote:
         | A single electron application. Pick any of them.
        
           | goldenkey wrote:
           | Hacker News: Desktop Edition. Keeps an open tab for every
           | submission favorited. ;-)
        
           | vgeek wrote:
           | Do you think it will run Doom?
        
             | penagwin wrote:
             | one instance of doom compiled with wasm on electron, but
             | only if you lower the resolution to get it to run smooth
        
           | nottorp wrote:
           | Isn't that single threaded?
        
             | philliphaydon wrote:
             | Sure but these days you need to run many of them for
             | different apps :D
        
             | junon wrote:
             | Nope! Two threads at least are used.
        
         | binarymax wrote:
         | Low-latency information retrieval for heavy-read/low-write
         | content sets.
         | 
         | An example that might solidify the idea: pack a Wikipedia
         | snapshot into it for search, and serve ~1m queries per second
         | on it (12k qps per core).
        
           | prox wrote:
           | Nice.
        
         | cogman10 wrote:
         | I'd throw some video encoding work at something like that. Easy
         | enough to eat up all that ram and the cores.
        
           | dragontamer wrote:
           | https://chipsandcheese.com/2021/08/05/neoverse-n1-vs-
           | zen-2-a...
           | 
           | No. This core is terrible at encoding.
           | 
           | EDIT: And encoding is limited to ~16 cores in practice. It
           | seems like after that, the communication between threads get
           | too much to be useful. Unless you plan to be doing
           | 5-simultaneous encodings at a time, then you're gonna have to
           | find something else to do with all those cores.
        
             | cogman10 wrote:
             | Most of the new encoding tools split videos by scene and
             | then run parallel encodes from there, such as av1an.
             | 
             | For a decently sized video (say a TV episode) there's
             | usually like 100 split points to divy out to encoders.
             | 
             | https://github.com/master-of-zen/Av1an
        
           | LeifCarrotson wrote:
           | Is there a reason that 175W of processing power on 80 small
           | cores at 2 GHz would be faster than, for example, an AMD EPYC
           | 7F32, which has a similar TDP of 180W and 8 cores with 2
           | threads each that run at ~4 GHz?
           | 
           | Naively, assuming identical instruction sets (I know they're
           | not), 16 threads at 4 GHz is less than half as good as 80 at
           | 2 GHz. But that can't be the whole story.
        
             | dragontamer wrote:
             | AVX2 (256-bit SIMD instructions) is huge in the encoding
             | world. A lot of these encoding algorithms operate over
             | reasonable block sizes (8x8 macroblocks) that ensure that
             | SIMD-instruction sets benefit greatly.
             | 
             | ARM only has 128-bit SIMD through NEON. Its reasonably well
             | designed, but nothing beats the brute force of just doing
             | 256-bits at a time (or 512-bits in the case of Intel's
             | AVX512)
        
         | packetslave wrote:
         | In my case, Houdini simulations for my vfx hobby ("only" 64
         | cores and 128gb RAM, though).
        
         | sleepybrett wrote:
         | might be nice in a render farm?
        
           | packetslave wrote:
           | Not at 2.80 GHz it isn't.
        
         | [deleted]
        
         | OptionX wrote:
         | Based on the spec and the form-factor I assume an oven.
        
         | clircle wrote:
         | I think the target is low power consumption server
         | applications.
        
         | m4rtink wrote:
         | Minecraft servers? ;-)
        
           | gjsman-1000 wrote:
           | Oracle Cloud provides 4 Altra Cores and 24GB of RAM for free.
           | I can support ~15 players with a half-dozen plugins without
           | players complaining of any lag. Minecraft is very single-
           | threaded though and I'm only using 4GB of the RAM because of
           | Java's Garbage Collector - but it does work and it's
           | supposedly free indefinitely.
        
         | [deleted]
        
       | baybal2 wrote:
       | Cost?
        
         | nottorp wrote:
         | Speaking of which, do they have smaller cheaper systems for
         | hobbyists playing around, or are they just enterprise?
        
           | wmf wrote:
           | Everything from Ampere is crazy expensive (unless you count
           | the Oracle Cloud free tier).
        
         | banana_giraffe wrote:
         | A 2U server with this same processor (and a case and
         | powersupply) is in the $10k range, so I'd expect something
         | around that.
        
         | wmf wrote:
         | Probably over $5,000... for the motherboard with no CPU
         | included.
        
       | LargoLasskhyfv wrote:
       | related:
       | 
       | [0] https://www.oracle.com/news/announcement/oracle-unlocks-
       | powe...
        
         | iflp wrote:
         | This is 7.2$/mo, at which rate you can rent xeon cores as well.
        
       ___________________________________________________________________
       (page generated 2021-09-17 23:01 UTC)