hngopher.com

       [HN Gopher] First true exascale supercomputer?
       ___________________________________________________________________
        
       First true exascale supercomputer?
        
       Author : AliCollins
       Score  : 51 points
       Date   : 2022-07-06 15:57 UTC (7 hours ago)
        
 (HTM) web link (www.top500.org)
 (TXT) w3m dump (www.top500.org)
        
       | rektide wrote:
       | > _8,730,112 total cores_
       | 
       | This must include the GPUs, otherwise it'd be 136,408 sockets.
       | For a 42U rack with 4P 1U servers (not that that's what's in use,
       | but to give an understandable napkin figure), that'd be 812
       | racks.
       | 
       | Frontier's own page says 74 "cabinets"/racks, and this is just
       | for the compute (and perhaps switching and/or power? storage is
       | elsewhere). Made up of 9408 nodes, with 4 MI250X gpu accelerators
       | each- those accelerators being dual chip + 8x HBMe2 memory a
       | piece monsters. From Anandtech[1], we can see the liquid-cooled
       | half-width sleds are dual socket, and packed packed packed.
       | 
       | [1] https://www.anandtech.com/show/17074/amds-instinct-
       | mi250x-re...
        
         | JonChesterfield wrote:
         | Bit hard to guess what a 'core' would be on a gpu. Compute unit
         | / streaming multiprocessor perhaps.
        
       | seiferteric wrote:
       | > and relies on gigabit ethernet for data transfer.
       | 
       | This seems suprising to me, I would have expected 10Gb at least,
       | if not something like inifiniband.
        
         | xtreme wrote:
         | That's a typo, Frontier uses the Slingshot network from
         | Cray/HPE. The table below has the correct information.
        
         | dixie_land wrote:
         | seems to be a proprietary interconnect that is "Ethernet
         | compatible"
         | 
         | https://www.hpe.com/us/en/compute/hpc/slingshot-interconnect...
        
           | pclmulqdq wrote:
           | As far as I know, Slingshot uses a layer 1 that is exactly
           | the same as Ethernet and allows layer 2 ethernet packets to
           | enter switches. However, it has several layer 1/2 extensions
           | that let it look more like Infiniband to use cases that need
           | it, including flow control and congestion control.
        
         | jagger27 wrote:
         | It's wrong, and quite a funny typo. The interconnect is 100
         | gigabit.
         | 
         | https://www.olcf.ornl.gov/frontier/#4
        
       | jmpman wrote:
       | Back in the 2010 timeframe, there were articles about how an
       | Exascale Supercomputers might be impossible. Would be interesting
       | if someone could go back and assess where those predictions were
       | wrong and where they held, and how the architecture changed to
       | get around those true scaling limits.
        
         | porcoda wrote:
         | Power efficiency mostly. The power requirements of an exascale
         | machine with 2010-timeframe hardware would be crazy.
        
           | peter303 wrote:
           | Oak Ridge still consumes 20 megawatts. However older
           | technology was appearing to require a gigawatt.
        
       | peter303 wrote:
       | Onward to zettaflops around 2037, assuming an order of magnitude
       | every five years. Thats been pretty much the case for 60 years.
        
       | einpoklum wrote:
       | > "This HPE Cray EX system is the first US system with a peak
       | performance exceeding one ExaFlop/s."
       | 
       | So, it's not actually the first one? And another one already
       | exists outside the US?
        
         | jagger27 wrote:
         | Yes. It is assumed that China is downplaying how capable theirs
         | are.
        
         | ncmncm wrote:
         | We may assume NSA has faster ones, devoted to speech
         | transcription and codebreaking.
        
         | SkyMarshal wrote:
         | That was an odd qualification. The only thing they mention is
         | that the #2 computer in Japan is theoretically capable of an
         | Exaflop, but hasn't demonstrated it yet.
        
       | kvetching wrote:
       | If they truly wanted to solve world problems, they need to allow
       | an AGI company like DeepMind or OpenAI to use it. The people now
       | using it are likely wasting so much money using outdated
       | technologies.
        
         | ncmncm wrote:
         | As in Vernor Vinge's "A Fire upon the Deep", where Powers in
         | the Great Beyond transcend existence while regular people are
         | condemned to live out their lives running a trivial program.
        
       | sriram_malhar wrote:
       | 21 MW power! Insane.
       | 
       | Interestingly, the second one is 30 MW.
        
         | krylon wrote:
         | For reference, Roadrunner, which was the first petascale system
         | in 2008, used 2.35 MW (according to Wikipedia). So this one
         | gives us 1,000 times the performance for 10 times the energy.
         | From a performance/Watt perspective, this is an impressive
         | improvement.
         | 
         | EDIT: Wikipedia also says Roadrunner was not considered power-
         | efficient in its day, which led to it being decommissioned
         | after only five years of operation.
        
           | porcoda wrote:
           | That, and the apps teams hated the architecture given very
           | poor tooling support - especially since the writing was on
           | the wall that the future was GPGPU accelerators and the Cell
           | was a dead end. The roadrunner processors were awesome on
           | paper, but not so much when it came to working with them.
           | Kind of a shame really: there were some really interesting
           | ideas in that processor design.
        
         | dekhn wrote:
         | To me 50 megawatts is the baseline of what I would expect for a
         | decent cluster.
        
           | jeffbee wrote:
           | Perhaps you in fact forgot how to count lower than 50.
        
             | beckingz wrote:
             | Reference link: https://www.youtube.com/watch?v=3t6L-FlfeaI
        
             | ncmncm wrote:
             | He is so fast that just counting to ten, he gets that far
             | before he can stop.
        
             | dekhn wrote:
             | I can't even get out of bed for less than 3 peer bonuses!
        
         | ncmncm wrote:
         | This is the first I have noticed them reporting power draw. It
         | seems immoral to run it for anything that doesn't help stop
         | global climate catastrophe. (Presumably global thermonuclear
         | war would suffice. But carbon capture afterward would be hard
         | to arrange for.)
         | 
         | Wondering if they measure while benchmarking, or add up max
         | power ratings of the chips.
         | 
         | Did any old mainframe ever burn like that? E.g. the first big
         | USAF missile tracking system, the one that filled four floors
         | of a custom building?
        
           | jeffbee wrote:
           | Are you kidding? You know a single widebody airliner uses
           | more energy than that, right?
        
             | ncmncm wrote:
             | Widebody airliners aren't doing much for the climate,
             | either.
        
               | ben_w wrote:
               | While true, I think the point is more about how minuscule
               | 21 MW is when considered in isolation -- 0.001 percent of
               | global electricity usage.
        
           | sbierwagen wrote:
           | Using the secret skill of "clicking on the links for the
           | other lists" I discovered that the first TOP500 which had a
           | machine report a power draw was the TOP10 in November 1996:
           | https://top500.org/lists/top500/1996/11/
           | 
           | (498 kW for 229 GFlops. 136,317 times more power draw per
           | flops than the current leader on the Green500.)
        
       | causi wrote:
       | It feels like it's been a long time since supercomputers were
       | interesting. They're just oodles of identical processors
       | connected together like legos. "We can afford more bricks than
       | the next guy" is not exciting. When was the last time we had a
       | "fastest supercomputer" that could do something the second-
       | fastest couldn't also do?
        
         | zamadatix wrote:
         | Speed is just the measure of how fast it does something not a
         | measure of what it's capable of doing. I wouldn't expect to
         | divine more information like "what new things can it do" from
         | that number alone outside "things we didn't have enough compute
         | time for before we do now".
         | 
         | Lego style supercomputers are still very interesting in my eye
         | though. As the technical complexity involved in scaling the raw
         | compute performance has simplified to a "how many do you want"
         | problem the technical complexity in the interconnects has
         | remained interesting and innovative both for connectivity intra
         | and inter node. You won't really see that in the FLOPS number
         | that makes the headlines but the interconnect can be the
         | difference between a type of workload being feasible or not.
         | The main push here is how large you can make certain levels of
         | shared memory access happen at what latencies to run larger
         | jobs instead of just more jobs.
        
           | convolvatron wrote:
           | there is also a huge amount of work remaining to be done in
           | programming models and consistency.
        
         | rfoo wrote:
         | > They're just oodles of identical processors connected
         | together like legos.
         | 
         | That's the Cloud, not supercomputing. Supercomputing is all
         | about interconnect.
        
           | agumonkey wrote:
           | I also wonder how the software side of things changes in
           | those settings, how do people design program / algorithms
           | around fast and wide data path like these.
        
             | colatkinson wrote:
             | I have a bit of experience programming for a highly-
             | parallel supercomputer, specifically in my case an IBM
             | BlueGene/Q. In that case, the answer is a lot of message
             | passing (we used Open MPI [0]). Since the nodes are
             | discrete and don't have any shared memory, you end up with
             | something kinda reminiscent of the actor model as
             | popularized by Erlang and co -- but in C for number-
             | crunching performance.
             | 
             | That said, each of the nodes is itself composed of multiple
             | cores with shared memory. So in cases where you really want
             | to grind out performance, you actually end up using message
             | passing to divvy up chunks of work, and then use classic
             | pthreads to parallelize things further, with lower latency.
             | 
             | I forget the exact terminology used, but the parent is
             | right that the interconnect is the "killer feature." To
             | make that message passing fast, there's a lot of crazy
             | topography to keep the number of hops down. The Q had nodes
             | connected in a "torus" configuration to that end [1].
             | 
             | Debugging is a bit of a nightmare, though, since some bugs
             | inevitably only come up once you have a large number of
             | nodes running the algorithm in parallel. But you'll
             | probably be in a mainframe-style time-sharing setup, so you
             | may have to wait hours or more to rerun things.
             | 
             | This applies less to some of the newer supercomputers,
             | which are more or less clusters of GPUs instead of clusters
             | of CPUs. I imagine there's some commonality, but I haven't
             | worked with any of them so I can't really say.
             | 
             | [0] https://www.open-mpi.org/
             | 
             | [1] https://www.scorec.rpi.edu/~shephard/FEP19/notes-2019/I
             | ntrod...
        
         | freemint wrote:
         | Well fundamentally all super computers are turing machines. So
         | one can do X while Y can not doesn't really make sense in that
         | context.
         | 
         | However the second-fastest (ARM based Fugaku) absolutely wipes
         | the floor with the fastest in certain tasks due to a difference
         | in interconnect topology. Fugaku futhermore has no GPUs unlike
         | many other super computers and instead a CPU with some vector
         | instructions, leading to a different programming model.
         | 
         | If you are more into specialized hardware, Anton3 is amazing.
        
         | bitwize wrote:
         | Building the communication fabric it takes to make those oodles
         | of identical processors to exchange and share data quickly so
         | they don't get bogged down in their own communication overhead
         | is a profoundly interesting problem, and by "profoundly
         | interesting" I mean "call Richard Feynman in to help you solve
         | it":
         | 
         | https://longnow.org/essays/richard-feynman-connection-machin...
         | 
         | Besides which, at that level the goal is not to go "look at
         | this cool thing we built", it's more like "how do we cheaply
         | and effectively build something that can solve these massive
         | weather/nuclear explosion/human brain/etc. simulation problems
         | we have?" and if ganging together lots of off-the-shelf
         | CPUs/GPUs achieves that goal with less time, effort, and cost
         | than building super-custom, boutique-schmoutique hardware, so
         | be it.
        
         | guenthert wrote:
         | Not sure about exciting, but I'd think the technical
         | challenges, particularly regarding intra-cluster communication,
         | can be interesting to some. There's a lot of money in it, they
         | better do something useful (more useful then running Linpack or
         | calculating digits of Pi), rather then being just show cases.
         | 
         | Said that, #1 is about twice as fast as #2, which is about
         | three times as fast as number #3. Those gaps are much wider
         | then I would have expected this late in the game.
        
         | kjs3 wrote:
         | You can still get the NEC SX series, which is a non-x86, non-
         | arm vector super. They're pretty nifty. "Fastest" has gone in a
         | different direction tho.
        
       | inasio wrote:
       | There's a bit of drama in that there are unofficial reports of
       | two systems in China with higher performance [0], the arXiv paper
       | listed below talks about a 40 million core system with around
       | double theoretical performance than Frontier, and there's
       | apparently a second system online with similar performance. I
       | personally suspect that they didn't submit benchmarks to the
       | top500 simply because those don't run well enough in the systems
       | 
       | [0] https://arxiv.org/pdf/2204.07816.pdf
        
         | dekhn wrote:
         | Let them build those machines and the f they are any good we
         | can steal all their ideas. Turnabout is fair play.
        
         | oefrha wrote:
         | I heard they won't submit anymore so as to not draw further
         | scrutiny and possible sanctions onto their suppliers. Not sure
         | if true, but keeping a low profile certainly makes sense given
         | the blows dealt to the more visible vendors in the past few
         | years.
        
           | greggsy wrote:
           | What are the concerns for vendors?
        
             | perihelions wrote:
             | US banned the sale of American HPC components to Chinese
             | supercomputers.
             | 
             | https://news.ycombinator.com/item?id=9349116 (2015, 93
             | comments)
             | 
             | https://news.ycombinator.com/item?id=26740371 (2021, 151
             | comments) etc.
        
               | throwaway4good wrote:
               | They also prevent Chinese supercomputing related
               | companies having their chips fabbed in Taiwan.
        
       | marcodiego wrote:
       | I remember in early 2000's trying to convince people to use linux
       | and being mocked that it was a "toy" or "not professional
       | enough". While at the time I tried to argue how it was more
       | stable, more secure and better performant than competition and
       | even arguing that it was improving continuously, some people
       | still made fun of me. It is a good thing I've been able, for a
       | long time, to see this: https://www.top500.org/statistics/list/ ,
       | chose Category:"Operating system family" and click "Submit".
        
         | jacquesm wrote:
         | The writing has been on the wall since the early Beowulf
         | success stories hit.
         | 
         | I'm pretty bullish on the long term survival of Linux in some
         | form or other, proprietary OS's not so much.
        
       | jpsamaroo wrote:
       | This is exciting news! What's also exciting is that it's not just
       | C++ that can run on this supercomputer; there is also good
       | (currently unofficial) support for programming those GPUs from
       | Julia, via the AMDGPU.jl library (note: I am the
       | author/maintainer of this library). Some of our users have been
       | able to run AMDGPU.jl's testsuite on the Crusher test system
       | (which is an attached testing system with the same hardware
       | configuration as Frontier), as well as their own domain-specific
       | programs that use AMDGPU.jl.
       | 
       | What's nice about programming GPUs in Julia is that you can write
       | code once and execute it on multiple kinds of GPUs, with
       | excellent performance. The KernelAbstractions.jl library makes
       | this possible for compute kernels by acting as a frontend to
       | AMDGPU.jl, CUDA.jl, and soon Metal.jl and oneAPI.jl, allowing a
       | single piece of code to be portable to AMD, NVIDIA, Intel, and
       | Apple GPUs, and also CPUs. Similarly, the GPUArrays.jl library
       | allows the same behavior for idiomatic array operations, and will
       | automatically dispatch calls to BLAS, FFT, RNG, linear solver,
       | and DNN vendor-provided libraries when appropriate.
       | 
       | I'm personally looking forward to helping researchers get their
       | Julia code up and running on Frontier so that we can push
       | scientific computing to the max!
       | 
       | Library link: <https://github.com/JuliaGPU/AMDGPU.jl>
        
       | linsomniac wrote:
       | TL;DR: Wow! ~9 million cores, 21 megawatts, >2x the performance
       | of #2 but pulling less power (compared to 30MW). #3 is
       | 0.15EFLOPS, but also 3MW.
        
       | jakear wrote:
       | The spec sheet mentions they're moving from CUDA powering their
       | prior supercomputer to "HIP" for this one. This is the first I've
       | heard of HIP, does anyone have experience with it? My impression
       | was that GPU programming tended to mean CUDA, which isn't cross
       | platform (as opposed to HIP).
       | 
       | https://developer.amd.com/resources/rocm-learning-center/fun....
        
         | eslaught wrote:
         | HIP is basically CUDA with s/cuda/hip/g.
         | 
         | My experience is that the stack is pretty rough around the
         | edges. But when it works, you (almost) literally find-and-
         | replace, and it pretty much works as advertised. However, just
         | because you can get to a correct code doesn't necessarily mean
         | that code will achieve optimal performance (without further
         | tuning, of course).
        
           | latchkey wrote:
           | Not only tuning the code, but also the bazillion knobs on the
           | GPUs themselves.
        
             | gh02t wrote:
             | That's the upside of a supercomputer though, fixed
             | architecture to target with enough weight that it's
             | worthwhile.
        
         | latchkey wrote:
         | If you have AMD gpus, then you need to use HIP to run all those
         | CUDA applications.
        
       | CoastalCoder wrote:
       | I used to be really excited about supercomputers. It's part of
       | why I pursued HPC-related work.
       | 
       | But I think that having no interest in their actual applications
       | has curbed my enthusiasm. I wish I could make a good living in
       | something that interested more.
        
         | JonChesterfield wrote:
         | Could work on the supercomputer hardware/toolchain/libraries
         | instead of the applications
        
         | linksnapzz wrote:
         | I love the applications, but I'm dismayed at the stagnation in
         | programming models used to get the best performance out of
         | modern clusters. This sums up my feelings:
         | 
         | https://www.usenix.org/conference/atc21/presentation/fri-key...
        
       | nabla9 wrote:
       | For comparison, 2000 SP Power3 375 MHz in Oak Ridge National
       | Laboratory did the same order of magnitude GFlops as iPhones with
       | A14 chip can do.
        
       ___________________________________________________________________
       (page generated 2022-07-06 23:01 UTC)