[HN Gopher] A day in the life of the fastest supercomputer
       ___________________________________________________________________
        
       A day in the life of the fastest supercomputer
        
       Author : nradclif
       Score  : 55 points
       Date   : 2024-09-04 23:38 UTC (3 days ago)
        
 (HTM) web link (www.nature.com)
 (TXT) w3m dump (www.nature.com)
        
       | ungreased0675 wrote:
       | I was hoping for a list of projects this system has queued up.
       | It'd be interesting to see where the priorities are for something
       | so powerful.
        
         | dekhn wrote:
         | I haven't been able to find a web-accessible version of their
         | SLURM queue, nor could I find the allocations (compute amounts
         | given to specific groups). You can see a subset of the
         | allocations here: https://www.ornl.gov/news/incite-program-
         | awards-supercomputi...
        
         | pelagicAustral wrote:
         | You can infer a little from this [0] article:
         | 
         | ORNL and its partners continue to execute the bring-up of
         | Frontier on schedule. Next steps include continued testing and
         | validation of the system, which remains on track for final
         | acceptance and early science access later in 2022 and open for
         | full science at the beginning of 2023.
         | 
         | UT-Battelle manages ORNL for the Department of Energy's Office
         | of Science, the single largest supporter of basic research in
         | the physical sciences in the United States. _The Office of
         | Science is working to address some of the most pressing
         | challenges of our time. For more information, please visit
         | energy.gov /science_
         | 
         | [0] https://www.ornl.gov/news/frontier-supercomputer-debuts-
         | worl...
        
       | iJohnDoe wrote:
       | The analogies used in this article were a bit weird.
       | 
       | Two things I've always wondered since I'm not an expert.
       | 
       | 1. Obviously, applications must be written to run effectively to
       | distribute the load across the supercomputer. I wonder how often
       | this prevents useful things from being considered to run on the
       | supercomputer.
       | 
       | 2. It always seems like getting access to run anything on the
       | supercomputer is very competitive or even artificially limited? A
       | shame this isn't open to more people. That much processing
       | resources seems like it should go much further to be utilized for
       | more things.
        
         | tryauuum wrote:
         | I feel like the name "supercomputer" is overhyped. It's just
         | many normal x86 machines running Linux and connected with fast
         | network.
         | 
         | Here in Finland I think you can use LUMI supercomputer for
         | free. With a condition that the results should be publically
         | available
        
           | bjelkeman-again wrote:
           | How to get access to Lumi https://www.lumi-
           | supercomputer.eu/get-started/
        
           | NegativeK wrote:
           | I think you've used the "just" trap to trivialize something.
           | 
           | I'm surprised that Frontier is free with the same conditions;
           | I expected researchers to need grant money or whatever to
           | fund their time. Neat.
        
           | lokimedes wrote:
           | In the beginning they were just "Beowulf clusters" compared
           | to "real" supercomputers. Isn't it always like this, the
           | romantic and exceptional is absorbed by the sheer scale of
           | the practical and common once someone discovers a way to
           | drive the economy at scale? Cars, aircraft, long-distance
           | communications, now perhaps AI? Yet the words may still
           | capture the early romance.
        
           | markstock wrote:
           | FYI: LUMI uses a nearly identical architecture as Frontier
           | (AMD CPUs and GPUs), and was also made by HPE.
        
         | msteffen wrote:
         | My former employer (Pachyderm) was acquired by HPE, who built
         | Frontier (and sells supercomputers in general), and I've
         | learned a lot about that area since the acquisition.
         | 
         | One of the main differences between supercomputers and eg a
         | datacenter is that in the former case, application authors do
         | not, as a rule, assume hardware or network issues and engineer
         | around them. A typical supercomputer workload will fail overall
         | if any one of its hundreds or thousands of workers fail. This
         | assumption greatly simplifies the work of writing such
         | software, as error handling is typically one of the biggest, if
         | not the biggest, sources of complexity a distributed system. It
         | makes engineering the hardware much harder, of course, but
         | that's how HPE makes money.
         | 
         | A second difference is that RDMA (Remote Direct Memory Access--
         | the ability for one computer to access another computer's
         | memory without going through its CPU. The network card can
         | access memory directly) is standard. This removes all the
         | complexity of an RPC framework from supercomputer workloads.
         | Also, the L1 protocol used has orders of magnitude lower
         | latency than Ethernet, such that it's often faster to read
         | memory on a remote machine than do any kind of local caching.
         | 
         | The result is that the frameworks for writing these workloads
         | let you more or less call an arbitrary function, run it on a
         | neighbor, and collect the result in roughly the same amount of
         | time it would've taken to run it locally.
        
       | kaycebasques wrote:
       | What's the documentation like for supercomputers? I.e. when a
       | researcher gets approved to use a supercomputer, do they get lots
       | of documentation explaining how to set up and run their program?
       | I got the sense from a physicist buddy that a lot of experimental
       | physics stuff is shared informally and never written down. Or
       | maybe each field has a couple popular frameworks for running
       | simulations, and the Frontier people just make sure that Frontier
       | runs each framework well?
        
         | tryauuum wrote:
         | Google openmpi, mpirun, slurm. It's not complex.
         | 
         | It's like kubernetes but invented long ago before kubernetes
        
         | sega_sai wrote:
         | I know that DOE's supercomputer NERSC has a lot of
         | documentation https://docs.nersc.gov/getting-started/ . Plus
         | they also have weekly events where you can ask any questions
         | about how the code/optimisation etc (I have never attended
         | those, but regularly get emails about those)
        
         | piombisallow wrote:
         | Take a look here if you're curious, as an example:
         | https://docs.ncsa.illinois.edu/systems/delta/en/latest/
         | 
         | 90% of my interactions are ssh'ing into a login node and
         | running code with SLURM, then downloading the data.
        
         | markstock wrote:
         | https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
         | 
         | This will have much of what you need.
        
       | kkielhofner wrote:
       | I have a project on Frontier - happy to answer any questions!
       | 
       | Funny story about Bronson Messer (quoted in the article):
       | 
       | On my first trip to Oak Ridge we went on a tour of "The Machine".
       | Afterwards we were hanging out on the observation deck and got
       | introduced to something like 10 people.
       | 
       | Everyone at Oak Ridge is just Tom, Bob, etc. No titles or any of
       | that stuff - I'm not sure I've ever heard anyone refer to
       | themselves or anyone else as "Doctor".
       | 
       | Anyway, the guy to my right asks me a question about ML
       | frameworks or something (don't even remember it specifically).
       | Then he says "Sorry, I'm sure that seems like a really basic
       | question, I'm still learning this stuff. I'm a nuclear
       | astrophysicist by training".
       | 
       | Then someone yells out "AND a three-time Jeopardy champion"!
       | Everyone laughs.
       | 
       | You guessed it, guy was Bronson.
       | 
       | Place is wild.
        
         | johnklos wrote:
         | > anyone refer to themselves or anyone else as "Doctor".
         | 
         | Reminds me of the t-shirt I had that said, "Ok, Ok, so you've
         | got a PhD. Just don't touch anything."
        
         | ai_slurp_bot wrote:
         | Hey, my sister Katie is the reason he wasn't a 4 day champ!
         | Beat him by $1. She also lost her next game
        
       | 7373737373 wrote:
       | So what is the actual utilization % of this machine?
        
         | nradclif wrote:
         | I don't know the exact utilization, but most large
         | supercomputers that I'm familiar with have very high
         | utilization, like around 90%. The Slurm/PBS queue times can
         | sometimes be measured in days.
        
       | dauertewigkeit wrote:
       | Don't the industry labs have bigger machines by now? I lost
       | track.
        
         | Mistletoe wrote:
         | Not any that we know about.
         | 
         | https://top500.org/lists/top500/list/2024/06/
        
       | cubefox wrote:
       | > With its nearly 38,000 GPUs, Frontier occupies a unique public-
       | sector role in the field of AI research, which is otherwise
       | dominated by industry.
       | 
       | Is it really realistic to assume that this is the "fastest
       | supercomputer"? What are estimated sizes for supercomputers used
       | by OpenAI, Microsoft, Google etc?
       | 
       | Strangely enough, the Nature piece only mentions possible secret
       | military supercomputers, but not ones used by AI companies.
        
         | rcxdude wrote:
         | There is a difference between a supercomputer and just a large
         | cluster of compute nodes: mainly this is in the bandwidth
         | between the nodes. I suspect industry uses a larger number of
         | smaller groups of highly-connected GPUs for AI work.
        
           | p1esk wrote:
           | Do you mean this supercomputer has slower internode links?
           | What are its links? For example, xAI just brought up 100k GPU
           | cluster, most likely with 800Gbps internode links, or maybe
           | even double that.
           | 
           | I think the main difference is in the target numerical
           | precision: supercomputers such as this one focus on
           | maximizing FP64 throughput, while GPU clusters used by OpenAI
           | or xAI want to compute in 16 or even 8 bit precision (BF16 or
           | FP8).
        
             | markstock wrote:
             | Each node has 4 GPUs, and each of those has a dedicated
             | network interface card capable of 200 Gbps each way. Data
             | can move right from one GPU's memory to another. But it's
             | not just bandwidth that allows the machine to run so well,
             | it's a very low-latency network as well. Many science codes
             | require very frequent synchronizations, and low latency
             | permits them to scale out to tens of thousands of
             | endpoints.
        
       | langcss wrote:
       | Or worlds smallest cloud provider?
        
       ___________________________________________________________________
       (page generated 2024-09-08 23:01 UTC)