[HN Gopher] A day in the life of the fastest supercomputer
___________________________________________________________________
A day in the life of the fastest supercomputer
Author : nradclif
Score : 55 points
Date : 2024-09-04 23:38 UTC (3 days ago)
(HTM) web link (www.nature.com)
(TXT) w3m dump (www.nature.com)
| ungreased0675 wrote:
| I was hoping for a list of projects this system has queued up.
| It'd be interesting to see where the priorities are for something
| so powerful.
| dekhn wrote:
| I haven't been able to find a web-accessible version of their
| SLURM queue, nor could I find the allocations (compute amounts
| given to specific groups). You can see a subset of the
| allocations here: https://www.ornl.gov/news/incite-program-
| awards-supercomputi...
| pelagicAustral wrote:
| You can infer a little from this [0] article:
|
| ORNL and its partners continue to execute the bring-up of
| Frontier on schedule. Next steps include continued testing and
| validation of the system, which remains on track for final
| acceptance and early science access later in 2022 and open for
| full science at the beginning of 2023.
|
| UT-Battelle manages ORNL for the Department of Energy's Office
| of Science, the single largest supporter of basic research in
| the physical sciences in the United States. _The Office of
| Science is working to address some of the most pressing
| challenges of our time. For more information, please visit
| energy.gov /science_
|
| [0] https://www.ornl.gov/news/frontier-supercomputer-debuts-
| worl...
| iJohnDoe wrote:
| The analogies used in this article were a bit weird.
|
| Two things I've always wondered since I'm not an expert.
|
| 1. Obviously, applications must be written to run effectively to
| distribute the load across the supercomputer. I wonder how often
| this prevents useful things from being considered to run on the
| supercomputer.
|
| 2. It always seems like getting access to run anything on the
| supercomputer is very competitive or even artificially limited? A
| shame this isn't open to more people. That much processing
| resources seems like it should go much further to be utilized for
| more things.
| tryauuum wrote:
| I feel like the name "supercomputer" is overhyped. It's just
| many normal x86 machines running Linux and connected with fast
| network.
|
| Here in Finland I think you can use LUMI supercomputer for
| free. With a condition that the results should be publically
| available
| bjelkeman-again wrote:
| How to get access to Lumi https://www.lumi-
| supercomputer.eu/get-started/
| NegativeK wrote:
| I think you've used the "just" trap to trivialize something.
|
| I'm surprised that Frontier is free with the same conditions;
| I expected researchers to need grant money or whatever to
| fund their time. Neat.
| lokimedes wrote:
| In the beginning they were just "Beowulf clusters" compared
| to "real" supercomputers. Isn't it always like this, the
| romantic and exceptional is absorbed by the sheer scale of
| the practical and common once someone discovers a way to
| drive the economy at scale? Cars, aircraft, long-distance
| communications, now perhaps AI? Yet the words may still
| capture the early romance.
| markstock wrote:
| FYI: LUMI uses a nearly identical architecture as Frontier
| (AMD CPUs and GPUs), and was also made by HPE.
| msteffen wrote:
| My former employer (Pachyderm) was acquired by HPE, who built
| Frontier (and sells supercomputers in general), and I've
| learned a lot about that area since the acquisition.
|
| One of the main differences between supercomputers and eg a
| datacenter is that in the former case, application authors do
| not, as a rule, assume hardware or network issues and engineer
| around them. A typical supercomputer workload will fail overall
| if any one of its hundreds or thousands of workers fail. This
| assumption greatly simplifies the work of writing such
| software, as error handling is typically one of the biggest, if
| not the biggest, sources of complexity a distributed system. It
| makes engineering the hardware much harder, of course, but
| that's how HPE makes money.
|
| A second difference is that RDMA (Remote Direct Memory Access--
| the ability for one computer to access another computer's
| memory without going through its CPU. The network card can
| access memory directly) is standard. This removes all the
| complexity of an RPC framework from supercomputer workloads.
| Also, the L1 protocol used has orders of magnitude lower
| latency than Ethernet, such that it's often faster to read
| memory on a remote machine than do any kind of local caching.
|
| The result is that the frameworks for writing these workloads
| let you more or less call an arbitrary function, run it on a
| neighbor, and collect the result in roughly the same amount of
| time it would've taken to run it locally.
| kaycebasques wrote:
| What's the documentation like for supercomputers? I.e. when a
| researcher gets approved to use a supercomputer, do they get lots
| of documentation explaining how to set up and run their program?
| I got the sense from a physicist buddy that a lot of experimental
| physics stuff is shared informally and never written down. Or
| maybe each field has a couple popular frameworks for running
| simulations, and the Frontier people just make sure that Frontier
| runs each framework well?
| tryauuum wrote:
| Google openmpi, mpirun, slurm. It's not complex.
|
| It's like kubernetes but invented long ago before kubernetes
| sega_sai wrote:
| I know that DOE's supercomputer NERSC has a lot of
| documentation https://docs.nersc.gov/getting-started/ . Plus
| they also have weekly events where you can ask any questions
| about how the code/optimisation etc (I have never attended
| those, but regularly get emails about those)
| piombisallow wrote:
| Take a look here if you're curious, as an example:
| https://docs.ncsa.illinois.edu/systems/delta/en/latest/
|
| 90% of my interactions are ssh'ing into a login node and
| running code with SLURM, then downloading the data.
| markstock wrote:
| https://docs.olcf.ornl.gov/systems/frontier_user_guide.html
|
| This will have much of what you need.
| kkielhofner wrote:
| I have a project on Frontier - happy to answer any questions!
|
| Funny story about Bronson Messer (quoted in the article):
|
| On my first trip to Oak Ridge we went on a tour of "The Machine".
| Afterwards we were hanging out on the observation deck and got
| introduced to something like 10 people.
|
| Everyone at Oak Ridge is just Tom, Bob, etc. No titles or any of
| that stuff - I'm not sure I've ever heard anyone refer to
| themselves or anyone else as "Doctor".
|
| Anyway, the guy to my right asks me a question about ML
| frameworks or something (don't even remember it specifically).
| Then he says "Sorry, I'm sure that seems like a really basic
| question, I'm still learning this stuff. I'm a nuclear
| astrophysicist by training".
|
| Then someone yells out "AND a three-time Jeopardy champion"!
| Everyone laughs.
|
| You guessed it, guy was Bronson.
|
| Place is wild.
| johnklos wrote:
| > anyone refer to themselves or anyone else as "Doctor".
|
| Reminds me of the t-shirt I had that said, "Ok, Ok, so you've
| got a PhD. Just don't touch anything."
| ai_slurp_bot wrote:
| Hey, my sister Katie is the reason he wasn't a 4 day champ!
| Beat him by $1. She also lost her next game
| 7373737373 wrote:
| So what is the actual utilization % of this machine?
| nradclif wrote:
| I don't know the exact utilization, but most large
| supercomputers that I'm familiar with have very high
| utilization, like around 90%. The Slurm/PBS queue times can
| sometimes be measured in days.
| dauertewigkeit wrote:
| Don't the industry labs have bigger machines by now? I lost
| track.
| Mistletoe wrote:
| Not any that we know about.
|
| https://top500.org/lists/top500/list/2024/06/
| cubefox wrote:
| > With its nearly 38,000 GPUs, Frontier occupies a unique public-
| sector role in the field of AI research, which is otherwise
| dominated by industry.
|
| Is it really realistic to assume that this is the "fastest
| supercomputer"? What are estimated sizes for supercomputers used
| by OpenAI, Microsoft, Google etc?
|
| Strangely enough, the Nature piece only mentions possible secret
| military supercomputers, but not ones used by AI companies.
| rcxdude wrote:
| There is a difference between a supercomputer and just a large
| cluster of compute nodes: mainly this is in the bandwidth
| between the nodes. I suspect industry uses a larger number of
| smaller groups of highly-connected GPUs for AI work.
| p1esk wrote:
| Do you mean this supercomputer has slower internode links?
| What are its links? For example, xAI just brought up 100k GPU
| cluster, most likely with 800Gbps internode links, or maybe
| even double that.
|
| I think the main difference is in the target numerical
| precision: supercomputers such as this one focus on
| maximizing FP64 throughput, while GPU clusters used by OpenAI
| or xAI want to compute in 16 or even 8 bit precision (BF16 or
| FP8).
| markstock wrote:
| Each node has 4 GPUs, and each of those has a dedicated
| network interface card capable of 200 Gbps each way. Data
| can move right from one GPU's memory to another. But it's
| not just bandwidth that allows the machine to run so well,
| it's a very low-latency network as well. Many science codes
| require very frequent synchronizations, and low latency
| permits them to scale out to tens of thousands of
| endpoints.
| langcss wrote:
| Or worlds smallest cloud provider?
___________________________________________________________________
(page generated 2024-09-08 23:01 UTC)