[HN Gopher] The AI Research SuperCluster
___________________________________________________________________
The AI Research SuperCluster
Author : minimaxir
Score : 56 points
Date : 2022-01-24 18:34 UTC (4 hours ago)
(HTM) web link (ai.facebook.com)
(TXT) w3m dump (ai.facebook.com)
| zwieback wrote:
| What's the track record on "super computers"? It seems risky to
| pour so much money into a platform that will be superseded in a
| few years time. Then again, it's not clear that there are really
| good alternatives.
| halfeatenpie wrote:
| Well it's partially the cost/benefit analysis of:
|
| 1. How much benefit will we get? 2. Will this benefit be higher
| than the cost of purchasing this right now? 3. What other
| alternatives will satisfy our needs?
|
| For most of these solutions, the answers are:
|
| 1. A lot. We need computing power to perform these analysis
| faster and to be competitive. 2. Yes, as continuing to innovate
| in this space will keep us competitive and give our scientists
| the resources to remain productive. 3. AWS/GCP/Azure are
| alternatives sure, but then (the rate at which Meta probably
| uses these resources) it probably cost them less to build this
| out than to pay AWS/GCP/Azure for access to these hardware.
| tedivm wrote:
| Major GPU architecture changes only happen every few years.
|
| * k80 - 2014, and this really was not that great of a chip.
|
| * v100 in 2017, and I'd consider this the first "built from the
| ground up" ML chip.
|
| * A100 in late 2020, with the first major cloud general
| availability being in 2021.
|
| Even when new chips come out the old ones are still usable- you
| can rent k80s for fairly cheap on all cloud providers, and have
| kept a surprising amount of their resale value. The v100s are
| also very much still in demand.
|
| The A100 is also an amazing system- the new nvswitch
| archetecture means the A100s work together far far better than
| their V100 counterparts. I was part of an upgrade project
| setting up an A100 cluster with infiniband and it really is
| amazing how well these chips work together. That communication
| barrier was a pretty obvious next step though (the k80s had
| crap intergpu communication, the v100s introduced the nvlink,
| and the nvswitch was the obvious way to go). There isn't an
| obvious next step, and I imagine the A100s to be the standard
| for at least the next four years (with lots of continued use
| after that).
| boulos wrote:
| You skipped the T4 which was in between Volta and Ampere, as
| well as the P100 between K80 and V100. So I'd say "meaningful
| chip changes" is closer to every 18 months.
|
| The T4 though isn't a "big part", but for people who fit
| within its envelope, it's a huge win (since it's cost is so
| much lower). A lot of deep learning folks had built out
| Turing-based workstations in that time period, and I think
| they're still reasonable value for money.
| codeulike wrote:
| The Terminator franchise is based around the folly of letting an
| AI control a nuclear arsenal. But here we are building the
| biggest AI ever and letting it analyse our social interactions.
| Think of the power this could have if it goes rogue! It could
| manipulate entire populatuions by minutely controlling what they
| see and read. Surely if manipulation at that scale and fidelity
| became possible it would be something to be concerned about?
|
| .... Oh wait ....
| georgeecollins wrote:
| The premise of the science fiction novel "After On" is that the
| first AI to reach sentience is running a dating app. It's
| actually a good, well researched book.
| abhinai wrote:
| I wish researchers outside Meta were allowed to rent this
| SuperCluster for maximum benefit to humanity.
| forgotmyoldacc wrote:
| Why not just rent AWS/Azure/GCP instead? They're all about the
| same. Top of the line enterprise GPUs with fast interconnect.
| tedivm wrote:
| They are not the same at all. AWS has the best GPU instances
| right now but there's some huge differences in networking
| speed. The P3 instances have 400Gbps per machine, with 8 GPUs
| on each machine. If you were to self host a cluster using the
| standard DGX machines you get 200Gbps per GPU, for a total of
| 1600Gbps for just the GPUs. The DGX machine has another two
| infiniband ports that can be used to attach to storage at
| pretty intense speeds as well.
|
| This makes a huge difference when using more than a single
| machine. I've done the math and purchased the machines at a
| previous company- assuming you aren't leaving the machines
| idle most of the time you save a considerable amount of money
| and get a lot better performance when building your own
| cluster.
| boulos wrote:
| Disclosure: I used to work for Google Cloud.
|
| > you save a considerable amount of money and get a lot
| better performance when building your own cluster.
|
| This _heavily_ depends on how much benefit you get from
| improving GPU performance from each generation. A lot of
| people assume 3 /4-yr TCO. If you instead "rent" for 1 year
| at a time, you've been getting >2x benefits per generation
| lately.
|
| Most folks also measure "occupancy" for clusters like this
| rather than "utilization". That is, if a job is using 128
| "GPUs" that counts as 128 in use. But that ignores that
| many jobs might have been just fine with T4s (which are a
| lot cheaper) versus A100s. (Depends a lot on the model, the
| I/O, etc.) Once you've bought a physical cluster, you're
| kind of stuck with that configuration (for better or
| worse).
|
| tl;dr: It's not just about "idle".
| tedivm wrote:
| These generations tend to be three years apart though, so
| if you're buying as the new generation comes out then
| your total TCO period has you running almost peak
| hardware (there were several versions of the V100, each
| with minor improvements). Many vendors also offer buy
| back and upgrade programs.
|
| At the same time it's hard to understate how different
| the prices are here. Our break even point for using on
| prem compared to AWS was about nine months. After that we
| saved money for the rest of that hardwares lifetime.
|
| I definitely agree that people shouldn't just rush out
| and buy these without benchmarking and examining their
| usecase. The cloud is really good for this! At the same
| time though I have yet to see any cloud provider with
| anything even approaching the interlink I can get on prem
| and that means, so it's basically impossible to get the
| same performance out of the cloud as it is on prem right
| now.
| boulos wrote:
| > using on prem compared to AWS was about nine months.
|
| At _list_ price or a moderate discount. The folks at this
| scale aren 't paying that :).
| dekhn wrote:
| the network topology and the network switch itself also can
| make a huge difference depending on traffic conditions; so
| you might have tons of fat NICs per GPU but if all of them
| want to alltoall, you better have a ton of cross section
| bandwidth.
|
| I always wonder about performance on these clusters. Back
| in MY day, I'd wait a week or more for results from my
| jobs, and immediately resubmit to wait two weeks in a queue
| for another week of runtime and do lots of data processing
| in the downtime. Then, I moved to cloud and decided on
| "what can I afford to do overnight" (IE, I set my time to
| result to be about 12 hours). I have a hard time justifying
| additional hardware to get results in 10 minutes versus a
| day, it seems like at that point you're just using it to
| get fast cycle times on new ideas, but who has new ideas
| every 10 minutes?
| tedivm wrote:
| > the network topology and the network switch itself also
| can make a huge difference depending on traffic
| conditions; so you might have tons of fat NICs per GPU
| but if all of them want to alltoall, you better have a
| ton of cross section bandwidth.
|
| This is the beauty of the new nvswitch chips and the
| infiniband networks instead of ethernet. Anyone who is
| doing this is setting up a fully switched high bandwidth
| infiniband network with ridiculous traffic between them.
| Nvidia purchased Mellanox a year or two ago- combine that
| with the ridiculously awesome nvswitch in the A100 dgx
| machines and there's a huge jump in cross chip traffic
| ability. At the same time though a decent mellanox router
| is probably going to set you back $30k.
| dekhn wrote:
| I'm not aware of any cost-effective switch that permits
| scaling all v all to arbitrary sizes. THat's my point.
| modern infiniband uses nearly all the same tech as
| previous supercomputers, but with faster interfaces, and
| more of them. For example the facebook cluster is a dual-
| layer clos network, which is one of a few cost-effective
| ways to get very high randomly distributed traffic, but
| all v all communication scales as n*2 and n*2 wires gets
| expensive fast.
|
| Better to find algorithms that need less communication,
| than to make faster computers that allow you to write
| algorithms that needs lots of communication. Otherwise
| you'll always pay $$$ to reach peak GPU performance.
| ska wrote:
| On-prem for nearly anything is going to be at least a bit
| of a win if your utilization is uniform or predictable. The
| real win for not doing it is in adaptability.
| tedivm wrote:
| Yeah, definitely, but I think it's important to talk
| about scale of that win.
|
| I mentioned in another comment that the GPU generation is
| roughly three years between major architecture upgrades-
| this has held true for a bit now, and that time may even
| stretch out a little. When the average company builds one
| of these clusters it's safe to assume they'll either run
| it for three years or sell it back for some return.
|
| Going with the cloud and assuming you don't commit to
| several years (losing that adaptability) the yearly cost
| of a p4 is $287,087. Over three years that's $861,261 to
| run a single machine. For about $450k you can build out a
| solid two machine (16gpu) cluster (including infiniband
| networking gear and a solid NAS) that will easily last
| three years. There are datacenters which specialize in
| this and companies that can manage these machines. If you
| don't have the cash up front you can lease them on good
| terms and your yearly bill will still be much lower than
| AWS.
|
| Model training is basically the one use case where I'm
| really willing to purchase equipment instead of using the
| cloud. The money it saves is enough to hire one or two
| more staff members, and the maintenance is shockingly low
| if you get it setup right to start.
| halfeatenpie wrote:
| You can rent cluster access (to an extent) using Azure Batch.
|
| Granted it's probably not at this scale, but it gives you
| access to a ton of resources.
| ankeshanand wrote:
| You can also rent a cloud TPU-v4 pod
| (https://cloud.google.com/tpu) which 4096 TPUv-4 chips with
| fast interconnect, amounting to around 1.1 exaflops of
| compute. It won't be cheap though (excess of 20M$/year I
| believe).
| caaqil wrote:
| This thread should probably be merged with this one:
| https://news.ycombinator.com/item?id=30062019
| kn8a wrote:
| Are there any alternatives to gradient based learning that could
| make this less useful? Is there another type of compute unit that
| is the next evolution of CPU -> GPU -> ?
| winterismute wrote:
| It's a tough question, it's not even back-propagation but even
| just sometimes the "parameters" of the models, for example [1]
| shows that models such as ResNeXt already perform better on a
| very different architecture such as Graphcore, for some sizes
| of convolutions. Older models, or models that get tuned for
| existing GPUs, do not perform as well.
|
| It's tough to come up with a new architecture that can have an
| advantage on current and future models, at least from a peak
| perf point of view, from a perf/watt for example instead the
| scaled-up Apple GPUs seem to show new interesting properties.
| But the Graphcore architecture is quite interesting, being able
| to act somehow both as a SIMD machine and a task-parallel
| machine.
|
| [1] https://arxiv.org/pdf/1912.03413v1.pdf
| randcraw wrote:
| Based the following constraints that lie at the center of AI
| and parallelism, I'd say no -- stochastic gradient pursuit
| using vector processors like GPUs is inescapable in all future
| AI advances.
|
| 1) All AI is based in search (esp. non-convex, where heuristics
| are insufficient to provide a global convex solution), and thus
| is inevitably implemented using iteration, driven locally by
| gradient-pursuit and globally by... ways to efficiently gather
| information to optimize the loss function that measures how
| well that info gain is being refined and exploited.
|
| 2) Search that is inherently non-convex and inefficient
| requires as much compute power as possible, i.e. using
| supercomputers.
|
| 3) All supercomputer-based solutions to non-convex problems are
| implemented iteratively, where results are improved not using
| closed-form math or complete info, but by incremental
| optimization of partial results that aggregate with the
| iterations, like repeated stochastic gradient descent that
| creates and enhances 'resonant' clusters of 'neurons'.
|
| 4) The only form of supercomputing that has proven to scale up
| at anywhere near indefinitely is data-parallelism (a dataflow-
| specific form of SIMD) -- where the search space is spread as
| evenly (and naively) as possible across as many processing
| elements as possible.
|
| 5) Vector processing hardware like GPUs implement data-
| parallelism as well as any HPC architecture yet devised.
|
| Thus, I believe that AI is stuck with GPUs, or equivalent
| meshes of vector processors, indefinitely.
| buildbot wrote:
| It's not larger than Microsoft's?
| https://blogs.microsoft.com/ai/openai-azure-supercomputer/
| boulos wrote:
| That was a V100 cluster though. 10k V100s is less powerful (for
| ML stuff) than ~6k A100s.
| mooneater wrote:
| > All this infrastructure must be extremely reliable, as we
| estimate some experiments could run for weeks and require
| thousands of GPUs
|
| Is it hardward-fault tolerant? Curious how well this will work
| otherwise as it scales.
| etaioinshrdlu wrote:
| It is interesting that it only allows training on anonymized and
| encrypted data. I wonder how much these restrictions slow down
| their research?
|
| Although, they are definitely a good idea considering the data
| source.
___________________________________________________________________
(page generated 2022-01-24 23:03 UTC)