Subj : xAI cluster is now the most powerful AI training system in the wo To : All From : TechnologyDaily Date : Tue Sep 17 2024 03:15:05 xAI cluster is now the most powerful AI training system in the world but questions remain over storage capacity, power usage and why it's actually called Colossus Date: Tue, 17 Sep 2024 02:02:00 +0000 Description: Elon Musk says xAI's Colossus AI training system, built in just 122 days, is now online. FULL STORY ====================================================================== We recently got a glimpse of what $1 billion worth of AI GPUs looks like when Elon Musk shared a brief video tour of Cortex, X's AI training supercomputer currently under construction at Teslas Giga Texas plant. More recently, Musk took to his social media platform to announce that Colossus, a new 100k H100 training cluster, is now up and running. Musk claims that Colossus is "the most powerful AI training system in the world" and that it was built "from start to finish" in just 122 days. That's quite an achievement. Servers for the xAI cluster were reportedly provided by Dell and Supermicro, with the cost of the project estimated to be between $3-4 billion. This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months. Excellent September 2, 2024 Where does Colossus get its name? Toms Hardware notes, Although all of these clusters are formally operational and even training AI models, it is entirely unclear how many are actually online today. First, it takes some time to debug and optimize the settings of those superclusters. Second, X needs to ensure that they get enough power, and while Elon Musk's company has been using 14 diesel generators to power its Memphis supercomputer, they were still not enough to feed all 100,000 H100 GPUs. The Colossus system is poised to eventually double in capacity, with plans to incorporate an additional 100,000 GPUs - 50,000 H100 units and 50,000 of Nvidia's next-gen H200 chips. The supercluster will primarily be used to train xAI's Grok-3, the company's latest, most advanced AI model. We've yet to see any mention of storage for the new system, but it will need to be huge. The naming of the new supercomputer has raised more than a few eyebrows, however, with people noting that it shares its name with a 1970 sci-fi movie (based on a 1966 novel by D.F. Jones) about a supercomputer that becomes sentient after being given control of the US nuclear arsenal. Things, predictably, go horribly wrong for humanity. Both the novel and film explore timely themes of AI autonomy, the dangers of relinquishing control to machines, and the ethical implications of artificial intelligence. Its possible that Musk wasnt aware of this when the name was chosen for his new AI training system, and it might have been selected purely to emphasize the sheer scale of the supercluster. Then again, with Musk's track record, it wouldnt be surprising if the reference was entirely intentional - he knows exactly what hes doing. More from TechRadar Pro Tesla plans to replace a fundamental-but-flawed building block of the Internet Could Tesla be about to make its own silicon? Nvidia is powering a mega Tesla supercomputer with 10,000 H100 GPUs ====================================================================== Link to news story: https://www.techradar.com/pro/xai-cluster-is-now-the-most-powerful-ai-training -system-in-the-world-but-questions-remain-over-storage-capacity-power-usage-an d-why-it-s-actually-called-colossus --- Mystic BBS v1.12 A47 (Linux/64) * Origin: tqwNet Technology News (1337:1/100) .