[HN Gopher] Intel Gaudi 3 the New 128GB HBM2e AI Chip in the Wild
       ___________________________________________________________________
        
       Intel Gaudi 3 the New 128GB HBM2e AI Chip in the Wild
        
       Author : rbanffy
       Score  : 116 points
       Date   : 2024-04-22 15:56 UTC (7 hours ago)
        
 (HTM) web link (www.servethehome.com)
 (TXT) w3m dump (www.servethehome.com)
        
       | loudmax wrote:
       | The "in the wild" part of the title is misleading. These chips
       | are being presented in a very controlled environment.
       | 
       | An interesting aspect of Intel's design is they use Ethernet for
       | connectivity. If they can get the performance on par with NVLink,
       | that by itself could be a win because everybody knows how to
       | manage Ethernet. Very few people know how to manage an NVLink
       | network.
       | 
       | To be clear, this is data center hardware. The lower power
       | versions of these cards consume like 600W, and no mention in the
       | article on pricing.
        
         | benreesman wrote:
         | I agree that it's misleading to act like this is a product in
         | market, we don't _really_ even know if the yield will happen.
         | 
         | But it's a serious thing if it happens.
        
         | foobiekr wrote:
         | Very few people actually know how to provision and manage a
         | lossless Ethernet fabric and I'd wager someone who had
         | literally never touched infiniband would have an easier time
         | accomplishing it from zero with IB than with Ethernet on real
         | vendor gear.
         | 
         | Ethernet has so, so many gotchas. Maybe if it was a layer 3
         | only network it would work. Maybe.
        
           | alfalfasprout wrote:
           | I'm guessing it's RDMA over ethernet too which often has a
           | lot of gotchas depending on the exact hardware being used.
        
             | epistasis wrote:
             | Ironically enough, since NVIDIA bought Mellanox, it's
             | likely that the best documented route to get ROCE v2 going
             | is with switches purchased from NVIDIA...
             | 
             | Edit: and yes, it's RDMA over ethernet https://docs.habana.
             | ai/en/latest/Gaudi_Overview/Gaudi_Archit...
        
         | dogma1138 wrote:
         | You don't need to manage NVLink.
         | 
         | NVLink either talks native NVlink to itself when you are using
         | NVlink switches either intra-server or intra-rack or;
         | 
         | It can talk PCIe over NVlink when talking to a PCIe endpoint.
         | 
         | Or you can run Infiniband or Ethernet on top of it and talk to
         | w/e is on the other side.
         | 
         | Gaudi isn't that different remember Ethernet != TCP/IP.
        
           | dboreham wrote:
           | So it works with Ethernet switches?
        
             | dogma1138 wrote:
             | What does?
        
         | latchkey wrote:
         | > An interesting aspect of Intel's design is they use Ethernet
         | for connectivity.
         | 
         | Interestingly, tenstorrent is doing something similar with
         | their wormhole cards.
         | 
         | I'm not convinced yet that it is the right way to go. If the
         | switching fabric on the card fails, you lose the whole card.
         | Keeping it separated out is a bit less risky, at the cost of
         | some speed.
         | 
         | I'm more partial to composable fabrics, but they aren't ready
         | yet for PCIe5 and we have PCIe6 just around the corner next
         | year.
        
           | wmf wrote:
           | You want each ASIC to have 24 external NICs (so 192 NICs for
           | a server?) with all the cabling/backplanes that would
           | require?
        
             | latchkey wrote:
             | They are 24x200G, which is already outdated. Everything we
             | are doing is currently 400G (via 8xCX7 cards running in
             | ethernet mode) and 800G at the spine. 800G NICs, which will
             | come with PCIe6 next year and cuts the number of
             | connections down.
             | 
             | What I'd prefer is the connection is through the UBB/OAM
             | baseboard, such that you have PCIe connections. Look into
             | what GigaIO and Liqid are doing. There is a 3rd option that
             | is even cooler than those two, but I don't want to mention
             | it here. ;-)
        
         | choppaface wrote:
         | Many years ago I met Naveen Rao and tried to demo the Nervana-
         | derived Intel card, which at the time Facebook and a couple
         | others were sampling. During more formal talks, Intel sent him
         | literally surrounded by a Xeon sales team that sidetracked the
         | whole meeting.
         | 
         | When these Intel GPUs are "in the wild" it actually means Xeon
         | salespeople are out on the hunt.
        
         | conradev wrote:
         | RoCE is an IETF standard for this:
         | https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet
         | 
         | In my understanding one of the big advantages of the protocol
         | (v2, that is) is that it is routed over IP and can work with
         | existing switches ($$) instead of needing specialized ones
         | ($$$$)
        
           | wmf wrote:
           | RoCE never required expensive switches but getting the PFC
           | configuration right can be tricky.
        
       | jsight wrote:
       | So, where can I use instances of Gaudi 3 and what is the hourly
       | price for these instances?
        
         | sidkashyap wrote:
         | https://www.intel.com/content/www/us/en/developer/tools/devc...
        
         | latchkey wrote:
         | This is something I'd like to offer via my business (Hot Aisle)
         | at some point in the near future. Right now, we are just
         | getting started and focused on MI300x, but the long term goal
         | is to offer any type of high end compute that people are
         | willing to rent.
        
           | alchemist1e9 wrote:
           | I'm interested and have questions, but
           | https://www.hotaisle.xyz/ doesn't exactly provide a lot of
           | answers.
        
             | latchkey wrote:
             | Sorry about that. We are just getting started, so the
             | lowest priority right now is the website.
             | 
             | Additionally, due to the KYC requirements around these GPUs
             | (due to US export controls), we really want to get to know
             | our customers first.
             | 
             | Feel free to ping me on email and happy to get on a call
             | and talk more.
        
           | CapeTheory wrote:
           | What USP are you aiming for, to differentiate from the many
           | companies who have tried and failed to offer some form of
           | HPCaaS over the last 10-15 years?
        
             | latchkey wrote:
             | Great question! I'm going to answer it the only way I know
             | how... with a bit of a story of the history of things.
             | Sorry if this bores you.
             | 
             | The problem I realized over a year ago was that nobody had
             | hourly rental access to high end AMD GPUs. In addition,
             | access to high end Nvidia was equally difficult. I signed
             | up for a CoreWeave account, put in my credit card and was
             | told a few weeks later that my account was not approved.
             | 
             | In effect, the only way to get access to super high end
             | compute, was to be involved in HPC and that requires
             | connections. At the time, we also didn't even know if AMD
             | was going to seriously adopt AI as a strategy.
             | 
             | My view was that there were actually two problems, lack of
             | general access and that everyone was putting all their eggs
             | into a single basket. Mostly because of that lack of
             | access, and because AMD was lacking a great developer
             | flywheel story.
             | 
             | I spent August to December building a business plan,
             | closing funding, forming the business, hiring my co-founder
             | full time, securing data center space, securing direct
             | relationships with vendors, and designing the system we
             | were going to deploy. There are a million other little
             | details in there, but this is long enough as it is.
             | 
             | Oct/Nov of last year rolls around and suddenly AMD has
             | changed their tune. Lisa Su doubles down. Dec 6th, MI300x
             | rolls out. We made our first PoC order in January, received
             | it in March. It just goes to show how cutting edge and how
             | long all of this takes. 3 more small (not hyperscaler)
             | businesses sprung up during that time, all offering
             | effectively the same product. We went to the data center,
             | deployed our PoC and about 2 weeks later, we had our first
             | customer onboarded. I call all of that validation, and was
             | able to secure further funding based on it.
             | 
             | To answer your question, I'm not sure that I need a
             | specific USP. The demand for compute isn't going down. If I
             | have a product that people want, and I can offer them
             | ethical, honest, truthful, great service around that
             | product. All based on decades of experience. Can't that be
             | enough? Myself and my investors believe so.
        
       | 1024core wrote:
       | About time. NVIDIA needs some serious competition.
        
         | talldayo wrote:
         | _pokes OpenCL 's corpse with a stick_
         | 
         | C'mon, do something...
        
           | ein0p wrote:
           | Ironically, Transformers are relatively simple architectures
           | - all you really need is a high performance matmul. So OpenCL
           | could "do something" at this point, if it were alive.
        
           | imtringued wrote:
           | https://github.com/ROCm/ROCm/issues/2754
           | 
           | Wow and I thought that the latest generation of GPUs was
           | better.
        
         | FuriouslyAdrift wrote:
         | AMD is doing several billion in ai processor sales already and
         | the new chip is selling as fast as they can make them. At least
         | with AMD, a customer can actually get them now as opposed to
         | the nearly 1 year lead time from nVidia.
        
           | latchkey wrote:
           | Confirmed. Buying them up as fast as I can. =)
        
           | fransje26 wrote:
           | Now, if they could also do the performant, unified, software
           | and driver part..
        
         | doctorpangloss wrote:
         | The only meaningful hardware competition, meaning lower prices,
         | will come from Chinese designed, Chinese manufactured parts.
         | This is still a long ways out.
         | 
         | Is it inevitable? I think so. Before 2019 there wasn't an
         | opportunity, now there is.
         | 
         | For software, Chinese universities, Alibaba, Tencent and
         | Bytedace are already releasing models, training code and in
         | rare cases datasets that are competitive with private
         | offerings. CogVLM/CogAgent is one that I use. It's very
         | promising.
        
           | elzbardico wrote:
           | How much time for that? I wouldn't expect nothing in
           | industrial volumes for the next years, maybe 2028? who knows?
           | 
           | But, anyway, we will prohibited from buying it, probably. We
           | still can't buy Cuban cigars.
        
             | wmf wrote:
             | I don't think we'll be legally prohibited from buying it
             | but there will be zero English docs (see Allwinner and
             | such). Maybe if you're lucky you'll get an uncommented code
             | dump with a forked years-old version of PyTorch.
        
         | rbanffy wrote:
         | Competition doesn't do much when all production everywhere is
         | already taken in preorders. It'll only change when there is
         | surplus production.
        
       | seventytwo wrote:
       | What are the row of green rectangles in the middles of the longe
       | edges?
        
       | 2genders5827 wrote:
       | Are you lonely? Do u want an AI girlfriend?
       | https://discord.gg/elyzaJhXmwQkXQbAEdYUwV
        
       | 2genders8873 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza
       | obxPscqVvjisLebPa
        
       | 2genders14206 wrote:
       | Are you lonely? Do u want an AI girlfriend?
       | https://discord.gg/elyza dLgazCrFjXTzkGVpG
        
       | 2genders36371 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza
       | qXPwZGKpOTyEhZRca
        
       | 2genders17675 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza
       | cyVjppwkUxqVPkNpw
        
       | 2genders2082 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza
       | hZIlKnyIVfKCCgqMf
        
       | 2genders25672 wrote:
       | Are you lonely? Do u want an AI girlfriend?
       | https://discord.gg/elyza -- FOLLOW THE HOMIE
       | https://twitter.com/hashimthearab corewqAWxWbelVPMp
        
       | 2genders44876 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/candyai
       | BteKFQNXYqUnGdCRy
        
       | 2genders2516 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/candyai
       | huhTzYQSUCIkoBpub
        
       | 2genders21790 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/candyai
       | pDTlXhNTzvLpcMlwc
        
       | 2genders11504 wrote:
       | Are you lonely? Do u want an AI girlfriend?
       | https://discord.gg/candyai hYFTyCKrtqBoLFfIY
        
       | 2genders9902 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza -- FOLLOW
       | THE HOMIE https://twitter.com/hashimthearab WElwFiUyTOogGxcNk
        
       | 2genders11504 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/candyai
       | jkXUZhhMeNjZPZgHS
        
       | 2genders18584 wrote:
       | hi are u lonely want ai gf?? https://discord.gg/elyza -- FOLLOW
       | THE HOMIE https://twitter.com/hashimthearab hbXNLYlUhYzMTluwD
        
       ___________________________________________________________________
       (page generated 2024-04-22 23:01 UTC)