[HN Gopher] Intel Gaudi 3 the New 128GB HBM2e AI Chip in the Wild
___________________________________________________________________
Intel Gaudi 3 the New 128GB HBM2e AI Chip in the Wild
Author : rbanffy
Score : 116 points
Date : 2024-04-22 15:56 UTC (7 hours ago)
(HTM) web link (www.servethehome.com)
(TXT) w3m dump (www.servethehome.com)
| loudmax wrote:
| The "in the wild" part of the title is misleading. These chips
| are being presented in a very controlled environment.
|
| An interesting aspect of Intel's design is they use Ethernet for
| connectivity. If they can get the performance on par with NVLink,
| that by itself could be a win because everybody knows how to
| manage Ethernet. Very few people know how to manage an NVLink
| network.
|
| To be clear, this is data center hardware. The lower power
| versions of these cards consume like 600W, and no mention in the
| article on pricing.
| benreesman wrote:
| I agree that it's misleading to act like this is a product in
| market, we don't _really_ even know if the yield will happen.
|
| But it's a serious thing if it happens.
| foobiekr wrote:
| Very few people actually know how to provision and manage a
| lossless Ethernet fabric and I'd wager someone who had
| literally never touched infiniband would have an easier time
| accomplishing it from zero with IB than with Ethernet on real
| vendor gear.
|
| Ethernet has so, so many gotchas. Maybe if it was a layer 3
| only network it would work. Maybe.
| alfalfasprout wrote:
| I'm guessing it's RDMA over ethernet too which often has a
| lot of gotchas depending on the exact hardware being used.
| epistasis wrote:
| Ironically enough, since NVIDIA bought Mellanox, it's
| likely that the best documented route to get ROCE v2 going
| is with switches purchased from NVIDIA...
|
| Edit: and yes, it's RDMA over ethernet https://docs.habana.
| ai/en/latest/Gaudi_Overview/Gaudi_Archit...
| dogma1138 wrote:
| You don't need to manage NVLink.
|
| NVLink either talks native NVlink to itself when you are using
| NVlink switches either intra-server or intra-rack or;
|
| It can talk PCIe over NVlink when talking to a PCIe endpoint.
|
| Or you can run Infiniband or Ethernet on top of it and talk to
| w/e is on the other side.
|
| Gaudi isn't that different remember Ethernet != TCP/IP.
| dboreham wrote:
| So it works with Ethernet switches?
| dogma1138 wrote:
| What does?
| latchkey wrote:
| > An interesting aspect of Intel's design is they use Ethernet
| for connectivity.
|
| Interestingly, tenstorrent is doing something similar with
| their wormhole cards.
|
| I'm not convinced yet that it is the right way to go. If the
| switching fabric on the card fails, you lose the whole card.
| Keeping it separated out is a bit less risky, at the cost of
| some speed.
|
| I'm more partial to composable fabrics, but they aren't ready
| yet for PCIe5 and we have PCIe6 just around the corner next
| year.
| wmf wrote:
| You want each ASIC to have 24 external NICs (so 192 NICs for
| a server?) with all the cabling/backplanes that would
| require?
| latchkey wrote:
| They are 24x200G, which is already outdated. Everything we
| are doing is currently 400G (via 8xCX7 cards running in
| ethernet mode) and 800G at the spine. 800G NICs, which will
| come with PCIe6 next year and cuts the number of
| connections down.
|
| What I'd prefer is the connection is through the UBB/OAM
| baseboard, such that you have PCIe connections. Look into
| what GigaIO and Liqid are doing. There is a 3rd option that
| is even cooler than those two, but I don't want to mention
| it here. ;-)
| choppaface wrote:
| Many years ago I met Naveen Rao and tried to demo the Nervana-
| derived Intel card, which at the time Facebook and a couple
| others were sampling. During more formal talks, Intel sent him
| literally surrounded by a Xeon sales team that sidetracked the
| whole meeting.
|
| When these Intel GPUs are "in the wild" it actually means Xeon
| salespeople are out on the hunt.
| conradev wrote:
| RoCE is an IETF standard for this:
| https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet
|
| In my understanding one of the big advantages of the protocol
| (v2, that is) is that it is routed over IP and can work with
| existing switches ($$) instead of needing specialized ones
| ($$$$)
| wmf wrote:
| RoCE never required expensive switches but getting the PFC
| configuration right can be tricky.
| jsight wrote:
| So, where can I use instances of Gaudi 3 and what is the hourly
| price for these instances?
| sidkashyap wrote:
| https://www.intel.com/content/www/us/en/developer/tools/devc...
| latchkey wrote:
| This is something I'd like to offer via my business (Hot Aisle)
| at some point in the near future. Right now, we are just
| getting started and focused on MI300x, but the long term goal
| is to offer any type of high end compute that people are
| willing to rent.
| alchemist1e9 wrote:
| I'm interested and have questions, but
| https://www.hotaisle.xyz/ doesn't exactly provide a lot of
| answers.
| latchkey wrote:
| Sorry about that. We are just getting started, so the
| lowest priority right now is the website.
|
| Additionally, due to the KYC requirements around these GPUs
| (due to US export controls), we really want to get to know
| our customers first.
|
| Feel free to ping me on email and happy to get on a call
| and talk more.
| CapeTheory wrote:
| What USP are you aiming for, to differentiate from the many
| companies who have tried and failed to offer some form of
| HPCaaS over the last 10-15 years?
| latchkey wrote:
| Great question! I'm going to answer it the only way I know
| how... with a bit of a story of the history of things.
| Sorry if this bores you.
|
| The problem I realized over a year ago was that nobody had
| hourly rental access to high end AMD GPUs. In addition,
| access to high end Nvidia was equally difficult. I signed
| up for a CoreWeave account, put in my credit card and was
| told a few weeks later that my account was not approved.
|
| In effect, the only way to get access to super high end
| compute, was to be involved in HPC and that requires
| connections. At the time, we also didn't even know if AMD
| was going to seriously adopt AI as a strategy.
|
| My view was that there were actually two problems, lack of
| general access and that everyone was putting all their eggs
| into a single basket. Mostly because of that lack of
| access, and because AMD was lacking a great developer
| flywheel story.
|
| I spent August to December building a business plan,
| closing funding, forming the business, hiring my co-founder
| full time, securing data center space, securing direct
| relationships with vendors, and designing the system we
| were going to deploy. There are a million other little
| details in there, but this is long enough as it is.
|
| Oct/Nov of last year rolls around and suddenly AMD has
| changed their tune. Lisa Su doubles down. Dec 6th, MI300x
| rolls out. We made our first PoC order in January, received
| it in March. It just goes to show how cutting edge and how
| long all of this takes. 3 more small (not hyperscaler)
| businesses sprung up during that time, all offering
| effectively the same product. We went to the data center,
| deployed our PoC and about 2 weeks later, we had our first
| customer onboarded. I call all of that validation, and was
| able to secure further funding based on it.
|
| To answer your question, I'm not sure that I need a
| specific USP. The demand for compute isn't going down. If I
| have a product that people want, and I can offer them
| ethical, honest, truthful, great service around that
| product. All based on decades of experience. Can't that be
| enough? Myself and my investors believe so.
| 1024core wrote:
| About time. NVIDIA needs some serious competition.
| talldayo wrote:
| _pokes OpenCL 's corpse with a stick_
|
| C'mon, do something...
| ein0p wrote:
| Ironically, Transformers are relatively simple architectures
| - all you really need is a high performance matmul. So OpenCL
| could "do something" at this point, if it were alive.
| imtringued wrote:
| https://github.com/ROCm/ROCm/issues/2754
|
| Wow and I thought that the latest generation of GPUs was
| better.
| FuriouslyAdrift wrote:
| AMD is doing several billion in ai processor sales already and
| the new chip is selling as fast as they can make them. At least
| with AMD, a customer can actually get them now as opposed to
| the nearly 1 year lead time from nVidia.
| latchkey wrote:
| Confirmed. Buying them up as fast as I can. =)
| fransje26 wrote:
| Now, if they could also do the performant, unified, software
| and driver part..
| doctorpangloss wrote:
| The only meaningful hardware competition, meaning lower prices,
| will come from Chinese designed, Chinese manufactured parts.
| This is still a long ways out.
|
| Is it inevitable? I think so. Before 2019 there wasn't an
| opportunity, now there is.
|
| For software, Chinese universities, Alibaba, Tencent and
| Bytedace are already releasing models, training code and in
| rare cases datasets that are competitive with private
| offerings. CogVLM/CogAgent is one that I use. It's very
| promising.
| elzbardico wrote:
| How much time for that? I wouldn't expect nothing in
| industrial volumes for the next years, maybe 2028? who knows?
|
| But, anyway, we will prohibited from buying it, probably. We
| still can't buy Cuban cigars.
| wmf wrote:
| I don't think we'll be legally prohibited from buying it
| but there will be zero English docs (see Allwinner and
| such). Maybe if you're lucky you'll get an uncommented code
| dump with a forked years-old version of PyTorch.
| rbanffy wrote:
| Competition doesn't do much when all production everywhere is
| already taken in preorders. It'll only change when there is
| surplus production.
| seventytwo wrote:
| What are the row of green rectangles in the middles of the longe
| edges?
| 2genders5827 wrote:
| Are you lonely? Do u want an AI girlfriend?
| https://discord.gg/elyzaJhXmwQkXQbAEdYUwV
| 2genders8873 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza
| obxPscqVvjisLebPa
| 2genders14206 wrote:
| Are you lonely? Do u want an AI girlfriend?
| https://discord.gg/elyza dLgazCrFjXTzkGVpG
| 2genders36371 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza
| qXPwZGKpOTyEhZRca
| 2genders17675 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza
| cyVjppwkUxqVPkNpw
| 2genders2082 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza
| hZIlKnyIVfKCCgqMf
| 2genders25672 wrote:
| Are you lonely? Do u want an AI girlfriend?
| https://discord.gg/elyza -- FOLLOW THE HOMIE
| https://twitter.com/hashimthearab corewqAWxWbelVPMp
| 2genders44876 wrote:
| hi are u lonely want ai gf?? https://discord.gg/candyai
| BteKFQNXYqUnGdCRy
| 2genders2516 wrote:
| hi are u lonely want ai gf?? https://discord.gg/candyai
| huhTzYQSUCIkoBpub
| 2genders21790 wrote:
| hi are u lonely want ai gf?? https://discord.gg/candyai
| pDTlXhNTzvLpcMlwc
| 2genders11504 wrote:
| Are you lonely? Do u want an AI girlfriend?
| https://discord.gg/candyai hYFTyCKrtqBoLFfIY
| 2genders9902 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza -- FOLLOW
| THE HOMIE https://twitter.com/hashimthearab WElwFiUyTOogGxcNk
| 2genders11504 wrote:
| hi are u lonely want ai gf?? https://discord.gg/candyai
| jkXUZhhMeNjZPZgHS
| 2genders18584 wrote:
| hi are u lonely want ai gf?? https://discord.gg/elyza -- FOLLOW
| THE HOMIE https://twitter.com/hashimthearab hbXNLYlUhYzMTluwD
___________________________________________________________________
(page generated 2024-04-22 23:01 UTC)