[HN Gopher] Intel Gaudi 3 AI Accelerator
___________________________________________________________________
Intel Gaudi 3 AI Accelerator
Author : goldemerald
Score : 277 points
Date : 2024-04-09 16:21 UTC (6 hours ago)
(HTM) web link (www.intel.com)
(TXT) w3m dump (www.intel.com)
| 1024core wrote:
| > Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB)
| of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth
| ...
|
| I didn't know "terabytes (TB)" was a unit of memory bandwidth...
| throwup238 wrote:
| It's equivalent to about thirteen football fields per arn if
| that helps.
| gnabgib wrote:
| Bit of an embarrassing typo, they do later qualify it as
| 3.7TB/s
| SteveNuts wrote:
| Most of the time bandwidth is expressed in
| giga/gibi/tera/tebi _bits_ per second so this is also
| confusing to me
| sliken wrote:
| Only for networking, not for anything measured inside a
| node. Disk bandwidth, cache bandwidth, and memory bandwidth
| is nearly always measured in bytes/sec (bandwidth), or
| NS/cache line or similar (which is mix of bandwidth and
| latency).
| rileyphone wrote:
| 128GB in one chip seems important with the rise of sparse
| architectures like MoE. Hopefully these are competitive with
| Nvidia's offerings, though in the end they will be competing for
| the same fab space as Nvidia if I'm not mistaken.
| latchkey wrote:
| AMD MI300x is 192GB.
| tucnak wrote:
| Which would be impressive had it _actually_ worked for ML
| workloads.
| Hugsun wrote:
| Does it not work for them? Where can I learn why?
| tucnak wrote:
| Just go have a look around Github issues in their ROCm
| repositories on Github. A few months back the top excuse
| re: AMD was that we're not supposed to use their
| "consumer" cards, however the datacenter stuff is kosher.
| Well, guess what, we have purchased their datacenter
| card, MI50, and it's similarly screwed. Too many bugs in
| the kernel, kernel crashes, hangs, and the ROCm code is
| buggy / incomplete. When it works, it works for a short
| period of time, and yes HBM memory is kind of nice, but
| the whole thing is not worth it. Some say MI210 and MI300
| are better, but it's just wishful thinking as all the
| bugs are in the software, kernel driver, and firmware. I
| have spent too many hours troubleshooting entry-level
| datacenter-grade Instinct cards with no recourse from AMD
| whatsoever to pay 10+ thousands for MI210 a couple-year
| old underpowered hardware, and MI300 is just unavailable.
|
| Not even from cloud providers which should be telling
| enough.
| Workaccount2 wrote:
| It's seriously impressive how well AMD has been able to
| maintain their incredible software deficiency for over a
| decade now.
| alexey-salmin wrote:
| They deeply care about the tradition of ATI kernel
| modules from 2004
| amirhirsch wrote:
| Buying Xilinx helped a lot here.
| jmward01 wrote:
| Yeah, this has stopped me from trying anything with them.
| They need to lead with their consumer cards so that
| developers can test/build/evaluate/gain trust locally and
| then their enterprise offerings need to 100% guarantee
| that the stuff developers worked on will work in the data
| center. I keep hoping to see this but every time I look
| it isn't there. There is way more support for apple
| silicon out there than ROCm and that has no path to
| enterprise. AMD is missing the boat.
| latchkey wrote:
| You are right, AMD should do more with consumer cards,
| but I understand why they aren't today. It is a big ship,
| they've really only started changing course as of last
| Oct/Nov, before the release of MI300x in Dec. If you have
| limited resources and a whole culture to change, you have
| to give them time to fix that.
|
| That said, if you're on the inside, like I am, and you
| talk to people at AMD (just got off two separate back to
| back calls with them), rest assured, they are dedicated
| to making this stuff work.
|
| Part of that is to build a developer flywheel by making
| their top end hardware available to end users. That's
| where my company Hot Aisle comes into play. Something
| that wasn't available before outside of the HPC markets,
| is now going to be made available.
| tucnak wrote:
| > developer flywheel
|
| This is peak comedy
| latchkey wrote:
| https://news.ycombinator.com/newsguidelines.html
|
| Comments should get more thoughtful and substantive, not
| less, as a topic gets more divisive.
| jmward01 wrote:
| I look forward to seeing it. NVIDIA needs real
| competition for their own benefit if not the market as a
| whole. I want a richer ecosystem where Intel, AMD, NVIDIA
| and other players all join in with the winner being the
| consumer. From a selfish point of view I also want to do
| more home experimentation. LLMs are so new that you can
| make breakthroughs without a huge team but it really
| helps to have hardware to make it easier to play with
| ideas. Consumer card memory limitations are hurting that
| right now.
| latchkey wrote:
| > I want a richer ecosystem where Intel, AMD, NVIDIA and
| other players all join in with the winner being the
| consumer.
|
| This is _exactly_ the void I 'm trying to fill.
| JonChesterfield wrote:
| In fairness it wasn't Apple who implemented the non-mac
| uses of their hardware.
|
| AMD's driver is in your kernel, all the userspace is on
| GitHub. The ISA is documented. It's entirely possible to
| treat the ASICs as mass market subsidized floating point
| machines and run your own code on them.
|
| Modulo firmware. I'm vaguely on the path to working out
| what's going on there. Changing that without talking to
| the hardware guys in real time might be rather difficult
| even with the code available though.
| JonChesterfield wrote:
| We absolutely hammered the MI50 in internal testing for
| ages. Was solid as far as I can tell.
|
| Rocm is sensitive to matching kernel version to driver
| version to userspace version. Staying very much on the
| kernel version from a official release and using the
| corresponding driver is drastically more robust than
| optimistically mixing different components. In
| particular, rocm is released and tested as one large
| blob, and running that large blob on a slightly different
| kernel version can go very badly. Mixing things from
| GitHub with things from your package manager is also
| optimistic.
|
| Imagine it as huge ball of code where cross version
| compatibility of pieces is totally untested.
| tucnak wrote:
| I would run simple llama.cpp batch jobs for 10 minutes
| when it would suddenly fail, and require a restart.
| Random VM_L2_PROTECTION_FAULT in dmesg, something having
| to do with doorbells. I did report this, never heard back
| from them.
| huac wrote:
| There's a number of scaled AMD deployments, including
| Lamini (https://www.lamini.ai/blog/lamini-amd-paving-the-
| road-to-gpu...) specifically for LLM's. There's also a
| number of HPC configurations, including the world's largest
| publicly disclosed supercomputer (Frontier) and Europe's
| largest supercomputer (LUMI) running on MI250x. Multiple
| teams have trained models on those HPC setups too.
|
| Do you have any more evidence as to why these categorically
| don't work?
| latchkey wrote:
| > _Do you have any more evidence as to why these
| categorically don 't work?_
|
| They don't. Loud voices parroting George, with nothing to
| back it up.
|
| Here are another couple good links:
|
| https://www.evp.cloud/post/diving-deeper-insights-from-
| our-l...
|
| https://www.databricks.com/blog/training-llms-scale-amd-
| mi25...
| latchkey wrote:
| > the only MLPerf-benchmarked alternative for LLMs on the market
|
| I hope to work on this for AMD MI300x soon. My company just got
| added to the MLCommons organization.
| riskable wrote:
| > Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into
| every Intel Gaudi 3 accelerator
|
| WHAT!? It's basically got the equivalent of a 24-port,
| 200-gigabit switch built into it. How does that make sense? Can
| you imaging stringing 24 Cat 8 cables between servers in a single
| rack? Wait: How do you even _decide_ where those cables go? Do
| you buy 24 Gaudi 3 accelerators and run cables directly between
| every single one of them so they can all talk 200-gigabit
| ethernet to each other?
|
| Also: If you've got that many Cat 8 cables coming out the back of
| the thing _how do you even access it_? You 'll have to unplug
| half of them (better keep track of which was connected to what
| port!) just to be able to grab the shell of the device in the
| rack. 24 ports is usually enough to take up the majority of
| horizontal space in the rack so maybe this thing requires a
| minimum of 2-4U just to use it? That would make more sense but
| not help in the density department.
|
| I'm imagining a lot of orders for "a gradient" of colors of
| cables so the data center folks wiring the things can keep track
| of which cable is supposed to go where.
| radicaldreamer wrote:
| The amount of power that will use up is massive, they should've
| gone for some fiber instead
| buildbot wrote:
| It will be fiber, Ethernet is just the protocol not the
| physical interface.
| KeplerBoy wrote:
| The fiber optics are also extremely power hungry. For short
| runs people use direct attach copper cables to avoid having
| to deal with fiberoptics.
| brookst wrote:
| Audio folks solved the "which cable goes where" problem ages
| ago with cable snakes:
| https://www.seismicaudiospeakers.com/products/24-channel-xlr...
|
| But I'm not how big and how expensive a 24 channel cat 8 snake
| would be (!).
| nullindividual wrote:
| I wouldn't think that would be appropriate for Ethernet due
| to cross talk.
| wmf wrote:
| Four-lane and eight-lane twinax cables exist; I think each
| pair is individually shielded. Beyond that there's fiber.
| pezezin wrote:
| Those cables definitely exist for Ethernet, and regarding
| cross talk, that's what shielding is for.
|
| Although not for 200 Gbps, at that rate you either use big
| twinax DACs, or go to fibre.
| gaogao wrote:
| Infiniband I've heard as incredibly annoying to deal with
| procuring as well as some other aspects of it, so lots of folks
| very happy to get RoCE (ethernet) working instead, even if it
| is a bit cumbersome.
| blackeyeblitzar wrote:
| See https://www.nextplatform.com/2024/04/09/with-
| gaudi-3-intel-c... for more details. Here's the relevant bits,
| although you should visit the article to see the networking
| diagrams:
|
| > The Gaudi 3 accelerators inside of the nodes are connected
| using the same OSFP links to the outside world as happened with
| the Gaudi 2 designs, but in this case the doubling of the speed
| means that Intel has had to add retimers between the Ethernet
| ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports
| that come out of the back of the system board. Of the 24 ports
| on each Gaudi 3, 21 of them are used to make a high-bandwidth
| all-to-all network linking those Gaudi 3 devices tightly to
| each other. Like this:
|
| > As you scale, you build a sub-cluster with sixteen of these
| eight-way Gaudi 3 nodes, with three leaf switches - generally
| based on the 51.2 Tb/sec "Tomahawk 5" StrataXGS switch ASICs
| from Broadcom, according to Medina - that have half of their 64
| ports running at 800 GB/sec pointing down to the servers and
| half of their ports pointing up to the spine network. You need
| three leaf switches to do the trick:
|
| > To get to 4,096 Gaudi 3 accelerators across 512 server nodes,
| you build 32 sub-clusters and you cross link the 96 leaf
| switches with a three banks of sixteen spine switches, which
| will give you three different paths to link any Gaudi 3 to any
| other Gaudi 3 through two layers of network. Like this:
|
| The cabling works out neatly in the rack configurations they
| envision. The idea here is to use standard Ethernet instead of
| proprietary Infiniband (which Nvidia got from acquiring
| Mellanox). Because each accelerator can reach other
| accelerators via multiple paths that will (ideally) not be
| over-utilized, you will be able to perform large operations
| across them efficiently without needing to get especially
| optimized about how your software manages communication.
| Manabu-eo wrote:
| The PCI-e HL-338 version is also listing 24 200GbE RDMA nics
| in a dual-slot configuration. How would they be connected?
| wmf wrote:
| They may go to the top of the card where you can use an
| SLI-like bridge to connect multiple cards.
| buildbot wrote:
| 200gb is not going to be using CAT, it will be fiber (or direct
| attached copper cable as noted by dogma1138) with a QSFP
| interface
| dogma1138 wrote:
| It will most likely use copper QSFP56 cables since these
| interfaces are either used in inter rack or adjacent rack
| direct attachments or to the nearest switch.
|
| O.5-1.5/2m copper cables are easily available and cheap and
| 4-8m (and even longer) is also possible with copper but tends
| to be more expensive and harder to get by.
|
| Even 800gb is possible with copper cables these days but
| you'll end up spending just as much if not more on cabling as
| the rest of your
| kit...https://www.fibermall.com/sale-460634-800g-osfp-
| acc-3m-flt.h...
| buildbot wrote:
| Fair point!
| juliangoldsmith wrote:
| For Gaudi2, it looks like 21/24 ports are internal to the
| server. I highly doubt those have actual individual cables.
| Most likely they're just carried on PCBs like any other signal.
|
| 100GBe is only supported on twinax anyway, so Cat8 is
| irrelevant here. The other 3 ports are probably QSFP or
| something.
| colechristensen wrote:
| Anyone have experience and suggestions for an AI accelerator?
|
| Think prototype consumer product with total cost preferably <
| $500, definitely less than $1000.
| jsheard wrote:
| The default answer is to get the biggest Nvidia gaming card you
| can afford, prioritizing VRAM size over speed. Ideally one of
| the 24GB ones.
| mirekrusin wrote:
| Rent or 3090, maybe used 4090 if you're lucky.
| Hugsun wrote:
| You can get very cheap tesla P40s with 24gb of ram. They are
| much much slower than the newer cards but offer decent value
| for running a local chatbot.
|
| I can't speak to the ease of configuration but know that some
| people have used these successfully.
| jononor wrote:
| What is the workload?
| hedgehog wrote:
| What else in on the BOM? Volume? At that price you likely want
| to use whatever resources are on the SoC that runs the thing
| and work around that. Feel free to e-mail me.
| wmf wrote:
| AMD Hawk Point?
| dist-epoch wrote:
| All new CPUs will have so called NPUs inside them. For helping
| running models locally.
| JonChesterfield wrote:
| I liked my 5700XT. That seems to be $200 now. Ran arbitrary
| code on it just fine. Lots of machine learning seems to be
| obsessed with amount of memory though and increasing that is
| likely to increase the price. Also HN doesn't like ROCm much,
| so there's that.
| neilmovva wrote:
| A bit surprised that they're using HBM2e, which is what Nvidia
| A100 (80GB) used back in 2020. But Intel is using 8 stacks here,
| so Gaudi 3 achieves comparable total bandwidth (3.7TB/s) to H100
| (3.4TB/s) which uses 5 stacks of HBM3. Hopefully the older HBM
| has better supply - HBM3 is hard to get right now!
|
| The Gaudi 3 multi-chip package also looks interesting. I see 2
| central compute dies, 8 HBM die stacks, and then 6 small dies
| interleaved between the HBM stacks - curious to know whether
| those are also functional, or just structural elements for
| mechanical support.
| bayindirh wrote:
| > A bit surprised that they're using HBM2e, which is what
| Nvidia A100 (80GB) used back in 2020.
|
| This is one of the secret recipes of Intel. They can use older
| tech and push it a little further to catch/surpass current gen
| tech until current gen becomes easier/cheaper to
| produce/acquire/integrate.
|
| They have done it with their first quad core processors by
| merging two dual core processors (Q6xxx series), or by creating
| absurdly clocked single core processors aimed at very niche
| market segments.
|
| We have not seen it until now, because they were sleeping at
| the wheel, and knocked unconscious by AMD.
| mvkel wrote:
| Interesting.
|
| Would you say this means Intel is "back," or just not
| completely dead?
| bayindirh wrote:
| No, this means Intel has woken up and trying. There's no
| guarantee in anything. I'm more of an AMD person, but I
| want to see fierce competition, not monopoly, even if it's
| "my team's monopoly".
| chucke1992 wrote:
| Well the only reason why AMD is doing good at CPU is
| becoming Intel is sleeping. Otherwise it would be Nvidia
| vs AMD (less steroids though).
| bayindirh wrote:
| EPYC is actually pretty good. It's true that Intel was
| sleeping, but AMD's new architecture is a beast. Has
| better memory support, more PCIe lanes and better overall
| system latency and throughput.
|
| Intel's TDP problems and AVX clock issues leave a bitter
| taste in the mouth.
| alexey-salmin wrote:
| Oh dear, Q6600 was so bad, I regret ever owning it
| PcChip wrote:
| Really? I never owned one but even I remember the famous
| SLACR, I thought they were the hot item back then
| alexey-salmin wrote:
| It was "hot" but using one as a main desktop in 2007 was
| depressing due to abysmal single-core performance.
| mrybczyn wrote:
| What? It was outstanding for the time, great price
| performance, and very tunable for clock / voltage IIRC.
| alexey-salmin wrote:
| Well overclocked I don't know, but out-of-the box single-
| core performance completely sucked. And in 2007 not
| enough applications had threads to make it up in the
| number of cores.
|
| It was fun to play with but you'd also expect the higher-
| end desktop to e.g. handle x264 videos which was not the
| case (search for q6600 on videolan forum). And
| depressingly many cheaper CPUs of the time did it easily.
| JonChesterfield wrote:
| 65nm tolerated a lot of voltage. Fun thing to overclock.
| bayindirh wrote:
| I owned one, it was a performant little chip. Developed my
| first multi core stuff with it.
|
| I loved it, to be honest.
| chucke1992 wrote:
| Q6600 was quite good but E8400 was the best.
| alexey-salmin wrote:
| E8400 was actually good, yes
| astrodust wrote:
| Q6600 is the spiritual successor to the ABIT BP6 Dual
| Celeron option: https://en.wikipedia.org/wiki/ABIT_BP6
| bayindirh wrote:
| ABIT was a legend in motherboards. I used their AN-7
| Ultra and AN-8 Ultra. No newer board gave the flexibility
| and capabilities of these series.
|
| My latest ASUS was good enough, but I didn't (and
| probably won't) build any newer systems, so ABITs will
| have the crown.
| JonChesterfield wrote:
| > This is one of the secret recipes of Intel
|
| Any other examples of this? I remember the secret sauce being
| a process advantage over the competition, exactly the
| opposite of making old tech outperform the state of the art.
| calaphos wrote:
| Intels surprisingly fast 14nm processors come to mind. Born
| of necessity as they couldn't get their 10 and later 7nm
| processes working for years. Despite that Intel managed to
| keep up in single core performance with newer 7nm AMD
| chips, although at a mich higher power draw.
| Dalewyn wrote:
| Or today with Alder Lake and Raptor Lake(Refresh), where
| their CPUs made on Intel 7 (10nm) are on par if not
| slightly better than AMD's offerings made on TSMC 5nm.
| deepnotderp wrote:
| That's because CPU performance cares less about
| transistor density and more about transistor performance,
| and 14nm drive strength was excellent
| timr wrote:
| Back in the day, Intel was great for overclocking because
| all of their chips could run at _significantly_ higher
| speeds and voltages than on the tin. This was because they
| basically just targeted the higher specs, and sold the
| underperforming silicon as lower-tier products.
|
| Don't know if this counts, but feels directionally similar.
| sairahul82 wrote:
| Can we expect the price of 'Gaudi 3 PCIe' to be reasonable enough
| to put in a workstation? That would be a game changer for local
| LLMs
| CuriouslyC wrote:
| Just based on the RAM alone, let's just say if you can't just
| buy a Vision Pro without a second thought about the price tag,
| don't get your hopes up.
| wongarsu wrote:
| Probably not. An 40GB Nvidia A100 is arguably reasonable for a
| workstation at $6000. Depending on your definition an 80GB A100
| for $16000 is still reasonable. I don't see this being cheaper
| than an 80GB A100. Probably a good bit more expensive, seeing
| as it has more RAM, compares itself favorably to the H100, and
| has enough compelling features that it probably doesn't have to
| (strongly) compete on price.
| chessgecko wrote:
| I think you're right on the price, but just to give some
| false hope. I think newish hbm (and this is hbm2e which is a
| little older) is around $15/gb so for 128 gb thats $1920.
| There are some other cogs, but in theory they could sell this
| for like $3-4k and make some gross profit while getting some
| hobbyist mindshare/research code written for it. I doubt they
| will though, it might eat too much into profits from the non
| pcie variants.
| p1esk wrote:
| _in theory they could sell this for like $3-4k_
|
| You're joking, right? They will price it to match current
| H100 pricing. Multiply your estimate by 10x.
| chessgecko wrote:
| They could, I know they wont, but they wouldn't lose
| money on the parts
| Workaccount2 wrote:
| Interestingly they are using HBME2 memory which is a few
| years old at this point. The price might end up being
| surprisingly good because of this.
| 0cf8612b2e1e wrote:
| Surely NVidia's pricing is more what the market will bear vs
| an intrinsic cost to build. Intel being the underdog should
| be willing to offer a discount just to get their foot in the
| door.
| wmf wrote:
| Nvidia is charging $35K so a discount relative to that is
| still very expensive.
| tormeh wrote:
| Pricing is normally what the market will bear. If this is
| below your cost as supplier you exit the market.
| AnthonyMouse wrote:
| But if your competitor's price is dramatically _above_
| your cost, you can provide a huge discount as an
| incentive for customers to pay the transition cost to
| your system while still turning a tidy profit.
| narrator wrote:
| Isn't it much better to get a Mac Studio with an M2 Max and
| 192gb of Ram and 31 terraflops for $6599 and run llama.cpp?
| ipsum2 wrote:
| It won't be under $10k.
| yieldcrv wrote:
| Has anyone here bought an AI accelerator to run their AI SaaS
| service from their home to customers instead of trying to make a
| profit on top of OpenAI or Replicate
|
| Seems like an okay $8,000 - $30,000 investment, and bare metal
| server maintenance isn't that complicated these days.
| shiftpgdn wrote:
| Dingboard runs off of the owner's pile of used gamer cards. The
| owner frequently posts about it on twitter.
| kaycebasques wrote:
| Wow, I very much appreciate the use of the 5 Ws and H [1] in this
| announcement. Thank you Intel for not subjecting my eyes to corp
| BS
|
| [1] https://en.wikipedia.org/wiki/Five_Ws
| belval wrote:
| I wonder if with the advent of LLMs being able to spit out
| perfect corpo-speak everyone will recenter to succint and short
| "here's the gist" as the long version will become associated to
| cheap automated output.
| YetAnotherNick wrote:
| So now hardware companies stopped reporting FLOP/s number and
| reports in arbitrary unit of parallel operation/s.
| AnonMO wrote:
| 1835 tflops fp8. you have to look for it, but they posted it.
| The link in the op is just an announcement. the white paper has
| more info. https://www.intel.com/content/www/us/en/content-
| details/8174...
| whalesalad wrote:
| https://www.merriam-webster.com/dictionary/gaudy
| jagger27 wrote:
| https://en.wikipedia.org/wiki/Antoni_Gaud%C3%AD
| riazrizvi wrote:
| That's an i. He's one the the greatest architects of all time.
| https://www.archdaily.com/877599/10-must-see-gaudi-buildings...
| TheAceOfHearts wrote:
| Honestly, I thought the same thing upon reading the name. I'm
| aware of the reference to Antoni Gaudi, but having the name
| sound so close to gaudy seems a bit unfortunate. Surely they
| must've had better options? Then again I don't know how these
| sorts of names get decided anymore.
| whalesalad wrote:
| to be fair intel is not known for naming things well.
| brookst wrote:
| Yeah I can't believe people are nitpicking the name when it
| could just as easily have been AIX19200xvr4200AI.
| prewett wrote:
| 'Gaudi' is properly pronounced Ga-oo-DEE in his native
| Catalan, whereas (in my dialect) 'gaudy' is pronounced GAW-
| dee. My guess is Intel wasn't even thinking about 'gaudy'
| because they were thinking about "famous architects" or
| whatever the naming pool was. Although, I had heard that the
| 'gaudy' came from the architect's name because of what people
| thought of his work. (I'm not sure this is correct, it was
| just my first introduction to the word.)
| andersa wrote:
| Price?
| mpreda wrote:
| How much does one such card cost?
| kylixz wrote:
| This is a bit snarky -- but will Intel actually keep this product
| line alive for more than a few years? Having been bitten by
| building products around some of their non-x86 offerings where
| they killed good IP off and then failed to support it... I'm
| skeptical.
|
| I truly do hope it is successful so we can have some alternative
| accelerators.
| forkerenok wrote:
| I'm not very involved in the broader topic, but isn't the
| shortage of hardware for AI-related workloads intense enough so
| as to grant them the benefit of the doubt?
| jtriangle wrote:
| The real question is, how long does it actually have to hang
| around really? With the way this market is going, it probably
| only has to be supported in earnest for a few years by which
| point it'll be so far obsolete that everyone who matters will
| have moved on.
| AnthonyMouse wrote:
| We're talking about the architecture, not the hardware model.
| What people want is to have a new, faster version in a few
| years that will run the same code written for this one.
|
| Also, hardware has a lifecycle. At some point the old
| hardware isn't worth running in a large scale operation
| because it consumes more in electricity to run 24/7 than it
| would cost to replace with newer hardware. But then it falls
| into the hands of people who aren't going to run it 24/7,
| like hobbyists and students, which as a manufacturer you
| still want to support because that's how you get people to
| invest their time in your stuff instead of a competitor's.
| riffic wrote:
| Itanic was a fun era
| cptskippy wrote:
| Itanium only stuck around as long as it did because they were
| obligated to support HP.
| cptskippy wrote:
| I think it's a valid question. Intel has a habit of whispering
| away anything that doesn't immediately ship millions of units
| or that they're contractually obligated to support.
| astrodust wrote:
| I hope it pairs well with Optane modules!
| VHRanger wrote:
| I'll add it right next to my Xeon Phi!
| iamleppert wrote:
| Long enough for you to get in, develop some AI product, raise
| investment funds, and get out with your bag!
| AnonMO wrote:
| it's crazy that Intel can't manufacture its own chips atm, but it
| looks like that might change in the coming years as new fabs come
| online.
| alecco wrote:
| Gaudi 3 has PCIe 4.0 (vs. H100 PCIe 5.0, so 2x the bandwidth).
| Probably not a deal-breaker but it's strange for Intel (of all
| vendors) to lag behind in PCIe.
| wmf wrote:
| N5, PCIe 4.0, and HBM2e. This chip was probably delayed two
| years.
| alecco wrote:
| Good point, it's built on TSMC while Intel is pushing to
| become the #2 foundry. Probably it's because Gaudi was made
| by an Israeli company Intel acquired in 2019 (not an internal
| project). Who knows.
|
| https://www.semianalysis.com/p/is-intel-back-foundry-and-
| pro...
| KeplerBoy wrote:
| The whitepaper says it's PCIe 5 on Gaudi 3.
| brcmthrowaway wrote:
| Does this support apple silicon?
| ancharm wrote:
| Is the scheduling / bare metal software open source through
| OneAPI? Can a link be posted showing it if so?
| chessgecko wrote:
| I feel a little misled by the speedup numbers. They are comparing
| lower batch size h100/200 numbers to higher batch size gaudi 3
| numbers for throughput (which is heavily improved by increasing
| batch size). I feel like there are some inference scenarios where
| this is better, but its really hard to tell from the numbers in
| the paper.
| m3kw9 wrote:
| Can you run Cuda on it?
| boroboro4 wrote:
| No one runs Cuda, everyone runs PyTorch. Which you can run on
| it.
| geertj wrote:
| I wonder if someone knowledgeable could comment on OneAPI vs
| Cuda. I feel like if Intel is going to be a serious competitor to
| Nvidia, both software and hardware are going to be equally
| important.
| ZoomerCretin wrote:
| I'm not familiar with the particulars of OneAPI, but it's just
| a matter of rewriting CUDA kernels into OneAPI. This is pretty
| trivial for the vast majority of small (<5 LoC) kernels. Unlike
| AMD, it looks like they're serious about dogfooding their own
| chips, and they have a much better reputation for their driver
| quality.
| JonChesterfield wrote:
| All the dev work at AMD is on our own hardware. Even things
| like the corporate laptops are ryzen based. The first gen
| ryzen laptop I got was _terrible_ but it wasn 't intel. We
| also do things like develop ROCm on the non-qualified cards
| and build our tools with our tools. It would be crazy not to.
| ZoomerCretin wrote:
| Yes that's why I qualified "serious" dogfooding. Of course
| you use your hardware for your own development work, but
| it's clearly not enough given that showstopper driver
| issues are going unfixed for half a year.
| sorenjan wrote:
| Why isn't AMD part of the UXL Foundation? What does AMD
| gain from not working together with other companies do make
| an open alternative to Cuda?
|
| Please make SYCL a priority, cross platform code would make
| AMD GPUs a viable alternative in the future.
| alecco wrote:
| Trivial??
| TApplencourt wrote:
| You have SYCLomatic to help.
| ZoomerCretin wrote:
| That statement has two qualifications.
| wmf wrote:
| IMO dogfooding Gaudi would mean training a model on it (and
| the only way to "prove" it would be to release that model).
| meragrin_ wrote:
| Apparently, Google, Qualcomm, Samsung, and ARM are rallying
| around oneAPI:
|
| https://uxlfoundation.org/
| mk_stjames wrote:
| One nice thing about this (and the new offerings from AMD) is
| that they will be using the "open accelerator module (OAM)"
| interface- which standardizes the connector that they use to put
| them on baseboards, similar to the SXM connections of Nvidia that
| use MegArray connectors to thier baseboards.
|
| With Nvidia, the SXM connection pinouts have always been held
| proprietary and confidential. For example, P100's and V100's have
| standard PCI-e lanes connected to one of the two sides of their
| MegArray connectors, and if you know that pinout you could
| literally build PCI-e cards with SXM2/3 connectors to repurpose
| those now obsolete chips (this has been done by one person).
|
| There are thousands, maybe tens of thousands of P100's you could
| pickup for literally <$50 apiece these days which technically
| give you more Tflops/$ than anything on the market, but they are
| useless because their interface was not ever made open and has
| not been reverse engineered openly and the OEM baseboards (Dell,
| Supermicro mainly) are still hideously expensive outside China.
|
| I'm one of those people who finds 'retro-super-computing' a cool
| hobby and thus the interfaces like OAM being open means that
| these devices may actually have a life for hobbyists in 8~10
| years instead of being sent directly to the bins due to secret
| interfaces and obfuscated backplane specifications.
| JonChesterfield wrote:
| I really like this side to AMD. There's a strategic call
| somewhere high up to bias towards collaboration with other
| companies. Sharing the fabric specifications with broadcom was
| an amazing thing to see. It's not out of the question that
| we'll see single chips with chiplets made by different
| companies attached together.
| wmf wrote:
| Why don't they sell used P100 DGX/HGX servers as a unit? Are
| those bare P100s only so cheap precisely because they're
| useless?
| mk_stjames wrote:
| I have a theory some big cloud provider moved a ton of racks
| from SXM2 P100's to SXM2 V100's (those were a thing) and thus
| orphaned an absolute ton of P100's without their baseboards.
|
| Or, these salvage operations just stripped racks and kept the
| small stuff and e-waste the racks because they think it's the
| more efficient use of their storage space and would be easier
| to sell, without thinking correctly.
| formerly_proven wrote:
| The price is low because they're useless (except for replacing
| dead cards in a DGX), if you had a 40$ PCIe AIC-to-SXM adapter,
| the price would go up a lot.
|
| > I'm one of those people who finds 'retro-super-computing' a
| cool hobby and thus the interfaces like OAM being open means
| that these devices may actually have a life for hobbyists in
| 8~10 years instead of being sent directly to the bins due to
| secret interfaces and obfuscated backplane specifications.
|
| Very cool hobby. It's also unfortunate how stringent e-waste
| rules lead to so much perfectly fine hardware to be scrapped.
| And how the remainder is typically pulled apart to the board /
| module level for spares. Makes it very unlikely to stumble over
| more or less complete-ish systems.
| KeplerBoy wrote:
| I'm not sure the prices would go up that much. What would
| anyone buy that card for?
|
| Yes, it has a decent memory bandwidth (~750 GB/s) and it runs
| CUDA. But it only has 16 GB and doesn't support tensor cores
| or low precision floats. It's in a weird place.
| trueismywork wrote:
| Scientific computing would buy it up like hot cakes.
| KeplerBoy wrote:
| Only if the specific workload needs FP64 (4.5 Tflop/s),
| the 9 Tflop/s for FP32 can be had for cheap with Turing
| or Ampere consumer cards.
|
| Still, your point stands. It's crazy how that 2016 GPU
| has two thirds the FP32 power of this new 2024 unobtanium
| card and infinitely more FP64.
| buildbot wrote:
| The SXM2 interface is actually publicly documented! There is an
| open compute spec for a 8-way baseboard. You can find the
| pinouts there.
| mk_stjames wrote:
| I had read their documents such as the spec for the Big Basin
| JBOG, where everything is documented except the actual
| pinouts on the base board. Everything leading up to it and
| from it is there but the actual MegArray pinout connection to
| a single P100/V100 I never found.
|
| But maybe there was more I missed. I'll take another look.
| throwaway4good wrote:
| Worth noting that it is fabbed by TSMC.
| amelius wrote:
| Missing in these pictures are the thermal management solutions.
| InitEnabler wrote:
| If you look at one of the pictures you can get a peak at what
| they look like (I think...) in the bottom right.
|
| https://www.intel.com/content/dam/www/central-libraries/us/e...
| KeplerBoy wrote:
| vector floating point performance comes in at 14 Tflops/s for
| FP32 and 28 Tflop/s for FP16.
|
| Not the best of times for stuff that doesn't fit matrix
| processing units.
| einpoklum wrote:
| If your metric is memory bandwidth or memory size, then this
| announcement gives you some concrete information. But - suppose
| my metric for performance is matrix-multiply-add (or just matrix-
| multiply) bandwidth. What MMA primitives does Gaudi offer (i.e.
| type combinations and matrix dimension combinations), and how
| many of such ops per second, in practice? The linked page says
| "64,000 in parallel", but that does not actually tell me much.
___________________________________________________________________
(page generated 2024-04-09 23:00 UTC)