[HN Gopher] Intel Gaudi 3 AI Accelerator
       ___________________________________________________________________
        
       Intel Gaudi 3 AI Accelerator
        
       Author : goldemerald
       Score  : 277 points
       Date   : 2024-04-09 16:21 UTC (6 hours ago)
        
 (HTM) web link (www.intel.com)
 (TXT) w3m dump (www.intel.com)
        
       | 1024core wrote:
       | > Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB)
       | of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth
       | ...
       | 
       | I didn't know "terabytes (TB)" was a unit of memory bandwidth...
        
         | throwup238 wrote:
         | It's equivalent to about thirteen football fields per arn if
         | that helps.
        
         | gnabgib wrote:
         | Bit of an embarrassing typo, they do later qualify it as
         | 3.7TB/s
        
           | SteveNuts wrote:
           | Most of the time bandwidth is expressed in
           | giga/gibi/tera/tebi _bits_ per second so this is also
           | confusing to me
        
             | sliken wrote:
             | Only for networking, not for anything measured inside a
             | node. Disk bandwidth, cache bandwidth, and memory bandwidth
             | is nearly always measured in bytes/sec (bandwidth), or
             | NS/cache line or similar (which is mix of bandwidth and
             | latency).
        
       | rileyphone wrote:
       | 128GB in one chip seems important with the rise of sparse
       | architectures like MoE. Hopefully these are competitive with
       | Nvidia's offerings, though in the end they will be competing for
       | the same fab space as Nvidia if I'm not mistaken.
        
         | latchkey wrote:
         | AMD MI300x is 192GB.
        
           | tucnak wrote:
           | Which would be impressive had it _actually_ worked for ML
           | workloads.
        
             | Hugsun wrote:
             | Does it not work for them? Where can I learn why?
        
               | tucnak wrote:
               | Just go have a look around Github issues in their ROCm
               | repositories on Github. A few months back the top excuse
               | re: AMD was that we're not supposed to use their
               | "consumer" cards, however the datacenter stuff is kosher.
               | Well, guess what, we have purchased their datacenter
               | card, MI50, and it's similarly screwed. Too many bugs in
               | the kernel, kernel crashes, hangs, and the ROCm code is
               | buggy / incomplete. When it works, it works for a short
               | period of time, and yes HBM memory is kind of nice, but
               | the whole thing is not worth it. Some say MI210 and MI300
               | are better, but it's just wishful thinking as all the
               | bugs are in the software, kernel driver, and firmware. I
               | have spent too many hours troubleshooting entry-level
               | datacenter-grade Instinct cards with no recourse from AMD
               | whatsoever to pay 10+ thousands for MI210 a couple-year
               | old underpowered hardware, and MI300 is just unavailable.
               | 
               | Not even from cloud providers which should be telling
               | enough.
        
               | Workaccount2 wrote:
               | It's seriously impressive how well AMD has been able to
               | maintain their incredible software deficiency for over a
               | decade now.
        
               | alexey-salmin wrote:
               | They deeply care about the tradition of ATI kernel
               | modules from 2004
        
               | amirhirsch wrote:
               | Buying Xilinx helped a lot here.
        
               | jmward01 wrote:
               | Yeah, this has stopped me from trying anything with them.
               | They need to lead with their consumer cards so that
               | developers can test/build/evaluate/gain trust locally and
               | then their enterprise offerings need to 100% guarantee
               | that the stuff developers worked on will work in the data
               | center. I keep hoping to see this but every time I look
               | it isn't there. There is way more support for apple
               | silicon out there than ROCm and that has no path to
               | enterprise. AMD is missing the boat.
        
               | latchkey wrote:
               | You are right, AMD should do more with consumer cards,
               | but I understand why they aren't today. It is a big ship,
               | they've really only started changing course as of last
               | Oct/Nov, before the release of MI300x in Dec. If you have
               | limited resources and a whole culture to change, you have
               | to give them time to fix that.
               | 
               | That said, if you're on the inside, like I am, and you
               | talk to people at AMD (just got off two separate back to
               | back calls with them), rest assured, they are dedicated
               | to making this stuff work.
               | 
               | Part of that is to build a developer flywheel by making
               | their top end hardware available to end users. That's
               | where my company Hot Aisle comes into play. Something
               | that wasn't available before outside of the HPC markets,
               | is now going to be made available.
        
               | tucnak wrote:
               | > developer flywheel
               | 
               | This is peak comedy
        
               | latchkey wrote:
               | https://news.ycombinator.com/newsguidelines.html
               | 
               | Comments should get more thoughtful and substantive, not
               | less, as a topic gets more divisive.
        
               | jmward01 wrote:
               | I look forward to seeing it. NVIDIA needs real
               | competition for their own benefit if not the market as a
               | whole. I want a richer ecosystem where Intel, AMD, NVIDIA
               | and other players all join in with the winner being the
               | consumer. From a selfish point of view I also want to do
               | more home experimentation. LLMs are so new that you can
               | make breakthroughs without a huge team but it really
               | helps to have hardware to make it easier to play with
               | ideas. Consumer card memory limitations are hurting that
               | right now.
        
               | latchkey wrote:
               | > I want a richer ecosystem where Intel, AMD, NVIDIA and
               | other players all join in with the winner being the
               | consumer.
               | 
               | This is _exactly_ the void I 'm trying to fill.
        
               | JonChesterfield wrote:
               | In fairness it wasn't Apple who implemented the non-mac
               | uses of their hardware.
               | 
               | AMD's driver is in your kernel, all the userspace is on
               | GitHub. The ISA is documented. It's entirely possible to
               | treat the ASICs as mass market subsidized floating point
               | machines and run your own code on them.
               | 
               | Modulo firmware. I'm vaguely on the path to working out
               | what's going on there. Changing that without talking to
               | the hardware guys in real time might be rather difficult
               | even with the code available though.
        
               | JonChesterfield wrote:
               | We absolutely hammered the MI50 in internal testing for
               | ages. Was solid as far as I can tell.
               | 
               | Rocm is sensitive to matching kernel version to driver
               | version to userspace version. Staying very much on the
               | kernel version from a official release and using the
               | corresponding driver is drastically more robust than
               | optimistically mixing different components. In
               | particular, rocm is released and tested as one large
               | blob, and running that large blob on a slightly different
               | kernel version can go very badly. Mixing things from
               | GitHub with things from your package manager is also
               | optimistic.
               | 
               | Imagine it as huge ball of code where cross version
               | compatibility of pieces is totally untested.
        
               | tucnak wrote:
               | I would run simple llama.cpp batch jobs for 10 minutes
               | when it would suddenly fail, and require a restart.
               | Random VM_L2_PROTECTION_FAULT in dmesg, something having
               | to do with doorbells. I did report this, never heard back
               | from them.
        
             | huac wrote:
             | There's a number of scaled AMD deployments, including
             | Lamini (https://www.lamini.ai/blog/lamini-amd-paving-the-
             | road-to-gpu...) specifically for LLM's. There's also a
             | number of HPC configurations, including the world's largest
             | publicly disclosed supercomputer (Frontier) and Europe's
             | largest supercomputer (LUMI) running on MI250x. Multiple
             | teams have trained models on those HPC setups too.
             | 
             | Do you have any more evidence as to why these categorically
             | don't work?
        
               | latchkey wrote:
               | > _Do you have any more evidence as to why these
               | categorically don 't work?_
               | 
               | They don't. Loud voices parroting George, with nothing to
               | back it up.
               | 
               | Here are another couple good links:
               | 
               | https://www.evp.cloud/post/diving-deeper-insights-from-
               | our-l...
               | 
               | https://www.databricks.com/blog/training-llms-scale-amd-
               | mi25...
        
       | latchkey wrote:
       | > the only MLPerf-benchmarked alternative for LLMs on the market
       | 
       | I hope to work on this for AMD MI300x soon. My company just got
       | added to the MLCommons organization.
        
       | riskable wrote:
       | > Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into
       | every Intel Gaudi 3 accelerator
       | 
       | WHAT!? It's basically got the equivalent of a 24-port,
       | 200-gigabit switch built into it. How does that make sense? Can
       | you imaging stringing 24 Cat 8 cables between servers in a single
       | rack? Wait: How do you even _decide_ where those cables go? Do
       | you buy 24 Gaudi 3 accelerators and run cables directly between
       | every single one of them so they can all talk 200-gigabit
       | ethernet to each other?
       | 
       | Also: If you've got that many Cat 8 cables coming out the back of
       | the thing _how do you even access it_? You 'll have to unplug
       | half of them (better keep track of which was connected to what
       | port!) just to be able to grab the shell of the device in the
       | rack. 24 ports is usually enough to take up the majority of
       | horizontal space in the rack so maybe this thing requires a
       | minimum of 2-4U just to use it? That would make more sense but
       | not help in the density department.
       | 
       | I'm imagining a lot of orders for "a gradient" of colors of
       | cables so the data center folks wiring the things can keep track
       | of which cable is supposed to go where.
        
         | radicaldreamer wrote:
         | The amount of power that will use up is massive, they should've
         | gone for some fiber instead
        
           | buildbot wrote:
           | It will be fiber, Ethernet is just the protocol not the
           | physical interface.
        
           | KeplerBoy wrote:
           | The fiber optics are also extremely power hungry. For short
           | runs people use direct attach copper cables to avoid having
           | to deal with fiberoptics.
        
         | brookst wrote:
         | Audio folks solved the "which cable goes where" problem ages
         | ago with cable snakes:
         | https://www.seismicaudiospeakers.com/products/24-channel-xlr...
         | 
         | But I'm not how big and how expensive a 24 channel cat 8 snake
         | would be (!).
        
           | nullindividual wrote:
           | I wouldn't think that would be appropriate for Ethernet due
           | to cross talk.
        
             | wmf wrote:
             | Four-lane and eight-lane twinax cables exist; I think each
             | pair is individually shielded. Beyond that there's fiber.
        
             | pezezin wrote:
             | Those cables definitely exist for Ethernet, and regarding
             | cross talk, that's what shielding is for.
             | 
             | Although not for 200 Gbps, at that rate you either use big
             | twinax DACs, or go to fibre.
        
         | gaogao wrote:
         | Infiniband I've heard as incredibly annoying to deal with
         | procuring as well as some other aspects of it, so lots of folks
         | very happy to get RoCE (ethernet) working instead, even if it
         | is a bit cumbersome.
        
         | blackeyeblitzar wrote:
         | See https://www.nextplatform.com/2024/04/09/with-
         | gaudi-3-intel-c... for more details. Here's the relevant bits,
         | although you should visit the article to see the networking
         | diagrams:
         | 
         | > The Gaudi 3 accelerators inside of the nodes are connected
         | using the same OSFP links to the outside world as happened with
         | the Gaudi 2 designs, but in this case the doubling of the speed
         | means that Intel has had to add retimers between the Ethernet
         | ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports
         | that come out of the back of the system board. Of the 24 ports
         | on each Gaudi 3, 21 of them are used to make a high-bandwidth
         | all-to-all network linking those Gaudi 3 devices tightly to
         | each other. Like this:
         | 
         | > As you scale, you build a sub-cluster with sixteen of these
         | eight-way Gaudi 3 nodes, with three leaf switches - generally
         | based on the 51.2 Tb/sec "Tomahawk 5" StrataXGS switch ASICs
         | from Broadcom, according to Medina - that have half of their 64
         | ports running at 800 GB/sec pointing down to the servers and
         | half of their ports pointing up to the spine network. You need
         | three leaf switches to do the trick:
         | 
         | > To get to 4,096 Gaudi 3 accelerators across 512 server nodes,
         | you build 32 sub-clusters and you cross link the 96 leaf
         | switches with a three banks of sixteen spine switches, which
         | will give you three different paths to link any Gaudi 3 to any
         | other Gaudi 3 through two layers of network. Like this:
         | 
         | The cabling works out neatly in the rack configurations they
         | envision. The idea here is to use standard Ethernet instead of
         | proprietary Infiniband (which Nvidia got from acquiring
         | Mellanox). Because each accelerator can reach other
         | accelerators via multiple paths that will (ideally) not be
         | over-utilized, you will be able to perform large operations
         | across them efficiently without needing to get especially
         | optimized about how your software manages communication.
        
           | Manabu-eo wrote:
           | The PCI-e HL-338 version is also listing 24 200GbE RDMA nics
           | in a dual-slot configuration. How would they be connected?
        
             | wmf wrote:
             | They may go to the top of the card where you can use an
             | SLI-like bridge to connect multiple cards.
        
         | buildbot wrote:
         | 200gb is not going to be using CAT, it will be fiber (or direct
         | attached copper cable as noted by dogma1138) with a QSFP
         | interface
        
           | dogma1138 wrote:
           | It will most likely use copper QSFP56 cables since these
           | interfaces are either used in inter rack or adjacent rack
           | direct attachments or to the nearest switch.
           | 
           | O.5-1.5/2m copper cables are easily available and cheap and
           | 4-8m (and even longer) is also possible with copper but tends
           | to be more expensive and harder to get by.
           | 
           | Even 800gb is possible with copper cables these days but
           | you'll end up spending just as much if not more on cabling as
           | the rest of your
           | kit...https://www.fibermall.com/sale-460634-800g-osfp-
           | acc-3m-flt.h...
        
             | buildbot wrote:
             | Fair point!
        
         | juliangoldsmith wrote:
         | For Gaudi2, it looks like 21/24 ports are internal to the
         | server. I highly doubt those have actual individual cables.
         | Most likely they're just carried on PCBs like any other signal.
         | 
         | 100GBe is only supported on twinax anyway, so Cat8 is
         | irrelevant here. The other 3 ports are probably QSFP or
         | something.
        
       | colechristensen wrote:
       | Anyone have experience and suggestions for an AI accelerator?
       | 
       | Think prototype consumer product with total cost preferably <
       | $500, definitely less than $1000.
        
         | jsheard wrote:
         | The default answer is to get the biggest Nvidia gaming card you
         | can afford, prioritizing VRAM size over speed. Ideally one of
         | the 24GB ones.
        
         | mirekrusin wrote:
         | Rent or 3090, maybe used 4090 if you're lucky.
        
         | Hugsun wrote:
         | You can get very cheap tesla P40s with 24gb of ram. They are
         | much much slower than the newer cards but offer decent value
         | for running a local chatbot.
         | 
         | I can't speak to the ease of configuration but know that some
         | people have used these successfully.
        
         | jononor wrote:
         | What is the workload?
        
         | hedgehog wrote:
         | What else in on the BOM? Volume? At that price you likely want
         | to use whatever resources are on the SoC that runs the thing
         | and work around that. Feel free to e-mail me.
        
         | wmf wrote:
         | AMD Hawk Point?
        
         | dist-epoch wrote:
         | All new CPUs will have so called NPUs inside them. For helping
         | running models locally.
        
         | JonChesterfield wrote:
         | I liked my 5700XT. That seems to be $200 now. Ran arbitrary
         | code on it just fine. Lots of machine learning seems to be
         | obsessed with amount of memory though and increasing that is
         | likely to increase the price. Also HN doesn't like ROCm much,
         | so there's that.
        
       | neilmovva wrote:
       | A bit surprised that they're using HBM2e, which is what Nvidia
       | A100 (80GB) used back in 2020. But Intel is using 8 stacks here,
       | so Gaudi 3 achieves comparable total bandwidth (3.7TB/s) to H100
       | (3.4TB/s) which uses 5 stacks of HBM3. Hopefully the older HBM
       | has better supply - HBM3 is hard to get right now!
       | 
       | The Gaudi 3 multi-chip package also looks interesting. I see 2
       | central compute dies, 8 HBM die stacks, and then 6 small dies
       | interleaved between the HBM stacks - curious to know whether
       | those are also functional, or just structural elements for
       | mechanical support.
        
         | bayindirh wrote:
         | > A bit surprised that they're using HBM2e, which is what
         | Nvidia A100 (80GB) used back in 2020.
         | 
         | This is one of the secret recipes of Intel. They can use older
         | tech and push it a little further to catch/surpass current gen
         | tech until current gen becomes easier/cheaper to
         | produce/acquire/integrate.
         | 
         | They have done it with their first quad core processors by
         | merging two dual core processors (Q6xxx series), or by creating
         | absurdly clocked single core processors aimed at very niche
         | market segments.
         | 
         | We have not seen it until now, because they were sleeping at
         | the wheel, and knocked unconscious by AMD.
        
           | mvkel wrote:
           | Interesting.
           | 
           | Would you say this means Intel is "back," or just not
           | completely dead?
        
             | bayindirh wrote:
             | No, this means Intel has woken up and trying. There's no
             | guarantee in anything. I'm more of an AMD person, but I
             | want to see fierce competition, not monopoly, even if it's
             | "my team's monopoly".
        
               | chucke1992 wrote:
               | Well the only reason why AMD is doing good at CPU is
               | becoming Intel is sleeping. Otherwise it would be Nvidia
               | vs AMD (less steroids though).
        
               | bayindirh wrote:
               | EPYC is actually pretty good. It's true that Intel was
               | sleeping, but AMD's new architecture is a beast. Has
               | better memory support, more PCIe lanes and better overall
               | system latency and throughput.
               | 
               | Intel's TDP problems and AVX clock issues leave a bitter
               | taste in the mouth.
        
           | alexey-salmin wrote:
           | Oh dear, Q6600 was so bad, I regret ever owning it
        
             | PcChip wrote:
             | Really? I never owned one but even I remember the famous
             | SLACR, I thought they were the hot item back then
        
               | alexey-salmin wrote:
               | It was "hot" but using one as a main desktop in 2007 was
               | depressing due to abysmal single-core performance.
        
             | mrybczyn wrote:
             | What? It was outstanding for the time, great price
             | performance, and very tunable for clock / voltage IIRC.
        
               | alexey-salmin wrote:
               | Well overclocked I don't know, but out-of-the box single-
               | core performance completely sucked. And in 2007 not
               | enough applications had threads to make it up in the
               | number of cores.
               | 
               | It was fun to play with but you'd also expect the higher-
               | end desktop to e.g. handle x264 videos which was not the
               | case (search for q6600 on videolan forum). And
               | depressingly many cheaper CPUs of the time did it easily.
        
             | JonChesterfield wrote:
             | 65nm tolerated a lot of voltage. Fun thing to overclock.
        
             | bayindirh wrote:
             | I owned one, it was a performant little chip. Developed my
             | first multi core stuff with it.
             | 
             | I loved it, to be honest.
        
             | chucke1992 wrote:
             | Q6600 was quite good but E8400 was the best.
        
               | alexey-salmin wrote:
               | E8400 was actually good, yes
        
               | astrodust wrote:
               | Q6600 is the spiritual successor to the ABIT BP6 Dual
               | Celeron option: https://en.wikipedia.org/wiki/ABIT_BP6
        
               | bayindirh wrote:
               | ABIT was a legend in motherboards. I used their AN-7
               | Ultra and AN-8 Ultra. No newer board gave the flexibility
               | and capabilities of these series.
               | 
               | My latest ASUS was good enough, but I didn't (and
               | probably won't) build any newer systems, so ABITs will
               | have the crown.
        
           | JonChesterfield wrote:
           | > This is one of the secret recipes of Intel
           | 
           | Any other examples of this? I remember the secret sauce being
           | a process advantage over the competition, exactly the
           | opposite of making old tech outperform the state of the art.
        
             | calaphos wrote:
             | Intels surprisingly fast 14nm processors come to mind. Born
             | of necessity as they couldn't get their 10 and later 7nm
             | processes working for years. Despite that Intel managed to
             | keep up in single core performance with newer 7nm AMD
             | chips, although at a mich higher power draw.
        
               | Dalewyn wrote:
               | Or today with Alder Lake and Raptor Lake(Refresh), where
               | their CPUs made on Intel 7 (10nm) are on par if not
               | slightly better than AMD's offerings made on TSMC 5nm.
        
               | deepnotderp wrote:
               | That's because CPU performance cares less about
               | transistor density and more about transistor performance,
               | and 14nm drive strength was excellent
        
             | timr wrote:
             | Back in the day, Intel was great for overclocking because
             | all of their chips could run at _significantly_ higher
             | speeds and voltages than on the tin. This was because they
             | basically just targeted the higher specs, and sold the
             | underperforming silicon as lower-tier products.
             | 
             | Don't know if this counts, but feels directionally similar.
        
       | sairahul82 wrote:
       | Can we expect the price of 'Gaudi 3 PCIe' to be reasonable enough
       | to put in a workstation? That would be a game changer for local
       | LLMs
        
         | CuriouslyC wrote:
         | Just based on the RAM alone, let's just say if you can't just
         | buy a Vision Pro without a second thought about the price tag,
         | don't get your hopes up.
        
         | wongarsu wrote:
         | Probably not. An 40GB Nvidia A100 is arguably reasonable for a
         | workstation at $6000. Depending on your definition an 80GB A100
         | for $16000 is still reasonable. I don't see this being cheaper
         | than an 80GB A100. Probably a good bit more expensive, seeing
         | as it has more RAM, compares itself favorably to the H100, and
         | has enough compelling features that it probably doesn't have to
         | (strongly) compete on price.
        
           | chessgecko wrote:
           | I think you're right on the price, but just to give some
           | false hope. I think newish hbm (and this is hbm2e which is a
           | little older) is around $15/gb so for 128 gb thats $1920.
           | There are some other cogs, but in theory they could sell this
           | for like $3-4k and make some gross profit while getting some
           | hobbyist mindshare/research code written for it. I doubt they
           | will though, it might eat too much into profits from the non
           | pcie variants.
        
             | p1esk wrote:
             | _in theory they could sell this for like $3-4k_
             | 
             | You're joking, right? They will price it to match current
             | H100 pricing. Multiply your estimate by 10x.
        
               | chessgecko wrote:
               | They could, I know they wont, but they wouldn't lose
               | money on the parts
        
           | Workaccount2 wrote:
           | Interestingly they are using HBME2 memory which is a few
           | years old at this point. The price might end up being
           | surprisingly good because of this.
        
           | 0cf8612b2e1e wrote:
           | Surely NVidia's pricing is more what the market will bear vs
           | an intrinsic cost to build. Intel being the underdog should
           | be willing to offer a discount just to get their foot in the
           | door.
        
             | wmf wrote:
             | Nvidia is charging $35K so a discount relative to that is
             | still very expensive.
        
             | tormeh wrote:
             | Pricing is normally what the market will bear. If this is
             | below your cost as supplier you exit the market.
        
               | AnthonyMouse wrote:
               | But if your competitor's price is dramatically _above_
               | your cost, you can provide a huge discount as an
               | incentive for customers to pay the transition cost to
               | your system while still turning a tidy profit.
        
           | narrator wrote:
           | Isn't it much better to get a Mac Studio with an M2 Max and
           | 192gb of Ram and 31 terraflops for $6599 and run llama.cpp?
        
         | ipsum2 wrote:
         | It won't be under $10k.
        
       | yieldcrv wrote:
       | Has anyone here bought an AI accelerator to run their AI SaaS
       | service from their home to customers instead of trying to make a
       | profit on top of OpenAI or Replicate
       | 
       | Seems like an okay $8,000 - $30,000 investment, and bare metal
       | server maintenance isn't that complicated these days.
        
         | shiftpgdn wrote:
         | Dingboard runs off of the owner's pile of used gamer cards. The
         | owner frequently posts about it on twitter.
        
       | kaycebasques wrote:
       | Wow, I very much appreciate the use of the 5 Ws and H [1] in this
       | announcement. Thank you Intel for not subjecting my eyes to corp
       | BS
       | 
       | [1] https://en.wikipedia.org/wiki/Five_Ws
        
         | belval wrote:
         | I wonder if with the advent of LLMs being able to spit out
         | perfect corpo-speak everyone will recenter to succint and short
         | "here's the gist" as the long version will become associated to
         | cheap automated output.
        
       | YetAnotherNick wrote:
       | So now hardware companies stopped reporting FLOP/s number and
       | reports in arbitrary unit of parallel operation/s.
        
         | AnonMO wrote:
         | 1835 tflops fp8. you have to look for it, but they posted it.
         | The link in the op is just an announcement. the white paper has
         | more info. https://www.intel.com/content/www/us/en/content-
         | details/8174...
        
       | whalesalad wrote:
       | https://www.merriam-webster.com/dictionary/gaudy
        
         | jagger27 wrote:
         | https://en.wikipedia.org/wiki/Antoni_Gaud%C3%AD
        
         | riazrizvi wrote:
         | That's an i. He's one the the greatest architects of all time.
         | https://www.archdaily.com/877599/10-must-see-gaudi-buildings...
        
         | TheAceOfHearts wrote:
         | Honestly, I thought the same thing upon reading the name. I'm
         | aware of the reference to Antoni Gaudi, but having the name
         | sound so close to gaudy seems a bit unfortunate. Surely they
         | must've had better options? Then again I don't know how these
         | sorts of names get decided anymore.
        
           | whalesalad wrote:
           | to be fair intel is not known for naming things well.
        
             | brookst wrote:
             | Yeah I can't believe people are nitpicking the name when it
             | could just as easily have been AIX19200xvr4200AI.
        
           | prewett wrote:
           | 'Gaudi' is properly pronounced Ga-oo-DEE in his native
           | Catalan, whereas (in my dialect) 'gaudy' is pronounced GAW-
           | dee. My guess is Intel wasn't even thinking about 'gaudy'
           | because they were thinking about "famous architects" or
           | whatever the naming pool was. Although, I had heard that the
           | 'gaudy' came from the architect's name because of what people
           | thought of his work. (I'm not sure this is correct, it was
           | just my first introduction to the word.)
        
       | andersa wrote:
       | Price?
        
       | mpreda wrote:
       | How much does one such card cost?
        
       | kylixz wrote:
       | This is a bit snarky -- but will Intel actually keep this product
       | line alive for more than a few years? Having been bitten by
       | building products around some of their non-x86 offerings where
       | they killed good IP off and then failed to support it... I'm
       | skeptical.
       | 
       | I truly do hope it is successful so we can have some alternative
       | accelerators.
        
         | forkerenok wrote:
         | I'm not very involved in the broader topic, but isn't the
         | shortage of hardware for AI-related workloads intense enough so
         | as to grant them the benefit of the doubt?
        
         | jtriangle wrote:
         | The real question is, how long does it actually have to hang
         | around really? With the way this market is going, it probably
         | only has to be supported in earnest for a few years by which
         | point it'll be so far obsolete that everyone who matters will
         | have moved on.
        
           | AnthonyMouse wrote:
           | We're talking about the architecture, not the hardware model.
           | What people want is to have a new, faster version in a few
           | years that will run the same code written for this one.
           | 
           | Also, hardware has a lifecycle. At some point the old
           | hardware isn't worth running in a large scale operation
           | because it consumes more in electricity to run 24/7 than it
           | would cost to replace with newer hardware. But then it falls
           | into the hands of people who aren't going to run it 24/7,
           | like hobbyists and students, which as a manufacturer you
           | still want to support because that's how you get people to
           | invest their time in your stuff instead of a competitor's.
        
         | riffic wrote:
         | Itanic was a fun era
        
           | cptskippy wrote:
           | Itanium only stuck around as long as it did because they were
           | obligated to support HP.
        
         | cptskippy wrote:
         | I think it's a valid question. Intel has a habit of whispering
         | away anything that doesn't immediately ship millions of units
         | or that they're contractually obligated to support.
        
         | astrodust wrote:
         | I hope it pairs well with Optane modules!
        
           | VHRanger wrote:
           | I'll add it right next to my Xeon Phi!
        
         | iamleppert wrote:
         | Long enough for you to get in, develop some AI product, raise
         | investment funds, and get out with your bag!
        
       | AnonMO wrote:
       | it's crazy that Intel can't manufacture its own chips atm, but it
       | looks like that might change in the coming years as new fabs come
       | online.
        
       | alecco wrote:
       | Gaudi 3 has PCIe 4.0 (vs. H100 PCIe 5.0, so 2x the bandwidth).
       | Probably not a deal-breaker but it's strange for Intel (of all
       | vendors) to lag behind in PCIe.
        
         | wmf wrote:
         | N5, PCIe 4.0, and HBM2e. This chip was probably delayed two
         | years.
        
           | alecco wrote:
           | Good point, it's built on TSMC while Intel is pushing to
           | become the #2 foundry. Probably it's because Gaudi was made
           | by an Israeli company Intel acquired in 2019 (not an internal
           | project). Who knows.
           | 
           | https://www.semianalysis.com/p/is-intel-back-foundry-and-
           | pro...
        
         | KeplerBoy wrote:
         | The whitepaper says it's PCIe 5 on Gaudi 3.
        
       | brcmthrowaway wrote:
       | Does this support apple silicon?
        
       | ancharm wrote:
       | Is the scheduling / bare metal software open source through
       | OneAPI? Can a link be posted showing it if so?
        
       | chessgecko wrote:
       | I feel a little misled by the speedup numbers. They are comparing
       | lower batch size h100/200 numbers to higher batch size gaudi 3
       | numbers for throughput (which is heavily improved by increasing
       | batch size). I feel like there are some inference scenarios where
       | this is better, but its really hard to tell from the numbers in
       | the paper.
        
       | m3kw9 wrote:
       | Can you run Cuda on it?
        
         | boroboro4 wrote:
         | No one runs Cuda, everyone runs PyTorch. Which you can run on
         | it.
        
       | geertj wrote:
       | I wonder if someone knowledgeable could comment on OneAPI vs
       | Cuda. I feel like if Intel is going to be a serious competitor to
       | Nvidia, both software and hardware are going to be equally
       | important.
        
         | ZoomerCretin wrote:
         | I'm not familiar with the particulars of OneAPI, but it's just
         | a matter of rewriting CUDA kernels into OneAPI. This is pretty
         | trivial for the vast majority of small (<5 LoC) kernels. Unlike
         | AMD, it looks like they're serious about dogfooding their own
         | chips, and they have a much better reputation for their driver
         | quality.
        
           | JonChesterfield wrote:
           | All the dev work at AMD is on our own hardware. Even things
           | like the corporate laptops are ryzen based. The first gen
           | ryzen laptop I got was _terrible_ but it wasn 't intel. We
           | also do things like develop ROCm on the non-qualified cards
           | and build our tools with our tools. It would be crazy not to.
        
             | ZoomerCretin wrote:
             | Yes that's why I qualified "serious" dogfooding. Of course
             | you use your hardware for your own development work, but
             | it's clearly not enough given that showstopper driver
             | issues are going unfixed for half a year.
        
             | sorenjan wrote:
             | Why isn't AMD part of the UXL Foundation? What does AMD
             | gain from not working together with other companies do make
             | an open alternative to Cuda?
             | 
             | Please make SYCL a priority, cross platform code would make
             | AMD GPUs a viable alternative in the future.
        
           | alecco wrote:
           | Trivial??
        
             | TApplencourt wrote:
             | You have SYCLomatic to help.
        
             | ZoomerCretin wrote:
             | That statement has two qualifications.
        
           | wmf wrote:
           | IMO dogfooding Gaudi would mean training a model on it (and
           | the only way to "prove" it would be to release that model).
        
         | meragrin_ wrote:
         | Apparently, Google, Qualcomm, Samsung, and ARM are rallying
         | around oneAPI:
         | 
         | https://uxlfoundation.org/
        
       | mk_stjames wrote:
       | One nice thing about this (and the new offerings from AMD) is
       | that they will be using the "open accelerator module (OAM)"
       | interface- which standardizes the connector that they use to put
       | them on baseboards, similar to the SXM connections of Nvidia that
       | use MegArray connectors to thier baseboards.
       | 
       | With Nvidia, the SXM connection pinouts have always been held
       | proprietary and confidential. For example, P100's and V100's have
       | standard PCI-e lanes connected to one of the two sides of their
       | MegArray connectors, and if you know that pinout you could
       | literally build PCI-e cards with SXM2/3 connectors to repurpose
       | those now obsolete chips (this has been done by one person).
       | 
       | There are thousands, maybe tens of thousands of P100's you could
       | pickup for literally <$50 apiece these days which technically
       | give you more Tflops/$ than anything on the market, but they are
       | useless because their interface was not ever made open and has
       | not been reverse engineered openly and the OEM baseboards (Dell,
       | Supermicro mainly) are still hideously expensive outside China.
       | 
       | I'm one of those people who finds 'retro-super-computing' a cool
       | hobby and thus the interfaces like OAM being open means that
       | these devices may actually have a life for hobbyists in 8~10
       | years instead of being sent directly to the bins due to secret
       | interfaces and obfuscated backplane specifications.
        
         | JonChesterfield wrote:
         | I really like this side to AMD. There's a strategic call
         | somewhere high up to bias towards collaboration with other
         | companies. Sharing the fabric specifications with broadcom was
         | an amazing thing to see. It's not out of the question that
         | we'll see single chips with chiplets made by different
         | companies attached together.
        
         | wmf wrote:
         | Why don't they sell used P100 DGX/HGX servers as a unit? Are
         | those bare P100s only so cheap precisely because they're
         | useless?
        
           | mk_stjames wrote:
           | I have a theory some big cloud provider moved a ton of racks
           | from SXM2 P100's to SXM2 V100's (those were a thing) and thus
           | orphaned an absolute ton of P100's without their baseboards.
           | 
           | Or, these salvage operations just stripped racks and kept the
           | small stuff and e-waste the racks because they think it's the
           | more efficient use of their storage space and would be easier
           | to sell, without thinking correctly.
        
         | formerly_proven wrote:
         | The price is low because they're useless (except for replacing
         | dead cards in a DGX), if you had a 40$ PCIe AIC-to-SXM adapter,
         | the price would go up a lot.
         | 
         | > I'm one of those people who finds 'retro-super-computing' a
         | cool hobby and thus the interfaces like OAM being open means
         | that these devices may actually have a life for hobbyists in
         | 8~10 years instead of being sent directly to the bins due to
         | secret interfaces and obfuscated backplane specifications.
         | 
         | Very cool hobby. It's also unfortunate how stringent e-waste
         | rules lead to so much perfectly fine hardware to be scrapped.
         | And how the remainder is typically pulled apart to the board /
         | module level for spares. Makes it very unlikely to stumble over
         | more or less complete-ish systems.
        
           | KeplerBoy wrote:
           | I'm not sure the prices would go up that much. What would
           | anyone buy that card for?
           | 
           | Yes, it has a decent memory bandwidth (~750 GB/s) and it runs
           | CUDA. But it only has 16 GB and doesn't support tensor cores
           | or low precision floats. It's in a weird place.
        
             | trueismywork wrote:
             | Scientific computing would buy it up like hot cakes.
        
               | KeplerBoy wrote:
               | Only if the specific workload needs FP64 (4.5 Tflop/s),
               | the 9 Tflop/s for FP32 can be had for cheap with Turing
               | or Ampere consumer cards.
               | 
               | Still, your point stands. It's crazy how that 2016 GPU
               | has two thirds the FP32 power of this new 2024 unobtanium
               | card and infinitely more FP64.
        
         | buildbot wrote:
         | The SXM2 interface is actually publicly documented! There is an
         | open compute spec for a 8-way baseboard. You can find the
         | pinouts there.
        
           | mk_stjames wrote:
           | I had read their documents such as the spec for the Big Basin
           | JBOG, where everything is documented except the actual
           | pinouts on the base board. Everything leading up to it and
           | from it is there but the actual MegArray pinout connection to
           | a single P100/V100 I never found.
           | 
           | But maybe there was more I missed. I'll take another look.
        
       | throwaway4good wrote:
       | Worth noting that it is fabbed by TSMC.
        
       | amelius wrote:
       | Missing in these pictures are the thermal management solutions.
        
         | InitEnabler wrote:
         | If you look at one of the pictures you can get a peak at what
         | they look like (I think...) in the bottom right.
         | 
         | https://www.intel.com/content/dam/www/central-libraries/us/e...
        
       | KeplerBoy wrote:
       | vector floating point performance comes in at 14 Tflops/s for
       | FP32 and 28 Tflop/s for FP16.
       | 
       | Not the best of times for stuff that doesn't fit matrix
       | processing units.
        
       | einpoklum wrote:
       | If your metric is memory bandwidth or memory size, then this
       | announcement gives you some concrete information. But - suppose
       | my metric for performance is matrix-multiply-add (or just matrix-
       | multiply) bandwidth. What MMA primitives does Gaudi offer (i.e.
       | type combinations and matrix dimension combinations), and how
       | many of such ops per second, in practice? The linked page says
       | "64,000 in parallel", but that does not actually tell me much.
        
       ___________________________________________________________________
       (page generated 2024-04-09 23:00 UTC)