[HN Gopher] PyTorch Library for Running LLM on Intel CPU and GPU
___________________________________________________________________
PyTorch Library for Running LLM on Intel CPU and GPU
Author : ebalit
Score : 265 points
Date : 2024-04-03 10:28 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| tomrod wrote:
| Looking forward to reviewing!
| Hugsun wrote:
| I'd be interested in seeing benchmark data. The speed seemed
| pretty good in those examples.
| antonp wrote:
| Hm, no major cloud provider offers intel gpus.
| anentropic wrote:
| Lots offer Intel CPUs though...
| VHRanger wrote:
| No, but for consumers they're a great offering.
|
| 16GB RAM and performance around a 4060ti or so, but for 65% of
| the price
| _joel wrote:
| and 65% of the software support, less I'm inclined to
| believe? Although having more players in the fold is
| definitely a good thing.
| VHRanger wrote:
| Intel is historically really good at the software side,
| though.
|
| For all their hardware research hiccups in the last 10
| years, they've been delivering on open source machine
| learning libraries.
|
| It's apparently the same on driver improvements and gaming
| GPU features in the last year.
| frognumber wrote:
| I'm optimistic Intel will get the software right in due
| course. Last I looked, it wasn't all there yet, but it
| was on the right track.
|
| Right now, I have a nice NVidia card, but if things stay
| on track, I think it's very likely my next GPU might be
| Intel. Open-source, not to mention better value.
| HarHarVeryFunny wrote:
| But even if Intel have stable optimized drivers and ML
| support, it'd still need to be supported by PyTorch/etc
| for most developers to want to use it. People want to
| write at high level, not at CUDA-type level.
| VHRanger wrote:
| Intel is supported in Pytorch, though. It's supported
| from their own branch, which is presumably a big
| annoyance to install, but they do work
| HarHarVeryFunny wrote:
| I just tried googling for Intel's PyTorch, and it's clear
| as mud as to exactly what's run on the GPU and what is
| not. I assume they'd be bragging about it if this ran
| everything on their GPU the same as it would on NVDIA, so
| I'm guessing it just accelerates some operations.
| belthesar wrote:
| Intel GPUs got quite a bit of penetration in the SE Asian
| market, and Intel is close to releasing a new generation. In
| addition, Intel's allowing for GPU virtualization without
| additional license fees (unlike Nvidia and GRID licenses),
| allowing hosting operators to carve up these cards. I have a
| feeling we're going to see a lot more Intel offerings
| available.
| DrNosferatu wrote:
| Any performance benchmark against 'llamafile'[0] or others?
|
| [0] - https://github.com/mozilla-Ocho/llamafile
| VHRanger wrote:
| You can already use intel GPUs (both ARC and iGPUS) with
| llama.cpp on a bunch of backends:
|
| - SYCL [1]
|
| - Vulkan
|
| - OpenCL
|
| I don't own the hardware, but I imagine SYCL is more performant
| for ARC , because it's the one intel is pushing for their
| datacenter stuff
|
| [1]:
| https://www.intel.com/content/www/us/en/developer/articles/t...
| captaindiego wrote:
| Are there any Intel GPUs with a lot of vRAM that someone could
| recommend that would work with this?
| goosedragons wrote:
| For consumer stuff there's the Intel Arc A770 with 16GB VRAM.
| More than that and you start moving into enterprise stuff.
| ZeroCool2u wrote:
| Which seems like their biggest mistake. If they would just
| release a card with more than 24GB VRAM, people would be
| clamoring for their cards, even if they were marginally
| slower. It's the same reason that 3090's are still in high
| demand compared to the 4090's.
| Aromasin wrote:
| There's the Max GPU (Ponte Vecchio), their datacentre offering,
| with 128GB of HBM2e memory, 408 MB of L2 cache, and 64 MB of L1
| cache. Then there's Gaudi, which has similar numbers but with
| cores specific for AI workloads (as far as I know from the
| marketing).
|
| You can pick them up in prebuilds from Dell and Supermicro:
| https://www.supermicro.com/en/accelerators/intel
|
| Read more about them here: https://www.servethehome.com/intel-
| shows-gpu-max-1550-perfor...
| vegabook wrote:
| The company that did 4-cores-forever, has the opportunity to
| redeem itself, in its next consumer GPU release, by disrupting
| the "8-16GB VRAM forever" that AMD and Nvidia have been imposing
| on us for a decade. It would be poetic to see 32-48GB at a non-
| eye-watering price point.
|
| Intel definitely seems to be doing all the right things on
| software support.
| sitkack wrote:
| What is obvious to us, is an industry standard to Product
| Managers. When is the last time you have seen an industry
| player upset the status quo? Intel has not changed _that_ much.
| zoobab wrote:
| "It would be poetic to see 32-48GB at a non-eye-watering price
| point."
|
| I heard some Asrock motherboard BIOSes could set the VRAM up to
| 64GB on Ryzen5.
|
| Doing some investigations with different AMD hardware atm.
| stefanka wrote:
| That would be an interesting information. Which MB works with
| with which APU with 32 or more GB of VRAM. Can you post your
| findings please?
| LoganDark wrote:
| When has an APU ever been as fast as a GPU? How much cache
| does it have, a few hundred megabytes? That can't possibly be
| enough for matmul, no matter how much slow DDR4/5 is
| technically addressable.
| whalesalad wrote:
| still wondering why we can't have gpu's with sodimm slots so
| you can crank the vram
| amir_karbasi wrote:
| I believe that the issue is that graphic cards require really
| fast memory. This requires close memory placement (that's why
| the memory is so close to the core on the board). expandable
| memory will not be able to provide the required bandwidth
| here.
| frognumber wrote:
| The universe used to have hierarchies. Fast memory close,
| slow memory far. Registers. L1. L2. L3. RAM. Swap.
|
| The same thing would make a lot of sense here. Super-fast
| memory close, with overflow into classic DDR slots.
|
| As a footnote, going parallel also helps. 8 sticks of RAM
| at 1/8 the bandwidth each is the same as one stick of RAM
| at 8x the bandwidth, if you don't multiplex onto the same
| traces.
| riskable wrote:
| It's not so simple... The way GPU architecture works is
| that it _needs_ as-fast-as-possible access to its VRAM.
| The concept of "overflow memory" for a GPU is your PC's
| RAM. Adding a secondary memory controller and equivalent
| DRAM to the card itself would only provide a trivial
| improvement over, "just using the PC RAM".
|
| Point of fact: GPUs don't even use all the PCI Express
| lanes they have available to them! Most GPUs (even top of
| the line ones like Nvidia's 4090) only use about 8 lanes
| of bandwidth. This is why some newer GPUs are being
| offered with M.2 slots so you can add an SSD
| (https://press.asus.com/news/asus-dual-geforce-
| rtx-4060-ti-ss... ).
| wongarsu wrote:
| GPUs have memory hierarchies too. A 4090 has about 16MB
| of L1 cache and 72MB of L2 cache, followed by the 24GB of
| GDDR6 RAM, followed by host ram that can be accessed via
| PCIe.
|
| The issue is that GPUs are massively parallel. A 4090 has
| 128 streaming multiprocessors, each executing 128
| "threads" or "lanes" in parallel. If each "thread" works
| on a different part of memory that leaves you with 1kB of
| L1 cache per thread, and 4.5kB of L2 cache each. For each
| clock cycle you might be issuing thousands of request to
| your memory controller for cache misses and prefetching.
| That's why you want insanely fast RAM.
|
| You can write CUDA code that directly accesses your host
| memory as a layer beyond that, but usually you want to
| transfer that data in bigger chunks. You probably could
| make a card that adds DDR4 slots as an additional level
| of hierarchy. It's the kind of weird stuff Intel might do
| (the Phi had some interesting memory layout ideas).
| chessgecko wrote:
| You could, but the memory bandwidth wouldn't be amazing
| unless you had a lot of sticks and it would end up getting
| pretty expensive
| justsomehnguy wrote:
| Look at the motherboards with >2 Memory channels. That would
| require a lot of physical space, which is quite restricted on
| a 50 y/o standard for the expansion cards.
| riskable wrote:
| You can do this sort of thing but you can't use SODIMM slots
| because that places the actual memory chips too far away from
| the GPU. Instead what you need is something like BGA sockets
| (https://www.nxp.com/design/design-center/development-
| boards/... ) which are _stupidly expensive_ (e.g. $600 per
| socket).
| monocasa wrote:
| You could probably use something like CAMM which solved a
| similar problem for lpddr.
|
| https://en.wikipedia.org/wiki/CAMM_(memory_module)
| chessgecko wrote:
| Going above 24GB is probably not going to be cheap until gddr7
| is out, and even that will only push it to 36gb. The fancier
| stacked gddr6 stuff is probably pretty expensive and you can't
| just add more dies because of signal integrity issues.
| frognumber wrote:
| Assuming you want to maintain full bandwidth.
|
| Which I don't care too much about.
|
| However, even 16->24GB is a big step, since a lot of the
| model are developed for 3090/4090-class hardware. 36GB would
| place it lose to the class of the fancy 40GB data center
| cards.
|
| If Intel decided to push VRAM, it will definitely have a
| market. Critically, a lot of folks will also be incentivized
| to make software compatible, since it will be the cheapest
| way to run models.
| 0cf8612b2e1e wrote:
| At this point, I cannot run an entire class of models
| without OOM. I will take a performance hit if it lets me
| run it at all.
|
| I want a consumer card that can do some number of tokens
| per second. I do not need a monster that can serve as the
| basis for a startup.
| hnfong wrote:
| A maxed out Mac Studio probably fits your requirements as
| stated.
| 0cf8612b2e1e wrote:
| If I were willing to drop $4k on that setup, I might as
| well get the real NVidia offering.
|
| The hobbyist market needs something priced well under $1k
| to make it accessible.
| rnewme wrote:
| How comes you don't care about full bandwidth?
| Dalewyn wrote:
| The thing about RAM speed (aka bandwidth) is that it
| becomes irrelevant if you run out and have to page out to
| slower tiers of storage.
| riskable wrote:
| No kidding... Intel is playing catch-up with Nvidia in the AI
| space and a big reason for that is their offerings aren't
| competitive. You can get an Intel Arc A770 with 16GB of VRAM
| (which was released in October, 2022) for about $300 or an
| Nvidia 4060 Ti with 16GB of VRAM for ~$500 which is _twice_ as
| fast for AI workloads in reality (see:
| https://cdn.mos.cms.futurecdn.net/FtXkrY6AD8YypMiHrZuy4K-120...
| )
|
| This is a huge problem because _in theory_ the Arc A770 is
| faster! It 's theoretical performance (TFLOPS) is _more_ than
| twice as fast as an Nvidia 4060 (see:
| https://cdn.mos.cms.futurecdn.net/Q7WgNxqfgyjCJ5kk8apUQE-120...
| ). So why does it perform so poorly? Because everything AI-
| related has been developed and optimized to run on Nvidia's
| CUDA.
|
| Mostly, this is a mindshare issue. If Intel offered a
| workstation GPU (i.e. _not_ a ridiculously expensive
| "enterprise" monster) that developers could use that had
| something like 32GB or 64GB of VRAM it would sell! They'd sell
| zillions of them! In fact, I'd wager that they'd be _so_
| popular it 'd be hard for consumers to even get their hands on
| one because it would sell out everywhere.
|
| It doesn't even need to be the fastest card. It just needs to
| offer more VRAM than the competition. Right now, if you want to
| do things like training or video generation the lack of VRAM is
| a bigger bottleneck than the speed of the GPU. How does Intel
| not see this!? They have the power to step up and take over a
| huge section of the market but instead they're just copying
| (poorly) what everyone else is doing.
| Workaccount2 wrote:
| Based on leaks, it looks like intel somehow missed an easy
| opportunity here. There is an insane demand for high VRAM
| cards now, and it seems the next intel cards will be 12GB.
|
| Intel, screw everything else, just pack as much VRAM in those
| as you can. Build it and they will come.
| dheera wrote:
| Exactly, I'd love to have 1TB of RAM that can be accessed
| at 6000 MT/s.
| talldayo wrote:
| Optane is crying and punching the walls right now.
| yjftsjthsd-h wrote:
| Does optane have an advantage over RAM here?
| watersb wrote:
| Optane products were sold as DIMMS with single-DIMM
| capacity as high as 512 GB. With an Intel memory
| controller that could make it look like DRAM.
|
| 512 GB.
|
| It was slower than conventional DRAM.
|
| But for AI models, Optane may have an advantage: it's
| bit-addressable.
|
| I'm not aware of any memory controllers that exposed that
| single-bit granularity; Optane was fighting to create a
| niche for itself, between DRAM and NAND Flash: pretending
| to be both, when it was neither.
|
| Bit-level operations, computational units in the same
| device as massive storage, is an architecture that has
| yet to be developed.
|
| AI GPUs try to be such an architecture by plopping 16GB
| of HBM next to a sea of little dot-product engines.
| glitchc wrote:
| I think the answer to that is fairly straightforward. Intel
| isn't in the business of producing RAM. They would have to
| buy and integrate a third-party product which is likely not
| something their business side has ever contemplated as a
| viable strategy.
| monocasa wrote:
| Their GPUs as sold already include RAM.
| glitchc wrote:
| Yes, but they don't fab their own RAM. It's a cost center
| for them.
| monocasa wrote:
| If they can sell the board with more RAM for more than
| their extra RAM costs, or can sell more GPUs total but
| the RAM itself is priced essentially at cost, then it's
| not a cost center.
| RussianCow wrote:
| That's not what a cost center is. There is an opportunity
| for them to make more money by putting more RAM into
| their GPUs and exposing themselves to a different market.
| Whether they physically manufacture that RAM doesn't
| matter in the slightest.
| ponector wrote:
| I don't agree. Who will buy it? A few enthusiasts who wants
| to run LLM locally but cannot afford M3 or 4090?
|
| It will be a niche product with poor sales.
| talldayo wrote:
| > Who will buy it?
|
| Frustrated AMD customers willing to put their money where
| their mouth is?
| bt1a wrote:
| I think there's more than a few enthusiasts who would be
| very interesting in buying 1 or more of these cards (if
| they had 32+ GB of memory), but I don't have any data to
| back that opinion up. It is not only those who can't afford
| a 4090 though.
|
| While the 4090 can run models that use less than 24GB of
| memory at blistering speeds, models are going to continue
| to scale up and 24GB is fairly limiting. Because LLM
| inference can take advantage of splitting the layers among
| multiple GPUs, high memory GPUs that aren't super expensive
| are desirable.
|
| To share a personal perspective, I have a desktop with a
| 3090 and an M1 Max Studio with 64GB of memory. I use the M1
| for local LLMs because I can use up to 57~GB of memory,
| even though the output (in terms of tok/s) is much slower
| than ones I can fit on a 3090.
| Dalewyn wrote:
| >models are going to continue to scale up and 24GB is
| fairly limiting
|
| >24GB is fairly limiting
|
| Can I take a moment to suggest that maybe we're very
| spoiled?
|
| 24GB of VRAM is more than most peoples' system RAM, and
| that is "fairly limiting"?
|
| To think Bill once said 640KB would be enough.
| hnfong wrote:
| It doesn't matter whether anyone is "spoiled" or not.
|
| The fact is large language models require a lot of VRAM,
| and the more interesting ones need more than 24GB to run.
|
| The people who are able to afford systems with more than
| 24GB VRAM will go buy hardware that gives them that, and
| when GPU vendors release products with insufficient VRAM
| they limit their market.
|
| I mean inequality is definitely increasing at a worrying
| rate these days, but let's keep the discussion on
| topic...
| Dalewyn wrote:
| I'm just fascinated that the response/demand to running
| out of RAM is _" Just sell us more RAM, god damn!"_
| instead of engineering a solution to make due with what
| is practically (and realistically) available.
| xoranth wrote:
| People have engineered solutions to make what is
| available practical (see all the various quantization
| schemes that have come out).
|
| It is just that there's a limit to how much you can
| compress the models.
| dekhn wrote:
| I would say that increasing RAM to avoid engineering a
| solution has long been a successful strategy.
|
| i learned my RAM lesson when I bought my first real linux
| PC. it had 4MB of RAM, which was enough to run X, bash,
| xterm, and emacs. But once I ran all that and also wanted
| to compile with g++, it would start swapping, which in
| the days of slow hard drives, was death to productivity.
|
| I spent $200 to double to 8MB, and then another $200 to
| double to 16MB, and then finally, $200 to max out the RAM
| on my machine-- 32MB! And once I did that everything
| flew.
|
| Rather than attempting to solve the problem by making
| emacs (eight megs and constantly swapping) use less RAM,
| or find a way to hack without X, I deployed money to max
| out my machine (which was practical, but not
| realistically available to me unless I gave up other
| things in life for the short term). Not only was I more
| productive, I used that time to work on _other_
| engineering problems which helped build my career, while
| also learning an important lesson about swapping /paging.
|
| People demand RAM and what was not practically available
| is often available 2 years later as standard. Seems like
| a great approach to me, especially if you don't have
| enough smart engineers to work around problems like that
| (see "How would you sort 4M integers in 2M of RAM?")
| watersb wrote:
| _> I spent $200 to double to 8MB, and then another $200
| to double to 16MB, and then finally, $200 to max out the
| RAM on my machine-- 32MB!_
|
| Thank you. Now I feel a log better for dropping $700 on
| the 32MB of RAM when I built my first rig.
| whiplash451 wrote:
| By the same logic, we'd still be writing assembly code on
| 640KB RAM machines in 2024.
| michaelt wrote:
| There has in fact been a great deal of careful
| engineering to allow 70 billion parameter models to run
| on _just_ 48GB of VRAM
|
| The people _training_ 70B parameter models from scratch
| need ~600GB of VRAM to do it!
| nl wrote:
| While saying "we want more efficiency" is great there is
| a trade off between size and accuracy here.
|
| It is possible that compressing and using all of human
| knowledge takes a lot of memory and in some cases the
| accuracy is more important than reducing memory usage.
|
| For example [1] shows how Gemma 2B using AVX512
| instructions could solve problems it couldn't solve using
| AVX2 because of rounding issues with the lower-memory
| instructions. It's likely that most quantization (and
| other memory reduction schemes) have similar problems.
|
| As we develop more multi-modal models that can do things
| like understand 3D video in better than real time it's
| likely memory requirements will _increase_ , not
| decrease.
|
| [1] https://github.com/google/gemma.cpp/issues/23
| loudmax wrote:
| I tend to agree that it would be niche. The machine
| learning enthusiast market is far smaller than the gamer
| market.
|
| But selling to machine learning enthusiasts is not a bad
| place to be. A lot of these enthusiasts are going to go on
| to work at places that are deploying enterprise AI at
| scale. Right now, almost all of their experience is CUDA
| and they're likely to recommend hardware they're familiar
| with. By making consumer Intel GPUs attractive to ML
| enthusiasts, Intel would make their enterprise GPUs much
| more interesting for enterprise.
| mysteria wrote:
| The problem is that this now becomes a long term
| investment, which doesn't work out when we have CEOs
| chasing quarterly profits and all that. Meanwhile Nvidia
| stuck with CUDA all those years back (while ensuring that
| it worked well on both the consumer and enterprise line)
| and now they reap the rewards.
| Wytwwww wrote:
| Current Intel and its leadership seems to be much more
| focused on long term goals/growth than before, or so they
| claim.
| antupis wrote:
| It would be same playbook that NVIDIA did CUDA where was
| market 2010 when it was research labs and hobbyists doing
| vector calculations.
| resource_waste wrote:
| I need offline LLMs for work.
|
| It doesnt need to be consumer grade, it doesnt need to be
| ultra high either.
|
| It needs to be cheap enough for my department to
| expensive it via petty cash.
| Aerroon wrote:
| It's about mindshare. Random people using your product to
| do AI means that the tooling is going to improve because
| people will try to use them. But as it stands right now if
| you think there's any chance you want to use AI in the next
| 5 years, then why would you buy anything other than Nvidia?
|
| It doesn't even matter if that's your primary goal or not.
| alecco wrote:
| AFAIK, unless you are a huge American corp with orders
| above $100m Nvidia will only sell you old and expensive
| server cards like the crappy A40 PCIe 4.0 48GB GDDR6 at
| $5,000. Good luck getting SXM H100s or GH200.
|
| If Intel sells a stackable kit with a lot of RAM and a
| reasonable interconnect a lot of corporate customers will
| buy. It doesn't even have to be that good, just half way
| between PCIe 5.0 and NVLink.
|
| But it seems they are still too stuck in their old ways. I
| wouldn't count on them waking up. Nor AMD. It's sad.
| ponector wrote:
| Parent comment requested non-enterprise, consumer grade
| GPU with tons of memory. I'm sure there is no market for
| this.
|
| However, server solutions could have some traction.
| resource_waste wrote:
| >M3
|
| >4090
|
| These are noob hardware. A6000 is my choice.
|
| Which really only further emphesizes your point.
|
| >CPU based is a waste of everyone's time/effort
|
| >GPU based is 100% limited by VRAM, and is what you are
| realistically going to use.
| jmward01 wrote:
| Microsoft got where they are because the developed tools
| that everyone used. The got the developers and the
| consumers followed. Intel (or AMD) could do the same thing.
| Get a big card with lost of ram so that the developers get
| used to your ecosystem and then sell the enterprise GPUs to
| make the $$$. It is a clear path with a lot of history and
| it blows my mind Intel and AMD aren't doing it.
| belter wrote:
| AMD making drivers of high quality? I would pay to see that :-)
| haunter wrote:
| First crypto then AI, I wish GPUs were left alone for gaming.
| azinman2 wrote:
| Didn't nvidia try to block this in software by slowing down
| mining?
|
| Seems like we just need consumer matrix math cards with
| literally no video out, and then a different set of
| requirements for those with a video out.
| wongarsu wrote:
| But Nvidia doesn't want to make consumer compute cards
| because those might steal market share from the datacenter
| compute cards they are selling at 5x markup.
| talldayo wrote:
| Are there actually gamers out there that are still struggling
| to source GPUs? Even at the height of the mining craze, it
| was still possible to backorder cards at MSRP if you're
| patient.
|
| The serious crypto and AI nuts are all using custom hardware.
| Crypto moved onto ASICs for anything power-efficient, and
| Nvidia's DGX systems aren't being cannibalized from the
| gaming market.
| baq wrote:
| They were.
|
| But then those pesky researchers and hackers figured out how
| to use the matmul hardware for non-gaming.
| UncleOxidant wrote:
| > Intel definitely seems to be doing all the right things on
| software support.
|
| Can you elaborate on this? Intel's reputation for software
| support hasn't been stellar, what's changed?
| OkayPhysicist wrote:
| The issue from the manufacturer's perspective is that they've
| got two different customer bases with wildly different
| willingness to pay, but not substantially different needs from
| their product. If Nvidia and AMD didn't split the two markets
| somehow, then there would be no cards available to the PC
| market, since the AI companies with much deeper pockets would
| buy up the lot. This is undesirable from the manufacturer's
| perspective for a couple reasons, but I suspect a big one is
| worries that the next AI winter would cause their entire
| business to crater out, whereas the PC market is pretty
| reliable for the foreseeable future.
|
| Right now, the best discriminator they have is that PC users
| are willing to put up with much smaller amounts of VRAM.
| donnygreenberg wrote:
| Would be nice if this came with scripts which could launch the
| examples on compatible GPUs on cloud providers (rather than
| trying to guess?). Would anyone else be interested in that?
| Considering putting it together.
___________________________________________________________________
(page generated 2024-04-03 23:01 UTC)