[HN Gopher] Efficient LLM inference solution on Intel GPU
___________________________________________________________________
Efficient LLM inference solution on Intel GPU
Author : PaulHoule
Score : 88 points
Date : 2024-01-20 17:11 UTC (5 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| brucethemoose2 wrote:
| I find this paper kind of "medium" (as the transformers baseline
| they compare against is _glacially_ slow), but Intel 's
| integration efforts in open source LLM runtimes are real and very
| interesting. See, for instance, their PRs in llama.cpp, not to
| speak of all their other open source efforts.
| baq wrote:
| Their software folks are really good, at least in the GPU
| space.
|
| The hardware seems stuck in past decade and the process woes
| don't help either - but the software should be ready if they
| ever dig themselves out of the hardware hole.
| selimthegrim wrote:
| At NeurIPS it seemed like the Intel Labs/materials
| research/hardware people were out in force
| brucethemoose2 wrote:
| I mean, if they sell me a 48GB GPU that's not $5K like a
| W7900, I will drop my Nvidia card faster than I can blink. I
| don't even care if its kinda slow and uses an older process.
|
| And then I may end up fixing Intel bugs in ML projects we
| use, and deploying them on Intel cloud hardware... which is
| hopefully more available as well.
| kiratp wrote:
| +1 to this.
|
| I've been working with their research teams on this and credit
| due indeed.
|
| Some context:
| https://github.com/ggerganov/llama.cpp/issues/2555#issuecomm...
|
| https://github.com/ggerganov/llama.cpp/discussions/3965
| NelsonMinar wrote:
| Well that sounds encouraging, Intel QuickSync has made efficient
| video encoding really accessible. Would love something similar
| for LLM inference
|
| Is this paper the same work as goes in to Intel's BigDL-LLM code?
| That's been out for a few months now but I haven't seen it in use
| yet. https://medium.com/intel-tech/bigdl-llm-easily-optimize-
| your...
| brucethemoose2 wrote:
| > I haven't seen it in use yet
|
| Neither have I, but this is interesting and I am bookmarking
| it, thanks.
|
| The Gen AI space is littered with very interesting high
| performance runtimes that precisely no one uses. There are too
| many for integrators to even keep up with, much less integrate!
| wmf wrote:
| Note that this is about Intel Ponte Vecchio professional GPUs,
| not the consumer Alchemist GPUs. Do they use the same software
| stack?
| xipix wrote:
| Yes, also newer iGPUs (11th Gen onwards) can use same software
| newsclues wrote:
| Are we at or approaching the point where people can self host an
| useful LLM for personal use on affordable hardware?
| adastra22 wrote:
| Is a MacBook Air affordable? Because we're already at that
| point.
| newsclues wrote:
| The base model?
|
| Given the article about GPUs, I think I was thinking more
| along the lines of a desktop with a consumer GPU.
| brucethemoose2 wrote:
| Intel (and AMD) reportedly have big M-Pro-like iGPUs in the
| pipe, and Intel has the Battlemage GPU line coming up, so
| we shall see.
| adastra22 wrote:
| By base model do you mean off-the-shelf? I think the bigger
| limitation there would be RAM, as the machine comes with
| just 8GB soldered on board. You can expand to 24GB for
| +$400 though, and increase the GPU cores +25% for an extra
| +$100.
|
| The CPU/GPU speed of the Air is the same as the MacBook Pro
| base model though. The only difference is the lack of
| active cooling, which for large workloads can result in
| performance degradation. Chat apps are intrinsically
| interactive though, only using bursts of GPU when it is
| performing inference. Shouldn't be an issue.
|
| The expensive PRO/MAX variants of the MacBook Pro would be
| 2x - 4x faster. But the plain M2 in the MacBook Air is
| sufficient to get real-time speeds on sizable models.
|
| When the M3 comes out for the MacBook Air, discounted M2
| machines, new or refurbished, should be an ideal entry-
| level machine for local inference.
| washadjeffmad wrote:
| > discounted M2 machines, new or refurbished, should be
| an ideal entry-level machine for local inference
|
| At the point of purchase of the lowest cost configuration
| with 24GB Unified Memory, you've already paid the an
| equivalent of over 2200 hours of GPU compute time on an
| RTX 4090 24GB, with a performance that exceeds the
| MacBook by around 1200% (it/s).
|
| If you buy that MacBook for AI, you would have to run
| continuous generative inference on it for over a decade
| to match the return of just having used runpod instead.
| Even Apple doesn't use Macs to do AI - they use GCP TPUs.
| Buy the Mac if you like it, by all means, but be
| realistic.
|
| LLM performance on Apple Silicon is decent for small
| models but does not scale. If any Mac architecture were
| cost effective at any scale for AI, we would be putting
| them in data centers, and we're not.
| brucethemoose2 wrote:
| I run Yi 34B at 45K-70K context as a GPT 3.5 replacement on a
| Nvidia 3090.
|
| Its quite smart, and fast.
|
| The whole rig cost me $2.1K, but it could have easily been
| $1.5K without splurging on certain parts like I did. And its 10
| Liters, small enough to move around.
| washadjeffmad wrote:
| Same, but mine is 11L (A4-H2O). Which case did you get?!
| brucethemoose2 wrote:
| A Node 202! I wanted something flat enough for a suitcase.
|
| https://www.reddit.com/r/sffpc/comments/18a7mal/ducted_3090
| _...
| montebicyclelo wrote:
| Consumer laptops, such as apple silicon macbooks with just 32gb
| ram, can run fairly large models, such as Mixtral-8x7b, an open
| source model that's comparable to gpt-3.5, well enough for
| interactive chat, (and seems likely models of this size will
| keep improving) - ofc, 64gb ram preferable, because it unlocks
| larger models. And 3090s are the other way to to go, which are
| faster but not quite as much vram, though you can use multiple.
| tudorw wrote:
| I'm using GPT4ALL with Mistral on a $400 Intel NUC, so yes, if
| the use fits!
| Const-me wrote:
| If you use Windows and you have a discrete GPU from any vendor
| with at least 6GB VRAM, you could test my implementation of
| Mistral model: https://github.com/Const-me/Cgml
|
| Screenshot: https://github.com/Const-
| me/Cgml/blob/master/Mistral/Mistral...
| spearman wrote:
| I skimmed the paper but couldn't find it: What API did they use
| to write their kernels? I would have guessed SYCL since that's
| what Intel is pushing for GPU programming but I couldn't find any
| reference to SYCL in the paper.
| spearman wrote:
| OK I found it. Looks like they use SYCL (which for some reason
| they've rebranded to DPC++): https://github.com/intel/intel-
| extension-for-pytorch/tree/v2...
| mepian wrote:
| SYCL is a standard, DPC++ is a particular implementation of
| this standard.
___________________________________________________________________
(page generated 2024-01-20 23:01 UTC)