[HN Gopher] Efficient LLM inference solution on Intel GPU
       ___________________________________________________________________
        
       Efficient LLM inference solution on Intel GPU
        
       Author : PaulHoule
       Score  : 88 points
       Date   : 2024-01-20 17:11 UTC (5 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | brucethemoose2 wrote:
       | I find this paper kind of "medium" (as the transformers baseline
       | they compare against is _glacially_ slow), but Intel 's
       | integration efforts in open source LLM runtimes are real and very
       | interesting. See, for instance, their PRs in llama.cpp, not to
       | speak of all their other open source efforts.
        
         | baq wrote:
         | Their software folks are really good, at least in the GPU
         | space.
         | 
         | The hardware seems stuck in past decade and the process woes
         | don't help either - but the software should be ready if they
         | ever dig themselves out of the hardware hole.
        
           | selimthegrim wrote:
           | At NeurIPS it seemed like the Intel Labs/materials
           | research/hardware people were out in force
        
           | brucethemoose2 wrote:
           | I mean, if they sell me a 48GB GPU that's not $5K like a
           | W7900, I will drop my Nvidia card faster than I can blink. I
           | don't even care if its kinda slow and uses an older process.
           | 
           | And then I may end up fixing Intel bugs in ML projects we
           | use, and deploying them on Intel cloud hardware... which is
           | hopefully more available as well.
        
         | kiratp wrote:
         | +1 to this.
         | 
         | I've been working with their research teams on this and credit
         | due indeed.
         | 
         | Some context:
         | https://github.com/ggerganov/llama.cpp/issues/2555#issuecomm...
         | 
         | https://github.com/ggerganov/llama.cpp/discussions/3965
        
       | NelsonMinar wrote:
       | Well that sounds encouraging, Intel QuickSync has made efficient
       | video encoding really accessible. Would love something similar
       | for LLM inference
       | 
       | Is this paper the same work as goes in to Intel's BigDL-LLM code?
       | That's been out for a few months now but I haven't seen it in use
       | yet. https://medium.com/intel-tech/bigdl-llm-easily-optimize-
       | your...
        
         | brucethemoose2 wrote:
         | > I haven't seen it in use yet
         | 
         | Neither have I, but this is interesting and I am bookmarking
         | it, thanks.
         | 
         | The Gen AI space is littered with very interesting high
         | performance runtimes that precisely no one uses. There are too
         | many for integrators to even keep up with, much less integrate!
        
       | wmf wrote:
       | Note that this is about Intel Ponte Vecchio professional GPUs,
       | not the consumer Alchemist GPUs. Do they use the same software
       | stack?
        
         | xipix wrote:
         | Yes, also newer iGPUs (11th Gen onwards) can use same software
        
       | newsclues wrote:
       | Are we at or approaching the point where people can self host an
       | useful LLM for personal use on affordable hardware?
        
         | adastra22 wrote:
         | Is a MacBook Air affordable? Because we're already at that
         | point.
        
           | newsclues wrote:
           | The base model?
           | 
           | Given the article about GPUs, I think I was thinking more
           | along the lines of a desktop with a consumer GPU.
        
             | brucethemoose2 wrote:
             | Intel (and AMD) reportedly have big M-Pro-like iGPUs in the
             | pipe, and Intel has the Battlemage GPU line coming up, so
             | we shall see.
        
             | adastra22 wrote:
             | By base model do you mean off-the-shelf? I think the bigger
             | limitation there would be RAM, as the machine comes with
             | just 8GB soldered on board. You can expand to 24GB for
             | +$400 though, and increase the GPU cores +25% for an extra
             | +$100.
             | 
             | The CPU/GPU speed of the Air is the same as the MacBook Pro
             | base model though. The only difference is the lack of
             | active cooling, which for large workloads can result in
             | performance degradation. Chat apps are intrinsically
             | interactive though, only using bursts of GPU when it is
             | performing inference. Shouldn't be an issue.
             | 
             | The expensive PRO/MAX variants of the MacBook Pro would be
             | 2x - 4x faster. But the plain M2 in the MacBook Air is
             | sufficient to get real-time speeds on sizable models.
             | 
             | When the M3 comes out for the MacBook Air, discounted M2
             | machines, new or refurbished, should be an ideal entry-
             | level machine for local inference.
        
               | washadjeffmad wrote:
               | > discounted M2 machines, new or refurbished, should be
               | an ideal entry-level machine for local inference
               | 
               | At the point of purchase of the lowest cost configuration
               | with 24GB Unified Memory, you've already paid the an
               | equivalent of over 2200 hours of GPU compute time on an
               | RTX 4090 24GB, with a performance that exceeds the
               | MacBook by around 1200% (it/s).
               | 
               | If you buy that MacBook for AI, you would have to run
               | continuous generative inference on it for over a decade
               | to match the return of just having used runpod instead.
               | Even Apple doesn't use Macs to do AI - they use GCP TPUs.
               | Buy the Mac if you like it, by all means, but be
               | realistic.
               | 
               | LLM performance on Apple Silicon is decent for small
               | models but does not scale. If any Mac architecture were
               | cost effective at any scale for AI, we would be putting
               | them in data centers, and we're not.
        
         | brucethemoose2 wrote:
         | I run Yi 34B at 45K-70K context as a GPT 3.5 replacement on a
         | Nvidia 3090.
         | 
         | Its quite smart, and fast.
         | 
         | The whole rig cost me $2.1K, but it could have easily been
         | $1.5K without splurging on certain parts like I did. And its 10
         | Liters, small enough to move around.
        
           | washadjeffmad wrote:
           | Same, but mine is 11L (A4-H2O). Which case did you get?!
        
             | brucethemoose2 wrote:
             | A Node 202! I wanted something flat enough for a suitcase.
             | 
             | https://www.reddit.com/r/sffpc/comments/18a7mal/ducted_3090
             | _...
        
         | montebicyclelo wrote:
         | Consumer laptops, such as apple silicon macbooks with just 32gb
         | ram, can run fairly large models, such as Mixtral-8x7b, an open
         | source model that's comparable to gpt-3.5, well enough for
         | interactive chat, (and seems likely models of this size will
         | keep improving) - ofc, 64gb ram preferable, because it unlocks
         | larger models. And 3090s are the other way to to go, which are
         | faster but not quite as much vram, though you can use multiple.
        
         | tudorw wrote:
         | I'm using GPT4ALL with Mistral on a $400 Intel NUC, so yes, if
         | the use fits!
        
         | Const-me wrote:
         | If you use Windows and you have a discrete GPU from any vendor
         | with at least 6GB VRAM, you could test my implementation of
         | Mistral model: https://github.com/Const-me/Cgml
         | 
         | Screenshot: https://github.com/Const-
         | me/Cgml/blob/master/Mistral/Mistral...
        
       | spearman wrote:
       | I skimmed the paper but couldn't find it: What API did they use
       | to write their kernels? I would have guessed SYCL since that's
       | what Intel is pushing for GPU programming but I couldn't find any
       | reference to SYCL in the paper.
        
         | spearman wrote:
         | OK I found it. Looks like they use SYCL (which for some reason
         | they've rebranded to DPC++): https://github.com/intel/intel-
         | extension-for-pytorch/tree/v2...
        
           | mepian wrote:
           | SYCL is a standard, DPC++ is a particular implementation of
           | this standard.
        
       ___________________________________________________________________
       (page generated 2024-01-20 23:01 UTC)