[HN Gopher] LlamaF: An Efficient Llama2 Architecture Accelerator...
       ___________________________________________________________________
        
       LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded
       FPGAs
        
       Author : PaulHoule
       Score  : 73 points
       Date   : 2024-09-27 12:16 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | jsheard wrote:
       | Is there any particular reason you'd _want_ to use an FPGA for
       | this? Unless your problem space is highly dynamic (e.g.
       | prototyping) or you 're making products in vanishing low
       | quantities for a price insensitive market (e.g. military) an ASIC
       | is always going to be better.
       | 
       | There doesn't seem to be much flux in the low level architectures
       | used for inferencing at this point, so may as well commit to an
       | ASIC, as is already happening with Apple, Qualcomm, etc building
       | NPUs into their SOCs.
        
         | someguydave wrote:
         | gotta prototype the thing somewhere. If it turns out that the
         | LLM algos become pretty mature I suspect accelerators of all
         | kinds will be baked into silicon, especially for inference.
        
           | jsheard wrote:
           | That's the thing though, we're already there. Every new
           | consumer ARM and x86 ASIC is shipping with some kind of NPU,
           | the time for tentatively testing the waters with FPGAs was a
           | few years ago before this stuff came to market.
        
             | PaulHoule wrote:
             | But the NPU might be poorly designed for your model or
             | workload or just poorly designed.
        
           | mistrial9 wrote:
           | like this? https://www.d-matrix.ai/product/
        
         | PaulHoule wrote:
         | (1) Academics could make an FPGA but not an ASIC, (2) FPGA is a
         | first step to make an ASIC
        
         | wongarsu wrote:
         | This specific project looks like a case of "we have this
         | platform for automotive and industrial use, running Llama on
         | the dual-core ARM CPU is slow but there's an FPGA right next to
         | it". That's all the justification you really need for a
         | university project.
         | 
         | Not sure how useful this is for anyone who isn't already locked
         | into this specific architecture. But it might be a useful
         | benchmark or jumping-off-point for more useful FPGA-based
         | accelerators, like ones optimized for 1 bit or 1.58 bit LLMs
        
         | israrkhan wrote:
         | You can open-source your FPGA designs for wider collaboration
         | with the community? wider collaboration. Also, FPGA is the
         | starting step to make any modern digital chip.
        
         | danielmarkbruce wrote:
         | Model architecture changes fast. Maybe it will slow down.
        
       | fhdsgbbcaA wrote:
       | Looks like LLM inference will follow the same path as Bitcoin:
       | CPU -> GPU -> FPGA -> ASIC.
        
         | hackernudes wrote:
         | I really doubt it. Bitcoin mining is quite fixed, just massive
         | amounts of SHA256. On the other hand, ASICs for accelerating
         | matrix/tensor math are already around. LLM architecture is far
         | from fixed and currently being figured out. I don't see an ASIC
         | any time soon unless someone REALLY wants to put a specific
         | model on a phone or something.
        
           | YetAnotherNick wrote:
           | Google's TPU is an ASIC and performs competitively. Also
           | Tesla and Meta is building something AFAIK.
           | 
           | Although I doubt you could get lot better as GPUs already
           | have half the die area reserved for matrix multiplication.
        
             | danielmarkbruce wrote:
             | It depends on your precise definition of ASIC. The FPGA
             | thing here would be analogous to an MSIC where m = model.
             | 
             | It's clearly different to build a chip for a specific model
             | than what a TPU is.
             | 
             | Maybe we'll start seeing MSICs soon.
        
               | YetAnotherNick wrote:
               | LLMs and many other models spend 99% of the FLOPs in
               | matrix multiplication. And TPU initially had just single
               | operation i.e. multiply matrix. Even if the MSIC is 100x
               | better than GPU in other operations, it would just be 1%
               | faster overall.
        
               | danielmarkbruce wrote:
               | You can still optimize various layers of memory for a
               | specific model, make it all 8 bit or 4 bit or whatever
               | you want, maybe burn in a specific activation function,
               | all kinds of stuff.
               | 
               | No chance you'd only get 1% speedup on a chip designed
               | for a specific model.
        
         | bee_rider wrote:
         | LLM inference is a small task built into some other program you
         | are running, right? Like an office suite with some sentence
         | suggestion feature, probably a good use for an LLM, would be...
         | mostly office suite, with a little LLM inference sprinkled in.
         | 
         | So, the "ASIC" here is probably the CPU with, like, slightly
         | better vector extensions. AVX1024-FP16 or something, haha.
        
         | winwang wrote:
         | As far as I understand, the main issue for LLM inference is
         | memory bandwidth and capacity. Tensor cores are already an ASIC
         | for matmul, and they idle half the time waiting on memory.
        
       | KeplerBoy wrote:
       | 4 times as efficient as on the SoC's low end arm cores, soo many
       | times less efficient than on modern GPUs I guess?
       | 
       | Not that I was expecting GPU like efficiency from a fairly small
       | scale FPGA project. Nvidia engineers spent thousands of man-years
       | making sure that stuff works well on GPUs.
        
       | bitdeep wrote:
       | Not sure if you guys know: Groq already doing this with their
       | ASIC chips. So... the already passed FPGA phase and is on ASICs
       | phase.
       | 
       | The problem is: seems that their costs is 1x or 2x from what they
       | are charging.
        
         | qwertox wrote:
         | The way I see it, is that one day we'll be buying small LLM
         | cartridges.
        
       ___________________________________________________________________
       (page generated 2024-09-27 23:00 UTC)