[HN Gopher] LlamaF: An Efficient Llama2 Architecture Accelerator...
       ___________________________________________________________________
        
       LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded
       FPGAs
        
       Author : PaulHoule
       Score  : 112 points
       Date   : 2024-09-27 12:16 UTC (1 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | jsheard wrote:
       | Is there any particular reason you'd _want_ to use an FPGA for
       | this? Unless your problem space is highly dynamic (e.g.
       | prototyping) or you 're making products in vanishing low
       | quantities for a price insensitive market (e.g. military) an ASIC
       | is always going to be better.
       | 
       | There doesn't seem to be much flux in the low level architectures
       | used for inferencing at this point, so may as well commit to an
       | ASIC, as is already happening with Apple, Qualcomm, etc building
       | NPUs into their SOCs.
        
         | someguydave wrote:
         | gotta prototype the thing somewhere. If it turns out that the
         | LLM algos become pretty mature I suspect accelerators of all
         | kinds will be baked into silicon, especially for inference.
        
           | jsheard wrote:
           | That's the thing though, we're already there. Every new
           | consumer ARM and x86 ASIC is shipping with some kind of NPU,
           | the time for tentatively testing the waters with FPGAs was a
           | few years ago before this stuff came to market.
        
             | PaulHoule wrote:
             | But the NPU might be poorly designed for your model or
             | workload or just poorly designed.
        
           | mistrial9 wrote:
           | like this? https://www.d-matrix.ai/product/
        
         | PaulHoule wrote:
         | (1) Academics could make an FPGA but not an ASIC, (2) FPGA is a
         | first step to make an ASIC
        
         | wongarsu wrote:
         | This specific project looks like a case of "we have this
         | platform for automotive and industrial use, running Llama on
         | the dual-core ARM CPU is slow but there's an FPGA right next to
         | it". That's all the justification you really need for a
         | university project.
         | 
         | Not sure how useful this is for anyone who isn't already locked
         | into this specific architecture. But it might be a useful
         | benchmark or jumping-off-point for more useful FPGA-based
         | accelerators, like ones optimized for 1 bit or 1.58 bit LLMs
        
         | israrkhan wrote:
         | You can open-source your FPGA designs for wider collaboration
         | with the community? wider collaboration. Also, FPGA is the
         | starting step to make any modern digital chip.
        
         | danielmarkbruce wrote:
         | Model architecture changes fast. Maybe it will slow down.
        
       | fhdsgbbcaA wrote:
       | Looks like LLM inference will follow the same path as Bitcoin:
       | CPU -> GPU -> FPGA -> ASIC.
        
         | hackernudes wrote:
         | I really doubt it. Bitcoin mining is quite fixed, just massive
         | amounts of SHA256. On the other hand, ASICs for accelerating
         | matrix/tensor math are already around. LLM architecture is far
         | from fixed and currently being figured out. I don't see an ASIC
         | any time soon unless someone REALLY wants to put a specific
         | model on a phone or something.
        
           | YetAnotherNick wrote:
           | Google's TPU is an ASIC and performs competitively. Also
           | Tesla and Meta is building something AFAIK.
           | 
           | Although I doubt you could get lot better as GPUs already
           | have half the die area reserved for matrix multiplication.
        
             | danielmarkbruce wrote:
             | It depends on your precise definition of ASIC. The FPGA
             | thing here would be analogous to an MSIC where m = model.
             | 
             | It's clearly different to build a chip for a specific model
             | than what a TPU is.
             | 
             | Maybe we'll start seeing MSICs soon.
        
               | YetAnotherNick wrote:
               | LLMs and many other models spend 99% of the FLOPs in
               | matrix multiplication. And TPU initially had just single
               | operation i.e. multiply matrix. Even if the MSIC is 100x
               | better than GPU in other operations, it would just be 1%
               | faster overall.
        
               | danielmarkbruce wrote:
               | You can still optimize various layers of memory for a
               | specific model, make it all 8 bit or 4 bit or whatever
               | you want, maybe burn in a specific activation function,
               | all kinds of stuff.
               | 
               | No chance you'd only get 1% speedup on a chip designed
               | for a specific model.
        
           | pzo wrote:
           | Apple has Neural Engine and it really speeds up many CoreML
           | models - if most operators are implemented in NPU inference
           | will be significantly faster than on GPU on my Macbook m2 max
           | (and they have similar NPU as those in e.g. iPhone 13). Those
           | ASIC NPU just implements many typical low level operators
           | used in most ML models.
        
           | imtringued wrote:
           | 99% of the time is spent on matrix matrix or matrix vector
           | calculation. Activation functions, softmax, RoPE, etc
           | basically cost nothing in comparison.
           | 
           | Most NPUs are programmable, because the bottleneck is data
           | SRAM and memory bandwidth instead of instruction SRAM.
           | 
           | For classic matrix matrix multiplication, the SRAM bottleneck
           | is the number of matrix outputs you can store in SRAM. N rows
           | and M columns get you N X M accumulator outputs. The
           | calculation of the dot product can be split into separate
           | steps without losing the N X M scaling, so the SRAM consumed
           | by the row and column vectors is insignificant in the limit.
           | 
           | For the MLP layers in the unbatched case, the bottleneck lies
           | in the memory bandwidth needed to load the model parameters.
           | The problem is therefore how fast your DDR, GDDR, HBM memory
           | and your NoC/system bus lets you transfer data to the NPU.
           | 
           | Having a programmable processor that controls the matrix
           | multiplication function unit costs you silicon area for the
           | instruction SRAM. For matrix vector multiplication, the
           | memory bottleneck is so big, it doesn't matter what
           | architecture you are using, even CPUs are fast enough. There
           | is no demand for getting rid of the not very costly
           | instruction SRAM.
           | 
           | "but what about the area taken up by the processor itself?"
           | 
           | HAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA. Nice joke
           | 
           | Wait..., you were serious? The area taken up by an in order
           | VLIW/TTA processor is so insignificant I jammed it in-between
           | the routing gap of two SRAM blocks. Sure, the matrix
           | multiplication unit might take up some space, bit decoding
           | instructions is such an insignificant cost that anyone
           | opposing programmability must have completely different goals
           | and priorities than LLMs or machine learning.
        
         | bee_rider wrote:
         | LLM inference is a small task built into some other program you
         | are running, right? Like an office suite with some sentence
         | suggestion feature, probably a good use for an LLM, would be...
         | mostly office suite, with a little LLM inference sprinkled in.
         | 
         | So, the "ASIC" here is probably the CPU with, like, slightly
         | better vector extensions. AVX1024-FP16 or something, haha.
        
           | p1esk wrote:
           | _would be... mostly office suite, with a little LLM inference
           | sprinkled in._
           | 
           | No, it would be LLM inference with a little bit of an office
           | suite sprinkled in.
        
         | winwang wrote:
         | As far as I understand, the main issue for LLM inference is
         | memory bandwidth and capacity. Tensor cores are already an ASIC
         | for matmul, and they idle half the time waiting on memory.
        
         | evanjrowley wrote:
         | You forgot to place "vertically-integrated unobtanium" after
         | ASIC.
        
           | namibj wrote:
           | Soooo.... TPUv4?
        
             | evanjrowley wrote:
             | Yes, but the kinds that aren't on the market.
        
       | KeplerBoy wrote:
       | 4 times as efficient as on the SoC's low end arm cores, soo many
       | times less efficient than on modern GPUs I guess?
       | 
       | Not that I was expecting GPU like efficiency from a fairly small
       | scale FPGA project. Nvidia engineers spent thousands of man-years
       | making sure that stuff works well on GPUs.
        
       | bitdeep wrote:
       | Not sure if you guys know: Groq already doing this with their
       | ASIC chips. So... the already passed FPGA phase and is on ASICs
       | phase.
       | 
       | The problem is: seems that their costs is 1x or 2x from what they
       | are charging.
        
         | qwertox wrote:
         | The way I see it, is that one day we'll be buying small LLM
         | cartridges.
        
         | latchkey wrote:
         | Probably more than 2x...
         | 
         | "Semi analysis did some cost estimates, and I did some but
         | you're likely paying somewhere in the 12 million dollar range
         | for the equipment to serve a single query using llama-70b.
         | Compare that to a couple of gpus, and it's easy to see why they
         | are struggling to sell hardware, they can't scale down.
         | 
         | Since they didn't use hbm, you need to stich enough cards
         | together to get the memory to hold your model. It takes a lot
         | of 256mb cards to get to 64gb, and there isn't a good way to
         | try the tech out since a single rack really can't serve an
         | LLM."
         | 
         | https://news.ycombinator.com/item?id=39966620
        
         | faangguyindia wrote:
         | Groq is unpredictable and while it might be fast for some
         | requests about it's super slow or fails on others.
         | 
         | Fastest commercial model is Google's Gemini Flash (predictable
         | speed)
        
       | rldjbpin wrote:
       | as of now there are way too many parallel developments across
       | abstraction layers, hardware or software, to really have the best
       | combo just yet. even this example is for an older architecture
       | because certain things just move slower than others.
       | 
       | but when things plateau off, this, then ASICs, would probably be
       | the most efficient way ahead for "stable" versions of AI models
       | during inference.
        
       ___________________________________________________________________
       (page generated 2024-09-28 23:02 UTC)