[HN Gopher] LlamaF: An Efficient Llama2 Architecture Accelerator...
___________________________________________________________________
LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded
FPGAs
Author : PaulHoule
Score : 112 points
Date : 2024-09-27 12:16 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| jsheard wrote:
| Is there any particular reason you'd _want_ to use an FPGA for
| this? Unless your problem space is highly dynamic (e.g.
| prototyping) or you 're making products in vanishing low
| quantities for a price insensitive market (e.g. military) an ASIC
| is always going to be better.
|
| There doesn't seem to be much flux in the low level architectures
| used for inferencing at this point, so may as well commit to an
| ASIC, as is already happening with Apple, Qualcomm, etc building
| NPUs into their SOCs.
| someguydave wrote:
| gotta prototype the thing somewhere. If it turns out that the
| LLM algos become pretty mature I suspect accelerators of all
| kinds will be baked into silicon, especially for inference.
| jsheard wrote:
| That's the thing though, we're already there. Every new
| consumer ARM and x86 ASIC is shipping with some kind of NPU,
| the time for tentatively testing the waters with FPGAs was a
| few years ago before this stuff came to market.
| PaulHoule wrote:
| But the NPU might be poorly designed for your model or
| workload or just poorly designed.
| mistrial9 wrote:
| like this? https://www.d-matrix.ai/product/
| PaulHoule wrote:
| (1) Academics could make an FPGA but not an ASIC, (2) FPGA is a
| first step to make an ASIC
| wongarsu wrote:
| This specific project looks like a case of "we have this
| platform for automotive and industrial use, running Llama on
| the dual-core ARM CPU is slow but there's an FPGA right next to
| it". That's all the justification you really need for a
| university project.
|
| Not sure how useful this is for anyone who isn't already locked
| into this specific architecture. But it might be a useful
| benchmark or jumping-off-point for more useful FPGA-based
| accelerators, like ones optimized for 1 bit or 1.58 bit LLMs
| israrkhan wrote:
| You can open-source your FPGA designs for wider collaboration
| with the community? wider collaboration. Also, FPGA is the
| starting step to make any modern digital chip.
| danielmarkbruce wrote:
| Model architecture changes fast. Maybe it will slow down.
| fhdsgbbcaA wrote:
| Looks like LLM inference will follow the same path as Bitcoin:
| CPU -> GPU -> FPGA -> ASIC.
| hackernudes wrote:
| I really doubt it. Bitcoin mining is quite fixed, just massive
| amounts of SHA256. On the other hand, ASICs for accelerating
| matrix/tensor math are already around. LLM architecture is far
| from fixed and currently being figured out. I don't see an ASIC
| any time soon unless someone REALLY wants to put a specific
| model on a phone or something.
| YetAnotherNick wrote:
| Google's TPU is an ASIC and performs competitively. Also
| Tesla and Meta is building something AFAIK.
|
| Although I doubt you could get lot better as GPUs already
| have half the die area reserved for matrix multiplication.
| danielmarkbruce wrote:
| It depends on your precise definition of ASIC. The FPGA
| thing here would be analogous to an MSIC where m = model.
|
| It's clearly different to build a chip for a specific model
| than what a TPU is.
|
| Maybe we'll start seeing MSICs soon.
| YetAnotherNick wrote:
| LLMs and many other models spend 99% of the FLOPs in
| matrix multiplication. And TPU initially had just single
| operation i.e. multiply matrix. Even if the MSIC is 100x
| better than GPU in other operations, it would just be 1%
| faster overall.
| danielmarkbruce wrote:
| You can still optimize various layers of memory for a
| specific model, make it all 8 bit or 4 bit or whatever
| you want, maybe burn in a specific activation function,
| all kinds of stuff.
|
| No chance you'd only get 1% speedup on a chip designed
| for a specific model.
| pzo wrote:
| Apple has Neural Engine and it really speeds up many CoreML
| models - if most operators are implemented in NPU inference
| will be significantly faster than on GPU on my Macbook m2 max
| (and they have similar NPU as those in e.g. iPhone 13). Those
| ASIC NPU just implements many typical low level operators
| used in most ML models.
| imtringued wrote:
| 99% of the time is spent on matrix matrix or matrix vector
| calculation. Activation functions, softmax, RoPE, etc
| basically cost nothing in comparison.
|
| Most NPUs are programmable, because the bottleneck is data
| SRAM and memory bandwidth instead of instruction SRAM.
|
| For classic matrix matrix multiplication, the SRAM bottleneck
| is the number of matrix outputs you can store in SRAM. N rows
| and M columns get you N X M accumulator outputs. The
| calculation of the dot product can be split into separate
| steps without losing the N X M scaling, so the SRAM consumed
| by the row and column vectors is insignificant in the limit.
|
| For the MLP layers in the unbatched case, the bottleneck lies
| in the memory bandwidth needed to load the model parameters.
| The problem is therefore how fast your DDR, GDDR, HBM memory
| and your NoC/system bus lets you transfer data to the NPU.
|
| Having a programmable processor that controls the matrix
| multiplication function unit costs you silicon area for the
| instruction SRAM. For matrix vector multiplication, the
| memory bottleneck is so big, it doesn't matter what
| architecture you are using, even CPUs are fast enough. There
| is no demand for getting rid of the not very costly
| instruction SRAM.
|
| "but what about the area taken up by the processor itself?"
|
| HAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHAHA. Nice joke
|
| Wait..., you were serious? The area taken up by an in order
| VLIW/TTA processor is so insignificant I jammed it in-between
| the routing gap of two SRAM blocks. Sure, the matrix
| multiplication unit might take up some space, bit decoding
| instructions is such an insignificant cost that anyone
| opposing programmability must have completely different goals
| and priorities than LLMs or machine learning.
| bee_rider wrote:
| LLM inference is a small task built into some other program you
| are running, right? Like an office suite with some sentence
| suggestion feature, probably a good use for an LLM, would be...
| mostly office suite, with a little LLM inference sprinkled in.
|
| So, the "ASIC" here is probably the CPU with, like, slightly
| better vector extensions. AVX1024-FP16 or something, haha.
| p1esk wrote:
| _would be... mostly office suite, with a little LLM inference
| sprinkled in._
|
| No, it would be LLM inference with a little bit of an office
| suite sprinkled in.
| winwang wrote:
| As far as I understand, the main issue for LLM inference is
| memory bandwidth and capacity. Tensor cores are already an ASIC
| for matmul, and they idle half the time waiting on memory.
| evanjrowley wrote:
| You forgot to place "vertically-integrated unobtanium" after
| ASIC.
| namibj wrote:
| Soooo.... TPUv4?
| evanjrowley wrote:
| Yes, but the kinds that aren't on the market.
| KeplerBoy wrote:
| 4 times as efficient as on the SoC's low end arm cores, soo many
| times less efficient than on modern GPUs I guess?
|
| Not that I was expecting GPU like efficiency from a fairly small
| scale FPGA project. Nvidia engineers spent thousands of man-years
| making sure that stuff works well on GPUs.
| bitdeep wrote:
| Not sure if you guys know: Groq already doing this with their
| ASIC chips. So... the already passed FPGA phase and is on ASICs
| phase.
|
| The problem is: seems that their costs is 1x or 2x from what they
| are charging.
| qwertox wrote:
| The way I see it, is that one day we'll be buying small LLM
| cartridges.
| latchkey wrote:
| Probably more than 2x...
|
| "Semi analysis did some cost estimates, and I did some but
| you're likely paying somewhere in the 12 million dollar range
| for the equipment to serve a single query using llama-70b.
| Compare that to a couple of gpus, and it's easy to see why they
| are struggling to sell hardware, they can't scale down.
|
| Since they didn't use hbm, you need to stich enough cards
| together to get the memory to hold your model. It takes a lot
| of 256mb cards to get to 64gb, and there isn't a good way to
| try the tech out since a single rack really can't serve an
| LLM."
|
| https://news.ycombinator.com/item?id=39966620
| faangguyindia wrote:
| Groq is unpredictable and while it might be fast for some
| requests about it's super slow or fails on others.
|
| Fastest commercial model is Google's Gemini Flash (predictable
| speed)
| rldjbpin wrote:
| as of now there are way too many parallel developments across
| abstraction layers, hardware or software, to really have the best
| combo just yet. even this example is for an older architecture
| because certain things just move slower than others.
|
| but when things plateau off, this, then ASICs, would probably be
| the most efficient way ahead for "stable" versions of AI models
| during inference.
___________________________________________________________________
(page generated 2024-09-28 23:02 UTC)