[HN Gopher] LlamaF: An Efficient Llama2 Architecture Accelerator...
___________________________________________________________________
LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded
FPGAs
Author : PaulHoule
Score : 73 points
Date : 2024-09-27 12:16 UTC (10 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| jsheard wrote:
| Is there any particular reason you'd _want_ to use an FPGA for
| this? Unless your problem space is highly dynamic (e.g.
| prototyping) or you 're making products in vanishing low
| quantities for a price insensitive market (e.g. military) an ASIC
| is always going to be better.
|
| There doesn't seem to be much flux in the low level architectures
| used for inferencing at this point, so may as well commit to an
| ASIC, as is already happening with Apple, Qualcomm, etc building
| NPUs into their SOCs.
| someguydave wrote:
| gotta prototype the thing somewhere. If it turns out that the
| LLM algos become pretty mature I suspect accelerators of all
| kinds will be baked into silicon, especially for inference.
| jsheard wrote:
| That's the thing though, we're already there. Every new
| consumer ARM and x86 ASIC is shipping with some kind of NPU,
| the time for tentatively testing the waters with FPGAs was a
| few years ago before this stuff came to market.
| PaulHoule wrote:
| But the NPU might be poorly designed for your model or
| workload or just poorly designed.
| mistrial9 wrote:
| like this? https://www.d-matrix.ai/product/
| PaulHoule wrote:
| (1) Academics could make an FPGA but not an ASIC, (2) FPGA is a
| first step to make an ASIC
| wongarsu wrote:
| This specific project looks like a case of "we have this
| platform for automotive and industrial use, running Llama on
| the dual-core ARM CPU is slow but there's an FPGA right next to
| it". That's all the justification you really need for a
| university project.
|
| Not sure how useful this is for anyone who isn't already locked
| into this specific architecture. But it might be a useful
| benchmark or jumping-off-point for more useful FPGA-based
| accelerators, like ones optimized for 1 bit or 1.58 bit LLMs
| israrkhan wrote:
| You can open-source your FPGA designs for wider collaboration
| with the community? wider collaboration. Also, FPGA is the
| starting step to make any modern digital chip.
| danielmarkbruce wrote:
| Model architecture changes fast. Maybe it will slow down.
| fhdsgbbcaA wrote:
| Looks like LLM inference will follow the same path as Bitcoin:
| CPU -> GPU -> FPGA -> ASIC.
| hackernudes wrote:
| I really doubt it. Bitcoin mining is quite fixed, just massive
| amounts of SHA256. On the other hand, ASICs for accelerating
| matrix/tensor math are already around. LLM architecture is far
| from fixed and currently being figured out. I don't see an ASIC
| any time soon unless someone REALLY wants to put a specific
| model on a phone or something.
| YetAnotherNick wrote:
| Google's TPU is an ASIC and performs competitively. Also
| Tesla and Meta is building something AFAIK.
|
| Although I doubt you could get lot better as GPUs already
| have half the die area reserved for matrix multiplication.
| danielmarkbruce wrote:
| It depends on your precise definition of ASIC. The FPGA
| thing here would be analogous to an MSIC where m = model.
|
| It's clearly different to build a chip for a specific model
| than what a TPU is.
|
| Maybe we'll start seeing MSICs soon.
| YetAnotherNick wrote:
| LLMs and many other models spend 99% of the FLOPs in
| matrix multiplication. And TPU initially had just single
| operation i.e. multiply matrix. Even if the MSIC is 100x
| better than GPU in other operations, it would just be 1%
| faster overall.
| danielmarkbruce wrote:
| You can still optimize various layers of memory for a
| specific model, make it all 8 bit or 4 bit or whatever
| you want, maybe burn in a specific activation function,
| all kinds of stuff.
|
| No chance you'd only get 1% speedup on a chip designed
| for a specific model.
| bee_rider wrote:
| LLM inference is a small task built into some other program you
| are running, right? Like an office suite with some sentence
| suggestion feature, probably a good use for an LLM, would be...
| mostly office suite, with a little LLM inference sprinkled in.
|
| So, the "ASIC" here is probably the CPU with, like, slightly
| better vector extensions. AVX1024-FP16 or something, haha.
| winwang wrote:
| As far as I understand, the main issue for LLM inference is
| memory bandwidth and capacity. Tensor cores are already an ASIC
| for matmul, and they idle half the time waiting on memory.
| KeplerBoy wrote:
| 4 times as efficient as on the SoC's low end arm cores, soo many
| times less efficient than on modern GPUs I guess?
|
| Not that I was expecting GPU like efficiency from a fairly small
| scale FPGA project. Nvidia engineers spent thousands of man-years
| making sure that stuff works well on GPUs.
| bitdeep wrote:
| Not sure if you guys know: Groq already doing this with their
| ASIC chips. So... the already passed FPGA phase and is on ASICs
| phase.
|
| The problem is: seems that their costs is 1x or 2x from what they
| are charging.
| qwertox wrote:
| The way I see it, is that one day we'll be buying small LLM
| cartridges.
___________________________________________________________________
(page generated 2024-09-27 23:00 UTC)