[HN Gopher] Energy-Efficient Llama 2 Inference on FPGAs via High...
___________________________________________________________________
Energy-Efficient Llama 2 Inference on FPGAs via High Level
Synthesis
Author : PaulHoule
Score : 81 points
Date : 2024-05-10 02:46 UTC (20 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| emsal wrote:
| Perhaps this is the obvious comment, but I really hope that
| something that employs technology like this can get off the
| ground and become a services company. The ideas about
| democratizing the AI inference hardware space and making it
| energy efficient really resonate with me.
| om8 wrote:
| Don't you think that LLM inference is already very democratic?
| Not saying that there is no room for improvements there --
| there is still a lot to do in the space of speculative
| decoding, quantization and other stuff. I'm saying that every
| 16 year old with a decent enough personal computer can run
| fully locally latest open weights model like Llama3-8b that
| beats almost everything we had a year ago.
|
| The part of this ecosystem that is as non-democratized as it
| can be is training. It's currently impossible to train decent
| enough model with resources that are available to one person.
| almostgotcaught wrote:
| am i missing something? since when does vitis connect to vivado
| and do the p&r too?
|
| anyway if you're tempted by this, i strongly advise you to ponder
| this:
|
| > Run the Hardware build, should take around ~12 hours.
| dailykoder wrote:
| Welcome to the world of FPGA synthesis :-)
|
| > am i missing something? since when does vitis connect to
| vivado and do the p&r too?
|
| I haven't done much HLS, but isn't that the normal case?
| Translating the HLS into HDL and then do the pnr with vivado?
| almostgotcaught wrote:
| i've done plenty of FPGA and HLS as well
|
| > Translating the HLS into HDL and then do the pnr with
| vivado?
|
| As far as I know, vitis does not give you tcl scripts (or
| whatever) for vivado - you have to do that yourself.
| ChrisKjellqvist wrote:
| I've been using HLS recently and from my understanding, all
| you need is `v++`.
|
| [link to example HLS makefile](https://github.com/Xilinx/Vi
| tis_Accel_Examples/blob/f61637e9...)
| bnprks wrote:
| Seems like the claims of the abstract for speed and energy-
| efficiency relative to an RTX 3090 are when the GPU is using a
| batch size of 1. I wonder if someone with more experience can
| comment on how much throughput gain is possible on a GPU by
| increasing batch size without severely harming latency (and what
| the power consumption change might be).
|
| And from a hardware cost perspective the AWS f1.2xlarge instances
| they used are $1.65/hr on-demand, vs say $1.29/hr for an A100
| from Lambda Labs. A very interesting line of thinking to use
| FPGAs, but I'm not sure if this is really describing a viable
| competitor to GPUs even for inference-only scenarios.
| dhruvdh wrote:
| The FPGA being used is I believe one of the lowest speced SKUs.
|
| AWS instance prices are more of a supply/demand/availability
| thing, it would be more interesting to compare from a total
| cost of ownership / perf-power-area prespective.
| sandGorgon wrote:
| if i want to play around with this at home on a fpga
| devkit...which fpga kit should i use ?
|
| the Xilinx Virtex UltraScale+ VU9P FPGA prototyping boards seem
| to be 9000 USD. Anything in the 1000$ range ?
| almostgotcaught wrote:
| ebay has a bunch of Alveo U30 cards for ~500. 500k luts, 3000
| dsp slices, ~1M registers, even some DDR.
|
| fair warning: one does not "play around" with an FPGA. they are
| the antithesis of user friendly.
| BonusPlay wrote:
| Note, that paper provides environment for AWS FPGAs, which you
| can rent on per-hour basis.
|
| As for cheaper FPGAs, the paper notes that the bottleneck is
| the size of on-chip memory. So I doubt it will be easy to find
| cheaper model to reduce costs.
|
| Another hidden fee would be Vivado and Vitis (tooling)
| licenses, which you need for most upper-end FPGAs.
| jesprenj wrote:
| > Although the GPU performs inference faster than the FPGA, one
| of the primary bottlenecks of deep learning inference is memory
| bandwidth and the availability of on-chip memory (Balasubramanian
| et al., 2021). A RTX 3090 has 24GB VRAM running at 1219 MHz with
| a base core clock of 1395 MHz (TechPowerUp, 2024). In comparison,
| a VU9P FPGA has 345.9 MB of combined on-chip BRAM and URAM,
| running at a much slower clock speed of around 200-300 MHz
| depending on the module; however, with much lower clock speeds,
| the FPGA is able to achieve better efficiency on power and energy
| consumption, as shown below.
|
| So as far as I can understand, the biggest "bottleneck"/limiting
| factor with using FPGAs for LLMs is the available memory -- with
| current large models exceeding 40 GiB in parameter size, GPUs and
| TPUs with DRAM look like the only way to go forward for the
| months to come ... Thoughts?
| dailykoder wrote:
| Wouldn't be surprised if AMD or Intel come up with an FPGA
| especially for this application. Atleast AMD advertises a lot
| with their AI FPGA stuff, so they'll probably build one with
| either a lot more BRAM or the ability to attach some very fast
| RAM? But going from a few Megabytes of RAM to Gigabytes sounds
| very expensive. DRAM is just too slow, I guess
| bnprks wrote:
| Yeah, I think DRAM is almost certainly the future, just in
| terms of being able to afford the memory capacity to fit large
| models. Even Cerebras using a full wafer only gets up to 44 GB
| of SRAM on a chip (at a cost over $2M).
|
| An interesting twist is that this DRAM might not need to be a
| central pool where bandwidth must be shared globally -- e.g.
| the Tensortorrent strategy seems to be aiming for using smaller
| chips that each have their own memory. Splitting up memory
| should yield very high aggregate bandwidth even with slower
| DRAM, which is great as long as they can figure out the cross-
| chip data flow to avoid networking bottlenecks
| cherioo wrote:
| While FPGA may prove more efficient than 3090, a primarily gaming
| card, I can't see how it should be more efficient than dedicated
| training/inference card, as the latter are more effectively ASIC,
| not to mention memory and bandwidth limitations.
|
| Is there something I am missing making FPGA potentially more
| viable, besides not feeding into NVIDIA's greed?
| KeplerBoy wrote:
| A dedicated training/inference card is still more general than
| an LLama 2 inference card. It's obvious that you will get
| better efficiency the more you tailor your silicon to the task,
| with diminishing gains but still.
| kernelsanderz wrote:
| What's really cool, is that this was built on Karpathy's llama2.c
| repo https://github.com/karpathy/llama2.c
|
| A great example of the unexpected things that happen when put
| great code into the commons.
|
| I bet Andrej never expected anything like this when he released
| it.
| samus wrote:
| Keep in mind that Andrej is probably holding back a lot of
| optimizations for the sake of keeping the code comprehensible!
| hongspike wrote:
| Trying to understand what cases we would want to use FPGAs rather
| than GPUs.
|
| Memory bandwidth for FPGAs seems worse, so for serving models
| don't GPUs still win out?
| bunnie wrote:
| Not exactly a fair comparison - the design is limited to run
| models that can fit entirely in on chip RAM for the FPGA. This
| greatly reduces power consumption because the FPGA does not have
| to pay the overhead of DRAM PHY's, termination, DRAM chips, etc
| which is a relatively fixed cost irrespective of capacity. This
| means the energy cost in terms of energy per bit transfered is
| much higher for the GPU than the FPGA.
|
| Thus the GPU is storing a 110M model in gigabytes of external
| RAM, and paying the power penalty associated with the excess
| capacity, while the chosen model 110M fits neatly within the
| FPGA's on chip RAM, and the design can trim all that overhead
| accordingly.
|
| A more fair comparison would either run a larger model that had
| both systems hitting external RAM, or they would compare
| power/performance against some sort of inference ASIC that had
| all the RAM on chip (maybe a cerebrus, but scaled according to
| the portion of the wafer actually used for the model).
|
| That being said, it's neat that they open sourced their work, and
| it's worth looking down this path more.
| dhruvdh wrote:
| I don't know what you are trying to say here. If one system
| doesn't need to move as much data because it is more flexible,
| that is a good thing. What do we gain by making it "fair"?
| alpacalaca wrote:
| If you're limiting the size of the model to 110 million
| parameters (105MiB assuming int8) because that's what will
| fit onto your FPGA then of course it's going to be more
| energy efficient than a Broadwell era Xeon with a 24GB RTX
| 3090. It's like concluding that a rickshaw is more efficient
| than a train, something that will absolutely be true in a
| technical sense if you're only transporting a single
| passenger, but makes no sense if you're transporting hundreds
| if not thousands of passengers.
|
| A more apt comparison would have been with a phone made in
| the past 5 years, even without an AI accelerator chip I'm
| sure you could manage 20-30+ t/s from a 110m model but this
| depends entirely on the memory bandwidth of the phone.
| mikewarot wrote:
| Most of the work of LLMs is in large matrix multiply-accumulate
| operations. You could take all those constants and convert
| everything to fixed point, then compile it down to the smallest
| possible directed acyclic graph of binary logical operations. (In
| other words, do the NAND to Tetris thing in reverse)
|
| The problem is FPGAs are heterogeneous and highly optimized to
| reduce latency, instead of for efficiency, so there are strong
| limits to the approach.
|
| On the plus side, the LLMs are a lot of layers, so you could take
| one layer per FPGA, and just use the high speed links in the
| chips to feed the results to the next FPGA.
|
| You'd have maybe even a millisecond of latency, at a clock rate
| of 100 Mhz, but possibly a million tokens per second of serial /
| parallel streams of execution.
| creativeSlumber wrote:
| Is there anything preventing designing a custom ASIC for LLM
| training / inference rather relying on using GPU's ? I've been
| away from hardware for a long time, but if you can compile it
| into run on an FPGA, then a custom ASIC should be even faster
| (clock speed) and more power efficient.
| brcmthrowaway wrote:
| SiFive should work on this!
| zekrioca wrote:
| If you are getting a 403, you may read the HTML at
| https://bytez.com/docs/arxiv/2405.00738/paper
___________________________________________________________________
(page generated 2024-05-10 23:02 UTC)