[HN Gopher] Energy-Efficient Llama 2 Inference on FPGAs via High...
       ___________________________________________________________________
        
       Energy-Efficient Llama 2 Inference on FPGAs via High Level
       Synthesis
        
       Author : PaulHoule
       Score  : 81 points
       Date   : 2024-05-10 02:46 UTC (20 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | emsal wrote:
       | Perhaps this is the obvious comment, but I really hope that
       | something that employs technology like this can get off the
       | ground and become a services company. The ideas about
       | democratizing the AI inference hardware space and making it
       | energy efficient really resonate with me.
        
         | om8 wrote:
         | Don't you think that LLM inference is already very democratic?
         | Not saying that there is no room for improvements there --
         | there is still a lot to do in the space of speculative
         | decoding, quantization and other stuff. I'm saying that every
         | 16 year old with a decent enough personal computer can run
         | fully locally latest open weights model like Llama3-8b that
         | beats almost everything we had a year ago.
         | 
         | The part of this ecosystem that is as non-democratized as it
         | can be is training. It's currently impossible to train decent
         | enough model with resources that are available to one person.
        
       | almostgotcaught wrote:
       | am i missing something? since when does vitis connect to vivado
       | and do the p&r too?
       | 
       | anyway if you're tempted by this, i strongly advise you to ponder
       | this:
       | 
       | > Run the Hardware build, should take around ~12 hours.
        
         | dailykoder wrote:
         | Welcome to the world of FPGA synthesis :-)
         | 
         | > am i missing something? since when does vitis connect to
         | vivado and do the p&r too?
         | 
         | I haven't done much HLS, but isn't that the normal case?
         | Translating the HLS into HDL and then do the pnr with vivado?
        
           | almostgotcaught wrote:
           | i've done plenty of FPGA and HLS as well
           | 
           | > Translating the HLS into HDL and then do the pnr with
           | vivado?
           | 
           | As far as I know, vitis does not give you tcl scripts (or
           | whatever) for vivado - you have to do that yourself.
        
             | ChrisKjellqvist wrote:
             | I've been using HLS recently and from my understanding, all
             | you need is `v++`.
             | 
             | [link to example HLS makefile](https://github.com/Xilinx/Vi
             | tis_Accel_Examples/blob/f61637e9...)
        
       | bnprks wrote:
       | Seems like the claims of the abstract for speed and energy-
       | efficiency relative to an RTX 3090 are when the GPU is using a
       | batch size of 1. I wonder if someone with more experience can
       | comment on how much throughput gain is possible on a GPU by
       | increasing batch size without severely harming latency (and what
       | the power consumption change might be).
       | 
       | And from a hardware cost perspective the AWS f1.2xlarge instances
       | they used are $1.65/hr on-demand, vs say $1.29/hr for an A100
       | from Lambda Labs. A very interesting line of thinking to use
       | FPGAs, but I'm not sure if this is really describing a viable
       | competitor to GPUs even for inference-only scenarios.
        
         | dhruvdh wrote:
         | The FPGA being used is I believe one of the lowest speced SKUs.
         | 
         | AWS instance prices are more of a supply/demand/availability
         | thing, it would be more interesting to compare from a total
         | cost of ownership / perf-power-area prespective.
        
       | sandGorgon wrote:
       | if i want to play around with this at home on a fpga
       | devkit...which fpga kit should i use ?
       | 
       | the Xilinx Virtex UltraScale+ VU9P FPGA prototyping boards seem
       | to be 9000 USD. Anything in the 1000$ range ?
        
         | almostgotcaught wrote:
         | ebay has a bunch of Alveo U30 cards for ~500. 500k luts, 3000
         | dsp slices, ~1M registers, even some DDR.
         | 
         | fair warning: one does not "play around" with an FPGA. they are
         | the antithesis of user friendly.
        
         | BonusPlay wrote:
         | Note, that paper provides environment for AWS FPGAs, which you
         | can rent on per-hour basis.
         | 
         | As for cheaper FPGAs, the paper notes that the bottleneck is
         | the size of on-chip memory. So I doubt it will be easy to find
         | cheaper model to reduce costs.
         | 
         | Another hidden fee would be Vivado and Vitis (tooling)
         | licenses, which you need for most upper-end FPGAs.
        
       | jesprenj wrote:
       | > Although the GPU performs inference faster than the FPGA, one
       | of the primary bottlenecks of deep learning inference is memory
       | bandwidth and the availability of on-chip memory (Balasubramanian
       | et al., 2021). A RTX 3090 has 24GB VRAM running at 1219 MHz with
       | a base core clock of 1395 MHz (TechPowerUp, 2024). In comparison,
       | a VU9P FPGA has 345.9 MB of combined on-chip BRAM and URAM,
       | running at a much slower clock speed of around 200-300 MHz
       | depending on the module; however, with much lower clock speeds,
       | the FPGA is able to achieve better efficiency on power and energy
       | consumption, as shown below.
       | 
       | So as far as I can understand, the biggest "bottleneck"/limiting
       | factor with using FPGAs for LLMs is the available memory -- with
       | current large models exceeding 40 GiB in parameter size, GPUs and
       | TPUs with DRAM look like the only way to go forward for the
       | months to come ... Thoughts?
        
         | dailykoder wrote:
         | Wouldn't be surprised if AMD or Intel come up with an FPGA
         | especially for this application. Atleast AMD advertises a lot
         | with their AI FPGA stuff, so they'll probably build one with
         | either a lot more BRAM or the ability to attach some very fast
         | RAM? But going from a few Megabytes of RAM to Gigabytes sounds
         | very expensive. DRAM is just too slow, I guess
        
         | bnprks wrote:
         | Yeah, I think DRAM is almost certainly the future, just in
         | terms of being able to afford the memory capacity to fit large
         | models. Even Cerebras using a full wafer only gets up to 44 GB
         | of SRAM on a chip (at a cost over $2M).
         | 
         | An interesting twist is that this DRAM might not need to be a
         | central pool where bandwidth must be shared globally -- e.g.
         | the Tensortorrent strategy seems to be aiming for using smaller
         | chips that each have their own memory. Splitting up memory
         | should yield very high aggregate bandwidth even with slower
         | DRAM, which is great as long as they can figure out the cross-
         | chip data flow to avoid networking bottlenecks
        
       | cherioo wrote:
       | While FPGA may prove more efficient than 3090, a primarily gaming
       | card, I can't see how it should be more efficient than dedicated
       | training/inference card, as the latter are more effectively ASIC,
       | not to mention memory and bandwidth limitations.
       | 
       | Is there something I am missing making FPGA potentially more
       | viable, besides not feeding into NVIDIA's greed?
        
         | KeplerBoy wrote:
         | A dedicated training/inference card is still more general than
         | an LLama 2 inference card. It's obvious that you will get
         | better efficiency the more you tailor your silicon to the task,
         | with diminishing gains but still.
        
       | kernelsanderz wrote:
       | What's really cool, is that this was built on Karpathy's llama2.c
       | repo https://github.com/karpathy/llama2.c
       | 
       | A great example of the unexpected things that happen when put
       | great code into the commons.
       | 
       | I bet Andrej never expected anything like this when he released
       | it.
        
         | samus wrote:
         | Keep in mind that Andrej is probably holding back a lot of
         | optimizations for the sake of keeping the code comprehensible!
        
       | hongspike wrote:
       | Trying to understand what cases we would want to use FPGAs rather
       | than GPUs.
       | 
       | Memory bandwidth for FPGAs seems worse, so for serving models
       | don't GPUs still win out?
        
       | bunnie wrote:
       | Not exactly a fair comparison - the design is limited to run
       | models that can fit entirely in on chip RAM for the FPGA. This
       | greatly reduces power consumption because the FPGA does not have
       | to pay the overhead of DRAM PHY's, termination, DRAM chips, etc
       | which is a relatively fixed cost irrespective of capacity. This
       | means the energy cost in terms of energy per bit transfered is
       | much higher for the GPU than the FPGA.
       | 
       | Thus the GPU is storing a 110M model in gigabytes of external
       | RAM, and paying the power penalty associated with the excess
       | capacity, while the chosen model 110M fits neatly within the
       | FPGA's on chip RAM, and the design can trim all that overhead
       | accordingly.
       | 
       | A more fair comparison would either run a larger model that had
       | both systems hitting external RAM, or they would compare
       | power/performance against some sort of inference ASIC that had
       | all the RAM on chip (maybe a cerebrus, but scaled according to
       | the portion of the wafer actually used for the model).
       | 
       | That being said, it's neat that they open sourced their work, and
       | it's worth looking down this path more.
        
         | dhruvdh wrote:
         | I don't know what you are trying to say here. If one system
         | doesn't need to move as much data because it is more flexible,
         | that is a good thing. What do we gain by making it "fair"?
        
           | alpacalaca wrote:
           | If you're limiting the size of the model to 110 million
           | parameters (105MiB assuming int8) because that's what will
           | fit onto your FPGA then of course it's going to be more
           | energy efficient than a Broadwell era Xeon with a 24GB RTX
           | 3090. It's like concluding that a rickshaw is more efficient
           | than a train, something that will absolutely be true in a
           | technical sense if you're only transporting a single
           | passenger, but makes no sense if you're transporting hundreds
           | if not thousands of passengers.
           | 
           | A more apt comparison would have been with a phone made in
           | the past 5 years, even without an AI accelerator chip I'm
           | sure you could manage 20-30+ t/s from a 110m model but this
           | depends entirely on the memory bandwidth of the phone.
        
       | mikewarot wrote:
       | Most of the work of LLMs is in large matrix multiply-accumulate
       | operations. You could take all those constants and convert
       | everything to fixed point, then compile it down to the smallest
       | possible directed acyclic graph of binary logical operations. (In
       | other words, do the NAND to Tetris thing in reverse)
       | 
       | The problem is FPGAs are heterogeneous and highly optimized to
       | reduce latency, instead of for efficiency, so there are strong
       | limits to the approach.
       | 
       | On the plus side, the LLMs are a lot of layers, so you could take
       | one layer per FPGA, and just use the high speed links in the
       | chips to feed the results to the next FPGA.
       | 
       | You'd have maybe even a millisecond of latency, at a clock rate
       | of 100 Mhz, but possibly a million tokens per second of serial /
       | parallel streams of execution.
        
       | creativeSlumber wrote:
       | Is there anything preventing designing a custom ASIC for LLM
       | training / inference rather relying on using GPU's ? I've been
       | away from hardware for a long time, but if you can compile it
       | into run on an FPGA, then a custom ASIC should be even faster
       | (clock speed) and more power efficient.
        
       | brcmthrowaway wrote:
       | SiFive should work on this!
        
       | zekrioca wrote:
       | If you are getting a 403, you may read the HTML at
       | https://bytez.com/docs/arxiv/2405.00738/paper
        
       ___________________________________________________________________
       (page generated 2024-05-10 23:02 UTC)