[HN Gopher] Hardware Acceleration of LLMs: A comprehensive surve...
___________________________________________________________________
Hardware Acceleration of LLMs: A comprehensive survey and
comparison
Author : matt_d
Score : 248 points
Date : 2024-09-06 22:09 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| moffkalast wrote:
| In-memory sounds like the way to go not just in terms of
| performance, but in that it makes no sense to build an ASIC or
| program an FPGA for a model that will most likely be obsolete in
| a few months at best if you're lucky.
| limit499karma wrote:
| https://arxiv.org/pdf/2402.09709
| throwawaymaths wrote:
| Yeah, it's not like foundational models ever share compute
| kernels, or anything.
| moffkalast wrote:
| Eh, there's so much shenanigans these days even in fine
| tuning, people adding empty layers and pruning and whatnot,
| it's unlikely that even models based on the same one will
| have the same architecture.
|
| For new foundation models it's even worse, because there's
| some fancy experiment every time and the llama.cpp team needs
| two weeks to figure out how to implement it so the model can
| even run.
| throwawaymaths wrote:
| Yeah so you might need to implement a weird activation or
| positional enciding in software or something but I suspect
| 90% will probably be the same... If it's just layer count
| or skipped matrices I assume it should be possible to write
| an orchestrator that could run most of those models...
| Unless we move to mamba or something
| yjftsjthsd-h wrote:
| I'm unfamiliar; in this context is "in-memory" specialized
| hardware that combines CPU+RAM?
| limit499karma wrote:
| In-mem (generally) means no (re)loading of data from a storage
| device.
| yjftsjthsd-h wrote:
| Sure, but I don't think that makes sense here; when I run an
| LLM on CPU, I load to memory and run it, when I run on GPU I
| load the model into the GPU's memory and run it, and I don't
| have anything like that much money to burn but I imagine if I
| used an FPGA then I would load the model into its memory and
| then run it from there. So the fact that they're saying "in-
| memory" in _contrast_ to ex. GPU makes me think that they 're
| talking about something different here.
| mmoskal wrote:
| It's a different kind of memory chip that also does some
| computation. See https://en.m.wikipedia.org/wiki/In-
| memory_processing
| adrian_b wrote:
| While this has been proposed repeatedly for many decades,
| I doubt that it will ever become useful.
|
| Combining memory with computation seems good in theory,
| but it is difficult to do in practice.
|
| The fabrication technologies for DRAM and for
| computational devices are very different. If you
| implement computational units on a DRAM chip, they will
| have a much worse performance than those implemented with
| a dedicated fabrication process, so for instance their
| performance per watt and per occupied area will be worse,
| leading to higher costs than for using separate memories
| and computational devices.
|
| The higher cost might be acceptable in certain cases if a
| much higher performance is obtained. However it is
| unavoidable that unlike with a CPU/GPU/FPGA, where you
| can easily reprogram the device to implement a completely
| different algorithm, a device with in-memory computation
| would be much less flexible, so it either will implement
| extremely simple operations, like adding to memory or
| multiplying the memory, which would not increase much the
| performance due to communication overheads, or it would
| implement some more complex operations, which might
| implement some ML/AI algorithm that is popular for the
| moment, but which would be hard to use to implement
| better algorithms when such algorithms are discovered.
| vlovich123 wrote:
| I suspect that the attempts to remove the DRAM controller
| and embedding it into the chips directly will succeed in
| meaningfully reducing the power per retrieval and
| increase the bandwidth by big enough that it'll postpone
| these more esoteric architectures even though its pretty
| clear that bulk data processing like LLMs (and maybe even
| graphics) is better suited to this architecture since
| it's cheaper to fan out the code than it is to shuffle
| all these bits back and forth.
| p1esk wrote:
| In-memory doesn't mean in-DRAM.
|
| https://arxiv.org/pdf/2406.08413
| adrian_b wrote:
| SRAM does not have enough capacity to be useful for in-
| memory computation.
|
| The existing CPUs, GPUs and FPGAs are full of SRAM that
| is intimately mixed with the computational parts of the
| chips and you could not find any structure improving on
| that.
|
| All the talk about in-memory computing is strictly about
| DRAM, because only DRAM could increase the amount of
| memory from the up to hundreds of MB of memory that is
| currently contained inside the biggest CPUs or GPUs to
| the hundreds of GB that might be needed by the biggest
| ML/AI applications.
|
| All the other memory technologies mentioned in the paper
| linked by you are many years or even decades away from
| being usable as simple memory devices. In order to be
| used for in-memory computing, one must first solve the
| problem of making them work as memories. For now, it is
| not even clear if this simpler problem can be solved.
| p1esk wrote:
| Let's see: Mythic uses flash, d-Matrix uses SRAM.
| Encharge is the only one who uses capacitor based
| crossbars, but those are custom built from scratch and
| very different from any existing DRAM technology.
|
| Which companies are using DRAM for in-memory computing?
| adrian_b wrote:
| Mythic does not do in-memory computing, despite their
| claims.
|
| Flash cannot be used for in-memory computing, because
| writing it is too slow.
|
| According to what they say, they have an inference device
| that uses analog computing for inference. They have a
| flash memory, but that stores only the weights of the
| model, which are constant during the computation, so the
| flash is not a working memory, it is used only for the
| reconfiguration of the device, when a new model is
| loaded.
|
| Analog computing for inference is actually something that
| is much more promising than in-memory computing, so
| Mythic might be able to develop useful devices.
|
| d-Matrix appears to do true in-memory computing, but the
| price of their devices for an amount of memory matching a
| current GPU will be astronomical.
|
| Perhaps there will be organizations willing to pay huge
| amounts of money for a very high performance, like those
| which are buying Cerebras nowadays, but such an expensive
| technology will always be a niche too small to be
| relevant for most users.
| p1esk wrote:
| You don't need to write anything back to flash to use it
| to compute something: the output of a floating gate
| transistor is written to some digital buffer nearby
| (usually SRAM). Yes, it's only used for inference, not
| sure how that disqualifies it from being in-memory
| computing? In-memory computing simply means there's a
| memory device/circuit (transistor, capacitor, memristor,
| etc) that holds a value and is used to compute another
| value based on some input received by the cell. As
| opposed to a traditional ALU which receives two inputs
| from a separate memory circuit (registers) to compute the
| output.
| adrian_b wrote:
| This is not in-memory computing, because from the point
| of view of the inference algorithm the flash memory is
| not a memory.
|
| You can remove all the flash memory and replace all its
| bits with suitable connections to ground or the supply
| voltage, corresponding to the weights of the model.
|
| Then the device without any flash memory will continue to
| function exactly like before, computing the inference
| algorithm without changes. Therefore it should be obvious
| that this is not in-memory computing, if you can remove
| the memory without affecting the computing.
|
| The memory is needed only if you want to be able to
| change the model, by loading another set of weights.
|
| The flash memory is a configuration memory, exactly like
| the configuration memories of logic devices like FPGAs or
| CPLDs. In FPGAs or CPLDs you do the same thing, you load
| the configuration memory with a new set of values, then
| the FPGA/CPLD will implement a new logic device, until
| the next reloading of the configuration memory.
|
| Exactly like in this device, the configuration memory of
| the FPGAs/CPLDs, which may be a flash memory too, is not
| counted as a working memory. The FPGAs/CPLDs contain
| memories and registers, but those are distinct from the
| configuration memory and they cannot be implemented with
| flash memory, like the configuration memory.
|
| In this inference device with analog computing there must
| also be a working memory, which contains mutable state,
| but that must be implemented with capacitors that store
| analog voltages.
|
| You might talk about an in-memory computing only with
| reference to the analog memory with capacitors, but even
| this description is likely to be misleading, because from
| the point of view of the analog memory it is more
| probable that the structure of the inference device is
| some kind of dataflow structure, where the memory
| capacitors implement some kind of analog shift registers
| and not anything resembling memory cells in which
| information is stored for later retrieval.
| vlovich123 wrote:
| Am I misreading something?
|
| > At their core, NVM arrays are arranged in two
| dimensions and programmed to discrete conductances (Fig.
| 5). Each crosspoint in the array has two terminals
| connected to a word line and a bit line. Digital inputs
| are converted to voltages, which then activate the word
| lines. The multiplication operation is performed between
| the conductance gij and the voltage Vi by applying Ohm's
| law at each cell, while currents Ij accumulate along each
| column according to Kirchhoff's current law
|
| Sounds like the compute element is embedded within the
| DRAM but instead of doing a digital computation it's done
| in analog space (which feels a bit wrong since the
| DAC+ADC combo would eat quite a bit of power but maybe
| it's easier to manufacture or other reasons to do it in
| analog space).
|
| Or you're saying it would be better with flash storage
| because it could be used for even larger models. I think
| that's right but my overall point holds - removal of the
| DRAM controller could free up significant amounts of DRAM
| bandwidth (like 20x IIRC) and reduction in power (by 100x
| IIRC). There's value in that regardless and it would just
| be a free speedup and would significantly benefit
| existing LLMs that rely on RAM. An analog compute circuit
| embedded within flash would be usable basically only for
| today's LLMs architecture and not be very flexible and
| require a huge change in how this stuff works to take
| advantage. Might still make sense if the architecture
| remains largely unchanged and other approaches can't be
| as competitive, but it does lock you into a design more
| than something more digitally programmable that can also
| do other things.
| sroussey wrote:
| Using analog means it will be faster (digital is slow,
| waiting for the carry on each bit), but I am curious how
| they do the ADC. RAM stuff is generally so different that
| not introducing logic gates in the memory makes sense.
| vlovich123 wrote:
| Digital is slow, but I would think converting the signal
| to/from digital might be slow too. Maybe it's taking the
| analog signal from the RAM itself & storing back the
| analog signal with a little bit of cleanup without ever
| going into the digital domain?
| sroussey wrote:
| Oh, absolutely. Never switching to digital would be the
| way. And not hard for low bit counts like 4. I am very
| interested in the methodology if they do this with 64bit.
| p1esk wrote:
| _Am I misreading something?_
|
| Yes, you are. NVM stands for 'non-volatile memory", which
| is literally the opposite of DRAM.
|
| Analog computation can be done using any memory cell
| technology (transistor, capacitor, memristor, etc), but
| the result will always go through ADC to be stored in a
| digital buffer.
|
| Flash does not provide any advantages as far as model
| size, the size of crossbar is constrained by other
| factors (e.g. current leakage), and typically it's in the
| ballpark of 1kx1k matmuls. You simply put more of them on
| a chip, and try to parallelize as much as possible.
|
| But I largely agree with your conclusion.
| janwas wrote:
| +1. Personal opinion: accelerators are useful today but
| have kept us in a local minimum which is certainly not
| the ideal. There are interesting approaches such as near
| linear low-rank approximation of attention gradients [1].
| Would we rather have that, or somewhat better constant
| factors? [1] https://arxiv.org/html/2408.13233v1
| fulafel wrote:
| Not in the context of discussing hardware architectures.
|
| (Context in the abstract is "First, we present the
| accelerators based on FPGAs, then we present the accelerators
| targeting GPUs and finally accelerators ported on ASICs and
| In-memory architectures" and the section title in the paper
| body is "V. In-Memory Hardware Accelerators")
| kurthr wrote:
| I'd expect it to be MAC hardware embedded on the DRAM die (or
| in the case of stacked HBM, possibly on the substrate die).
|
| To quote from an old article about such acceleration which sees
| 19x improvements over DRAM + GPU: Since MAC
| operations consume the dominant part of most ML workload
| runtime, we propose in-subarray multiplication coupled
| with intra-bank accumulation. The multiplication
| operation is performed by performing AND operations and
| addition in column-based fashion while only adding less than 1%
| area overhead.
|
| https://arxiv.org/pdf/2105.03736
| jumploops wrote:
| Curious if anyone is making AccelTran ASICs?
| refibrillator wrote:
| This paper is light on background so I'll offer some additional
| context:
|
| As early as the 90s it was observed that CPU speed (FLOPs) was
| improving faster than memory bandwidth. In 1995 William Wulf and
| Sally Mckee predicted this divergence would lead to a "memory
| wall", where most computations would be bottlenecked by data
| access rather than arithmetic operations.
|
| Over the past 20 years peak server hardware FLOPS has been
| scaling at 3x every 2 years, outpacing the growth of DRAM and
| interconnect bandwidth, which have only scaled at 1.6 and 1.4
| times every 2 years, respectively.
|
| Thus for training and inference of LLMs, the performance
| bottleneck is increasingly shifting toward memory bandwidth.
| Particularly for autoregressive Transformer decoder models, it
| can be the _dominant_ bottleneck.
|
| This is driving the need for new tech like Compute-in-memory
| (CIM), also known as processing-in-memory (PIM). Hardware in
| which operations are performed directly on the data in memory,
| rather than transferring data to CPU registers first. Thereby
| improving latency and power consumption, and possibly
| sidestepping the great "memory wall".
|
| Notably to compare ASIC and FPGA hardware across varying
| semiconductor process sizes, the paper uses a fitted polynomial
| to extrapolate to a common denominator of 16nm:
|
| _> Based on the article by Aaron Stillmaker and B.Baas titled
| "Scaling equations for the accurate prediction of CMOS device
| performance from 180 nm to 7nm," we extrapolated the performance
| and the energy efficiency on a 16nm technology to make a fair
| comparison_
|
| But extrapolation for CIM/PIM is not done because they claim:
|
| _> As the in-memory accelerators the performance is not based
| only on the process technology, the extrapolation is performed
| only on the FPGA and ASIC accelerators where the process
| technology affects significantly the performance of the systems._
|
| Which strikes me as an odd claim at face value, but perhaps
| others here could offer further insight on that decision.
|
| Links below for further reading.
|
| https://arxiv.org/abs/2403.14123
|
| https://en.m.wikipedia.org/wiki/In-memory_processing
|
| http://vcl.ece.ucdavis.edu/pubs/2017.02.VLSIintegration.Tech...
| bilsbie wrote:
| Thanks for the background. Whatever happened to memristors and
| the promise of memory living alongside cpu?
| dewarrn1 wrote:
| That's funny, I had thought that memristors were a solved
| problem based on this talk from a while back (2010!):
| https://www.youtube.com/watch?v=bKGhvKyjgLY, but I gather HP
| never really commercialized the technology. More recently,
| there does seem to be interest in and research on the topic
| for the reasons you and the GP post noted (e.g.,
| https://www.nature.com/articles/s41586-023-05759-5).
| tonetegeatinst wrote:
| I believe asianometry did a YouTube video on
| memristors....might be worth watching.
| Lerc wrote:
| Or even an architecture akin to an atonishingly large number
| of RP2050's. It does seem like it would work well for certain
| types of nnet architectures.
|
| I've always been partial to the idea of two parallel surfaces
| with optical links, Make a connection machine style hypercube
| where the bit of the ID of every processor indicates its
| location in the hypercube. Place all of the even parity CPUs
| on one surface and the odd parity CPUs on the other surface,
| every CPU would have line of sight on its neighbour in the
| hypercube (as well as the diametrically opposed CPU with all
| the ID bits flipped)
| phh wrote:
| > Or even an architecture akin to an atonishingly large
| number of RP2050's.
|
| Groq and Cerebras are probably that kind of architecture
| iml7 wrote:
| It came and went in the form of optane.
| chatmasta wrote:
| I'm a layman on this topic, so I'm definitely about to say
| something wrong. But I recall an intriguing idea about a sort
| of "reversion to analog," whereby we use the full range of
| voltage crossing a resistor. Instead of cutting it in half to
| produce binary (high voltage is 1, low voltage is 0), we
| could treat the voltage as a scalar weight within a network
| of resistors.
|
| Has anyone else heard of this idea or have any insight on it?
| gugagore wrote:
| The question then emerges: how do we program these things?
|
| Sara Acour has some answers, e.g. https://scholar.google.co
| m/citations?view_op=view_citation&h...
| nickpsecurity wrote:
| They mostly failed in the market. I have a list of them here:
|
| https://news.ycombinator.com/item?id=41069685
|
| I like the one that's in RAM sticks with an affordable price. I
| could imagine cramming a bunch of them into a 1U board with
| high-speed interconnects. Or just PCI cards full of them.
| _zoltan_ wrote:
| While this might have been true for a while before 2018, since
| then 400GbE ethernet became the fastest adapted interconnect,
| and _today_ 1.6Tbit interconnects exist. PCI-e V4 came and went
| so fast that it lived maybe 2 years.
|
| NVMeOF has been scaling with fabric performance and it's been
| great.
|
| 400GB/s interconnect on the H100 DGX today.
| fsndz wrote:
| True. Samsung's Dr. Jung Bae Lee was also talking about that
| recently.
|
| "rapid growth of AI models is being constrained by a growing
| disparity between compute performance and memory bandwidth.
| While next-generation models like GPT-5 are expected to reach
| an unprecedented scale of 3-5 trillion parameters, the
| technical bottleneck of memory bandwidth is becoming a critical
| obstacle to realizing their full potential."
|
| https://www.lycee.ai/blog/2024-09-04-samsung-memory-bottlene...
| next_xibalba wrote:
| Could a FPGA + ASICs + in-mem hybrid architecture have any role
| to play in scaling/flexibility? Each one has its own benefits
| (e.g., FPGAs for flexibility, ASICs for performance, in-memory
| for energy efficiency), so could a hybrid approach integrating
| each to juice LLM perf even further?
| synergy20 wrote:
| normally it's FPGA + memory first, when it hits a sweet spot in
| the market with volume, you then turn FPGA to ASIC for
| performance and cost saving. For big companies they will go
| ASIC directly.
| synergy20 wrote:
| Memory move is the bottleneck these days, thus the expensive HBM,
| Nvidia's design is also memory-optimized since it's the true
| bottleneck chip wise and system wise.
| DrNosferatu wrote:
| Why haven't all GPUs migrated to HBMx?
|
| You seldom see it.
| iml7 wrote:
| expensive
| deepnotderp wrote:
| It's expensive, not needed for most consumer workloads and
| ironically, is actually often _worse for latency_ for many
| patterns, even though it's much higher bandwidth
| ska wrote:
| why is that ironic? Many ways to increase bandwidth come at
| the cost of latency...
| koolala wrote:
| I'd love to watch a LLM run in WebGL where everything is
| Textures. Would be neat to visually see the difference in
| architectures.
| vanviegen wrote:
| Wouldn't that be just like watching static noise?
| archerx wrote:
| I think some patterns would appear.
| Twirrim wrote:
| Why would you expect patterns to appear?
| iml7 wrote:
| Doesn't Google have a tool that allows you to check the
| activation status of the matrix? Gemma scope
| smcleod wrote:
| Is there a "nice" way to read content on Arxiv?
|
| Every time I land on that site I'm so confused / lost in it's
| interface (or lack there of) I usually end up leaving without
| getting to the content.
| buildbot wrote:
| It's a paper pre-publishing website, so everything is formatted
| in PDFs by default. They recently added html:
| https://arxiv.org/html/2409.03384v1 That's the best way per
| paper. There's a few arxiv frontends, like https://arxiv-
| sanity-lite.com/
| Noumenon72 wrote:
| Same here -- I visited this link earlier today and thought "Oh,
| it's just an abstract, I'm out". I've read Arxiv papers before
| but the UI just doesn't look like it offers any content.
| johndough wrote:
| Click on "View PDF" or "HTML (experimental)" on the top right
| to get to the content.
| TheMysteryTrain wrote:
| I've always had the same issue as OP - it's never bothered me
| much because I'm rarely in the mood to read something so
| dense.
|
| But I find it quite interesting that I've managed to
| completely miss the big obvious blue buttons every time, I
| just immediately scan down to the first paragraph.
|
| The cynic in me guesses it's because I'm so used to
| extraneous content taking up space that I instinctively skim
| past, but maybe that's too pessimistic & there's another
| UX/psychological reason for it.
| beefnugs wrote:
| Its the idea of the gateway: we click on a link and expect
| it to be what we are looking for. Having some "summary" or
| paywall or anything between us and what we thought we are
| getting is , QUICK CLOSE THE STUPID WEBSITE as fast as
| possible triggering.
| fulafel wrote:
| Related:
|
| https://arxiv.org/pdf/2406.08413 Memory Is All You Need: An
| Overview of Compute-in-Memory Architectures for Accelerating
| Large Language Model Inference
| smusamashah wrote:
| There was a paper about LLM running on same power as a light
| bulb.
|
| https://arxiv.org/abs/2406.02528
|
| https://news.ucsc.edu/2024/06/matmul-free-llm.html
| transpute wrote:
| Claims 90% memory reduction with OSS code for replication on
| standard GPUs, https://github.com/ridgerchu/matmulfreellm
|
| _> ..avoid using matrix multiplication using two main
| techniques. The first is a method to force all the numbers
| within the matrices to be ternary, meaning they can take one of
| three values: negative one, zero, or positive one. This allows
| the computation to be reduced to summing numbers rather than
| multiplying.. Instead of multiplying every single number in one
| matrix with every single number in the other matrix.. the
| matrices are overlaid and only the most important operations
| are performed.. researchers were able to maintain the
| performance of the neural network by introducing time-based
| computation in the training of the model. This enables the
| network to have a "memory" of the important information it
| processes, enhancing performance.
|
| > ... On standard GPUs.. network achieved about 10 times less
| memory consumption and operated about 25 percent faster.. could
| provide a path forward to enabling the algorithms to run at
| full capacity on devices with smaller memory like smartphones..
| Over three weeks, the team created a [FPGA] prototype of their
| hardware.. surpasses human-readable throughput.. on just 13
| watts of power. Using GPUs would require about 700 watts of
| power, meaning that the custom hardware achieved more than 50
| times the efficiency of GPUs._
| mikewarot wrote:
| I've always been partial to systolic arrays. I iterated through a
| bunch of options over the past few decades, and settled upon what
| I think is the optimal solution, a cartesian grid of cells.
|
| Each cell would have 4 input bits, 1 each from the neighbors, and
| 4 output bits, again, one to each neighbor. In the middle would
| be 64 bits of shift register from a long scan chain, the output
| of which goes to 4 16:1 multiplexers, and 4 bits of latch.
|
| Through the magic of graph coloring, a checkerboard pattern would
| be used to clock all of the cells to allow data to flow in any
| direction without preference, and _without race conditions_. All
| of the inputs to any given cell would be stable.
|
| This allows the flexibility of an FPGA, without the need to worry
| about timing issues or race conditions, glitches, etc. This also
| keeps all the lines short, so everything is local and fast/low
| power.
|
| What it doesn't do is be efficient with gates, nor give the
| fastest path for logic. Every single operation happens
| effectively in parallel. All computation is pipelined.
|
| I've had this idea since about 1982... I really wish someone
| would pick it up and run with it. I call it the BitGrid.
| covoeus wrote:
| Sounds similar to the GA144 chip from the inventor of Forth
| mikewarot wrote:
| The GA144 chip is a 12x12 grid of RISC Stack oriented CPUs
| with networking hardware between them. Each can be doing
| something different.
|
| The BitGrid is bit oriented, not bytes or words, which makes
| if flexible for doing FP8, FP16, or whatever other type of
| calculation is required. Since everything is clocked at the
| same rate/step, data flows continuously, without need for
| queue/dequeue, etc.
|
| Ideally, a chip could be made that's a billion cells. The
| structure is simple, and uniform. An emulator exists, and
| that would return the exact same answer (albeit much slower)
| than a billion cell chip. You could divide up simulation
| among a network of CPUs to speed it up.
| DrNosferatu wrote:
| The values (namely the FPGAs) should have been normalized also by
| price.
| fsndz wrote:
| this explains the success of Groq's ASIC-powered LPUs. LLM
| inference on Groq Cloud is blazingly fast. Also, the reduction in
| energy consumption is nice.
___________________________________________________________________
(page generated 2024-09-07 23:01 UTC)