[HN Gopher] Matrix-vector multiplication implemented in off-the-...
___________________________________________________________________
Matrix-vector multiplication implemented in off-the-shelf DRAM for
Low-Bit LLMs
Author : cpldcpu
Score : 205 points
Date : 2025-05-04 23:35 UTC (23 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| Bolwin wrote:
| They're doing matrix operations in the Dram itself? That sounds
| insane and also fascinating
| summarity wrote:
| Getting LLM inference running on any thing is going to be the
| next "it runs Doom"
| iszomer wrote:
| I guess the more contextual nuance would be "..it runs
| Quake".
| im3w1l wrote:
| Well the goal here isn't to _just_ run it. The goal is to run
| it at an attractive price /performance.
| nkurz wrote:
| Yup, and incredibly they are able to do this on standard RAM by
| "intentionally violating the timing parameters":
|
| _Processing-Using-DRAM (PUD) leverages the inherent analog
| operational characteristics of DRAM to enable highly parallel
| bit-serial computations directly within memory arrays. Prior
| research has demonstrated that commercial off-the-shelf DRAM
| can achieve PUD functionality without hardware modifications by
| intentionally violating the timing parameters._
|
| _These studies have established two fundamental PUD
| operations: RowCopy and majority-of-X (MAJX) (Fig. 1). The
| RowCopy operation facilitates data movement between different
| rows within a subarray by issuing a PRE command followed
| immediately by an ACT command before bitline precharging
| completes, enabling data transfer through the bitlines. This
| operation affects all cells along a row simultaneously, making
| it approximately 100 times faster than processor-mediated data
| movement. The MAJX operation performs a majority vote among X
| cells sharing the same bitline that are activated
| simultaneously, implemented in commercial DRAM by issuing ACT,
| PRE, and ACT commands in rapid succession without delays. This
| allows concurrent activation of 2~32 rows. MAJX enables bit-
| serial computations that leverage the parallelism of subarrays
| with 65,536 columns, serving as the fundamental computational
| unit for PUD._
| nayuki wrote:
| This kind of low-level protocol manipulation of DRAM has some
| similarities to rowhammer attacks.
| gwern wrote:
| Can it be used to covertly run computations invisible to
| the OS or CPU?
| nsteel wrote:
| This research requires a custom memory controller that's
| doing "weird" things, the CPU isn't really getting
| involved here. It's very different compared to row hammer
| in my opinion. If you have a custom memory controller
| then I think all bets are off.
| wtallis wrote:
| Only to the same extent that any other co-processor add-
| in card can do stuff that's not observable by the CPU.
| Your CPU's RAM is managed by the CPU's memory controller
| hardware, and that memory controller does not give
| software the ability to issue individual DRAM commands
| like precharge. This research uses a memory controller
| implemented on a FPGA, talking to its own pool of RAM.
| elcritch wrote:
| I hope Micron or another commercial player builds a product
| on this!
| tamlin wrote:
| Samsung and SK-Hynix have had specs and papers for a few
| years already for HBM and GDDR. e.g.
|
| * https://www.servethehome.com/sk-hynix-ai-memory-at-hot-
| chips... * https://www.servethehome.com/samsung-processing-
| in-memory-te...
|
| Not sure anyone has started using it in production.
| nsteel wrote:
| And as mentioned in a comment elsewhere, LPDDR6-PIM is
| coming along too https://wccftech.com/samsung-
| collaborates-with-sk-hynix-in-p...
|
| We'll see that before anything built around HBM or GDDR.
| robwwilliams wrote:
| This is just mind-bendingly weird and wonderfully creative. It
| can pay to work in the weeds! Bravo.
| userbinator wrote:
| This behaviour has been around since the earliest DRAMs with
| multiplexed row/column addresses. The Mostek MK4096 of 1973
| could probably do this. Only took about half a century for
| someone to figure it out.
| walterbell wrote:
| _> By intentionally issuing DRAM commands that violate
| manufacturer-specified timing parameters.. [gaining] massive
| parallelism up to 65,536 bitwise operations in parallel._
|
| Take that, binary blobs for DRAM training!
| willvarfar wrote:
| Can we expect to see matrix multiplication and perhaps other ops
| move from classic CPUs out into the DRAM, perhaps with deliberate
| hardware support?
|
| And does such a processing shift give advantage to Samsung etc?
| Where does this leave NVIDIA etc?
| imtringued wrote:
| Your questions are kind of amusing since Apple will use
| LPDDR6-PIM on the next generation of iPhones.
|
| https://www.patentlyapple.com/2024/12/apple-plans-to-transit...
| nsteel wrote:
| I don't get it, what's the joke?
| userbinator wrote:
| Did anyone else notice the absolutely insane author lists of
| references 1 and 3?
|
| I was expecting to find this 2016 article in there:
| https://news.ycombinator.com/item?id=12469270
|
| This 2019 one does show up:
| https://news.ycombinator.com/item?id=22712811
|
| Of course, this "out of spec" behaviour of DRAM, more
| specifically the ability to do copying, is also implicated in
| this infamous bug: https://news.ycombinator.com/item?id=5314959
|
| It seems more than one person independently observed such a
| thing, and thought "this might be a useful behaviour".
| s-macke wrote:
| This seems to be a formatting error. For such a huge author
| list, you usually write only the first name and then "et al."
| for "others".
| tomsmeding wrote:
| The 'et al.' is used for in-article citations, if done in
| author-year format; references in the reference list are, to
| the extent that I've seen, always written out in full. I
| guess Google just wanted to make the life of any academic
| citing their work miserable. There are (unfortunately)
| conferences that have page limits that _include_ the
| reference list; I wonder if an exception would be made here.
| esafak wrote:
| They want authors to think twice before citing someone. A
| curious incentive!
| throwaway519 wrote:
| One day, I'm going to credit my entire department, deli and
| everyone in the park at 2pm as contributors too.
| swimwiththebeat wrote:
| So is this a new technique of doing computations within existing
| DRAM to overcome the memory wall issue of modern computing?
| cpldcpu wrote:
| Some more background information:
|
| One of the original proposals for in-DRAM compute:
| https://users.ece.cmu.edu/~omutlu/pub/in-DRAM-bulk-AND-OR-ie...
|
| First demonstration with off-the-shelf parts:
| https://parallel.princeton.edu/papers/micro19-gao.pdf
|
| DRAM Bender, the tool they are using to implement this:
| https://github.com/CMU-SAFARI/DRAM-Bender
|
| Memory-Centric Computing: Recent Advances in Processing-in-
| DRAMhttps://arxiv.org/abs/2412.19275
| xhkkffbf wrote:
| In-DRAM goes back a long time. There were plenty of papers in
| the 90s about various ideas for turning a bank of DRAM into a
| SIMD machine. They weren't as clever as some of these ideas or
| as well developed but these papers are just the latest versions
| of an old idea.
| therealcamino wrote:
| Do any of those techniques use unmodified DRAM or are you
| talking about processor-in-memory approaches?
| dr_zoidberg wrote:
| The abstract of OPs link mentions "Processing-Using-DRAM
| (PUD)" as exactly that, using off the shelf components. I
| do wonder how they achieve that, I guess fiddling with the
| controller in ways that are not standard but get the job
| (processing data in memory) done.
|
| Edit: Oh and cpldcpu linked the ComputeDRAM paper that
| explains how to do it with off the shelf parts.
| jiveturkey wrote:
| That context is very helpful. But you don't need to poo-poo
| the ideas as "just another iteration". Everything we have
| today is built on top of decades of prior work. The paper
| itself mentions a lot of prior work.
| morphle wrote:
| A bit unscientific that they don't cite the original Intelligent
| RAM (IRAM) sources from 1997:
|
| https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=iram...
| cpldcpu wrote:
| I also strongly suspect that there are earlier sources.
|
| However, IRAM looks like _compute near memory_ where they will
| add an ALU to the memory chip. _compute in memory_ is about
| using the memory array itself.
|
| To be fair, CIM looked much less appealing before the advent of
| deep-learning with crazy vector lengths. So people rather tried
| to build something that allows more fine grained control of the
| operations.
| morphle wrote:
| >I also strongly suspect that there are earlier sources.
|
| You are right, I remember 1972-ish papers where they did
| compute in memory. I just couldn't locate links to these
| papers in a few minutes.
| xiphias2 wrote:
| This woule be a cool way to make a cheap inferencing device for
| the largest LLMs
| protocolture wrote:
| >General matrix-vector multiplication (GeMV)
|
| Ok, so my math isnt great.
|
| When I was studying Quaternions during my 3d math class (That I
| failed the first time, like I said, not a math guy) they briefly
| covered the history of matrix calculation in graphics
| development.
|
| My understanding is that Quaternions became popular because they
| are _almost_ as accurate as matrices but much less complex
| computationally.
|
| Has anyone tried building an LLM using Quats instead of matrices?
|
| Or are the optimisations with Quaternions more useful in
| realtime?
| monocasa wrote:
| My understanding was that the main benefit of quaternions in
| computer graphics was representing rotations in a way that
| doesn't result in gimble lock.
|
| And beyond that, for those rotations, a quaternion doesn't
| scale nearly as well as you add dimensions. Complex numbers are
| a complex representation of two space, quaternions are a
| complex representation of three space, and to go to four space
| you have octonions, which have eight elements.
| eru wrote:
| Quaternions have four dimensions.
| thomaskoopman wrote:
| Yes, but quaternions of unit length are a representation of
| the rotation group in 3D space ( https://en.wikipedia.org/w
| iki/Representation_theory_of_SU(2)... ), which is how they
| are used for rotations.
| suspended_state wrote:
| The original question was: can quaternions be used in
| place of matrices to perform LLMs tasks, and the answer
| is: quaternions are 4 dimensions, with the implied
| meaning that matrices can cover different
| dimentionalities, which are needed for LLMs (and neural
| network in general).
| formerly_proven wrote:
| Axis-angle also doesn't have gimbal lock - the main advantage
| of quaternions is that actually performing rotations with
| them only involves addition and multiplication, no
| trigonometry at all. The same is true for using proper
| rotation matrices, but those use a lot more memory. Plus, you
| can actually lerp between quaternions (more generally - they
| compose). That doesn't work with matrices (I think)
| monocasa wrote:
| Axis angle has gimble lock when composed.
| thomaskoopman wrote:
| A matrix is a representation of a linear function (e.g. a
| function that plays nice with + and scalar multiplication). A
| specific subset can be used to describe rotations in 3D space.
| Quaternions can (arguably) do this better. But quaternions
| cannot be used to describe any linear function. So I do not
| think this makes sense for LLMs.
| tzs wrote:
| > But quaternions cannot be used to describe any linear
| function
|
| Does this mean all functions that can be described by
| quaternions are non-linear, or does it mean that quaternions
| can describe some linear functions such as the ones
| associated with rotations in 3D space but there are linear
| function they cannot describe?
| thomaskoopman wrote:
| Quaternions (when viewed as vectors) are not linear
| functions, but the arguments to linear functions. You can
| add them: (a + bi + cj + dk) + (a' + b'i + c'j + d'k) = (a
| + a') + (b + b')i + (c + c')j + (d + d')k, and multiply
| them by a scalar: lambda * (a + bi + cj + dk)= (lambda * a)
| + (lambda * b)i + (lambda * c)j + (lambda * d)k. An example
| of a linear function on quaternions is the zero function.
| After all, zero(q + q') = 0 = 0 + 0 = zero(q) + zero(q'),
| and zero(lambda * q) = 0 = lambda * 0 = lambda * zero(q).
|
| Matrices and quaternions take different approaches to
| describing rotations: a matrix sees a rotation as a linear
| function, and quaternions see rotations as a group
| (confusingly represented with matrices, this field is
| called representation theory if you want to know more).
|
| So the answer to your question: there are linear functions
| that quaternions cannot describe. And quaternions can only
| describe a very specific class of linear functions (with
| some rather complicated maths behind them).
| eru wrote:
| Quaternions only have four fixed dimensions. For neural
| networks you need many, many more dimensions.
| benob wrote:
| I think you are mixing things. Quaternions are in the same
| category as complex numbers. They can be represented as
| matrices, and there are probably nice uses of matrices where
| the element is a quaternion (such as QDNNs) instead of a real
| number. My experience is that in massive architectures such as
| LLMs, simpler forms are more successful unless there is a true
| benefit to representing things with more elaborate types of
| scalars (such as in physics, or 3d graphics).
| chasd00 wrote:
| In the hardware world are there risks of taking advantage of a
| bug knowing that the manufacturer may someday fix the bug? I know
| in the software world it's a bad idea to leverage a bug in a
| platform to enable a feature (or fix another bug). The bug you're
| counting on being present may get fixed 15 years in the future
| and then your system explodes and no one knows why.
|
| edit: seems like there was a recent discussion about something
| similar... undefined behavior in some C function iirc
| vlovich123 wrote:
| Undefined behavior in C/C++ has been discussed for a very very
| long time. I'd say the impact of it when combined with
| optimizing compilers first came to broader public awareness
| around the 2010ish time frame (maybe 2013?) which is now about
| 12+ years old.
|
| As for this paper, it's not about relying on a bug but rather
| presenting what might be possible with DRAM in the hopes of
| standardizing capabilities.
| alexpotato wrote:
| This pops up in low latency HFT specifically with networking
| cards.
|
| Certain network cards have either a bug or combination of
| features that work in an interesting way to the benefit of the
| trading firm.
|
| These bugs (and features too) sometimes get removed in favor of
| either getting rid of the bug or those features are seen as not
| needed for the larger market etc. Therefore, firms will
| sometimes attempt to buy up all available supply of certain
| models.
___________________________________________________________________
(page generated 2025-05-05 23:01 UTC)