[HN Gopher] Matrix-vector multiplication implemented in off-the-...
       ___________________________________________________________________
        
       Matrix-vector multiplication implemented in off-the-shelf DRAM for
       Low-Bit LLMs
        
       Author : cpldcpu
       Score  : 205 points
       Date   : 2025-05-04 23:35 UTC (23 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | Bolwin wrote:
       | They're doing matrix operations in the Dram itself? That sounds
       | insane and also fascinating
        
         | summarity wrote:
         | Getting LLM inference running on any thing is going to be the
         | next "it runs Doom"
        
           | iszomer wrote:
           | I guess the more contextual nuance would be "..it runs
           | Quake".
        
           | im3w1l wrote:
           | Well the goal here isn't to _just_ run it. The goal is to run
           | it at an attractive price /performance.
        
         | nkurz wrote:
         | Yup, and incredibly they are able to do this on standard RAM by
         | "intentionally violating the timing parameters":
         | 
         |  _Processing-Using-DRAM (PUD) leverages the inherent analog
         | operational characteristics of DRAM to enable highly parallel
         | bit-serial computations directly within memory arrays. Prior
         | research has demonstrated that commercial off-the-shelf DRAM
         | can achieve PUD functionality without hardware modifications by
         | intentionally violating the timing parameters._
         | 
         |  _These studies have established two fundamental PUD
         | operations: RowCopy and majority-of-X (MAJX) (Fig. 1). The
         | RowCopy operation facilitates data movement between different
         | rows within a subarray by issuing a PRE command followed
         | immediately by an ACT command before bitline precharging
         | completes, enabling data transfer through the bitlines. This
         | operation affects all cells along a row simultaneously, making
         | it approximately 100 times faster than processor-mediated data
         | movement. The MAJX operation performs a majority vote among X
         | cells sharing the same bitline that are activated
         | simultaneously, implemented in commercial DRAM by issuing ACT,
         | PRE, and ACT commands in rapid succession without delays. This
         | allows concurrent activation of 2~32 rows. MAJX enables bit-
         | serial computations that leverage the parallelism of subarrays
         | with 65,536 columns, serving as the fundamental computational
         | unit for PUD._
        
           | nayuki wrote:
           | This kind of low-level protocol manipulation of DRAM has some
           | similarities to rowhammer attacks.
        
             | gwern wrote:
             | Can it be used to covertly run computations invisible to
             | the OS or CPU?
        
               | nsteel wrote:
               | This research requires a custom memory controller that's
               | doing "weird" things, the CPU isn't really getting
               | involved here. It's very different compared to row hammer
               | in my opinion. If you have a custom memory controller
               | then I think all bets are off.
        
               | wtallis wrote:
               | Only to the same extent that any other co-processor add-
               | in card can do stuff that's not observable by the CPU.
               | Your CPU's RAM is managed by the CPU's memory controller
               | hardware, and that memory controller does not give
               | software the ability to issue individual DRAM commands
               | like precharge. This research uses a memory controller
               | implemented on a FPGA, talking to its own pool of RAM.
        
           | elcritch wrote:
           | I hope Micron or another commercial player builds a product
           | on this!
        
             | tamlin wrote:
             | Samsung and SK-Hynix have had specs and papers for a few
             | years already for HBM and GDDR. e.g.
             | 
             | * https://www.servethehome.com/sk-hynix-ai-memory-at-hot-
             | chips... * https://www.servethehome.com/samsung-processing-
             | in-memory-te...
             | 
             | Not sure anyone has started using it in production.
        
               | nsteel wrote:
               | And as mentioned in a comment elsewhere, LPDDR6-PIM is
               | coming along too https://wccftech.com/samsung-
               | collaborates-with-sk-hynix-in-p...
               | 
               | We'll see that before anything built around HBM or GDDR.
        
       | robwwilliams wrote:
       | This is just mind-bendingly weird and wonderfully creative. It
       | can pay to work in the weeds! Bravo.
        
         | userbinator wrote:
         | This behaviour has been around since the earliest DRAMs with
         | multiplexed row/column addresses. The Mostek MK4096 of 1973
         | could probably do this. Only took about half a century for
         | someone to figure it out.
        
       | walterbell wrote:
       | _> By intentionally issuing DRAM commands that violate
       | manufacturer-specified timing parameters.. [gaining] massive
       | parallelism up to 65,536 bitwise operations in parallel._
       | 
       | Take that, binary blobs for DRAM training!
        
       | willvarfar wrote:
       | Can we expect to see matrix multiplication and perhaps other ops
       | move from classic CPUs out into the DRAM, perhaps with deliberate
       | hardware support?
       | 
       | And does such a processing shift give advantage to Samsung etc?
       | Where does this leave NVIDIA etc?
        
         | imtringued wrote:
         | Your questions are kind of amusing since Apple will use
         | LPDDR6-PIM on the next generation of iPhones.
         | 
         | https://www.patentlyapple.com/2024/12/apple-plans-to-transit...
        
           | nsteel wrote:
           | I don't get it, what's the joke?
        
       | userbinator wrote:
       | Did anyone else notice the absolutely insane author lists of
       | references 1 and 3?
       | 
       | I was expecting to find this 2016 article in there:
       | https://news.ycombinator.com/item?id=12469270
       | 
       | This 2019 one does show up:
       | https://news.ycombinator.com/item?id=22712811
       | 
       | Of course, this "out of spec" behaviour of DRAM, more
       | specifically the ability to do copying, is also implicated in
       | this infamous bug: https://news.ycombinator.com/item?id=5314959
       | 
       | It seems more than one person independently observed such a
       | thing, and thought "this might be a useful behaviour".
        
         | s-macke wrote:
         | This seems to be a formatting error. For such a huge author
         | list, you usually write only the first name and then "et al."
         | for "others".
        
           | tomsmeding wrote:
           | The 'et al.' is used for in-article citations, if done in
           | author-year format; references in the reference list are, to
           | the extent that I've seen, always written out in full. I
           | guess Google just wanted to make the life of any academic
           | citing their work miserable. There are (unfortunately)
           | conferences that have page limits that _include_ the
           | reference list; I wonder if an exception would be made here.
        
             | esafak wrote:
             | They want authors to think twice before citing someone. A
             | curious incentive!
        
         | throwaway519 wrote:
         | One day, I'm going to credit my entire department, deli and
         | everyone in the park at 2pm as contributors too.
        
       | swimwiththebeat wrote:
       | So is this a new technique of doing computations within existing
       | DRAM to overcome the memory wall issue of modern computing?
        
       | cpldcpu wrote:
       | Some more background information:
       | 
       | One of the original proposals for in-DRAM compute:
       | https://users.ece.cmu.edu/~omutlu/pub/in-DRAM-bulk-AND-OR-ie...
       | 
       | First demonstration with off-the-shelf parts:
       | https://parallel.princeton.edu/papers/micro19-gao.pdf
       | 
       | DRAM Bender, the tool they are using to implement this:
       | https://github.com/CMU-SAFARI/DRAM-Bender
       | 
       | Memory-Centric Computing: Recent Advances in Processing-in-
       | DRAMhttps://arxiv.org/abs/2412.19275
        
         | xhkkffbf wrote:
         | In-DRAM goes back a long time. There were plenty of papers in
         | the 90s about various ideas for turning a bank of DRAM into a
         | SIMD machine. They weren't as clever as some of these ideas or
         | as well developed but these papers are just the latest versions
         | of an old idea.
        
           | therealcamino wrote:
           | Do any of those techniques use unmodified DRAM or are you
           | talking about processor-in-memory approaches?
        
             | dr_zoidberg wrote:
             | The abstract of OPs link mentions "Processing-Using-DRAM
             | (PUD)" as exactly that, using off the shelf components. I
             | do wonder how they achieve that, I guess fiddling with the
             | controller in ways that are not standard but get the job
             | (processing data in memory) done.
             | 
             | Edit: Oh and cpldcpu linked the ComputeDRAM paper that
             | explains how to do it with off the shelf parts.
        
           | jiveturkey wrote:
           | That context is very helpful. But you don't need to poo-poo
           | the ideas as "just another iteration". Everything we have
           | today is built on top of decades of prior work. The paper
           | itself mentions a lot of prior work.
        
       | morphle wrote:
       | A bit unscientific that they don't cite the original Intelligent
       | RAM (IRAM) sources from 1997:
       | 
       | https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=iram...
        
         | cpldcpu wrote:
         | I also strongly suspect that there are earlier sources.
         | 
         | However, IRAM looks like _compute near memory_ where they will
         | add an ALU to the memory chip. _compute in memory_ is about
         | using the memory array itself.
         | 
         | To be fair, CIM looked much less appealing before the advent of
         | deep-learning with crazy vector lengths. So people rather tried
         | to build something that allows more fine grained control of the
         | operations.
        
           | morphle wrote:
           | >I also strongly suspect that there are earlier sources.
           | 
           | You are right, I remember 1972-ish papers where they did
           | compute in memory. I just couldn't locate links to these
           | papers in a few minutes.
        
       | xiphias2 wrote:
       | This woule be a cool way to make a cheap inferencing device for
       | the largest LLMs
        
       | protocolture wrote:
       | >General matrix-vector multiplication (GeMV)
       | 
       | Ok, so my math isnt great.
       | 
       | When I was studying Quaternions during my 3d math class (That I
       | failed the first time, like I said, not a math guy) they briefly
       | covered the history of matrix calculation in graphics
       | development.
       | 
       | My understanding is that Quaternions became popular because they
       | are _almost_ as accurate as matrices but much less complex
       | computationally.
       | 
       | Has anyone tried building an LLM using Quats instead of matrices?
       | 
       | Or are the optimisations with Quaternions more useful in
       | realtime?
        
         | monocasa wrote:
         | My understanding was that the main benefit of quaternions in
         | computer graphics was representing rotations in a way that
         | doesn't result in gimble lock.
         | 
         | And beyond that, for those rotations, a quaternion doesn't
         | scale nearly as well as you add dimensions. Complex numbers are
         | a complex representation of two space, quaternions are a
         | complex representation of three space, and to go to four space
         | you have octonions, which have eight elements.
        
           | eru wrote:
           | Quaternions have four dimensions.
        
             | thomaskoopman wrote:
             | Yes, but quaternions of unit length are a representation of
             | the rotation group in 3D space ( https://en.wikipedia.org/w
             | iki/Representation_theory_of_SU(2)... ), which is how they
             | are used for rotations.
        
               | suspended_state wrote:
               | The original question was: can quaternions be used in
               | place of matrices to perform LLMs tasks, and the answer
               | is: quaternions are 4 dimensions, with the implied
               | meaning that matrices can cover different
               | dimentionalities, which are needed for LLMs (and neural
               | network in general).
        
           | formerly_proven wrote:
           | Axis-angle also doesn't have gimbal lock - the main advantage
           | of quaternions is that actually performing rotations with
           | them only involves addition and multiplication, no
           | trigonometry at all. The same is true for using proper
           | rotation matrices, but those use a lot more memory. Plus, you
           | can actually lerp between quaternions (more generally - they
           | compose). That doesn't work with matrices (I think)
        
             | monocasa wrote:
             | Axis angle has gimble lock when composed.
        
         | thomaskoopman wrote:
         | A matrix is a representation of a linear function (e.g. a
         | function that plays nice with + and scalar multiplication). A
         | specific subset can be used to describe rotations in 3D space.
         | Quaternions can (arguably) do this better. But quaternions
         | cannot be used to describe any linear function. So I do not
         | think this makes sense for LLMs.
        
           | tzs wrote:
           | > But quaternions cannot be used to describe any linear
           | function
           | 
           | Does this mean all functions that can be described by
           | quaternions are non-linear, or does it mean that quaternions
           | can describe some linear functions such as the ones
           | associated with rotations in 3D space but there are linear
           | function they cannot describe?
        
             | thomaskoopman wrote:
             | Quaternions (when viewed as vectors) are not linear
             | functions, but the arguments to linear functions. You can
             | add them: (a + bi + cj + dk) + (a' + b'i + c'j + d'k) = (a
             | + a') + (b + b')i + (c + c')j + (d + d')k, and multiply
             | them by a scalar: lambda * (a + bi + cj + dk)= (lambda * a)
             | + (lambda * b)i + (lambda * c)j + (lambda * d)k. An example
             | of a linear function on quaternions is the zero function.
             | After all, zero(q + q') = 0 = 0 + 0 = zero(q) + zero(q'),
             | and zero(lambda * q) = 0 = lambda * 0 = lambda * zero(q).
             | 
             | Matrices and quaternions take different approaches to
             | describing rotations: a matrix sees a rotation as a linear
             | function, and quaternions see rotations as a group
             | (confusingly represented with matrices, this field is
             | called representation theory if you want to know more).
             | 
             | So the answer to your question: there are linear functions
             | that quaternions cannot describe. And quaternions can only
             | describe a very specific class of linear functions (with
             | some rather complicated maths behind them).
        
         | eru wrote:
         | Quaternions only have four fixed dimensions. For neural
         | networks you need many, many more dimensions.
        
         | benob wrote:
         | I think you are mixing things. Quaternions are in the same
         | category as complex numbers. They can be represented as
         | matrices, and there are probably nice uses of matrices where
         | the element is a quaternion (such as QDNNs) instead of a real
         | number. My experience is that in massive architectures such as
         | LLMs, simpler forms are more successful unless there is a true
         | benefit to representing things with more elaborate types of
         | scalars (such as in physics, or 3d graphics).
        
       | chasd00 wrote:
       | In the hardware world are there risks of taking advantage of a
       | bug knowing that the manufacturer may someday fix the bug? I know
       | in the software world it's a bad idea to leverage a bug in a
       | platform to enable a feature (or fix another bug). The bug you're
       | counting on being present may get fixed 15 years in the future
       | and then your system explodes and no one knows why.
       | 
       | edit: seems like there was a recent discussion about something
       | similar... undefined behavior in some C function iirc
        
         | vlovich123 wrote:
         | Undefined behavior in C/C++ has been discussed for a very very
         | long time. I'd say the impact of it when combined with
         | optimizing compilers first came to broader public awareness
         | around the 2010ish time frame (maybe 2013?) which is now about
         | 12+ years old.
         | 
         | As for this paper, it's not about relying on a bug but rather
         | presenting what might be possible with DRAM in the hopes of
         | standardizing capabilities.
        
         | alexpotato wrote:
         | This pops up in low latency HFT specifically with networking
         | cards.
         | 
         | Certain network cards have either a bug or combination of
         | features that work in an interesting way to the benefit of the
         | trading firm.
         | 
         | These bugs (and features too) sometimes get removed in favor of
         | either getting rid of the bug or those features are seen as not
         | needed for the larger market etc. Therefore, firms will
         | sometimes attempt to buy up all available supply of certain
         | models.
        
       ___________________________________________________________________
       (page generated 2025-05-05 23:01 UTC)