[HN Gopher] SeedLM: Compressing LLM Weights into Seeds of Pseudo...
       ___________________________________________________________________
        
       SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random
       Generators
        
       Author : pizza
       Score  : 147 points
       Date   : 2025-04-06 08:53 UTC (14 hours ago)
        
 (HTM) web link (machinelearning.apple.com)
 (TXT) w3m dump (machinelearning.apple.com)
        
       | elashri wrote:
       | I think it would be better to just link directly to the paper
       | [1]. It is a work by researchers at Apple and Meta.
       | 
       | [1] https://arxiv.org/abs/2410.10714
        
       | anshumankmr wrote:
       | all this and they can't launch Apple intelligence on schedule :(
        
         | timschmidt wrote:
         | Honestly, this seems like enabling work. Even the iPhone 16 Pro
         | Max only seems to have 8GB RAM. If Apple Intelligence plans to
         | do anything useful on-device, they need work like this.
        
           | manmal wrote:
           | Personally, I'm holding off on purchasing a new iPhone for
           | this reason, even though my 13 Pro is getting long in the
           | tooth as a full time iOS dev. The coming generation is
           | rumored to have more and better memory (LPDDR5X?), and
           | cooling (vapor chamber).
        
         | brookst wrote:
         | Research vs productization. Very different.
        
         | echelon wrote:
         | This unblocks future product work. You have to lay the
         | groundwork.
         | 
         | We need advancements like this if we want on-device AI to work
         | well. This is the kind of thing Apple Silicon needs especially.
         | It's weak relative to Nvidia consumer chips.
        
       | gblargg wrote:
       | It sounds like they basically find part of a pseudo-random
       | sequence that is closest to the desired data, then store the
       | random seed and corrections (which are small so take less space).
        
         | joerick wrote:
         | Pretty fascinating from an information theory point of view.
         | Surprising that it works at all. Is this, like, the JPEG of
         | uniformly distributed, uncorrelated data?
        
           | barotalomey wrote:
           | You might find The Library of Babel fascinating [1, 2]
           | 
           | 1: https://libraryofbabel.info/
           | 
           | 2: https://news.ycombinator.com/item?id=9480949
        
           | samus wrote:
           | We don't know. They basically look for sequences that
           | approximate NN weights well, in the same way sinusoidal
           | functions work well with "natural" images, but not with
           | graphics with hard edges.
        
       | visarga wrote:
       | Very interesting trick, using a dictionary of basis vectors which
       | are quickly computed from a seed without storage. But the result
       | is the same 3 or 4 bit quantization, with only a slight
       | improvement. Their tiles are small, just 8 or 12 weights, it's
       | why compression doesn't go too far. It would have been great if
       | this trick lowered quantization <1 bit/weight, that would require
       | longer tiles. Wondering what are the limits if we use a larger
       | reservoir of cheap entropy as part of neural net architecture,
       | even in training.
       | 
       | Congrats to Apple and Meta, makes sense they did the research,
       | this will go towards efficient serving of LLMs on phones. And
       | it's very easy to implement.
        
         | kingsleyopara wrote:
         | I was about to post something similar. While the research is
         | interesting, it doesn't offer any advantages over 3- or 4-bit
         | quantization. I also have to assume they explored using longer
         | tiles but found it to be ineffective -- which would make sense
         | to me from an information theory perspective.
        
           | timschmidt wrote:
           | > it doesn't offer any advantages over 3- or 4-bit
           | quantization.
           | 
           | "zero-shot accuracy retention at 4- and 3-bit compression to
           | be on par with or better than state-of-the-art methods, while
           | maintaining performance comparable to FP16 baselines."
           | 
           | My reading of that says FP16 accuracy at Q3 or Q4 size /
           | memory bandwidth. Which is a huge advantage.
        
             | kingsleyopara wrote:
             | For zero-shot accuracy from Table 3:
             | 
             | * LLaMA 3 8B: baseline 72.26, 4-bit 71.31, 3-bit 62.79
             | 
             | * LLaMA 3 70B: baseline 79.51, 4-bit 78.06, 3-bit 74.68
             | 
             | These results seem comparable to modern quantization
             | methods--for example, the ~4-bit results for smaller LLaMA
             | models listed here: https://ai.meta.com/blog/meta-llama-
             | quantized-lightweight-mo...
        
               | timschmidt wrote:
               | I don't see any comparable numbers on the page you
               | linked. Seems to only have numbers for 1B and 3B
               | parameter models. Comparisons to AWQ and OmniQuant in
               | Table 3 seem quite favorable with SeedLM showing 10% -
               | 50% better performance.
               | 
               | Also seems like the techniques may be possible to
               | combine.
        
               | _0ffh wrote:
               | As a rule of thumb, the bigger the model is, the more
               | graciously it degrades under quantisation. So you may
               | assume performance loss for a 8B model would be lower
               | than for a 3B model. (I know that doesn't make up for
               | missing numbers in link, just fyi.)
        
           | jsenn wrote:
           | I think the main advantage is that you can compute the extra
           | parameters (the PRNG seeds) from the network weights alone,
           | whereas most other quantization methods require simulating
           | the quantization procedure at training time (Quantization-
           | Aware Training) or setting them from a calibration dataset
           | (Post-Training Quantization)
        
           | hedgehog wrote:
           | This technique has three significant advantages over popular
           | low bit quantization: 1) it retains more accuracy, 2) it does
           | not require calibration data, 3) it's easier to implement in
           | hardware.
        
         | samus wrote:
         | It should be definitely worth it because you can reuse
         | databases of sequence to seed mappings for _all_ future models.
        
       | torginus wrote:
       | This sound like compression with extra steps.. What makes this
       | technique particular to LLM weights instead of general purpose
       | data?
        
         | pornel wrote:
         | Weights in neural networks don't always need to be precise. Not
         | all weights are equally useful to the network. There seems to
         | be a lot of redundancy that can be replaced with
         | approximations.
         | 
         | This technique seems a bit similar to lossy image compression
         | that replaces exact pixels with a combination of pre-defined
         | patterns (DCT in JPEG), but here the patterns aren't from
         | cosine function, but from a pseudo-random one.
         | 
         | It may also be beating simple quantization from just adding
         | noise that acts as dithering, and breaks up the bands created
         | by combinations of quantized numbers.
        
         | jsenn wrote:
         | > What makes this technique particular to LLM weights
         | 
         | This is my understanding as a non-expert.
         | 
         | LLM activations tend to be relatively sparse with large
         | outliers. With linear quantization, this means you either have
         | to clip off the outliers or you have to stretch your range to
         | include the outliers, which wastes precious bits. Neither of
         | these works well, so essentially all LLM quantization research
         | is using various heuristics to get around these outliers. For
         | example, you can do linear quantization but split the
         | activations up into smaller blocks to make it less likely that
         | any given block contains an outlier.
         | 
         | Another trick people have discovered (predates LLMs) is
         | applying a random rotation/projection to the embeddings. This
         | has the effect of making sure no one dimension in the vector
         | dominates the others (which again hurts quantization). This
         | works because in order for a single dimension to dominate, all
         | the others have to "conspire" to be near zero. When you have
         | 10,000+ dimensions, that's very unlikely.
         | 
         | This paper applies the latter trick. Instead of pre-generating
         | the random projection matrices, they generate them on the fly
         | on the accelerator from a seed that is fixed for each block.
         | The seed is chosen from an offline brute-force search that
         | needs only the weights of the network. This separates it from a
         | lot of other quantization methods that either require
         | calibration data or have to be simulated at training time so
         | the network learns the quantization parameters itself.
         | 
         | You might think this is wasteful/might hurt performance, but it
         | turns out that LLM inference is heavily memory-bound as it
         | involves streaming a very large neural network into the
         | accelerator (GPU/TPU/NPU/whatever) to operate on a relatively
         | small amount of data, so there are lots of "free cycles" to
         | generate these random numbers. Of course, if you care about
         | power usage that might not be a great idea...
        
       | jlcases wrote:
       | This compression approach reminds me of similarities with human
       | knowledge transfer. In both cases, we're looking for compact
       | representations that can reconstruct complex information.
       | 
       | For technical documentation, I'm experimenting with a similar
       | concept: instead of exhaustively documenting every implementation
       | detail, defining a minimal set of principles and architectural
       | decisions that allow "regenerating" the complete understanding.
       | 
       | Current LLMs excel at expanding compressed concepts, but we're
       | still far from finding the optimal balance between explicit
       | knowledge (detailed documentation) and implicit knowledge
       | (patterns and principles). Is anyone working on systems applying
       | similar ideas to technical knowledge management?
        
       | EGreg wrote:
       | What did Zuck mean that Llama 4 Behemoth is already the highest
       | performing base model and hasnt even done training yet? What are
       | the benchmarks then?
       | 
       | Does he mean they did pretraining but not fine tuning?
        
         | tintor wrote:
         | You can fine tune a checkpoint of model during pre-training.
        
       | benob wrote:
       | A variant I have been thinking of: each parameter matrix (or
       | block) is the sum of a random matrix (generated from a seed) and
       | a low rank matrix (a LoRA). I'd like to experiment training from
       | scratch in that setting.
        
         | sadiq wrote:
         | There's a related write-up here you might find interesting:
         | https://wandb.ai/learning-at-home/LM_OWT/reports/Parameter-s...
         | 
         | It covers some experiments on weight tying, one of which is
         | actually LoRA and random weights.
        
       | _0ffh wrote:
       | I suspect an April fools joke.
       | 
       | In general, compression using PRNGs is not a thing. There might
       | be a special exception for this case, but I somewhat doubt it. =)
        
         | RainyDayTmrw wrote:
         | The version on Arxiv dates to October 2024, which likely rules
         | this out.
        
           | _0ffh wrote:
           | I submit that there is no "compression into PRNG seeds" going
           | on here. This is just a quantisation method that happens to
           | leverage PRNs, which might have some specific advantages and
           | disadvantages. What I am sure it does not do, is what it's
           | title seems to claim, if taken literally. I suspect they're
           | having a good laugh, getting away with what they must know is
           | borderline trolling. I'm impressed!
        
         | threeseed wrote:
         | I suspect that you are literally the _only_ person on this
         | planet who would find this to be funny enough for Apple to
         | waste the time of a dozen AI Researchers, Meta, arXiv and Apple
         | Legal who vet everything.
        
       | RainyDayTmrw wrote:
       | How do you reconcile this with the (I believe) widely accepted
       | idea that you can't meaningfully compress data using offsets into
       | Pi?
        
         | fc417fc802 wrote:
         | Not an expert but my impression is that the title and intro are
         | worded in a highly misleading manner.
         | 
         | IIUC they're transforming the data before compressing it. Also
         | IIUC this is an established method.
         | 
         | Because of the nature of the data and the transform involved,
         | you can get reasonable results with random numbers. That's
         | already been done, but this work brute forces seeds to optimize
         | the compression ratio and then derives the transform on the fly
         | from the seed in order to save on memory bandwidth.
         | 
         | I feel like (again, non-expert) there are much deeper
         | implications about current ML models here. The fact that a
         | randomized transform can have this sort of impact seems to
         | imply that there's much less information encoded by the data
         | than we otherwise might expect given its sheer size.
         | 
         | Regarding Pi. You can't encode arbitrary data using arbitrary
         | sequences and expect to come out ahead on average. But you can
         | encode specific data using algorithms that exhibit specific
         | behavior.
        
           | fc417fc802 wrote:
           | Maybe I'm wrong. Figure 2 seems to depict exactly what's
           | described by the title, searching for a combination of random
           | numbers that recovers an approximation of the weights. But if
           | that's true then I have the same question about information
           | theoretics that you posed above.
        
         | diegoperini wrote:
         | You get to choose your own, more efficient "PI" for your model.
         | Still, it's a valid question.
        
       | htrp wrote:
       | Also from October 2024 (https://arxiv.org/abs/2410.10714)
        
       ___________________________________________________________________
       (page generated 2025-04-06 23:00 UTC)