[HN Gopher] SeedLM: Compressing LLM Weights into Seeds of Pseudo...
___________________________________________________________________
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random
Generators
Author : pizza
Score : 147 points
Date : 2025-04-06 08:53 UTC (14 hours ago)
(HTM) web link (machinelearning.apple.com)
(TXT) w3m dump (machinelearning.apple.com)
| elashri wrote:
| I think it would be better to just link directly to the paper
| [1]. It is a work by researchers at Apple and Meta.
|
| [1] https://arxiv.org/abs/2410.10714
| anshumankmr wrote:
| all this and they can't launch Apple intelligence on schedule :(
| timschmidt wrote:
| Honestly, this seems like enabling work. Even the iPhone 16 Pro
| Max only seems to have 8GB RAM. If Apple Intelligence plans to
| do anything useful on-device, they need work like this.
| manmal wrote:
| Personally, I'm holding off on purchasing a new iPhone for
| this reason, even though my 13 Pro is getting long in the
| tooth as a full time iOS dev. The coming generation is
| rumored to have more and better memory (LPDDR5X?), and
| cooling (vapor chamber).
| brookst wrote:
| Research vs productization. Very different.
| echelon wrote:
| This unblocks future product work. You have to lay the
| groundwork.
|
| We need advancements like this if we want on-device AI to work
| well. This is the kind of thing Apple Silicon needs especially.
| It's weak relative to Nvidia consumer chips.
| gblargg wrote:
| It sounds like they basically find part of a pseudo-random
| sequence that is closest to the desired data, then store the
| random seed and corrections (which are small so take less space).
| joerick wrote:
| Pretty fascinating from an information theory point of view.
| Surprising that it works at all. Is this, like, the JPEG of
| uniformly distributed, uncorrelated data?
| barotalomey wrote:
| You might find The Library of Babel fascinating [1, 2]
|
| 1: https://libraryofbabel.info/
|
| 2: https://news.ycombinator.com/item?id=9480949
| samus wrote:
| We don't know. They basically look for sequences that
| approximate NN weights well, in the same way sinusoidal
| functions work well with "natural" images, but not with
| graphics with hard edges.
| visarga wrote:
| Very interesting trick, using a dictionary of basis vectors which
| are quickly computed from a seed without storage. But the result
| is the same 3 or 4 bit quantization, with only a slight
| improvement. Their tiles are small, just 8 or 12 weights, it's
| why compression doesn't go too far. It would have been great if
| this trick lowered quantization <1 bit/weight, that would require
| longer tiles. Wondering what are the limits if we use a larger
| reservoir of cheap entropy as part of neural net architecture,
| even in training.
|
| Congrats to Apple and Meta, makes sense they did the research,
| this will go towards efficient serving of LLMs on phones. And
| it's very easy to implement.
| kingsleyopara wrote:
| I was about to post something similar. While the research is
| interesting, it doesn't offer any advantages over 3- or 4-bit
| quantization. I also have to assume they explored using longer
| tiles but found it to be ineffective -- which would make sense
| to me from an information theory perspective.
| timschmidt wrote:
| > it doesn't offer any advantages over 3- or 4-bit
| quantization.
|
| "zero-shot accuracy retention at 4- and 3-bit compression to
| be on par with or better than state-of-the-art methods, while
| maintaining performance comparable to FP16 baselines."
|
| My reading of that says FP16 accuracy at Q3 or Q4 size /
| memory bandwidth. Which is a huge advantage.
| kingsleyopara wrote:
| For zero-shot accuracy from Table 3:
|
| * LLaMA 3 8B: baseline 72.26, 4-bit 71.31, 3-bit 62.79
|
| * LLaMA 3 70B: baseline 79.51, 4-bit 78.06, 3-bit 74.68
|
| These results seem comparable to modern quantization
| methods--for example, the ~4-bit results for smaller LLaMA
| models listed here: https://ai.meta.com/blog/meta-llama-
| quantized-lightweight-mo...
| timschmidt wrote:
| I don't see any comparable numbers on the page you
| linked. Seems to only have numbers for 1B and 3B
| parameter models. Comparisons to AWQ and OmniQuant in
| Table 3 seem quite favorable with SeedLM showing 10% -
| 50% better performance.
|
| Also seems like the techniques may be possible to
| combine.
| _0ffh wrote:
| As a rule of thumb, the bigger the model is, the more
| graciously it degrades under quantisation. So you may
| assume performance loss for a 8B model would be lower
| than for a 3B model. (I know that doesn't make up for
| missing numbers in link, just fyi.)
| jsenn wrote:
| I think the main advantage is that you can compute the extra
| parameters (the PRNG seeds) from the network weights alone,
| whereas most other quantization methods require simulating
| the quantization procedure at training time (Quantization-
| Aware Training) or setting them from a calibration dataset
| (Post-Training Quantization)
| hedgehog wrote:
| This technique has three significant advantages over popular
| low bit quantization: 1) it retains more accuracy, 2) it does
| not require calibration data, 3) it's easier to implement in
| hardware.
| samus wrote:
| It should be definitely worth it because you can reuse
| databases of sequence to seed mappings for _all_ future models.
| torginus wrote:
| This sound like compression with extra steps.. What makes this
| technique particular to LLM weights instead of general purpose
| data?
| pornel wrote:
| Weights in neural networks don't always need to be precise. Not
| all weights are equally useful to the network. There seems to
| be a lot of redundancy that can be replaced with
| approximations.
|
| This technique seems a bit similar to lossy image compression
| that replaces exact pixels with a combination of pre-defined
| patterns (DCT in JPEG), but here the patterns aren't from
| cosine function, but from a pseudo-random one.
|
| It may also be beating simple quantization from just adding
| noise that acts as dithering, and breaks up the bands created
| by combinations of quantized numbers.
| jsenn wrote:
| > What makes this technique particular to LLM weights
|
| This is my understanding as a non-expert.
|
| LLM activations tend to be relatively sparse with large
| outliers. With linear quantization, this means you either have
| to clip off the outliers or you have to stretch your range to
| include the outliers, which wastes precious bits. Neither of
| these works well, so essentially all LLM quantization research
| is using various heuristics to get around these outliers. For
| example, you can do linear quantization but split the
| activations up into smaller blocks to make it less likely that
| any given block contains an outlier.
|
| Another trick people have discovered (predates LLMs) is
| applying a random rotation/projection to the embeddings. This
| has the effect of making sure no one dimension in the vector
| dominates the others (which again hurts quantization). This
| works because in order for a single dimension to dominate, all
| the others have to "conspire" to be near zero. When you have
| 10,000+ dimensions, that's very unlikely.
|
| This paper applies the latter trick. Instead of pre-generating
| the random projection matrices, they generate them on the fly
| on the accelerator from a seed that is fixed for each block.
| The seed is chosen from an offline brute-force search that
| needs only the weights of the network. This separates it from a
| lot of other quantization methods that either require
| calibration data or have to be simulated at training time so
| the network learns the quantization parameters itself.
|
| You might think this is wasteful/might hurt performance, but it
| turns out that LLM inference is heavily memory-bound as it
| involves streaming a very large neural network into the
| accelerator (GPU/TPU/NPU/whatever) to operate on a relatively
| small amount of data, so there are lots of "free cycles" to
| generate these random numbers. Of course, if you care about
| power usage that might not be a great idea...
| jlcases wrote:
| This compression approach reminds me of similarities with human
| knowledge transfer. In both cases, we're looking for compact
| representations that can reconstruct complex information.
|
| For technical documentation, I'm experimenting with a similar
| concept: instead of exhaustively documenting every implementation
| detail, defining a minimal set of principles and architectural
| decisions that allow "regenerating" the complete understanding.
|
| Current LLMs excel at expanding compressed concepts, but we're
| still far from finding the optimal balance between explicit
| knowledge (detailed documentation) and implicit knowledge
| (patterns and principles). Is anyone working on systems applying
| similar ideas to technical knowledge management?
| EGreg wrote:
| What did Zuck mean that Llama 4 Behemoth is already the highest
| performing base model and hasnt even done training yet? What are
| the benchmarks then?
|
| Does he mean they did pretraining but not fine tuning?
| tintor wrote:
| You can fine tune a checkpoint of model during pre-training.
| benob wrote:
| A variant I have been thinking of: each parameter matrix (or
| block) is the sum of a random matrix (generated from a seed) and
| a low rank matrix (a LoRA). I'd like to experiment training from
| scratch in that setting.
| sadiq wrote:
| There's a related write-up here you might find interesting:
| https://wandb.ai/learning-at-home/LM_OWT/reports/Parameter-s...
|
| It covers some experiments on weight tying, one of which is
| actually LoRA and random weights.
| _0ffh wrote:
| I suspect an April fools joke.
|
| In general, compression using PRNGs is not a thing. There might
| be a special exception for this case, but I somewhat doubt it. =)
| RainyDayTmrw wrote:
| The version on Arxiv dates to October 2024, which likely rules
| this out.
| _0ffh wrote:
| I submit that there is no "compression into PRNG seeds" going
| on here. This is just a quantisation method that happens to
| leverage PRNs, which might have some specific advantages and
| disadvantages. What I am sure it does not do, is what it's
| title seems to claim, if taken literally. I suspect they're
| having a good laugh, getting away with what they must know is
| borderline trolling. I'm impressed!
| threeseed wrote:
| I suspect that you are literally the _only_ person on this
| planet who would find this to be funny enough for Apple to
| waste the time of a dozen AI Researchers, Meta, arXiv and Apple
| Legal who vet everything.
| RainyDayTmrw wrote:
| How do you reconcile this with the (I believe) widely accepted
| idea that you can't meaningfully compress data using offsets into
| Pi?
| fc417fc802 wrote:
| Not an expert but my impression is that the title and intro are
| worded in a highly misleading manner.
|
| IIUC they're transforming the data before compressing it. Also
| IIUC this is an established method.
|
| Because of the nature of the data and the transform involved,
| you can get reasonable results with random numbers. That's
| already been done, but this work brute forces seeds to optimize
| the compression ratio and then derives the transform on the fly
| from the seed in order to save on memory bandwidth.
|
| I feel like (again, non-expert) there are much deeper
| implications about current ML models here. The fact that a
| randomized transform can have this sort of impact seems to
| imply that there's much less information encoded by the data
| than we otherwise might expect given its sheer size.
|
| Regarding Pi. You can't encode arbitrary data using arbitrary
| sequences and expect to come out ahead on average. But you can
| encode specific data using algorithms that exhibit specific
| behavior.
| fc417fc802 wrote:
| Maybe I'm wrong. Figure 2 seems to depict exactly what's
| described by the title, searching for a combination of random
| numbers that recovers an approximation of the weights. But if
| that's true then I have the same question about information
| theoretics that you posed above.
| diegoperini wrote:
| You get to choose your own, more efficient "PI" for your model.
| Still, it's a valid question.
| htrp wrote:
| Also from October 2024 (https://arxiv.org/abs/2410.10714)
___________________________________________________________________
(page generated 2025-04-06 23:00 UTC)