[HN Gopher] From Memorization to Reasoning in the Spectrum of Lo...
___________________________________________________________________
From Memorization to Reasoning in the Spectrum of Loss Curvature
Author : andy12_
Score : 54 points
Date : 2025-11-07 12:43 UTC (10 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| andy12_ wrote:
| Very concise summary of the procedure described in this paper:
|
| 1. Run the model once across a dataset to estimate loss curvature
| per MLP weight matrix via K-FAC (activation/gradient
| covariances).
|
| 2. Decompose each weight matrix into curvature-ordered
| components; low-curvature directions correspond most to verbatim
| memorization, higher curvature to shared/general mechanisms.
|
| 3. Edit by dropping the low-curvature subspace and keep only the
| top directions.
| vessenes wrote:
| Thank you for this huge time saver.
|
| Now, about the paper-that's super interesting. I imagine the
| dream here is to distil down into a "reasoning" core. Or maybe
| reclaim space for more generalization. Lots of interesting use
| cases.
| getnormality wrote:
| Thank you!
|
| I think you may have accidentally switched low and high in #2,
| no? The abstract speaks of high curvature as associated with
| memorization:
|
| > curvature for memorized training points is much sharper than
| non memorized
| radarsat1 wrote:
| This sounds more correct to me. I've read previously
| somewhere that better generalization is usually associated
| with wider, smoother minima, and this is why regularization
| is important, because it has a smoothing function on the loss
| landscape.
| getnormality wrote:
| Yes. This is also not hard to see intuitively from scratch.
|
| Say you have a smooth but highly flexible model y = f(x)
| and some data points you are fitting with a machine
| learning algorithm. For whatever reason, the algorithm
| decides it wants to reduce training error by interpolating
| some specific point, (x0,y0), without negatively affecting
| training error on nearby points. The direct, guaranteed
| successful way to do this is to adjust the model to y0 =
| f(x0) exactly on x0 by adding a Dirac delta there, leaving
| the rest of f exactly as-is. But this cannot be done on a
| differentiable model, as it would create a discontinuity.
| The next best thing that such a model can actually do is
| replace the Dirac delta with a smooth but very narrow bump
| (e.g. Gaussian). But this narrow bump will inevitably have
| extremely high curvature at x0, since the bump is flat at
| x0 and it has to merge with the neighborhood around x0 in a
| very short distance.
|
| Think of driving: if you have to change lanes in a very
| short distance, you're going to have to steer hard.
| Steering is curvature.
| woadwarrior01 wrote:
| That's very reminiscent of the idea behind the SAM
| (Sharpness Aware Minimization) family of optimizers.
| andy12_ wrote:
| Actually, no! Look at this in the paper
|
| > In extending from studying per-example to bulk
| memorization, we propose a novel inversion of the previous
| interpretation of loss curvature: while individual memorized
| points are associated with high curvature, the direction of
| curvature varies across examples, meaning that, averaged
| across multiple examples, memorization directions are
| actually flatter than generalizing directions, which maintain
| a consistent moderate curvature across points
| getnormality wrote:
| Ah! I figured I should be very circumspect in the question
| since I hadn't read in full and there could be some crazy
| reason it's actually the opposite.
| vatsachak wrote:
| The decomposition they use "averages out the points of high
| curvature" therefore those components of the decomposition
| which correspond to "higher curvature" are those components
| which are used across multiple data points. Therefore they
| are the "general reasoning"
| kingstnap wrote:
| A very similar idea is presented here in the first 5 minutes of
| this recent talk. But more from observing a kink in loss curves.
|
| https://youtu.be/UyK3DgWY7yw?si=NN3f9Erik8o_Nfbs
| NitpickLawyer wrote:
| > Our work enhances the understanding of memorization in neural
| networks with practical applications towards removing it
|
| Cool stuff. In a recent podcast Karpathy was also talking about
| this. He sees this as the next "target": models that don't
| memorise, because you can look it up in an oracle, but still keep
| the "reasoning" qualities.
| esafak wrote:
| How can you generalize without facts? They are the foundation
| on which generalization is built. Like programming without
| memorizing the keywords. Unless you make a distinction between
| facts that let you generalize, and facts that do not, like
| random ID numbers.
| icandoit wrote:
| We want the LLM to learn the multiplication algorithm not an
| incomplete set of tables. The algorithm might be smaller and
| will be more complete.
|
| Honestly, our technology has outpaced our epistemology. So we
| don't really know what a fact is or isn't. Are facts what we
| call our supervised learning experiences? You think the sun
| rises, no the earth spins. Your belief that the sun rises
| helps you predict sunset and sunrise. Your belief would be
| quaint to someone born and raised on a space station. Apollos
| chariot moves the sun across the sky doesn't it?
| esafak wrote:
| There is a related line of work that suggests spikes in the ESD
| are related to the generalization vs memorization too; e.g.,
|
| _From Spikes to Heavy Tails: Unveiling the Spectral Evolution of
| Neural Networks_ (https://openreview.net/pdf?id=DJHB8eBUnt)
___________________________________________________________________
(page generated 2025-11-07 23:01 UTC)