[HN Gopher] From Memorization to Reasoning in the Spectrum of Lo...
       ___________________________________________________________________
        
       From Memorization to Reasoning in the Spectrum of Loss Curvature
        
       Author : andy12_
       Score  : 54 points
       Date   : 2025-11-07 12:43 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | andy12_ wrote:
       | Very concise summary of the procedure described in this paper:
       | 
       | 1. Run the model once across a dataset to estimate loss curvature
       | per MLP weight matrix via K-FAC (activation/gradient
       | covariances).
       | 
       | 2. Decompose each weight matrix into curvature-ordered
       | components; low-curvature directions correspond most to verbatim
       | memorization, higher curvature to shared/general mechanisms.
       | 
       | 3. Edit by dropping the low-curvature subspace and keep only the
       | top directions.
        
         | vessenes wrote:
         | Thank you for this huge time saver.
         | 
         | Now, about the paper-that's super interesting. I imagine the
         | dream here is to distil down into a "reasoning" core. Or maybe
         | reclaim space for more generalization. Lots of interesting use
         | cases.
        
         | getnormality wrote:
         | Thank you!
         | 
         | I think you may have accidentally switched low and high in #2,
         | no? The abstract speaks of high curvature as associated with
         | memorization:
         | 
         | > curvature for memorized training points is much sharper than
         | non memorized
        
           | radarsat1 wrote:
           | This sounds more correct to me. I've read previously
           | somewhere that better generalization is usually associated
           | with wider, smoother minima, and this is why regularization
           | is important, because it has a smoothing function on the loss
           | landscape.
        
             | getnormality wrote:
             | Yes. This is also not hard to see intuitively from scratch.
             | 
             | Say you have a smooth but highly flexible model y = f(x)
             | and some data points you are fitting with a machine
             | learning algorithm. For whatever reason, the algorithm
             | decides it wants to reduce training error by interpolating
             | some specific point, (x0,y0), without negatively affecting
             | training error on nearby points. The direct, guaranteed
             | successful way to do this is to adjust the model to y0 =
             | f(x0) exactly on x0 by adding a Dirac delta there, leaving
             | the rest of f exactly as-is. But this cannot be done on a
             | differentiable model, as it would create a discontinuity.
             | The next best thing that such a model can actually do is
             | replace the Dirac delta with a smooth but very narrow bump
             | (e.g. Gaussian). But this narrow bump will inevitably have
             | extremely high curvature at x0, since the bump is flat at
             | x0 and it has to merge with the neighborhood around x0 in a
             | very short distance.
             | 
             | Think of driving: if you have to change lanes in a very
             | short distance, you're going to have to steer hard.
             | Steering is curvature.
        
             | woadwarrior01 wrote:
             | That's very reminiscent of the idea behind the SAM
             | (Sharpness Aware Minimization) family of optimizers.
        
           | andy12_ wrote:
           | Actually, no! Look at this in the paper
           | 
           | > In extending from studying per-example to bulk
           | memorization, we propose a novel inversion of the previous
           | interpretation of loss curvature: while individual memorized
           | points are associated with high curvature, the direction of
           | curvature varies across examples, meaning that, averaged
           | across multiple examples, memorization directions are
           | actually flatter than generalizing directions, which maintain
           | a consistent moderate curvature across points
        
             | getnormality wrote:
             | Ah! I figured I should be very circumspect in the question
             | since I hadn't read in full and there could be some crazy
             | reason it's actually the opposite.
        
           | vatsachak wrote:
           | The decomposition they use "averages out the points of high
           | curvature" therefore those components of the decomposition
           | which correspond to "higher curvature" are those components
           | which are used across multiple data points. Therefore they
           | are the "general reasoning"
        
       | kingstnap wrote:
       | A very similar idea is presented here in the first 5 minutes of
       | this recent talk. But more from observing a kink in loss curves.
       | 
       | https://youtu.be/UyK3DgWY7yw?si=NN3f9Erik8o_Nfbs
        
       | NitpickLawyer wrote:
       | > Our work enhances the understanding of memorization in neural
       | networks with practical applications towards removing it
       | 
       | Cool stuff. In a recent podcast Karpathy was also talking about
       | this. He sees this as the next "target": models that don't
       | memorise, because you can look it up in an oracle, but still keep
       | the "reasoning" qualities.
        
         | esafak wrote:
         | How can you generalize without facts? They are the foundation
         | on which generalization is built. Like programming without
         | memorizing the keywords. Unless you make a distinction between
         | facts that let you generalize, and facts that do not, like
         | random ID numbers.
        
           | icandoit wrote:
           | We want the LLM to learn the multiplication algorithm not an
           | incomplete set of tables. The algorithm might be smaller and
           | will be more complete.
           | 
           | Honestly, our technology has outpaced our epistemology. So we
           | don't really know what a fact is or isn't. Are facts what we
           | call our supervised learning experiences? You think the sun
           | rises, no the earth spins. Your belief that the sun rises
           | helps you predict sunset and sunrise. Your belief would be
           | quaint to someone born and raised on a space station. Apollos
           | chariot moves the sun across the sky doesn't it?
        
       | esafak wrote:
       | There is a related line of work that suggests spikes in the ESD
       | are related to the generalization vs memorization too; e.g.,
       | 
       |  _From Spikes to Heavy Tails: Unveiling the Spectral Evolution of
       | Neural Networks_ (https://openreview.net/pdf?id=DJHB8eBUnt)
        
       ___________________________________________________________________
       (page generated 2025-11-07 23:01 UTC)