[HN Gopher] Grokking at the edge of linear separability
       ___________________________________________________________________
        
       Grokking at the edge of linear separability
        
       Author : marojejian
       Score  : 67 points
       Date   : 2024-10-11 16:09 UTC (6 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | diwank wrote:
       | Grokking is so cool. What does it even mean that grokking
       | exhibits similarities to criticality? As in, what are the
       | philosophical ramifications of this?
        
         | hackinthebochs wrote:
         | Criticality is the boundary between order and chaos, which also
         | happens to be the boundary at which information dynamics and
         | computation can occur. Think of it like this: a highly ordered
         | structure cannot carry much information because there are few
         | degrees of freedom. The other extreme is too many degrees of
         | freedom in a chaotic environment; any correlated state quickly
         | gets destroyed by entropy. The point at which the two dynamics
         | are balanced is where computation can occur. This point has
         | enough dynamics that state can change in a controlled manner,
         | and enough order so that state can reliably persist over time.
         | 
         | I would speculate that the connection between grokking and
         | criticality is that grokking represents the point at which a
         | network maximizes the utility of information in service to
         | prediction. This maximum would be when dynamics and rigidity
         | are finely tuned to the constraints of the problem the network
         | is solving, when computation is being leveraged to maximum
         | effect. Presumably this maximum leverage of computation is the
         | point of ideal generalization.
        
           | Agingcoder wrote:
           | This looks very interesting. Would you have references ? (
           | not necessarily on grokking but about the part where
           | computation can occur only when the right balance is found )
        
             | hackinthebochs wrote:
             | Hard to pin down a single citation of that point. But some
             | good places to start are:
             | 
             | https://en.wikipedia.org/wiki/Critical_brain_hypothesis
             | 
             | https://journals.aps.org/pre/abstract/10.1103/PhysRevE.79.0
             | 4... (on sci-hub)
        
           | soulofmischief wrote:
           | A scale-free network is one whose degree distribution follows
           | a power law. [0]
           | 
           | Self-organized criticality describes a phenomenon where
           | certain complex systems naturally evolve toward a critical
           | state where they exhibit power-law behavior and scale
           | invariance. [1]
           | 
           | The power laws observed in such systems suggest they are at
           | the edge between order and chaos. In intelligent systems,
           | such as the brain, this edge-of-chaos behavior is thought to
           | enable maximal adaptability, information processing, and
           | optimization.
           | 
           | The brain has been proposed to operate near critical points,
           | with neural avalanches following power laws. This allows a
           | very small amount of energy to have an outsized impact, the
           | key feature of scale-free networks. This phenomenon is a
           | natural extension of the stationary action principle.
           | 
           | [0] https://en.wikipedia.org/wiki/Scale-free_network
           | 
           | [1] https://www.researchgate.net/publication/235741761_Self-
           | Orga...
        
       | delichon wrote:
       | I think this means that when training a cat detector it's better
       | to have more bobcats and lynx and fewer dogs.
        
       | kouru225 wrote:
       | And winner of Best Title of the Year goes to:
        
         | bbor wrote:
         | I'm glad I'm not the only one initially drawn in by the title!
         | As the old meme goes;
         | 
         | > If you can't describe your job in 3 Words, you have a BS job:
         | 
         | > 1. "I catch fish" _Real job!_
         | 
         | > 2. "I drive taxis" _Real job!_
         | 
         | > 3. "I grok at the edge of linear separability" _BS Job!_
        
           | sva_ wrote:
           | > ai researcher
        
       | alizaid wrote:
       | Grokking is fascinating! It seems tied to how neural networks hit
       | critical points in generalization. Could this concept also
       | enhance efficiency in models dealing with non-linearly separable
       | data?
        
         | wslh wrote:
         | Could you expand about grokking [1]? I superficially understand
         | what it means but it seems more important that the article
         | conveys.
         | 
         | Particularly:
         | 
         | > Grokking can be understood as a phase transition during the
         | training process. While grokking has been thought of as largely
         | a phenomenon of relatively shallow models, grokking has been
         | observed in deep neural networks and non-neural models and is
         | the subject of active research.
         | 
         | Does that paper add more insights?
         | 
         | [1]
         | https://en.wikipedia.org/wiki/Grokking_(machine_learning)?wp...
        
           | tanananinena wrote:
           | This is probably the most interesting (and insightful) paper
           | on grokking I've read recently:
           | https://arxiv.org/abs/2402.15555
        
       | bbor wrote:
       | Wow, fascinating stuff and "grokking" is news to me. Thanks for
       | sharing! In typical HN fashion, I'd like to come in as an amateur
       | and nitpick the terminology/philosophy choices of this nascent-
       | yet-burgeoning subfield:                 We begin by examining
       | the optimal generalizing solution, that indicates the network has
       | properly learned the task... the network should put all points in
       | Rd on the same side of the separating hyperplane, or in other
       | words, push the decision boundary to infinity... Overfitting
       | occurs when the hyperplane is only far enough from the data to
       | correctly classify all the training samples.
       | 
       | This is such a dumb idea on first glance, I'm so impressed that
       | they pushed past that and used it for serious insights. It truly
       | is a kind of atomic/fundamental/formalized/simplified way to
       | explore overfitting on its own.
       | 
       | Ultimately their thesis, as I understand it from the top of page
       | 5, is roughly these two steps (with some slight rewording):
       | [I.] We call a training set separable if there exists a vector
       | [that divides the data, like a 2D vector from the origin dividing
       | two sets of 2D points]... The training set is almost surely
       | separable [when there's twice as many dimensions as there are
       | points, and almost surely inseparable otherwise]...
       | 
       | Again, dumb observation that's obvious in hindsight, which makes
       | it all the more impressive that they found a use for it. This is
       | how paradigm shifts happen! An alternate title for the paper
       | could've been "A Vector Is All You Need (to understand
       | grokking)". Ok but assuming I understood the setup right, here's
       | the actual finding:                 [II.] [Given infinite
       | training time,] the model will always overfit for separable
       | training sets[, and] for inseparable training sets the model will
       | always generalize perfectly. However, when the training set is on
       | the verge of separability... dynamics may take arbitrarily long
       | times to reach the generalizing solution [rather than
       | overfitting].        **This is the underlying mechanism of
       | grokking in this setting**.
       | 
       | Or, in other words from Appendix B:                 grokking
       | occurs near critical points in which solutions exchange stability
       | and dynamics are generically slow
       | 
       | Assuming I understood that all correctly, this finally brings me
       | to my philosophical critique of "grokking", which ends up being a
       | complement to this paper: grokking is just a modal transition in
       | algorithmic structure, which is exactly why it's seemingly
       | related to topics as diverse as physical phase changes and the
       | sudden appearance of large language models. I don't blame the
       | statisticians for not recognizing it, but IMO they're capturing
       | something far more fundamental than a behavioral quirk in some
       | mathematical tool.
       | 
       | Non-human animals (and maybe some really smart plants) obviously
       | are capable of "learning" in some human-like way, but it rarely
       | surpasses the basics of Pavlovian conditioning: they delineate
       | quantitative objects in their perceptive field (as do unconscious
       | particles when they mechanically interact with each other),
       | computationally attach qualitative symbols to them based on
       | experience (as do plants), and then calculate relations/groups of
       | that data based on some evolutionarily-tuned algorithms (again, a
       | capability I believe to be unique to animals and weird plants).
       | Humans, on the other hand, not only perform calculations about
       | our immediate environment, but also freely engage in meta-
       | calculations -- this is why our smartest primate relatives are
       | still incapable of posing questions, yet humans pose them
       | naturally from an extremely young age.
       | 
       | Details aside, my point is that different orders of cognition are
       | different not just in some _quantitative_ way, like an increase
       | in linear efficiency, but rather in a _qualitative_ --or, to use
       | the hot lingo, _emergent_ --way. In my non-credentialed opinion,
       | this paper is a beautiful formalization of that phenomenon, even
       | though it necessarily is stuck at the bottom of the stack so-to-
       | speak, describing the switch in cognitive capacity from direct
       | quantification to symbolic qualification.
       | 
       | It's very possible I'm clouded by the need to confirm my priors,
       | but if not, I hope to see this paper see wide use among ML
       | researchers as a clean, simplified exposition of what we're all
       | really trying to do here on a fundamental level. A generalization
       | of generalization, if you will!
       | 
       | Alon, Noam, and Yohai, if you're in here, congrats for devising
       | such a dumb paper that is all the more useful & insightful
       | because of it. I'd love to hear your hot takes on the connections
       | between grokking, cognition, and physics too, if you have any
       | that didn't make the cut!
        
         | anigbrowl wrote:
         | It's just another garbage buzzword. We already have perfectly
         | good words for this like _understanding_ and _comprehension_.
         | The use of _grokking_ is a a form of in-group signaling to get
         | buy-in from other Cool Kids Who Like Robert Heinlein, but it 's
         | so obviously a nerdspeak effort at branding that it's probably
         | never going to catch on outside of that demographic, no matter
         | how fetch it is.
        
           | kaibee wrote:
           | > It's just another garbage buzzword. We already have
           | perfectly good words for this like understanding and
           | comprehension.
           | 
           | Yeah, try telling people that NNs contain actual
           | understanding and comprehension. That won't be controversial
           | at all.
        
       | PoignardAzur wrote:
       | I feel super confused about this paper.
       | 
       | Apparently their training goal is for the model to ignore all
       | input values and output a constant. Sure.
       | 
       | But then they outline some kind of equation of when grokking will
       | or won't happen, and... I don't get it?
       | 
       | For a goal that simple, won't any neural network with any amount
       | of weight decay eventually converge to a stack of all-zeros
       | matrices (plus a single bias)?
       | 
       | What is this paper even saying, on an empirical level?
        
       ___________________________________________________________________
       (page generated 2024-10-11 23:01 UTC)