[HN Gopher] Grokking at the edge of linear separability
___________________________________________________________________
Grokking at the edge of linear separability
Author : marojejian
Score : 67 points
Date : 2024-10-11 16:09 UTC (6 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| diwank wrote:
| Grokking is so cool. What does it even mean that grokking
| exhibits similarities to criticality? As in, what are the
| philosophical ramifications of this?
| hackinthebochs wrote:
| Criticality is the boundary between order and chaos, which also
| happens to be the boundary at which information dynamics and
| computation can occur. Think of it like this: a highly ordered
| structure cannot carry much information because there are few
| degrees of freedom. The other extreme is too many degrees of
| freedom in a chaotic environment; any correlated state quickly
| gets destroyed by entropy. The point at which the two dynamics
| are balanced is where computation can occur. This point has
| enough dynamics that state can change in a controlled manner,
| and enough order so that state can reliably persist over time.
|
| I would speculate that the connection between grokking and
| criticality is that grokking represents the point at which a
| network maximizes the utility of information in service to
| prediction. This maximum would be when dynamics and rigidity
| are finely tuned to the constraints of the problem the network
| is solving, when computation is being leveraged to maximum
| effect. Presumably this maximum leverage of computation is the
| point of ideal generalization.
| Agingcoder wrote:
| This looks very interesting. Would you have references ? (
| not necessarily on grokking but about the part where
| computation can occur only when the right balance is found )
| hackinthebochs wrote:
| Hard to pin down a single citation of that point. But some
| good places to start are:
|
| https://en.wikipedia.org/wiki/Critical_brain_hypothesis
|
| https://journals.aps.org/pre/abstract/10.1103/PhysRevE.79.0
| 4... (on sci-hub)
| soulofmischief wrote:
| A scale-free network is one whose degree distribution follows
| a power law. [0]
|
| Self-organized criticality describes a phenomenon where
| certain complex systems naturally evolve toward a critical
| state where they exhibit power-law behavior and scale
| invariance. [1]
|
| The power laws observed in such systems suggest they are at
| the edge between order and chaos. In intelligent systems,
| such as the brain, this edge-of-chaos behavior is thought to
| enable maximal adaptability, information processing, and
| optimization.
|
| The brain has been proposed to operate near critical points,
| with neural avalanches following power laws. This allows a
| very small amount of energy to have an outsized impact, the
| key feature of scale-free networks. This phenomenon is a
| natural extension of the stationary action principle.
|
| [0] https://en.wikipedia.org/wiki/Scale-free_network
|
| [1] https://www.researchgate.net/publication/235741761_Self-
| Orga...
| delichon wrote:
| I think this means that when training a cat detector it's better
| to have more bobcats and lynx and fewer dogs.
| kouru225 wrote:
| And winner of Best Title of the Year goes to:
| bbor wrote:
| I'm glad I'm not the only one initially drawn in by the title!
| As the old meme goes;
|
| > If you can't describe your job in 3 Words, you have a BS job:
|
| > 1. "I catch fish" _Real job!_
|
| > 2. "I drive taxis" _Real job!_
|
| > 3. "I grok at the edge of linear separability" _BS Job!_
| sva_ wrote:
| > ai researcher
| alizaid wrote:
| Grokking is fascinating! It seems tied to how neural networks hit
| critical points in generalization. Could this concept also
| enhance efficiency in models dealing with non-linearly separable
| data?
| wslh wrote:
| Could you expand about grokking [1]? I superficially understand
| what it means but it seems more important that the article
| conveys.
|
| Particularly:
|
| > Grokking can be understood as a phase transition during the
| training process. While grokking has been thought of as largely
| a phenomenon of relatively shallow models, grokking has been
| observed in deep neural networks and non-neural models and is
| the subject of active research.
|
| Does that paper add more insights?
|
| [1]
| https://en.wikipedia.org/wiki/Grokking_(machine_learning)?wp...
| tanananinena wrote:
| This is probably the most interesting (and insightful) paper
| on grokking I've read recently:
| https://arxiv.org/abs/2402.15555
| bbor wrote:
| Wow, fascinating stuff and "grokking" is news to me. Thanks for
| sharing! In typical HN fashion, I'd like to come in as an amateur
| and nitpick the terminology/philosophy choices of this nascent-
| yet-burgeoning subfield: We begin by examining
| the optimal generalizing solution, that indicates the network has
| properly learned the task... the network should put all points in
| Rd on the same side of the separating hyperplane, or in other
| words, push the decision boundary to infinity... Overfitting
| occurs when the hyperplane is only far enough from the data to
| correctly classify all the training samples.
|
| This is such a dumb idea on first glance, I'm so impressed that
| they pushed past that and used it for serious insights. It truly
| is a kind of atomic/fundamental/formalized/simplified way to
| explore overfitting on its own.
|
| Ultimately their thesis, as I understand it from the top of page
| 5, is roughly these two steps (with some slight rewording):
| [I.] We call a training set separable if there exists a vector
| [that divides the data, like a 2D vector from the origin dividing
| two sets of 2D points]... The training set is almost surely
| separable [when there's twice as many dimensions as there are
| points, and almost surely inseparable otherwise]...
|
| Again, dumb observation that's obvious in hindsight, which makes
| it all the more impressive that they found a use for it. This is
| how paradigm shifts happen! An alternate title for the paper
| could've been "A Vector Is All You Need (to understand
| grokking)". Ok but assuming I understood the setup right, here's
| the actual finding: [II.] [Given infinite
| training time,] the model will always overfit for separable
| training sets[, and] for inseparable training sets the model will
| always generalize perfectly. However, when the training set is on
| the verge of separability... dynamics may take arbitrarily long
| times to reach the generalizing solution [rather than
| overfitting]. **This is the underlying mechanism of
| grokking in this setting**.
|
| Or, in other words from Appendix B: grokking
| occurs near critical points in which solutions exchange stability
| and dynamics are generically slow
|
| Assuming I understood that all correctly, this finally brings me
| to my philosophical critique of "grokking", which ends up being a
| complement to this paper: grokking is just a modal transition in
| algorithmic structure, which is exactly why it's seemingly
| related to topics as diverse as physical phase changes and the
| sudden appearance of large language models. I don't blame the
| statisticians for not recognizing it, but IMO they're capturing
| something far more fundamental than a behavioral quirk in some
| mathematical tool.
|
| Non-human animals (and maybe some really smart plants) obviously
| are capable of "learning" in some human-like way, but it rarely
| surpasses the basics of Pavlovian conditioning: they delineate
| quantitative objects in their perceptive field (as do unconscious
| particles when they mechanically interact with each other),
| computationally attach qualitative symbols to them based on
| experience (as do plants), and then calculate relations/groups of
| that data based on some evolutionarily-tuned algorithms (again, a
| capability I believe to be unique to animals and weird plants).
| Humans, on the other hand, not only perform calculations about
| our immediate environment, but also freely engage in meta-
| calculations -- this is why our smartest primate relatives are
| still incapable of posing questions, yet humans pose them
| naturally from an extremely young age.
|
| Details aside, my point is that different orders of cognition are
| different not just in some _quantitative_ way, like an increase
| in linear efficiency, but rather in a _qualitative_ --or, to use
| the hot lingo, _emergent_ --way. In my non-credentialed opinion,
| this paper is a beautiful formalization of that phenomenon, even
| though it necessarily is stuck at the bottom of the stack so-to-
| speak, describing the switch in cognitive capacity from direct
| quantification to symbolic qualification.
|
| It's very possible I'm clouded by the need to confirm my priors,
| but if not, I hope to see this paper see wide use among ML
| researchers as a clean, simplified exposition of what we're all
| really trying to do here on a fundamental level. A generalization
| of generalization, if you will!
|
| Alon, Noam, and Yohai, if you're in here, congrats for devising
| such a dumb paper that is all the more useful & insightful
| because of it. I'd love to hear your hot takes on the connections
| between grokking, cognition, and physics too, if you have any
| that didn't make the cut!
| anigbrowl wrote:
| It's just another garbage buzzword. We already have perfectly
| good words for this like _understanding_ and _comprehension_.
| The use of _grokking_ is a a form of in-group signaling to get
| buy-in from other Cool Kids Who Like Robert Heinlein, but it 's
| so obviously a nerdspeak effort at branding that it's probably
| never going to catch on outside of that demographic, no matter
| how fetch it is.
| kaibee wrote:
| > It's just another garbage buzzword. We already have
| perfectly good words for this like understanding and
| comprehension.
|
| Yeah, try telling people that NNs contain actual
| understanding and comprehension. That won't be controversial
| at all.
| PoignardAzur wrote:
| I feel super confused about this paper.
|
| Apparently their training goal is for the model to ignore all
| input values and output a constant. Sure.
|
| But then they outline some kind of equation of when grokking will
| or won't happen, and... I don't get it?
|
| For a goal that simple, won't any neural network with any amount
| of weight decay eventually converge to a stack of all-zeros
| matrices (plus a single bias)?
|
| What is this paper even saying, on an empirical level?
___________________________________________________________________
(page generated 2024-10-11 23:01 UTC)