[HN Gopher] Do Machine Learning Models Memorize or Generalize?
       ___________________________________________________________________
        
       Do Machine Learning Models Memorize or Generalize?
        
       Author : 1wheel
       Score  : 344 points
       Date   : 2023-08-10 13:56 UTC (9 hours ago)
        
 (HTM) web link (pair.withgoogle.com)
 (TXT) w3m dump (pair.withgoogle.com)
        
       | lewhoo wrote:
       | So, the TLDR could be: they memorize at first and then generalize
       | ?
        
       | mjburgess wrote:
       | Statistical learning can typically be phrased in terms of k
       | nearest neighbours
       | 
       | In the case of NNs we have a "modal knn" (memorising) going to a
       | "mean knn" ('generalising') under the right sort of training.
       | 
       | I'd call both of these memorising, but the latter is a kind of
       | weighted recall.
       | 
       | Generalisation as a property of statistical models (ie., models
       | of conditional freqs) is not the same property as generalisation
       | in the case of scientific models.
       | 
       | In the latter a scientific model is general because it models
       | causally necessary effects from causes -- so, _necessarily_ if X
       | then Y.
       | 
       | Whereas generalisation in associative stats is just about whether
       | you're drawing data from the empirical freq. distribution or
       | whether you've modelled first. In all automated stats the only
       | diff between the "model" and "the data" is some sort of weighted
       | averaging operation.
       | 
       | So in automated stats (ie., ML,AI) it's really just whether the
       | model uses a mean.
        
         | autokad wrote:
         | I disagree, it feels like you are just fusing over words and
         | not what's happening in the real world. If you were right, a
         | human doesn't learn anything either, they just memories.
         | 
         | you can look at it by results: I give these models inputs its
         | never seen before but it gives me outputs that are correct /
         | acceptable.
         | 
         | you can look at it in terms of data: we took petabytes of data,
         | and with an 8gb model (stable difusion) we can output an image
         | of anything. That's an unheard of compression, only possible if
         | its generalizing - not memorizing.
        
         | bippihippi1 wrote:
         | it's been proven that all models learned by gradient descent
         | are equivalent to kernel machines. interpolation isn't
         | generalization. if theres a new input sufficiently different
         | from the training data the behaviour is unknown
        
           | xapata wrote:
           | One weird trick ...
           | 
           | There's some fox and hedgehog analogy I've never understood.
        
           | visarga wrote:
           | but when the model trains on 13T tokens it is hard to be OOD
        
         | ActivePattern wrote:
         | I'd be curious how much of the link you read.
         | 
         | What they demonstrate is a neural network learning an algorithm
         | that approximates modular addition. The exact workings of this
         | algorithm is explained in the footnotes. The learned algorithm
         | is general -- it is just as valid on unseen inputs as seen
         | inputs.
         | 
         | There's no memorization going on in this case. It's _actually_
         | approximating the process used to generate the data, which just
         | isn 't possible using k nearest neighbors.
        
         | visarga wrote:
         | > Statistical learning can typically be phrased in terms of k
         | nearest neighbours
         | 
         | We have suspected that neural nets are a kind of kNN. Here's a
         | paper:
         | 
         | Every Model Learned by Gradient Descent Is Approximately a
         | Kernel Machine
         | 
         | https://arxiv.org/abs/2012.00152
        
       | [deleted]
        
       | xaellison wrote:
       | what's the TLDR: memorize, or generalize?
        
       | greenflag wrote:
       | It seems the take home is weight decay induces sparsity which
       | helps learn the "true" representation rather than an overfit one.
       | It's interesting the human brain has a comparable mechanism
       | prevalent in development [1]. I would love to know from someone
       | in the field if this was the inspiration for weight decay (or
       | presumably just the more equivalent nn pruning [2]).
       | 
       | [1] https://en.wikipedia.org/wiki/Synaptic_pruning [2]
       | https://en.wikipedia.org/wiki/Pruning_(artificial_neural_net...
        
         | tbalsam wrote:
         | ML researcher here wanting to offer a clarification.
         | 
         | L1 induces sparsity. Weight decay explicitly _does not_, as it
         | is L2. This is a common misconception.
         | 
         | Something a lot of people don't know is that weight decay works
         | because when applied as regularization it causes the network to
         | approach the MDL, which reduces regret during training.
         | 
         | Pruning in the brain is somewhat related, but because the brain
         | uses sparsity to (fundamentally, IIRC) induce representations
         | instead of compression, it's basically a different motif
         | entirely.
         | 
         | If you need a hint here on this one, think about the implicit
         | biases of different representations and the downstream impacts
         | that they can have on the learned (or learnable)
         | representations of whatever system is in question.
         | 
         | I hope this answers your question.
        
           | joaogui1 wrote:
           | That looks interesting, do you know what paper talks about
           | the connection between MDL, regret, and weight decay?
        
             | tbalsam wrote:
             | I would start with Shannon's information theory and the
             | Wikipedia page on L2/the MDL as a decent starting point.
             | 
             | For the first, there are a few good papers that simplify
             | the concepts even further.
        
         | pcwelder wrote:
         | Afaik weight decay is inspired from L2 regularisation which
         | goes back to linear regression where L2 regularisation is
         | equivalent to having gaussian prior on the weights with zero
         | mean.
         | 
         | Note that L1 regularisation produces much more sparsity but it
         | doesn't perform as well.
        
           | nonameiguess wrote:
           | This. Weight decay is just a method of dropping most weights
           | to zero which is a standard technique used by statisticians
           | for regularization purposes for decades. As far as I
           | understand, it goes back at least to Tikhorov from 1970 and
           | was mostly called ridge regression in the regression context.
           | Normal ordinary least squares attempts to minimize the L2
           | norm of the squared residuals. When a system is
           | overdetermined, adding a penalty term (usually just a scalar
           | multiple of an identity matrix) and also minimizing the L2
           | norm of that biases the model to produce mostly near-zero
           | weights. This helps with underdetermined systems and gives a
           | better conditioned model matrix that is actually possible to
           | solve numerically without underflow.
           | 
           | It's kind of amazing to watch this from the sidelines, a
           | process of engineers getting ridiculously impressive results
           | from some combo of sheer hackery and ingenuity, great data
           | pipelining and engineering, extremely large datasets,
           | extremely fast hardware, and computational methods that scale
           | very well, but at the same time, gradually relearning lessons
           | and re-inventing techniques that were perfected by
           | statisticians over half a century ago.
        
             | whimsicalism wrote:
             | this comment is so off base, first off no l2 des not
             | encourage near 0 weights, second off they are not
             | relearning, everyone already knew what l1/l2 penalties are
        
             | tbalsam wrote:
             | L1 drops weights to zero, L2 biases towards Gaussianality.
             | 
             | It's not always relearning lessons or people entirely
             | blindly trying things either, many researchers use the
             | underlying math to inform decisions for network
             | optimization. If you're seeing that, then that's probably a
             | side of the field where people are newer to some of the
             | math behind it, and that will change as things get more
             | established.
             | 
             | The underlying mathematics behind these kinds of systems
             | are what has motivated a lot of the improvements in hlb-
             | CIFAR10, for example. I don't think I would have been able
             | to get there without sitting down with the fundamentals,
             | planning, thinking, and working a lot, and then executing.
             | There is a good place for blind empirical research too, but
             | it loses its utility past a certain point of overuse.
        
         | visarga wrote:
         | The inspiration for weight decay was to reduce the capacity to
         | memorize of the model until it perfectly fits the complexity of
         | the task, not more not less. A model more complex than the task
         | is over-fitting, the other one is under-fitting. Got to balance
         | them out.
         | 
         | But the best cure for over-fitting is to make the dataset
         | larger and ensure data diversity. LLMs have datasets so large
         | they usually train one epoch.
        
           | crdrost wrote:
           | And there have been a lot of approaches to do this, my
           | favorite one being the idea that maybe if we just randomly
           | zap out some of the neurons while we train the rest, that
           | forcing it to acquire that redundancy might privilege
           | structured representations over memorization. Just always
           | seemed like some fraternity prank, "if you REALLY know the
           | tenets of Delta Mu Beta you can recite them when drunk after
           | we spin you around in a circle twelve times fast!"
        
             | whimsicalism wrote:
             | https://nitter.net/Yampeleg/status/1688441683946377216
        
           | kaibee wrote:
           | > But the best cure for over-fitting is to make the dataset
           | larger and ensure data diversity.
           | 
           | This is also good life advice.
        
           | nightski wrote:
           | It sounds nice in theory, but the data itself could be
           | problematic. There is no temporal nature to it. You can have
           | duplicate data points, many data points that are closely
           | related but describe the same thing/event/etc.. So while only
           | showing the model each data point once ensures you do not
           | introduce any extra weight on a data point, if the dataset
           | itself is skewed it doesn't help you at all.
           | 
           | Just by trying to make the dataset diverse you could skew
           | things to not reflect reality. I just don't think enough
           | attention has been paid to the data, and too much the model.
           | But I could be very wrong.
           | 
           | There is a natural temporality to the data humans receive.
           | You can't relive the same moment twice. That said, human
           | intelligence is on a scale too and may be affected in the
           | same way.
        
             | visarga wrote:
             | > I just don't think enough attention has been paid to the
             | data, and too much the model.
             | 
             | I wholly agree. Everyone is blinded by models - GPT4 this,
             | LLaMA2 that - but the real source of the smarts is in the
             | dataset. Why would any model, no matter how its
             | architecture is tweaked, learn about the same ability from
             | the same data? Why would humans be all able to learn the
             | same skills when every brain is quite different. It was the
             | data, not the model
             | 
             | And since we are exhausting all the available quality text
             | online we need to start engineering new data with LLMs and
             | validation systems. AIs need to introspect more into their
             | training sets, not just train to reproduce them, but
             | analyse, summarise and comment on them. We reflect on our
             | information, AIs should do more reflection before learning.
             | 
             | More fundamentally, how are AIs going to evolve past human
             | level unless they make their own data or they collect data
             | from external systems?
        
               | Salgat wrote:
               | This is definitely current models' biggest issue. You're
               | training a model against millions of books worth of data
               | (which would take a human tens of thousands of lifetimes)
               | to achieve a superficial level of conversational ability
               | to match a human, which can consume at most 3 novels a
               | day without compromising comprehension. Current models
               | are terribly inefficient when it comes to learning from
               | data.
        
               | whimsicalism wrote:
               | You have to count the training process from the origin of
               | the human brain imo, not from the birth of any individual
               | human.
               | 
               | Neural nets look much more competitive by that standard.
        
               | imtringued wrote:
               | They are inefficient by design. Gradient descent and
               | backpropagation scale poorly, but they work and GPUs are
               | cheap, so here we are.
        
               | og_kalu wrote:
               | Modern LLMs are nowhere near the scale of the human brain
               | however you want to slice things so terribly inefficient
               | is very arguable. also language skills seemingly take
               | much less data and scale when you aren't trying to have
               | it learn the sum total of human knowledge.
               | https://arxiv.org/abs/2305.07759
        
               | Salgat wrote:
               | Scale is a very subjective thing since one is analog (86B
               | neurons) and one is digital (175B parameters).
               | Additionally, consider how many compute hours GPT 3 took
               | to train (10,000 V100s were set aside for exclusive
               | training of GPT 3). I'd say that GPT 3 scale vastly
               | dwarfs the human brain, which runs at a paltry 12 watts.
        
               | ben_w wrote:
               | > It was the data, not the model
               | 
               | It's _both_.
               | 
               | It's clearly impossible to learn how to translate Linear
               | A into modern English using only content written in pure
               | Japanese that never references either.
               | 
               | Yet also, none of the algorithms before Transformers were
               | able to first ingest the web, then answer a random
               | natural language question in any domain -- closest was
               | Google etc. matching on indexed keywords.
               | 
               | > how are AIs going to evolve past human level unless
               | they make their own data?
               | 
               | Who says they can't make their own data?
               | 
               | Both _a priori_ (by development of  "new" mathematical
               | and logical tautological deductions), and _a posteriori_
               | by devising, and observing the results of, various
               | experiments.
               | 
               | Same as us, really.
        
               | whimsicalism wrote:
               | > Yet also, none of the algorithms before Transformers
               | were able to first ingest the web, then answer a random
               | natural language question in any domain -- closest was
               | Google etc. matching on indexed keywords.
               | 
               | Wrong, recurrent models were able to do this, just not as
               | well.
        
               | riversflow wrote:
               | I see this brought up consistently on the topic of AI
               | take-off/X-risk.
               | 
               | How does an AI _language model_ devise an experiment and
               | observe the results? The language model is only trained
               | on what's already known, I'm extremely incredulous that
               | this language model technique can actually reason a
               | genuinely novel hypothesis.
               | 
               | A LLM is a series of weights sitting in the ram of GPU
               | cluster, it's really just a fancy prediction function. It
               | doesn't have the sort of biological imperatives (a result
               | of being complete independent beings) or entropy that
               | drive living systems.
               | 
               | Moreover, if we consider how it works for humans, people
               | have to _think_ about problems. Do we even have a model
               | or even an idea about what "thinking" is? Meanwhile
               | science is a looping process that mostly requires a
               | physical element(testing/verification) to it. So unless
               | we make some radical breakthroughs in general purpose
               | robotics, as well as overcome the thinking problem I
               | don't see how AI can do some sort tech breakout/runaway.
        
               | ben_w wrote:
               | Starting with the end so we're on the same page about
               | framing the situation:
               | 
               | > I don't see how AI can do some sort tech
               | breakout/runaway.
               | 
               | I'm expecting (in the mode, but with a wide and shallow
               | distribution) a roughly 10x increase in GDP growth, from
               | increased automation etc., _not_ a singularity /foom.
               | 
               | I think the main danger is bugs and misuse (both
               | malicious and short-sighted).
               | 
               | -
               | 
               | > How does an AI language model devise an experiment and
               | observe the results?
               | 
               | Same way as Helen Keller.
               | 
               | Same way scientists with normal senses do for data
               | outside human sense organs, be that the LHC or nm/s^2
               | acceleration of binary stars or gravity waves (or the
               | confusingly similarly named but very different
               | gravitational waves).
               | 
               | > The language model is only trained on what's already
               | known, I'm extremely incredulous that this language model
               | technique can actually reason a genuinely novel
               | hypothesis.
               | 
               | Were you, or any other human, trained on things
               | _unknown_?
               | 
               | If so, how?
               | 
               | > A LLM is a series of weights sitting in the ram of GPU
               | cluster, it's really just a fancy prediction function. It
               | doesn't have the sort of biological imperatives (a result
               | of being complete independent beings) or entropy that
               | drive living systems.
               | 
               | Why do you believe that biological imperatives are in any
               | way important?
               | 
               | I can't see how any of a desire to eat, shag, fight, run
               | away, or freeze up... help with either the scientific
               | method nor pure maths.
               | 
               | Even the "special sauce" that humans have over other
               | animals didn't lead to _any_ us doing the scientific
               | method until very recently, and _most_ of us still don
               | 't.
               | 
               | > Do we even have a model or even an idea about what
               | "thinking" is?
               | 
               | AFAIK, only in terms of output, not qualia or anything
               | like that.
               | 
               | Does it matter if the thing a submarine does is swimming,
               | if it gets to the destination? LLMs, for all their
               | mistakes and their... utterly inhuman minds and
               | transhuman training experience... can do many things
               | which would've been considered "implausible" even in a
               | sci-fi setting a decade ago.
               | 
               | > So unless we make some radical breakthroughs in general
               | purpose robotics
               | 
               | I don't think it needs to be _general_ , as labs are
               | increasingly automated even without general robotics.
        
               | imtringued wrote:
               | It's not just a series of weights. It is an unchanging
               | series of weights. This isn't necessarily artificial
               | intelligence. It is the intelligence of the dead.
        
         | BaseballPhysics wrote:
         | The human brain has synaptic pruning. The exact purpose of it
         | is theorized but not actually understood, and it's a gigantic
         | leap to assume some sort of analogous mechanism between LLMs
         | and the human brain.
        
         | [deleted]
        
       | djha-skin wrote:
       | How is this even a shock.
       | 
       | Anyone who so much as taken a class on this knows that even the
       | simplest of perceptron networks, decision trees, or any form of
       | machine learning model generalizes. That's why we use them. If
       | they don't, it's called _overfit_ [1], where the model is so
       | accurate on the training data that its inferential ability on new
       | data suffers.
       | 
       | I know that the article might be talking about a higher form of
       | generalization with LLMs or whatever, but I don't see why the
       | same principle of "don't overfit the data" wouldn't apply to that
       | situation.
       | 
       | No, really: what part of their base argument is novel?
       | 
       | 1: https://en.wikipedia.org/wiki/Overfitting
        
         | halflings wrote:
         | The interesting part is the sudden generalization.
         | 
         | Simple models predicting simple things will generally slowly
         | overfit, and regularization keeps that overfitting in check.
         | 
         | This "grokking" phenomenon is when a model first starts by
         | aggressively overfitting, then gradually prunes unnecessary
         | weights until it _suddenly_ converges on the one generalizable
         | combination of weights (as it 's the only one that both solves
         | the training data _and_ minimizes weights).
         | 
         | Why is this interesting? Because you could argue that this
         | justifies using overparametrized models with high levels of
         | regularization; e.g. models that will tend to aggressively
         | overfit, but over time might converge to a better solution by
         | gradual pruning of weights. The traditional approach is not to
         | do this, but rather to use a simpler model (which would
         | initially generalize better, but due to its simplicity might
         | not be able to learn the underlying mechanism and reach higher
         | accuracy).
        
           | timy2shoes wrote:
           | It's interesting that the researchers chose example problems
           | where the minimum norm solution is the best at
           | generalization. What if that's not the case?
        
         | godelski wrote:
         | It's because you over generalized your simple understanding.
         | There is a lot more nuance to that thing you are calling
         | overfitting (and underfitting). We do not know why it happens
         | or when it happens, in all cases. We do know cases where it
         | does happen and why it happens, but that doesn't me we don't
         | know others. There is still a lot of interpretation left that
         | is needed. How much was overfit? How much underfit? Can these
         | happen at the same time? (yes) What layers do this, what causes
         | this, and how can we avoid it? Reading the article shows you
         | that this is far from a trivial task. This is all before we
         | even introduce the concept of sudden generalization. Once we do
         | that then all these things start again but now under a
         | completely different context that is even more surprising. We
         | also need to talk about new aspects like the rate of
         | generalization and rate of memorization what what affects
         | these.
         | 
         | tldr: don't oversimplify things: you underfit
         | 
         | P.S. please don't fucking review. Your complaints aren't
         | critiques.
        
       | tipsytoad wrote:
       | Seriously, are they only talking about weight decay? Why so
       | complicated?
        
       | SimplyUnknown wrote:
       | First of all, great blog post with great examples. Reminds me of
       | distill.pub used to be.
       | 
       | Second, the article correctly states that typically L2 weight
       | decay is used, leading to a lot of weights with small magnitudes.
       | For models that generalize better, would it then be better to
       | always use L1 weight decay to promote sparsity in combination
       | with longer training?
       | 
       | I wonder whether deep learning models that only use sparse
       | fourier features rather than dense linear layers would work
       | better...
        
         | qumpis wrote:
         | Slightly related but sparsity-inducing activation function Relu
         | is often used in neural networks
        
         | medium_spicy wrote:
         | Short answer: if the inputs can be represented well on the
         | Fourier basis, yes. I have a patent in process on this, fingers
         | crossed.
         | 
         | Longer answer: deep learning models are usually trying to find
         | the best nonlinear basis in which to represent inputs; if the
         | inputs are well-represented (read that as: can be sparsely
         | represented) in some basis known a-priori, it usually helps to
         | just put them in that basis, e.g., by FFT'ing RF signals.
         | 
         | The challenge is that the overall-optimal basis might not be
         | the same as those of any local minima, so you've got to do some
         | tricks to nudge the network closer.
        
       | superkuh wrote:
       | There were no auto-discovery RSS/Atom feeds in the HTML, no links
       | to the RSS feed anywhere, but by guessing at possible feed names
       | and locations I was able to find the "Explorables" RSS feed at:
       | https://pair.withgoogle.com/explorables/rss.xml
        
       | flyer_go wrote:
       | I don't think I have seen an answer here that actually challenges
       | this question - from my experience, I have yet to see a neural
       | network actually learn representations outside the range in which
       | it was trained. Some papers have tried to use things like
       | sinusoidal activation functions that can force a neural network
       | to fit a repeating function, but on its own I would call it pure
       | coincidence.
       | 
       | On generalization - its still memorization. I think there has
       | been some proof that chatgpt does 'try' to perform some higher
       | level thinking but still has problems due to the dictionary type
       | lookup table it uses. The higher level thinking or agi that
       | people are excited about is a form of generalization that is so
       | impressive we don't really think of it as memorization. But I
       | actually question if our wantingness to generate original thought
       | isn't as actually separate from what we currently are seeing.
        
         | smaddox wrote:
         | > I have yet to see a neural network actually learn
         | representations outside the range in which it was trained
         | 
         | Generalization doesn't require _learning representations_
         | outside of the training set. It requires learning reusable
         | representations that compose in ways that enable solving unseen
         | problems.
         | 
         | > On generalization - its still memorization
         | 
         | Not sure what you mean by this. This statement sounds self
         | contradictory to me. Generalization requires abstraction /
         | compression. Not sure if that's what you mean by memorization.
         | 
         | Overparameterized models are able to generalize (and tend to,
         | when trained appropriately) because there are far more
         | parameterizations that minimize loss by compressing knowledge
         | than there are parameterizations that minimize loss without
         | compression.
         | 
         | This is fairly easy to see. Imagine a dataset and model such
         | that the model has barely enough capacity to learn the dataset
         | without compression. The only degrees of freedom would be
         | through changes in basis. In contrast, if the model uses
         | compression, that would increase the degrees of freedom. The
         | more compression, the more degrees of freedom, and the more
         | parameterizations that would minimize the loss.
         | 
         | If stochastic gradient descent is sufficiently equally as
         | likely to find any given compressed minimum as any given
         | uncompressed minimum, then the fact that there are
         | exponentially many more compressed minimums than uncompressed
         | minimums means it will tend to find a compressed minimum.
         | 
         | Of course this is only a probabilistic argument, and doesn't
         | guarantee compression / generalization. And in fact we know
         | that there are ways to train a model such that it will not
         | generalize, such as training for many epochs on a small dataset
         | without augmentation.
        
         | jhaenchen wrote:
         | The issue is that we are prone to inflate the complexity of our
         | own processing logic. Ultimately we are pattern recognition
         | machines in combination with abstract representation. This
         | allows us to connect the dots between events in the world and
         | apply principles in one domain to another.
         | 
         | But, like all complexity, it is reduceable to component parts.
         | 
         | (In fact, we know this because we evolved to have this ability.
         | )
        
           | agalunar wrote:
           | Calling us "pattern recognition machines capable of abstract
           | representation" I think is correct, but is (rather) broad
           | description of what we can do and not really a comment on how
           | our minds work. Sure, from personal observation, it seems
           | like we sometimes overcomplicate self-analysis ("I'm feeling
           | bad - why? oh, there are these other things that happened and
           | related problems I have and maybe they're all manifestations
           | of one or two deeper problems, &c" when in reality I'm just
           | tired or hungry), but that seems like evidence we're both
           | simpler than we think and also more complex than you'd expect
           | (so much mental machinery for such straightforward
           | problems!).
           | 
           | I read _Language in Our Brain_ [1] recently and I was amazed
           | by what we 've learned about the neurologicial basis of
           | language, but I was even more astounded at how profoundly
           | _little_ we know.
           | 
           | > But, like all complexity, it is reduceable to component
           | parts.
           | 
           | This is just false, no? Sometimes horrendously complicated
           | systems are made of simple parts that interact in ways that
           | are intractable to predict or that defy reduction.
           | 
           | [1] https://mitpress.mit.edu/9780262036924/language-in-our-
           | brain
        
       | huijzer wrote:
       | A bit of both, but it does certainly generalize. Just look into
       | the sentiment neuron from OpenAI in 2017 or come up with an
       | unique question to ChatGPT.
        
       | _ache_ wrote:
       | Does anyone know how that charts are created ? I bet that it's
       | half generated by some sort of library and them manually improved
       | but the generated animated SVG are beautiful.
        
         | 1wheel wrote:
         | Basically just a bunch of d3 -- could be cleaned up
         | significantly, but that's hard to do while iterating and
         | polishing the charts.
         | 
         | I also have a couple of little libraries for things like
         | annotations, interleaving svg/canvas and making d3 a bit less
         | verbose.
         | 
         | - https://github.com/PAIR-code/ai-
         | explorables/tree/master/sour...
         | 
         | - https://1wheel.github.io/swoopy-drag/
         | 
         | - https://github.com/gka/d3-jetpack
         | 
         | - https://roadtolarissa.com/hot-reload/
        
           | iaw wrote:
           | I was going to ask the same question. Those are some great
           | visualizations
        
       | davidguetta wrote:
       | hierarchize would be a better term than generalize
        
         | 3cats-in-a-coat wrote:
         | Generalize is seeing common principles, patterns, between
         | disparate instances of a phenomena. It's a proper word for
         | this.
        
           | Chabsff wrote:
           | That's a common mechanism to achieve generalization, but the
           | term is a little more general (heh) than that. It
           | specifically refers to correctly handling data that lives
           | outside the distribution presented by the training data.
           | 
           | It's a description of a _behavior_ , not a mechanism. Which
           | may or may not be appropriate depending on whether you are
           | talking about *what* the model does or *how* it achieves it.
        
             | 3cats-in-a-coat wrote:
             | Kinda fuzzy what's "in the distribution", because it
             | depends on how deeply the model interprets it. If it
             | understands examples outside the distribution... that kinda
             | puts them in the distribution.
             | 
             | General understanding makes the information in the
             | distribution very wide. Shallow understanding makes it very
             | narrow. Like say recognizing only specific combinations of
             | pixels verbatim.
        
               | Chabsff wrote:
               | I think you are misinterpreting. The distribution present
               | in the training set in isolation (the one I'm referring
               | to, and is not fuzzy in the slightest) is not the same
               | thing as the distribution understood by the trained model
               | (the one you are referring to, and is definitely more
               | conceptual and hard to characterize in non-trivial
               | cases).
               | 
               | "Generalization" is simply the theoretical measure of how
               | much the later extends beyond the former, regardless of
               | how that's achieved.
        
           | davidguetta wrote:
           | Generalize has a tendency to imply you can extrapolate. And
           | in most case it's actually the opposite that happens: neural
           | nets tend to COMPRESS the data. (which in turn is a good
           | thing in many case because the data is noisy)
        
             | 3cats-in-a-coat wrote:
             | The point of compression is to decompress after. That's
             | what happens during inference, and when the extrapolation
             | occurs.
             | 
             | Let's say I tell GPT "write 8 times foobar". Will it? Well
             | then it understands me and can extrapolate from the request
             | to the proper response, without having specifically "write
             | 8 times foobar" in its model.
             | 
             | Most decompression algorithms focus on predicting the next
             | token (byte, term, etc.), believe it or not. The more
             | accurately they predict the next token, the less
             | information you need to store to correct misprediction.
        
         | ot wrote:
         | "hierarchize" only describes your own mental model of how
         | knowledge organization and reasoning may work in the model, not
         | the actual phenomenon being observed here.
         | 
         | "generalize" means going from specific examples to general
         | cases not seen before, which is a perfectly good description of
         | the phenomenon. Why try to invent a new word?
        
           | davidguetta wrote:
           | > hierarchize" only describes your own mental model of how
           | knowledge organization and reasoning may work in the model,
           | not the actual phenomenon being observed here
           | 
           | It's not true, if you look at deep CNN the lower layers show
           | lines, the higher complex stuff like eyes or football players
           | etc.. Herarchisation of information actually emerges
           | naturally in NNs.
           | 
           | Generalization often implies extrapolation on new data, which
           | is just not the case most of the time with NNs and why i
           | didn't like the word
        
         | version_five wrote:
         | Anything would be better than "grokking".
         | 
         | From what I gather they're talking about double descent which
         | afaik is the consequence of overparameterization leading to a
         | smooth interpolation between the training data as opposed to
         | what happens in traditional overfitting. Imagine a polynomial
         | fit with the same degree as the number of data points (swinging
         | up and down wildly away from the data) compared with a much
         | higher degree fit that could smoothly interpolate between the
         | points while still landing right on them.
         | 
         | None of this is what I would call generalization, it's good
         | interpolation, which is what deep learning does in a very high
         | dimensional space. It's notoriously awful at extrapolating, ie
         | generalizing to anything without support in the training data.
        
           | visarga wrote:
           | > It's notoriously awful at extrapolating, ie generalizing to
           | anything without support in the training data.
           | 
           | Scientists are also pretty lousy at making new discoveries
           | without labs. They just need training data.
        
           | Jack000 wrote:
           | double descent is a different phenomenon from grokking
        
       | blueyes wrote:
       | If your data set is too small, they memorize. If you train them
       | well on a large dataset, they learn to generalize.
        
         | visarga wrote:
         | they only generalise with big datasets, that is the rule
        
           | blueyes wrote:
           | That's what I said.
        
       | ajuc wrote:
       | I was trying to make an AI for my 2d sidescrolling game with
       | asteroid-like steering learn from recorded player input +
       | surroundings.
       | 
       | It generalized splendidly - it's conclusion was that you always
       | need to press "forward" and do nothing else, no matter what
       | happens :)
        
       | mostertoaster wrote:
       | Sometimes I think the reason human memory in some sense is so
       | amazing, is what we lack in storage capacity that machines have,
       | we makeup for in our ability to create patterns that compress the
       | amount of information stored dramatically, and then it is like we
       | compress those patterns together with other patterns and are able
       | to extract things from it. Like it is an incredibly lossy
       | compression, but it gets the job done.
        
         | tbalsam wrote:
         | For more information and the related math behind associative
         | memories, please see Hopfield Neural Networks.
         | 
         | While the upper bound is technically "infinity", there is a
         | tradeoff between the amount of concepts stored and the
         | fundamental amount of information storable per concept, similar
         | to how other tradeoff principles like the uncertainty
         | principle, etc work.
        
         | bobboies wrote:
         | Good example in my math and physics classes I found it really
         | helpful to understand the general concepts, then instead of
         | memorizing formulas could actually derive them from other known
         | (perhaps easier-to-remember) facts.
         | 
         | Geometry is good for training in this way--and often very
         | helpful for physics proofs too!
        
         | BSEdlMMldESB wrote:
         | yes, when we do this to history, it becomes filled with
         | conspiracies. but is merely a process to 'understand' history
         | by projecting intentionalities.
         | 
         | this 'compression' is what 'understanding' something really
         | entails; at first... but then there's more.
         | 
         | when knowledge becomes understood it enables perception (e.g.
         | we perceive meaning in words once we learn to read).
         | 
         | when we get really good at this understanding-perception we may
         | start to 'manipulate' the abstractions we 'perceive'. an
         | example would be to 'understand a cube' and then being able to
         | rotate it around so to predict what would happen without really
         | needing the cube. but this is an overly simplistic example
        
         | pillefitz wrote:
         | That is essentially what embeddings do
        
           | nightski wrote:
           | Maybe, except from my understanding an embedding vector tends
           | to be much larger than the source token (due to the high
           | dimensionality of the embedding space). So it's almost like a
           | reverse compression in a way. That said I know vector DBs
           | have much more efficient ways of storing those vector
           | embedding.
        
             | jncfhnb wrote:
             | Tokens are not 1:1 with vectors.
        
         | bufferoverflow wrote:
         | There are rare people who remember everything
         | 
         | https://youtu.be/hpTCZ-hO6iI
        
           | svachalek wrote:
           | It's pretty fascinating to me how "normal" Marilu Henner
           | seems to be. I'm getting older and my memory is not what it
           | was, but when I was younger it was pretty extraordinary. I
           | did really well in school and college but over time I've
           | realized it was mostly due to being able to remember most
           | things pretty effortlessly, over being truly "smart" in a
           | classic sense.
           | 
           | But having so much of the past being so accessible is tough.
           | There are lots of memories I'd rather not have, that are
           | vivid and easily called up. And still, I think it's only a
           | fraction of what her memory seems to be like.
        
             | 93po wrote:
             | As someone on the other end of the spectrum, I have an
             | awful memory, and don't remember most of my life aside from
             | really wide, sweeping generalizations and maybe a couple
             | hundred very specific memories. My way of existence is also
             | very sad, and it makes me feel like I've not really lived.
        
             | TheRealSteel wrote:
             | " I did really well in school and college but over time
             | I've realized it was mostly due to being able to remember
             | most things pretty effortlessly"
             | 
             | Same! They thought I was a genius in primary school but I
             | ended up a loser adult with a dead end job. Turns out I
             | just liked technology and was good at remembering facts and
             | names for things.
        
           | hgsgm wrote:
           | Is there scientific evidence of that or just claims?
        
             | badumtsss wrote:
             | some people don't want to be studied or tested.
        
         | ComputerGuru wrote:
         | That's not exactly true, there doesn't seem to be an upper
         | bound (that we can reach) on storage capacity in the brain [0].
         | Instead, the brain actually works to actively distill knowledge
         | that doesn't need to be memorized verbatim into its essential
         | components in order to achieve exactly this "generalized
         | intuition and understanding" to avoid overfitting.
         | 
         | [0]: https://www.scientificamerican.com/article/new-estimate-
         | boos...
        
           | halflings wrote:
           | > That's not exactly true [...] Instead, the brain actually
           | works to actively distill knowledge that doesn't need to be
           | memorized verbatim into its essential components
           | 
           | ...but that's exactly what OP said, no?
           | 
           | I remember attending an ML presentation where the speaker
           | shared a quote I can't find anymore (speaking of memory and
           | generalization :)), which said something like: "To learn is
           | to forget"
           | 
           | If we memorized everything perfectly, we would not learn
           | anything: instead of remembering the concept of a "chair",
           | you would remember thousands of separate instances of things
           | you've seen that have a certain combination of colors and
           | shapes etc
           | 
           | It's the fact that we forget certain details (small
           | differences between all these chairs) that makes us learn
           | what a "chair" is.
           | 
           | Likewise, if you remembered every single word in a book, you
           | would not understand its meaning; understanding its meaning =
           | being able to "summarize" (compress) this long list of words
           | into something more essential: storyline, characters,
           | feelings, etc.
        
             | JieJie wrote:
             | My mind is a blurry jpeg of my life.
             | 
             | (https://www.newyorker.com/tech/annals-of-
             | technology/chatgpt-...)
        
             | WanderPanda wrote:
             | Compression = Intelligence
             | 
             | http://prize.hutter1.net/
        
             | ComputerGuru wrote:
             | > but that's exactly what OP said, no?
             | 
             | Not precisely. We don't know if verbatim capacity is
             | limited (and it doesn't seem to be) but the brain operates
             | in a space-efficient manner all the same. So there isn't
             | necessarily a causative relationship between "memory
             | capacity" and "means of storage".
             | 
             | > Likewise, if you remembered every single word in a book,
             | you would not understand its meaning
             | 
             | I understand your meaning but I want to clarify for the
             | sake of the discussion that unlike with ML, the human brain
             | can both memorize verbatim _and_ understand the meaning
             | because there is no mechanism for memorizing something but
             | not processing it (i.e. purely storage). The first pass(es)
             | are stripped to their essentials but subsequent passes
             | provide the ability to memorize the same input.
        
               | whimsicalism wrote:
               | > verbatim capacity is limited
               | 
               | I am but a simple physicist and I can already tell you it
               | is.
        
               | SanderNL wrote:
               | We know for certain it is limited. Do brains not adhere
               | to physics?
        
             | cmpalmer52 wrote:
             | There's a story by Jorge Luis Borges called "Funes the
             | Memorious" about a man who remembers everything, but can't
             | generalize. There's a line about him not knowing if a dog
             | on the square glimpsed at noon from the side is the same
             | dog as the one seen from the back at 12:01 or something
             | like that. Swirls of smoke from a cigarette are memorized
             | forever. He mostly sits in a dark room.
        
           | jjk166 wrote:
           | Distilling knowledge is data compression.
        
             | w10-1 wrote:
             | You're conflating memorization with generalization, no?
        
               | jjk166 wrote:
               | Memorization is storing data. Generalization is
               | developing the heuristics by which you compress stored
               | data. To distill knowledge is to apply heuristics to
               | lossily-compress a large amount of data to a much smaller
               | amount of data from which you nevertheless can recover
               | enough information to be useful in the future.
        
           | downboots wrote:
           | Can "distill knowledge" be made precise ?
        
             | __loam wrote:
             | Unless you know something the neuroscientists don't, it
             | cannot.
        
             | ComputerGuru wrote:
             | As best as I've been able to research, it's still under
             | active exploration and there are hypotheses but no real
             | answers. I believe research has basically been circling
             | around the recent understanding that in addition to being
             | part of how the brain is wired, it is also an active,
             | deliberate (if unconscious) mechanism that takes place in
             | the background and is run "at a higher priority" during
             | sleep (sort of like an indexing daemon running at low
             | priority during waking hours then getting the bulk of
             | system resources devoted to it during idle).
             | 
             | There are also studies that show "data" in the brain isn't
             | stored read-only and the process of accessing that memory
             | involves remapping the neurons (which is how fake memories
             | are possible) - so my take is if you access a memory or
             | datum sequentially start to finish each time the brain
             | knows this is to be stored verbatim for as-is retrieval but
             | if you access snapshots of it or actively seek to and
             | replay a certain part while trying to relate that memory to
             | a process or a new task, the brain rewires the neural
             | pathways accusingly. Which implies that there us an
             | unconscious part that takes place globally plus an active,
             | modifying process where how we use a stored memory affects
             | how it is stored and indexed (so data isn't accessed by
             | simple fields but rather by complex properties or getters,
             | in programming parlance).
             | 
             | I guess the key difference from how machine learning works
             | (and I believe an integral part of AGI, if it is even
             | possible) is that inference is constant, even when you're
             | only "looking up" data and you don't know the right answer
             | (i.e. not training stage). The brain recognizes how the new
             | query differs from queries it has been trained on and can
             | modify its own records to take into account the new data.
             | For example, let's say you're trying to classify animals
             | into groups and you've "been trained" on a dataset that
             | doesn't include monotremes or marsupials. The first time
             | you come across a platypus in the wild (with its mammaries
             | but no nipples, warm-blooded but lays eggs, and a single
             | duct for waste and reproduction) you wouldn't just
             | mistakenly classify it as a bird or mammal - you would
             | actively trigger a (delayed/background) reclassification of
             | all your existing inferences to account for this new
             | phenomenon, _even though you don't know what the answer to
             | the platypus classification question is_.
        
             | clord wrote:
             | imo, it amounts to revisiting concepts once more general
             | principles are found -- and needed. For instance, you learn
             | the alphabet, and it's hard. the order is tricky. the
             | sounds are tricky, etc. but eventually, it get distilled to
             | a pattern. But you still have to start from A to remember
             | what letter 6 is, until you encounter that problem many
             | times, and then the brain creates a 6=F mapping. I think of
             | it in economic terms: when the brain realizes it's cheaper
             | to create a generalization, it does so on the fly, and that
             | generalization takes over the task.
             | 
             | Somtimes it's almost like creating a specialist shard to
             | take over the task. Driving is hard at first, with very
             | high task overload, lots to pay attention to. With
             | practice, it becomes a little automated part of yourself
             | takes care of those tasks while your main general
             | intelligence can do whatever it likes, even as the "driver"
             | deals with seriously difficult tasks.
        
             | esafak wrote:
             | https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theor
             | y
        
           | nonameiguess wrote:
           | I've thought about this a lot in the context of the desire
           | people seem to have to try and achieve human immortality or
           | at least indefinite lifespans. If SciAm is correct here and
           | the upper bound is a quadrillion bytes, we may not be able to
           | hit that given the bound on possible human experiences, but
           | someone who lived long enough would eventually hit that.
           | After a hundred million years or whatever the real number is
           | of life, you'd either lose the ability to form new memories
           | or you'd have to overwrite old ones to do so.
           | 
           | Aside from having to eventually experience the death of all
           | stars and light and the decay of most of the universe's
           | baryonic matter and then face an eternity of darkness with
           | nothing to touch, it's yet another reason I don't think
           | immortality (as opposed to just a very long lifespan) is
           | actually desirable.
        
             | mewpmewp2 wrote:
             | I imagine there would be perhaps tech or technique which
             | you can choose to determine which memories to compress and
             | countless of others techniques like extra storage that you
             | can instantly access, so I don't see all of these as being
             | real arguments why not become immortal. If I have to choose
             | to be dead and memoryless compared to losing some of my
             | memories, but being still alive, why should I choose being
             | dead and memoryless?
             | 
             | And when losing memories you would first just discard some
             | details, like you lose now anyway, but you would start
             | compressing centuries into rough ideas of what happened,
             | it's just the details that would lack a bit.
             | 
             | I don't see it being a problem at all. And if really
             | something happens with the Universe, sure I can die then,
             | but why would I want to die before?
             | 
             | I want to know what happens, what gets discovered, what
             | happens with humanity, how far do we reach in terms of
             | understanding of what is going on in this place. Why are we
             | here. Imagine dying and not even knowing why you were here.
        
             | imtringued wrote:
             | Longtermists argue that we will be harvesting hawking
             | radiation from blackholes trillions of years after the heat
             | death of the universe.
        
               | __loam wrote:
               | The last civilizations will be built around black holes.
        
           | TheRealSteel wrote:
           | You seem to have just re-stated what the other person said.
        
             | whimsicalism wrote:
             | Thank you, thought I was losing it for a second
        
           | gattilorenz wrote:
           | Is there a "realistic upper bound" in things that should be
           | memorized verbatim? Ancient greeks probably memorized the
           | Iliad and other poems (rhyming and metre might work as a
           | substitute for data compression, in this case), and many
           | medieval preachers apparently memorized the whole Bible...
        
             | [deleted]
        
       | gorjusborg wrote:
       | Grr, the AI folks are ruining the term 'grok'.
       | 
       | It means roughly 'to understand completely, fully'.
       | 
       | To use the same term to describe generalization... just shows you
       | didn't grok grokking.
        
         | erwald wrote:
         | "Grok" in AI doesn't quite describe generalization, it's more
         | specific that that. It's more like "delayed and fairly sudden
         | generalization" or something like that. There was some
         | discussion of this in the comments of this post[1], which
         | proposes calling the phenomenon "eventual recovery from
         | overfitting" instead.
         | 
         | [1]
         | https://www.lesswrong.com/posts/GpSzShaaf8po4rcmA/qapr-5-gro...
        
           | gorjusborg wrote:
           | Whoever suggested 'eventual recovery from overfitting' is a
           | kindred spirit.
           | 
           | Why throw away the context and nuance?
           | 
           | That decision only further leans into the 'AI is magic'
           | attitude.
        
             | jeremyjh wrote:
             | No, actually this is just how language evolves. I'm glad we
             | have the word "car" instead of "carriage powered by
             | internal combustion engine" even if it confused some people
             | 100 years ago when the term became used exclusively to mean
             | something a bit more specfic.
             | 
             | Of course the jargon used in a specific sub-field evolves
             | much more quickly than common usage because the intended
             | audience of paper like this is expected to be well-read and
             | current in the field already.
        
               | smolder wrote:
               | Language devolves just as it evolves. We (the grand we)
               | regularly introduce ambiguity --words and meanings with
               | no useful purpose, or that are worse than useless.
               | 
               | I'm not really weighing in on the appropriateness of the
               | use "grok" in this case. It's just a pet peeve of mine
               | that people bring out "language evolves" as an excuse for
               | why any arbitrary change is natural and therefore
               | acceptable and we should go with the flow. Some changes
               | are strictly bad ones.
               | 
               | A go-to example is when "literally" no longer means
               | "literally", but its opposite, or nothing at all. We
               | don't have a replacement word, so now in some contexts
               | people have to explain that they "literally mean
               | literally".
        
               | krapp wrote:
               | Language only evolves, "devolving" isn't a thing. All
               | changes are arbitrary. Language is always messy, fluid
               | and ambigious. You should go with the flow because being
               | a prescriptivist about the way other people speak is
               | obnoxious and pointless.
               | 
               | And "literally" has been used to mean "figuratively" for
               | as long as the word has existed[0].
               | 
               | [0]https://blogs.illinois.edu/view/25/96439
        
               | mdp2021 wrote:
               | > _devolving isn 't a thing_
               | 
               | Incompetent use is devolution.
        
               | gorjusborg wrote:
               | Also being overlooked is that the nuances in what we
               | accept is in large part how we define group culture.
               | 
               | If you want to use the word 'irregardless' unironically
               | there are people who will accept that. Then there are the
               | rest of us.
        
               | smolder wrote:
               | I'm going to take a rosier view of prescriptivists and
               | say they are a necessary part of the speaking/writing
               | public, doing the valuable work of fighting entropic
               | forces to prevent making our language dumb. They don't
               | always need to win or be right.
               | 
               | That's the first time I've seen literally-as-figuratively
               | defended from a historical perspective. I still think
               | we'd all be better off if people didn't mindlessly use it
               | as a filler word or for emphasis, which is generally what
               | people are doing these days that is the source of
               | controversy, not reviving an archaic usage.
               | 
               | Also, it's kind of ironic you corrected my use of
               | "devolves", where many would accept it. :)
        
               | gorjusborg wrote:
               | > No, actually this is just how language evolves
               | 
               | Stop making 'fetch' happen, it's not going to happen.
        
               | [deleted]
        
           | tbalsam wrote:
           | Part of the issue here is posting a LessWrong post. There is
           | some good in there, but much of that site is like a Flat
           | Earth conspiracy theory for neural networks.
           | 
           | Neural network training [edit: on a fixed point task, as is
           | often the case {such as image->label}] is always (always)
           | biphasic necessarily, so there is no "eventual recovery from
           | overfitting". In my experience, it is just people newer to
           | the field or just noodling around fundamentally
           | misunderstanding what is happening, as their network goes
           | through a very delayed phase change. Unfortunately there is a
           | significant amplification to these kinds of posts and such,
           | as people like chasing the new shiny of some fad-or-another-
           | that-does-not-actually-exist instead of the much more
           | 'boring' (which I find fascinating) math underneath it all.
           | 
           | To me, as someone who specializes in optimizing network
           | training speeds, it just indicates poor engineering to the
           | problem on the part of the person running the experiments. It
           | is not a new or strange phenomenon, it is a literal
           | consequence of the information theory underlying neural
           | network training.
        
             | tbalsam wrote:
             | To further clarify things, the reason there is no mystical
             | 'eventual recovery from overfitting ' is because
             | overfitting is a stable bound that is approached. Adding
             | this false denomination to this implies a non-biphasic
             | nature to neural network training, and adds false
             | information that wasn't there before.
             | 
             | Thankfully things are pretty stable in the
             | over/underfitting regime. I feel sad when I see ML
             | misinformation propagated on a forum that requires little
             | experience but has high leverage due to the rampant misuse
             | of existing terms and complete invention of a in-group-
             | language that has little touch with the mathematical
             | foundations of what's happening behind the scenes. I've
             | done this for 7-8 years at this point at a pretty deep
             | level and have a strong pocket of expertise, so I'm not
             | swinging at this one blindly.
        
             | ShamelessC wrote:
             | > Part of the issue here is posting a LessWrong post. There
             | is some good in there, but much of that site is like a Flat
             | Earth conspiracy theory for neural networks.
             | 
             | Indeed! It's very frustrating that so many people here are
             | such staunch defenders of LessWrong. Some/much of the
             | behavior there is honestly concerning.
        
         | NikkiA wrote:
         | I've always taken 'grok' to be in the same sense as 'to be one
         | with'
        
           | gorjusborg wrote:
           | Yeah, there is definitely irony that I'm trying to push my
           | own definition of an extra-terrestrial word, complaining that
           | someone is ruining it.
           | 
           | If anyone wants to come up with their own definition, read
           | Robert Heinlein's 'Stranger in a Strange Land'. There is no
           | definition in there, but you build an intuition of the
           | meaning by its use.
           | 
           | One of the issues I have w/ the use in AI is that using the
           | word 'grok' suggests that the machine understands (that's a
           | common interpretation of the word grok, that it is an
           | understanding greater than normal understanding).
           | 
           | By using an alien word, we are both suggesting something that
           | probably isn't technically true, while simultaneously giving
           | ourselves a slimy out. If you are going to suggest that AI
           | understands, just have the courage to say it with common
           | english, and be ready for argument.
           | 
           | Redefining a word that already exists to make the argument
           | technical feels dishonest.
        
             | snewman wrote:
             | Actually the definition of 'grok' is discussed in the book;
             | you can find some relevant snippets at
             | https://en.m.wikipedia.org/wiki/Grok. My recollection is
             | that the book says the original / literal meaning is
             | "drink", but this isn't supported by the Wikipedia quotes
             | and perhaps I am misremembering, it has been a long time.
        
         | 93po wrote:
         | I have heard grok used tremendously more frequently in the past
         | year or two and I find it annoying because they're using it as
         | a replacement for the word "understand" for reasons I don't
         | "grok"
        
         | whimsicalism wrote:
         | I literally do not see the difference between the two uses that
         | you are trying to make
        
         | mxwsn wrote:
         | They're just defining grokking in a different way. It's
         | reasonable to me though - grokking suggests elements of
         | intuitive understanding, and a sudden, large increase in
         | understanding. These mirror what happens to the loss.
        
         | thuuuomas wrote:
         | "Grok" is more about in-group signaling like "LaTex
         | credibility" or publishing blog posts on arxiv.
        
         | jjk166 wrote:
         | I've always considered the important part of grokking something
         | to be the intuitiveness of the understanding, rather than the
         | completeness.
        
         | momirlan wrote:
         | grok, implying a mystical union, is not applicable to AI
        
           | Filligree wrote:
           | Why not?
        
         | benreesman wrote:
         | Sci-Fi Nerd Alert:
         | 
         | "Grok" was Valentine Michael Smith's rendering for human ears
         | and vocal cords of a Martian word with a precise denotational
         | semantic of "to drink". The connotational semantics range from
         | to literally or figuratively "drink deeply" all the way up
         | through to consume the absented carcass of a cherished one.
         | 
         | I highly recommend Stranger in A Strange Land (and make sure to
         | get the unabridged re-issue, 1990 IIRC).
        
         | paulddraper wrote:
         | What the difference between understanding and generalizing?
         | 
         | And what is the indicator for a machine understanding
         | something?
        
       | jimwhite42 wrote:
       | I'm not sure if I'm remembering it right, but I think it was on a
       | Raphael Milliere interview on Mindscape, where Raphael said
       | something along the lines of when there are many dimensions in a
       | machine learning model, the distinction between interpolation and
       | extrapolation is not clear like it is in our usual areas of
       | reasoning. I can't work out if this could be something similar to
       | what the article is talking about.
        
       | MagicMoonlight wrote:
       | Memorise because there is no decision component. It attempts to
       | just brute force a pattern rather than thinking through the
       | information and making a conclusion.
        
       | lachlan_gray wrote:
       | It looks like grid cells!
       | 
       | https://en.wikipedia.org/wiki/Grid_cell
       | 
       | If you plot a head map of a neuron in the hidden layer on a 2D
       | chart where one axis is $a$ and the other is $b$, I think you
       | might get a triangular lattice. If it's doing what I think it is,
       | then looking at another hidden neuron would give a different
       | lattice with another orientation + scale.
       | 
       | Also you could make a base 67 adding machine by chaining these
       | together.
       | 
       | I also can't help the gut feeling that the relationship between
       | W_in-proj's neurons compared to the relationship between W_out-
       | proj's neurons looks like the same mapping as the one between the
       | semitone circle and the circle of fifths
       | 
       | https://upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Pi...
        
       | esafak wrote:
       | I haven't read the latest literature but my understanding is that
       | "grokking" is the phase transition that occurs during the
       | coalescing of islands of understanding (increasingly abstract
       | features) that eventually form a pathway to generalization. And
       | that this is something associated with over-parameterized models,
       | which have the potential to learn multiple paths (explanations).
       | 
       | https://en.wikipedia.org/wiki/Percolation_theory
       | 
       | A relevant, recent paper I found from a quick search: _The
       | semantic landscape paradigm for neural networks_
       | (https://arxiv.org/abs/2307.09550)
        
       | ComputerGuru wrote:
       | PSA: if you're interested in the details of this topic, it's
       | probably best to view TFA on a computer as there is data in the
       | visualizations that you can't explore on mobile.
        
       | tehjoker wrote:
       | Well they memorize points and lines (or tanh) between different
       | parts of the space right? So it depends on whether a useful
       | generalization can be extracted from the line estimation and how
       | dense the points on the landscape are no?
        
       | [deleted]
        
       | taeric wrote:
       | I'm curious how representative the target function is? I get that
       | it is common for you to want a model to learn the important
       | pieces of an input, but a string of bits, and only caring about
       | the first three, feels particularly contrived. Literally a truth
       | table on relevant parameters of size 8? And trained with 4.8
       | million samples? Or am I misunderstanding something there? (I
       | fully expect I'm misunderstanding something.)
        
         | jaggirs wrote:
         | I have observed this pattern before in computer vision tasks
         | (train accuracy flatlining for a while before test acc starts
         | to go up). The point of the simple tasks is to be able to
         | interpret what could be going on behind the scenes when this
         | happens.
        
           | taeric wrote:
           | No doubt. But I have also seen what people thought were
           | generalized models failing on outlier, but valid, data. Quite
           | often.
           | 
           | Put another way, it isn't just how simple this task seems to
           | be in the number of terms that are important, but isn't it
           | also a rather dense function?
           | 
           | Probably better question to ask is how sensitive are models
           | that are looking at less dense functions to this? (Or more
           | dense.). I'm not trying to disavow the ideas.
        
             | visarga wrote:
             | Maybe humans are also failing a lot in out of distribution
             | settings. It might be inherent.
        
               | taeric wrote:
               | We have names for that. :D. Stereotypes being a large
               | one. Racism being motivated interpretation on the same
               | ideas. Right?
        
       | agumonkey wrote:
       | They ponderize.
        
       ___________________________________________________________________
       (page generated 2023-08-10 23:00 UTC)