[HN Gopher] How do machines 'grok' data?
       ___________________________________________________________________
        
       How do machines 'grok' data?
        
       Author : nsoonhui
       Score  : 89 points
       Date   : 2024-04-13 05:17 UTC (1 days ago)
        
 (HTM) web link (www.quantamagazine.org)
 (TXT) w3m dump (www.quantamagazine.org)
        
       | verisimi wrote:
       | > Automatic testing revealed this unexpected accuracy to the rest
       | of the team, and they soon realized that the network had found
       | clever ways of arranging the numbers a and b. Internally, the
       | network represents the numbers in some high-dimensional space,
       | but when the researchers projected these numbers down to 2D space
       | and mapped them, the numbers formed a circle.
        
       | mjburgess wrote:
       | NNs (indeed, all statistical fitting algs) have no relevant
       | properties here: properties derive just from the structure of the
       | dataset. Here (https://arxiv.org/abs/2201.02177) NNs are trained
       | on a 'complete world' problem, ie., modulo arithmetic where the
       | whole outcome space is trivial in size and abstract with complete
       | information.
       | 
       | Why should NNs eventually find a representation of this tiny,
       | abstract, full-representable outcome space after an arbitrary
       | amount of training time? Well it will do so, eventually, if this
       | outcome space can be fully represented by sequences of
       | conditional probabilities.
       | 
       | There is nothing more to this 'discovery' than some trivial
       | abstract mathematical spaces can be represented as conditional
       | probability structures. Is this even a discovery?
       | 
       | One has to imagine this deception is perpetrated because the
       | peddlers of such systems want to impart the problem structure to
       | properties of NNs in general, and thereby say, "well, if you
       | train NNs on face shapes, phrenology becomes possible!". ie., as
       | a way of whitewashing their broken half-baked generative AI
       | systems where the problem domain isnt arithmetic mod 97
        
         | sillysaurusx wrote:
         | Firstly, thanks for linking the paper.
         | 
         | I think you're being unfair. I say that as someone who has done
         | his share of charlatan hunting.
         | 
         | They explicitly list their contributions in the paper. They're
         | not saying they did something they didn't. It's not like that
         | bogus "rat brain flies plane" paper that was doing something
         | simple under the hood and then dressing it up as a world
         | changing discovery in order to gain funding. They are doing
         | something simple and studying it carefully. This is grade A
         | science as far as I'm concerned, every bit as good as studying
         | the motion of the stars and trying to find patterns.
         | 
         | It did not occur to me that NNs might be able to extrapolate a
         | tiny training dataset into complete solutions. ML scientists
         | are taught from an early age that you need to have a large
         | dataset to make any progress. I don't know whether this paper
         | counts as a "discovery", but it was certainly a fun read, which
         | was nice enough for me.
         | 
         | In some sense this paper is a proof of the idea that NNs can
         | extrapolate, not merely memorize. This is in contrast to recent
         | work where researchers have been claiming otherwise.
         | 
         | The weight decay study was also a nice touch. It's no discovery
         | to say that weight decay helps, but it's a reminder to use it.
         | I haven't. Adam has always been reliable for me, and now I
         | wonder if it was a mistake to shy away from weight decay. (We
         | wanted to keep our training pipeline as simple as possible, in
         | case any part of it might be causing problems. And lots of
         | unexpected parts did.)
         | 
         | Now, I haven't read the article submitted here, only the paper
         | you linked. Maybe they're claiming something more than the
         | paper. But if so, then that is a (very real) problem with
         | scientific journalism, and not necessarily the scientists
         | themselves. It depends how much the scientists are leading or
         | misleading the reporters. It's important to separate the
         | criticism of the work from the reporting around the work.
         | 
         | I'd also be curious if you have any citations for your claim
         | that if an outcome space can be represented as a sequence of
         | conditional probabilities, then NNs are guaranteed to find a
         | solution after some unknown amount of training time. This is a
         | surprising thing to me.
        
         | gus_massa wrote:
         | The interesting part is that with training you can go:
         | 
         | random -> generalization -> overfiting -> better generalization
         | 
         | The last step was unknown IIUC.
         | 
         | The relevant trick is "regularization":
         | 
         | > _This is a process that reduces the model's functional
         | capacity -- the complexity of the function that the model can
         | learn._
        
       | qrios wrote:
       | A good read if the How and Why NNs are working is of interest.
       | Also the HN discussion [1] about this paper [2] is related from
       | my point of view.
       | 
       | [1] https://news.ycombinator.com/item?id=40019217
       | 
       | [2] "From Words to Numbers: Your Large Language Model Is Secretly
       | A Capable Regressor When Given In-Context Examples"
       | https://arxiv.org/abs/2404.07544v1
       | 
       | (Edit: format)
        
       | schlauerfox wrote:
       | machines lack the ability to 'grok'.
        
         | ang_cire wrote:
         | Yep. Computers are just binary-state machines. Everything you
         | can do with a computer, you could do with water channels and
         | pulleys and gates (if you could assemble trillions and
         | trillions of them). No one would ask whether the water-channel
         | system 'groks' things, but because they are miniaturized and
         | (to 99.9999% of people) esoteric in their actual workings,
         | people treat them like magic.
         | 
         | Clarke was right, but sadly it doesn't take alien technology
         | for people to start anthropomorphizing computers or thinking
         | that they magically have transcended mere (though obviously
         | very complexly-arranged) logic gates, it just takes a GUI that
         | spits out algorithmically-generated pictures and text.
        
           | munchausen42 wrote:
           | I think most of what a brain cell effectively does could be
           | simulated with water channels, pulleys and gates - so don't
           | expect humans to grok either.
        
           | __MatrixMan__ wrote:
           | > No one would ask whether the water-channel system 'groks'
           | things
           | 
           | I would. It's very common to describe the flow of electricity
           | as similar to the flow of water. If it's electricity in my
           | brain that allows me to understand, why couldn't there be an
           | analogous system involving water which also understands?
           | 
           | Any substrate which supports the necessary logical operations
           | ought to be sufficient. To believe otherwise seems needlessly
           | anthropocentric.
        
             | jrflowers wrote:
             | > Any substrate which supports the necessary logical
             | operations ought to be sufficient. To believe otherwise
             | seems needlessly anthropocentric.
             | 
             | This makes sense. Anything that performs logic using
             | electricity is conscious and comparable to the human brain.
             | This is obviously true because there is no word or concept
             | for anthropomorphizing
        
         | Starman_Jones wrote:
         | I thought the article did a very good job of explaining why the
         | term 'grokking' was used to describe this emergent behavior,
         | and how it fit Heinlein's original definition. I'm curious
         | which part of their explanation you feel is incorrect.
        
       | zeegroen wrote:
       | So the article mentions "regularization" as the secret ingredient
       | to get to a generalized solution, but they don't explain it. Does
       | someone know that is? Or is it an industrial secret of OpenAI?
        
         | liliumregale wrote:
         | Regularization as a concept is taught in introductory ML
         | classes. A simple example is called L2 regularization: you
         | include in your loss function the sum of squares of the
         | parameters (times some constant k). This causes the parameter
         | values to compete between being good at modeling the training
         | data and satisfying this constraint--which (hopefully!) reduces
         | overfitting.
         | 
         | The specific regularization techniques that any one model is
         | trained with may not be publicly revealed, but OAI hardly
         | deserves credit for the concept.
        
       | mannykannot wrote:
       | _Sometimes, the networks instead find what the researchers call
       | the "pizza" algorithm. This approach imagines a pizza divided
       | into slices and numbered in order. To add two numbers, imagine
       | drawing arrows from the center of the pizza to the numbers in
       | question, then calculating the line that bisects the angle formed
       | by the first two arrows. This line passes through the middle of
       | some slice of the pizza: The number of the slice is the sum of
       | the two numbers._
       | 
       | I feel I'm looking really dumb here, but it is not obvious to me
       | that this works - seeing as this is a pizza analogy, take, for
       | example, 1 + 1 mod 8 = ?. I do not see how the algorithm as set
       | out in the paper can be properly described in this way.
        
       ___________________________________________________________________
       (page generated 2024-04-14 23:02 UTC)