[HN Gopher] How do machines 'grok' data?
___________________________________________________________________
How do machines 'grok' data?
Author : nsoonhui
Score : 89 points
Date : 2024-04-13 05:17 UTC (1 days ago)
(HTM) web link (www.quantamagazine.org)
(TXT) w3m dump (www.quantamagazine.org)
| verisimi wrote:
| > Automatic testing revealed this unexpected accuracy to the rest
| of the team, and they soon realized that the network had found
| clever ways of arranging the numbers a and b. Internally, the
| network represents the numbers in some high-dimensional space,
| but when the researchers projected these numbers down to 2D space
| and mapped them, the numbers formed a circle.
| mjburgess wrote:
| NNs (indeed, all statistical fitting algs) have no relevant
| properties here: properties derive just from the structure of the
| dataset. Here (https://arxiv.org/abs/2201.02177) NNs are trained
| on a 'complete world' problem, ie., modulo arithmetic where the
| whole outcome space is trivial in size and abstract with complete
| information.
|
| Why should NNs eventually find a representation of this tiny,
| abstract, full-representable outcome space after an arbitrary
| amount of training time? Well it will do so, eventually, if this
| outcome space can be fully represented by sequences of
| conditional probabilities.
|
| There is nothing more to this 'discovery' than some trivial
| abstract mathematical spaces can be represented as conditional
| probability structures. Is this even a discovery?
|
| One has to imagine this deception is perpetrated because the
| peddlers of such systems want to impart the problem structure to
| properties of NNs in general, and thereby say, "well, if you
| train NNs on face shapes, phrenology becomes possible!". ie., as
| a way of whitewashing their broken half-baked generative AI
| systems where the problem domain isnt arithmetic mod 97
| sillysaurusx wrote:
| Firstly, thanks for linking the paper.
|
| I think you're being unfair. I say that as someone who has done
| his share of charlatan hunting.
|
| They explicitly list their contributions in the paper. They're
| not saying they did something they didn't. It's not like that
| bogus "rat brain flies plane" paper that was doing something
| simple under the hood and then dressing it up as a world
| changing discovery in order to gain funding. They are doing
| something simple and studying it carefully. This is grade A
| science as far as I'm concerned, every bit as good as studying
| the motion of the stars and trying to find patterns.
|
| It did not occur to me that NNs might be able to extrapolate a
| tiny training dataset into complete solutions. ML scientists
| are taught from an early age that you need to have a large
| dataset to make any progress. I don't know whether this paper
| counts as a "discovery", but it was certainly a fun read, which
| was nice enough for me.
|
| In some sense this paper is a proof of the idea that NNs can
| extrapolate, not merely memorize. This is in contrast to recent
| work where researchers have been claiming otherwise.
|
| The weight decay study was also a nice touch. It's no discovery
| to say that weight decay helps, but it's a reminder to use it.
| I haven't. Adam has always been reliable for me, and now I
| wonder if it was a mistake to shy away from weight decay. (We
| wanted to keep our training pipeline as simple as possible, in
| case any part of it might be causing problems. And lots of
| unexpected parts did.)
|
| Now, I haven't read the article submitted here, only the paper
| you linked. Maybe they're claiming something more than the
| paper. But if so, then that is a (very real) problem with
| scientific journalism, and not necessarily the scientists
| themselves. It depends how much the scientists are leading or
| misleading the reporters. It's important to separate the
| criticism of the work from the reporting around the work.
|
| I'd also be curious if you have any citations for your claim
| that if an outcome space can be represented as a sequence of
| conditional probabilities, then NNs are guaranteed to find a
| solution after some unknown amount of training time. This is a
| surprising thing to me.
| gus_massa wrote:
| The interesting part is that with training you can go:
|
| random -> generalization -> overfiting -> better generalization
|
| The last step was unknown IIUC.
|
| The relevant trick is "regularization":
|
| > _This is a process that reduces the model's functional
| capacity -- the complexity of the function that the model can
| learn._
| qrios wrote:
| A good read if the How and Why NNs are working is of interest.
| Also the HN discussion [1] about this paper [2] is related from
| my point of view.
|
| [1] https://news.ycombinator.com/item?id=40019217
|
| [2] "From Words to Numbers: Your Large Language Model Is Secretly
| A Capable Regressor When Given In-Context Examples"
| https://arxiv.org/abs/2404.07544v1
|
| (Edit: format)
| schlauerfox wrote:
| machines lack the ability to 'grok'.
| ang_cire wrote:
| Yep. Computers are just binary-state machines. Everything you
| can do with a computer, you could do with water channels and
| pulleys and gates (if you could assemble trillions and
| trillions of them). No one would ask whether the water-channel
| system 'groks' things, but because they are miniaturized and
| (to 99.9999% of people) esoteric in their actual workings,
| people treat them like magic.
|
| Clarke was right, but sadly it doesn't take alien technology
| for people to start anthropomorphizing computers or thinking
| that they magically have transcended mere (though obviously
| very complexly-arranged) logic gates, it just takes a GUI that
| spits out algorithmically-generated pictures and text.
| munchausen42 wrote:
| I think most of what a brain cell effectively does could be
| simulated with water channels, pulleys and gates - so don't
| expect humans to grok either.
| __MatrixMan__ wrote:
| > No one would ask whether the water-channel system 'groks'
| things
|
| I would. It's very common to describe the flow of electricity
| as similar to the flow of water. If it's electricity in my
| brain that allows me to understand, why couldn't there be an
| analogous system involving water which also understands?
|
| Any substrate which supports the necessary logical operations
| ought to be sufficient. To believe otherwise seems needlessly
| anthropocentric.
| jrflowers wrote:
| > Any substrate which supports the necessary logical
| operations ought to be sufficient. To believe otherwise
| seems needlessly anthropocentric.
|
| This makes sense. Anything that performs logic using
| electricity is conscious and comparable to the human brain.
| This is obviously true because there is no word or concept
| for anthropomorphizing
| Starman_Jones wrote:
| I thought the article did a very good job of explaining why the
| term 'grokking' was used to describe this emergent behavior,
| and how it fit Heinlein's original definition. I'm curious
| which part of their explanation you feel is incorrect.
| zeegroen wrote:
| So the article mentions "regularization" as the secret ingredient
| to get to a generalized solution, but they don't explain it. Does
| someone know that is? Or is it an industrial secret of OpenAI?
| liliumregale wrote:
| Regularization as a concept is taught in introductory ML
| classes. A simple example is called L2 regularization: you
| include in your loss function the sum of squares of the
| parameters (times some constant k). This causes the parameter
| values to compete between being good at modeling the training
| data and satisfying this constraint--which (hopefully!) reduces
| overfitting.
|
| The specific regularization techniques that any one model is
| trained with may not be publicly revealed, but OAI hardly
| deserves credit for the concept.
| mannykannot wrote:
| _Sometimes, the networks instead find what the researchers call
| the "pizza" algorithm. This approach imagines a pizza divided
| into slices and numbered in order. To add two numbers, imagine
| drawing arrows from the center of the pizza to the numbers in
| question, then calculating the line that bisects the angle formed
| by the first two arrows. This line passes through the middle of
| some slice of the pizza: The number of the slice is the sum of
| the two numbers._
|
| I feel I'm looking really dumb here, but it is not obvious to me
| that this works - seeing as this is a pizza analogy, take, for
| example, 1 + 1 mod 8 = ?. I do not see how the algorithm as set
| out in the paper can be properly described in this way.
___________________________________________________________________
(page generated 2024-04-14 23:02 UTC)