[HN Gopher] Neurons in Large Language Models: Dead, N-Gram, Posi...
___________________________________________________________________
Neurons in Large Language Models: Dead, N-Gram, Positional
Author : PaulHoule
Score : 86 points
Date : 2023-09-20 12:03 UTC (10 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| ACacctfortaday wrote:
| hmmm... For many decades researchers and experts have stated that
| we really don't understand what the heck is going on inside these
| artificial neuron networds... ....Sounds like the decades long
| quest to understand how a network of ArtificialNeurons processes
| input data into useful output (for example: categorizing,
| deciding yes or no, or controlling the actuators of robots, etc.)
| ...has been cracked. That is, through conceptually understanding
| linguistics and concepts like 'tokens' and 'parsing', the
| researchers have been able to trace through the steps of what the
| FFNs are logically doing. [Regarding other commenters' comments:
| Two points to keep in mind.... 1) any trained ANN (in machine
| learning, as well as FFNs) can be reduced to a single
| mathematical formula to replicate the processing being done by
| these ArtificialNeuralNets. In other words, after training, the
| result can be reduced to a single bit of math that alone can be
| reused to replicate the trained net WITHOUT using the actual ANN
| anymore - using only the ANN for training in order to derive a
| definite mathematical result for future use. {See most graduate
| school level textbooks on machine learning for the details....}
| 2) ..."sparse" does not equal "unnecessary"; it sounds like what
| others have suggested, it's like a decision tree rather than
| complexes of connections between artificial neurons doing heaven
| only knows what logically....]
|
| ~jgroch
| FrustratedMonky wrote:
| Finding more and more similarities between computer based neural
| networks, and carbon based brain neural networks, everyday.
|
| It is only scale at this point. Human level functioning is
| coming.
|
| The bad news, humans are quirky and violent and don't have the
| best reinforcement learning themselves, and have poor alignment.
|
| Basically humans are a low bar for AI to reach.
| ben_w wrote:
| Perceptrons are, by design, analogous to organic neurones --
| only analogies, not fantastic models. Likewise the artificial
| networks are analogies to organic networks.
|
| It's therefore not surprising that the artificial behaves
| analogously to the organic, but it would be a mistake to assume
| they reproduce us accurately: GPT-3 is about a thousand times
| less complex than a human, while trained on more text than any
| of us could experience in ten millennia _and only text_. It has
| no emotions, unless the capacity to simulate affect happened to
| lead to this as an emergent property by accident (which I can
| 't rule out but don't actually expect to be true).
| BobbyJo wrote:
| > GPT-3 is about a thousand times less complex than a human
|
| Also note that it takes the 'brute force' approach to
| architecture by using a transformer model, basically learning
| a connection graph from scratch. If you want that to scale to
| human complexity in function, you're probably going to have
| to overshoot in size by an order of magnitude.
| p1esk wrote:
| _GPT-3 is about a thousand times less complex than a human_
|
| How did you figure that? How many synapses are dedicated to
| text processing in a human brain? How much information do
| those synapses encode compared to the information encoded in
| GPT-3? How about GPT-4?
| __loam wrote:
| Estimates for how much information the brain encodes are
| several orders of magnitude higher than the biggest llm, to
| the point where trying to replicate it is pushing the
| boundaries of computability. The brain is also
| significantly more adaptable and generalized thanks to
| neuroplasticity.
| Filligree wrote:
| I don't think GP was restricting 'human' to 'text
| processing elements of a human'.
|
| At any rate, it's certainly true that GPT-3 is missing a
| great deal. So is GPT-4.
| ben_w wrote:
| Correct; mine was an overall statement rather than purely
| about human language processing.
| FrustratedMonky wrote:
| Sure, they are modeled. One is math, one is atoms.
|
| But look at a single 'real' neuron, it is just calcium ions,
| electrical potentials, there is no 'emotion'.
|
| Once you can completely model a 'real' neuron (I know there
| is still some scale to achieving this). Then link together
| these exactly modeled 'real' neurons. What is to say it is
| not experiencing 'emotions'. Even though it is silicon.
|
| Humans give themselves too much credit for being special. "I
| feeeeeel, the computer can't". That is just not understanding
| yourself.
| ben_w wrote:
| > But look at a single 'real' neuron, it is just calcium
| ions, electrical potentials, there is no 'emotion'.
|
| Indeed; this is why I am willing to entertain the
| possibility that emotions may be present as an emergent
| property of simulating us. I don't expect it, but I can't
| rule it out.
| __loam wrote:
| You clearly lack understanding here too. Simulating "real"
| neurons at the scale required to simulate a brain is
| probably np hard. Even if we wanted to try, we don't have
| any maps of neuron connectivity with nearly the resolution
| required to do so.
| ben_w wrote:
| I think the bigger problem is we have no idea what the
| necessary or sufficient requirements are, neither for
| qualia nor for intelligence. (Not sure why you think it's
| np rather than just _lots_?)
|
| With intelligence we can at least tell when we've
| achieved it, to whatever standard we like.
|
| Emotions could probably be a thing where we can map some
| internal state to some emotional affect _display_ ,
| eventually; but what about any question of emotional
| _qualia_?
|
| AFAICT, we don't even have a good grasp of the question
| yet. We each have access only the one inside our own
| head, and minimal capacity to even share what these are
| like with each other. When did you learn about
| aphantasia, for example? Is there a word for equivalents
| for that for other senses besides vision? I can
| "visualise" sounds and totally override the direction of
| down coming from my inner ears, but I can't "visualise"
| smells, and I don't have a non-visually oriented word for
| the idea, as both "visualise" and "imagine" clearly are
| and "idea" itself more subtly also is.
| _0ffh wrote:
| "Qualia" are really just a reframing of (an aspect of)
| consciousness, which has been speculated to be purely
| epiphenomenal. Maybe we're just along for the ride, and
| our actions merely happen to mirror our decisions - or
| the other way around, same difference.
| ben_w wrote:
| Qualia is how to discuss the problem of consciousness
| without a pointless discussion about if this is the
| opposite of unconscious, the opposite of subconscious,
| something that needs a soul, or any of the other things
| that go wrong if you don't taboo the "c" word.
|
| We don't know the answers (nor, I assert, the correct
| questions), though "what even is this?" and "what has
| it?" were already relevant to animal rights questions
| (qualia might have started with humans, but there's no
| reason to assume that) well before current AI, and even
| if we find a convenient proof that current AI definitely
| can't and that's fine... some of us want to advance the
| AI to at least as far as brain uploading where the qualia
| is the goal, though virtual hells like in the novel
| _Surface Detail_ seem to me to be an extremely plausible
| dystopian outcome if uploads ever actually happen.
| pushfoo wrote:
| > we don't have any maps of neuron connectivity
|
| We do for a at least one classic, small-scale example: C.
| Elegans. Despite mapping the roughly 300 neurons, the
| simulation attempts I'm aware of weren't very fruitful.
|
| > with nearly the resolution required to do so.
|
| I agree this may be part of why. Accurate simulation may
| require replicating subtle behavior outside the neuron
| body. Further maps or simulation attempts may have since
| been made, and possibly with better results. Given I
| don't remember headlines about this, it's likely that any
| improvements weren't groundbreaking.
|
| I don't know enough about the roles of glia and inter-
| neuron (not interneuron) behavior to discuss this further
| beyond wild speculation. Nor does anyone, as far as I
| know. Gaining that knowledge would probably be necessary
| to build connectomes with sufficient accuracy for
| simulation.
| FrustratedMonky wrote:
| Look up the paper on the neuron ability to do XOR logic.
| It is far less than NP hard.
| ben_w wrote:
| Do you mean biological neurones or perceptrons? There
| have been publications about both regarding XOR. If the
| latter, be aware that this was only about single layers
| _and_ that perceptrons have an unnecessary restriction on
| the way they can combine inputs.
| PaulHoule wrote:
| I find it pretty remarkable how in many visual recognition
| neural networks (say for the MNIST digits) you see neurons
| close to the input layer that respond similarly to neurons in
| the V1 area of the visual cortex.
|
| http://www.lifesci.sussex.ac.uk/home/George_Mather/Linked%20.
| ..
| sdenton4 wrote:
| It's wavelets all the way down.
| cuteboy19 wrote:
| There's basically only one way to solve that problem. Are
| we surprised every time 2+2 is 4?
| ben_w wrote:
| Did we have any reason to expect that in advance? This
| was the fist time we tried, in this metaphor, to
| calculate 2+2 in the abstract.
| pmarreck wrote:
| > It is only scale at this point.
|
| I see, this is basically the "AI of the gaps" argument. :P
|
| It is NOT "only scale" at this point. But at the current rate,
| you will see this soon enough (I just happen to see it
| _already_ ). We'll have some very intelligent-seeming, very
| useful, but relatively uncreative zombies missing a "spark" (or
| any willpower, or any real source of what is valuable to them,
| or any sense of what is aesthetically pleasing or preferable or
| enjoyable or beautiful, or any consciousness for that matter).
| It will allow us to redefine what it means to be human. But our
| distinct human-ness will stand out even more at that point.
| FrustratedMonky wrote:
| LOL. Falling back on metaphysical arguments about 'sparks'
| and 'uniqueness', is NOT you 'seeing' something nobody else
| is able too.
| PaulHoule wrote:
| I would point to the problem that chatbots fail not at having
| a "spark" but at things ordinary computer software does well.
| The other day somebody pointed out in an HN conversation that
| I had gotten the 1984 Superbowl confused with the 1986
| Superbowl.
|
| That's a very human mistake, I'm sure somebody can tell you
| who played in every Superbowl and what the score was but
| people do misremember things frequently and we don't call it
| a hallucination. (which is a defect in perception)
|
| "Superhuman intelligence" is easy to realize for sports
| statistics if you do the ontology and data entry work and put
| the data in a relational or related database.
|
| The thing is that chatbots get 90% accuracy for cases where
| you can get 99.99% accuracy (sometimes the data entry is
| wrong) with conventional technology. There is a kind of faith
| that we can go to 10^17 or 10^30 parameters or something and
| at some point perfect performance will "emerge" but no I
| think it is more like it will approach some asymptote, say
| 95% and you will try harder and harder and it will like
| pushing a bubble around other a rug. It's a common situation
| in failing technology projects, quite well documented in
|
| https://www.amazon.com/Friends-High-Places-
| Livingston-1990-1...
|
| but boy people are seduced by those situations and have a
| hard time recognizing that they are in them.
|
| In a certain sense chatbots already have superhuman powers of
| seduction that, I think, come from not having a "self" which
| makes mirroring easier to attain. People wouldn't ordinarily
| be impressed by a computer program that can sort a list of
| numbers 90% correctly but give it the ability to apologize
| and many people will think it is _really sincere_ and think
| it is really promising, it just needs a few trillion more
| transistors. (See the story _Trurl's Machine_ in Stanislaw
| Lem's excellent _Cyberiad_ except _that_ machine is
| belligerent and not agreeable)
|
| Now an obvious path is to have the chatbot turn a question
| into a SQL query and then feed the results into conversation
| and that's a great idea and an active research area, but I'd
| point out the dialogues between Achilles and the Tortoise in
|
| https://en.wikipedia.org/wiki/G%C3%B6del,_Escher,_Bach
|
| which people mistakenly think is about symbolic A.I. but that
| is really about the problems of solving problems where the
| correct solution has a logical aspect. Even though logic
| isn't everything, the formulation of most problems (like "Who
| won the soccer game at Cornell last night?") is fundamentally
| logical and leads you straight to paradoxes that can have you
| forever pushing a bubble under the rug and thinking "just one
| more" little hack will fix it...
| fnordpiglet wrote:
| LLMs are just one tool in a collection. Intelligence is
| based on many models, not just the language parts of our
| brain - and I expect AI to incorporate more models in a
| system approach. Why does it matter if LLMs are able to
| play chess at a grandmaster level or not? They can delegate
| the actual chess optimization problem to a chess optimizing
| program. While it's interesting language alone is as
| powerful as it is, it's very myopic to judge the tool alone
| and not as a part of a toolbox.
| FrustratedMonky wrote:
| Exactly It is NOT all about LLM's. There are a lot of
| other successful models. From AlphaGo, to visions
| systems, robotics. LLM is just the latest shiny thing.
|
| At some point they will all be tied together, and at that
| point it will start to look a lot more like sections of
| our brain, one is vision, one is language, one is
| movement. etc....
| candiodari wrote:
| I think it's already been made clear that the main reason
| for the "asymptote" is wrong data input. These models
| attempt to learn from random internet text ... and this
| turns out to not be all that accurate.
|
| Also, I've observed a model I was training having the same
| problem as I do myself. If I at any point learn wrong data,
| which happens of course, then getting that wrong data back
| out is very hard and requires 10 or 50 times the effort I
| spent learning the initial data. In fact I strongly suspect
| I never unlearn bad data, I just _additionally_ learn "if
| I say X, it's wrong, say Y instead".
| jacquesm wrote:
| Brains suck for exact work such as database work or precise
| calculations over longer chains. But they excel at
| approximate work, and that's a very useful skill to have as
| long as _if you have to_ you can fire up the pencil and
| paper and do your precise calculations that way. And paper
| works fine for database work as well and will remember all
| of those sports stats for as long as you care (and even
| after you 're dead).
|
| Brains are so powerful because they are universal, they can
| use auxiliary data stores and co-processors just fine.
| visarga wrote:
| I think much of that spark is actually contextual language
| generation. We're relying on learned language patterns for
| most of our sparks.
| fnordpiglet wrote:
| Agency is a key part of that spark, but we have done all
| sorts of research into agency and providing goal based agents
| into an AI model framework incorporating LLMs as well as
| other optimizers and solvers I think will provide the
| majority of that spark. The process of creativity depends on
| both internal agency and goal setting unmoored by external
| dictation and semantic synthesis of abductively reasoned
| concepts, with an aesthetic that feels into the goal based
| optimizer. These are things that can be simulated to the
| point that while there may be an uncanny valley somewhere
| it'll be close enough to be hard to distinguish.
|
| But I do wonder if the practical utility of such an entity is
| worth the amount of effort and capital required to build and
| sustain it. I suspect it'll be more a novelty than a
| practical tool.
| kelseyfrog wrote:
| Our distinct humanness will be found in the narcissism of
| small differences.
| jeroenvlek wrote:
| Great work, please put this on a tile :D
| __loam wrote:
| The only narcissists here are the computer scientists who
| have convinced themselves they've made god.
| FrustratedMonky wrote:
| Nobody is claiming they have made god today.
|
| That is on the roadmap for 2030.
| pushfoo wrote:
| If I read the paper correctly, it seems to support the old quip
| about all AI being decision trees, at least for smaller model
| sizes.
|
| It also raises an interesting UX question: is there an implicit
| tradeoff between legibility and power for notations as for
| different ways of expressing rotation, or is that a consequence
| of using graphics hardware to implement AI?
| PaulHoule wrote:
| A RelU network looks a bit like a network of decision trees,
| that's for sure.
| jszymborski wrote:
| When I read "Dead Neuron" I definitely associated it with
| ReLUs.
|
| It's claimed that this tendency to create dead neurons has a
| regularizing effect, but frankly I don't buy that.
|
| Anecdotally, I've always had better results with Leaky ReLUs
| or Mish.
| pmarreck wrote:
| I guess the obvious next step is to cull the dead neurons from
| the model data? I imagine the brain does the same thing after the
| childhood years
| kmeisthax wrote:
| Most models have a purely fixed architecture: i.e. if you train
| three layers of six feed-forward neurons a piece, your model
| and the training data it learns from will just fit within that
| architecture to the extent that gradient descent can _force_ it
| to fit. There is no mechanism to say "oh these neurons will
| never activate, let's prune them", or "we'd have much better
| loss if we added a layer here".
|
| In the dark times before PyTorch there was an idea called NEAT:
| "neuroevolution of augmenting topologies", which tried to use a
| genetic algorithm (i.e. testing a bunch of slightly modified
| solutions for loss) to discover both optimal weights _and_
| optimal network structure. I don 't think this gets used all
| that often, if at all.
|
| I hear stories every few years about how many models have
| unused neurons that can be pruned or how hyperparameter
| selection is a pain in the ass, but nothing about automating
| the pruning and parameter selection in such a way as to
| efficiently use the whole model. I'm not sure it's necessary
| anyway.
| tysam_and wrote:
| DL researcher here, it's also hard I think because
| experimentally many of us have noted (also some research on
| this) that there's a critical early phase to learning that
| conditions the network a certain way, and adding and/or
| removing layers later seems to actually be quite difficult
| (removing easier than adding, by far, esp w/ something like
| variational dropout [yes, i cite the old deep magicks]:
| https://arxiv.org/abs/1701.05369)
| hiddencost wrote:
| I had success doubling the width of a layer by copying,
| adding epsilon noise and then fine tuning.
|
| Transform from [X] to [X X+epsilon].
|
| In that case I was trying to fuse two signals that had
| fairly similar characteristics.
| tysam_and wrote:
| Yes, unfortunately I have yet to beat that technique in
| wallclock convergence speed vs just using the larger
| network from the start. :'((((
|
| Whoever figures out how to clearly and effectively do it
| consistently faster than a 'full network from the start'
| version will open up an entirely new subfield of neural
| network training efficiency. :'))))
| sdenton4 wrote:
| There's a huge amount of work on model pruning, especially
| with an eye towards model reduction for on-device
| deployments. I've done a bunch of work in this space focused
| on getting speech synthesis working in real time on phones.
| It works and can be automated.
|
| There's a lot of nuance though. What has typically worked
| best are variants on iterative magnitude weight pruning
| rather than activation pruning.
|
| This can often get rid of 90% of the weights with minimal
| impact on quality. Structured pruning lets you remove blocks
| of weights while retaining good SIMD utilization... and so
| on.
| radarsat1 wrote:
| > There is no mechanism to say "oh these neurons will never
| activate, let's prune them"
|
| Not at training time. But after training you can evaluate
| which activations rarely fire and simply remove their columns
| from the weight matrix.
|
| Or you could re-randomize them and continue training.
___________________________________________________________________
(page generated 2023-09-20 23:01 UTC)