[HN Gopher] Neurons in Large Language Models: Dead, N-Gram, Posi...
       ___________________________________________________________________
        
       Neurons in Large Language Models: Dead, N-Gram, Positional
        
       Author : PaulHoule
       Score  : 86 points
       Date   : 2023-09-20 12:03 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | ACacctfortaday wrote:
       | hmmm... For many decades researchers and experts have stated that
       | we really don't understand what the heck is going on inside these
       | artificial neuron networds... ....Sounds like the decades long
       | quest to understand how a network of ArtificialNeurons processes
       | input data into useful output (for example: categorizing,
       | deciding yes or no, or controlling the actuators of robots, etc.)
       | ...has been cracked. That is, through conceptually understanding
       | linguistics and concepts like 'tokens' and 'parsing', the
       | researchers have been able to trace through the steps of what the
       | FFNs are logically doing. [Regarding other commenters' comments:
       | Two points to keep in mind.... 1) any trained ANN (in machine
       | learning, as well as FFNs) can be reduced to a single
       | mathematical formula to replicate the processing being done by
       | these ArtificialNeuralNets. In other words, after training, the
       | result can be reduced to a single bit of math that alone can be
       | reused to replicate the trained net WITHOUT using the actual ANN
       | anymore - using only the ANN for training in order to derive a
       | definite mathematical result for future use. {See most graduate
       | school level textbooks on machine learning for the details....}
       | 2) ..."sparse" does not equal "unnecessary"; it sounds like what
       | others have suggested, it's like a decision tree rather than
       | complexes of connections between artificial neurons doing heaven
       | only knows what logically....]
       | 
       | ~jgroch
        
       | FrustratedMonky wrote:
       | Finding more and more similarities between computer based neural
       | networks, and carbon based brain neural networks, everyday.
       | 
       | It is only scale at this point. Human level functioning is
       | coming.
       | 
       | The bad news, humans are quirky and violent and don't have the
       | best reinforcement learning themselves, and have poor alignment.
       | 
       | Basically humans are a low bar for AI to reach.
        
         | ben_w wrote:
         | Perceptrons are, by design, analogous to organic neurones --
         | only analogies, not fantastic models. Likewise the artificial
         | networks are analogies to organic networks.
         | 
         | It's therefore not surprising that the artificial behaves
         | analogously to the organic, but it would be a mistake to assume
         | they reproduce us accurately: GPT-3 is about a thousand times
         | less complex than a human, while trained on more text than any
         | of us could experience in ten millennia _and only text_. It has
         | no emotions, unless the capacity to simulate affect happened to
         | lead to this as an emergent property by accident (which I can
         | 't rule out but don't actually expect to be true).
        
           | BobbyJo wrote:
           | > GPT-3 is about a thousand times less complex than a human
           | 
           | Also note that it takes the 'brute force' approach to
           | architecture by using a transformer model, basically learning
           | a connection graph from scratch. If you want that to scale to
           | human complexity in function, you're probably going to have
           | to overshoot in size by an order of magnitude.
        
           | p1esk wrote:
           | _GPT-3 is about a thousand times less complex than a human_
           | 
           | How did you figure that? How many synapses are dedicated to
           | text processing in a human brain? How much information do
           | those synapses encode compared to the information encoded in
           | GPT-3? How about GPT-4?
        
             | __loam wrote:
             | Estimates for how much information the brain encodes are
             | several orders of magnitude higher than the biggest llm, to
             | the point where trying to replicate it is pushing the
             | boundaries of computability. The brain is also
             | significantly more adaptable and generalized thanks to
             | neuroplasticity.
        
             | Filligree wrote:
             | I don't think GP was restricting 'human' to 'text
             | processing elements of a human'.
             | 
             | At any rate, it's certainly true that GPT-3 is missing a
             | great deal. So is GPT-4.
        
               | ben_w wrote:
               | Correct; mine was an overall statement rather than purely
               | about human language processing.
        
           | FrustratedMonky wrote:
           | Sure, they are modeled. One is math, one is atoms.
           | 
           | But look at a single 'real' neuron, it is just calcium ions,
           | electrical potentials, there is no 'emotion'.
           | 
           | Once you can completely model a 'real' neuron (I know there
           | is still some scale to achieving this). Then link together
           | these exactly modeled 'real' neurons. What is to say it is
           | not experiencing 'emotions'. Even though it is silicon.
           | 
           | Humans give themselves too much credit for being special. "I
           | feeeeeel, the computer can't". That is just not understanding
           | yourself.
        
             | ben_w wrote:
             | > But look at a single 'real' neuron, it is just calcium
             | ions, electrical potentials, there is no 'emotion'.
             | 
             | Indeed; this is why I am willing to entertain the
             | possibility that emotions may be present as an emergent
             | property of simulating us. I don't expect it, but I can't
             | rule it out.
        
             | __loam wrote:
             | You clearly lack understanding here too. Simulating "real"
             | neurons at the scale required to simulate a brain is
             | probably np hard. Even if we wanted to try, we don't have
             | any maps of neuron connectivity with nearly the resolution
             | required to do so.
        
               | ben_w wrote:
               | I think the bigger problem is we have no idea what the
               | necessary or sufficient requirements are, neither for
               | qualia nor for intelligence. (Not sure why you think it's
               | np rather than just _lots_?)
               | 
               | With intelligence we can at least tell when we've
               | achieved it, to whatever standard we like.
               | 
               | Emotions could probably be a thing where we can map some
               | internal state to some emotional affect _display_ ,
               | eventually; but what about any question of emotional
               | _qualia_?
               | 
               | AFAICT, we don't even have a good grasp of the question
               | yet. We each have access only the one inside our own
               | head, and minimal capacity to even share what these are
               | like with each other. When did you learn about
               | aphantasia, for example? Is there a word for equivalents
               | for that for other senses besides vision? I can
               | "visualise" sounds and totally override the direction of
               | down coming from my inner ears, but I can't "visualise"
               | smells, and I don't have a non-visually oriented word for
               | the idea, as both "visualise" and "imagine" clearly are
               | and "idea" itself more subtly also is.
        
               | _0ffh wrote:
               | "Qualia" are really just a reframing of (an aspect of)
               | consciousness, which has been speculated to be purely
               | epiphenomenal. Maybe we're just along for the ride, and
               | our actions merely happen to mirror our decisions - or
               | the other way around, same difference.
        
               | ben_w wrote:
               | Qualia is how to discuss the problem of consciousness
               | without a pointless discussion about if this is the
               | opposite of unconscious, the opposite of subconscious,
               | something that needs a soul, or any of the other things
               | that go wrong if you don't taboo the "c" word.
               | 
               | We don't know the answers (nor, I assert, the correct
               | questions), though "what even is this?" and "what has
               | it?" were already relevant to animal rights questions
               | (qualia might have started with humans, but there's no
               | reason to assume that) well before current AI, and even
               | if we find a convenient proof that current AI definitely
               | can't and that's fine... some of us want to advance the
               | AI to at least as far as brain uploading where the qualia
               | is the goal, though virtual hells like in the novel
               | _Surface Detail_ seem to me to be an extremely plausible
               | dystopian outcome if uploads ever actually happen.
        
               | pushfoo wrote:
               | > we don't have any maps of neuron connectivity
               | 
               | We do for a at least one classic, small-scale example: C.
               | Elegans. Despite mapping the roughly 300 neurons, the
               | simulation attempts I'm aware of weren't very fruitful.
               | 
               | > with nearly the resolution required to do so.
               | 
               | I agree this may be part of why. Accurate simulation may
               | require replicating subtle behavior outside the neuron
               | body. Further maps or simulation attempts may have since
               | been made, and possibly with better results. Given I
               | don't remember headlines about this, it's likely that any
               | improvements weren't groundbreaking.
               | 
               | I don't know enough about the roles of glia and inter-
               | neuron (not interneuron) behavior to discuss this further
               | beyond wild speculation. Nor does anyone, as far as I
               | know. Gaining that knowledge would probably be necessary
               | to build connectomes with sufficient accuracy for
               | simulation.
        
               | FrustratedMonky wrote:
               | Look up the paper on the neuron ability to do XOR logic.
               | It is far less than NP hard.
        
               | ben_w wrote:
               | Do you mean biological neurones or perceptrons? There
               | have been publications about both regarding XOR. If the
               | latter, be aware that this was only about single layers
               | _and_ that perceptrons have an unnecessary restriction on
               | the way they can combine inputs.
        
           | PaulHoule wrote:
           | I find it pretty remarkable how in many visual recognition
           | neural networks (say for the MNIST digits) you see neurons
           | close to the input layer that respond similarly to neurons in
           | the V1 area of the visual cortex.
           | 
           | http://www.lifesci.sussex.ac.uk/home/George_Mather/Linked%20.
           | ..
        
             | sdenton4 wrote:
             | It's wavelets all the way down.
        
             | cuteboy19 wrote:
             | There's basically only one way to solve that problem. Are
             | we surprised every time 2+2 is 4?
        
               | ben_w wrote:
               | Did we have any reason to expect that in advance? This
               | was the fist time we tried, in this metaphor, to
               | calculate 2+2 in the abstract.
        
         | pmarreck wrote:
         | > It is only scale at this point.
         | 
         | I see, this is basically the "AI of the gaps" argument. :P
         | 
         | It is NOT "only scale" at this point. But at the current rate,
         | you will see this soon enough (I just happen to see it
         | _already_ ). We'll have some very intelligent-seeming, very
         | useful, but relatively uncreative zombies missing a "spark" (or
         | any willpower, or any real source of what is valuable to them,
         | or any sense of what is aesthetically pleasing or preferable or
         | enjoyable or beautiful, or any consciousness for that matter).
         | It will allow us to redefine what it means to be human. But our
         | distinct human-ness will stand out even more at that point.
        
           | FrustratedMonky wrote:
           | LOL. Falling back on metaphysical arguments about 'sparks'
           | and 'uniqueness', is NOT you 'seeing' something nobody else
           | is able too.
        
           | PaulHoule wrote:
           | I would point to the problem that chatbots fail not at having
           | a "spark" but at things ordinary computer software does well.
           | The other day somebody pointed out in an HN conversation that
           | I had gotten the 1984 Superbowl confused with the 1986
           | Superbowl.
           | 
           | That's a very human mistake, I'm sure somebody can tell you
           | who played in every Superbowl and what the score was but
           | people do misremember things frequently and we don't call it
           | a hallucination. (which is a defect in perception)
           | 
           | "Superhuman intelligence" is easy to realize for sports
           | statistics if you do the ontology and data entry work and put
           | the data in a relational or related database.
           | 
           | The thing is that chatbots get 90% accuracy for cases where
           | you can get 99.99% accuracy (sometimes the data entry is
           | wrong) with conventional technology. There is a kind of faith
           | that we can go to 10^17 or 10^30 parameters or something and
           | at some point perfect performance will "emerge" but no I
           | think it is more like it will approach some asymptote, say
           | 95% and you will try harder and harder and it will like
           | pushing a bubble around other a rug. It's a common situation
           | in failing technology projects, quite well documented in
           | 
           | https://www.amazon.com/Friends-High-Places-
           | Livingston-1990-1...
           | 
           | but boy people are seduced by those situations and have a
           | hard time recognizing that they are in them.
           | 
           | In a certain sense chatbots already have superhuman powers of
           | seduction that, I think, come from not having a "self" which
           | makes mirroring easier to attain. People wouldn't ordinarily
           | be impressed by a computer program that can sort a list of
           | numbers 90% correctly but give it the ability to apologize
           | and many people will think it is _really sincere_ and think
           | it is really promising, it just needs a few trillion more
           | transistors. (See the story _Trurl's Machine_ in Stanislaw
           | Lem's excellent _Cyberiad_ except _that_ machine is
           | belligerent and not agreeable)
           | 
           | Now an obvious path is to have the chatbot turn a question
           | into a SQL query and then feed the results into conversation
           | and that's a great idea and an active research area, but I'd
           | point out the dialogues between Achilles and the Tortoise in
           | 
           | https://en.wikipedia.org/wiki/G%C3%B6del,_Escher,_Bach
           | 
           | which people mistakenly think is about symbolic A.I. but that
           | is really about the problems of solving problems where the
           | correct solution has a logical aspect. Even though logic
           | isn't everything, the formulation of most problems (like "Who
           | won the soccer game at Cornell last night?") is fundamentally
           | logical and leads you straight to paradoxes that can have you
           | forever pushing a bubble under the rug and thinking "just one
           | more" little hack will fix it...
        
             | fnordpiglet wrote:
             | LLMs are just one tool in a collection. Intelligence is
             | based on many models, not just the language parts of our
             | brain - and I expect AI to incorporate more models in a
             | system approach. Why does it matter if LLMs are able to
             | play chess at a grandmaster level or not? They can delegate
             | the actual chess optimization problem to a chess optimizing
             | program. While it's interesting language alone is as
             | powerful as it is, it's very myopic to judge the tool alone
             | and not as a part of a toolbox.
        
               | FrustratedMonky wrote:
               | Exactly It is NOT all about LLM's. There are a lot of
               | other successful models. From AlphaGo, to visions
               | systems, robotics. LLM is just the latest shiny thing.
               | 
               | At some point they will all be tied together, and at that
               | point it will start to look a lot more like sections of
               | our brain, one is vision, one is language, one is
               | movement. etc....
        
             | candiodari wrote:
             | I think it's already been made clear that the main reason
             | for the "asymptote" is wrong data input. These models
             | attempt to learn from random internet text ... and this
             | turns out to not be all that accurate.
             | 
             | Also, I've observed a model I was training having the same
             | problem as I do myself. If I at any point learn wrong data,
             | which happens of course, then getting that wrong data back
             | out is very hard and requires 10 or 50 times the effort I
             | spent learning the initial data. In fact I strongly suspect
             | I never unlearn bad data, I just _additionally_ learn  "if
             | I say X, it's wrong, say Y instead".
        
             | jacquesm wrote:
             | Brains suck for exact work such as database work or precise
             | calculations over longer chains. But they excel at
             | approximate work, and that's a very useful skill to have as
             | long as _if you have to_ you can fire up the pencil and
             | paper and do your precise calculations that way. And paper
             | works fine for database work as well and will remember all
             | of those sports stats for as long as you care (and even
             | after you 're dead).
             | 
             | Brains are so powerful because they are universal, they can
             | use auxiliary data stores and co-processors just fine.
        
           | visarga wrote:
           | I think much of that spark is actually contextual language
           | generation. We're relying on learned language patterns for
           | most of our sparks.
        
           | fnordpiglet wrote:
           | Agency is a key part of that spark, but we have done all
           | sorts of research into agency and providing goal based agents
           | into an AI model framework incorporating LLMs as well as
           | other optimizers and solvers I think will provide the
           | majority of that spark. The process of creativity depends on
           | both internal agency and goal setting unmoored by external
           | dictation and semantic synthesis of abductively reasoned
           | concepts, with an aesthetic that feels into the goal based
           | optimizer. These are things that can be simulated to the
           | point that while there may be an uncanny valley somewhere
           | it'll be close enough to be hard to distinguish.
           | 
           | But I do wonder if the practical utility of such an entity is
           | worth the amount of effort and capital required to build and
           | sustain it. I suspect it'll be more a novelty than a
           | practical tool.
        
           | kelseyfrog wrote:
           | Our distinct humanness will be found in the narcissism of
           | small differences.
        
             | jeroenvlek wrote:
             | Great work, please put this on a tile :D
        
             | __loam wrote:
             | The only narcissists here are the computer scientists who
             | have convinced themselves they've made god.
        
               | FrustratedMonky wrote:
               | Nobody is claiming they have made god today.
               | 
               | That is on the roadmap for 2030.
        
       | pushfoo wrote:
       | If I read the paper correctly, it seems to support the old quip
       | about all AI being decision trees, at least for smaller model
       | sizes.
       | 
       | It also raises an interesting UX question: is there an implicit
       | tradeoff between legibility and power for notations as for
       | different ways of expressing rotation, or is that a consequence
       | of using graphics hardware to implement AI?
        
         | PaulHoule wrote:
         | A RelU network looks a bit like a network of decision trees,
         | that's for sure.
        
           | jszymborski wrote:
           | When I read "Dead Neuron" I definitely associated it with
           | ReLUs.
           | 
           | It's claimed that this tendency to create dead neurons has a
           | regularizing effect, but frankly I don't buy that.
           | 
           | Anecdotally, I've always had better results with Leaky ReLUs
           | or Mish.
        
       | pmarreck wrote:
       | I guess the obvious next step is to cull the dead neurons from
       | the model data? I imagine the brain does the same thing after the
       | childhood years
        
         | kmeisthax wrote:
         | Most models have a purely fixed architecture: i.e. if you train
         | three layers of six feed-forward neurons a piece, your model
         | and the training data it learns from will just fit within that
         | architecture to the extent that gradient descent can _force_ it
         | to fit. There is no mechanism to say  "oh these neurons will
         | never activate, let's prune them", or "we'd have much better
         | loss if we added a layer here".
         | 
         | In the dark times before PyTorch there was an idea called NEAT:
         | "neuroevolution of augmenting topologies", which tried to use a
         | genetic algorithm (i.e. testing a bunch of slightly modified
         | solutions for loss) to discover both optimal weights _and_
         | optimal network structure. I don 't think this gets used all
         | that often, if at all.
         | 
         | I hear stories every few years about how many models have
         | unused neurons that can be pruned or how hyperparameter
         | selection is a pain in the ass, but nothing about automating
         | the pruning and parameter selection in such a way as to
         | efficiently use the whole model. I'm not sure it's necessary
         | anyway.
        
           | tysam_and wrote:
           | DL researcher here, it's also hard I think because
           | experimentally many of us have noted (also some research on
           | this) that there's a critical early phase to learning that
           | conditions the network a certain way, and adding and/or
           | removing layers later seems to actually be quite difficult
           | (removing easier than adding, by far, esp w/ something like
           | variational dropout [yes, i cite the old deep magicks]:
           | https://arxiv.org/abs/1701.05369)
        
             | hiddencost wrote:
             | I had success doubling the width of a layer by copying,
             | adding epsilon noise and then fine tuning.
             | 
             | Transform from [X] to [X X+epsilon].
             | 
             | In that case I was trying to fuse two signals that had
             | fairly similar characteristics.
        
               | tysam_and wrote:
               | Yes, unfortunately I have yet to beat that technique in
               | wallclock convergence speed vs just using the larger
               | network from the start. :'((((
               | 
               | Whoever figures out how to clearly and effectively do it
               | consistently faster than a 'full network from the start'
               | version will open up an entirely new subfield of neural
               | network training efficiency. :'))))
        
           | sdenton4 wrote:
           | There's a huge amount of work on model pruning, especially
           | with an eye towards model reduction for on-device
           | deployments. I've done a bunch of work in this space focused
           | on getting speech synthesis working in real time on phones.
           | It works and can be automated.
           | 
           | There's a lot of nuance though. What has typically worked
           | best are variants on iterative magnitude weight pruning
           | rather than activation pruning.
           | 
           | This can often get rid of 90% of the weights with minimal
           | impact on quality. Structured pruning lets you remove blocks
           | of weights while retaining good SIMD utilization... and so
           | on.
        
           | radarsat1 wrote:
           | > There is no mechanism to say "oh these neurons will never
           | activate, let's prune them"
           | 
           | Not at training time. But after training you can evaluate
           | which activations rarely fire and simply remove their columns
           | from the weight matrix.
           | 
           | Or you could re-randomize them and continue training.
        
       ___________________________________________________________________
       (page generated 2023-09-20 23:01 UTC)