[HN Gopher] Self-Compressing Neural Networks
       ___________________________________________________________________
        
       Self-Compressing Neural Networks
        
       Author : bilsbie
       Score  : 124 points
       Date   : 2024-08-04 12:17 UTC (10 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | bilsbie wrote:
       | dynamic quantization-aware training that puts size (in bytes) of
       | the model in the loss
        
         | diimdeep wrote:
         | I know where you saw that
         | 
         | https://x.com/realGeorgeHotz/status/1819963680739512550
         | 
         | > This is one of the coolest papers I've seen in a while.
         | "Self-Compressing Neural Networks" is dynamic quantization-
         | aware training that puts size (in bytes) of the model in the
         | loss! > My implementation (in @__tinygrad__):
         | 
         | https://github.com/geohot/ai-notebooks/blob/master/mnist_sel...
        
       | mlajtos wrote:
       | That is pretty cool. I found a follow up work that applies this
       | technique to LLM:
       | https://konczer.github.io/doc/Poster_EEML23_JozsefKonczer.pd...
        
       | andrewflnr wrote:
       | This kind of thing, much more than LLMs, makes me worry about AGI
       | takeoff.
        
         | tazu wrote:
         | Why? Do you think lossless compression is intelligence?
        
           | luckystarr wrote:
           | Parents thinking was probably: If you can achieve similar
           | results with a fraction of memory/compute usage then
           | capability at the same hardware level will increase even
           | more.
        
             | andrewflnr wrote:
             | It's specifically the fact that the network is directing
             | its own optimization. Which yes, could then potentially be
             | used to get more capability from the hardware, but that's
             | true of manually optimized networks as well. Needing less
             | human help is the... interesting part.
        
           | andrewflnr wrote:
           | No, but since you mentioned it:
           | https://en.wikipedia.org/wiki/Hutter_Prize
           | 
           | Anyway, OP is about lossy compression. I can't fully follow
           | it but they talk about techniques for mitigating loss later
           | in the paper.
        
           | idiotsecant wrote:
           | Compressing understanding (not just information) in a way
           | that uses semantic links in information is a big part of
           | intelligence, I'd say.
        
             | visarga wrote:
             | We're doing a double search - searching for experience
             | outside, collecting data - and searching for understanding
             | inside, by compressing the data. Search and learn, they
             | define both AI and us.
        
       | throwup238 wrote:
       | I think this might be the first step to making neural networks
       | that actually mimic biological brains. IMO the biggest piece
       | missing from NN architectures is a mechanism like neuroplasticity
       | that modifies the topology of neurons. Brains reorganize
       | themselves around the things they learn.
       | 
       | This paper is a long way from implementing synaptic
       | pruning/strengthening/weakening, neurogenesis, or synaptogenesis
       | but it's the first one I've seen where the network is self
       | optimizing.
        
         | wigster wrote:
         | i know nothing of which i speak.... but the theme reminded me
         | of the insect brains, with very relatively few Neurons that
         | manage pretty extraordinary feats. i guess random evolution
         | pruning happens and if there is no detrimental effect, cheerio.
        
         | nyrikki wrote:
         | Unfortunately dendritic compartmentalization, spike timing etc
         | are still not present. All efforts at models of SNNs that I
         | know of have hit problems like riddled basins so far, that is
         | what to look for to move past the limits of perceptron based
         | networks IMHO.
         | 
         | As PAC learning with autograd and perceptrons is just
         | compression, or set shattering, this paper is more of an
         | optimization method that reduces ANN expressiveness through
         | additional compression. Being able to control loss of precision
         | is exciting though.
         | 
         | It may help in some cases, especially for practical use cases,
         | but their unaddressed mention of potential problems with noisy
         | loss functions needs to be addressed.
         | 
         | Human biological neurons can do XOR in the dendrites without
         | hitting the soma at all is another example.
         | 
         | If you haven't heard about dendritic compartmentalization and
         | plasticity, here is a paper.
         | 
         | https://www.cell.com/neuron/fulltext/S0896-6273(11)00993-7
         | 
         | > In conclusion our results support the view that experience
         | can drive clustered synaptic enhancement onto neuronal
         | dendritic subcompartments, providing fundamental architecture
         | to circuit development and function
        
           | derefr wrote:
           | > reduces ANN expressiveness
           | 
           | But does it? It's been my hypothesis for a while that every
           | grad-trained NN is hauling around a lot of "nascent" nodes --
           | nodes that were on their way to being useful, but haven't
           | received enough input _yet_ to actually have their outputs be
           | distinguishable from noise  / ever influence the output. Sort
           | of the neuroplastic equivalent of an evolutionary _pre-
           | adaptation_.
           | 
           | If such nodes exist in NNs, they would be important to
           | decreasing training time to learning new concepts _given
           | further training_ ; but if there will _be_ no more training,
           | then they could be pruned for literally no change in
           | expressivity (i.e. the optimality of the NN as an autoencoder
           | of the existing training data.)
        
             | nyrikki wrote:
             | Easiest way I can figure out how to explain my claim.
             | 
             | Consider when you use 'partial connectivity', E.G.
             | convolution or pooling layers for local feature extraction
             | on say MNIST.
             | 
             | While useful, those partial connection layers are
             | explicitly used because fully connected layers do not have
             | translational invariance.
             | 
             | So with a fully connected network, shifting the letter 'i'
             | a few pixels to the right wouldn't match.
             | 
             | We choose to discard some of those connections for local
             | feature detection. But as the reason that the fully
             | connected model lacks translational invariance is because
             | it maintains that position data.
             | 
             | Note how that is more 'expressive', even if
             | counterproductive for the actual use case.
             | 
             | Another lens is the fact that neural networks have extreme
             | simplicity bias. In that they learn only the simplest
             | features to solve a task at hand.
             | 
             | If you want to recognize an i, irrespective of the
             | translational location, that bias is useful. But you 'throw
             | away' (in a very loose sense) the positional data to do so.
             | 
             | Horses for courses, not good vs bad.
        
           | IIAOPSW wrote:
           | You piqued my curiosity, so I looked for a paper. I found
           | something tangential but fascinating.
           | 
           | "Naud and Sprekeler (2018) suggest that this could be
           | achieved using a synaptic strategy that facilitates summation
           | for simple action potentials arriving on the basal dendrites
           | and depresses faster burst-like events arriving on the distal
           | tuft"
           | 
           | Oh, its frequency multiplexing with a band pass filter. Same
           | trick the analog phone system used to reduce the amount of
           | wire needed in the network. Same problem, same solution.
           | Convergent evolution.
           | 
           | I wonder if there's ways to do phreaking on neurons.
           | 
           | https://www.sciencedirect.com/science/article/pii/S030645222.
           | ..
        
           | dontwearitout wrote:
           | Are dendritic sub-compartments necessary to explicitly model,
           | or does this work just imply that biological neurons are
           | complicated and are better modeled as a multi-layered
           | artificial network, rather than a single simple computational
           | unit?
           | 
           | Similarly, do you think that spiking networks are important,
           | or just a specific mechanism used in the brain to transmit
           | information, which dense (or sparse) vectors of floats do in
           | artificial neural networks?
        
             | nyrikki wrote:
             | If the goal was to create an artificial neural network that
             | better approximated the biological human brain, yes the
             | perceptron model is insufficient.
             | 
             | If your goal is to produce a useful model on real hardware
             | and it works...no
             | 
             | Remember the constraints of ANNs being universal
             | approximaters (in theory)
             | 
             | 1) The function you are learning needs to be continuous 2)
             | Your model is over a closed, bounded subset of R^n 3) The
             | activation function is bounded and monodial
             | 
             | Obviously that is the theoretical UAT constraints. For
             | gradient decent typically used in real ML models, the
             | constraint of finding only smooth approximations of
             | continuous functions can be problematic depending on your
             | needs.
             | 
             | But people leveraged phlogiston theory for beer brewing
             | with great success and obviously Newtonian Mechanics is
             | good enough for many tasks.
             | 
             | SNNs in theory should be able to solve problems that are
             | challenging for perceptron models, but as I said, features
             | like riddled basins are problematic so far.
             | 
             | https://arxiv.org/abs/1711.02160
        
         | TheDudeMan wrote:
         | Stop trying to mimic brains. Do what works best for
         | transistors.
        
           | sva_ wrote:
           | It would be foolish not to look for inspiration in a system
           | that had billions of years of evolution invested in it.
        
             | p1esk wrote:
             | We already found the inspiration. That's how we invented
             | neural networks. Now we need to focus on what works.
        
         | kklisura wrote:
         | > mechanism like neuroplasticity that modifies the topology of
         | neurons
         | 
         | Isn't this already accomplished via weights?
        
       | Version467 wrote:
       | This is super cool. It's surprising to me that it took so long
       | for someone to try this. It seems like such an obvious idea (in
       | hindsight). But I guess that's easy to say now that someone came
       | up with it. If this turns out to work well even for much larger
       | models, then we might see loss functions that incorporate ever
       | more specific performance metrics, conceivably even actual
       | execution times on specific hardware.
        
         | xpe wrote:
         | There was related work that happened before, as mentioned in
         | the paper.
        
       | spacemanspiff01 wrote:
       | So this was published a year and a half ago? Is there a reason it
       | did not catch on?
        
         | svantana wrote:
         | It's not really that innovative. As the paper notes, there are
         | several similar previous works. Also, it sounds like they have
         | done a bunch of tweaking to reduce the "irreversible
         | forgetting" specifically for this particular dataset and
         | network, which is not very scientific. Further testing is
         | required to see if this method really has legs.
        
       | w-m wrote:
       | Using as little computational resources (memory and/or FLOPS) as
       | possible as an additional optimization criterion when training
       | NNs is an interesting avenue. I think the current state of pre-
       | trained model families is weird. Take Llama 3.1 or Segment
       | Anything 2: you get tiny/small/medium/larger/huge models, where
       | for each tier the model size was predefined, and they are trained
       | somewhat (completely?) independently. This feels iffy, patchy,
       | and like we haven't really arrived yet.
       | 
       | I'd want a model that scales up and down depending on the task
       | given at inference, and a model that doesn't have a fixed size
       | when starting the training. Shouldn't it specialize over training
       | progress, when seeing more tokens, and grow larger where needed?
       | Without some human fixing a size beforehand?
       | 
       | Self-organization is a fascinating topic to me. This last year
       | I've been working on Self-Organizing Gaussian Splats [0]. With a
       | lot of squinting, this lives in a similar space as the Self-
       | Compressing Neural Networks from the link above. The idea of the
       | Gaussians was to build on Self-Organizing Maps (lovely 90s
       | concept, look for some GIFs if you don't know it), and use that
       | to represent 3D scenes in a memory-efficient way. By mapping
       | attributes into a locally smooth 2D grid. It's quite a simple
       | algorithm, but works really well, and better than many quite
       | complicated coding schemes. So this has me excited that we'll
       | (re-)discover great methods in this space in the near future.
       | 
       | [0]: https://fraunhoferhhi.github.io/Self-Organizing-Gaussians/
        
       ___________________________________________________________________
       (page generated 2024-08-04 23:00 UTC)