[HN Gopher] Self-Compressing Neural Networks
___________________________________________________________________
Self-Compressing Neural Networks
Author : bilsbie
Score : 124 points
Date : 2024-08-04 12:17 UTC (10 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| bilsbie wrote:
| dynamic quantization-aware training that puts size (in bytes) of
| the model in the loss
| diimdeep wrote:
| I know where you saw that
|
| https://x.com/realGeorgeHotz/status/1819963680739512550
|
| > This is one of the coolest papers I've seen in a while.
| "Self-Compressing Neural Networks" is dynamic quantization-
| aware training that puts size (in bytes) of the model in the
| loss! > My implementation (in @__tinygrad__):
|
| https://github.com/geohot/ai-notebooks/blob/master/mnist_sel...
| mlajtos wrote:
| That is pretty cool. I found a follow up work that applies this
| technique to LLM:
| https://konczer.github.io/doc/Poster_EEML23_JozsefKonczer.pd...
| andrewflnr wrote:
| This kind of thing, much more than LLMs, makes me worry about AGI
| takeoff.
| tazu wrote:
| Why? Do you think lossless compression is intelligence?
| luckystarr wrote:
| Parents thinking was probably: If you can achieve similar
| results with a fraction of memory/compute usage then
| capability at the same hardware level will increase even
| more.
| andrewflnr wrote:
| It's specifically the fact that the network is directing
| its own optimization. Which yes, could then potentially be
| used to get more capability from the hardware, but that's
| true of manually optimized networks as well. Needing less
| human help is the... interesting part.
| andrewflnr wrote:
| No, but since you mentioned it:
| https://en.wikipedia.org/wiki/Hutter_Prize
|
| Anyway, OP is about lossy compression. I can't fully follow
| it but they talk about techniques for mitigating loss later
| in the paper.
| idiotsecant wrote:
| Compressing understanding (not just information) in a way
| that uses semantic links in information is a big part of
| intelligence, I'd say.
| visarga wrote:
| We're doing a double search - searching for experience
| outside, collecting data - and searching for understanding
| inside, by compressing the data. Search and learn, they
| define both AI and us.
| throwup238 wrote:
| I think this might be the first step to making neural networks
| that actually mimic biological brains. IMO the biggest piece
| missing from NN architectures is a mechanism like neuroplasticity
| that modifies the topology of neurons. Brains reorganize
| themselves around the things they learn.
|
| This paper is a long way from implementing synaptic
| pruning/strengthening/weakening, neurogenesis, or synaptogenesis
| but it's the first one I've seen where the network is self
| optimizing.
| wigster wrote:
| i know nothing of which i speak.... but the theme reminded me
| of the insect brains, with very relatively few Neurons that
| manage pretty extraordinary feats. i guess random evolution
| pruning happens and if there is no detrimental effect, cheerio.
| nyrikki wrote:
| Unfortunately dendritic compartmentalization, spike timing etc
| are still not present. All efforts at models of SNNs that I
| know of have hit problems like riddled basins so far, that is
| what to look for to move past the limits of perceptron based
| networks IMHO.
|
| As PAC learning with autograd and perceptrons is just
| compression, or set shattering, this paper is more of an
| optimization method that reduces ANN expressiveness through
| additional compression. Being able to control loss of precision
| is exciting though.
|
| It may help in some cases, especially for practical use cases,
| but their unaddressed mention of potential problems with noisy
| loss functions needs to be addressed.
|
| Human biological neurons can do XOR in the dendrites without
| hitting the soma at all is another example.
|
| If you haven't heard about dendritic compartmentalization and
| plasticity, here is a paper.
|
| https://www.cell.com/neuron/fulltext/S0896-6273(11)00993-7
|
| > In conclusion our results support the view that experience
| can drive clustered synaptic enhancement onto neuronal
| dendritic subcompartments, providing fundamental architecture
| to circuit development and function
| derefr wrote:
| > reduces ANN expressiveness
|
| But does it? It's been my hypothesis for a while that every
| grad-trained NN is hauling around a lot of "nascent" nodes --
| nodes that were on their way to being useful, but haven't
| received enough input _yet_ to actually have their outputs be
| distinguishable from noise / ever influence the output. Sort
| of the neuroplastic equivalent of an evolutionary _pre-
| adaptation_.
|
| If such nodes exist in NNs, they would be important to
| decreasing training time to learning new concepts _given
| further training_ ; but if there will _be_ no more training,
| then they could be pruned for literally no change in
| expressivity (i.e. the optimality of the NN as an autoencoder
| of the existing training data.)
| nyrikki wrote:
| Easiest way I can figure out how to explain my claim.
|
| Consider when you use 'partial connectivity', E.G.
| convolution or pooling layers for local feature extraction
| on say MNIST.
|
| While useful, those partial connection layers are
| explicitly used because fully connected layers do not have
| translational invariance.
|
| So with a fully connected network, shifting the letter 'i'
| a few pixels to the right wouldn't match.
|
| We choose to discard some of those connections for local
| feature detection. But as the reason that the fully
| connected model lacks translational invariance is because
| it maintains that position data.
|
| Note how that is more 'expressive', even if
| counterproductive for the actual use case.
|
| Another lens is the fact that neural networks have extreme
| simplicity bias. In that they learn only the simplest
| features to solve a task at hand.
|
| If you want to recognize an i, irrespective of the
| translational location, that bias is useful. But you 'throw
| away' (in a very loose sense) the positional data to do so.
|
| Horses for courses, not good vs bad.
| IIAOPSW wrote:
| You piqued my curiosity, so I looked for a paper. I found
| something tangential but fascinating.
|
| "Naud and Sprekeler (2018) suggest that this could be
| achieved using a synaptic strategy that facilitates summation
| for simple action potentials arriving on the basal dendrites
| and depresses faster burst-like events arriving on the distal
| tuft"
|
| Oh, its frequency multiplexing with a band pass filter. Same
| trick the analog phone system used to reduce the amount of
| wire needed in the network. Same problem, same solution.
| Convergent evolution.
|
| I wonder if there's ways to do phreaking on neurons.
|
| https://www.sciencedirect.com/science/article/pii/S030645222.
| ..
| dontwearitout wrote:
| Are dendritic sub-compartments necessary to explicitly model,
| or does this work just imply that biological neurons are
| complicated and are better modeled as a multi-layered
| artificial network, rather than a single simple computational
| unit?
|
| Similarly, do you think that spiking networks are important,
| or just a specific mechanism used in the brain to transmit
| information, which dense (or sparse) vectors of floats do in
| artificial neural networks?
| nyrikki wrote:
| If the goal was to create an artificial neural network that
| better approximated the biological human brain, yes the
| perceptron model is insufficient.
|
| If your goal is to produce a useful model on real hardware
| and it works...no
|
| Remember the constraints of ANNs being universal
| approximaters (in theory)
|
| 1) The function you are learning needs to be continuous 2)
| Your model is over a closed, bounded subset of R^n 3) The
| activation function is bounded and monodial
|
| Obviously that is the theoretical UAT constraints. For
| gradient decent typically used in real ML models, the
| constraint of finding only smooth approximations of
| continuous functions can be problematic depending on your
| needs.
|
| But people leveraged phlogiston theory for beer brewing
| with great success and obviously Newtonian Mechanics is
| good enough for many tasks.
|
| SNNs in theory should be able to solve problems that are
| challenging for perceptron models, but as I said, features
| like riddled basins are problematic so far.
|
| https://arxiv.org/abs/1711.02160
| TheDudeMan wrote:
| Stop trying to mimic brains. Do what works best for
| transistors.
| sva_ wrote:
| It would be foolish not to look for inspiration in a system
| that had billions of years of evolution invested in it.
| p1esk wrote:
| We already found the inspiration. That's how we invented
| neural networks. Now we need to focus on what works.
| kklisura wrote:
| > mechanism like neuroplasticity that modifies the topology of
| neurons
|
| Isn't this already accomplished via weights?
| Version467 wrote:
| This is super cool. It's surprising to me that it took so long
| for someone to try this. It seems like such an obvious idea (in
| hindsight). But I guess that's easy to say now that someone came
| up with it. If this turns out to work well even for much larger
| models, then we might see loss functions that incorporate ever
| more specific performance metrics, conceivably even actual
| execution times on specific hardware.
| xpe wrote:
| There was related work that happened before, as mentioned in
| the paper.
| spacemanspiff01 wrote:
| So this was published a year and a half ago? Is there a reason it
| did not catch on?
| svantana wrote:
| It's not really that innovative. As the paper notes, there are
| several similar previous works. Also, it sounds like they have
| done a bunch of tweaking to reduce the "irreversible
| forgetting" specifically for this particular dataset and
| network, which is not very scientific. Further testing is
| required to see if this method really has legs.
| w-m wrote:
| Using as little computational resources (memory and/or FLOPS) as
| possible as an additional optimization criterion when training
| NNs is an interesting avenue. I think the current state of pre-
| trained model families is weird. Take Llama 3.1 or Segment
| Anything 2: you get tiny/small/medium/larger/huge models, where
| for each tier the model size was predefined, and they are trained
| somewhat (completely?) independently. This feels iffy, patchy,
| and like we haven't really arrived yet.
|
| I'd want a model that scales up and down depending on the task
| given at inference, and a model that doesn't have a fixed size
| when starting the training. Shouldn't it specialize over training
| progress, when seeing more tokens, and grow larger where needed?
| Without some human fixing a size beforehand?
|
| Self-organization is a fascinating topic to me. This last year
| I've been working on Self-Organizing Gaussian Splats [0]. With a
| lot of squinting, this lives in a similar space as the Self-
| Compressing Neural Networks from the link above. The idea of the
| Gaussians was to build on Self-Organizing Maps (lovely 90s
| concept, look for some GIFs if you don't know it), and use that
| to represent 3D scenes in a memory-efficient way. By mapping
| attributes into a locally smooth 2D grid. It's quite a simple
| algorithm, but works really well, and better than many quite
| complicated coding schemes. So this has me excited that we'll
| (re-)discover great methods in this space in the near future.
|
| [0]: https://fraunhoferhhi.github.io/Self-Organizing-Gaussians/
___________________________________________________________________
(page generated 2024-08-04 23:00 UTC)