[HN Gopher] Towards 1-bit Machine Learning Models
___________________________________________________________________
Towards 1-bit Machine Learning Models
Author : homarp
Score : 166 points
Date : 2024-03-28 10:58 UTC (1 days ago)
(HTM) web link (mobiusml.github.io)
(TXT) w3m dump (mobiusml.github.io)
| corysama wrote:
| I look forward to the inevitable probabilistic sub-bit machine
| learning models :)
| Cieric wrote:
| I assume this is ment to be a joke, but isn't this actually
| being worked on? (minus the probabilistic portion)
| https://arxiv.org/abs/2310.16795
| Cheer2171 wrote:
| My 0-bit model can predict if you have cancer with 99.5%
| accuracy. It doesn't even need input data! Don't ask about the
| false negative rate though.
| asah wrote:
| My 0-bit, no-input model can predict if you have cancer with
| 99.5% accuracy and 0.5% false negative rate. Don't ask about
| whether the cancer cell(s) are benign.
| paul_mk1 wrote:
| Sub 1-bit has been done at least as far back as 2016 for VGG
| style networks (my work).
|
| I was able to get 0.68 "effective" bits.
|
| The idea is that in each forward pass you add noise to _each_
| weight independently drawn from normal distribution, and when
| you calculate snr, it 's sub 1 bit. Points to the idea that a
| stochastic memory element can be used.
|
| https://arxiv.org/abs/1606.01981
| iamgopal wrote:
| So new AI chip will be coming optimised for large scale non-byte
| bit only AI world ?
| thesz wrote:
| Every CPU and GPU now has popcount [1] instruction. You can
| thank US three letter agencies for that [2].
| [1]
| https://en.wikipedia.org/wiki/Hamming_weight#Processor_support
| [2] https://vaibhavsagar.com/blog/2019/09/08/popcount/
|
| Multiplication then is just bitwise AND and a popcount.
| ttul wrote:
| Makes you wonder what the agencies know that we don't...
| rapatel0 wrote:
| It's just cryptography and encoding. Popcount is useful for
| understanding entropy.
| JKCalhoun wrote:
| As an AI layman, I am for some reason excited by the low-bit
| learning models.
|
| Apple, now leaving behind the albatros that was the Apple car,
| could be the one to roll their own silicon and put an LLM in our
| back pockets.
|
| Goodbye Siri, hello Seriously.
| talldayo wrote:
| Or we could end up getting another Google integration, with iOS
| being too rigid and locked-down to feasibly support third-party
| private AI alternatives. Fate has a cruel sense of humor,
| sometimes.
| odyssey7 wrote:
| Are you referring to the Swift version of TensorFlow?
|
| https://www.infoworld.com/article/3608151/swift-for-
| tensorfl...
| Me1000 wrote:
| I believe they're referring to these rumors:
| https://www.macrumors.com/2024/03/21/apple-iphone-ai-
| talks-g...
| JKCalhoun wrote:
| Could be a way for Apple to buy time as well until they
| do have something competitive.
| lawlessone wrote:
| > could be the one to roll their own silicon and put an LLM in
| our back pockets.
|
| Wouldn't they have to do a lot of catching up with everyone
| else already working on it?
| declaredapple wrote:
| They've been designing their own chips a while now, including
| with an NPU.
|
| Also because of their unified memory design, they actually
| have insane bandwidth which is incredibly useful for LLMs.
| IMO they may have a head-start in that respect for on-device
| inference of large models (e.g. 1B+ params).
| talldayo wrote:
| I don't think people are running 1B+ models on the Neural
| Engine these days. The high-performance models I've seen
| all rely on Metal Performance Shaders, which scales with
| how powerful your GPU is. It's not terribly slow on iPhone,
| but I think some people get the wrong idea and correlate an
| ambient processor like the Neural Engine with LLMs.
|
| The bigger bottleneck seems like memory, to me. iPhones
| have traditionally skimped on RAM moreso than even cheap
| and midrange Android counterparts. I can imagine running an
| LLM in the background on my S10 - it's a bit harder to
| envision iOS swapping everything smoothly on a similarly-
| aged iPhone.
| quentinp wrote:
| Actually they plan to put an LLM in your back pocket using
| flash memory, not silicon: https://arxiv.org/abs/2312.11514
| declaredapple wrote:
| The flash doesn't do the computations though, that's just a
| method of getting it to the processor
| sroussey wrote:
| It would be better to have eeprom or some such directly
| attached as memory. No loading.
| fabmilo wrote:
| I like how it goes from 13.5 GB to 1.76 GB and gets comparable
| results. Definitely there is some underlying process of why this
| works so well that we are missing. Are the bits selecting the
| different subspaces? Can we improve this process by using better
| normalization layers / gating units?
| nobodyandproud wrote:
| I wonder what information theory can tell us here.
| ttul wrote:
| Neural networks work, but why they work is not well known.
| Researchers continue to find new "free lunches" all the time.
| rapatel0 wrote:
| The entropy of English is ~1 bit per letter (Measured in a
| funny way by Shannon - https://cs.stanford.edu/people/erobert
| s/courses/soco/project...)
|
| In general, it's a bit funny in how we build ML models. You
| take a bucket of matrices, fill them with random numbers, and
| then start to introduce "biases" through back propagation. A
| model can converge down loss, but most of them are still
| filled with random noise.
|
| Binary weights are somewhat obvious in retrospect. Weights
| indicate the strength of an association between two neurons.
| Intuitively, most of the value is probably just that an
| association between two neurons exists, and whether it's
| positive or negatively associated.
| Animats wrote:
| I'm surprised that 1-bit works. Trinary (-1, 0, 1) makes
| sense, but if you only have 0 and 1, the asymmetry ought to
| be a problem. Or is multiplying an XOR, so 1x1 = 0?
| edflsafoiewq wrote:
| There's a scale and bias afterwards, so its not
| necessarily asymmetric.
| scarmig wrote:
| It's more like, addition is an XOR. (In fact, AND as mult
| and XOR as add are GF(2)). Throw in NOT (or, really, just
| a constant 1) and you can compute any circuit.
|
| Biologically, inhibitory neurons are every bit as
| important as excitatory ones, so if you squint just
| right, XOR looks like a neuron's activation being
| inhibited by another presynaptic neuron.
| rapatel0 wrote:
| Yeah basically. In binary, multiplication is an XNOR
| operation.
|
| 00 = 1
|
| 01 = 0
|
| 10 = 0
|
| 11 = 1
| mlsu wrote:
| Wow, fascinating read. 1999
|
| _Would a monkey who knew the n-gram frequencies of the
| letters in English where n is large be able to produce
| credible English text? Furthermore, does this monkey "know"
| English? If the N-gram monkey is behind one door and a
| human is behind the other, could a third-party observer
| tell which was the monkey? This question rings of the
| Turing test for artificial intelligence, to which there is
| no easy answer._
|
| No easy answer indeed!
| convolvatron wrote:
| idk that we need to think that hard. we have shown can build
| arithmetic functions from networks of logic gates given
| sufficient connectivity. so in some sense it just drops the
| learning network below the level of words and asked it to make
| its own math.
|
| since this generalizability thing seems to be real, this kind
| of directly folds in quantization-aware-training, doesn't it?
| WithinReason wrote:
| 1-bit weights have been a thing since at least 2016:
|
| https://arxiv.org/abs/1606.06160
| buildbot wrote:
| XNOR-Net was the "first" in that generation, as I recall. It's
| not a new idea at all though.
|
| Check out the dates on these papers -
|
| https://ieeexplore.ieee.org/abstract/document/286901 < 1994
|
| https://ieeexplore.ieee.org/abstract/document/344783 < 1993
|
| https://link.springer.com/article/10.1007/BF00337115 < 1986
| bjornsing wrote:
| First version of BNN was submitted to ArXiv a month before
| XNOR-Net: https://arxiv.org/abs/1602.02830
| buildbot wrote:
| Oh nice, good catch!
| thechao wrote:
| We're just speed-running the NN pendulum, at this point.
| rapatel0 wrote:
| Don't forget Michael
|
| https://arxiv.org/abs/1511.00363 < 2015
| redskyluan wrote:
| what will be the next step of 1 bit? could it be 0.5 bit? Any
| kind of quantization is not gonna to last long.
| CamperBob2 wrote:
| Not using traditional numbers at all, I imagine. It's possible
| that trying to build neural nets out of vectors that are
| themselves built with quantized scalars is a suboptimal
| approach.
| pixelpoet wrote:
| The way things are going we'll have -1 bit before long :)
| ganzuul wrote:
| https://en.wikipedia.org/wiki/Adjacency_matrix
|
| What would a graph look like when edges are represented with less
| than 1-bit?
| robrenaud wrote:
| It would have predictible structure. There would be groupwise,
| regular sparsity. You could exploit these patterns to infer
| properties like, "there is no connection from this vertex to
| this entire span of verticies".
|
| Just think about representing a compressed version of a matrix.
| You want to have a short encoding for matrices are ones that
| are more ordered in some real, exploitable way. With some
| determnistic computation of the compressed representation, you
| could produce the full adjacency matrix.
| ganzuul wrote:
| https://proceedings.mlr.press/v206/meller23a/meller23a.pdf
|
| It seems to have a lot of similarities to the technique in
| OPs article.
| grungegun wrote:
| Does anyone know if this works on vanilla deep networks? These
| quantization articles always seem to target LLM's which leads me
| to wonder if there's something special about the LLM architecture
| vs a vanilla deep architecture.
| Y_Y wrote:
| Certainly does, people have been doing this in computer vision
| for years.
| alephxyz wrote:
| LLMs have been trending towards obscenely large number of
| parameters (314B for grok), which makes quantization crucial if
| you want to run them without a Meta-sized budget.
| zaptrem wrote:
| Transformer LLMs are just a bunch of MLPs (linear layers) where
| you sometimes multiply/softmax the output in a funny way
| (attention). In other words, they're arguably more "vanilla
| deep net" than most architectures (e.g., conv nets).
|
| (There are also positional/token embeddings and normalization
| but those are a tiny minority of the parameters)
| amelius wrote:
| Ok, but what does a perceptron look like in 1-bit? Would it
| be just some simple logic gate, like an OR-gate?
| ianbicking wrote:
| Reduction to decision tree!
|
| But I'm unclear how it actually run, and the article talks about
| the conversion and training but doesn't describe how it runs... I
| suppose because it's obvious to someone who has followed
| quantization.
|
| Thinking out loud... if you have a model of just 1 and 0, my
| first thought is that the outputs are 1's and 0's but I think
| that's wrong. Instead it's a bunch of floats, and you multiply
| them by 1 or 0 (in a sense you are sampling the output of the
| previous layer?), add them up, and put the result through some
| activation function. And two-bit quantization sounds kind of
| similar, just with a _little_ scale, going from -1 to 2 instead
| of 0 to 1.
|
| It seems kind of interesting that you now have a bunch of weights
| that are exactly 0, meaning you can assert something about what
| parameters and weights affect what parts of the output. Though in
| some sense the way they compress the weights down to one bit is
| also a heuristic you could use to interpret the original model...
| this just makes it clearer that in totality you are making a
| defensible simplification, because the end result is still a
| working model.
|
| It also seems like you could make a lot of mathematical
| assertions about a one bit model that would be harder to make
| otherwise. Like maybe you could start thinking of a model as an
| equation, a _particular_ (though giant) equation, and look at its
| properties and consider symbolic transformations to that
| equation.
| fabmilo wrote:
| I like the decision tree analogy
| mmoskal wrote:
| It seems the trick here is they first quantize it to 1- or 2-bit,
| and then they fine-tune the quantization bias parameters (the
| parameters that dequantize from 1-2 to 16 bit) via LoRA. Then
| they have specialized kernels to do matrix multiplication at the
| bit level.
|
| Also, the 2-bit model seems much better than the 1-bit model -
| they use [-1, 0, 1, 2] - I wonder if '2' is needed in light of
| the 1.58b paper (which claims -1 is def. needed).
| andy_xor_andrew wrote:
| Interesting, and it kinda makes sense. You quantize, which
| invariably means you lose some precision, but then you can
| finetune post-quantization to recover at least some of it. Neat
| idea.
| r1chardnl wrote:
| Is this similar to binary search or could it be a binary search
| where you're just excluding all possibilities that you know can't
| be the probable result that you're looking for?
| nyrikki wrote:
| Feed forward networks are effectively DAGs
|
| With the smaller model size, quantized models have less
| accuracy and less stable training, but large models take
| advantage of increased parallelism.
|
| Binary search is more linear, circuits are a better lens than
| turing machines.
|
| While it still needs to be verified, if you look into what a
| uniform consent depth threshold circuit is, that will help.
|
| Although I guess binary weights may be in AC0 and not TC0, but
| that may not hold with billions of parameters.
| londons_explore wrote:
| I believe the future is 1 bit models - for both training and
| inference.
|
| When people make custom silicon for 1 bit models, they'll find
| that it is _sooooo_ much more power and silicon-space efficient
| to do 1 bit math than 16 bit floating point - like 100x or more.
|
| That extra model size will vastly overshadow any worse
| performance of the models.
| JHonaker wrote:
| > That extra model size will vastly overshadow any worse
| performance of the models.
|
| ...What?
| sroussey wrote:
| I think OP was referring to parameter size. You can make up
| for quantization by having more parameters.
| cma wrote:
| For training how do you get any kind of meaningful derivative
| with it?
| twelfthnight wrote:
| Maybe evolutionary algorithms instead? Hasn't proven super
| useful historically, but maybe at the scale of enormous LLMs
| it will be?
| bionhoward wrote:
| Evolutionary algorithms made you, didn't they?
| chalst wrote:
| The OP explicitly excludes training.
| wongarsu wrote:
| Probably more than 100x for inference. Not only are you
| drastically reducing the number of bits and replacing float
| math with integer math, you can do matrix multiplication with
| only addition (as pointed out in the BitNet b1.58 paper).
| Additions require a lot less hardware to implement than
| multiplication. Adding one-bit or two-bit numbers requires
| barely any hardware at all. A traditional two-bit adder without
| carry bit is three xor gates and an and gate.
| programjames wrote:
| > I believe the future is 1 bit models - for both training and
| inference.
|
| 1 bit's nothin'. The future is training directly on
| electronic/photonic circuits.
| bionhoward wrote:
| Isn't 1bit too low for optimal radix economy (Euler's number)
| though?
|
| I want to see somebody try "imbalanced quaternary" -,0,+,2
| twelfthnight wrote:
| Haven't heard this argument before. But from the Wikipedia
| article it seems base 3 has the best asymptomatic radix
| economy, but isn't much better than base 2 and base 2 is
| seemingly easier to program and optimize.
|
| Since this is a new argument I've not heard, would be curious
| if you had links or some more explanation.
| littlestymaar wrote:
| Stupid question: how can you do deep learning on 1 bit? DL works
| by calculating a gradient and following it to minimize a loss
| function, but if all your parameters are 1 bit, there's no such
| thing as a gradient as you don't have differentiation at all. So
| how does it work?
| edflsafoiewq wrote:
| Not sure what you mean, why wouldn't there be a gradient? There
| is a problem that small changes to weights may have no effect
| (eg ((0 + 0.3) + 0.3) + 0.4 = 0 if you have to round the result
| to 0 or 1 after every step), but to fix this I believe everyone
| maintains high precision weights for the backwards pass, from
| which the quantized weights are derived.
| Centigonal wrote:
| Not a stupid question at all.
|
| Early quantization approaches just quantized a high-bit-
| precision pre-trained model - no need to calculate gradients on
| the quantized weights. BitNet[1] changed the game by training a
| low-precision model from scratch. It achieves this by keeping
| high precision in the gradients, optimizer state, and in
| "latent weights" which are then quantized on the fly. I don't
| really understand the finer details of how this works, so check
| out the paper if you're interested.
|
| This article's approach is interesting. They start by
| quantizing a pre-trained high-precision model, and then they
| fine-tune the quantized model using LoRA (which dramatically
| improves the performance of the quantized model). They don't
| talk about the bit depth of the values in the LoRA matrices, so
| it may be that those are higher bit-depth.
|
| [1] https://arxiv.org/pdf/2310.11453.pdf
| kromem wrote:
| The most exciting part about ternary or binary weights is the
| inevitable hardware revolution for AI dedicated chips that's
| going to result from it.
| jstmm wrote:
| In both the '1-bit Model' and '2-bit Model' tables, the forward
| time (sec) for Llama2-7B with FP16 (full precision) is 0.1 s,
| whereas it's ~0.231, ~0.257, ~0.353 s respectively for HQQ
| (1-bit) / HQQ+ (1-bit) / Quip# (2-bit) meaning the FP16 model has
| ~3x lower inference time.
|
| On the contrary, in the BitNet b1.56 paper [0] the authors report
| their 7b model has 2.9x reduced inference latency.
|
| It's not clear to me what's happening here. Can someone explain
| why the 1/2bit HQQ/HQQ+ models are so much slower than the BitNet
| b1.56 models?
|
| [0] https://arxiv.org/pdf/2402.17764.pdf
| thatguysaguy wrote:
| Sounds like these guys didn't use custom kernels, but BitNet
| did.
| londons_explore wrote:
| GPU's aren't really designed for 1 bit math... They don't
| perform much faster than floating point math.
|
| Whereas a custom ASIC or updated design of GPU could give
| _massive_ speedups with 1 bit math.
| bee_rider wrote:
| For 1 bit math, at least it should be possible to populate
| every other bit of an integer type, right? Surely one could
| do _better_ with a dedicated type for this, but at least we
| could pack 16 single-bit weights into a 32 bit int for
| addition, right?
| UncleOxidant wrote:
| Yes, exactly. Neither GPUs nor CPUs are setup for 1 bit math.
| Pulling 1 or 2 bits out of a word isn't all that
| straightforward on CPU or GPU - lots of shifting and masking.
| I wonder how long it's going to be before we see custom
| hardware for bitnets? I suspect we'll see it on FPGAs first.
___________________________________________________________________
(page generated 2024-03-29 23:00 UTC)