[HN Gopher] Towards 1-bit Machine Learning Models
       ___________________________________________________________________
        
       Towards 1-bit Machine Learning Models
        
       Author : homarp
       Score  : 166 points
       Date   : 2024-03-28 10:58 UTC (1 days ago)
        
 (HTM) web link (mobiusml.github.io)
 (TXT) w3m dump (mobiusml.github.io)
        
       | corysama wrote:
       | I look forward to the inevitable probabilistic sub-bit machine
       | learning models :)
        
         | Cieric wrote:
         | I assume this is ment to be a joke, but isn't this actually
         | being worked on? (minus the probabilistic portion)
         | https://arxiv.org/abs/2310.16795
        
         | Cheer2171 wrote:
         | My 0-bit model can predict if you have cancer with 99.5%
         | accuracy. It doesn't even need input data! Don't ask about the
         | false negative rate though.
        
           | asah wrote:
           | My 0-bit, no-input model can predict if you have cancer with
           | 99.5% accuracy and 0.5% false negative rate. Don't ask about
           | whether the cancer cell(s) are benign.
        
         | paul_mk1 wrote:
         | Sub 1-bit has been done at least as far back as 2016 for VGG
         | style networks (my work).
         | 
         | I was able to get 0.68 "effective" bits.
         | 
         | The idea is that in each forward pass you add noise to _each_
         | weight independently drawn from normal distribution, and when
         | you calculate snr, it 's sub 1 bit. Points to the idea that a
         | stochastic memory element can be used.
         | 
         | https://arxiv.org/abs/1606.01981
        
       | iamgopal wrote:
       | So new AI chip will be coming optimised for large scale non-byte
       | bit only AI world ?
        
         | thesz wrote:
         | Every CPU and GPU now has popcount [1] instruction. You can
         | thank US three letter agencies for that [2].
         | [1]
         | https://en.wikipedia.org/wiki/Hamming_weight#Processor_support
         | [2] https://vaibhavsagar.com/blog/2019/09/08/popcount/
         | 
         | Multiplication then is just bitwise AND and a popcount.
        
           | ttul wrote:
           | Makes you wonder what the agencies know that we don't...
        
             | rapatel0 wrote:
             | It's just cryptography and encoding. Popcount is useful for
             | understanding entropy.
        
       | JKCalhoun wrote:
       | As an AI layman, I am for some reason excited by the low-bit
       | learning models.
       | 
       | Apple, now leaving behind the albatros that was the Apple car,
       | could be the one to roll their own silicon and put an LLM in our
       | back pockets.
       | 
       | Goodbye Siri, hello Seriously.
        
         | talldayo wrote:
         | Or we could end up getting another Google integration, with iOS
         | being too rigid and locked-down to feasibly support third-party
         | private AI alternatives. Fate has a cruel sense of humor,
         | sometimes.
        
           | odyssey7 wrote:
           | Are you referring to the Swift version of TensorFlow?
           | 
           | https://www.infoworld.com/article/3608151/swift-for-
           | tensorfl...
        
             | Me1000 wrote:
             | I believe they're referring to these rumors:
             | https://www.macrumors.com/2024/03/21/apple-iphone-ai-
             | talks-g...
        
               | JKCalhoun wrote:
               | Could be a way for Apple to buy time as well until they
               | do have something competitive.
        
         | lawlessone wrote:
         | > could be the one to roll their own silicon and put an LLM in
         | our back pockets.
         | 
         | Wouldn't they have to do a lot of catching up with everyone
         | else already working on it?
        
           | declaredapple wrote:
           | They've been designing their own chips a while now, including
           | with an NPU.
           | 
           | Also because of their unified memory design, they actually
           | have insane bandwidth which is incredibly useful for LLMs.
           | IMO they may have a head-start in that respect for on-device
           | inference of large models (e.g. 1B+ params).
        
             | talldayo wrote:
             | I don't think people are running 1B+ models on the Neural
             | Engine these days. The high-performance models I've seen
             | all rely on Metal Performance Shaders, which scales with
             | how powerful your GPU is. It's not terribly slow on iPhone,
             | but I think some people get the wrong idea and correlate an
             | ambient processor like the Neural Engine with LLMs.
             | 
             | The bigger bottleneck seems like memory, to me. iPhones
             | have traditionally skimped on RAM moreso than even cheap
             | and midrange Android counterparts. I can imagine running an
             | LLM in the background on my S10 - it's a bit harder to
             | envision iOS swapping everything smoothly on a similarly-
             | aged iPhone.
        
         | quentinp wrote:
         | Actually they plan to put an LLM in your back pocket using
         | flash memory, not silicon: https://arxiv.org/abs/2312.11514
        
           | declaredapple wrote:
           | The flash doesn't do the computations though, that's just a
           | method of getting it to the processor
        
             | sroussey wrote:
             | It would be better to have eeprom or some such directly
             | attached as memory. No loading.
        
       | fabmilo wrote:
       | I like how it goes from 13.5 GB to 1.76 GB and gets comparable
       | results. Definitely there is some underlying process of why this
       | works so well that we are missing. Are the bits selecting the
       | different subspaces? Can we improve this process by using better
       | normalization layers / gating units?
        
         | nobodyandproud wrote:
         | I wonder what information theory can tell us here.
        
           | ttul wrote:
           | Neural networks work, but why they work is not well known.
           | Researchers continue to find new "free lunches" all the time.
        
           | rapatel0 wrote:
           | The entropy of English is ~1 bit per letter (Measured in a
           | funny way by Shannon - https://cs.stanford.edu/people/erobert
           | s/courses/soco/project...)
           | 
           | In general, it's a bit funny in how we build ML models. You
           | take a bucket of matrices, fill them with random numbers, and
           | then start to introduce "biases" through back propagation. A
           | model can converge down loss, but most of them are still
           | filled with random noise.
           | 
           | Binary weights are somewhat obvious in retrospect. Weights
           | indicate the strength of an association between two neurons.
           | Intuitively, most of the value is probably just that an
           | association between two neurons exists, and whether it's
           | positive or negatively associated.
        
             | Animats wrote:
             | I'm surprised that 1-bit works. Trinary (-1, 0, 1) makes
             | sense, but if you only have 0 and 1, the asymmetry ought to
             | be a problem. Or is multiplying an XOR, so 1x1 = 0?
        
               | edflsafoiewq wrote:
               | There's a scale and bias afterwards, so its not
               | necessarily asymmetric.
        
               | scarmig wrote:
               | It's more like, addition is an XOR. (In fact, AND as mult
               | and XOR as add are GF(2)). Throw in NOT (or, really, just
               | a constant 1) and you can compute any circuit.
               | 
               | Biologically, inhibitory neurons are every bit as
               | important as excitatory ones, so if you squint just
               | right, XOR looks like a neuron's activation being
               | inhibited by another presynaptic neuron.
        
               | rapatel0 wrote:
               | Yeah basically. In binary, multiplication is an XNOR
               | operation.
               | 
               | 00 = 1
               | 
               | 01 = 0
               | 
               | 10 = 0
               | 
               | 11 = 1
        
             | mlsu wrote:
             | Wow, fascinating read. 1999
             | 
             |  _Would a monkey who knew the n-gram frequencies of the
             | letters in English where n is large be able to produce
             | credible English text? Furthermore, does this monkey "know"
             | English? If the N-gram monkey is behind one door and a
             | human is behind the other, could a third-party observer
             | tell which was the monkey? This question rings of the
             | Turing test for artificial intelligence, to which there is
             | no easy answer._
             | 
             | No easy answer indeed!
        
         | convolvatron wrote:
         | idk that we need to think that hard. we have shown can build
         | arithmetic functions from networks of logic gates given
         | sufficient connectivity. so in some sense it just drops the
         | learning network below the level of words and asked it to make
         | its own math.
         | 
         | since this generalizability thing seems to be real, this kind
         | of directly folds in quantization-aware-training, doesn't it?
        
       | WithinReason wrote:
       | 1-bit weights have been a thing since at least 2016:
       | 
       | https://arxiv.org/abs/1606.06160
        
         | buildbot wrote:
         | XNOR-Net was the "first" in that generation, as I recall. It's
         | not a new idea at all though.
         | 
         | Check out the dates on these papers -
         | 
         | https://ieeexplore.ieee.org/abstract/document/286901 < 1994
         | 
         | https://ieeexplore.ieee.org/abstract/document/344783 < 1993
         | 
         | https://link.springer.com/article/10.1007/BF00337115 < 1986
        
           | bjornsing wrote:
           | First version of BNN was submitted to ArXiv a month before
           | XNOR-Net: https://arxiv.org/abs/1602.02830
        
             | buildbot wrote:
             | Oh nice, good catch!
        
           | thechao wrote:
           | We're just speed-running the NN pendulum, at this point.
        
         | rapatel0 wrote:
         | Don't forget Michael
         | 
         | https://arxiv.org/abs/1511.00363 < 2015
        
       | redskyluan wrote:
       | what will be the next step of 1 bit? could it be 0.5 bit? Any
       | kind of quantization is not gonna to last long.
        
         | CamperBob2 wrote:
         | Not using traditional numbers at all, I imagine. It's possible
         | that trying to build neural nets out of vectors that are
         | themselves built with quantized scalars is a suboptimal
         | approach.
        
         | pixelpoet wrote:
         | The way things are going we'll have -1 bit before long :)
        
       | ganzuul wrote:
       | https://en.wikipedia.org/wiki/Adjacency_matrix
       | 
       | What would a graph look like when edges are represented with less
       | than 1-bit?
        
         | robrenaud wrote:
         | It would have predictible structure. There would be groupwise,
         | regular sparsity. You could exploit these patterns to infer
         | properties like, "there is no connection from this vertex to
         | this entire span of verticies".
         | 
         | Just think about representing a compressed version of a matrix.
         | You want to have a short encoding for matrices are ones that
         | are more ordered in some real, exploitable way. With some
         | determnistic computation of the compressed representation, you
         | could produce the full adjacency matrix.
        
           | ganzuul wrote:
           | https://proceedings.mlr.press/v206/meller23a/meller23a.pdf
           | 
           | It seems to have a lot of similarities to the technique in
           | OPs article.
        
       | grungegun wrote:
       | Does anyone know if this works on vanilla deep networks? These
       | quantization articles always seem to target LLM's which leads me
       | to wonder if there's something special about the LLM architecture
       | vs a vanilla deep architecture.
        
         | Y_Y wrote:
         | Certainly does, people have been doing this in computer vision
         | for years.
        
         | alephxyz wrote:
         | LLMs have been trending towards obscenely large number of
         | parameters (314B for grok), which makes quantization crucial if
         | you want to run them without a Meta-sized budget.
        
         | zaptrem wrote:
         | Transformer LLMs are just a bunch of MLPs (linear layers) where
         | you sometimes multiply/softmax the output in a funny way
         | (attention). In other words, they're arguably more "vanilla
         | deep net" than most architectures (e.g., conv nets).
         | 
         | (There are also positional/token embeddings and normalization
         | but those are a tiny minority of the parameters)
        
           | amelius wrote:
           | Ok, but what does a perceptron look like in 1-bit? Would it
           | be just some simple logic gate, like an OR-gate?
        
       | ianbicking wrote:
       | Reduction to decision tree!
       | 
       | But I'm unclear how it actually run, and the article talks about
       | the conversion and training but doesn't describe how it runs... I
       | suppose because it's obvious to someone who has followed
       | quantization.
       | 
       | Thinking out loud... if you have a model of just 1 and 0, my
       | first thought is that the outputs are 1's and 0's but I think
       | that's wrong. Instead it's a bunch of floats, and you multiply
       | them by 1 or 0 (in a sense you are sampling the output of the
       | previous layer?), add them up, and put the result through some
       | activation function. And two-bit quantization sounds kind of
       | similar, just with a _little_ scale, going from -1 to 2 instead
       | of 0 to 1.
       | 
       | It seems kind of interesting that you now have a bunch of weights
       | that are exactly 0, meaning you can assert something about what
       | parameters and weights affect what parts of the output. Though in
       | some sense the way they compress the weights down to one bit is
       | also a heuristic you could use to interpret the original model...
       | this just makes it clearer that in totality you are making a
       | defensible simplification, because the end result is still a
       | working model.
       | 
       | It also seems like you could make a lot of mathematical
       | assertions about a one bit model that would be harder to make
       | otherwise. Like maybe you could start thinking of a model as an
       | equation, a _particular_ (though giant) equation, and look at its
       | properties and consider symbolic transformations to that
       | equation.
        
         | fabmilo wrote:
         | I like the decision tree analogy
        
       | mmoskal wrote:
       | It seems the trick here is they first quantize it to 1- or 2-bit,
       | and then they fine-tune the quantization bias parameters (the
       | parameters that dequantize from 1-2 to 16 bit) via LoRA. Then
       | they have specialized kernels to do matrix multiplication at the
       | bit level.
       | 
       | Also, the 2-bit model seems much better than the 1-bit model -
       | they use [-1, 0, 1, 2] - I wonder if '2' is needed in light of
       | the 1.58b paper (which claims -1 is def. needed).
        
         | andy_xor_andrew wrote:
         | Interesting, and it kinda makes sense. You quantize, which
         | invariably means you lose some precision, but then you can
         | finetune post-quantization to recover at least some of it. Neat
         | idea.
        
       | r1chardnl wrote:
       | Is this similar to binary search or could it be a binary search
       | where you're just excluding all possibilities that you know can't
       | be the probable result that you're looking for?
        
         | nyrikki wrote:
         | Feed forward networks are effectively DAGs
         | 
         | With the smaller model size, quantized models have less
         | accuracy and less stable training, but large models take
         | advantage of increased parallelism.
         | 
         | Binary search is more linear, circuits are a better lens than
         | turing machines.
         | 
         | While it still needs to be verified, if you look into what a
         | uniform consent depth threshold circuit is, that will help.
         | 
         | Although I guess binary weights may be in AC0 and not TC0, but
         | that may not hold with billions of parameters.
        
       | londons_explore wrote:
       | I believe the future is 1 bit models - for both training and
       | inference.
       | 
       | When people make custom silicon for 1 bit models, they'll find
       | that it is _sooooo_ much more power and silicon-space efficient
       | to do 1 bit math than 16 bit floating point - like 100x or more.
       | 
       | That extra model size will vastly overshadow any worse
       | performance of the models.
        
         | JHonaker wrote:
         | > That extra model size will vastly overshadow any worse
         | performance of the models.
         | 
         | ...What?
        
           | sroussey wrote:
           | I think OP was referring to parameter size. You can make up
           | for quantization by having more parameters.
        
         | cma wrote:
         | For training how do you get any kind of meaningful derivative
         | with it?
        
           | twelfthnight wrote:
           | Maybe evolutionary algorithms instead? Hasn't proven super
           | useful historically, but maybe at the scale of enormous LLMs
           | it will be?
        
             | bionhoward wrote:
             | Evolutionary algorithms made you, didn't they?
        
           | chalst wrote:
           | The OP explicitly excludes training.
        
         | wongarsu wrote:
         | Probably more than 100x for inference. Not only are you
         | drastically reducing the number of bits and replacing float
         | math with integer math, you can do matrix multiplication with
         | only addition (as pointed out in the BitNet b1.58 paper).
         | Additions require a lot less hardware to implement than
         | multiplication. Adding one-bit or two-bit numbers requires
         | barely any hardware at all. A traditional two-bit adder without
         | carry bit is three xor gates and an and gate.
        
         | programjames wrote:
         | > I believe the future is 1 bit models - for both training and
         | inference.
         | 
         | 1 bit's nothin'. The future is training directly on
         | electronic/photonic circuits.
        
         | bionhoward wrote:
         | Isn't 1bit too low for optimal radix economy (Euler's number)
         | though?
         | 
         | I want to see somebody try "imbalanced quaternary" -,0,+,2
        
           | twelfthnight wrote:
           | Haven't heard this argument before. But from the Wikipedia
           | article it seems base 3 has the best asymptomatic radix
           | economy, but isn't much better than base 2 and base 2 is
           | seemingly easier to program and optimize.
           | 
           | Since this is a new argument I've not heard, would be curious
           | if you had links or some more explanation.
        
       | littlestymaar wrote:
       | Stupid question: how can you do deep learning on 1 bit? DL works
       | by calculating a gradient and following it to minimize a loss
       | function, but if all your parameters are 1 bit, there's no such
       | thing as a gradient as you don't have differentiation at all. So
       | how does it work?
        
         | edflsafoiewq wrote:
         | Not sure what you mean, why wouldn't there be a gradient? There
         | is a problem that small changes to weights may have no effect
         | (eg ((0 + 0.3) + 0.3) + 0.4 = 0 if you have to round the result
         | to 0 or 1 after every step), but to fix this I believe everyone
         | maintains high precision weights for the backwards pass, from
         | which the quantized weights are derived.
        
         | Centigonal wrote:
         | Not a stupid question at all.
         | 
         | Early quantization approaches just quantized a high-bit-
         | precision pre-trained model - no need to calculate gradients on
         | the quantized weights. BitNet[1] changed the game by training a
         | low-precision model from scratch. It achieves this by keeping
         | high precision in the gradients, optimizer state, and in
         | "latent weights" which are then quantized on the fly. I don't
         | really understand the finer details of how this works, so check
         | out the paper if you're interested.
         | 
         | This article's approach is interesting. They start by
         | quantizing a pre-trained high-precision model, and then they
         | fine-tune the quantized model using LoRA (which dramatically
         | improves the performance of the quantized model). They don't
         | talk about the bit depth of the values in the LoRA matrices, so
         | it may be that those are higher bit-depth.
         | 
         | [1] https://arxiv.org/pdf/2310.11453.pdf
        
       | kromem wrote:
       | The most exciting part about ternary or binary weights is the
       | inevitable hardware revolution for AI dedicated chips that's
       | going to result from it.
        
       | jstmm wrote:
       | In both the '1-bit Model' and '2-bit Model' tables, the forward
       | time (sec) for Llama2-7B with FP16 (full precision) is 0.1 s,
       | whereas it's ~0.231, ~0.257, ~0.353 s respectively for HQQ
       | (1-bit) / HQQ+ (1-bit) / Quip# (2-bit) meaning the FP16 model has
       | ~3x lower inference time.
       | 
       | On the contrary, in the BitNet b1.56 paper [0] the authors report
       | their 7b model has 2.9x reduced inference latency.
       | 
       | It's not clear to me what's happening here. Can someone explain
       | why the 1/2bit HQQ/HQQ+ models are so much slower than the BitNet
       | b1.56 models?
       | 
       | [0] https://arxiv.org/pdf/2402.17764.pdf
        
         | thatguysaguy wrote:
         | Sounds like these guys didn't use custom kernels, but BitNet
         | did.
        
         | londons_explore wrote:
         | GPU's aren't really designed for 1 bit math... They don't
         | perform much faster than floating point math.
         | 
         | Whereas a custom ASIC or updated design of GPU could give
         | _massive_ speedups with 1 bit math.
        
           | bee_rider wrote:
           | For 1 bit math, at least it should be possible to populate
           | every other bit of an integer type, right? Surely one could
           | do _better_ with a dedicated type for this, but at least we
           | could pack 16 single-bit weights into a 32 bit int for
           | addition, right?
        
           | UncleOxidant wrote:
           | Yes, exactly. Neither GPUs nor CPUs are setup for 1 bit math.
           | Pulling 1 or 2 bits out of a word isn't all that
           | straightforward on CPU or GPU - lots of shifting and masking.
           | I wonder how long it's going to be before we see custom
           | hardware for bitnets? I suspect we'll see it on FPGAs first.
        
       ___________________________________________________________________
       (page generated 2024-03-29 23:00 UTC)