[HN Gopher] Towards 1-bit Machine Learning Models
       ___________________________________________________________________
        
       Towards 1-bit Machine Learning Models
        
       Author : homarp
       Score  : 315 points
       Date   : 2024-03-28 10:58 UTC (2 days ago)
        
 (HTM) web link (mobiusml.github.io)
 (TXT) w3m dump (mobiusml.github.io)
        
       | corysama wrote:
       | I look forward to the inevitable probabilistic sub-bit machine
       | learning models :)
        
         | Cieric wrote:
         | I assume this is ment to be a joke, but isn't this actually
         | being worked on? (minus the probabilistic portion)
         | https://arxiv.org/abs/2310.16795
        
         | Cheer2171 wrote:
         | My 0-bit model can predict if you have cancer with 99.5%
         | accuracy. It doesn't even need input data! Don't ask about the
         | false negative rate though.
        
           | asah wrote:
           | My 0-bit, no-input model can predict if you have cancer with
           | 99.5% accuracy and 0.5% false negative rate. Don't ask about
           | whether the cancer cell(s) are benign.
        
             | ganzuul wrote:
             | My computation-free model can give you cancer with 100%
             | certainty.
        
         | paul_mk1 wrote:
         | Sub 1-bit has been done at least as far back as 2016 for VGG
         | style networks (my work).
         | 
         | I was able to get 0.68 "effective" bits.
         | 
         | The idea is that in each forward pass you add noise to _each_
         | weight independently drawn from normal distribution, and when
         | you calculate snr, it 's sub 1 bit. Points to the idea that a
         | stochastic memory element can be used.
         | 
         | https://arxiv.org/abs/1606.01981
        
           | overengineer wrote:
           | If noise is desirable in this way, why not to just switch
           | back to analog computing which comes with free noise
        
       | iamgopal wrote:
       | So new AI chip will be coming optimised for large scale non-byte
       | bit only AI world ?
        
         | thesz wrote:
         | Every CPU and GPU now has popcount [1] instruction. You can
         | thank US three letter agencies for that [2].
         | [1]
         | https://en.wikipedia.org/wiki/Hamming_weight#Processor_support
         | [2] https://vaibhavsagar.com/blog/2019/09/08/popcount/
         | 
         | Multiplication then is just bitwise AND and a popcount.
        
           | ttul wrote:
           | Makes you wonder what the agencies know that we don't...
        
             | rapatel0 wrote:
             | It's just cryptography and encoding. Popcount is useful for
             | understanding entropy.
        
       | JKCalhoun wrote:
       | As an AI layman, I am for some reason excited by the low-bit
       | learning models.
       | 
       | Apple, now leaving behind the albatros that was the Apple car,
       | could be the one to roll their own silicon and put an LLM in our
       | back pockets.
       | 
       | Goodbye Siri, hello Seriously.
        
         | talldayo wrote:
         | Or we could end up getting another Google integration, with iOS
         | being too rigid and locked-down to feasibly support third-party
         | private AI alternatives. Fate has a cruel sense of humor,
         | sometimes.
        
           | odyssey7 wrote:
           | Are you referring to the Swift version of TensorFlow?
           | 
           | https://www.infoworld.com/article/3608151/swift-for-
           | tensorfl...
        
             | Me1000 wrote:
             | I believe they're referring to these rumors:
             | https://www.macrumors.com/2024/03/21/apple-iphone-ai-
             | talks-g...
        
               | JKCalhoun wrote:
               | Could be a way for Apple to buy time as well until they
               | do have something competitive.
        
         | lawlessone wrote:
         | > could be the one to roll their own silicon and put an LLM in
         | our back pockets.
         | 
         | Wouldn't they have to do a lot of catching up with everyone
         | else already working on it?
        
           | declaredapple wrote:
           | They've been designing their own chips a while now, including
           | with an NPU.
           | 
           | Also because of their unified memory design, they actually
           | have insane bandwidth which is incredibly useful for LLMs.
           | IMO they may have a head-start in that respect for on-device
           | inference of large models (e.g. 1B+ params).
        
             | talldayo wrote:
             | I don't think people are running 1B+ models on the Neural
             | Engine these days. The high-performance models I've seen
             | all rely on Metal Performance Shaders, which scales with
             | how powerful your GPU is. It's not terribly slow on iPhone,
             | but I think some people get the wrong idea and correlate an
             | ambient processor like the Neural Engine with LLMs.
             | 
             | The bigger bottleneck seems like memory, to me. iPhones
             | have traditionally skimped on RAM moreso than even cheap
             | and midrange Android counterparts. I can imagine running an
             | LLM in the background on my S10 - it's a bit harder to
             | envision iOS swapping everything smoothly on a similarly-
             | aged iPhone.
        
               | JKCalhoun wrote:
               | Sure, but we're discussing 1.8-bit models that, again I'm
               | a layman, I assume are over an order of magnitude smaller
               | in their memory overhead.
        
         | quentinp wrote:
         | Actually they plan to put an LLM in your back pocket using
         | flash memory, not silicon: https://arxiv.org/abs/2312.11514
        
           | declaredapple wrote:
           | The flash doesn't do the computations though, that's just a
           | method of getting it to the processor
        
             | sroussey wrote:
             | It would be better to have eeprom or some such directly
             | attached as memory. No loading.
        
       | fabmilo wrote:
       | I like how it goes from 13.5 GB to 1.76 GB and gets comparable
       | results. Definitely there is some underlying process of why this
       | works so well that we are missing. Are the bits selecting the
       | different subspaces? Can we improve this process by using better
       | normalization layers / gating units?
        
         | nobodyandproud wrote:
         | I wonder what information theory can tell us here.
        
           | ttul wrote:
           | Neural networks work, but why they work is not well known.
           | Researchers continue to find new "free lunches" all the time.
        
           | rapatel0 wrote:
           | The entropy of English is ~1 bit per letter (Measured in a
           | funny way by Shannon - https://cs.stanford.edu/people/erobert
           | s/courses/soco/project...)
           | 
           | In general, it's a bit funny in how we build ML models. You
           | take a bucket of matrices, fill them with random numbers, and
           | then start to introduce "biases" through back propagation. A
           | model can converge down loss, but most of them are still
           | filled with random noise.
           | 
           | Binary weights are somewhat obvious in retrospect. Weights
           | indicate the strength of an association between two neurons.
           | Intuitively, most of the value is probably just that an
           | association between two neurons exists, and whether it's
           | positive or negatively associated.
        
             | Animats wrote:
             | I'm surprised that 1-bit works. Trinary (-1, 0, 1) makes
             | sense, but if you only have 0 and 1, the asymmetry ought to
             | be a problem. Or is multiplying an XOR, so 1x1 = 0?
        
               | edflsafoiewq wrote:
               | There's a scale and bias afterwards, so its not
               | necessarily asymmetric.
        
               | scarmig wrote:
               | It's more like, addition is an XOR. (In fact, AND as mult
               | and XOR as add are GF(2)). Throw in NOT (or, really, just
               | a constant 1) and you can compute any circuit.
               | 
               | Biologically, inhibitory neurons are every bit as
               | important as excitatory ones, so if you squint just
               | right, XOR looks like a neuron's activation being
               | inhibited by another presynaptic neuron.
        
               | rapatel0 wrote:
               | Yeah basically. In binary, multiplication is an XNOR
               | operation.
               | 
               | 00 = 1
               | 
               | 01 = 0
               | 
               | 10 = 0
               | 
               | 11 = 1
        
               | zamadatix wrote:
               | 0*0 = 1? I've always hear it as being the output of AND
        
               | overengineer wrote:
               | For balanced binary [-1, 1] yes. -1*-1=1
        
               | zamadatix wrote:
               | That makes sense, thanks!
        
               | scarmig wrote:
               | XNOR does not distribute over AND or any other binary
               | operators (try 0 XNOR (0 AND 1)), nor does it have a
               | multiplicative identity, so it's not really
               | multiplication in a way that's useful.
        
               | gpm wrote:
               | It seems like they have learned floating point parameters
               | x and y so that dequantized(bit 0) = x and
               | dequantized(bit 1) = y. Thus there is no built in
               | asymmetry. Or more precisely they learned a zero point
               | and a scale but it's equivalent to this simpler model in
               | the 1 bit case.
               | 
               | It still seems like there would be a problem because
               | either [x, y] looks like [0ish, 1ish] and you can't have
               | negative weights, or [x, y] looks like [-1ish, 1ish] and
               | you can't have "don't care" weights. But if you have some
               | redundancy in your neurons I guess this is acceptable
               | because you can just cancel out the positive contribution
               | from a neuron you don't care about with a negative
               | contribution from a very similar neuron that you also
               | don't care about.
        
             | mlsu wrote:
             | Wow, fascinating read. 1999
             | 
             |  _Would a monkey who knew the n-gram frequencies of the
             | letters in English where n is large be able to produce
             | credible English text? Furthermore, does this monkey "know"
             | English? If the N-gram monkey is behind one door and a
             | human is behind the other, could a third-party observer
             | tell which was the monkey? This question rings of the
             | Turing test for artificial intelligence, to which there is
             | no easy answer._
             | 
             | No easy answer indeed!
        
             | nobodyandproud wrote:
             | Down to one bit but that's taking the 2.62 bits and then
             | applying the redundancy factor.
             | 
             | What's cool is that the differentiable activation function
             | is important---to avoid the linear, perceptron limitation--
             | but the weight scaling can be so simple, at least for LLMs.
             | 
             | It makes me wonder whether the extra layers are effectively
             | compensating; in other words, can the number of layers or
             | hidden neurons be trimmed down if we then add more bits to
             | each weight and still see equivalent effectiveness?
        
               | imtringued wrote:
               | You can just delete whole layers, because the point of
               | residual layers is to make the model learn the
               | hyperparameter "layer count" automatically. Compare this
               | to the absence of residual layers where the model must
               | use all layers. Then you will have to get the layer count
               | perfect, but this isn't possible, since each data point
               | might benefit from a different layer count. The extra
               | layers therefore exist primarily so the model becomes
               | robust to poorly chosen hyper parameters. You still need
               | a minimum amount of layers, but that isn't the problem.
               | 
               | https://arxiv.org/abs/2403.17887
        
         | convolvatron wrote:
         | idk that we need to think that hard. we have shown can build
         | arithmetic functions from networks of logic gates given
         | sufficient connectivity. so in some sense it just drops the
         | learning network below the level of words and asked it to make
         | its own math.
         | 
         | since this generalizability thing seems to be real, this kind
         | of directly folds in quantization-aware-training, doesn't it?
        
       | WithinReason wrote:
       | 1-bit weights have been a thing since at least 2016:
       | 
       | https://arxiv.org/abs/1606.06160
        
         | buildbot wrote:
         | XNOR-Net was the "first" in that generation, as I recall. It's
         | not a new idea at all though.
         | 
         | Check out the dates on these papers -
         | 
         | https://ieeexplore.ieee.org/abstract/document/286901 < 1994
         | 
         | https://ieeexplore.ieee.org/abstract/document/344783 < 1993
         | 
         | https://link.springer.com/article/10.1007/BF00337115 < 1986
        
           | bjornsing wrote:
           | First version of BNN was submitted to ArXiv a month before
           | XNOR-Net: https://arxiv.org/abs/1602.02830
        
             | buildbot wrote:
             | Oh nice, good catch!
        
           | thechao wrote:
           | We're just speed-running the NN pendulum, at this point.
        
         | rapatel0 wrote:
         | Don't forget Michael
         | 
         | https://arxiv.org/abs/1511.00363 < 2015
        
         | elcomet wrote:
         | They have been a thing for decades
         | 
         | https://ieeexplore.ieee.org/abstract/document/363432
        
           | WithinReason wrote:
           | I don't think that paper's methods could be applied to LLMs
        
       | redskyluan wrote:
       | what will be the next step of 1 bit? could it be 0.5 bit? Any
       | kind of quantization is not gonna to last long.
        
         | CamperBob2 wrote:
         | Not using traditional numbers at all, I imagine. It's possible
         | that trying to build neural nets out of vectors that are
         | themselves built with quantized scalars is a suboptimal
         | approach.
        
         | pixelpoet wrote:
         | The way things are going we'll have -1 bit before long :)
        
       | ganzuul wrote:
       | https://en.wikipedia.org/wiki/Adjacency_matrix
       | 
       | What would a graph look like when edges are represented with less
       | than 1-bit?
        
         | robrenaud wrote:
         | It would have predictible structure. There would be groupwise,
         | regular sparsity. You could exploit these patterns to infer
         | properties like, "there is no connection from this vertex to
         | this entire span of verticies".
         | 
         | Just think about representing a compressed version of a matrix.
         | You want to have a short encoding for matrices are ones that
         | are more ordered in some real, exploitable way. With some
         | determnistic computation of the compressed representation, you
         | could produce the full adjacency matrix.
        
           | ganzuul wrote:
           | https://proceedings.mlr.press/v206/meller23a/meller23a.pdf
           | 
           | It seems to have a lot of similarities to the technique in
           | OPs article.
        
         | selecsosi wrote:
         | I think the problem is that you are also trying to represent
         | the biases of directionality in the connection e.g.: [-1,0,1].
         | In the adjacency matrix, if you store the sign (direction) of
         | the edge, then I think that would map fairly isomorphically.
        
           | ganzuul wrote:
           | Basically factorized into a giant state machine. Would be
           | logical.
        
       | grungegun wrote:
       | Does anyone know if this works on vanilla deep networks? These
       | quantization articles always seem to target LLM's which leads me
       | to wonder if there's something special about the LLM architecture
       | vs a vanilla deep architecture.
        
         | Y_Y wrote:
         | Certainly does, people have been doing this in computer vision
         | for years.
        
         | alephxyz wrote:
         | LLMs have been trending towards obscenely large number of
         | parameters (314B for grok), which makes quantization crucial if
         | you want to run them without a Meta-sized budget.
        
         | zaptrem wrote:
         | Transformer LLMs are just a bunch of MLPs (linear layers) where
         | you sometimes multiply/softmax the output in a funny way
         | (attention). In other words, they're arguably more "vanilla
         | deep net" than most architectures (e.g., conv nets).
         | 
         | (There are also positional/token embeddings and normalization
         | but those are a tiny minority of the parameters)
        
           | amelius wrote:
           | Ok, but what does a perceptron look like in 1-bit? Would it
           | be just some simple logic gate, like an OR-gate?
        
             | zaptrem wrote:
             | Not my area of expertise but I'd assume it becomes a
             | decision tree or something.
             | 
             | Edit: lol https://news.ycombinator.com/item?id=39868508
        
           | grungegun wrote:
           | So there's no performance gain for quantization enabled by
           | the transformer architecture? It seems very strange that
           | quantization works so well since in most of my experiments,
           | the internal model weights of mlps look random.
        
       | ianbicking wrote:
       | Reduction to decision tree!
       | 
       | But I'm unclear how it actually run, and the article talks about
       | the conversion and training but doesn't describe how it runs... I
       | suppose because it's obvious to someone who has followed
       | quantization.
       | 
       | Thinking out loud... if you have a model of just 1 and 0, my
       | first thought is that the outputs are 1's and 0's but I think
       | that's wrong. Instead it's a bunch of floats, and you multiply
       | them by 1 or 0 (in a sense you are sampling the output of the
       | previous layer?), add them up, and put the result through some
       | activation function. And two-bit quantization sounds kind of
       | similar, just with a _little_ scale, going from -1 to 2 instead
       | of 0 to 1.
       | 
       | It seems kind of interesting that you now have a bunch of weights
       | that are exactly 0, meaning you can assert something about what
       | parameters and weights affect what parts of the output. Though in
       | some sense the way they compress the weights down to one bit is
       | also a heuristic you could use to interpret the original model...
       | this just makes it clearer that in totality you are making a
       | defensible simplification, because the end result is still a
       | working model.
       | 
       | It also seems like you could make a lot of mathematical
       | assertions about a one bit model that would be harder to make
       | otherwise. Like maybe you could start thinking of a model as an
       | equation, a _particular_ (though giant) equation, and look at its
       | properties and consider symbolic transformations to that
       | equation.
        
         | fabmilo wrote:
         | I like the decision tree analogy
        
         | bick_nyers wrote:
         | A comment I really liked on a previous post about ternary
         | highlighted that what you are actually measuring with {-1, 0,
         | 1} is inverse correlation, no correlation, and correlation.
        
       | mmoskal wrote:
       | It seems the trick here is they first quantize it to 1- or 2-bit,
       | and then they fine-tune the quantization bias parameters (the
       | parameters that dequantize from 1-2 to 16 bit) via LoRA. Then
       | they have specialized kernels to do matrix multiplication at the
       | bit level.
       | 
       | Also, the 2-bit model seems much better than the 1-bit model -
       | they use [-1, 0, 1, 2] - I wonder if '2' is needed in light of
       | the 1.58b paper (which claims -1 is def. needed).
        
         | andy_xor_andrew wrote:
         | Interesting, and it kinda makes sense. You quantize, which
         | invariably means you lose some precision, but then you can
         | finetune post-quantization to recover at least some of it. Neat
         | idea.
        
         | jimmySixDOF wrote:
         | Which is itself a little counterintuitive as the arxiv papers
         | they cite say models need to be pretrained from the ground up
         | using 1- or 2-bit (or 1.58bit). It definitely adds some
         | interesting data points for the open source community who are
         | experimenting in every possible direction.
        
       | r1chardnl wrote:
       | Is this similar to binary search or could it be a binary search
       | where you're just excluding all possibilities that you know can't
       | be the probable result that you're looking for?
        
         | nyrikki wrote:
         | Feed forward networks are effectively DAGs
         | 
         | With the smaller model size, quantized models have less
         | accuracy and less stable training, but large models take
         | advantage of increased parallelism.
         | 
         | Binary search is more linear, circuits are a better lens than
         | turing machines.
         | 
         | While it still needs to be verified, if you look into what a
         | uniform consent depth threshold circuit is, that will help.
         | 
         | Although I guess binary weights may be in AC0 and not TC0, but
         | that may not hold with billions of parameters.
        
       | londons_explore wrote:
       | I believe the future is 1 bit models - for both training and
       | inference.
       | 
       | When people make custom silicon for 1 bit models, they'll find
       | that it is _sooooo_ much more power and silicon-space efficient
       | to do 1 bit math than 16 bit floating point - like 100x or more.
       | 
       | That extra model size will vastly overshadow any worse
       | performance of the models.
        
         | JHonaker wrote:
         | > That extra model size will vastly overshadow any worse
         | performance of the models.
         | 
         | ...What?
        
           | sroussey wrote:
           | I think OP was referring to parameter size. You can make up
           | for quantization by having more parameters.
        
             | JHonaker wrote:
             | I know. I was being a bit mean, but they seem to be
             | implying that more parameters _even with_ worse performance
             | is preferable. That seems backward to me. There is no
             | reason for parameter size to be intrinsically valuable, and
             | I think people are a bit infatuated with it.
        
         | cma wrote:
         | For training how do you get any kind of meaningful derivative
         | with it?
        
           | twelfthnight wrote:
           | Maybe evolutionary algorithms instead? Hasn't proven super
           | useful historically, but maybe at the scale of enormous LLMs
           | it will be?
        
             | bionhoward wrote:
             | Evolutionary algorithms made you, didn't they?
        
               | TheDudeMan wrote:
               | That does not prove that they can beat gradient decent.
        
               | bick_nyers wrote:
               | It took a lot of human brain flops to get to this point
               | in time though, I wonder how many orders of magnitude
               | more than it took to train ChatGPT...
        
             | VHRanger wrote:
             | Nope, they're orders of magnitude more inefficient because
             | they don't leverage gradient descent.
             | 
             | Rule of thumb in optimization: real numbers are easy,
             | integers are hard
        
               | markisus wrote:
               | This may be the status quo because of the so called
               | "hardware lottery" which has historically been optimized
               | for floating point. I'm speculating, but if hardware
               | designers were instead only concerned about raw xnor
               | density and throughput, we might end up with chips
               | powerful enough that giant 1-bit nets could be trained
               | purely through evolution.
        
               | imtringued wrote:
               | How do you optimize memory for floating point?
        
               | VHRanger wrote:
               | BF8 and other similar formats?
        
               | VHRanger wrote:
               | No, it's a fact at the mathematical level that you can
               | enshrine in big O terms if you want to
        
             | bick_nyers wrote:
             | Gradient-directed evolutionary algorithm sounds kinda
             | interesting.
        
           | chalst wrote:
           | The OP explicitly excludes training.
        
             | cma wrote:
             | The one I replied to said 1-bit for both training and
             | inference.
        
           | scotty79 wrote:
           | Maybe something probabilistic?
        
         | wongarsu wrote:
         | Probably more than 100x for inference. Not only are you
         | drastically reducing the number of bits and replacing float
         | math with integer math, you can do matrix multiplication with
         | only addition (as pointed out in the BitNet b1.58 paper).
         | Additions require a lot less hardware to implement than
         | multiplication. Adding one-bit or two-bit numbers requires
         | barely any hardware at all. A traditional two-bit adder without
         | carry bit is three xor gates and an and gate.
        
           | fasa99 wrote:
           | to me the most exciting thing is that if is training that is
           | speed up on the order of 100x-1000x, a large cluster may be
           | well suited to gradient-descend hyperparameter tuning
           | parameters by LLM training again and again at scale -- this
           | is the first foot in a door towards an AI that iteratively
           | may improve itself
        
             | mobicham wrote:
             | LoRA training should benefit from the same speed-up,
             | because the 1-bit weights will be frozen and all you need
             | for both the forward and backward pass is a binary matmul,
             | then maybe cast after to get more stable gradients.
        
         | programjames wrote:
         | > I believe the future is 1 bit models - for both training and
         | inference.
         | 
         | 1 bit's nothin'. The future is training directly on
         | electronic/photonic circuits.
        
         | bionhoward wrote:
         | Isn't 1bit too low for optimal radix economy (Euler's number)
         | though?
         | 
         | I want to see somebody try "imbalanced quaternary" -,0,+,2
        
           | twelfthnight wrote:
           | Haven't heard this argument before. But from the Wikipedia
           | article it seems base 3 has the best asymptomatic radix
           | economy, but isn't much better than base 2 and base 2 is
           | seemingly easier to program and optimize.
           | 
           | Since this is a new argument I've not heard, would be curious
           | if you had links or some more explanation.
        
           | johnmorrison wrote:
           | people are indeed working on -1,0,1,2 Q2 models, I read
           | something about it the other day but don't recall the title.
           | 
           | they mentioned decomposition of Q2 into ternary+binary i.e.
           | [[1,2],[-1,0]] -> [[1,1],[0,0]] + [[0,1],[-1,0]]
        
           | Dylan16807 wrote:
           | I bet the optimal "large" value is bigger than 2.
        
           | hervature wrote:
           | Radix economy is all about which base is the most efficient
           | to represent a given number. It is simple to show that, for
           | large numbers, this is equivalent to how efficient a base can
           | represent itself, b/ln(b). Simple calculus shows this is
           | minimized at e (Euler's number) or 3 if integer (closely
           | followed by 2).
           | 
           | It sounds like you have something to add but you are already
           | dictating the base by saying "bit". Literally from "binary
           | digit". Anyway, quantization is not about which number system
           | is best - virtually all computer systems we use today
           | represents numbers in base 2. Quantization, at its core, is
           | lossy compression. How do you go from a large model trained
           | to high precision to a smaller model without hindering
           | performance? This can be studied without needing to know the
           | base.
           | 
           | Suppose you are using a decimal computer. You can ask
           | yourself, I have a 128-decimal precision numbers, do I need
           | that much precision? What happens if I just use 1-decimal
           | precision by chopping off the 127 digits after the first
           | decimal? You realize that there are two parts of an
           | operation. The numbers involved (the operands) and the
           | operation itself. You then ask yourself, if I keep one of the
           | operands fixed (the original input), can I represent my
           | 128-decimal precision neural network simply as a series of
           | operations without the other operand? Perhaps only the most
           | basic ones? Like: noops (add 0 or multiply by 1), increments
           | (add 1), decrements (subtract 1), negations (multiply by -1),
           | and clears (multiply by 0)? You count those numbers (-1, 0,
           | and 1). There are 3 so you proudly proclaim you've made a
           | neural network that only uses 0.477 dits. People get excited
           | and confused because that is less than 1 dit which seems like
           | a fundamental point. You further surprise the scientific
           | field by finding a clever trick for getting rid of negations.
           | You beat your previous record and now you only need 0.301
           | dits to represent your network. You are about to accept your
           | Turing reward when the ghost of Claude Shannon appears and
           | says "Why are you using a unit that measures entropy to mean
           | how many symbols you have? If you insist, at least realize
           | 0.301 dits is 1 bit." You are shocked when you realize
           | 10^0.301 = 2^1. Reviewing Shannon's seminal paper, you are
           | awestruck by Shannon's prescient comment "Change from the
           | base a to base b merely requires multiplication by
           | log_b(a).". You humbly give your award to Shannon. You keep
           | the $1M since ghosts aren't as fast a new NVidia DGX. No
           | matter how quantized the ghost is.
           | 
           | [1] - https://people.math.harvard.edu/~ctm/home/text/others/s
           | hanno...
        
         | api wrote:
         | Doesn't training need higher precision to avoid getting stuck
         | at local minima, at least with back propagation style learning?
         | 
         | Maybe something alternate like evolutionary algorithms could
         | work in a domain like this, but so far those have proven to be
         | less efficient.
        
           | sp332 wrote:
           | A recent paper used a single ternary "trit" ~1.5 bits per
           | parameter for training.
           | https://news.ycombinator.com/item?id=39535800 They said it
           | would be difficult to apply this technique to pre-trained
           | models and had to be trained in 1-trit from scratch.
        
             | gliptic wrote:
             | But it's using larger weights during training, not just the
             | trits.
        
         | mikewarot wrote:
         | I believe the future is 4*4 bit look up tables with output
         | latches, with a bit to/from each Cartesian neighbor. Clock them
         | like the colors of a chessboard, in 2 phases, and you don't
         | have to worry about timing dependencies.
         | 
         | All of your code gets converted to a directed acyclic
         | graph(DAG), executing at Ghz rates if you want.
         | 
         | Imagine a machine that can output a million parallel GPT-4
         | streams at 1000 tokens per second each.
         | 
         | If the architecture changes it's just a different DAG. Unlike
         | with FPGAs and their custom blocks that have to be optimally
         | used, you can compile a DAG almost instantly.
        
           | smusamashah wrote:
           | Is this something from a research finding or is it your idea?
        
             | mikewarot wrote:
             | It's my idea... I've been writing about it for over a
             | decade. I hope to get a small version of it made via tiny
             | tapeout later this year once I'm out of the precariat.
             | 
             | The idea is that you can decompose any non-cyclic code, for
             | example, a matrix multiply, FFT, etc.. into a directed
             | graph of bitwise operations, then map those operations out
             | into the grid of LUTs. Pipe inputs in one side of the grid,
             | and get the outputs out the other side, and all of the
             | bitwise operations happen in lock step parallelism. (Unlike
             | an FPGA which is asyncronous)
             | 
             | If you can decompose one stage of an LLM down into a
             | bitwise graph of computation, you can easily map it into
             | the grid. If there's a bad cell in the grid, you can map
             | around it.
             | 
             | Because all of the high speed logic lines are short (to the
             | neighbors) you don't have the long lines / high capacitance
             | issues driving signals all the way across an FPGA or ASIC,
             | thus you can really crank up the clock rate (and/or it's
             | very power efficient).
             | 
             | It should be trivial, design wise, to scale from a 16x16
             | grid up through chips that could crank out Petaflops.
             | 
             | Here's a simulator for the thing, written in Pascal.
             | 
             | https://github.com/mikewarot/Bitgrid
        
               | sitkack wrote:
               | You can simulate this on a GPU now. This is like running
               | a CA on a GPU with slightly different rules.
               | 
               | I think you would also be delighted to know that many of
               | the non-GPU ai accelerators are in a mesh topology.
               | 
               | Cerebras, Tenstorrent, Esperanto
               | 
               | https://www.esperanto.ai/wp-
               | content/uploads/2021/08/HC2021.E...
        
               | mikewarot wrote:
               | I've been intrigued by this approach. On a more highly
               | optimized (but harder to program) take is the GA144[1]
               | from Chuck Moore, the inventor of Forth. It's a grid of
               | 144 F18 Forth based processors in a cartesian grid. These
               | processors are far more limited, but then again they take
               | far less power as well.
               | 
               | [1] https://www.greenarraychips.com/
        
               | sitkack wrote:
               | That is a pretty cool chip. Chuck Moore is a fascinating
               | character.
               | 
               | You would probably get a kick out of David Ackley's T2
               | Tile Project.
               | 
               | https://t2tile.com/
               | 
               | https://www.youtube.com/@T2TileProject/videos
               | 
               | https://www.youtube.com/@DaveAckley
               | 
               | https://en.wikipedia.org/wiki/Ackley_function
               | 
               | https://www.cs.unm.edu/~ackley/
               | 
               | Fun fact, Tenstorrent wanted to add instructions to
               | enqueue data between processors connected in a mesh and
               | Arm said no (they don't do architectural licenses
               | anymore), so Tenstorrent used RISCV.
               | 
               | Implementing something like a GA144 but using small
               | RISC-V processors would be an interesting hack.
               | https://www.youtube.com/watch?v=6QRKpd28NEE
        
           | Dylan16807 wrote:
           | 1. If you write FPGA code as a grid of lookup tables then I
           | would expect it to be easy to compile instantly.
           | 
           | 2. In what way is this "acyclic"?
           | 
           | 3. Won't putting your code _into_ this form be the hard part?
           | Even if you start with a DAG, 99.99% of them won 't fit this
           | form without intense restructuring. So you just moved the
           | hard step over by one.
        
         | bgnn wrote:
         | This! Maybe just integer, but not floating point. That's a
         | ridiculous way to do computation when you don't really need the
         | precision.
        
         | imtringued wrote:
         | The 1 ternary bit models only compress the weights. You still
         | add and subtract using bfloat16 for better accuracy. Dedicated
         | silicon is mostly a waste, because you are only processing two
         | operations per parameter during inference. Loading the
         | parameters from slow DDR, GDDR or HBM memory is the bottleneck
         | in practice and the only solution is PIM. I was honestly
         | disappointed by Nvidia's Blackwell since it is just barely
         | competitive with GDDR PIM.
        
           | mobicham wrote:
           | At least you can copy 16 times more data to the shared memory
           | with binary weights.
        
       | littlestymaar wrote:
       | Stupid question: how can you do deep learning on 1 bit? DL works
       | by calculating a gradient and following it to minimize a loss
       | function, but if all your parameters are 1 bit, there's no such
       | thing as a gradient as you don't have differentiation at all. So
       | how does it work?
        
         | edflsafoiewq wrote:
         | Not sure what you mean, why wouldn't there be a gradient? There
         | is a problem that small changes to weights may have no effect
         | (eg ((0 + 0.3) + 0.3) + 0.4 = 0 if you have to round the result
         | to 0 or 1 after every step), but to fix this I believe everyone
         | maintains high precision weights for the backwards pass, from
         | which the quantized weights are derived.
        
           | scotty79 wrote:
           | You could treat fractional addition as probabilistic
           | operation.
        
         | Centigonal wrote:
         | Not a stupid question at all.
         | 
         | Early quantization approaches just quantized a high-bit-
         | precision pre-trained model - no need to calculate gradients on
         | the quantized weights. BitNet[1] changed the game by training a
         | low-precision model from scratch. It achieves this by keeping
         | high precision in the gradients, optimizer state, and in
         | "latent weights" which are then quantized on the fly. I don't
         | really understand the finer details of how this works, so check
         | out the paper if you're interested.
         | 
         | This article's approach is interesting. They start by
         | quantizing a pre-trained high-precision model, and then they
         | fine-tune the quantized model using LoRA (which dramatically
         | improves the performance of the quantized model). They don't
         | talk about the bit depth of the values in the LoRA matrices, so
         | it may be that those are higher bit-depth.
         | 
         | [1] https://arxiv.org/pdf/2310.11453.pdf
        
       | kromem wrote:
       | The most exciting part about ternary or binary weights is the
       | inevitable hardware revolution for AI dedicated chips that's
       | going to result from it.
        
         | imtringued wrote:
         | Your hardware already supports addition and subtraction and the
         | tensor cores of NVIDIA GPUs are already fast enough to keep up.
         | The only benefit is reducing memory capacity and bandwidth
         | requirements.
        
         | Matumio wrote:
         | You mean we'll have hardware accelerated ternary instructions?
         | https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
         | 
         | (Okay probably those are not ready to be used as NN weights if
         | the activations are not binary too, but... the gap to what CPUs
         | already can do is getting smaller.)
        
       | jstmm wrote:
       | In both the '1-bit Model' and '2-bit Model' tables, the forward
       | time (sec) for Llama2-7B with FP16 (full precision) is 0.1 s,
       | whereas it's ~0.231, ~0.257, ~0.353 s respectively for HQQ
       | (1-bit) / HQQ+ (1-bit) / Quip# (2-bit) meaning the FP16 model has
       | ~3x lower inference time.
       | 
       | On the contrary, in the BitNet b1.56 paper [0] the authors report
       | their 7b model has 2.9x reduced inference latency.
       | 
       | It's not clear to me what's happening here. Can someone explain
       | why the 1/2bit HQQ/HQQ+ models are so much slower than the BitNet
       | b1.56 models?
       | 
       | [0] https://arxiv.org/pdf/2402.17764.pdf
        
         | thatguysaguy wrote:
         | Sounds like these guys didn't use custom kernels, but BitNet
         | did.
        
           | mobicham wrote:
           | That's correct. Only the dequantization is done on CUDA, the
           | matmul is done with Pytorch. If they put their kernels open-
           | source we could re-use them!
        
         | londons_explore wrote:
         | GPU's aren't really designed for 1 bit math... They don't
         | perform much faster than floating point math.
         | 
         | Whereas a custom ASIC or updated design of GPU could give
         | _massive_ speedups with 1 bit math.
        
           | bee_rider wrote:
           | For 1 bit math, at least it should be possible to populate
           | every other bit of an integer type, right? Surely one could
           | do _better_ with a dedicated type for this, but at least we
           | could pack 16 single-bit weights into a 32 bit int for
           | addition, right?
        
           | UncleOxidant wrote:
           | Yes, exactly. Neither GPUs nor CPUs are setup for 1 bit math.
           | Pulling 1 or 2 bits out of a word isn't all that
           | straightforward on CPU or GPU - lots of shifting and masking.
           | I wonder how long it's going to be before we see custom
           | hardware for bitnets? I suspect we'll see it on FPGAs first.
        
           | imtringued wrote:
           | You're telling me GPUs aren't designed for additions and
           | subtractions? Where did you hear that?
        
             | bick_nyers wrote:
             | I think they are moreso saying that GPUs are not optimized
             | for those operations. CPU aren't "designed" for matrix
             | multiplies yet we can still run them, albeit at a slower
             | rate than on a GPU.
        
         | brucethemoose2 wrote:
         | Real world GPU performance is hugely influenced by hand
         | optimization of the CUDA kernels.
        
       | vladf wrote:
       | Really strong binary results. So strong it was fishy. I hope
       | someone can explain my confusion below.
       | 
       | > We compared the performance of the Llama2-7B model in three
       | configurations: FP16 (full precision), HQQ (without fine-tuning),
       | and HQQ+ (with adapter layers) using a group-size of 8.
       | 
       | Interesting, what is "group-size of 8"?
       | 
       | From their HQQ post (https://mobiusml.github.io/hqq_blog/), it's
       | the block size at which they add scales (presumably 16-bit) and
       | shifts (in that post, it's 8-bit).
       | 
       | So for every 8 binary weights we have a 16-bit scale and 8-bit
       | shift?
       | 
       | > Fine-tuning with Low-Rank Adapters
       | 
       | They say they inline the shift into the LoRA but how can you do
       | this, block-wise, without increasing your LoRA rank by num-blocks
       | (they claim to only use 1 additional rank)?
       | 
       | Then, the reported 7B sizes, in GB:
       | 
       | > 13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip#
       | 2-bit)
       | 
       | those numbers would make sense if it was _actually_ 1 bit. But if
       | you include the overhead of 16-bit scales (and why is the shift
       | inlineable into lora? still unexplained) it'd be more like 3-bit.
       | 
       | From their HF page:
       | 
       | > This version offloads the meta-data to the CPU, so only the
       | binary weights and the low-rank adapters are stored in the GPU
       | memory.
       | 
       | Interesting, so we have to go back to CPU to rescale? Is this how
       | they counted GB? This should have been clearly caveated in the
       | table. I also am amazed they got latency lower than quip if they
       | pingpong to CPU.
        
         | danielhanchen wrote:
         | When one does quantization, it's done in blocks. Bitsandbytes
         | uses a blocksize of 64 I think. W * scale + zero_point is
         | needed for each group size of 8. So you need 2 numbers in fp16
         | for each 64 group of numbers. For BnB, you get 4.5bit approx
         | since 64*4bit + 16bit + 16bit = 288/64 = 4.5. So 4bit is
         | actually 4.5bit.
         | 
         | For HQQ 1bit, a group size of 8 needs 2 fp16 numbers (you
         | mentioned 8bit for shift). So you need 8 * 1bit + 16bit + 8bit
         | for each group ie 32bits for each group size of 8. Or 4bits per
         | param.
         | 
         | I'm assuming the scale and zero_point are both moved to 8bit
         | maybe so 8*1bit + 8bit + 8bit = 24bit / 8 = 3bits per param?
         | 
         | "This version offloads the meta-data to the CPU, so only the
         | binary weights and the low-rank adapters are stored in the GPU
         | memory.", so the 8+8 scale / zero_point moves to the CPU. So
         | GPU memory 1bit, but CPU meta data is the rest - very smart!
        
           | vladf wrote:
           | Err, you are just restating what I'm saying, without
           | addressing the concerns.
           | 
           | 1 - is it fair to use ram in two places and report only one
           | of them without any asterisk? (If you think this is fair-oh
           | boy wait till you hear about my 0GB hbm use inference
           | algorithm)
           | 
           | 2 - i know how subchannel quantization works. Are they
           | hitting those reported latency numbers with per layer cpu
           | pingpong to rescale?
        
             | danielhanchen wrote:
             | Oh sorry - did not expect to restate what you said whoops -
             | all train of thought!
             | 
             | You can see from
             | https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-
             | hf_1bi... that the model disk space is 3GB + 100MB of LoRA
             | weights. I also uploaded a 4bit one to https://huggingface.
             | co/unsloth/llama-2-7b-bnb-4bit/tree/main which uses 3.87GB.
             | 
             | So because of the offloading trick, the GPU VRAM is less,
             | but in actuality, still 3GB is needed.
             | 
             | Unsure on latency sadly
        
           | Dylan16807 wrote:
           | > "This version offloads the meta-data to the CPU, so only
           | the binary weights and the low-rank adapters are stored in
           | the GPU memory.", so the 8+8 scale / zero_point moves to the
           | CPU. So GPU memory 1bit, but CPU meta data is the rest - very
           | smart!
           | 
           | Doesn't it need all the weight metadata for a layer to use
           | that layer?
           | 
           | * If yes, then can't _any_ algorithm offload x% of its data
           | as a balancing act between speed and RAM?
           | 
           | * If no, then what's it for and when does it get used?
        
             | danielhanchen wrote:
             | Oh yes you need all the metadata, but because it's 2
             | numbers the scale and zero_point, I think the movement of
             | singular digits are super fast to the GPU registers -
             | cannot confirm though.
             | 
             | It's like in cuBLAS you do alpha _A_ B + beta*C, and alpha
             | and beta are both scalars which can be on the CPU, and
             | moved to the GPU in nanaseconds.
             | 
             | I'm unsure though since I haven't tested it
        
               | Dylan16807 wrote:
               | It still has to go through the entire memory system. It's
               | hard for me to imagine that transferring a number from
               | the CPU to the GPU is faster than transferring a byte,
               | and if you have 2 CPU-resident numbers per GPU-resident
               | byte that's a lot of transferring.
        
               | danielhanchen wrote:
               | I don't disagree - fair point there definitely is a
               | latency transfer overhead. I would suspect one had
               | prefetch it by calling `.to("cuda", non_blocking = True)`
               | say 2 layers ahead, so you can in theory hide the
               | movement.
               | 
               | I think somewhere the blog did mention HQQ for 1 bit is
               | slower for now, maybe due to the transfer overhead,
               | although I couldn't exactly remember where
        
               | Dylan16807 wrote:
               | My point is more that if it's that many bytes flowing
               | around on demand, you're basically swapping layers in and
               | out of the GPU as you use them (or x% of each layer).
               | 
               | Which is fine, and it's a valid feature, but you don't
               | need to split those bytes into "data" and "metadata" to
               | make that happen.
               | 
               | Is there actually something they gain from this
               | particular method of splitting?
        
               | danielhanchen wrote:
               | I guess it's mainly to reduce VRAM usage. Assuming we
               | don't do this, then a 7b model with 1bit group size 8
               | will use 3GB or something of VRAM, whilst a 4bit group
               | size of 64 will use 4GB approx.
               | 
               | Assume we have a 100b model - with 4bit, VRAM is around
               | 57GB or so. With 1bit, 43GB VRAM, but by moving the
               | scalars and zero point to RAM, VRAM use is like 15GB or
               | something, at the cost of like 28GB RAM usage.
               | 
               | I guess maybe a valid approach is to dynamically select
               | which ones to move to RAM or VRAM, given your VRAM
               | budget. Say you have a 40GB GPU, clearly move more of the
               | scalars to GPU. But if you have a weak GPU, then you need
               | to use more RAM.
        
               | Dylan16807 wrote:
               | I still don't think I'm getting my point across.
               | 
               | What if you store 40% of the data and 40% of the metadata
               | in CPU memory. Is that the same for performance?
               | 
               | Is there a reason we want _particular parts_ of the layer
               | to stay on the GPU? Or is it just number of bytes.
        
         | mobicham wrote:
         | Hello, I am the main author, would love to clarify a couple of
         | things:
         | 
         | All the linear-quantization methods have meta-data, including
         | the 1.58bit paper. You can control the quality vs. memory usage
         | by reducing the group-size. However, the meta-data is the not
         | the same thing as the quantized weights for many reasons:
         | 
         | > The meta-data size doesn't change the fact that you can do
         | binary/ternary matmul, which the most important thing in this
         | story.
         | 
         | > The meta-data size doesn't increase the actual compute: these
         | are point-wise operations and even if you have 1 scalar you
         | still need to multiply the same amount of weights.
         | 
         | > Meta-data is offloaded to the CPU with pinned-memory, which
         | allows non-blocking transfers. Technically, you can trigger the
         | copy in the layer before and synchronize and will make it
         | almost seamless. I did some experiments with cuda streams that
         | worked very well on an older machine, but then I tried a better
         | machine and the transfer was much faster. Obviously if you are
         | trying it on Google colab it's very slow for this reason.
         | 
         | > Smaller models like Llama2-7B are very hard to directly
         | quantize at very low bits, so they need a lower group-size to
         | function well. Larger models (like what we showed for Mixtral),
         | can be quantized to 2-bit on the fly, without any data, and
         | still work very well. So basically larger models are less
         | sensitive to extreme quantization and you could use a much
         | larger group-size. I still think that the meta-data size is
         | really not a big deal for the reasons I have explained above.
         | 
         | > There are many other ways to increase the group-size or even
         | get rid of it all together, many ideas available but needs lots
         | of experimentation.
         | 
         | > Binary/ternary CUDA matmul kernels don't exist yet. The
         | current code is implementing the dequantization step in CUDA
         | but then uses torch.matmul as fp16. I tried doing matmul at
         | low-bits with CUDA but it is very difficult to even beat cuBLAS
         | with fp16, especially for a novice CUDA coder like me :)
         | 
         | Please note: this is early experimental work. Since it showed
         | promising results, we wanted to share it with the community
         | first as we progress. There's still a lot of things to be done
         | and we are actively working on it, despite the very limited
         | resources we have.
         | 
         | Happy to answer any questions here!
        
           | mikeravkine wrote:
           | Thank you for your efforts on behalf of the GPU poor!
           | 
           | It's getting tougher to use older, cheaper GPUs
           | (Pascal/Maxwell) with modern quantization schemes so anything
           | you can do to keep kernels compatible with SM52 and SM61
           | would be greatly appreciated.
        
             | mobicham wrote:
             | Thank you, very glad to hear that!
        
           | vladf wrote:
           | Thanks for the reply. I'm quite familiar with subchannel
           | quant, but still feel like my questions did not get
           | addressed.
           | 
           | 1 Could you post the full memory use of the methods? E.g. you
           | include quip metadata in its GB but not hqq metadata in its
           | GB.
           | 
           | 2 If you have to go to cpu to shift and scale, how did you
           | get latency lower than pure on device? Was this bsz1? No
           | speculative decoding?
           | 
           | 3 how can lora absorb shifts with only increasing rank by 1
           | if you have a shift per group?
        
       | danielhanchen wrote:
       | Highly respect HQQ team's work - same accuracy as GPTQ / AWQ and
       | with no activation aware tuning calibration part - ie no more 3
       | hour calibration runs! A big fan of the 4bit ones especially, and
       | the 3,2, and now 1bit ones!
       | 
       | Also super cool idea of 1bit needing some calibration like AWQ -
       | no calibration data shows very bad results, but with some LoRA
       | adapters and finetuning, a great recovery of performance is
       | possible.
       | 
       | Planning to add support for this inside Unsloth to make all low
       | bit finetuning 2x faster and save tonnes of VRAM!
        
       | shrubble wrote:
       | I wonder if there is a correspondence to Sony's 1-bit DSD audio
       | stream? Could hardware developed for this purpose be used for
       | some part of the processing?
        
       | codedokode wrote:
       | I was thinking about this idea (using only additions for
       | inference) before too. The article says about using 1-bit values
       | for weights, but there is also another option: use integers for
       | weights, but allow the output of a neuron take only 0 and 1 (for
       | example, 1 if output is > 0 and 0 otherwise). In this case we do
       | not need multiplications as well, but cannot save memory for
       | storing weights. Maybe it has some other advantages, who knows.
       | Never had time to research this further.
        
         | Greenpants wrote:
         | You may be interested in the "binary step" activation function.
         | This does what you're suggesting. In general, complex behaviour
         | really takes a hit though using this for the activation
         | function of a neuron (though I'm also not sure which papers
         | show metrics on this being used for transformer models).
        
       ___________________________________________________________________
       (page generated 2024-03-30 23:01 UTC)