[HN Gopher] Towards 1-bit Machine Learning Models
___________________________________________________________________
Towards 1-bit Machine Learning Models
Author : homarp
Score : 315 points
Date : 2024-03-28 10:58 UTC (2 days ago)
(HTM) web link (mobiusml.github.io)
(TXT) w3m dump (mobiusml.github.io)
| corysama wrote:
| I look forward to the inevitable probabilistic sub-bit machine
| learning models :)
| Cieric wrote:
| I assume this is ment to be a joke, but isn't this actually
| being worked on? (minus the probabilistic portion)
| https://arxiv.org/abs/2310.16795
| Cheer2171 wrote:
| My 0-bit model can predict if you have cancer with 99.5%
| accuracy. It doesn't even need input data! Don't ask about the
| false negative rate though.
| asah wrote:
| My 0-bit, no-input model can predict if you have cancer with
| 99.5% accuracy and 0.5% false negative rate. Don't ask about
| whether the cancer cell(s) are benign.
| ganzuul wrote:
| My computation-free model can give you cancer with 100%
| certainty.
| paul_mk1 wrote:
| Sub 1-bit has been done at least as far back as 2016 for VGG
| style networks (my work).
|
| I was able to get 0.68 "effective" bits.
|
| The idea is that in each forward pass you add noise to _each_
| weight independently drawn from normal distribution, and when
| you calculate snr, it 's sub 1 bit. Points to the idea that a
| stochastic memory element can be used.
|
| https://arxiv.org/abs/1606.01981
| overengineer wrote:
| If noise is desirable in this way, why not to just switch
| back to analog computing which comes with free noise
| iamgopal wrote:
| So new AI chip will be coming optimised for large scale non-byte
| bit only AI world ?
| thesz wrote:
| Every CPU and GPU now has popcount [1] instruction. You can
| thank US three letter agencies for that [2].
| [1]
| https://en.wikipedia.org/wiki/Hamming_weight#Processor_support
| [2] https://vaibhavsagar.com/blog/2019/09/08/popcount/
|
| Multiplication then is just bitwise AND and a popcount.
| ttul wrote:
| Makes you wonder what the agencies know that we don't...
| rapatel0 wrote:
| It's just cryptography and encoding. Popcount is useful for
| understanding entropy.
| JKCalhoun wrote:
| As an AI layman, I am for some reason excited by the low-bit
| learning models.
|
| Apple, now leaving behind the albatros that was the Apple car,
| could be the one to roll their own silicon and put an LLM in our
| back pockets.
|
| Goodbye Siri, hello Seriously.
| talldayo wrote:
| Or we could end up getting another Google integration, with iOS
| being too rigid and locked-down to feasibly support third-party
| private AI alternatives. Fate has a cruel sense of humor,
| sometimes.
| odyssey7 wrote:
| Are you referring to the Swift version of TensorFlow?
|
| https://www.infoworld.com/article/3608151/swift-for-
| tensorfl...
| Me1000 wrote:
| I believe they're referring to these rumors:
| https://www.macrumors.com/2024/03/21/apple-iphone-ai-
| talks-g...
| JKCalhoun wrote:
| Could be a way for Apple to buy time as well until they
| do have something competitive.
| lawlessone wrote:
| > could be the one to roll their own silicon and put an LLM in
| our back pockets.
|
| Wouldn't they have to do a lot of catching up with everyone
| else already working on it?
| declaredapple wrote:
| They've been designing their own chips a while now, including
| with an NPU.
|
| Also because of their unified memory design, they actually
| have insane bandwidth which is incredibly useful for LLMs.
| IMO they may have a head-start in that respect for on-device
| inference of large models (e.g. 1B+ params).
| talldayo wrote:
| I don't think people are running 1B+ models on the Neural
| Engine these days. The high-performance models I've seen
| all rely on Metal Performance Shaders, which scales with
| how powerful your GPU is. It's not terribly slow on iPhone,
| but I think some people get the wrong idea and correlate an
| ambient processor like the Neural Engine with LLMs.
|
| The bigger bottleneck seems like memory, to me. iPhones
| have traditionally skimped on RAM moreso than even cheap
| and midrange Android counterparts. I can imagine running an
| LLM in the background on my S10 - it's a bit harder to
| envision iOS swapping everything smoothly on a similarly-
| aged iPhone.
| JKCalhoun wrote:
| Sure, but we're discussing 1.8-bit models that, again I'm
| a layman, I assume are over an order of magnitude smaller
| in their memory overhead.
| quentinp wrote:
| Actually they plan to put an LLM in your back pocket using
| flash memory, not silicon: https://arxiv.org/abs/2312.11514
| declaredapple wrote:
| The flash doesn't do the computations though, that's just a
| method of getting it to the processor
| sroussey wrote:
| It would be better to have eeprom or some such directly
| attached as memory. No loading.
| fabmilo wrote:
| I like how it goes from 13.5 GB to 1.76 GB and gets comparable
| results. Definitely there is some underlying process of why this
| works so well that we are missing. Are the bits selecting the
| different subspaces? Can we improve this process by using better
| normalization layers / gating units?
| nobodyandproud wrote:
| I wonder what information theory can tell us here.
| ttul wrote:
| Neural networks work, but why they work is not well known.
| Researchers continue to find new "free lunches" all the time.
| rapatel0 wrote:
| The entropy of English is ~1 bit per letter (Measured in a
| funny way by Shannon - https://cs.stanford.edu/people/erobert
| s/courses/soco/project...)
|
| In general, it's a bit funny in how we build ML models. You
| take a bucket of matrices, fill them with random numbers, and
| then start to introduce "biases" through back propagation. A
| model can converge down loss, but most of them are still
| filled with random noise.
|
| Binary weights are somewhat obvious in retrospect. Weights
| indicate the strength of an association between two neurons.
| Intuitively, most of the value is probably just that an
| association between two neurons exists, and whether it's
| positive or negatively associated.
| Animats wrote:
| I'm surprised that 1-bit works. Trinary (-1, 0, 1) makes
| sense, but if you only have 0 and 1, the asymmetry ought to
| be a problem. Or is multiplying an XOR, so 1x1 = 0?
| edflsafoiewq wrote:
| There's a scale and bias afterwards, so its not
| necessarily asymmetric.
| scarmig wrote:
| It's more like, addition is an XOR. (In fact, AND as mult
| and XOR as add are GF(2)). Throw in NOT (or, really, just
| a constant 1) and you can compute any circuit.
|
| Biologically, inhibitory neurons are every bit as
| important as excitatory ones, so if you squint just
| right, XOR looks like a neuron's activation being
| inhibited by another presynaptic neuron.
| rapatel0 wrote:
| Yeah basically. In binary, multiplication is an XNOR
| operation.
|
| 00 = 1
|
| 01 = 0
|
| 10 = 0
|
| 11 = 1
| zamadatix wrote:
| 0*0 = 1? I've always hear it as being the output of AND
| overengineer wrote:
| For balanced binary [-1, 1] yes. -1*-1=1
| zamadatix wrote:
| That makes sense, thanks!
| scarmig wrote:
| XNOR does not distribute over AND or any other binary
| operators (try 0 XNOR (0 AND 1)), nor does it have a
| multiplicative identity, so it's not really
| multiplication in a way that's useful.
| gpm wrote:
| It seems like they have learned floating point parameters
| x and y so that dequantized(bit 0) = x and
| dequantized(bit 1) = y. Thus there is no built in
| asymmetry. Or more precisely they learned a zero point
| and a scale but it's equivalent to this simpler model in
| the 1 bit case.
|
| It still seems like there would be a problem because
| either [x, y] looks like [0ish, 1ish] and you can't have
| negative weights, or [x, y] looks like [-1ish, 1ish] and
| you can't have "don't care" weights. But if you have some
| redundancy in your neurons I guess this is acceptable
| because you can just cancel out the positive contribution
| from a neuron you don't care about with a negative
| contribution from a very similar neuron that you also
| don't care about.
| mlsu wrote:
| Wow, fascinating read. 1999
|
| _Would a monkey who knew the n-gram frequencies of the
| letters in English where n is large be able to produce
| credible English text? Furthermore, does this monkey "know"
| English? If the N-gram monkey is behind one door and a
| human is behind the other, could a third-party observer
| tell which was the monkey? This question rings of the
| Turing test for artificial intelligence, to which there is
| no easy answer._
|
| No easy answer indeed!
| nobodyandproud wrote:
| Down to one bit but that's taking the 2.62 bits and then
| applying the redundancy factor.
|
| What's cool is that the differentiable activation function
| is important---to avoid the linear, perceptron limitation--
| but the weight scaling can be so simple, at least for LLMs.
|
| It makes me wonder whether the extra layers are effectively
| compensating; in other words, can the number of layers or
| hidden neurons be trimmed down if we then add more bits to
| each weight and still see equivalent effectiveness?
| imtringued wrote:
| You can just delete whole layers, because the point of
| residual layers is to make the model learn the
| hyperparameter "layer count" automatically. Compare this
| to the absence of residual layers where the model must
| use all layers. Then you will have to get the layer count
| perfect, but this isn't possible, since each data point
| might benefit from a different layer count. The extra
| layers therefore exist primarily so the model becomes
| robust to poorly chosen hyper parameters. You still need
| a minimum amount of layers, but that isn't the problem.
|
| https://arxiv.org/abs/2403.17887
| convolvatron wrote:
| idk that we need to think that hard. we have shown can build
| arithmetic functions from networks of logic gates given
| sufficient connectivity. so in some sense it just drops the
| learning network below the level of words and asked it to make
| its own math.
|
| since this generalizability thing seems to be real, this kind
| of directly folds in quantization-aware-training, doesn't it?
| WithinReason wrote:
| 1-bit weights have been a thing since at least 2016:
|
| https://arxiv.org/abs/1606.06160
| buildbot wrote:
| XNOR-Net was the "first" in that generation, as I recall. It's
| not a new idea at all though.
|
| Check out the dates on these papers -
|
| https://ieeexplore.ieee.org/abstract/document/286901 < 1994
|
| https://ieeexplore.ieee.org/abstract/document/344783 < 1993
|
| https://link.springer.com/article/10.1007/BF00337115 < 1986
| bjornsing wrote:
| First version of BNN was submitted to ArXiv a month before
| XNOR-Net: https://arxiv.org/abs/1602.02830
| buildbot wrote:
| Oh nice, good catch!
| thechao wrote:
| We're just speed-running the NN pendulum, at this point.
| rapatel0 wrote:
| Don't forget Michael
|
| https://arxiv.org/abs/1511.00363 < 2015
| elcomet wrote:
| They have been a thing for decades
|
| https://ieeexplore.ieee.org/abstract/document/363432
| WithinReason wrote:
| I don't think that paper's methods could be applied to LLMs
| redskyluan wrote:
| what will be the next step of 1 bit? could it be 0.5 bit? Any
| kind of quantization is not gonna to last long.
| CamperBob2 wrote:
| Not using traditional numbers at all, I imagine. It's possible
| that trying to build neural nets out of vectors that are
| themselves built with quantized scalars is a suboptimal
| approach.
| pixelpoet wrote:
| The way things are going we'll have -1 bit before long :)
| ganzuul wrote:
| https://en.wikipedia.org/wiki/Adjacency_matrix
|
| What would a graph look like when edges are represented with less
| than 1-bit?
| robrenaud wrote:
| It would have predictible structure. There would be groupwise,
| regular sparsity. You could exploit these patterns to infer
| properties like, "there is no connection from this vertex to
| this entire span of verticies".
|
| Just think about representing a compressed version of a matrix.
| You want to have a short encoding for matrices are ones that
| are more ordered in some real, exploitable way. With some
| determnistic computation of the compressed representation, you
| could produce the full adjacency matrix.
| ganzuul wrote:
| https://proceedings.mlr.press/v206/meller23a/meller23a.pdf
|
| It seems to have a lot of similarities to the technique in
| OPs article.
| selecsosi wrote:
| I think the problem is that you are also trying to represent
| the biases of directionality in the connection e.g.: [-1,0,1].
| In the adjacency matrix, if you store the sign (direction) of
| the edge, then I think that would map fairly isomorphically.
| ganzuul wrote:
| Basically factorized into a giant state machine. Would be
| logical.
| grungegun wrote:
| Does anyone know if this works on vanilla deep networks? These
| quantization articles always seem to target LLM's which leads me
| to wonder if there's something special about the LLM architecture
| vs a vanilla deep architecture.
| Y_Y wrote:
| Certainly does, people have been doing this in computer vision
| for years.
| alephxyz wrote:
| LLMs have been trending towards obscenely large number of
| parameters (314B for grok), which makes quantization crucial if
| you want to run them without a Meta-sized budget.
| zaptrem wrote:
| Transformer LLMs are just a bunch of MLPs (linear layers) where
| you sometimes multiply/softmax the output in a funny way
| (attention). In other words, they're arguably more "vanilla
| deep net" than most architectures (e.g., conv nets).
|
| (There are also positional/token embeddings and normalization
| but those are a tiny minority of the parameters)
| amelius wrote:
| Ok, but what does a perceptron look like in 1-bit? Would it
| be just some simple logic gate, like an OR-gate?
| zaptrem wrote:
| Not my area of expertise but I'd assume it becomes a
| decision tree or something.
|
| Edit: lol https://news.ycombinator.com/item?id=39868508
| grungegun wrote:
| So there's no performance gain for quantization enabled by
| the transformer architecture? It seems very strange that
| quantization works so well since in most of my experiments,
| the internal model weights of mlps look random.
| ianbicking wrote:
| Reduction to decision tree!
|
| But I'm unclear how it actually run, and the article talks about
| the conversion and training but doesn't describe how it runs... I
| suppose because it's obvious to someone who has followed
| quantization.
|
| Thinking out loud... if you have a model of just 1 and 0, my
| first thought is that the outputs are 1's and 0's but I think
| that's wrong. Instead it's a bunch of floats, and you multiply
| them by 1 or 0 (in a sense you are sampling the output of the
| previous layer?), add them up, and put the result through some
| activation function. And two-bit quantization sounds kind of
| similar, just with a _little_ scale, going from -1 to 2 instead
| of 0 to 1.
|
| It seems kind of interesting that you now have a bunch of weights
| that are exactly 0, meaning you can assert something about what
| parameters and weights affect what parts of the output. Though in
| some sense the way they compress the weights down to one bit is
| also a heuristic you could use to interpret the original model...
| this just makes it clearer that in totality you are making a
| defensible simplification, because the end result is still a
| working model.
|
| It also seems like you could make a lot of mathematical
| assertions about a one bit model that would be harder to make
| otherwise. Like maybe you could start thinking of a model as an
| equation, a _particular_ (though giant) equation, and look at its
| properties and consider symbolic transformations to that
| equation.
| fabmilo wrote:
| I like the decision tree analogy
| bick_nyers wrote:
| A comment I really liked on a previous post about ternary
| highlighted that what you are actually measuring with {-1, 0,
| 1} is inverse correlation, no correlation, and correlation.
| mmoskal wrote:
| It seems the trick here is they first quantize it to 1- or 2-bit,
| and then they fine-tune the quantization bias parameters (the
| parameters that dequantize from 1-2 to 16 bit) via LoRA. Then
| they have specialized kernels to do matrix multiplication at the
| bit level.
|
| Also, the 2-bit model seems much better than the 1-bit model -
| they use [-1, 0, 1, 2] - I wonder if '2' is needed in light of
| the 1.58b paper (which claims -1 is def. needed).
| andy_xor_andrew wrote:
| Interesting, and it kinda makes sense. You quantize, which
| invariably means you lose some precision, but then you can
| finetune post-quantization to recover at least some of it. Neat
| idea.
| jimmySixDOF wrote:
| Which is itself a little counterintuitive as the arxiv papers
| they cite say models need to be pretrained from the ground up
| using 1- or 2-bit (or 1.58bit). It definitely adds some
| interesting data points for the open source community who are
| experimenting in every possible direction.
| r1chardnl wrote:
| Is this similar to binary search or could it be a binary search
| where you're just excluding all possibilities that you know can't
| be the probable result that you're looking for?
| nyrikki wrote:
| Feed forward networks are effectively DAGs
|
| With the smaller model size, quantized models have less
| accuracy and less stable training, but large models take
| advantage of increased parallelism.
|
| Binary search is more linear, circuits are a better lens than
| turing machines.
|
| While it still needs to be verified, if you look into what a
| uniform consent depth threshold circuit is, that will help.
|
| Although I guess binary weights may be in AC0 and not TC0, but
| that may not hold with billions of parameters.
| londons_explore wrote:
| I believe the future is 1 bit models - for both training and
| inference.
|
| When people make custom silicon for 1 bit models, they'll find
| that it is _sooooo_ much more power and silicon-space efficient
| to do 1 bit math than 16 bit floating point - like 100x or more.
|
| That extra model size will vastly overshadow any worse
| performance of the models.
| JHonaker wrote:
| > That extra model size will vastly overshadow any worse
| performance of the models.
|
| ...What?
| sroussey wrote:
| I think OP was referring to parameter size. You can make up
| for quantization by having more parameters.
| JHonaker wrote:
| I know. I was being a bit mean, but they seem to be
| implying that more parameters _even with_ worse performance
| is preferable. That seems backward to me. There is no
| reason for parameter size to be intrinsically valuable, and
| I think people are a bit infatuated with it.
| cma wrote:
| For training how do you get any kind of meaningful derivative
| with it?
| twelfthnight wrote:
| Maybe evolutionary algorithms instead? Hasn't proven super
| useful historically, but maybe at the scale of enormous LLMs
| it will be?
| bionhoward wrote:
| Evolutionary algorithms made you, didn't they?
| TheDudeMan wrote:
| That does not prove that they can beat gradient decent.
| bick_nyers wrote:
| It took a lot of human brain flops to get to this point
| in time though, I wonder how many orders of magnitude
| more than it took to train ChatGPT...
| VHRanger wrote:
| Nope, they're orders of magnitude more inefficient because
| they don't leverage gradient descent.
|
| Rule of thumb in optimization: real numbers are easy,
| integers are hard
| markisus wrote:
| This may be the status quo because of the so called
| "hardware lottery" which has historically been optimized
| for floating point. I'm speculating, but if hardware
| designers were instead only concerned about raw xnor
| density and throughput, we might end up with chips
| powerful enough that giant 1-bit nets could be trained
| purely through evolution.
| imtringued wrote:
| How do you optimize memory for floating point?
| VHRanger wrote:
| BF8 and other similar formats?
| VHRanger wrote:
| No, it's a fact at the mathematical level that you can
| enshrine in big O terms if you want to
| bick_nyers wrote:
| Gradient-directed evolutionary algorithm sounds kinda
| interesting.
| chalst wrote:
| The OP explicitly excludes training.
| cma wrote:
| The one I replied to said 1-bit for both training and
| inference.
| scotty79 wrote:
| Maybe something probabilistic?
| wongarsu wrote:
| Probably more than 100x for inference. Not only are you
| drastically reducing the number of bits and replacing float
| math with integer math, you can do matrix multiplication with
| only addition (as pointed out in the BitNet b1.58 paper).
| Additions require a lot less hardware to implement than
| multiplication. Adding one-bit or two-bit numbers requires
| barely any hardware at all. A traditional two-bit adder without
| carry bit is three xor gates and an and gate.
| fasa99 wrote:
| to me the most exciting thing is that if is training that is
| speed up on the order of 100x-1000x, a large cluster may be
| well suited to gradient-descend hyperparameter tuning
| parameters by LLM training again and again at scale -- this
| is the first foot in a door towards an AI that iteratively
| may improve itself
| mobicham wrote:
| LoRA training should benefit from the same speed-up,
| because the 1-bit weights will be frozen and all you need
| for both the forward and backward pass is a binary matmul,
| then maybe cast after to get more stable gradients.
| programjames wrote:
| > I believe the future is 1 bit models - for both training and
| inference.
|
| 1 bit's nothin'. The future is training directly on
| electronic/photonic circuits.
| bionhoward wrote:
| Isn't 1bit too low for optimal radix economy (Euler's number)
| though?
|
| I want to see somebody try "imbalanced quaternary" -,0,+,2
| twelfthnight wrote:
| Haven't heard this argument before. But from the Wikipedia
| article it seems base 3 has the best asymptomatic radix
| economy, but isn't much better than base 2 and base 2 is
| seemingly easier to program and optimize.
|
| Since this is a new argument I've not heard, would be curious
| if you had links or some more explanation.
| johnmorrison wrote:
| people are indeed working on -1,0,1,2 Q2 models, I read
| something about it the other day but don't recall the title.
|
| they mentioned decomposition of Q2 into ternary+binary i.e.
| [[1,2],[-1,0]] -> [[1,1],[0,0]] + [[0,1],[-1,0]]
| Dylan16807 wrote:
| I bet the optimal "large" value is bigger than 2.
| hervature wrote:
| Radix economy is all about which base is the most efficient
| to represent a given number. It is simple to show that, for
| large numbers, this is equivalent to how efficient a base can
| represent itself, b/ln(b). Simple calculus shows this is
| minimized at e (Euler's number) or 3 if integer (closely
| followed by 2).
|
| It sounds like you have something to add but you are already
| dictating the base by saying "bit". Literally from "binary
| digit". Anyway, quantization is not about which number system
| is best - virtually all computer systems we use today
| represents numbers in base 2. Quantization, at its core, is
| lossy compression. How do you go from a large model trained
| to high precision to a smaller model without hindering
| performance? This can be studied without needing to know the
| base.
|
| Suppose you are using a decimal computer. You can ask
| yourself, I have a 128-decimal precision numbers, do I need
| that much precision? What happens if I just use 1-decimal
| precision by chopping off the 127 digits after the first
| decimal? You realize that there are two parts of an
| operation. The numbers involved (the operands) and the
| operation itself. You then ask yourself, if I keep one of the
| operands fixed (the original input), can I represent my
| 128-decimal precision neural network simply as a series of
| operations without the other operand? Perhaps only the most
| basic ones? Like: noops (add 0 or multiply by 1), increments
| (add 1), decrements (subtract 1), negations (multiply by -1),
| and clears (multiply by 0)? You count those numbers (-1, 0,
| and 1). There are 3 so you proudly proclaim you've made a
| neural network that only uses 0.477 dits. People get excited
| and confused because that is less than 1 dit which seems like
| a fundamental point. You further surprise the scientific
| field by finding a clever trick for getting rid of negations.
| You beat your previous record and now you only need 0.301
| dits to represent your network. You are about to accept your
| Turing reward when the ghost of Claude Shannon appears and
| says "Why are you using a unit that measures entropy to mean
| how many symbols you have? If you insist, at least realize
| 0.301 dits is 1 bit." You are shocked when you realize
| 10^0.301 = 2^1. Reviewing Shannon's seminal paper, you are
| awestruck by Shannon's prescient comment "Change from the
| base a to base b merely requires multiplication by
| log_b(a).". You humbly give your award to Shannon. You keep
| the $1M since ghosts aren't as fast a new NVidia DGX. No
| matter how quantized the ghost is.
|
| [1] - https://people.math.harvard.edu/~ctm/home/text/others/s
| hanno...
| api wrote:
| Doesn't training need higher precision to avoid getting stuck
| at local minima, at least with back propagation style learning?
|
| Maybe something alternate like evolutionary algorithms could
| work in a domain like this, but so far those have proven to be
| less efficient.
| sp332 wrote:
| A recent paper used a single ternary "trit" ~1.5 bits per
| parameter for training.
| https://news.ycombinator.com/item?id=39535800 They said it
| would be difficult to apply this technique to pre-trained
| models and had to be trained in 1-trit from scratch.
| gliptic wrote:
| But it's using larger weights during training, not just the
| trits.
| mikewarot wrote:
| I believe the future is 4*4 bit look up tables with output
| latches, with a bit to/from each Cartesian neighbor. Clock them
| like the colors of a chessboard, in 2 phases, and you don't
| have to worry about timing dependencies.
|
| All of your code gets converted to a directed acyclic
| graph(DAG), executing at Ghz rates if you want.
|
| Imagine a machine that can output a million parallel GPT-4
| streams at 1000 tokens per second each.
|
| If the architecture changes it's just a different DAG. Unlike
| with FPGAs and their custom blocks that have to be optimally
| used, you can compile a DAG almost instantly.
| smusamashah wrote:
| Is this something from a research finding or is it your idea?
| mikewarot wrote:
| It's my idea... I've been writing about it for over a
| decade. I hope to get a small version of it made via tiny
| tapeout later this year once I'm out of the precariat.
|
| The idea is that you can decompose any non-cyclic code, for
| example, a matrix multiply, FFT, etc.. into a directed
| graph of bitwise operations, then map those operations out
| into the grid of LUTs. Pipe inputs in one side of the grid,
| and get the outputs out the other side, and all of the
| bitwise operations happen in lock step parallelism. (Unlike
| an FPGA which is asyncronous)
|
| If you can decompose one stage of an LLM down into a
| bitwise graph of computation, you can easily map it into
| the grid. If there's a bad cell in the grid, you can map
| around it.
|
| Because all of the high speed logic lines are short (to the
| neighbors) you don't have the long lines / high capacitance
| issues driving signals all the way across an FPGA or ASIC,
| thus you can really crank up the clock rate (and/or it's
| very power efficient).
|
| It should be trivial, design wise, to scale from a 16x16
| grid up through chips that could crank out Petaflops.
|
| Here's a simulator for the thing, written in Pascal.
|
| https://github.com/mikewarot/Bitgrid
| sitkack wrote:
| You can simulate this on a GPU now. This is like running
| a CA on a GPU with slightly different rules.
|
| I think you would also be delighted to know that many of
| the non-GPU ai accelerators are in a mesh topology.
|
| Cerebras, Tenstorrent, Esperanto
|
| https://www.esperanto.ai/wp-
| content/uploads/2021/08/HC2021.E...
| mikewarot wrote:
| I've been intrigued by this approach. On a more highly
| optimized (but harder to program) take is the GA144[1]
| from Chuck Moore, the inventor of Forth. It's a grid of
| 144 F18 Forth based processors in a cartesian grid. These
| processors are far more limited, but then again they take
| far less power as well.
|
| [1] https://www.greenarraychips.com/
| sitkack wrote:
| That is a pretty cool chip. Chuck Moore is a fascinating
| character.
|
| You would probably get a kick out of David Ackley's T2
| Tile Project.
|
| https://t2tile.com/
|
| https://www.youtube.com/@T2TileProject/videos
|
| https://www.youtube.com/@DaveAckley
|
| https://en.wikipedia.org/wiki/Ackley_function
|
| https://www.cs.unm.edu/~ackley/
|
| Fun fact, Tenstorrent wanted to add instructions to
| enqueue data between processors connected in a mesh and
| Arm said no (they don't do architectural licenses
| anymore), so Tenstorrent used RISCV.
|
| Implementing something like a GA144 but using small
| RISC-V processors would be an interesting hack.
| https://www.youtube.com/watch?v=6QRKpd28NEE
| Dylan16807 wrote:
| 1. If you write FPGA code as a grid of lookup tables then I
| would expect it to be easy to compile instantly.
|
| 2. In what way is this "acyclic"?
|
| 3. Won't putting your code _into_ this form be the hard part?
| Even if you start with a DAG, 99.99% of them won 't fit this
| form without intense restructuring. So you just moved the
| hard step over by one.
| bgnn wrote:
| This! Maybe just integer, but not floating point. That's a
| ridiculous way to do computation when you don't really need the
| precision.
| imtringued wrote:
| The 1 ternary bit models only compress the weights. You still
| add and subtract using bfloat16 for better accuracy. Dedicated
| silicon is mostly a waste, because you are only processing two
| operations per parameter during inference. Loading the
| parameters from slow DDR, GDDR or HBM memory is the bottleneck
| in practice and the only solution is PIM. I was honestly
| disappointed by Nvidia's Blackwell since it is just barely
| competitive with GDDR PIM.
| mobicham wrote:
| At least you can copy 16 times more data to the shared memory
| with binary weights.
| littlestymaar wrote:
| Stupid question: how can you do deep learning on 1 bit? DL works
| by calculating a gradient and following it to minimize a loss
| function, but if all your parameters are 1 bit, there's no such
| thing as a gradient as you don't have differentiation at all. So
| how does it work?
| edflsafoiewq wrote:
| Not sure what you mean, why wouldn't there be a gradient? There
| is a problem that small changes to weights may have no effect
| (eg ((0 + 0.3) + 0.3) + 0.4 = 0 if you have to round the result
| to 0 or 1 after every step), but to fix this I believe everyone
| maintains high precision weights for the backwards pass, from
| which the quantized weights are derived.
| scotty79 wrote:
| You could treat fractional addition as probabilistic
| operation.
| Centigonal wrote:
| Not a stupid question at all.
|
| Early quantization approaches just quantized a high-bit-
| precision pre-trained model - no need to calculate gradients on
| the quantized weights. BitNet[1] changed the game by training a
| low-precision model from scratch. It achieves this by keeping
| high precision in the gradients, optimizer state, and in
| "latent weights" which are then quantized on the fly. I don't
| really understand the finer details of how this works, so check
| out the paper if you're interested.
|
| This article's approach is interesting. They start by
| quantizing a pre-trained high-precision model, and then they
| fine-tune the quantized model using LoRA (which dramatically
| improves the performance of the quantized model). They don't
| talk about the bit depth of the values in the LoRA matrices, so
| it may be that those are higher bit-depth.
|
| [1] https://arxiv.org/pdf/2310.11453.pdf
| kromem wrote:
| The most exciting part about ternary or binary weights is the
| inevitable hardware revolution for AI dedicated chips that's
| going to result from it.
| imtringued wrote:
| Your hardware already supports addition and subtraction and the
| tensor cores of NVIDIA GPUs are already fast enough to keep up.
| The only benefit is reducing memory capacity and bandwidth
| requirements.
| Matumio wrote:
| You mean we'll have hardware accelerated ternary instructions?
| https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
|
| (Okay probably those are not ready to be used as NN weights if
| the activations are not binary too, but... the gap to what CPUs
| already can do is getting smaller.)
| jstmm wrote:
| In both the '1-bit Model' and '2-bit Model' tables, the forward
| time (sec) for Llama2-7B with FP16 (full precision) is 0.1 s,
| whereas it's ~0.231, ~0.257, ~0.353 s respectively for HQQ
| (1-bit) / HQQ+ (1-bit) / Quip# (2-bit) meaning the FP16 model has
| ~3x lower inference time.
|
| On the contrary, in the BitNet b1.56 paper [0] the authors report
| their 7b model has 2.9x reduced inference latency.
|
| It's not clear to me what's happening here. Can someone explain
| why the 1/2bit HQQ/HQQ+ models are so much slower than the BitNet
| b1.56 models?
|
| [0] https://arxiv.org/pdf/2402.17764.pdf
| thatguysaguy wrote:
| Sounds like these guys didn't use custom kernels, but BitNet
| did.
| mobicham wrote:
| That's correct. Only the dequantization is done on CUDA, the
| matmul is done with Pytorch. If they put their kernels open-
| source we could re-use them!
| londons_explore wrote:
| GPU's aren't really designed for 1 bit math... They don't
| perform much faster than floating point math.
|
| Whereas a custom ASIC or updated design of GPU could give
| _massive_ speedups with 1 bit math.
| bee_rider wrote:
| For 1 bit math, at least it should be possible to populate
| every other bit of an integer type, right? Surely one could
| do _better_ with a dedicated type for this, but at least we
| could pack 16 single-bit weights into a 32 bit int for
| addition, right?
| UncleOxidant wrote:
| Yes, exactly. Neither GPUs nor CPUs are setup for 1 bit math.
| Pulling 1 or 2 bits out of a word isn't all that
| straightforward on CPU or GPU - lots of shifting and masking.
| I wonder how long it's going to be before we see custom
| hardware for bitnets? I suspect we'll see it on FPGAs first.
| imtringued wrote:
| You're telling me GPUs aren't designed for additions and
| subtractions? Where did you hear that?
| bick_nyers wrote:
| I think they are moreso saying that GPUs are not optimized
| for those operations. CPU aren't "designed" for matrix
| multiplies yet we can still run them, albeit at a slower
| rate than on a GPU.
| brucethemoose2 wrote:
| Real world GPU performance is hugely influenced by hand
| optimization of the CUDA kernels.
| vladf wrote:
| Really strong binary results. So strong it was fishy. I hope
| someone can explain my confusion below.
|
| > We compared the performance of the Llama2-7B model in three
| configurations: FP16 (full precision), HQQ (without fine-tuning),
| and HQQ+ (with adapter layers) using a group-size of 8.
|
| Interesting, what is "group-size of 8"?
|
| From their HQQ post (https://mobiusml.github.io/hqq_blog/), it's
| the block size at which they add scales (presumably 16-bit) and
| shifts (in that post, it's 8-bit).
|
| So for every 8 binary weights we have a 16-bit scale and 8-bit
| shift?
|
| > Fine-tuning with Low-Rank Adapters
|
| They say they inline the shift into the LoRA but how can you do
| this, block-wise, without increasing your LoRA rank by num-blocks
| (they claim to only use 1 additional rank)?
|
| Then, the reported 7B sizes, in GB:
|
| > 13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip#
| 2-bit)
|
| those numbers would make sense if it was _actually_ 1 bit. But if
| you include the overhead of 16-bit scales (and why is the shift
| inlineable into lora? still unexplained) it'd be more like 3-bit.
|
| From their HF page:
|
| > This version offloads the meta-data to the CPU, so only the
| binary weights and the low-rank adapters are stored in the GPU
| memory.
|
| Interesting, so we have to go back to CPU to rescale? Is this how
| they counted GB? This should have been clearly caveated in the
| table. I also am amazed they got latency lower than quip if they
| pingpong to CPU.
| danielhanchen wrote:
| When one does quantization, it's done in blocks. Bitsandbytes
| uses a blocksize of 64 I think. W * scale + zero_point is
| needed for each group size of 8. So you need 2 numbers in fp16
| for each 64 group of numbers. For BnB, you get 4.5bit approx
| since 64*4bit + 16bit + 16bit = 288/64 = 4.5. So 4bit is
| actually 4.5bit.
|
| For HQQ 1bit, a group size of 8 needs 2 fp16 numbers (you
| mentioned 8bit for shift). So you need 8 * 1bit + 16bit + 8bit
| for each group ie 32bits for each group size of 8. Or 4bits per
| param.
|
| I'm assuming the scale and zero_point are both moved to 8bit
| maybe so 8*1bit + 8bit + 8bit = 24bit / 8 = 3bits per param?
|
| "This version offloads the meta-data to the CPU, so only the
| binary weights and the low-rank adapters are stored in the GPU
| memory.", so the 8+8 scale / zero_point moves to the CPU. So
| GPU memory 1bit, but CPU meta data is the rest - very smart!
| vladf wrote:
| Err, you are just restating what I'm saying, without
| addressing the concerns.
|
| 1 - is it fair to use ram in two places and report only one
| of them without any asterisk? (If you think this is fair-oh
| boy wait till you hear about my 0GB hbm use inference
| algorithm)
|
| 2 - i know how subchannel quantization works. Are they
| hitting those reported latency numbers with per layer cpu
| pingpong to rescale?
| danielhanchen wrote:
| Oh sorry - did not expect to restate what you said whoops -
| all train of thought!
|
| You can see from
| https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-
| hf_1bi... that the model disk space is 3GB + 100MB of LoRA
| weights. I also uploaded a 4bit one to https://huggingface.
| co/unsloth/llama-2-7b-bnb-4bit/tree/main which uses 3.87GB.
|
| So because of the offloading trick, the GPU VRAM is less,
| but in actuality, still 3GB is needed.
|
| Unsure on latency sadly
| Dylan16807 wrote:
| > "This version offloads the meta-data to the CPU, so only
| the binary weights and the low-rank adapters are stored in
| the GPU memory.", so the 8+8 scale / zero_point moves to the
| CPU. So GPU memory 1bit, but CPU meta data is the rest - very
| smart!
|
| Doesn't it need all the weight metadata for a layer to use
| that layer?
|
| * If yes, then can't _any_ algorithm offload x% of its data
| as a balancing act between speed and RAM?
|
| * If no, then what's it for and when does it get used?
| danielhanchen wrote:
| Oh yes you need all the metadata, but because it's 2
| numbers the scale and zero_point, I think the movement of
| singular digits are super fast to the GPU registers -
| cannot confirm though.
|
| It's like in cuBLAS you do alpha _A_ B + beta*C, and alpha
| and beta are both scalars which can be on the CPU, and
| moved to the GPU in nanaseconds.
|
| I'm unsure though since I haven't tested it
| Dylan16807 wrote:
| It still has to go through the entire memory system. It's
| hard for me to imagine that transferring a number from
| the CPU to the GPU is faster than transferring a byte,
| and if you have 2 CPU-resident numbers per GPU-resident
| byte that's a lot of transferring.
| danielhanchen wrote:
| I don't disagree - fair point there definitely is a
| latency transfer overhead. I would suspect one had
| prefetch it by calling `.to("cuda", non_blocking = True)`
| say 2 layers ahead, so you can in theory hide the
| movement.
|
| I think somewhere the blog did mention HQQ for 1 bit is
| slower for now, maybe due to the transfer overhead,
| although I couldn't exactly remember where
| Dylan16807 wrote:
| My point is more that if it's that many bytes flowing
| around on demand, you're basically swapping layers in and
| out of the GPU as you use them (or x% of each layer).
|
| Which is fine, and it's a valid feature, but you don't
| need to split those bytes into "data" and "metadata" to
| make that happen.
|
| Is there actually something they gain from this
| particular method of splitting?
| danielhanchen wrote:
| I guess it's mainly to reduce VRAM usage. Assuming we
| don't do this, then a 7b model with 1bit group size 8
| will use 3GB or something of VRAM, whilst a 4bit group
| size of 64 will use 4GB approx.
|
| Assume we have a 100b model - with 4bit, VRAM is around
| 57GB or so. With 1bit, 43GB VRAM, but by moving the
| scalars and zero point to RAM, VRAM use is like 15GB or
| something, at the cost of like 28GB RAM usage.
|
| I guess maybe a valid approach is to dynamically select
| which ones to move to RAM or VRAM, given your VRAM
| budget. Say you have a 40GB GPU, clearly move more of the
| scalars to GPU. But if you have a weak GPU, then you need
| to use more RAM.
| Dylan16807 wrote:
| I still don't think I'm getting my point across.
|
| What if you store 40% of the data and 40% of the metadata
| in CPU memory. Is that the same for performance?
|
| Is there a reason we want _particular parts_ of the layer
| to stay on the GPU? Or is it just number of bytes.
| mobicham wrote:
| Hello, I am the main author, would love to clarify a couple of
| things:
|
| All the linear-quantization methods have meta-data, including
| the 1.58bit paper. You can control the quality vs. memory usage
| by reducing the group-size. However, the meta-data is the not
| the same thing as the quantized weights for many reasons:
|
| > The meta-data size doesn't change the fact that you can do
| binary/ternary matmul, which the most important thing in this
| story.
|
| > The meta-data size doesn't increase the actual compute: these
| are point-wise operations and even if you have 1 scalar you
| still need to multiply the same amount of weights.
|
| > Meta-data is offloaded to the CPU with pinned-memory, which
| allows non-blocking transfers. Technically, you can trigger the
| copy in the layer before and synchronize and will make it
| almost seamless. I did some experiments with cuda streams that
| worked very well on an older machine, but then I tried a better
| machine and the transfer was much faster. Obviously if you are
| trying it on Google colab it's very slow for this reason.
|
| > Smaller models like Llama2-7B are very hard to directly
| quantize at very low bits, so they need a lower group-size to
| function well. Larger models (like what we showed for Mixtral),
| can be quantized to 2-bit on the fly, without any data, and
| still work very well. So basically larger models are less
| sensitive to extreme quantization and you could use a much
| larger group-size. I still think that the meta-data size is
| really not a big deal for the reasons I have explained above.
|
| > There are many other ways to increase the group-size or even
| get rid of it all together, many ideas available but needs lots
| of experimentation.
|
| > Binary/ternary CUDA matmul kernels don't exist yet. The
| current code is implementing the dequantization step in CUDA
| but then uses torch.matmul as fp16. I tried doing matmul at
| low-bits with CUDA but it is very difficult to even beat cuBLAS
| with fp16, especially for a novice CUDA coder like me :)
|
| Please note: this is early experimental work. Since it showed
| promising results, we wanted to share it with the community
| first as we progress. There's still a lot of things to be done
| and we are actively working on it, despite the very limited
| resources we have.
|
| Happy to answer any questions here!
| mikeravkine wrote:
| Thank you for your efforts on behalf of the GPU poor!
|
| It's getting tougher to use older, cheaper GPUs
| (Pascal/Maxwell) with modern quantization schemes so anything
| you can do to keep kernels compatible with SM52 and SM61
| would be greatly appreciated.
| mobicham wrote:
| Thank you, very glad to hear that!
| vladf wrote:
| Thanks for the reply. I'm quite familiar with subchannel
| quant, but still feel like my questions did not get
| addressed.
|
| 1 Could you post the full memory use of the methods? E.g. you
| include quip metadata in its GB but not hqq metadata in its
| GB.
|
| 2 If you have to go to cpu to shift and scale, how did you
| get latency lower than pure on device? Was this bsz1? No
| speculative decoding?
|
| 3 how can lora absorb shifts with only increasing rank by 1
| if you have a shift per group?
| danielhanchen wrote:
| Highly respect HQQ team's work - same accuracy as GPTQ / AWQ and
| with no activation aware tuning calibration part - ie no more 3
| hour calibration runs! A big fan of the 4bit ones especially, and
| the 3,2, and now 1bit ones!
|
| Also super cool idea of 1bit needing some calibration like AWQ -
| no calibration data shows very bad results, but with some LoRA
| adapters and finetuning, a great recovery of performance is
| possible.
|
| Planning to add support for this inside Unsloth to make all low
| bit finetuning 2x faster and save tonnes of VRAM!
| shrubble wrote:
| I wonder if there is a correspondence to Sony's 1-bit DSD audio
| stream? Could hardware developed for this purpose be used for
| some part of the processing?
| codedokode wrote:
| I was thinking about this idea (using only additions for
| inference) before too. The article says about using 1-bit values
| for weights, but there is also another option: use integers for
| weights, but allow the output of a neuron take only 0 and 1 (for
| example, 1 if output is > 0 and 0 otherwise). In this case we do
| not need multiplications as well, but cannot save memory for
| storing weights. Maybe it has some other advantages, who knows.
| Never had time to research this further.
| Greenpants wrote:
| You may be interested in the "binary step" activation function.
| This does what you're suggesting. In general, complex behaviour
| really takes a hit though using this for the activation
| function of a neuron (though I'm also not sure which papers
| show metrics on this being used for transformer models).
___________________________________________________________________
(page generated 2024-03-30 23:01 UTC)