[HN Gopher] The Era of 1-bit LLMs: ternary parameters for cost-e...
       ___________________________________________________________________
        
       The Era of 1-bit LLMs: ternary parameters for cost-effective
       computing
        
       Author : fgfm
       Score  : 566 points
       Date   : 2024-02-28 09:28 UTC (13 hours ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | transfire wrote:
       | Shouldn't that be "1-trit"?
        
         | QuesnayJr wrote:
         | They call it 1.58-bit in the paper. (1.58 is roughly the base 2
         | logarithm of 3.)
        
           | jmmcd wrote:
           | So by "1-bit" they mean "less than 2 bits". AI is an
           | insufferable field at times like this.
        
             | riskable wrote:
             | What else are they going to call it? Nobody wants to say
             | they wrote some two-bit paper about AI!
        
               | jmmcd wrote:
               | The whole thing is a bit of a scam
        
         | bmacho wrote:
         | Read the pdf https://arxiv.org/pdf/2402.17764.pdf they call it
         | 1-bit everywhere.
         | 
         | I don't know why do they do this, 1-bit seems to be a very
         | wrong name for {-1, 0, 1}.
        
           | FrustratedMonky wrote:
           | Yes Technically, but it is catchy for the masses. 1-bit seems
           | to get the idea across, even if not technically describing
           | {-1,0,1}.
        
           | edflsafoiewq wrote:
           | I think 0 "doesn't count", since you don't have to add or
           | subtract anything for it, just mask it out.
        
             | numpad0 wrote:
             | Ternary or three-value logic is a thing in CS[1]
             | 
             | 1: https://en.wikipedia.org/wiki/Three-valued_logic
        
             | paipa wrote:
             | Would be cool to see what happens if you quantize towards
             | zero preferentially. Sparsifying the matrix should improve
             | inference speed directly, right?
        
       | tuananh wrote:
       | Major breakthrough in LLM scene. Achieve performance and
       | perplexity equivalent to full FP16 models of same parameter size.
       | 
       | And you can fit 120B model with a single card 24GB VRAM. This is
       | mind blowing.
        
         | cyanydeez wrote:
         | I mean, it expands the hardware selection, but until there's
         | models and leader boards etc, can't really say it's a break
         | through.
        
       | anon373839 wrote:
       | > BitNet b1.58 can match the performance of the full precision
       | baseline starting from a 3B size. ... This demonstrates that
       | BitNet b1.58 is a Pareto improvement over the state-of-the-art
       | LLM models.
       | 
       | > BitNet b1.58 is enabling a new scaling law with respect to
       | model performance and inference cost. As a reference, we can have
       | the following equivalence between different model sizes in
       | 1.58-bit and 16-bit based on the results in Figure 2 and 3.
       | 
       | > * 13B BitNet b1.58 is more efficient, in terms of latency,
       | memory usage and energy consumption, than 3B FP16 LLM.
       | 
       | > * 30B BitNet b1.58 is more efficient, in terms of latency,
       | memory usage and energy consumption, than 7B FP16 LLM.
       | 
       | > * 70B BitNet b1.58 is more efficient, in terms of latency,
       | memory usage and energy consumption, than 13B FP16 LLM.
       | 
       | This paper seems to represent a monumental breakthrough in LLM
       | efficiency, as the efficiency gains come with zero (or negative)
       | performance penalty.
       | 
       | Does it seem at all likely that existing models could be
       | converted?
        
         | accurrent wrote:
         | They seem to be using LLAMA. Might be worth trying out. Their
         | conversion formula seems stupidly simple.
        
           | wongarsu wrote:
           | However they trained their models from scratch, which is also
           | why they only have meaningful numbers for 700M, 1.3B, 3B and
           | 3.9B models. Apparently they are following BitNet's approach
           | of replacing linear layers with quantized layers during
           | training? If it was trivial to convert existing models
           | without performance loss I would have expected them to
           | include a benchmark of that somewhere in the paper to
           | generate even more impact.
        
             | imjonse wrote:
             | They present numbers for 7B to 70B models as well.
        
               | sp332 wrote:
               | They do not have perplexity numbers for the larger models
               | (see Table 2), only speed and memory benchmarks.
        
               | imjonse wrote:
               | You're both right, I skimmed the paper, saw large model
               | numbers but didn't notice it was for speed. On the HF
               | page they say those models are being trained.
               | 
               | https://huggingface.co/papers/2402.17764
               | 
               | "We haven't finished the training of the models beyond 3B
               | as it requires much much more resources. However, we're
               | optimistic about the results because we have verified
               | that BitNet follows a similar performance-parameter
               | scaling law as the full-precision LLMs. We'll update the
               | results on larger models once they're ready."
        
               | anon373839 wrote:
               | Those numbers are for cost only, not performance. It's
               | not clear they actually _trained_ a 70B vs. just using
               | randomly initialized parameters.
        
           | FrustratedMonky wrote:
           | Yes. I wonder then how long before someone that does have a
           | lot of compute power like OpenAI/MS, or others, can rapidly
           | pivot and try this out on some even larger models.
           | 
           | Doesn't this mean that current big players can rapidly expand
           | by huge multiples in size.?
        
         | ignoramous wrote:
         | I wonder if 1bit quantization is the _main_ reason why pplx.ai
         | is faster than any other RAG or chatbot. For instance, Gemini
         | in comparison is a turtle, though it is better at explanations,
         | while pplx is concise.
        
         | btbuildem wrote:
         | Discussion on HF [1] implies that no, conversion is not
         | helpful. It would take training the model from scratch.
         | 
         | 1: https://huggingface.co/papers/2402.17764
        
           | anon373839 wrote:
           | It's a pity if realizing these gains absolutely requires full
           | pre-training from scratch. I imagine more than a few people
           | will at least try to find a way to repurpose the knowledge
           | contained in existing models.
        
             | cooljoseph wrote:
             | You can also have another model "mentor" a new model you
             | are teaching to speed up training. You don't have to start
             | from scratch with zero knowledge. This is done a lot in
             | what are called distillations.
        
       | imjonse wrote:
       | Too bad there seem to be no pretrained models to download. This
       | is not a quantization method to apply on existing models, so
       | having the pretrained weights is needed if one wants to test it.
        
         | bArray wrote:
         | +1 On this, the real proof would have been testing both models
         | side-by-side.
         | 
         | It seems that it may be published on GitHub [1] according to
         | HuggingFace [2].
         | 
         | [1] https://github.com/microsoft/unilm/tree/master/bitnet
         | 
         | [2] https://huggingface.co/papers/2402.17764
        
           | imjonse wrote:
           | Nothing there yet, but it's good to know they want to publish
           | just did not get around to yet.
        
           | SushiHippie wrote:
           | From [2]:
           | 
           | > We would definitely be happy to open-source the models for
           | future research. Please stay tuned!
        
           | UncleOxidant wrote:
           | link #2 appears to be broken.
        
       | dindobre wrote:
       | Refreshing paper in terms of machine learning papers, simple
       | explanation, easy to replicate, no alchemy-tier interpretations.
       | Can't wait to see this paper replicated or disproved when it
       | comes to real-life production tasks.
        
         | imjonse wrote:
         | The presentation is simplified because it implies knowledge of
         | its predeccesor, BitNet https://arxiv.org/abs/2310.11453
        
           | dindobre wrote:
           | Makes sense!
        
         | wongarsu wrote:
         | The most glaring omission is that they only compared to fp16
         | models, not to quantized models. And of course the benchmarks
         | might be misleading compared to the real experience.
         | 
         | But if you wanted to make LLM-specific hardware (or x64
         | instructions tuned for LLMs) this model architecture makes that
         | extremely cheap. Multiplication requires a lot of transistors,
         | this architecture requires only two-bit adders. You could make
         | SIMD instructions that do thousands of these in parallel, for
         | fairly little silicon cost.
        
       | the8472 wrote:
       | What does it mean for future hardware if it's not using floating
       | point matrix multiplication units?
        
         | cyanydeez wrote:
         | https://stackoverflow.com/questions/45373679/why-is-it-faste...
        
           | gpderetta wrote:
           | As per answer, the reason float is faster than in is because
           | a) hardware companies provide float ALUs than integer ALUs
           | and b) float FMA is a thing, while integer FMA isn't. Both
           | are because currently most HPC-like loads use floats instead
           | of integers, not because of intrinsic hardware reasons.
        
             | KeplerBoy wrote:
             | If it's desired integer performance could far exceed float
             | performance, since ALUs need less die area than FPUs.
             | 
             | If this paper holds, I'd expect that's where custom
             | accelerators will be heading.
        
               | gpderetta wrote:
               | Oh, I agree, I'm just saying that there is no reason in
               | principle for floats performance to be better than
               | integer.
               | 
               | edit: also this might be implementable purely using
               | bitwise vector operations. Would need to check the
               | throughput of those.
        
         | KeplerBoy wrote:
         | Expect Nvidia to advertise with their TOPS numbers instead of
         | their FLOPS.
        
           | rfoo wrote:
           | Already happened years ago. They advertised TOPS for
           | int8/int4 [0], and with 50% sparsity [1].
           | 
           | [0] low-bit CNNs worked pretty well actually.
           | 
           | [1] Totally useless marketing snake oil.
        
       | hoseja wrote:
       | Balanced ternary, my beloved.
        
       | yieldcrv wrote:
       | This is great, my employer just gave me a M1 laptop with only
       | 16gb ram and I had to downgrade my 7B parameter local LLM's to 3
       | bit quantizing, they've been surprisingly okay!
       | 
       | In my personal machine at 64gb ram, I usually use 8x7B at Q5 or
       | 70B at Q4
       | 
       | Its Mistral all the way down! Imagining Q1.58 that's doing well
       | makes me happy
        
         | turnsout wrote:
         | Quantized 7B LLMs should work fine on your machine, though
         | maybe you're talking about speed?
        
           | yieldcrv wrote:
           | 7B works fine
        
         | woadwarrior01 wrote:
         | You can run 4 bit quantized versions of SOLAR-10.7B and Llama 2
         | 13B based models quite well on 16GB M1 laptops.
        
         | FergusArgyll wrote:
         | You shouldn't have to quantize it that much, maybe you're
         | running a lot of other programs while running inference?
         | 
         | Also, try using pure llama.cpp, AFAIK it's the least possible
         | overhead
        
           | regularfry wrote:
           | Getting more value out of phi-2-sized models is where you
           | really want to be on lower-end M1's.
        
       | lucubratory wrote:
       | After reading the results I skipped back to the comment section
       | to ask if this was real because it looks a little too good to be
       | true, but figured I should check authors and it's Microsoft
       | research and UCAS so yeah, real. This is going to change a lot of
       | things, obviously the edge computing applications they point out,
       | but also this is going to bottom out the cost of providing high-
       | performance LLMs in the cloud. I don't know what that means for
       | the economics long term, naively way less costs maybe means new
       | entrants without an entire cloud available can compete easier? I
       | do wonder if something like this has already been found and
       | implemented by either OpenAI or Google.
        
         | anon373839 wrote:
         | It also means the largest models can be scaled up significantly
         | with the same inference budget.
        
           | llm_trw wrote:
           | Depends. The only paper they cite for training:
           | https://arxiv.org/pdf/2310.11453.pdf doesn't improve training
           | costs much and most models are already training constrained.
           | Not everyone has $200m to throw at training another model
           | from scratch.
        
             | arunk47 wrote:
             | Is there any scope for indie builders?
        
         | aurareturn wrote:
         | After playing with OpenAI's GPT4 API, I'm quite convinced that
         | LLMs would be in everything and everywhere today if inference
         | cost is as low as loading a website and context size is 100x
         | higher.
         | 
         | In other words, only inference cost is holding it back from
         | completely changing everything.
         | 
         | So if we have a shortcut to getting something like GPT4 to run
         | locally on a small device, watch out.
        
           | rvnx wrote:
           | It's coming in October with the new Apple chip
        
             | sigmoid10 wrote:
             | I'd be very surprised if Apple can put something on the
             | level of GPT4 on a handheld. Remember, GPT4 is estimated to
             | be around 1.7 trillion parameters. That's 3.4TB at 16 bit
             | and it would still be ~340GB at 1.58bits. The best we can
             | hope for is a low-ish level few billion parameter model.
             | Which would still be cool on a phone, but as of today these
             | models are nowhere near GPT4.
        
               | jairuhme wrote:
               | They won't have something at that size because as you
               | pointed out, it is still huge. But depending on how they
               | are used, smaller parameter models may be better for
               | specific on-phone tasks that start to make the size of
               | the model not a problem. GPT4 is so large because it is
               | very general purpose with the goal seeming to be to
               | answer anything. You could have a smaller model focused
               | solely on Siri or something that wouldn't require the
               | parameter size of GPT4
        
               | sigmoid10 wrote:
               | The thing a about GPT4 that matters so much is not just
               | raw knowledge retention, but complex, abstract reasoning
               | and even knowing what it doesn't know. We haven't seen
               | that yet in smaller models and it's unclear if it is even
               | possible. The best we could hope for right now is a
               | better natural language interface than Siri for calling
               | OS functions.
        
               | ynniv wrote:
               | You don't need "GPT4" though. Mixtral 8x7B is robust and
               | can be run in 36 Gb, 24 Gb if you're willing to
               | compromise. A 1.5 bit quantization should bring it down
               | to 16. That's still a lot compared to the iPhone 15's 6,
               | but it's close enough to imagine it happening soon. With
               | some kind of streaming-from-flash architecture you might
               | be in the realm already.
        
               | creshal wrote:
               | > With some kind of streaming-from-flash architecture you
               | might be in the realm already.
               | 
               | I thought mmap'ing models to only keep the currently
               | needed pieces in RAM was something that was figured out
               | ~6 months ago? Performance wasn't terribly great iirc,
               | but with how much faster 1.58B is, it should still be
               | okay-ish.
        
               | imtringued wrote:
               | I'm not sure what use that is, other than to maintain the
               | KV cache across requests.
        
               | liuliu wrote:
               | There is a more detailed paper from Apple on this.
               | Basically, you can do a little bit better than only
               | keeping current weights in RAM with mmap.
               | 
               | For LLM, you are mostly dealing with b = W @ a where a
               | and b are vectors, only W is the matrix. If a is sparse
               | (i.e. have a few 0s), you don't need all the columns from
               | W to do the matrix-vector multiplication. A cleverly
               | arranged W can make sure during inference, only related
               | columns loaded from flash. Further more, if you can apply
               | "One Weird Trick" paper to this matrix-vector
               | multiplication, you can shard W by rows, i.e. `b[i:i+n] =
               | W[i:i+n,:] @ a[i:i+n] for i in range(N, N / b)` such that
               | while the previous b[i:i+n] is still computing, you have
               | visibility on which columns of the next matrix to be
               | loaded already.
        
               | cjbprime wrote:
               | You need all of the model in RAM to perform the matmult
               | that gets you the next token from it. There's no
               | shortcut.
        
           | jart wrote:
           | LLMs will give normal people a firmer standing in
           | technological society. That's a good thing. But will it
           | change everything? Not a chance. Even if LLMs did change
           | everything, that probably would not be a good thing. Dijkstra
           | says Muslim algebra died when it returned to the rhetoric
           | style, and the modern civilized world could only emerge --for
           | better or for worse-- when Western Europe could free itself
           | from the fetters of medieval scholasticism --a vain attempt
           | at verbal precision!--thanks to the carefully, or at least
           | consciously designed formal symbolisms that we owe to people
           | like Vieta, Descartes, Leibniz, and (later) Boole. So don't
           | be so proud of these graphics cards you've made, because the
           | ability to understand the human tongue is insignificant
           | compared to the power of math.
        
             | rafaelero wrote:
             | LLM's can do math as well.
        
               | jart wrote:
               | What makes you think that? Which LLMs?
        
               | rafaelero wrote:
               | https://deepmind.google/discover/blog/alphageometry-an-
               | olymp...
        
               | dns_snek wrote:
               | Last time I checked, GPT-4 couldn't reliably add 2
               | numbers, never mind anything more complex.
        
               | vidarh wrote:
               | Last I checked (and confirmed by repeating it just now)
               | GPT-4 did just fine at adding 2 numbers up, because it
               | knows better now than to do that manually and will
               | express it as Python. It does _worse_ if you try to force
               | it to do it step by step like a child and don 't
               | reinforce adherence to the rules every step, because just
               | like humans it gets "sloppy" when you try to get it to
               | repeat the same steps over and over.
               | 
               | If you want to measure its ability to do mindlessly
               | repetitive tasks without diverging from instructions, you
               | should compare it to humans doing the same, not expect it
               | to act like a calculator.
               | 
               | If you want to measure its ability to _solve problems_
               | that involve many such steps that are simple to express
               | but tedious to carry out, ask it to write and evaluate
               | code to do it instead.
        
               | dns_snek wrote:
               | The claim was that "LLMs can do math". Below they linked
               | a model from Google that might be capable of that, but as
               | a general rule (and with OpenAI's models specifically)
               | LLMs can't "do math" by any reasonable definition.
        
               | vidarh wrote:
               | I've had it do plenty of math. Some it does badly at,
               | some it does fine. Generally it's not "disciplined"
               | enough to do things that requires lots of rote repetitive
               | tasks, but neither are most humans, and that has improved
               | drastically as they've adjusted it to instead do what
               | most humans do and use tools. Would it be nice if it
               | _also_ got more willing to  "stick to it" when given rote
               | tasks? Sure.
               | 
               | But whether or not it can "do maths" to your definition
               | depends very much on what you want it to do, and how you
               | define "do maths". To me it's irrelevant if it's doing
               | the low-level calculations as long as it knows how to
               | express them as code. If I wanted a calculator I'd use a
               | calculator. And I don't consider a calculator able to "do
               | math" just because it can precisely add numbers.
               | 
               | Meanwhile I've had lengthy discussions with GPT about
               | subjects like orbital mechanics and calculating
               | atmospheric effects where it correctly used maths that I
               | had to double-check not because I didn't trust GPT
               | (though I _also_ want 't to verify for that reason) but
               | because I didn't know the maths (not that it was anything
               | particularly advanced, but I lost interest in maths
               | during my CS degree and picked the minimum amount of
               | maths I could get away with).
               | 
               | By _my_ definition it can  "do maths" just fine. I guess
               | you don't consider my view of that "reasonable". I can
               | live with that, as meanwhile, it will keep doing maths
               | for me when I need it.
               | 
               | Of course this was also a case of moving the goalposts to
               | set up a strawman - in the comment of yours I replied to,
               | you claimed it couldn't reliably _add two numbers_.
        
               | dns_snek wrote:
               | It often fails at basic 3-4 digit arithmetic. If you're
               | stretching that definition far enough to claim that GPT4
               | can "do math" then I should be able to call myself a
               | commercial pilot because I can land a plane in a sim 20%
               | of the time.
               | 
               | I'm not moving goalposts, the original claim was that
               | LLMs can "do math". Primary school arithmetic is math.
               | 
               | GPT-4 can't do math and that's _okay_ , I don't
               | understand why so many of you are so touchy and defensive
               | about this. It's a limitation that exists, nothing more,
               | nothing less.
        
               | int_19h wrote:
               | GPT-4 is a tiny subset of "LLMs".
               | 
               | If you train a model to do math (and optimize
               | representation for that), it'll do math. GPT-4 just
               | isn't, and, generally speaking, they aren't, because it's
               | much more efficient to train them to "use a calculator".
               | Same as with humans.
        
               | imtringued wrote:
               | You do realize that arithmetic is a very simple symbolic
               | manipulation task? All you have to do is keep track of
               | the carry. I haven't seen an LLM that couldn't get digit
               | by digit addition done, but they always mess up the
               | carry.
        
               | vidarh wrote:
               | Just like humans. Try to get _regular people_ do e.g. add
               | 15-16 digit numbers (where is typically where I 'd see
               | GPT4 start to get "sloppy" unless you prompt it the way
               | you would a child who's learning and is still prone to
               | get annoyed and wonder why the hell you make them to it
               | manually), and see how many start making mistakes.
               | 
               | I find it really comical that this is what people
               | complain about GPT over - there's zero benefit to get
               | LLMs to get good at this over other tasks. To the extent
               | we get it "for free" as a benefit of other learning,
               | sure, but when we make kids practice this _over and over
               | again_ to drill doing it without getting sloppy, it has
               | traditionally been out of some belief that it 's
               | important, but a computer will always have a "calculator"
               | that is far more efficient than the LLM at its disposal
               | and it's idiocy to care about whether it does that part
               | well the tedious and hard way or knows how to describe
               | the problem to a more efficient tool
               | 
               | I also find it comical that people use tasks where LLMs
               | behaviour is if anything mot human-like, in its tendency
               | to lose focus and start taking shortcuts (before GPT4
               | started writing Python instead, it'd for a while try
               | _really_ hard to not give you a step by step breakdown
               | and instead clearly take shortcuts even you prompted it
               | heavily to reason through it step by step), when
               | presented with stupidly repetitive tasks as examples of
               | how they 're not good enough.
        
               | mikewarot wrote:
               | GPT-x can't add, or subtract, or do anything else of the
               | type... it can APPEAR to do so, because that's what it
               | was built to do.... act like the text it's seen
               | previously and predict what the next text would be.
               | 
               | If you include a large amount of properly solved math in
               | its training text, it gets MUCH better at that kind of
               | math.
               | 
               | It has a very deep set of intelligences that are alien to
               | us, that allow it to predict and ACT LIKE us, when it
               | comes to generating the next word. You're only seeing the
               | output of those intelligences through a very lossy
               | channel.
               | 
               | As a side note, there are structures in human language
               | that apparently encode much more information that you
               | might think at first glance. The fact that Word2Vec had
               | such mathematical properties, despite it's relative
               | simplicity, astound me to this day. Throwing a bunch of
               | sine/cosine values on top of that to represent position
               | in a sentence to enable LLMs is also amazing in that it
               | works.
        
               | ekianjo wrote:
               | most open models do it poorly though. ChatGPT is better
               | at it.
        
               | lovasoa wrote:
               | - Hey ChatGTP ! What it 69* _94 ?
               | 
               | - The result of 69*_94 is 6466.
        
               | cooper_ganglia wrote:
               | This comment reminded me of that scene in Indiana Jones
               | where the guy is spinning the sword around about to
               | attack Indy, and then Indy just pulls out his pistol and
               | shoots him.
        
             | samatman wrote:
             | I agree with your basic thesis here, retrospection will
             | view LLMs as a transitional architecture.
             | 
             | However, this paper is evidence that the field is figuring
             | out how to built what's actually needed, which is a good
             | thing.
        
             | ordu wrote:
             | _> the modern civilized world could only emerge --for
             | better or for worse-- when Western Europe could free itself
             | from the fetters of medieval scholasticism_
             | 
             | I can propose an alternate view of things. Not that I'm
             | going to argue that it is the only true statement in the
             | world, but I think it is necessary for a thought to
             | progress to have an alternative hypothesis.
             | 
             | So the proposition is: formal symbolisms can deal only with
             | those problems that where already solved in imprecise
             | human's languages.
             | 
             | To invent calculus and orbital mechanics you need first to
             | talk for a several centuries (or thousands of years?) about
             | what is position and velocity, you need to talk your way
             | upto acceleration, and then you need to find a way to
             | measure them and to define in a strict geometric terms. Ah,
             | and infinity, it was a very counter-intuitive idea, Xenon
             | invented some of his paradoxes specifically to point at
             | counter-intuitiveness. When Newton came all these talks and
             | debates did the most of work for him.
             | 
             |  _> the ability to understand the human tongue is
             | insignificant compared to the power of math._
             | 
             | But the fun is: you cannot know if someone understands math
             | if they do not understand human language too. You cannot
             | teach math to those who cannot speak human language.
             | 
             | Math is a cream on top with a limited applicability. What
             | math can say about love? I do not like to sound like
             | Dumbledor, but really behind all we do there is an emotions
             | motivating us. Math cannot deal with emotions, because it
             | was built that way _and_ because non-math talks about
             | emotions hadn 't bring a good model for emotions, which
             | math could express in a formalized language.
             | 
             |  _> Dijkstra says _
             | 
             | I wonder when he said it? Before AI concluded that expert-
             | systems based on logic were acknowledged to be a failure or
             | after that?
        
           | declaredapple wrote:
           | I'll agree with you, and add that inference speed is a big
           | factor too.
           | 
           | SDXL-ligtning/cascade can generate images in 200ms which is
           | fast enough to fit in a web request, and paradoxically makes
           | it even cheaper to generate.
           | 
           | And using groq at 500 t/s is wild compared to any of the
           | other platforms.
        
             | pennomi wrote:
             | 500 t/s is uncomfortably fast to me. Generating high
             | quality answers at speeds faster than I can read is the
             | point at which I feel like LLMs are magic.
             | 
             | I'm glad people are doing it though, and I'll happily adapt
             | to accessing inference at that speed.
        
               | azinman2 wrote:
               | That's important for new applications to emerge where
               | this happens on lots of data. You can't run LLMs at scale
               | on tasks like Google might (every webpage) when the cost
               | of each document is so high to process. Interactive
               | chatbots are just the tip.
        
           | gitfan86 wrote:
           | That is the plan. Even if these independent software
           | improvements don't create 10x improvements NVDA and others
           | are making huge improvements.
        
         | wongarsu wrote:
         | I wouldn't be surprised if this causes hardware startups to pop
         | up that build accelerator cards tuned for this architecture. It
         | seems stupidly simple to do inference in hardware, and with
         | most of the training being quantized as well you might even be
         | able to provide speedups (and energy savings) for training with
         | reasonable investment and on cheaper processor nodes than what
         | Nvidia is using.
         | 
         | Sure, Nvidia might eat their lunch in a couple of years, but
         | bitcoin ASICs prove that you can have a niche producing
         | specialized processors, and VCs would probably jump at the
         | thought of disrupting Nvidia's high margin business.
        
           | anon291 wrote:
           | There's like a million startups promising analog / bit-level
           | computation, inference-only, cheap computation.
           | 
           | There's rain.ai, d-matrix, etc.
        
         | btbuildem wrote:
         | If this dethrones Nvidia, it would be a wonderful side effect
        
           | rafaelero wrote:
           | It's more likely that Nvidia will offer support to INT2 in
           | the next generation and keep their dominance.
        
             | Klipper3 wrote:
             | INT2 ternary is equivalent to INT1 + binary mask. Nvidia
             | supprted INT1 matrix multiply in RTX20 and RTX30
             | generations, nobody used it, so they removed INT1 support
             | from RTX40 generation.
        
       | raghavtoshniwal wrote:
       | Sooo, short Nvidia?
        
         | MadDemon wrote:
         | Depends if this results in more efficient models or simply
         | larger, more capable models.
        
           | wongarsu wrote:
           | In both cases this is a prime opportunity for anyone to
           | disrupt Nvidia. They are in this market position in large
           | part because both video games and neural networks do a lot of
           | highly parallel floating point math, especially matrix
           | multiplication. This model architecture doesn't do any of
           | that.
           | 
           | Of course it should be fairly simple for Nvidia to add
           | special silicon and instructions for two-bit addition to a
           | future generation of their cards. But it'll take a while
           | because they already have a roadmap and preexisting
           | commitments. And any competitor doesn't have to copy
           | everything Nvidia does to make floating point numbers go
           | fast, they can just focus on making two-bit data handling and
           | addition go fast.
        
         | sebzim4500 wrote:
         | These still run on GPUs
        
           | leroman wrote:
           | - we have llama.cpp (could be enough or at least as mentioned
           | in the paper a co-processor to accelerate the calc can be
           | added, less need for large RAM / high end hardware)
           | 
           | - as most work is inference, might not need for as many GPUs
           | 
           | - consumer cards (24G) could possibly run the big models
        
             | sebzim4500 wrote:
             | If consumer cards can run the big models, then datacenter
             | cards will be able to efficiently run the really big
             | models.
        
               | leroman wrote:
               | Some tasks we are using LLMs for are performing very
               | close to GPT-4 levels using 7B models, so really depends
               | on what value you are looking to get.
        
           | londons_explore wrote:
           | GPU's aren't yet awfully efficient at 1 bit math.
           | 
           | I could imagine FPGA designs might be competitive.
           | 
           | And dedicated ASIC's would almost certainly beat both by a
           | decent margin.
        
             | sebzim4500 wrote:
             | I'm very unconvinced that ASICs are better suited for this
             | than for FP16/FP8 models that are being used today.
        
               | londons_explore wrote:
               | BF16 is a pretty big unit in an ASIC - You need at least
               | 9 * 5 gates to calculate the exponent of the result, a 10
               | bit barrel shifter (10*10 + 10*ceil(log2(10)) gates), and
               | a 10 bit multiplier (approximately 10 * 10 * 9 gates)
               | 
               | Total = 1085 gates. The reality is probably far more,
               | because you're going to want to use carry-look-ahead and
               | pipelining.
               | 
               | Whereas 1 bit multiplies and add's of say a 16 bit
               | accumulator use... 16 gates! (and probably half since you
               | can probably use scheduling tricks to skip past the
               | zero's, at the expense of variable latency...)
               | 
               | So when 1 bit math uses only 1/100th of the silicon area
               | of 16 bit math, and according to this paper gets the same
               | results, the future is clearly silicon that can do 1 bit
               | math.
        
             | int_19h wrote:
             | I don't think it would be difficult to make them efficient.
             | 
             | The main reason why we run this stuff on GPUs is their
             | memory bandwidth, anyway.
        
         | etiam wrote:
         | Hardly for this reason, but it does look suspiciously high
         | doesn't it.
        
       | leroman wrote:
       | Can someone versed in the ways of math explain how this is
       | different from previous quantization methods?
       | 
       | And specifically, seeing how going from 16fp to 8bit mostly gives
       | same perplexity while anything further seems to lose quality /
       | dumb down the model, how is this even less precise method is able
       | to achieve this?
        
         | IanCal wrote:
         | It's not quantising existing models, they're training new ones.
        
           | leroman wrote:
           | I understand this part but it seemed that the 16->8->4 etc is
           | similar to compression of the "net" and seemed to lower
           | quality below 8.
        
         | TheCoreh wrote:
         | If I understand it correctly, this seems to be more than just
         | quantizing, the models are apparently trained in this format as
         | well. So it's possible that the many layers adjust themselves
         | in a way that "cancels out" the inaccuracies of the lower bit
         | count
        
       | llm_trw wrote:
       | So are there any details on the algorithms they used for
       | backprop? I'm not seeing any in the paper other than "we used a
       | lot of tokens".
        
         | IanCal wrote:
         | Does this help? https://arxiv.org/abs/2310.11453
         | 
         | It seems to have more details (it's the paper before the linked
         | one) about the actual training, but I'm scanning it and this
         | isn't my field so maybe it's too light also.
        
           | llm_trw wrote:
           | Not really, that's for the binary version of the algorithm,
           | the ternary version can propagate a lot more information in
           | the backwards pass using the fact outputs either -1, 0, 1.
           | 
           | But I imagine they are using the same thing since a bunch of
           | the authors are the same.
        
         | wongarsu wrote:
         | It's a fairly straightforward modification of BitNet, so I
         | assume this quote from the BitNet paper applies:
         | 
         | To train our 1-bit model, we employ the straight-through
         | estimator (STE)[BLC13 ] to approximate the gradient during
         | backpropagation. This method bypasses the non-differentiable
         | functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions,
         | during the backward pass. STE allows gradients to flow through
         | the network without being affected by these non-differentiable
         | functions, making it possible to train our quantized model
        
       | sp332 wrote:
       | 1-bit LLMs remind me of a random forum post I read about SACD and
       | limitations of the 1-bit DSD audio format.
       | https://www.audiosciencereview.com/forum/index.php?threads/d...
       | Accumulating approximate values in one bit leads to being
       | "constantly overloaded", with any error correction overwriting
       | all of your real signal from the next step. I think this trinary
       | system might leave enough room to avoid this problem.
        
       | Alifatisk wrote:
       | If this paper (especially the results on Table 4) is true, then
       | this is a game changer!
        
       | osigurdson wrote:
       | I have often mused that, in some ways, it seems like the
       | transistor is really being wasted in AI applications. We use
       | binary states in normal computing to reduce entropy. In AI this
       | is less of a concern, so why not use more of the available
       | voltage range? Basically, re-think the role of the transistor and
       | re-design from the ground up - maybe NAND gates are not the ideal
       | fundamental building block here?
        
         | adrianN wrote:
         | You could call them connection machine and perhaps have an llm
         | trained on Feynman help with the design.
        
         | sigmoid10 wrote:
         | People are working on that [1]. In some sense, it's a step back
         | to analog computing. Add/multiply is possible to do directly in
         | memory with voltages, but it's less versatile (and stable) than
         | digital computing. So you can't do _all_ calculations in a
         | neural network that way, meaning some digital components will
         | always be necessary. But I 'm pretty sure analog will make a
         | comeback for AI chips sooner or later.
         | 
         | [1] https://www.nature.com/articles/s41586-023-06337-5
        
           | zcw100 wrote:
           | Reminds me of my father saying something about how vacuum
           | tubes are great integrators.
        
             | monocasa wrote:
             | Chips are too. Opamps can add, multiply, subtract, divide,
             | integrate and differentiate depending on how they're
             | plugged in.
        
               | klysm wrote:
               | Hence the name 'operational' amplifier
        
           | irrelative wrote:
           | Hadn't thought about it this way before, but given that LLMs
           | are auto regressive (use their own data for next data),
           | they're sensitive to error drift in ways that are rather
           | similar to analog computers.
        
           | trebligdivad wrote:
           | Trinary however is an interesting middle; people have built
           | trinary hardware long ago; it feels like you could make
           | natively trinary hardware for something like this; it might
           | even be quite a win.
        
             | thsksbd wrote:
             | Can you make a "CMOS" three voltage level circuit though?
             | One where the only current flow is when the state changes?
             | 
             | Im not in this field but that's a question that's been
             | bugging me for a while. Off you can't do this wouldn't
             | energy consumption balloon?
        
             | int_19h wrote:
             | People haven't built _reliable_ ternary electronics,
             | though. Soviets tried with Setun, but they eventually had
             | to resort to emulating each trit with two hardware bits
             | (and wasting one state out of the possible four).
        
         | wakawaka28 wrote:
         | I have heard of people trying to build analog AI devices but
         | that seems like years ago, and no news has come out about it in
         | recent times. Maybe it is harder than it seems. I bet it is
         | expensive to regulate voltage so precisely and it's not a
         | flexible enough scheme to be support training neural networks
         | like we have now, which are highly reconfigurable. I've also
         | heard of people trying to use analog computing for more mundane
         | things. But no devices have hit the market after so many years
         | so I'm assuming it is a super hard problem, maybe even
         | intractible.
        
           | osigurdson wrote:
           | Perhaps another variation on the idea is to allow a higher
           | error rate. For example, if a 0.01% error rate was acceptable
           | in AI, perhaps the voltage range between states could be
           | lowered (which has a quadratic relationship to power
           | consumption) and clock speed could increase.
        
         | loudmax wrote:
         | The Veritasium Youtube channel did a video about this about a
         | year ago: https://www.youtube.com/watch?v=GVsUOuSjvcg
         | 
         | They visit Texas company Mythic AI to discuss how they use
         | flash memory for machine learning. There's a California company
         | named Syntiant doing something similar.
        
         | gryn wrote:
         | the reason why digital/numeric processing won is the power loss
         | in the analog world. when design an analog circuit the next
         | processing stage you add at the end has impact on the ones
         | before it.
         | 
         | this then require a higher skill from the engineers/consumers.
         | 
         | if you want to avoid that you need to add op-amps with a gain
         | of 1 at the boundary of each one, this also that care of the
         | power loss at each stage.
         | 
         | the other part is that there's a limit of to the amount of
         | useful information/computation you can do with analog
         | processing too once you take into account voltage noise. when
         | you do a comparison there are stages where analog win but also
         | place where where digital wins.
         | 
         | I'll edit later this with a link to some papers that discuss
         | these topics if I manage to find them in my mess.
        
           | dazed_confused wrote:
           | Good explanation. When I was working at a semiconductor
           | manufacturer, our thresholds were like 0 - 0.2V to 0.8 -
           | 1.0V. Additionally, if you look at QLC SSDs, their longevity
           | is hugely degraded. Analog computing is non-trivial, to say
           | the least.
        
           | im3w1l wrote:
           | For the specific case of neural networks they seem to be very
           | resistant to noise. That's why quantization works in the
           | first place.
        
         | BlueTemplar wrote:
         | I have heard that the first commercial neural network chip (by
         | Intel, in the 90s) was analog ?
        
         | barrenko wrote:
         | Hmm, maybe some (signaling) inspiration from biology other than
         | neural signaling.
        
         | StableAlkyne wrote:
         | It would be something of a full circle I feel went back to
         | dedicated circuits for NNs - that's how they began life when
         | Rosenblatt built his Perceptron.
         | 
         | I remember reading a review on the history in grad school
         | (can't remember the paper) where the author stated that one of
         | the initial interests in NNs by the military was their
         | distributed nature. Even back then, people realized you could
         | remove a neuron or break a connection and they would still work
         | (and even today, dropout is a way of regularizing the network).
         | The thinking was that being able to build a computer or
         | automated device that could be damaged (radiation flipping
         | bits, an impact destroying part of the circuit, etc) and still
         | work would be an advantage given the perceived inevitably of
         | nuclear war.
         | 
         | Compared to a normal von Neumann machine which is very fault
         | intolerant - remove the CPU and no processing, no memory=no
         | useful calculation, etc. One reason people may have avoided
         | further attempts at physical neural networks is it's
         | intrinsically more complex than von Neumann, since now your
         | processing and memory is intertwined (the NN is the processor
         | and the program and the memory at the same time).
        
           | kurisufag wrote:
           | >von Braun machine
           | 
           | von neumann? though it is funny to imagine von braun
           | inventing computer architecture as a side hustle to inventing
           | rocket science.
        
             | StableAlkyne wrote:
             | Oh fuck, thanks for catching that!
        
         | the8472 wrote:
         | Bits are copyable without data loss. Analog properties of
         | individual transistors are less so.
        
         | seydor wrote:
         | let's use cells
        
           | Razengan wrote:
           | We already do.
        
         | drexlspivey wrote:
         | Next Up: Quantum AI
        
         | mikewarot wrote:
         | >maybe NAND gates are not the ideal fundamental building block
         | here?
         | 
         | It's my long held opinion that LUTs (Look Up Tables) are the
         | basis of computation for the future. I've been pondering this
         | for a long time since George Gilder told us that wasting
         | transistors was the winning strategy. What could be more
         | wasteful than just making a huge grid of LUTs that all
         | interconnect, with NO routing hardware?
         | 
         | As time goes by, the idea seems to have more and more merit.
         | Imagine a grid of 4x4 bit look up tables, each connected to its
         | neighbors, and clocked in 2 phases, to prevent race conditions.
         | You eliminate the high speed long lines across chips that cause
         | so much grief (except the clock signals, and bits to load the
         | tables, which don't happen often).
         | 
         | What you lose in performance (in terms of latency), you make up
         | for with the homogenous architecture that is easy to think
         | about, can route around bad cells, and be compiled to almost
         | instantly, thanks to the lack of special cases. You also don't
         | ever have to worry about latency, it's constant.
        
           | phdelightful wrote:
           | It's been a long time since I worked on FPGAs, but it sounds
           | like FPGAs! What do you see as the main differences?
        
             | mikewarot wrote:
             | No routing, no fast lines that cut across the chip, which
             | cut way down on latency, but make FPGAs harder to build,
             | and especially hard to compile to once you want to use
             | them.
             | 
             | All that routing hardware, and the special function units
             | featured in many FPGAs are something you have to optimize
             | the usage of, and route to. You end up with using solvers,
             | simulated annealing, etc... instead of a straight compile
             | to binary expressions, and mapping to the grid.
             | 
             | Latency minimization is the key to getting a design to run
             | fast in an FPGA. In a BitGrid, you know the clock speed,
             | you know the latency by just counting the steps in the
             | graph. BitGrid performance is determined by how many
             | answers/second you can get from a given chip. If you had a
             | 1 Ghz rack of BitGrid chips that could run GPT-4, with a
             | latency of 1 mSec per token, you'd think that was horrible,
             | but you could run a million such streams in parallel.
        
       | londons_explore wrote:
       | Powers of 3 don't pack well into binary memory...
       | 
       | A 1 bit multiplier in silicon is a single logic gate, but a
       | ternary decoder to decode a packed tri-state 'weight' is bigger.
       | 
       | I therefore suspect that this method will be extended to make all
       | weights simple 1 or 0 (ie. Binary). Perhaps that will be done by
       | having half the weights have 1 or 0 values, while the other half
       | are -1 or 0.
        
         | baq wrote:
         | You can build dedicated silicon with ternary gates:
         | https://medium.com/@rxseger/exploring-ternary-logic-tnand-an...
         | 
         | Not sure if it's more efficient than just binary digital
         | circuits in highly integrated chip, though.
        
           | samatman wrote:
           | It's optimal if your program is naturally ternary, which this
           | one is. Using three signals, rather than ternary gates, is
           | less effective, because you need much more precision to
           | detect two different voltage levels rather than just up and
           | down.
        
         | tromp wrote:
         | 5 trits fit into 1 byte pretty well, since 3^5 = 243 is just
         | under 2^8 = 256.
         | 
         | That should be called an 8/5 = 1.6 bit model though, while the
         | paper names it 1.58 bit, closer to log_2(3) ~ 1.5849625
        
           | JKCalhoun wrote:
           | Would be nice to have hardware instructions that work on 5
           | tris natively.
        
           | londons_explore wrote:
           | But the decoder for that will be 25+ gates, which is huge
           | compared to the handful of gates to use the resulting
           | weights.
        
         | fabmilo wrote:
         | can't you have 2 bits ? first bit for the sign second bit for
         | the 1 0 you can represent -1 +1 +0 -0
        
       | K0IN wrote:
       | when can we expect the first ~100+ million parameter models to
       | run on raspberry pi Pico?
        
       | Klipper3 wrote:
       | The theoretical capacity of a binary network is 69% of the
       | capacity of a full-weight network, so it makes sense that LLM
       | would converge to 1-bit networks in the long term.
       | 
       | It's nice to finally see practical networks reach the theoretical
       | limits found in the statistical mechanics of Ising models. A good
       | pointer to efficient 1-bit training, from the statistical
       | mechanics point of view, is here:
       | 
       | https://www.pnas.org/doi/full/10.1073/pnas.0700324104
        
         | arunk47 wrote:
         | What is stopping us right now from doing this one bit networks
         | ?
        
       | ulnarkressty wrote:
       | Take this with a grain of salt until someone reproduces it.
       | Improvements such as these require extraordinary evidence. Not to
       | mention extreme quantization has been tried before.
        
       | joelthelion wrote:
       | Assuming this is confirmed, what's the impact on training?
       | 
       | Inference is definitely an issue for LLMs right now. But if
       | training were suddenly possible for lone hackers (or maybe
       | smaller companies), it would open up a lot of new possibilities
       | as well.
        
       | stormfather wrote:
       | How does backprop work here? I can't imagine flipping bits of
       | everything upstream of an error is effective.
        
         | joelthelion wrote:
         | (haven't read the paper). Maybe you can flip bits with a
         | probability distribution that depends on the gradient?
        
           | stormfather wrote:
           | That's an interesting idea! Would love to try that on MNIST
           | one day.
        
         | spyder wrote:
         | From the BitNet paper:
         | 
         |  _" Straight-through estimator. To train our 1-bit model, we
         | employ the straight-through estimator (STE)[BLC13] to
         | approximate the gradient during backpropagation. This method
         | bypasses the nondifferentiable functions, such as the Sign (Eq.
         | 2) and Clip (Eq. 5) functions, during the backward pass. STE
         | allows gradients to flow through the network without being
         | affected by these non-differentiable functions, making it
         | possible to train our quantized model."_
         | 
         | also the author's (@shumingma) answer in the comments:
         | https://huggingface.co/papers/2402.17764#65df17ed4d436404cdc...
        
       | alexey-salmin wrote:
       | Also from Microsoft in 2021: Make Every feature Binary: A 135B
       | parameter sparse neural network for massively improved search
       | relevance [1]
       | 
       | [1] https://www.microsoft.com/en-us/research/blog/make-every-
       | fea...
        
       | yousif_123123 wrote:
       | Any models published as well?
        
         | jonbaer wrote:
         | I really can't tell but it seems to be a continuation of this
         | work if I read the To-Dos correctly, what do you think? Here it
         | seems to be 1-bit on just the transformer,
         | https://huggingface.co/shi3z/BitNetWikipedia110M
        
       | naasking wrote:
       | Interesting return to ternary. Effectively, each weight says only
       | whether it's correlated (+1), uncorrelated (0), or anti-
       | correlated (-1) with the input, and the structure of the network
       | is the actual computation over that information.
        
       | wenyuanyu wrote:
       | I wonder how the training process works...
        
       | wenyuanyu wrote:
       | If this turns out to be true. It could indeed be a game
       | changer... Given the advanced AI chip shortage... Also, for the
       | chip ban on China...
        
       | rafaelero wrote:
       | Looks like we have finally rediscovered a biological neuron.
        
         | bilsbie wrote:
         | How so?
        
           | rafaelero wrote:
           | They propagate information in a binary way (either they
           | activate or not).
        
       | cs702 wrote:
       | There are two findings I find _shocking_ in this work:
       | 
       | * In existing LLMs, we can replace all parameter floating-point
       | values representing real numbers with ternary values representing
       | (-1, 0, 1).
       | 
       | * In matrix multiplications (e.g., weights by vectors), we can
       | replace elementwise products in each dot product (a1b1 + a2b2
       | ...) with elementwise _additions_ (a1+b1 + a2+b2 ...), in which
       | signs depend on each value. See the paper for exact details.
       | 
       | On existing hardware, the gains in compute and memory efficiency
       | are significant, without performance degradation (as tested by
       | the authors).
       | 
       | If the proposed methods are implemented in hardware, we will see
       | _even greater gains_ in compute and memory efficiency.
       | 
       | Wow.
        
         | creshal wrote:
         | > * In existing LLMs, we can replace all parameter floating-
         | point values representing real numbers with ternary values
         | representing (-1, 0, 1).
         | 
         | Why is this so shocking? Quantization has been widely explored,
         | driving that to its extreme (and blowing up parameter count to
         | make up for it) just seems like a natural extension of that.
         | 
         | Easier said than done, of course, and very impressive that they
         | pulled it off.
         | 
         | > In matrix multiplications (e.g., weights by vectors), we can
         | replace elementwise products in each dot product (a1b1 + a2b2
         | ...) with elementwise additions (a1+b1 + a2+b2 ...), in which
         | signs depend on each value
         | 
         | I feel like this follows naturally from having only ternary
         | values, multiplication doesn't really bring much to the table
         | here. It's a bit surprising that it's performing so well on
         | existing hardware, usually multiplication hardware sees more
         | optimization, especially for GPGPU hardware.
        
           | cs702 wrote:
           | _> Why is this so shocking? Quantization has been widely
           | explored, driving that to its extreme (and blowing up
           | parameter count to make up for it) just seems like a natural
           | extension of that._
           | 
           | I find it shocking that we don't even need lower floating-
           | point precision. _We don 't need precision at all_. We only
           | need three symbols to represent every value.
           | 
           |  _> I feel like this follows naturally from having only
           | ternary values, multiplication doesn 't really bring much to
           | the table here. It's a bit surprising that it's performing so
           | well on existing hardware, usually multiplication hardware
           | sees more optimization, especially for GPGPU hardware._
           | 
           | I find it shocking. Consider that associative addition over
           | ternary digits, or trits, represented by three symbols
           | (a,b,c) has only three possible input pairs, (a,b), (a,c), or
           | (b,c) (within each pair, order doesn't matter), and only
           | three possible outputs, a, b, or c. Matrix multiplications
           | could be executed via crazy-cheap tritwise operations in
           | hardware. Maybe ternary hardware[a] will become a thing in
           | AI?
           | 
           | ---
           | 
           | [a] https://en.wikipedia.org/wiki/Ternary_computer
        
             | jerf wrote:
             | An integer is just a concatenation of bits. Floating point
             | appears more complicated but from an information theory
             | perspective it is also just a concatenation of bits. If,
             | for the sake of argument, one replaced a 64-bit int with 64
             | individual bits, that's really the same amount of
             | information and a structure could hypothetically then
             | either choose to recreate the original 64-bit int, or use
             | the 64-bits more efficiently by choosing from the much
             | larger set of possibilities of ways to use such resources.
             | 
             | Trits are helpful for neural nets, though, since they
             | really love signs and they need a 0.
             | 
             | So from the perspective that it's all just bits in the end
             | the only thing that is interesting is how useful it is to
             | arrange those bits into trits for this particular
             | algorithm, and that the algorithm seems to be able to use
             | things more effectively that way than with raw bits.
             | 
             | This may seem an absolutely bizarre zigzag, but I am
             | reminded of Busy Beavers, because of the way they take very
             | the very small primitives of a Turing Machine, break it
             | down to the smallest pieces, then combine them in ways that
             | almost immediately cease to be humanly comprehensible.
             | Completely different selection mechanism for what appears,
             | but it turns out Turing Machine states can do a lot "more"
             | than you might think simply by looking at human-designed
             | TMs. We humans have very stereotypical design methodologies
             | and they have their advantages, but sometimes just letting
             | algorithms rip can result in much better things than we
             | could ever hope to design with the same resources.
        
               | cs702 wrote:
               | _> So from the perspective that it 's all just bits in
               | the end the only thing that is interesting is how useful
               | it is to arrange those bits into trits for this
               | particular algorithm, and that the algorithm seems to be
               | able to use things more effectively that way than with
               | raw bits._
               | 
               | Thank you. I find many other things interesting here,
               | including the potential implications for hardware, but
               | otherwise, yes, I agree with you, _that_ is interesting.
        
               | SkyBelow wrote:
               | This sort of breakdown also reminds me of the explanation
               | of why busy beavers grow faster than anything humans can
               | ever define. Anything a human can define is a finite
               | number of steps that can be represented by some turing
               | machine of size M. A turning machine of size N > M can
               | then use M as a subset of it, growing faster than than
               | the turing machine of size M. Either it is the busy
               | beaver for size N, or it grows slower than the busy
               | beaver for size N. Either way, the busy beaver for size N
               | grows faster than whatever the human defined that was
               | captured by the turning machine of size M. This
               | explanation was what helped me understand why busy
               | beavers is faster growing than any operator that can be
               | formally defined (obviously you can define an operator
               | that references busy beaver itself, but busy beaver can
               | be considered to not be formally defined, and thus any
               | operator defined used it isn't formally defined either).
               | 
               | The bit about floating point numbers just being a
               | collection of bits interpreted in a certain way helps
               | make sense why a bigger model doesn't need floating
               | points at all.
        
             | jxy wrote:
             | The matrices (weights) are ternary.
             | 
             | The vectors are not.
        
               | cs702 wrote:
               | The _activations_ are in (-1, 1), so they 're also
               | representable by (-1, 0, 1).
        
           | satellite2 wrote:
           | Because it's no longer a linear optimization or curve fitting
           | problem. It becomes a voting or combinatorial problem. Which
           | at least in my mind are two completely different areas of
           | research.
        
             | HPsquared wrote:
             | With enough parameters, it probably starts looking
             | continuous again. Like how in physics everything is
             | quantised at the smallest scale but if you put enough atoms
             | together it all smooths out and behaves "classically".
        
               | amelius wrote:
               | Yes, but we can simulate classical physics using
               | mathematical shortcuts. Simulating every little atom
               | would take a lot more work.
        
           | ncruces wrote:
           | Well I guess it's the "blowing up parameter count to make up
           | for it" that confuses me, but maybe it's just ignorance.
           | 
           | Like what would be the expected factor of this blow up to
           | make up the difference between ternary and whatever 16 bits
           | encoding they were using?
           | 
           | I mean intuitively I'd expect to need ~10x the symbols to
           | encode the same information? Are they using an order of
           | magnitude more parameters, or is that not how it works?
        
             | int_19h wrote:
             | With existing common quantization techniques, a 70b model
             | quantized to 3-bit still drastically outperforms an
             | unquantized 35b model.
        
         | nutanc wrote:
         | We have been experimenting with the paper(https://www.researchg
         | ate.net/publication/372834606_ON_NON-IT...).
         | 
         | There is a mathematical proof that binary representation is
         | enough to capture the latent space. And in fact we don't even
         | need to do "training" to get that representation.
         | 
         | The practical application we tried out for this algorithm was
         | to create an alternate space for mpnet embeddings of Wikipedia
         | paragraphs. Using Bit embedding we are able to represent 36
         | million passages of Wikipedia in
         | 2GB.(https://gpt3experiments.substack.com/p/building-a-vector-
         | dat...)
        
           | cs702 wrote:
           | You're talking about mapping floating-point vector
           | representations, i.e., embeddings, computed by a pretrained
           | LLM to binary vector representations, right? And you're
           | talking about doing this by first having _someone else_ 's
           | pretrained LLM compute the embeddings, right? Sorry, but that
           | seems only minimally, tangentially related to the topic of
           | running LLMs in ternary space. I don't see how your comment
           | is relevant to the discussion here.
        
             | nutanc wrote:
             | Yeah, sorry, needed a much bigger canvas than a comment to
             | explain. Let me try again. The example I took was to show
             | mapping from one space to another space and it may have
             | just come across as not learning anything. Yes. You are
             | right it was someone else's pretrained LLM. But this new
             | space learnt the latent representations of the original
             | embedding space. Now, instead of the original embedding
             | space it could also have been some image representation or
             | some audio representation. Even neural networks take input
             | in X space and learn a representation in Y space. The paper
             | shows that any layer of a neural network can in fact be
             | replaced with a set of planes and we can represent a space
             | using those planes and that those planes can be created in
             | a non iterative way. Not sure if I am being clear, but have
             | written a small blog post to show for MNIST how an NN
             | creates the planes(https://gpt3experiments.substack.com/p/u
             | nderstanding-neural-...). Will write more on how once these
             | planes are written, how we can use a bit representation
             | instead of floating point values to get similar accuracy in
             | prediction and next how we can draw those planes without
             | the iterative training process.
        
           | SushiHippie wrote:
           | Wow, this works better than I would've thought.
           | 
           | > Who moderates Hacker News?
           | 
           | First result:
           | 
           | > Hacker News
           | 
           | > At the end of March 2014, Graham stepped away from his
           | leadership role at Y Combinator, leaving Hacker News
           | administration in the hands of other staff members. The site
           | is currently moderated by Daniel Gackle who posts under the
           | username "dang".
        
           | m3kw9 wrote:
           | How is this not lossy compression?
        
             | rf15 wrote:
             | It kind of is!
        
             | sandyarmstrong wrote:
             | LLMs and vector embeddings are always lossy compression,
             | yes?
        
             | baq wrote:
             | kind of related:
             | https://medium.com/@heinrichpeters/commentary-gzip-knn-
             | beats...
        
           | fabmilo wrote:
           | I find this extremely interesting. Do you share the source
           | code of the process? any more references?
        
         | rhaps0dy wrote:
         | I think you need more evidence than this paper (which is very
         | short and light on actual numbers) to be this shocked.
         | 
         | For example, most of the plots in the paper are actually of
         | throughput, memory, etc. all performance characteristics that
         | are better on the ternary version. Which, of course.
         | 
         | The only thing that contains perplexities are Table 1 and 2.
         | There, they compare "BitNet b1.58 to our reproduced FP16 LLaMA
         | LLM in various sizes" on the RedPajama data set. The first
         | thing to note is the perplexities are very high: they're all at
         | least ~9.9, which compared for example with quantized Llama on
         | wikitext-2 which is 6.15
         | (https://www.xzh.me/2023/09/a-perplexity-benchmark-of-
         | llamacp...). Maybe RedPajama is a lot harder than wikitext-2,
         | but that's a big gap.
         | 
         | I think probably their benchmark (their "reproduced FP16 LLaMA
         | LLM") is just not very good. They didn't invest much in
         | training their baseline and so they handily beat it.
        
           | cs702 wrote:
           | Thank you. I think the paper as it is provides enough
           | evidence to support the claims. If I understand the authors
           | correctly, they trained the compared models on only 100B
           | tokens, all drawn from RedPajama, to make the comparisons
           | apples-to-apples. That's sensible. It allows for easier
           | replication of the results. Otherwise, I agree with you that
           | more extensive testing, after more extensive pretraining, is
           | still necessary.
        
         | lr1970 wrote:
         | Authors reported perplexity only for small up to 3B weights
         | models. On the other hand, they reported throughput for 70B
         | model, but not its performance (perplexity, end-to-end tasks).
         | Very unfortunate omission. Overall, the paper is rather poorly
         | written.
        
           | cs702 wrote:
           | If I understand the authors correctly, they trained the
           | compared models on only 100B tokens, all drawn from
           | RedPajama, to make the comparisons apples-to-apples. That's
           | sensible. It allows for easier replication of the results.
           | Otherwise, I agree with you that more extensive testing,
           | after more extensive pretraining, at larger model sizes, is
           | still necessary.
        
             | lr1970 wrote:
             | towards the end of the paper they mentioned training on 2T
             | tokens.
        
               | cs702 wrote:
               | You're right. Thank you for pointing that out.
        
         | beagle3 wrote:
         | I haven't been keeping tabs, but this seems very much like RIP
         | / Achilioptas version of the Johnson Lindenstrauss lemma.
         | 
         | Perhaps the rest of the JL lemma promise applies as well -
         | compressing the number of parameters by a few orders of
         | magnitude as well.
        
         | jandrese wrote:
         | It seems like the AI space is slowly coming back around to the
         | old Thinking Machines CM-1 architecture. It's not too often in
         | computing where you see ideas a full 40 years ahead of their
         | time make it into production.
        
           | giantrobot wrote:
           | IIUC the main issue with the CM-1 architecture was feeding
           | the processor cluster with data. That required a heftier
           | front end system than was practical/affordable at the time.
           | With modern CPUs and memory subsystems the GPUs can be
           | saturated pretty easily. So going back to huge clusters of
           | super narrow cores won't starve them for work.
        
         | abeppu wrote:
         | > On existing hardware, the gains in compute and memory
         | efficiency are significant, without performance degradation (as
         | tested by the authors).
         | 
         | Did they actually show absence of performance degradation?
         | 
         | I think it's conspicuous that Table 1 and Table 2 in the paper,
         | which show perplexity and accuracy results respectively, are
         | only for small model sizes, whereas Figure 2, Figure 3
         | (latency, memory, energy consumption) and Table 3 (throughput)
         | all show larger model sizes. So it seems like they had every
         | opportunity to show the perplexity/accuracy comparisons at the
         | larger model sizes, but did not include them.
        
           | cs702 wrote:
           | Others have already made the same point in this thread. See
           | my response here:
           | https://news.ycombinator.com/item?id=39539508
        
         | vessenes wrote:
         | I'd be VERRY cautious about being excited here.
         | 
         | My priors are like this:
         | 
         | 1. Initial training of a neural network moves all weights
         | around a large amount at first.
         | 
         | 2. Later training of the network adjusts them a small amount.
         | 
         | 3. An undertrained network will therefore look a lot like
         | figuring out "positive, negative, or 0?" for each node during
         | early training.
         | 
         | If all these things are true, then
         | 
         | 1. Early training of an fp16 network and a bitnet with 0 added
         | will be _roughly_ similar in results
         | 
         | 2. Later training will yield different / worse results, as the
         | network gets into the 'fine tuning' part of the training.
         | 
         | I think the paper's stats back these priors up -- they say
         | "this works on (3B+) large networks, but not small ones." They
         | then imply there's something about the structure of a large
         | network that allows a bitnet to do well. It seems more likely
         | to me it works on large networks because they have not put the
         | compute into 3B+ networks to get past the 'gross tuning' phase.
         | 
         | The networks they have compute to put in to get them 'fully'
         | trained -- those networks don't show the results.
         | 
         | Also, a quick reminder that Perplexity 12 is really terrible.
         | You would not want to use such a network. Hopefully I'm wrong
         | and we can get something for free here! But, I'm cautious - to
         | - skeptical.
        
           | mise_en_place wrote:
           | Intuitively I've always been a bit skeptical of quantization.
           | Wouldn't there be a tiny loss in precision by doing this type
           | of quantization? I could imagine the error function
           | increasing by utilizing these types of techniques.
        
             | eightysixfour wrote:
             | It does increase the "error" (meaning it is less likely to
             | predict the next word when compared against a dataset) but
             | the losses are lower than your intuition would guide you to
             | believe.
        
             | int_19h wrote:
             | Quantization does reduce quality of the outputs. But the
             | point is that you save enough memory doing so that you can
             | cram a larger model into the same hardware, and this more
             | than compensates for lost precision.
        
             | spencerchubb wrote:
             | Yes each weight will not be able to "learn" as much if it
             | has less bits of precision. But the idea is that you can
             | use more weights, and the big question is whether these
             | low-precision weights can make the model more accurate, as
             | a whole.
        
             | thesz wrote:
             | John Carmack pointed out (and I learned it here at HN) that
             | what training really needs is the *sign" of each individual
             | gradient parameter. I.e., you can quantize gradient to -1,
             | 0 and 1 and still have neural network learn much of the
             | dataset.
        
           | svantana wrote:
           | Wait, are we reading the same paper? What I'm seeing is
           | comparable accuracy to unquantized models for <4B params, and
           | nothing reported for larger models except resource
           | consumption.
        
             | vessenes wrote:
             | Nope, you're right, I got the table inverted in my head.
             | I'm updating my top comment.
        
           | gliptic wrote:
           | > Also, a quick reminder that Perplexity 12 is really
           | terrible.
           | 
           | The 3B model had a perplexity of 9.91, less than LLaMa 1 in
           | fp16.
        
           | cs702 wrote:
           | Thank you. Your key point -- that so far all models with the
           | proposed methods may have been only "grossly trained" -- is
           | compelling. If I understand the authors correctly, they
           | trained the compared models on only 100B tokens, all drawn
           | from RedPajama, to make the comparisons apples-to-apples.
           | That seems sensible to me, and makes replication easier, but
           | I agree we need more to see extensive testing, after more
           | extensive pretraining, on models of larger sizes.
        
             | gliptic wrote:
             | They also trained 3B with 2 trillion tokens.
             | 
             | > The number of training tokens is a crucial factor for
             | LLMs. To test the scalability of BitNet b1.58 in terms of
             | tokens, we trained a BitNet b1.58 model with 2T tokens
             | following the data recipe of StableLM-3B [ TBMR], which is
             | the state-of-the-art open-source 3B model.
             | 
             | > [..]
             | 
             | > Our findings shows that BitNet b1.58 achieves a superior
             | performance on all end tasks, indicating that 1.58-bit LLMs
             | also have strong generalization capabilities.
        
               | cs702 wrote:
               | You're right. Thank you for pointing that out!
        
           | vessenes wrote:
           | Update - I'm still cautious about this paper, but I had the
           | table numbers inverted in my head while thinking about it.
           | The paper shows better perplexity results than competing
           | models at larger parameter sizes, so I was wrong.
        
         | flockonus wrote:
         | Considering how much faster additions are processed, and how a
         | particular silicon chip could be optimized for this very
         | specific case; all parts added together perhaps could show
         | >100x speed up vs current systems.
         | 
         | I must concur, "wow".
        
         | sva_ wrote:
         | I'm also curious about the potential speed gains in automatic
         | differentiation, as there are way less branches to 'go up'. Or
         | am I wrong here?
        
           | lumost wrote:
           | They actually use a relu to represent the model weights. But
           | I'm not convinced that this can't be avoided. We do gradient
           | boosted decision tree training without this trick.
        
         | PaulHoule wrote:
         | I am not startled at all. Dense vector representations are
         | pretty silly, they can't really be the road to knowledge
         | representation.
        
         | p1esk wrote:
         | Ternary networks have been used since 2015. There are hundreds
         | of papers. They all require full QAT (training from scratch).
         | Not sure why you're shocked.
        
         | acchow wrote:
         | Conversely, this also implies our current model sizes can still
         | embed a _ton_ more "understanding"
        
         | Noe2097 wrote:
         | There is another _shocking_ realization in this work: there are
         | 11 types of people: those who know what binary means, those who
         | don't, and those who say they do but actually don't.
         | 
         | "The era of 1-bit LLMs"
         | 
         | Representing { -1, 0, 1 } can't be done with 1-bit, I'm sorry
         | -- and sad, please let's all get back to something vaguely
         | sound and rigorous.
        
           | npunt wrote:
           | Ternary supporters are always bitter about this
           | 
           | (I'll let myself out)
        
         | AaronFriel wrote:
         | In undergrad, some of us math majors would joke that there's
         | really only three quantities: 0, 1, infinity.
         | 
         | So, do we need the -1, and/or would a 2.32 bit (5 state, or 6
         | with +/-0) LLM perform better than a 1.58 bit LLM?
        
         | fzliu wrote:
         | This will be big for FPGAs - adders are extremely cheap
         | compared to multipliers and other DSP blocks.
        
         | paul_mk1 wrote:
         | Fun to see ternary weights making a comeback. This was hot back
         | in 2016 with BinaryConnect and TrueNorth chip from IBM research
         | (disclosure, I was one of the lead chip architects there).
         | 
         | Authors seemed to have missed the history. They should at least
         | cite Binary Connect or Straight Through Estimators (not my
         | work).
         | 
         | Helpful hint to authors: you can get down to 0.68 bits / weight
         | using a similar technique, good chance this will work for LLMs
         | too.
         | 
         | https://arxiv.org/abs/1606.01981
         | 
         | This was a passion project of mine in my last few months at IBM
         | research :).
         | 
         | I am convinced there is a deep connection to understanding why
         | backprop is unreasonably effective, and the result that you can
         | train low precision DNNs; for those note familiar, the
         | technique is to compute the loss wrt to the low precision
         | parameters (eg project to ternary) but apply the gradient to
         | high precision copy of parameters (known as the straight
         | through estimator). This is a biased estimator and there is no
         | theoretical underpinning for why this should work, but in
         | practice it works well.
         | 
         | My best guess is that it is encouraging the network to choose
         | good underlying subnetworks to solve the problem, similar to
         | Lottery Ticket Hypothesis. With ternary weights it is just
         | about who connects to who (ie a graph), and not about the
         | individual weight values anymore.
        
       | nutate wrote:
       | Triggered by the use of 1-bit to describe a trit.
        
       | checker659 wrote:
       | If all the weights are either 1, 0 or -1, isn't this what
       | biological neurons do?
        
         | nathan_compton wrote:
         | Not even remotely. I suppose you could kind of say that
         | activations are boolean in the sense that neurons emit spikes,
         | but arguably significant information is encoded in spike
         | timing.
        
       | w-m wrote:
       | I was reading _Exposing Floating Point_ today (as Airfoil is on
       | the HN front page and I was perusing the archive of the author).
       | It 's a blog explaining the inner workings of floating point
       | representations. About zero values it says [0]:
       | 
       | > Yes, the floating point standard specifies both +0.0 and -0.0.
       | This concept is actually useful because it tells us from which
       | "direction" the 0 was approached as a result of storing value too
       | small to be represented in a float. For instance -10e-30f /
       | 10e30f won't fit in a float, however, it will produce the value
       | of -0.0.
       | 
       | The authors of the LLM paper use the values {-1, 0, -1}.
       | Connecting the two ideas, I'm now wondering whether having a
       | 2-bit {-1, -0, 0, 1} representation might have any benefit over
       | the proposed 1.58 bits. Could the additional -0 carry some
       | pseudo-gradient information, ("the 0 leaning towards the negative
       | side")?
       | 
       | Also, I've seen 2-bit quantizations being proposed in other LLM
       | quantization papers. What values are they using?
       | 
       | [0] https://ciechanow.ski/exposing-floating-point/#zero
        
         | rfoo wrote:
         | Interesting, how do you use -0 in the add, then? Is -0+1-1 a 0
         | or a -0?
         | 
         | > Could the additional -0 carry some pseudo-gradient
         | information
         | 
         | It looks like training was done on fp32 or bf16. Low-bit
         | quantization is approximated with STE during training. I'd
         | expect training itself cause each point to "polarize" towards 1
         | or -1.
         | 
         | > 2-bit quantizations being proposed
         | 
         | Symmetric (i.e. without 0) exponential values were pretty
         | popular IIRC.
        
           | w-m wrote:
           | > how do you use -0 in the add
           | 
           | In my mind the two zero values would represent a tiny epsilon
           | around 0, let's say -0.01 and +0.01. Looking at them like
           | this, it would mean                 +0 +0 -0 = +0       +0 -0
           | -0 = -0       +1 * +0 = +0       -1 * +0 = -0
           | 
           | Performing addition with the same sign count in each group
           | would be problematic. How to decide on the sign of +0-0 or
           | +1-1, other than flipping a coin?
        
         | creshal wrote:
         | > Could the additional -0 carry some pseudo-gradient
         | information, ("the 0 leaning towards the negative side")?
         | 
         | Probably, but is it worth the cost? One of the goals behind
         | BitNet and this paper is to find a way to implement LLMs as
         | efficiently in hardware as possible, and foregoing floating
         | point semantics is a big part of it. I'm not sure if there's a
         | way to encode -0 that doesn't throw out half the performance
         | gains.
        
           | SushiHippie wrote:
           | But if I understand it correctly, they already need to use 2
           | bits, one for the sign and another one for the value, so
           | there is already one wasted state, which could be used for
           | -0.
        
             | pennomi wrote:
             | You can pack two trits into three bits, however. So one
             | byte could hold 5 values instead of 4.
        
               | para_parolu wrote:
               | Can processor perform addition on them effectively?
        
               | threatripper wrote:
               | How exactly would you do that? 3 states need 1.58 bits
               | which is a tad more than 1.5. Two 3-states have 32=9
               | states while three bits only give you 23=8 states.
        
       | BenoitEssiambre wrote:
       | Low bit parameters is always talked about in terms of performance
       | benefits but I wonder if allowing the LLM to combine parameters
       | to represent values, means it can select the resolution of each
       | value, that is use a kind of internal scientific notation to
       | track the uncertainty of values. More low bit parameters combined
       | together means more precision and resolution, less can mean more
       | uncertainty. This might allow the LLM to better calibrate the
       | uncertainty of it's knowledge in a Bayesian way, to prevent
       | hallucinations from the overconfidence you get from overfitting
       | on too many bits.
        
       | singularity2001 wrote:
       | So we almost go back full circle to human (animal) brain binary
       | spikes?
        
         | concrete_head wrote:
         | It's not quiet spikes but getting closer to the idea. I'm
         | amazed it has taken this long for this type of thing to reach
         | HN which gives next to no attention to spiking neural networks.
         | 
         | Simon Thorpe, a CNRS researcher has got some fascinating papers
         | and lectures on YouTube on using binary weights on neuromorphic
         | hardware which has had practical applications for over 20 years
         | already.
         | 
         | I made an account just to drop his name somewhere on this
         | forum.
        
       | elromulous wrote:
       | So for the uninitiated (me), does this mean the input is not a
       | float (i.e. is quantized on input), such that all the math can be
       | done with int operations?
       | 
       | This seems almost too good to be true.
       | 
       | Edit: Answering my own question, yes. The details are in the
       | original bitnet paper: https://arxiv.org/abs/2310.11453
        
       | Mizza wrote:
       | I hope somebody gives this team access to the good data and a lot
       | of crunch, I'd love to see what happens when you train the big
       | fella.
        
       | rapatel0 wrote:
       | The mathematics of the BNNs are sound. The shannon entropy of a
       | word is really small (I vaguely remember ~2 bits). Also all
       | neural networks are ridiculously over provisioned.
       | 
       | I worked on 7 years ago trying to efficiently binarize CNNs from
       | existing models. It the difficult was getting training running
       | without the losses going to high. I think that vision models will
       | be much more difficult to binarize, but you might not need to
       | with clip if the vision encoder stays in regular math {fp16,int8}
        
       | ein0p wrote:
       | How is it a 1 bit LLM if 2 bits are required for each weight (and
       | one of the 4 possible states is wasted to be able to represent 0)
        
         | ricardobeat wrote:
         | As someone else pointed out here, you can store 5 ternary
         | values in 1 byte, 3^5 == 243.
        
           | ein0p wrote:
           | That's still not 1 bit, and that would basically destroy
           | whatever perf advantage you might hope to get if you want to
           | keep the model in memory in that format rather than unpack it
           | on load.
        
       | anon291 wrote:
       | This is something that's been tried many times before. 1-bit to
       | 2-bit models and binary NNs have a long history.
        
       | superdisk wrote:
       | Is there anything about this specific to LLMs, or could you use
       | it for any transformer based model? It seems like they made a
       | modified transformer.
        
       | fgfm wrote:
       | It's funny how discoveries in NLP & computer vision complement
       | each other. The replacement of multiplication by additions made
       | me think about the AdderNet paper
       | (https://arxiv.org/abs/1912.13200), which concluded as you had to
       | suffer almost no performance drop.
       | 
       | Perhaps the accumulators in current hardware cannot leverage this
       | to its full potential, but combined with such a strict
       | quantization, this would open LLM to the wider ML community much
       | earlier than expected (when consumer hardware allows you to train
       | near SOTA LLMs from scratch on your machine).
        
       | gojomo wrote:
       | That's not a 'bit' ("Binary digIT"). It's closer to a 'trit'
       | ("TeRnary-digIT"). Specifically, ternary digits spanning {-1, 0,
       | 1} (rather than the usual {0, 1, 2} in a base-3 numbering system)
       | are 'balanced ternary'.
       | 
       | A great intro to the theoretical reasons ternary might have some
       | promise in computing is this 2001 article from 'American
       | Scientist', "Third Base", which quotes Knuth calling balanced-
       | ternary "perhaps the prettiest numbering system of all" and also
       | discusses an abortive Soviet effort in the direction of ternary
       | computing:
       | 
       | http://web.archive.org/web/20011205185830/http://americansci...
       | 
       | In an aside, the article hints that _e_ -nary digits (base
       | 2.718...) if somehow made practical/meaningful, might actually be
       | better than ternary (or perhaps even optimal?).
       | 
       | So maybe this paper's observation that ~"1.58 bits" (ln2(3)
       | binary-digits) is a sweet-spot could be further refined into some
       | method for representing the state of a e-nary-modeled algorithm
       | in ln2(e) binary-digits (~"1.44 bits") per underlying e-it.
       | 
       | (As it may be of renewed interest, I've also put this 2001
       | "American Scientist" base-3 intro as a new HN submission for
       | discussion: https://news.ycombinator.com/item?id=39541756)
        
         | bee_rider wrote:
         | It is obviously pretty common to represent matrices with lots
         | of zeros in a sparse format, like csr or something. I wonder if
         | they could get away with 1-bit representation using a sparse
         | matrix. Of course, it would be a little different from a
         | typical sparse matrix because there's no problem normally
         | having a zero-value in a structurally non-zero location.
        
         | no_identd wrote:
         | See also:
         | 
         | https://en.wikipedia.org/wiki/Nat_(unit) (make sure to read the
         | footnotes, too)
         | 
         | Edit: See also also, on the radix economy of balanced ternary
         | (called "tristate") vs base 3:
         | https://web.archive.org/web/20090312094241/http://abhijit.in...
         | + a wild Marvin Minsky appears: https://archive.fo/gL2Bv
         | 
         | That page also brings up the whole "but division" problem with
         | balanced ternary, however, I personally suspect that
         | http://degiorgi.math.hr/aaa_sem/Div_Krishna/887-889.pdf ("A
         | Division Algorithm for Signed-Digit Arithmetic" by Chin Tung,
         | from 1968 !) might offer an overlooked path to a solution to
         | that problem
         | 
         | And see also also2, this quote from TAOCP:
         | 
         | "Cauchy pointed out that negative digits make it unneccesary
         | for a person to memorize the multiplication table past 5x5."
         | 
         | The--INCREDIBLY ANNOYING TO LOCATE--source for which is "105.
         | Calculs numeriques. sur les moyens d'eviter les erreurs dans
         | les calculs numeriques." on Pdf page 445/document page 431
         | here:
         | 
         | https://www.e-rara.ch/download/pdf/5702285?name=Tome%2520V%4...
         | 
         | See also also3:
         | https://pdfs.semanticscholar.org/5f77/b1cf105024b41b6824ba91...
         | (Vince, Andrew - Radix Representation and Rep-Tiling)
         | 
         | ( +a vaguely related paper here on quantum mechanics & radix
         | economy, BUT it makes the mistake of using an overly specific
         | formula applicable only to unsigned-digit representations thus
         | drawing the wrong conclusions:
         | https://www.researchgate.net/profile/Vladimir_Garcia-Morales...
         | )
        
         | nighthawk454 wrote:
         | Base e is the optimal base for number representation, so that's
         | probably why. Followed by base 3, then base 2.
         | 
         | https://en.m.wikipedia.org/wiki/Radix_economy
        
           | o11c wrote:
           | FSVO "optimal". In practice, both physical reality and
           | algorithm design strongly favors base 2.
        
             | nighthawk454 wrote:
             | Yeah, specifically, the definition of optimal provided -
             | radix economy. There are plenty of other considerations one
             | could make in other contexts. Practically, a transcendental
             | base seems... rather impractical. And base 2 is not so much
             | 'more optimal' than base 3 to warrant the electrical
             | complexity probably, for example.
        
         | dekhn wrote:
         | How useful are -0 and 0? You could splurge on two bits per
         | value which gives you { -1, -0, 0, 1 }
        
           | ant6n wrote:
           | 3^5=243. so use a byte to represent a vector of 5 ternary
           | values, leaving some possible signaling values.
        
         | Razengan wrote:
         | Why not a tit?
        
           | jdiff wrote:
           | Because bi- is two, tri- is three. Ti- is meaningless, and
           | not good enough of a joke to make up for it.
        
           | esafak wrote:
           | They renamed the biggest ML conference (NIPS) over the same
           | joke, so don't count on it.
        
         | kleiba wrote:
         | Note that they're not claiming that their LLM is 1-bit -
         | they're saying that there is a 1-bit era of LLMs. What they do
         | say is that their approach is _a variant_ of a 1-bit LLM
         | variant, namely a ternary LLM (they explicitly state that in
         | the abstract).
        
       | brunooliv wrote:
       | Do the implications at a practical level mean that the size of
       | gguf files will become smaller?
        
       | klysm wrote:
       | Does this mean we can compile LLMs to run on FPGAs directly?
        
         | loa_in_ wrote:
         | I don't know if ternary gate arrays are a thing, but if so then
         | yes.
        
       | karmasimida wrote:
       | This is exciting news, if the 8B numbers are true, we can already
       | use model like Mixtral 8x7, even with a single GPU?
       | 
       | But further into the development, we need comparison to large
       | model sizes. 70B might be too much to ask, but 13B should be
       | there at least.
        
         | cjbprime wrote:
         | You could already run Mixtral on the more expensive single
         | consumer GPUs (with 24GB VRAM) before this paper, at e.g.
         | 3-bits per weight.
        
       | Havoc wrote:
       | If true then I'm guessing this would make ASICs for this far more
       | simple too, right?
        
       | Avisite wrote:
       | Does quantization need to be an all or nothing? with the kind of
       | low bit models we have seen, my assumption would be that only
       | certain weights would benefit from the extra precision. A mixture
       | of precision with 2-bit, 3-bit, to 8-bit weights might perform
       | well, but I am unsure if any training process could identify the
       | weights that need the extra precision.
        
       | Blackthorn wrote:
       | Is there any rigorous way to answer the question of how much
       | information (be it entropy or some other measurement) is
       | contained in a model's weights?
        
       | oxxoxoxooo wrote:
       | Prior art:
       | 
       | Binarized Neural Networks: Training Deep Neural Networks with
       | Weights and Activations Constrained to +1 or -1
       | 
       | https://arxiv.org/abs/1602.02830
       | 
       | Ternary Neural Networks for Resource-Efficient AI Applications
       | 
       | https://arxiv.org/abs/1609.00222
        
         | kandu wrote:
         | Also: training neural networks by turning connections on and
         | off, or by just flipping the sign of the weights:
         | https://arxiv.org/abs/2006.16627
        
       | modeless wrote:
       | Maybe a silly question but nonlinearity is important for neural
       | nets. Wouldn't it make more sense for the three values to be e.g.
       | (2, 0, -1) so they are not colinear?
       | 
       | Also, what are the prospects for FPGA implementations of this?
        
       | bilsbie wrote:
       | This really just sounds absurd. How can ternary possibly encode
       | enough information?
       | 
       | Anyone willing to explain it like I'm a Django developer who
       | watched half a karpathy video?
        
       | bilsbie wrote:
       | How would you use this in something like PyTorch? There's no
       | ternary data type.
        
       | bilsbie wrote:
       | Could there be some value in recognizing areas where the model
       | needs finer grained weights and somehow using a different data
       | type just in certain areas?
        
         | fabiospampinato wrote:
         | It seems tough to do, besides I'm not sure what the benefit
         | would be, with that you can't do the optimized matrix
         | multiplication anymore, and if you need more precision
         | presumably you can just add more neurons and/or train for
         | longer and/or with better data.
        
       | kouru225 wrote:
       | Ok can someone catch me up to speed on LLM hardware requirements?
       | Last I looked I needed a 20 gb vram card to run a good one. Is
       | that not true anymore?
        
         | SushiHippie wrote:
         | Not true anymore, but it also highly depends on what your
         | definition of "a good one" is.
         | 
         | Many people find Mistral 7B to be excellent, around gpt-3.5
         | level of good.
         | 
         | Mistral 7B normally requires like 20gb VRAM, but with llama.cpp
         | and quantization, you could even run it on your phone (albeit
         | bad quality).
         | 
         | Quantization >= q4_K_M seem to provide nearly as good responses
         | as the unquantized model, and q4_K_M only needs ~7GB of VRAM.
         | 
         | See the table here:
         | 
         | https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGU...
         | 
         | Using ollama you can get up and running even a bit faster than
         | with llama.cpp directly (ollama uses llama.cpp under the hood).
        
           | kouru225 wrote:
           | Oh Jesus so basically it's very feasible for me to run my own
           | local llm on a NAS or a server or something... well I guess
           | it's time for me to get on with the times...
           | 
           | Thanks!
        
       | esha_manideep wrote:
       | These models will are compatible with llama.cpp out of the box,
       | we (GigaML - https://gigaml.com) are planning to train a small
       | model (3-4B, 1-bit, opensource) with the latest stack-v2 dataset
       | released today. Let me know if anyone is interested in
       | collaborating with us.
        
       | arunk47 wrote:
       | Okay wait, can I train my own llm yet?
        
       | eigenvalue wrote:
       | Is it really so surprising that something like this works given
       | how human brain neurons work? My admittedly basic understanding
       | is that these operate through an all-or-nothing principle for
       | their action potentials (firing): they either fire or they don't,
       | based on whether the input signals reach a certain threshold. So
       | the output is already sort of binary in biological neurons. The
       | inputs are more like continuous values, since they are the sum of
       | many different neurons sending signals into each neuron, but in
       | this paper the activations are 8-bit, not binary/ternary. Can any
       | neuroscientists here comment?
        
         | m00x wrote:
         | This isn't really how neurons work.
         | 
         | First of all, they operate independent of a synchronized clock,
         | and they can also accumulate signals instead of executing on a
         | input. Neuromorphic chips are closer to how the brain works,
         | but they're still super early. I believe Intel has the best one
         | with the Loihi 2.
         | 
         | (Not a neuroscientist but my wife is and that's what I
         | understand from our chats)
        
       ___________________________________________________________________
       (page generated 2024-02-28 23:00 UTC)