[HN Gopher] The Era of 1-bit LLMs: ternary parameters for cost-e...
___________________________________________________________________
The Era of 1-bit LLMs: ternary parameters for cost-effective
computing
Author : fgfm
Score : 566 points
Date : 2024-02-28 09:28 UTC (13 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| transfire wrote:
| Shouldn't that be "1-trit"?
| QuesnayJr wrote:
| They call it 1.58-bit in the paper. (1.58 is roughly the base 2
| logarithm of 3.)
| jmmcd wrote:
| So by "1-bit" they mean "less than 2 bits". AI is an
| insufferable field at times like this.
| riskable wrote:
| What else are they going to call it? Nobody wants to say
| they wrote some two-bit paper about AI!
| jmmcd wrote:
| The whole thing is a bit of a scam
| bmacho wrote:
| Read the pdf https://arxiv.org/pdf/2402.17764.pdf they call it
| 1-bit everywhere.
|
| I don't know why do they do this, 1-bit seems to be a very
| wrong name for {-1, 0, 1}.
| FrustratedMonky wrote:
| Yes Technically, but it is catchy for the masses. 1-bit seems
| to get the idea across, even if not technically describing
| {-1,0,1}.
| edflsafoiewq wrote:
| I think 0 "doesn't count", since you don't have to add or
| subtract anything for it, just mask it out.
| numpad0 wrote:
| Ternary or three-value logic is a thing in CS[1]
|
| 1: https://en.wikipedia.org/wiki/Three-valued_logic
| paipa wrote:
| Would be cool to see what happens if you quantize towards
| zero preferentially. Sparsifying the matrix should improve
| inference speed directly, right?
| tuananh wrote:
| Major breakthrough in LLM scene. Achieve performance and
| perplexity equivalent to full FP16 models of same parameter size.
|
| And you can fit 120B model with a single card 24GB VRAM. This is
| mind blowing.
| cyanydeez wrote:
| I mean, it expands the hardware selection, but until there's
| models and leader boards etc, can't really say it's a break
| through.
| anon373839 wrote:
| > BitNet b1.58 can match the performance of the full precision
| baseline starting from a 3B size. ... This demonstrates that
| BitNet b1.58 is a Pareto improvement over the state-of-the-art
| LLM models.
|
| > BitNet b1.58 is enabling a new scaling law with respect to
| model performance and inference cost. As a reference, we can have
| the following equivalence between different model sizes in
| 1.58-bit and 16-bit based on the results in Figure 2 and 3.
|
| > * 13B BitNet b1.58 is more efficient, in terms of latency,
| memory usage and energy consumption, than 3B FP16 LLM.
|
| > * 30B BitNet b1.58 is more efficient, in terms of latency,
| memory usage and energy consumption, than 7B FP16 LLM.
|
| > * 70B BitNet b1.58 is more efficient, in terms of latency,
| memory usage and energy consumption, than 13B FP16 LLM.
|
| This paper seems to represent a monumental breakthrough in LLM
| efficiency, as the efficiency gains come with zero (or negative)
| performance penalty.
|
| Does it seem at all likely that existing models could be
| converted?
| accurrent wrote:
| They seem to be using LLAMA. Might be worth trying out. Their
| conversion formula seems stupidly simple.
| wongarsu wrote:
| However they trained their models from scratch, which is also
| why they only have meaningful numbers for 700M, 1.3B, 3B and
| 3.9B models. Apparently they are following BitNet's approach
| of replacing linear layers with quantized layers during
| training? If it was trivial to convert existing models
| without performance loss I would have expected them to
| include a benchmark of that somewhere in the paper to
| generate even more impact.
| imjonse wrote:
| They present numbers for 7B to 70B models as well.
| sp332 wrote:
| They do not have perplexity numbers for the larger models
| (see Table 2), only speed and memory benchmarks.
| imjonse wrote:
| You're both right, I skimmed the paper, saw large model
| numbers but didn't notice it was for speed. On the HF
| page they say those models are being trained.
|
| https://huggingface.co/papers/2402.17764
|
| "We haven't finished the training of the models beyond 3B
| as it requires much much more resources. However, we're
| optimistic about the results because we have verified
| that BitNet follows a similar performance-parameter
| scaling law as the full-precision LLMs. We'll update the
| results on larger models once they're ready."
| anon373839 wrote:
| Those numbers are for cost only, not performance. It's
| not clear they actually _trained_ a 70B vs. just using
| randomly initialized parameters.
| FrustratedMonky wrote:
| Yes. I wonder then how long before someone that does have a
| lot of compute power like OpenAI/MS, or others, can rapidly
| pivot and try this out on some even larger models.
|
| Doesn't this mean that current big players can rapidly expand
| by huge multiples in size.?
| ignoramous wrote:
| I wonder if 1bit quantization is the _main_ reason why pplx.ai
| is faster than any other RAG or chatbot. For instance, Gemini
| in comparison is a turtle, though it is better at explanations,
| while pplx is concise.
| btbuildem wrote:
| Discussion on HF [1] implies that no, conversion is not
| helpful. It would take training the model from scratch.
|
| 1: https://huggingface.co/papers/2402.17764
| anon373839 wrote:
| It's a pity if realizing these gains absolutely requires full
| pre-training from scratch. I imagine more than a few people
| will at least try to find a way to repurpose the knowledge
| contained in existing models.
| cooljoseph wrote:
| You can also have another model "mentor" a new model you
| are teaching to speed up training. You don't have to start
| from scratch with zero knowledge. This is done a lot in
| what are called distillations.
| imjonse wrote:
| Too bad there seem to be no pretrained models to download. This
| is not a quantization method to apply on existing models, so
| having the pretrained weights is needed if one wants to test it.
| bArray wrote:
| +1 On this, the real proof would have been testing both models
| side-by-side.
|
| It seems that it may be published on GitHub [1] according to
| HuggingFace [2].
|
| [1] https://github.com/microsoft/unilm/tree/master/bitnet
|
| [2] https://huggingface.co/papers/2402.17764
| imjonse wrote:
| Nothing there yet, but it's good to know they want to publish
| just did not get around to yet.
| SushiHippie wrote:
| From [2]:
|
| > We would definitely be happy to open-source the models for
| future research. Please stay tuned!
| UncleOxidant wrote:
| link #2 appears to be broken.
| dindobre wrote:
| Refreshing paper in terms of machine learning papers, simple
| explanation, easy to replicate, no alchemy-tier interpretations.
| Can't wait to see this paper replicated or disproved when it
| comes to real-life production tasks.
| imjonse wrote:
| The presentation is simplified because it implies knowledge of
| its predeccesor, BitNet https://arxiv.org/abs/2310.11453
| dindobre wrote:
| Makes sense!
| wongarsu wrote:
| The most glaring omission is that they only compared to fp16
| models, not to quantized models. And of course the benchmarks
| might be misleading compared to the real experience.
|
| But if you wanted to make LLM-specific hardware (or x64
| instructions tuned for LLMs) this model architecture makes that
| extremely cheap. Multiplication requires a lot of transistors,
| this architecture requires only two-bit adders. You could make
| SIMD instructions that do thousands of these in parallel, for
| fairly little silicon cost.
| the8472 wrote:
| What does it mean for future hardware if it's not using floating
| point matrix multiplication units?
| cyanydeez wrote:
| https://stackoverflow.com/questions/45373679/why-is-it-faste...
| gpderetta wrote:
| As per answer, the reason float is faster than in is because
| a) hardware companies provide float ALUs than integer ALUs
| and b) float FMA is a thing, while integer FMA isn't. Both
| are because currently most HPC-like loads use floats instead
| of integers, not because of intrinsic hardware reasons.
| KeplerBoy wrote:
| If it's desired integer performance could far exceed float
| performance, since ALUs need less die area than FPUs.
|
| If this paper holds, I'd expect that's where custom
| accelerators will be heading.
| gpderetta wrote:
| Oh, I agree, I'm just saying that there is no reason in
| principle for floats performance to be better than
| integer.
|
| edit: also this might be implementable purely using
| bitwise vector operations. Would need to check the
| throughput of those.
| KeplerBoy wrote:
| Expect Nvidia to advertise with their TOPS numbers instead of
| their FLOPS.
| rfoo wrote:
| Already happened years ago. They advertised TOPS for
| int8/int4 [0], and with 50% sparsity [1].
|
| [0] low-bit CNNs worked pretty well actually.
|
| [1] Totally useless marketing snake oil.
| hoseja wrote:
| Balanced ternary, my beloved.
| yieldcrv wrote:
| This is great, my employer just gave me a M1 laptop with only
| 16gb ram and I had to downgrade my 7B parameter local LLM's to 3
| bit quantizing, they've been surprisingly okay!
|
| In my personal machine at 64gb ram, I usually use 8x7B at Q5 or
| 70B at Q4
|
| Its Mistral all the way down! Imagining Q1.58 that's doing well
| makes me happy
| turnsout wrote:
| Quantized 7B LLMs should work fine on your machine, though
| maybe you're talking about speed?
| yieldcrv wrote:
| 7B works fine
| woadwarrior01 wrote:
| You can run 4 bit quantized versions of SOLAR-10.7B and Llama 2
| 13B based models quite well on 16GB M1 laptops.
| FergusArgyll wrote:
| You shouldn't have to quantize it that much, maybe you're
| running a lot of other programs while running inference?
|
| Also, try using pure llama.cpp, AFAIK it's the least possible
| overhead
| regularfry wrote:
| Getting more value out of phi-2-sized models is where you
| really want to be on lower-end M1's.
| lucubratory wrote:
| After reading the results I skipped back to the comment section
| to ask if this was real because it looks a little too good to be
| true, but figured I should check authors and it's Microsoft
| research and UCAS so yeah, real. This is going to change a lot of
| things, obviously the edge computing applications they point out,
| but also this is going to bottom out the cost of providing high-
| performance LLMs in the cloud. I don't know what that means for
| the economics long term, naively way less costs maybe means new
| entrants without an entire cloud available can compete easier? I
| do wonder if something like this has already been found and
| implemented by either OpenAI or Google.
| anon373839 wrote:
| It also means the largest models can be scaled up significantly
| with the same inference budget.
| llm_trw wrote:
| Depends. The only paper they cite for training:
| https://arxiv.org/pdf/2310.11453.pdf doesn't improve training
| costs much and most models are already training constrained.
| Not everyone has $200m to throw at training another model
| from scratch.
| arunk47 wrote:
| Is there any scope for indie builders?
| aurareturn wrote:
| After playing with OpenAI's GPT4 API, I'm quite convinced that
| LLMs would be in everything and everywhere today if inference
| cost is as low as loading a website and context size is 100x
| higher.
|
| In other words, only inference cost is holding it back from
| completely changing everything.
|
| So if we have a shortcut to getting something like GPT4 to run
| locally on a small device, watch out.
| rvnx wrote:
| It's coming in October with the new Apple chip
| sigmoid10 wrote:
| I'd be very surprised if Apple can put something on the
| level of GPT4 on a handheld. Remember, GPT4 is estimated to
| be around 1.7 trillion parameters. That's 3.4TB at 16 bit
| and it would still be ~340GB at 1.58bits. The best we can
| hope for is a low-ish level few billion parameter model.
| Which would still be cool on a phone, but as of today these
| models are nowhere near GPT4.
| jairuhme wrote:
| They won't have something at that size because as you
| pointed out, it is still huge. But depending on how they
| are used, smaller parameter models may be better for
| specific on-phone tasks that start to make the size of
| the model not a problem. GPT4 is so large because it is
| very general purpose with the goal seeming to be to
| answer anything. You could have a smaller model focused
| solely on Siri or something that wouldn't require the
| parameter size of GPT4
| sigmoid10 wrote:
| The thing a about GPT4 that matters so much is not just
| raw knowledge retention, but complex, abstract reasoning
| and even knowing what it doesn't know. We haven't seen
| that yet in smaller models and it's unclear if it is even
| possible. The best we could hope for right now is a
| better natural language interface than Siri for calling
| OS functions.
| ynniv wrote:
| You don't need "GPT4" though. Mixtral 8x7B is robust and
| can be run in 36 Gb, 24 Gb if you're willing to
| compromise. A 1.5 bit quantization should bring it down
| to 16. That's still a lot compared to the iPhone 15's 6,
| but it's close enough to imagine it happening soon. With
| some kind of streaming-from-flash architecture you might
| be in the realm already.
| creshal wrote:
| > With some kind of streaming-from-flash architecture you
| might be in the realm already.
|
| I thought mmap'ing models to only keep the currently
| needed pieces in RAM was something that was figured out
| ~6 months ago? Performance wasn't terribly great iirc,
| but with how much faster 1.58B is, it should still be
| okay-ish.
| imtringued wrote:
| I'm not sure what use that is, other than to maintain the
| KV cache across requests.
| liuliu wrote:
| There is a more detailed paper from Apple on this.
| Basically, you can do a little bit better than only
| keeping current weights in RAM with mmap.
|
| For LLM, you are mostly dealing with b = W @ a where a
| and b are vectors, only W is the matrix. If a is sparse
| (i.e. have a few 0s), you don't need all the columns from
| W to do the matrix-vector multiplication. A cleverly
| arranged W can make sure during inference, only related
| columns loaded from flash. Further more, if you can apply
| "One Weird Trick" paper to this matrix-vector
| multiplication, you can shard W by rows, i.e. `b[i:i+n] =
| W[i:i+n,:] @ a[i:i+n] for i in range(N, N / b)` such that
| while the previous b[i:i+n] is still computing, you have
| visibility on which columns of the next matrix to be
| loaded already.
| cjbprime wrote:
| You need all of the model in RAM to perform the matmult
| that gets you the next token from it. There's no
| shortcut.
| jart wrote:
| LLMs will give normal people a firmer standing in
| technological society. That's a good thing. But will it
| change everything? Not a chance. Even if LLMs did change
| everything, that probably would not be a good thing. Dijkstra
| says Muslim algebra died when it returned to the rhetoric
| style, and the modern civilized world could only emerge --for
| better or for worse-- when Western Europe could free itself
| from the fetters of medieval scholasticism --a vain attempt
| at verbal precision!--thanks to the carefully, or at least
| consciously designed formal symbolisms that we owe to people
| like Vieta, Descartes, Leibniz, and (later) Boole. So don't
| be so proud of these graphics cards you've made, because the
| ability to understand the human tongue is insignificant
| compared to the power of math.
| rafaelero wrote:
| LLM's can do math as well.
| jart wrote:
| What makes you think that? Which LLMs?
| rafaelero wrote:
| https://deepmind.google/discover/blog/alphageometry-an-
| olymp...
| dns_snek wrote:
| Last time I checked, GPT-4 couldn't reliably add 2
| numbers, never mind anything more complex.
| vidarh wrote:
| Last I checked (and confirmed by repeating it just now)
| GPT-4 did just fine at adding 2 numbers up, because it
| knows better now than to do that manually and will
| express it as Python. It does _worse_ if you try to force
| it to do it step by step like a child and don 't
| reinforce adherence to the rules every step, because just
| like humans it gets "sloppy" when you try to get it to
| repeat the same steps over and over.
|
| If you want to measure its ability to do mindlessly
| repetitive tasks without diverging from instructions, you
| should compare it to humans doing the same, not expect it
| to act like a calculator.
|
| If you want to measure its ability to _solve problems_
| that involve many such steps that are simple to express
| but tedious to carry out, ask it to write and evaluate
| code to do it instead.
| dns_snek wrote:
| The claim was that "LLMs can do math". Below they linked
| a model from Google that might be capable of that, but as
| a general rule (and with OpenAI's models specifically)
| LLMs can't "do math" by any reasonable definition.
| vidarh wrote:
| I've had it do plenty of math. Some it does badly at,
| some it does fine. Generally it's not "disciplined"
| enough to do things that requires lots of rote repetitive
| tasks, but neither are most humans, and that has improved
| drastically as they've adjusted it to instead do what
| most humans do and use tools. Would it be nice if it
| _also_ got more willing to "stick to it" when given rote
| tasks? Sure.
|
| But whether or not it can "do maths" to your definition
| depends very much on what you want it to do, and how you
| define "do maths". To me it's irrelevant if it's doing
| the low-level calculations as long as it knows how to
| express them as code. If I wanted a calculator I'd use a
| calculator. And I don't consider a calculator able to "do
| math" just because it can precisely add numbers.
|
| Meanwhile I've had lengthy discussions with GPT about
| subjects like orbital mechanics and calculating
| atmospheric effects where it correctly used maths that I
| had to double-check not because I didn't trust GPT
| (though I _also_ want 't to verify for that reason) but
| because I didn't know the maths (not that it was anything
| particularly advanced, but I lost interest in maths
| during my CS degree and picked the minimum amount of
| maths I could get away with).
|
| By _my_ definition it can "do maths" just fine. I guess
| you don't consider my view of that "reasonable". I can
| live with that, as meanwhile, it will keep doing maths
| for me when I need it.
|
| Of course this was also a case of moving the goalposts to
| set up a strawman - in the comment of yours I replied to,
| you claimed it couldn't reliably _add two numbers_.
| dns_snek wrote:
| It often fails at basic 3-4 digit arithmetic. If you're
| stretching that definition far enough to claim that GPT4
| can "do math" then I should be able to call myself a
| commercial pilot because I can land a plane in a sim 20%
| of the time.
|
| I'm not moving goalposts, the original claim was that
| LLMs can "do math". Primary school arithmetic is math.
|
| GPT-4 can't do math and that's _okay_ , I don't
| understand why so many of you are so touchy and defensive
| about this. It's a limitation that exists, nothing more,
| nothing less.
| int_19h wrote:
| GPT-4 is a tiny subset of "LLMs".
|
| If you train a model to do math (and optimize
| representation for that), it'll do math. GPT-4 just
| isn't, and, generally speaking, they aren't, because it's
| much more efficient to train them to "use a calculator".
| Same as with humans.
| imtringued wrote:
| You do realize that arithmetic is a very simple symbolic
| manipulation task? All you have to do is keep track of
| the carry. I haven't seen an LLM that couldn't get digit
| by digit addition done, but they always mess up the
| carry.
| vidarh wrote:
| Just like humans. Try to get _regular people_ do e.g. add
| 15-16 digit numbers (where is typically where I 'd see
| GPT4 start to get "sloppy" unless you prompt it the way
| you would a child who's learning and is still prone to
| get annoyed and wonder why the hell you make them to it
| manually), and see how many start making mistakes.
|
| I find it really comical that this is what people
| complain about GPT over - there's zero benefit to get
| LLMs to get good at this over other tasks. To the extent
| we get it "for free" as a benefit of other learning,
| sure, but when we make kids practice this _over and over
| again_ to drill doing it without getting sloppy, it has
| traditionally been out of some belief that it 's
| important, but a computer will always have a "calculator"
| that is far more efficient than the LLM at its disposal
| and it's idiocy to care about whether it does that part
| well the tedious and hard way or knows how to describe
| the problem to a more efficient tool
|
| I also find it comical that people use tasks where LLMs
| behaviour is if anything mot human-like, in its tendency
| to lose focus and start taking shortcuts (before GPT4
| started writing Python instead, it'd for a while try
| _really_ hard to not give you a step by step breakdown
| and instead clearly take shortcuts even you prompted it
| heavily to reason through it step by step), when
| presented with stupidly repetitive tasks as examples of
| how they 're not good enough.
| mikewarot wrote:
| GPT-x can't add, or subtract, or do anything else of the
| type... it can APPEAR to do so, because that's what it
| was built to do.... act like the text it's seen
| previously and predict what the next text would be.
|
| If you include a large amount of properly solved math in
| its training text, it gets MUCH better at that kind of
| math.
|
| It has a very deep set of intelligences that are alien to
| us, that allow it to predict and ACT LIKE us, when it
| comes to generating the next word. You're only seeing the
| output of those intelligences through a very lossy
| channel.
|
| As a side note, there are structures in human language
| that apparently encode much more information that you
| might think at first glance. The fact that Word2Vec had
| such mathematical properties, despite it's relative
| simplicity, astound me to this day. Throwing a bunch of
| sine/cosine values on top of that to represent position
| in a sentence to enable LLMs is also amazing in that it
| works.
| ekianjo wrote:
| most open models do it poorly though. ChatGPT is better
| at it.
| lovasoa wrote:
| - Hey ChatGTP ! What it 69* _94 ?
|
| - The result of 69*_94 is 6466.
| cooper_ganglia wrote:
| This comment reminded me of that scene in Indiana Jones
| where the guy is spinning the sword around about to
| attack Indy, and then Indy just pulls out his pistol and
| shoots him.
| samatman wrote:
| I agree with your basic thesis here, retrospection will
| view LLMs as a transitional architecture.
|
| However, this paper is evidence that the field is figuring
| out how to built what's actually needed, which is a good
| thing.
| ordu wrote:
| _> the modern civilized world could only emerge --for
| better or for worse-- when Western Europe could free itself
| from the fetters of medieval scholasticism_
|
| I can propose an alternate view of things. Not that I'm
| going to argue that it is the only true statement in the
| world, but I think it is necessary for a thought to
| progress to have an alternative hypothesis.
|
| So the proposition is: formal symbolisms can deal only with
| those problems that where already solved in imprecise
| human's languages.
|
| To invent calculus and orbital mechanics you need first to
| talk for a several centuries (or thousands of years?) about
| what is position and velocity, you need to talk your way
| upto acceleration, and then you need to find a way to
| measure them and to define in a strict geometric terms. Ah,
| and infinity, it was a very counter-intuitive idea, Xenon
| invented some of his paradoxes specifically to point at
| counter-intuitiveness. When Newton came all these talks and
| debates did the most of work for him.
|
| _> the ability to understand the human tongue is
| insignificant compared to the power of math._
|
| But the fun is: you cannot know if someone understands math
| if they do not understand human language too. You cannot
| teach math to those who cannot speak human language.
|
| Math is a cream on top with a limited applicability. What
| math can say about love? I do not like to sound like
| Dumbledor, but really behind all we do there is an emotions
| motivating us. Math cannot deal with emotions, because it
| was built that way _and_ because non-math talks about
| emotions hadn 't bring a good model for emotions, which
| math could express in a formalized language.
|
| _> Dijkstra says _
|
| I wonder when he said it? Before AI concluded that expert-
| systems based on logic were acknowledged to be a failure or
| after that?
| declaredapple wrote:
| I'll agree with you, and add that inference speed is a big
| factor too.
|
| SDXL-ligtning/cascade can generate images in 200ms which is
| fast enough to fit in a web request, and paradoxically makes
| it even cheaper to generate.
|
| And using groq at 500 t/s is wild compared to any of the
| other platforms.
| pennomi wrote:
| 500 t/s is uncomfortably fast to me. Generating high
| quality answers at speeds faster than I can read is the
| point at which I feel like LLMs are magic.
|
| I'm glad people are doing it though, and I'll happily adapt
| to accessing inference at that speed.
| azinman2 wrote:
| That's important for new applications to emerge where
| this happens on lots of data. You can't run LLMs at scale
| on tasks like Google might (every webpage) when the cost
| of each document is so high to process. Interactive
| chatbots are just the tip.
| gitfan86 wrote:
| That is the plan. Even if these independent software
| improvements don't create 10x improvements NVDA and others
| are making huge improvements.
| wongarsu wrote:
| I wouldn't be surprised if this causes hardware startups to pop
| up that build accelerator cards tuned for this architecture. It
| seems stupidly simple to do inference in hardware, and with
| most of the training being quantized as well you might even be
| able to provide speedups (and energy savings) for training with
| reasonable investment and on cheaper processor nodes than what
| Nvidia is using.
|
| Sure, Nvidia might eat their lunch in a couple of years, but
| bitcoin ASICs prove that you can have a niche producing
| specialized processors, and VCs would probably jump at the
| thought of disrupting Nvidia's high margin business.
| anon291 wrote:
| There's like a million startups promising analog / bit-level
| computation, inference-only, cheap computation.
|
| There's rain.ai, d-matrix, etc.
| btbuildem wrote:
| If this dethrones Nvidia, it would be a wonderful side effect
| rafaelero wrote:
| It's more likely that Nvidia will offer support to INT2 in
| the next generation and keep their dominance.
| Klipper3 wrote:
| INT2 ternary is equivalent to INT1 + binary mask. Nvidia
| supprted INT1 matrix multiply in RTX20 and RTX30
| generations, nobody used it, so they removed INT1 support
| from RTX40 generation.
| raghavtoshniwal wrote:
| Sooo, short Nvidia?
| MadDemon wrote:
| Depends if this results in more efficient models or simply
| larger, more capable models.
| wongarsu wrote:
| In both cases this is a prime opportunity for anyone to
| disrupt Nvidia. They are in this market position in large
| part because both video games and neural networks do a lot of
| highly parallel floating point math, especially matrix
| multiplication. This model architecture doesn't do any of
| that.
|
| Of course it should be fairly simple for Nvidia to add
| special silicon and instructions for two-bit addition to a
| future generation of their cards. But it'll take a while
| because they already have a roadmap and preexisting
| commitments. And any competitor doesn't have to copy
| everything Nvidia does to make floating point numbers go
| fast, they can just focus on making two-bit data handling and
| addition go fast.
| sebzim4500 wrote:
| These still run on GPUs
| leroman wrote:
| - we have llama.cpp (could be enough or at least as mentioned
| in the paper a co-processor to accelerate the calc can be
| added, less need for large RAM / high end hardware)
|
| - as most work is inference, might not need for as many GPUs
|
| - consumer cards (24G) could possibly run the big models
| sebzim4500 wrote:
| If consumer cards can run the big models, then datacenter
| cards will be able to efficiently run the really big
| models.
| leroman wrote:
| Some tasks we are using LLMs for are performing very
| close to GPT-4 levels using 7B models, so really depends
| on what value you are looking to get.
| londons_explore wrote:
| GPU's aren't yet awfully efficient at 1 bit math.
|
| I could imagine FPGA designs might be competitive.
|
| And dedicated ASIC's would almost certainly beat both by a
| decent margin.
| sebzim4500 wrote:
| I'm very unconvinced that ASICs are better suited for this
| than for FP16/FP8 models that are being used today.
| londons_explore wrote:
| BF16 is a pretty big unit in an ASIC - You need at least
| 9 * 5 gates to calculate the exponent of the result, a 10
| bit barrel shifter (10*10 + 10*ceil(log2(10)) gates), and
| a 10 bit multiplier (approximately 10 * 10 * 9 gates)
|
| Total = 1085 gates. The reality is probably far more,
| because you're going to want to use carry-look-ahead and
| pipelining.
|
| Whereas 1 bit multiplies and add's of say a 16 bit
| accumulator use... 16 gates! (and probably half since you
| can probably use scheduling tricks to skip past the
| zero's, at the expense of variable latency...)
|
| So when 1 bit math uses only 1/100th of the silicon area
| of 16 bit math, and according to this paper gets the same
| results, the future is clearly silicon that can do 1 bit
| math.
| int_19h wrote:
| I don't think it would be difficult to make them efficient.
|
| The main reason why we run this stuff on GPUs is their
| memory bandwidth, anyway.
| etiam wrote:
| Hardly for this reason, but it does look suspiciously high
| doesn't it.
| leroman wrote:
| Can someone versed in the ways of math explain how this is
| different from previous quantization methods?
|
| And specifically, seeing how going from 16fp to 8bit mostly gives
| same perplexity while anything further seems to lose quality /
| dumb down the model, how is this even less precise method is able
| to achieve this?
| IanCal wrote:
| It's not quantising existing models, they're training new ones.
| leroman wrote:
| I understand this part but it seemed that the 16->8->4 etc is
| similar to compression of the "net" and seemed to lower
| quality below 8.
| TheCoreh wrote:
| If I understand it correctly, this seems to be more than just
| quantizing, the models are apparently trained in this format as
| well. So it's possible that the many layers adjust themselves
| in a way that "cancels out" the inaccuracies of the lower bit
| count
| llm_trw wrote:
| So are there any details on the algorithms they used for
| backprop? I'm not seeing any in the paper other than "we used a
| lot of tokens".
| IanCal wrote:
| Does this help? https://arxiv.org/abs/2310.11453
|
| It seems to have more details (it's the paper before the linked
| one) about the actual training, but I'm scanning it and this
| isn't my field so maybe it's too light also.
| llm_trw wrote:
| Not really, that's for the binary version of the algorithm,
| the ternary version can propagate a lot more information in
| the backwards pass using the fact outputs either -1, 0, 1.
|
| But I imagine they are using the same thing since a bunch of
| the authors are the same.
| wongarsu wrote:
| It's a fairly straightforward modification of BitNet, so I
| assume this quote from the BitNet paper applies:
|
| To train our 1-bit model, we employ the straight-through
| estimator (STE)[BLC13 ] to approximate the gradient during
| backpropagation. This method bypasses the non-differentiable
| functions, such as the Sign (Eq. 2) and Clip (Eq. 5) functions,
| during the backward pass. STE allows gradients to flow through
| the network without being affected by these non-differentiable
| functions, making it possible to train our quantized model
| sp332 wrote:
| 1-bit LLMs remind me of a random forum post I read about SACD and
| limitations of the 1-bit DSD audio format.
| https://www.audiosciencereview.com/forum/index.php?threads/d...
| Accumulating approximate values in one bit leads to being
| "constantly overloaded", with any error correction overwriting
| all of your real signal from the next step. I think this trinary
| system might leave enough room to avoid this problem.
| Alifatisk wrote:
| If this paper (especially the results on Table 4) is true, then
| this is a game changer!
| osigurdson wrote:
| I have often mused that, in some ways, it seems like the
| transistor is really being wasted in AI applications. We use
| binary states in normal computing to reduce entropy. In AI this
| is less of a concern, so why not use more of the available
| voltage range? Basically, re-think the role of the transistor and
| re-design from the ground up - maybe NAND gates are not the ideal
| fundamental building block here?
| adrianN wrote:
| You could call them connection machine and perhaps have an llm
| trained on Feynman help with the design.
| sigmoid10 wrote:
| People are working on that [1]. In some sense, it's a step back
| to analog computing. Add/multiply is possible to do directly in
| memory with voltages, but it's less versatile (and stable) than
| digital computing. So you can't do _all_ calculations in a
| neural network that way, meaning some digital components will
| always be necessary. But I 'm pretty sure analog will make a
| comeback for AI chips sooner or later.
|
| [1] https://www.nature.com/articles/s41586-023-06337-5
| zcw100 wrote:
| Reminds me of my father saying something about how vacuum
| tubes are great integrators.
| monocasa wrote:
| Chips are too. Opamps can add, multiply, subtract, divide,
| integrate and differentiate depending on how they're
| plugged in.
| klysm wrote:
| Hence the name 'operational' amplifier
| irrelative wrote:
| Hadn't thought about it this way before, but given that LLMs
| are auto regressive (use their own data for next data),
| they're sensitive to error drift in ways that are rather
| similar to analog computers.
| trebligdivad wrote:
| Trinary however is an interesting middle; people have built
| trinary hardware long ago; it feels like you could make
| natively trinary hardware for something like this; it might
| even be quite a win.
| thsksbd wrote:
| Can you make a "CMOS" three voltage level circuit though?
| One where the only current flow is when the state changes?
|
| Im not in this field but that's a question that's been
| bugging me for a while. Off you can't do this wouldn't
| energy consumption balloon?
| int_19h wrote:
| People haven't built _reliable_ ternary electronics,
| though. Soviets tried with Setun, but they eventually had
| to resort to emulating each trit with two hardware bits
| (and wasting one state out of the possible four).
| wakawaka28 wrote:
| I have heard of people trying to build analog AI devices but
| that seems like years ago, and no news has come out about it in
| recent times. Maybe it is harder than it seems. I bet it is
| expensive to regulate voltage so precisely and it's not a
| flexible enough scheme to be support training neural networks
| like we have now, which are highly reconfigurable. I've also
| heard of people trying to use analog computing for more mundane
| things. But no devices have hit the market after so many years
| so I'm assuming it is a super hard problem, maybe even
| intractible.
| osigurdson wrote:
| Perhaps another variation on the idea is to allow a higher
| error rate. For example, if a 0.01% error rate was acceptable
| in AI, perhaps the voltage range between states could be
| lowered (which has a quadratic relationship to power
| consumption) and clock speed could increase.
| loudmax wrote:
| The Veritasium Youtube channel did a video about this about a
| year ago: https://www.youtube.com/watch?v=GVsUOuSjvcg
|
| They visit Texas company Mythic AI to discuss how they use
| flash memory for machine learning. There's a California company
| named Syntiant doing something similar.
| gryn wrote:
| the reason why digital/numeric processing won is the power loss
| in the analog world. when design an analog circuit the next
| processing stage you add at the end has impact on the ones
| before it.
|
| this then require a higher skill from the engineers/consumers.
|
| if you want to avoid that you need to add op-amps with a gain
| of 1 at the boundary of each one, this also that care of the
| power loss at each stage.
|
| the other part is that there's a limit of to the amount of
| useful information/computation you can do with analog
| processing too once you take into account voltage noise. when
| you do a comparison there are stages where analog win but also
| place where where digital wins.
|
| I'll edit later this with a link to some papers that discuss
| these topics if I manage to find them in my mess.
| dazed_confused wrote:
| Good explanation. When I was working at a semiconductor
| manufacturer, our thresholds were like 0 - 0.2V to 0.8 -
| 1.0V. Additionally, if you look at QLC SSDs, their longevity
| is hugely degraded. Analog computing is non-trivial, to say
| the least.
| im3w1l wrote:
| For the specific case of neural networks they seem to be very
| resistant to noise. That's why quantization works in the
| first place.
| BlueTemplar wrote:
| I have heard that the first commercial neural network chip (by
| Intel, in the 90s) was analog ?
| barrenko wrote:
| Hmm, maybe some (signaling) inspiration from biology other than
| neural signaling.
| StableAlkyne wrote:
| It would be something of a full circle I feel went back to
| dedicated circuits for NNs - that's how they began life when
| Rosenblatt built his Perceptron.
|
| I remember reading a review on the history in grad school
| (can't remember the paper) where the author stated that one of
| the initial interests in NNs by the military was their
| distributed nature. Even back then, people realized you could
| remove a neuron or break a connection and they would still work
| (and even today, dropout is a way of regularizing the network).
| The thinking was that being able to build a computer or
| automated device that could be damaged (radiation flipping
| bits, an impact destroying part of the circuit, etc) and still
| work would be an advantage given the perceived inevitably of
| nuclear war.
|
| Compared to a normal von Neumann machine which is very fault
| intolerant - remove the CPU and no processing, no memory=no
| useful calculation, etc. One reason people may have avoided
| further attempts at physical neural networks is it's
| intrinsically more complex than von Neumann, since now your
| processing and memory is intertwined (the NN is the processor
| and the program and the memory at the same time).
| kurisufag wrote:
| >von Braun machine
|
| von neumann? though it is funny to imagine von braun
| inventing computer architecture as a side hustle to inventing
| rocket science.
| StableAlkyne wrote:
| Oh fuck, thanks for catching that!
| the8472 wrote:
| Bits are copyable without data loss. Analog properties of
| individual transistors are less so.
| seydor wrote:
| let's use cells
| Razengan wrote:
| We already do.
| drexlspivey wrote:
| Next Up: Quantum AI
| mikewarot wrote:
| >maybe NAND gates are not the ideal fundamental building block
| here?
|
| It's my long held opinion that LUTs (Look Up Tables) are the
| basis of computation for the future. I've been pondering this
| for a long time since George Gilder told us that wasting
| transistors was the winning strategy. What could be more
| wasteful than just making a huge grid of LUTs that all
| interconnect, with NO routing hardware?
|
| As time goes by, the idea seems to have more and more merit.
| Imagine a grid of 4x4 bit look up tables, each connected to its
| neighbors, and clocked in 2 phases, to prevent race conditions.
| You eliminate the high speed long lines across chips that cause
| so much grief (except the clock signals, and bits to load the
| tables, which don't happen often).
|
| What you lose in performance (in terms of latency), you make up
| for with the homogenous architecture that is easy to think
| about, can route around bad cells, and be compiled to almost
| instantly, thanks to the lack of special cases. You also don't
| ever have to worry about latency, it's constant.
| phdelightful wrote:
| It's been a long time since I worked on FPGAs, but it sounds
| like FPGAs! What do you see as the main differences?
| mikewarot wrote:
| No routing, no fast lines that cut across the chip, which
| cut way down on latency, but make FPGAs harder to build,
| and especially hard to compile to once you want to use
| them.
|
| All that routing hardware, and the special function units
| featured in many FPGAs are something you have to optimize
| the usage of, and route to. You end up with using solvers,
| simulated annealing, etc... instead of a straight compile
| to binary expressions, and mapping to the grid.
|
| Latency minimization is the key to getting a design to run
| fast in an FPGA. In a BitGrid, you know the clock speed,
| you know the latency by just counting the steps in the
| graph. BitGrid performance is determined by how many
| answers/second you can get from a given chip. If you had a
| 1 Ghz rack of BitGrid chips that could run GPT-4, with a
| latency of 1 mSec per token, you'd think that was horrible,
| but you could run a million such streams in parallel.
| londons_explore wrote:
| Powers of 3 don't pack well into binary memory...
|
| A 1 bit multiplier in silicon is a single logic gate, but a
| ternary decoder to decode a packed tri-state 'weight' is bigger.
|
| I therefore suspect that this method will be extended to make all
| weights simple 1 or 0 (ie. Binary). Perhaps that will be done by
| having half the weights have 1 or 0 values, while the other half
| are -1 or 0.
| baq wrote:
| You can build dedicated silicon with ternary gates:
| https://medium.com/@rxseger/exploring-ternary-logic-tnand-an...
|
| Not sure if it's more efficient than just binary digital
| circuits in highly integrated chip, though.
| samatman wrote:
| It's optimal if your program is naturally ternary, which this
| one is. Using three signals, rather than ternary gates, is
| less effective, because you need much more precision to
| detect two different voltage levels rather than just up and
| down.
| tromp wrote:
| 5 trits fit into 1 byte pretty well, since 3^5 = 243 is just
| under 2^8 = 256.
|
| That should be called an 8/5 = 1.6 bit model though, while the
| paper names it 1.58 bit, closer to log_2(3) ~ 1.5849625
| JKCalhoun wrote:
| Would be nice to have hardware instructions that work on 5
| tris natively.
| londons_explore wrote:
| But the decoder for that will be 25+ gates, which is huge
| compared to the handful of gates to use the resulting
| weights.
| fabmilo wrote:
| can't you have 2 bits ? first bit for the sign second bit for
| the 1 0 you can represent -1 +1 +0 -0
| K0IN wrote:
| when can we expect the first ~100+ million parameter models to
| run on raspberry pi Pico?
| Klipper3 wrote:
| The theoretical capacity of a binary network is 69% of the
| capacity of a full-weight network, so it makes sense that LLM
| would converge to 1-bit networks in the long term.
|
| It's nice to finally see practical networks reach the theoretical
| limits found in the statistical mechanics of Ising models. A good
| pointer to efficient 1-bit training, from the statistical
| mechanics point of view, is here:
|
| https://www.pnas.org/doi/full/10.1073/pnas.0700324104
| arunk47 wrote:
| What is stopping us right now from doing this one bit networks
| ?
| ulnarkressty wrote:
| Take this with a grain of salt until someone reproduces it.
| Improvements such as these require extraordinary evidence. Not to
| mention extreme quantization has been tried before.
| joelthelion wrote:
| Assuming this is confirmed, what's the impact on training?
|
| Inference is definitely an issue for LLMs right now. But if
| training were suddenly possible for lone hackers (or maybe
| smaller companies), it would open up a lot of new possibilities
| as well.
| stormfather wrote:
| How does backprop work here? I can't imagine flipping bits of
| everything upstream of an error is effective.
| joelthelion wrote:
| (haven't read the paper). Maybe you can flip bits with a
| probability distribution that depends on the gradient?
| stormfather wrote:
| That's an interesting idea! Would love to try that on MNIST
| one day.
| spyder wrote:
| From the BitNet paper:
|
| _" Straight-through estimator. To train our 1-bit model, we
| employ the straight-through estimator (STE)[BLC13] to
| approximate the gradient during backpropagation. This method
| bypasses the nondifferentiable functions, such as the Sign (Eq.
| 2) and Clip (Eq. 5) functions, during the backward pass. STE
| allows gradients to flow through the network without being
| affected by these non-differentiable functions, making it
| possible to train our quantized model."_
|
| also the author's (@shumingma) answer in the comments:
| https://huggingface.co/papers/2402.17764#65df17ed4d436404cdc...
| alexey-salmin wrote:
| Also from Microsoft in 2021: Make Every feature Binary: A 135B
| parameter sparse neural network for massively improved search
| relevance [1]
|
| [1] https://www.microsoft.com/en-us/research/blog/make-every-
| fea...
| yousif_123123 wrote:
| Any models published as well?
| jonbaer wrote:
| I really can't tell but it seems to be a continuation of this
| work if I read the To-Dos correctly, what do you think? Here it
| seems to be 1-bit on just the transformer,
| https://huggingface.co/shi3z/BitNetWikipedia110M
| naasking wrote:
| Interesting return to ternary. Effectively, each weight says only
| whether it's correlated (+1), uncorrelated (0), or anti-
| correlated (-1) with the input, and the structure of the network
| is the actual computation over that information.
| wenyuanyu wrote:
| I wonder how the training process works...
| wenyuanyu wrote:
| If this turns out to be true. It could indeed be a game
| changer... Given the advanced AI chip shortage... Also, for the
| chip ban on China...
| rafaelero wrote:
| Looks like we have finally rediscovered a biological neuron.
| bilsbie wrote:
| How so?
| rafaelero wrote:
| They propagate information in a binary way (either they
| activate or not).
| cs702 wrote:
| There are two findings I find _shocking_ in this work:
|
| * In existing LLMs, we can replace all parameter floating-point
| values representing real numbers with ternary values representing
| (-1, 0, 1).
|
| * In matrix multiplications (e.g., weights by vectors), we can
| replace elementwise products in each dot product (a1b1 + a2b2
| ...) with elementwise _additions_ (a1+b1 + a2+b2 ...), in which
| signs depend on each value. See the paper for exact details.
|
| On existing hardware, the gains in compute and memory efficiency
| are significant, without performance degradation (as tested by
| the authors).
|
| If the proposed methods are implemented in hardware, we will see
| _even greater gains_ in compute and memory efficiency.
|
| Wow.
| creshal wrote:
| > * In existing LLMs, we can replace all parameter floating-
| point values representing real numbers with ternary values
| representing (-1, 0, 1).
|
| Why is this so shocking? Quantization has been widely explored,
| driving that to its extreme (and blowing up parameter count to
| make up for it) just seems like a natural extension of that.
|
| Easier said than done, of course, and very impressive that they
| pulled it off.
|
| > In matrix multiplications (e.g., weights by vectors), we can
| replace elementwise products in each dot product (a1b1 + a2b2
| ...) with elementwise additions (a1+b1 + a2+b2 ...), in which
| signs depend on each value
|
| I feel like this follows naturally from having only ternary
| values, multiplication doesn't really bring much to the table
| here. It's a bit surprising that it's performing so well on
| existing hardware, usually multiplication hardware sees more
| optimization, especially for GPGPU hardware.
| cs702 wrote:
| _> Why is this so shocking? Quantization has been widely
| explored, driving that to its extreme (and blowing up
| parameter count to make up for it) just seems like a natural
| extension of that._
|
| I find it shocking that we don't even need lower floating-
| point precision. _We don 't need precision at all_. We only
| need three symbols to represent every value.
|
| _> I feel like this follows naturally from having only
| ternary values, multiplication doesn 't really bring much to
| the table here. It's a bit surprising that it's performing so
| well on existing hardware, usually multiplication hardware
| sees more optimization, especially for GPGPU hardware._
|
| I find it shocking. Consider that associative addition over
| ternary digits, or trits, represented by three symbols
| (a,b,c) has only three possible input pairs, (a,b), (a,c), or
| (b,c) (within each pair, order doesn't matter), and only
| three possible outputs, a, b, or c. Matrix multiplications
| could be executed via crazy-cheap tritwise operations in
| hardware. Maybe ternary hardware[a] will become a thing in
| AI?
|
| ---
|
| [a] https://en.wikipedia.org/wiki/Ternary_computer
| jerf wrote:
| An integer is just a concatenation of bits. Floating point
| appears more complicated but from an information theory
| perspective it is also just a concatenation of bits. If,
| for the sake of argument, one replaced a 64-bit int with 64
| individual bits, that's really the same amount of
| information and a structure could hypothetically then
| either choose to recreate the original 64-bit int, or use
| the 64-bits more efficiently by choosing from the much
| larger set of possibilities of ways to use such resources.
|
| Trits are helpful for neural nets, though, since they
| really love signs and they need a 0.
|
| So from the perspective that it's all just bits in the end
| the only thing that is interesting is how useful it is to
| arrange those bits into trits for this particular
| algorithm, and that the algorithm seems to be able to use
| things more effectively that way than with raw bits.
|
| This may seem an absolutely bizarre zigzag, but I am
| reminded of Busy Beavers, because of the way they take very
| the very small primitives of a Turing Machine, break it
| down to the smallest pieces, then combine them in ways that
| almost immediately cease to be humanly comprehensible.
| Completely different selection mechanism for what appears,
| but it turns out Turing Machine states can do a lot "more"
| than you might think simply by looking at human-designed
| TMs. We humans have very stereotypical design methodologies
| and they have their advantages, but sometimes just letting
| algorithms rip can result in much better things than we
| could ever hope to design with the same resources.
| cs702 wrote:
| _> So from the perspective that it 's all just bits in
| the end the only thing that is interesting is how useful
| it is to arrange those bits into trits for this
| particular algorithm, and that the algorithm seems to be
| able to use things more effectively that way than with
| raw bits._
|
| Thank you. I find many other things interesting here,
| including the potential implications for hardware, but
| otherwise, yes, I agree with you, _that_ is interesting.
| SkyBelow wrote:
| This sort of breakdown also reminds me of the explanation
| of why busy beavers grow faster than anything humans can
| ever define. Anything a human can define is a finite
| number of steps that can be represented by some turing
| machine of size M. A turning machine of size N > M can
| then use M as a subset of it, growing faster than than
| the turing machine of size M. Either it is the busy
| beaver for size N, or it grows slower than the busy
| beaver for size N. Either way, the busy beaver for size N
| grows faster than whatever the human defined that was
| captured by the turning machine of size M. This
| explanation was what helped me understand why busy
| beavers is faster growing than any operator that can be
| formally defined (obviously you can define an operator
| that references busy beaver itself, but busy beaver can
| be considered to not be formally defined, and thus any
| operator defined used it isn't formally defined either).
|
| The bit about floating point numbers just being a
| collection of bits interpreted in a certain way helps
| make sense why a bigger model doesn't need floating
| points at all.
| jxy wrote:
| The matrices (weights) are ternary.
|
| The vectors are not.
| cs702 wrote:
| The _activations_ are in (-1, 1), so they 're also
| representable by (-1, 0, 1).
| satellite2 wrote:
| Because it's no longer a linear optimization or curve fitting
| problem. It becomes a voting or combinatorial problem. Which
| at least in my mind are two completely different areas of
| research.
| HPsquared wrote:
| With enough parameters, it probably starts looking
| continuous again. Like how in physics everything is
| quantised at the smallest scale but if you put enough atoms
| together it all smooths out and behaves "classically".
| amelius wrote:
| Yes, but we can simulate classical physics using
| mathematical shortcuts. Simulating every little atom
| would take a lot more work.
| ncruces wrote:
| Well I guess it's the "blowing up parameter count to make up
| for it" that confuses me, but maybe it's just ignorance.
|
| Like what would be the expected factor of this blow up to
| make up the difference between ternary and whatever 16 bits
| encoding they were using?
|
| I mean intuitively I'd expect to need ~10x the symbols to
| encode the same information? Are they using an order of
| magnitude more parameters, or is that not how it works?
| int_19h wrote:
| With existing common quantization techniques, a 70b model
| quantized to 3-bit still drastically outperforms an
| unquantized 35b model.
| nutanc wrote:
| We have been experimenting with the paper(https://www.researchg
| ate.net/publication/372834606_ON_NON-IT...).
|
| There is a mathematical proof that binary representation is
| enough to capture the latent space. And in fact we don't even
| need to do "training" to get that representation.
|
| The practical application we tried out for this algorithm was
| to create an alternate space for mpnet embeddings of Wikipedia
| paragraphs. Using Bit embedding we are able to represent 36
| million passages of Wikipedia in
| 2GB.(https://gpt3experiments.substack.com/p/building-a-vector-
| dat...)
| cs702 wrote:
| You're talking about mapping floating-point vector
| representations, i.e., embeddings, computed by a pretrained
| LLM to binary vector representations, right? And you're
| talking about doing this by first having _someone else_ 's
| pretrained LLM compute the embeddings, right? Sorry, but that
| seems only minimally, tangentially related to the topic of
| running LLMs in ternary space. I don't see how your comment
| is relevant to the discussion here.
| nutanc wrote:
| Yeah, sorry, needed a much bigger canvas than a comment to
| explain. Let me try again. The example I took was to show
| mapping from one space to another space and it may have
| just come across as not learning anything. Yes. You are
| right it was someone else's pretrained LLM. But this new
| space learnt the latent representations of the original
| embedding space. Now, instead of the original embedding
| space it could also have been some image representation or
| some audio representation. Even neural networks take input
| in X space and learn a representation in Y space. The paper
| shows that any layer of a neural network can in fact be
| replaced with a set of planes and we can represent a space
| using those planes and that those planes can be created in
| a non iterative way. Not sure if I am being clear, but have
| written a small blog post to show for MNIST how an NN
| creates the planes(https://gpt3experiments.substack.com/p/u
| nderstanding-neural-...). Will write more on how once these
| planes are written, how we can use a bit representation
| instead of floating point values to get similar accuracy in
| prediction and next how we can draw those planes without
| the iterative training process.
| SushiHippie wrote:
| Wow, this works better than I would've thought.
|
| > Who moderates Hacker News?
|
| First result:
|
| > Hacker News
|
| > At the end of March 2014, Graham stepped away from his
| leadership role at Y Combinator, leaving Hacker News
| administration in the hands of other staff members. The site
| is currently moderated by Daniel Gackle who posts under the
| username "dang".
| m3kw9 wrote:
| How is this not lossy compression?
| rf15 wrote:
| It kind of is!
| sandyarmstrong wrote:
| LLMs and vector embeddings are always lossy compression,
| yes?
| baq wrote:
| kind of related:
| https://medium.com/@heinrichpeters/commentary-gzip-knn-
| beats...
| fabmilo wrote:
| I find this extremely interesting. Do you share the source
| code of the process? any more references?
| rhaps0dy wrote:
| I think you need more evidence than this paper (which is very
| short and light on actual numbers) to be this shocked.
|
| For example, most of the plots in the paper are actually of
| throughput, memory, etc. all performance characteristics that
| are better on the ternary version. Which, of course.
|
| The only thing that contains perplexities are Table 1 and 2.
| There, they compare "BitNet b1.58 to our reproduced FP16 LLaMA
| LLM in various sizes" on the RedPajama data set. The first
| thing to note is the perplexities are very high: they're all at
| least ~9.9, which compared for example with quantized Llama on
| wikitext-2 which is 6.15
| (https://www.xzh.me/2023/09/a-perplexity-benchmark-of-
| llamacp...). Maybe RedPajama is a lot harder than wikitext-2,
| but that's a big gap.
|
| I think probably their benchmark (their "reproduced FP16 LLaMA
| LLM") is just not very good. They didn't invest much in
| training their baseline and so they handily beat it.
| cs702 wrote:
| Thank you. I think the paper as it is provides enough
| evidence to support the claims. If I understand the authors
| correctly, they trained the compared models on only 100B
| tokens, all drawn from RedPajama, to make the comparisons
| apples-to-apples. That's sensible. It allows for easier
| replication of the results. Otherwise, I agree with you that
| more extensive testing, after more extensive pretraining, is
| still necessary.
| lr1970 wrote:
| Authors reported perplexity only for small up to 3B weights
| models. On the other hand, they reported throughput for 70B
| model, but not its performance (perplexity, end-to-end tasks).
| Very unfortunate omission. Overall, the paper is rather poorly
| written.
| cs702 wrote:
| If I understand the authors correctly, they trained the
| compared models on only 100B tokens, all drawn from
| RedPajama, to make the comparisons apples-to-apples. That's
| sensible. It allows for easier replication of the results.
| Otherwise, I agree with you that more extensive testing,
| after more extensive pretraining, at larger model sizes, is
| still necessary.
| lr1970 wrote:
| towards the end of the paper they mentioned training on 2T
| tokens.
| cs702 wrote:
| You're right. Thank you for pointing that out.
| beagle3 wrote:
| I haven't been keeping tabs, but this seems very much like RIP
| / Achilioptas version of the Johnson Lindenstrauss lemma.
|
| Perhaps the rest of the JL lemma promise applies as well -
| compressing the number of parameters by a few orders of
| magnitude as well.
| jandrese wrote:
| It seems like the AI space is slowly coming back around to the
| old Thinking Machines CM-1 architecture. It's not too often in
| computing where you see ideas a full 40 years ahead of their
| time make it into production.
| giantrobot wrote:
| IIUC the main issue with the CM-1 architecture was feeding
| the processor cluster with data. That required a heftier
| front end system than was practical/affordable at the time.
| With modern CPUs and memory subsystems the GPUs can be
| saturated pretty easily. So going back to huge clusters of
| super narrow cores won't starve them for work.
| abeppu wrote:
| > On existing hardware, the gains in compute and memory
| efficiency are significant, without performance degradation (as
| tested by the authors).
|
| Did they actually show absence of performance degradation?
|
| I think it's conspicuous that Table 1 and Table 2 in the paper,
| which show perplexity and accuracy results respectively, are
| only for small model sizes, whereas Figure 2, Figure 3
| (latency, memory, energy consumption) and Table 3 (throughput)
| all show larger model sizes. So it seems like they had every
| opportunity to show the perplexity/accuracy comparisons at the
| larger model sizes, but did not include them.
| cs702 wrote:
| Others have already made the same point in this thread. See
| my response here:
| https://news.ycombinator.com/item?id=39539508
| vessenes wrote:
| I'd be VERRY cautious about being excited here.
|
| My priors are like this:
|
| 1. Initial training of a neural network moves all weights
| around a large amount at first.
|
| 2. Later training of the network adjusts them a small amount.
|
| 3. An undertrained network will therefore look a lot like
| figuring out "positive, negative, or 0?" for each node during
| early training.
|
| If all these things are true, then
|
| 1. Early training of an fp16 network and a bitnet with 0 added
| will be _roughly_ similar in results
|
| 2. Later training will yield different / worse results, as the
| network gets into the 'fine tuning' part of the training.
|
| I think the paper's stats back these priors up -- they say
| "this works on (3B+) large networks, but not small ones." They
| then imply there's something about the structure of a large
| network that allows a bitnet to do well. It seems more likely
| to me it works on large networks because they have not put the
| compute into 3B+ networks to get past the 'gross tuning' phase.
|
| The networks they have compute to put in to get them 'fully'
| trained -- those networks don't show the results.
|
| Also, a quick reminder that Perplexity 12 is really terrible.
| You would not want to use such a network. Hopefully I'm wrong
| and we can get something for free here! But, I'm cautious - to
| - skeptical.
| mise_en_place wrote:
| Intuitively I've always been a bit skeptical of quantization.
| Wouldn't there be a tiny loss in precision by doing this type
| of quantization? I could imagine the error function
| increasing by utilizing these types of techniques.
| eightysixfour wrote:
| It does increase the "error" (meaning it is less likely to
| predict the next word when compared against a dataset) but
| the losses are lower than your intuition would guide you to
| believe.
| int_19h wrote:
| Quantization does reduce quality of the outputs. But the
| point is that you save enough memory doing so that you can
| cram a larger model into the same hardware, and this more
| than compensates for lost precision.
| spencerchubb wrote:
| Yes each weight will not be able to "learn" as much if it
| has less bits of precision. But the idea is that you can
| use more weights, and the big question is whether these
| low-precision weights can make the model more accurate, as
| a whole.
| thesz wrote:
| John Carmack pointed out (and I learned it here at HN) that
| what training really needs is the *sign" of each individual
| gradient parameter. I.e., you can quantize gradient to -1,
| 0 and 1 and still have neural network learn much of the
| dataset.
| svantana wrote:
| Wait, are we reading the same paper? What I'm seeing is
| comparable accuracy to unquantized models for <4B params, and
| nothing reported for larger models except resource
| consumption.
| vessenes wrote:
| Nope, you're right, I got the table inverted in my head.
| I'm updating my top comment.
| gliptic wrote:
| > Also, a quick reminder that Perplexity 12 is really
| terrible.
|
| The 3B model had a perplexity of 9.91, less than LLaMa 1 in
| fp16.
| cs702 wrote:
| Thank you. Your key point -- that so far all models with the
| proposed methods may have been only "grossly trained" -- is
| compelling. If I understand the authors correctly, they
| trained the compared models on only 100B tokens, all drawn
| from RedPajama, to make the comparisons apples-to-apples.
| That seems sensible to me, and makes replication easier, but
| I agree we need more to see extensive testing, after more
| extensive pretraining, on models of larger sizes.
| gliptic wrote:
| They also trained 3B with 2 trillion tokens.
|
| > The number of training tokens is a crucial factor for
| LLMs. To test the scalability of BitNet b1.58 in terms of
| tokens, we trained a BitNet b1.58 model with 2T tokens
| following the data recipe of StableLM-3B [ TBMR], which is
| the state-of-the-art open-source 3B model.
|
| > [..]
|
| > Our findings shows that BitNet b1.58 achieves a superior
| performance on all end tasks, indicating that 1.58-bit LLMs
| also have strong generalization capabilities.
| cs702 wrote:
| You're right. Thank you for pointing that out!
| vessenes wrote:
| Update - I'm still cautious about this paper, but I had the
| table numbers inverted in my head while thinking about it.
| The paper shows better perplexity results than competing
| models at larger parameter sizes, so I was wrong.
| flockonus wrote:
| Considering how much faster additions are processed, and how a
| particular silicon chip could be optimized for this very
| specific case; all parts added together perhaps could show
| >100x speed up vs current systems.
|
| I must concur, "wow".
| sva_ wrote:
| I'm also curious about the potential speed gains in automatic
| differentiation, as there are way less branches to 'go up'. Or
| am I wrong here?
| lumost wrote:
| They actually use a relu to represent the model weights. But
| I'm not convinced that this can't be avoided. We do gradient
| boosted decision tree training without this trick.
| PaulHoule wrote:
| I am not startled at all. Dense vector representations are
| pretty silly, they can't really be the road to knowledge
| representation.
| p1esk wrote:
| Ternary networks have been used since 2015. There are hundreds
| of papers. They all require full QAT (training from scratch).
| Not sure why you're shocked.
| acchow wrote:
| Conversely, this also implies our current model sizes can still
| embed a _ton_ more "understanding"
| Noe2097 wrote:
| There is another _shocking_ realization in this work: there are
| 11 types of people: those who know what binary means, those who
| don't, and those who say they do but actually don't.
|
| "The era of 1-bit LLMs"
|
| Representing { -1, 0, 1 } can't be done with 1-bit, I'm sorry
| -- and sad, please let's all get back to something vaguely
| sound and rigorous.
| npunt wrote:
| Ternary supporters are always bitter about this
|
| (I'll let myself out)
| AaronFriel wrote:
| In undergrad, some of us math majors would joke that there's
| really only three quantities: 0, 1, infinity.
|
| So, do we need the -1, and/or would a 2.32 bit (5 state, or 6
| with +/-0) LLM perform better than a 1.58 bit LLM?
| fzliu wrote:
| This will be big for FPGAs - adders are extremely cheap
| compared to multipliers and other DSP blocks.
| paul_mk1 wrote:
| Fun to see ternary weights making a comeback. This was hot back
| in 2016 with BinaryConnect and TrueNorth chip from IBM research
| (disclosure, I was one of the lead chip architects there).
|
| Authors seemed to have missed the history. They should at least
| cite Binary Connect or Straight Through Estimators (not my
| work).
|
| Helpful hint to authors: you can get down to 0.68 bits / weight
| using a similar technique, good chance this will work for LLMs
| too.
|
| https://arxiv.org/abs/1606.01981
|
| This was a passion project of mine in my last few months at IBM
| research :).
|
| I am convinced there is a deep connection to understanding why
| backprop is unreasonably effective, and the result that you can
| train low precision DNNs; for those note familiar, the
| technique is to compute the loss wrt to the low precision
| parameters (eg project to ternary) but apply the gradient to
| high precision copy of parameters (known as the straight
| through estimator). This is a biased estimator and there is no
| theoretical underpinning for why this should work, but in
| practice it works well.
|
| My best guess is that it is encouraging the network to choose
| good underlying subnetworks to solve the problem, similar to
| Lottery Ticket Hypothesis. With ternary weights it is just
| about who connects to who (ie a graph), and not about the
| individual weight values anymore.
| nutate wrote:
| Triggered by the use of 1-bit to describe a trit.
| checker659 wrote:
| If all the weights are either 1, 0 or -1, isn't this what
| biological neurons do?
| nathan_compton wrote:
| Not even remotely. I suppose you could kind of say that
| activations are boolean in the sense that neurons emit spikes,
| but arguably significant information is encoded in spike
| timing.
| w-m wrote:
| I was reading _Exposing Floating Point_ today (as Airfoil is on
| the HN front page and I was perusing the archive of the author).
| It 's a blog explaining the inner workings of floating point
| representations. About zero values it says [0]:
|
| > Yes, the floating point standard specifies both +0.0 and -0.0.
| This concept is actually useful because it tells us from which
| "direction" the 0 was approached as a result of storing value too
| small to be represented in a float. For instance -10e-30f /
| 10e30f won't fit in a float, however, it will produce the value
| of -0.0.
|
| The authors of the LLM paper use the values {-1, 0, -1}.
| Connecting the two ideas, I'm now wondering whether having a
| 2-bit {-1, -0, 0, 1} representation might have any benefit over
| the proposed 1.58 bits. Could the additional -0 carry some
| pseudo-gradient information, ("the 0 leaning towards the negative
| side")?
|
| Also, I've seen 2-bit quantizations being proposed in other LLM
| quantization papers. What values are they using?
|
| [0] https://ciechanow.ski/exposing-floating-point/#zero
| rfoo wrote:
| Interesting, how do you use -0 in the add, then? Is -0+1-1 a 0
| or a -0?
|
| > Could the additional -0 carry some pseudo-gradient
| information
|
| It looks like training was done on fp32 or bf16. Low-bit
| quantization is approximated with STE during training. I'd
| expect training itself cause each point to "polarize" towards 1
| or -1.
|
| > 2-bit quantizations being proposed
|
| Symmetric (i.e. without 0) exponential values were pretty
| popular IIRC.
| w-m wrote:
| > how do you use -0 in the add
|
| In my mind the two zero values would represent a tiny epsilon
| around 0, let's say -0.01 and +0.01. Looking at them like
| this, it would mean +0 +0 -0 = +0 +0 -0
| -0 = -0 +1 * +0 = +0 -1 * +0 = -0
|
| Performing addition with the same sign count in each group
| would be problematic. How to decide on the sign of +0-0 or
| +1-1, other than flipping a coin?
| creshal wrote:
| > Could the additional -0 carry some pseudo-gradient
| information, ("the 0 leaning towards the negative side")?
|
| Probably, but is it worth the cost? One of the goals behind
| BitNet and this paper is to find a way to implement LLMs as
| efficiently in hardware as possible, and foregoing floating
| point semantics is a big part of it. I'm not sure if there's a
| way to encode -0 that doesn't throw out half the performance
| gains.
| SushiHippie wrote:
| But if I understand it correctly, they already need to use 2
| bits, one for the sign and another one for the value, so
| there is already one wasted state, which could be used for
| -0.
| pennomi wrote:
| You can pack two trits into three bits, however. So one
| byte could hold 5 values instead of 4.
| para_parolu wrote:
| Can processor perform addition on them effectively?
| threatripper wrote:
| How exactly would you do that? 3 states need 1.58 bits
| which is a tad more than 1.5. Two 3-states have 32=9
| states while three bits only give you 23=8 states.
| BenoitEssiambre wrote:
| Low bit parameters is always talked about in terms of performance
| benefits but I wonder if allowing the LLM to combine parameters
| to represent values, means it can select the resolution of each
| value, that is use a kind of internal scientific notation to
| track the uncertainty of values. More low bit parameters combined
| together means more precision and resolution, less can mean more
| uncertainty. This might allow the LLM to better calibrate the
| uncertainty of it's knowledge in a Bayesian way, to prevent
| hallucinations from the overconfidence you get from overfitting
| on too many bits.
| singularity2001 wrote:
| So we almost go back full circle to human (animal) brain binary
| spikes?
| concrete_head wrote:
| It's not quiet spikes but getting closer to the idea. I'm
| amazed it has taken this long for this type of thing to reach
| HN which gives next to no attention to spiking neural networks.
|
| Simon Thorpe, a CNRS researcher has got some fascinating papers
| and lectures on YouTube on using binary weights on neuromorphic
| hardware which has had practical applications for over 20 years
| already.
|
| I made an account just to drop his name somewhere on this
| forum.
| elromulous wrote:
| So for the uninitiated (me), does this mean the input is not a
| float (i.e. is quantized on input), such that all the math can be
| done with int operations?
|
| This seems almost too good to be true.
|
| Edit: Answering my own question, yes. The details are in the
| original bitnet paper: https://arxiv.org/abs/2310.11453
| Mizza wrote:
| I hope somebody gives this team access to the good data and a lot
| of crunch, I'd love to see what happens when you train the big
| fella.
| rapatel0 wrote:
| The mathematics of the BNNs are sound. The shannon entropy of a
| word is really small (I vaguely remember ~2 bits). Also all
| neural networks are ridiculously over provisioned.
|
| I worked on 7 years ago trying to efficiently binarize CNNs from
| existing models. It the difficult was getting training running
| without the losses going to high. I think that vision models will
| be much more difficult to binarize, but you might not need to
| with clip if the vision encoder stays in regular math {fp16,int8}
| ein0p wrote:
| How is it a 1 bit LLM if 2 bits are required for each weight (and
| one of the 4 possible states is wasted to be able to represent 0)
| ricardobeat wrote:
| As someone else pointed out here, you can store 5 ternary
| values in 1 byte, 3^5 == 243.
| ein0p wrote:
| That's still not 1 bit, and that would basically destroy
| whatever perf advantage you might hope to get if you want to
| keep the model in memory in that format rather than unpack it
| on load.
| anon291 wrote:
| This is something that's been tried many times before. 1-bit to
| 2-bit models and binary NNs have a long history.
| superdisk wrote:
| Is there anything about this specific to LLMs, or could you use
| it for any transformer based model? It seems like they made a
| modified transformer.
| fgfm wrote:
| It's funny how discoveries in NLP & computer vision complement
| each other. The replacement of multiplication by additions made
| me think about the AdderNet paper
| (https://arxiv.org/abs/1912.13200), which concluded as you had to
| suffer almost no performance drop.
|
| Perhaps the accumulators in current hardware cannot leverage this
| to its full potential, but combined with such a strict
| quantization, this would open LLM to the wider ML community much
| earlier than expected (when consumer hardware allows you to train
| near SOTA LLMs from scratch on your machine).
| gojomo wrote:
| That's not a 'bit' ("Binary digIT"). It's closer to a 'trit'
| ("TeRnary-digIT"). Specifically, ternary digits spanning {-1, 0,
| 1} (rather than the usual {0, 1, 2} in a base-3 numbering system)
| are 'balanced ternary'.
|
| A great intro to the theoretical reasons ternary might have some
| promise in computing is this 2001 article from 'American
| Scientist', "Third Base", which quotes Knuth calling balanced-
| ternary "perhaps the prettiest numbering system of all" and also
| discusses an abortive Soviet effort in the direction of ternary
| computing:
|
| http://web.archive.org/web/20011205185830/http://americansci...
|
| In an aside, the article hints that _e_ -nary digits (base
| 2.718...) if somehow made practical/meaningful, might actually be
| better than ternary (or perhaps even optimal?).
|
| So maybe this paper's observation that ~"1.58 bits" (ln2(3)
| binary-digits) is a sweet-spot could be further refined into some
| method for representing the state of a e-nary-modeled algorithm
| in ln2(e) binary-digits (~"1.44 bits") per underlying e-it.
|
| (As it may be of renewed interest, I've also put this 2001
| "American Scientist" base-3 intro as a new HN submission for
| discussion: https://news.ycombinator.com/item?id=39541756)
| bee_rider wrote:
| It is obviously pretty common to represent matrices with lots
| of zeros in a sparse format, like csr or something. I wonder if
| they could get away with 1-bit representation using a sparse
| matrix. Of course, it would be a little different from a
| typical sparse matrix because there's no problem normally
| having a zero-value in a structurally non-zero location.
| no_identd wrote:
| See also:
|
| https://en.wikipedia.org/wiki/Nat_(unit) (make sure to read the
| footnotes, too)
|
| Edit: See also also, on the radix economy of balanced ternary
| (called "tristate") vs base 3:
| https://web.archive.org/web/20090312094241/http://abhijit.in...
| + a wild Marvin Minsky appears: https://archive.fo/gL2Bv
|
| That page also brings up the whole "but division" problem with
| balanced ternary, however, I personally suspect that
| http://degiorgi.math.hr/aaa_sem/Div_Krishna/887-889.pdf ("A
| Division Algorithm for Signed-Digit Arithmetic" by Chin Tung,
| from 1968 !) might offer an overlooked path to a solution to
| that problem
|
| And see also also2, this quote from TAOCP:
|
| "Cauchy pointed out that negative digits make it unneccesary
| for a person to memorize the multiplication table past 5x5."
|
| The--INCREDIBLY ANNOYING TO LOCATE--source for which is "105.
| Calculs numeriques. sur les moyens d'eviter les erreurs dans
| les calculs numeriques." on Pdf page 445/document page 431
| here:
|
| https://www.e-rara.ch/download/pdf/5702285?name=Tome%2520V%4...
|
| See also also3:
| https://pdfs.semanticscholar.org/5f77/b1cf105024b41b6824ba91...
| (Vince, Andrew - Radix Representation and Rep-Tiling)
|
| ( +a vaguely related paper here on quantum mechanics & radix
| economy, BUT it makes the mistake of using an overly specific
| formula applicable only to unsigned-digit representations thus
| drawing the wrong conclusions:
| https://www.researchgate.net/profile/Vladimir_Garcia-Morales...
| )
| nighthawk454 wrote:
| Base e is the optimal base for number representation, so that's
| probably why. Followed by base 3, then base 2.
|
| https://en.m.wikipedia.org/wiki/Radix_economy
| o11c wrote:
| FSVO "optimal". In practice, both physical reality and
| algorithm design strongly favors base 2.
| nighthawk454 wrote:
| Yeah, specifically, the definition of optimal provided -
| radix economy. There are plenty of other considerations one
| could make in other contexts. Practically, a transcendental
| base seems... rather impractical. And base 2 is not so much
| 'more optimal' than base 3 to warrant the electrical
| complexity probably, for example.
| dekhn wrote:
| How useful are -0 and 0? You could splurge on two bits per
| value which gives you { -1, -0, 0, 1 }
| ant6n wrote:
| 3^5=243. so use a byte to represent a vector of 5 ternary
| values, leaving some possible signaling values.
| Razengan wrote:
| Why not a tit?
| jdiff wrote:
| Because bi- is two, tri- is three. Ti- is meaningless, and
| not good enough of a joke to make up for it.
| esafak wrote:
| They renamed the biggest ML conference (NIPS) over the same
| joke, so don't count on it.
| kleiba wrote:
| Note that they're not claiming that their LLM is 1-bit -
| they're saying that there is a 1-bit era of LLMs. What they do
| say is that their approach is _a variant_ of a 1-bit LLM
| variant, namely a ternary LLM (they explicitly state that in
| the abstract).
| brunooliv wrote:
| Do the implications at a practical level mean that the size of
| gguf files will become smaller?
| klysm wrote:
| Does this mean we can compile LLMs to run on FPGAs directly?
| loa_in_ wrote:
| I don't know if ternary gate arrays are a thing, but if so then
| yes.
| karmasimida wrote:
| This is exciting news, if the 8B numbers are true, we can already
| use model like Mixtral 8x7, even with a single GPU?
|
| But further into the development, we need comparison to large
| model sizes. 70B might be too much to ask, but 13B should be
| there at least.
| cjbprime wrote:
| You could already run Mixtral on the more expensive single
| consumer GPUs (with 24GB VRAM) before this paper, at e.g.
| 3-bits per weight.
| Havoc wrote:
| If true then I'm guessing this would make ASICs for this far more
| simple too, right?
| Avisite wrote:
| Does quantization need to be an all or nothing? with the kind of
| low bit models we have seen, my assumption would be that only
| certain weights would benefit from the extra precision. A mixture
| of precision with 2-bit, 3-bit, to 8-bit weights might perform
| well, but I am unsure if any training process could identify the
| weights that need the extra precision.
| Blackthorn wrote:
| Is there any rigorous way to answer the question of how much
| information (be it entropy or some other measurement) is
| contained in a model's weights?
| oxxoxoxooo wrote:
| Prior art:
|
| Binarized Neural Networks: Training Deep Neural Networks with
| Weights and Activations Constrained to +1 or -1
|
| https://arxiv.org/abs/1602.02830
|
| Ternary Neural Networks for Resource-Efficient AI Applications
|
| https://arxiv.org/abs/1609.00222
| kandu wrote:
| Also: training neural networks by turning connections on and
| off, or by just flipping the sign of the weights:
| https://arxiv.org/abs/2006.16627
| modeless wrote:
| Maybe a silly question but nonlinearity is important for neural
| nets. Wouldn't it make more sense for the three values to be e.g.
| (2, 0, -1) so they are not colinear?
|
| Also, what are the prospects for FPGA implementations of this?
| bilsbie wrote:
| This really just sounds absurd. How can ternary possibly encode
| enough information?
|
| Anyone willing to explain it like I'm a Django developer who
| watched half a karpathy video?
| bilsbie wrote:
| How would you use this in something like PyTorch? There's no
| ternary data type.
| bilsbie wrote:
| Could there be some value in recognizing areas where the model
| needs finer grained weights and somehow using a different data
| type just in certain areas?
| fabiospampinato wrote:
| It seems tough to do, besides I'm not sure what the benefit
| would be, with that you can't do the optimized matrix
| multiplication anymore, and if you need more precision
| presumably you can just add more neurons and/or train for
| longer and/or with better data.
| kouru225 wrote:
| Ok can someone catch me up to speed on LLM hardware requirements?
| Last I looked I needed a 20 gb vram card to run a good one. Is
| that not true anymore?
| SushiHippie wrote:
| Not true anymore, but it also highly depends on what your
| definition of "a good one" is.
|
| Many people find Mistral 7B to be excellent, around gpt-3.5
| level of good.
|
| Mistral 7B normally requires like 20gb VRAM, but with llama.cpp
| and quantization, you could even run it on your phone (albeit
| bad quality).
|
| Quantization >= q4_K_M seem to provide nearly as good responses
| as the unquantized model, and q4_K_M only needs ~7GB of VRAM.
|
| See the table here:
|
| https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGU...
|
| Using ollama you can get up and running even a bit faster than
| with llama.cpp directly (ollama uses llama.cpp under the hood).
| kouru225 wrote:
| Oh Jesus so basically it's very feasible for me to run my own
| local llm on a NAS or a server or something... well I guess
| it's time for me to get on with the times...
|
| Thanks!
| esha_manideep wrote:
| These models will are compatible with llama.cpp out of the box,
| we (GigaML - https://gigaml.com) are planning to train a small
| model (3-4B, 1-bit, opensource) with the latest stack-v2 dataset
| released today. Let me know if anyone is interested in
| collaborating with us.
| arunk47 wrote:
| Okay wait, can I train my own llm yet?
| eigenvalue wrote:
| Is it really so surprising that something like this works given
| how human brain neurons work? My admittedly basic understanding
| is that these operate through an all-or-nothing principle for
| their action potentials (firing): they either fire or they don't,
| based on whether the input signals reach a certain threshold. So
| the output is already sort of binary in biological neurons. The
| inputs are more like continuous values, since they are the sum of
| many different neurons sending signals into each neuron, but in
| this paper the activations are 8-bit, not binary/ternary. Can any
| neuroscientists here comment?
| m00x wrote:
| This isn't really how neurons work.
|
| First of all, they operate independent of a synchronized clock,
| and they can also accumulate signals instead of executing on a
| input. Neuromorphic chips are closer to how the brain works,
| but they're still super early. I believe Intel has the best one
| with the Loihi 2.
|
| (Not a neuroscientist but my wife is and that's what I
| understand from our chats)
___________________________________________________________________
(page generated 2024-02-28 23:00 UTC)