[HN Gopher] 1-Bit AI Infrastructure
___________________________________________________________________
1-Bit AI Infrastructure
Author : galeos
Score : 142 points
Date : 2024-11-15 14:28 UTC (5 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| ttyprintk wrote:
| Later a4.8 quantization by some of the same team:
|
| https://news.ycombinator.com/item?id=42092724
|
| https://arxiv.org/abs/2411.04965
| skavi wrote:
| and the repo for this project:
| https://github.com/microsoft/BitNet
| sinuhe69 wrote:
| The demo they showed was full of repeated sentences. The 3B
| model looks quite dense, TBH. Did they just want to show the
| speed?
| newswasboring wrote:
| 3B models, especially in quantized state, almost always
| behave like this.
| dailykoder wrote:
| I have read about it quite a few weeks ago the first time and I
| found it very interesting.
|
| Now that I have done more than enough CPU design inside FPGAs, I
| wanted to try something new, some computation heavy things that
| could benefit from an FPGA. Does anyone here know how feasable
| it'd be to implement something like that on an FPGA? I only have
| rather small chips (artix-7 35T and polarfire SoC with 95k logic
| slices). So I know I won't be able to press a full LLM into that,
| but something should be possible.
|
| Maybe I should refresh the fundamentals though and start with
| MNIST. But the question is rather: What is a realistic goal that
| I could possibly reach with these small FPGAs? Performance might
| be secondary, I am rather interested in what's possible regarding
| complexity/features on a small device.
|
| Also has anyone here compiled openCL (or GL?) kernels for FPGAs
| and can give me a starting point? I was wondering if it's
| possible to have a working backend for something like
| tinygrad[1]. I think this would be a good way to learn all the
| different layers on how such frameworks actually work
|
| - [1] https://github.com/tinygrad/tinygrad
| svantana wrote:
| Couldn't you implement a bitnet kernel, and use that as a co-
| processor to a PC? Or is the I/O bandwidth so low that it won't
| be worth it?
| dailykoder wrote:
| Since I don't have a board with PCIe port the fastest I could
| get is 100MBit ethernet, i think. Or rather use the Microchip
| board which has a hard RISC-V quad core processor on it
| connected via an AXI-Bus with the FPGA fabric. The CPU itself
| run at only 625MHz, so there is huge potential to speed up
| some fancy computation
| mysteria wrote:
| Even with a PCIe FPGA card you're still going to be memory
| bound during inference. When running LLama.cpp on straight
| CPU memory bandwidth, not CPU power, is always the
| bottleneck.
|
| Now if the FPGA card had a large amount of GPU tier memory
| then that would help.
| verytrivial wrote:
| You gain in potential parallelism with FPGA, so with very small
| "at the edge" models they could speed things up, right? But the
| models are always going to be large, so memory bandwidth is
| going to be a bottle neck unless some v fancy FPGA memory
| "fabric" is possible. Perhaps for extremely low latency
| classification tasks? I'm having trouble picturing that
| application though.
|
| The code itself is surprisingly small/tight. I'm been playing
| with llama.cpp for the last few days. The CPU only archive is
| like 8Mb on gitlab, and there is no memory allocation during
| run time. My _ancient_ laptop (as in 2014!) is sweating but
| producing spookily good output with quantized 7B models.
|
| (I'm mainly commenting to have someone correct me, by the way,
| since I'm interested in this question too!)
| dailykoder wrote:
| > Perhaps for extremely low latency classification tasks? I'm
| having trouble picturing that application though.
|
| Possibly, yes. I have no concrete plans yet. Maybe language
| models are the wrong area though. Some general either image
| classification or object detection would be neat (say lane
| detection with a camera or something like that)
| tgv wrote:
| Real-time translation or speech transcription for the
| hearing-impaired onto AR-glasses? Now you've got a good
| reason to make it look like a Star Trek device.
|
| Or glasses that can detect threats/opportunities in the
| environment and call them out via ear plugs, for the
| vision-impaired.
| UncleOxidant wrote:
| Lower latency, but also much lower power. This sort of thing
| would be of great interest to companies running AI
| datacenters (which is why Microsoft is doing this research,
| I'd think). Low latency is also quite useful for real-time
| tasks.
|
| > The code itself is surprisingly small/tight. I'm been
| playing with llama.cpp for the last few days.
|
| Is there a bitnet model that runs on llama.cpp? (looks like
| it: https://www.reddit.com/r/LocalLLaMA/comments/1dmt4v7/llam
| acp...) which bitnet model did you use?
| nickpsecurity wrote:
| This submission should help you:
|
| https://news.ycombinator.com/item?id=41470074
| dailykoder wrote:
| Thanks!
| UncleOxidant wrote:
| I've had the same idea. One way to go about it would be to
| modify an existing RISC-V cpu to include the ternary math ops
| to accelerate bitnet operations. And vector/matrix extensions
| based on those. Then your LLM is implemented in RISC-V assembly
| using those extensions. (It would be possible to do some work
| on the LLVM backend so you could use a C implementation of the
| LLM, but that starts to be a lot of work. Also, we'd need 2 bit
| signed int types in C.)
|
| A completely different approach is differentiable logic
| networks. You end up with a logic-gate network after training.
| This logic gate network would be very easy to translate into
| Verilog or VHDL. https://github.com/Felix-Petersen/difflogic
| WiSaGaN wrote:
| I would expect research along this way to pick up quite a bit if
| we confirm the pretrain stage is not scaling as previous
| expected, thus the scale and architecture would be more stable in
| the near future, especially if the focus shifts to inference time
| scaling.
| sva_ wrote:
| It seems like arxiv replaced 'bitnet.cpp' with a link 'this http
| url', even though '.cpp' is clearly not a tld. Poor regex?
| bc569a80a344f9c wrote:
| Sort of. And not on the author's side.
|
| https://academia.stackexchange.com/questions/132315/how-to-a...
| Joker_vD wrote:
| > '.cpp' is clearly not a tld.
|
| Is it that clear? Because e.g. .app and .cpa _are_ TLDs. So are
| .py and .so.
| Natfan wrote:
| and .com is a TLD[0] and also a file type[1], to further
| complicate matters.
|
| ---
|
| [0]: https://en.wikipedia.org/wiki/.com
|
| [1]: https://en.wikipedia.org/wiki/COM_file
| js8 wrote:
| It's technically not 1-bit, but 2-bit.
|
| Anyway, I wonder if there is some HW support in modern CPUs/GPUs
| for linear algebra (like matrix multiplication) over Z_2^n ? I
| think it would be useful for SAT solving.
| almostgotcaught wrote:
| not CPU/GPU but on FPGA finite field arithmetic is a thing;
| plenty of stuff like this around
| https://ieeexplore.ieee.org/document/4392002
| scarmig wrote:
| There's carry-less multiplication
| (https://en.m.wikipedia.org/wiki/CLMUL_instruction_set),
| introduced by Intel in 2010.
| JKCalhoun wrote:
| Or, technically, 1.58 bit. ;-)
| meindnoch wrote:
| https://en.m.wikipedia.org/wiki/CLMUL_instruction_set
| yalok wrote:
| So basically the idea is to pack 3 ternary weights (-1,0,1) into
| 5 bits instead of 6, but they compare the results with fp16 model
| which would use 48 bits for those 3 weights...
|
| And speed up comes from the memory io, compensated a bit by the
| need to unpack these weights before using them...
|
| Did I get this right?
| UncleOxidant wrote:
| Yeah, that seems to be the case. Though, I suspect Microsoft is
| interested in implementing something like a custom RISC-V CPU
| that has an ALU that's tuned for doing this ternary math and
| added custom vector/matrix instructions. Something like that
| could save them a lot of power in their data centers.
|
| If it were to catch on then perhaps we'd see Intel, AMD, ARM
| adding math ops optimized for doing ternary math?
| hidelooktropic wrote:
| Does anyone have the actual "this http url"?
| dkrajews wrote:
| https://github.com/microsoft/BitNet
___________________________________________________________________
(page generated 2024-11-20 23:01 UTC)