[HN Gopher] 1-Bit AI Infrastructure
       ___________________________________________________________________
        
       1-Bit AI Infrastructure
        
       Author : galeos
       Score  : 142 points
       Date   : 2024-11-15 14:28 UTC (5 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | ttyprintk wrote:
       | Later a4.8 quantization by some of the same team:
       | 
       | https://news.ycombinator.com/item?id=42092724
       | 
       | https://arxiv.org/abs/2411.04965
        
         | skavi wrote:
         | and the repo for this project:
         | https://github.com/microsoft/BitNet
        
           | sinuhe69 wrote:
           | The demo they showed was full of repeated sentences. The 3B
           | model looks quite dense, TBH. Did they just want to show the
           | speed?
        
             | newswasboring wrote:
             | 3B models, especially in quantized state, almost always
             | behave like this.
        
       | dailykoder wrote:
       | I have read about it quite a few weeks ago the first time and I
       | found it very interesting.
       | 
       | Now that I have done more than enough CPU design inside FPGAs, I
       | wanted to try something new, some computation heavy things that
       | could benefit from an FPGA. Does anyone here know how feasable
       | it'd be to implement something like that on an FPGA? I only have
       | rather small chips (artix-7 35T and polarfire SoC with 95k logic
       | slices). So I know I won't be able to press a full LLM into that,
       | but something should be possible.
       | 
       | Maybe I should refresh the fundamentals though and start with
       | MNIST. But the question is rather: What is a realistic goal that
       | I could possibly reach with these small FPGAs? Performance might
       | be secondary, I am rather interested in what's possible regarding
       | complexity/features on a small device.
       | 
       | Also has anyone here compiled openCL (or GL?) kernels for FPGAs
       | and can give me a starting point? I was wondering if it's
       | possible to have a working backend for something like
       | tinygrad[1]. I think this would be a good way to learn all the
       | different layers on how such frameworks actually work
       | 
       | - [1] https://github.com/tinygrad/tinygrad
        
         | svantana wrote:
         | Couldn't you implement a bitnet kernel, and use that as a co-
         | processor to a PC? Or is the I/O bandwidth so low that it won't
         | be worth it?
        
           | dailykoder wrote:
           | Since I don't have a board with PCIe port the fastest I could
           | get is 100MBit ethernet, i think. Or rather use the Microchip
           | board which has a hard RISC-V quad core processor on it
           | connected via an AXI-Bus with the FPGA fabric. The CPU itself
           | run at only 625MHz, so there is huge potential to speed up
           | some fancy computation
        
             | mysteria wrote:
             | Even with a PCIe FPGA card you're still going to be memory
             | bound during inference. When running LLama.cpp on straight
             | CPU memory bandwidth, not CPU power, is always the
             | bottleneck.
             | 
             | Now if the FPGA card had a large amount of GPU tier memory
             | then that would help.
        
         | verytrivial wrote:
         | You gain in potential parallelism with FPGA, so with very small
         | "at the edge" models they could speed things up, right? But the
         | models are always going to be large, so memory bandwidth is
         | going to be a bottle neck unless some v fancy FPGA memory
         | "fabric" is possible. Perhaps for extremely low latency
         | classification tasks? I'm having trouble picturing that
         | application though.
         | 
         | The code itself is surprisingly small/tight. I'm been playing
         | with llama.cpp for the last few days. The CPU only archive is
         | like 8Mb on gitlab, and there is no memory allocation during
         | run time. My _ancient_ laptop (as in 2014!) is sweating but
         | producing spookily good output with quantized 7B models.
         | 
         | (I'm mainly commenting to have someone correct me, by the way,
         | since I'm interested in this question too!)
        
           | dailykoder wrote:
           | > Perhaps for extremely low latency classification tasks? I'm
           | having trouble picturing that application though.
           | 
           | Possibly, yes. I have no concrete plans yet. Maybe language
           | models are the wrong area though. Some general either image
           | classification or object detection would be neat (say lane
           | detection with a camera or something like that)
        
             | tgv wrote:
             | Real-time translation or speech transcription for the
             | hearing-impaired onto AR-glasses? Now you've got a good
             | reason to make it look like a Star Trek device.
             | 
             | Or glasses that can detect threats/opportunities in the
             | environment and call them out via ear plugs, for the
             | vision-impaired.
        
           | UncleOxidant wrote:
           | Lower latency, but also much lower power. This sort of thing
           | would be of great interest to companies running AI
           | datacenters (which is why Microsoft is doing this research,
           | I'd think). Low latency is also quite useful for real-time
           | tasks.
           | 
           | > The code itself is surprisingly small/tight. I'm been
           | playing with llama.cpp for the last few days.
           | 
           | Is there a bitnet model that runs on llama.cpp? (looks like
           | it: https://www.reddit.com/r/LocalLLaMA/comments/1dmt4v7/llam
           | acp...) which bitnet model did you use?
        
         | nickpsecurity wrote:
         | This submission should help you:
         | 
         | https://news.ycombinator.com/item?id=41470074
        
           | dailykoder wrote:
           | Thanks!
        
         | UncleOxidant wrote:
         | I've had the same idea. One way to go about it would be to
         | modify an existing RISC-V cpu to include the ternary math ops
         | to accelerate bitnet operations. And vector/matrix extensions
         | based on those. Then your LLM is implemented in RISC-V assembly
         | using those extensions. (It would be possible to do some work
         | on the LLVM backend so you could use a C implementation of the
         | LLM, but that starts to be a lot of work. Also, we'd need 2 bit
         | signed int types in C.)
         | 
         | A completely different approach is differentiable logic
         | networks. You end up with a logic-gate network after training.
         | This logic gate network would be very easy to translate into
         | Verilog or VHDL. https://github.com/Felix-Petersen/difflogic
        
       | WiSaGaN wrote:
       | I would expect research along this way to pick up quite a bit if
       | we confirm the pretrain stage is not scaling as previous
       | expected, thus the scale and architecture would be more stable in
       | the near future, especially if the focus shifts to inference time
       | scaling.
        
       | sva_ wrote:
       | It seems like arxiv replaced 'bitnet.cpp' with a link 'this http
       | url', even though '.cpp' is clearly not a tld. Poor regex?
        
         | bc569a80a344f9c wrote:
         | Sort of. And not on the author's side.
         | 
         | https://academia.stackexchange.com/questions/132315/how-to-a...
        
         | Joker_vD wrote:
         | > '.cpp' is clearly not a tld.
         | 
         | Is it that clear? Because e.g. .app and .cpa _are_ TLDs. So are
         | .py and .so.
        
           | Natfan wrote:
           | and .com is a TLD[0] and also a file type[1], to further
           | complicate matters.
           | 
           | ---
           | 
           | [0]: https://en.wikipedia.org/wiki/.com
           | 
           | [1]: https://en.wikipedia.org/wiki/COM_file
        
       | js8 wrote:
       | It's technically not 1-bit, but 2-bit.
       | 
       | Anyway, I wonder if there is some HW support in modern CPUs/GPUs
       | for linear algebra (like matrix multiplication) over Z_2^n ? I
       | think it would be useful for SAT solving.
        
         | almostgotcaught wrote:
         | not CPU/GPU but on FPGA finite field arithmetic is a thing;
         | plenty of stuff like this around
         | https://ieeexplore.ieee.org/document/4392002
        
         | scarmig wrote:
         | There's carry-less multiplication
         | (https://en.m.wikipedia.org/wiki/CLMUL_instruction_set),
         | introduced by Intel in 2010.
        
         | JKCalhoun wrote:
         | Or, technically, 1.58 bit. ;-)
        
         | meindnoch wrote:
         | https://en.m.wikipedia.org/wiki/CLMUL_instruction_set
        
       | yalok wrote:
       | So basically the idea is to pack 3 ternary weights (-1,0,1) into
       | 5 bits instead of 6, but they compare the results with fp16 model
       | which would use 48 bits for those 3 weights...
       | 
       | And speed up comes from the memory io, compensated a bit by the
       | need to unpack these weights before using them...
       | 
       | Did I get this right?
        
         | UncleOxidant wrote:
         | Yeah, that seems to be the case. Though, I suspect Microsoft is
         | interested in implementing something like a custom RISC-V CPU
         | that has an ALU that's tuned for doing this ternary math and
         | added custom vector/matrix instructions. Something like that
         | could save them a lot of power in their data centers.
         | 
         | If it were to catch on then perhaps we'd see Intel, AMD, ARM
         | adding math ops optimized for doing ternary math?
        
       | hidelooktropic wrote:
       | Does anyone have the actual "this http url"?
        
         | dkrajews wrote:
         | https://github.com/microsoft/BitNet
        
       ___________________________________________________________________
       (page generated 2024-11-20 23:01 UTC)