[HN Gopher] Lm.rs: Minimal CPU LLM inference in Rust with no dep...
       ___________________________________________________________________
        
       Lm.rs: Minimal CPU LLM inference in Rust with no dependency
        
       Author : littlestymaar
       Score  : 156 points
       Date   : 2024-10-11 16:46 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | fuddle wrote:
       | Nice work, it would be great to see some benchmarks comparing it
       | to llm.c.
        
         | littlestymaar wrote:
         | I doubt it would compare favorably at the moment, I don't think
         | it's particularly well optimized besides using rayon to get CPU
         | parallelism and wide for a bit of SIMD.
         | 
         | It's good enough to get pretty good performance for little
         | effort, but I don't think it would win a benchmark race either.
        
       | echelon wrote:
       | This is really cool.
       | 
       | It's already using Dioxus (neat). I wonder if WASM could be put
       | on the roadmap.
       | 
       | If this could run a lightweight LLM like RWKV in the browser,
       | then the browser unlocks a whole class of new capabilities
       | without calling any SaaS APIs.
        
         | marmaduke wrote:
         | I was poking at this a bit here
         | 
         | https://github.com/maedoc/rwkv.js
         | 
         | using the Rwkv.cpp compiled with emscripten, but I didn't quite
         | figure out the tokenizers part (yet, only spent about an hour
         | on it)
         | 
         | Nevertheless I am pretty sure the 1.6b rwkv6 would be totally
         | usable offline browser only. It's not capable enough for
         | general chat but for rag etc it could be quite enough
        
         | littlestymaar wrote:
         | > I wonder if WASM could be put on the roadmap.
         | 
         | The library itself should be able to compile to WASM with very
         | little change: _rayon_ and _wide_ the only mandatory
         | dependencies support wasm out of the box, and to get rid of
         | _memmap2_ by replacing the `Mmap` type in _transformer.rs_ with
         | ` &[u8]`.
         | 
         | That being said, RWKV is a completely different architecture so
         | it should be reimplemented entierly and is not likely to be
         | part of the roadmap ever (not the main author so I can't say
         | for sure, but I really doubt it).
        
       | wyldfire wrote:
       | The title is less clear than it could be IMO.
       | 
       | When I saw "no dependency" I thought maybe it could be no_std
       | (llama.c is relatively lightweight in this regard). But it's
       | definitely not `no_std` and in fact seems like it has several
       | dependencies. Perhaps all of them are rust dependencies?
        
         | saghm wrote:
         | The readme seems to indicate that it expects pytorch alongside
         | several other Python dependencies in a requirements.txt file
         | (which is the only place I can find any form of the word
         | "dependency" on the page). I'm very confused by the
         | characterization in the title here given that it doesn't seem
         | to be claimed at all by the project itself (which simple has
         | the subtitle "Minimal LLM inference in Rust").
         | 
         | From the git history, it looks like the username of the person
         | who posted this here is someone who has contributed to the
         | project but isn't the primary author. If they could elaborate
         | on what exactly they mean by saying this has "zero
         | dependencies", that might be helpful.
        
           | littlestymaar wrote:
           | > The readme seems to indicate that it expects pytorch
           | alongside several other Python dependencies in a
           | requirements.txt file
           | 
           | That's only if you want to convert the model yourself, you
           | don't need that if you use the converted weights on the
           | author's huggingface page (in "prepared-models" table of the
           | README).
           | 
           | > From the git history, it looks like the username of the
           | person who posted this here is someone who has contributed to
           | the project but isn't the primary author.
           | 
           | Yup that's correct, so far I've only authored the dioxus GUI
           | app.
           | 
           | > If they could elaborate on what exactly they mean by saying
           | this has "zero dependencies", that might be helpful.
           | 
           | See my other response:
           | https://news.ycombinator.com/item?id=41812665
        
             | J_Shelby_J wrote:
             | What do you think about implementing your gui for other
             | rust LLM projects? I'm looking for a front end for my
             | project: https://github.com/ShelbyJenkins/llm_client
        
         | vitaminka wrote:
         | is rust cargo basically like npm at this point? like how on
         | earth is sixteen dependencies means no dependencies lol
        
           | littlestymaar wrote:
           | > like how on earth is sixteen dependencies means no
           | dependencies lol
           | 
           | You're counting optional dependencies used in the binaries
           | which isn't fair (obviously the GUI app or the backend of the
           | webui are going to have dependencies!). But yes 3
           | dependencies isn't literally no dependency.
        
           | tormeh wrote:
           | Yes, basically. Someone who is a dependency maximalist (never
           | write any code that can be replaced by a dependency) then you
           | can easily end up with a thousand dependencies. I don't like
           | things being that way, but others do.
           | 
           | It's worth noting that Rust's std library is really small,
           | and you therefore need more dependencies in Rust than in some
           | other languages like Python. There are some "blessed" crates
           | though, like the ones maintained by the rust-lang team
           | themselves (https://crates.io/teams/github:rust-lang:libs and
           | https://crates.io/teams/github:rust-lang-nursery:libs). Also,
           | when you add a dependency like Tokio, Axum, or Polars, these
           | are often ecosystems of crates rather than singular crates.
           | 
           | Tl;dr: Good package managers end up encouraging micro-
           | dependencies and dependency bloat because these things are
           | now painless. Cargo is one of these good package managers.
        
             | jll29 wrote:
             | How about designing a "proper" standard library for Rust
             | (comparable to Java's or CommonLISP's), to ensure a richer
             | experience, avoiding dependency explosions, and also to
             | ensure things are written in a uniform interface style? Is
             | that something the Rust folks are considering or actively
             | working on?
             | 
             | EDIT: nobody is helped by 46 regex libraries, none of which
             | implements Unicode fully, for example (not an example taken
             | from the Rust community).
        
               | tormeh wrote:
               | Just use the rust-lang org's regex crate. It's
               | fascinating that you managed to pick one of like 3 high-
               | level use-cases that are covered by official rust-lang
               | crates.
        
         | littlestymaar wrote:
         | Titles are hard.
         | 
         | What I wanted to express is that it doesn't have any pytorch or
         | Cuda or onnx or whatever deep learning dependency and that all
         | the logic is self contained.
         | 
         | To be totally transparent it has 5 Rust dependencies by
         | default, two of them should be feature gated for the _chat_
         | (chrono and clap), and then there are 3 utility crates that are
         | used to get a little bit more performance out of the hardware
         | (`rayon` for easier parallelization, `wide` for helping with
         | SIMD, and `memmap2` for memory mapping of the model file).
        
           | J_Shelby_J wrote:
           | Yeah, hard to not be overly verbose. "No massive dependencies
           | with long build times and deep abstractions!" Is not as
           | catchy.
        
             | 0x457 wrote:
             | No dependencies in this case (and pretty much any rust
             | project) means: to build you need rustc+cargo and to use
             | you just need resulting binary.
             | 
             | As in you don't need to have C compiler, python, dynamic
             | libraries. "pure rust" would be a better way to describe
             | it.
        
               | littlestymaar wrote:
               | It's a little bit more than pure Rust: to build the
               | library there's basically only two dependencies (rayon
               | and wide) which bring only 14 transitive dependencies
               | (anyone who's built even simple Rust program knows that
               | this is a very small number).
               | 
               | And there's more, Rayon and wide are only needed for
               | performance and we could trivially put them behind a
               | feature flag and have zero dependency and have the
               | library work in a no-std context actually, but it would
               | be so slow it would have no use at all so I don't really
               | think that makes sense to do except in order to win an
               | argument...
        
         | ctz wrote:
         | The original may have made sense, eg "no hardware dependency",
         | or "no GPU dependency". Unfortunately HN deletes words from
         | titles with no rhyme or reason, and no transparency.
        
       | lucgagan wrote:
       | Correct me if I am wrong, but these implementations are all CPU
       | bound?, i.e. if I have a good GPU, I should look for
       | alternatives.
        
         | littlestymaar wrote:
         | It's all implemented on the CPU, yes, there's no GPU
         | acceleration whatsoever (at the moment at least).
         | 
         | > if I have a good GPU, I should look for alternatives.
         | 
         | If you actually want to run it, even just on the CPU, you
         | should look for an alternative (and the alternative is called
         | llama.cpp) this is more of an educational resource about how
         | things work when you remove all the layers of complexity in the
         | ecosystem.
         | 
         | LLM are somewhat magic in how effective they can be, but in
         | terms of code it's really simple.
        
         | bt1a wrote:
         | You are correct. This project is "on the CPU", so it will not
         | utilize your GPU for computation. If you would like to try out
         | a Rust framework that does support GPUs, Candle
         | https://github.com/huggingface/candle/tree/main may be worth
         | exploring
        
         | J_Shelby_J wrote:
         | Yes. Depending on gpu 10-20x difference.
         | 
         | For rust you have the llama.cpp wrappers like llm_client
         | (mine), and the candle based projects mistral.rs, and Kalosm.
         | 
         | Although, my project does try and provide an implementation of
         | mistral.rs, I haven't fully migrated from llama.cpp. A full
         | rust implementation would be nice for quick install times
         | (among other reasons). Right now my crate has to clone and
         | build. It's automated for mac, pc, and Linux but it adds about
         | a minute of build time.
        
       | J_Shelby_J wrote:
       | Neat.
       | 
       | FYI I have a whole bunch of rust tools[0] for loading models and
       | other LLM tasks. For example auto selecting the largest quant
       | based on memory available, extracting a tokenizer from a gguf,
       | prompting, etc. You could use this to remove some of the python
       | dependencies you have.
       | 
       | Currently to support llama.cpp, but this is pretty neat too. Any
       | plans to support grammars?
       | 
       | [0] https://github.com/ShelbyJenkins/llm_client
        
       | gip wrote:
       | Great! Did something similar some time ago [0] but the
       | performance was underwhelming compared to C/C++ code running on
       | CPU (which points to my lack of understanding of how to make Rust
       | fast). Would be nice to have some benchmarks of the different
       | Rust implementations.
       | 
       | Implementing LLM inference should/could really become the new
       | "hello world!" for serious programmers out there :)
       | 
       | [0] https://github.com/gip/yllama.rs
        
       | simonw wrote:
       | This is impressive. I just ran the 1.2G llama3.2-1b-it-q80.lmrs
       | on a M2 64GB MacBook and it felt speedy and used 1000% of CPU
       | across 13 threads (according to Activity Monitor).
       | cd /tmp         git clone https://github.com/samuel-
       | vitorino/lm.rs         cd lm.rs         RUSTFLAGS="-C target-
       | cpu=native" cargo build --release --bin chat         curl -LO
       | 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_
       | 0-LMRS/resolve/main/tokenizer.bin?download=true'         curl -LO
       | 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_
       | 0-LMRS/resolve/main/llama3.2-1b-it-q80.lmrs?download=true'
       | ./target/release/chat --model llama3.2-1b-it-q80.lmrs
        
         | littlestymaar wrote:
         | Could you try with                   ./target/release/chat
         | --model llama3.2-1b-it-q80.lmrs --show-metrics
         | 
         | To know how many token/s you get?
        
           | simonw wrote:
           | Nice, just tried that with "tell me a long tall tale" as the
           | prompt and got:                   Speed: 26.41 tok/s
           | 
           | Full output: https://gist.github.com/simonw/6f25fca5c664b84fd
           | d4b72b091854...
        
         | amelius wrote:
         | Not sure how to formulate this, but what does this mean in the
         | sense of how "smart" it is compared to the latest chatgpt
         | version?
        
           | simonw wrote:
           | The model I'm running here is Llama 3.2 1B, the smallest on-
           | device model I've tried that has given me good results.
           | 
           | The fact that a 1.2GB download can do as well as this is
           | honestly astonishing to me - but it's going to laughably poor
           | in comparison to something like GPT-4o - which I'm guessing
           | is measured in the 100s of GBs.
           | 
           | You can try out Llama 3.2 1B yourself directly in your
           | browser (it will fetch about 1GB of data) at
           | https://chat.webllm.ai/
        
           | littlestymaar wrote:
           | The implementation has no control on "how smart" the model
           | is, and when it comes to llama 1B, it's not very smart by
           | current standard (but it would still have blown everyone's
           | mind just a few years back).
        
             | KeplerBoy wrote:
             | The implementation absolutely can influence the outputs.
             | 
             | If you have a sloppy implementations which somehow
             | accumulates a lot of error in it's floating point math, you
             | will get worse results.
             | 
             | It's rarely talked about, but it's a real thing. Floating
             | point addition and multiplication is non-associative and
             | the order of operations affects the correctness and
             | performance. Developers might (unknowningly) trade
             | performance for correctness. And it matters a lot more in
             | the low precision modes we operate today. Just try
             | different methods of summing a vector containing 9,999 fp16
             | ones in fp16. Hint: it will never be 9,999.0 and you won't
             | get close to the best approximation if you do it in a naive
             | loop.
        
               | littlestymaar wrote:
               | TIL, thanks.
        
               | sroussey wrote:
               | How well does bf16 work in comparison?
        
               | KeplerBoy wrote:
               | Even worse, I'd say since it has fewer bits for the
               | fraction. At least in the example i was mentioning, where
               | you run into precision limits, not into range limits.
               | 
               | I believe bf16 was primarily designed as a storage
               | format, since it just needs 16 zero bits added to be a
               | valid fp32.
        
               | jiggawatts wrote:
               | I thought all current implementations accumulate into a
               | fp32 instead of accumulating in fp16.
        
       | jll29 wrote:
       | This is beautifully written, thanks for sharing.
       | 
       | I could see myself using some of the source code in the classroom
       | to explain how transformers "really" work; code is more
       | concrete/detailed than all those pictures of attention heads etc.
       | 
       | Two points of minor criticism/suggestions for improvement:
       | 
       | - libraries should not print to stdout, as that output may detroy
       | application output (imagine I want to use the library in a text
       | editor to offer style checking). So best to write to a string
       | buffer owned by a logging class instance associated with a lm.rs
       | object.
       | 
       | - Is it possible to do all this without "unsafe" without twisting
       | one's arm? I see there are uses of "unsafe" e.g. to force data
       | alignment in the model reader.
       | 
       | Again, thanks and very impressive!
        
       | marques576 wrote:
       | Such a talented guy!
        
       ___________________________________________________________________
       (page generated 2024-10-11 23:00 UTC)