[HN Gopher] Lm.rs: Minimal CPU LLM inference in Rust with no dep...
___________________________________________________________________
Lm.rs: Minimal CPU LLM inference in Rust with no dependency
Author : littlestymaar
Score : 156 points
Date : 2024-10-11 16:46 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| fuddle wrote:
| Nice work, it would be great to see some benchmarks comparing it
| to llm.c.
| littlestymaar wrote:
| I doubt it would compare favorably at the moment, I don't think
| it's particularly well optimized besides using rayon to get CPU
| parallelism and wide for a bit of SIMD.
|
| It's good enough to get pretty good performance for little
| effort, but I don't think it would win a benchmark race either.
| echelon wrote:
| This is really cool.
|
| It's already using Dioxus (neat). I wonder if WASM could be put
| on the roadmap.
|
| If this could run a lightweight LLM like RWKV in the browser,
| then the browser unlocks a whole class of new capabilities
| without calling any SaaS APIs.
| marmaduke wrote:
| I was poking at this a bit here
|
| https://github.com/maedoc/rwkv.js
|
| using the Rwkv.cpp compiled with emscripten, but I didn't quite
| figure out the tokenizers part (yet, only spent about an hour
| on it)
|
| Nevertheless I am pretty sure the 1.6b rwkv6 would be totally
| usable offline browser only. It's not capable enough for
| general chat but for rag etc it could be quite enough
| littlestymaar wrote:
| > I wonder if WASM could be put on the roadmap.
|
| The library itself should be able to compile to WASM with very
| little change: _rayon_ and _wide_ the only mandatory
| dependencies support wasm out of the box, and to get rid of
| _memmap2_ by replacing the `Mmap` type in _transformer.rs_ with
| ` &[u8]`.
|
| That being said, RWKV is a completely different architecture so
| it should be reimplemented entierly and is not likely to be
| part of the roadmap ever (not the main author so I can't say
| for sure, but I really doubt it).
| wyldfire wrote:
| The title is less clear than it could be IMO.
|
| When I saw "no dependency" I thought maybe it could be no_std
| (llama.c is relatively lightweight in this regard). But it's
| definitely not `no_std` and in fact seems like it has several
| dependencies. Perhaps all of them are rust dependencies?
| saghm wrote:
| The readme seems to indicate that it expects pytorch alongside
| several other Python dependencies in a requirements.txt file
| (which is the only place I can find any form of the word
| "dependency" on the page). I'm very confused by the
| characterization in the title here given that it doesn't seem
| to be claimed at all by the project itself (which simple has
| the subtitle "Minimal LLM inference in Rust").
|
| From the git history, it looks like the username of the person
| who posted this here is someone who has contributed to the
| project but isn't the primary author. If they could elaborate
| on what exactly they mean by saying this has "zero
| dependencies", that might be helpful.
| littlestymaar wrote:
| > The readme seems to indicate that it expects pytorch
| alongside several other Python dependencies in a
| requirements.txt file
|
| That's only if you want to convert the model yourself, you
| don't need that if you use the converted weights on the
| author's huggingface page (in "prepared-models" table of the
| README).
|
| > From the git history, it looks like the username of the
| person who posted this here is someone who has contributed to
| the project but isn't the primary author.
|
| Yup that's correct, so far I've only authored the dioxus GUI
| app.
|
| > If they could elaborate on what exactly they mean by saying
| this has "zero dependencies", that might be helpful.
|
| See my other response:
| https://news.ycombinator.com/item?id=41812665
| J_Shelby_J wrote:
| What do you think about implementing your gui for other
| rust LLM projects? I'm looking for a front end for my
| project: https://github.com/ShelbyJenkins/llm_client
| vitaminka wrote:
| is rust cargo basically like npm at this point? like how on
| earth is sixteen dependencies means no dependencies lol
| littlestymaar wrote:
| > like how on earth is sixteen dependencies means no
| dependencies lol
|
| You're counting optional dependencies used in the binaries
| which isn't fair (obviously the GUI app or the backend of the
| webui are going to have dependencies!). But yes 3
| dependencies isn't literally no dependency.
| tormeh wrote:
| Yes, basically. Someone who is a dependency maximalist (never
| write any code that can be replaced by a dependency) then you
| can easily end up with a thousand dependencies. I don't like
| things being that way, but others do.
|
| It's worth noting that Rust's std library is really small,
| and you therefore need more dependencies in Rust than in some
| other languages like Python. There are some "blessed" crates
| though, like the ones maintained by the rust-lang team
| themselves (https://crates.io/teams/github:rust-lang:libs and
| https://crates.io/teams/github:rust-lang-nursery:libs). Also,
| when you add a dependency like Tokio, Axum, or Polars, these
| are often ecosystems of crates rather than singular crates.
|
| Tl;dr: Good package managers end up encouraging micro-
| dependencies and dependency bloat because these things are
| now painless. Cargo is one of these good package managers.
| jll29 wrote:
| How about designing a "proper" standard library for Rust
| (comparable to Java's or CommonLISP's), to ensure a richer
| experience, avoiding dependency explosions, and also to
| ensure things are written in a uniform interface style? Is
| that something the Rust folks are considering or actively
| working on?
|
| EDIT: nobody is helped by 46 regex libraries, none of which
| implements Unicode fully, for example (not an example taken
| from the Rust community).
| tormeh wrote:
| Just use the rust-lang org's regex crate. It's
| fascinating that you managed to pick one of like 3 high-
| level use-cases that are covered by official rust-lang
| crates.
| littlestymaar wrote:
| Titles are hard.
|
| What I wanted to express is that it doesn't have any pytorch or
| Cuda or onnx or whatever deep learning dependency and that all
| the logic is self contained.
|
| To be totally transparent it has 5 Rust dependencies by
| default, two of them should be feature gated for the _chat_
| (chrono and clap), and then there are 3 utility crates that are
| used to get a little bit more performance out of the hardware
| (`rayon` for easier parallelization, `wide` for helping with
| SIMD, and `memmap2` for memory mapping of the model file).
| J_Shelby_J wrote:
| Yeah, hard to not be overly verbose. "No massive dependencies
| with long build times and deep abstractions!" Is not as
| catchy.
| 0x457 wrote:
| No dependencies in this case (and pretty much any rust
| project) means: to build you need rustc+cargo and to use
| you just need resulting binary.
|
| As in you don't need to have C compiler, python, dynamic
| libraries. "pure rust" would be a better way to describe
| it.
| littlestymaar wrote:
| It's a little bit more than pure Rust: to build the
| library there's basically only two dependencies (rayon
| and wide) which bring only 14 transitive dependencies
| (anyone who's built even simple Rust program knows that
| this is a very small number).
|
| And there's more, Rayon and wide are only needed for
| performance and we could trivially put them behind a
| feature flag and have zero dependency and have the
| library work in a no-std context actually, but it would
| be so slow it would have no use at all so I don't really
| think that makes sense to do except in order to win an
| argument...
| ctz wrote:
| The original may have made sense, eg "no hardware dependency",
| or "no GPU dependency". Unfortunately HN deletes words from
| titles with no rhyme or reason, and no transparency.
| lucgagan wrote:
| Correct me if I am wrong, but these implementations are all CPU
| bound?, i.e. if I have a good GPU, I should look for
| alternatives.
| littlestymaar wrote:
| It's all implemented on the CPU, yes, there's no GPU
| acceleration whatsoever (at the moment at least).
|
| > if I have a good GPU, I should look for alternatives.
|
| If you actually want to run it, even just on the CPU, you
| should look for an alternative (and the alternative is called
| llama.cpp) this is more of an educational resource about how
| things work when you remove all the layers of complexity in the
| ecosystem.
|
| LLM are somewhat magic in how effective they can be, but in
| terms of code it's really simple.
| bt1a wrote:
| You are correct. This project is "on the CPU", so it will not
| utilize your GPU for computation. If you would like to try out
| a Rust framework that does support GPUs, Candle
| https://github.com/huggingface/candle/tree/main may be worth
| exploring
| J_Shelby_J wrote:
| Yes. Depending on gpu 10-20x difference.
|
| For rust you have the llama.cpp wrappers like llm_client
| (mine), and the candle based projects mistral.rs, and Kalosm.
|
| Although, my project does try and provide an implementation of
| mistral.rs, I haven't fully migrated from llama.cpp. A full
| rust implementation would be nice for quick install times
| (among other reasons). Right now my crate has to clone and
| build. It's automated for mac, pc, and Linux but it adds about
| a minute of build time.
| J_Shelby_J wrote:
| Neat.
|
| FYI I have a whole bunch of rust tools[0] for loading models and
| other LLM tasks. For example auto selecting the largest quant
| based on memory available, extracting a tokenizer from a gguf,
| prompting, etc. You could use this to remove some of the python
| dependencies you have.
|
| Currently to support llama.cpp, but this is pretty neat too. Any
| plans to support grammars?
|
| [0] https://github.com/ShelbyJenkins/llm_client
| gip wrote:
| Great! Did something similar some time ago [0] but the
| performance was underwhelming compared to C/C++ code running on
| CPU (which points to my lack of understanding of how to make Rust
| fast). Would be nice to have some benchmarks of the different
| Rust implementations.
|
| Implementing LLM inference should/could really become the new
| "hello world!" for serious programmers out there :)
|
| [0] https://github.com/gip/yllama.rs
| simonw wrote:
| This is impressive. I just ran the 1.2G llama3.2-1b-it-q80.lmrs
| on a M2 64GB MacBook and it felt speedy and used 1000% of CPU
| across 13 threads (according to Activity Monitor).
| cd /tmp git clone https://github.com/samuel-
| vitorino/lm.rs cd lm.rs RUSTFLAGS="-C target-
| cpu=native" cargo build --release --bin chat curl -LO
| 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_
| 0-LMRS/resolve/main/tokenizer.bin?download=true' curl -LO
| 'https://huggingface.co/samuel-vitorino/Llama-3.2-1B-Instruct-Q8_
| 0-LMRS/resolve/main/llama3.2-1b-it-q80.lmrs?download=true'
| ./target/release/chat --model llama3.2-1b-it-q80.lmrs
| littlestymaar wrote:
| Could you try with ./target/release/chat
| --model llama3.2-1b-it-q80.lmrs --show-metrics
|
| To know how many token/s you get?
| simonw wrote:
| Nice, just tried that with "tell me a long tall tale" as the
| prompt and got: Speed: 26.41 tok/s
|
| Full output: https://gist.github.com/simonw/6f25fca5c664b84fd
| d4b72b091854...
| amelius wrote:
| Not sure how to formulate this, but what does this mean in the
| sense of how "smart" it is compared to the latest chatgpt
| version?
| simonw wrote:
| The model I'm running here is Llama 3.2 1B, the smallest on-
| device model I've tried that has given me good results.
|
| The fact that a 1.2GB download can do as well as this is
| honestly astonishing to me - but it's going to laughably poor
| in comparison to something like GPT-4o - which I'm guessing
| is measured in the 100s of GBs.
|
| You can try out Llama 3.2 1B yourself directly in your
| browser (it will fetch about 1GB of data) at
| https://chat.webllm.ai/
| littlestymaar wrote:
| The implementation has no control on "how smart" the model
| is, and when it comes to llama 1B, it's not very smart by
| current standard (but it would still have blown everyone's
| mind just a few years back).
| KeplerBoy wrote:
| The implementation absolutely can influence the outputs.
|
| If you have a sloppy implementations which somehow
| accumulates a lot of error in it's floating point math, you
| will get worse results.
|
| It's rarely talked about, but it's a real thing. Floating
| point addition and multiplication is non-associative and
| the order of operations affects the correctness and
| performance. Developers might (unknowningly) trade
| performance for correctness. And it matters a lot more in
| the low precision modes we operate today. Just try
| different methods of summing a vector containing 9,999 fp16
| ones in fp16. Hint: it will never be 9,999.0 and you won't
| get close to the best approximation if you do it in a naive
| loop.
| littlestymaar wrote:
| TIL, thanks.
| sroussey wrote:
| How well does bf16 work in comparison?
| KeplerBoy wrote:
| Even worse, I'd say since it has fewer bits for the
| fraction. At least in the example i was mentioning, where
| you run into precision limits, not into range limits.
|
| I believe bf16 was primarily designed as a storage
| format, since it just needs 16 zero bits added to be a
| valid fp32.
| jiggawatts wrote:
| I thought all current implementations accumulate into a
| fp32 instead of accumulating in fp16.
| jll29 wrote:
| This is beautifully written, thanks for sharing.
|
| I could see myself using some of the source code in the classroom
| to explain how transformers "really" work; code is more
| concrete/detailed than all those pictures of attention heads etc.
|
| Two points of minor criticism/suggestions for improvement:
|
| - libraries should not print to stdout, as that output may detroy
| application output (imagine I want to use the library in a text
| editor to offer style checking). So best to write to a string
| buffer owned by a logging class instance associated with a lm.rs
| object.
|
| - Is it possible to do all this without "unsafe" without twisting
| one's arm? I see there are uses of "unsafe" e.g. to force data
| alignment in the model reader.
|
| Again, thanks and very impressive!
| marques576 wrote:
| Such a talented guy!
___________________________________________________________________
(page generated 2024-10-11 23:00 UTC)