[HN Gopher] Llama.rs - Rust port of llama.cpp for fast LLaMA inf...
       ___________________________________________________________________
        
       Llama.rs - Rust port of llama.cpp for fast LLaMA inference on CPU
        
       Author : rrampage
       Score  : 147 points
       Date   : 2023-03-15 17:15 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | unshavedyak wrote:
       | Anyone know if these LLaMA models can have a large pile of
       | context fed in? Eg to have the "AI" act like ChatGPT with a
       | specific knowledge base you feed in?
       | 
       | Ie imagine you feed in the last year of chatlogs of yours, and
       | then ask the Assistant queries about the chatlogs. Compound that
       | with your Wiki, itinerary, etc. Is this possible with LLaMA?
       | Where might it fail in doing this?
       | 
       |  _(and yes, i know this is basically autocomplete on steroids. I
       | 'm still curious hah)_
        
         | [deleted]
        
         | grepLeigh wrote:
         | I cracked up while reading the README. To the author: may you
         | always get as much joy as you're giving with these projects.
         | Nice work!
        
         | kevin42 wrote:
         | You can feed in a lot of context, but the memory requirements
         | go up and up. It was crashing on me in some experiments, but I
         | saw a pull request that let's you set the context size on the
         | command line. I was using the 30B model and with 500 words of
         | context it was using over 100GB of RAM.
         | 
         | You probably would be better off fine-tuning the LLaMA model
         | like they did with Alpaca though.
        
       | taf2 wrote:
       | Funny that he had a hard time converting llama.cop to expose a
       | web server... I was just asking gpt 4 to write one for me... will
       | hopefully have a pr ready soon
        
       | cozzyd wrote:
       | I feel like https://github.com/ggerganov/llama.cpp/issues/171 is
       | a better approach here?
       | 
       | With how fast llama.cpp is changing, this seems like a lot of
       | churn for no reason.
        
         | ronsor wrote:
         | Oh hey, that's my issue.
         | 
         | It seems like most of the work would simply be moving the
         | inference stuff (feeding tokens to the model and sampling)
         | outside of main(). Most of the other functionality such as
         | model weights loading are already handled in their own
         | functions.
        
           | tantony wrote:
           | I am working on
           | https://github.com/ggerganov/llama.cpp/issues/77 to
           | "librarify" the code a bit. I eventually did want to Rewrite
           | It In Rust(tm), but OP just beat me to it.
        
       | ilovefood wrote:
       | Great job porting the C++ code! Seems like the reasoning was to
       | provide the code as a library to embed in a HTTP Server, cannot
       | wait to see that happen and try it out.
       | 
       | Looking at how the inference runs, this shouldn't be a big
       | problem, right? https://github.com/setzer22/llama-
       | rs/blob/main/llama-rs/src/...
        
         | saidinesh5 wrote:
         | It shouldn't be too much effort to extend/rewrite that function
         | to use websockets (i wouldn't use just http for something like
         | this). All the important functions used by that function
         | (llama_eval, sample_top_p_top_k etc.. ) seem to be public
         | anyway.
        
       | recuter wrote:
       | From the readme, to preempt the moaning: "I just like collecting
       | imaginary internet points, in the form of little stars, that
       | people seem to give to me whenever I embark on pointless quests
       | for rewriting X thing, but in Rust."
       | 
       | OK? Just don't. Let us have this. :)
        
         | galangalalgol wrote:
         | I like the tag, because I wouldn't read the story otherwise.
         | Any time someone ports something to rust the story is usually
         | interesting. Sometimes good interesting, sometimes bad
         | interesting. I don't enjoy reading about ports to Go. It is
         | almost always uneventful. Some more performance perhaps, didn't
         | take long. Wasn't too hard. Stubbed my toe on the error
         | checking etc. With rust, even if the port itself was easy, and
         | the performance gains were minimal, there is usually some bit
         | about a weird error rust found in the old code base, or how the
         | borrow checker ate their baby and everyone panicked, but then
         | everything was fine.
         | 
         | It isn't rust specific, I'd similarly like to know if someone
         | rewrote something in haskell or austral, or lisp. Not because
         | of the languages, but because they make good stories.
        
         | macawfish wrote:
         | I think it's fun
        
       | petercooper wrote:
       | Can someone a lot smarter than me give a basic explanation as to
       | why something like this can run at a respectable speed on the CPU
       | whereas Stable Diffusion is next to useless on them? (That is to
       | say, 10-100x slower, whereas I have not seen GPU based LLaMA go
       | 10-100x faster than the demo here.) I had assumed there were
       | similar algorithms at play.
        
         | tantony wrote:
         | Stable Diffusion runs pretty fast on Apple Silicon. Not sure if
         | that uses the GPU though.
         | 
         | I think one reason in this particular case may be the 4-bit
         | quantization.
        
           | alwayslikethis wrote:
           | Quantization is the answer here. CPU running the large models
           | at 16 bits (which is actually 32, because CPUs mostly do not
           | support FP16) would be really slow.
        
       | xiphias2 wrote:
       | ggml should be ported as well to make it really count, use rust's
       | multithreading for fun
        
       | adeon wrote:
       | I've counted three different Rust LLaMA implementations on r/rust
       | subreddit this week:
       | 
       | https://github.com/Noeda/rllama/ (pure Rust+OpenCL)
       | 
       | https://github.com/setzer22/llama-rs/ (ggml based)
       | 
       | https://github.com/philpax/ggllama (also ggml based)
       | 
       | There's also a discussion on GitHub issue on setzer's repo to
       | collaborate a bit on these separate efforts:
       | https://github.com/setzer22/llama-rs/issues/4
        
         | comex wrote:
         | Do you know if any of them support GPTQ [1], either end-to-end
         | or just by importing weights that were previously quantized
         | with GPTQ? Apparently GPTQ provides a significant quality boost
         | "for free".
         | 
         | I haven't had time to look into this in detail, but apparently
         | llama.cpp doesn't support it yet [2] though it will soon. And
         | the original implementation only works with CUDA.
         | 
         | [1] https://github.com/qwopqwop200/GPTQ-for-LLaMa/
         | 
         | [2] https://github.com/ggerganov/llama.cpp/issues/9
        
           | adeon wrote:
           | As far as I know, the two ggml ones are basically just
           | llama.cpp ports that include the ggml source code so if the
           | support is not in llama.cpp, I don't think it's in these
           | implementations either. Although maybe that also means that
           | they'll gain that ability as soon as llama.cpp does.
           | 
           | I'm the author of the last one, rllama and it has no
           | quantization whatsoever. I don't think any of these are
           | improvements over llama.cpp for end-users at this time.
           | Unless you really really really want your software to be in
           | Rust in particular.
        
         | hummus_bae wrote:
         | corresponding reddit threads:
         | 
         | -
         | 
         | https://www.reddit.com/r/rust/comments/6jm58w/rllama_a_rust_...
         | 
         | -
         | 
         | https://www.reddit.com/r/rust/comments/6jm6gk/off_by_one_abs...
         | 
         | -
         | 
         | https://www.reddit.com/r/rust/comments/6jmpu3/llama_rs_simpl...
        
           | comex wrote:
           | Two of those links don't work.
        
       ___________________________________________________________________
       (page generated 2023-03-15 23:01 UTC)