[HN Gopher] Llama.rs - Rust port of llama.cpp for fast LLaMA inf...
___________________________________________________________________
Llama.rs - Rust port of llama.cpp for fast LLaMA inference on CPU
Author : rrampage
Score : 147 points
Date : 2023-03-15 17:15 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| unshavedyak wrote:
| Anyone know if these LLaMA models can have a large pile of
| context fed in? Eg to have the "AI" act like ChatGPT with a
| specific knowledge base you feed in?
|
| Ie imagine you feed in the last year of chatlogs of yours, and
| then ask the Assistant queries about the chatlogs. Compound that
| with your Wiki, itinerary, etc. Is this possible with LLaMA?
| Where might it fail in doing this?
|
| _(and yes, i know this is basically autocomplete on steroids. I
| 'm still curious hah)_
| [deleted]
| grepLeigh wrote:
| I cracked up while reading the README. To the author: may you
| always get as much joy as you're giving with these projects.
| Nice work!
| kevin42 wrote:
| You can feed in a lot of context, but the memory requirements
| go up and up. It was crashing on me in some experiments, but I
| saw a pull request that let's you set the context size on the
| command line. I was using the 30B model and with 500 words of
| context it was using over 100GB of RAM.
|
| You probably would be better off fine-tuning the LLaMA model
| like they did with Alpaca though.
| taf2 wrote:
| Funny that he had a hard time converting llama.cop to expose a
| web server... I was just asking gpt 4 to write one for me... will
| hopefully have a pr ready soon
| cozzyd wrote:
| I feel like https://github.com/ggerganov/llama.cpp/issues/171 is
| a better approach here?
|
| With how fast llama.cpp is changing, this seems like a lot of
| churn for no reason.
| ronsor wrote:
| Oh hey, that's my issue.
|
| It seems like most of the work would simply be moving the
| inference stuff (feeding tokens to the model and sampling)
| outside of main(). Most of the other functionality such as
| model weights loading are already handled in their own
| functions.
| tantony wrote:
| I am working on
| https://github.com/ggerganov/llama.cpp/issues/77 to
| "librarify" the code a bit. I eventually did want to Rewrite
| It In Rust(tm), but OP just beat me to it.
| ilovefood wrote:
| Great job porting the C++ code! Seems like the reasoning was to
| provide the code as a library to embed in a HTTP Server, cannot
| wait to see that happen and try it out.
|
| Looking at how the inference runs, this shouldn't be a big
| problem, right? https://github.com/setzer22/llama-
| rs/blob/main/llama-rs/src/...
| saidinesh5 wrote:
| It shouldn't be too much effort to extend/rewrite that function
| to use websockets (i wouldn't use just http for something like
| this). All the important functions used by that function
| (llama_eval, sample_top_p_top_k etc.. ) seem to be public
| anyway.
| recuter wrote:
| From the readme, to preempt the moaning: "I just like collecting
| imaginary internet points, in the form of little stars, that
| people seem to give to me whenever I embark on pointless quests
| for rewriting X thing, but in Rust."
|
| OK? Just don't. Let us have this. :)
| galangalalgol wrote:
| I like the tag, because I wouldn't read the story otherwise.
| Any time someone ports something to rust the story is usually
| interesting. Sometimes good interesting, sometimes bad
| interesting. I don't enjoy reading about ports to Go. It is
| almost always uneventful. Some more performance perhaps, didn't
| take long. Wasn't too hard. Stubbed my toe on the error
| checking etc. With rust, even if the port itself was easy, and
| the performance gains were minimal, there is usually some bit
| about a weird error rust found in the old code base, or how the
| borrow checker ate their baby and everyone panicked, but then
| everything was fine.
|
| It isn't rust specific, I'd similarly like to know if someone
| rewrote something in haskell or austral, or lisp. Not because
| of the languages, but because they make good stories.
| macawfish wrote:
| I think it's fun
| petercooper wrote:
| Can someone a lot smarter than me give a basic explanation as to
| why something like this can run at a respectable speed on the CPU
| whereas Stable Diffusion is next to useless on them? (That is to
| say, 10-100x slower, whereas I have not seen GPU based LLaMA go
| 10-100x faster than the demo here.) I had assumed there were
| similar algorithms at play.
| tantony wrote:
| Stable Diffusion runs pretty fast on Apple Silicon. Not sure if
| that uses the GPU though.
|
| I think one reason in this particular case may be the 4-bit
| quantization.
| alwayslikethis wrote:
| Quantization is the answer here. CPU running the large models
| at 16 bits (which is actually 32, because CPUs mostly do not
| support FP16) would be really slow.
| xiphias2 wrote:
| ggml should be ported as well to make it really count, use rust's
| multithreading for fun
| adeon wrote:
| I've counted three different Rust LLaMA implementations on r/rust
| subreddit this week:
|
| https://github.com/Noeda/rllama/ (pure Rust+OpenCL)
|
| https://github.com/setzer22/llama-rs/ (ggml based)
|
| https://github.com/philpax/ggllama (also ggml based)
|
| There's also a discussion on GitHub issue on setzer's repo to
| collaborate a bit on these separate efforts:
| https://github.com/setzer22/llama-rs/issues/4
| comex wrote:
| Do you know if any of them support GPTQ [1], either end-to-end
| or just by importing weights that were previously quantized
| with GPTQ? Apparently GPTQ provides a significant quality boost
| "for free".
|
| I haven't had time to look into this in detail, but apparently
| llama.cpp doesn't support it yet [2] though it will soon. And
| the original implementation only works with CUDA.
|
| [1] https://github.com/qwopqwop200/GPTQ-for-LLaMa/
|
| [2] https://github.com/ggerganov/llama.cpp/issues/9
| adeon wrote:
| As far as I know, the two ggml ones are basically just
| llama.cpp ports that include the ggml source code so if the
| support is not in llama.cpp, I don't think it's in these
| implementations either. Although maybe that also means that
| they'll gain that ability as soon as llama.cpp does.
|
| I'm the author of the last one, rllama and it has no
| quantization whatsoever. I don't think any of these are
| improvements over llama.cpp for end-users at this time.
| Unless you really really really want your software to be in
| Rust in particular.
| hummus_bae wrote:
| corresponding reddit threads:
|
| -
|
| https://www.reddit.com/r/rust/comments/6jm58w/rllama_a_rust_...
|
| -
|
| https://www.reddit.com/r/rust/comments/6jm6gk/off_by_one_abs...
|
| -
|
| https://www.reddit.com/r/rust/comments/6jmpu3/llama_rs_simpl...
| comex wrote:
| Two of those links don't work.
___________________________________________________________________
(page generated 2023-03-15 23:01 UTC)