[HN Gopher] Run Llama locally with only PyTorch on CPU
___________________________________________________________________
Run Llama locally with only PyTorch on CPU
Author : anordin95
Score : 107 points
Date : 2024-10-08 01:45 UTC (3 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| anordin95 wrote:
| Peel back the layers of the onion and other gluey-mess to gain
| insight into these models.
| yjftsjthsd-h wrote:
| If your goal is
|
| > I want to peel back the layers of the onion and other gluey-
| mess to gain insight into these models.
|
| Then this is great.
|
| If your goal is
|
| > Run and explore Llama models locally with minimal dependencies
| on CPU
|
| then I recommend https://github.com/Mozilla-Ocho/llamafile which
| ships as a single file with no dependencies and runs on CPU with
| great performance. Like, such great performance that I've mostly
| given up on GPU for LLMs. It was a game changer.
| hedgehog wrote:
| Ollama (also wrapping llama.cpp) has GPU support, unless you're
| really in love with the idea of bundling weights into the
| inference executable probably a better choice for most people.
| yjftsjthsd-h wrote:
| When I said
|
| > such great performance that I've mostly given up on GPU for
| LLMs
|
| I mean I used to run ollama on GPU, but llamafile was
| approximately the same performance on just CPU so I switched.
| Now that might just be because my GPU is weak by current
| standards, but that is in fact the comparison I was making.
|
| Edit: Though to be clear, ollama would easily be my second
| pick; it also has minimal dependencies and is super easy to
| run locally.
| jart wrote:
| Ollama is great if you're really in love with the idea of
| having your multi gigabyte models (likely the majority of
| your disk space) stored in obfuscated UUID filenames. Ollama
| also still hasn't addressed the license violations I reported
| to them back in March.
| https://github.com/ollama/ollama/issues/3185
| hedgehog wrote:
| I wasn't aware of the license issue, wow. Not a good look
| especially considering how simple that is to resolve.
|
| The model storage doesn't bother me but I also use Docker
| so I'm used to having a lot of tool-managed data to deal
| with. YMMV.
|
| Edit: Removed question about GPU support.
| codetrotter wrote:
| I think this is also a problem in a lot of tools, that is
| never talked about.
|
| Even myself I've not thought about this so deeply, even
| though I am also very concerned about honoring other
| people's work and that licenses are followed.
|
| I have some command line tools for example that I've
| written in Rust that depend on various libraries. But
| because I distribute my software in source form mostly, I
| haven't really paid attention to how a command-line tool
| which is distributed as a compiled binary would make sure
| to include attribution and copies of the licenses of its
| dependencies.
|
| And so the main place where I've given more thought to
| those concerns is for example in full-blown GUI apps. There
| they usually have an about menu that will include info
| about their dependencies. And the other part where I've
| thought about it is in commercial electronics making use of
| open source software in their firmware. In those physical
| products they usually include either some printed documents
| alongside the product where attributions and license texts
| are sometimes found, and sometimes if the product has a
| display, or a display output, they have a menu you can find
| somewhere with that sort of info.
|
| I know that for example Debian is very good at being
| thorough with details about licenses, but I've never looked
| at what they do with command line tools that compile third-
| party code into them. Like does Debian package maintainers
| then for example dig up copies of the licenses from the
| source and dependencies and put them somewhere in
| /usr/share/ as plain text files? Or do the .deb files
| themselves contain license text copies you can view but
| which are not installed onto the system? Or they work with
| software authors to add a flag that will show licenses? Or
| something else?
| jart wrote:
| A great place to start is with the LLaMA 3.2 q6 llamafile I
| posted a few days ago.
| https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafi...
| We have a new CLI chatbot interface that's really fun to use.
| Syntax highlighting and all. You can also use GPU by passing
| the -ngl 999 flag.
| cromka wrote:
| ,,On _Windows_ , only the graphics card driver needs to be
| installed if you own an NVIDIA GPU. On _Windows_ , if you
| have an AMD GPU, you should install the ROCm SDK v6.1 and
| then pass the flags --recompile --gpu amd the first time you
| run your llamafile."
|
| Looks like there's a typo, Windows is mentioned twice.
| yumraj wrote:
| Can it use GPU if available, say on Apple silicon Macs
| unkeen wrote:
| > GPU on MacOS ARM64 is supported by compiling a small module
| using the Xcode Command Line Tools, which need to be
| installed. This is a one time cost that happens the first
| time you run your llamafile.
| xyc wrote:
| I wonder if it's possible for llamafile to distribute
| without the need for Xcode Command Line Tools, but perhaps
| it's necessary for the single cross-platform binary.
|
| Loved llamafile and used it to build the first version of
| https://recurse.chat/, but live compilation using XCode
| Command Line Tool is a no-go for Mac App Store builds (runs
| in Mac App Sandbox). llama.cpp doesn't need compiling on
| user's machine fwiw.
| rmbyrro wrote:
| Do you have a ballpark idea of how much RAM would be necessary
| to run llama 3.1 8b and 70b on 8-quant?
| karolist wrote:
| Roughly, at Q8 the model sizes translate to GB, so ~3 and
| ~70GB.
| AlfredBarnes wrote:
| Thanks for posting this!
| bagels wrote:
| How great is the performance? Tokens/s?
| littlestymaar wrote:
| With the same mindset, but without even PyTorch as dependency
| there's a straightforward CPU implementation of llama/gemma in
| Rust: https://github.com/samuel-vitorino/lm.rs/
|
| It's impressive to realize how little code is needed to run these
| models at all.
| Ship_Star_1010 wrote:
| PyTorch has a native llm solution It supports all the LLama
| models. It supports CPU, MPS and CUDA
| https://github.com/pytorch/torchchat Getting 4.5 tokens a second
| using 3.1 8B full precision using CPU only on my M1
| ajaksalad wrote:
| > I was a bit surprised Meta didn't publish an example way to
| simply invoke one of these LLM's with only torch (or some
| minimal set of dependencies)
|
| Seems like torchchat is exactly what the author was looking
| for.
|
| > And the 8B model typically gets killed by the OS for using
| too much memory.
|
| Torchchat also provides some quantization options so you can
| reduce the model size to fit into memory.
| tcdent wrote:
| > from llama_models.llama3.reference_impl.model import
| Transformer
|
| This just imports the Llama reference implementation and patches
| the device FYI.
|
| There are more robust implementations out there.
| I_am_tiberius wrote:
| Does anyone know what's the easiest way to finetune a model
| locally is today?
___________________________________________________________________
(page generated 2024-10-11 23:00 UTC)