[HN Gopher] Run Llama locally with only PyTorch on CPU
       ___________________________________________________________________
        
       Run Llama locally with only PyTorch on CPU
        
       Author : anordin95
       Score  : 107 points
       Date   : 2024-10-08 01:45 UTC (3 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | anordin95 wrote:
       | Peel back the layers of the onion and other gluey-mess to gain
       | insight into these models.
        
       | yjftsjthsd-h wrote:
       | If your goal is
       | 
       | > I want to peel back the layers of the onion and other gluey-
       | mess to gain insight into these models.
       | 
       | Then this is great.
       | 
       | If your goal is
       | 
       | > Run and explore Llama models locally with minimal dependencies
       | on CPU
       | 
       | then I recommend https://github.com/Mozilla-Ocho/llamafile which
       | ships as a single file with no dependencies and runs on CPU with
       | great performance. Like, such great performance that I've mostly
       | given up on GPU for LLMs. It was a game changer.
        
         | hedgehog wrote:
         | Ollama (also wrapping llama.cpp) has GPU support, unless you're
         | really in love with the idea of bundling weights into the
         | inference executable probably a better choice for most people.
        
           | yjftsjthsd-h wrote:
           | When I said
           | 
           | > such great performance that I've mostly given up on GPU for
           | LLMs
           | 
           | I mean I used to run ollama on GPU, but llamafile was
           | approximately the same performance on just CPU so I switched.
           | Now that might just be because my GPU is weak by current
           | standards, but that is in fact the comparison I was making.
           | 
           | Edit: Though to be clear, ollama would easily be my second
           | pick; it also has minimal dependencies and is super easy to
           | run locally.
        
           | jart wrote:
           | Ollama is great if you're really in love with the idea of
           | having your multi gigabyte models (likely the majority of
           | your disk space) stored in obfuscated UUID filenames. Ollama
           | also still hasn't addressed the license violations I reported
           | to them back in March.
           | https://github.com/ollama/ollama/issues/3185
        
             | hedgehog wrote:
             | I wasn't aware of the license issue, wow. Not a good look
             | especially considering how simple that is to resolve.
             | 
             | The model storage doesn't bother me but I also use Docker
             | so I'm used to having a lot of tool-managed data to deal
             | with. YMMV.
             | 
             | Edit: Removed question about GPU support.
        
             | codetrotter wrote:
             | I think this is also a problem in a lot of tools, that is
             | never talked about.
             | 
             | Even myself I've not thought about this so deeply, even
             | though I am also very concerned about honoring other
             | people's work and that licenses are followed.
             | 
             | I have some command line tools for example that I've
             | written in Rust that depend on various libraries. But
             | because I distribute my software in source form mostly, I
             | haven't really paid attention to how a command-line tool
             | which is distributed as a compiled binary would make sure
             | to include attribution and copies of the licenses of its
             | dependencies.
             | 
             | And so the main place where I've given more thought to
             | those concerns is for example in full-blown GUI apps. There
             | they usually have an about menu that will include info
             | about their dependencies. And the other part where I've
             | thought about it is in commercial electronics making use of
             | open source software in their firmware. In those physical
             | products they usually include either some printed documents
             | alongside the product where attributions and license texts
             | are sometimes found, and sometimes if the product has a
             | display, or a display output, they have a menu you can find
             | somewhere with that sort of info.
             | 
             | I know that for example Debian is very good at being
             | thorough with details about licenses, but I've never looked
             | at what they do with command line tools that compile third-
             | party code into them. Like does Debian package maintainers
             | then for example dig up copies of the licenses from the
             | source and dependencies and put them somewhere in
             | /usr/share/ as plain text files? Or do the .deb files
             | themselves contain license text copies you can view but
             | which are not installed onto the system? Or they work with
             | software authors to add a flag that will show licenses? Or
             | something else?
        
         | jart wrote:
         | A great place to start is with the LLaMA 3.2 q6 llamafile I
         | posted a few days ago.
         | https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafi...
         | We have a new CLI chatbot interface that's really fun to use.
         | Syntax highlighting and all. You can also use GPU by passing
         | the -ngl 999 flag.
        
           | cromka wrote:
           | ,,On _Windows_ , only the graphics card driver needs to be
           | installed if you own an NVIDIA GPU. On _Windows_ , if you
           | have an AMD GPU, you should install the ROCm SDK v6.1 and
           | then pass the flags --recompile --gpu amd the first time you
           | run your llamafile."
           | 
           | Looks like there's a typo, Windows is mentioned twice.
        
         | yumraj wrote:
         | Can it use GPU if available, say on Apple silicon Macs
        
           | unkeen wrote:
           | > GPU on MacOS ARM64 is supported by compiling a small module
           | using the Xcode Command Line Tools, which need to be
           | installed. This is a one time cost that happens the first
           | time you run your llamafile.
        
             | xyc wrote:
             | I wonder if it's possible for llamafile to distribute
             | without the need for Xcode Command Line Tools, but perhaps
             | it's necessary for the single cross-platform binary.
             | 
             | Loved llamafile and used it to build the first version of
             | https://recurse.chat/, but live compilation using XCode
             | Command Line Tool is a no-go for Mac App Store builds (runs
             | in Mac App Sandbox). llama.cpp doesn't need compiling on
             | user's machine fwiw.
        
         | rmbyrro wrote:
         | Do you have a ballpark idea of how much RAM would be necessary
         | to run llama 3.1 8b and 70b on 8-quant?
        
           | karolist wrote:
           | Roughly, at Q8 the model sizes translate to GB, so ~3 and
           | ~70GB.
        
         | AlfredBarnes wrote:
         | Thanks for posting this!
        
         | bagels wrote:
         | How great is the performance? Tokens/s?
        
       | littlestymaar wrote:
       | With the same mindset, but without even PyTorch as dependency
       | there's a straightforward CPU implementation of llama/gemma in
       | Rust: https://github.com/samuel-vitorino/lm.rs/
       | 
       | It's impressive to realize how little code is needed to run these
       | models at all.
        
       | Ship_Star_1010 wrote:
       | PyTorch has a native llm solution It supports all the LLama
       | models. It supports CPU, MPS and CUDA
       | https://github.com/pytorch/torchchat Getting 4.5 tokens a second
       | using 3.1 8B full precision using CPU only on my M1
        
         | ajaksalad wrote:
         | > I was a bit surprised Meta didn't publish an example way to
         | simply invoke one of these LLM's with only torch (or some
         | minimal set of dependencies)
         | 
         | Seems like torchchat is exactly what the author was looking
         | for.
         | 
         | > And the 8B model typically gets killed by the OS for using
         | too much memory.
         | 
         | Torchchat also provides some quantization options so you can
         | reduce the model size to fit into memory.
        
       | tcdent wrote:
       | > from llama_models.llama3.reference_impl.model import
       | Transformer
       | 
       | This just imports the Llama reference implementation and patches
       | the device FYI.
       | 
       | There are more robust implementations out there.
        
       | I_am_tiberius wrote:
       | Does anyone know what's the easiest way to finetune a model
       | locally is today?
        
       ___________________________________________________________________
       (page generated 2024-10-11 23:00 UTC)