https://github.com/anordin95/run-llama-locally Skip to content Navigation Menu Toggle navigation Sign in * Product + GitHub Copilot Write better code with AI + Security Find and fix vulnerabilities + Actions Automate any workflow + Codespaces Instant dev environments + Issues Plan and track work + Code Review Manage code changes + Discussions Collaborate outside of code + Code Search Find more, search less Explore + All features + Documentation + GitHub Skills + Blog * Solutions By size + Enterprise + Teams + Startups By industry + Healthcare + Financial services + Manufacturing By use case + CI/CD & Automation + DevOps + DevSecOps * Resources Topics + AI + DevOps + Security + Software Development + View all Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up Reseting focus You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} anordin95 / run-llama-locally Public * Notifications You must be signed in to change notification settings * Fork 5 * Star 85 Run and explore Llama models locally with minimal dependencies on CPU 85 stars 5 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 1 * Pull requests 0 * Actions * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Security * Insights anordin95/run-llama-locally This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Last Last Name Name commit commit message date Latest commit History 30 Commits llama-models @ e46a708 llama-models @ e46a708 .gitignore .gitignore .gitmodules .gitmodules README.md README.md minimal_run_inference.py minimal_run_inference.py requirements.in requirements.in requirements.txt requirements.txt run_inference.py run_inference.py View all files Repository files navigation * README Running Llama locally with minimal dependencies Motivation I want to peel back the layers of the onion and other gluey-mess to gain insight into these models. There are other popular ways to invoke these models, such as Ollama and Hugging-Face's general API package: transformers, but those hide the interesting details behind an API. I was a bit surprised Meta didn't publish an example way to simply invoke one of these LLM's with only torch (or some minimal set of dependencies), though I am obviously grateful and so pleased with their contribution of the public weights! Setup steps 1. Download the relevant model weight(s) via https://www.llama.com/ llama-downloads/ 2. $ pip install -r requirements.txt 3. $ cd llama-models; pip install -e .; cd .. 4. $ python minimal_run_inference.py or $ python run_inference.py Exploring the model & outputs run_inference.py is more bloated than minimal_run_inference.py. It implements beam-search & features far more explanatory comments. minimal_run_inference.py is a simple, few lines of code way to run the Llama models. It's a great place to start hacking around or exploring on your own. If one of the steps in it doesn't make sense, peek over at run_inference.py where there are likely detailed comments. Script parameters The global variables in the run_inference.py scripts: MODEL_NAME, LLAMA_MODELS_DIR, INPUT_STRING and DEVICE take the values you'd expect (there are adjacent comments with examples and more details too). They should be modified as you see fit. Technical Overview Dependencies The minimal set of dependencies I found includes torch (perhaps, obviously), a lesser known library also published by Meta: fairscale, which implements a variety of highly scalable/parallelizable analogues of torch operators and blobfile, which implements a general file I/O mechanism that Meta's Tokenizer implementation uses. Meta provides the language-model weights in a simple way, but a model-architecture to drop them into is still needed. This is provided, in a less obvious way, in the llama_models repo. The model-architecture class therein relies on both torch and fairscale and expects each, specifically torch.distributed and fairscale, to be initialized appropriately. The use of CUDA is hard-coded in a few places in the official repo. I changed that and bundled that version here (as a git submodule). With those initializations squared away, the model-architecture class can be instantiated. Though, that model is largely a blank slate until we then drop the weights in. The tokenizer is similarly available in llama_models and relies on a dictionary-like file distributed along with the model-weights. I'm not sure why, but that file's strings (which map to unique integers or indices) are base64 encoded. Technically, you don't need to know that to use the Tokenizer, but if you're curious to see the actual tokens the system uses, make sure to decode appropriately! Beam-search I believe most systems use beam-search rather than greedily taking the most-likely token at each time-step, so I implemented the same. Beam search takes the k (say 5) most likely tokens at the first time-step and uses them as a seed for k distinct sequences. For all future time-steps, only the most likely token is appended to the sequence. At the end, the overall most likely sequence is selected. Performance notes Using CPU, I can pretty comfortably run the 1B model on my Mac M1 Air's that has 16GB of RAM averaging about 1 token per second. The 3B model struggles and gets about 1 token every 60 seconds. And the 8B model typically gets killed by the OS for using too much memory. Initially, using mps (Metal Performance Shaders), i.e. Apple's GPU, would produce all nan's as model output. The issue is due to a known-bug in torch.triu which I implemented a workaround for in the llama-models git submoudle. With mps, the inference time of the first few tokens on the 1B model is notably faster, but the memory usage is much higher. It's not entirely clear to me why the memory usage differs so notably, particularly given Apple's unified memory layout (i.e. cpu & gpu share memory). Once the sequence is about 100 or 200 tokens, the throughput slows down notably -- about half of the cpu's throughput. I suspect that the relatively higher memory-load of the GPU (caused for unknown reasons) in conjunction with a growing sequence length starts to swamp my system's available memory to a degree which effects the computation speed. Aside on GPU memory: I'm using a batch-size of 1, so there's no batch parallelism (i.e. presumably multiple full models in memory). And, the memory of each transformer layer should be relatively constant, unless perhaps each attention-heads' parameters are loaded into memory then discarded, whereas in the parallel (i.e. gpu) case all heads are simultaneously loaded AND that difference is enough to cause a notable change in memory-load. If you know why, drop a note! About Run and explore Llama models locally with minimal dependencies on CPU Resources Readme Activity Stars 85 stars Watchers 4 watching Forks 5 forks Report repository Releases No releases published Packages 0 No packages published Languages * Python 100.0% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.