https://github.com/pytorch/torchchat Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + GitHub Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions By size + Enterprise + Teams + Startups By industry + Healthcare + Financial services + Manufacturing By use case + CI/CD & Automation + DevOps + DevSecOps * Resources Topics + AI + DevOps + Security + Software Development Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} pytorch / torchchat Public * Notifications You must be signed in to change notification settings * Fork 38 * Star 989 Run PyTorch LLMs locally on servers, desktop and mobile License BSD-3-Clause license 989 stars 38 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 61 * Pull requests 8 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights pytorch/torchchat This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Last Last Name Name commit commit message date Latest commit History 688 Commits .ci/scripts .ci/scripts .github .github .pins .pins android android api api assets assets browser browser build build config config distributed distributed docs docs export_util export_util quantization quantization runner runner scripts scripts tokenizer tokenizer utils utils .clang-format .clang-format .flake8 .flake8 .gitignore .gitignore .gitmodules .gitmodules .lintrunner.toml .lintrunner.toml CMakeLists.txt CMakeLists.txt CODE_OF_CONDUCT.md CODE_OF_CONDUCT.md CONTRIBUTING.md CONTRIBUTING.md LICENSE LICENSE README.md README.md cli.py cli.py download.py download.py eval.py eval.py export.py export.py generate.py generate.py install_requirements.sh install_requirements.sh requirements-lintrunner.txt requirements-lintrunner.txt requirements.txt requirements.txt server.py server.py torchchat.py torchchat.py View all files Repository files navigation * README * Code of conduct * BSD-3-Clause license Chat with LLMs Everywhere torchchat is a small codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android. What can you do with torchchat? * Run models via PyTorch / Python + Chat + Generate + Run chat in the Browser * Run models on desktop/server without python + Use AOT Inductor for faster execution + Running in c++ using the runner * Run models on mobile + Deploy and run on iOS + Deploy and run on Android * Evaluate a model Highlights * Command line interaction with popular LLMs such as Llama 3, Llama 2, Stories, Mistral and more * PyTorch-native execution with performance * Supports popular hardware and OS + Linux (x86) + Mac OS (M1/M2/M3) + Android (Devices that support XNNPACK) + iOS 17+ (iPhone 13 Pro+) * Multiple data types including: float32, float16, bfloat16 * Multiple quantization schemes * Multiple execution modes including: Python (Eager, Compile) or Native (AOT Inductor (AOTI), ExecuTorch) Installation The following steps require that you have Python 3.10 installed. # get the code git clone https://github.com/pytorch/torchchat.git cd torchchat # set up a virtual environment python3 -m venv .venv source .venv/bin/activate # install dependencies ./install_requirements.sh Commands The interfaces of torchchat are leveraged through Python Commands and Native Runners. While the Python Commands are enumerable in the --help menu, the latter are explored in their respective sections. python3 torchchat.py --help # Output usage: torchchat [-h] {chat,browser,generate,export,eval,download,list,remove,where,server} ... positional arguments: {chat,browser,generate,export,eval,download,list,remove,where,server} The specific command to run chat Chat interactively with a model via the CLI generate Generate responses from a model given a prompt browser Chat interactively with a model in a locally hosted browser export Export a model artifact to AOT Inductor or ExecuTorch download Download model artifacts list List all supported models remove Remove downloaded model artifacts where Return directory containing downloaded model artifacts server [WIP] Starts a locally hosted REST server for model interaction eval Evaluate a model via lm-eval options: -h, --help show this help message and exit Python Inference (chat, generate, browser, server) * These commands represent different flavors of performing model inference in a Python enviroment. * Models are constructed either from CLI args or from loading exported artifacts. Exporting (export) * This command generates model artifacts that are consumed by Python Inference or Native Runners. * More information is provided in the AOT Inductor and ExecuTorch sections. Inventory Management (download, list, remove, where) * These commands are used to manage and download models. * More information is provided in the Download Weights section. Evaluation (eval) * This command test model fidelity via EleutherAI's lm_evaluation_harness. * More information is provided in the Evaluation section. Download Weights Most models use Hugging Face as the distribution channel, so you will need to create a Hugging Face account. Create a Hugging Face user access token as documented here with the write role. Log into Hugging Face: huggingface-cli login Once this is done, torchchat will be able to download model artifacts from Hugging Face. python3 torchchat.py download llama3.1 Note This command may prompt you to request access to Llama 3 via Hugging Face, if you do not already have access. Simply follow the prompts and re-run the command when access is granted.* Additional Model Inventory Management Commands List This subcommands shows the available models python3 torchchat.py list Where This subcommands shows location of a particular model. python3 torchchat.py list This is useful in scripts when you do not want to hard-code paths Remove This subcommands removes the specified model python3 torchchat.py remove llama3.1 More information about these commands can be found by adding the --help option. Running via PyTorch / Python The simplest way to run a model in PyTorch is via eager execution. This is the default execution mode for both PyTorch and torchchat. It performs inference without creating exporting artifacts or using a separate runner. The model used for inference can also be configured and tailored to specific needs (compilation, quantization, etc.). See the customization guide for the options supported by torchchat. Tip For more information about these commands, please refer to the --help menu. Chat This mode allows you to chat with an LLM in an interactive fashion. python3 torchchat.py chat llama3.1 Generate This mode generates text based on an input prompt. python3 torchchat.py generate llama3.1 --prompt "write me a story about a boy and his bear" Browser This mode allows you to chat with the model using a UI in your browser Running the command automatically open a tab in your browser. streamlit run torchchat.py -- browser llama3.1 Server Note: This feature is still a work in progress and not all endpoints are working This mode gives a REST API that matches the OpenAI API spec for interacting with a model To test out the REST API, you'll need 2 terminals: one to host the server, and one to send the request. In one terminal, start the server python3 torchchat.py server llama3.1 In another terminal, query the server using curl. Depending on the model configuration, this query might take a few minutes to respond. Setting stream to "true" in the request emits a response in chunks. Currently, this response is plaintext and will not be formatted to the OpenAI API specification. If stream is unset or not "true", then the client will await the full response from the server. Example Input + Output curl http://127.0.0.1:5000/chat \ -H "Content-Type: application/json" \ -d '{ "model": "llama3.1", "stream": "true", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Hello!" } ] }' {"response":" I'm a software developer with a passion for building innovative and user-friendly applications. I have experience in developing web and mobile applications using various technologies such as Java, Python, and JavaScript. I'm always looking for new challenges and opportunities to learn and grow as a developer.\n\nIn my free time, I enjoy reading books on computer science and programming, as well as experimenting with new technologies and techniques. I'm also interested in machine learning and artificial intelligence, and I'm always looking for ways to apply these concepts to real-world problems.\n\nI'm excited to be a part of the developer community and to have the opportunity to share my knowledge and experience with others. I'm always happy to help with any questions or problems you may have, and I'm looking forward to learning from you as well.\n\nThank you for visiting my profile! I hope you find my information helpful and interesting. If you have any questions or would like to discuss any t opics, please feel free to reach out to me. I"} Desktop/Server Execution AOTI (AOT Inductor) AOTI compiles models before execution for faster inference. The process creates a DSO model (represented by a file with extension .so) that is then loaded for inference. This can be done with both Python and C++ enviroments. The following example exports and executes the Llama3.1 8B Instruct model. The first command compiles and performs the actual export. python3 torchchat.py export llama3.1 --output-dso-path exportedModels/llama3.1.so Note If your machine has cuda add this flag for performance --quantize config/data/cuda.json when exporting. For more details on quantization and what settings to use for your use case visit our customization guide. Run in a Python Enviroment To run in a python enviroment, use the generate subcommand like before, but include the dso file. python3 torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --prompt "Hello my name is" Note: Depending on which accelerator is used to generate the .dso file, the command may need the device specified: --device (cuda | cpu). Run using our C++ Runner To run in a C++ enviroment, we need to build the runner binary. scripts/build_native.sh aoti Then run the compiled executable, with the exported DSO from earlier. cmake-out/aoti_run exportedModels/llama3.1.so -z `python3 torchchat.py where llama3.1`/tokenizer.model -l 3 -i "Once upon a time" Note: Depending on which accelerator is used to generate the .dso file, the runner may need the device specified: -d (CUDA | CPU). Mobile Execution ExecuTorch enables you to optimize your model for execution on a mobile or embedded device. Set Up ExecuTorch Before running any commands in torchchat that require ExecuTorch, you must first install ExecuTorch. To install ExecuTorch, run the following commands. This will download the ExecuTorch repo to ./et-build/src and install various ExecuTorch libraries to ./et-build/install. Important The following commands should be run from the torchchat root directory. export TORCHCHAT_ROOT=${PWD} ./scripts/install_et.sh Export for mobile Similar to AOTI, to deploy onto device, we first export the PTE artifact, then we load the artifact for inference. The following example uses the Llama3.1 8B Instruct model. # Export python3 torchchat.py export llama3.1 --quantize config/data/mobile.json --output-pte-path llama3.1.pte Note We use --quantize config/data/mobile.json to quantize the llama3.1 model to reduce model size and improve performance for on-device use cases. For more details on quantization and what settings to use for your use case visit our customization guide. Deploy and run on Desktop While ExecuTorch does not focus on desktop inference, it is capable of doing so. This is handy for testing out PTE models without sending them to a physical device. Specifically there are 2 ways of doing so: Pure Python and via a Runner Deploying via Python # Execute python3 torchchat.py generate llama3.1 --device cpu --pte-path llama3.1.pte --prompt "Hello my name is" Deploying via a Runner Build the runner scripts/build_native.sh et Execute using the runner cmake-out/et_run llama3.1.pte -z `python3 torchchat.py where llama3.1`/tokenizer.model -l 3 -i "Once upon a time" Deploy and run on iOS The following assumes you've completed the steps for Setting up ExecuTorch. Deploying with Xcode Requirements * Xcode 15.0 or later * Cmake 3.19 or later + Download and open the macOS .dmg installer and move the Cmake app to /Applications folder. + Install Cmake command line tools: sudo /Applications/ CMake.app/Contents/bin/cmake-gui --install * A development provisioning profile with the increased-memory-limit entitlement. Steps 1. Open the Xcode project: open et-build/src/executorch/examples/demo-apps/apple_ios/LLaMA/LLaMA.xcodeproj Note: If you're running into any issues related to package dependencies, close Xcode, clean some of the caches and/or the build products, and open the Xcode project again: rm -rf \ ~/Library/org.swift.swiftpm \ ~/Library/Caches/org.swift.swiftpm \ ~/Library/Caches/com.apple.dt.Xcode \ ~/Library/Developer/Xcode/DerivedData 2. Click the Play button to launch the app in the Simulator. 3. To run on a device, ensure you have it set up for development and a provisioning profile with the increased-memory-limit entitlement. Update the app's bundle identifier to match your provisioning profile with the required capability. 4. After successfully launching the app, copy the exported ExecuTorch model (.pte) and tokenizer (.model) files to the iLLaMA folder. You can find the model file called llama3.1.pte in the current torchchat directory and the tokenizer file at $ (python3 torchchat.py where llama3.1)/tokenizer.model path. + For the Simulator: Drag and drop both files onto the Simulator window and save them in the On My iPhone > iLLaMA folder. + For a device: Open a separate Finder window, navigate to the Files tab, drag and drop both files into the iLLaMA folder, and wait for the copying to finish. 5. Follow the app's UI guidelines to select the model and tokenizer files from the local filesystem and issue a prompt. Click the image below to see it in action! iOS app running a LlaMA model Deploy and run on Android The following assumes you've completed the steps for Setting up ExecuTorch. Approach 1 (Recommended): Android Studio Requirements * Android Studio * Java 17 * Android SDK 34 * adb Steps 1. Download the AAR file, which contains the Java library and corresponding JNI library, to build and run the app. + executorch-llama-tiktoken-rc3-0719.aar (SHASUM: c3e5d2a97708f033c2b1839a89f12f737e3bbbef) 2. Rename the downloaded AAR file to executorch.aar and move the file to android/torchchat/app/libs/. You may need to create directory android/torchchat/app/libs/ if it does not exist. 3. Push the model and tokenizer file to your device. You can find the model file called llama3.1.pte in the current torchchat directory and the tokenizer file at $(python3 torchchat.py where llama3.1)/tokenizer.model path. adb shell mkdir -p /data/local/tmp/llama adb push /data/local/tmp/llama adb push /data/local/tmp/llama 4. Use Android Studio to open the torchchat app skeleton, located at android/torchchat. 5. Click the Play button (^R) to launch it to emulator/device. + We recommend using a device with at least 12GB RAM and 20GB storage. + If using an emulated device, refer to this post on how to set the RAM. 6. Follow the app's UI guidelines to pick the model and tokenizer files from the local filesystem. Then issue a prompt. Note: The AAR file listed in Step 1 has the tiktoken tokenizer, which is used for Llama 3. To tweak or use a custom tokenizer and runtime, modify the ExecuTorch code and use this script to build the AAR library. For convenience, we also provide an AAR for sentencepiece tokenizer (e.g. Llama 2): executorch-llama-bpe-rc3-0719.aar (SHASUM: d5fe81d9a4700c36b50ae322e6bf34882134edb0) Android app running a LlaMA model Approach 2: E2E Script Alternatively, you can run scripts/android_example.sh which sets up Java, Android SDK Manager, Android SDK, Android emulator (if no physical device is found), builds the app, and launches it for you. It can be used if you don't have a GUI. export TORCHCHAT_ROOT=$(pwd) export USE_TIKTOKEN=ON # Set this only for tiktoken tokenizer sh scripts/android_example.sh Eval Note: This feature is still a work in progress and not all features are working Uses the lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args. See Evaluation Examples Eager mode: python3 torchchat.py eval llama3.1 --dtype fp32 --limit 5 To test the perplexity for a lowered or quantized model, pass it in the same way you would to generate: python3 torchchat.py eval llama3.1 --pte-path llama3.1.pte --limit 5 Models The following models are supported by torchchat and have associated aliases. Model Mobile Notes Friendly meta-llama/ Tuned for chat . Alias to Meta-Llama-3.1-8B-Instruct llama3.1. meta-llama/ Best for generate. Alias to Meta-Llama-3.1-8B llama3.1-base. meta-llama/ Tuned for chat . Alias to Meta-Llama-3-8B-Instruct llama3. meta-llama/Meta-Llama-3-8B Best for generate. Alias to llama3-base. meta-llama/ Tuned for chat. Alias to llama2. Llama-2-7b-chat-hf meta-llama/ Tuned for chat. Alias to Llama-2-13b-chat-hf llama2-13b-chat. meta-llama/ Tuned for chat. Alias to Llama-2-70b-chat-hf llama2-70b-chat. meta-llama/Llama-2-7b-hf Best for generate. Alias to llama2-base. meta-llama/ Tuned for Python and generate. CodeLlama-7b-Python-hf Alias to codellama. meta-llama/ Tuned for Python and generate. CodeLlama-34b-Python-hf Alias to codellama-34b. mistralai/Mistral-7B-v0.1 Best for generate. Alias to mistral-7b-v01-base. mistralai/ Tuned for chat. Alias to Mistral-7B-Instruct-v0.1 mistral-7b-v01-instruct. mistralai/ Tuned for chat. Alias to Mistral-7B-Instruct-v0.2 mistral. tinyllamas/stories15M Toy model for generate. Alias to stories15M. tinyllamas/stories42M Toy model for generate. Alias to stories42M. tinyllamas/stories110M Toy model for generate. Alias to stories110M. openlm-research/ Best for generate. Alias to open_llama_7b open-llama. While we describe how to use torchchat using the popular llama3 model, you can perform the example commands with any of these models. Design Principles torchchat embodies PyTorch's design philosophy details, especially "usability over everything else". Native PyTorch torchchat is a native-PyTorch library. While we provide integrations with the surrounding ecosystem (eg: Hugging Face models, etc), all of the core functionality is written in PyTorch. Simplicity and Extensibility torchchat is designed to be easy to understand, use and extend. * Composition over implementation inheritance - layers of inheritance for code re-use makes the code hard to read and extend * No training frameworks - explicitly outlining the training logic makes it easy to extend for custom use cases * Code duplication is preferred over unnecessary abstractions * Modular building blocks over monolithic components Correctness torchchat provides well-tested components with a high-bar on correctness. We provide * Extensive unit-tests to ensure things operate as they should Community Contributions We really value our community and the contributions made by our wonderful users. We'll use this section to call out some of these contributions! If you'd like to help out as well, please see the CONTRIBUTING guide. Troubleshooting CERTIFICATE_VERIFY_FAILED Run pip install --upgrade certifi. Access to model is restricted and you are not in the authorized list Some models require an additional step to access. Follow the link provided in the error to get access. Installing ET Fails If ./scripts/install_et.sh fails with an error like Building wheel for executorch (pyproject.toml) did not run successfully It's possible that it's linking to an older version of pytorch installed some other way like via homebrew. You can break the link by uninstalling other versions such as brew uninstall pytorch Note: You may break something that depends on this, so be aware. Filing Issues Please include the exact command you ran and the output of that command. Also, run this script and include the output saved to system_info.txt so that we can better debug your issue. (echo "Operating System Information"; uname -a; echo ""; cat /etc/os-release; echo ""; echo "Python Version"; python --version || python3 --version; echo ""; echo "PIP Version"; pip --version || pip3 --version; echo ""; echo "Installed Packages"; pip freeze || pip3 freeze; echo ""; echo "PyTorch Version"; python -c "import torch; print(torch.__version__)" || python3 -c "import torch; print(torch.__version__)"; echo ""; echo "Collection Complete") > system_info.txt Disclaimer The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations. Acknowledgements Thank you to the community for all the awesome libraries and tools you've built around local LLM inference. * Georgi Gerganov and his GGML project shining a spotlight on community-based enablement and inspiring so many other projects. * Andrej Karpathy and his llama2.c project. So many great (and simple!) ideas in llama2.c that we have directly adopted (both ideas and code) from his repo. You can never go wrong by following Andrej's work. * Michael Gschwind, Bert Maher, Scott Wolchok, Bin Bao, Chen Yang, Huamin Li and Mu-Chu Li who built the first version of nanogpt (DSOGPT) with AOT Inductor proving that AOTI can be used to build efficient LLMs, and DSOs are a viable distribution format for models. nanoGPT. * Bert Maher and his llama2.so, which built on Andrej's llama2.c and on DSOGPT to close the loop on Llama models with AOTInductor. * Christian Puhrsch, Horace He, Joe Isaacson and many more for their many contributions in Accelerating GenAI models in the "Anything, Fast!" pytorch.org blogs, and, in particular, Horace He for GPT, Fast!, which we have directly adopted (both ideas and code) from his repo. License torchchat is released under the BSD 3 license. (Additional code in this distribution is covered by the MIT and Apache Open Source licenses.) However you may have other legal obligations that govern your use of content, such as the terms of service for third-party models. About Run PyTorch LLMs locally on servers, desktop and mobile Topics local pytorch llm Resources Readme License BSD-3-Clause license Code of conduct Code of conduct Activity Custom properties Stars 989 stars Watchers 29 watching Forks 38 forks Report repository Releases No releases published Packages 0 No packages published Contributors 40 * @mikekgfb * @metascroy * @malfet * @Jack-Khuu * @larryliu0820 * @orionr * @guangy10 * @mergennachin * @GregoryComer * @byjlw * @kirklandsign * @kimishpatel * @shoumikhin * @lucylq + 26 contributors Languages * Python 67.4% * Shell 14.2% * C++ 14.0% * Java 3.7% * CMake 0.7% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.