https://github.com/mlc-ai/web-llm Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this organization All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} mlc-ai / web-llm Public * Notifications * Fork 26 * Star 633 Bringing large-language models and chat to web browsers. Everything runs inside the browser with no server support. mlc.ai/web-llm License Apache-2.0 license 633 stars 26 forks Star Notifications * Code * Issues 1 * Pull requests 0 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights mlc-ai/web-llm This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 2 branches 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/m] Use Git or checkout with SVN using the web URL. [gh repo clone mlc-ai] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @tqchen tqchen remove miscategorization ... 514f6e3 Apr 15, 2023 remove miscategorization 514f6e3 Git stats * 26 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time 3rdparty Initial Commit April 14, 2023 09:25 log_db Initial Commit April 14, 2023 09:25 scripts update gh April 14, 2023 20:21 site remove miscategorization April 15, 2023 17:03 web Add stats April 14, 2023 21:48 web_llm Quantization with optional transposition and fusion with matmul (#15) April 14, 2023 20:03 .gitignore Initial Commit April 14, 2023 09:25 .gitmodules Initial Commit April 14, 2023 09:25 LICENSE add readme April 14, 2023 09:25 README.md Update README.md April 15, 2023 17:02 build.py Add shader dump April 14, 2023 20:12 chat.py Initial Commit April 14, 2023 09:25 evaluate.py Introducing per-function profiling (#6) April 14, 2023 09:25 setup.py Initial Commit April 14, 2023 09:25 View code Web LLM How Comparison to Native GPU Runtime, Limitations and Opportunities Links Acknowledgement README.md Web LLM This project brings language model chats directly onto web browsers. Everything runs inside the browser with no server support and accelerated with WebGPU. We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration. Check out our demo webpage to try out! [demo] We have been seeing amazing progress in generative AI and LLM recently. Thanks to the open-source efforts like LLaMA, Alpaca, Vicuna, and Dolly, we can now see an exciting future of building our own open-source language models and personal AI assistant. These models are usually big and compute-heavy. To build a chat service, we will need a large cluster to run an inference server, while clients send requests to servers and retrieve the inference output. We also usually have to run on a specific type of GPUs where popular deep-learning frameworks are readily available. This project is our step to bring more diversity to the ecosystem. Specifically, can we simply bake LLMs directly into the client side and directly run them inside a browser? If that can be realized, we could offer support for client personal AI models with the benefit of cost reduction, enhancement for personalization, and privacy protection. The client side is getting pretty powerful. Won't it be even more amazing if we can simply open up a browser and directly bring AI natively to your browser tab? There is some level of readiness in the ecosystem. WebGPU has just shipped and enables native GPU executions on the browser. Still, there are big hurdles to cross, to name a few: * We need to bring the models somewhere without the relevant GPU-accelerated Python frameworks. * Most of the AI frameworks rely heavily on optimized computed libraries that are maintained by hardware vendors. We need to start from scratch. * Careful planning of memory usage, and aggressive compression of weights so that we can fit the models into memory. We also do not want to only do it for just one model. Instead, we would like to present a repeatable and hackable workflow that enables anyone to easily develop and optimize these models in a productive Python-first approach, and deploy them universally, including on the web. Besides supporting WebGPU, this project also provides the harness for other kinds of GPU backends that TVM supports (such as CUDA, OpenCL, and Vulkan) and really enables accessible deployment of LLM models. How The key technology here is machine learning compilation (MLC). Our solution builds on the shoulders of the open source ecosystem, including Hugging Face, model variants from LLaMA and Vicuna, wasm and WebGPU. The main flow builds on Apache TVM Unity, an exciting ongoing development in the Apache TVM Community * We bake a language model's IRModule in TVM with native dynamic shape support, avoiding the need of padding to max length and reducing both computation amount and memory usage. * Each function in TVM's IRModule can be further transformed and generate runnable code that can be deployed universally on any environment that is supported by minimum tvm runtime (JavaScript being one of them). * TensorIR is the key technique used to generate optimized programs. We provide productive solutions by quickly transforming TensorIR programs based on the combination of expert knowledge and automated scheduler. * Heuristics are used when optimizing light-weight operators in order to reduce the engineering pressure. * We utilize int4 quantization techniques to compress the model weights so that they can fit into memory. * We build static memory planning optimizations to reuse memory across multiple layers. * We use Emscripten and TypeScript to build a TVM web runtime that can deploy generated modules. * We also leveraged a wasm port of SentencePiece tokenizer. web-llm All parts of this workflow are done in Python, with the exception of course, of the last part that builds a 600 loc JavaScript app that connects things together. This is also a fun process of interactive development, bringing new models. All these are made possible by the open-source ecosystem that we leverage. Specifically, we make heavy use of TVM unity, an exciting latest development in the TVM project that enables such Python-first interactive MLC development experiences that allows us to easily compose new optimizations, all in Python, and incrementally bring our app to the web. TVM unity also provides an easy way to compose new solutions in the ecosystem. We will continue to bring further optimizations such as fused quantization kernels, and bring them to more platforms. One key characteristic of LLM models is the dynamic nature of the model. As the decoding and encoding process depends on computations that grow with the size of tokens, we leverage the first-class dynamic shape support in TVM unity that represents sequence dimensions through symbolic integers. This allows us to plan ahead to statically allocate all the memory needed for the sequence window of interest without padding. We also leveraged the integration of tensor expressions to quickly express partial-tensor computations such as rotary embedding directly without materializing them into full-tensor matrix computations. Comparison to Native GPU Runtime, Limitations and Opportunities Besides the WebGPU runtime, we also provide options for native deployment with local GPU runtime. So they can be used both as a tool to deploy on native environment as well as a reference point to compare native GPU driver performance and WebGPU. WebGPU works by translating WGSL shaders to native shaders. We observed that there are opportunities to reach zero gap between the WebGPU runtime and native environment. Some of the current gaps are caused by Chrome's WebGPU implementation inserts bound clips for all array index access, such that a[i] becomes a[min(i, a.size)]. This can be optimized out as the WebGPU support continues to mature. You can get around this by using a special flag to launch Chrome (thanks to Dawn developers for providing the pointers), by exiting Chrome completely, then in command line, type /path/to/Chrome --enable-dawn-features=disable_robustness Then you will find that the execution speed is as fast as native GPU environment. We anticipate this problem will get resolved as WebGPU matures. WebGPU just shipped and we are excited to see opportunities it can unblock. There are also a lot of exciting upcoming features we can leverage to further improve things such as fp16 extensions. Links * Demo page * You might also be interested in Web Stable Diffusion. Acknowledgement This project is made possible thanks to collaboration with CMU School of Computer Science Catalyst MLC OctoML UW SJTU This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. The open-source ML community members made these models publicly available. PyTorch and hugging face communities that make these models accessible. We would like to thank the teams behind vicuna, SentencePiece, LLaMA, Alpaca. We also would like to thank the WebAssembly, Emscripten, and WebGPU communities. Finally, thanks to Dawn and WebGPU developers. About Bringing large-language models and chat to web browsers. Everything runs inside the browser with no server support. mlc.ai/web-llm Topics deep-learning language-model webgpu tvm webml llm chatgpt Resources Readme License Apache-2.0 license Stars 633 stars Watchers 8 watching Forks 26 forks Report repository Releases No releases published Packages 0 No packages published Contributors 4 * @tqchen tqchen Tianqi Chen * @jinhongyii jinhongyii Hongyi Jin * @MasterJH5574 MasterJH5574 Ruihang Lai * @spectrometerHBH spectrometerHBH Bohan Hou Languages * Python 91.0% * JavaScript 6.3% * Shell 1.1% * Other 1.6% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.