https://github.com/mlc-ai/web-llm

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this organization All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}
mlc-ai / web-llm Public

  * Notifications
  * Fork 26
  * Star 633

Bringing large-language models and chat to web browsers. Everything
runs inside the browser with no server support.

mlc.ai/web-llm

License

Apache-2.0 license
633 stars 26 forks
Star
Notifications

  * Code
  * Issues 1
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

mlc-ai/web-llm

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
2 branches 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/m]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone mlc-ai]

    Work fast with our official CLI. Learn more.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@tqchen
tqchen remove miscategorization
...
514f6e3 Apr 15, 2023
remove miscategorization
514f6e3

Git stats

  * 26 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
3rdparty
Initial Commit
April 14, 2023 09:25
log_db
Initial Commit
April 14, 2023 09:25
scripts
update gh
April 14, 2023 20:21
site
remove miscategorization
April 15, 2023 17:03
web
Add stats
April 14, 2023 21:48
web_llm
Quantization with optional transposition and fusion with matmul (#15)
April 14, 2023 20:03
.gitignore
Initial Commit
April 14, 2023 09:25
.gitmodules
Initial Commit
April 14, 2023 09:25
LICENSE
add readme
April 14, 2023 09:25
README.md
Update README.md
April 15, 2023 17:02
build.py
Add shader dump
April 14, 2023 20:12
chat.py
Initial Commit
April 14, 2023 09:25
evaluate.py
Introducing per-function profiling (#6)
April 14, 2023 09:25
setup.py
Initial Commit
April 14, 2023 09:25
View code
Web LLM How Comparison to Native GPU Runtime, Limitations and
Opportunities Links Acknowledgement

README.md

 Web LLM

This project brings language model chats directly onto web browsers.
Everything runs inside the browser with no server support and
accelerated with WebGPU. We can bring a lot of fun opportunities to
build AI assistants for everyone and enable privacy while enjoying
GPU acceleration.

Check out our demo webpage to try out!

[demo]

We have been seeing amazing progress in generative AI and LLM
recently. Thanks to the open-source efforts like LLaMA, Alpaca,
Vicuna, and Dolly, we can now see an exciting future of building our
own open-source language models and personal AI assistant.

These models are usually big and compute-heavy. To build a chat
service, we will need a large cluster to run an inference server,
while clients send requests to servers and retrieve the inference
output. We also usually have to run on a specific type of GPUs where
popular deep-learning frameworks are readily available.

This project is our step to bring more diversity to the ecosystem.
Specifically, can we simply bake LLMs directly into the client side
and directly run them inside a browser? If that can be realized, we
could offer support for client personal AI models with the benefit of
cost reduction, enhancement for personalization, and privacy
protection. The client side is getting pretty powerful.

Won't it be even more amazing if we can simply open up a browser and
directly bring AI natively to your browser tab? There is some level
of readiness in the ecosystem. WebGPU has just shipped and enables
native GPU executions on the browser.

Still, there are big hurdles to cross, to name a few:

  * We need to bring the models somewhere without the relevant
    GPU-accelerated Python frameworks.
  * Most of the AI frameworks rely heavily on optimized computed
    libraries that are maintained by hardware vendors. We need to
    start from scratch.
  * Careful planning of memory usage, and aggressive compression of
    weights so that we can fit the models into memory.

We also do not want to only do it for just one model. Instead, we
would like to present a repeatable and hackable workflow that enables
anyone to easily develop and optimize these models in a productive
Python-first approach, and deploy them universally, including on the
web.

Besides supporting WebGPU, this project also provides the harness for
other kinds of GPU backends that TVM supports (such as CUDA, OpenCL,
and Vulkan) and really enables accessible deployment of LLM models.

 How

The key technology here is machine learning compilation (MLC). Our
solution builds on the shoulders of the open source ecosystem,
including Hugging Face, model variants from LLaMA and Vicuna, wasm
and WebGPU. The main flow builds on Apache TVM Unity, an exciting
ongoing development in the Apache TVM Community

  * We bake a language model's IRModule in TVM with native dynamic
    shape support, avoiding the need of padding to max length and
    reducing both computation amount and memory usage.
  * Each function in TVM's IRModule can be further transformed and
    generate runnable code that can be deployed universally on any
    environment that is supported by minimum tvm runtime (JavaScript
    being one of them).
  * TensorIR is the key technique used to generate optimized
    programs. We provide productive solutions by quickly transforming
    TensorIR programs based on the combination of expert knowledge
    and automated scheduler.
  * Heuristics are used when optimizing light-weight operators in
    order to reduce the engineering pressure.
  * We utilize int4 quantization techniques to compress the model
    weights so that they can fit into memory.
  * We build static memory planning optimizations to reuse memory
    across multiple layers.
  * We use Emscripten and TypeScript to build a TVM web runtime that
    can deploy generated modules.
  * We also leveraged a wasm port of SentencePiece tokenizer.

web-llm

All parts of this workflow are done in Python, with the exception of
course, of the last part that builds a 600 loc JavaScript app that
connects things together. This is also a fun process of interactive
development, bringing new models.

All these are made possible by the open-source ecosystem that we
leverage. Specifically, we make heavy use of TVM unity, an exciting
latest development in the TVM project that enables such Python-first
interactive MLC development experiences that allows us to easily
compose new optimizations, all in Python, and incrementally bring our
app to the web.

TVM unity also provides an easy way to compose new solutions in the
ecosystem. We will continue to bring further optimizations such as
fused quantization kernels, and bring them to more platforms.

One key characteristic of LLM models is the dynamic nature of the
model. As the decoding and encoding process depends on computations
that grow with the size of tokens, we leverage the first-class
dynamic shape support in TVM unity that represents sequence
dimensions through symbolic integers. This allows us to plan ahead to
statically allocate all the memory needed for the sequence window of
interest without padding.

We also leveraged the integration of tensor expressions to quickly
express partial-tensor computations such as rotary embedding directly
without materializing them into full-tensor matrix computations.

 Comparison to Native GPU Runtime, Limitations and Opportunities

Besides the WebGPU runtime, we also provide options for native
deployment with local GPU runtime. So they can be used both as a tool
to deploy on native environment as well as a reference point to
compare native GPU driver performance and WebGPU.

WebGPU works by translating WGSL shaders to native shaders. We
observed that there are opportunities to reach zero gap between the
WebGPU runtime and native environment.

Some of the current gaps are caused by Chrome's WebGPU implementation
inserts bound clips for all array index access, such that a[i]
becomes a[min(i, a.size)]. This can be optimized out as the WebGPU
support continues to mature.

You can get around this by using a special flag to launch Chrome
(thanks to Dawn developers for providing the pointers), by exiting
Chrome completely, then in command line, type

/path/to/Chrome --enable-dawn-features=disable_robustness

Then you will find that the execution speed is as fast as native GPU
environment. We anticipate this problem will get resolved as WebGPU
matures. WebGPU just shipped and we are excited to see opportunities
it can unblock. There are also a lot of exciting upcoming features we
can leverage to further improve things such as fp16 extensions.

 Links

  * Demo page
  * You might also be interested in Web Stable Diffusion.

 Acknowledgement

This project is made possible thanks to collaboration with

CMU School of Computer Science Catalyst MLC
OctoML UW SJTU

This project is only possible thanks to the shoulders open-source
ecosystems that we stand on. We want to thank the Apache TVM
community and developers of the TVM Unity effort. The open-source ML
community members made these models publicly available. PyTorch and
hugging face communities that make these models accessible. We would
like to thank the teams behind vicuna, SentencePiece, LLaMA, Alpaca.
We also would like to thank the WebAssembly, Emscripten, and WebGPU
communities. Finally, thanks to Dawn and WebGPU developers.

About

Bringing large-language models and chat to web browsers. Everything
runs inside the browser with no server support.

mlc.ai/web-llm

Topics

deep-learning language-model webgpu tvm webml llm chatgpt

Resources

Readme

License

Apache-2.0 license

Stars

633 stars

Watchers

8 watching

Forks

26 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Contributors 4

  * @tqchen tqchen Tianqi Chen
  * @jinhongyii jinhongyii Hongyi Jin
  * @MasterJH5574 MasterJH5574 Ruihang Lai
  * @spectrometerHBH spectrometerHBH Bohan Hou

Languages

  * Python 91.0%
  * JavaScript 6.3%
  * Shell 1.1%
  * Other 1.6%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.