https://github.com/kyutai-labs/moshi Skip to content Navigation Menu Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + GitHub Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions By size + Enterprise + Teams + Startups By industry + Healthcare + Financial services + Manufacturing By use case + CI/CD & Automation + DevOps + DevSecOps * Resources Topics + AI + DevOps + Security + Software Development + View all Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up Reseting focus You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} kyutai-labs / moshi Public * Notifications You must be signed in to change notification settings * Fork 122 * Star 2.1k License Apache-2.0, MIT licenses found Licenses found Apache-2.0 LICENSE-APACHE MIT LICENSE-MIT 2.1k stars 122 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 13 * Pull requests 2 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights kyutai-labs/moshi This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main BranchesTags Go to file Code Folders and files Last commit Last Name Name message commit date Latest commit History 417 Commits .github .github client client moshi moshi moshi_mlx moshi_mlx rust rust scripts scripts .gitignore .gitignore .pre-commit-config.yaml .pre-commit-config.yaml CONTRIBUTING.md CONTRIBUTING.md FAQ.md FAQ.md LICENSE-APACHE LICENSE-APACHE LICENSE-MIT LICENSE-MIT README.md README.md mimi.png mimi.png moshi.png moshi.png requirements-dev.txt requirements-dev.txt View all files Repository files navigation * README * Apache-2.0 license * MIT license Moshi: a speech-text foundation model for real time dialogue precommit badge rust ci badge [Read the paper] [Demo] [Hugging Face] Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec. Mimi processes 24 kHz audio, down to a 12.5 Hz representation with a bandwidth of 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size), yet performs better than existing, non-streaming, codec like SpeechTokenizer (50 Hz, 4kbps), or SemantiCodec (50 Hz, 1.3kbps). Moshi models two streams of audio: one corresponds to Moshi, and the other one to the user. At inference, the stream from the user is taken from the audio input, and the one for Moshi is sampled from the model's output. Along these two audio streams, Moshi predicts text tokens corresponding to its own speech, its inner monologue, which greatly improves the quality of its generation. A small Depth Transformer models inter codebook dependencies for a given time step, while a large, 7B parameter Temporal Transformer models the temporal dependencies. Moshi achieves a theoretical latency of 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms on an L4 GPU. Talk to Moshi now on our live demo. Schema representing the structure of Moshi. Moshi models two streams of audio: one corresponds to Moshi, and the other one to the user. At inference, the audio stream of the user is taken from the audio input, and the audio stream for Moshi is sampled from the model's output. Along that, Moshi predicts text tokens corresponding to its own speech for improved accuracy. A small Depth Transformer models inter codebook dependencies for a given step. Mimi builds on previous neural audio codecs such as SoundStream and EnCodec, adding a Transformer both in the encoder and decoder, and adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the average frame rate of text tokens (~3-4 Hz), and limit the number of autoregressive steps in Moshi. Similarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match a self-supervised representation from WavLM, which allows modeling semantic and acoustic information with a single model. Interestingly, while Mimi is fully causal and streaming, it learns to match sufficiently well the non-causal representation from WavLM, without introducing any delays. Finally, and similarly to EBEN, Mimi uses only an adversarial training loss, along with feature matching, showing strong improvements in terms of subjective quality despite its low bitrate. Schema representing the structure of Mimi, our proposed neural codec. Mimi contains a Transformerin both its encoder and decoded, and achieves a frame rate closer to that of text tokens. This allows us to reducethe number of auto-regressive steps taken by Moshi, thus reducing the latency of the model. Organisation of the repository There are three separate versions of the moshi inference stack in this repo. * The Python version using PyTorch is in the moshi/ directory. * The Python version using MLX for M series Macs is in the moshi_mlx/ directory. * The Rust version used in production is in the rust/ directory. This contains in particular a Mimi implementation in Rust, with Python bindings available as rustymimi. Finally, the code for the live demo is provided in the client/ directory. Models We release three models: * our speech codec Mimi, * Moshi fine-tuned on a male synthetic voice (Moshiko), * Moshi fine-tuned on a female synthetic voice (Moshika). Depending on the backend, the file format and quantization available will vary. Here is the list of the HuggingFace repo with each model. Mimi is bundled in each of those, and always use the same checkpoint format. * Moshika for PyTorch (bf16): kyutai/moshika-pytorch-bf16. * Moshiko for PyTorch (bf16): kyutai/moshiko-pytorch-bf16. * Moshika for MLX (int4, int8, bf16): kyutai/moshika-mlx-q4, kyutai /moshika-mlx-q8, kyutai/moshika-mlx-bf16. * Moshiko for MLX (int4, int8, bf16): kyutai/moshiko-mlx-q4, kyutai /moshiko-mlx-q8, kyutai/moshiko-mlx-bf16. * Moshika for Rust/Candle (int8, bf16): kyutai/moshika-candle-q8, kyutai/moshika-mlx-bf16. * Moshiko for Rust/Candle (int8, bf16): kyutai/moshiko-candle-q8, kyutai/moshiko-mlx-bf16. All models are released under the CC-BY 4.0 license. Requirements You will need at least Python 3.10, with 3.12 recommended. For specific requirements, please check the individual backends directories. You can install the PyTorch and MLX clients with the following: pip install moshi # moshi PyTorch, from PyPI pip install moshi_mlx # moshi MLX, from PyPI, best with Python 3.12. # Or the bleeding edge versions for Moshi and Moshi-MLX. pip install -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi&subdirectory=moshi" pip install -e "git+https://git@github.com/kyutai-labs/moshi.git#egg=moshi_mlx&subdirectory=moshi_mlx" pip install rustymimi # mimi, rust implementation with Python bindings from PyPI If you are not using Python 3.12, you might get an error when installing moshi_mlx or rustymimi (which moshi_mlx depends on). Then,you will need to install the Rust toolchain, or switch to Python 3.12. While we hope that the present codebase will work on Windows, we do not provide official support for it. We have tested the MLX version on a MacBook Pro M3. At the moment, we do not support quantization for the PyTorch version, so you will need a GPU with a significant amount of memory (24GB). For using the Rust backend, you will need a recent version of the Rust toolchain. To compile GPU support, you will also need the CUDA properly installed for your GPU, in particular with nvcc. Python (PyTorch) The PyTorch based API can be found in the moshi directory. It provides a streaming version of the audio tokenizer (mimi) and the language model (moshi). In order to run in interactive mode, you need to start a server which will run the model, you can then use either the web UI or a command line client. Start the server with: python -m moshi.server [--gradio-tunnel] [--hf-repo kyutai/moshika-pytorch-bf16] And then access the web UI on localhost:8998. If your GPU is on a distant machine with no direct access, --gradio-tunnel will create a tunnel with a URL accessible from anywhere. Keep in mind that this tunnel goes through the US and can add significant latency (up to 500ms from Europe). You can use --gradio-tunnel-token to set a fixed secret token and reuse the same address over time. Alternatively, you might want to use SSH to redirect your connection. You can use --hf-repo to select a different pretrained model, by setting the proper Hugging Face repository. Accessing a server that is not localhost via http may cause issues with using the microphone in the web UI (in some browsers this is only allowed using https). A local client is also available, as python -m moshi.client [--url URL_TO_GRADIO] However note that, unlike the web browser, this client is barebone: It does not perform any echo cancellation, nor does it try to compensate for a growing lag by skipping frames. For more information, in particular on how to use the API directly, please checkout moshi/README.md. Python (MLX) for local inference on macOS Once you have installed moshi_mlx, you can run python -m moshi_mlx.local -q 4 # weights quantized to 4 bits python -m moshi_mlx.local -q 8 # weights quantized to 8 bits # And using a different pretrained model: python -m moshi_mlx.local -q 4 --hf-repo kyutai/moshika-mlx-q4 python -m moshi_mlx.local -q 8 --hf-repo kyutai/moshika-mlx-q8 # be careful to always match the `-q` and `--hf-repo` flag. This command line interface is also barebone. It does not perform any echo cancellation, nor does it try to compensate for a growing lag by skipping frames. Alternatively you can run python -m moshi_mlx.local_web to use the web UI, the connection is via http and will be at localhost:8998. Rust In order to run the Rust inference server, use the following command from within the rust directory: cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config.json standalone When using macOS, you can replace --features cuda with --features metal. Alternatively you can use config-q8.json rather than config.json to use the quantized q8 model. You can select a different pretrained model, e.g. Moshika, by changing the "hf_repo" key in either file. Once the server has printed 'standalone worker listening', you can use the web UI. By default the Rust server uses https so it will be at localhost:8998. You will get warnings about the site being unsafe. When using chrome you can bypass these by selecting "Details" or "Advanced", then "Visit this unsafe site" or "Proceed to localhost (unsafe)". Clients We recommend using the web UI as it provides additional echo cancellation that helps the overall model quality. Note that most command will directly serve this UI in the provided URL, and there is in general nothing more to do. Alternatively, we provide command line interfaces for the Rust and Python versions, the protocol is the same as with the web UI so there is nothing to change on the server side. For reference, here is the list of clients for Moshi. Rust Command Line From within the rust directory, run the following: cargo run --bin moshi-cli -r -- tui --host localhost Python with PyTorch python -m moshi.client WebUI The web UI can be built from this repo via the following steps (these will require npm being installed). cd client npm install npm run build The web UI can then be found in the client/dist directory. Development If you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following: # From the root of the clone of the repo pip install -e 'moshi[dev]' pip install -e 'moshi_mlx[dev]' pre-commit install If you wish to build locally rustymimi (assuming you have Rust properly installed): pip install maturin maturin dev -r -m rust/mimi-pyo3/Cargo.toml FAQ Checkout the Frequently Asked Questions section before opening an issue. License The present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend. The web client code is provided under the MIT license. Note that parts of this code is based on AudioCraft, released under the MIT license. The weights for the models are released under the CC-BY 4.0 license. Citation If you use either Mimi or Moshi, please cite the following paper, @techreport{kyutai2024moshi, author = {Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and Am\'elie Royer and Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour}, title = {Moshi: a speech-text foundation model for real-time dialogue}, institution = {Kyutai}, year={2024}, month={September}, url={http://kyutai.org/Moshi.pdf}, } About No description, website, or topics provided. Resources Readme License Apache-2.0, MIT licenses found Licenses found Apache-2.0 LICENSE-APACHE MIT LICENSE-MIT Activity Custom properties Stars 2.1k stars Watchers 26 watching Forks 122 forks Report repository Releases No releases published Packages 0 No packages published Contributors 9 * @LaurentMazare * @adefossez * @FL33TW00D * @Vaibhavs10 * @lienz * @manukyutai * @dsa * @ameroyer * @baberabb Languages * Python 44.9% * Rust 42.4% * TypeScript 12.3% * CSS 0.3% * HTML 0.1% * Shell 0.0% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.