https://github.com/ictnlp/LLaMA-Omni

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        GitHub Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    By size
      + Enterprise
      + Teams
      + Startups
    By industry
      + Healthcare
      + Financial services
      + Manufacturing
    By use case
      + CI/CD & Automation
      + DevOps
      + DevSecOps
  * Resources
    Topics
      + AI
      + DevOps
      + Security
      + Software Development
      + View all
    Explore
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Enterprise
      +  
        Enterprise platform
        AI-powered developer platform
    Available add-ons
      +  
        Advanced Security
        Enterprise-grade security features
      +  
        GitHub Copilot
        Enterprise-grade AI features
      +  
        Premium Support
        Enterprise-grade 24/7 support
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up Reseting focus
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
ictnlp / LLaMA-Omni Public

  * Notifications You must be signed in to change notification
    settings
  * Fork 68
  * Star 1.1k

LLaMA-Omni is a low-latency and high-quality end-to-end speech
interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve
speech capabilities at the GPT-4o level.

arxiv.org/abs/2409.06666

License

Apache-2.0 license
1.1k stars 68 forks Branches Tags Activity
Star
Notifications You must be signed in to change notification settings

  * Code
  * Issues 5
  * Pull requests 2
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

ictnlp/LLaMA-Omni

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
 main
BranchesTags
  
Go to file
Code

Folders and files

     Name            Name      Last commit message Last commit date
Latest commit

 

History

8 Commits
 
images          images                              
omni_speech     omni_speech                         
.gitignore      .gitignore                          
LICENSE         LICENSE                             
README.md       README.md                           
pyproject.toml  pyproject.toml                      
View all files

Repository files navigation

  * README
  * Apache-2.0 license

 LLaMA-Omni: Seamless Speech Interaction with Large Language Models

 


    Authors: Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, 
    Shaolei Zhang, Yang Feng*

arXiv model code

LLaMA-Omni is a speech-language model built upon
Llama-3.1-8B-Instruct. It supports low-latency and high-quality
speech interactions, simultaneously generating both text and speech
responses based on speech instructions.

                               [model]

 Highlights

 

  *  Built on Llama-3.1-8B-Instruct, ensuring high-quality
    responses.

  *  Low-latency speech interaction with a latency as low as 226ms.

  *  Simultaneous generation of both text and speech responses.

  * [?][?] Trained in less than 3 days using just 4 GPUs.

demo.mp4

Install

 

 1. Clone this repository.

git clone https://github.com/ictnlp/LLaMA-Omni
cd LLaMA-Omni

 2. Install packages.

conda create -n llama-omni python=3.10
conda activate llama-omni
pip install pip==24.0
pip install -e .

 3. Install fairseq.

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install -e . --no-build-isolation

 4. Install flash-attention.

pip install flash-attn --no-build-isolation

Quick Start

 

 1. Download the Llama-3.1-8B-Omni model from Huggingface.

 2. Download the Whisper-large-v3 model.

import whisper
model = whisper.load_model("large-v3", download_root="models/speech_encoder/")

 3. Download the unit-based HiFi-GAN vocoder.

wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -P vocoder/
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -P vocoder/

Gradio Demo

 

 1. Launch a controller.

python -m omni_speech.serve.controller --host 0.0.0.0 --port 10000

 2. Launch a gradio web server.

python -m omni_speech.serve.gradio_web_server --controller http://localhost:10000 --port 8000 --model-list-mode reload --vocoder vocoder/g_00500000 --vocoder-cfg vocoder/config.json

 3. Launch a model worker.

python -m omni_speech.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path Llama-3.1-8B-Omni --model-name Llama-3.1-8B-Omni --s2s

 4. Visit http://localhost:8000/ and interact with LLaMA-3.1-8B-Omni!

Note: Due to the instability of streaming audio playback in Gradio,
we have only implemented streaming audio synthesis without enabling
autoplay. If you have a good solution, feel free to submit a PR.
Thanks!

Local Inference

 

To run inference locally, please organize the speech instruction
files according to the format in the omni_speech/infer/examples
directory, then refer to the following script.

bash omni_speech/infer/run.sh omni_speech/infer/examples

LICENSE

 

Our code is released under the Apache-2.0 License. Our model, as it
is built on Llama 3.1, is required to comply with the Llama 3.1
License.

Acknowledgements

 

  * LLaVA: The codebase we built upon.
  * SLAM-LLM: We borrow some code about speech encoder and speech
    adaptor.

Citation

 

If you have any questions, please feel free to submit an issue or
contact fangqingkai21b@ict.ac.cn.

If our work is useful for you, please cite as:

@article{fang-etal-2024-llama-omni,
  title={LLaMA-Omni: Seamless Speech Interaction with Large Language Models},
  author={Fang, Qingkai and Guo, Shoutao and Zhou, Yan and Ma, Zhengrui and Zhang, Shaolei and Feng, Yang},
  journal={arXiv preprint arXiv:2409.06666},
  year={2024}
}

Star History

 

Star History Chart

About

LLaMA-Omni is a low-latency and high-quality end-to-end speech
interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve
speech capabilities at the GPT-4o level.

arxiv.org/abs/2409.06666

Topics

speech-to-text speech-to-speech large-language-models 
multimodal-large-language-models speech-language-model 
speech-interaction

Resources

Readme

License

Apache-2.0 license
Activity
Custom properties

Stars

1.1k stars

Watchers

19 watching

Forks

68 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Languages

  * Python 99.5%
  * Shell 0.5%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.