https://research.myshell.ai/jetmoe

JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars

myshell-x-mitperformance

Key Messages

 1. JetMoE-8B is trained with less than $ 0.1 million^1 cost but
    outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar
    training resources. LLM training can be much cheaper than people
    generally thought.
 2. JetMoE-8B is very open and academia-friendly because:
     1. It only uses public datasets for training, and the code is
        open-sourced. No proprietary resource is needed.
     2. It can be finetuned with very limited compute budget (e.g.,
        consumer-grade GPU) that most labs can afford.
 3. JetMoE-8B only has 2.2B active parameters during inference, which
    drastically lowers the computational cost. Compared to a model
    with similar inference computation, like Gemma-2B, JetMoE-8B
    achieves constantly better performance.

^1 We used a 96xH100 GPU cluster for 2 weeks, which cost ~$0.08
million.

  * Github: https://github.com/myshell-ai/JetMoE
  * HuggingFace: https://huggingface.co/jetmoe/jetmoe-8b
  * Chat Demo on Lepton AI: https://www.lepton.ai/playground/chat?
    model=jetmoe-8b-chat

Authors

The project is contributed by Yikang Shen, Zhen Guo, Tianle Cai and
Zengyi Qin. For technical inquiries, please contact Yikang Shen. For
media and collaboration inquiries, please contact Zengyi Qin.

Collaboration

If you have great ideas but need more resources (GPU, data, funding,
etc.), welcome to contact Zengyi Qin. We are open to collaborations
and are actively supporting high-quality open-source projects.

Benchmarks

We use the same evaluation methodology as in the Open LLM
leaderboard. For MBPP code benchmark, we use the same evaluation
methodology as in the LLaMA2 and Deepseek-MoE paper. The results are
shown below:

                Active Training       Open LLM                                             GSM
     Model      Params  Tokens  MBPP Leaderboard ARC  Hellaswag MMLU TruthfulQA WinoGrande  8K
                                       Average
Gemma-2B        2B     2T       28.0 46.4        48.4 71.8      41.8 33.1       66.3       16.9
DeepseekMoE-16B 2.8B   2T       34.0 51.1        53.2 79.8      46.3 36.1       73.7       17.3
LLaMA2-7B       7B     2T       20.8 51.0        53.1 78.6      46.9 38.8       74.0       14.5
LLaMA-13B       13B    1T       22.0 51.4        56.2 80.9      47.7 39.5       76.2       7.6
JetMoE-8B       2.2B   1.25T    34.2 53.0        48.7 80.5      49.2 41.7       70.2       27.8

     Model       MT-Bench Score
GPT-4            9.014
GPT-3.5-turbo    7.995
Claude-v1        7.923
JetMoE-8B-chat   6.681
Llama-2-13b-chat 6.650
Vicuna-13b-v1.3  6.413
Wizardlm-13b     6.353
Llama-2-7b-chat  6.269

To our surprise, despite the lower training cost and computation,
JetMoE-8B performs even better than LLaMA2-7B, LLaMA-13B, and
DeepseekMoE-16B. Compared to a model with similar training and
inference computation, like Gemma-2B, JetMoE-8B achieves better
performance.

Model Details

JetMoE uses a sparsely activated architecture inspired by
ModuleFormer. JetMoE-8B has 24 blocks. Each block has two MoE layers:
Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE).
Each MoA and MoE layer has 8 expert, and 2 experts are activated for
each input token. It has 8 billion parameters in total and 2.2B
active parameters. JetMoE-8B is trained on 1.25T tokens from publicly
available datasets, with a learning rate of 5.0 x 10-4 and a global
batch-size of 4M tokens.

architecture

Training Details

Our training recipe follows the MiniCPM's two-phases training method.
Phase 1 uses a constant learning rate with linear warmup and is
trained on 1 trillion tokens from large-scale open-source pretraining
datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses
exponential learning rate decay and is trained on 250 billion tokens
from phase 1 datasets and extra high-quality open-source datasets. We
used a 96xH100 GPU cluster for 2 weeks to train the model.

phase1-dataphase2-data

Technical Report

For more technical details, please refer to the JetMoE Technical
Report (Coming Soon).

Acknowledgement

We express our gratitude to Shengding Hu for his valuable advice on
the Phase 2 data mixture. We also express our gratitude to Exabits
for their assistance in setting up the GPU clusters, and to Lepton AI
for their support in setting up the chat demo.