https://github.com/Tencent/Tencent-Hunyuan-Large

Skip to content

Navigation Menu

Toggle navigation
 
Sign in

  * Product
      +  
        GitHub Copilot
        Write better code with AI
      +  
        Security
        Find and fix vulnerabilities
      +  
        Actions
        Automate any workflow
      +  
        Codespaces
        Instant dev environments
      +  
        Issues
        Plan and track work
      +  
        Code Review
        Manage code changes
      +  
        Discussions
        Collaborate outside of code
      +  
        Code Search
        Find more, search less
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    By company size
      + Enterprises
      + Small and medium teams
      + Startups
    By use case
      + DevSecOps
      + DevOps
      + CI/CD
      + View all use cases
    By industry
      + Healthcare
      + Financial services
      + Manufacturing
      + Government
      + View all industries
    View all solutions
  * Resources
    Topics
      + AI
      + DevOps
      + Security
      + Software Development
      + View all
    Explore
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Enterprise
      +  
        Enterprise platform
        AI-powered developer platform
    Available add-ons
      +  
        Advanced Security
        Enterprise-grade security features
      +  
        GitHub Copilot
        Enterprise-grade AI features
      +  
        Premium Support
        Enterprise-grade 24/7 support
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up Reseting focus
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
Tencent / Tencent-Hunyuan-Large Public

  * Notifications You must be signed in to change notification
    settings
  * Fork 16
  * Star 436

License

View license
436 stars 16 forks Branches Tags Activity
Star
Notifications You must be signed in to change notification settings

  * Code
  * Issues 0
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

Tencent/Tencent-Hunyuan-Large

 main
BranchesTags
  
[                    ]
Go to file
Code

Folders and files

      Name              Name          Last commit       Last commit
                                        message            date
Latest commit

 

History

2 Commits
 
benchmark         benchmark                            
examples          examples                             
inference         inference                            
models            models                               
train             train                                
.gitmodules       .gitmodules                          
LICENSE.txt       LICENSE.txt                          
Notice            Notice                               
README.md         README.md                            
README_CN.md      README_CN.md                         
requirements.txt  requirements.txt                     
View all files

Repository files navigation

  * README
  * License

Zhong Wen   | English



     [68747470733a2f2f647363616368652e74656e63656e742d636c6f75]

     Hugging Face   |   [?]  official website  |     HunyuanAPI

       Technical Report  |   Demo   |   Tencent Cloud TI   

Model Introduction

 

With the rapid development of artificial intelligence technology,
large language models (LLMs) have made significant progress in fields
such as natural language processing, computer vision, and scientific
tasks. However, as the scale of these models increases, optimizing
resource consumption while maintaining high performance has become a
key challenge. To address this challenge, we have explored Mixture of
Experts (MoE) models. The currently unveiled Hunyuan-Large
(Hunyuan-MoE-A52B) model is the largest open-source Transformer-based
MoE model in the industry, featuring a total of 389 billion
parameters and 52 billion active parameters. This is currently the
largest open-source Transformer-based MoE model in the industry,
featuring a total of 389 billion parameters and 52 billion active
parameters.

By open-sourcing the Hunyuan-Large model and revealing related
technical details, we hope to inspire more researchers with
innovative ideas and collectively advance the progress and
application of AI technology. We welcome you to join our open-source
community to explore and optimize future AI models together!

Introduction to Technical Advantages

 

Model

 

  * High-Quality Synthetic Data: By enhancing training with synthetic
    data, Hunyuan-Large can learn richer representations, handle
    long-context inputs, and generalize better to unseen data.

  * KV Cache Compression: Utilizes Grouped Query Attention (GQA) and
    Cross-Layer Attention (CLA) strategies to significantly reduce
    memory usage and computational overhead of KV caches, improving
    inference throughput.

  * Expert-Specific Learning Rate Scaling: Sets different learning
    rates for different experts to ensure each sub-model effectively
    learns from the data and contributes to overall performance.

  * Long-Context Processing Capability: The pre-trained model
    supports text sequences up to 256K, and the Instruct model
    supports up to 128K, significantly enhancing the ability to
    handle long-context tasks.

  * Extensive Benchmarking: Conducts extensive experiments across
    various languages and tasks to validate the practical
    effectiveness and safety of Hunyuan-Large.

inference Framework

 

  * This open-source release offers two inference backend options
    tailored for the Hunyuan-Large model: the popular vLLM-backend
    and the TRT-LLM-backend. Both solutions include optimizations for
    enhanced performance. For instance, the introduction of a new CLA
    structure significantly reduces GPU memory usage, achieving a 50%
    savings in the KV-Cache portion, which ensures efficient handling
    of long text scenarios. Additionally, by employing FP8
    quantization, we achieve a 50% reduction in memory usage compared
    to traditional FP16/BF16 quantization, while maintaining
    precision and resulting in a 70% increase in throughput.
    Meanwhile, by leveraging the efficient operators at the core of
    TRT-LLM, the performance of the TRT-LLM solution surpasses that
    of vLLM by over 30%. The TRT-LLM solution is widely used in
    Tencent's Hunyuan project. In this release, we are initially
    open-sourcing the vLLM solution, with plans to release the
    TRT-LLM solution in the near future.

Training Framework

 

  * The Hunyuan-Large open-source model is fully compatible with the
    Hugging Face format, enabling researchers and developers to
    perform model fine-tuning using the hf-deepspeed framework.
    Additionally, we support training acceleration through the use of
    flash attention. To further assist in the adoption process, we
    have made the corresponding training scripts and model
    implementations publicly available to the community through this
    release, facilitating subsequent model training and fine-tuning
    operations based on these resources.

 

Related News

 

  * 2024.11.5 TI Platform has integrated Hunyuan-Large model already,
    you can easily train and deploy it in just a few steps. Visit
    Chat with Hunyuan-Large to experience real-time conversations
    with the model, and explore Hunyuan-Large Best Practice on TI to
    create your own customized Hunyuan-Large model.
  * 2024.11.5 We have open-sourced Hunyuan-A52B-Pretrain,
    Hunyuan-A52B-Instruct, and Hunyuan-A52B-Instruct-FP8 on Hugging
    Face. We also released a technical report and a training and
    inference operations manual, providing detailed information on
    the model's capabilities and the procedures for training and
    inference.

Benchmark Evaluation

 

Hunyuan-Large pre-trained model achieves the best overall performance
compared to both Dense and MoE based competitors having similar
activated parameter sizes. For aggregated benchmarks such as MMLU,
MMLU-Pro, and CMMLU, Hunyuan-Large consistently achieves the best
performance, confirming its comprehensive abilities on aggregated
tasks. Hunyuan-Large also shows superior performance in commonsense
understanding and reasoning, and classical NLP tasks such as QA and
reading comprehension tasks (e.g., CommonsenseQA, PIQA and TriviaQA).
For the mathematics capability, Hunyuan-Large outperforms all
baselines in math datasets of GSM8K and MATH, and also gains the best
results on CMATH in Chinese.We also observe that Hunyuan-Large
achieves the overall best performance in all Chinese tasks (e.g.,
CMMLU, C-Eval).

     Model       LLama3.1-405B LLama3.1-70B Mixtral-8x22B DeepSeek-V2 Hunyuan-Large
MMLU             85.2          79.3         77.8          78.5        88.4
MMLU-Pro         61.6          53.8         49.5          -           60.2
BBH              85.9          81.6         78.9          78.9        86.3
HellaSwag        -             -            88.7          87.8        86.8
CommonsenseQA    85.8          84.1         82.4          -           92.9
WinoGrande       86.7          85.3         85.0          84.9        88.7
PIQA             -             -            83.6          83.7        88.3
NaturalQuestions -             -            39.6          38.7        52.8
DROP             84.8          79.6         80.4          80.1        88.9
ARC-C            96.1          92.9         91.2          92.4        95.0
TriviaQA         -             -            82.1          79.9        89.2
CMMLU            -             -            60.0          84.0        90.2
C-Eval           -             -            59.6          81.7        91.9
C3               -             -            71.4          77.4        82.3
GSM8K            89.0          83.7         83.7          79.2        92.8
MATH             53.8          41.4         42.5          43.6        69.8
CMATH            -             -            72.3          78.7        91.3
HumanEval        61.0          58.5         53.1          48.8        71.4
MBPP             73.4          68.6         64.2          66.6        72.6

Hunyuan-Large-Instruct achieves consistent improvements on most types
of tasks compared to LLMs having similar activated parameters,
indicating the effectiveness of our post-training. Delving into the
model performance in different categories of benchmarks, we find that
our instruct model achieves the best performance on MMLU and MATH
dataset.
Notably, on the MMLU dataset, our model demonstrates a significant
improvement, outperforming the LLama3.1-405B model by 2.6%.
This enhancement is not just marginal but indicative of the
Hunyuan-Large-Instruct's superior understanding and reasoning
capabilities across a wide array of language understanding tasks. The
model's prowess is further underscored in its performance on the MATH
dataset, where it surpasses the LLama3.1-405B by a notable margin of
3.6%.
Remarkably, this leap in accuracy is achieved with only 52 billion
activated parameters, underscoring the efficiency of our model.

               LLama3.1  LLama3.1  Mixtral  DeepSeekV2.5 Hunyuan-Large
    Model        405B      70B      8x22B       Chat         Inst.
                 Inst.    Inst.     Inst.
MMLU           87.3      83.6     77.8      80.4         89.9
CMMLU          -         -        61.0      -            90.4
C-Eval         -         -        60.0      -            88.6
BBH            -         -        78.4      84.3         89.5
HellaSwag      -         -        86.0      90.3         88.5
ARC-C          96.9      94.8     90.0      -            94.6
GPQA_diamond   51.1      46.7     -         -            42.4
MATH           73.8      68.0     49.8      74.7         77.4
HumanEval      89.0      80.5     75.0      89.0         90.0
AlignBench     6.0       5.9      6.2       8.0          8.3
MT-Bench       9.1       8.8      8.1       9.0          9.4
IFEval         86.0      83.6     71.2      -            85.0
strict-prompt
Arena-Hard     69.3      55.7     -         76.2         81.8
AlpacaEval-2.0 39.3      34.3     30.9      50.5         51.8

 

Data

 

Hunyuan-Large provides model training-related processes. In this
section, you can process the training data format for use in model
training.

Training Data Format and Processing

 

Training data is processed into a messages format as follows. The
default system prompt for training and inference is "You are a
helpful assistant." Below are examples of single-turn and multi-turn
data:

# Single-Turn Data Format

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hai Shui Wei Shi Yao Shi Xian De " },
    {"role": "assistant", "content": "Hai Shui Shi Xian De Zhu Yao Shi Yin Wei Qi Zhong Han You Xu Duo Rong Jie Zai Shui Zhong De Yan Lei He Kuang Wu Zhi . Zhe Xie Yan Lei He Kuang Wu Zhi Lai Zi Yu Di Qiu Biao Mian De Yan Shi He Tu Rang Zhong De Hua Xue Wu Zhi ,Sui Zhao Shi Jian De Tui Yi ,Ta Men Bei Dai Dao Liao Hai Yang Zhong . Dang Hai Shui Zheng Fa Shi ,Shui Fen Zheng Fa Diao Liao ,Dan Yan Lei He Kuang Wu Zhi Reng Ran Liu Zai Shui Zhong ,Dao Zhi Hai Shui Bian De Geng Jia Xian Wei . Yin Ci ,Hai Shui De Xian Du Shi You Qi Zhong De Yan Lei He Kuang Wu Zhi De Han Liang Jue Ding De . "}
]

# Multi-Turn Data Format
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Zhi Dao Zhong Qing Sen Lin Zhe Bu Dian Ying Ma ?" },
    {"role": "assistant", "content": "Zhi Dao Ya ,Shi Yi Bu You Wang Jia Wei Dao Yan De Pian Zi . " },
    {"role": "user", "content": " Er Zhu Yan Li Geng Shi You Wang Fei ,Yi Shang Ying Bian Shou Dao Zhui Peng . " },
    {"role": "assistant", "content": "Suo Yi Ci Pian Huo De Liao Di 14Jie Xiang Gang Dian Ying Jin Xiang Jiang Zui Jia Ying Pian Jiang . "}
]

from tokenizers import ByteLevelBPETokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("./models", use_fast=False, trust_remote_code=True)

ids = tokenizer.apply_chat_template(messages)

For more usage references, see the ./models/test.py file.

 

Quick Start

 

You can quickly get started by referring to the content in the Quick
Start Guide.

Model Training

 

To simplify the Training process, HunyuanLLM provides a pre-built
Docker image:

hunyuaninfer/hunyuan-large.

Hardware Requirements

 

Tested on H20, without enabling make_moe_param_leaf_module and using
zero3+offload, with a max_seq_length of 2048, full fine-tuning
requires at least 32 GPUs, and LoRA fine-tuning requires at least 8
GPUs.

Training Performance

 

With the minimum configuration (8 GPUs for LoRA fine-tuning),
per_device_train_batch_size is set to 1, and
gradient_accumulation_steps is set to 1, resulting in approximately
35 seconds per iteration.

Launch Method

 

Refer to: HuggingFace Transformers Trainer

Single-Machine Training

 

In the train directory, execute:

pip install -r requirements.txt
bash train.sh

Multi-Machine Training

 

To start training on multiple machines, follow the steps below and
ensure that all machines are within the same cluster.

Configure Passwordless SSH Login Between Machines

 

The following steps use two machines as an example, with their IPs
represented as ${ip1} and ${ip2}. These operations are performed
within a Docker container.

First, configure passwordless SSH between containers on each machine.

ssh-keygen                      # Generate id_rsa and id_rsa.pub for passwordless login
ssh-keygen -t rsa -A    # Generate /etc/ssh/ssh_host_rsa_key and ssh_host_ecdsa_key for starting 'SSH listen' later
/usr/sbin/sshd -p 36005 -o ListenAddress=0.0.0.0        # Start SSH listen
echo "Port 36005" > ~/.ssh/config   # Change SSH connection port to 36005
passwd root    # Set root password to avoid alerts from monitoring platforms

Note: The 36005 here is an example. You can choose any port, but
ensure that the port is open and not occupied by other processes.

Next, within the container on each machine, execute:

cat ~/.ssh/id_rsa.pub

Copy the output SSH public key and paste it into the ~/.ssh/
authorized_keys file, with one public key per line. This must be done
on every machine. Ultimately, the ~/.ssh/authorized_keys file on each
machine should be identical and contain the public keys of all
machines.

It's important to note that during multi-node training, the code
executed on each node must be consistent. It is recommended to mount
a shared network drive. If mounting a shared drive is not possible,
you need to manually copy the dataset, scripts, and code to the same
directory on all machines.

Start Multi-Machine Training

 

Once the preparation steps are completed and dependencies are
confirmed to be installed (if not, execute pip install -r
requirements.txt to install), you can add the following configuration
at the beginning of train.sh:

export HOST_GPU_NUM=8
# Current machine IP
export LOCAL_IP=${ip1}
# Multi-node machine IPs, separated by commas
export NODE_IP_LIST="${ip1}:8,${ip2}:8"
# Number of machine nodes
export NODES=2
export NODE_NUM=$((${NODES} * ${HOST_GPU_NUM}))

Note: Replace ${ip1} and ${ip2} with the actual IP addresses!

Then, on the machine with ${ip1}, execute bash train.sh in the train/
directory. Note that on the first run, you might see the following
output:

The authenticity of host '[ip]:36005 ([ip]:36005)' can't be established.
ECDSA key fingerprint is xxxxxx.
ECDSA key fingerprint is MD5:xxxxxx.
Are you sure you want to continue connecting (yes/no)?

At this point, type yes to continue.

Key Parameters

 

The key parameters in the script are as follows:

  * --deepspeed: This parameter should point to a DeepSpeed
    configuration file. The train folder provides three default
    DeepSpeed configuration files: ds_zero2_no_offload.json,
    ds_zero3_no_offload.json, ds_zero3_offload.json. The required GPU
    memory decreases in this order.
  * --model_name_or_path: The path to the HF pre-trained model.
    Ensure this path contains the modeling_hunyuan.py and
    configuration_hunyuan.py files; otherwise, it cannot be loaded.
  * --tokenizer_name_or_path: The path to the tokenizer folder.
    Ensure this path contains the tokenization_hy.py file; otherwise,
    it cannot be loaded.
  * --train_data_file: The path to the training file, which should be
    a JSONL file.
  * --output_dir: The output directory where logs, tensorboard files,
    and model weights will be stored.
  * --per_device_train_batch_size: The batch size per GPU.
  * --gradient_accumulation_steps: The number of gradient
    accumulation steps. The global batch size is
    per_device_train_batch_size * gradient_accumulation_steps *
    dp_size.
  * --max_steps: The total number of training steps.
  * --save_steps: The number of steps between saving checkpoints.
  * --use_lora: Whether to use LoRA for training. This also accepts
    --lora_rank, --lora_alpha, and --lora_dropout parameters. LoRA is
    applied by default to the 'q_proj', 'k_proj', 'v_proj', 'o_proj'
    parameters. If you need to change this, modify it in the code.
    Note: When using LoRA for training, only the LoRA weights are
    saved, not the base model weights. If you need to merge LoRA
    weights, see the "LoRA Weight Merging" section below.
  * --make_moe_param_leaf_module: When using zero3 and MoE training,
    treat the MoE module as a leaf module, meaning its parameters are
    not split by zero3. This option is expected to significantly
    increase memory usage.
  * --gradient_checkpointing: Enable gradient checkpointing.
  * --train_attention_params_only: Whether to train only the
    attention parameters.
  * --learning_rate: The maximum learning rate during training.
  * --min_lr: The minimum learning rate during training.
  * --use_flash_attn: Kai Qi  flash-attention Jin Xing Xun Lian Jia Su 

Note:

  * If you want to continue training from a previously saved
    checkpoint instead of loading pre-trained weights, specify
    --resume_from_checkpoint with the path to the checkpoint from the
    previous training. Do not specify --model_name_or_path, as this
    will only load the weights and not the training state.
  * When continuing training from a checkpoint, there might be slight
    deviations in loss due to randomness introduced by some
    non-deterministic algorithms, which is considered normal. Refer
    to: HuggingFace Transformers Trainer Randomness
  * When --model_name_or_path is specified, all model-related
    parameters will be ignored.
  * Samples within a batch will be padded to align with the longest
    sample in the batch, with each sample having a maximum length of
    max_seq_length. Any excess will be truncated.
  * If you encounter warnings about bias weights not being loaded,
    you can ignore them, as biases are not used in Hunyuan-Large.

What to Do If Out of Memory?

 

Refer to: DeepSpeed Configuration

You can try modifying the DeepSpeed configuration by removing the
auto attribute from these parameters and reducing their values:

  * stage3_param_persistence_threshold
  * stage3_prefetch_bucket_size
  * stage3_max_reuse_distance
  * stage3_max_reuse_distance

Merging LoRA Models

 

The saved LoRA weights cannot be merged into the zero3 model during
training because, with zero3 enabled, model weights are split across
different data parallel ranks. If you want to merge LoRA weights into
the base model, you can do so offline to obtain the merged weight
file. Execute merge_lora_weight.sh to merge the LoRA weights with the
base model weights. The parameters include:

  * --base_model_path: Directory of the base model weights
  * --adapter_model_path: Directory of the LoRA weights
  * --output_path: Directory to save the merged weights
  * --save_dtype: Data format for storing the merged weights,
    available options include: fp16, bf16, fp32

 

Inference and Deployment

 

HunyuanLLM uses TRT-LLM and vLLM for deployment. We are open sourcing
the vLLM deployment (see Reasoning with vLLM), and the TRT-LLM
deployment (see Reasoning with TRT-LLM) will be available in the near
future.

Using TRT-LLM for Inference

 

To be opened

Using vLLM for Inference

 

Docker:

 

To simplify the deployment process, HunyuanLLM provides a pre-built
Docker image:

hunyuaninfer/hunyuan-large. You only need to download the model files
and start the Docker container using the code below to begin model
inference.

docker run --name hunyuanLLM_infer -itd --privileged --user root --net=host --ipc=host --gpus=8 hunyuaninfer/hunyuan-large:infer-open-source

Note: Docker container privilege management. The above code uses
privileged mode (--privileged) to start the Docker container, which
grants the container higher privileges, increasing the risk of data
leakage and cluster security threats. It is recommended to avoid
using privileged mode unless necessary to reduce security risks. For
scenarios where privileged mode is required, conduct a thorough
security assessment and implement appropriate security monitoring and
hardening measures.

Configure Passwordless SSH Login Between Machines

 

The following steps use two machines as an example, with their IPs
represented as ${ip1} and ${ip2}. These operations are performed
within a Docker container.

First, run passwd on both machines to set a password, for example:
Tmp123,./

Copy inference/login_ssh.py into the container and execute the
following command, ensuring the IP and password are correctly
entered.

python3 login_ssh.py --ips ${ip1},${ip2} --port 36000 --password=Tmp123,./

Note : Before starting, be sure to verify multi-machine
communication using VLLM's debugging script: https://docs.vllm.ai/en/
latest/getting_started/debugging.html

BF16 Deployment

 

BF16 requires 16 H800 or H20 GPUs for deployment. After verifying
that multi-machine communication is correct, execute the following
steps:

Before running the commands, set the following environment variables:

${LOCAL_IP}: The IP corresponding to bond1 on the current machine
${MODEL_PATH}: Path to the Hunyuan LLM model

Step 1: Start Ray

 

Ray is an open-source library for parallel and distributed Python. In
this section, we use Ray to achieve multi-machine communication.

Ray Component Configuration Hardening: The default configuration of
Ray components does not enable authentication mechanisms for service
ports (e.g., 6379, 8265), posing risks of unauthorized access and
command execution. It is recommended to deploy Ray components only in
trusted internal network environments or ensure strict access control
list (ACL) policies are implemented for these ports to prevent
unauthorized network access.

First, start Ray on each node (either in the background or by keeping
the terminal running):

On the head node:

export VLLM_HOST_IP=${LOCAL_IP}
export NCCL_SOCKET_IFNAME=bond1
export GLOO_SOCKET_IFNAME=bond1
ray start --block --head --node-ip-address=${LOCAL_IP} --port=6379

On all worker nodes:

Note: Replace {HEAD NODE $LOCAL_IP} with the actual ${LOCAL_IP} of
the head node.

export VLLM_HOST_IP=${LOCAL_IP}
export NCCL_SOCKET_IFNAME=bond1
export GLOO_SOCKET_IFNAME=bond1
ray start --block --address={HEAD NODE $LOCAL_IP}:6379 --node-ip-address=${LOCAL_IP}

If Ray fails to start, execute ray stop and then run the above
commands again.

Step 2: Execute Inference

 

Method 1: Command Line Inference

 

Below is a code snippet demonstrating how to quickly request the chat
model using vLLM:

Note: vLLM Component Remote Code Execution Protection. In the code
below, if the trust-remote-code configuration option of the vLLM
component is enabled, it will allow loading and executing code from
remote model repositories, which may lead to the execution of
malicious code. Unless explicitly required by business needs, it is
recommended to keep this configuration option disabled to reduce
potential security threats.

import os
from vllm import LLM, SamplingParams

model_path=os.environ.get('MODEL_PATH')

llm = LLM(model=model_path,
        tokenizer=model_path,
        trust_remote_code=True,
        max_model_len=10240,
        dtype='bfloat16',
        tensor_parallel_size=16,
        pipeline_parallel_size=1,
        disable_log_stats=False,
        gpu_memory_utilization=0.98,
        disable_custom_all_reduce=True,
        #distributed_executor_backend='ray',
        enforce_eager=True,
        max_num_seqs=8,
        use_v2_block_manager=True,
        quantization=None)

prompts = ["Hai Shui Wei Shi Yao Shi Xian De "]

sampling_params = SamplingParams(
    temperature=0.7, top_p=0.6, max_tokens=200, top_k=20, repetition_penalty=1.05)

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Method 2: Service-Based Inference

 

Below we demonstrate how to deploy the model using vLLM in a
service-based manner and make requests.

Run the following on the head node:

export VLLM_HOST_IP=${LOCAL_IP}
export NCCL_SOCKET_IFNAME=bond1
export GLOO_SOCKET_IFNAME=bond1

Next, start the service by running:

cd inference
sh run_server.sh

Tips: Troubleshooting, if you encounter the following error:

ray.exceptions.RaySystemError: System error: No module named 'transformers_modules' traceback: Traceback (most recent call last):
ModuleNotFoundError: No module named 'transformers_modules'

Copy the ~/.cache/huggingface/modules/ directory from the head node
to the corresponding path on all worker nodes.

After successfully running run_server.sh, execute the request script:

sh openapi.sh

Be sure to modify ${LOCAL_IP} and ${MODEL_PATH} in openapi.sh to
values match the corresponding service.

Quantized Model Deployment:

 

This section describes the process of deploying a quantized model
using vLLM.

Image: The deployment image is the same as for BF16.

Int8 Quantized Model Deployment:

 

To deploy the Int8-weight-only version of the Hunyuan-L model, simply
set the environment variables in run_server_int8.sh:

${MODEL_PATH}: Path to the BF16 model
${LOCAL_IP}: The IP corresponding to bond1 on the current machine

Then, start the Int8 service by running:

sh run_server_int8.sh

After successfully running run_server_int8.sh, execute the request
script:

sh openapi.sh

FP8 Quantized Model Deployment:

 

To deploy the W8A8C8 version of the Hunyuan-L model, simply set the
environment variables in run_server_fp8.sh:

${MODEL_PATH}: Path to the FP8 model
${LOCAL_IP}: The IP corresponding to bond1 on the current machine

Then, start the FP8 service by running:

sh run_server_fp8.sh

After successfully running run_server_fp8.sh, execute the request
script:

sh openapi.sh

FP8 BENCHMARK

 

This part introduces the Benchmark of Hunyuan Large Instruct FP8
quantitative model.

Dataset BF16 W8A8C8-FP8
ARC-C   94.6 94.2
C-Eval  88.6 89.2
CMMLU   90.4 89.8
MMLU    89.9 88.9

Inference Performance

 

This section presents the efficiency test results of deploying
various models (original and quantized) using vLLM, including
inference speed (tokens/s) under different batch sizes.

 Inference         Model         Number of   input_length batch batch
 Framework                       GPUs (H20)                =1    =4
vLLM        Hunyuan-Large       16           2048         20.2  75.5
vLLM        Hunyuan-Large(int8  8            2048         19.3  73.6
            weight only)
vLLM        Hunyuan-Large       8            2048         19.8  74.9
            (W8A8C8-FP8)

Tokenizer

 

The tokenizer used in the HunYuan-Large model balances compression
rate and effectiveness, ensuring that embeddings are sufficiently
trained. The vocabulary includes 100K tokens integrated from
tiktoken. Additionally, we trained an extra 29K Chinese tokens using
a large amount of high-quality Chinese training data to enhance the
model's Chinese capabilities and the tokenizer's compression rate.
Combined, our new tokenizer improves the compression rate compared to
the LLaMA3 tokenizer, increasing from 2.78 characters/token to 3.13
characters/token.

Hunyuan API

 

You can experience our Hunyuan-Large model on Tencent Cloud. For
details, please visit: https://cloud.tencent.com/document/product/
1729/97730.

Interactive Demo Web

 

The Hunyuan-Large web demo is now open. Visit https://huggingface.co/
spaces/tencent/Hunyuan-Large to easily experience our model.

Training/Inference on TI

 

Tencent Cloud's TI Platform is a comprehensive machine learning
platform tailored for AI engineers. With the Hunyuan-Large model
already integrated, you can easily train and deploy it in just a few
steps. Visit Chat with Hunyuan-Large to experience real-time
conversations with the model, and explore Hunyuan-Large Best Practice
on TI to create your own customized Hunyuan-Large model.

Citation

 

If you find our work helpful, feel free to give us a cite.

@misc{sun2024hunyuanlargeopensourcemoemodel,
      title={Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent},
      author={Xingwu Sun and Yanfeng Chen and Yiqing Huang and Ruobing Xie and Jiaqi Zhu and Kai Zhang and Shuaipeng Li and Zhen Yang and Jonny Han and Xiaobo Shu and Jiahao Bu and Zhongzhi Chen and Xuemeng Huang and Fengzong Lian and Saiyong Yang and Jianfeng Yan and Yuyuan Zeng and Xiaoqin Ren and Chao Yu and Lulu Wu and Yue Mao and Tao Yang and Suncong Zheng and Kan Wu and Dian Jiao and Jinbao Xue and Xipeng Zhang and Decheng Wu and Kai Liu and Dengpeng Wu and Guanghui Xu and Shaohua Chen and Shuang Chen and Xiao Feng and Yigeng Hong and Junqiang Zheng and Chengcheng Xu and Zongwei Li and Xiong Kuang and Jianglu Hu and Yiqi Chen and Yuchi Deng and Guiyang Li and Ao Liu and Chenchen Zhang and Shihui Hu and Zilong Zhao and Zifan Wu and Yao Ding and Weichao Wang and Han Liu and Roberts Wang and Hao Fei and Peijie She and Ze Zhao and Xun Cao and Hai Wang and Fusheng Xiang and Mengyuan Huang and Zhiyuan Xiong and Bin Hu and Xuebin Hou and Lei Jiang and Jiajia Wu and Yaping Deng and Yi Shen and Qian Wang and We
ijie Liu and Jie Liu and Meng Chen and Liang Dong and Weiwen Jia and Hu Chen and Feifei Liu and Rui Yuan and Huilin Xu and Zhenxiang Yan and Tengfei Cao and Zhichao Hu and Xinhua Feng and Dong Du and Tinghao She and Yangyu Tao and Feng Zhang and Jianchen Zhu and Chengzhong Xu and Xirui Li and Chong Zha and Wen Ouyang and Yinben Xia and Xiang Li and Zekun He and Rongpeng Chen and Jiawei Song and Ruibin Chen and Fan Jiang and Chongqing Zhao and Bo Wang and Hao Gong and Rong Gan and Winston Hu and Zhanhui Kang and Yong Yang and Yuhong Liu and Di Wang and Jie Jiang},
      year={2024},
      eprint={2411.02265},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.02265},
}


Contact Us

 

If you would like to leave a message for our R&D and product teams,
Welcome to contact our open-source team . You can also contact us via
email (hunyuan_opensource@tencent.com).

About

No description, website, or topics provided.

Resources

Readme

License

View license
Activity
Custom properties

Stars

436 stars

Watchers

12 watching

Forks

16 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Languages

  * Python 96.1%
  * Shell 3.9%

Footer

 (c) 2024 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact
  * Manage cookies
  * Do not share my personal information

You can't perform that action at this time.