https://github.com/QwenLM/Qwen

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session.
Dismiss alert
{{ message }}
QwenLM / Qwen Public

  * Notifications
  * Fork 347
  * Star 4.7k

The official repo of Qwen (Tong Yi Qian Wen ) chat & pretrained large language
model proposed by Alibaba Cloud.

License

View license
4.7k stars 347 forks Activity
Star
Notifications

  * Code
  * Issues 89
  * Pull requests 5
  * Discussions
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Discussions
  * Actions
  * Projects
  * Security
  * Insights

QwenLM/Qwen

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
9 branches 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/Q]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone QwenLM]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@JustinLin610
JustinLin610 Merge pull request #350 from ArtificialZeng/main
...
94826ab Sep 27, 2023
Merge pull request #350 from ArtificialZeng/main

typo

94826ab

Git stats

  * 315 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
.github/ISSUE_TEMPLATE
Update bug report issue template.
August 10, 2023 12:46
assets
update wechat qrcode
September 26, 2023 15:51
eval
release latest models
September 25, 2023 10:41
examples
add function calling support
August 30, 2023 15:04
finetune
add gradient checkpointing
September 12, 2023 23:10
.gitignore
first commit
August 3, 2023 12:57
FAQ.md
release latest models
September 25, 2023 10:41
FAQ_ja.md
release latest models
September 25, 2023 10:41
FAQ_zh.md
release latest models
September 25, 2023 10:41
LICENSE
release latest models
September 25, 2023 10:41
NOTICE
release the evaluation benchmark for tool use; update tool use
result...
August 8, 2023 17:45
QWEN_TECHNICAL_REPORT.pdf
rename
September 25, 2023 11:05
README.md
update readme
September 26, 2023 16:59
README_CN.md
Merge pull request #350 from ArtificialZeng/main
September 27, 2023 11:06
README_JA.md
update readme
September 26, 2023 16:59
cli_demo.py
update cli_demo.py and web_demo.py
September 25, 2023 20:29
finetune.py
add finetuning
September 11, 2023 23:47
openai_api.py
release latest models
September 25, 2023 10:41
requirements.txt
Update requirements.txt
September 25, 2023 14:41
requirements_web_demo.txt
add requirements_web_demo.txt
August 11, 2023 10:59
tech_memo.md
release the evaluation benchmark for tool use; update tool use
result...
August 8, 2023 17:45
tokenization_note.md
add tokenization notes
August 8, 2023 10:58
tokenization_note_ja.md
Add tokenization_note_ja.md
August 20, 2023 10:29
tokenization_note_zh.md
add tokenization notes
August 8, 2023 10:58
utils.py
update device mapping func for supporting multi-gpu inference
August 21, 2023 14:38
web_demo.py
update cli_demo.py and web_demo.py
September 25, 2023 20:29
View code
[                    ]
News and Updates Performance Requirements Quickstart  Transformers 
ModelScope Quantization Usage Performance Inference Speed GPU Memory
Usage Quantization of KV cache Usage Comparative Results Results
memory usage comparison Difference of Storage in layer-past
Finetuning Demo Web UI CLI Demo API Deployment Tool Usage
Long-Context Understanding Tokenizer Reproduction FAQ License
Agreement Contact Us

README.md

Zhong Wen   |  English  |  Ri Ben Yu 



     [68747470733a2f2f7169616e77656e2d7265732e6f73732d636e2d62]


   Hugging Face   |    ModelScope   |     Paper    |   [?] Demo
       WeChat (Wei Xin )   |    DingTalk (Ding Ding )    |   Discord  



    Qwen-Chat Qwen-Chat (Int4) Qwen
7B                          
14B                         

We opensource our Qwen series, now including Qwen, the base language
models, namely Qwen-7B and Qwen-14B, as well as Qwen-Chat, the chat
models, namely Qwen-7B-Chat and Qwen-14B-Chat. Links are on the above
table. Click them and check the model cards. Also, we release the
technical report. Please click the paper link and check it out!

In brief, we have strong base language models, which have been stably
pretrained for up to 3 trillion tokens of multilingual data with a
wide coverage of domains, languages (with a focus on Chinese and
English), etc. They are able to achieve competitive performance on
benchmark datasets. Additionally, we have chat models that are
aligned with human preference based on SFT and RLHF (not released
yet), which are able to chat, create content, extract information,
summarize, translate, code, solve math problems, and so on, and are
able to use tools, play as agents, or even play as code interpreters,
etc.

In this repo, you can figure out:

  * Quickstart with Qwen, and enjoy the simple inference.
  * Details about the quantization models, including usage, memory,
    inference speed. For comparison, we also provide the statistics
    of the BF16 models.
  * Tutorials on finetuning, including full-parameter tuning, LoRA,
    and Q-LoRA.
  * Instructions on building demos, including WebUI, CLI demo, etc.
  * Information about Qwen for tool use, agent, and code interpreter
  * Statistics of long-context understanding evaluation
  * License agreement
  * ...

Also, if you meet problems, turn to FAQ for help first. Still feeling
struggled? Feel free to shoot us issues (better in English so that
more people can understand you)! If you would like to help us, send
us pull requests with no hesitation! We are always excited about PR!

Would like to chat with us or date us coffee time? Welcome to our
Discord or WeChat!


News and Updates

  * 2023.9.25  We release Qwen-14B and Qwen-14B-Chat on ModelScope
    and Hugging Face, along with qwen.cpp and Qwen-Agent. Codes and
    checkpoints of Qwen-7B and Qwen-7B-Chat are also updated. PLEASE
    PULL THE LATEST VERSION!
      + Compared to Qwen-7B (original), Qwen-7B uses more training
        tokens, increasing from 2.2T tokens to 2.4T tokens, while the
        context length extends from 2048 to 8192. The Chinese
        knowledge and coding ability of Qwen-7B have been further
        improved.
  * 2023.9.12 We now support finetuning on the Qwen-7B models,
    including full-parameter finetuning, LoRA and Q-LoRA.
  * 2023.8.21 We release the Int4 quantized model for Qwen-7B-Chat,
    Qwen-7B-Chat-Int4, which requires low memory costs but achieves
    improved inference speed. Besides, there is no significant
    performance degradation on the benchmark evaluation.
  * 2023.8.3 We release both Qwen-7B and Qwen-7B-Chat on ModelScope
    and Hugging Face. We also provide a technical memo for more
    details about the model, including training details and model
    performance.


Performance

Qwen-14B and Qwen-7B (this is the new version trained with more
tokens and the context length is extended from 2048 to 8192)
outperform the baseline models of similar model sizes on a series of
benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP,
BBH, etc., which evaluate the models' capabilities on natural
language understanding, mathematic problem solving, coding, etc.
However, even Qwen-14B still significantly fall behind GPT-3.5, let
alone GPT-4. See the results below.

[radar_14b]


Model          MMLU  C-Eval GSM8K   MATH  HumanEval  MBPP   BBH   CMMLU
              5-shot 5-shot 8-shot 4-shot  0-shot   3-shot 3-shot 5-shot
LLaMA2-7B      46.8   32.5   16.7   3.3     12.8     20.8   38.2   31.8
LLaMA2-13B     55.0   41.4   29.6   5.0     18.9     30.3   45.6   38.4
LLaMA2-34B     62.6    -     42.2   6.2     22.6     33.0   44.1    -
ChatGLM2-6B    47.9   51.7   32.4   6.5       -       -     33.7    -
InternLM-7B    51.0   53.4   31.2   6.3     10.4     14.0   37.0   51.8
InternLM-20B   62.1   58.8   52.6   7.9     25.6     35.6   52.5   59.0
Baichuan2-7B   54.7   56.3   24.6   5.6     18.3     24.2   41.6   57.1
Baichuan2-13B  59.5   59.0   52.8   10.1    17.1     30.2   49.0   62.0
Qwen-7B        56.7   59.6   51.6   10.4    24.4     31.2   40.6   58.8
(original)
Qwen-7B        58.2   63.5   51.7   11.6    29.9     31.6   45.0   62.2
Qwen-14B       66.3   72.1   61.3   24.8    32.3     40.8   53.4   71.0

For all compared models, we report the best scores between their
official reported results and OpenCompass.

For more experimental results (detailed model performance on more
benchmark datasets) and details, please refer to our technical report
by clicking here.


Requirements

  * python 3.8 and above
  * pytorch 1.12 and above, 2.0 and above are recommended
  * transformers 4.32 and above
  * CUDA 11.4 and above are recommended (this is for GPU users,
    flash-attention users, etc.)


Quickstart

Below, we provide simple examples to show how to use Qwen-Chat with 
ModelScope and  Transformers.

Before running the code, make sure you have setup the environment and
installed the required packages. Make sure you meet the above
requirements, and then install the dependent libraries.

pip install -r requirements.txt

If your device supports fp16 or bf16, we recommend installing
flash-attention for higher efficiency and lower memory usage. (
flash-attention is optional and the project can run normally without
installing it)

git clone -b v1.0.8 https://github.com/Dao-AILab/flash-attention
cd flash-attention && pip install .
# Below are optional. Installing them might be slow.
# pip install csrc/layer_norm
# pip install csrc/rotary

Now you can start with ModelScope or Transformers.

 Transformers

To use Qwen-Chat for the inference, all you need to do is to input a
few lines of codes as demonstrated below. Remember to pass in the
correct model names or paths, such as "Qwen/Qwen-7B-Chat" and "Qwen/
Qwen-14B-Chat". However, please make sure that you are using the
latest code.

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Model names: "Qwen/Qwen-7B-Chat", "Qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
    device_map="auto",
    trust_remote_code=True
).eval()

# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", trust_remote_code=True)

# 1st dialogue turn
response, history = model.chat(tokenizer, "Ni Hao ", history=None)
print(response)
# Ni Hao !Hen Gao Xing Wei Ni Ti Gong Bang Zhu . 

# 2nd dialogue turn
response, history = model.chat(tokenizer, "Gei Wo Jiang Yi Ge Nian Qing Ren Fen Dou Chuang Ye Zui Zhong Qu De Cheng Gong De Gu Shi . ", history=history)
print(response)
# Zhe Shi Yi Ge Guan Yu Yi Ge Nian Qing Ren Fen Dou Chuang Ye Zui Zhong Qu De Cheng Gong De Gu Shi . 
# Gu Shi De Zhu Ren Gong Jiao Li Ming ,Ta Lai Zi Yi Ge Pu Tong De Jia Ting ,Fu Mu Du Shi Pu Tong De Gong Ren . Cong Xiao ,Li Ming Jiu Li Xia Liao Yi Ge Mu Biao :Yao Cheng Wei Yi Ming Cheng Gong De Qi Ye Jia . 
# Wei Liao Shi Xian Zhe Ge Mu Biao ,Li Ming Qin Fen Xue Xi ,Kao Shang Liao Da Xue . Zai Da Xue Qi Jian ,Ta Ji Ji Can Jia Ge Chong Chuang Ye Bi Sai ,Huo De Liao Bu Shao Jiang Xiang . Ta Huan Li Yong Ke Yu Shi Jian Qu Shi Xi ,Ji Lei Liao Bao Gui De Jing Yan . 
# Bi Ye Hou ,Li Ming Jue Ding Kai Shi Zi Ji De Chuang Ye Zhi Lu . Ta Kai Shi Xun Zhao Tou Zi Ji Hui ,Dan Duo Ci Du Bei Ju Jue Liao . Ran Er ,Ta Bing Mei You Fang Qi . Ta Ji Xu Nu Li ,Bu Duan Gai Jin Zi Ji De Chuang Ye Ji Hua ,Bing Xun Zhao Xin De Tou Zi Ji Hui . 
# Zui Zhong ,Li Ming Cheng Gong Di Huo De Liao Yi Bi Tou Zi ,Kai Shi Liao Zi Ji De Chuang Ye Zhi Lu . Ta Cheng Li Liao Yi Jia Ke Ji Gong Si ,Zhuan Zhu Yu Kai Fa Xin Xing Ruan Jian . Zai Ta De Ling Dao Xia ,Gong Si Xun Su Fa Zhan Qi Lai ,Cheng Wei Liao Yi Jia Cheng Gong De Ke Ji Qi Ye . 
# Li Ming De Cheng Gong Bing Bu Shi Ou Ran De . Ta Qin Fen , Jian Ren , Yong Yu Mou Xian ,Bu Duan Xue Xi He Gai Jin Zi Ji . Ta De Cheng Gong Ye Zheng Ming Liao ,Zhi Yao Nu Li Fen Dou ,Ren He Ren Du You Ke Neng Qu De Cheng Gong . 

# 3rd dialogue turn
response, history = model.chat(tokenizer, "Gei Zhe Ge Gu Shi Qi Yi Ge Biao Ti ", history=history)
print(response)
# <<Fen Dou Chuang Ye :Yi Ge Nian Qing Ren De Cheng Gong Zhi Lu >> 

Running Qwen pretrained base model is also simple.

Running Qwen

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# Model names: "Qwen/Qwen-7B", "Qwen/Qwen-14B" 
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)
# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B", device_map="cpu", trust_remote_code=True).eval()
# use auto mode, automatically select precision based on the device.
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B",
    device_map="auto",
    trust_remote_code=True
).eval()

# Specify hyperparameters for generation. But if you use transformers>=4.32.0, there is no need to do this.
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B", trust_remote_code=True)

inputs = tokenizer('Meng Gu Guo De Shou Du Shi Wu Lan Ba Tuo (Ulaanbaatar)\nBing Dao De Shou Du Shi Lei Ke Ya Wei Ke (Reykjavik)\nAi Sai E Bi Ya De Shou Du Shi ', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Meng Gu Guo De Shou Du Shi Wu Lan Ba Tuo (Ulaanbaatar)\nBing Dao De Shou Du Shi Lei Ke Ya Wei Ke (Reykjavik)\nAi Sai E Bi Ya De Shou Du Shi Ya De Si Ya Bei Ba (Addis Ababa)...

 ModelScope

ModelScope is an opensource platform for Model-as-a-Service (MaaS),
which provides flexible and cost-effective model service to AI
developers. Similarly, you can run the models with ModelScope as
shown below:

from modelscope import AutoModelForCausalLM, AutoTokenizer
from modelscope import GenerationConfig

# Model names: "qwen/Qwen-7B-Chat", "qwen/Qwen-14B-Chat"
tokenizer = AutoTokenizer.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("qwen/Qwen-7B-Chat", revision='v1.0.5', device_map="auto", trust_remote_code=True, fp16=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-7B-Chat", revision='v1.0.5', trust_remote_code=True) # Ke Zhi Ding Bu Tong De Sheng Cheng Chang Du , top_pDeng Xiang Guan Chao Can 

response, history = model.chat(tokenizer, "Ni Hao ", history=None)
print(response)
response, history = model.chat(tokenizer, "Zhe Jiang De Sheng Hui Zai Na Li ?", history=history)
print(response)
response, history = model.chat(tokenizer, "Ta You Shi Yao Hao Wan De Jing Dian ", history=history)
print(response)


Quantization

Usage

We provide a solution based on AutoGPTQ, and release an Int4
quantized model for Qwen-7B-Chat Click here and Qwen-14B-Chat Click
here, which achieve nearly lossless model effects but improved
performance on both memory costs and inference speed.

Here we demonstrate how to use our provided quantized models for
inference. Before you start, make sure you meet the requirements of
auto-gptq (e.g., torch 2.0 and above, transformers 4.32.0 and above,
etc.) and install the required packages:

pip install auto-gptq optimum

If you meet problems installing auto-gptq, we advise you to check out
the official repo to find a wheel.

Then you can load the quantized model easily and run inference as
same as usual:

# Model names: "Qwen/Qwen-7B-Chat-Int4", "Qwen/Qwen-14B-Chat-Int4"
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat-Int4",
    device_map="auto",
    trust_remote_code=True
).eval()
response, history = model.chat(tokenizer, "Hi", history=None)

Performance

We illustrate the model performance of both BF16 and Int4 models on
the benchmark, and we find that the quantized model does not suffer
from significant performance degradation. Results are shown below:

    Quantization     MMLU CEval (val) GSM8K Humaneval
Qwen-7B-Chat (BF16)  53.9    54.2     41.1    24.4
Qwen-7B-Chat (Int4)  52.6    52.9     38.1    23.8
Qwen-14B-Chat (BF16) 64.6    69.8     61.0    43.9
Qwen-14B-Chat (Int4) 63.3    69.0     59.8    45.7

Inference Speed

We measured the average inference speed (tokens/s) of generating 2048
and 8192 tokens under BF16 precision and Int4 quantization,
respectively.

    Quantization     Speed (2048 tokens) Speed (8192 tokens)
Qwen-7B-Chat (BF16)         30.34               29.32
Qwen-7B-Chat (Int4)         43.56               33.92
Qwen-14B-Chat (BF16)        30.70               21.73
Qwen-14B-Chat (Int4)        37.11               26.11

In detail, the setting of profiling is generating 8192 new tokens
with 1 context token. The profiling runs on a single A100-SXM4-80G
GPU with PyTorch 2.0.1 and CUDA 11.4. The inference speed is averaged
over the generated 8192 tokens.

GPU Memory Usage

We also profile the peak GPU memory usage for encoding 2048 tokens as
context (and generating single token) and generating 8192 tokens
(with single token as context) under BF16 or Int4 quantization level,
respectively. The results are shown below.

 Quantization    Peak Usage for Encoding   Peak Usage for Generating
                       2048 Tokens                8192 Tokens
Qwen-7B-Chat             17.66GB                    22.58GB
(BF16)
Qwen-7B-Chat             8.21GB                     13.62GB
(Int4)
Qwen-14B-Chat            30.15GB                    38.94GB
(BF16)
Qwen-14B-Chat            13.00GB                    21.79GB
(Int4)

The above speed and memory profiling are conducted using this script.


Quantization of KV cache

Attention KV cache can be quantized and compressed for storage, to
get a higher sample throughput.

Usage

The parameters of 'use_cache_quantization' and 'use_cache_kernel' are
provided to control kv-cache-quantization behavior When
use_cache_quantization=True and use_cache_kernel=True,
kv-cache-quantization will be enabled. The specific use method is as
follows:

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen-7B-Chat",
     device_map="auto",
     trust_remote_code=True,
     use_cache_quantization=True,
     use_cache_kernel=True,
     use_flash_attn=False
)

Attention: Currently, kv-cache-quantization and flash attn cannot be
turned on at the same time. If you enable kv cache quantization and
use_flash_attn at the same time (use_flash_attn=True,
use_cache_quantization=True, use_cache_kernel=True), use_flash_attn
is disabled by default(use_flash_attn=false).

Comparative Results

Results

We have verified that the use of the quantized int8-kvcache model
does not suffer from significant performance degradation.

memory usage comparison

The profiling runs on a single A100-SXM4-80G GPU with PyTorch 2.0.1
and CUDA 11.4. We use BF16 models, and generate 1024 tokens
(seq-length=1024) by default, and oom indicates out of memory.

With kv-cache quantization turned on, we can run a larger batch size
(bs).

USE KVCache  bs=1   bs=4  bs=16  bs=32  bs=64  bs=100
no          16.3GB 24.1GB 31.7GB 48.7GB  oom    oom
yes         15.5GB 17.2GB 22.3GB 30.2GB 48.2GB 72.4GB

With kv-cache quantization turned on, the model can save more memory
when generate longer seq-length (sl, number of tokens generated) at
infer.

USE KVCache sl=512 sl=1024 sl=2048 sl=4096 sl=8192
no          15.2GB 16.3GB  17.6GB  19.5GB  23.2GB
yes          15GB  15.5GB  15.8GB  16.6GB  17.6GB

Difference of Storage in layer-past

The model which turn on the kv-cache quantization will convert the
format of layer-past from float to int8, meanwhile the quantianted
layer-past will also store quantiantion parameters of current value.
Specific steps are as follows: 1, Quantize key/value

    qv,scale,zero_point=quantize_cache_v(v)

2, Store into layer_past

Following is the format of quantized layer_past:

    layer_past=((q_key,key_scale,key_zero_point),
                (q_value,value_scale,value_zero_point))

Bascial format of layer_past:

    layer_past=(key,value)

If you want to use the attention KV which is quantized, you can use
the dequantization operation to convert the int8 key/value back to
the float format as following:

    v=dequantize_cache_torch(qv,scale,zero_point)

Finetuning

Now we provide the official training script, finetune.py, for users
to finetune the pretrained model for downstream applications in a
simple fashion. Additionally, we provide shell scripts to launch
finetuning with no worries. This script supports the training with
DeepSpeed and FSDP. The shell scripts that we provide use DeepSpeed
(Note: this may have conflicts with the latest version of pydantic)
and Peft. You can install them by:

pip install peft deepspeed

To prepare your training data, you need to put all the samples into a
list and save it to a json file. Each sample is a dictionary
consisting of an id and a list for conversation. Below is a simple
example list with 1 sample:

[
  {
    "id": "identity_0",
    "conversations": [
      {
        "from": "user",
        "value": "Ni Hao ",
      },
      {
        "from": "assistant",
        "value": "Wo Shi Yi Ge Yu Yan Mo Xing ,Wo Jiao Tong Yi Qian Wen . "
      }
    ]
  }
]

After data preparation, you can use the provided shell scripts to run
finetuning. Remember to specify the path to the data file, $DATA.

The finetuning scripts allow you to perform:

  * Full-parameter finetuning
  * LoRA
  * Q-LoRA

Full-parameter parameter finetuning requires updating all parameters
in the whole training process. To launch your training, run the
following script:

# Distributed training. We do not provide single-GPU training script as the insufficient GPU memory will break down the training.
sh finetune/finetune_ds.sh

Remember to specify the correct model name or path, the data path, as
well as the output directory in the shell scripts. Another thing to
notice is that we use DeepSpeed ZeRO 3 in this script. If you want to
make changes, just remove the argument --deepspeed or make changes in
the DeepSpeed configuration json file based on your requirements.
Additionally, this script supports mixed-precision training, and thus
you can use --bf16 True or --fp16 True. Empirically we advise you to
use bf16 to make your training consistent with our pretraining and
alignment if your machine supports bf16, and thus we use it by
default.

Similarly, to run LoRA, use another script to run as shown below.
Before you start, make sure that you have installed peft. Also, you
need to specify your paths to your model, data, and output. We advise
you to use absolute path for your pretrained model. This is because
LoRA only saves the adapter and the absolute path in the adapter
configuration json file is used for finding out the pretrained model
to load. Also, this script support both bf16 and fp16.

# Single GPU training
sh finetune/finetune_lora_single_gpu.sh
# Distributed training
sh finetune/finetune_lora_ds.sh

In comparison with full-parameter finetuning, LoRA (paper) only
updates the parameters of adapter layers but keeps the original large
language model layers frozen. This allows much fewer memory costs and
thus fewer computation costs. However, if you still suffer from
insufficient memory, you can consider Q-LoRA (paper), which uses the
quantized large language model and other techniques such as paged
attention to allow even fewer memory costs. To run Q-LoRA, directly
run the following script:

# Single GPU training
sh finetune/finetune_qlora_single_gpu.sh
# Distributed training
sh finetune/finetune_qlora_ds.sh

For Q-LoRA, we advise you to load our provided quantized model, e.g.,
Qwen-7B-Chat-Int4. However, different from full-parameter finetuning
and LoRA, only fp16 is supported for Q-LoRA.

Different from full-parameter finetuning, the training of both LoRA
and Q-LoRA only saves the adapter parameters. Suppose your training
starts from Qwen-7B, you can load the finetuned model for inference
as shown below:

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    path_to_adapter, # path to the output directory
    device_map="auto",
    trust_remote_code=True
).eval()

The shell scripts uses torchrun to run single-GPU or multi-GPU
training. For multi-GPU training, you need to specify the proper
hyperparameters for distributed training based on your machine.


Demo

Web UI

We provide code for users to build a web UI demo (thanks to @wysaid).
Before you start, make sure you install the following packages:

pip install -r requirements_web_demo.txt

Then run the command below and click on the generated link:

python web_demo.py


                             [web_demo]

CLI Demo

We provide a CLI demo example in cli_demo.py, which supports
streaming output for the generation. Users can interact with
Qwen-7B-Chat by inputting prompts, and the model returns model
outputs in the streaming mode. Run the command below:

python cli_demo.py


                             [cli_demo]


API

We provide methods to deploy local API based on OpenAI API (thanks to
@hanpenggit). Before you start, install the required packages:

pip install fastapi uvicorn openai "pydantic>=2.3.0" sse_starlette

Then run the command to deploy your API:

python openai_api.py

You can change your arguments, e.g., -c for checkpoint name or path,
--cpu-only for CPU deployment, etc. If you meet problems launching
your API deployment, updating the packages to the latest version can
probably solve them.

Using the API is also simple. See the example below:

import openai
openai.api_base = "http://localhost:8000/v1"
openai.api_key = "none"

# create a request activating streaming response
for chunk in openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {"role": "user", "content": "Ni Hao "}
    ],
    stream=True
    # Specifying stop words in streaming output format is not yet supported and is under development.
):
    if hasattr(chunk.choices[0].delta, "content"):
        print(chunk.choices[0].delta.content, end="", flush=True)

# create a request not activating streaming response
response = openai.ChatCompletion.create(
    model="Qwen",
    messages=[
        {"role": "user", "content": "Ni Hao "}
    ],
    stream=False,
    stop=[] # You can add custom stop words here, e.g., stop=["Observation:"] for ReAct prompting.
)
print(response.choices[0].message.content)


                            [openai_api]

Function calling is also supported (but only when stream=False for
the moment). See the example usage here.


Deployment

It is simple to run the model on CPU, which requires your
specification of device:

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-7B-Chat", device_map="cpu", trust_remote_code=True).eval()

If you suffer from lack of GPU memory and you would like to run the
model on more than 1 GPU, you can use our provided script utils.py:

from utils import load_model_on_gpus
model = load_model_on_gpus('Qwen/Qwen-7B-Chat', num_gpus=2)

Then you can run the 7B chat model on 2 GPUs using the above scripts.


We also provide pure C++ implementation of Qwen-LM and tiktoken, see
qwen.cpp for details.

Tool Usage

Qwen-Chat has been optimized for tool usage and function calling
capabilities. Users can develop agents, LangChain applications, and
even agument Qwen with a Python Code Interpreter.

We provide documentation on how to implement tool calls based on the
principle of ReAct Prompting, please refer to the ReAct example.
Based on this principle, we provide support for function calling in
openai_api.py.

We have tested the model's tool calling capabilities on our
open-source Chinese evaluation benchmark and found that Qwen-Chat
consistently performs well:

                      Chinese Tool-Use Benchmark
    Model       Tool Selection       Tool Input       False Positive
                   (Acc.|)           (Rouge-L|)           Error|
GPT-4                95%                0.90              15.0%
GPT-3.5              85%                0.88              75.0%
Qwen-7B-Chat         98%                0.91               7.3%
Qwen-14B-Chat        98%                0.93               2.4%

To assess Qwen's ability to use the Python Code Interpreter for tasks
such as mathematical problem solving, data visualization, and other
general-purpose tasks such as file handling and web scraping, we have
created and open-sourced a benchmark specifically designed for
evaluating these capabilities. You can find the benchmark at this
link.

We have observed that Qwen performs well in terms of code
executability and result accuracy when generating code:

       Executable Rate of Generated Code (%)
        Model          Math| Visualization| General|
GPT-4                  91.9       85.9        82.8
GPT-3.5                89.2       65.0        74.1
LLaMA2-7B-Chat         41.9       33.1        24.1
LLaMA2-13B-Chat        50.0       40.5        48.3
CodeLLaMA-7B-Instruct  85.1       54.0        70.7
CodeLLaMA-13B-Instruct 93.2       55.8        74.1
InternLM-7B-Chat-v1.1  78.4       44.2        62.1
InternLM-20B-Chat      70.3       44.2        65.5
Qwen-7B-Chat           82.4       64.4        67.2
Qwen-14B-Chat          89.2       84.1        65.5

               Accuracy of Code Execution Results (%)
        Model          Math| Visualization-Hard| Visualization-Easy|
GPT-4                  82.8         66.7                60.8
GPT-3.5                47.3         33.3                55.7
LLaMA2-7B-Chat          3.9         14.3                39.2
LLaMA2-13B-Chat         8.3          8.3                40.5
CodeLLaMA-7B-Instruct  14.3         26.2                60.8
CodeLLaMA-13B-Instruct 28.2         27.4                62.0
InternLM-7B-Chat-v1.1  28.5          4.8                40.5
InternLM-20B-Chat      34.6         21.4                45.6
Qwen-7B-Chat           41.9         40.5                54.4
Qwen-14B-Chat          58.4         53.6                59.5


                            [code_inter]

In addition, we also provide experimental results demonstrating that
our model is capable of acting as a HuggingFace Agent. For more
information, please refer to the example documentation. The model's
performance on the evaluation dataset provided by Hugging Face is as
follows:

       HuggingFace Agent Benchmark- Run Mode
      Model        Tool Selection| Tool Used| Code|
GPT-4                    100          100     97.4
GPT-3.5                 95.4          96.3    87.0
StarCoder-Base-15B      86.1          87.0    68.9
StarCoder-15B           87.0          88.0    68.9
Qwen-7B-Chat            87.0          87.0    71.5
Qwen-14B-Chat           93.5          94.4    87.0

      HuggingFace Agent Benchmark - Chat Mode
      Model        Tool Selection| Tool Used| Code|
GPT-4                   97.9          97.9    98.5
GPT-3.5                 97.3          96.8    89.6
StarCoder-Base-15B      97.9          97.9    91.1
StarCoder-15B           97.9          97.9    89.6
Qwen-7B-Chat            94.7          94.7    85.1
Qwen-14B-Chat           97.9          97.9    95.5


Long-Context Understanding

To extend the context length and break the bottleneck of training
sequence length, we introduce several techniques, including NTK-aware
interpolation, window attention, and LogN attention scaling, to
extend the context length of Qwen-7B/14B from 2k to over 8K tokens,
and Qwen-7B from 8k to 32k tokens. We conduct language modeling
experiments on the arXiv dataset with the PPL evaluation and find
that Qwen can reach outstanding performance in the scenario of long
context. Results are demonstrated below:

             Model                          Sequence Length
                                 1024 2048 4096   8192   16384  32768
Qwen-7B (original)               4.23 3.78 39.35 469.81 2645.09   -
+ dynamic_ntk                    4.23 3.78 3.59   3.66   5.71     -
+ dynamic_ntk + logn             4.23 3.78 3.58   3.56   4.62     -
+ dynamic_ntk + logn +           4.23 3.78 3.58   3.49   4.32     -
window_attn
Qwen-7B                          4.23 3.81 3.52   3.31   7.27   181.49
+ dynamic_ntk + logn +           4.23 3.81 3.52   3.33   3.22    3.17
window_attn
Qwen-14B                          -   3.46 22.79 334.65 3168.35   -
+ dynamic_ntk + logn +            -   3.46 3.29   3.18   3.42     -
window_attn

Tokenizer

Our tokenizer based on tiktoken is different from other tokenizers,
e.g., sentencepiece tokenizer. You need to pay attention to special
tokens, especially in finetuning. For more detailed information on
the tokenizer and related use in fine-tuning, please refer to the
documentation.


Reproduction

For your reproduction of the model performance on benchmark datasets,
we provide scripts for you to reproduce the results. Check eval/
EVALUATION.md for more information. Note that the reproduction may
lead to slight differences from our reported results.


FAQ

If you meet problems, please refer to FAQ and the issues first to
search a solution before you launch a new issue.


License Agreement

Researchers and developers are free to use the codes and model
weights of both Qwen and Qwen-Chat. We also allow their commercial
use. Check our license at LICENSE for more details. If you have
requirements for commercial use, please fill out the form (7B, 14B)
to apply.


Contact Us

If you are interested to leave a message to either our research team
or product team, join our Discord or WeChat groups! Also, feel free
to send an email to qianwen_opensource@alibabacloud.com.

About

The official repo of Qwen (Tong Yi Qian Wen ) chat & pretrained large language
model proposed by Alibaba Cloud.

Topics

natural-language-processing chinese pretrained-models 
large-language-models llm flash-attention

Resources

Readme

License

View license
Activity

Stars

4.7k stars

Watchers

47 watching

Forks

347 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Contributors 20

  * @JustinLin610
  * @yangapku
  * @jxst539246
  * @JianxinMa
  * @eltociear
  * @fyabc
  * @logicwong
  * @wysaid
  * @hzhwcmhf
  * @simonJJJ
  * @huybery

+ 9 contributors

Languages

  * Python 95.5%
  * Shell 4.5%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.