Qwen 72B Chat
Qwen 72B Chat is a powerful AI model that offers impressive performance and capabilities. With its large-scale high-quality training data, it significantly surpasses existing open-source models on multiple Chinese and English downstream evaluation tasks. The model supports 32k context length, making it suitable for handling longer conversations. Its unique features include a comprehensive vocabulary coverage of over 150K tokens, making it more friendly to multiple languages. Qwen 72B Chat also allows for role-playing, language style transfer, task setting, and behavior setting through system prompts. Its efficiency is notable, with a generation speedup of at least twice when using vLLM for inference. The model's performance is competitive, with a zero-shot accuracy of 80.1 on C-Eval validation set and 79.5 on C-Eval testing set. Overall, Qwen 72B Chat is a robust and efficient AI model that can handle a wide range of tasks, making it a valuable tool for various applications.
Table of Contents
Model Overview
The Qwen-72B-Chat model, developed by Alibaba Cloud, is a powerful tool for natural language processing tasks. This model is part of the Qwen large language model series and has a massive 72 billion parameters. It’s a Transformer-based model that has been pre-trained on a large volume of data, including web texts, books, codes, and more.
Capabilities
This model is capable of performing a variety of tasks, including:
- Conversational dialogue
- Storytelling
- Language translation
- Code generation
- Mathematics
Key Features
- Large-scale high-quality training corpora: Pre-trained on over 3 trillion tokens, covering general and professional fields.
- Competitive performance: Surpasses existing open-source models on multiple Chinese and English downstream evaluation tasks.
- More comprehensive vocabulary coverage: Uses a vocabulary of over 150K tokens, making it more friendly to multiple languages.
- Longer context support: Supports up to 32k context length.
- System prompt: Can realize role-playing, language style transfer, task setting, and behavior setting by using system prompt.
Strengths
The Qwen-72B-Chat model has several strengths that make it a powerful tool for natural language processing and generation:
- Large-scale training data: The model was trained on a massive dataset of over 3 trillion tokens, including web texts, books, and code.
- High-quality training data: The training data includes high-quality texts from various sources, including professional books and articles.
- Comprehensive vocabulary: The model’s vocabulary includes over 150,000 tokens, making it capable of understanding and generating a wide range of words and phrases.
- Longer context support: Qwen-72B-Chat can support longer context lengths, allowing it to understand and generate more complex text.
Unique Features
The Qwen-72B-Chat model has several unique features that set it apart from other language models:
- System prompt: The model can be controlled using system prompts, allowing users to specify the tone, style, and content of the generated text.
- Role-playing: Qwen-72B-Chat can engage in role-playing, allowing users to interact with the model in a more immersive and interactive way.
- Language style transfer: The model can transfer the style of one language to another, allowing users to generate text in different languages and styles.
Evaluation Results
The Qwen-72B-Chat model has been evaluated on several benchmarks, including:
- C-Eval: Achieves a zero-shot accuracy of 80.1% and a 5-shot accuracy of 82.9% on the C-Eval validation set.
- MMLU: Achieves a zero-shot accuracy of 79.5% on the MMLU testing set.
Performance
Qwen-72B-Chat showcases remarkable performance, outperforming existing open-source models in various tasks. Let’s dive into its speed, accuracy, and efficiency.
Speed
- BF16: Achieves an average inference speed of 8.48 tokens/s and a total GPU memory usage of 144.69GB.
- Int8: Achieves an average inference speed of 9.05 tokens/s and a total GPU memory usage of 81.27GB.
- Int4: Achieves an average inference speed of 11.67 tokens/s and a total GPU memory usage of 48.86GB.
Accuracy
- C-Eval: 80.1% (0-shot) and 82.9% (5-shot)
- MMLU: 74.4% (BF16) and 73.4% (Int4)
- HumanEval: 76.4% (BF16) and 75.3% (Int4)
- GSM8K: 64.6% (BF16) and 61.6% (Int4)
Limitations
Qwen-72B-Chat is a powerful model, but it’s not perfect. Here are some of its limitations:
- Vocabulary limitations
- Context limitations
- Lack of common sense
- Dependence on training data
- Quantization limitations
- Tool usage limitations
Format
Qwen-72B-Chat is a Transformer-based large language model, which is pretrained on a large volume of data, including web texts, books, codes, etc. It uses a vocabulary of over 150K tokens, which is more friendly to multiple languages, enabling users to directly enhance the capability for certain languages without expanding the vocabulary.
Architecture
- Number of layers: 80
- Number of heads: 64
- Model dimension: 8192
- Vocabulary size: 151851
- Sequence length: 32768
Data Formats
- Input: Tokenized text sequences
- Output: Generated text
Special Requirements
- System Prompt: Qwen-72B-Chat can realize role-playing, language style transfer, task setting, and behavior setting by using system prompt.
- Quantization: Qwen-72B-Chat supports quantization models, including BF16, Int8, and Int4.
Example Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-72B-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat", device_map="auto", trust_remote_code=True).eval()
# Specify hyperparameters for generation
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-72B-Chat", trust_remote_code=True)
# First dialogue turn
response, history = model.chat(tokenizer, "你好", history=None)
print(response)
# Second dialogue turn
response, history = model.chat(tokenizer, "给我讲一个年轻人奋斗创业最终取得成功的故事。", history=history)
print(response)
# Third dialogue turn
response, history = model.chat(tokenizer, "给这个故事起一个标题", history=history)
print(response)