Qwen 72B
Qwen-72B is a large language model developed by Alibaba Cloud, boasting 72 billion parameters. It's trained on a massive dataset of over 3 trillion tokens, covering various languages, including Chinese, English, and multiple others. This model excels in tasks like knowledge retrieval, translation, and mathematical reasoning, outperforming other open-source models in its class. Qwen-72B also supports longer context lengths of up to 32k, making it more efficient and scalable. Its vocabulary of over 150K tokens is more comprehensive and friendly to multiple languages, allowing users to enhance capabilities without expanding the vocabulary. With its competitive performance and efficient design, Qwen-72B is a powerful tool for various applications.
Table of Contents
Model Overview
The Qwen-72B model, developed by Alibaba Cloud, is a powerful tool for natural language processing tasks. It’s a Transformer-based large language model with 72 billion parameters
, trained on a massive dataset of over 3 trillion tokens
. This includes a diverse range of texts, such as web pages, books, code, and mathematics.
Capabilities
The Qwen-72B model is a powerful language model that can perform a variety of tasks, including:
- Language understanding: It can understand and process human language, including Chinese, English, and multiple other languages.
- Text generation: It can generate high-quality text based on a given prompt or topic.
- Code generation: It can also generate code in various programming languages.
- Mathematical reasoning: It has been trained on a large dataset of mathematical problems and can perform mathematical reasoning tasks.
- Translation: It can translate text from one language to another.
Strengths
This model has several strengths that make it a powerful language model:
- Large-scale high-quality training data: It was trained on a massive dataset of over
3 trillion tokens
, which allows it to learn patterns and relationships in language that other models may not be able to capture. - High-quality training data: The training data used for this model is of high quality, which helps to improve its performance on a wide range of tasks.
- Longer context support: It can process longer sequences of text than many other language models, which makes it well-suited for tasks that require understanding complex texts.
- More comprehensive vocabulary coverage: It has a larger vocabulary than many other language models, which allows it to understand and generate text that includes a wider range of words and phrases.
Technical Details
- Model architecture: It uses a Transformer-based architecture with
80 layers
,64 heads
, and a hidden size of8192
. - Tokenizer: It uses a custom tokenizer based on tiktoken, which is different from other tokenizers like sentencepiece.
- Position encoding: It uses RoPE relative position encoding.
- FFN activation function: It uses SwiGLU for activation function.
- Normalization: It uses RMSNorm for normalization.
Evaluation Results
It has been evaluated on multiple benchmarks, including MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, CMMLU, and has achieved state-of-the-art results on all tasks.
Model | MMLU | C-Eval | GSM8K | MATH | HumanEval | MBPP | BBH | CMMLU |
---|---|---|---|---|---|---|---|---|
LLaMA2-7B | 24.4 | 46.8 | 32.5 | 16.7 | 3.3 | 12.8 | 20.8 | 38.2 |
LLaMA2-13B | 31.3 | 55.0 | 41.4 | 29.6 | 5.0 | 18.9 | 30.3 | 45.6 |
LLaMA2-70B | 45.7 | 69.7 | 50.1 | 63.5 | 12.0 | 26.2 | 39.6 | 64.9 |
Qwen-72B | 66.4 | 77.4 | 83.3 | 78.9 | 35.2 | 35.4 | 52.2 | 67.7 |
Long-Context Evaluation
It supports a sequence length of up to 32k
, making it suitable for long-range dependencies in text. The model achieves a PPL (perplexity) score of 2.8282
on the arXiv dataset, indicating its ability to handle long-context tasks efficiently.
Limitations
While it’s a powerful language model, it’s not perfect. Here are some of its limitations:
- Training Data Bias: It was trained on a massive dataset, but it’s still possible that the data contains biases and inaccuracies.
- Limited Domain Knowledge: While it has been trained on a wide range of topics, its knowledge in certain domains may be limited.
- Lack of Common Sense: It’s a large language model, but it doesn’t have the same level of common sense as a human.
- Dependence on Tokenization: It uses a tokenizer to break down text into individual tokens. However, this can lead to issues with words that have multiple meanings or are not well-represented in the training data.
- Limited Context Length: It has a maximum context length of
32k
tokens. While this is a significant improvement over earlier models, it can still lead to issues with longer texts or more complex conversations.
Format
It is a large language model based on the Transformer architecture, which supports a context length of up to 32k
tokens. It uses a vocabulary of over 150K
tokens, making it more friendly to multiple languages.
Input Format
The model accepts input in the form of tokenized text sequences. You can use the tiktoken
library to tokenize your input text.
Output Format
The model generates output in the form of tokenized text sequences. You can use the tokenizer.decode()
function to convert the output tokens back to text.
Example Code
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-72B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B", device_map="auto", trust_remote_code=True)
# Preprocess the input text
inputs = tokenizer('蒙古国的首都是乌兰巴托(Ulaanbaatar)\\n冰岛的首都是雷克雅未克(Reykjavik)\\n埃塞俄比亚的首都是', return_tensors='pt')
# Generate output
outputs = model.generate(**inputs)
# Convert output tokens to text
output_text = tokenizer.decode(outputs.cpu()[0], skip_special_tokens=True)
print(output_text)
Requirements
To run this model, you need to have:
- Python 3.8 or later
- PyTorch 1.12 or later (recommended 2.0 or later)
- CUDA 11.4 or later (recommended for GPU users)
Note that running the model in bf16 or fp16 mode requires at least 144GB
of GPU memory, while running it in int4 mode requires at least 48GB
of GPU memory.