Qwen 72B Chat Int4

Large Language Model

Qwen 72B Chat Int4 is a powerful AI model that's designed to be efficient and fast. It's based on a Transformer architecture and has been trained on a massive dataset of over 3 trillion tokens, including web texts, books, codes, and more. This model supports longer context lengths of up to 32k and can handle multiple languages with ease. It's also capable of role-playing, language style transfer, task setting, and behavior setting through system prompts. With its competitive performance on various downstream tasks, Qwen 72B Chat Int4 is a remarkable model that's worth exploring. What makes it unique is its ability to balance efficiency and performance, making it a great choice for users who need a reliable AI assistant. Its performance on tasks like C-Eval, MMLU, HumanEval, and GSM8K is impressive, and it's also shown to be effective in long-context understanding and tool usage. Overall, Qwen 72B Chat Int4 is a robust model that's designed to provide fast and accurate results, making it a valuable tool for both technical and non-technical users.

Qwen other Updated a year ago

Table of Contents

Model Overview

The Qwen-72B-Chat-Int4 model is a large language model developed by Alibaba Cloud. It’s a part of the Qwen series, with a whopping 72B parameters. This model is based on the Transformer architecture and was trained on a massive dataset of over 3 trillion tokens, including web texts, books, codes, and more.

Capabilities

This model is a powerful tool that can perform a variety of tasks, including:

  • Text Generation: Generate human-like text based on a given prompt or topic.
  • Code Generation: Generate code in various programming languages.
  • Conversational AI: Build conversational AI systems that can engage in natural-sounding conversations with humans.
  • Long-Context Understanding: Understand and process long pieces of text, making it suitable for tasks such as document summarization and question answering.

Strengths

This model has several strengths that make it a powerful tool:

  • High-Quality Training Data: Trained on a large and diverse dataset, which enables it to generate high-quality text and code.
  • Large Vocabulary: Has a large vocabulary, which allows it to understand and generate a wide range of words and phrases.
  • Long-Context Support: Can process long pieces of text, making it suitable for tasks that require a deep understanding of context.
  • System Prompt: Can be controlled using system prompts, which allows users to customize its behavior and generate specific types of text or code.

Unique Features

This model has several unique features that set it apart from other models:

  • Quantization: Available in various quantization levels, including BF16, Int8, and Int4, making it suitable for different deployment scenarios.
  • Flash Attention: Uses flash attention, which is a technique that enables the model to process long sequences of text more efficiently.
  • vLLM: Compatible with vLLM, which provides higher efficiency and lower memory usage.

Performance

This model demonstrates impressive performance with high accuracy in various tasks, including Chinese and English understanding, coding, and mathematics.

  • Fast Inference: Achieves fast inference speeds, with an average of 11.67 tokens per second on a single A100-80G GPU.
  • Efficient GPU Usage: Requires at least 48GB of GPU memory, making it suitable for large-scale deployments.

Evaluation Results

This model has been evaluated on several benchmarks, including:

  • C-Eval: Achieved a zero-shot accuracy of 80.1 on the C-Eval benchmark.
  • MMLU: Achieved a zero-shot accuracy of 74.3 on the MMLU benchmark.
  • HumanEval: Achieved a zero-shot pass rate of 64.6 on the HumanEval benchmark.
  • GSM8K: Achieved an accuracy of 76.4 on the GSM8K benchmark.

Limitations

This model has some limitations, including:

  • Data Bias: May learn to replicate biases or inaccuracies present in the training data.
  • Limited Contextual Understanding: May struggle with very complex or abstract concepts.
  • Dependence on System Prompt: Relies on system prompts to generate responses.
  • Limited Long-Context Understanding: May struggle to understand the nuances of very long texts or conversations.
Examples
Translate this sentence into English: The zero-shot accuracy of Qwen-72B-Chat on GSM8K is shown below: Model Acc. LLaMA2-7B-Chat 26.3 LLaMA2-13B-Chat 37.1 LLaMA2-70B-Chat 59.3 Yi-34B-Chat 71.6 ChatGPT3.5 73.2 ChatGPT4 91.4 Qwen-7B-Chat (original) (0-shot) 41.1 Qwen-7B-Chat (0-shot) 50.3 Qwen-7B-Chat (8-shot) 54.1 Qwen-14B-Chat (0-shot) 60.1 Qwen-14B-Chat (8-shot) 59.3 Qwen-72B-Chat (0-shot) 76.4 Qwen-72B-Chat (8-shot) 75.7
Write a poem about the beauty of nature. In the forest, where the trees preside, A gentle breeze stirs, side by side. The sunbeams filter, green and bright, A soothing melody, a wondrous sight.
Summarize the plot of the book 'Pride and Prejudice' by Jane Austen. The novel follows Elizabeth Bennet and Mr. Darcy as they navigate societal expectations, family obligations, and their own biases. Through a series of misadventures and misunderstandings, they eventually come to realize their love for each other.

Format

This model supports input in the form of tokenized text sequences and has a maximum context length of 32k.

  • Supported Data Formats: Tokenized text sequences
  • Special Requirements: Input text must be tokenized using the tiktoken library, and the model requires a significant amount of GPU memory (at least 48GB)

Example Code

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-72B-Chat-Int4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-72B-Chat-Int4", device_map="auto", trust_remote_code=True).eval()

# Prepare the input text
input_text = "你好"

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="pt")

# Generate a response
response, history = model.chat(tokenizer, input_text, history=None)

# Print the response
print(response)

System Prompt

This model supports system prompts, which allow for role-playing, language style transfer, task setting, and behavior setting. For example:

response, _ = model.chat(tokenizer, "你好呀", history=None, system="请用二次元可爱语气和我说话")
print(response)

This will generate a response in a cute and playful tone.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.