Qwen2 VL 7B Instruct
The Qwen2-VL-7B-Instruct model is a powerful multimodal AI model that achieves state-of-the-art performance on various visual understanding benchmarks. It can understand images of diverse resolutions and ratios, as well as videos up to 20 minutes long, enabling high-quality video-based question answering, dialog, and content creation. The model also supports multilingual text understanding within images, covering languages such as European languages, Japanese, Korean, Arabic, and Vietnamese. With its ability to operate devices like mobile phones and robots, Qwen2-VL-7B demonstrates complex reasoning and decision-making capabilities. However, it has limitations such as lack of audio support, data timeliness, and constraints in recognizing individuals and intellectual property. Despite these limitations, the model is a significant advancement in AI technology, offering a wide range of applications and possibilities.
Table of Contents
Model Overview
The Qwen2-VL model, representing nearly a year of innovation, is designed to understand and interact with visual information, including images and videos, in a more human-like way. It is a powerful tool that can perform a variety of tasks, including image understanding, video understanding, multilingual support, and device operation.
Key Enhancements
- State-of-the-art understanding of images: Achieves top performance on various visual understanding benchmarks, including MathVista, DocVQA, and RealWorldQA.
- Video understanding: Can comprehend videos up to 20 minutes long, enabling high-quality video-based question answering, dialog, and content creation.
- Multilingual support: Supports understanding texts in multiple languages, including most European languages, Japanese, Korean, Arabic, Vietnamese, and more.
- Agent capabilities: Can operate devices like mobile phones and robots, making decisions based on visual environment and text instructions.
Model Architecture Updates
- Naive Dynamic Resolution: Can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens.
- Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information.
Model Variants
We have three models with 2B
, 7B
, and 72B
parameters. This repository contains the instruction-tuned 7B
Qwen2-VL model.
Performance
Qwen2-VL is a powerhouse when it comes to handling various visual tasks with remarkable speed, accuracy, and efficiency.
Speed
The model can process images and videos at an incredible pace, making it ideal for applications that require quick responses.
Accuracy
Qwen2-VL boasts state-of-the-art performance on visual understanding benchmarks, including:
Benchmark | Qwen2-VL | ==InternVL2-8B== | ==MiniCPM-V 2.6== | ==GPT-4o-mini== |
---|---|---|---|---|
MMMUval | 54.1 | 51.8 | 49.8 | 60 |
DocVQAtest | 94.5 | 91.6 | 90.8 | - |
InfoVQAtest | 76.5 | 74.8 | - | - |
ChartQAtest | 83.0 | 83.3 | - | - |
TextVQAval | 84.3 | 77.4 | 80.1 | - |
Efficiency
Qwen2-VL is designed to be efficient, with the ability to handle arbitrary image resolutions and dynamic visual tokens.
Limitations
While Qwen2-VL is a powerful tool, it’s not perfect. Here are some areas where it needs improvement:
Lack of Audio Support
- Qwen2-VL can’t understand audio information within videos.
Data Timeliness
- Our image dataset is only updated until June 2023.
Constraints in Individuals and Intellectual Property (IP)
- Qwen2-VL has limited capacity to recognize specific individuals or IPs.
Limited Capacity for Complex Instruction
- When faced with intricate multi-step instructions, Qwen2-VL’s understanding and execution capabilities need enhancement.
Insufficient Counting Accuracy
- Particularly in complex scenes, the accuracy of object counting is not high.
Weak Spatial Reasoning Skills
- Especially in 3D spaces, Qwen2-VL’s inference of object positional relationships is inadequate.
Format
Qwen2-VL uses a multimodal architecture that can handle both images and videos, along with text inputs.
Supported Data Formats
- Images: local files, base64, and URLs
- Videos: local files (currently, no support for URLs or base64)
- Text: plain text
Input Requirements
- Images and videos can be provided as local files, base64-encoded strings, or URLs.
- Text inputs should be plain text.
- For videos, you can provide a list of image frames or a single video file.
Output Format
- The model generates text outputs based on the input visual content.
Code Examples
Here’s an example of how to use the Qwen2-VL model with the transformers
library:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load the model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
# Define a sample input message
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preprocess the input
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
# Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)