Qwen2 VL 7B Instruct

Multimodal AI model

The Qwen2-VL-7B-Instruct model is a powerful multimodal AI model that achieves state-of-the-art performance on various visual understanding benchmarks. It can understand images of diverse resolutions and ratios, as well as videos up to 20 minutes long, enabling high-quality video-based question answering, dialog, and content creation. The model also supports multilingual text understanding within images, covering languages such as European languages, Japanese, Korean, Arabic, and Vietnamese. With its ability to operate devices like mobile phones and robots, Qwen2-VL-7B demonstrates complex reasoning and decision-making capabilities. However, it has limitations such as lack of audio support, data timeliness, and constraints in recognizing individuals and intellectual property. Despite these limitations, the model is a significant advancement in AI technology, offering a wide range of applications and possibilities.

Qwen apache-2.0 Updated 7 months ago

Table of Contents

Model Overview

The Qwen2-VL model, representing nearly a year of innovation, is designed to understand and interact with visual information, including images and videos, in a more human-like way. It is a powerful tool that can perform a variety of tasks, including image understanding, video understanding, multilingual support, and device operation.

Key Enhancements

  • State-of-the-art understanding of images: Achieves top performance on various visual understanding benchmarks, including MathVista, DocVQA, and RealWorldQA.
  • Video understanding: Can comprehend videos up to 20 minutes long, enabling high-quality video-based question answering, dialog, and content creation.
  • Multilingual support: Supports understanding texts in multiple languages, including most European languages, Japanese, Korean, Arabic, Vietnamese, and more.
  • Agent capabilities: Can operate devices like mobile phones and robots, making decisions based on visual environment and text instructions.

Model Architecture Updates

  • Naive Dynamic Resolution: Can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens.
  • Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information.

Model Variants

We have three models with 2B, 7B, and 72B parameters. This repository contains the instruction-tuned 7B Qwen2-VL model.

Performance

Qwen2-VL is a powerhouse when it comes to handling various visual tasks with remarkable speed, accuracy, and efficiency.

Speed

The model can process images and videos at an incredible pace, making it ideal for applications that require quick responses.

Accuracy

Qwen2-VL boasts state-of-the-art performance on visual understanding benchmarks, including:

BenchmarkQwen2-VL==InternVL2-8B====MiniCPM-V 2.6====GPT-4o-mini==
MMMUval54.151.849.860
DocVQAtest94.591.690.8-
InfoVQAtest76.574.8--
ChartQAtest83.083.3--
TextVQAval84.377.480.1-

Efficiency

Qwen2-VL is designed to be efficient, with the ability to handle arbitrary image resolutions and dynamic visual tokens.

Examples
Describe the image at https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg A white cat sitting on a wooden table, looking directly at the camera with its paws resting on the edge of the table.
Identify the similarities between these two images: file:///path/to/image1.jpg and file:///path/to/image2.jpg Both images depict a sunny day with clear blue skies and a few white clouds.
Describe the video at file:///path/to/video1.mp4 A short video showing a person playing a guitar in a park on a sunny day.

Limitations

While Qwen2-VL is a powerful tool, it’s not perfect. Here are some areas where it needs improvement:

Lack of Audio Support

  • Qwen2-VL can’t understand audio information within videos.

Data Timeliness

  • Our image dataset is only updated until June 2023.

Constraints in Individuals and Intellectual Property (IP)

  • Qwen2-VL has limited capacity to recognize specific individuals or IPs.

Limited Capacity for Complex Instruction

  • When faced with intricate multi-step instructions, Qwen2-VL’s understanding and execution capabilities need enhancement.

Insufficient Counting Accuracy

  • Particularly in complex scenes, the accuracy of object counting is not high.

Weak Spatial Reasoning Skills

  • Especially in 3D spaces, Qwen2-VL’s inference of object positional relationships is inadequate.

Format

Qwen2-VL uses a multimodal architecture that can handle both images and videos, along with text inputs.

Supported Data Formats

  • Images: local files, base64, and URLs
  • Videos: local files (currently, no support for URLs or base64)
  • Text: plain text

Input Requirements

  • Images and videos can be provided as local files, base64-encoded strings, or URLs.
  • Text inputs should be plain text.
  • For videos, you can provide a list of image frames or a single video file.

Output Format

  • The model generates text outputs based on the input visual content.

Code Examples

Here’s an example of how to use the Qwen2-VL model with the transformers library:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Define a sample input message
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preprocess the input
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")

# Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(output_text)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.