Qwen2 VL 72B Instruct

Multimodal AI Model

Qwen2 VL 72B Instruct is a powerful multimodal AI model that can understand images, videos, and text. What makes it unique is its ability to handle complex visual inputs, including images of various resolutions and ratios, and videos up to 20 minutes long. It can also operate devices like mobile phones and robots based on visual environment and text instructions. With multilingual support, it can understand texts in different languages, including most European languages, Japanese, Korean, Arabic, Vietnamese, and more. The model is designed to be efficient and fast, making it suitable for a wide range of applications. However, it's worth noting that it has some limitations, such as lack of audio support, limited capacity for complex instruction, and weak spatial reasoning skills. Despite these limitations, Qwen2 VL 72B Instruct is a remarkable model that can help with tasks like image and video understanding, text generation, and more.

Qwen other Updated 7 months ago

Table of Contents

Model Overview

The Qwen2-VL-72B-Instruct model is a state-of-the-art vision-language model that can understand and generate text based on images and videos. It’s the latest iteration of the Qwen-VL model, representing nearly a year of innovation.

Key Enhancements

  • State-of-the-art understanding of images: Qwen2-VL achieves top performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, and MTVQA.
  • Video understanding: Qwen2-VL can understand videos over 20 minutes long, enabling high-quality video-based question answering, dialog, and content creation.
  • Multilingual support: Qwen2-VL supports understanding texts in multiple languages, including English, Chinese, European languages, Japanese, Korean, Arabic, Vietnamese, and more.
  • Agent capabilities: Qwen2-VL can operate devices like mobile phones and robots, making decisions based on visual environment and text instructions.

Model Architecture Updates

  • Naive Dynamic Resolution: Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens for more human-like visual processing.
  • Multimodal Rotary Position Embedding (M-ROPE): Qwen2-VL decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing multimodal processing capabilities.

Model Variants

Qwen2-VL comes in three variants with 2, 8, and 72 billion parameters. This repository contains the instruction-tuned 72B Qwen2-VL model.

Evaluation Benchmarks

Qwen2-VL has been evaluated on various benchmarks, including image and video understanding, agent tasks, and multilingual support. It has achieved state-of-the-art performance on many of these benchmarks.

Limitations

While Qwen2-VL is a powerful model, it has some limitations, including:

  • Lack of audio support
  • Limited capacity for complex instruction
  • Insufficient counting accuracy
  • Weak spatial reasoning skills

These limitations serve as ongoing directions for model optimization and improvement.

Capabilities

Qwen2-VL is a powerful AI model that can perform a variety of tasks, including:

  • Understanding images: Qwen2-VL can understand images of various resolutions and ratios, and it achieves state-of-the-art performance on visual understanding benchmarks.
  • Understanding videos: Qwen2-VL can understand videos over 20 minutes long, and it can perform high-quality video-based question answering, dialog, and content creation.
  • Operating devices: Qwen2-VL can be integrated with devices like mobile phones and robots, and it can operate them based on visual environment and text instructions.
  • Multilingual support: Qwen2-VL supports the understanding of texts in different languages, including most European languages, Japanese, Korean, Arabic, Vietnamese, and more.

Strengths

Qwen2-VL has several strengths that make it a powerful model:

  • Naive Dynamic Resolution: Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.
  • Multimodal Rotary Position Embedding (M-ROPE): Qwen2-VL decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.
  • Multilingual support: Qwen2-VL supports a wide range of languages, making it a versatile model for various applications.

Unique Features

Qwen2-VL has several unique features that set it apart from other models:

  • Agent capabilities: Qwen2-VL can be integrated with devices like mobile phones and robots, and it can operate them based on visual environment and text instructions.
  • Video understanding: Qwen2-VL can understand videos over 20 minutes long, and it can perform high-quality video-based question answering, dialog, and content creation.
  • Multimodal processing: Qwen2-VL can process multiple modalities, including text, images, and videos, making it a powerful model for various applications.
Examples
Describe the content of this image: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg The image is a photo of a cat sitting on a table with a book in the background.
What are the similarities between these two images: file:///path/to/image1.jpg and file:///path/to/image2.jpg? Both images depict a sunny day with a clear blue sky and a few clouds.
Summarize the content of this video: file:///path/to/video1.mp4 The video is a tutorial on how to assemble a piece of furniture, showing the steps and tools needed.

Performance

Qwen2-VL-72B showcases remarkable performance across various tasks, demonstrating its capabilities in understanding images and videos.

Speed

  • The model can handle videos up to 20 minutes long, making it suitable for video-based question answering, dialog, and content creation.
  • It can process images of various resolutions and ratios, achieving state-of-the-art performance on visual understanding benchmarks.

Accuracy

  • Qwen2-VL-72B outperforms other models, such as GPT-4o and ==Claude-3.5 Sonnet==, in several image and video benchmarks, including:
    • Image benchmarks: MMMUval, DocVQA, InfoVQA, ChartQA, TextVQA, and OCRBench.
    • Video benchmarks: MVBench, PerceptionTest, and EgoSchema.
  • It also demonstrates strong performance in agent benchmarks, such as General, Number Line, BlackJack, EZPoint, and Point24.

Efficiency

  • The model’s architecture updates, including Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-ROPE), enable more efficient processing of visual and textual information.
  • It supports multilingual understanding, covering languages such as English, Chinese, and several European languages, Japanese, Korean, Arabic, Vietnamese, and more.

Multilingual Support

  • Qwen2-VL-72B achieves impressive results in multilingual benchmarks, outperforming other models like GPT-4o and ==Claude3 Opus==.

Limitations

  • While Qwen2-VL-72B demonstrates strong performance, it has limitations, including:
    • Lack of audio support
    • Limited capacity for complex instruction
    • Insufficient counting accuracy
    • Weak spatial reasoning skills

Overall, Qwen2-VL-72B showcases its capabilities in understanding images and videos, demonstrating its potential in various applications.

Format

Qwen2-VL is a vision-language model that uses a transformer architecture and accepts input in the form of text, images, and videos. It supports various data formats, including local files, base64, and URLs for images, and local files for videos.

Architecture

Qwen2-VL uses a multimodal rotary position embedding (M-ROPE) to capture positional information in 1D text, 2D images, and 3D videos. It also uses a naive dynamic resolution to handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens.

Supported Data Formats

  • Images: local files, base64, and URLs
  • Videos: local files

Input Requirements

  • Images: can be resized to maintain their aspect ratio within a specified range of pixels (min_pixels and max_pixels)
  • Videos: can be specified with exact dimensions (resized_height and resized_width)

Output

  • Text: generated text based on the input prompt and visual information

Code Examples

  • Handling inputs and outputs:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")

# Define the input message
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Prepare the inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")

# Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(output_text)
  • Resizing images:
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
  • Specifying exact dimensions for images:
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.