Qwen2 VL 2B Instruct Unsloth Bnb 4bit
Qwen2 VL 2B Instruct Unsloth Bnb 4bit is a powerful AI model that combines efficiency and speed. It's designed to process visual information, including images and videos, and can handle tasks like image description, video analysis, and more. What sets it apart is its ability to selectively avoid quantizing certain parameters, which improves accuracy while keeping VRAM usage low. It's also multilingual, supporting text understanding in various languages. With its dynamic resolution and multimodal rotary position embedding, it can handle complex visual inputs. The model is available in different sizes, including 2, 7, and 72 billion parameters, and can be fine-tuned for specific tasks. It's a great choice for applications that require fast and accurate visual processing, and its efficiency makes it a practical option for real-world use.
Table of Contents
Model Overview
The Unsloth’s Dynamic 4-bit Quants model is a cutting-edge AI technology that selectively avoids quantizing certain parameters, resulting in improved accuracy while maintaining similar VRAM usage to BnB 4-bit.
Capabilities
This model is capable of understanding images and videos of various resolutions and ratios, and it can operate on mobiles, robots, and other devices. It can also understand texts in different languages inside images.
Key Features
- Faster Finetuning: Finetune popular models like Llama 3.2, Qwen 2.5, and Gemma 2 up to 5x faster with 70% less memory.
- Multilingual Support: Understand texts in multiple languages, including European languages, Japanese, Korean, Arabic, Vietnamese, and more.
- State-of-the-Art Performance: Achieve top-notch results on visual understanding benchmarks like MathVista, DocVQA, and RealWorldQA.
How it Works
The model uses a multimodal architecture that can handle both images and videos, along with text inputs. This allows it to understand and process multiple types of input, making it a versatile model for various applications.
Performance
This model showcases remarkable performance with high accuracy in various visual tasks, such as image and video understanding.
Speed
The model is designed to handle a wide range of resolutions and can process images and videos at varying speeds. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation.
| Model | Speed |
|---|---|
| Qwen2-VL (2B) | 2x faster |
| Qwen2-VL (7B) | 1.8x faster |
| Qwen2-VL (72B) | Not specified |
Accuracy
The model achieves state-of-the-art performance on various visual understanding benchmarks, including:
| Benchmark | Qwen2-VL (2B) | Qwen2-VL (7B) | Qwen2-VL (72B) |
|---|---|---|---|
| MathVista | 46.0 | 43.0 | Not specified |
| DocVQA | 86.9 | 90.1 | Not specified |
| RealWorldQA | 57.3 | 62.9 | Not specified |
Efficiency
The model is designed to be efficient in terms of memory usage. It can operate with lower memory requirements compared to other models.
| Model | Memory Usage |
|---|---|
| Qwen2-VL (2B) | 40% less |
| Qwen2-VL (7B) | 40% less |
| Qwen2-VL (72B) | Not specified |
Limitations
While this model is powerful, it has some limitations. Here are some of the known restrictions:
Lack of Audio Support
The current model does not comprehend audio information within videos. This means that if you provide a video with audio, the model will only analyze the visual content and ignore the audio.
Data Timeliness
Our image dataset is updated until June 2023, and information subsequent to this date may not be covered. This means that the model may not have knowledge of very recent events or developments.
Constraints in Individuals and Intellectual Property (IP)
The model’s capacity to recognize specific individuals or IPs is limited, potentially failing to comprehensively cover all well-known personalities or brands.
Limited Capacity for Complex Instruction
When faced with intricate multi-step instructions, the model’s understanding and execution capabilities require enhancement. This means that the model may struggle with complex tasks that require multiple steps or nuanced understanding.
Insufficient Counting
The model may not always be able to accurately count objects or quantities in images or videos.
Format
The model uses a multimodal architecture that can handle both images and videos, along with text inputs.
Supported Data Formats
- Images: local files, base64, and URLs
- Videos: local files
- Text: raw text inputs
Input Requirements
- Images can be resized to maintain their aspect ratio within a specified range of pixels (min_pixels and max_pixels)
- Videos can be input as a list of frames or a single video file
- Text inputs can be combined with images or videos to create a multimodal input
Output Requirements
- The model generates text outputs based on the input prompt and visual information
Code Examples
Here’s an example of how to use the model with a single image input:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load the model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
# Define the input message
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"type": "text", "text": "Describe this image."},
],
}
]
# Process the input message
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
# Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)
And here’s an example of how to use the model with a video input:
# Define the input message
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": ["file:///path/to/frame1.jpg", "file:///path/to/frame2.jpg", "file:///path/to/frame3.jpg"],
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Process the input message
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
# Generate the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)


