InternVL2 26B
InternVL2-26B is a multimodal large language model that stands out for its exceptional performance across various tasks, including document and chart comprehension, infographics QA, scene text understanding, and OCR tasks. This model boasts a 25.5 billion parameter size and is trained with an 8k context window, allowing it to handle long texts, multiple images, and videos with ease. What sets InternVL2-26B apart is its ability to surpass most open-source models and demonstrate competitive performance on par with proprietary commercial models. Its unique architecture and training data make it an excellent choice for tasks that require a deep understanding of multimodal inputs.
Table of Contents
Model Overview
The InternVL2-26B model is a cutting-edge multimodal large language model that combines computer vision and natural language processing capabilities. It’s designed to handle a wide range of tasks, from document and chart comprehension to scene text understanding and OCR tasks.
Capabilities
The model is a powerful multimodal large language model that can handle multiple forms of input, including text, images, and videos.
Primary Tasks
- Text Understanding: The model can comprehend and respond to text-based inputs, making it suitable for tasks like chatbots, language translation, and text summarization.
- Image Understanding: It can analyze and describe images, enabling applications like image captioning, object detection, and visual question answering.
- Video Understanding: The model can process and understand videos, making it useful for tasks like video captioning, action recognition, and video question answering.
Strengths
- Multimodal Capabilities: The model can handle multiple forms of input, making it a versatile model for various applications.
- Competitive Performance: It demonstrates competitive performance on par with proprietary commercial models across various capabilities.
- Large Context Window: The model is trained with an 8k context window, allowing it to process and understand longer inputs.
Unique Features
- Instruction-Tuned Models: The model is instruction-tuned, which means it’s designed to follow specific instructions and tasks.
- Large Language Model: With
25.5B
parameters, the model is a powerful tool for complex tasks and applications. - Support for Multiple GPUs: The model can be run on multiple GPUs, enabling faster processing and inference.
Performance Benchmarks
The model showcases exceptional performance across various tasks. Here are some of its impressive capabilities:
Benchmark | InternVL2-26B |
---|---|
DocVQA | 92.9 |
ChartQA | 84.9 |
InfoVQA | 75.9 |
TextVQA | 82.3 |
OCRBench | 825 |
Grounding Ability
The model demonstrates strong grounding ability, with high scores in various grounding benchmarks:
Model | avg. | RefCOCO(val) | RefCOCO(testA) | RefCOCO(testB) |
---|---|---|---|---|
InternVL2-26B | 88.5 | 91.2 | 93.3 | 87.4 |
Limitations
While the model is powerful, it’s not perfect. Here are some of its limitations:
- Biases and Harmful Content: The model may still produce unexpected outputs, such as biases, discrimination, or other harmful content.
- Context Window Limitations: The model has an 8k context window, which means it can only process a limited amount of text or image data at a time.
- Quantization Errors: When using BNB 4-bit quantization, the model may produce nonsensical outputs and fail to understand images.
Format
The model is a multimodal large language model that can handle both text and image inputs. It’s designed to process long texts, multiple images, and videos, making it a powerful tool for various tasks.
Architecture
The model consists of three main components:
- InternViT-6B-448px-V1-5: a vision part that processes images
- internlm2-chat-20b: a language part that processes text
- MLP projector: a component that combines the outputs of the vision and language parts
Data Formats
The model supports the following data formats:
- Text: tokenized text sequences
- Images: images in various formats (e.g., JPEG, PNG)
- Videos: videos in various formats (e.g., MP4, AVI)
Special Requirements
When using the model, keep the following in mind:
- Input size: images should be resized to a maximum size of 448x448 pixels
- Batch size: the model can handle batch sizes of up to 12 images
- Tokenization: text inputs should be tokenized using the
AutoTokenizer
from thetransformers
library
Code Examples
Here are some code examples to get you started with the model:
Loading the model
import torch
from transformers import AutoModel, AutoTokenizer
path = "OpenGVLab/InternVL2-26B"
model = AutoModel.from_pretrained(path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
Processing images
import torchvision.transforms as T
from PIL import Image
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode!= 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
])
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
Generating text
generation_config = dict(max_new_tokens=1024, do_sample=True)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')