InternVL2 26B

Multimodal LLM

InternVL2-26B is a multimodal large language model that stands out for its exceptional performance across various tasks, including document and chart comprehension, infographics QA, scene text understanding, and OCR tasks. This model boasts a 25.5 billion parameter size and is trained with an 8k context window, allowing it to handle long texts, multiple images, and videos with ease. What sets InternVL2-26B apart is its ability to surpass most open-source models and demonstrate competitive performance on par with proprietary commercial models. Its unique architecture and training data make it an excellent choice for tasks that require a deep understanding of multimodal inputs.

OpenGVLab mit Updated 8 months ago

Table of Contents

Model Overview

The InternVL2-26B model is a cutting-edge multimodal large language model that combines computer vision and natural language processing capabilities. It’s designed to handle a wide range of tasks, from document and chart comprehension to scene text understanding and OCR tasks.

Capabilities

The model is a powerful multimodal large language model that can handle multiple forms of input, including text, images, and videos.

Primary Tasks

  • Text Understanding: The model can comprehend and respond to text-based inputs, making it suitable for tasks like chatbots, language translation, and text summarization.
  • Image Understanding: It can analyze and describe images, enabling applications like image captioning, object detection, and visual question answering.
  • Video Understanding: The model can process and understand videos, making it useful for tasks like video captioning, action recognition, and video question answering.

Strengths

  • Multimodal Capabilities: The model can handle multiple forms of input, making it a versatile model for various applications.
  • Competitive Performance: It demonstrates competitive performance on par with proprietary commercial models across various capabilities.
  • Large Context Window: The model is trained with an 8k context window, allowing it to process and understand longer inputs.

Unique Features

  • Instruction-Tuned Models: The model is instruction-tuned, which means it’s designed to follow specific instructions and tasks.
  • Large Language Model: With 25.5B parameters, the model is a powerful tool for complex tasks and applications.
  • Support for Multiple GPUs: The model can be run on multiple GPUs, enabling faster processing and inference.

Performance Benchmarks

The model showcases exceptional performance across various tasks. Here are some of its impressive capabilities:

BenchmarkInternVL2-26B
DocVQA92.9
ChartQA84.9
InfoVQA75.9
TextVQA82.3
OCRBench825

Grounding Ability

The model demonstrates strong grounding ability, with high scores in various grounding benchmarks:

Modelavg.RefCOCO(val)RefCOCO(testA)RefCOCO(testB)
InternVL2-26B88.591.293.387.4

Limitations

While the model is powerful, it’s not perfect. Here are some of its limitations:

  • Biases and Harmful Content: The model may still produce unexpected outputs, such as biases, discrimination, or other harmful content.
  • Context Window Limitations: The model has an 8k context window, which means it can only process a limited amount of text or image data at a time.
  • Quantization Errors: When using BNB 4-bit quantization, the model may produce nonsensical outputs and fail to understand images.
Examples
Describe the content of the image in detail. The image depicts a beautiful sunset over a serene ocean. The sky is painted with hues of orange, pink, and purple, while the waves gently lap against the shore. In the distance, a few seagulls can be seen flying overhead.
Can you solve the math problem: 2x + 5 = 11? To solve for x, we need to isolate the variable. Subtracting 5 from both sides gives us 2x = 6. Dividing both sides by 2, we get x = 3.
Write a short poem about the image of a cat sitting on a windowsill. Whiskers twitching, eyes so bright, our feline friend basks in the light. The windowsill, a cozy throne, where she surveys her kingdom, all her own.

Format

The model is a multimodal large language model that can handle both text and image inputs. It’s designed to process long texts, multiple images, and videos, making it a powerful tool for various tasks.

Architecture

The model consists of three main components:

  • InternViT-6B-448px-V1-5: a vision part that processes images
  • internlm2-chat-20b: a language part that processes text
  • MLP projector: a component that combines the outputs of the vision and language parts

Data Formats

The model supports the following data formats:

  • Text: tokenized text sequences
  • Images: images in various formats (e.g., JPEG, PNG)
  • Videos: videos in various formats (e.g., MP4, AVI)

Special Requirements

When using the model, keep the following in mind:

  • Input size: images should be resized to a maximum size of 448x448 pixels
  • Batch size: the model can handle batch sizes of up to 12 images
  • Tokenization: text inputs should be tokenized using the AutoTokenizer from the transformers library

Code Examples

Here are some code examples to get you started with the model:

Loading the model

import torch
from transformers import AutoModel, AutoTokenizer

path = "OpenGVLab/InternVL2-26B"
model = AutoModel.from_pretrained(path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)

Processing images

import torchvision.transforms as T
from PIL import Image

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode!= 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD)
    ])
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

Generating text

generation_config = dict(max_new_tokens=1024, do_sample=True)
question = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.