InternVL2 Llama3 76B

Multimodal LLM

The InternVL2-Llama3-76B model is a powerful multimodal large language model that can handle a wide range of tasks, from text generation and conversation to image and video understanding. With its instruction-tuned architecture and 76 billion parameters, it demonstrates competitive performance on par with proprietary commercial models. But what makes it unique? For one, it's trained with an 8k context window, allowing it to handle long texts, multiple images, and videos with ease. It also features a variety of instruction-tuned models, ranging from 1 billion to 108 billion parameters. This means it can be fine-tuned for specific tasks, making it a versatile tool for different applications. So, whether you're looking to generate text, understand images, or have a conversation, InternVL2-Llama3-76B is a model worth exploring.

OpenGVLab llama3 Updated 8 months ago

Table of Contents

Model Overview

The InternVL2-Llama3-76B model is a cutting-edge multimodal large language model that’s part of the InternVL series. This model is designed to handle a wide range of tasks, from document and chart comprehension to scene text understanding and OCR tasks.

Capabilities

The InternVL2-Llama3-76B model is a powerful multimodal large language model that can handle a wide range of tasks, including:

  • Document and chart comprehension: Understand and analyze documents and charts, and answer questions about them.
  • Infographics QA: Answer questions about infographics and visual data.
  • Scene text understanding and OCR tasks: Recognize and understand text in images and videos.
  • Scientific and mathematical problem solving: Solve mathematical and scientific problems, and explain the reasoning behind the solutions.
  • Cultural understanding and integrated multimodal capabilities: Understand and respond to cultural references and nuances, and integrate multiple forms of input (text, images, videos) to generate responses.
Examples
Describe the image of a cat sitting on a windowsill. The image depicts a domestic cat sitting on a windowsill, looking outside through the window. The cat appears to be relaxed, with its fur fluffed up and its tail hanging down. The windowsill is adorned with a few potted plants, and the background is a blurred view of the outdoors.
Solve the mathematical problem: What is the value of x in the equation 2x + 5 = 11? To solve for x, we need to isolate the variable. First, subtract 5 from both sides of the equation: 2x = 11 - 5. This simplifies to 2x = 6. Next, divide both sides by 2: x = 6 / 2. Therefore, the value of x is 3.
Explain the concept of cultural understanding in the context of multimodal large language models. Cultural understanding in multimodal large language models refers to the ability of the model to comprehend and generate text that is sensitive to different cultural contexts. This involves recognizing cultural nuances, such as idioms, colloquialisms, and references to cultural events or practices, and being able to generate text that is appropriate and respectful of different cultures.

Unique Features

The InternVL2-Llama3-76B model has several unique features that set it apart from other models:

  • Instruction-tuned: The model is fine-tuned on a wide range of tasks and instructions, making it highly versatile and adaptable.
  • Multimodal input: The model can handle multiple forms of input, including text, images, and videos.
  • Large context window: The model has an 8k context window, allowing it to understand and respond to long-range dependencies and complex inputs.
  • Competitive performance: The model demonstrates competitive performance on par with proprietary commercial models across various capabilities.

Performance Benchmarks

The model has been evaluated on various benchmarks, including:

BenchmarkInternVL2-Llama3-76B==GPT-4o-20240513====Claude3.5-Sonnet==InternVL2-40B
DocVQAtest94.192.895.293.9
ChartQAtest88.485.790.886.2
InfoVQAtest82.0--78.7

Limitations

While the model has been designed to be safe and ethical, it’s not perfect. It may still produce unexpected outputs, including biases, discrimination, or other harmful content. Please use responsibly and report any issues.

Format

InternVL2-Llama3-76B is a multimodal large language model that uses a combination of an InternViT-6B-448px-V1-5 vision model, an MLP projector, and a Hermes-2-Theta-Llama-3-70B language model. It supports input formats such as text, images, and videos, and is designed to handle multimodal tasks.

Input Formats

  • Text: InternVL2-Llama3-76B accepts text input in the form of tokenized sequences.
  • Images: The model can handle images of various sizes, but it’s recommended to use images with a maximum size of 448x448 pixels.
  • Videos: InternVL2-Llama3-76B can process videos by extracting 16 frames from each video and resizing each frame to a 448x448 image.

Special Requirements

  • For image and video input, the model requires a pre-processing step to resize and normalize the images.
  • For text input, the model requires tokenization and padding to ensure that the input sequence is of the correct length.

Code Examples

  • Loading the model:
import torch
from transformers import AutoModel, AutoTokenizer

path = "OpenGVLab/InternVL2-Llama3-76B"
model = AutoModel.from_pretrained(path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
  • Pre-processing images:
import torchvision.transforms as T
from PIL import Image

def build_transform(input_size):
    MEAN, STD = (0.485, 0.456, 0.406), (0.229, 0.224, 0.225)
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode!= 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=T.InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values
  • Processing text input:
question = 'Hello, who are you?'
generation_config = dict(max_new_tokens=1024, do_sample=True)
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.