InternVL2 Llama3 76B
The InternVL2-Llama3-76B model is a powerful multimodal large language model that can handle a wide range of tasks, from text generation and conversation to image and video understanding. With its instruction-tuned architecture and 76 billion parameters, it demonstrates competitive performance on par with proprietary commercial models. But what makes it unique? For one, it's trained with an 8k context window, allowing it to handle long texts, multiple images, and videos with ease. It also features a variety of instruction-tuned models, ranging from 1 billion to 108 billion parameters. This means it can be fine-tuned for specific tasks, making it a versatile tool for different applications. So, whether you're looking to generate text, understand images, or have a conversation, InternVL2-Llama3-76B is a model worth exploring.
Table of Contents
Model Overview
The InternVL2-Llama3-76B model is a cutting-edge multimodal large language model that’s part of the InternVL series. This model is designed to handle a wide range of tasks, from document and chart comprehension to scene text understanding and OCR tasks.
Capabilities
The InternVL2-Llama3-76B model is a powerful multimodal large language model that can handle a wide range of tasks, including:
- Document and chart comprehension: Understand and analyze documents and charts, and answer questions about them.
- Infographics QA: Answer questions about infographics and visual data.
- Scene text understanding and OCR tasks: Recognize and understand text in images and videos.
- Scientific and mathematical problem solving: Solve mathematical and scientific problems, and explain the reasoning behind the solutions.
- Cultural understanding and integrated multimodal capabilities: Understand and respond to cultural references and nuances, and integrate multiple forms of input (text, images, videos) to generate responses.
Unique Features
The InternVL2-Llama3-76B model has several unique features that set it apart from other models:
- Instruction-tuned: The model is fine-tuned on a wide range of tasks and instructions, making it highly versatile and adaptable.
- Multimodal input: The model can handle multiple forms of input, including text, images, and videos.
- Large context window: The model has an 8k context window, allowing it to understand and respond to long-range dependencies and complex inputs.
- Competitive performance: The model demonstrates competitive performance on par with proprietary commercial models across various capabilities.
Performance Benchmarks
The model has been evaluated on various benchmarks, including:
Benchmark | InternVL2-Llama3-76B | ==GPT-4o-20240513== | ==Claude3.5-Sonnet== | InternVL2-40B |
---|---|---|---|---|
DocVQAtest | 94.1 | 92.8 | 95.2 | 93.9 |
ChartQAtest | 88.4 | 85.7 | 90.8 | 86.2 |
InfoVQAtest | 82.0 | - | - | 78.7 |
Limitations
While the model has been designed to be safe and ethical, it’s not perfect. It may still produce unexpected outputs, including biases, discrimination, or other harmful content. Please use responsibly and report any issues.
Format
InternVL2-Llama3-76B is a multimodal large language model that uses a combination of an InternViT-6B-448px-V1-5 vision model, an MLP projector, and a Hermes-2-Theta-Llama-3-70B language model. It supports input formats such as text, images, and videos, and is designed to handle multimodal tasks.
Input Formats
- Text: InternVL2-Llama3-76B accepts text input in the form of tokenized sequences.
- Images: The model can handle images of various sizes, but it’s recommended to use images with a maximum size of 448x448 pixels.
- Videos: InternVL2-Llama3-76B can process videos by extracting 16 frames from each video and resizing each frame to a 448x448 image.
Special Requirements
- For image and video input, the model requires a pre-processing step to resize and normalize the images.
- For text input, the model requires tokenization and padding to ensure that the input sequence is of the correct length.
Code Examples
- Loading the model:
import torch
from transformers import AutoModel, AutoTokenizer
path = "OpenGVLab/InternVL2-Llama3-76B"
model = AutoModel.from_pretrained(path, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
- Pre-processing images:
import torchvision.transforms as T
from PIL import Image
def build_transform(input_size):
MEAN, STD = (0.485, 0.456, 0.406), (0.229, 0.224, 0.225)
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode!= 'RGB' else img),
T.Resize((input_size, input_size), interpolation=T.InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
- Processing text input:
question = 'Hello, who are you?'
generation_config = dict(max_new_tokens=1024, do_sample=True)
response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
print(f'User: {question}\nAssistant: {response}')