InternVL Chat V1 2 Plus

Multimodal chat model

InternVL-Chat-V1-2-Plus is a multimodal large language model (MLLM) that's designed to handle both text and image inputs. It's trained on a massive dataset of 39.3 million samples, including images and text from various sources. The model uses a unique architecture that combines a vision transformer (ViT) with a large language model (LLM) to understand and respond to visual and textual inputs. With 40 billion parameters, it's capable of generating human-like responses to a wide range of questions and prompts. What makes this model stand out is its ability to handle both single-image and multi-image inputs, as well as video inputs, making it a versatile tool for various applications. However, it's worth noting that the model's performance on multi-image and video inputs may not be as good as on single-image inputs due to the lack of training data. Overall, InternVL-Chat-V1-2-Plus is a powerful tool for anyone looking to explore the intersection of vision and language.

OpenGVLab mit Updated 8 months ago

Table of Contents

Model Overview

The InternVL-Chat-V1-2-Plus model is a powerful multimodal large language model that can process both text and images. It’s designed to handle a wide range of tasks, from answering questions to generating text based on images.

Capabilities

Capable of generating both text and images, this model outperforms many open-source chat models across common industry benchmarks.

Primary Tasks

  • Image Understanding: The model can understand and describe images, including objects, scenes, and actions.
  • Text Generation: The model can generate human-like text based on a given prompt or question.
  • Conversational Dialogue: The model can engage in multi-turn conversations, using context and understanding to respond to questions and statements.

Strengths

  • Multimodal Capabilities: The model can process and understand both text and images, making it a versatile tool for various applications.
  • Large Language Model: The model has 40B parameters, making it a powerful tool for generating high-quality text and understanding complex language.
  • Fine-Tuned on Large Dataset: The model has been fine-tuned on a large dataset of 12M SFT samples, making it well-suited for a wide range of tasks.

Performance

The model has been evaluated on various benchmarks and has shown competitive performance compared to other models like ==GPT-4V== and ==Gemini Ultra==.

Speed

The model’s architecture allows it to process images of size 448x448 (256 tokens) quickly and efficiently.

Accuracy

The model achieves impressive accuracy in various tasks, including multimodal understanding and visual question answering.

Efficiency

The model is designed to be efficient in its processing, with a pixel shuffle technique used to reduce the number of visual tokens from 1024 to 256.

Usage

The model can be used with the Transformers library, and example code is provided for loading the model, processing images, and generating text.

Examples
Describe this image: <image> https://example.com/image.jpg The image depicts a sunny day at the beach with a few people in the distance. There is a palm tree in the foreground and a few seagulls flying overhead.
Write a short poem based on the image: <image> https://example.com/image.jpg Warm sand beneath bare feet, Seagulls soaring high and free, Palm trees swaying gently, A serene and peaceful scene by the sea.
Tell me a story about a character who finds a mysterious object on the beach: <image> https://example.com/image.jpg As she walked along the beach, she stumbled upon a strange object half-buried in the sand. It was an old, intricately carved wooden box. She picked it up, feeling an odd sensation, like it was meant for her. As she opened it, she found a note inside that read: 'For the one who seeks adventure, follow the tide.'

Limitations

The model has some limitations, including a lack of training with multi-image data, limited video support, and quantization errors.

Image Size Limitations

  • The model is limited to processing images of size 448x448 pixels.
  • This limitation can result in reduced performance when dealing with larger or smaller images.

Dependence on Pre-Trained Weights

  • The model relies on pre-trained weights, which can be a limitation if the pre-trained weights are not available or are not suitable for the specific task at hand.

Format

The model uses a multimodal large language model (MLLM) architecture, which combines computer vision and natural language processing capabilities.

Architecture

The model’s architecture consists of three main components:

  • InternViT-6B-448px-V1-2: A vision transformer that processes images.
  • MLP: A multilayer perceptron that handles text inputs.
  • Nous-Hermes-2-Yi-34B: A large language model that generates text outputs.

Data Formats

The model supports the following data formats:

  • Text: Tokenized text sequences.
  • Images: 448x448 pixel images.

Input Requirements

When providing input to the model, keep the following in mind:

  • Image size: Images should be resized to 448x448 pixels.
  • Text length: Text inputs should be tokenized and have a maximum length of 1024 tokens.

Output

The model generates text outputs based on the input provided.

Code Examples

Here are some code examples to get you started:

  • Loading the model: model = AutoModel.from_pretrained("OpenGVLab/InternVL-Chat-V1-2-Plus", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, use_flash_attn=True, trust_remote_code=True).eval().cuda()
  • Processing images: image_processor = CLIPImageProcessor.from_pretrained("OpenGVLab/InternVL-Chat-V1-2-Plus")
  • Generating text: response = model.chat(tokenizer, pixel_values, question, generation_config)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.