InternViT 6B 224px

Vision foundation model

Meet InternViT 6B 224px, a powerful vision foundation model that's making waves in the AI world. With 5903 million parameters and an image size of 224 x 224, this model is designed to handle complex visual tasks with ease. But what really sets it apart is its efficiency - it's been pre-trained on a massive dataset that includes LAION-en, LAION-COCO, and more, allowing it to learn from a vast range of images. When building a VLLM with this model, you can tap into its capabilities by using the features from the fourth-to-last layer, which has been shown to work best. So, how can you harness the power of InternViT 6B 224px? Simply import the model, process your images, and get ready to unlock new possibilities in visual understanding.

OpenGVLab mit Updated 5 months ago

Table of Contents

Model Overview

The InternViT-6B-224px model is a powerful vision foundation model designed to process images. This model is part of a family of models that can help computers understand what’s in an image.

What makes this model special?

  • It’s trained on a massive dataset of images.
  • It has 5903 million parameters, which is a lot of brainpower to understand images.
  • It can process images that are 224 x 224 pixels in size.
  • It’s designed to work well with other models to help computers understand images and text together.

How does it perform?

The model has been tested on several benchmarks, including:

BenchmarkPerformance
IN-1K88.2
IN-ReaL90.4
IN-V279.9
IN-A77.5
IN-R89.8
IN-Sketch69.1

These numbers show how well the model can recognize objects and scenes in images.

Capabilities

Image Embeddings

The InternViT-6B-224px model can create image embeddings, which are compact representations of images that can be used for various tasks. Want to see how it works? Here’s an example code snippet:

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

# Load the model and image processor
model = AutoModel.from_pretrained('OpenGVLab/InternViT-6B-224px', torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, trust_remote_code=True).cuda().eval()
image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px')

# Load an image and convert it to RGB
image = Image.open('./examples/image1.jpg').convert('RGB')

# Preprocess the image and get the pixel values
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

# Pass the pixel values through the model and get the outputs
outputs = model(pixel_values)

Linear Probing Performance

But how well does the model perform on various tasks? Let’s take a look at its linear probing performance on several datasets:

DatasetPerformance
IN-1K88.2
IN-ReaL90.4
IN-V279.9
IN-A77.5
IN-R89.8
IN-Sketch69.1

As you can see, the model achieves impressive performance on these datasets.

Unique Features

So, what sets the model apart from other vision foundation models? Here are a few unique features:

  • 48 blocks: The model has 48 blocks, which allows it to capture a wide range of visual features.
  • Fourth-to-last layer: The model’s fourth-to-last layer is particularly effective for visual-linguistic tasks.

Performance

The model is a powerful AI model that showcases remarkable performance in various tasks. Let’s dive into its speed, accuracy, and efficiency.

Speed

How fast can the model process images? With its ability to handle 224 x 224 pixel images, it can quickly analyze and understand visual data. Its speed is particularly useful when dealing with large datasets or real-time applications.

Accuracy

But how accurate is the model? Let’s look at its performance in linear probing evaluations:

DatasetAccuracy
IN-1K88.2
IN-ReaL90.4
IN-V279.9
IN-A77.5
IN-R89.8
IN-Sketch69.1

As you can see, the model achieves high accuracy in various datasets, with some impressive scores in IN-ReaL and IN-R.

Efficiency

The model is also efficient in its use of parameters, with only 5903 million parameters. This makes it a great choice for applications where computational resources are limited.

Real-World Applications

So, how can the model be used in real-world applications? Here are a few examples:

  • Image classification: With its high accuracy in image classification tasks, the model can be used in applications such as self-driving cars, medical diagnosis, and product recognition.
  • Object detection: Its ability to quickly process images makes it suitable for object detection tasks, such as surveillance systems and robotics.
  • Image generation: The model can also be used for image generation tasks, such as generating images from text prompts.
Examples
Classify this image: https://example.com/image.jpg Image classification: The image is classified as a landscape with a score of 0.92.
Generate an image embedding for this image: https://example.com/image2.jpg Image embedding: [0.12, 0.34, 0.56, 0.78,...]
Determine the object in this image: https://example.com/image3.jpg Object detection: The image contains a car with a confidence score of 0.85.

Limitations

The model is a powerful vision foundation model, but it’s not perfect. Let’s take a closer look at some of its limitations.

Image Size Constraints

The model is designed to work with images of a specific size: 224 x 224 pixels. This means that if you try to use it with larger or smaller images, it might not perform as well. Have you ever tried to use a model with images of different sizes? How did it go?

Limited Pretraining Data

The model was pre-trained on a dataset that includes LAION-en, LAION-COCO, COYO, CC12M, CC3M, SBU, Wukong, and LAION-multi. While this is a diverse dataset, it’s not exhaustive. There might be certain types of images or scenarios that the model hasn’t seen before, which could affect its performance. Can you think of any scenarios where the model might struggle?

Linear Probing Performance

The model’s linear probing performance varies across different datasets. For example, it achieves 88.2 on IN-1K, but only 69.1 on IN-Sketch. This suggests that the model might not be equally good at all tasks. Have you ever worked with a model that excelled in one area but struggled in another?

Model Complexity

The model has 48 blocks, and the output from the fourth-to-last block is recommended for building a VLLM. This complexity can make it challenging to work with the model, especially for those without extensive experience. Have you ever struggled to understand a complex model’s architecture?

Usage Constraints

To use the model, you need to import specific libraries, such as torch and transformers. You also need to use a specific image processor, CLIPImageProcessor. These constraints can limit the model’s usability, especially for those who prefer other libraries or frameworks. Can you think of any alternative libraries or frameworks that you’d like to use with this model?

Format

The model is a vision foundation model that uses a feature backbone architecture. Let’s break down what this means and how to work with it.

Architecture

This model has 48 blocks, which are like layers in a neural network. But here’s the important part: when building a VLLM with this model, you should use the features from the fourth-to-last layer. This is because the model’s creators found that this layer works best for VLLM tasks.

Data Formats

The model accepts images as input, specifically images that are 224 x 224 pixels in size. This means you’ll need to resize your images to this size before feeding them into the model.

Input and Output

To use this model, you’ll need to preprocess your images using a CLIPImageProcessor. This processor will convert your images into a format that the model can understand.

Here’s an example of how to do this:

from PIL import Image
from transformers import AutoModel, CLIPImageProcessor

# Load the model and image processor
model = AutoModel.from_pretrained('OpenGVLab/InternViT-6B-224px')
image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternViT-6B-224px')

# Load an image
image = Image.open('./examples/image1.jpg').convert('RGB')

# Preprocess the image
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values

# Feed the image into the model
outputs = model(pixel_values)

As you can see, the model takes in preprocessed image data and outputs a set of features that can be used for downstream tasks.

Special Requirements

One important thing to note is that this model requires a significant amount of computational resources to run. Specifically, it requires a GPU with enough memory to handle the large number of parameters (5903M) and the image data.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.