VLM2Vec Full

Multimodal Embedding Model

VLM2Vec Full is a unified multimodal embedding model that can handle various tasks with ease. Built on top of a well-trained Vision-Language Model, it's designed to provide massive multimodal embedding capabilities. The model is trained using contrastive learning with in-batch negatives and has achieved impressive results on 36 evaluation datasets, outperforming existing baselines by a significant margin. With its ability to process both images and text, VLM2Vec Full can efficiently handle tasks like image-text matching and text-image retrieval. By leveraging its capabilities, users can unlock new possibilities in multimodal understanding and generation.

TIGER Lab apache-2.0 Updated 3 months ago

Table of Contents

Model Overview

The VLM2Vec model is a powerful tool for understanding and working with images and text together. It’s like a super smart robot that can look at a picture and tell you what’s in it, or read a sentence and find a matching image.

What makes VLM2Vec special?

  • It’s trained on a huge dataset of images and text, which helps it learn to understand the connections between them.
  • It uses a technique called contrastive learning, which helps it figure out what’s important and what’s not.
  • It can be used for a variety of tasks, like image classification, text classification, and even generating new images or text.

Capabilities

The VLM2Vec model is a powerful tool for multimodal embedding tasks. It can handle a wide range of tasks, from image-text matching to text-image retrieval.

What can VLM2Vec do?

  • Image-Text Matching: VLM2Vec can take an image and a text query as input and determine how well they match. For example, if you show the model an image of a cat and a dog, and ask it to match the image with the text “A cat and a dog”, it will give you a high similarity score.
  • Text-Image Retrieval: VLM2Vec can also take a text query as input and find the most relevant image from a database. For example, if you ask the model to find an image that matches the text “A cat and a tiger”, it will return an image of a cat and a tiger.

Performance

But how does VLM2Vec compare to other models? Let’s take a look at some examples.

TaskVLM2Vec==Other Models==
Image + Text -> Text0.30080.2000
Text -> Image0.29300.2500

As you can see, VLM2Vec performs significantly better than other models in these tasks.

Examples

Let’s take a look at some examples of how VLM2Vec performs in different tasks.

  • Image + Text -> Text: A cat and a dog = tensor([[0.3008]], device=‘cuda:0’, dtype=torch.bfloat16)
  • Text -> Image: Find me an everyday image that matches the given caption: A cat and a dog. = tensor([[0.2930]], device=‘cuda:0’, dtype=torch.bfloat16)
Examples
Find me an everyday image that matches the given caption: A cat and a dog. tensor([[0.2930]], device='cuda:0', dtype=torch.bfloat16)
Represent the given image with the following question: What is in the image tensor([[0.3008]], device='cuda:0', dtype=torch.bfloat16)
Find me an everyday image that matches the given caption: A cat and a tiger. tensor([[0.2012]], device='cuda:0', dtype=torch.bfloat16)

Limitations

While VLM2Vec is a powerful model, it’s not perfect. Let’s take a closer look at some of its limitations.

Training Data

The model was trained on a specific dataset (MMEB-train) and evaluated on another (MMEB-eval). This means that its performance might not generalize well to other datasets or tasks.

Contrastive Learning

The model uses contrastive learning, which can lead to some challenges. For example, the model might not perform well when the input data is not well-represented in the training set.

Format

VLM2Vec is a vision-language model that uses a transformer architecture. It’s designed to handle both image and text inputs, and it’s great at understanding the relationships between them.

Supported Data Formats

VLM2Vec accepts two types of inputs:

  • Text: You can input text sequences, like sentences or phrases.
  • Image: You can input images, like photos or diagrams.

Special Requirements

To use VLM2Vec, you need to preprocess your inputs using a special processor. This processor helps the model understand the inputs and get the most out of them.

Here’s an example of how to use the processor:

processor = load_processor(model_args)
inputs = processor('<|image_1|> Represent the given image with the following question: What is in the image', [Image.open('figures/example.jpg')])

In this example, we’re using the processor to convert an image and a text sequence into a format that VLM2Vec can understand.

Model Architecture

VLM2Vec is based on the Phi-3.5-V model, which is a well-known vision-language model. It’s been fine-tuned for multimodal embedding tasks, which means it’s great at understanding the relationships between images and text.

Handling Inputs and Outputs

Here’s an example of how to use VLM2Vec to compute the similarity between an image and a text sequence:

qry_output = model(qry=inputs)["qry_reps"]
string = 'A cat and a dog'
inputs = processor(string)
inputs = {key: value.to('cuda') for key, value in inputs.items()}
tgt_output = model(tgt=inputs)["tgt_reps"]
print(string, '=', model.compute_similarity(qry_output, tgt_output))

In this example, we’re using VLM2Vec to compute the similarity between an image and a text sequence. The output is a tensor that represents the similarity between the two inputs.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.