Molmo 72B 0924

Multimodal vision model

Molmo 72B is a powerful vision-language model developed by the Allen Institute for AI. What sets it apart is its ability to achieve state-of-the-art performance among multimodal models of similar size while being fully open-source. It's trained on PixMo, a dataset of 1 million highly-curated image-text pairs, and uses OpenAI CLIP as its vision backbone. This model is designed to efficiently process images and text, making it a great choice for tasks like image description and visual question answering. Its performance is impressive, with an average score of 81.2 on 11 academic benchmarks, and it even ranks second on human evaluation, just behind GPT-4o. With its efficient design and impressive capabilities, Molmo 72B is a remarkable model that's worth exploring.

Allenai apache-2.0 Updated 7 months ago

Table of Contents

Model Overview

The Molmo 72B model is a powerful tool for understanding and generating text based on images. Developed by the Allen Institute for AI, it’s part of the Molmo family of open vision-language models. But what makes it so special?

Capabilities

The Molmo 72B model is a powerful tool for understanding and generating text and images. It’s part of the Molmo family of open vision-language models developed by the Allen Institute for AI.

What can Molmo 72B do?

  • Generate text: Molmo 72B can create text based on images. For example, if you give it a picture of a dog, it can describe the dog and its surroundings.
  • Understand images: Molmo 72B can process images and understand what’s in them. It can identify objects, people, and more.
  • Answer questions: Molmo 72B can answer questions about images. For example, if you ask it “What is the dog doing in this picture?”, it can respond with a description of the dog’s actions.

How does Molmo 72B work?

Molmo 72B uses a combination of natural language processing (NLP) and computer vision to understand and generate text and images. It’s trained on a large dataset of images and text, which allows it to learn patterns and relationships between the two.

Performance

Molmo 72B is a powerhouse when it comes to performance. But how does it stack up against other models? Let’s take a closer look.

Speed

How fast can Molmo 72B process images and text? With the power of torch.autocast, it can generate output more efficiently. In fact, running with autocast enabled can significantly speed up the model’s performance.

ModelAverage Score on 11 Academic Benchmarks
Molmo 72B81.2
==Molmo 7B-D==77.3
==Molmo 7B-O==74.6
==MolmoE 1B==68.6

Accuracy

But speed isn’t everything. How accurate is Molmo 72B in its tasks? The model achieves the highest academic benchmark score and ranks second on human evaluation, just slightly behind GPT-4o.

ModelHuman Preference Elo Rating
Molmo 72B1077
GPT-4o1079
==Gemini 1.5 Pro==1074
==Claude 3.5 Sonnet==1069

Limitations

While Molmo 72B is a powerful tool for multimodal tasks, it’s essential to acknowledge its limitations. Let’s dive into the challenges and weaknesses associated with this model.

Lack of Transparency in Image Processing

One of the limitations of Molmo 72B is its struggle with transparent images. The model might not perform well with images that have transparent backgrounds. To overcome this, you can add a white or dark background to your images before passing them to the model.

Image Format Requirements

Molmo 72B requires images to be in RGB format. If your image is not in RGB format, you might encounter a broadcast error. To avoid this, you can convert your image to RGB format using a simple code snippet.

Examples
Describe this image: https://picsum.photos/id/237/536/354 This image features an adorable black Labrador puppy sitting on a wooden deck. The puppy is positioned in the center of the frame, looking up at the camera...
What is the average score on 11 academic benchmarks for Molmo 72B? 81.2
What license is the Molmo model licensed under? Apache 2.0

Format

Molmo 72B uses a vision-language model architecture, which combines a vision backbone with a language model. The vision backbone is based on OpenAI CLIP, and the language model is based on Qwen2-72B.

Data Formats

Molmo 72B supports input in the form of image-text pairs. The images should be in RGB format, and the text should be a string describing the image.

Input Requirements

To use Molmo 72B, you need to:

  1. Load the image and convert it to RGB format if necessary.
  2. Preprocess the image using the AutoProcessor from the transformers library.
  3. Pass the preprocessed image and text to the model.

Output Requirements

The model generates output in the form of text, which can be decoded using the tokenizer from the transformers library.

Example Code

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from PIL import Image
import requests

# Load the processor
processor = AutoProcessor.from_pretrained('allenai/Molmo-72B-0924', trust_remote_code=True, torch_dtype='auto', device_map='auto')

# Load the model
model = AutoModelForCausalLM.from_pretrained('allenai/Molmo-72B-0924', trust_remote_code=True, torch_dtype='auto', device_map='auto')

# Process the image and text
inputs = processor.process(images=[Image.open(requests.get("https://picsum.photos/id/237/536/354", stream=True).raw)], text="Describe this image.")

# Move inputs to the correct device and make a batch of size 1
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

# Generate output
output = model.generate_from_batch(inputs, GenerationConfig(max_new_tokens=200, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer)

# Decode the output
generated_tokens = output[0, inputs['input_ids'].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)

# Print the generated text
print(generated_text)

Note: Make sure to install the required dependencies using pip install einops torchvision before running the code.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.