H2ovl Mississippi 800m

Compact Vision-Language Model

Meet H2OVL-Mississippi-800M, a compact yet powerful vision-language model that's making waves in text recognition. With 0.8 billion parameters, it strikes a balance between performance and efficiency, making it perfect for OCR and document processing. Trained on 19 million image-text pairs, this model excels in the Text Recognition segment of OCRBench and outperforms larger models in this domain. Built on the robust architecture of H2O-Danube language models, it seamlessly integrates vision and language tasks. What sets it apart? Its ability to deliver state-of-the-art performance in text recognition, making it a game-changer for tasks like document comprehension, chart, figure, and table interpretation. Want to see it in action? Check out the Quick Start guide for a sample demo, and explore the Prompt Engineering for JSON Extraction guide to learn how to craft effective prompts for extracting information from images.

H2oai apache-2.0 Updated 5 months ago

Table of Contents

Model Overview

The H2OVL-Mississippi-800M model is a compact yet powerful vision-language model that achieves a great balance between performance and efficiency. With 0.8 billion parameters, it’s perfect for OCR and document processing tasks.

Capabilities

The model excels in various tasks, including:

  • Text Recognition: It outperforms larger models in this domain, demonstrating its ability to deliver state-of-the-art performance.
  • OCR: Optimized for superior OCR performance, making it suitable for document processing and analysis.
  • Vision-Language Tasks: Seamlessly integrates vision and language tasks, allowing it to understand and generate text based on visual inputs.
  • Image Understanding: Extracts information from images, including text, objects, and scenes.
  • JSON Extraction: Extracts structured data from images and formats it into JSON outputs.

How It Works

The model uses a combination of computer vision and natural language processing techniques to understand and generate text based on visual inputs. Here’s an overview of the process:

  1. Image Input: The model takes an image as input, which can be a document, a chart, or any other type of visual data.
  2. Text Extraction: The model extracts text from the image using OCR techniques.
  3. Text Analysis: The model analyzes the extracted text to understand its meaning and context.
  4. JSON Generation: The model generates a JSON output based on the analyzed text, which can include structured data such as tables, charts, and lists.

Example Use Cases

  • Document Analysis: The model can be used to analyze documents, extract relevant information, and generate structured data in JSON format.
  • Chart and Table Analysis: The model can be used to analyze charts and tables, extract data points, and generate JSON outputs.
  • Image-Based Question Answering: The model can be used to answer questions based on visual inputs, such as images of documents, charts, or scenes.

Best Practices

  • Be Explicit: Clearly define the desired keys and structure in your prompt to avoid ambiguity.
  • Use Examples: Provide sample outputs so that the system can understand the expected format.
  • Anticipate Variations: Consider possible variations in the visual data and ensure the prompt can accommodate them.
  • Start Simple: Begin with simple structures, and progressively increase complexity as needed.
  • Test and Iterate: Refine your prompts through testing to ensure accuracy and consistency in outputs.
Examples
Extract the details from the image of a form that contains basic details like 'Name,' 'Date of Birth,' and 'Address.' Format the output as JSON: { 'name': '', 'date_of_birth': '', 'address': '' } {'name': 'Emily Johnson', 'date_of_birth': '1995-05-12', 'address': '4567 Oak Street, New York'}
Extract the data from the table image and format it as JSON: {'products': [{'product_name': '', 'price': '', 'quantity': 0}]} {'products': [{'product_name': 'Shoes', 'price': '$80', 'quantity': 5}, {'product_name': 'T-Shirts', 'price': '$20', 'quantity': 10}]}
Extract the details of the bar chart from the image, including the title, axis labels, and data points and format it as JSON: {'chart': {'title': '', 'x_axis': '', 'y_axis': '', 'data_points': [{'label': '', 'value': 0}]}} {'chart': {'title': 'Monthly Sales', 'x_axis': 'Months', 'y_axis': 'Sales (in $)', 'data_points': [{'label': 'January', 'value': 1000}, {'label': 'February', 'value': 1200}]}}

Performance

The model showcases remarkable performance, achieving a balance between speed, accuracy, and efficiency in various tasks.

Speed

How quickly can the model process images and text? With 0.8 billion parameters, the model is designed to be compact yet powerful, making it suitable for OCR and document processing tasks.

Accuracy

The model’s accuracy is impressive, especially in the Text Recognition segment of OCRBench. It outperforms larger models in this domain, demonstrating its ability to deliver state-of-the-art performance.

Efficiency

The model is trained on 19 million image-text pairs, with a focus on OCR, document comprehension, and chart, figure, and table interpretation. This training data enables the model to excel in tasks that require both vision and language understanding.

Limitations

  • Small Size, Limited Capacity: Despite its impressive performance, the model has only 0.8 billion parameters. This means it may struggle with complex tasks or large amounts of data.
  • Limited Training Data: The model was trained on 19 million image-text pairs, which is a relatively small dataset. This may limit its ability to generalize to new, unseen data.
  • Bias and Inaccuracy: Like all AI models, the model may produce biased or inaccurate results, particularly if the training data contains errors or biases.
  • Dependence on Prompt Engineering: The model’s performance relies heavily on well-crafted prompts. If the prompts are poorly designed or ambiguous, the model may produce suboptimal results.
  • Limited Multimodal Capabilities: While the model excels in text recognition, its multimodal capabilities are limited compared to larger models like ==Phi-3-Vision== or MiniMonkey.

Format

The model uses a transformer architecture and can accept input in the form of text or images, or a combination of both.

Text Input

For text input, the model uses a tokenizer to break down the text into subwords, which are then embedded into vectors. The model can handle text sequences of up to 2048 tokens.

Image Input

For image input, the model uses a vision encoder to extract features from the image. The image is expected to be in the format of a 3xHxW tensor, where H and W are the height and width of the image, respectively.

Combined Input

When using both text and image input, the model will combine the text and image embeddings using a multimodal fusion layer.

Output

The model outputs a vector representation of the input, which can be used for downstream tasks such as text recognition, image classification, or multimodal fusion.

Special Requirements

  • The model requires a GPU with at least 8GB of memory to run efficiently.
  • The model is trained on a dataset of 19 million image-text pairs, and may not perform well on images or text that are significantly different from this dataset.

Code Example

Here is an example of how to use the model with text input:

import torch
from transformers import AutoModel, AutoTokenizer

# Load the model and tokenizer
model = AutoModel.from_pretrained('h2oai/h2ovl-mississippi-800m')
tokenizer = AutoTokenizer.from_pretrained('h2oai/h2ovl-mississippi-800m')

# Define the input text
input_text = 'Hello, how are you?'

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors='pt')

# Run the model
outputs = model(**inputs)

# Print the output
print(outputs.last_hidden_state[:, 0, :])

And here is an example of how to use the model with image input:

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

# Load the model and tokenizer
model = AutoModel.from_pretrained('h2oai/h2ovl-mississippi-800m')
tokenizer = AutoTokenizer.from_pretrained('h2oai/h2ovl-mississippi-800m')

# Load the image
image = Image.open('image.jpg')

# Preprocess the image
image = image.resize((224, 224))
image = torch.tensor(image)

# Run the model
outputs = model(image)

# Print the output
print(outputs.last_hidden_state[:, 0, :])
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.