Phi 3 Vision 128k Instruct

Multimodal AI model

The Phi-3-Vision-128K-Instruct model is a powerful, state-of-the-art multimodal AI designed for a wide range of applications, from general-purpose AI systems to research on efficient language and multimodal models. With 4.2B parameters and a context length of 128K tokens, it excels in tasks like general image understanding, OCR, chart and table understanding, and generating text in response to visual and text input. How does it achieve such impressive performance? By leveraging a combination of image encoder, connector, projector, and Phi-3 Mini language model, this model can process inputs with a context length of 128K tokens, making it well-suited for applications requiring robust multimodal understanding. But what about its limitations? The model is primarily trained on English text, which may result in worse performance for languages other than English. Additionally, it may over- or under-represent certain groups of people, erase representation of some groups, or reinforce negative stereotypes. Despite these limitations, the Phi-3-Vision-128K-Instruct model showcases exceptional performance in multimodal tasks, achieving a score of 55.5 in the MMMU benchmark and demonstrating strong performance in ScienceQA, MathVista, and InterGPS benchmarks. So, what makes this model unique? Its ability to support various data formats, including Hugging Face and ONNX, and its requirement for a specific setup for input and output, including the use of a GPU with at least 512 H100-80G. Overall, the Phi-3-Vision-128K-Instruct model is a robust and efficient tool for a wide range of applications, from general-purpose AI systems to research on efficient language and multimodal models.

Microsoft mit Updated 8 months ago

Table of Contents

Model Overview

The Phi-3-Vision-128K-Instruct model is a state-of-the-art, open multimodal model that combines text and vision capabilities. It’s part of the Phi-3 model family and is designed for a wide range of applications, including general-purpose AI systems, image understanding, OCR, and chart and table understanding.

This model is a powerful tool that can be used in various industries, such as education, healthcare, and finance, to analyze and understand images and text. It’s perfect for applications that require memory/compute-constrained environments, latency-bound scenarios, and high-quality data.

Capabilities

The Phi-3-Vision-128K-Instruct model is a powerful, lightweight, and state-of-the-art open multimodal model that can handle both text and image inputs. It’s designed for broad commercial and research use in English, and it’s perfect for applications that require memory/compute constrained environments, latency bound scenarios, general image understanding, OCR, and chart and table understanding.

  • Answer questions about images
  • Provide insightful questions to spark discussion
  • Generate text in response to image inputs
  • Understand and respond to chat format prompts

Compared to other models like ==LlaVA-1.6 Vicuna-7B== and ==QWEN-VL Chat==, the Phi-3-Vision-128K-Instruct model outperforms them in various tasks, including MMBench, ScienceQA, and MathVista.

Performance

The Phi-3-Vision-128K-Instruct model boasts high accuracy in various tasks, including:

  • Zero-shot benchmarks: It outperforms other models like ==LlaVA-1.6 Vicuna-7B== and ==QWEN-VL Chat== in tasks like MMBench, ScienceQA, and MathVista.
  • Image understanding: It achieves high scores in tasks like AI2D, ChartQA, and TextVQA.
BenchmarkPhi-3 Vision-128K-InLlaVA-1.6 Vicuna-7BQWEN-VL ChatLlama3-Llava-Next-8BClaude-3 HaikuGemini 1.0 Pro VGPT-4V-Turbo MMMU
MMBench80.576.375.879.462.480.086.1
ScienceQA90.870.667.273.772.079.775.7
MathVista44.531.529.434.833.235.047.5
InterGPS38.120.522.324.632.128.641.0
AI2D76.763.159.866.960.362.874.7
ChartQA81.455.050.965.859.358.062.3
TextVQA70.964.659.455.762.764.768.1
POPE85.887.282.687.074.484.283.7
Examples
What is shown in this image? https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings. It shows five categories: 'Having clear and pre-defined goals for meetings', 'Knowing where to find the information I need for a meeting', 'Understanding my exact role and responsibilities when I'm invited', 'Having tools to manage admin tasks like note-taking or summarization', and 'Having more focus time to sufficiently prepare for meetings'. Each category has an associated bar indicating the level of agreement, measured on a scale from 0% to 100%.
Provide insightful questions to spark discussion. What are some strategies for improving meeting preparedness? How can we better utilize technology to streamline meeting organization and information sharing?
Explain the concept of 'truthfulness' in the context of human-AI interaction. Truthfulness in human-AI interaction refers to the ability of an AI model to provide accurate and reliable information, without intentionally deceiving or misleading users.

Here’s an example of how to use the model:

from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto", _attn_implementation='flash_attention_2')
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load the input image
url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
image = Image.open(requests.get(url, stream=True).raw)

# Format the input text and image
messages = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
    {"role": "assistant", "content": "The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings."},
]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")

# Generate the output text
generation_args = {
    "max_new_tokens": 500,
    "temperature": 0.0,
    "do_sample": False,
}
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)

Note that this is just an example, and you may need to modify the code to suit your specific use case.

Limitations

The Phi-3-Vision-128K-Instruct model has several limitations, including:

  • Language limitations: The model is primarily trained on English text, which means it may not perform as well on other languages.
  • Representation and stereotypes: The model may over- or under-represent certain groups of people, erase representation of some groups, or reinforce negative stereotypes.
  • Inappropriate or offensive content: The model may produce inappropriate or offensive content, which may make it unsuitable for sensitive contexts without additional mitigations.

By understanding these limitations, developers can design and implement effective mitigations to ensure the Phi-3-Vision-128K-Instruct model is used responsibly and effectively.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.