Phi 3 Vision 128k Instruct
The Phi-3-Vision-128K-Instruct model is a powerful, state-of-the-art multimodal AI designed for a wide range of applications, from general-purpose AI systems to research on efficient language and multimodal models. With 4.2B parameters and a context length of 128K tokens, it excels in tasks like general image understanding, OCR, chart and table understanding, and generating text in response to visual and text input. How does it achieve such impressive performance? By leveraging a combination of image encoder, connector, projector, and Phi-3 Mini language model, this model can process inputs with a context length of 128K tokens, making it well-suited for applications requiring robust multimodal understanding. But what about its limitations? The model is primarily trained on English text, which may result in worse performance for languages other than English. Additionally, it may over- or under-represent certain groups of people, erase representation of some groups, or reinforce negative stereotypes. Despite these limitations, the Phi-3-Vision-128K-Instruct model showcases exceptional performance in multimodal tasks, achieving a score of 55.5 in the MMMU benchmark and demonstrating strong performance in ScienceQA, MathVista, and InterGPS benchmarks. So, what makes this model unique? Its ability to support various data formats, including Hugging Face and ONNX, and its requirement for a specific setup for input and output, including the use of a GPU with at least 512 H100-80G. Overall, the Phi-3-Vision-128K-Instruct model is a robust and efficient tool for a wide range of applications, from general-purpose AI systems to research on efficient language and multimodal models.
Table of Contents
Model Overview
The Phi-3-Vision-128K-Instruct model is a state-of-the-art, open multimodal model that combines text and vision capabilities. It’s part of the Phi-3 model family and is designed for a wide range of applications, including general-purpose AI systems, image understanding, OCR, and chart and table understanding.
This model is a powerful tool that can be used in various industries, such as education, healthcare, and finance, to analyze and understand images and text. It’s perfect for applications that require memory/compute-constrained environments, latency-bound scenarios, and high-quality data.
Capabilities
The Phi-3-Vision-128K-Instruct model is a powerful, lightweight, and state-of-the-art open multimodal model that can handle both text and image inputs. It’s designed for broad commercial and research use in English, and it’s perfect for applications that require memory/compute constrained environments, latency bound scenarios, general image understanding, OCR, and chart and table understanding.
- Answer questions about images
- Provide insightful questions to spark discussion
- Generate text in response to image inputs
- Understand and respond to chat format prompts
Compared to other models like ==LlaVA-1.6 Vicuna-7B== and ==QWEN-VL Chat==, the Phi-3-Vision-128K-Instruct model outperforms them in various tasks, including MMBench, ScienceQA, and MathVista.
Performance
The Phi-3-Vision-128K-Instruct model boasts high accuracy in various tasks, including:
- Zero-shot benchmarks: It outperforms other models like ==LlaVA-1.6 Vicuna-7B== and ==QWEN-VL Chat== in tasks like MMBench, ScienceQA, and MathVista.
- Image understanding: It achieves high scores in tasks like AI2D, ChartQA, and TextVQA.
Benchmark | Phi-3 Vision-128K-In | LlaVA-1.6 Vicuna-7B | QWEN-VL Chat | Llama3-Llava-Next-8B | Claude-3 Haiku | Gemini 1.0 Pro V | GPT-4V-Turbo MMMU |
---|---|---|---|---|---|---|---|
MMBench | 80.5 | 76.3 | 75.8 | 79.4 | 62.4 | 80.0 | 86.1 |
ScienceQA | 90.8 | 70.6 | 67.2 | 73.7 | 72.0 | 79.7 | 75.7 |
MathVista | 44.5 | 31.5 | 29.4 | 34.8 | 33.2 | 35.0 | 47.5 |
InterGPS | 38.1 | 20.5 | 22.3 | 24.6 | 32.1 | 28.6 | 41.0 |
AI2D | 76.7 | 63.1 | 59.8 | 66.9 | 60.3 | 62.8 | 74.7 |
ChartQA | 81.4 | 55.0 | 50.9 | 65.8 | 59.3 | 58.0 | 62.3 |
TextVQA | 70.9 | 64.6 | 59.4 | 55.7 | 62.7 | 64.7 | 68.1 |
POPE | 85.8 | 87.2 | 82.6 | 87.0 | 74.4 | 84.2 | 83.7 |
Here’s an example of how to use the model:
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
model_id = "microsoft/Phi-3-vision-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto", _attn_implementation='flash_attention_2')
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Load the input image
url = "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"
image = Image.open(requests.get(url, stream=True).raw)
# Format the input text and image
messages = [
{"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
{"role": "assistant", "content": "The chart displays the percentage of respondents who agree with various statements about their preparedness for meetings."},
]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
# Generate the output text
generation_args = {
"max_new_tokens": 500,
"temperature": 0.0,
"do_sample": False,
}
generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
Note that this is just an example, and you may need to modify the code to suit your specific use case.
Limitations
The Phi-3-Vision-128K-Instruct model has several limitations, including:
- Language limitations: The model is primarily trained on English text, which means it may not perform as well on other languages.
- Representation and stereotypes: The model may over- or under-represent certain groups of people, erase representation of some groups, or reinforce negative stereotypes.
- Inappropriate or offensive content: The model may produce inappropriate or offensive content, which may make it unsuitable for sensitive contexts without additional mitigations.
By understanding these limitations, developers can design and implement effective mitigations to ensure the Phi-3-Vision-128K-Instruct model is used responsibly and effectively.