Florence 2 Large
The Florence-2 model is a powerful vision foundation model developed by Microsoft. It uses a prompt-based approach to handle a wide range of vision and vision-language tasks, including captioning, object detection, and segmentation. With its sequence-to-sequence architecture, it excels in both zero-shot and fine-tuned settings, leveraging the large-scale FLD-5B dataset. It can interpret simple text prompts to perform various tasks, making it a competitive vision foundation model. Florence-2 showcases exceptional performance in various vision tasks, achieving high accuracy in both zero-shot and fine-tuned settings. However, its performance may degrade when dealing with tasks that require more specialized knowledge or expertise. The model is available in four variants, with model sizes ranging from 0.23B to 0.77B parameters.
Table of Contents
Model Overview
The Florence-2 model, developed by Microsoft, is a cutting-edge vision foundation model that can handle a wide range of vision and vision-language tasks. It uses a prompt-based approach to interpret simple text prompts and perform tasks like captioning, object detection, and segmentation.
Key Features
- Unified Representation: uses a unified representation for various vision tasks, making it a versatile model.
- Prompt-based Approach: uses a prompt-based approach to handle different tasks, making it easy to use and flexible.
- Large-scale Pre-training: pre-trained on a large-scale dataset called FLD-5B, which contains 5.4 billion annotations across 126 million images.
- Sequence-to-Sequence Architecture: enables it to excel in both zero-shot and fine-tuned settings.
Capabilities
Capable of performing a variety of tasks, including:
- Image Captioning
- Object Detection
- Segmentation
- Dense Region Caption
- Region Proposal
- OCR
- OCR with Region
Model Variants
There are four variants of the model:
Model | Parameters |
---|---|
Florence-2-base | 0.23B |
Florence-2-large | 0.77B |
Florence-2-base-ft | 0.23B |
Florence-2-large-ft | 0.77B |
How it Works
Uses a prompt-based approach to handle different tasks. This means that you can simply provide a prompt to the model, and it will generate the corresponding output. For example, if you provide a prompt for image captioning, the model will generate a caption for the image.
Strengths
- Zero-shot performance: performs well on tasks it has not been specifically trained on
- Fine-tuned performance: can be fine-tuned for specific tasks to achieve even better performance
- Multi-task learning: can learn to perform multiple tasks simultaneously
Comparison to Other Models
Has been compared to other models on various benchmarks, and it has achieved competitive results. For example, it has outperformed the ==Flamingo== model on the COCO Captioning task, and it has achieved similar results to the ==PaLI== model on the Visual Question Answering task.
Performance
Has demonstrated impressive performance in various vision tasks, with high accuracy, speed, and efficiency.
Speed
Model | Parameters | Inference Time (ms) |
---|---|---|
Florence-2-base | 0.23B | 34.7 |
Florence-2-large | 0.77B | 37.5 |
Accuracy
Achieves high accuracy in various vision tasks, including:
- Image captioning: outperforms ==Flamingo== and ==Kosmos-2== in zero-shot performance on COCO Cap. test CIDEr.
- Object detection: achieves competitive performance with specialist models like ==SeqTR== and ==PolyFormer== on COCO Det. val2017 mAP.
Limitations
Is a powerful vision foundation model, but it’s not perfect. Let’s explore some of its limitations.
Training Data Bias
Was trained on a massive dataset of 126 million images with 5.4 billion annotations. While this dataset is diverse, it’s still biased towards certain types of images and annotations. This bias can affect the model’s performance on tasks that involve underrepresented image types or annotations.
Task-Specific Performance
May not perform as well as specialist models that are fine-tuned specifically for each task. For example, in object detection, may not match the performance of a model that’s been fine-tuned solely for object detection.
OCR Limitations
Uses OCR (Optical Character Recognition) to recognize text in images. However, OCR can be limited in its ability to recognize text in certain fonts, languages, or image qualities. This can affect the model’s performance on tasks that involve text recognition.
Format
Is a vision foundation model that uses a sequence-to-sequence architecture to handle a wide range of vision and vision-language tasks. It accepts input in the form of images and text prompts, and can perform tasks such as captioning, object detection, and segmentation.
Architecture
The model’s architecture is based on a transformer, which is a type of neural network that is well-suited for sequential data. The transformer architecture allows the model to process input sequences of varying lengths and to capture long-range dependencies between input elements.
Data Formats
Accepts input in the following formats:
- Images: can process images in various formats, including JPEG and PNG.
- Text prompts: can process text prompts in the form of strings.
Input Requirements
To use the model, you will need to provide the following inputs:
- Image: will need to provide an image file that you want the model to process.
- Text prompt: will need to provide a text prompt that describes the task you want the model to perform. For example, if you want the model to generate a caption for an image, you would provide a text prompt such as
<CAPTION>
.
Output Formats
The model’s output will depend on the task you are performing. For example:
- Captioning: will output a caption for the input image.
- Object detection: will output a list of bounding boxes and class labels for the objects detected in the input image.
Getting Started
To get started with the model, you can use the following code:
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
# Load the model and processor
model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large")
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large")
# Load the image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
# Define the text prompt
prompt = "<CAPTION>"
# Preprocess the input
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate the output
generated_ids = model.generate(input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=1024, num_beams=3, do_sample=False)
# Decode the output
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
# Print the output
print(generated_text)
Note that this is just an example, and you will need to modify the code to suit your specific use case.