Florence 2 Large

Unified vision model

The Florence-2 model is a powerful vision foundation model developed by Microsoft. It uses a prompt-based approach to handle a wide range of vision and vision-language tasks, including captioning, object detection, and segmentation. With its sequence-to-sequence architecture, it excels in both zero-shot and fine-tuned settings, leveraging the large-scale FLD-5B dataset. It can interpret simple text prompts to perform various tasks, making it a competitive vision foundation model. Florence-2 showcases exceptional performance in various vision tasks, achieving high accuracy in both zero-shot and fine-tuned settings. However, its performance may degrade when dealing with tasks that require more specialized knowledge or expertise. The model is available in four variants, with model sizes ranging from 0.23B to 0.77B parameters.

Microsoft mit Updated 8 months ago

Table of Contents

Model Overview

The Florence-2 model, developed by Microsoft, is a cutting-edge vision foundation model that can handle a wide range of vision and vision-language tasks. It uses a prompt-based approach to interpret simple text prompts and perform tasks like captioning, object detection, and segmentation.

Key Features

  • Unified Representation: uses a unified representation for various vision tasks, making it a versatile model.
  • Prompt-based Approach: uses a prompt-based approach to handle different tasks, making it easy to use and flexible.
  • Large-scale Pre-training: pre-trained on a large-scale dataset called FLD-5B, which contains 5.4 billion annotations across 126 million images.
  • Sequence-to-Sequence Architecture: enables it to excel in both zero-shot and fine-tuned settings.

Capabilities

Capable of performing a variety of tasks, including:

  • Image Captioning
  • Object Detection
  • Segmentation
  • Dense Region Caption
  • Region Proposal
  • OCR
  • OCR with Region

Model Variants

There are four variants of the model:

ModelParameters
Florence-2-base0.23B
Florence-2-large0.77B
Florence-2-base-ft0.23B
Florence-2-large-ft0.77B

How it Works

Uses a prompt-based approach to handle different tasks. This means that you can simply provide a prompt to the model, and it will generate the corresponding output. For example, if you provide a prompt for image captioning, the model will generate a caption for the image.

Strengths

  • Zero-shot performance: performs well on tasks it has not been specifically trained on
  • Fine-tuned performance: can be fine-tuned for specific tasks to achieve even better performance
  • Multi-task learning: can learn to perform multiple tasks simultaneously

Comparison to Other Models

Has been compared to other models on various benchmarks, and it has achieved competitive results. For example, it has outperformed the ==Flamingo== model on the COCO Captioning task, and it has achieved similar results to the ==PaLI== model on the Visual Question Answering task.

Performance

Has demonstrated impressive performance in various vision tasks, with high accuracy, speed, and efficiency.

Speed

ModelParametersInference Time (ms)
Florence-2-base0.23B34.7
Florence-2-large0.77B37.5

Accuracy

Achieves high accuracy in various vision tasks, including:

  • Image captioning: outperforms ==Flamingo== and ==Kosmos-2== in zero-shot performance on COCO Cap. test CIDEr.
  • Object detection: achieves competitive performance with specialist models like ==SeqTR== and ==PolyFormer== on COCO Det. val2017 mAP.

Limitations

Is a powerful vision foundation model, but it’s not perfect. Let’s explore some of its limitations.

Training Data Bias

Was trained on a massive dataset of 126 million images with 5.4 billion annotations. While this dataset is diverse, it’s still biased towards certain types of images and annotations. This bias can affect the model’s performance on tasks that involve underrepresented image types or annotations.

Task-Specific Performance

May not perform as well as specialist models that are fine-tuned specifically for each task. For example, in object detection, may not match the performance of a model that’s been fine-tuned solely for object detection.

OCR Limitations

Uses OCR (Optical Character Recognition) to recognize text in images. However, OCR can be limited in its ability to recognize text in certain fonts, languages, or image qualities. This can affect the model’s performance on tasks that involve text recognition.

Format

Is a vision foundation model that uses a sequence-to-sequence architecture to handle a wide range of vision and vision-language tasks. It accepts input in the form of images and text prompts, and can perform tasks such as captioning, object detection, and segmentation.

Architecture

The model’s architecture is based on a transformer, which is a type of neural network that is well-suited for sequential data. The transformer architecture allows the model to process input sequences of varying lengths and to capture long-range dependencies between input elements.

Data Formats

Accepts input in the following formats:

  • Images: can process images in various formats, including JPEG and PNG.
  • Text prompts: can process text prompts in the form of strings.

Input Requirements

To use the model, you will need to provide the following inputs:

  • Image: will need to provide an image file that you want the model to process.
  • Text prompt: will need to provide a text prompt that describes the task you want the model to perform. For example, if you want the model to generate a caption for an image, you would provide a text prompt such as <CAPTION>.

Output Formats

The model’s output will depend on the task you are performing. For example:

  • Captioning: will output a caption for the input image.
  • Object detection: will output a list of bounding boxes and class labels for the objects detected in the input image.
Examples
Caption this image: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true A car is parked on the side of the road.
Detect objects in this image: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true {'<OD>': {'bboxes': [[10, 10, 50, 50], [60, 60, 100, 100]], 'labels': ['car', 'wheel']}}
Perform OCR on this image: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true {'<OCR>': {'text': 'This is a car parked on the side of the road.'}}

Getting Started

To get started with the model, you can use the following code:

import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

# Load the model and processor
model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large")
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large")

# Load the image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

# Define the text prompt
prompt = "<CAPTION>"

# Preprocess the input
inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate the output
generated_ids = model.generate(input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=1024, num_beams=3, do_sample=False)

# Decode the output
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

# Print the output
print(generated_text)

Note that this is just an example, and you will need to modify the code to suit your specific use case.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.