Deeplabv3 Mobilevit Xx Small

Semantic segmentation model

The Deeplabv3 Mobilevit Xx Small model is a powerful tool for semantic segmentation tasks. But what makes it unique? For starters, it combines the efficiency of MobileNetV2-style layers with the global processing capabilities of transformers, allowing it to process image data quickly and accurately. This model is also remarkably lightweight, with only 1.9 million parameters, making it a great choice for applications where speed and efficiency are crucial. But don't let its size fool you - it's still capable of achieving a high mIOU score of 73.6 on the PASCAL VOC dataset. Whether you're working on a project that requires fast and accurate image segmentation, this model is definitely worth considering.

Apple other Updated 3 years ago

Table of Contents

Model Overview

Meet the MobileViT + DeepLabV3 model, a light-weight and low-latency convolutional neural network that’s perfect for mobile devices. But what makes it special?

The model combines MobileNetV2-style layers with a new block that uses transformers for global processing. It converts image data into flattened patches for processing, then “unflattens” them back into feature maps. Plus, it doesn’t require positional embeddings, making it more efficient.

Capabilities

The MobileViT + DeepLabV3 model is a powerful tool for semantic segmentation. But what does that mean?

Semantic segmentation is a task where a model tries to identify the different parts of an image. For example, if you have a picture of a cat and a dog, the model will try to identify which pixels belong to the cat and which belong to the dog.

Examples
Segment the image of a cat sitting on a couch. Predicted mask: cat - 80% confidence, couch - 70% confidence, background - 90% confidence
Analyze the image of a busy street with cars, pedestrians, and buildings. Detected objects: car - 85% confidence, pedestrian - 90% confidence, building - 95% confidence
Identify the objects in the image of a kitchen with a table, chairs, and appliances. Detected objects: table - 80% confidence, chair - 85% confidence, refrigerator - 90% confidence, stove - 85% confidence

What can it do?

  • Identify objects in images
  • Segment images into different parts
  • Improve image classification models

The model can be used for a variety of tasks, such as:

  • Autonomous driving
  • Robotics
  • Surveillance

How to use it

  1. Import the necessary libraries: MobileViTFeatureExtractor and MobileViTForSemanticSegmentation
  2. Load the model and feature extractor using from_pretrained
  3. Preprocess your image using the feature extractor
  4. Run the model on your preprocessed image
  5. Get the predicted mask using argmax

Performance

The MobileViT + DeepLabV3 model is designed to be fast, accurate, and efficient in various tasks. But how does it really perform?

The model is optimized for speed, making it suitable for real-time applications. It uses a light-weight architecture that combines MobileNetV2-style layers with a new block that replaces local processing in convolutions with global processing using transformers.

Comparison to Other Models

Compared to other models, the MobileViT + DeepLabV3 model achieves a good balance between speed, accuracy, and efficiency. For example, the ==MobileViT-XS== model has a higher mIOU score, but it also has more parameters and may be slower to process images.

ModelmIOU Score# Parameters
MobileViT + DeepLabV373.61.9M
==MobileViT-XS==77.12.9M
==MobileViT-S==79.16.4M

Limitations

The MobileViT + DeepLabV3 model has some limitations that you should be aware of.

  • Limited training data: The model was pre-trained on ImageNet-1k, a dataset consisting of 1 million images and 1,000 classes.
  • Resolution limitations: The model is designed to work with images of resolution 512x512.
  • Pixel order: The model expects images to be in BGR pixel order, not RGB.

Format

The MobileViT + DeepLabV3 model accepts input images in the form of BGR pixel order (not RGB). The images are expected to be in the range [0, 1] and are center-cropped at 512x512.

To use this model, you need to preprocess the input images using the MobileViTFeatureExtractor. Here’s an example:

from transformers import MobileViTFeatureExtractor, MobileViTForSemanticSegmentation
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/deeplabv3-mobilevit-xx-small")
model = MobileViTForSemanticSegmentation.from_pretrained("apple/deeplabv3-mobilevit-xx-small")

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
predicted_mask = logits.argmax(1).squeeze(0)

The output of the model is a predicted mask, which is a tensor with shape (height, width).

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.