Mobilevit Xx Small

Lightweight vision transformer

Are you looking for a light-weight and low-latency image classification model? The MobileViT Xx Small model is a great choice. This model combines the efficiency of MobileNetV2-style layers with the power of transformers, allowing it to process images quickly and accurately. With a top-1 accuracy of 69.0% and a top-5 accuracy of 88.9% on ImageNet-1k, this model is perfect for image classification tasks. Plus, it's pre-trained on a large dataset and can be fine-tuned for specific tasks, making it a versatile tool for a variety of applications.

Apple other Updated 3 years ago

Table of Contents

Model Overview

The MobileViT model is a super lightweight, fast, and powerful tool for image classification tasks. It’s like a tiny but mighty robot that can look at pictures and tell you what’s in them.

What makes it special?

  • It’s a convolutional neural network that uses a special block called MobileViT-block, which combines local and global processing to understand images.
  • It’s designed to be fast and efficient, making it perfect for mobile devices and other applications where speed is crucial.
  • It doesn’t require any positional embeddings, which makes it even more efficient.

How does it work?

  1. It takes an image and breaks it down into small patches.
  2. It processes these patches using a transformer layer.
  3. It then “unflattens” the patches back into feature maps.
  4. It uses these feature maps to classify the image into one of 1,000 possible classes.

Capabilities

The MobileViT model is a powerful tool for image classification tasks. It’s designed to be light-weight and fast, making it perfect for use on mobile devices or in applications where speed is crucial.

What can MobileViT do?

  • Image Classification: MobileViT can classify images into one of 1,000 classes, using the ImageNet-1k dataset as a reference.
  • Object Detection: With its ability to process images at high speed, MobileViT can be used for object detection tasks, such as detecting objects in images or videos.

How does MobileViT work?

  • Convolutional Neural Network (CNN): MobileViT uses a CNN architecture, which is a type of neural network designed for image processing tasks.
  • Transformer Layers: MobileViT also uses transformer layers, which allow it to process images in a more efficient and effective way.
  • No Positional Embeddings: Unlike some other models, MobileViT does not require positional embeddings, which makes it even more efficient.

Comparison to other models

ModelImageNet top-1 accuracyImageNet top-5 accuracy# params
MobileViT-XXS69.088.91.3 M
==MobileViT-XS==74.892.32.3 M
==MobileViT-S==78.494.15.6 M

Performance

Speed

The MobileViT model is designed to be fast and efficient. It uses a unique combination of MobileNetV2-style layers and transformer blocks to process images quickly.

Accuracy

But speed isn’t everything. How accurate is the MobileViT model, really? Let’s take a closer look at its performance on the ImageNet-1k dataset:

  • Top-1 accuracy: 69.0%
  • Top-5 accuracy: 88.9%

Efficiency

So, how efficient is the MobileViT model? Let’s talk about its training procedure. The model was trained on a dataset of 1 million images and 1,000 classes.

Examples
Classify this image: http://images.cocodataset.org/val2017/000000039769.jpg Predicted class: cat
What is the accuracy of MobileViT-XXS on ImageNet top-1? 69.0
What is the number of parameters in MobileViT-XXS? 1.3 M

Example Code

Here’s an example of how to use the MobileViT model to classify an image:

from transformers import MobileViTFeatureExtractor, MobileViTForImageClassification
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/mobilevit-xx-small")
model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-xx-small")

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

Limitations

Current Model has some limitations that are important to consider when using it for image classification tasks.

Limited Resolution

The Current Model was trained on images with a resolution of 256x256.

Limited Number of Parameters

The Current Model has a relatively small number of parameters (1.3M) compared to other models like ==MobileViT-XS== (2.3M) and ==MobileViT-S== (5.6M).

Limited Training Data

The Current Model was trained on ImageNet-1k, a dataset consisting of 1 million images and 1,000 classes.

Limited Support for PyTorch

Currently, both the feature extractor and model support PyTorch.

Preprocessing Requirements

The Current Model requires images to be preprocessed in a specific way, including resizing/rescaling, center-cropping, and normalizing pixels.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.