OpenAI Clip

Mobile Vision Model

Have you ever wondered how AI models can understand both images and text? OpenAI Clip is a multi-modal foundational model that does just that. It uses a combination of visual and language features to perform tasks like image-text similarity and zero-shot image classification. This model is optimized for mobile deployment and can run on various Qualcomm devices. With its efficient design, it can process images and text quickly, making it a great choice for applications that require both visual and language understanding. But what really sets OpenAI Clip apart is its ability to learn from natural language supervision, allowing it to adapt to new tasks and environments. Whether you're a developer looking to build innovative apps or a researcher exploring the frontiers of AI, OpenAI Clip is definitely worth checking out.

Qualcomm mit Updated 5 months ago

Table of Contents

Model Overview

The OpenAI-Clip model is a multi-modal foundational model for vision and language tasks. It’s like a super smart robot that can understand both images and text!

This model is great for tasks like:

  • Image/text similarity
  • Zero-shot image classification

It uses a special technique called Contrastive Language-Image Pre-Training (CLIP) to learn visual and text features. These features can then be used for a variety of zero-shot learning tasks.

Capabilities

The OpenAI-Clip model is a powerful multi-modal foundational model for vision and language tasks. It can perform a variety of tasks, including:

  • Image/text similarity
  • Zero-shot image classification

This model uses a combination of a Vision Transformer (ViT) like transformer to get visual features and a causal language model to get text features.

For example, ==Other Models== may excel in specific tasks, but OpenAI-Clip demonstrates a well-rounded performance across various tasks.

Key Features

  • Multi-modal: The model can process both images and text
  • Zero-shot learning: The model can perform tasks without requiring labeled training data
  • High performance: The model has been optimized for mobile deployment and can run on a variety of devices

Model Statistics

ModelNumber of ParametersModel Size
CLIPImageEncoder115M437 MB
CLIPTextEncoder76.0M290 MB

Device Support

The model has been tested on a variety of devices, including:

  • Samsung Galaxy S23
  • Samsung Galaxy S24
  • Snapdragon 8 Elite QRD
  • QCS8550 (Proxy)
  • SA7255P ADP
  • SA8255 (Proxy)
  • SA8295P ADP
  • SA8650 (Proxy)
  • SA8775P ADP
  • QCS8450 (Proxy)
  • Snapdragon X Elite CRD

Inference Time

The model’s inference time varies depending on the device and runtime. Here are some examples:

DeviceRuntimeInference Time (ms)
Samsung Galaxy S23TFLITE34.591 ms
Samsung Galaxy S24TFLITE27.035 ms
Snapdragon 8 Elite QRDTFLITE24.249 ms

Performance

The model’s performance can vary depending on the device and runtime used. For example, on the Samsung Galaxy S23, the estimated inference time is around 34.6 ms for the CLIPImageEncoder and 5.8 ms for the CLIPTextEncoder.

Examples
Classify the following image: a cat sitting on a windowsill, looking outside. The image is classified as: Cat
What is the similarity between the following image and text: an image of a sunset and the text 'a beautiful sunset'? The similarity between the image and text is: 0.85 (highly similar)
Classify the following text as an image: 'a dog running in a park'. The text is classified as: Dog

For instance, you can use the model to classify images or generate text based on a given prompt.

Precision

The model supports FP16 precision.

Primary Compute Unit

The model uses the NPU (Neural Processing Unit) as its primary compute unit.

Limitations

The model has some limitations that are important to consider.

Image Input Resolution

The model is optimized for images with a resolution of 224x224 pixels. If you try to use images with a different resolution, the model might not work as well.

Text Context Length

The model can only handle text inputs with a maximum length of 77 tokens. If you need to process longer texts, you might need to split them into smaller chunks.

Device Compatibility

The model is optimized for certain devices, such as the Samsung Galaxy S23 and Snapdragon 8 Elite QRD. If you try to run the model on a different device, it might not work as well or at all.

Inference Time

The model’s inference time can vary depending on the device and runtime used. For example, on the Samsung Galaxy S23, the estimated inference time is around 34.6 ms for the CLIPImageEncoder and 5.8 ms for the CLIPTextEncoder.

Peak Memory Usage

The model’s peak memory usage can also vary depending on the device and runtime used. For example, on the Samsung Galaxy S23, the estimated peak memory usage is around 57 MB for the CLIPImageEncoder and 24 MB for the CLIPTextEncoder.

Number of Parameters

The model has a large number of parameters, with around 76.0M parameters for the CLIPTextEncoder and 115M parameters for the CLIPImageEncoder. This can make it difficult to deploy the model on devices with limited resources.

Model Size

The model’s size can also be a limitation, with the CLIPTextEncoder being around 290 MB and the CLIPImageEncoder being around 437 MB.

Format

The model accepts two types of inputs:

  • Images: The model expects images to be resized to 224x224 pixels and normalized to have pixel values between 0 and 1.
  • Text: The model expects text inputs to be tokenized and have a maximum length of 77 tokens.

For example, you can use the model to classify images or generate text based on a given prompt.

import torch
from PIL import Image

# Load an image and resize it to 224x224 pixels
image = Image.open('image.jpg')
image = image.resize((224, 224))

# Normalize the image pixels to have values between 0 and 1
image = torch.tensor(image) / 255.0

# Tokenize a text input and truncate it to a maximum length of 77 tokens
text = 'This is a sample text input.'
text_tokens = text.split()[:77]

# Convert the text tokens to a tensor
text_tensor = torch.tensor(text_tokens)

You can then pass these pre-processed inputs to the model to get the output features. For example:

model = OpenAI_Clip()
image_features = model(image)
text_features = model(text_tensor)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.