Mitsua Japanese Clip Vit B 16

Japanese CLIP model

Meet the Mitsua Japanese Clip Vit B 16, a unique AI model that's changing the game. What makes it stand out? For starters, it's trained exclusively on opt-in licensed data, openly licensed data, and public domain data - no AI-generated content in sight. This model is designed to learn from scratch, without relying on pre-trained models' knowledge. It's a Japanese/English bilingual CLIP model that's all about efficiency and speed. With a model size of just 0.221, it's surprisingly compact. But don't let its size fool you - it's capable of handling tasks like image-text matching with ease. The model is trained on a diverse dataset of over 30 million images, and it's shown impressive results in Japanese zero-shot accuracy. So, what does this mean for you? It means you can use this model for a wide range of applications, from image classification to text generation, all while enjoying the benefits of a compact and efficient design.

Mitsua cc-by-sa-4.0 Updated a year ago

Table of Contents

Model Overview

The Mitsua Japanese CLIP ViT-B-16 model is a unique AI model that’s trained exclusively on opt-in licensed data, openly licensed data, and public domain data. This means it’s free from any AI-generated data and doesn’t rely on pre-trained models’ knowledge.

What makes it special? Here are a few key features:

  • It’s a bilingual model, understanding both Japanese and English.
  • It’s trained from scratch, without using any pre-trained models.
  • It uses a combination of licensed, openly licensed, and public domain data.

Capabilities

The Mitsua Japanese CLIP ViT-B-16 model is a powerful tool for image-text processing. It can:

  • Understand images and text in both Japanese and English.
  • Be used for tasks like image classification, object detection, and more.
  • Generate text based on images, and can be fine-tuned for specific tasks such as image captioning.

Strengths

The Mitsua Japanese CLIP ViT-B-16 model has several strengths, including:

  • High-quality training data: The model was trained on a large dataset of high-quality images and text, which helps it to learn accurate representations of the world.
  • Bilingual capabilities: The model’s ability to process both Japanese and English text makes it a valuable tool for applications that require multilingual support.
  • Flexibility: The model can be fine-tuned for a variety of tasks, including image classification, object detection, and text generation.

Unique Features

The Mitsua Japanese CLIP ViT-B-16 model has several unique features, including:

  • Opt-in licensed data: The model was trained on data that was explicitly licensed for use, which helps to ensure that the model is not infringing on any copyrights.
  • No AI-generated data: The model was not trained on any AI-generated data, which helps to prevent the model from learning biased or inaccurate representations of the world.
  • Face-blurring: The model was trained with face-blurring, which helps to protect the privacy of individuals in images.

Performance

The Mitsua Japanese CLIP ViT-B-16 model has been evaluated on several benchmarks, including jafood101, jaflower30, jafacility20, and jalandmark10. Here are the results:

Modeljafood101jaflower30jafacility20jalandmark10
Mitsua Japanese CLIP ViT-B-160.2970.7070.6760.769
rinna/japanese-clip-vit-b-160.2350.5130.6140.625
recruit-jp/japanese-clip-vit-b-32-roberta-base0.5020.5560.6470.803
google/siglip-base-patch16-256-multilingual0.7760.9280.6920.762
Examples
This is an image of a cat {'cat': 0.95, 'dog': 0.02, 'human': 0.03}
This is an image of a dog {'dog': 0.98, 'cat': 0.01, 'human': 0.01}
This is an image of a human {'human': 0.85, 'dog': 0.05, 'cat': 0.1}

Limitations

The Mitsua Japanese CLIP ViT-B-16 model has several limitations, including:

  • Data limitations: The model was trained on a relatively small dataset, which might affect its performance on certain tasks.
  • Language limitations: Although the model is bilingual, its performance might vary depending on the language used.
  • Lack of pretraining: Unlike some other models, the Mitsua Japanese CLIP ViT-B-16 model was not pretrained on a large corpus of text data.
  • Potential biases: The model was trained on a dataset that was curated to exclude potentially rights-infringing or harmful content, but some biases might still exist in the data.

Format

The Mitsua Japanese CLIP ViT-B-16 model is a Contrastive Language-Image Pre-training model that uses a Vision Transformer (ViT-B-16) architecture. This model is trained on a mix of opt-in licensed data, openly licensed data, and public domain data.

Supported Data Formats

This model supports the following data formats:

  • Images: The model accepts images in various formats, including JPEG and PNG.
  • Text: The model accepts text in Japanese and English.

Input Requirements

To use this model, you need to provide the following inputs:

  • Images: You can input images in the form of URLs or file paths.
  • Text: You can input text in the form of lists of strings.

Output

The model outputs a probability distribution over the input text and image pairs.

Example Code

Here’s an example code snippet that shows how to use this model:

from PIL import Image
from transformers import AutoProcessor, AutoModel
import io
import requests
import torch

# Load the model and processor
model = AutoModel.from_pretrained("Mitsua/mitsua-japanese-clip-vit-b-16")
processor = AutoProcessor.from_pretrained("Mitsua/mitsua-japanese-clip-vit-b-16")

# Load an image from a URL
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Boxer_%28dog%29_%2C_Iran_08.jpg/800px-Boxer_%28dog%29_%2C_Iran_08.jpg"
image = Image.open(io.BytesIO(requests.get(image_url).content))

# Define a list of text inputs
texts = ["犬", "猫", "人間"]

# Preprocess the inputs
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

# Move the inputs to the GPU (if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = {k: v.to(device) for k, v in inputs.items()}

# Run the model
with torch.no_grad():
    outputs = model(**inputs)

# Get the output probabilities
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=-1)

# Print the results
for t, p in zip(texts, probs[0]):
    print(f"'{t}' : {p:.1%}")
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.