Mitsua Japanese Clip Vit B 16
Meet the Mitsua Japanese Clip Vit B 16, a unique AI model that's changing the game. What makes it stand out? For starters, it's trained exclusively on opt-in licensed data, openly licensed data, and public domain data - no AI-generated content in sight. This model is designed to learn from scratch, without relying on pre-trained models' knowledge. It's a Japanese/English bilingual CLIP model that's all about efficiency and speed. With a model size of just 0.221, it's surprisingly compact. But don't let its size fool you - it's capable of handling tasks like image-text matching with ease. The model is trained on a diverse dataset of over 30 million images, and it's shown impressive results in Japanese zero-shot accuracy. So, what does this mean for you? It means you can use this model for a wide range of applications, from image classification to text generation, all while enjoying the benefits of a compact and efficient design.
Table of Contents
Model Overview
The Mitsua Japanese CLIP ViT-B-16 model is a unique AI model that’s trained exclusively on opt-in licensed data, openly licensed data, and public domain data. This means it’s free from any AI-generated data and doesn’t rely on pre-trained models’ knowledge.
What makes it special? Here are a few key features:
- It’s a bilingual model, understanding both Japanese and English.
- It’s trained from scratch, without using any pre-trained models.
- It uses a combination of licensed, openly licensed, and public domain data.
Capabilities
The Mitsua Japanese CLIP ViT-B-16 model is a powerful tool for image-text processing. It can:
- Understand images and text in both Japanese and English.
- Be used for tasks like image classification, object detection, and more.
- Generate text based on images, and can be fine-tuned for specific tasks such as image captioning.
Strengths
The Mitsua Japanese CLIP ViT-B-16 model has several strengths, including:
- High-quality training data: The model was trained on a large dataset of high-quality images and text, which helps it to learn accurate representations of the world.
- Bilingual capabilities: The model’s ability to process both Japanese and English text makes it a valuable tool for applications that require multilingual support.
- Flexibility: The model can be fine-tuned for a variety of tasks, including image classification, object detection, and text generation.
Unique Features
The Mitsua Japanese CLIP ViT-B-16 model has several unique features, including:
- Opt-in licensed data: The model was trained on data that was explicitly licensed for use, which helps to ensure that the model is not infringing on any copyrights.
- No AI-generated data: The model was not trained on any AI-generated data, which helps to prevent the model from learning biased or inaccurate representations of the world.
- Face-blurring: The model was trained with face-blurring, which helps to protect the privacy of individuals in images.
Performance
The Mitsua Japanese CLIP ViT-B-16 model has been evaluated on several benchmarks, including jafood101, jaflower30, jafacility20, and jalandmark10. Here are the results:
| Model | jafood101 | jaflower30 | jafacility20 | jalandmark10 |
|---|---|---|---|---|
| Mitsua Japanese CLIP ViT-B-16 | 0.297 | 0.707 | 0.676 | 0.769 |
| rinna/japanese-clip-vit-b-16 | 0.235 | 0.513 | 0.614 | 0.625 |
| recruit-jp/japanese-clip-vit-b-32-roberta-base | 0.502 | 0.556 | 0.647 | 0.803 |
| google/siglip-base-patch16-256-multilingual | 0.776 | 0.928 | 0.692 | 0.762 |
Limitations
The Mitsua Japanese CLIP ViT-B-16 model has several limitations, including:
- Data limitations: The model was trained on a relatively small dataset, which might affect its performance on certain tasks.
- Language limitations: Although the model is bilingual, its performance might vary depending on the language used.
- Lack of pretraining: Unlike some other models, the Mitsua Japanese CLIP ViT-B-16 model was not pretrained on a large corpus of text data.
- Potential biases: The model was trained on a dataset that was curated to exclude potentially rights-infringing or harmful content, but some biases might still exist in the data.
Format
The Mitsua Japanese CLIP ViT-B-16 model is a Contrastive Language-Image Pre-training model that uses a Vision Transformer (ViT-B-16) architecture. This model is trained on a mix of opt-in licensed data, openly licensed data, and public domain data.
Supported Data Formats
This model supports the following data formats:
- Images: The model accepts images in various formats, including JPEG and PNG.
- Text: The model accepts text in Japanese and English.
Input Requirements
To use this model, you need to provide the following inputs:
- Images: You can input images in the form of URLs or file paths.
- Text: You can input text in the form of lists of strings.
Output
The model outputs a probability distribution over the input text and image pairs.
Example Code
Here’s an example code snippet that shows how to use this model:
from PIL import Image
from transformers import AutoProcessor, AutoModel
import io
import requests
import torch
# Load the model and processor
model = AutoModel.from_pretrained("Mitsua/mitsua-japanese-clip-vit-b-16")
processor = AutoProcessor.from_pretrained("Mitsua/mitsua-japanese-clip-vit-b-16")
# Load an image from a URL
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Boxer_%28dog%29_%2C_Iran_08.jpg/800px-Boxer_%28dog%29_%2C_Iran_08.jpg"
image = Image.open(io.BytesIO(requests.get(image_url).content))
# Define a list of text inputs
texts = ["犬", "猫", "人間"]
# Preprocess the inputs
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
# Move the inputs to the GPU (if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = {k: v.to(device) for k, v in inputs.items()}
# Run the model
with torch.no_grad():
outputs = model(**inputs)
# Get the output probabilities
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=-1)
# Print the results
for t, p in zip(texts, probs[0]):
print(f"'{t}' : {p:.1%}")


