Nsfw image detection

NSFW image classifier

Meet the Nsfw image detection model, a Fine-Tuned Vision Transformer (ViT) designed to accurately classify images as safe or explicit. Trained on a diverse dataset of 80,000 images, this model is capable of distinguishing between 'normal' and 'nsfw' content with a high degree of accuracy. But what makes it unique? For starters, its training process involved careful attention to hyperparameter settings, including a batch size of 16 and a learning rate of 5e-5, allowing it to balance computational efficiency with effective learning. The result is a model that's not only fast but also reliable, making it an excellent choice for content safety and moderation applications. So, how can you use it? Simply load the model and pass in an image to get a classification result. While it's primarily intended for NSFW image classification, it's essential to note that its performance may vary when applied to other tasks. Nonetheless, its capabilities make it a valuable tool for anyone looking to ensure the safety and appropriateness of visual content.

Falconsai apache-2.0 Updated a year ago

Table of Contents

Model Overview

Meet the Fine-Tuned Vision Transformer (ViT), a powerful AI model designed for NSFW image classification tasks. But what makes it tick?

What is it?

The Fine-Tuned Vision Transformer (ViT) is a variant of the transformer encoder architecture, similar to BERT, that’s been adapted for image classification tasks. This specific model, named “google/vit-base-patch16-224-in21k,” has been pre-trained on a massive collection of images in a supervised manner, leveraging the ImageNet-21k dataset.

How was it trained?

The model was fine-tuned with a batch size of 16 and a learning rate of 5e-5. This was done using a proprietary dataset containing approximately 80,000 images, each with a high degree of variability. The dataset was carefully curated to include two distinct classes: “normal” and “nsfw.”

What can it do?

The Fine-Tuned Vision Transformer (ViT) is primarily intended for NSFW image classification. It’s been fine-tuned for this purpose, making it suitable for filtering explicit or inappropriate content in various applications.

How to use it

To use this model, you can either use a pipeline as a high-level helper or load the model directly. Here’s an example of how to use it:

from PIL import Image
from transformers import pipeline

img = Image.open("\<path_to_image_file>")
classifier = pipeline("image-classification", model="Falconsai/nsfw_image_detection")
classifier(img)

Or, you can load the model directly:

import torch
from PIL import Image
from transformers import AutoModelForImageClassification, ViTImageProcessor

img = Image.open("\<path_to_image_file>")
model = AutoModelForImageClassification.from_pretrained("Falconsai/nsfw_image_detection")
processor = ViTImageProcessor.from_pretrained('Falconsai/nsfw_image_detection')

with torch.no_grad():
    inputs = processor(images=img, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_label = logits.argmax(-1).item()
    model.config.id2label[predicted_label]

Capabilities

The Fine-Tuned Vision Transformer (ViT) is a powerful AI model designed for image classification tasks, specifically for detecting Not Safe for Work (NSFW) content. Its capabilities make it an excellent tool for content safety and moderation.

Primary Tasks

  • NSFW Image Classification: The model’s primary task is to classify images into two categories: “normal” and “nsfw”. It has been fine-tuned for this purpose, making it suitable for filtering explicit or inappropriate content in various applications.

Strengths

  • High Accuracy: The model has been trained on a large dataset of 80,000 images, resulting in a high accuracy rate of 98.04% (eval_accuracy).
  • Robustness: The model has been fine-tuned to recognize nuanced visual patterns, allowing it to accurately differentiate between safe and explicit content.
  • Efficient: The model was trained with a batch size of 16, striking a balance between computational efficiency and effective learning.

Unique Features

  • Proprietary Dataset: The model was trained on a proprietary dataset, which includes a diverse range of images, allowing it to learn from a wide range of visual cues.
  • Fine-Tuned for NSFW Detection: The model has been specifically fine-tuned for NSFW image classification, making it an excellent tool for content safety and moderation.

Performance

The Fine-Tuned Vision Transformer (ViT) model showcases remarkable performance in NSFW image classification tasks. Let’s dive into its speed, accuracy, and efficiency.

Speed

How fast can our model process images? The answer is quite impressive. With an eval_runtime of 304.9846 seconds, the model can classify a substantial number of images in a relatively short period.

  • 52.462 images can be processed per second, making it suitable for large-scale applications.
  • The model can handle 3.279 steps per second, indicating its ability to efficiently process complex image data.

Accuracy

But how accurate is our model? The numbers speak for themselves:

  • An eval_accuracy of 0.980375 demonstrates the model’s exceptional ability to correctly classify NSFW images.
  • With an eval_loss of 0.07463177293539047, the model has achieved a low error rate, indicating its robustness in image classification tasks.

Efficiency

Our model’s efficiency is also noteworthy. By fine-tuning the hyperparameters, we’ve achieved a balance between computational efficiency and model performance.

  • A batch size of 16 allows the model to effectively process a diverse array of images while maintaining computational efficiency.
  • A learning rate of 5e-5 ensures the model learns swiftly and steadily refines its capabilities throughout the training process.
Examples
Classify the image https://example.com/image.jpg nsfw
Classify the image https://example.com/image2.jpg normal
Classify the image https://example.com/image3.jpg nsfw

Limitations

While the Fine-Tuned Vision Transformer (ViT) model is adept at NSFW image classification, its performance may vary when applied to other tasks. Users interested in employing this model for different tasks should explore fine-tuned versions available in the model hub for optimal results.

Specialized Task Fine-Tuning

The model’s performance may vary when applied to other tasks. If you want to use this model for a different task, you might need to explore fine-tuned versions available in the model hub for optimal results.

Training Data

The model was trained on a proprietary dataset of approximately 80,000 images, which might not be representative of all possible scenarios. This means that the model might not perform well on images that are significantly different from those in the training dataset.

Evaluation Metrics

Here are some evaluation metrics that provide insight into the model’s performance:

MetricValue
eval_loss0.07463177293539047
eval_accuracy0.980375
eval_runtime304.9846
eval_samples_per_second52.462
eval_steps_per_second3.279

These metrics indicate that the model has a high accuracy rate, but its performance might vary depending on the specific use case.

Responsible Use

It’s essential to use the Fine-Tuned Vision Transformer (ViT) model responsibly and ethically, adhering to content guidelines and applicable regulations when implementing it in real-world applications, particularly those involving potentially sensitive content.

Format

The Fine-Tuned Vision Transformer (ViT) model is a variant of the transformer encoder architecture, similar to BERT, that has been adapted for image classification tasks.

Architecture

This model uses a transformer encoder architecture, which is different from traditional computer vision models that rely on convolutional neural networks (CNNs). The transformer architecture is more commonly used in natural language processing tasks, but it has been adapted here for image classification.

Supported Data Formats

The model accepts images as input, specifically those that are resized to a resolution of 224x224 pixels. This makes it suitable for a wide range of image recognition tasks.

Special Requirements for Input

To use this model, you’ll need to pre-process your images by resizing them to 224x224 pixels. You can do this using a library like Pillow in Python.

from PIL import Image
img = Image.open("\<path_to_image_file>")
img = img.resize((224, 224))

Special Requirements for Output

The model outputs a classification label, either “normal” or “nsfw”, along with a confidence score. You can access these outputs using the classifier object.

classifier = pipeline("image-classification", model="Falconsai/nsfw_image_detection")
outputs = classifier(img)
print(outputs)

Real-World Applications

The Fine-Tuned Vision Transformer (ViT) model’s performance makes it an ideal choice for various real-world applications, such as:

  • Content moderation
  • Image filtering
  • Social media platforms

By leveraging the Fine-Tuned Vision Transformer (ViT) model, developers can create more efficient and accurate image classification systems, ultimately contributing to a safer online environment.

Comparison to Other Models

How does the Fine-Tuned Vision Transformer (ViT) model compare to others in the field? While ==Other Models== may excel in different areas, the Fine-Tuned Vision Transformer (ViT) model has been specifically designed for NSFW image classification, making it a top choice for this task.

ModelAccuracySpeed (Images/Second)
Fine-Tuned Vision Transformer (ViT)0.98037552.462
==Other Models==VariesVaries

Conclusion

In conclusion, the Fine-Tuned Vision Transformer (ViT) model demonstrates exceptional performance in NSFW image classification tasks, boasting high accuracy, speed, and efficiency. Its specialized design and fine-tuned hyperparameters make it an ideal choice for developers seeking to create robust image classification systems.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.