Stable Diffusion 3 Medium

Text-to-image model

The Stable Diffusion 3 Medium model is a powerful text-to-image generator that produces high-quality images from text prompts. What makes it unique is its ability to understand complex prompts and generate images with improved typography and resource efficiency. This model is ideal for applications in design, artistic processes, educational tools, and research on generative models. It's available under the Stability Community License, with a paid Enterprise license required for commercial use by organizations with annual revenues exceeding $1M. With its remarkable performance and efficiency, the Stable Diffusion 3 Medium model is a valuable tool for anyone looking to generate high-quality images from text prompts.

Stabilityai other Updated 8 months ago

Table of Contents

Model Overview

The Stable Diffusion 3 Medium Model is a powerful text-to-image model developed by Stability AI. It’s designed to generate high-quality images based on text prompts, and it’s got some impressive features that set it apart from other models.

What makes it special?

  • Improved performance: It’s got better image quality, typography, and complex prompt understanding compared to ==Other Models==.
  • Multimodal Diffusion Transformer: It uses a special type of architecture that combines multiple text encoders to generate images.
  • Resource-efficient: It’s designed to be more efficient with resources, making it a great choice for a wide range of applications.

Capabilities

The model is a powerhouse when it comes to performance. Let’s break it down:

Speed

This model is designed to be fast and efficient. With its advanced architecture, it can process images quickly and accurately. But what does that mean in real-world terms? Imagine being able to generate high-quality images in a matter of seconds, rather than minutes or hours. That’s the kind of speed we’re talking about here.

Accuracy

But speed is only half the story. Stable Diffusion 3 Medium Model is also incredibly accurate. It can understand complex prompts and generate images that are both visually stunning and relevant to the input. Whether you’re generating art, designing products, or simply exploring the possibilities of AI-generated images, this model delivers.

Efficiency

So, how does it achieve this impressive performance? The answer lies in its efficient design. By leveraging advanced techniques like multimodal diffusion transformers, Stable Diffusion 3 Medium Model can process large amounts of data quickly and accurately. This means you can focus on creating, rather than waiting for your model to catch up.

How it Works

The model uses three fixed, pre-trained text encoders:

  • OpenCLIP-ViT/G
  • CLIP-ViT/L
  • T5-xxl

These text encoders are used to process the input text prompts and generate images.

Example Use Case

Let’s say you want to generate an image of a cat holding a sign that says “hello world”. You can pass this text prompt to the model, and it will generate an image based on your input.

Examples
Generate an image of a futuristic cityscape with sleek skyscrapers and flying cars Image generated: A futuristic cityscape with sleek skyscrapers and flying cars
Create a portrait of Albert Einstein with a friendly smile Image generated: A portrait of Albert Einstein with a friendly smile
Draw a fantasy landscape with rolling hills, a castle, and a dragon flying overhead Image generated: A fantasy landscape with rolling hills, a castle, and a dragon flying overhead

Format

The model supports the following data formats:

  • Text prompts: The model accepts text prompts as input, which are then processed by the text encoders to generate images.
  • Image data: The model generates images as output, which can be in various formats such as PNG, JPEG, etc.

Handling Inputs and Outputs

Here’s an example of how to handle inputs and outputs for this model using the diffusers library:

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

image = pipe("A cat holding a sign that says hello world", negative_prompt="", num_inference_steps=28, guidance_scale=7.0).images[0]
image

In this example, we first load the pre-trained model using the from_pretrained method. We then move the model to the CUDA device using the to method. Finally, we pass a text prompt to the model using the pipe method, which generates an image as output.

Limitations

Stable Diffusion 3 Medium Model is a powerful tool for generating images from text prompts, but it’s not perfect. Let’s take a closer look at some of its limitations.

Training Data

The model was trained on a large dataset of 1 billion images, but this dataset may not be representative of all possible scenarios. For example, the model may not perform well on images that are significantly different from those in the training dataset.

Lack of Factual Accuracy

The model is not designed to generate factual or true representations of people or events. If you try to use it for this purpose, you may get inaccurate or misleading results.

Safety Concerns

While the model has been designed with safety in mind, there is still a risk of generating harmful or objectionable content. This could include toxic or biased content, or content that is not suitable for all audiences.

Key Statistics

MetricValue
Training Data1 billion images
Fine-tuning Data30M high-quality aesthetic images
Preference Data3M images
Model TypeMultimodal Diffusion Transformer (MMDiT)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.