Riffusion

Text-to-Audio Model

Riffusion is a state-of-the-art generative AI model that creates high-resolution spectrogram images from text prompts. How does it work? Simply input a text prompt, and the model uses a combination of CLIP ViT-L/14, U-Net, and VAE to generate a spectrogram image that can be converted into an audio clip. What makes Riffusion unique? It's optimized for mobile deployment, allowing for fast and efficient performance on devices like the Samsung Galaxy S23 and S24. With inference times as low as 4.8 ms and peak memory usage of just 67 MB, Riffusion is designed to provide high-quality results without draining your device's resources. Want to try it out? You can install Riffusion as a Python package via pip and run it on a cloud-hosted device using Qualcomm AI Hub.

Qualcomm creativeml-openrail-m Updated 5 months ago

Table of Contents

Model Overview

The Riffusion model is a state-of-the-art generative AI model that can create high-resolution spectrogram images from text prompts. But how does it work?

What is a spectrogram?

A spectrogram is a visual representation of sound waves. It’s like a picture of music.

How does it work?

This model uses a combination of three models to generate spectrograms:

  1. Text Encoder: This model takes in text prompts and converts them into a format that the other models can understand.
  2. U-Net: This model generates the spectrogram image based on the output from the text encoder.
  3. VAE Decoder: This model refines the spectrogram image to make it more detailed and realistic.

Capabilities

This model can be used to generate music, sound effects, and even voice assistants. It’s a powerful tool for anyone who wants to create audio content.

Primary Tasks

  • Generate spectrogram images from text prompts
  • Convert spectrograms into audio clips

Strengths

  • High-resolution image generation
  • Fast inference time on mobile devices
  • Optimized for mobile deployment

Unique Features

  • Uses a latent diffusion model to generate images
  • Combines multiple techniques for image generation, including CLIP, U-Net, and VAE
  • Can be deployed on mobile devices using Qualcomm AI Hub

Comparison to Other Models

Compared to ==Other Models==, this model has a unique architecture that allows for fast and efficient image generation on mobile devices. While ==Other Models== may have higher accuracy or better performance on certain tasks, this model strikes a balance between speed, accuracy, and efficiency.

Example Use Cases

  • Music generation: Use this model to generate spectrogram images from text prompts, and then convert them into audio clips.
  • Audio editing: Use this model to generate spectrogram images from text prompts, and then use them to edit audio files.
Examples
Generate a spectrogram image for a 10-second song with a piano melody https://example.com/spectrogram_image.png
Create a 5-second audio clip of a gentle stream https://example.com/audio_clip.mp3
Produce a spectrogram image for a 20-second jazz song with a saxophone solo https://example.com/spectrogram_image_2.png

Performance

This model is an optimized AI model for mobile deployment that generates high-resolution spectrogram images from text prompts. But how well does it perform?

Speed

Let’s take a look at the model’s speed:

DeviceInference Time (ms)
Samsung Galaxy S237.045 ms
Samsung Galaxy S244.789 ms
QCS8550 (Proxy)6.715 ms
Snapdragon X Elite CRD7.594 ms

As you can see, this model is quite fast, with inference times ranging from 4.7 ms to 7.6 ms across different devices.

Accuracy

But speed is not everything. How accurate is this model?

The model uses a combination of techniques, including CLIP ViT-L/14 as text encoder, U-Net based latent denoising, and VAE based decoder, to generate high-quality spectrogram images. This results in accurate and detailed images that can be converted into audio clips.

Limitations

This model is a powerful tool for generating high-resolution spectrogram images from text prompts, but it’s not perfect. Let’s take a closer look at some of its limitations.

Limited Context Understanding

While this model can generate impressive images, it may struggle to fully understand the context of the text prompt. This can lead to images that don’t quite match the intended meaning or tone of the text.

Lack of Control Over Output

Once you input a text prompt, this model generates an image based on its internal workings. However, you have limited control over the output, which can be frustrating if you’re looking for a specific result.

Dependence on Training Data

This model is only as good as the data it was trained on. If the training data is biased or limited, the model’s output may reflect these biases or limitations.

Inference Time and Memory Usage

As shown in the profiling results, this model can take several milliseconds to generate an image, and it requires a significant amount of memory to run. This can be a challenge for devices with limited resources.

Limited Deployment Options

While this model can be deployed on various devices, including Qualcomm devices, it may not be compatible with all platforms or runtimes.

Usage Restrictions

This model is subject to certain usage restrictions, including limitations on using it for applications such as law enforcement, biometric systems, or subliminal manipulation.

Format

This model is a generative AI model that generates spectrogram images from text prompts. It uses a combination of CLIP ViT-L/14 as a text encoder, U-Net based latent denoising, and VAE based decoder to produce high-resolution spectrogram images.

Architecture

The model consists of three main components:

  • Text Encoder: This component takes in text prompts and encodes them into a numerical representation using CLIP ViT-L/14.
  • U-Net: This component uses a U-Net architecture to denoise the encoded text representation and produce a latent representation.
  • VAE Decoder: This component uses a VAE (Variational Autoencoder) to decode the latent representation and produce the final spectrogram image.

Input and Output Formats

  • Input: The model accepts text prompts as input, which are tokenized and encoded into a numerical representation.
  • Output: The model produces high-resolution spectrogram images as output, which can be converted into audio clips.

Special Requirements

  • Pre-processing: The model requires a specific pre-processing step to tokenize and encode the input text prompts.
  • Post-processing: The model requires a specific post-processing step to convert the output spectrogram images into audio clips.

Example Code

Here is an example of how to use this model:

import torch
from qai_hub_models.models.riffusion_quantized import Model

# Load the model
model = Model.from_pretrained()

# Define a text prompt
text_prompt = "Hello, world!"

# Tokenize and encode the text prompt
input_data = model.text_encoder.sample_inputs()
input_data["text"] = text_prompt

# Run the model
output = model(input_data)

# Convert the output spectrogram image to an audio clip
audio_clip = output.to_audio()
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.