Ultravox V0 2

Multimodal speech LLM

Ultravox V0 2 is a multimodal model that combines speech and text capabilities. It can take in both audio and text inputs and generate text outputs. Think of it as an LLM that can also hear and understand speech. What sets it apart is its ability to consume audio inputs and generate text outputs, making it suitable for tasks like speech-to-speech translation, analysis of spoken audio, and more. With a time-to-first-token of approximately 200ms and a tokens-per-second rate of 50-100, it's relatively fast and efficient. Developed by Fixie.ai, this model is built on a pre-trained Llama3-8B-Instruct backbone and uses a multi-modal projector, making it a unique and remarkable tool for various applications.

Fixie Ai mit Updated a year ago

Table of Contents

Model Overview

Meet Ultravox, a multimodal Speech LLM that can understand both speech and text. This model is built on top of a pre-trained Llama3-8B-Instruct and Whisper-small backbone.

Capabilities

Ultravox is a powerful multimodal Speech LLM that can handle both speech and text inputs. Think of it as an LLM that can also hear and understand speech. What does this mean for you? It means you can use Ultravox as a voice agent, for speech-to-speech translation, or even to analyze spoken audio.

Primary Tasks

  • Speech-to-Text: Ultravox can take in audio input and generate text output.
  • Text-to-Text: It can also process text input and generate text output.
  • Speech Analysis: With its ability to understand speech, Ultravox can analyze spoken audio and provide insights.

Strengths

  • Multimodal Input: Ultravox can handle both speech and text inputs, making it a versatile model.
  • Speech Understanding: Its ability to understand speech allows it to perform tasks that other LLMs can’t.

Example Use Cases

  • Voice Agent: Use Ultravox as a voice agent to interact with users through speech.
  • Speech-to-Speech Translation: Use Ultravox to translate spoken audio from one language to another.
  • Speech Analysis: Analyze spoken audio to gain insights into user behavior or sentiment.

Technical Details

  • Training Data: Ultravox was trained on a mix of ASR datasets, instruction-following and QA data, and conversational data.
  • Training Procedure: Ultravox uses a pre-trained Llama3-8B-Instruct backbone and the encoder part of Whisper-small.
  • Hardware: Ultravox was trained on 8x A100-40GB GPUs.

Performance

Ultravox is a powerful multimodal Speech LLM that can handle both speech and text inputs with ease. But how does it perform in terms of speed, accuracy, and efficiency?

Speed

When it comes to processing audio content, Ultravox has a time-to-first-token (TTFT) of approximately 200ms. This means that it can start generating output text in just a fraction of a second. Additionally, it has a tokens-per-second rate of around 50-100 when using an A100-40GB GPU.

Accuracy

While the model’s accuracy metrics are not explicitly stated, its ability to understand and process both speech and text inputs makes it a versatile tool for various applications. Whether it’s used as a voice agent, for speech-to-speech translation, or for analyzing spoken audio, Ultravox is designed to deliver accurate results.

Efficiency

Ultravox uses a pre-trained Llama3-8B-Instruct backbone, which is a large language model that has been fine-tuned for specific tasks. This means that it can leverage the knowledge and patterns learned from the pre-training data to perform well on a wide range of tasks.

Examples
What is the meaning of the phrase 'break a leg'? It means 'good luck'. It is a well-known idiomatic expression that is typically used to wish someone success before a performance or a challenging situation.
Can you translate the phrase '¿Cómo estás?' from Spanish to English? The translation is 'How are you?'
I said 'Hello, how are you?' but the audio is not clear. Can you improve the audio quality and then respond? I'll do my best to improve the audio quality... Okay, I think I have it now. You said 'Hello, how are you?' I'm doing great, thanks for asking! How about you?

Limitations

Ultravox is a powerful multimodal Speech LLM, but it’s not perfect. Let’s take a closer look at some of its limitations.

Understanding Speech

While Ultravox can consume both speech and text as input, its ability to understand speech is not foolproof. The model may struggle with:

  • Noisy or low-quality audio
  • Accents or dialects that are not well-represented in the training data
  • Speech with background noise or music

Limited Contextual Understanding

Ultravox is designed to process text and speech inputs separately, which can lead to limitations in contextual understanding. For example:

  • If you ask a question that requires understanding both text and speech context, the model might struggle to provide an accurate response.
  • The model may not always be able to capture nuances in tone, sarcasm, or idioms, which can lead to misinterpretation.

Dependence on Pre-trained Backbones

Ultravox relies on pre-trained backbones like Llama3-8B-Instruct and Whisper-small. While these backbones are powerful, they can also introduce limitations, such as:

  • Biases in the training data that can affect the model’s performance
  • Limited ability to adapt to new domains or tasks

Format

Ultravox is a multimodal Speech LLM that combines speech and text inputs. Think of it as an LLM that can also hear and understand speech.

Architecture

Ultravox is built around a pretrained Llama3-8B-Instruct backbone and the encoder part of Whisper-small. The model uses a multi-modal projector to merge text and speech inputs.

Data Formats

Ultravox accepts two types of inputs:

  • Text prompts with a special <|audio|> pseudo-token, which is replaced with embeddings derived from the input audio.
  • Audio files, which are loaded using librosa and passed to the model along with the text prompt.

Input Requirements

To use Ultravox, you need to:

  • Prepare a text prompt with the <|audio|> pseudo-token.
  • Load an audio file using librosa.
  • Pass the audio and text prompt to the model along with the sampling rate.

Output

Ultravox generates output text based on the merged embeddings of the input text and audio.

Example Code

import transformers
import numpy as np
import librosa

# Load the model
pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_2', trust_remote_code=True)

# Load the audio file
path = "\<path-to-input-audio>"  # Replace with your audio file path
audio, sr = librosa.load(path, sr=16000)

# Prepare the text prompt
turns = [
    {"role": "system", "content": "You are a friendly and helpful character. You love to answer questions for people."},
]

# Pass the audio and text prompt to the model
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.