Ultravox V0 4 1 Llama 3 1 8b
Meet Ultravox, a multimodal Speech LLM that can understand both text and speech. It's built on top of a pre-trained Llama3.1-8B-Instruct and Whisper-large-v3-turbo backbone, making it a powerful tool for speech-to-speech translation, voice agents, and more. But what really sets it apart is its ability to learn from speech and text simultaneously, allowing it to generate more accurate and human-like responses. With a time-to-first-token of around 150ms and a tokens-per-second rate of 50-100, Ultravox is fast and efficient. Plus, it's been trained on a mix of ASR datasets and speech translation datasets, giving it a unique edge in understanding spoken language. So, what can you do with Ultravox? Think of it as a voice agent that can also understand and respond to text-based inputs. It's a game-changer for anyone looking to create more interactive and engaging experiences.
Table of Contents
Model Overview
The Ultravox model is a multimodal Speech LLM that can understand both speech and text. Think of it as an LLM that can also hear and understand speech.
Key Features
- Can consume both speech and text as input
- Uses a special
<|audio|>
pseudo-token to merge embeddings from input audio and text - Can generate output text based on the merged embeddings
- Future revisions plan to support generation of semantic and acoustic audio tokens
Capabilities
Ultravox is a powerful multimodal Speech LLM that can understand and process both speech and text inputs. Think of it as a super smart assistant that can hear and respond to voice commands, just like a human!
What can Ultravox do?
- Speech-to-Text: Convert spoken audio into text, so you can analyze or respond to it.
- Speech-to-Speech Translation: Translate spoken audio from one language to another, in real-time!
- Voice Agent: Use Ultravox as a voice assistant to answer questions, provide information, or even control devices.
How does Ultravox work?
- Multimodal Input: Ultravox takes in both text and speech inputs, using a special
<|audio|>
token to merge the two. - Knowledge Distillation: The model is trained to match the performance of a pre-trained Llama 3.1 8B backbone, ensuring high-quality outputs.
What makes Ultravox special?
- Fast Response Times: Ultravox can respond to audio inputs in approximately
150ms
, with a tokens-per-second rate of50-100
on an A100-40GB GPU. - Improved Translation: The model has shown modest improvements in translation evaluations, making it a great choice for speech-to-speech translation tasks.
Performance
Ultravox is a powerful multimodal model that can handle both speech and text inputs. But how does it perform in real-world tasks?
Speed
Let’s talk about speed. When it comes to processing audio content, Ultravox has a time-to-first-token (TTFT) of approximately 150ms
. That’s fast! But what about generating text? Ultravox can produce around 50-100
tokens per second using an A100-40GB GPU. Not bad, right?
Accuracy
But speed isn’t everything. How accurate is Ultravox? Let’s look at some numbers:
Language Pair | Ultravox | Other Models |
---|---|---|
en_ar | 11.17 | 12.28 |
en_de | 25.47 | 27.13 |
es_en | 37.11 | 39.16 |
ru_en | 38.96 | 39.65 |
en_ca | 27.46 | 29.94 |
zh_en | 10.08 | 14.55 |
As you can see, Ultravox performs well in various language translation tasks. But how does it compare to other models?
Limitations
Ultravox is a powerful multimodal Speech LLM, but it’s not perfect. Let’s explore some of its limitations.
Limited Audio Generation Capabilities
Currently, Ultravox can only consume audio as input and generate text output. It can’t produce voice output or generate semantic and acoustic audio tokens. This limits its use cases, such as voice agents or speech-to-speech translation.
No Preference Tuning
This version of Ultravox hasn’t undergone preference tuning, which means it may not always generate output that aligns with human preferences.
Limited Training Data
The training dataset is a mix of ASR datasets and speech translation datasets, which may not cover all possible scenarios or languages. This could lead to biases or inaccuracies in the model’s output.
Speed and Performance
While Ultravox can process audio input relatively quickly (150ms
time-to-first-token), its tokens-per-second rate is limited to ~50-100
when using an A100-40GB GPU. This may not be sufficient for real-time applications or large-scale deployments.
Evaluation Metrics
Ultravox’s evaluation metrics show varying levels of performance across different languages and tasks. For example, its performance on English-Arabic translation is lower than on English-Spanish translation.
Comparison to Other Models
Ultravox’s performance is comparable to other models, but it may not always outperform them. For instance, its performance on English-German translation is lower than some other models.
Language Pair | Ultravox | Other Models |
---|---|---|
en_ar | 11.17 | 12.28 |
en_de | 25.47 | 27.13 |
es_en | 37.11 | 39.16 |
ru_en | 38.96 | 39.65 |
en_ca | 27.46 | 29.94 |
zh_en | 10.08 | 14.55 |
These limitations highlight areas where Ultravox can be improved or fine-tuned for specific use cases.
Format
Overview
Ultravox is a multimodal Speech LLM that can handle both speech and text as input. It’s built around a pretrained Llama3.1-8B-Instruct and whisper-large-v3-turbo backbone.
Architecture
The model uses a transformer architecture, with a special twist. It can consume both text and speech as input, which is merged into a single embedding. This embedding is then used to generate output text.
Data Formats
Ultravox supports the following data formats:
- Text: Input text can be any sequence of characters.
- Speech: Input speech is represented as audio embeddings, which are derived from the input audio using the whisper-large-v3-turbo model.
Input Requirements
To use Ultravox, you need to provide the following inputs:
- Text prompt: A text sequence that includes a special
<|audio|>
pseudo-token, which will be replaced with the audio embeddings. - Audio: The input audio file, which will be used to generate the audio embeddings.
Output
The model generates output text, which can be used for various applications such as speech-to-speech translation, analysis of spoken audio, and more.
Code Example
Here’s an example of how to use Ultravox in Python:
import transformers
import numpy as np
import librosa
pipe = transformers.pipeline(model='fixie-ai/ultravox-v0_4_1-llama-3_1-8b', trust_remote_code=True)
path = "\<path-to-input-audio>" # TODO: pass the audio here
audio, sr = librosa.load(path, sr=16000)
turns = [
{
"role": "system",
"content": "You are a friendly and helpful character. You love to answer questions for people."
},
]
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=30)
Note that you need to install the transformers
, peft
, and librosa
libraries to use this code.
Special Requirements
Ultravox requires a GPU with at least 40GB of memory to run efficiently. The model is trained using a knowledge-distillation loss, where the goal is to match the logits of the text-based Llama3.1-8B-Instruct backbone.