Ultravox V0 3
Have you ever wondered how AI models can understand both text and speech? Ultravox V0.3 is a multimodal Speech LLM that can do just that. Built on top of a pre-trained Llama3.1-8B-Instruct and Whisper-small backbone, this model can consume both speech and text as input, making it a powerful tool for tasks like voice agents, speech-to-speech translation, and analysis of spoken audio. What's remarkable is that it can generate output text in approximately 200ms, with a tokens-per-second rate of around 50-100 when using an A100-40GB GPU. This is made possible by its efficient design, which uses a knowledge-distillation loss and BF16 mixed precision training. While it's not perfect, Ultravox V0.3 has shown impressive results in various evaluations, making it a notable model in the field of multimodal AI.
Table of Contents
Model Overview
Meet Ultravox, a game-changing multimodal Speech LLM that can understand both speech and text inputs. Imagine having a conversation with a model that can not only read your messages but also listen to your voice. That’s what Ultravox can do!
Capabilities
Ultravox is a powerful multimodal model that can handle both speech and text input. But what can you do with it? Here are some examples:
- Use it as a voice agent to have conversations.
- Translate speech-to-speech in real-time.
- Analyze spoken audio to gain insights.
But how does it work? Simply put, you give Ultravox a text prompt with a special <|audio|> token, and the model processor replaces this token with embeddings derived from the input audio. The model then generates output text as usual.
Performance
So, how does Ultravox perform? Let’s dive in and explore its speed, accuracy, and efficiency.
- Time-to-first-token (TTFT): approximately
200ms. - Tokens-per-second rate:
50-100when using an A100-40GB GPU.
But what about accuracy? Here are some benchmarks:
| Task | Ultravox v0.2 | Ultravox v0.3 | Whisper-Llama3.1 | Llama3.1 (text-only) |
|---|---|---|---|---|
| en_de (BLEU) | 12.07 | 22.68 | 24.89 | 31.95 |
| es_en (BLEU) | 15.17 | 24.10 | 28.67 | 38.28 |
| LibriSpeech clean.test (WER) | 6.07 | 6.67 | 3.4 | - |
Limitations
While Ultravox is a powerful tool, it’s not perfect. Here are some limitations to keep in mind:
- Weaknesses in speech understanding, such as noisy or low-quality audio.
- Limited contextual understanding, which can make it struggle with long or complex conversations.
- Lack of common sense, which can lead to unexpected responses.
- Dependence on quality of input audio, which can affect performance.
But don’t worry, these limitations are being worked on, and future revisions will likely address these issues.
Format
So, how do you use Ultravox? Here’s a breakdown of its format:
- Architecture: Ultravox uses a transformer architecture, similar to other models.
- Input format: You’ll need to create a text prompt with a special
<|audio|>pseudo-token, which will be replaced with embeddings derived from the input audio. - Audio input: Ultravox supports audio input in the form of WAV files.
- Output format: Ultravox generates output text as usual, but future revisions will expand its capabilities to generate semantic and acoustic audio tokens.
By following these guidelines, you can unlock the full potential of Ultravox and use it for a variety of applications, from voice agents to speech-to-speech translation.


