Speaker Diarization 3.0
Speaker Diarization 3.0 is a powerful AI model designed to identify and separate speakers in audio files. How does it work? It ingests mono audio sampled at 16kHz and outputs speaker diarization as an Annotation instance. But what makes it unique? It's been trained on a combination of several datasets, including AISHELL, AliMeeting, and AMI, allowing it to handle a wide range of audio files. And the results? It's been benchmarked on a large collection of datasets, achieving impressive diarization error rates (DER) with minimal manual intervention. But what about speed? It can process a one-hour conversation in approximately 1.5 minutes, making it a fast and efficient solution for real-world applications.
Table of Contents
Model Overview
Meet the Speaker Diarization 3.0 model, a game-changer for audio processing tasks! This model is designed to identify and separate speakers in audio files, making it a valuable tool for various applications.
What can it do?
- Take in mono audio files sampled at 16kHz and output speaker diarization as an Annotation instance
- Automatically downmix stereo or multi-channel audio files to mono by averaging the channels
- Resample audio files sampled at a different rate to 16kHz upon loading
How does it work?
- Installation: Install pyannote.audio 3.0 with pip and accept the user conditions
- Usage: Instantiate the pipeline, run it on an audio file, and dump the diarization output to disk using RTTM format
Capabilities
The Current Model is a powerful speaker diarization pipeline that can automatically identify and separate speakers in audio files. It’s like having a super-smart assistant that can listen to a conversation and tell you who’s speaking and when!
Primary Tasks
- Speaker Diarization: The model takes an audio file as input and outputs a diarization of the speakers, including the start and end times of each speaker’s turn.
- Audio Processing: The model can handle mono audio files sampled at 16kHz and can automatically downmix stereo or multi-channel files to mono.
Strengths
- Fast and Accurate: The model has been benchmarked on a large collection of datasets and has shown impressive performance, with a diarization error rate (DER) of around 12.3% on the AISHELL-4 dataset.
- Real-time Processing: The model can process audio files in real-time, with a real-time factor of around 2.5% on a single Nvidia Tesla V100 SXM2 GPU.
Unique Features
- Automatic Speaker Detection: The model can automatically detect the number of speakers in an audio file, without the need for manual voice activity detection or fine-tuning of internal models.
- Progress Monitoring: The model provides hooks to monitor the progress of the pipeline, allowing you to track the processing of your audio files.
- Controlling the Number of Speakers: You can provide the number of speakers as an option, or set lower and upper bounds on the number of speakers using the
min_speakersandmax_speakersoptions.
Performance
Current Model is a powerful AI model that has been trained on a massive dataset of audio files. But how well does it perform in real-world tasks? Let’s dive into its speed, accuracy, and efficiency.
Speed
Imagine you have an hour-long conversation that you want to analyze. How long would it take Current Model to process it? The answer is approximately 1.5 minutes! This is because the model runs on a powerful GPU (Nvidia Tesla V100 SXM2) and a fast CPU (Intel Cascade Lake 6248). The real-time factor is around 2.5%, which means it can process audio files much faster than real-time.
Accuracy
But speed is not everything. How accurate is Current Model in identifying speakers in an audio file? The answer is very accurate! The model has been benchmarked on a large collection of datasets, including AISHELL, AliMeeting, AMI, and many more. The results show that it can achieve a low Diarization Error Rate (DER) of around 12.3% on the AISHELL-4 dataset.
Limitations
Current Model is a powerful tool for speaker diarization, but it’s not perfect. Let’s take a closer look at some of its limitations.
Audio Requirements
Current Model only works with mono audio files sampled at 16kHz. If your audio file is stereo or multi-channel, it will be automatically downmixed to mono by averaging the channels. This might affect the quality of the output.
Processing Time
While Current Model can process audio files relatively quickly, it’s not real-time. It takes around 1.5 minutes to process a one-hour conversation. This might be a challenge if you need to process large volumes of audio data quickly.
Number of Speakers
If the number of speakers is not known in advance, Current Model might struggle to accurately identify them. You can provide the number of speakers as an option, but this might not always be possible.
Format
Speaker Diarization 3.0 uses a pipeline architecture and accepts input in the form of mono audio files sampled at 16kHz. Don’t worry if your audio files are stereo or multi-channel - the pipeline will automatically downmix them to mono by averaging the channels. If your audio files are sampled at a different rate, they’ll be resampled to 16kHz upon loading.
Here’s a breakdown of the input and output formats:
| Input | Output |
|---|---|
| Mono audio file (16kHz) | Speaker diarization as an Annotation instance |
What’s an Annotation instance? An Annotation instance is a special format that contains information about the speakers in the audio file, including their identities and timestamps.
How do I use the pipeline?
To use the pipeline, you’ll need to install pyannote.audio with pip and accept the user conditions. Then, you can instantiate the pipeline and run it on an audio file like this:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.0", use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
diarization = pipeline("audio.wav")
What about GPU processing? By default, the pipeline runs on CPU, but you can send it to GPU with a few lines of code:
import torch
pipeline.to(torch.device("cuda"))
This can significantly speed up processing time - on a Nvidia Tesla V100 SXM2 GPU, the real-time factor is around 2.5%.
Can I control the number of speakers?
Yes, you can! If you know the number of speakers in advance, you can use the num_speakers option:
diarization = pipeline("audio.wav", num_speakers=2)
You can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options.


