Speaker Diarization 3.1
The Speaker Diarization 3.1 model is a cutting-edge, open-source pipeline designed to identify and segment speakers within audio files. This advanced model boasts a range of impressive capabilities, including the ability to process mono audio sampled at 16kHz and output speaker diarization as an Annotation instance. With its pure PyTorch implementation, this model offers ease of deployment and potentially faster inference speeds. Its robust performance is demonstrated through benchmarking on a large collection of datasets, showcasing its ability to accurately identify speakers with minimal manual intervention. By leveraging this model, users can unlock a wealth of applications, from automatic speech recognition to speaker identification and analysis.
Table of Contents
Model Overview
The pyannote/speaker-diarization-3.1 model is a powerful tool for speaker diarization tasks. It’s an open-source model that can help you identify who is speaking and when in an audio file.
Capabilities
This model takes an audio file as input and outputs a speaker diarization, which is a list of speakers and the time intervals when they spoke. It’s like a transcript of the conversation, but instead of just the words, you get the speakers’ identities and timestamps.
The model uses a combination of speaker segmentation and embedding to identify the speakers. It’s trained on a large dataset of audio files and can handle different types of audio, including mono and multi-channel files.
Using this model can save you time and effort in manual voice activity detection and speaker identification. It’s also fully automatic, which means you don’t need to fine-tune the model or adjust hyper-parameters for each dataset.
Performance
The model has been benchmarked on several datasets and has shown promising results. Here are some metrics to give you an idea of its performance:
Dataset | DER% | FA% | Miss% | Conf% |
---|---|---|---|---|
AISHELL-4 | 12.2 | 3.8 | 4.4 | 4.0 |
AliMeeting (channel 1) | 24.4 | 4.4 | 10.0 | 10.0 |
AMI (headset mix, only_words) | 18.8 | 3.6 | 9.5 | 5.7 |
… | … | … | … | … |
How to Use
You can use this model by installing the pyannote.audio
library and following the instructions in the documentation. You’ll need to create an access token and instantiate the pipeline using the Pipeline.from_pretrained
method.
Here’s an example of how to use the model:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
diarization = pipeline("audio.wav")
Note that you’ll need to replace the HUGGINGFACE_ACCESS_TOKEN_GOES_HERE
with your actual access token.
Limitations
This model has some limitations, including:
- It only works with mono audio files sampled at 16kHz.
- Audio files sampled at a different rate will be resampled to 16kHz upon loading.
- If the number of speakers is not known in advance, the model may not perform well.
- The model can also provide lower and/or upper bounds on the number of speakers using
min_speakers
andmax_speakers
options, but this may not always be accurate.
Format
This model accepts mono audio files sampled at 16kHz. If you have stereo or multi-channel audio files, don’t worry! The model will automatically downmix them to mono by averaging the channels. If your audio files are sampled at a different rate, they’ll be resampled to 16kHz upon loading.
To use this model, you’ll need to:
- Install
pyannote.audio
version 3.1 or higher using pip - Accept the user conditions for
pyannote/segmentation-3.0
andpyannote/speaker-diarization-3.1
- Create an access token at hf.co/settings/tokens
The model outputs speaker diarization as an Annotation
instance, which can be dumped to disk using RTTM format.
from pyannote.audio import Pipeline
# Instantiate the pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
# Run the pipeline on an audio file
diarization = pipeline("audio.wav")
# Dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
diarization.write_rttm(rttm)