Speaker Diarization 3.1

Speaker Diarization

The Speaker Diarization 3.1 model is a cutting-edge, open-source pipeline designed to identify and segment speakers within audio files. This advanced model boasts a range of impressive capabilities, including the ability to process mono audio sampled at 16kHz and output speaker diarization as an Annotation instance. With its pure PyTorch implementation, this model offers ease of deployment and potentially faster inference speeds. Its robust performance is demonstrated through benchmarking on a large collection of datasets, showcasing its ability to accurately identify speakers with minimal manual intervention. By leveraging this model, users can unlock a wealth of applications, from automatic speech recognition to speaker identification and analysis.

Pyannote mit Updated a year ago

Table of Contents

Model Overview

The pyannote/speaker-diarization-3.1 model is a powerful tool for speaker diarization tasks. It’s an open-source model that can help you identify who is speaking and when in an audio file.

Capabilities

This model takes an audio file as input and outputs a speaker diarization, which is a list of speakers and the time intervals when they spoke. It’s like a transcript of the conversation, but instead of just the words, you get the speakers’ identities and timestamps.

The model uses a combination of speaker segmentation and embedding to identify the speakers. It’s trained on a large dataset of audio files and can handle different types of audio, including mono and multi-channel files.

Using this model can save you time and effort in manual voice activity detection and speaker identification. It’s also fully automatic, which means you don’t need to fine-tune the model or adjust hyper-parameters for each dataset.

Performance

The model has been benchmarked on several datasets and has shown promising results. Here are some metrics to give you an idea of its performance:

DatasetDER%FA%Miss%Conf%
AISHELL-412.23.84.44.0
AliMeeting (channel 1)24.44.410.010.0
AMI (headset mix, only_words)18.83.69.55.7

How to Use

You can use this model by installing the pyannote.audio library and following the instructions in the documentation. You’ll need to create an access token and instantiate the pipeline using the Pipeline.from_pretrained method.

Here’s an example of how to use the model:

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
diarization = pipeline("audio.wav")

Note that you’ll need to replace the HUGGINGFACE_ACCESS_TOKEN_GOES_HERE with your actual access token.

Limitations

This model has some limitations, including:

  • It only works with mono audio files sampled at 16kHz.
  • Audio files sampled at a different rate will be resampled to 16kHz upon loading.
  • If the number of speakers is not known in advance, the model may not perform well.
  • The model can also provide lower and/or upper bounds on the number of speakers using min_speakers and max_speakers options, but this may not always be accurate.
Examples
Diarize the speakers in the audio file 'meeting.wav' and output the result in RTTM format. Speaker 1: 0.0 - 1.2, Speaker 2: 1.2 - 2.5, Speaker 1: 2.5 - 3.8
Process the audio file 'interview.wav' on GPU for faster processing. Processing on GPU... Done. Diarization result: Speaker 1: 0.0 - 0.8, Speaker 2: 0.8 - 1.5
Diarize the speakers in the audio file 'podcast.wav' with a minimum of 2 and a maximum of 5 speakers. Speaker 1: 0.0 - 2.1, Speaker 2: 2.1 - 4.2, Speaker 3: 4.2 - 6.5

Format

This model accepts mono audio files sampled at 16kHz. If you have stereo or multi-channel audio files, don’t worry! The model will automatically downmix them to mono by averaging the channels. If your audio files are sampled at a different rate, they’ll be resampled to 16kHz upon loading.

To use this model, you’ll need to:

  • Install pyannote.audio version 3.1 or higher using pip
  • Accept the user conditions for pyannote/segmentation-3.0 and pyannote/speaker-diarization-3.1
  • Create an access token at hf.co/settings/tokens

The model outputs speaker diarization as an Annotation instance, which can be dumped to disk using RTTM format.

from pyannote.audio import Pipeline

# Instantiate the pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# Run the pipeline on an audio file
diarization = pipeline("audio.wav")

# Dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.