Low Frame Rate Speech Codec 22khz

Low-bitrate audio codec

Have you ever wondered how to compress audio files while maintaining high quality? The Low Frame Rate Speech Codec 22khz is a neural audio codec that achieves just that. By leveraging finite scalar quantization and adversarial training with large speech language models, it compresses audio to a bitrate of 1.89 kbps and 21.5 frames per second. But what makes it unique? The model uses a fully convolutional generator neural network and three discriminators, including a Speech Language Model (SLM) as a discriminator. This allows it to capture information ranging from acoustic to semantic aspects, resulting in accurate pronunciation even at low frame rates. With its efficient design, the model can be used for inference or fine-tuning on another dataset, making it a remarkable tool for audio compression tasks.

Nvidia other Updated 5 months ago

Table of Contents

Model Overview

The NVIDIA Low Frame-rate Speech Codec is a revolutionary AI model that can compress audio files to a tiny 1.89 kbps bitrate while maintaining high-quality sound. But how does it do it?

Capabilities

Capable of compressing and reconstructing audio files with high quality, this model uses a combination of neural networks and speech language models to achieve this.

What can it do?

  • Compress audio files to a bitrate of 1.89 kbps and a frame rate of 21.5 frames per second
  • Reconstruct audio files from compressed tokens
  • Handle audio files with a sample rate of 22,050 Hz and a mono-channel format

How does it work?

The model uses a fully convolutional generator neural network and three discriminators to compress and reconstruct audio files. It also utilizes speech language models to improve the quality of the reconstructed audio.

Examples
Compress the audio file 'audio_example.wav' using the Low Frame-rate Speech Codec model. The compressed audio is saved as 'compressed_audio.wav' with a bitrate of 1.89 kbps and a frame rate of 21.5 frames per second.
Reconstruct the original audio from the compressed tokens. The reconstructed audio is saved as 'reconstructed_audio.wav' with a sample rate of 22050 Hz and a mono channel.
Evaluate the performance of the Low Frame-rate Speech Codec model on the MLS English dataset. The model achieved a Squim MOS of 4.43, SI-SDR of 4.77, Mel Dist. of 0.143, STFT Dist. of 0.060, and CER of 2.16 on the MLS English dataset.

What makes it special?

  • It uses Finite Scalar Quantization (FSQ) with eight codebooks and four dimensions per code to compress audio files
  • It employs a multi-receptive field fusion (MRF) module in the encoder to improve the quality of the compressed audio
  • It uses a HiFi-GAN-based decoder to reconstruct audio files from compressed tokens

Performance

So, how does the NVIDIA Low Frame-rate Speech Codec perform?

Speed

The model is incredibly fast, with a frame rate of 21.5 frames per second and a sample rate of 22,050 Hz. This means it can process and compress audio quickly, making it suitable for real-time applications.

Accuracy

But how accurate is it? The model’s performance is evaluated using multiple objective audio quality metrics, and the results are impressive. For example, it achieves a Squim MOS (Mean Opinion Score) of 4.43 on the MLS dataset and 4.69 on the DAPS dataset. This indicates that the compressed audio is of high quality and comparable to state-of-the-art codecs.

Efficiency

The model’s efficiency is also noteworthy. It uses a bitrate of 1.89 kbps, which is significantly lower than other audio codecs. This means it can compress audio to a much smaller size without sacrificing quality, making it ideal for applications where storage or bandwidth is limited.

Comparison to Other Models

So, how does the NVIDIA Low Frame-rate Speech Codec compare to other audio codecs? The results show that it outperforms state-of-the-art codecs in many metrics, including Squim MOS, SI-SDR (Scale-Invariant Signal-to-Distortion Ratio), and Mel Dist. (Mel Spectral Distance).

Real-World Applications

The NVIDIA Low Frame-rate Speech Codec has many potential applications, such as:

  • Real-time audio compression for streaming or video conferencing
  • Audio compression for storage or transmission in resource-constrained environments
  • Improving the quality of low-bitrate audio in various applications

Limitations

Current Model is a powerful tool for audio compression, but it has some limitations. Here are a few things to keep in mind:

Training Data

The model was trained on a large dataset of 28.7k hours of speech data from 105 languages. However, this data may not be representative of all languages or speaking styles. For example, the model may not perform as well on languages that are not well-represented in the training data.

Bitrate and Frame Rate

The model is designed to operate at a bitrate of 1.89 kbps and a frame rate of 21.5 frames per second. While this is suitable for many applications, it may not be sufficient for high-quality audio or applications that require a higher bitrate or frame rate.

Quantization

The model uses finite scalar quantization (FSQ) to compress the audio data. While FSQ is effective for many types of audio, it may not be suitable for all types of audio. For example, audio with a lot of high-frequency content may not be well-represented by FSQ.

Format

The Low Frame-rate Speech Codec is a neural audio codec that uses a fully convolutional generator neural network and three discriminators. It’s designed to compress audio while maintaining high quality.

Architecture

The model’s architecture consists of:

  • A generator neural network with an encoder, vector quantization, and a HiFi-GAN-based decoder
  • Three discriminators: a multi-period discriminator, a multi-scale complex STFT discriminator, and a Speech Language Model (SLM) discriminator

Data Formats

The model supports:

  • Input format: .wav files
  • Output format: .wav files
  • Sample rate: 22,050 Hz
  • Frame rate: 21.5 frames per second
  • Bit rate: 1.89 kbps

Input and Output

  • Input type: Audio
  • Input parameters: One-dimensional (1D)
  • Output type: Audio
  • Output parameters: One-dimensional (1D)

Special Requirements

  • The model requires a specific pre-processing step for audio inputs
  • The model uses Finite Scalar Quantization (FSQ) with eight codebooks and four dimensions per code

Example Code

Here’s an example of how to use the model for inference:

import librosa
import torch
import soundfile as sf
from nemo.collections.tts.models import AudioCodecModel

# Load audio codec model
nemo_codec_model = AudioCodecModel.from_pretrained("nvidia/low-frame-rate-speech-codec-22khz").eval()

# Load input audio
audio, _ = librosa.load("input_audio.wav", sr=nemo_codec_model.sample_rate)

# Pre-process audio
audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)

# Encode audio
encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)

# Reconstruct audio from tokens
reconstructed_audio, _ = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)

# Save reconstructed audio
output_audio = reconstructed_audio.cpu().numpy().squeeze()
sf.write("output_audio.wav", output_audio, nemo_codec_model.sample_rate)

Note that you’ll need to replace "input_audio.wav" and "output_audio.wav" with the actual file paths for your input and output audio files.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.