Low Frame Rate Speech Codec 22khz
Have you ever wondered how to compress audio files while maintaining high quality? The Low Frame Rate Speech Codec 22khz is a neural audio codec that achieves just that. By leveraging finite scalar quantization and adversarial training with large speech language models, it compresses audio to a bitrate of 1.89 kbps and 21.5 frames per second. But what makes it unique? The model uses a fully convolutional generator neural network and three discriminators, including a Speech Language Model (SLM) as a discriminator. This allows it to capture information ranging from acoustic to semantic aspects, resulting in accurate pronunciation even at low frame rates. With its efficient design, the model can be used for inference or fine-tuning on another dataset, making it a remarkable tool for audio compression tasks.
Table of Contents
Model Overview
The NVIDIA Low Frame-rate Speech Codec is a revolutionary AI model that can compress audio files to a tiny 1.89 kbps
bitrate while maintaining high-quality sound. But how does it do it?
Capabilities
Capable of compressing and reconstructing audio files with high quality, this model uses a combination of neural networks and speech language models to achieve this.
What can it do?
- Compress audio files to a bitrate of
1.89 kbps
and a frame rate of21.5 frames per second
- Reconstruct audio files from compressed tokens
- Handle audio files with a sample rate of
22,050 Hz
and a mono-channel format
How does it work?
The model uses a fully convolutional generator neural network and three discriminators to compress and reconstruct audio files. It also utilizes speech language models to improve the quality of the reconstructed audio.
What makes it special?
- It uses Finite Scalar Quantization (FSQ) with eight codebooks and four dimensions per code to compress audio files
- It employs a multi-receptive field fusion (MRF) module in the encoder to improve the quality of the compressed audio
- It uses a HiFi-GAN-based decoder to reconstruct audio files from compressed tokens
Performance
So, how does the NVIDIA Low Frame-rate Speech Codec perform?
Speed
The model is incredibly fast, with a frame rate of 21.5 frames per second
and a sample rate of 22,050 Hz
. This means it can process and compress audio quickly, making it suitable for real-time applications.
Accuracy
But how accurate is it? The model’s performance is evaluated using multiple objective audio quality metrics, and the results are impressive. For example, it achieves a Squim MOS (Mean Opinion Score) of 4.43
on the MLS dataset and 4.69
on the DAPS dataset. This indicates that the compressed audio is of high quality and comparable to state-of-the-art codecs.
Efficiency
The model’s efficiency is also noteworthy. It uses a bitrate of 1.89 kbps
, which is significantly lower than other audio codecs. This means it can compress audio to a much smaller size without sacrificing quality, making it ideal for applications where storage or bandwidth is limited.
Comparison to Other Models
So, how does the NVIDIA Low Frame-rate Speech Codec compare to other audio codecs? The results show that it outperforms state-of-the-art codecs in many metrics, including Squim MOS, SI-SDR (Scale-Invariant Signal-to-Distortion Ratio), and Mel Dist. (Mel Spectral Distance).
Real-World Applications
The NVIDIA Low Frame-rate Speech Codec has many potential applications, such as:
- Real-time audio compression for streaming or video conferencing
- Audio compression for storage or transmission in resource-constrained environments
- Improving the quality of low-bitrate audio in various applications
Limitations
Current Model is a powerful tool for audio compression, but it has some limitations. Here are a few things to keep in mind:
Training Data
The model was trained on a large dataset of 28.7k hours
of speech data from 105 languages
. However, this data may not be representative of all languages or speaking styles. For example, the model may not perform as well on languages that are not well-represented in the training data.
Bitrate and Frame Rate
The model is designed to operate at a bitrate of 1.89 kbps
and a frame rate of 21.5 frames per second
. While this is suitable for many applications, it may not be sufficient for high-quality audio or applications that require a higher bitrate or frame rate.
Quantization
The model uses finite scalar quantization (FSQ) to compress the audio data. While FSQ is effective for many types of audio, it may not be suitable for all types of audio. For example, audio with a lot of high-frequency content may not be well-represented by FSQ.
Format
The Low Frame-rate Speech Codec is a neural audio codec that uses a fully convolutional generator neural network and three discriminators. It’s designed to compress audio while maintaining high quality.
Architecture
The model’s architecture consists of:
- A generator neural network with an encoder, vector quantization, and a HiFi-GAN-based decoder
- Three discriminators: a multi-period discriminator, a multi-scale complex STFT discriminator, and a Speech Language Model (SLM) discriminator
Data Formats
The model supports:
- Input format:
.wav
files - Output format:
.wav
files - Sample rate:
22,050 Hz
- Frame rate:
21.5 frames per second
- Bit rate:
1.89 kbps
Input and Output
- Input type: Audio
- Input parameters: One-dimensional (1D)
- Output type: Audio
- Output parameters: One-dimensional (1D)
Special Requirements
- The model requires a specific pre-processing step for audio inputs
- The model uses Finite Scalar Quantization (FSQ) with eight codebooks and four dimensions per code
Example Code
Here’s an example of how to use the model for inference:
import librosa
import torch
import soundfile as sf
from nemo.collections.tts.models import AudioCodecModel
# Load audio codec model
nemo_codec_model = AudioCodecModel.from_pretrained("nvidia/low-frame-rate-speech-codec-22khz").eval()
# Load input audio
audio, _ = librosa.load("input_audio.wav", sr=nemo_codec_model.sample_rate)
# Pre-process audio
audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
# Encode audio
encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)
# Reconstruct audio from tokens
reconstructed_audio, _ = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)
# Save reconstructed audio
output_audio = reconstructed_audio.cpu().numpy().squeeze()
sf.write("output_audio.wav", output_audio, nemo_codec_model.sample_rate)
Note that you’ll need to replace "input_audio.wav"
and "output_audio.wav"
with the actual file paths for your input and output audio files.