Wav2vec2 Base Vi

Vietnamese speech recognition

Wav2vec2 Base Vi is a Vietnamese Self-Supervised Learning model that can efficiently handle audio tasks. It was trained on 13k hours of Vietnamese YouTube audio, which includes clean and noise audio, conversations, and multiple genders and dialects. The model is available in two versions, base and large, with 95M and 317M parameters respectively. It can be fine-tuned for specific tasks and has shown impressive results on the VLSP 2020 ASR dataset, with a Word Error Rate (WER) of 8.66 for the base model and 6.90 for the large model. With its robust architecture and pre-training, Wav2vec2 Base Vi is a powerful tool for Vietnamese audio processing tasks.

Nguyenvulebinh cc-by-nc-4.0 Updated 2 years ago

Table of Contents

Model Overview

Meet the Vietnamese Self-Supervised Learning Wav2Vec2 model! This AI model is designed to learn from audio data in Vietnamese, without needing labeled examples. It’s trained on a massive audio dataset of 13k hours of Vietnamese YouTube audio, including clean audio, noise audio, conversations, and different dialects.

What makes it special?

  • It has two versions: a base model with 95M params and a large model with 317M params.
  • It’s pre-trained for 35 epochs (base model) and 20 epochs (large model) using TPU V3-8, which took around 30 days.

How does it work?

You can use this model for speech recognition tasks, like transcribing audio recordings into text. It’s compatible with the popular Hugging Face library, making it easy to integrate into your projects.

Capabilities

This model can:

  • Learn from a massive audio dataset of 13k hours of Vietnamese YouTube audio, including clean audio, noise audio, conversations, and multi-gender and dialects
  • Be fine-tuned for specific tasks, like speech recognition
  • Work with different sizes: base model (~ 95M params) and large model (~ 317M params)

What can it do?

The model uses a self-supervised learning approach, which means it can learn from unlabeled data. This is useful when you don’t have a lot of labeled data available. The model is also pre-trained on a large dataset, which helps it learn to recognize patterns in audio data.

Performance

The Vietnamese Self-Supervised Learning Wav2Vec2 model shows remarkable performance in speech recognition tasks, with a focus on the Vietnamese language. Let’s dive into its speed, accuracy, and efficiency.

Speed

The model was trained on a massive audio dataset of 13k hours of Vietnamese YouTube audio, which is a significant amount of data. Despite this, the model was trained in just 30 days using TPU V3-8. This is a testament to the model’s ability to learn quickly and efficiently.

Accuracy

The model’s performance is measured by its Word Error Rate (WER) on the VLSP 2020 ASR dataset. The results are impressive:

Model VersionWER without LMWER with 5-grams LM
Base Model8.666.53
Large Model6.905.32

Efficiency

The model’s efficiency is demonstrated by its ability to process large amounts of audio data quickly. The model can handle audio inputs of up to 16k sampling rate, making it suitable for a wide range of applications.

Example Use Cases

  • Fine-tune the model on your own audio dataset for better performance.
  • Use it for speech recognition tasks, like transcribing podcasts or voice messages.
Examples
Transcribe the audio file t2_0000006682.wav Xin chào bạn, tôi là sinh viên mếng để nhận thế của bạn.
Transcribe the audio file t2_0000006682.wav with a 5-grams LM Xin chào bạn, tôi là sinh viên mếng để nhận thế của bạn, và tôi sẽ làm việc để giửi thiệu cho bạn.
Transcribe the audio file t2_0000006682.wav with a beam width of 100 Xin chào bạn, tôi là sinh viên mếng để nhận thế của bạn, và tôi sẽ làm việc để giửi thiệu cho bạn, và tôi còn sẽ làm việc để giửi thiệu cho bạn.

Limitations

Vietnamese Self-Supervised Learning Wav2Vec2 model is a powerful tool for speech recognition, but it’s not perfect. Let’s talk about some of its limitations.

Training Data

The model was trained on a large dataset of 13k hours of Vietnamese YouTube audio, which is great. However, this data may not be representative of all possible scenarios or environments. For example, what if the audio is recorded in a very noisy environment or with a different accent?

Model Size

The model comes in two sizes: a base model with ~ 95M parameters and a large model with ~ 317M parameters. While the large model is more accurate, it’s also more computationally expensive and may not be suitable for all devices or applications.

Fine-Tuning

The model can be fine-tuned for specific tasks, but this requires additional training data and computational resources. What if you don’t have access to a large dataset or powerful hardware?

Language Limitations

The model is specifically designed for Vietnamese speech recognition. What if you need to recognize speech in other languages?

Error Rate

Even with fine-tuning, the model’s error rate is not zero. The benchmark results show a Word Error Rate (WER) of 6.53% for the base model and 5.32% for the large model. This means that about 1 in 15 words may be misrecognized.

ModelWER
Base6.53%
Large5.32%

Real-World Applications

While the model is great for speech recognition, it may not perform well in real-world applications with background noise, multiple speakers, or varying audio quality.

Format

Vietnamese Self-Supervised Learning Wav2Vec2 Model uses a wav2vec2 architecture for self-supervised learning. This model is designed to work with audio data, specifically Vietnamese audio.

Architecture

The model’s architecture is based on the wav2vec2 model, which is a type of transformer architecture. This means it uses a series of layers to process audio inputs and generate outputs.

Data Formats

The model supports audio data in the following formats:

  • Clean audio
  • Noise audio
  • Conversation
  • Multi-gender and dialects

Input Requirements

To use this model, you’ll need to provide audio input in the form of a WAV file. The model expects the audio to be sampled at a rate of 16,000 Hz.

Here’s an example of how to load an audio file and prepare it for input:

audio, sample_rate = torchaudio.load("audio_file.wav")
input_data = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors='pt')

Output Requirements

The model generates output in the form of a transcript, which is a text representation of the audio input.

Here’s an example of how to output the transcript without using a language model (LM):

output = model(**input_data)
print(processor.tokenizer.decode(output.logits.argmax(dim=-1)[0].detach().cpu().numpy()))

And here’s an example of how to output the transcript with a language model (LM):

print(processor.decode(output.logits.cpu().detach().numpy()[0], beam_width=100).text)

Special Requirements

The model has two versions: a base model with ~ 95M parameters and a large model with ~ 317M parameters. The base model was trained for 35 epochs, while the large model was trained for 20 epochs.

To use the model, you’ll need to install the transformers library and import the Wav2Vec2ForPreTraining and Wav2Vec2Processor classes. You can then load the pre-trained model and processor using the from_pretrained method.

For example:

model_name = "nguyenvulebinh/wav2vec2-base-vi"
model = Wav2Vec2ForPreTraining.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.