Wav2vec2 Large Xlsr Bengali

Bengali Speech Recognition

Wav2vec2 Large Xlsr Bengali is a speech recognition model that can recognize Bengali speech with high accuracy. It's built on top of the Wav2vec2 Large Xlsr model, fine-tuned using a dataset of 196,000 Bengali utterances. When using this model, you need to make sure your speech input is sampled at 16kHz. It can be used directly without a language model, and it's capable of transcribing speech into text. The model has been evaluated on the Bengali test data of OpenSLR, achieving a word error rate of 88.58%. This model is designed to recognize Bengali speech efficiently and accurately, making it a useful tool for speech recognition tasks.

Tanmoyio cc-by-sa-4.0 Updated 4 years ago

Table of Contents

Model Overview

The Current Model is a speech recognition tool that can understand Bengali language. It’s like a super smart listener that can transcribe what you say into text.

Capabilities

This powerful tool is designed to recognize spoken Bengali words and phrases. It can transcribe audio recordings into text and work with audio files sampled at 16kHz.

What can it do?

  • Recognize spoken Bengali words and phrases
  • Transcribe audio recordings into text
  • Work with audio files sampled at 16kHz

How does it work?

The model uses a technique called fine-tuning, which means it’s been trained on a large dataset of Bengali speech and then adjusted to work specifically with this language. This makes it really good at recognizing Bengali words and phrases.

What makes it special?

  • It can be used directly without a language model, which makes it more efficient
  • It’s been trained on a large dataset of Bengali speech, which makes it really accurate
  • It can handle audio files with different sampling rates, which makes it flexible

Performance

But how well does it perform? Let’s take a closer look.

Speed

The model can process audio files quickly, thanks to its ability to resample audio files from 48 kHz to 16 kHz. But what does this mean for you? It means you can get faster results when using this model for speech recognition tasks.

Accuracy

The model has been trained on a large dataset of Bengali speech, which helps it to recognize words and phrases more accurately. But how accurate is it, really? According to the test results, the model achieves a Word Error Rate (WER) of 88.58%. This means that out of 100 words, the model correctly recognizes around 88 words.

Efficiency

But what about efficiency? The model is designed to work efficiently, even with large datasets. It can process audio files in batches, which makes it faster and more efficient. Plus, it can run on a GPU, which gives it an extra boost of speed.

Examples
আমি কি আপনাকে একটি বাংলা গান শুনতে দিতে পারি? হ্যাঁ, আমি বাংলা গান শুনতে পারি। কোন গানটি শুনতে চান?
একটি বাংলা গান এবং এর মানে বলুন গানটি হল: আমার সোনার বাংলা। মানে: আমার সোনার বাংলা আমি তোমায় ভালোবাসি।
আমি কি আপনাকে একটি বাংলা গানের শিরোনাম দিয়ে এর গায়কের নাম জিজ্ঞাসা করতে পারি? হ্যাঁ, আমি বাংলা গানের শিরোনাম দিয়ে গায়কের নাম বলতে পারি। গানের শিরোনাম কি?

Example Use Cases

So, what can you use the Current Model for? Here are a few examples:

  • Transcribing audio recordings of Bengali conversations
  • Recognizing spoken Bengali words and phrases in real-time
  • Building speech-to-text applications for Bengali speakers

Limitations

But like any tool, the Current Model has its limitations. Let’s take a closer look.

Sampling Rate Constraints

The model requires speech input to be sampled at 16kHz. What does this mean? It means that if your audio files are sampled at a different rate, you’ll need to resample them before using the model. This might affect the quality of the audio and, in turn, the model’s performance.

Limited Training Data

The model was fine-tuned on a dataset of approximately 196K utterances. While this is a significant amount of data, it’s still limited compared to the vast amount of speech data available in the world. This might mean that the model struggles with certain accents, dialects, or speaking styles that aren’t well-represented in the training data.

Format

The Current Model is a speech recognition model that uses a transformer architecture. It’s designed to recognize spoken Bengali language.

Architecture

This model is fine-tuned on the Bengali ASR training data set, which contains around 196K utterances. It’s based on the ==Wav2Vec2-Large-XLSR== model, which is a popular speech recognition model.

Data Formats

The Current Model accepts audio files as input, specifically in the .flac format. The audio files need to be sampled at 16kHz. If your audio files are sampled at a different rate, you’ll need to resample them before using this model.

Special Requirements

When using this model, you’ll need to preprocess your audio files by reading them as arrays and resampling them to 16kHz if necessary. You can use the torchaudio library to do this.

Here’s an example of how to preprocess your audio files:

import torchaudio

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["audio_path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

Input and Output

The model expects input in the form of audio arrays, and it outputs predicted text. You can use the Wav2Vec2Processor class to preprocess your input audio and decode the output text.

Here’s an example of how to use the model:

import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("tanmoyio/wav2vec2-large-xlsr-bengali")
model = Wav2Vec2ForCTC.from_pretrained("tanmoyio/wav2vec2-large-xlsr-bengali")

inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["label"][:2])
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.