Wav2vec2 Large Xlsr Bengali
Wav2vec2 Large Xlsr Bengali is a speech recognition model that can recognize Bengali speech with high accuracy. It's built on top of the Wav2vec2 Large Xlsr model, fine-tuned using a dataset of 196,000 Bengali utterances. When using this model, you need to make sure your speech input is sampled at 16kHz. It can be used directly without a language model, and it's capable of transcribing speech into text. The model has been evaluated on the Bengali test data of OpenSLR, achieving a word error rate of 88.58%. This model is designed to recognize Bengali speech efficiently and accurately, making it a useful tool for speech recognition tasks.
Table of Contents
Model Overview
The Current Model is a speech recognition tool that can understand Bengali language. It’s like a super smart listener that can transcribe what you say into text.
Capabilities
This powerful tool is designed to recognize spoken Bengali words and phrases. It can transcribe audio recordings into text and work with audio files sampled at 16kHz.
What can it do?
- Recognize spoken Bengali words and phrases
- Transcribe audio recordings into text
- Work with audio files sampled at 16kHz
How does it work?
The model uses a technique called fine-tuning, which means it’s been trained on a large dataset of Bengali speech and then adjusted to work specifically with this language. This makes it really good at recognizing Bengali words and phrases.
What makes it special?
- It can be used directly without a language model, which makes it more efficient
- It’s been trained on a large dataset of Bengali speech, which makes it really accurate
- It can handle audio files with different sampling rates, which makes it flexible
Performance
But how well does it perform? Let’s take a closer look.
Speed
The model can process audio files quickly, thanks to its ability to resample audio files from 48 kHz to 16 kHz. But what does this mean for you? It means you can get faster results when using this model for speech recognition tasks.
Accuracy
The model has been trained on a large dataset of Bengali speech, which helps it to recognize words and phrases more accurately. But how accurate is it, really? According to the test results, the model achieves a Word Error Rate (WER) of 88.58%. This means that out of 100 words, the model correctly recognizes around 88 words.
Efficiency
But what about efficiency? The model is designed to work efficiently, even with large datasets. It can process audio files in batches, which makes it faster and more efficient. Plus, it can run on a GPU, which gives it an extra boost of speed.
Example Use Cases
So, what can you use the Current Model for? Here are a few examples:
- Transcribing audio recordings of Bengali conversations
- Recognizing spoken Bengali words and phrases in real-time
- Building speech-to-text applications for Bengali speakers
Limitations
But like any tool, the Current Model has its limitations. Let’s take a closer look.
Sampling Rate Constraints
The model requires speech input to be sampled at 16kHz. What does this mean? It means that if your audio files are sampled at a different rate, you’ll need to resample them before using the model. This might affect the quality of the audio and, in turn, the model’s performance.
Limited Training Data
The model was fine-tuned on a dataset of approximately 196K utterances. While this is a significant amount of data, it’s still limited compared to the vast amount of speech data available in the world. This might mean that the model struggles with certain accents, dialects, or speaking styles that aren’t well-represented in the training data.
Format
The Current Model is a speech recognition model that uses a transformer architecture. It’s designed to recognize spoken Bengali language.
Architecture
This model is fine-tuned on the Bengali ASR training data set, which contains around 196K utterances. It’s based on the ==Wav2Vec2-Large-XLSR== model, which is a popular speech recognition model.
Data Formats
The Current Model accepts audio files as input, specifically in the .flac format. The audio files need to be sampled at 16kHz. If your audio files are sampled at a different rate, you’ll need to resample them before using this model.
Special Requirements
When using this model, you’ll need to preprocess your audio files by reading them as arrays and resampling them to 16kHz if necessary. You can use the torchaudio library to do this.
Here’s an example of how to preprocess your audio files:
import torchaudio
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["audio_path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
Input and Output
The model expects input in the form of audio arrays, and it outputs predicted text. You can use the Wav2Vec2Processor class to preprocess your input audio and decode the output text.
Here’s an example of how to use the model:
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
processor = Wav2Vec2Processor.from_pretrained("tanmoyio/wav2vec2-large-xlsr-bengali")
model = Wav2Vec2ForCTC.from_pretrained("tanmoyio/wav2vec2-large-xlsr-bengali")
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["label"][:2])


