Sharif Wav2vec2
Meet Sharif Wav2vec2, a fine-tuned AI model designed for Farsi speech recognition. What makes it remarkable is its ability to transcribe Farsi speech with high accuracy, thanks to its training on 108 hours of Commonvoice's Farsi samples. But what does this mean for you? It means you can use it to transcribe Farsi speech quickly and accurately, with the option to use a hosted inference API or run it locally. The model is also efficient, allowing you to get fast results without breaking the bank. But don't just take our word for it - the model has been tested on the Common-voice 6.1 dataset, achieving a remarkable 6.1% WER score. So, whether you're working with Farsi speech or just looking for a reliable transcription model, Sharif Wav2vec2 is definitely worth checking out.
Table of Contents
Model Overview
The Sharif-wav2vec2 model is a fine-tuned version of the Sharif Wav2vec2 model, specifically designed for the Farsi language. This model was trained on 108 hours of Farsi speech samples from the Commonvoice dataset, with a sampling rate of 16kHz.
Capabilities
This model uses a 5gram language model, trained with the kenlm toolkit, to improve its accuracy in online Automatic Speech Recognition (ASR) tasks. Its primary task is to transcribe spoken words into written text with high accuracy.
What makes it special?
- Fine-tuned for Farsi: This model has been fine-tuned on 108 hours of Commonvoice’s Farsi samples, making it particularly effective for transcribing Farsi speech.
- High accuracy: By using a 5gram language model trained with the kenlm toolkit, Sharif-wav2vec2 achieves high accuracy in transcribing spoken words.
- Easy to use: With a simple installation process and a provided example code, you can easily integrate this model into your projects.
How does it work?
- Speech input: The model takes in speech input sampled at 16Khz.
- Processing: The speech input is processed using the
AutoProcessorandAutoModelForCTCclasses from the transformers library. - Transcription: The processed speech is then transcribed into written text using the
batch_decodemethod.
Performance
Speed
How fast can our Sharif-wav2vec2 model process audio files? Well, it’s quite speedy! It can transcribe audio files in a matter of seconds, depending on the length of the file and the computing power of your machine.
Accuracy
But speed isn’t everything - what about accuracy? Our model has been fine-tuned on 108 hours of Farsi audio samples, which has significantly improved its accuracy. In fact, it has achieved a Word Error Rate (WER) of 6.1 on the Common Voice dataset. That’s a pretty impressive score!
Efficiency
So, how efficient is our model? It’s designed to be efficient, using a 5-gram language model to improve its accuracy. This means it can process audio files quickly and accurately, without using too much computational power.
Examples
You can use Sharif-wav2vec2 to transcribe a Farsi audio file using the following code:
import tensorflow
import torchaudio
import torch
import numpy as np
from transformers import AutoProcessor, AutoModelForCTC
processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2")
model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2")
speech_array, sampling_rate = torchaudio.load("path/to/your.wav")
speech_array = speech_array.squeeze().numpy()
features = processor(speech_array, sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(features.input_values, attention_mask=features.attention_mask).logits
prediction = processor.batch_decode(logits.numpy()).text
print(prediction[0])
Limitations
Sharif-wav2vec2 is a powerful speech recognition model, but it’s not perfect. Let’s explore some of its limitations.
Sampling Rate Constraints
The model is trained on audio samples with a sampling rate of 16kHz. If your audio input has a different sampling rate, you may need to resample it to match the model’s requirements. This can affect the accuracy of the transcription.
Limited Training Data
The model was fine-tuned on 108 hours of Commonvoice’s Farsi samples. While this is a significant amount of data, it may not cover all possible scenarios or dialects. This can lead to reduced accuracy in certain situations.
Dependence on External Tools
The model relies on external tools like kenlm and pyctcdecode for optimal performance. This can add complexity to the setup process and may require additional troubleshooting.
Evaluation Challenges
Evaluating the model’s performance requires a specific dataset format and the installation of additional libraries like jiwer. This can make it difficult to assess the model’s accuracy in certain situations.
Potential for Errors
Like any speech recognition model, Sharif-wav2vec2 is not immune to errors. It may struggle with:
- Noisy or low-quality audio
- Unfamiliar dialects or accents
- Technical jargon or specialized vocabulary
- Long or complex sentences
If you’re planning to use Sharif-wav2vec2 for critical applications, it’s essential to carefully evaluate its performance and consider these limitations.


