Sharif Wav2vec2

Farsi speech recognition

Meet Sharif Wav2vec2, a fine-tuned AI model designed for Farsi speech recognition. What makes it remarkable is its ability to transcribe Farsi speech with high accuracy, thanks to its training on 108 hours of Commonvoice's Farsi samples. But what does this mean for you? It means you can use it to transcribe Farsi speech quickly and accurately, with the option to use a hosted inference API or run it locally. The model is also efficient, allowing you to get fast results without breaking the bank. But don't just take our word for it - the model has been tested on the Common-voice 6.1 dataset, achieving a remarkable 6.1% WER score. So, whether you're working with Farsi speech or just looking for a reliable transcription model, Sharif Wav2vec2 is definitely worth checking out.

SLPL mit Updated 2 years ago

Table of Contents

Model Overview

The Sharif-wav2vec2 model is a fine-tuned version of the Sharif Wav2vec2 model, specifically designed for the Farsi language. This model was trained on 108 hours of Farsi speech samples from the Commonvoice dataset, with a sampling rate of 16kHz.

Capabilities

This model uses a 5gram language model, trained with the kenlm toolkit, to improve its accuracy in online Automatic Speech Recognition (ASR) tasks. Its primary task is to transcribe spoken words into written text with high accuracy.

What makes it special?

  • Fine-tuned for Farsi: This model has been fine-tuned on 108 hours of Commonvoice’s Farsi samples, making it particularly effective for transcribing Farsi speech.
  • High accuracy: By using a 5gram language model trained with the kenlm toolkit, Sharif-wav2vec2 achieves high accuracy in transcribing spoken words.
  • Easy to use: With a simple installation process and a provided example code, you can easily integrate this model into your projects.

How does it work?

  1. Speech input: The model takes in speech input sampled at 16Khz.
  2. Processing: The speech input is processed using the AutoProcessor and AutoModelForCTC classes from the transformers library.
  3. Transcription: The processed speech is then transcribed into written text using the batch_decode method.

Performance

Speed

How fast can our Sharif-wav2vec2 model process audio files? Well, it’s quite speedy! It can transcribe audio files in a matter of seconds, depending on the length of the file and the computing power of your machine.

Accuracy

But speed isn’t everything - what about accuracy? Our model has been fine-tuned on 108 hours of Farsi audio samples, which has significantly improved its accuracy. In fact, it has achieved a Word Error Rate (WER) of 6.1 on the Common Voice dataset. That’s a pretty impressive score!

Efficiency

So, how efficient is our model? It’s designed to be efficient, using a 5-gram language model to improve its accuracy. This means it can process audio files quickly and accurately, without using too much computational power.

Examples

You can use Sharif-wav2vec2 to transcribe a Farsi audio file using the following code:

import tensorflow
import torchaudio
import torch
import numpy as np
from transformers import AutoProcessor, AutoModelForCTC

processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2")
model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2")

speech_array, sampling_rate = torchaudio.load("path/to/your.wav")
speech_array = speech_array.squeeze().numpy()
features = processor(speech_array, sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(features.input_values, attention_mask=features.attention_mask).logits
    prediction = processor.batch_decode(logits.numpy()).text
    print(prediction[0])

Limitations

Sharif-wav2vec2 is a powerful speech recognition model, but it’s not perfect. Let’s explore some of its limitations.

Sampling Rate Constraints

The model is trained on audio samples with a sampling rate of 16kHz. If your audio input has a different sampling rate, you may need to resample it to match the model’s requirements. This can affect the accuracy of the transcription.

Limited Training Data

The model was fine-tuned on 108 hours of Commonvoice’s Farsi samples. While this is a significant amount of data, it may not cover all possible scenarios or dialects. This can lead to reduced accuracy in certain situations.

Dependence on External Tools

The model relies on external tools like kenlm and pyctcdecode for optimal performance. This can add complexity to the setup process and may require additional troubleshooting.

Evaluation Challenges

Evaluating the model’s performance requires a specific dataset format and the installation of additional libraries like jiwer. This can make it difficult to assess the model’s accuracy in certain situations.

Potential for Errors

Like any speech recognition model, Sharif-wav2vec2 is not immune to errors. It may struggle with:

  • Noisy or low-quality audio
  • Unfamiliar dialects or accents
  • Technical jargon or specialized vocabulary
  • Long or complex sentences

If you’re planning to use Sharif-wav2vec2 for critical applications, it’s essential to carefully evaluate its performance and consider these limitations.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.