Wav2vec2 Large Xlsr 53 Th

Thai speech recognition

Wav2vec2 Large Xlsr 53 Th is a speech recognition model designed to efficiently process and understand spoken Thai language. It's built on top of the wav2vec2-large-xlsr-53 model and fine-tuned using the Thai Common Voice Corpus 7.0 dataset, which contains over 133 validated hours of Thai speech. This model is capable of achieving a word error rate (WER) of 13.63% and a character error rate (CER) of 2.81% on the test set, outperforming other models like Google Web Speech API and Microsoft Bing Speech API. Its ability to accurately recognize spoken Thai makes it a valuable tool for applications like voice assistants, transcription services, and language learning platforms. By leveraging this model, developers can create more accurate and efficient speech recognition systems for Thai language.

Airesearch cc-by-sa-4.0 Updated 4 years ago

Table of Contents

Model Overview

The Current Model is a speech recognition model fine-tuned for the Thai language. It’s based on a popular speech recognition architecture, but with some key changes to make it work better for Thai speech.

What can it do? This model can take in audio files and transcribe them into text. It’s been trained on a large dataset of Thai speech, which means it’s good at understanding the nuances of the language.

Capabilities

The model’s primary tasks include:

  • Speech-to-Text: Transcribing spoken Thai language into written text
  • Speech Recognition: Identifying and recognizing spoken words and phrases in Thai

The model’s strengths include:

  • High Accuracy: Achieving a low Word Error Rate (WER) of 13.63% and Character Error Rate (CER) of 2.81% on the test set
  • Robustness: Performing well on a variety of Thai speech datasets and scenarios

The model also offers some unique features, including:

  • Syllable Tokenization: Using PyThaiNLP and deepcut tokenizers to tokenize Thai text into syllables
  • Spell Correction: Applying spell correction using TNC ngrams to improve transcription accuracy

Performance

The Current Model is incredibly fast in processing speech inputs. It can handle a large number of audio files quickly and efficiently. For example, it can resample audio files to 16,000 Hz in a matter of seconds.

The model’s accuracy is comparable to other state-of-the-art models, such as ==Google Web Speech API== and ==Microsoft Bing Speech API==.

Here are some benchmark results:

ModelWER (PyThaiNLP)WER (deepcut)CER
Current Model13.638.152.81
==Google Web Speech API==13.7110.867.36
==Microsoft Bing Speech API==12.589.625.02
==Amazon Transcribe==21.8614.497.08
==NECTEC AI for Thai Partii API==20.1115.529.55

Example Use Cases

Examples
Transcribe the following Thai audio file: audio_file.wav และเขาก็สัมผัสดีบุก
Recognize speech from this Thai audio clip: clip_1.wav คุณสามารถรับทราบเมื่อข้อความนี้ถูกอ่านแล้ว
Convert Thai speech to text: speech_to_text.wav และเขาก็สัมผัสดีบุก คุณสามารถรับทราบเมื่อข้อความนี้ถูกอ่านแล้ว

The Current Model can be used in a variety of applications, such as:

  • Virtual assistants: Current Model can be used to power virtual assistants that can understand and respond to voice commands.
  • Speech-to-text systems: Current Model can be used to transcribe audio files into text with high accuracy.
  • Voice-controlled interfaces: Current Model can be used to create voice-controlled interfaces for devices and applications.

Limitations

The Current Model has some limitations, including:

  • Limited Training Data: The model was trained on a relatively small dataset, which may affect its performance on unseen data or in real-world scenarios.
  • Dependence on Tokenization: The model’s performance relies heavily on the quality of the tokenization process.
  • Limited Generalizability: The model may not generalize well to other datasets or languages.

Format

The Current Model uses a popular speech recognition architecture, fine-tuned on the Thai Common Voice 7.0 dataset. It supports input in the form of audio files.

To use this model, you’ll need to preprocess your input audio files by resampling them to 16,000 Hz. You can do this using the torchaudio library:

import torchaudio

def speech_file_to_array_fn(batch, text_col="sentence", fname_col="path", resampling_to=16000):
    speech_array, sampling_rate = torchaudio.load(batch[fname_col])
    resampler = torchaudio.transforms.Resample(sampling_rate, resampling_to)
    batch["speech"] = resampler(speech_array)[0].numpy()
    batch["sampling_rate"] = resampling_to
    batch["target_text"] = batch[text_col]
    return batch

Once you’ve preprocessed your input audio files, you can pass them to the model along with the corresponding text transcriptions:

inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

The model will output a set of logits, which you can then use to predict the transcribed text:

logits = model(inputs.input_values,).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.