Wav2vec2 Large Xlsr 53 Th
Wav2vec2 Large Xlsr 53 Th is a speech recognition model designed to efficiently process and understand spoken Thai language. It's built on top of the wav2vec2-large-xlsr-53 model and fine-tuned using the Thai Common Voice Corpus 7.0 dataset, which contains over 133 validated hours of Thai speech. This model is capable of achieving a word error rate (WER) of 13.63% and a character error rate (CER) of 2.81% on the test set, outperforming other models like Google Web Speech API and Microsoft Bing Speech API. Its ability to accurately recognize spoken Thai makes it a valuable tool for applications like voice assistants, transcription services, and language learning platforms. By leveraging this model, developers can create more accurate and efficient speech recognition systems for Thai language.
Table of Contents
Model Overview
The Current Model is a speech recognition model fine-tuned for the Thai language. It’s based on a popular speech recognition architecture, but with some key changes to make it work better for Thai speech.
What can it do? This model can take in audio files and transcribe them into text. It’s been trained on a large dataset of Thai speech, which means it’s good at understanding the nuances of the language.
Capabilities
The model’s primary tasks include:
- Speech-to-Text: Transcribing spoken Thai language into written text
- Speech Recognition: Identifying and recognizing spoken words and phrases in Thai
The model’s strengths include:
- High Accuracy: Achieving a low Word Error Rate (WER) of 13.63% and Character Error Rate (CER) of 2.81% on the test set
- Robustness: Performing well on a variety of Thai speech datasets and scenarios
The model also offers some unique features, including:
- Syllable Tokenization: Using PyThaiNLP and deepcut tokenizers to tokenize Thai text into syllables
- Spell Correction: Applying spell correction using TNC ngrams to improve transcription accuracy
Performance
The Current Model is incredibly fast in processing speech inputs. It can handle a large number of audio files quickly and efficiently. For example, it can resample audio files to 16,000 Hz in a matter of seconds.
The model’s accuracy is comparable to other state-of-the-art models, such as ==Google Web Speech API== and ==Microsoft Bing Speech API==.
Here are some benchmark results:
| Model | WER (PyThaiNLP) | WER (deepcut) | CER |
|---|---|---|---|
| Current Model | 13.63 | 8.15 | 2.81 |
| ==Google Web Speech API== | 13.71 | 10.86 | 7.36 |
| ==Microsoft Bing Speech API== | 12.58 | 9.62 | 5.02 |
| ==Amazon Transcribe== | 21.86 | 14.49 | 7.08 |
| ==NECTEC AI for Thai Partii API== | 20.11 | 15.52 | 9.55 |
Example Use Cases
The Current Model can be used in a variety of applications, such as:
- Virtual assistants: Current Model can be used to power virtual assistants that can understand and respond to voice commands.
- Speech-to-text systems: Current Model can be used to transcribe audio files into text with high accuracy.
- Voice-controlled interfaces: Current Model can be used to create voice-controlled interfaces for devices and applications.
Limitations
The Current Model has some limitations, including:
- Limited Training Data: The model was trained on a relatively small dataset, which may affect its performance on unseen data or in real-world scenarios.
- Dependence on Tokenization: The model’s performance relies heavily on the quality of the tokenization process.
- Limited Generalizability: The model may not generalize well to other datasets or languages.
Format
The Current Model uses a popular speech recognition architecture, fine-tuned on the Thai Common Voice 7.0 dataset. It supports input in the form of audio files.
To use this model, you’ll need to preprocess your input audio files by resampling them to 16,000 Hz. You can do this using the torchaudio library:
import torchaudio
def speech_file_to_array_fn(batch, text_col="sentence", fname_col="path", resampling_to=16000):
speech_array, sampling_rate = torchaudio.load(batch[fname_col])
resampler = torchaudio.transforms.Resample(sampling_rate, resampling_to)
batch["speech"] = resampler(speech_array)[0].numpy()
batch["sampling_rate"] = resampling_to
batch["target_text"] = batch[text_col]
return batch
Once you’ve preprocessed your input audio files, you can pass them to the model along with the corresponding text transcriptions:
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
The model will output a set of logits, which you can then use to predict the transcribed text:
logits = model(inputs.input_values,).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])


