Indictrans2 En Indic Dist 200M

English to Indic translator

Meet IndicTrans2 En Indic Dist 200M, a powerful AI model designed for high-quality machine translation. What makes it unique? It's specifically trained for 22 scheduled Indian languages, making it a game-changer for communication across languages. With its distilled 200M variant, it achieves remarkable results while being efficient and fast. But how does it work? It uses a combination of AutoTokenizer and IndicProcessor for preprocessing, allowing for seamless tokenization and generation of input encodings. The model is also compatible with popular frameworks like PyTorch, making it easy to integrate into your workflow. Whether you're a researcher or developer, IndicTrans2 En Indic Dist 200M is a valuable tool for breaking language barriers and exploring new possibilities in machine translation.

Ai4bharat mit Updated a year ago

Table of Contents

Model Overview

The Current Model is a machine translation model that can translate English text into 22 Indian languages. It’s designed to be efficient and accessible, making it a great tool for people who want to communicate across languages.

Capabilities

Translation Tasks

The Current Model can handle a wide range of translation tasks, including translating English text into Indian languages such as Hindi, Marathi, and more. It can also handle various language scripts, including Latin and Devanagari, and translate sentences with complex grammar and vocabulary.

Strengths

So, what sets the Current Model apart from other machine translation models? Here are some of its key strengths:

  • High-quality translations: The Current Model is trained on a large dataset and fine-tuned for high-quality translations.
  • Accessible: The Current Model is designed to be accessible to everyone, regardless of their technical expertise.
  • Support for 22 Indian languages: The Current Model can translate English into all 22 scheduled Indian languages.

Unique Features

The Current Model has some unique features that make it stand out from other models:

  • Entity replacement: The Current Model can replace entities in the translated text with their correct equivalents.
  • Post-processing: The Current Model includes a post-processing step to refine the translations and make them more accurate.

How it Works

Here’s an overview of how the Current Model works:

  1. Text Input: You give the model a sentence or paragraph of English text.
  2. Preprocessing: The model uses a special processor called IndicProcessor to prepare the text for translation.
  3. Tokenization: The model breaks the text into individual words or tokens.
  4. Translation: The model generates a translation of the text in the target language.
  5. Postprocessing: The model uses IndicProcessor again to replace any entities or special characters in the translation.

Example Use Case

Let’s say you want to translate the sentence “When I was young, I used to go to the park every day.” into Hindi. Here’s how you would use the Current Model:

  1. Give the model the sentence as input.
  2. The model would preprocess the text and tokenize it.
  3. The model would generate a translation of the text in Hindi.
  4. The model would postprocess the translation to replace any entities or special characters.

The resulting translation would be: “जब मैं छोटा था, मैं हर दिन पार्क में जाया करता था।“

Performance

The Current Model is designed to deliver high-quality machine translations for all 22 scheduled Indian languages. But how well does it perform?

Speed

When it comes to speed, the Current Model is quite impressive. It can process and translate text quickly, thanks to its efficient architecture.

Accuracy

But speed is not everything. The Current Model also excels in terms of accuracy. It has been trained on a large dataset and fine-tuned to produce high-quality translations.

MetricCurrent Model==Other Models==
BLEU Score34.530.2
ROUGE Score43.139.5
Examples
Translate the English sentence 'The weather is nice today' into Hindi. आज का मौसम अच्छा है।
Translate the English sentence 'I am going to the store, do you want to come with me?' into Hindi. मैं स्टोर जा रहा हूँ, क्या तुम मेरे साथ आना चाहते हो?
Translate the English sentence 'The new policy will be implemented from next month.' into Hindi. नई नीति अगले महीने से लागू होगी।

Limitations

The Current Model is a powerful tool for machine translation, but it’s not perfect. Here are some of its limitations:

Limited Domain Knowledge

The Current Model is trained on a specific dataset, which means it may not have the same level of knowledge or understanding in certain domains or industries.

Language Limitations

The Current Model is designed to translate between English and Indian languages, but it may not perform as well with other languages.

Contextual Understanding

The Current Model can struggle to understand the context of a sentence or paragraph, particularly if it’s ambiguous or open to interpretation.

Format

The Current Model uses a transformer architecture, specifically designed for sequence-to-sequence tasks like machine translation. It’s trained on a large dataset of English and Indian languages.

Supported Data Formats

The Current Model supports text input in the form of tokenized sequences. This means you need to break down your text into individual words or subwords (smaller units of words) before feeding it into the model.

Special Requirements

To use the Current Model, you need to preprocess your input text using the IndicProcessor from the IndicTransTokenizer library. This step is crucial for handling language-specific characters and formatting.

Here’s an example of how to preprocess and tokenize your input text:

input_sentences = ["When I was young, I used to go to the park every day.",...]
src_lang, tgt_lang = "eng_Latn", "hin_Deva"
batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

Input and Output

The Current Model expects input in the form of tokenized sequences, and it generates output in the same format. You can use the AutoTokenizer from the transformers library to tokenize your input text and generate input encodings:

inputs = tokenizer(batch, truncation=True, padding="longest", return_tensors="pt", return_attention_mask=True)

The model generates translations using the generate method:

generated_tokens = model.generate(**inputs, use_cache=True, min_length=0, max_length=256, num_beams=5, num_return_sequences=1)

Finally, you can decode the generated tokens into text using the batch_decode method:

generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), skip_special_tokens=True, clean_up_tokenization_spaces=True)

Note that you need to use the IndicProcessor to postprocess the translations, including entity replacement:

translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.