Indictrans2 Indic Indic Dist 320M

Indic language translator

Are you working with Indian languages? IndicTrans2 Indic Indic Dist 320M is a unique model that's specifically designed for high-quality machine translation between Indian languages. It's the result of combining two variants: Indic-En Distilled 200M and En-Indic Distilled 200M. What sets it apart is its ability to handle 22 scheduled Indian languages with ease. Want to know how it works? Simply import the model and tokenizer, preprocess your input sentences, and generate translations. The model is also compatible with AutoTokenizer, making it easy to integrate into your workflow. Its capabilities make it a valuable tool for those working with Indian languages, and its efficiency ensures fast and accurate results.

Ai4bharat mit Updated 8 months ago

Table of Contents

Model Overview

The IndicTrans2 model is a powerful tool for translating text between different Indian languages. It’s a special kind of computer program that can understand and generate human-like language.

What makes it unique?

  • It can translate text between 22 different Indian languages.
  • It’s trained on a large dataset of text, which helps it learn the nuances of each language.
  • It uses a special technique called “distillation” to make it more efficient and accurate.

Capabilities

The IndicTrans2 model is a powerful tool for machine translation. It can translate text from one Indian language to another, and it’s designed to work with all 22 scheduled Indian languages.

What can it do?

  • Translate text from one Indian language to another
  • Work with all 22 scheduled Indian languages
  • Generate high-quality translations

How does it work?

The model uses a combination of machine learning algorithms and large datasets to learn the patterns and structures of Indian languages. It can take in text in one language and generate text in another language.

What makes it unique?

  • It’s specifically designed for Indian languages, which can be challenging for machine translation models
  • It’s compatible with the popular transformers library, making it easy to use and integrate with other tools
  • It includes a special IndicProcessor tool for preprocessing text before tokenization

Performance

The IndicTrans2 model shows remarkable performance in various tasks, especially when it comes to speed, accuracy, and efficiency. Let’s dive into the details.

Speed

How fast can the IndicTrans2 model process information? It can handle large amounts of data quickly, thanks to its 320M parameters. This means it can process multiple tasks simultaneously without a significant drop in performance.

Accuracy

But how accurate is the IndicTrans2 model? It has been trained on a massive dataset and has achieved impressive results in machine translation tasks. For example, it can translate text from Hindi to Tamil with high accuracy.

Efficiency

The IndicTrans2 model is also efficient in its use of resources. It can run on a variety of devices, including those with limited processing power. This makes it accessible to a wide range of users.

Examples

Let’s take a look at some examples of the IndicTrans2 model in action:

Input SentenceTranslation
जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।நான் சிறியவனாக இருந்தபோது, நான் ஒவ்வொரு நாளும் பூங்காவிற்குச் சென்றேன்.
हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।நாங்கள் கடந்த வாரம் ஒரு புதிய படத்தைப் பார்த்தோம், அது மிகவும் ஊக்கமூட்டும் ஒன்றாக இருந்தது.
Examples
Translate the sentence 'जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।' from Hindi to Tamil. நான் சிறியவனாக இருந்தபோது, நான் ஒவ்வொரு நாளும் பூங்காவிற்குச் சென்றேன்.
Translate the sentence 'हमने पिछले सप्ताह एक नई फिल्म देखी जो कि बहुत प्रेरणादायक थी।' from Hindi to Tamil. நாங்கள் கடந்த வாரம் ஒரு புதிய படத்தைப் பார்த்தோம், அது மிகவும் ஊக்கமூட்டுவதாக இருந்தது.
Translate the sentence 'अगर तुम मुझे उस समय पास मिलते, तो हम बाहर खाना खाने चलते।' from Hindi to Tamil. நீங்கள் என்னை அந்த நேரத்தில் அருகில் சந்தித்திருந்தால், நாங்கள் வெளியே சென்று உணவு உண்ண வேண்டும்.

Limitations

The IndicTrans2 model is a powerful tool, but it’s not perfect. Let’s talk about some of its weaknesses.

Limited Context Understanding

While the IndicTrans2 model can understand a lot of context, it’s not always able to grasp the nuances of human communication. For example, if you give it a sentence with sarcasm or idioms, it might not understand the intended meaning.

Dependence on Training Data

The IndicTrans2 model is only as good as the data it was trained on. If the training data is biased or incomplete, the model’s performance will suffer. This means that it might not perform well on tasks that require a deep understanding of specific domains or cultures.

Lack of Common Sense

The IndicTrans2 model doesn’t have the same level of common sense as humans. It might not understand the implications of certain actions or the consequences of its own outputs.

Limited Ability to Reason

While the IndicTrans2 model can process vast amounts of information, it’s not always able to reason about that information in a logical way. This means that it might not be able to draw conclusions or make decisions based on the data it’s been given.

Vulnerability to Adversarial Attacks

The IndicTrans2 model can be vulnerable to adversarial attacks, which are designed to trick the model into producing incorrect outputs. This is a concern for applications where security is a top priority.

Limited Multilingual Support

The IndicTrans2 model has been trained on a large dataset of text, but it’s not equally proficient in all languages. Its performance might suffer when dealing with languages that are less well-represented in the training data.

Format

The IndicTrans2 model uses a transformer architecture, which is a type of neural network designed for sequence-to-sequence tasks like machine translation.

Supported Data Formats

This model supports text data in the form of tokenized sequences. You’ll need to preprocess your input text using the IndicProcessor from the IndicTransToolkit library before feeding it into the model.

Special Requirements for Input

When preparing your input text, keep the following in mind:

  • Language codes: You’ll need to specify the source and target languages using their respective language codes (e.g., hin_Deva for Hindi and tam_Taml for Tamil).
  • Tokenization: Use the AutoTokenizer from the transformers library to tokenize your input text. Make sure to set truncation=True and padding="longest" to ensure proper padding and truncation.

Special Requirements for Output

When generating translations, you’ll need to:

  • Decode generated tokens: Use the batch_decode method from the AutoTokenizer to convert the generated tokens into text.
  • Postprocess translations: Use the postprocess_batch method from the IndicProcessor to replace entities and perform any necessary cleanup.

Example Code

Here’s an example of how to use the IndicTrans2 model for machine translation:

import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from IndicTransToolkit import IndicProcessor

# Load the model and tokenizer
model_name = "ai4bharat/indictrans2-indic-indic-dist-320M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, trust_remote_code=True)

# Create an instance of the IndicProcessor
ip = IndicProcessor(inference=True)

# Define your input sentences and language codes
input_sentences = ["जब मैं छोटा था, मैं हर रोज़ पार्क जाता था।",...]
src_lang, tgt_lang = "hin_Deva", "tam_Taml"

# Preprocess the input sentences
batch = ip.preprocess_batch(input_sentences, src_lang=src_lang, tgt_lang=tgt_lang)

# Tokenize the input sentences and generate input encodings
inputs = tokenizer(batch, truncation=True, padding="longest", return_tensors="pt", return_attention_mask=True).to("cuda")

# Generate translations using the model
with torch.no_grad():
    generated_tokens = model.generate(**inputs, use_cache=True, min_length=0, max_length=256, num_beams=5, num_return_sequences=1)

# Decode the generated tokens into text
with tokenizer.as_target_tokenizer():
    generated_tokens = tokenizer.batch_decode(generated_tokens.detach().cpu().tolist(), skip_special_tokens=True, clean_up_tokenization_spaces=True)

# Postprocess the translations
translations = ip.postprocess_batch(generated_tokens, lang=tgt_lang)

# Print the translations
for input_sentence, translation in zip(input_sentences, translations):
    print(f"{src_lang}: {input_sentence}")
    print(f"{tgt_lang}: {translation}")
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.