Multilingual E5 Large
The Multilingual E5 Large model is a powerful text embeddings model that supports 100 languages. It's designed to handle tasks like passage retrieval, semantic similarity, and bitext mining with high accuracy and efficiency. With 24 layers and an embedding size of 1024, this model achieves state-of-the-art results on various benchmarks, including the Mr. TyDi and MTEB benchmark. But what makes it unique? It's continually trained on a mixture of multilingual datasets, allowing it to capture nuances in different languages. However, it's not perfect - low-resource languages may experience performance degradation, and long texts will be truncated to 512 tokens. Still, it's a remarkable model that can be used for a wide range of applications, from open QA to ad-hoc information retrieval.
Table of Contents
Model Overview
The Multilingual-E5-large model is a powerful tool for natural language processing tasks. It’s designed to work with multiple languages, making it a great choice for applications that need to handle text in different languages.
Capabilities
Primary Tasks
This model excels at:
- Text retrieval: It can efficiently search and retrieve relevant passages from a large database.
- Semantic similarity: It can determine the similarity between two pieces of text, making it useful for tasks like paraphrase detection and bitext mining.
- Linear probing classification: It can be used as a feature extractor for downstream classification tasks.
Strengths
- Multilingual support: It supports 100 languages, making it a great choice for applications that require language flexibility.
- High performance: It achieves state-of-the-art results on various benchmarks, including the Mr. TyDi and MTEB evaluations.
Unique Features
- Contrastive pre-training: It was trained using a contrastive pre-training approach, which enables it to learn effective text embeddings.
- Weak supervision: It was trained on a mixture of multilingual datasets with weak supervision, making it robust and adaptable to different tasks.
How it Works
The model uses a technique called contrastive learning to learn embeddings for text. This involves training the model to predict whether two pieces of text are similar or not. The model is also fine-tuned on a range of datasets to improve its performance on specific tasks.
Supported Languages
This model supports 100 languages, including but not limited to:
- English
- Spanish
- French
- German
- Chinese
- Japanese
- Korean
Note that while it supports many languages, low-resource languages may see performance degradation.
Example Use Cases
- Text retrieval: Use the model to search for relevant documents in a large corpus.
- Semantic similarity: Use the model to measure the similarity between two pieces of text.
- Bitext mining: Use the model to find pairs of text that are translations of each other.
Performance
The model’s performance on the Mr. TyDi benchmark is impressive, with an average MRR@10 of 70.5
. But what does this mean in practice? Simply put, the model is able to accurately retrieve relevant text passages with high precision.
Benchmark Results
Language | Avg MRR@10 |
---|---|
ar | 70.5 |
bn | 77.5 |
en | 73.2 |
fi | 60.8 |
id | 66.8 |
ja | 68.5 |
ko | 62.5 |
ru | 61.6 |
sw | 65.8 |
te | 72.7 |
th | 90.2 |
Limitations
- Long texts will be truncated: The model can only handle texts up to 512 tokens long. If you need to process longer texts, you may need to split them into smaller chunks.
- Low-resource languages may see performance degradation: While the model supports many languages, low-resource languages may not perform as well as languages with more training data.
Getting Started
To use the model, you’ll need to install the sentence_transformers
library and download the model weights. You can then use the model to encode text and compute similarity scores.
Format
The model expects input texts to start with specific prefixes:
query:
for queriespassage:
for passages
Even for non-English texts! If you’re using the model for tasks other than retrieval, you can simply use the query:
prefix.
Here’s an example of how to handle inputs:
input_texts = ['query: how much protein should a female eat',
'query: 南瓜的家常做法',
"passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day....",
"passage: 1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精..."]
The model outputs embeddings, which can be used for various tasks like semantic similarity, bitext mining, or paraphrase retrieval.
Here’s an example of how to get embeddings using the sentence_transformers
library:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('intfloat/multilingual-e5-large')
input_texts = [...]
embeddings = model.encode(input_texts, normalize_embeddings=True)