Multilingual E5 Large

Multilingual Embeddings

The Multilingual E5 Large model is a powerful text embeddings model that supports 100 languages. It's designed to handle tasks like passage retrieval, semantic similarity, and bitext mining with high accuracy and efficiency. With 24 layers and an embedding size of 1024, this model achieves state-of-the-art results on various benchmarks, including the Mr. TyDi and MTEB benchmark. But what makes it unique? It's continually trained on a mixture of multilingual datasets, allowing it to capture nuances in different languages. However, it's not perfect - low-resource languages may experience performance degradation, and long texts will be truncated to 512 tokens. Still, it's a remarkable model that can be used for a wide range of applications, from open QA to ad-hoc information retrieval.

Intfloat mit Updated a year ago

Table of Contents

Model Overview

The Multilingual-E5-large model is a powerful tool for natural language processing tasks. It’s designed to work with multiple languages, making it a great choice for applications that need to handle text in different languages.

Capabilities

Primary Tasks

This model excels at:

  • Text retrieval: It can efficiently search and retrieve relevant passages from a large database.
  • Semantic similarity: It can determine the similarity between two pieces of text, making it useful for tasks like paraphrase detection and bitext mining.
  • Linear probing classification: It can be used as a feature extractor for downstream classification tasks.

Strengths

  • Multilingual support: It supports 100 languages, making it a great choice for applications that require language flexibility.
  • High performance: It achieves state-of-the-art results on various benchmarks, including the Mr. TyDi and MTEB evaluations.

Unique Features

  • Contrastive pre-training: It was trained using a contrastive pre-training approach, which enables it to learn effective text embeddings.
  • Weak supervision: It was trained on a mixture of multilingual datasets with weak supervision, making it robust and adaptable to different tasks.

How it Works

The model uses a technique called contrastive learning to learn embeddings for text. This involves training the model to predict whether two pieces of text are similar or not. The model is also fine-tuned on a range of datasets to improve its performance on specific tasks.

Supported Languages

This model supports 100 languages, including but not limited to:

  • English
  • Spanish
  • French
  • German
  • Chinese
  • Japanese
  • Korean

Note that while it supports many languages, low-resource languages may see performance degradation.

Example Use Cases

  • Text retrieval: Use the model to search for relevant documents in a large corpus.
  • Semantic similarity: Use the model to measure the similarity between two pieces of text.
  • Bitext mining: Use the model to find pairs of text that are translations of each other.
Examples
query: What is the daily protein requirement for women? 46 grams per day
query: What are some common ways to cook pumpkin? Stir-fry with garlic and ginger, boil and mash, or roast with olive oil and salt.
query: What is the similarity between the queries and passages? Scores: [[0.876], [0.542], [0.812], [0.921]]

Performance

The model’s performance on the Mr. TyDi benchmark is impressive, with an average MRR@10 of 70.5. But what does this mean in practice? Simply put, the model is able to accurately retrieve relevant text passages with high precision.

Benchmark Results

LanguageAvg MRR@10
ar70.5
bn77.5
en73.2
fi60.8
id66.8
ja68.5
ko62.5
ru61.6
sw65.8
te72.7
th90.2

Limitations

  • Long texts will be truncated: The model can only handle texts up to 512 tokens long. If you need to process longer texts, you may need to split them into smaller chunks.
  • Low-resource languages may see performance degradation: While the model supports many languages, low-resource languages may not perform as well as languages with more training data.

Getting Started

To use the model, you’ll need to install the sentence_transformers library and download the model weights. You can then use the model to encode text and compute similarity scores.

Format

The model expects input texts to start with specific prefixes:

  • query: for queries
  • passage: for passages

Even for non-English texts! If you’re using the model for tasks other than retrieval, you can simply use the query: prefix.

Here’s an example of how to handle inputs:

input_texts = ['query: how much protein should a female eat', 
               'query: 南瓜的家常做法', 
               "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day....", 
               "passage: 1.清炒南瓜丝 原料:嫩南瓜半个 调料:葱、盐、白糖、鸡精..."]

The model outputs embeddings, which can be used for various tasks like semantic similarity, bitext mining, or paraphrase retrieval.

Here’s an example of how to get embeddings using the sentence_transformers library:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('intfloat/multilingual-e5-large')
input_texts = [...]
embeddings = model.encode(input_texts, normalize_embeddings=True)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.