Bge Base En V1.5

English Embedding Model

The Bge Base En V1.5 model is a powerful tool for natural language processing tasks. With its efficient design and ability to support diverse retrieval augmentation needs, it's ideal for tasks like passage retrieval and semantic similarity. The model can be fine-tuned for specific tasks and has been shown to achieve state-of-the-art performances on benchmarks like MS MARCO and BEIR. Its ability to handle both short and long queries makes it a versatile choice for various applications. The model's similarity distribution has been improved in version 1.5, allowing for more accurate results. It's also compatible with various frameworks like FlagEmbedding, Sentence-Transformers, and HuggingFace Transformers, making it easy to integrate into existing workflows.

BAAI mit Updated a year ago

Table of Contents

Model Overview

The BGE-M3 model is a cutting-edge language model that focuses on retrieval-augmented LLMs. It’s part of the BGE model series and offers exciting features like:

  • Multi-linguality: Supports over 100 languages
  • Multi-granularities: Handles input lengths up to 8192 tokens
  • Multi-functionality: Unifies dense, lexical, and multi-vector (Colbert) retrieval methods

This model achieves state-of-the-art performance on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Want to know more about its technical details? Check out the technical report and code on the GitHub page!

Capabilities

The BGE-M3 model is a powerful tool for text retrieval and embedding. It’s designed to support diverse retrieval augmentation needs for Large Language Models (LLMs). Here are some of its key capabilities:

  • Multilingual Support: The model supports multiple languages, including English and Chinese.
  • Dense Retrieval: It uses dense retrieval methods to efficiently search for relevant passages.
  • Fine-tuning: The model can be fine-tuned for specific tasks, such as passage retrieval or semantic similarity.
  • Reranking: It can be used as a reranker model to re-rank top-k documents retrieved by other models.
  • Multi-Functionality: The model supports multiple retrieval methods, including dense retrieval, sparse retrieval, and multi-vector (ColBERT) retrieval.
  • Multi-Granularity: It can handle input lengths up to 8192 tokens.

Strengths

The BGE-M3 model has several strengths that make it a valuable tool for text retrieval and embedding:

  • High Performance: The model achieves state-of-the-art performance on several benchmarks, including MTEB and C-MTEB.
  • Efficient: It’s designed to be efficient and scalable, making it suitable for large-scale applications.
  • Flexible: The model can be fine-tuned for specific tasks and can be used as a reranker model.

Unique Features

The BGE-M3 model has several unique features that set it apart from other models:

  • Unified Embedding Model: It’s a unified embedding model that supports diverse retrieval augmentation needs for LLMs.
  • Multi-Linguality: The model supports multiple languages, making it a valuable tool for multilingual applications.
  • Multi-Functionality: It supports multiple retrieval methods, making it a flexible tool for different applications.

Comparison to Other Models

The BGE-M3 model is compared to other models, such as ==BGE==, which is a more efficient model that supports more languages and longer texts. However, the BGE-M3 model is more powerful and achieves better performance on several benchmarks.

Performance

The BGE-M3 model showcases remarkable performance in various tasks, especially in retrieval-augmented language models. Let’s dive into its speed, accuracy, and efficiency.

Speed

The BGE-M3 model is designed to be fast and efficient. It can handle large-scale datasets and process queries quickly. For example, it can retrieve relevant passages from a massive database in a matter of milliseconds.

Accuracy

The BGE-M3 model achieves state-of-the-art performance in several benchmarks, including MTEB and C-MTEB. It demonstrates high accuracy in text classification tasks, especially when fine-tuned with contrastive learning.

Efficiency

The BGE-M3 model is designed to be efficient in terms of computational resources. It can be fine-tuned with a small amount of data and still achieve competitive performance. Additionally, it supports multi-linguality, multi-granularities, and multi-functionality, making it a versatile model for various applications.

Example Use Cases

The BGE-M3 model can be used in various applications, such as:

  • Retrieval-augmented language models
  • Text classification tasks
  • Passage retrieval
  • Semantic similarity tasks
Examples
Encode the following sentences for a short query to long passage retrieval task: 'What is the capital of France?', 'Where is the Eiffel Tower located?' q_embeddings = model.encode_queries(['What is the capital of France?', 'Where is the Eiffel Tower located?']); p_embeddings = model.encode(['Paris is the capital of France.', 'The Eiffel Tower is located in Paris.']); scores = q_embeddings @ p_embeddings.T
Compute the similarity between two sentences using the BAAI/bge-large-zh-v1.5 model. model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval='为这个句子生成表示以用于检索相关文章:', use_fp16=True); embeddings_1 = model.encode(['样例数据-1']); embeddings_2 = model.encode(['样例数据-2']); similarity = embeddings_1 @ embeddings_2.T
Mine hard negatives for fine-tuning the BAAI/bge-large-en-v1.5 model. Following this example to prepare data and fine-tune your model. Mine hard negatives following this example, which can improve the retrieval performance.

Limitations

The BGE-M3 model has some limitations that you should be aware of. While it’s great at retrieving information, it’s not perfect.

Limited Context Length

The model can only handle a certain amount of context. If your query or passage is too long, it might not work as well.

Similarity Distribution

The model’s similarity scores might not always be accurate. A score of 0.5 or higher doesn’t necessarily mean the two sentences are similar. You might need to adjust the threshold for your specific use case.

Query Instructions

For short queries, you might need to add an instruction to get the best results. But for passages, no instruction is needed.

Limited Languages

The BGE-M3 model only supports a limited number of languages. If you need to work with other languages, you might want to try ==Other Models== like bge-m3.

Fine-tuning

Fine-tuning the model can be tricky. You might need to mine hard negatives to improve performance, and even then, the results might not be perfect.

Retrieval Performance

The model’s retrieval performance might not always be the best. You might need to use a cross-encoder model like bge-reranker to re-rank the top results.

Technical Requirements

To use the BGE-M3 model, you’ll need a certain level of technical expertise. You’ll need to know how to work with FlagEmbedding, Sentence-Transformers, Langchain, or HuggingFace Transformers.

Format

The BGE-M3 model uses a transformer architecture and accepts input in the form of tokenized text sequences. It supports multiple languages, including English and Chinese, and can handle input lengths up to 8192 tokens.

Input Format

  • Input text sequences should be tokenized using a compatible tokenizer.
  • For short query to long passage retrieval tasks, an instruction should be added to the query, but not to the passages.
  • The instruction can be found in the Model List section.

Output Format

  • The model outputs a dense vector representation of the input text sequence.
  • The output vector can be used for similarity calculations, such as cosine similarity.

Special Requirements

  • For fine-tuning, it is recommended to mine hard negatives to improve retrieval performance.
  • The model can be used with or without instructions, but using instructions can improve retrieval performance for short query to long passage tasks.

Code Examples

  • Using FlagEmbedding:
from FlagEmbedding import FlagModel

model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True)
sentences = ["样例数据-1", "样例数据-2"]
embeddings = model.encode(sentences)
  • Using Sentence-Transformers:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
sentences = ["样例数据-1", "样例数据-2"]
embeddings = model.encode(sentences, normalize_embeddings=True)
  • Using Langchain:
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True}
model = HuggingFaceBgeEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="为这个句子生成表示以用于检索相关文章:")
  • Using HuggingFace Transformers:
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()
sentences = ["样例数据-1", "样例数据-2"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.