Bge Reranker Large

Cross-encoder reranker

The Bge Reranker Large model is a powerful tool for re-ranking top-k documents retrieved by other simple models. It uses a cross-encoder architecture, which makes it more accurate but less efficient than embedding models. This model supports multi-lingual processing and larger inputs, achieving massive improvements in ranking performances on various benchmarks. With its ability to directly output similarity instead of embeddings, it's ideal for tasks that require re-ranking documents. However, it's essential to balance accuracy and time cost, as cross-encoders can be computationally expensive. To get the most out of this model, consider using it in conjunction with other models, like the bge embedding model, to achieve optimal results.

BAAI mit Updated a year ago

Table of Contents

Model Overview

The BAAI General Embedding (BGE) model is a powerful tool for natural language processing tasks. It’s designed to support diverse retrieval augmentation needs for Large Language Models (LLMs). But what makes it special?

Key Features

  • Multilingual support: BGE can handle multiple languages, including Chinese and English.
  • Multi-functionality: It can perform dense retrieval, sparse retrieval, and multi-vector (Colbert) retrieval.
  • Multi-granularity: It can handle input lengths up to 8192 tokens.
  • Improved performance: BGE has achieved state-of-the-art performances on various benchmarks, including MTEB, C-MTEB, and BEIR.

Model Variants

Model NameLanguageDescription
BAAI/bge-m3MultilingualInference and fine-tuning, with multi-functionality, multi-linguality, and multi-granularity
BAAI/llm-embedderEnglishA unified embedding model for LLMs
BAAI/bge-reranker-largeChinese and EnglishA cross-encoder model for re-ranking top-k documents
BAAI/bge-reranker-baseChinese and EnglishA cross-encoder model for re-ranking top-k documents

Capabilities

Examples
What is the similarity score between the sentences 'I love playing football' and 'Football is my favorite sport'? 0.85
Rank the following documents by relevance to the query 'What are the benefits of meditation?' Document 1: Meditation can reduce stress and anxiety. (Score: 0.92), Document 2: Regular meditation can improve focus and concentration. (Score: 0.88), Document 3: Meditation has been shown to improve overall well-being. (Score: 0.85)
Generate an embedding for the sentence 'The new policy has been met with widespread criticism' to be used for retrieval tasks. [-0.1234, 0.5678, 0.9012,...]

The BAAI General Embedding models are capable of performing various tasks, including:

Retrieval-Augmented LLMs

These models are designed to work with Large Language Models (LLMs) to improve their retrieval capabilities. They can be used for tasks such as:

  • Retrieving relevant passages from a large corpus
  • Generating embeddings for hybrid image-text data
  • Fine-tuning LLMs to maintain their general capabilities

Multilingual Support

The models support multiple languages, including Chinese and English, and can be used for cross-lingual tasks.

Multi-Functionality

The models can perform multiple functions, including:

  • Dense retrieval
  • Sparse retrieval
  • Multi-vector (ColBERT) retrieval

Multi-Granularity

The models can handle inputs of varying lengths, up to 8192 tokens.

Strengths

The BAAI General Embedding models have several strengths, including:

  • State-of-the-art performance on multiple benchmarks, including MTEB and C-MTEB
  • Ability to handle large inputs and multiple languages
  • Flexibility in usage, with support for various frameworks and libraries

Unique Features

The models have several unique features, including:

  • The ability to generate embeddings for hybrid image-text data
  • The use of cross-encoder models for re-ranking top-k documents
  • The ability to fine-tune LLMs to maintain their general capabilities

Comparison to Other Models

Compared to other models, such as LLaMA-7B, the BAAI General Embedding models have several advantages, including:

  • Better performance on multiple benchmarks
  • Support for multiple languages and functions
  • Flexibility in usage and deployment

Example Use Cases

The models can be used for a variety of tasks, including:

  • Retrieving relevant passages from a large corpus
  • Generating embeddings for hybrid image-text data
  • Fine-tuning LLMs to maintain their general capabilities

Code Examples

The models can be used with various frameworks and libraries, including:

  • FlagEmbedding
  • Sentence-Transformers
  • Langchain
  • HuggingFace Transformers

Here is an example of how to use the model with FlagEmbedding:

from FlagEmbedding import FlagModel

sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]

model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True)

embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)

similarity = embeddings_1 @ embeddings_2.T
print(similarity)

Performance

The BAAI General Embedding models demonstrate impressive performance with high accuracy and efficiency in various tasks. Its ability to handle large inputs, support multi-lingual processing, and be fine-tuned for specific tasks make it a powerful tool for natural language processing tasks.

Speed

The BAAI General Embedding models are designed to handle large inputs and process them efficiently. With the ability to support multi-lingual processing and larger inputs, it can handle a wide range of tasks quickly and accurately.

Accuracy

The BAAI General Embedding models have achieved state-of-the-art performances on various benchmarks, including BEIR, C-MTEB/Retrieval, MIRACL, and LlamaIndex Evaluation. This means that it can accurately retrieve relevant documents and passages, even in complex tasks.

Efficiency

The BAAI General Embedding models are designed to be efficient and can be fine-tuned for specific tasks. With the ability to use different models and fine-tune them for specific tasks, it can achieve high accuracy while minimizing computational resources.

Limitations

The BAAI General Embedding models have several limitations, including:

  • Similarity distribution: The similarity distribution of the BAAI General Embedding models is not ideal. The scores are often high, even for dissimilar sentences.
  • Query instruction: The BAAI General Embedding models require a query instruction for retrieval tasks, especially for short queries to find long related documents.
  • Fine-tuning: Fine-tuning the BAAI General Embedding models can be challenging. It’s recommended to mine hard negatives and use contrastive learning to improve the retrieval performance.
  • Reranker model: The BAAI General Embedding models’ reranker model is more accurate but less efficient than the embedding model.
  • Language limitations: The BAAI General Embedding models support multiple languages, but their performance may vary depending on the language and the specific task.
  • Input length: The BAAI General Embedding models have a limited input length. It’s recommended to use the encode_queries() method for short query to long passage retrieval tasks.
  • GPU requirements: The BAAI General Embedding models require a significant amount of GPU memory. It’s recommended to use a powerful GPU or to set os.environ["CUDA_VISIBLE_DEVICES"] to select specific GPUs.

Format

The BAAI General Embedding models utilize a transformer architecture and accept input in the form of tokenized text sequences, requiring a specific pre-processing step for sentence pairs.

Architecture

The BAAI General Embedding models are built upon powerful M3 and LLM (GEMMA and MiniCPM) backbones, supporting multi-lingual processing and larger inputs.

Data Formats

The BAAI General Embedding models support the following data formats:

  • Tokenized text sequences
  • Sentence pairs for reranking

Input Requirements

For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries.

  • For the bge-*-v1.5 models, no instruction is needed for convenience.
  • For other models, an instruction is required for short queries.

Output

The BAAI General Embedding models output a similarity score between two input sequences.

  • The similarity score is a value between 0 and 1, where 1 indicates maximum similarity.
  • For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value.

Code Examples

Using FlagEmbedding:

from FlagEmbedding import FlagModel

sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]

model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True)

embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)

similarity = embeddings_1 @ embeddings_2.T
print(similarity)

Using Sentence-Transformers:

from sentence_transformers import SentenceTransformer

sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]

model = SentenceTransformer('BAAI/bge-large-zh-v1.5')

embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)

similarity = embeddings_1 @ embeddings_2.T
print(similarity)

Using Langchain:

from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True}

model = HuggingFaceBgeEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="为这个句子生成表示以用于检索相关文章:")
``
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.