Bge Reranker Large
The Bge Reranker Large model is a powerful tool for re-ranking top-k documents retrieved by other simple models. It uses a cross-encoder architecture, which makes it more accurate but less efficient than embedding models. This model supports multi-lingual processing and larger inputs, achieving massive improvements in ranking performances on various benchmarks. With its ability to directly output similarity instead of embeddings, it's ideal for tasks that require re-ranking documents. However, it's essential to balance accuracy and time cost, as cross-encoders can be computationally expensive. To get the most out of this model, consider using it in conjunction with other models, like the bge embedding model, to achieve optimal results.
Table of Contents
Model Overview
The BAAI General Embedding (BGE) model is a powerful tool for natural language processing tasks. It’s designed to support diverse retrieval augmentation needs for Large Language Models (LLMs). But what makes it special?
Key Features
- Multilingual support: BGE can handle multiple languages, including Chinese and English.
- Multi-functionality: It can perform dense retrieval, sparse retrieval, and multi-vector (Colbert) retrieval.
- Multi-granularity: It can handle input lengths up to 8192 tokens.
- Improved performance: BGE has achieved state-of-the-art performances on various benchmarks, including MTEB, C-MTEB, and BEIR.
Model Variants
Model Name | Language | Description |
---|---|---|
BAAI/bge-m3 | Multilingual | Inference and fine-tuning, with multi-functionality, multi-linguality, and multi-granularity |
BAAI/llm-embedder | English | A unified embedding model for LLMs |
BAAI/bge-reranker-large | Chinese and English | A cross-encoder model for re-ranking top-k documents |
BAAI/bge-reranker-base | Chinese and English | A cross-encoder model for re-ranking top-k documents |
Capabilities
The BAAI General Embedding models are capable of performing various tasks, including:
Retrieval-Augmented LLMs
These models are designed to work with Large Language Models (LLMs) to improve their retrieval capabilities. They can be used for tasks such as:
- Retrieving relevant passages from a large corpus
- Generating embeddings for hybrid image-text data
- Fine-tuning LLMs to maintain their general capabilities
Multilingual Support
The models support multiple languages, including Chinese and English, and can be used for cross-lingual tasks.
Multi-Functionality
The models can perform multiple functions, including:
- Dense retrieval
- Sparse retrieval
- Multi-vector (ColBERT) retrieval
Multi-Granularity
The models can handle inputs of varying lengths, up to 8192 tokens.
Strengths
The BAAI General Embedding models have several strengths, including:
- State-of-the-art performance on multiple benchmarks, including MTEB and C-MTEB
- Ability to handle large inputs and multiple languages
- Flexibility in usage, with support for various frameworks and libraries
Unique Features
The models have several unique features, including:
- The ability to generate embeddings for hybrid image-text data
- The use of cross-encoder models for re-ranking top-k documents
- The ability to fine-tune LLMs to maintain their general capabilities
Comparison to Other Models
Compared to other models, such as LLaMA-7B, the BAAI General Embedding models have several advantages, including:
- Better performance on multiple benchmarks
- Support for multiple languages and functions
- Flexibility in usage and deployment
Example Use Cases
The models can be used for a variety of tasks, including:
- Retrieving relevant passages from a large corpus
- Generating embeddings for hybrid image-text data
- Fine-tuning LLMs to maintain their general capabilities
Code Examples
The models can be used with various frameworks and libraries, including:
- FlagEmbedding
- Sentence-Transformers
- Langchain
- HuggingFace Transformers
Here is an example of how to use the model with FlagEmbedding:
from FlagEmbedding import FlagModel
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True)
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
Performance
The BAAI General Embedding models demonstrate impressive performance with high accuracy and efficiency in various tasks. Its ability to handle large inputs, support multi-lingual processing, and be fine-tuned for specific tasks make it a powerful tool for natural language processing tasks.
Speed
The BAAI General Embedding models are designed to handle large inputs and process them efficiently. With the ability to support multi-lingual processing and larger inputs, it can handle a wide range of tasks quickly and accurately.
Accuracy
The BAAI General Embedding models have achieved state-of-the-art performances on various benchmarks, including BEIR, C-MTEB/Retrieval, MIRACL, and LlamaIndex Evaluation. This means that it can accurately retrieve relevant documents and passages, even in complex tasks.
Efficiency
The BAAI General Embedding models are designed to be efficient and can be fine-tuned for specific tasks. With the ability to use different models and fine-tune them for specific tasks, it can achieve high accuracy while minimizing computational resources.
Limitations
The BAAI General Embedding models have several limitations, including:
- Similarity distribution: The similarity distribution of the BAAI General Embedding models is not ideal. The scores are often high, even for dissimilar sentences.
- Query instruction: The BAAI General Embedding models require a query instruction for retrieval tasks, especially for short queries to find long related documents.
- Fine-tuning: Fine-tuning the BAAI General Embedding models can be challenging. It’s recommended to mine hard negatives and use contrastive learning to improve the retrieval performance.
- Reranker model: The BAAI General Embedding models’ reranker model is more accurate but less efficient than the embedding model.
- Language limitations: The BAAI General Embedding models support multiple languages, but their performance may vary depending on the language and the specific task.
- Input length: The BAAI General Embedding models have a limited input length. It’s recommended to use the
encode_queries()
method for short query to long passage retrieval tasks. - GPU requirements: The BAAI General Embedding models require a significant amount of GPU memory. It’s recommended to use a powerful GPU or to set
os.environ["CUDA_VISIBLE_DEVICES"]
to select specific GPUs.
Format
The BAAI General Embedding models utilize a transformer architecture and accept input in the form of tokenized text sequences, requiring a specific pre-processing step for sentence pairs.
Architecture
The BAAI General Embedding models are built upon powerful M3 and LLM (GEMMA and MiniCPM) backbones, supporting multi-lingual processing and larger inputs.
Data Formats
The BAAI General Embedding models support the following data formats:
- Tokenized text sequences
- Sentence pairs for reranking
Input Requirements
For a retrieval task that uses short queries to find long related documents, it is recommended to add instructions for these short queries.
- For the bge-*-v1.5 models, no instruction is needed for convenience.
- For other models, an instruction is required for short queries.
Output
The BAAI General Embedding models output a similarity score between two input sequences.
- The similarity score is a value between 0 and 1, where 1 indicates maximum similarity.
- For downstream tasks, such as passage retrieval or semantic similarity, what matters is the relative order of the scores, not the absolute value.
Code Examples
Using FlagEmbedding:
from FlagEmbedding import FlagModel
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True)
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
Using Sentence-Transformers:
from sentence_transformers import SentenceTransformer
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
Using Langchain:
from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True}
model = HuggingFaceBgeEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="为这个句子生成表示以用于检索相关文章:")
``