Bge Reranker Base

Cross-encoder reranker

The Bge Reranker Base model is a powerful tool for re-ranking top-k documents retrieved by other models. It's a cross-encoder model that's more accurate but less efficient than embedding models. What makes it unique is its ability to support multi-lingual processing and larger inputs, making it a great choice for tasks that require precise results. With its ability to re-rank documents, it can help improve the performance of other models. However, it's worth noting that it may require more computational resources due to its cross-encoder architecture. If you're looking for a model that can provide more accurate results, the Bge Reranker Base model is definitely worth considering.

BAAI mit Updated 10 months ago

Table of Contents

Model Overview

The BAAI General Embedding (BGE) Model is a powerful tool for natural language processing tasks, especially in retrieval-augmented language models. It consists of several projects, including Long-Context LLM, Fine-tuning of LM, Embedding Model, Reranker Model, and Benchmark.

Capabilities

The BGE Model is capable of handling various retrieval tasks, including dense retrieval, sparse retrieval, and multi-vector (Colbert) retrieval. It supports multiple languages and can process inputs of up to 8192 tokens.

  • Multilingual Support: The BGE Model supports over 100 languages, making it a versatile tool for various applications.
  • Multi-Functionality: The model can perform different retrieval tasks, including dense retrieval, sparse retrieval, and multi-vector (Colbert) retrieval.
  • Multi-Granularity: The model can process inputs of varying lengths, up to 8192 tokens.
  • Improved Retrieval Ability: The BGE Model has achieved state-of-the-art performance on various benchmarks, including MTEB, C-MTEB, and MIRACL.

Model Variants

The BGE Model has several variants, each with its strengths and weaknesses. Some of the notable variants include:

Model NameLanguageDescription
BAAI/bge-m3MultilingualInference and fine-tune, multi-functionality, multi-linguality, and multi-granularity
BAAI/llm-embedderEnglishInference and fine-tune, unified embedding model for diverse retrieval augmentation needs
BAAI/bge-reranker-largeChinese and EnglishInference and fine-tune, cross-encoder model for re-ranking top-k documents
BAAI/bge-reranker-baseChinese and EnglishInference and fine-tune, cross-encoder model for re-ranking top-k documents

Performance

The BGE Model has demonstrated exceptional performance in retrieval tasks, outperforming other models in benchmarks such as MTEB and C-MTEB. This is particularly notable in the BGE-M3 model, which supports multi-linguality, multi-granularities, and multi-functionality.

Limitations

While the BGE Model is a powerful tool, it’s not perfect. Some of its limitations include:

  • Language limitations: The model may not perform equally well in all languages.
  • Length limitations: The model has limitations when it comes to handling long texts.
  • Similarity score limitations: The similarity score between two dissimilar sentences may be higher than expected.
  • Query instruction limitations: The model may not always perform well in cases where a query instruction is required.
Examples
What is the similarity between the sentences '样例数据-1' and '样例数据-3'? 0.85
Represent the sentence 'query_1' for searching relevant passages. Embedding vector: [0.2, 0.5, 0.1]
Rank the documents ['passage_1', 'passage_2', 'passage_3'] based on their similarity to the query 'query_2'. Ranking: passage_2 (0.9), passage_1 (0.8), passage_3 (0.7)

Usage

The BGE Model can be used with various libraries, including FlagEmbedding, Sentence-Transformers, Langchain, and HuggingFace Transformers. It can be fine-tuned for specific tasks and can be used for various retrieval tasks, including passage retrieval and semantic similarity.

Format

The BGE Model uses a transformer-based architecture, which allows it to handle long input sequences and multiple languages. It’s a unified embedding model that supports diverse retrieval augmentation needs for LLMs.

Architecture

BGE uses a transformer-based architecture, which allows it to handle long input sequences and multiple languages. It’s a unified embedding model that supports diverse retrieval augmentation needs for LLMs.

Data Formats

BGE supports multiple data formats, including:

  • Text: BGE can handle text input in various languages, including English, Chinese, and many others.
  • Tokenized text: BGE requires tokenized text input, which can be achieved through pre-processing steps.

Input Requirements

  • Query instruction: For some tasks, such as passage retrieval, a query instruction is required. This instruction is used to generate embeddings for the query.
  • Passages: Passages do not require an instruction and can be used directly.

Output

BGE outputs embeddings that can be used for various tasks, such as passage retrieval, semantic similarity, and more.

Special Requirements

  • Fine-tuning: BGE models can be fine-tuned for specific tasks, which requires additional training data and a fine-tuning process.
  • Hard negatives: Hard negatives are required for fine-tuning BGE models, which can be achieved through mining hard negatives.

Code Examples

Here are some code examples for using BGE with different libraries:

FlagEmbedding

from FlagEmbedding import FlagModel

sentences = ["样例数据-1", "样例数据-2"]
model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True)
embeddings = model.encode(sentences)

Sentence-Transformers

from sentence_transformers import SentenceTransformer

sentences = ["样例数据-1", "样例数据-2"]
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
embeddings = model.encode(sentences, normalize_embeddings=True)

Langchain

from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True}
model = HuggingFaceBgeEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="为这个句子生成表示以用于检索相关文章:")

HuggingFace Transformers

from transformers import AutoTokenizer, AutoModel
import torch

sentences = ["样例数据-1", "样例数据-2"]
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.