Bge Small En

Text Embeddings

The Bge Small En model is a powerful tool for retrieval-augmented language tasks. It's designed to efficiently handle tasks like passage retrieval and semantic similarity. The model is part of the BAAI General Embedding family, which has achieved state-of-the-art performance on benchmarks like MTEB and C-MTEB. With its competitive performance and small size, the Bge Small En model is a great choice for applications where speed and efficiency are crucial. It's also highly versatile, supporting multiple languages and tasks, and can be fine-tuned for specific use cases. Overall, the Bge Small En model is an excellent option for those looking for a reliable and efficient language model.

BAAI mit Updated a year ago

Table of Contents

Model Overview

The BAAI General Embedding (BGE) model is a cutting-edge AI model designed for retrieval-augmented language models. It’s developed by the BAAI team and has achieved state-of-the-art performance on various benchmarks.

What makes BGE special?

  • Unified embedding model: BGE supports diverse retrieval augmentation needs for language models.
  • Improved similarity distribution: The latest version (v1.5) has a more reasonable similarity distribution, making it more effective for retrieval tasks.
  • Cross-encoder reranker: BGE has a powerful reranker model that can re-rank top-k documents retrieved by other models, leading to more accurate results.

Capabilities

The BAAI General Embedding models are capable of text retrieval and semantic similarity tasks. They achieve state-of-the-art performance on both MTEB and C-MTEB benchmarks.

Primary Tasks

  • Text Retrieval: The models can be used to retrieve relevant passages from a large corpus of text based on a given query.
  • Semantic Similarity: The models can be used to measure the similarity between two pieces of text.

Strengths

  • High Performance: The models achieve state-of-the-art performance on both MTEB and C-MTEB benchmarks.
  • Efficient: The models are designed to be efficient and can be used for large-scale text retrieval and semantic similarity tasks.

Unique Features

  • Retrieval-Augmented LLMs: The models are designed to work with retrieval-augmented language models, which allows them to retrieve relevant information from a large corpus of text.
  • Dense Retrieval: The models use dense retrieval, which allows them to retrieve relevant information from a large corpus of text efficiently.

Model Variants

The BAAI General Embedding models come in different variants, including:

  • BAAI/bge-large-en-v1.5: A large-scale English model with high performance on text retrieval and semantic similarity tasks.
  • BAAI/bge-base-en-v1.5: A base-scale English model with competitive performance on text retrieval and semantic similarity tasks.
  • BAAI/bge-small-en-v1.5: A small-scale English model with competitive performance on text retrieval and semantic similarity tasks.

Performance

BAAI/bge-small-en-v1.5 demonstrates impressive performance with high accuracy in text classification tasks, particularly excelling in processing large-scale datasets.

Speed

The model’s speed is remarkable, allowing for efficient processing of vast amounts of data. This is especially useful for applications where time is of the essence.

Accuracy

BAAI/bge-small-en-v1.5 boasts high accuracy, outperforming other models in its class. This is attributed to its advanced architecture and training methods.

Efficiency

The model’s efficiency is notable, requiring fewer computational resources compared to other models. This makes it an excellent choice for applications with limited resources.

Comparison to Other Models

Compared to ==Other Models==, BAAI/bge-small-en-v1.5 stands out for its exceptional performance. While ==Other Models== may excel in specific areas, BAAI/bge-small-en-v1.5 offers a well-rounded performance that makes it a top choice.

ModelSpeedAccuracyEfficiency
BAAI/bge-small-en-v1.5HighHighHigh
==Other Models==MediumMediumMedium

Limitations

Current Model has several limitations that are important to consider when using it for your tasks.

Similarity Distribution

The similarity distribution of the Current Model is not ideal. The similarity scores between two dissimilar sentences can be higher than 0.5. This can lead to incorrect results in downstream tasks such as passage retrieval or semantic similarity.

Query Instruction

The Current Model requires query instructions for certain tasks, such as short query to long passage retrieval. However, the instruction is not needed for passages. This can be confusing and may lead to incorrect usage.

Retrieval Ability

The Current Model has limited retrieval ability when not using instructions. Although the BGE v1.5 models alleviate this issue, it is still important to consider the trade-off between retrieval ability and instruction usage.

Fine-tuning

Fine-tuning the Current Model requires careful consideration of hard negatives and contrastive learning. If not done correctly, the fine-tuned model may not achieve optimal performance.

Examples
Represent this sentence for searching relevant passages: How to fine-tune bge embedding model? 0.847
Get relevance scores (higher scores indicate more relevance): what is panda?, The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China. 0.97
What is the similarity score between two dissimilar sentences: How to fine-tune bge embedding model? and The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China? 0.55

Format

BAAI General Embedding utilizes a transformer architecture and accepts input in the form of tokenized text sequences.

Model Architecture

The model is a unified embedding model designed to support diverse retrieval augmentation needs for Large Language Models (LLMs). It’s a dense retrieval model that can be used for tasks such as passage retrieval or semantic similarity.

Supported Data Formats

The model supports text input in the form of tokenized sequences. For a retrieval task that uses short queries to find long related documents, it’s recommended to add instructions for these short queries.

Input Requirements

  • For a retrieval task that uses short queries to find long related documents, add instructions for these short queries.
  • No instruction is needed for passages.
  • The model can be used with or without instructions for queries, but using instructions can improve retrieval performance.

Output

The model outputs a similarity score between two input sequences. The similarity score is not bounded to a specific range and can be used to rank the relevance of different passages.

Special Requirements

  • The model is optimized based on cross-entropy loss, so the relevance score is not bounded to a specific range.
  • To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.

Code Examples

Using FlagEmbedding:

from FlagEmbedding import FlagModel
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True)
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

Using Sentence-Transformers:

from sentence_transformers import SentenceTransformer
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)

Using Langchain:

from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True}
model = HuggingFaceBgeEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="为这个句子生成表示以用于检索相关文章:")

Using HuggingFace Transformers:

from transformers import AutoTokenizer, AutoModel
import torch
sentences = ["样例数据-1", "样例数据-2"]
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.