Bge Small En
The Bge Small En model is a powerful tool for retrieval-augmented language tasks. It's designed to efficiently handle tasks like passage retrieval and semantic similarity. The model is part of the BAAI General Embedding family, which has achieved state-of-the-art performance on benchmarks like MTEB and C-MTEB. With its competitive performance and small size, the Bge Small En model is a great choice for applications where speed and efficiency are crucial. It's also highly versatile, supporting multiple languages and tasks, and can be fine-tuned for specific use cases. Overall, the Bge Small En model is an excellent option for those looking for a reliable and efficient language model.
Table of Contents
Model Overview
The BAAI General Embedding (BGE) model is a cutting-edge AI model designed for retrieval-augmented language models. It’s developed by the BAAI team and has achieved state-of-the-art performance on various benchmarks.
What makes BGE special?
- Unified embedding model: BGE supports diverse retrieval augmentation needs for language models.
- Improved similarity distribution: The latest version (v1.5) has a more reasonable similarity distribution, making it more effective for retrieval tasks.
- Cross-encoder reranker: BGE has a powerful reranker model that can re-rank top-k documents retrieved by other models, leading to more accurate results.
Capabilities
The BAAI General Embedding models are capable of text retrieval and semantic similarity tasks. They achieve state-of-the-art performance on both MTEB and C-MTEB benchmarks.
Primary Tasks
- Text Retrieval: The models can be used to retrieve relevant passages from a large corpus of text based on a given query.
- Semantic Similarity: The models can be used to measure the similarity between two pieces of text.
Strengths
- High Performance: The models achieve state-of-the-art performance on both MTEB and C-MTEB benchmarks.
- Efficient: The models are designed to be efficient and can be used for large-scale text retrieval and semantic similarity tasks.
Unique Features
- Retrieval-Augmented LLMs: The models are designed to work with retrieval-augmented language models, which allows them to retrieve relevant information from a large corpus of text.
- Dense Retrieval: The models use dense retrieval, which allows them to retrieve relevant information from a large corpus of text efficiently.
Model Variants
The BAAI General Embedding models come in different variants, including:
- BAAI/bge-large-en-v1.5: A large-scale English model with high performance on text retrieval and semantic similarity tasks.
- BAAI/bge-base-en-v1.5: A base-scale English model with competitive performance on text retrieval and semantic similarity tasks.
- BAAI/bge-small-en-v1.5: A small-scale English model with competitive performance on text retrieval and semantic similarity tasks.
Performance
BAAI/bge-small-en-v1.5 demonstrates impressive performance with high accuracy in text classification tasks, particularly excelling in processing large-scale datasets.
Speed
The model’s speed is remarkable, allowing for efficient processing of vast amounts of data. This is especially useful for applications where time is of the essence.
Accuracy
BAAI/bge-small-en-v1.5 boasts high accuracy, outperforming other models in its class. This is attributed to its advanced architecture and training methods.
Efficiency
The model’s efficiency is notable, requiring fewer computational resources compared to other models. This makes it an excellent choice for applications with limited resources.
Comparison to Other Models
Compared to ==Other Models==, BAAI/bge-small-en-v1.5 stands out for its exceptional performance. While ==Other Models== may excel in specific areas, BAAI/bge-small-en-v1.5 offers a well-rounded performance that makes it a top choice.
Model | Speed | Accuracy | Efficiency |
---|---|---|---|
BAAI/bge-small-en-v1.5 | High | High | High |
==Other Models== | Medium | Medium | Medium |
Limitations
Current Model has several limitations that are important to consider when using it for your tasks.
Similarity Distribution
The similarity distribution of the Current Model is not ideal. The similarity scores between two dissimilar sentences can be higher than 0.5. This can lead to incorrect results in downstream tasks such as passage retrieval or semantic similarity.
Query Instruction
The Current Model requires query instructions for certain tasks, such as short query to long passage retrieval. However, the instruction is not needed for passages. This can be confusing and may lead to incorrect usage.
Retrieval Ability
The Current Model has limited retrieval ability when not using instructions. Although the BGE v1.5 models alleviate this issue, it is still important to consider the trade-off between retrieval ability and instruction usage.
Fine-tuning
Fine-tuning the Current Model requires careful consideration of hard negatives and contrastive learning. If not done correctly, the fine-tuned model may not achieve optimal performance.
Format
BAAI General Embedding utilizes a transformer architecture and accepts input in the form of tokenized text sequences.
Model Architecture
The model is a unified embedding model designed to support diverse retrieval augmentation needs for Large Language Models (LLMs). It’s a dense retrieval model that can be used for tasks such as passage retrieval or semantic similarity.
Supported Data Formats
The model supports text input in the form of tokenized sequences. For a retrieval task that uses short queries to find long related documents, it’s recommended to add instructions for these short queries.
Input Requirements
- For a retrieval task that uses short queries to find long related documents, add instructions for these short queries.
- No instruction is needed for passages.
- The model can be used with or without instructions for queries, but using instructions can improve retrieval performance.
Output
The model outputs a similarity score between two input sequences. The similarity score is not bounded to a specific range and can be used to rank the relevance of different passages.
Special Requirements
- The model is optimized based on cross-entropy loss, so the relevance score is not bounded to a specific range.
- To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
Code Examples
Using FlagEmbedding:
from FlagEmbedding import FlagModel
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True)
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
Using Sentence-Transformers:
from sentence_transformers import SentenceTransformer
sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
Using Langchain:
from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True}
model = HuggingFaceBgeEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="为这个句子生成表示以用于检索相关文章:")
Using HuggingFace Transformers:
from transformers import AutoTokenizer, AutoModel
import torch
sentences = ["样例数据-1", "样例数据-2"]
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)