Bge Base En V1.5
The Bge Base En V1.5 model is a powerful tool for natural language processing tasks. With its efficient design and ability to support diverse retrieval augmentation needs, it's ideal for tasks like passage retrieval and semantic similarity. The model can be fine-tuned for specific tasks and has been shown to achieve state-of-the-art performances on benchmarks like MS MARCO and BEIR. Its ability to handle both short and long queries makes it a versatile choice for various applications. The model's similarity distribution has been improved in version 1.5, allowing for more accurate results. It's also compatible with various frameworks like FlagEmbedding, Sentence-Transformers, and HuggingFace Transformers, making it easy to integrate into existing workflows.
Table of Contents
Model Overview
The BGE-M3 model is a cutting-edge language model that focuses on retrieval-augmented LLMs. It’s part of the BGE model series and offers exciting features like:
- Multi-linguality: Supports over 100 languages
- Multi-granularities: Handles input lengths up to 8192 tokens
- Multi-functionality: Unifies dense, lexical, and multi-vector (Colbert) retrieval methods
This model achieves state-of-the-art performance on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. Want to know more about its technical details? Check out the technical report and code on the GitHub page!
Capabilities
The BGE-M3 model is a powerful tool for text retrieval and embedding. It’s designed to support diverse retrieval augmentation needs for Large Language Models (LLMs). Here are some of its key capabilities:
- Multilingual Support: The model supports multiple languages, including English and Chinese.
- Dense Retrieval: It uses dense retrieval methods to efficiently search for relevant passages.
- Fine-tuning: The model can be fine-tuned for specific tasks, such as passage retrieval or semantic similarity.
- Reranking: It can be used as a reranker model to re-rank top-k documents retrieved by other models.
- Multi-Functionality: The model supports multiple retrieval methods, including dense retrieval, sparse retrieval, and multi-vector (ColBERT) retrieval.
- Multi-Granularity: It can handle input lengths up to 8192 tokens.
Strengths
The BGE-M3 model has several strengths that make it a valuable tool for text retrieval and embedding:
- High Performance: The model achieves state-of-the-art performance on several benchmarks, including MTEB and C-MTEB.
- Efficient: It’s designed to be efficient and scalable, making it suitable for large-scale applications.
- Flexible: The model can be fine-tuned for specific tasks and can be used as a reranker model.
Unique Features
The BGE-M3 model has several unique features that set it apart from other models:
- Unified Embedding Model: It’s a unified embedding model that supports diverse retrieval augmentation needs for LLMs.
- Multi-Linguality: The model supports multiple languages, making it a valuable tool for multilingual applications.
- Multi-Functionality: It supports multiple retrieval methods, making it a flexible tool for different applications.
Comparison to Other Models
The BGE-M3 model is compared to other models, such as ==BGE==, which is a more efficient model that supports more languages and longer texts. However, the BGE-M3 model is more powerful and achieves better performance on several benchmarks.
Performance
The BGE-M3 model showcases remarkable performance in various tasks, especially in retrieval-augmented language models. Let’s dive into its speed, accuracy, and efficiency.
Speed
The BGE-M3 model is designed to be fast and efficient. It can handle large-scale datasets and process queries quickly. For example, it can retrieve relevant passages from a massive database in a matter of milliseconds.
Accuracy
The BGE-M3 model achieves state-of-the-art performance in several benchmarks, including MTEB and C-MTEB. It demonstrates high accuracy in text classification tasks, especially when fine-tuned with contrastive learning.
Efficiency
The BGE-M3 model is designed to be efficient in terms of computational resources. It can be fine-tuned with a small amount of data and still achieve competitive performance. Additionally, it supports multi-linguality, multi-granularities, and multi-functionality, making it a versatile model for various applications.
Example Use Cases
The BGE-M3 model can be used in various applications, such as:
- Retrieval-augmented language models
- Text classification tasks
- Passage retrieval
- Semantic similarity tasks
Limitations
The BGE-M3 model has some limitations that you should be aware of. While it’s great at retrieving information, it’s not perfect.
Limited Context Length
The model can only handle a certain amount of context. If your query or passage is too long, it might not work as well.
Similarity Distribution
The model’s similarity scores might not always be accurate. A score of 0.5 or higher doesn’t necessarily mean the two sentences are similar. You might need to adjust the threshold for your specific use case.
Query Instructions
For short queries, you might need to add an instruction to get the best results. But for passages, no instruction is needed.
Limited Languages
The BGE-M3 model only supports a limited number of languages. If you need to work with other languages, you might want to try ==Other Models== like bge-m3.
Fine-tuning
Fine-tuning the model can be tricky. You might need to mine hard negatives to improve performance, and even then, the results might not be perfect.
Retrieval Performance
The model’s retrieval performance might not always be the best. You might need to use a cross-encoder model like bge-reranker to re-rank the top results.
Technical Requirements
To use the BGE-M3 model, you’ll need a certain level of technical expertise. You’ll need to know how to work with FlagEmbedding, Sentence-Transformers, Langchain, or HuggingFace Transformers.
Format
The BGE-M3 model uses a transformer architecture and accepts input in the form of tokenized text sequences. It supports multiple languages, including English and Chinese, and can handle input lengths up to 8192 tokens.
Input Format
- Input text sequences should be tokenized using a compatible tokenizer.
- For short query to long passage retrieval tasks, an instruction should be added to the query, but not to the passages.
- The instruction can be found in the Model List section.
Output Format
- The model outputs a dense vector representation of the input text sequence.
- The output vector can be used for similarity calculations, such as cosine similarity.
Special Requirements
- For fine-tuning, it is recommended to mine hard negatives to improve retrieval performance.
- The model can be used with or without instructions, but using instructions can improve retrieval performance for short query to long passage tasks.
Code Examples
- Using FlagEmbedding:
from FlagEmbedding import FlagModel
model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True)
sentences = ["样例数据-1", "样例数据-2"]
embeddings = model.encode(sentences)
- Using Sentence-Transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
sentences = ["样例数据-1", "样例数据-2"]
embeddings = model.encode(sentences, normalize_embeddings=True)
- Using Langchain:
from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-large-en-v1.5"
model_kwargs = {'device': 'cuda'}
encode_kwargs = {'normalize_embeddings': True}
model = HuggingFaceBgeEmbeddings(model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs, query_instruction="为这个句子生成表示以用于检索相关文章:")
- Using HuggingFace Transformers:
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
model.eval()
sentences = ["样例数据-1", "样例数据-2"]
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)