Bge Large En V1.5

English Embedding Model

The BGE Large En V1.5 model is a powerful tool for natural language processing tasks, particularly in the realm of retrieval-augmented language models. This model excels in its ability to generate high-quality embeddings for sentences, allowing for accurate similarity calculations and effective passage retrieval. With its focus on dense retrieval, the BGE Large En V1.5 model is well-suited for applications such as semantic search, question answering, and text classification. Its ability to handle long input lengths and support multiple retrieval methods makes it a versatile and valuable asset for a wide range of NLP tasks. The model's performance is further enhanced by its ability to be fine-tuned for specific tasks, allowing users to tailor its capabilities to their specific needs. Overall, the BGE Large En V1.5 model is a cutting-edge tool for NLP applications, offering a unique combination of accuracy, efficiency, and flexibility.

BAAI mit Updated a year ago

Table of Contents

Model Overview

The BGE model series, developed by BAAI, is a collection of powerful tools for natural language processing tasks. This model series focuses on retrieval-augmented Large Language Models (LLMs) and consists of several projects, including Long-Context LLM, Fine-tuning of LM, Dense Retrieval, and Reranker Model.

Capabilities

The BGE models are designed to support diverse retrieval augmentation needs for Large Language Models (LLMs). They offer a range of capabilities, including:

  • Multilingual support: The models can handle multiple languages, including English and Chinese.
  • Long-context LLM: The models can process long input sequences, making them suitable for tasks that require understanding complex contexts.
  • Dense retrieval: The models can perform dense retrieval, which enables them to retrieve relevant information from large databases.
  • Reranking: The models can rerank top-k documents retrieved by other models, improving the accuracy of the results.
  • Fine-tuning: The models can be fine-tuned for specific tasks, allowing users to adapt them to their needs.

Model List

The BGE model series includes several models, each with its own strengths and weaknesses. Some of the models in the series include:

  • BGE-M3: A multilingual model that supports dense retrieval, sparse retrieval, and multi-vector (ColBERT) retrieval.
  • BGE-Reranker: A cross-encoder model that can be used to re-rank top-k documents retrieved by other models.
  • BGE-Embedder: A unified embedding model that supports diverse retrieval augmentation needs for LLMs.

Performance

BGE-M3 is a powerhouse when it comes to speed, accuracy, and efficiency in various tasks. Let’s dive into the details.

Speed

BGE-M3 is designed to handle long texts and multiple languages, making it a great choice for large-scale applications. Its multi-granularity feature allows it to process input lengths up to 8192 tokens, which is a significant improvement over other models.

Accuracy

BGE-M3 achieves state-of-the-art performances on multi-lingual (MIRACL) and cross-lingual (MKQA) benchmarks. This means it can accurately retrieve relevant information from a vast amount of data, even when dealing with different languages.

Efficiency

BGE-M3 is not only fast but also efficient. It uses a unified embedding model to support diverse retrieval augmentation needs for LLMs, making it a great choice for applications that require high-performance and low latency.

Usage

The BGE models can be used with various libraries, including FlagEmbedding, Sentence-Transformers, Langchain, and HuggingFace Transformers. For example, you can use the FlagEmbedding library to encode sentences and calculate similarity scores:

from FlagEmbedding import FlagModel

sentences_1 = ["样例数据-1", "样例数据-2"]
sentences_2 = ["样例数据-3", "样例数据-4"]

model = FlagModel('BAAI/bge-large-zh-v1.5', query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:", use_fp16=True)

embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)

similarity = embeddings_1 @ embeddings_2.T
print(similarity)
Examples
Encode the following sentences for searching relevant passages: 'What is the meaning of life?', 'The answer is 42.' {'sentence_embeddings': [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]}
Calculate the similarity between two sentences: 'I love playing football.', 'Football is my favorite sport.' {'similarity': 0.85}
Generate embeddings for a short query to long passage retrieval task: 'What is the capital of France?', 'The capital of France is Paris.' {'query_embeddings': [[0.7, 0.8, 0.9]], 'passage_embeddings': [[0.1, 0.2, 0.3]]}

Format

BGE Models use a transformer architecture and accept input in the form of tokenized text sequences.

Supported Data Formats

  • Input: Tokenized text sequences
  • Output: Sentence embeddings

Special Requirements

  • Query Instruction: For some models, a query instruction is required for retrieval tasks. The instruction should be added to the query, but not to the passages.
  • Normalization: Embeddings should be normalized to compute cosine similarity.

Limitations

The BGE Model series, including BGE-M3, LLM-Embedder, and BGE Reranker, has several limitations that are important to consider when using these models.

Language Limitations

While BGE-M3 supports over 100 languages, it may not perform equally well across all languages. The model’s performance may vary depending on the language and the quality of the training data.

Length Limitations

BGE-M3 has a maximum input length of 8192 tokens, which may not be sufficient for very long documents or texts. In such cases, you may need to split the text into smaller chunks or use a different model.

Retrieval Limitations

The BGE Model series is designed for retrieval-augmented language models, but it may not always retrieve the most relevant or accurate results. The model’s performance may depend on the quality of the training data and the specific retrieval task.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.