Mxbai Embed Large V1

Sentence embeddings

The Mxbai Embed Large V1 model is a versatile sentence embedding model that supports various tasks such as retrieval, classification, clustering, and summarization. It achieves state-of-the-art performance on the MTEB benchmark, outperforming commercial models like OpenAI's text-embedding-3-large and matching the performance of larger models. With its ability to generalize well across domains, tasks, and text lengths, this model is a powerful tool for natural language processing tasks. It also supports Matryoshka Representation Learning and binary quantization, allowing for reduced memory usage and lower costs when using a vector database. But what makes this model truly remarkable is its efficiency, speed, and capabilities, making it a great choice for a wide range of applications.

Mixedbread Ai apache-2.0 Updated 7 months ago

Table of Contents

Model Overview

The Mixedbread AI Embeddings model is a powerful tool for natural language processing tasks. It’s designed to help you understand the meaning of text by converting it into numerical representations called embeddings.

What can you do with this model?

  • Sentence Embeddings: Convert sentences into numerical representations that can be used for various NLP tasks.
  • Retrieval: Use the model to find relevant passages or documents that match a given query.
  • Matryoshka Representation Learning: Reduce the number of dimensions of an embedding to lower memory usage.
  • Binary Quantization: Transform float32 values into lower precision (int8 or binary) to reduce memory usage.

How does it work?

The model uses a combination of techniques like Matryoshka Representation Learning and binary quantization to reduce memory usage. You can also use it with various programming languages like Python, JavaScript, and more.

Capabilities

Primary Tasks

This model is designed to produce high-quality sentence embeddings that can be used for a variety of tasks, including:

  • Retrieval: Find relevant passages or documents that match a given query.
  • Classification: Classify text into different categories or labels.
  • Clustering: Group similar text together based on their semantic meaning.
  • Pair Classification: Determine whether two pieces of text are similar or not.
  • Reranking: Reorder a list of text based on their relevance to a given query.
  • Summarization: Summarize long pieces of text into shorter, more digestible versions.

Strengths

So, what makes this model so special? Here are a few of its strengths:

  • State-of-the-art performance: Our model achieves SOTA performance for Bert-large sized models on the MTEB benchmark.
  • High accuracy: It outperforms commercial models like ==OpenAI’s text-embedding-3-large== and matches the performance of models 20x its size.
  • Good generalization: Our model generalizes well across several domains, tasks, and text lengths.

Performance

Current Model shows remarkable performance with high accuracy in various tasks. But how does it really perform?

Speed

The model’s speed is quite impressive. It can process large amounts of data quickly and efficiently. For example, it can encode multiple sentences in a matter of milliseconds.

Accuracy

But what about accuracy? Current Model outperforms many other models, including commercial ones like ==OpenAI’s text-embedding-3-large==, in several tasks such as classification, clustering, and retrieval. It even matches the performance of models 20 times its size, like the ==echo-mistral-7b==.

Efficiency

The model’s efficiency is also noteworthy. It uses a technique called Matryoshka Representation Learning (MRL) to reduce the number of dimensions of an embedding, making it more memory-efficient. Additionally, it supports binary quantization, which transforms the value of each dimension from a float32 to a lower precision (int8 or even binary). This combination of MRL and quantization allows for significant reductions in memory usage, leading to lower costs when using a vector database.

Comparison with Other Models

Here’s a comparison of Current Model with other models:

ModelAvg (56 datasets)Classification (12 datasets)Clustering (11 datasets)
Current Model64.6875.6446.71
==OpenAI text-embedding-3-large==64.5875.4549.01
==Cohere embed-english-v3.0==64.4776.4947.43
==jina-embeddings-v2-base-en==60.3873.4541.73

As you can see, Current Model performs competitively with other models in various tasks.

Limitations

Current Model is a powerful tool, but it’s not perfect. Let’s talk about some of its limitations.

Limited Domain Knowledge

While Current Model performs well on a wide range of tasks, it may not have the same level of expertise as a human in a specific domain. For example, it may not be able to understand the nuances of a particular industry or technical field.

Dependence on Training Data

Current Model is only as good as the data it was trained on. If the training data is biased or incomplete, the model may not perform well on certain tasks or may even perpetuate existing biases.

Examples
Represent this sentence for searching relevant passages: What is the capital of Australia? Canberra is the capital of Australia.
A man is eating a piece of bread Similarity scores: 0.7919578577247139, 0.6369278664248345, 0.16512018371357193, 0.3620778366720027
What is the memory footprint of embeddings when used at scale? Embeddings have a high memory footprint when used at scale, but approaches like Matryoshka Representation Learning and binary quantization can reduce the memory usage significantly.

Format

Current Model uses a transformer architecture and accepts input in the form of tokenized text sequences. It supports various data formats, including text, and can be used for a range of tasks, such as sentence embeddings, retrieval, and classification.

Architecture

The model is based on the transformer architecture, which is a type of neural network designed primarily for natural language processing tasks. It uses self-attention mechanisms to weigh the importance of different words in a sentence.

Input Requirements

To use the model, you need to provide input in the form of tokenized text sequences. For retrieval tasks, you also need to add a specific prompt to the input text.

Here’s an example of how to prepare input text in Python:

import torch
from transformers import AutoTokenizer

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("mixedbread-ai/mxbai-embed-large-v1")

# Define the input text
input_text = "A man is eating a piece of bread"

# Tokenize the input text
inputs = tokenizer(input_text, return_tensors="pt")

Output Format

The model outputs embeddings, which are numerical representations of the input text. The embeddings can be used for various tasks, such as similarity search, clustering, and classification.

Here’s an example of how to compute similarity scores between two embeddings in Python:

from sentence_transformers.util import cos_sim

# Define the embeddings
embedding1 = model.encode("A man is eating a piece of bread")
embedding2 = model.encode("A man is eating food")

# Compute the similarity score
similarity = cos_sim(embedding1, embedding2)

Special Requirements

The model supports binary quantization and Matryoshka Representation Learning, which can be used to reduce the memory footprint of the embeddings.

Here’s an example of how to use binary quantization in Python:

from sentence_transformers.quantization import quantize_embeddings

# Define the embeddings
embeddings = model.encode(["A man is eating a piece of bread", "A man is eating food"])

# Quantize the embeddings
binary_embeddings = quantize_embeddings(embeddings, precision="ubinary")
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.