All MiniLM L6 V2

Sentence Embeddings

The All MiniLM L6 V2 model is a powerful tool for sentence and short paragraph encoding, capable of mapping sentences and paragraphs to a 384-dimensional dense vector space. But how does it work? Essentially, it takes in input text and outputs a vector that captures the semantic information of the input, making it useful for tasks like information retrieval, clustering, and sentence similarity. With its contrastive learning objective and AdamW optimizer, this model achieves state-of-the-art results in these areas. But what really sets it apart is its efficiency - it can handle input text up to 256 word pieces and produces dense vector representations that facilitate efficient and effective text analysis. So, what can you use it for? Try clustering sentences into groups based on their semantic meaning, finding the similarity between sentences, or even generating vector representations of sentences. The possibilities are endless.

Sentence Transformers apache-2.0 Updated 5 months ago

Table of Contents

Model Overview

The all-MiniLM-L6-v2 model is a powerful tool for natural language processing tasks. It’s a sentence-transformers model that maps sentences and paragraphs to a 384-dimensional dense vector space. This means it can help with tasks like clustering or semantic search.

Capabilities

The all-MiniLM-L6-v2 model is capable of:

  • Taking a sentence or paragraph as input and outputting a vector that captures its semantic information
  • Being used for information retrieval, clustering, or sentence similarity tasks
  • Handling input text up to 256 word pieces (longer text is truncated)

How it Works

The model uses a contrastive learning objective, which means it’s trained to predict which sentence is paired with a given sentence in the dataset. This approach allows the model to learn a rich representation of language that can be used for a variety of tasks.

Training and Performance

The model was pre-trained on a large dataset using a self-supervised contrastive learning objective and fine-tuned on a 1 billion sentence pairs dataset using a contrastive objective. It was trained on a TPU v3-8 with a batch size of 1024 and a learning rate of 2e-5.

Training Data

The model was trained on a concatenation of multiple datasets, including:

DatasetNumber of Training Tuples
Reddit comments (2015-2018)726,484,430
S2ORC Citation pairs (Abstracts)116,288,806
WikiAnswers Duplicate question pairs77,427,422

Performance

The model showcases impressive performance in various tasks, including:

  • Clustering similar sentences or paragraphs together
  • Searching for semantically similar text
  • Information retrieval
  • Sentence similarity tasks

Example Use Cases

Examples
Find the semantic similarity between the sentences 'I love playing football' and 'I enjoy watching football games'. The similarity score is 0.85
Cluster the sentences 'The capital of France is Paris', 'The Eiffel Tower is in Paris', and 'Paris is the most romantic city' based on their semantic meaning. Cluster 1: ['The capital of France is Paris', 'The Eiffel Tower is in Paris', 'Paris is the most romantic city']
Find the top 3 most similar sentences to 'I am looking for a new job' from the given list of sentences ['I need a new job', 'I want to change my career', 'I am searching for a new opportunity', 'I am happy with my current job']. ['I need a new job', 'I am searching for a new opportunity', 'I want to change my career']
  • Information retrieval: Use the model to retrieve relevant documents or sentences based on their semantic similarity.
  • Clustering: Group similar sentences or documents together using the model.
  • Sentence similarity: Measure the similarity between two sentences using the model.

Technical Details

  • Model architecture: Based on the MiniLM-L6-H384-uncased model
  • Training data: Trained on a concatenation of multiple datasets
  • Hyperparameters: Trained with a batch size of 1024, learning rate of 2e-5, and sequence length of 128 tokens

Alternatives

  • ==BERT==
  • ==RoBERTa==
  • Other sentence embedding models

Limitations

The model has some limitations, including:

  • Limited input length: By default, input text longer than 256 word pieces is truncated.
  • Training data bias: The model was trained on a large dataset, but this dataset may still contain biases.
  • Lack of contextual understanding: The model may not always understand the context in which a sentence or paragraph is being used.

Getting Started

To use the all-MiniLM-L6-v2 model, you can install the sentence-transformers library and use the following code:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)

Alternatively, you can use the transformers library and the following code:

from transformers import AutoTokenizer, AutoModel
import torch

sentences = ["This is an example sentence", "Each sentence is converted"]
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print(sentence_embeddings)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.