All MiniLM L12 V2

Sentence Embeddings

The All MiniLM L12 V2 model is a powerful tool for sentence and short paragraph encoding. It takes in input text and outputs a vector that captures the semantic information, making it useful for tasks like information retrieval, clustering, and sentence similarity. But what makes this model unique? It was trained on a massive dataset of over 1 billion sentence pairs using a self-supervised contrastive learning objective, which allows it to learn from the relationships between sentences. This training process enables the model to produce high-quality sentence embeddings that can be used for a variety of applications. With its efficient design and ability to handle long input texts, the All MiniLM L12 V2 model is a valuable resource for anyone looking to work with sentence embeddings.

Sentence Transformers apache-2.0 Updated a year ago

Table of Contents

Model Overview

The all-MiniLM-L12-v2 model is a powerful tool for natural language processing tasks. It maps sentences and paragraphs to a 384-dimensional dense vector space, making it perfect for tasks like clustering or semantic search.

Capabilities

This model can:

  • Encode sentences into vectors
  • Perform clustering and semantic search
  • Handle long input text (although input text longer than 256 word pieces is truncated by default)

How it Works

The model uses a contrastive learning objective, which means it’s trained to predict which sentence is paired with another sentence in a dataset. This approach helps the model learn to capture the semantic meaning of sentences.

Key Features

  • Maps sentences and paragraphs to a 384-dimensional dense vector space
  • Can be used for tasks like clustering or semantic search
  • Outputs a vector that captures the semantic information
  • Can handle input text longer than 256 word pieces (although it will be truncated by default)

Training Procedure

The model was trained on a massive dataset of over 1 billion sentence pairs, using a contrastive objective. The training procedure involved fine-tuning a pre-trained model on a TPU v3-8, with a batch size of 1024 and a learning rate of 2e-5.

Comparison to Other Models

How does all-MiniLM-L12-v2 compare to other models? Let’s take a look:

ModelAccuracySpeedEfficiency
all-MiniLM-L12-v2HighFastEfficient
==microsoft/MiniLM-L12-H384-uncased==HighMediumMedium
sentence-transformers/all-MiniLM-L6-v2MediumFastEfficient

Intended Uses

This model is intended to be used as a sentence and short paragraph encoder. It can be used for a variety of natural language processing tasks, such as:

  • Information retrieval
  • Clustering
  • Sentence similarity tasks
Examples
What is the similarity between 'I love reading books' and 'I am an avid reader'? The similarity between the two sentences is 0.85, indicating a high semantic similarity.
Cluster the following sentences into topics: 'I love playing football', 'I am a big fan of tennis', 'I enjoy reading books', 'I am a bookworm' Cluster 1: Sports - 'I love playing football', 'I am a big fan of tennis' Cluster 2: Reading - 'I enjoy reading books', 'I am a bookworm'
Find the most similar sentence to 'I am looking for a new job' in the following list: 'I am looking for a new apartment', 'I am searching for a new career opportunity', 'I am interested in learning a new language' The most similar sentence is 'I am searching for a new career opportunity' with a similarity score of 0.92.

Example Use Cases

  • Clustering similar sentences or paragraphs together
  • Searching for similar sentences or paragraphs in a large corpus
  • Determining the semantic similarity between two sentences or paragraphs

Evaluation Results

The model has been evaluated on the Sentence Embeddings Benchmark, and has shown promising results. For more information, see the Sentence Embeddings Benchmark.

Real-World Applications

So, how can you use all-MiniLM-L12-v2 in real-world applications? Here are a few examples:

  • Information Retrieval: Use the model to retrieve relevant documents or web pages based on a search query.
  • Clustering: Group similar text documents or sentences together using the model’s semantic embeddings.
  • Sentence Similarity: Measure the similarity between two sentences or text documents using the model’s embeddings.

Limitations

While all-MiniLM-L12-v2 is a powerful model, it’s not perfect. Here are some of its limitations:

  • Limited Context Understanding: The model is not designed to understand long texts or complex narratives.
  • Dependence on Training Data: The model was trained on a specific dataset, and may not perform well on data that’s significantly different.
  • Computational Requirements: The model requires significant computational resources, especially when dealing with large inputs.

Format

all-MiniLM-L12-v2 is a sentence-transformers model that maps sentences and paragraphs to a 384-dimensional dense vector space. This model is perfect for tasks like clustering or semantic search.

Architecture

This model uses a transformer architecture, which is a type of neural network designed for natural language processing tasks. It’s trained on a massive dataset of sentence pairs, which allows it to learn the relationships between sentences and capture their semantic meaning.

Data Formats

all-MiniLM-L12-v2 supports input in the form of tokenized text sequences. This means that you need to break down your text into individual words or tokens before feeding it into the model.

Special Requirements

When using this model, keep in mind that:

  • Input text longer than 256 word pieces is truncated by default.
  • The model is trained on a sequence length of 128 tokens, so you may need to adjust your input accordingly.

Handling Inputs and Outputs

Here’s an example of how to use all-MiniLM-L12-v2 with the sentence-transformers library:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2')
embeddings = model.encode(sentences)
print(embeddings)

And here’s an example of how to use the model with the Hugging Face Transformers library:

from transformers import AutoTokenizer, AutoModel
import torch

sentences = ['This is an example sentence', 'Each sentence is converted']
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L12-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L12-v2')

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling and normalization
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:")
print(sentence_embeddings)

Note that these examples assume you have the necessary libraries installed and imported.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.