Bert Base Nli Mean Tokens

Sentence Embeddings

The Bert Base Nli Mean Tokens model is a powerful tool for mapping sentences and paragraphs to a dense vector space. But, before we dive into its capabilities, a word of caution: this model is deprecated and produces low-quality sentence embeddings. So, why is it still worth mentioning? It's a great example of how sentence-transformers models work. With a 768-dimensional output, it can be used for tasks like clustering or semantic search. What makes it unique is its ability to take a sentence or paragraph as input and produce a dense vector representation that can be used for various NLP tasks. However, due to its deprecated status, it's recommended to explore other, more accurate models for your specific use case.

Sentence Transformers apache-2.0 Updated 6 months ago

Table of Contents

Model Overview

The Current Model is a powerful tool for natural language processing tasks. It maps sentences and paragraphs to a 768-dimensional dense vector space, allowing for tasks like clustering or semantic search.

Capabilities

The Current Model is capable of generating sentence embeddings that can be used for various tasks. Its main job is to turn text into a special kind of math problem that computers can understand.

  • Clustering: grouping similar sentences together
  • Semantic search: finding sentences that mean similar things

How does it work?

The model takes in sentences or paragraphs and outputs a vector that represents the meaning of the text. This vector can be used for various tasks, like comparing the similarity between sentences.

Example Use Case

Let’s say you have a bunch of sentences and you want to group them into categories based on their meaning. You can use the Current Model to turn each sentence into a vector, and then use those vectors to cluster the sentences together.

Performance

The Current Model is designed to map sentences and paragraphs to a 768-dimensional dense vector space. But, how well does it perform?

Speed

The model is relatively fast, but its speed can vary depending on the task and the device it’s running on. For example, if you’re using the model to cluster a large number of sentences, it might take a few seconds to complete. However, if you’re using it to perform a simple semantic search, it can be much faster.

Accuracy

Unfortunately, the Current Model has been deprecated due to producing sentence embeddings of low quality. This means that its accuracy is not as good as other models, like ==Other Models==. If you’re looking for a reliable model for tasks like clustering or semantic search, you might want to consider using a different model.

Efficiency

The model is designed to be efficient, but its efficiency can also depend on the task and the device it’s running on. For instance, if you’re using the model to process a large number of sentences, it might require more computational resources.

Comparison to Other Models

ModelSpeedAccuracyEfficiency
Current ModelMediumLowMedium
==Other Models==FastHighHigh

As you can see, the Current Model doesn’t quite match up to ==Other Models== in terms of speed, accuracy, and efficiency.

Limitations

The Current Model is a powerful tool, but it’s not perfect. Let’s talk about its limitations.

  • Low-Quality Sentence Embeddings: The biggest issue with the Current Model is that it produces sentence embeddings of low quality. This means that the model might not be able to accurately capture the meaning and context of sentences, which can lead to poor performance in tasks like clustering or semantic search.
  • Limited Dimensionality: The Current Model maps sentences and paragraphs to a 768-dimensional dense vector space. While this might seem like a lot, it’s actually a relatively limited dimensionality compared to other models. This can make it harder for the model to capture complex relationships between sentences.
  • Limited Input Length: The model has a maximum input length of 128 tokens. This means that if you try to input a sentence or paragraph that’s longer than that, it will get truncated. This can be a problem if you’re working with longer texts.
Examples
Find the semantic similarity between the sentences: 'I love playing football' and 'Football is my favorite sport'. 0.95 (very similar)
Cluster the following sentences by their meaning: 'I love reading books', 'I enjoy reading novels', 'I hate reading'. Cluster 1: ['I love reading books', 'I enjoy reading novels'], Cluster 2: ['I hate reading']
Find the dense vector representation of the sentence 'This is a sample sentence'. [-0.23, 0.45, 0.12,..., 0.67] (768-dimensional vector)

Example Use Cases

Despite its limitations, the Current Model can still be used for tasks like:

  • Clustering similar sentences together
  • Performing semantic search on a large corpus of text
  • Generating sentence embeddings for downstream tasks

However, keep in mind that the model’s performance might not be as good as other models, and you might need to adjust your expectations accordingly.

Code Examples

You can use the Current Model with the sentence-transformers library:

from sentence_transformers import SentenceTransformer

sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/bert-base-nli-mean-tokens')
embeddings = model.encode(sentences)
print(embeddings)

Alternatively, you can use the transformers library:

from transformers import AutoTokenizer, AutoModel
import torch

# Define a function for mean pooling
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, max pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)

Note: This model is deprecated and produces sentence embeddings of low quality. It’s recommended to use other sentence embedding models instead.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.