Gte Small

General Text Embeddings

Meet the GTE-Small model, a powerful tool for text embeddings. It's designed to efficiently handle various tasks like information retrieval, semantic textual similarity, and text reranking. But what makes it remarkable? For starters, it's incredibly compact, with a model size of just 0.07 GB. This means it can be easily integrated into your applications without taking up too much space. Plus, it's trained on a massive corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables it to provide accurate results, even when dealing with complex texts. But don't just take our word for it - the GTE-Small model has been compared to other popular text embedding models on the MTEB benchmark, and the results are impressive. So, what can you do with this model? You can use it to create powerful text embeddings, perfect for tasks like information retrieval and semantic textual similarity. And with its compact size and efficient design, it's a great choice for applications where speed and accuracy are crucial.

Thenlper mit Updated a year ago

Table of Contents

Model Overview

The GTE-small model, developed by Alibaba DAMO Academy, is a powerful tool for natural language processing tasks. It’s part of the General Text Embeddings (GTE) family, which includes three models of different sizes: GTE-large, GTE-base, and GTE-small. These models are based on the popular BERT framework and are trained on a massive corpus of text pairs from various domains and scenarios.

So, what makes GTE-small special? Here are some key attributes:

  • Model Size: 0.07 GB (yes, it’s tiny!)
  • Dimension: 384
  • Sequence Length: 512 (that’s the maximum number of tokens it can handle)

But how does it perform? Let’s take a look at some benchmark results:

TaskGTE-small==Other Models==
Average (56)61.3662.25 (e5-large-v2)
Clustering (11)44.8945.9 (text-embedding-ada-002)
Pair Classification (3)83.5486.03 (e5-large-v2)

As you can see, GTE-small holds its own against other popular text embedding models.

Capabilities

The GTE-Small model is a powerful tool for text embeddings, capable of handling a wide range of downstream tasks such as:

  • Information retrieval
  • Semantic textual similarity
  • Text reranking
  • Retrieval
  • Summarization
  • Classification

This model is trained on a large-scale corpus of relevant text pairs, covering various domains and scenarios. As a result, it can be applied to different tasks with high accuracy.

The GTE-Small model is compared to other popular text embedding models on the MTEB benchmark. Here’s a summary of its performance:

ModelAverage Score
GTE-Small61.36
GTE-Base62.39
GTE-Large63.13
e5-Base-V261.5
e5-Large-V262.25

As you can see, the GTE-Small model performs competitively with other models, despite being smaller in size.

Usage

Want to try GTE-small out? Here’s some example code:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

#... (see the code example in the JSON data)

Or, if you prefer using sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

#... (see the code example in the JSON data)
Examples
What is the capital of China? Beijing
What is the similarity between sorting algorithms? 0.83
How to implement quick sort in python? Similarity score to sorting algorithms: 0.85

Note that the model has some limitations, such as only supporting English texts and truncating lengthy texts to a maximum of 512 tokens.

Limitation

Keep in mind that GTE-small is exclusively designed for English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.

Performance

The GTE-small model shows remarkable performance in various tasks, especially considering its small size. Let’s dive into its speed, accuracy, and efficiency.

Speed

The GTE-small model is incredibly fast, thanks to its small size of only 0.07 GB. This makes it perfect for applications where speed is crucial, such as real-time text analysis or chatbots.

Accuracy

The model’s accuracy is impressive, especially in tasks like:

  • Pair Classification: The GTE-small model achieves an accuracy of 83.54%, outperforming some larger models like the sentence-t5-xxl with 85.06%.
  • Reranking: With an accuracy of 57.7%, the GTE-small model is close to the top-performing models like the gte-large with 59.13%.

Efficiency

The GTE-small model is efficient in various tasks, including:

  • Text Embeddings: The model can handle large-scale datasets with ease, thanks to its ability to process 512 sequence lengths.
  • Downstream Tasks: The GTE-small model can be applied to various downstream tasks, such as information retrieval, semantic textual similarity, and text reranking.

Format

The GTE-small model uses a transformer architecture, similar to BERT, and is designed to handle text embeddings. It’s one of three models offered by Alibaba DAMO Academy, with the others being GTE-large and GTE-base.

Architecture

The model is trained on a large-scale corpus of relevant text pairs, covering various domains and scenarios. This enables it to be applied to different downstream tasks, such as:

Data Formats

The model supports input in the form of tokenized text sequences, with a maximum sequence length of 512 tokens. It’s essential to note that any lengthy texts will be truncated to this maximum length.

Special Requirements

  • The model exclusively caters to English texts.
  • Input texts need to be pre-processed using a tokenizer, such as AutoTokenizer from the transformers library.

Handling Inputs and Outputs

Here’s an example of how to handle inputs and outputs for this model using the transformers library:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

# Define a function to average pool the last hidden states
def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

# Define input texts
input_texts = ["what is the capital of China?", "how to implement quick sort in python?", "Beijing", "sorting algorithms"]

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
model = AutoModel.from_pretrained("thenlper/gte-small")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

# Get the model outputs
outputs = model(**batch_dict)

# Calculate the embeddings
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)

# Calculate scores
scores = (embeddings[:1] @ embeddings[1:].T) * 100

# Print the scores
print(scores.tolist())

Alternatively, you can use the sentence-transformers library to handle inputs and outputs:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# Define input sentences
sentences = ['That is a happy person', 'That is a very happy person']

# Load the model
model = SentenceTransformer('thenlper/gte-large')

# Calculate the embeddings
embeddings = model.encode(sentences)

# Calculate the cosine similarity
print(cos_sim(embeddings[0], embeddings[1]))
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.