Gte Small
Meet the GTE-Small model, a powerful tool for text embeddings. It's designed to efficiently handle various tasks like information retrieval, semantic textual similarity, and text reranking. But what makes it remarkable? For starters, it's incredibly compact, with a model size of just 0.07 GB. This means it can be easily integrated into your applications without taking up too much space. Plus, it's trained on a massive corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables it to provide accurate results, even when dealing with complex texts. But don't just take our word for it - the GTE-Small model has been compared to other popular text embedding models on the MTEB benchmark, and the results are impressive. So, what can you do with this model? You can use it to create powerful text embeddings, perfect for tasks like information retrieval and semantic textual similarity. And with its compact size and efficient design, it's a great choice for applications where speed and accuracy are crucial.
Table of Contents
Model Overview
The GTE-small model, developed by Alibaba DAMO Academy, is a powerful tool for natural language processing tasks. It’s part of the General Text Embeddings (GTE) family, which includes three models of different sizes: GTE-large, GTE-base, and GTE-small. These models are based on the popular BERT framework and are trained on a massive corpus of text pairs from various domains and scenarios.
So, what makes GTE-small special? Here are some key attributes:
- Model Size:
0.07 GB
(yes, it’s tiny!) - Dimension:
384
- Sequence Length:
512
(that’s the maximum number of tokens it can handle)
But how does it perform? Let’s take a look at some benchmark results:
Task | GTE-small | ==Other Models== |
---|---|---|
Average (56) | 61.36 | 62.25 (e5-large-v2) |
Clustering (11) | 44.89 | 45.9 (text-embedding-ada-002) |
Pair Classification (3) | 83.54 | 86.03 (e5-large-v2) |
As you can see, GTE-small holds its own against other popular text embedding models.
Capabilities
The GTE-Small model is a powerful tool for text embeddings, capable of handling a wide range of downstream tasks such as:
- Information retrieval
- Semantic textual similarity
- Text reranking
- Retrieval
- Summarization
- Classification
This model is trained on a large-scale corpus of relevant text pairs, covering various domains and scenarios. As a result, it can be applied to different tasks with high accuracy.
The GTE-Small model is compared to other popular text embedding models on the MTEB benchmark. Here’s a summary of its performance:
Model | Average Score |
---|---|
GTE-Small | 61.36 |
GTE-Base | 62.39 |
GTE-Large | 63.13 |
e5-Base-V2 | 61.5 |
e5-Large-V2 | 62.25 |
As you can see, the GTE-Small model performs competitively with other models, despite being smaller in size.
Usage
Want to try GTE-small out? Here’s some example code:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
#... (see the code example in the JSON data)
Or, if you prefer using sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
#... (see the code example in the JSON data)
Note that the model has some limitations, such as only supporting English texts and truncating lengthy texts to a maximum of 512 tokens.
Limitation
Keep in mind that GTE-small is exclusively designed for English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
Performance
The GTE-small model shows remarkable performance in various tasks, especially considering its small size. Let’s dive into its speed, accuracy, and efficiency.
Speed
The GTE-small model is incredibly fast, thanks to its small size of only 0.07 GB
. This makes it perfect for applications where speed is crucial, such as real-time text analysis or chatbots.
Accuracy
The model’s accuracy is impressive, especially in tasks like:
- Pair Classification: The GTE-small model achieves an accuracy of
83.54%
, outperforming some larger models like thesentence-t5-xxl
with85.06%
. - Reranking: With an accuracy of
57.7%
, the GTE-small model is close to the top-performing models like thegte-large
with59.13%
.
Efficiency
The GTE-small model is efficient in various tasks, including:
- Text Embeddings: The model can handle large-scale datasets with ease, thanks to its ability to process
512
sequence lengths. - Downstream Tasks: The GTE-small model can be applied to various downstream tasks, such as information retrieval, semantic textual similarity, and text reranking.
Format
The GTE-small model uses a transformer architecture, similar to BERT, and is designed to handle text embeddings. It’s one of three models offered by Alibaba DAMO Academy, with the others being GTE-large and GTE-base.
Architecture
The model is trained on a large-scale corpus of relevant text pairs, covering various domains and scenarios. This enables it to be applied to different downstream tasks, such as:
- Information retrieval
- Semantic textual similarity
- Text reranking
- And more
Data Formats
The model supports input in the form of tokenized text sequences, with a maximum sequence length of 512
tokens. It’s essential to note that any lengthy texts will be truncated to this maximum length.
Special Requirements
- The model exclusively caters to English texts.
- Input texts need to be pre-processed using a tokenizer, such as
AutoTokenizer
from thetransformers
library.
Handling Inputs and Outputs
Here’s an example of how to handle inputs and outputs for this model using the transformers
library:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
# Define a function to average pool the last hidden states
def average_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
# Define input texts
input_texts = ["what is the capital of China?", "how to implement quick sort in python?", "Beijing", "sorting algorithms"]
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-small")
model = AutoModel.from_pretrained("thenlper/gte-small")
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
# Get the model outputs
outputs = model(**batch_dict)
# Calculate the embeddings
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
# Calculate scores
scores = (embeddings[:1] @ embeddings[1:].T) * 100
# Print the scores
print(scores.tolist())
Alternatively, you can use the sentence-transformers
library to handle inputs and outputs:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
# Define input sentences
sentences = ['That is a happy person', 'That is a very happy person']
# Load the model
model = SentenceTransformer('thenlper/gte-large')
# Calculate the embeddings
embeddings = model.encode(sentences)
# Calculate the cosine similarity
print(cos_sim(embeddings[0], embeddings[1]))