Gte Base
Gte Base is a General Text Embeddings model designed to handle various text-related tasks efficiently. It's trained on a large-scale corpus of relevant text pairs, covering multiple domains and scenarios. This model is based on the BERT framework and offers a balance between performance and size. With a dimension of 768 and a sequence length of 512, Gte Base can be applied to tasks like information retrieval, semantic textual similarity, and text reranking. While it's limited to English texts and has a maximum token limit of 512, it provides a robust solution for text embedding tasks. If you're looking for a model that can efficiently handle text-related tasks, Gte Base is worth considering.
Table of Contents
Model Overview
The GTE-base General Text Embeddings model is a powerful tool for natural language processing tasks. Developed by Alibaba DAMO Academy, it’s mainly based on the BERT framework and is designed to handle a wide range of domains and scenarios.
Capabilities
Capable of handling various downstream tasks, including information retrieval, semantic textual similarity, text reranking, and more, this model is a versatile tool for text analysis.
What can it do?
- Text Embeddings: The model can create vector representations of text, allowing for efficient comparison and analysis of text data.
- Information Retrieval: It can help you find relevant information in large datasets by matching search queries with relevant documents.
- Semantic Textual Similarity: The model can determine the similarity between two pieces of text, enabling applications such as plagiarism detection and text summarization.
- Text Reranking: It can re-rank search results to improve the relevance and accuracy of search queries.
Example Use Cases
Here’s an example of how you can use the GTE-base model to calculate the similarity between two pieces of text:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('thenlper/gte-base')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Performance
The GTE-base model shows impressive performance in various text embedding tasks.
Speed
The model is relatively fast, with a model size of 0.22 GB
. This is much smaller than some other models, like sentence-t5-xxl
, which has a massive model size of 9.73 GB
.
Accuracy
The model achieves high accuracy in many tasks, including:
- Semantic Textual Similarity (STS):
82.3
out of100
- Text Reranking:
58.61
out of100
- Information Retrieval:
51.14
out of100
Comparison with Other Models
The GTE-base model has been compared to other popular text embedding models on the MTEB benchmark, and it has shown impressive results. Here’s a brief comparison:
Model | Average Score |
---|---|
GTE-base | 62.39 |
e5-large-v2 | 62.25 |
e5-base-v2 | 61.5 |
==sentence-t5-xxl== | 59.51 |
Limitations
The GTE-base model has some limitations that are worth considering.
Language Limitation
The model only works well with English texts. If you try to use it with texts in other languages, it might not perform as well.
Text Length Limitation
The model can only handle texts up to a certain length. If your text is longer than 512
tokens, it will be truncated. This means that some important information might be lost.
Domain Limitation
Although the model has been trained on a wide range of domains and scenarios, it might not perform equally well in all areas.
Format
The GTE-base model uses a BERT-like architecture and is designed to handle text embeddings.
Input Requirements
To use the model, you’ll need to prepare your input text data in a specific way:
- Tokenization: Your text data needs to be tokenized, which means breaking it down into individual words or subwords.
- Maximum sequence length: The model has a maximum sequence length of
512
tokens. If your text is longer, it will be truncated. - English only: The model is exclusively designed for English texts, so you’ll need to ensure your input data is in English.
Output
The model outputs a vector representation of the input text, which can be used for various downstream tasks, such as:
- Text similarity: You can use the output vectors to calculate the similarity between different text sequences.
- Text classification: You can use the output vectors as input to a classification model.
Code Examples
Here’s an example of how to use the GTE-base model with PyTorch:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base")
model = AutoModel.from_pretrained("thenlper/gte-base")
# Tokenize the input text
input_texts = ["What is the capital of China?", "How to implement quick sort in Python?"]
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
# Get the output vectors
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize the embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
# Calculate the similarity between the two input texts
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())