Gte Large
The Gte Large model is a powerful tool for text embeddings, trained on a massive corpus of relevance text pairs covering various domains and scenarios. It's designed to handle multiple downstream tasks, such as information retrieval, semantic textual similarity, and text reranking. But what makes it unique? For starters, it's built on the BERT framework and offers three different sizes of models to suit different needs. The Gte Large model has a dimension of 1024 and a sequence length of 512, making it capable of handling complex text data. Its performance is impressive, with an average score of 63.13 on the MTEB benchmark. However, it's essential to note that this model is exclusively designed for English texts and has a maximum token limit of 512. If you're working with lengthy texts or non-English languages, you might need to explore other options. Nevertheless, the Gte Large model is a remarkable tool for text embeddings, and its efficiency and speed make it a valuable asset for various applications.
Table of Contents
Model Overview
The GTE-large General Text Embeddings (GTE) model, developed by Alibaba DAMO Academy, is a powerful tool for natural language processing tasks. It’s mainly based on the BERT framework and is trained on a large-scale corpus of relevant text pairs, covering a wide range of domains and scenarios.
Key Attributes
- Model Size:
0.67 GB
- Dimension:
1024
- Sequence Length:
512
- Average Performance:
63.13
Capabilities
The GTE-large model is a powerful tool for text embeddings, offering a wide range of capabilities that make it an excellent choice for various natural language processing (NLP) tasks.
Primary Tasks
The GTE-large model excels in the following primary tasks:
- Information Retrieval: The model is trained to retrieve relevant information from a large corpus of text, making it an excellent choice for search engines, question answering systems, and other information retrieval applications.
- Semantic Textual Similarity: The model can determine the similarity between two pieces of text, which is useful for applications such as text classification, clustering, and topic modeling.
- Text Reranking: The model can rerank text based on relevance, making it an excellent choice for applications such as search engines and question answering systems.
- Retrieval: The model can retrieve relevant text from a large corpus, making it an excellent choice for applications such as search engines and information retrieval systems.
Strengths
The GTE-large model has several strengths that make it an excellent choice for NLP tasks:
- High Accuracy: The model has achieved state-of-the-art results on several benchmarks, including the MTEB benchmark.
- Wide Range of Applications: The model can be applied to a wide range of NLP tasks, including information retrieval, semantic textual similarity, text reranking, and retrieval.
- Efficient: The model is relatively small in size, making it efficient to use and deploy.
Unique Features
The GTE-large model has several unique features that set it apart from other text embedding models:
- Multi-stage Contrastive Learning: The model is trained using a multi-stage contrastive learning approach, which enables it to learn more effective text embeddings.
- Large-scale Corpus: The model is trained on a large-scale corpus of text, which enables it to learn a wide range of language patterns and relationships.
Performance
The GTE-large model is a powerhouse when it comes to speed, accuracy, and efficiency in various text-related tasks. Let’s dive into the details!
Speed
The GTE-large model is relatively lightweight, with a model size of 0.67 GB
. This means it can be easily integrated into applications without consuming too much memory. Its sequence length of 512
allows it to process text inputs of moderate length, making it suitable for a wide range of tasks.
Accuracy
The GTE-large model boasts impressive accuracy across various benchmarks. Here are some highlights:
Task | Accuracy |
---|---|
Clustering | 46.84% |
Pair Classification | 85.00% |
Reranking | 59.13% |
Retrieval | 52.22% |
STS (Semantic Textual Similarity) | 83.35% |
Summarization | 31.66% |
Classification | 73.33% |
As you can see, the GTE-large model excels in tasks that require understanding the meaning and context of text, such as semantic textual similarity and text classification.
Efficiency
The GTE-large model is designed to be efficient in its computations. Its dimensionality of 1024
allows it to capture complex patterns in text data without becoming too computationally expensive.
Comparison to Other Models
How does the GTE-large model stack up against other popular text embedding models? Here’s a brief comparison:
Model | Model Size (GB) | Dimension | Sequence Length | Average Accuracy |
---|---|---|---|---|
GTE-large | 0.67 | 1024 | 512 | 63.13% |
e5-large-v2 | 1.34 | 1024 | 512 | 62.25% |
e5-base-v2 | 0.44 | 768 | 512 | 61.50% |
==text-embedding-ada-002== | - | 1536 | 8192 | 60.99% |
While the GTE-large model may not be the largest or most complex model, it offers a great balance of speed, accuracy, and efficiency, making it a great choice for many text-related tasks.
Usage
You can use the GTE-large model with the Hugging Face Transformers library. Here’s an example code snippet:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large")
model = AutoModel.from_pretrained("thenlper/gte-large")
# Tokenize the input texts
input_texts = ["what is the capital of China?", "how to implement quick sort in python?"]
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
# Get the embeddings
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
# Calculate the similarity scores
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Note that this model has some limitations, including:
- It exclusively caters to English texts.
- Any lengthy texts will be truncated to a maximum of 512 tokens.