Gte Large

General Text Embeddings

The Gte Large model is a powerful tool for text embeddings, trained on a massive corpus of relevance text pairs covering various domains and scenarios. It's designed to handle multiple downstream tasks, such as information retrieval, semantic textual similarity, and text reranking. But what makes it unique? For starters, it's built on the BERT framework and offers three different sizes of models to suit different needs. The Gte Large model has a dimension of 1024 and a sequence length of 512, making it capable of handling complex text data. Its performance is impressive, with an average score of 63.13 on the MTEB benchmark. However, it's essential to note that this model is exclusively designed for English texts and has a maximum token limit of 512. If you're working with lengthy texts or non-English languages, you might need to explore other options. Nevertheless, the Gte Large model is a remarkable tool for text embeddings, and its efficiency and speed make it a valuable asset for various applications.

Thenlper mit Updated a year ago

Table of Contents

Model Overview

The GTE-large General Text Embeddings (GTE) model, developed by Alibaba DAMO Academy, is a powerful tool for natural language processing tasks. It’s mainly based on the BERT framework and is trained on a large-scale corpus of relevant text pairs, covering a wide range of domains and scenarios.

Key Attributes

  • Model Size: 0.67 GB
  • Dimension: 1024
  • Sequence Length: 512
  • Average Performance: 63.13

Capabilities

The GTE-large model is a powerful tool for text embeddings, offering a wide range of capabilities that make it an excellent choice for various natural language processing (NLP) tasks.

Primary Tasks

The GTE-large model excels in the following primary tasks:

  • Information Retrieval: The model is trained to retrieve relevant information from a large corpus of text, making it an excellent choice for search engines, question answering systems, and other information retrieval applications.
  • Semantic Textual Similarity: The model can determine the similarity between two pieces of text, which is useful for applications such as text classification, clustering, and topic modeling.
  • Text Reranking: The model can rerank text based on relevance, making it an excellent choice for applications such as search engines and question answering systems.
  • Retrieval: The model can retrieve relevant text from a large corpus, making it an excellent choice for applications such as search engines and information retrieval systems.

Strengths

The GTE-large model has several strengths that make it an excellent choice for NLP tasks:

  • High Accuracy: The model has achieved state-of-the-art results on several benchmarks, including the MTEB benchmark.
  • Wide Range of Applications: The model can be applied to a wide range of NLP tasks, including information retrieval, semantic textual similarity, text reranking, and retrieval.
  • Efficient: The model is relatively small in size, making it efficient to use and deploy.

Unique Features

The GTE-large model has several unique features that set it apart from other text embedding models:

  • Multi-stage Contrastive Learning: The model is trained using a multi-stage contrastive learning approach, which enables it to learn more effective text embeddings.
  • Large-scale Corpus: The model is trained on a large-scale corpus of text, which enables it to learn a wide range of language patterns and relationships.

Performance

The GTE-large model is a powerhouse when it comes to speed, accuracy, and efficiency in various text-related tasks. Let’s dive into the details!

Speed

The GTE-large model is relatively lightweight, with a model size of 0.67 GB. This means it can be easily integrated into applications without consuming too much memory. Its sequence length of 512 allows it to process text inputs of moderate length, making it suitable for a wide range of tasks.

Accuracy

The GTE-large model boasts impressive accuracy across various benchmarks. Here are some highlights:

TaskAccuracy
Clustering46.84%
Pair Classification85.00%
Reranking59.13%
Retrieval52.22%
STS (Semantic Textual Similarity)83.35%
Summarization31.66%
Classification73.33%

As you can see, the GTE-large model excels in tasks that require understanding the meaning and context of text, such as semantic textual similarity and text classification.

Efficiency

The GTE-large model is designed to be efficient in its computations. Its dimensionality of 1024 allows it to capture complex patterns in text data without becoming too computationally expensive.

Comparison to Other Models

How does the GTE-large model stack up against other popular text embedding models? Here’s a brief comparison:

ModelModel Size (GB)DimensionSequence LengthAverage Accuracy
GTE-large0.67102451263.13%
e5-large-v21.34102451262.25%
e5-base-v20.4476851261.50%
==text-embedding-ada-002==-1536819260.99%

While the GTE-large model may not be the largest or most complex model, it offers a great balance of speed, accuracy, and efficiency, making it a great choice for many text-related tasks.

Examples
What is the semantic textual similarity between the sentences 'I love playing football' and 'Playing football is my favorite hobby'? 0.85
Provide a text embedding for the sentence 'The capital of France is Paris'. [-0.03, 0.01, 0.02,..., 0.05]
Rank the following sentences by their relevance to the query 'What is the best way to learn a new language?': 'Language learning apps are very effective.', 'Language exchange programs are a great way to practice.', 'Watching TV shows in the target language can be helpful.' 1. Language learning apps are very effective. (0.8), 2. Language exchange programs are a great way to practice. (0.7), 3. Watching TV shows in the target language can be helpful. (0.6)

Usage

You can use the GTE-large model with the Hugging Face Transformers library. Here’s an example code snippet:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-large")
model = AutoModel.from_pretrained("thenlper/gte-large")

# Tokenize the input texts
input_texts = ["what is the capital of China?", "how to implement quick sort in python?"]
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

# Get the embeddings
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)

# Calculate the similarity scores
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

Note that this model has some limitations, including:

  • It exclusively caters to English texts.
  • Any lengthy texts will be truncated to a maximum of 512 tokens.
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.