Gte Base

General Text Embeddings

Gte Base is a General Text Embeddings model designed to handle various text-related tasks efficiently. It's trained on a large-scale corpus of relevant text pairs, covering multiple domains and scenarios. This model is based on the BERT framework and offers a balance between performance and size. With a dimension of 768 and a sequence length of 512, Gte Base can be applied to tasks like information retrieval, semantic textual similarity, and text reranking. While it's limited to English texts and has a maximum token limit of 512, it provides a robust solution for text embedding tasks. If you're looking for a model that can efficiently handle text-related tasks, Gte Base is worth considering.

Thenlper mit Updated 5 months ago

Table of Contents

Model Overview

The GTE-base General Text Embeddings model is a powerful tool for natural language processing tasks. Developed by Alibaba DAMO Academy, it’s mainly based on the BERT framework and is designed to handle a wide range of domains and scenarios.

Capabilities

Capable of handling various downstream tasks, including information retrieval, semantic textual similarity, text reranking, and more, this model is a versatile tool for text analysis.

What can it do?

  • Text Embeddings: The model can create vector representations of text, allowing for efficient comparison and analysis of text data.
  • Information Retrieval: It can help you find relevant information in large datasets by matching search queries with relevant documents.
  • Semantic Textual Similarity: The model can determine the similarity between two pieces of text, enabling applications such as plagiarism detection and text summarization.
  • Text Reranking: It can re-rank search results to improve the relevance and accuracy of search queries.

Example Use Cases

Here’s an example of how you can use the GTE-base model to calculate the similarity between two pieces of text:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']
model = SentenceTransformer('thenlper/gte-base')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

Performance

The GTE-base model shows impressive performance in various text embedding tasks.

Speed

The model is relatively fast, with a model size of 0.22 GB. This is much smaller than some other models, like sentence-t5-xxl, which has a massive model size of 9.73 GB.

Accuracy

The model achieves high accuracy in many tasks, including:

  • Semantic Textual Similarity (STS): 82.3 out of 100
  • Text Reranking: 58.61 out of 100
  • Information Retrieval: 51.14 out of 100

Comparison with Other Models

The GTE-base model has been compared to other popular text embedding models on the MTEB benchmark, and it has shown impressive results. Here’s a brief comparison:

ModelAverage Score
GTE-base62.39
e5-large-v262.25
e5-base-v261.5
==sentence-t5-xxl==59.51

Limitations

The GTE-base model has some limitations that are worth considering.

Language Limitation

The model only works well with English texts. If you try to use it with texts in other languages, it might not perform as well.

Text Length Limitation

The model can only handle texts up to a certain length. If your text is longer than 512 tokens, it will be truncated. This means that some important information might be lost.

Domain Limitation

Although the model has been trained on a wide range of domains and scenarios, it might not perform equally well in all areas.

Examples
What is the similarity between 'That is a happy person' and 'That is a very happy person'? 0.996
What is the capital of China? Beijing
How to implement quick sort in python? scores: [0.64, 0.44]

Format

The GTE-base model uses a BERT-like architecture and is designed to handle text embeddings.

Input Requirements

To use the model, you’ll need to prepare your input text data in a specific way:

  • Tokenization: Your text data needs to be tokenized, which means breaking it down into individual words or subwords.
  • Maximum sequence length: The model has a maximum sequence length of 512 tokens. If your text is longer, it will be truncated.
  • English only: The model is exclusively designed for English texts, so you’ll need to ensure your input data is in English.

Output

The model outputs a vector representation of the input text, which can be used for various downstream tasks, such as:

  • Text similarity: You can use the output vectors to calculate the similarity between different text sequences.
  • Text classification: You can use the output vectors as input to a classification model.

Code Examples

Here’s an example of how to use the GTE-base model with PyTorch:

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base")
model = AutoModel.from_pretrained("thenlper/gte-base")

# Tokenize the input text
input_texts = ["What is the capital of China?", "How to implement quick sort in Python?"]
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

# Get the output vectors
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize the embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)

# Calculate the similarity between the two input texts
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.