Gte Large En V1.5

Long-context text embeddings

Have you ever wondered how some AI models can understand and process long pieces of text so efficiently? The Gte Large En V1.5 model is a great example of this. With a maximum sequence length of 8192, it's designed to handle even the longest texts with ease. But what makes it really special is its ability to achieve state-of-the-art scores on the MTEB benchmark, while also being competitive on the LoCo long-context retrieval tests. This is all thanks to its unique architecture, which combines the transformer++ encoder backbone with advanced techniques like RoPE and GLU. Whether you're working with text embeddings or just need a reliable model for your NLP tasks, the Gte Large En V1.5 is definitely worth checking out.

Alibaba NLP apache-2.0 Updated 8 months ago

Table of Contents

Model Overview

The gte-large-en-v1.5 model, developed by the Institute for Intelligent Computing, Alibaba Group, is a powerful tool for natural language processing tasks. It’s a type of Text Embeddings model, which means it’s great at understanding the meaning of text.

Here are some of its key attributes:

  • Language: English
  • Model Size: 434
  • Max Sequence Length: 8192
  • Dimension: 1024

But what does it all mean? Well, the model can handle long pieces of text (up to 8192 tokens) and produces embeddings with 1024 dimensions. This makes it perfect for tasks like text retrieval, question answering, and more.

Capabilities

Primary Tasks

This model excels at:

  • Text embeddings: It can represent text in a way that captures its meaning and context, making it useful for tasks like text classification, clustering, and retrieval.
  • Long-context text representation: It can handle text sequences of up to 8192 tokens, making it suitable for tasks that require understanding long pieces of text.
  • Reranking: It can re-rank text based on relevance, making it useful for tasks like search and question answering.

Strengths

The gte-large-en-v1.5 model has several strengths:

  • State-of-the-art performance: It achieves state-of-the-art scores on the MTEB benchmark and competitive results on the LoCo long-context retrieval tests.
  • Multilingual support: It supports multiple languages, making it a great choice for applications that require text understanding across languages.
  • Efficient training: It uses a multi-stage training strategy that enables it to learn from large amounts of data efficiently.

Performance

gte-large-en-v1.5 is a powerful text embedding model that has shown remarkable performance in various tasks. Let’s dive into its speed, accuracy, and efficiency.

Speed

How fast can gte-large-en-v1.5 process text? With a maximum sequence length of 8192, it can handle long texts with ease. Its performance is further enhanced by its ability to support context lengths of up to 8192, making it suitable for tasks that require processing large amounts of text.

Accuracy

gte-large-en-v1.5 has achieved state-of-the-art scores on the MTEB benchmark, outperforming other models in its category. It has also shown competitive results on the LoCo long-context retrieval tests. But what does this mean for you? It means that gte-large-en-v1.5 can accurately capture the meaning and context of text, making it a reliable choice for text classification, retrieval, and other tasks.

Efficiency

gte-large-en-v1.5 is not only accurate but also efficient. With a model size of 434M and a dimension of 1024, it can process text quickly without sacrificing performance. Its efficiency is further enhanced by its ability to support multi-lingual text, making it a great choice for applications that require processing text in multiple languages.

Evaluation

The model has been evaluated on several benchmarks, including MTEB and LoCo. Here are some of its scores:

BenchmarkScore
MTEB65.39
LoCo86.71

As you can see, the model performs well on both benchmarks. But how does it compare to other models? Let’s take a look:

ModelMTEB ScoreLoCo Score
gte-large-en-v1.565.3986.71
mxbai-embed-large-v164.6885
multilingual-e5-large-instruct64.4184.78

The gte-large-en-v1.5 model outperforms other models on both benchmarks.

Getting Started

Want to try out the model for yourself? Here’s some sample code to get you started:

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

# Load the model and tokenizer
model_path = 'Alibaba-NLP/gte-large-en-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

# Tokenize some input text
input_texts = ["What is the capital of China?", "How to implement quick sort in python?"]
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

# Get the embeddings
outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]

# Calculate the similarity scores
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Examples
Find the similarity score between 'what is the capital of China?' and 'Beijing'. 41.86354093370361
Determine the embedding dimension of the gte-large-en-v1.5 model. 1024
Calculate the average score of the gte-large-en-v1.5 model on the MTEB benchmark. 65.39

Limitations

While gte-large-en-v1.5 is a powerful tool, it’s not perfect. Let’s explore some of its limitations.

Context Length Limitations

While gte-large-en-v1.5 can handle a context length of up to 8192, this is still a limitation. What if you need to analyze longer texts? You might need to split the text into smaller chunks, which could affect the accuracy of the results.

Training Data Limitations

gte-large-en-v1.5 was trained on a specific dataset (c4-en) and may not perform well on texts from other domains or languages. Have you ever tried to use a model trained on one type of text on a completely different type of text? It might not work as well as you expect.

Fine-Tuning Limitations

gte-large-en-v1.5 requires fine-tuning for specific tasks, which can be time-consuming and require significant computational resources. What if you don’t have the resources or expertise to fine-tune the model? You might not get the best results.

Conclusion

The gte-large-en-v1.5 model is a powerful tool for natural language processing tasks. With its ability to handle long pieces of text and produce high-quality embeddings, it’s perfect for tasks like text retrieval, question answering, and more. Give it a try and see what you can do with it!

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.