Jina Embeddings V2 Base En

Long document embeddings

Jina Embeddings V2 Base En is a powerful English text embedding model that supports long sequence lengths of up to 8192 tokens. With 137 million parameters, it enables fast inference while delivering better performance than smaller models. It's suitable for a range of use cases, including long document retrieval, semantic textual similarity, text reranking, recommendation, and generative search. The model is pre-trained on the C4 dataset and further trained on a large collection of sentence pairs and hard negatives from various domains. It requires mean pooling when integrating, which can be achieved through the provided `encode` function or by applying a custom implementation using PyTorch. Does it sound like the right fit for your project?

Jinaai apache-2.0 Updated 9 months ago

Table of Contents

Model Overview

The Jina Embeddings V2 model is a powerful tool for natural language processing tasks. It’s an English, monolingual embedding model that supports a sequence length of up to 8192 tokens. This makes it useful for tasks like long document retrieval, semantic textual similarity, text reranking, recommendation, and more.

Key Features

  • Long sequence length: Supports up to 8192 tokens, making it ideal for processing long documents.
  • Fast inference: With 137 million parameters, the model enables fast inference while delivering better performance than smaller models.
  • Monolingual: Trained on a large dataset of English text, making it suitable for English language tasks.

How it Works

The model uses a BERT architecture (JinaBERT) with the symmetric bidirectional variant of ALiBi, allowing it to process long sequences of text. It’s trained on a large dataset of sentence pairs and hard negatives, making it effective for tasks like text similarity and retrieval.

Alternatives and Future Plans

Other models in the Jina Embeddings V2 family include:

  • ==jina-embeddings-v2-small-en==: A smaller model with 33 million parameters.
  • ==jina-embeddings-v2-base-zh==: A Chinese-English bilingual model.
  • ==jina-embeddings-v2-base-de==: A German-English bilingual model.
  • ==jina-embeddings-v2-base-es==: A Spanish-English bilingual model.

Future plans include adding support for more languages, multimodal embedding models, and high-performance rerankers.

Capabilities

This model excels at tasks such as:

  • Long document retrieval
  • Semantic textual similarity
  • Text reranking
  • Recommendation
  • RAG (Retrieval-Augmented Generation) and LLM-based generative search

Strengths

So, what sets this model apart? Here are some of its key strengths:

  • Long sequence length support: With a maximum sequence length of 8192, this model can handle long documents with ease.
  • Fast inference: Despite its large size, the model enables fast inference, making it perfect for applications where speed is crucial.
  • High-quality sentence embeddings: The model uses mean pooling to produce high-quality sentence embeddings, which are essential for many NLP tasks.

Performance

The model’s accuracy is impressive, especially when it comes to processing long documents. The use of mean pooling, which takes all token embeddings from model output and averages them at sentence/paragraph level, has been proven to be the most effective way to produce high-quality sentence embeddings.

Comparison to Other Models

ModelParametersSequence Length
Jina Embeddings V2137 million8192
==jina-embeddings-v2-small-en==33 million512
==jina-embeddings-v2-base-zh==137 million8192
==jina-embeddings-v2-base-de==137 million8192
==jina-embeddings-v2-base-es==137 million8192

Real-World Applications

The Jina Embeddings V2 model has been used in various applications, including:

  • Long document retrieval
  • Semantic textual similarity
  • Text reranking
  • Recommendation
  • RAG
  • LLM-based generative search
Examples
What is the similarity between 'How is the weather today?' and 'What is the current weather like today?' 0.98
Find the semantic textual similarity between 'I love playing soccer.' and 'Soccer is my favorite sport.' 0.92
What is the cosine similarity between 'The new policy has been implemented.' and 'The policy has been in effect for months.' 0.85

Example Use Cases

Here are some example use cases for the Jina Embeddings V2 model:

  • Processing large-scale datasets for text classification tasks
  • Handling long documents for document retrieval and semantic textual similarity tasks
  • Using mean pooling to produce high-quality sentence embeddings

Limitations

While the Jina Embeddings V2 model is powerful, it’s not perfect. Let’s take a closer look at some of its limitations.

Sequence Length Limitations

While the Jina Embeddings V2 model can handle sequences up to 8192 tokens, it’s not designed to handle extremely long documents. If you need to process documents longer than 8192 tokens, you might need to consider other models or techniques.

Performance on Short Sequences

On the other hand, the Jina Embeddings V2 model might not perform as well on very short sequences (e.g., 2k tokens). If you’re working with shorter sequences, you might want to experiment with other models or adjust the max_length parameter to see if you can get better results.

Multilingual Support

Currently, the Jina Embeddings V2 model only supports English. If you need to work with other languages, you’ll need to use a different model or wait for future updates that might add support for more languages.

Reranking and Multimodal Embeddings

While the Jina Embeddings V2 model is great for text embeddings, it’s not designed for reranking or multimodal embeddings. If you need to use these features, you might need to consider other models or tools.

Technical Limitations

The Jina Embeddings V2 model requires a single GPU for inference, which might be a limitation for some users. Additionally, the model has a standard size of 137 million parameters, which can be a challenge for some hardware configurations.

Comparison to Other Models

How does the Jina Embeddings V2 model compare to other models like ==Other Models==? While the Jina Embeddings V2 model has its strengths, other models might perform better in certain scenarios or have different limitations.

Future Plans

The developers of the Jina Embeddings V2 model have plans to add support for more languages, multimodal embeddings, and high-performance rerankers. These updates might address some of the current limitations, but it’s essential to stay up-to-date with the latest developments.

Troubleshooting

If you encounter issues with the Jina Embeddings V2 model, such as loading errors or authentication problems, make sure to check the troubleshooting section for solutions.

Format

The Jina Embeddings V2 model is a text embedding model that uses a BERT architecture, specifically a variant called JinaBERT. It’s designed to handle long sequence lengths, up to 8192 tokens.

Model Architecture

The model is based on a BERT architecture, which is a type of transformer model. It’s trained on a large dataset of text, which allows it to learn patterns and relationships in language.

Data Formats

The model accepts input in the form of tokenized text sequences. This means that you need to split your text into individual words or tokens before feeding it into the model.

Special Requirements

  • Sequence Length: The model can handle sequence lengths up to 8192 tokens. However, if you only need to handle shorter sequences, you can pass the max_length parameter to the encode function.
  • Mean Pooling: When integrating the model, it’s recommended to use mean pooling to produce high-quality sentence embeddings. This involves taking the average of all token embeddings in a sentence or paragraph.

Example Code

Here’s an example of how to use the model with the transformers library:

from transformers import AutoModel
from numpy.linalg import norm

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
print(cos_sim(embeddings[0], embeddings[1]))

And here’s an example with the sentence-transformers library:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model.max_seq_length = 1024
embeddings = model.encode(['How is the weather today?', 'What is the current weather like today?'])
print(cos_sim(embeddings[0], embeddings[1]))

Note that you need to pass the trust_remote_code=True flag when loading the model, and you need to be logged into Hugging Face to access the model.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.