Sarashina Embedding V1 1b
The Sarashina Embedding V1 1b model is a powerful tool for understanding Japanese text. It's built on a 1.2B-parameter Japanese LLM and trained using multi-stage contrastive learning, achieving state-of-the-art results across 16 datasets. This model maps sentences and paragraphs to a 1792-dimensional dense vector space, enabling applications like semantic textual similarity, semantic search, and text classification. What makes it remarkable is its ability to learn accurate query-document similarity through supervised fine-tuning, making it a valuable resource for various natural language processing tasks. With its unique architecture and capabilities, the Sarashina Embedding V1 1b model is an efficient and effective solution for working with Japanese text.
Table of Contents
Model Overview
The Sarashina-Embedding-v1-1B model is a powerful tool for Japanese text embedding. This model is based on the 1.2B-parameter Japanese LLM “Sarashina2.1-1B” and is trained with multi-stage contrastive learning. It’s designed to map sentences and paragraphs to a 1792-dimensional dense vector space, making it perfect for tasks like semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Capabilities
The Sarashina-Embedding-v1-1B model is a powerful tool for Japanese text embedding. It can map sentences and paragraphs to a dense vector space, making it perfect for a variety of applications.
What can it do?
- Semantic textual similarity: Measure how similar two pieces of text are in meaning.
- Semantic search: Find relevant documents or sentences based on their meaning.
- Paraphrase mining: Identify different ways of expressing the same idea.
- Text classification: Categorize text into predefined categories.
- Clustering: Group similar texts together.
How does it work?
The model uses a two-stage learning process:
- Weakly-supervised learning: Trained on a large dataset of web-crawled data and open data to achieve generic text embedding performance.
- Supervised fine-tuning: Fine-tuned on specific datasets to learn accurate query-document similarity.
Key features
- Large capacity: Can handle up to 8,192 tokens (about 1,500-2,000 characters).
- High-dimensional output: Produces 1,792-dimensional dense vectors.
- Cosine similarity: Uses cosine similarity to measure the similarity between vectors.
- Japanese language support: Specifically designed for Japanese text embedding.
Performance
The Sarashina-Embedding-v1-1B model showcases remarkable performance in various tasks. Let’s dive into its speed, accuracy, and efficiency.
Speed
This model can handle long sequences of up to 8,192 tokens, making it suitable for processing lengthy texts and documents. Its ability to handle large inputs enables it to capture more context and nuances in the text, leading to better performance in downstream tasks.
Accuracy
The model achieves state-of-the-art average scores across 16 datasets in the Japanese Massive Text Embedding Benchmark (JMTEB). This is a testament to its ability to learn accurate and informative representations of Japanese text. In particular, it excels in:
- Semantic textual similarity: measuring the similarity between two pieces of text
- Semantic search: finding relevant documents or passages based on a query
- Paraphrase mining: identifying similar phrases or sentences
- Text classification: categorizing text into predefined categories
- Clustering: grouping similar texts together
Efficiency
The model’s architecture is designed to be efficient, using a Transformer-based approach to process input text. This allows it to handle large volumes of text data quickly and accurately.
Comparison to Other Models
Here’s a comparison of the Sarashina-Embedding-v1-1B model with other models on the JMTEB benchmark:
Model | Max Tokens | Avg. Score |
---|---|---|
Sarashina-Embedding-v1-1B | 8192 | 75.50 |
==OpenAI/text-embedding-3-large== | 8191 | 74.05 |
cl-nagoya/ruri-large | 512 | 73.31 |
pkshatech/GLuCoSE-base-ja-v2 | 512 | 72.23 |
pkshatech/RoSEtta-base-ja | 1024 | 72.04 |
intfloat/multilingual-e5-large | 512 | 70.90 |
Usage
To get started, simply install the Sentence Transformers library and load the model:
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")
Then, you can use the model to encode sentences and calculate similarities:
sentences = [...]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
Limitations
The Sarashina-Embedding-v1-1B model is not perfect. Let’s explore some of its limitations.
Training Data Bias
The model was trained on a large dataset, but it’s possible that the data may contain biases or imbalances. For example, the dataset may have more texts from certain domains or styles, which could affect the model’s performance on other types of texts.
Limited Contextual Understanding
While the model can understand the meaning of sentences and paragraphs, it may struggle with more complex or nuanced contexts. For instance, it may not fully grasp the subtleties of human emotions, sarcasm, or implied meaning.
Dependence on Pre-Training Data
The model’s performance is heavily dependent on the quality and diversity of the pre-training data. If the data is limited or biased, the model’s performance may suffer.
Limited Handling of Out-of-Vocabulary Words
The model may not perform well with words or phrases that are not in its vocabulary. This could be a challenge when dealing with specialized domains or emerging topics.
Computational Resources
The model requires significant computational resources to run, which could be a limitation for users with limited hardware or infrastructure.
Commercial Use Restrictions
The model is licensed under the Sarashina Model NonCommercial License Agreement, which restricts commercial use. If you’re interested in using the model for your business, please contact the developers through their contact page.