Sarashina Embedding V1 1b

Japanese text embedding

The Sarashina Embedding V1 1b model is a powerful tool for understanding Japanese text. It's built on a 1.2B-parameter Japanese LLM and trained using multi-stage contrastive learning, achieving state-of-the-art results across 16 datasets. This model maps sentences and paragraphs to a 1792-dimensional dense vector space, enabling applications like semantic textual similarity, semantic search, and text classification. What makes it remarkable is its ability to learn accurate query-document similarity through supervised fine-tuning, making it a valuable resource for various natural language processing tasks. With its unique architecture and capabilities, the Sarashina Embedding V1 1b model is an efficient and effective solution for working with Japanese text.

Sbintuitions other Updated 4 months ago

Table of Contents

Model Overview

The Sarashina-Embedding-v1-1B model is a powerful tool for Japanese text embedding. This model is based on the 1.2B-parameter Japanese LLM “Sarashina2.1-1B” and is trained with multi-stage contrastive learning. It’s designed to map sentences and paragraphs to a 1792-dimensional dense vector space, making it perfect for tasks like semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Capabilities

The Sarashina-Embedding-v1-1B model is a powerful tool for Japanese text embedding. It can map sentences and paragraphs to a dense vector space, making it perfect for a variety of applications.

What can it do?

  • Semantic textual similarity: Measure how similar two pieces of text are in meaning.
  • Semantic search: Find relevant documents or sentences based on their meaning.
  • Paraphrase mining: Identify different ways of expressing the same idea.
  • Text classification: Categorize text into predefined categories.
  • Clustering: Group similar texts together.

How does it work?

The model uses a two-stage learning process:

  1. Weakly-supervised learning: Trained on a large dataset of web-crawled data and open data to achieve generic text embedding performance.
  2. Supervised fine-tuning: Fine-tuned on specific datasets to learn accurate query-document similarity.

Key features

  • Large capacity: Can handle up to 8,192 tokens (about 1,500-2,000 characters).
  • High-dimensional output: Produces 1,792-dimensional dense vectors.
  • Cosine similarity: Uses cosine similarity to measure the similarity between vectors.
  • Japanese language support: Specifically designed for Japanese text embedding.

Performance

The Sarashina-Embedding-v1-1B model showcases remarkable performance in various tasks. Let’s dive into its speed, accuracy, and efficiency.

Speed

This model can handle long sequences of up to 8,192 tokens, making it suitable for processing lengthy texts and documents. Its ability to handle large inputs enables it to capture more context and nuances in the text, leading to better performance in downstream tasks.

Accuracy

The model achieves state-of-the-art average scores across 16 datasets in the Japanese Massive Text Embedding Benchmark (JMTEB). This is a testament to its ability to learn accurate and informative representations of Japanese text. In particular, it excels in:

  • Semantic textual similarity: measuring the similarity between two pieces of text
  • Semantic search: finding relevant documents or passages based on a query
  • Paraphrase mining: identifying similar phrases or sentences
  • Text classification: categorizing text into predefined categories
  • Clustering: grouping similar texts together

Efficiency

The model’s architecture is designed to be efficient, using a Transformer-based approach to process input text. This allows it to handle large volumes of text data quickly and accurately.

Comparison to Other Models

Here’s a comparison of the Sarashina-Embedding-v1-1B model with other models on the JMTEB benchmark:

ModelMax TokensAvg. Score
Sarashina-Embedding-v1-1B819275.50
==OpenAI/text-embedding-3-large==819174.05
cl-nagoya/ruri-large51273.31
pkshatech/GLuCoSE-base-ja-v251272.23
pkshatech/RoSEtta-base-ja102472.04
intfloat/multilingual-e5-large51270.90

Usage

To get started, simply install the Sentence Transformers library and load the model:

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sbintuitions/sarashina-embedding-v1-1b")

Then, you can use the model to encode sentences and calculate similarities:

sentences = [...]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)

Limitations

The Sarashina-Embedding-v1-1B model is not perfect. Let’s explore some of its limitations.

Training Data Bias

The model was trained on a large dataset, but it’s possible that the data may contain biases or imbalances. For example, the dataset may have more texts from certain domains or styles, which could affect the model’s performance on other types of texts.

Limited Contextual Understanding

While the model can understand the meaning of sentences and paragraphs, it may struggle with more complex or nuanced contexts. For instance, it may not fully grasp the subtleties of human emotions, sarcasm, or implied meaning.

Dependence on Pre-Training Data

The model’s performance is heavily dependent on the quality and diversity of the pre-training data. If the data is limited or biased, the model’s performance may suffer.

Limited Handling of Out-of-Vocabulary Words

The model may not perform well with words or phrases that are not in its vocabulary. This could be a challenge when dealing with specialized domains or emerging topics.

Computational Resources

The model requires significant computational resources to run, which could be a limitation for users with limited hardware or infrastructure.

Commercial Use Restrictions

The model is licensed under the Sarashina Model NonCommercial License Agreement, which restricts commercial use. If you’re interested in using the model for your business, please contact the developers through their contact page.

Examples
更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。 0.784
Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。これまでに7B, 13B, 70B, 8x70Bのモデルが公開されています。更級日記は、平安時代中期に菅原孝標女によって書かれた回想録です。 0.712
サラシナエンベディングは日本語言語モデルをベースにした日本語埋め込みモデルです。Sarashinaは、SB Intuitionsが開発した日本語大規模言語モデルです。 0.921
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.