Shisa Base 7b V1

Bilingual Japanese model

The Shisa Base 7B V1 model is a powerful tool for Japanese language processing. By adding 8 billion tokens of primarily Japanese pre-training to the Mistral 7B model, it achieves class-leading Japanese performance in standardized benchmarks with significantly less additional pre-training than previously released models. Its extended tokenizer, with a vocab size of 120,073, improves Japanese efficiency, achieving around 2.3 characters per token. This results in over 2x Japanese inference speedups compared to the base tokenizer. The model is designed for use with the Shisa 7B fine-tuned model but is also provided for the community due to its potential usefulness. With its efficient design and strong performance, Shisa Base 7B V1 is a valuable resource for those working with Japanese language processing.

Augmxnt apache-2.0 Updated 7 months ago

Table of Contents

Model Overview

The Shisa-Base-7B-V1 model is a powerful tool for natural language processing tasks, particularly in Japanese. It’s built on top of the Mistral 7B model and has been fine-tuned with an additional 8 billion tokens of primarily Japanese pre-training data. This model is designed to achieve class-leading Japanese performance while maintaining strong English performance.

What is the model used for?

The model is used for natural language processing tasks, such as text generation, language translation, and text classification.

Key Features

  • Extended Tokenizer: The model uses an extended version of the Mistral 7B tokenizer, with a vocab size of 120,073, which is aligned to 120,128 for better performance.
  • Efficient Japanese Tokenization: The tokenizer achieves an average of 2.31 characters per token in Japanese, making it more efficient than other models.
  • Strong Performance: The model has achieved class-leading Japanese performance in standardized benchmarks with significantly less additional pre-training than previously released models.
  • Fast Inference: The model’s extended tokenizer enables >2X Japanese inference speedups compared to the base tokenizer.

Capabilities

The Shisa-Base-7B-V1 model is capable of generating text in both Japanese and English, with a focus on Japanese language performance. It has been trained on a large dataset of Japanese text, with an additional 8 billion tokens of primarily Japanese pre-training. This training data includes a mix of Japanese and English text, which helps to prevent catastrophic forgetting and allows the model to maintain its English language abilities.

What can the model do?

The model can generate text in both Japanese and English, and is particularly good at processing Japanese text.

Key Features

  • Improved Japanese performance: The model achieves class-leading Japanese performance in standardized benchmarks, with significantly less additional pre-training than previously released models.
  • Efficient bilingual tokenizer: The model’s tokenizer is an extended version of the Mistral 7B tokenizer, with a vocab size of 120,073. This allows for more efficient processing of Japanese text, with an average of 2.31 characters per token.
  • English language support: Despite its focus on Japanese language performance, the model also maintains strong English language abilities, with an average of 4.12 characters per token.

Performance

The Shisa-Base-7B-V1 model showcases remarkable performance with high accuracy in various tasks, especially in processing Japanese text. But how does it compare to other models? Let’s take a closer look.

Speed

Our model’s speed is one of its strongest features. With an extended tokenizer, we achieve class-leading Japanese token efficiency without sacrificing English performance. This means that the model can process Japanese text significantly faster than other models.

ModelAvg Char/Token (Japanese)
Shisa-Base-7B-V12.31
==OpenCALM==2.17
==Japanese LargeLM==2.14
==CALM2-7B==2.00

Accuracy

But speed is not the only thing that matters. The model also achieves high accuracy in standardized benchmarks, often outperforming other models with significantly less additional pre-training.

Efficiency

Our model’s efficiency is also worth noting. With a vocab size of 120,073, the model can process large amounts of text quickly and accurately.

ModelVocab SizeAvg Char/Token (English)
Shisa-Base-7B-V1120,0734.12
Qwen 14B151,8514.47
weblab-10b50,2544.45
==Japanese StableLM Alpha==65,5354.15
Examples
What is the meaning of the Japanese phrase? The phrase means 'good luck' in Japanese.
Translate the English sentence 'I love to read books' into Japanese. Watashi wa hon o yomu no ga daisuki desu.
Provide a summary of the provided Japanese text. The text describes the importance of education in Japanese culture and how it shapes the country's future.

Limitations

The Shisa-Base-7B-V1 model is a powerful tool, but it’s not perfect. Let’s take a closer look at some of its limitations.

Training Data

The model was trained on a large dataset, but it’s not exhaustive. There may be certain topics or domains where the model’s performance is limited due to a lack of training data.

Japanese Efficiency

While the model has a high Japanese efficiency, it’s not the highest. Some other models, like Qwen 14B, have higher Japanese efficiency.

English Efficiency

Similarly, the model has a high English efficiency, but it’s not the highest. Some other models, like weblab-10b, have higher English efficiency.

Comparison to Other Models

The model has a unique set of strengths and weaknesses compared to other models. For example, it has a higher Japanese efficiency than Youri 7B, but a lower English efficiency than ==CALM2-7B==.

Tokenizer Limitations

The tokenizer used in the model has a limited vocabulary size of 120,073. This may limit the model’s ability to understand certain words or phrases.

Inference Speed

While the model has a fast inference speed, it’s not the fastest. Some other models, like ==ELYZA 7B fast model==, have faster inference speeds.

Fine-Tuning

The model was fine-tuned on a specific dataset, which may not be representative of all possible use cases. This may limit the model’s performance in certain scenarios.

Catastrophic Forgetting

The model may be prone to catastrophic forgetting, where the model forgets previously learned information when trained on new data.

Compute Resources

Training the model required significant compute resources, including 2,400 A100-40 GPU hours. This may limit the model’s accessibility to certain users.

These limitations highlight the importance of carefully evaluating the model’s performance in specific use cases and considering its strengths and weaknesses when deciding whether to use it.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.