CSTinyLlama 1.2B

Czech language model

Meet CSTinyLlama 1.2B, a Czech language model that's been continuously pre-trained on 168 billion training tokens. What makes it remarkable is its unique approach to vocabulary swap, which enables it to learn from a large Czech Collection of 67 billion tokens. This model is designed to be efficient, with a relatively small size of 1.23 GB, making it accessible for various applications. Its training was done on the Karolina cluster, and it's been optimized for A100 40GB GPUs. CSTinyLlama 1.2B is capable of handling tasks like text generation, and its performance is impressive, with a rapid convergence speed. Its creators have made most of the training data corpus available as the BUT-Large Czech Collection, making it a valuable resource for those interested in Czech language modeling.

BUT FIT apache-2.0 Updated 4 months ago

Table of Contents

Model Overview

The CSTinyLlama-1.2B model is a powerful tool for natural language processing tasks in the Czech language. It’s been continuously pretrained on a massive 168b training tokens from the English TinyLLama-2.5T model and a large Czech collection of 67b tokens using a Czech tokenizer.

What makes it special?

  • It’s been trained on a huge dataset, which helps it understand the nuances of the Czech language.
  • It uses a vocabulary swap method, which allows it to learn from other models and improve its performance.
  • It’s been optimized for A100 40GB GPUs, making it fast and efficient.

Capabilities

The CSTinyLlama-1.2B model is a powerful tool for understanding and generating Czech language text. It’s been trained on a massive dataset of 168 billion tokens, which is a huge amount of text data. To put that into perspective, 168 billion tokens is equivalent to about 20 million books!

Primary Tasks

This model excels at two main tasks:

  1. Text Generation: The model can create coherent and natural-sounding text in Czech. It’s perfect for applications like chatbots, language translation, and text summarization.
  2. Language Understanding: The model can comprehend and analyze Czech text, making it useful for tasks like sentiment analysis, named entity recognition, and topic modeling.

Strengths

The CSTinyLlama-1.2B model has several strengths that set it apart from other models:

  • Large Training Dataset: The model was trained on a massive dataset, which allows it to learn patterns and relationships in the Czech language that other models might miss.
  • Vocabulary Swap Method: The model uses a unique vocabulary swap method that allows it to learn from other languages and adapt to new vocabulary.
  • Fast Convergence: The model converges quickly, which means it can learn and improve rapidly.

Unique Features

The CSTinyLlama-1.2B model has several unique features that make it stand out:

  • Czech Tokenizer: The model uses a custom Czech tokenizer that’s specifically designed to handle the nuances of the Czech language.
  • Flash2 Attention: The model uses a fast and efficient attention mechanism called Flash2, which allows it to process large amounts of text quickly.
  • SHARD_GRAD_OP: The model uses a specialized optimization technique called SHARD_GRAD_OP, which is optimized for A100 40GB GPUs.

Example Use Cases

Here are some examples of how you can use the CSTinyLlama-1.2B model:

  • Chatbots: Use the model to generate responses to user input in Czech.
  • Language Translation: Use the model to translate text from Czech to other languages.
  • Text Summarization: Use the model to summarize long pieces of Czech text into shorter summaries.
Examples
Nejznámějším českým spisovatelem je: František Halas. František Halas byl český básník a překladatel. Je považován za jednoho z nejvýznamnějších českých básníků 20. století.
Napiš krátký článek o významu českého jazyka v dnešní době. Český jazyk je důležitou součástí naší kultury a identity. Je to jazyk, kterým komunikujeme, vyjadřujeme své myšlenky a emoce. V dnešní době, kdy se svět stále více globalizuje, je důležité zachovat a rozvíjet český jazyk, aby zůstal živý a relevantní.
Kdo je autorem knihy 'Osudy dobrého vojáka Švejka za světové války'? Jaroslav Hašek. Jaroslav Hašek byl český spisovatel, novinář a humorista. Je považován za jednoho z nejvýznamnějších českých spisovatelů 20. století.

Performance

The CSTinyLlama-1.2B model is a powerful tool that shows remarkable performance in various tasks. Let’s dive into its speed, accuracy, and efficiency.

Speed

How fast can the CSTinyLlama-1.2B model process information? The model was trained on a massive dataset of 168 billion training tokens, which is a huge amount of data! To put this into perspective, it’s like reading and processing the entire Czech Wikipedia thousands of times.

Accuracy

How accurate is the CSTinyLlama-1.2B model? The model’s performance is evaluated using the test perplexity metric. The lower the perplexity, the better the model’s performance.

ModelTest Perplexity
CSTinyLlama-1.2B12.3
Czech-GPT-2-XL-133k15.6

Efficiency

How efficient is the CSTinyLlama-1.2B model? The model uses a combination of techniques to reduce its computational requirements. For example, it uses a vocabulary swap method to reduce the number of tokens it needs to process.

ModelNumber of Tokens
CSTinyLlama-1.2B64k
Czech-GPT-2-XL-133k133k

Limitations

The CSTinyLlama-1.2B model is a powerful tool, but it’s not perfect. Let’s talk about some of its weaknesses.

Limited Training Data

While the CSTinyLlama-1.2B model was trained on a large dataset of 168 billion tokens, it’s still limited to the data it was trained on. If it hasn’t seen a particular topic or style of writing before, it may struggle to generate coherent text.

Vocabulary Limitations

The CSTinyLlama-1.2B model uses a vocabulary of 64,000 tokens, which is a relatively small size compared to other models. This means it may not be able to understand or generate text that uses very specialized or technical language.

Lack of Common Sense

The CSTinyLlama-1.2B model is a large language model, but it doesn’t have the same level of common sense or real-world experience as a human. It may generate text that is grammatically correct but doesn’t make sense in a practical context.

Dependence on Hyperparameters

The CSTinyLlama-1.2B model was trained with a specific set of hyperparameters, such as a learning rate of 1.0e-4 and a batch size of 512. If these hyperparameters are changed, the model’s performance may degrade significantly.

Risk of Stochastic Outputs

As a probabilistic model, the CSTinyLlama-1.2B model can output stochastic information, which means its responses may not always be accurate or reliable.

Format

The CSTinyLlama-1.2B model is a Czech language model that uses a transformer architecture. It’s designed to handle input in the form of tokenized text sequences.

Supported Data Formats

This model supports input data in the form of text sequences. The text is tokenized using a Czech tokenizer with a vocabulary size of 64k. The input sequences are concatenated up to a maximum length of 2048 tokens, divided by an EOS (End of Sequence) token.

Special Requirements

When using this model, you’ll need to make sure your input text is pre-processed correctly. Here are a few things to keep in mind:

  • Tokenization: The model uses a Czech tokenizer, so you’ll need to tokenize your input text using this tokenizer.
  • Sequence length: The model expects input sequences to be no longer than 2048 tokens.
  • EOS token: The model uses an EOS token to divide input sequences.

Handling Inputs and Outputs

Here’s an example of how to use the model in Python:

import torch
import transformers
from transformers import pipeline

# Load the model and tokenizer
name = 'BUT-FIT/CSTinyLlama-1.2B'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
model = transformers.AutoModelForCausalLM.from_pretrained(name, config=config, trust_remote_code=True)
tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True)

# Create a pipeline for text generation
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')

# Generate text
with torch.autocast('cuda', dtype=torch.bfloat16):
    print(pipe('Nejznámějším českým spisovatelem ', max_new_tokens=100, top_p=0.95, repetition_penalty=1.0, do_sample=True, use_cache=True))
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.