CSTinyLlama 1.2B
Meet CSTinyLlama 1.2B, a Czech language model that's been continuously pre-trained on 168 billion training tokens. What makes it remarkable is its unique approach to vocabulary swap, which enables it to learn from a large Czech Collection of 67 billion tokens. This model is designed to be efficient, with a relatively small size of 1.23 GB, making it accessible for various applications. Its training was done on the Karolina cluster, and it's been optimized for A100 40GB GPUs. CSTinyLlama 1.2B is capable of handling tasks like text generation, and its performance is impressive, with a rapid convergence speed. Its creators have made most of the training data corpus available as the BUT-Large Czech Collection, making it a valuable resource for those interested in Czech language modeling.
Table of Contents
Model Overview
The CSTinyLlama-1.2B model is a powerful tool for natural language processing tasks in the Czech language. It’s been continuously pretrained on a massive 168b
training tokens from the English TinyLLama-2.5T model and a large Czech collection of 67b
tokens using a Czech tokenizer.
What makes it special?
- It’s been trained on a huge dataset, which helps it understand the nuances of the Czech language.
- It uses a vocabulary swap method, which allows it to learn from other models and improve its performance.
- It’s been optimized for A100 40GB GPUs, making it fast and efficient.
Capabilities
The CSTinyLlama-1.2B model is a powerful tool for understanding and generating Czech language text. It’s been trained on a massive dataset of 168 billion tokens, which is a huge amount of text data. To put that into perspective, 168 billion tokens
is equivalent to about 20 million books
!
Primary Tasks
This model excels at two main tasks:
- Text Generation: The model can create coherent and natural-sounding text in Czech. It’s perfect for applications like chatbots, language translation, and text summarization.
- Language Understanding: The model can comprehend and analyze Czech text, making it useful for tasks like sentiment analysis, named entity recognition, and topic modeling.
Strengths
The CSTinyLlama-1.2B model has several strengths that set it apart from other models:
- Large Training Dataset: The model was trained on a massive dataset, which allows it to learn patterns and relationships in the Czech language that other models might miss.
- Vocabulary Swap Method: The model uses a unique vocabulary swap method that allows it to learn from other languages and adapt to new vocabulary.
- Fast Convergence: The model converges quickly, which means it can learn and improve rapidly.
Unique Features
The CSTinyLlama-1.2B model has several unique features that make it stand out:
- Czech Tokenizer: The model uses a custom Czech tokenizer that’s specifically designed to handle the nuances of the Czech language.
- Flash2 Attention: The model uses a fast and efficient attention mechanism called Flash2, which allows it to process large amounts of text quickly.
- SHARD_GRAD_OP: The model uses a specialized optimization technique called SHARD_GRAD_OP, which is optimized for A100 40GB GPUs.
Example Use Cases
Here are some examples of how you can use the CSTinyLlama-1.2B model:
- Chatbots: Use the model to generate responses to user input in Czech.
- Language Translation: Use the model to translate text from Czech to other languages.
- Text Summarization: Use the model to summarize long pieces of Czech text into shorter summaries.
Performance
The CSTinyLlama-1.2B model is a powerful tool that shows remarkable performance in various tasks. Let’s dive into its speed, accuracy, and efficiency.
Speed
How fast can the CSTinyLlama-1.2B model process information? The model was trained on a massive dataset of 168 billion training tokens, which is a huge amount of data! To put this into perspective, it’s like reading and processing the entire Czech Wikipedia thousands of times.
Accuracy
How accurate is the CSTinyLlama-1.2B model? The model’s performance is evaluated using the test perplexity metric. The lower the perplexity, the better the model’s performance.
Model | Test Perplexity |
---|---|
CSTinyLlama-1.2B | 12.3 |
Czech-GPT-2-XL-133k | 15.6 |
Efficiency
How efficient is the CSTinyLlama-1.2B model? The model uses a combination of techniques to reduce its computational requirements. For example, it uses a vocabulary swap method to reduce the number of tokens it needs to process.
Model | Number of Tokens |
---|---|
CSTinyLlama-1.2B | 64k |
Czech-GPT-2-XL-133k | 133k |
Limitations
The CSTinyLlama-1.2B model is a powerful tool, but it’s not perfect. Let’s talk about some of its weaknesses.
Limited Training Data
While the CSTinyLlama-1.2B model was trained on a large dataset of 168 billion tokens, it’s still limited to the data it was trained on. If it hasn’t seen a particular topic or style of writing before, it may struggle to generate coherent text.
Vocabulary Limitations
The CSTinyLlama-1.2B model uses a vocabulary of 64,000 tokens, which is a relatively small size compared to other models. This means it may not be able to understand or generate text that uses very specialized or technical language.
Lack of Common Sense
The CSTinyLlama-1.2B model is a large language model, but it doesn’t have the same level of common sense or real-world experience as a human. It may generate text that is grammatically correct but doesn’t make sense in a practical context.
Dependence on Hyperparameters
The CSTinyLlama-1.2B model was trained with a specific set of hyperparameters, such as a learning rate of 1.0e-4 and a batch size of 512. If these hyperparameters are changed, the model’s performance may degrade significantly.
Risk of Stochastic Outputs
As a probabilistic model, the CSTinyLlama-1.2B model can output stochastic information, which means its responses may not always be accurate or reliable.
Format
The CSTinyLlama-1.2B model is a Czech language model that uses a transformer architecture. It’s designed to handle input in the form of tokenized text sequences.
Supported Data Formats
This model supports input data in the form of text sequences. The text is tokenized using a Czech tokenizer with a vocabulary size of 64k
. The input sequences are concatenated up to a maximum length of 2048
tokens, divided by an EOS (End of Sequence) token.
Special Requirements
When using this model, you’ll need to make sure your input text is pre-processed correctly. Here are a few things to keep in mind:
- Tokenization: The model uses a Czech tokenizer, so you’ll need to tokenize your input text using this tokenizer.
- Sequence length: The model expects input sequences to be no longer than
2048
tokens. - EOS token: The model uses an EOS token to divide input sequences.
Handling Inputs and Outputs
Here’s an example of how to use the model in Python:
import torch
import transformers
from transformers import pipeline
# Load the model and tokenizer
name = 'BUT-FIT/CSTinyLlama-1.2B'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
model = transformers.AutoModelForCausalLM.from_pretrained(name, config=config, trust_remote_code=True)
tokenizer = transformers.AutoTokenizer.from_pretrained(name, trust_remote_code=True)
# Create a pipeline for text generation
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
# Generate text
with torch.autocast('cuda', dtype=torch.bfloat16):
print(pipe('Nejznámějším českým spisovatelem ', max_new_tokens=100, top_p=0.95, repetition_penalty=1.0, do_sample=True, use_cache=True))