Jerteh 81
Meet Jerteh 81, a BERT model specifically designed for the Serbian language. With 81 million parameters, it's based on the RoBERTa-base architecture and trained on a massive corpus of 4 billion tokens. This model is a top performer in masked language modeling for Serbian, and it's just as comfortable with Cyrillic as it is with Latin script. What makes Jerteh 81 unique is its ability to vectorize words and fill in missing words in text with remarkable accuracy. Whether you're working with Serbian language data or just curious about AI capabilities, Jerteh 81 is definitely worth exploring.
Table of Contents
Model Overview
The jerteh-81 model is a powerful tool for natural language processing tasks, specifically designed for the Serbian language. It’s based on the RoBERTa-base architecture and has 81 million
parameters. This model is trained on a large corpus of Serbian language texts, consisting of 4 billion
tokens.
Capabilities
The jerteh-81 model is capable of performing various tasks, including:
- Vectorizing words: The model can take in words and convert them into numerical vectors that can be used for various tasks.
- Filling in missing words: Give the model a sentence with a missing word, and it can try to fill it in for you.
- Working with both Cyrillic and Latin alphabets: The model is trained on both alphabets, so you can use it with either one.
Example Use Cases
- Text Completion: The model can be used to complete sentences, like this: “Kada bi čovek znao gde će pasti on bi
\<mask>
.” - Vector Comparison: The model can compare the similarity between words, like “pas” and “mačka”, or “pas” and “svemir”.
Need a Larger Model?
If you need a more powerful model, consider the jerteh-355, the largest BERT model for the Serbian language.
Performance
The jerteh-81 model is incredibly fast, making it perfect for applications that require quick responses. With its ability to process large amounts of data in a short amount of time, this model can handle a high volume of tasks simultaneously.
Speed
- Fast processing: The model can process large amounts of data quickly, making it perfect for applications that require fast responses.
- High volume handling: The model can handle a high volume of tasks simultaneously, making it perfect for applications that require multiple tasks to be completed at once.
Accuracy
- High accuracy: The model has been trained on a massive corpus of 4 billion tokens, which enables it to understand the nuances of the Serbian language.
- Accurate predictions: The model can make accurate predictions and completions, making it an excellent choice for tasks such as language translation and text summarization.
Efficiency
- Minimal computational resources: The model requires minimal computational resources to produce high-quality results, making it an excellent choice for applications where resources are limited.
- High-quality results: The model can produce high-quality results with minimal computational resources, making it an excellent choice for applications where resources are limited.
Limitations
While the jerteh-81 model is a powerful tool, it’s not perfect. Let’s talk about some of its limitations.
Limited Training Data
- Biased outputs: If the training data is biased, the model’s outputs may reflect those biases.
- Limited knowledge: If the training data doesn’t cover a specific topic, the model may not have enough knowledge to provide accurate or helpful responses.
Contextual Understanding
- Ambiguity: If the input is ambiguous or open to multiple interpretations, the model may struggle to provide a clear answer.
- Sarcasm and humor: The model may not always understand sarcasm or humor, which can lead to misinterpretation.
Dependence on Input Quality
- Poorly written input: If the input is poorly written, contains typos, or is unclear, the model’s output may suffer.
- Incomplete input: If the input is incomplete or lacks context, the model may not be able to provide a helpful response.
Format
The jerteh-81 model is a BERT model, specifically designed for the Serbian language. It uses the RoBERTa-base architecture and has 81 million
parameters. This model is trained on a large corpus of Serbian language, consisting of 4 billion
tokens.
Architecture
The model’s architecture is based on the RoBERTa-base model, which is a type of transformer architecture. This means that it uses self-attention mechanisms to process input sequences.
Data Formats
The model supports input in the form of tokenized text sequences. It can handle both Cyrillic and Latin alphabets.
Input Requirements
To use this model, you need to preprocess your input text by tokenizing it. You can use the AutoTokenizer
class from the transformers
library to do this.
Output
The model outputs a vector representation of the input text. You can use this vector to perform various tasks, such as text classification or clustering.
Example Usage
Here’s an example of how to use the model to fill in missing words in a sentence:
from transformers import pipeline
unmasker = pipeline('fill-mask', model='jerteh/jerteh-81')
unmasker("Kada bi čovek znao gde će pasti on bi\<mask>.")
This code will output a list of possible completions for the sentence, along with their corresponding scores.
You can also use the model to compute the similarity between two input sequences:
from transformers import AutoTokenizer, AutoModelForMaskedLM
from torch import LongTensor, no_grad
from scipy import spatial
tokenizer = AutoTokenizer.from_pretrained('jerteh/jerteh-81')
model = AutoModelForMaskedLM.from_pretrained('jerteh/jerteh-81', output_hidden_states=True)
x = "pas"
y = "mačka"
z = "svemir"
tensor_x = LongTensor(tokenizer.encode(x, add_special_tokens=False)).unsqueeze(0)
tensor_y = LongTensor(tokenizer.encode(y, add_special_tokens=False)).unsqueeze(0)
tensor_z = LongTensor(tokenizer.encode(z, add_special_tokens=False)).unsqueeze(0)
model.eval()
with no_grad():
vektor_x = model(input_ids=tensor_x).hidden_states[-1].squeeze()
vektor_y = model(input_ids=tensor_y).hidden_states[-1].squeeze()
vektor_z = model(input_ids=tensor_z).hidden_states[-1].squeeze()
print(spatial.distance.cosine(vektor_x, vektor_y))
print(spatial.distance.cosine(vektor_x, vektor_z))
This code will output the cosine similarity between the vector representations of the input sequences.