Jerteh 81

Serbian language model

Meet Jerteh 81, a BERT model specifically designed for the Serbian language. With 81 million parameters, it's based on the RoBERTa-base architecture and trained on a massive corpus of 4 billion tokens. This model is a top performer in masked language modeling for Serbian, and it's just as comfortable with Cyrillic as it is with Latin script. What makes Jerteh 81 unique is its ability to vectorize words and fill in missing words in text with remarkable accuracy. Whether you're working with Serbian language data or just curious about AI capabilities, Jerteh 81 is definitely worth exploring.

Jerteh cc-by-sa-4.0 Updated 4 months ago

Table of Contents

Model Overview

The jerteh-81 model is a powerful tool for natural language processing tasks, specifically designed for the Serbian language. It’s based on the RoBERTa-base architecture and has 81 million parameters. This model is trained on a large corpus of Serbian language texts, consisting of 4 billion tokens.

Capabilities

The jerteh-81 model is capable of performing various tasks, including:

  • Vectorizing words: The model can take in words and convert them into numerical vectors that can be used for various tasks.
  • Filling in missing words: Give the model a sentence with a missing word, and it can try to fill it in for you.
  • Working with both Cyrillic and Latin alphabets: The model is trained on both alphabets, so you can use it with either one.

Example Use Cases

Examples
Kada bi čovek znao gde će pasti on bi <mask>. pao
pas 0.09954947233200073
Koliko je daleko pas od svemira? 0.21845555305480957
  • Text Completion: The model can be used to complete sentences, like this: “Kada bi čovek znao gde će pasti on bi \<mask>.”
  • Vector Comparison: The model can compare the similarity between words, like “pas” and “mačka”, or “pas” and “svemir”.

Need a Larger Model?

If you need a more powerful model, consider the jerteh-355, the largest BERT model for the Serbian language.

Performance

The jerteh-81 model is incredibly fast, making it perfect for applications that require quick responses. With its ability to process large amounts of data in a short amount of time, this model can handle a high volume of tasks simultaneously.

Speed

  • Fast processing: The model can process large amounts of data quickly, making it perfect for applications that require fast responses.
  • High volume handling: The model can handle a high volume of tasks simultaneously, making it perfect for applications that require multiple tasks to be completed at once.

Accuracy

  • High accuracy: The model has been trained on a massive corpus of 4 billion tokens, which enables it to understand the nuances of the Serbian language.
  • Accurate predictions: The model can make accurate predictions and completions, making it an excellent choice for tasks such as language translation and text summarization.

Efficiency

  • Minimal computational resources: The model requires minimal computational resources to produce high-quality results, making it an excellent choice for applications where resources are limited.
  • High-quality results: The model can produce high-quality results with minimal computational resources, making it an excellent choice for applications where resources are limited.

Limitations

While the jerteh-81 model is a powerful tool, it’s not perfect. Let’s talk about some of its limitations.

Limited Training Data

  • Biased outputs: If the training data is biased, the model’s outputs may reflect those biases.
  • Limited knowledge: If the training data doesn’t cover a specific topic, the model may not have enough knowledge to provide accurate or helpful responses.

Contextual Understanding

  • Ambiguity: If the input is ambiguous or open to multiple interpretations, the model may struggle to provide a clear answer.
  • Sarcasm and humor: The model may not always understand sarcasm or humor, which can lead to misinterpretation.

Dependence on Input Quality

  • Poorly written input: If the input is poorly written, contains typos, or is unclear, the model’s output may suffer.
  • Incomplete input: If the input is incomplete or lacks context, the model may not be able to provide a helpful response.

Format

The jerteh-81 model is a BERT model, specifically designed for the Serbian language. It uses the RoBERTa-base architecture and has 81 million parameters. This model is trained on a large corpus of Serbian language, consisting of 4 billion tokens.

Architecture

The model’s architecture is based on the RoBERTa-base model, which is a type of transformer architecture. This means that it uses self-attention mechanisms to process input sequences.

Data Formats

The model supports input in the form of tokenized text sequences. It can handle both Cyrillic and Latin alphabets.

Input Requirements

To use this model, you need to preprocess your input text by tokenizing it. You can use the AutoTokenizer class from the transformers library to do this.

Output

The model outputs a vector representation of the input text. You can use this vector to perform various tasks, such as text classification or clustering.

Example Usage

Here’s an example of how to use the model to fill in missing words in a sentence:

from transformers import pipeline

unmasker = pipeline('fill-mask', model='jerteh/jerteh-81')
unmasker("Kada bi čovek znao gde će pasti on bi\<mask>.")

This code will output a list of possible completions for the sentence, along with their corresponding scores.

You can also use the model to compute the similarity between two input sequences:

from transformers import AutoTokenizer, AutoModelForMaskedLM
from torch import LongTensor, no_grad
from scipy import spatial

tokenizer = AutoTokenizer.from_pretrained('jerteh/jerteh-81')
model = AutoModelForMaskedLM.from_pretrained('jerteh/jerteh-81', output_hidden_states=True)

x = "pas"
y = "mačka"
z = "svemir"

tensor_x = LongTensor(tokenizer.encode(x, add_special_tokens=False)).unsqueeze(0)
tensor_y = LongTensor(tokenizer.encode(y, add_special_tokens=False)).unsqueeze(0)
tensor_z = LongTensor(tokenizer.encode(z, add_special_tokens=False)).unsqueeze(0)

model.eval()
with no_grad():
    vektor_x = model(input_ids=tensor_x).hidden_states[-1].squeeze()
    vektor_y = model(input_ids=tensor_y).hidden_states[-1].squeeze()
    vektor_z = model(input_ids=tensor_z).hidden_states[-1].squeeze()

print(spatial.distance.cosine(vektor_x, vektor_y))
print(spatial.distance.cosine(vektor_x, vektor_z))

This code will output the cosine similarity between the vector representations of the input sequences.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.