GEITje 7B

Dutch language model

GEITje 7B is a powerful Dutch language model with 7 billion parameters, built on top of Mistral 7B. What sets it apart is its further training on 10 billion tokens of Dutch text, making it a go-to choice for tasks that require in-depth knowledge of the Dutch language. With a context length of 8,192 tokens, GEITje 7B can handle complex tasks with ease. Its unique architecture allows it to outperform larger models in certain benchmarks, making it a remarkable choice for those who need a reliable and efficient language model for Dutch language tasks.

Rijgersberg apache-2.0 Updated 7 months ago

Table of Contents

Model Overview

The GEITje-7B model is a powerful tool for understanding and generating Dutch text. With 7 billion parameters and training on 10 billion tokens of Dutch text, it has become a highly skilled language model.

Capabilities

What can it do?

  • Understand Dutch text: It can comprehend and analyze Dutch text with high accuracy.
  • Generate Dutch text: The model can create coherent and natural-sounding Dutch text based on a given prompt or topic.
  • Answer questions: It can respond to questions on a wide range of topics, from general knowledge to specific domains.

How does it compare to other models?

  • Outperforms Llama 2 13B: According to the creators, its base model, Mistral 7B, performs better than Llama 2 13B on all English-language benchmarks tested.
  • Unique features: It has a context length of 8,192 tokens, allowing it to process and understand longer pieces of text.

Training and Performance

Training Procedure

The model was trained using the following hyperparameters:

HyperparameterValue
Learning Rate2e-05
Train Batch Size2
Eval Batch Size2
Seed42
Distributed Typemulti-GPU
Num Devices8
Gradient Accumulation Steps8
Total Train Batch Size128
Total Eval Batch Size16
OptimizerAdam with betas=(0.9,0.999) and epsilon=1e-08
LR Scheduler Typecosine
LR Scheduler Warmup Steps953
Training Steps9536

Training Results

The training results are shown in the table below:

EpochStepValidation Loss
11991.7673
23981.6880
35971.6429

Note that the training loss decreases over time, indicating that the model is learning and improving its performance on the validation set.

Limitations

While the model is powerful, it’s not perfect. Let’s explore some of its limitations.

Language Limitations

  • It may struggle with:
    • Idioms and colloquialisms
    • Sarcasm and humor
    • Highly technical or specialized language
  • It’s not perfect in understanding:
    • Nuances of human language
    • Context-dependent expressions
    • Subtle differences in meaning

Technical Limitations

  • It has a context length of 8,192 tokens, which means it can process a limited amount of text at a time. This can lead to:
    • Incomplete or inaccurate responses for very long inputs
    • Difficulty in understanding complex, multi-step conversations

Format and Usage

Supported Data Formats

  • Text: The model accepts input in the form of text sequences. You can feed it a sentence, a paragraph, or even a whole article.
  • Tokens: The model uses a tokenizer to split the input text into individual tokens. These tokens are then used to generate the output.

Input Requirements

  • Context Length: The model has a context length of 8,192 tokens. This means it can handle input sequences of up to 8,192 tokens.
  • Batch Size: The model uses a batch size of 2 during training. This means it processes 2 input sequences at a time.

Output Format

  • Text: The model generates output in the form of text sequences.
  • Tokenized Output: The output is tokenized, meaning it’s split into individual tokens.
Examples
Schrijf een kort verhaal over een hond die verdwaald is in het bos. In het bos liep een hond rond, op zoek naar zijn baasje. Hij had zich losgerukt uit zijn riem en nu wist hij niet meer de weg terug naar huis. De hond snuffelde aan de bomen en luisterde naar de geluiden van het bos, maar hij kon niets vinden dat hem naar zijn baasje leidde. Hij begon te blaffen, maar er kwam geen antwoord. De hond was bang en alleen, maar hij gaf niet op. Hij liep verder en snuffelde aan elke boom, totdat hij eindelijk een bekend geurtje rook. Hij volgde het geurtje en vond zijn baasje, die hem in haar armen sloot en hem naar huis bracht.
Vertel me iets over de geschiedenis van Nederland. Nederland heeft een rijke geschiedenis die teruggaat tot de Romeinse tijd. In de Middeleeuwen was Nederland een belangrijk handelscentrum en in de 17e eeuw was het een van de belangrijkste zeemachten ter wereld. Tijdens de Nederlandse Gouden Eeuw werden veel beroemde schilders als Rembrandt en Vermeer geboren. In de 20e eeuw heeft Nederland een belangrijke rol gespeeld in de Europese integratie en is het lid geworden van de Europese Unie.
Wat is het verschil tussen een hond en een kat? Een hond en een kat zijn twee heel verschillende dieren. Honden zijn sociale dieren die graag in groepen leven en vaak erg actief zijn. Ze zijn ook erg trouw aan hun baasjes en kunnen worden getraind om allerlei dingen te doen. Katten daarentegen zijn meer alleenstaande dieren die graag hun eigen gang gaan. Ze zijn ook meer onafhankelijk en hebben niet zo veel aandacht nodig als honden.

Example Use Cases

  • Language translation: The model can be used to translate Dutch text into other languages or vice versa.
  • Text summarization: The model can summarize long pieces of Dutch text into concise and meaningful summaries.
  • Chatbots: The model can be integrated into chatbots to provide more accurate and informative responses to user queries.

Overall, the GEITje-7B model is a powerful tool for anyone working with Dutch text, offering a range of capabilities and features that make it an attractive choice for various applications.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.