T5 Base Dutch

Dutch language model

The T5 Base Dutch model is a powerful tool for natural language processing tasks, specifically designed for the Dutch language. With 222 million parameters, it was pre-trained on a large dataset of Dutch text using a masked language modeling objective. This model is efficient, with a sequence length of 512 and a batch size of 128, allowing for fast processing of text data. It has been fine-tuned for summarization and translation tasks, achieving impressive results on evaluation metrics such as Rouge and Bleu scores. The model's performance is notable, with a samples per second rate of 45.19, making it suitable for real-world applications. While it requires fine-tuning for specific tasks, the T5 Base Dutch model is a valuable resource for those working with Dutch language data.

Yhavinga apache-2.0 Updated 4 months ago

Table of Contents

Model Overview

The Current Model is a Dutch language model designed for natural language processing tasks. It was created to process and understand Dutch language inputs.

Key Attributes

  • Model Type: T5
  • Parameters: 222M
  • Pre-training Objective: Masked language modeling (denoise token span corruption)
  • Dataset: mc4_nl_cleaned (Dutch mC4)
  • Sequence Length: 512
  • Batch Size: 128
  • Total Steps: 527500
  • Epochs: 1
  • Duration: 2d9h

Functionalities

  • Language Understanding: The model is designed to understand and process Dutch language inputs.
  • Text Generation: The model can be fine-tuned for text generation tasks such as summarization and translation.
  • Downstream Tasks: The model can be used for various downstream tasks such as language translation, text summarization, and more.

Capabilities

The Current Model is capable of performing a variety of tasks, including:

  • Text Summarization: The model can summarize long pieces of text into shorter, more digestible versions.
  • Translation: The model can translate text from one language to another, with a focus on Dutch and English.
  • Masked Language Modeling: The model can fill in missing words or phrases in a sentence.

Strengths

  • High Accuracy: The model has been fine-tuned to achieve high accuracy on a variety of tasks.
  • Efficient Training: The model was trained on a large dataset, but with a relatively small number of parameters.
  • Flexibility: The model can be used for a variety of tasks, including text summarization, translation, and more.
Examples
Vertaal het Engelse zin 'The cat sat on the mat' naar het Nederlands. De kat zat op het kleed.
Summariseer het volgende artikel: 'De afgelopen week is de temperatuur in Nederland flink gestegen. De temperaturen zijn in enkele dagen tijd met meer dan 10 graden gestegen. Dit is een ongekende stijging.' De temperatuur in Nederland is de afgelopen week flink gestegen.
Vertaal het Nederlandse zin 'De hond liep om de hoek' naar het Engels. The dog walked around the corner.

Comparison to Other Models

The Current Model is not the only model that has been trained on Dutch text. Let’s compare its performance to other models:

ModelRouge1 ScoreBleu Score
Current Model33.3845.88
==t5-v1.1-base-dutch-uncased==33.9751.21
==t5-v1.1-base-dutch-cased==34.3948.31

Fine-Tuning

The Current Model can be fine-tuned for specific tasks, such as language translation. In fact, two models, t5-small-24L-dutch-english and t5-base-36L-dutch-english, have been fine-tuned for both language directions on the CCMatrix dataset. The results are impressive:

ModelSource LanguageTarget LanguageBleu Score
t5-base-36L-ccmatrix-multiEnglishDutch56.8
t5-base-36L-ccmatrix-multiDutchEnglish62.8
t5-small-24L-ccmatrix-multiEnglishDutch57.4
t5-small-24L-ccmatrix-multiDutchEnglish63.1

Limitations

The Current Model has several limitations that are important to consider when using it for downstream tasks.

Limited Training Data

The model was trained on a limited dataset, which may not be representative of all possible scenarios. This can lead to biases and inaccuracies in the model’s outputs.

Lack of Common Sense

While the Current Model is good at understanding language, it sometimes lacks common sense or real-world experience. This can result in outputs that are not practical or realistic.

Inability to Reason Abstractly

The model is not capable of abstract reasoning or complex problem-solving. It is best suited for tasks that require language understanding and generation.

Format

The Current Model uses a transformer architecture, specifically designed for text-to-text tasks. This model accepts input in the form of tokenized text sequences.

Input Format

  • The model expects input text to be pre-processed using a SentencePiece tokenizer.
  • The tokenizer is configured with the following normalizers:
    • Nmt
    • NFKC
    • Replace multi-space to single-space
  • The model has a vocabulary size of 32,003 tokens.

Output Format

  • The model generates output in the form of tokenized text sequences.
  • The output can be converted back to plain text using the same SentencePiece tokenizer.

Special Requirements

  • The model requires a specific sequence length of 512 tokens.
  • The model uses a batch size of 128.

Example Code

Here’s an example of how to use the Current Model in Python:

import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load pre-trained model and tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-base-dutch')
model = T5ForConditionalGeneration.from_pretrained('t5-base-dutch')

# Pre-process input text
input_text = "Dit is een voorbeeld zin."
inputs = tokenizer.encode_plus(input_text, 
                                add_special_tokens=True, 
                                max_length=512, 
                                return_attention_mask=True, 
                                return_tensors='pt')

# Generate output
outputs = model.generate(inputs['input_ids'], 
                          attention_mask=inputs['attention_mask'], 
                          max_length=512)

# Convert output to plain text
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.