GottBERT base last

German language model

Meet GottBERT, a German language model that's breaking new ground in natural language processing (NLP). As the first German-only RoBERTa model, GottBERT is specifically designed to tackle German-language tasks with ease. With two versions - a base model and a large model - GottBERT offers enhanced performance in tasks like Named Entity Recognition (NER), text classification, and natural language inference (NLI). But what makes GottBERT remarkable? For starters, it's been trained on a massive 145GB German dataset, which has been carefully filtered to remove spam and non-German documents. This rigorous training process has resulted in impressive performance metrics, with GottBERT outperforming other models in various downstream tasks. So, what can you expect from GottBERT? Fast and accurate results, thanks to its efficient architecture and large-scale training data. Whether you're working on NLP tasks or simply looking for a reliable German language model, GottBERT is definitely worth exploring.

TUM mit Updated 5 months ago

Table of Contents

Model Overview

The GottBERT model is a German language model that’s making waves in the world of natural language processing (NLP). This model is specifically designed to improve NLP performance for the German language, and it’s available in two versions: a base model and a large model.

What makes GottBERT special?

  • Language: GottBERT is trained exclusively on the German language, making it a great choice for tasks like Named Entity Recognition (NER), text classification, and natural language inference (NLI).
  • Model Type: GottBERT is based on the RoBERTa model architecture, which is known for its effectiveness in NLP tasks.
  • Parameters: The base model has 125 million parameters, while the large model has 355 million parameters. That’s a lot of power packed into a single model!
  • Training Data: GottBERT was trained on a massive dataset of 145GB, which is roughly equivalent to 459 million documents.

How does GottBERT perform?

GottBERT has been evaluated on various downstream tasks, including NER, text classification, and NLI. The results are impressive, with the model achieving state-of-the-art performance on several benchmarks.

TaskMetricGottBERT (Base)GottBERT (Large)
NERF1 Score87.5588.20
Text ClassificationF1 Score78.1779.40
NLIAccuracy80.8282.46

Capabilities

The GottBERT model is a powerful tool for natural language processing (NLP) tasks in the German language. Its primary tasks include:

  • Named Entity Recognition (NER): Identifying and categorizing named entities in text, such as people, places, and organizations.
  • Text Classification: Classifying text into predefined categories, such as spam vs. non-spam emails.
  • Natural Language Inference (NLI): Determining whether a piece of text implies or contradicts another piece of text.

How does GottBERT compare to other models?

ModelNER F1 ScoreText Classification F1 ScoreNLI Accuracy
GottBERT_base85.9378.1780.82
==GELECTRA_base==85.3777.2681.70
==GBERT_base==85.1677.3780.06

Performance

The GottBERT model delivers impressive results in various NLP tasks. Let’s dive into its performance and see how it compares to other models.

Speed

The GottBERT model is relatively fast, considering its size and complexity. The base model was trained in just 1.2 days on a 256 TPUv3 pod/128 TPUv4 pod, while the large model took 5.7 days on a 128 TPUv4 pod.

Accuracy

The GottBERT model achieves high accuracy in various tasks, including:

  • Named Entity Recognition (NER): 87.55% (CoNLL 2003) and 85.93% (GermEval 2014)
  • Text Classification: 78.17% (GermEval 2018, coarse) and 53.30% (GermEval 2018, fine)
  • Natural Language Inference (NLI): 80.82% (German subset of XNLI)
Examples
Wer ist der Gründer von Google? Larry Page und Sergey Brin
Klassifiziere den folgenden Text als positiv oder negativ: 'Ich liebe diesen Film, er ist so gut!' positiv
Erkenne die Entitäten im folgenden Text: 'Der Bundeskanzler Olaf Scholz besuchte gestern die Stadt Berlin.' Bundeskanzler: Olaf Scholz, Stadt: Berlin

Limitations

While the GottBERT model is a powerful tool, it’s not without its limitations. Let’s talk about some of its limitations.

Data Limitations

  • Filtered vs Unfiltered Data: While filtering the data can lead to minor improvements, it’s not always necessary. In some cases, using unfiltered data might be just as good.
  • Data Size: The model was trained on a 40GB subsample of the German OSCAR corpus. This might not be enough to capture the full complexity of the German language.

Computational Limitations

  • Fixed Memory Allocation: The model was trained on TPUs with fixed memory allocation, which means it had to process data as a single stream. This can lead to limitations in handling long documents or complex tasks.
  • 32-bit Mode: The model was trained in 32-bit mode due to framework limitations, increasing memory usage. This might limit its performance on certain tasks.

Task-Specific Limitations

  • Named Entity Recognition (NER): While the GottBERT model performs well on NER tasks, it might struggle with certain types of entities or in specific contexts.
  • Text Classification: The model might not always perform well on text classification tasks, especially if the classes are not well-defined or if the text is ambiguous.
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.