Distilbert Base Uncased

Efficient BERT model

Meet DistilBERT, a distilled version of the BERT base model that's smaller, faster, and cheaper. It was trained on the same corpus as BERT, but with a self-supervised approach, using the BERT base model as a teacher. This model is uncased, meaning it doesn't differentiate between English and english. It's perfect for tasks that use the whole sentence to make decisions, like sequence classification, token classification, or question answering. But, keep in mind that it can have biased predictions, especially when it comes to sensitive information like race or gender. With its balanced performance and efficiency, DistilBERT is a great choice for fine-tuning on downstream tasks.

Distilbert apache-2.0 Updated a year ago

Table of Contents

Model Overview

The DistilBERT model is a smaller and faster version of the popular BERT model. But what makes it special? Let’s dive in!

Key Attributes

  • Smaller and faster: DistilBERT is a distilled version of BERT, making it more efficient and lightweight.
  • Uncased: It doesn’t differentiate between English and english, making it case-insensitive.
  • Pretrained on the same corpus: DistilBERT was trained on the same data as BERT, including BookCorpus and English Wikipedia.

Functionalities

  • Masked language modeling: DistilBERT can predict masked words in a sentence.
  • Next sentence prediction: It can also predict whether two sentences are related.
  • Fine-tuning: DistilBERT can be fine-tuned for specific downstream tasks, such as sequence classification, token classification, or question answering.

Capabilities

The DistilBERT model is a powerful tool for natural language processing tasks. It’s a smaller and faster version of the popular BERT model, but still packs a punch.

What can it do?

  • Masked Language Modeling: The model can fill in missing words in a sentence. For example, if you give it the sentence “Hello I’m a [MASK] model.”, it can predict the missing word.
  • Next Sentence Prediction: The model can predict whether two sentences are related or not.
  • Text Classification: The model can be fine-tuned for specific text classification tasks, such as sentiment analysis or spam detection.

How does it work?

The model uses a combination of three objectives to learn from the data:

  1. Distillation loss: The model is trained to mimic the behavior of the BERT base model.
  2. Masked Language Modeling: The model is trained to predict missing words in a sentence.
  3. Cosine embedding loss: The model is trained to generate hidden states that are close to the BERT base model.

Performance

DistilBERT is a smaller and faster version of the popular BERT model. But how does it perform? Let’s take a closer look.

Speed

DistilBERT is designed to be faster than BERT, and it delivers. It’s perfect for applications where speed is crucial, such as real-time language processing or large-scale data analysis.

Accuracy

But speed is not the only thing that matters. DistilBERT also achieves high accuracy in various tasks, such as:

  • Masked language modeling: DistilBERT can predict missing words in a sentence with high accuracy.
  • Next sentence prediction: DistilBERT can determine whether two sentences are related or not.
  • Sequence classification: DistilBERT can classify sequences of text into different categories.

Efficiency

DistilBERT is not only fast and accurate but also efficient. It uses fewer parameters than BERT, making it more suitable for deployment on devices with limited resources.

Limitations

While DistilBERT is a powerful model, it’s not perfect. It can have biased predictions, especially when it comes to sensitive topics. For example:

unmasker("The White man worked as a [MASK].")
unmasker("The Black woman worked as a [MASK].")

These biases can also affect fine-tuned versions of the model.

Format

DistilBERT is a smaller and faster version of the BERT model. It uses a transformer architecture and accepts input in the form of tokenized text sequences.

Architecture

DistilBERT is a distilled version of BERT, which means it was trained to mimic the behavior of BERT while being smaller and faster. It was pretrained on the same corpus as BERT in a self-supervised fashion, using BERT as a teacher model.

Data Formats

DistilBERT supports the following data formats:

  • Tokenized text sequences
  • Sentence pairs (with a specific pre-processing step)

Input Requirements

To use DistilBERT, you need to preprocess your input text data into tokenized sequences. You can use the DistilBertTokenizer class to do this.

Here’s an example of how to preprocess input text in PyTorch:

from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
Examples
The new restaurant in town is looking for a [MASK]. The new restaurant in town is looking for a chef.
What is the next sentence? The man went to the store and bought some milk. Then he returned home and poured himself a glass.
Is the following sentence true or false? The capital of France is Berlin. False. The capital of France is Paris.

Output Requirements

DistilBERT outputs a sequence of vectors, where each vector represents a token in the input sequence.

Here’s an example of how to get the output of DistilBERT in PyTorch:

from transformers import DistilBertModel

model = DistilBertModel.from_pretrained('distilbert-base-uncased')
output = model(**encoded_input)

Special Requirements

DistilBERT has some special requirements:

  • It’s primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.
  • For tasks such as text generation, you should look at models like ==GPT2==.
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.