Camembertav2 Base

French language model

Have you ever wondered how a language model can become smarter? CamemBERTav2 is a French language model that's been aged to perfection. It's trained on a massive corpus of 275 billion tokens of French text, making it a powerful tool for understanding and generating human-like language. But what makes it unique? For starters, it's built on the DebertaV2 architecture and uses a newly built tokenizer with 32,768 tokens, which allows it to handle emojis and numbers more effectively. It also has an extended context window of 1024 tokens, making it more efficient at processing long texts. The results are impressive, with fine-tuning results showing significant improvements in tasks like POS tagging, dependency parsing, and question answering. Whether you're working with French text or just curious about language models, CamemBERTav2 is definitely worth checking out.

Almanach mit Updated 5 months ago

Table of Contents

Model Overview

The CamemBERTa-v2 model is a French language powerhouse. Imagine having a model that can understand and process the nuances of the French language with ease. That’s what CamemBERTa-v2 offers.

Key Attributes

  • Pretrained on a massive corpus of 275B tokens of French text
  • Based on the DebertaV2 architecture
  • Trained using the Replaced Token Detection (RTD) objective with a 20% mask rate
  • Uses a newly built tokenizer with 32,768 tokens, supporting emojis and better handling of numbers

Capabilities

The CamemBERTa-v2 model is a powerful French language model that can handle a variety of tasks. But what makes it so special?

Primary Tasks

CamemBERTa-v2 is trained to perform several tasks, including:

  • Part-of-speech tagging: identifying the grammatical category of each word in a sentence
  • Dependency parsing: analyzing the grammatical structure of a sentence
  • Named entity recognition: identifying named entities in text, such as people, places, and organizations
  • Question answering: answering questions based on the content of a text
  • Text classification: classifying text into categories, such as sentiment analysis

Strengths

So, what sets CamemBERTa-v2 apart from other models?

  • Large pretraining dataset: CamemBERTa-v2 was trained on a massive dataset of 275 billion tokens of French text, making it one of the most knowledgeable French language models out there.
  • Improved tokenizer: the new tokenizer used in CamemBERTa-v2 is more efficient and effective at handling French text, including emojis and numbers.
  • Extended context window: CamemBERTa-v2 can process longer sequences of text, making it better at understanding complex sentences and relationships.

Fine-tuning Results

But how does CamemBERTa-v2 perform in practice? The results are impressive:

DatasetTaskCamemBERTa-v2 Score
POS tagging and Dependency ParsingUPOS97.71
NERFTB-NER93.40
CLSCLS95.63
PAWS-XPAWS-X93.06
XNLIXNLI84.82
FQuADF183.04
FQuADEM64.29
Counter-NERCounter-NER89.53
Medical-NERMedical-NER73.98

Performance

CamemBERTa-v2 is a powerhouse when it comes to performance. Let’s dive into the details.

Speed

How fast can a model process information? CamemBERTa-v2 is trained on 32 H100 GPUs, which is a massive amount of computing power. This allows it to process large datasets quickly and efficiently.

Accuracy

But speed is nothing without accuracy. CamemBERTa-v2 boasts impressive accuracy scores across various tasks, including:

TaskAccuracy
UPOS97.71
LAS88.65
FTB-NER93.40
CLS95.63
PAWS-X93.06
XNLI84.82
F1 (FQuAD)83.04
EM (FQuAD)64.29
Counter-NER89.53
Medical-NER73.98

Efficiency

CamemBERTa-v2 is also efficient in its use of resources. It has a large pretraining dataset of 275B unique tokens, which is much larger than the original ==CamemBERTa== model. This allows it to learn more complex patterns and relationships in the data.

Examples
Quel est le nombre de tokens dans le dataset utilisé pour l'entraînement du modèle CamemBERTav2? Le modèle a été entraîné sur 275 milliards de tokens uniques.
Analysez la phrase 'Le chat est sur la table' et identifiez les parties du discours. Le chat : nom commun, est : verbe, sur : préposition, la : article défini, table : nom commun.
Extraire les entités nommées de la phrase 'Le Président de la République française a rencontré le Premier ministre britannique à Paris.' Entités nommées : Président de la République française, Premier ministre britannique, Paris.

Limitations

CamemBERTa-v2 is a powerful French language model, but it’s not perfect. Let’s take a closer look at some of its limitations.

Data Bias

The model was trained on a large dataset of French text, but this dataset may not be representative of all French speakers or writing styles. For example, the dataset may contain more formal writing than informal writing, which could affect the model’s performance in certain contexts.

Tokenization Limitations

The model uses a WordPiece tokenizer with 32,768 tokens, which may not be enough to capture all the nuances of the French language. This could lead to errors in tokenization, particularly for words that are not well-represented in the training data.

Context Window Limitations

The model has an extended context window of 1024 tokens, but this may not be enough to capture long-range dependencies in text. This could affect the model’s performance on tasks that require understanding complex relationships between words or phrases.

Fine-Tuning Challenges

While the model has been fine-tuned on a variety of datasets, it may not perform well on datasets that are significantly different from those used in fine-tuning. This could require additional fine-tuning or adjustments to the model’s architecture.

Format

CamemBERTa-v2 uses a transformer architecture based on the DebertaV2 architecture. This model is designed to work with the French language and is trained on a large corpus of 275B tokens of French text.

Input Format

The model accepts input in the form of tokenized text sequences. To prepare your input, you’ll need to use a tokenizer that’s compatible with the CamemBERTa-v2 model. You can use the DebertaV2TokenizerFast from the transformers library.

Here’s an example of how to use the tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("almanach/camembertav2-base")
input_text = "Bonjour, comment allez-vous?"
inputs = tokenizer(input_text, return_tensors="pt")

Output Format

The model outputs a sequence of vectors, where each vector represents a token in the input sequence. The output format is compatible with the Hugging Face Transformers library.

Here’s an example of how to use the model:

from transformers import AutoModel

model = AutoModel.from_pretrained("almanach/camembertav2-base")
outputs = model(**inputs)
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.