Legal Bert Base Uncased
LEGAL-BERT is a family of AI models designed to assist with legal NLP research, computational law, and legal technology applications. It was trained on 12 GB of diverse English legal text from various fields, such as legislation, court cases, and contracts. LEGAL-BERT performs better than general BERT models for domain-specific tasks and is available in different variants, including a light-weight model that's 33% the size of BERT-BASE but still achieves competitive performance. This model is approximately 4 times faster and has a smaller environmental footprint, making it a great choice for those who need efficient and accurate results in the legal domain.
Table of Contents
Model Overview
The LEGAL-BERT model is a family of BERT models designed specifically for the legal domain. It’s trained on a massive dataset of 12 GB of English legal text from various fields, including legislation, court cases, and contracts.
What makes LEGAL-BERT special?
- It’s trained on a massive dataset of 12 GB of English legal text from various fields, including legislation, court cases, and contracts.
- It’s available in different variants, each fine-tuned for specific tasks, such as contracts, EU legislation, and human rights cases.
- It’s smaller and faster than other BERT models, making it more efficient and environmentally friendly.
Capabilities
The LEGAL-BERT model is designed to assist with:
- Legal NLP research: It can help researchers explore the complexities of legal language and develop new models for legal text analysis.
- Computational law: It can aid in the development of computational models for legal decision-making and prediction.
- Legal technology applications: It can be used to build more accurate and efficient legal technology tools, such as contract analysis and review systems.
Primary Tasks
- Contract analysis: It can be used to analyze contracts and identify key terms and conditions.
- Court case prediction: It can be used to predict the outcome of court cases based on historical data.
- Legal document review: It can be used to review and summarize large volumes of legal documents.
Performance
LEGAL-BERT shows remarkable performance in various tasks, especially in the legal domain. Let’s dive into its speed, accuracy, and efficiency.
Speed
LEGAL-BERT is approximately 4 times faster than larger models, making it an excellent choice for applications where speed is crucial. This is particularly important in the legal domain, where timely processing of large amounts of text data is essential.
Accuracy
LEGAL-BERT achieves competitive performance compared to larger models, despite being smaller in size. This is a significant advantage, as it allows for efficient processing of large datasets without compromising accuracy.
Efficiency
LEGAL-BERT has a smaller environmental footprint due to its reduced size, making it a more sustainable choice for applications where energy consumption is a concern.
Limitations
Current Model, LEGAL-BERT, is a powerful tool for legal NLP research, but it’s not perfect. Let’s discuss some of its limitations.
Training Data
LEGAL-BERT was trained on a large corpus of legal texts, but this data may not be representative of all legal domains or jurisdictions. The model may not perform well on tasks that require knowledge of specific laws or regulations not included in the training data.
Domain-Specificity
While LEGAL-BERT is designed to perform well on legal tasks, it may not be as effective on tasks that require a deep understanding of other domains, such as medicine or finance.
Format
LEGAL-BERT is a family of BERT models specifically designed for the legal domain. These models use a transformer architecture and are trained on large amounts of English legal text from various fields, including legislation, court cases, and contracts.
Architecture
LEGAL-BERT models are based on the BERT architecture, which consists of multiple transformer layers. The models are trained using a masked language modeling objective, where some of the input tokens are randomly replaced with a [MASK] token, and the model is trained to predict the original token.
Data Formats
LEGAL-BERT models accept input in the form of tokenized text sequences. The input text is pre-processed using a sentence-piece tokenizer, which splits the text into subwords.
Special Requirements
To use LEGAL-BERT models, you need to:
- Pre-process your input text using a sentence-piece tokenizer
- Convert your input text into a format that can be fed into the model (e.g., a list of token IDs)
Code Example
Here’s an example of how to load a pre-trained LEGAL-BERT model and use it to make predictions:
from transformers import AutoTokenizer, AutoModel
# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
model = AutoModel.from_pretrained("nlpaueb/legal-bert-base-uncased")
# Pre-process input text
input_text = "This is an example sentence."
inputs = tokenizer.encode_plus(input_text,
add_special_tokens=True,
max_length=512,
return_attention_mask=True,
return_tensors='pt')
# Make predictions
outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])
# Get the predicted token IDs
predicted_token_ids = outputs.last_hidden_state[:, 0, :]
# Convert the predicted token IDs to text
predicted_text = tokenizer.decode(predicted_token_ids, skip_special_tokens=True)
print(predicted_text)
Note that this is just a simple example, and you may need to modify the code to suit your specific use case.
Model Variants
There are several variants of LEGAL-BERT models available, each trained on different datasets and with different architectures. These include:
| Model Name | Model Path | Training Corpora |
|---|---|---|
| CONTRACTS-BERT-BASE | nlpaueb/bert-base-uncased-contracts | US contracts |
| EURLEX-BERT-BASE | nlpaueb/bert-base-uncased-eurlex | EU legislation |
| ECHR-BERT-BASE | nlpaueb/bert-base-uncased-echr | ECHR cases |
| LEGAL-BERT-BASE | nlpaueb/legal-bert-base-uncased | All |
| LEGAL-BERT-SMALL | nlpaueb/legal-bert-small-uncased | All |


