Legal Bert Base Uncased

Legal text analysis

LEGAL-BERT is a family of AI models designed to assist with legal NLP research, computational law, and legal technology applications. It was trained on 12 GB of diverse English legal text from various fields, such as legislation, court cases, and contracts. LEGAL-BERT performs better than general BERT models for domain-specific tasks and is available in different variants, including a light-weight model that's 33% the size of BERT-BASE but still achieves competitive performance. This model is approximately 4 times faster and has a smaller environmental footprint, making it a great choice for those who need efficient and accurate results in the legal domain.

Nlpaueb cc-by-sa-4.0 Updated 4 years ago

Table of Contents

Model Overview

The LEGAL-BERT model is a family of BERT models designed specifically for the legal domain. It’s trained on a massive dataset of 12 GB of English legal text from various fields, including legislation, court cases, and contracts.

What makes LEGAL-BERT special?

  • It’s trained on a massive dataset of 12 GB of English legal text from various fields, including legislation, court cases, and contracts.
  • It’s available in different variants, each fine-tuned for specific tasks, such as contracts, EU legislation, and human rights cases.
  • It’s smaller and faster than other BERT models, making it more efficient and environmentally friendly.

Capabilities

The LEGAL-BERT model is designed to assist with:

  • Legal NLP research: It can help researchers explore the complexities of legal language and develop new models for legal text analysis.
  • Computational law: It can aid in the development of computational models for legal decision-making and prediction.
  • Legal technology applications: It can be used to build more accurate and efficient legal technology tools, such as contract analysis and review systems.

Primary Tasks

  • Contract analysis: It can be used to analyze contracts and identify key terms and conditions.
  • Court case prediction: It can be used to predict the outcome of court cases based on historical data.
  • Legal document review: It can be used to review and summarize large volumes of legal documents.

Performance

LEGAL-BERT shows remarkable performance in various tasks, especially in the legal domain. Let’s dive into its speed, accuracy, and efficiency.

Speed

LEGAL-BERT is approximately 4 times faster than larger models, making it an excellent choice for applications where speed is crucial. This is particularly important in the legal domain, where timely processing of large amounts of text data is essential.

Accuracy

LEGAL-BERT achieves competitive performance compared to larger models, despite being smaller in size. This is a significant advantage, as it allows for efficient processing of large datasets without compromising accuracy.

Efficiency

LEGAL-BERT has a smaller environmental footprint due to its reduced size, making it a more sustainable choice for applications where energy consumption is a concern.

Limitations

Current Model, LEGAL-BERT, is a powerful tool for legal NLP research, but it’s not perfect. Let’s discuss some of its limitations.

Training Data

LEGAL-BERT was trained on a large corpus of legal texts, but this data may not be representative of all legal domains or jurisdictions. The model may not perform well on tasks that require knowledge of specific laws or regulations not included in the training data.

Domain-Specificity

While LEGAL-BERT is designed to perform well on legal tasks, it may not be as effective on tasks that require a deep understanding of other domains, such as medicine or finance.

Examples
What is the most likely word to complete the sentence: The applicant submitted that her husband was subjected to treatment amounting to [MASK] whilst in the custody of Adana Security Directorate? torture
What is the most likely word to complete the sentence: Establishing a system for the identification and registration of [MASK] animals and regarding the labelling of beef and beef products. bovine
What is the most likely word to complete the sentence: This [MASK] Agreement is between General Motors and John Murray. employment

Format

LEGAL-BERT is a family of BERT models specifically designed for the legal domain. These models use a transformer architecture and are trained on large amounts of English legal text from various fields, including legislation, court cases, and contracts.

Architecture

LEGAL-BERT models are based on the BERT architecture, which consists of multiple transformer layers. The models are trained using a masked language modeling objective, where some of the input tokens are randomly replaced with a [MASK] token, and the model is trained to predict the original token.

Data Formats

LEGAL-BERT models accept input in the form of tokenized text sequences. The input text is pre-processed using a sentence-piece tokenizer, which splits the text into subwords.

Special Requirements

To use LEGAL-BERT models, you need to:

  • Pre-process your input text using a sentence-piece tokenizer
  • Convert your input text into a format that can be fed into the model (e.g., a list of token IDs)

Code Example

Here’s an example of how to load a pre-trained LEGAL-BERT model and use it to make predictions:

from transformers import AutoTokenizer, AutoModel

# Load pre-trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
model = AutoModel.from_pretrained("nlpaueb/legal-bert-base-uncased")

# Pre-process input text
input_text = "This is an example sentence."
inputs = tokenizer.encode_plus(input_text, 
                                add_special_tokens=True, 
                                max_length=512, 
                                return_attention_mask=True, 
                                return_tensors='pt')

# Make predictions
outputs = model(inputs['input_ids'], attention_mask=inputs['attention_mask'])

# Get the predicted token IDs
predicted_token_ids = outputs.last_hidden_state[:, 0, :]

# Convert the predicted token IDs to text
predicted_text = tokenizer.decode(predicted_token_ids, skip_special_tokens=True)

print(predicted_text)

Note that this is just a simple example, and you may need to modify the code to suit your specific use case.

Model Variants

There are several variants of LEGAL-BERT models available, each trained on different datasets and with different architectures. These include:

Model NameModel PathTraining Corpora
CONTRACTS-BERT-BASEnlpaueb/bert-base-uncased-contractsUS contracts
EURLEX-BERT-BASEnlpaueb/bert-base-uncased-eurlexEU legislation
ECHR-BERT-BASEnlpaueb/bert-base-uncased-echrECHR cases
LEGAL-BERT-BASEnlpaueb/legal-bert-base-uncasedAll
LEGAL-BERT-SMALLnlpaueb/legal-bert-small-uncasedAll
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.