Llama Guard 3 8B
Llama Guard 3 is a powerful AI model that helps ensure online safety by classifying content and detecting potential hazards. It's trained on a wide range of topics, including violent crimes, hate speech, and intellectual property infringement, and can handle content in eight different languages. But what really sets it apart is its ability to provide detailed explanations for its classifications, making it a valuable tool for anyone looking to keep their online community safe. With its high accuracy and ability to handle complex topics, Llama Guard 3 is a game-changer for online safety. But how does it work? By analyzing the input text and generating a response that indicates whether the content is safe or not, Llama Guard 3 provides a clear and concise way to moderate online content. And with its support for multiple languages, it's a versatile tool that can be used in a variety of contexts. So, what kind of content can Llama Guard 3 handle? From search queries to code interpreter abuse, this model is designed to detect and classify a wide range of online content. And with its high performance and low false positive rate, it's a reliable choice for anyone looking to keep their online community safe.
Table of Contents
Model Overview
The Llama Guard 3 model is a powerful tool for content safety classification. It’s a Llama-3.1-8B pretrained model, fine-tuned to classify content in both LLM inputs (prompt classification) and LLM responses (response classification). This model acts as an LLM, generating text that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.
Capabilities
Primary Tasks
- Content Safety Classification: Classify content into 14 categories of hazards, including violent crimes, non-violent crimes, sex-related crimes, and more.
- Multilingual Support: Supports content safety classification in 8 languages, including English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai.
- Tool Use Capability: Detect and prevent code interpreter abuse, including denial of service attacks, container escapes, and privilege escalation exploits.
Key Features
- Industry-Leading System-Level Safety Performance: Recommended to be deployed along with Llama 3.1 for industry-leading system-level safety performance.
- Quantization: Available in half-precision and 8-bit precision versions, reducing the checkpoint size by about 40% with minimal impact on model performance.
Performance
Speed
Optimized for speed and can process large amounts of data quickly.
Accuracy
Achieved high accuracy in content safety classification tasks, outperforming other models like GPT4 in English, multilingual, and tool use capabilities.
Efficiency
Designed to be efficient and can be used with transformers. Supported since transformers version 4.43.
Comparison with Other Models
Model | F1 Score | AUPRC | False Positive Rate |
---|---|---|---|
Llama Guard 2 | 0.877 | 0.927 | 0.081 |
Llama Guard 3 | 0.939 | 0.985 | 0.040 |
GPT4 | 0.805 | N/A | 0.152 |
Usage
You can use Llama Guard 3 with transformers, and it’s supported since transformers version 4.43. Here’s an example of how to use it:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "meta-llama/Llama-Guard-3-8B"
device = "cuda"
dtype = torch.bfloat16
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map=device)
def moderate(chat):
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(device)
output = model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
prompt_len = input_ids.shape[-1]
return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
moderate([
{"role": "user", "content": "I forgot how to kill a process in Linux, can you help?"},
{"role": "assistant", "content": "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."},
])
Limitations
While Llama Guard 3 provides industry-leading safety performance, it may increase refusals to benign prompts (False Positives). The model’s performance may also vary depending on the specific use case and deployment.
Language Limitations
Supports content safety for 8 languages, but the model’s performance may vary across languages.
False Positives
May incorrectly classify some prompts or responses as unsafe, even if they’re not.
Quantization
Quantization can help reduce the deployment cost, but it may also affect the model’s performance.
Training Data
Trained on a specific dataset, which may not cover all possible scenarios.
Industry Standards
Aligned with the MLCommons standardized hazards taxonomy, but there’s still a need for industry standards in the LLM safety and content evaluation space.
Deployment
Available by default on Llama 3.1 reference implementations, but deploying it may require additional configuration and customization.