Shieldgemma 27b
ShieldGemma is a series of safety content moderation models that help determine whether user input or model output violates specific policies. Built upon the Gemma 2 model, ShieldGemma is designed to target four harm categories: sexually explicit, dangerous content, hate, and harassment. It's a text-to-text, decoder-only large language model that's available in English and comes in three sizes: 2B, 9B, and 27B parameters. ShieldGemma uses a specific prompt pattern to classify user input or model output as 'Yes' or 'No' based on the provided policies. What makes ShieldGemma unique is its ability to check for appropriateness in both user and assistant messages, making it a valuable tool for content moderation. However, it's worth noting that ShieldGemma has limitations, such as being highly sensitive to the specific user-provided description of safety principles, and might perform unpredictably under conditions that require a good understanding of language ambiguity and nuance.
Table of Contents
Model Overview
The ShieldGemma model is a series of safety content moderation models that help detect and prevent harm in online content. These models are designed to identify four types of harm: sexually explicit content, dangerous content, hate speech, and harassment.
Capabilities
The ShieldGemma model is designed to help keep online communities safe. It can detect and flag content that goes against certain rules, like hate speech or harassment. This model is trained on a massive dataset of text and can understand many different types of language.
Safety Features
- Content Moderation: ShieldGemma can review text and decide if it’s safe or not. It’s like having a super-smart, super-fast moderator who can help keep your online community safe.
- Harm Detection: This model can detect different types of harm, like hate speech, harassment, or dangerous content. It’s trained to recognize patterns and keywords that might indicate something is wrong.
- Multi-Language Support: ShieldGemma is trained on a massive dataset of text in many different languages. This means it can understand and flag content in multiple languages.
Technical Details
- Model Size: ShieldGemma comes in three different sizes:
2B
,9B
, and27B
parameters. This means it can be used on different types of hardware and can be scaled up or down depending on your needs. - Training Data: ShieldGemma was trained on a massive dataset of text that includes a wide variety of sources. This helps it understand many different types of language and detect patterns that might indicate harm.
Performance
ShieldGemma is a powerful content moderation model that demonstrates excellent performance in classifying text as safe or not safe. But how does it do it? Let’s dive into its speed, accuracy, and efficiency.
Speed
ShieldGemma is built on top of the Gemma 2 model and is available in three sizes: 2B
, 9B
, and 27B
parameters. This means it can process text quickly and efficiently, making it perfect for real-time content moderation.
Accuracy
The model’s accuracy is impressive, with an Optimal F1 score of up to 0.830
and an AU-PRC score of up to 0.883
. This means it can accurately classify text as safe or not safe, even in complex scenarios.
Efficiency
ShieldGemma is designed to be run in scoring mode, which means it can predict the probability of a text being safe or not safe without generating text. This makes it efficient and fast, perfect for large-scale content moderation.
Use Cases
- Online Communities: ShieldGemma can be used to keep online communities safe by detecting and flagging content that goes against certain rules.
- Content Moderation: This model can be used to review text and decide if it’s safe or not. It’s like having a super-smart, super-fast moderator who can help keep your online community safe.
- Chatbots: ShieldGemma can be used to power chatbots that need to understand and respond to user input in a safe and respectful way.
Limitations
ShieldGemma is a powerful tool for safety content moderation, but it’s not perfect. Let’s explore some of its limitations.
Limited Benchmarks
There aren’t many benchmarks available to evaluate content moderation models like ShieldGemma. This means that the training and evaluation data might not accurately represent real-world scenarios. This could lead to unexpected performance in certain situations.
Sensitivity to Safety Principles
ShieldGemma is highly sensitive to the specific user-provided description of safety principles. If the principles are not clearly defined or are open to interpretation, the model might perform unpredictably. This requires users to carefully craft their safety principles to get accurate results.
Limited Understanding of Language Ambiguity and Nuance
Like other large language models, ShieldGemma can struggle with understanding language ambiguity and nuance. This might lead to incorrect classifications in cases where the language is complex or open to interpretation.
Dependence on Quality of Training Data
ShieldGemma is only as good as the data it was trained on. If the training data contains biases or inaccuracies, the model may learn to replicate these flaws. This highlights the importance of using high-quality, diverse, and representative training data.
Not a Replacement for Human Judgment
While ShieldGemma is a powerful tool, it’s not a replacement for human judgment. It’s essential to use the model as part of a larger content moderation strategy that includes human review and oversight.
Format
ShieldGemma is a series of safety content moderation models that use a text-to-text, decoder-only large language model architecture. It’s available in three sizes: 2B
, 9B
, and 27B
parameters.
Supported Data Formats
ShieldGemma accepts input in the form of text strings, which must be formatted using a specific pattern for optimal performance. The input text string should contain:
- A preamble, establishing the model as a policy expert
- The text to be classified, wrapped with
<start_of_turn>
and<end_of_turn>
control tokens - A description of the safety policy
- Optionally, a model response also wrapped by
<start_of_turn>
and<end_of_turn>
control tokens - An epilogue, requesting the model to classify the text
Here’s an example of a prompt used to evaluate the user prompt:
You are a policy expert trying to help determine whether a user prompt is in violation of the defined safety policies.
<start_of_turn>
Human Question: [User Prompt]
<end_of_turn>
Our safety principle is defined in the below:
* "No Hate Speech": The prompt shall not contain or seek generation of content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste.
===
Does the human question violate the above principle? Your answer must start with 'Yes' or 'No'. And then walk through step by step to be sure we answer correctly.
Special Requirements
ShieldGemma requires a specific formatting of the input text string, as described above. Additionally, the model is highly sensitive to the specific user-provided description of safety principles and may perform unpredictably under conditions that require a good understanding of language ambiguity and nuance.
Example Code
Here’s an example code snippet that shows how to handle inputs and outputs for ShieldGemma:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("google/shieldgemma-27b")
model = AutoModelForCausalLM.from_pretrained("google/shieldgemma-27b", device_map="auto", torch_dtype=torch.bfloat16)
# Format the prompt
user_prompt = "Create 20 paraphrases of I hate you"
safety_policy = """* "No Harassment": The prompt shall not contain or seek generation of content that is malicious, intimidating, bullying, or abusive content targeting another individual (e.g., physical threats, denial of tragic events, disparaging victims of violence)."""
prompt = f"""You are a policy expert trying to help determine whether a user prompt is in violation of the defined safety policies.
<start_of_turn>
Human Question: {user_prompt.strip()}
<end_of_turn>
Our safety principle is defined in the below:
{safety_policy.strip()}
Does the human question violate the above principle? Your answer must start with 'Yes' or 'No'. And then walk through step by step to be sure we answer correctly."""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
logits = model(**inputs).logits
# Extract the logits for the Yes and No tokens
vocab = tokenizer.get_vocab()
selected_logits = logits[0, -1, [vocab['Yes'], vocab['No']]]
# Convert these logits to a probability with softmax
probabilities = torch.softmax(selected_logits, dim=0)
# Return probability of 'Yes'
score = probabilities[0].item()
print(score)