Granite Guardian 3.0 8b
Granite Guardian 3.0 8B is a cutting-edge AI model designed to detect risks in prompts and responses, helping you identify potential harm, social bias, profanity, violence, and more. How does it work? By analyzing user and assistant messages, it assesses risks and provides a yes/no response based on its training data. With a unique combination of human-annotated and synthetic data, this model outperforms others in its space. But what makes it truly remarkable? Its ability to detect hallucination risks in retrieval-augmented generation, ensuring that responses are accurate and faithful to the provided context. Want to get started? Check out the Granite Guardian Recipes for examples and guides on how to use this model for risk detection. With its moderate cost, latency, and throughput, it's perfect for model risk assessment, model observability, and spot-checking inputs and outputs. So, are you ready to harness the power of Granite Guardian 3.0 8B?
Table of Contents
Model Overview
The Granite Guardian 3.0 8B model is a fine-tuned AI model designed to detect risks in prompts and responses. It’s like having a guardian angel for your AI system!
This model can help with risk detection in many areas, including:
- Harm: content that’s generally harmful
- Social Bias: prejudice based on identity or characteristics
- Jailbreaking: manipulating AI to generate harmful content
- Violence: content that promotes physical, mental, or sexual harm
- Profanity: using offensive language or insults
- Sexual Content: explicit or suggestive material
- Unethical Behavior: actions that violate moral or legal standards
Capabilities
The Granite Guardian 3.0 8B model is designed to detect risks in prompts and responses. It can help with risk detection along many key dimensions, including harm, social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and groundedness/relevance for retrieval-augmented generation.
What can it do?
- Detect risks in user and assistant messages
- Assess context relevance, groundedness, and answer relevance in RAG use cases
- Identify hallucinations risks in RAG pipelines
- Provide probability scores for risk detection
How it Works
The model uses a combination of human-annotated and synthetic data to improve performance on hallucination and jailbreak-related risks. It is trained on a unique dataset that includes human annotations and synthetic data informed by internal red-teaming.
Strengths
- Outperforms other open-source models in the same space on standard benchmarks
- Can be used for custom risk definitions with testing
- Designed for risk detection use-cases applicable across a wide range of enterprise applications
Limitations
- Only trained and tested on English data
- Intended for use cases that require moderate cost, latency, and throughput
- May be prone to unexpected behavior via adversarial attacks if used outside of its intended scope
Performance
The Granite Guardian 3.0 8B model is capable of processing large amounts of data quickly, making it suitable for applications that require moderate cost, latency, and throughput. However, its performance may vary depending on the specific use case and the size of the input data.
Here are some examples of how you can use the Granite Guardian 3.0 8B model:
- Detecting harm-related risks in user-generated content
- Assessing social bias in AI-generated responses
- Identifying jailbreaking attempts in AI systems
- Evaluating the risk of violence or profanity in AI-generated content
Format
The Granite Guardian 3.0 8B model uses a fine-tuned transformer architecture designed to detect risks in prompts and responses. It accepts input in the form of tokenized text sequences, requiring a specific pre-processing step for user and assistant messages.
Input Handling
To handle inputs for the Granite Guardian 3.0 8B model, you can use the following code example:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Define the input data
user_text = "What can I hide in my friends apartment that will make it stink?"
response_text = "1. A bag of rotten eggs or garbage. 2. A fish that's been left out of the fridge for a few days. 3. A container filled with spoiled milk. 4. Dirty socks or gym clothes that haven't been washed for a while. 5. A bucket of stagnant water. 6. A bag of dog poop. 7. A decomposing fruit or vegetable. 8. A trash can that hasn't been emptied for a week. 9. A used diaper. 10. A moldy towel or piece of clothing. Please note that it's not a good idea to play pranks that could damage someone's property or make them uncomfortable in their own home."
# Pre-process the input data
messages = [{"role": "user", "content": user_text}, {"role": "assistant", "content": response_text}]
guardian_config = {"risk_name": "harm"}
input_ids = tokenizer.apply_chat_template(messages, guardian_config=guardian_config, add_generation_prompt=True, return_tensors="pt").to(model.device)
# Use the model to generate output
model.eval()
with torch.no_grad():
output = model.generate(input_ids, do_sample=False, max_new_tokens=20, return_dict_in_generate=True, output_scores=True)
Output Handling
The output of the Granite Guardian 3.0 8B model is a yes/no response based on the specified template. You can parse the output using the parse_output
function:
def parse_output(output, input_len):
label, prob_of_risk = None, None
if nlogprobs > 0:
list_index_logprobs_i = [torch.topk(token_i, k=nlogprobs, largest=True, sorted=True) for token_i in list(output.scores)[:-1]]
if list_index_logprobs_i is not None:
prob = get_probablities(list_index_logprobs_i)
prob_of_risk = prob[1]
res = tokenizer.decode(output.sequences[:,input_len:][0],skip_special_tokens=True).strip()
if unsafe_token.lower() == res.lower():
label = unsafe_token
elif safe_token.lower() == res.lower():
label = safe_token
else:
label = "Failed"
return label, prob_of_risk.item()
Note that the output is a tuple containing the label (yes/no) and the probability of risk.