Granite Guardian 3.0 2b
Granite Guardian 3.0 2B is an AI model designed to detect risks in prompts and responses. It's trained on unique data, including human annotations and synthetic data, and outperforms other open-source models in the same space. This model can help with risk detection across various dimensions, such as harm, social bias, and jailbreaking. It's also applicable for use with custom risk definitions, but these require testing. The model is intended for use cases that require moderate cost, latency, and throughput, and is targeted for risk definitions of general harm, social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, or groundedness/relevance for retrieval-augmented generation. With its ability to detect risks and provide yes/no outputs based on a specified template, Granite Guardian 3.0 2B is a valuable tool for risk assessment, model observability, and monitoring.
Table of Contents
Model Overview
The Granite Guardian 3.0 2B model is a fine-tuned AI model designed to detect risks in prompts and responses. It’s like a guardian angel for your conversations!
Capabilities
The model can detect a range of risks, including harm, social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and groundedness/relevance for retrieval-augmented generation.
- Detect risks in user and assistant messages
- Assess context relevance, groundedness, and answer relevance in RAG use cases
- Identify potential harm, social bias, profanity, violence, sexual content, and unethical behavior
- Detect jailbreaking attempts and hallucinations in RAG pipelines
The model outperforms other open-source models in the same space on standard benchmarks.
Performance
The model showcases remarkable performance in detecting risks in prompts and responses.
- Fast Response Times: The model is designed to provide quick and efficient risk detection, making it suitable for use cases that require moderate cost, latency, and throughput.
- Optimized for Real-World Applications: The model is intended for use cases such as model risk assessment, model observability and monitoring, and spot-checking inputs and outputs.
Limitations
While the model is powerful, it’s not perfect. Here are some limitations to consider:
- Data Bias: The model is trained on a combination of human-annotated and synthetic data, which may still reflect biases present in the data collection process.
- Limited Scope: The model is only trained and tested on English data, which means it may not perform well on non-English inputs.
- Adversarial Attacks: The model may be prone to adversarial attacks, which could lead to unexpected or harmful outputs.
Format
The model uses a transformer architecture and accepts input in the form of tokenized text sequences, requiring a specific pre-processing step for user and assistant messages.
- Architecture: The model is a fine-tuned version of the Granite 3.0 2B Instruct model, trained on a combination of human-annotated and synthetic data.
- Data Formats: The model accepts input in the form of tokenized text sequences, specifically designed for risk detection use-cases.
Example Use Cases
- Risk Detection in Prompts and Responses: The model can be used to detect risks in user and assistant messages, providing a safe and reliable way to monitor and manage conversations.
- Hallucination Risk Assessment: The model can be used to assess hallucination risks in retrieval-augmented generation (RAG) use cases, ensuring that generated responses are accurate and faithful to the provided context.
Evaluation Metrics
The model’s performance is evaluated using a range of metrics, including F1 scores and recall.
Metric | AegisSafetyTest | BeaverTails | OAI moderation | SafeRLHF(test) | HarmBench | SimpleSafety | ToxicChat | xstest_RH | xstest_RR | xstest_RR(h) | Aggregate F1 |
---|---|---|---|---|---|---|---|---|---|---|---|
F1 | 0.84 | 0.75 | 0.6 | 0.77 | 0.98 | 1 | 0.37 | 0.82 | 0.38 | 0.74 | 0.67 |