Granite Guardian 3.0 2b

Risk detection model

Granite Guardian 3.0 2B is an AI model designed to detect risks in prompts and responses. It's trained on unique data, including human annotations and synthetic data, and outperforms other open-source models in the same space. This model can help with risk detection across various dimensions, such as harm, social bias, and jailbreaking. It's also applicable for use with custom risk definitions, but these require testing. The model is intended for use cases that require moderate cost, latency, and throughput, and is targeted for risk definitions of general harm, social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, or groundedness/relevance for retrieval-augmented generation. With its ability to detect risks and provide yes/no outputs based on a specified template, Granite Guardian 3.0 2B is a valuable tool for risk assessment, model observability, and monitoring.

Ibm Granite apache-2.0 Updated a month ago

Table of Contents

Model Overview

The Granite Guardian 3.0 2B model is a fine-tuned AI model designed to detect risks in prompts and responses. It’s like a guardian angel for your conversations!

Capabilities

The model can detect a range of risks, including harm, social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and groundedness/relevance for retrieval-augmented generation.

  • Detect risks in user and assistant messages
  • Assess context relevance, groundedness, and answer relevance in RAG use cases
  • Identify potential harm, social bias, profanity, violence, sexual content, and unethical behavior
  • Detect jailbreaking attempts and hallucinations in RAG pipelines

The model outperforms other open-source models in the same space on standard benchmarks.

Performance

The model showcases remarkable performance in detecting risks in prompts and responses.

  • Fast Response Times: The model is designed to provide quick and efficient risk detection, making it suitable for use cases that require moderate cost, latency, and throughput.
  • Optimized for Real-World Applications: The model is intended for use cases such as model risk assessment, model observability and monitoring, and spot-checking inputs and outputs.

Limitations

While the model is powerful, it’s not perfect. Here are some limitations to consider:

  • Data Bias: The model is trained on a combination of human-annotated and synthetic data, which may still reflect biases present in the data collection process.
  • Limited Scope: The model is only trained and tested on English data, which means it may not perform well on non-English inputs.
  • Adversarial Attacks: The model may be prone to adversarial attacks, which could lead to unexpected or harmful outputs.

Format

The model uses a transformer architecture and accepts input in the form of tokenized text sequences, requiring a specific pre-processing step for user and assistant messages.

  • Architecture: The model is a fine-tuned version of the Granite 3.0 2B Instruct model, trained on a combination of human-annotated and synthetic data.
  • Data Formats: The model accepts input in the form of tokenized text sequences, specifically designed for risk detection use-cases.
Examples
What is the risk level of the following user message: 'I hate people from Karethia, they are so stupid.' Risk detected: Yes, probability of risk: 0.943
Assess the groundedness of the following assistant response: 'The film Eat was first shown by Jonas Mekas on December 24, 1922 at the Washington Square Gallery at 530 West Broadway.' Risk detected: Yes, probability of risk: 0.971
Determine the harm level of the following assistant response: '1. A bag of rotten eggs or garbage. 2. A fish that's been left out of the fridge for a few days.' Risk detected: Yes, probability of risk: 0.924

Example Use Cases

  • Risk Detection in Prompts and Responses: The model can be used to detect risks in user and assistant messages, providing a safe and reliable way to monitor and manage conversations.
  • Hallucination Risk Assessment: The model can be used to assess hallucination risks in retrieval-augmented generation (RAG) use cases, ensuring that generated responses are accurate and faithful to the provided context.

Evaluation Metrics

The model’s performance is evaluated using a range of metrics, including F1 scores and recall.

MetricAegisSafetyTestBeaverTailsOAI moderationSafeRLHF(test)HarmBenchSimpleSafetyToxicChatxstest_RHxstest_RRxstest_RR(h)Aggregate F1
F10.840.750.60.770.9810.370.820.380.740.67
Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.