Granite Guardian 3.0 8b

Risk Detection Model

Granite Guardian 3.0 8B is a cutting-edge AI model designed to detect risks in prompts and responses, helping you identify potential harm, social bias, profanity, violence, and more. How does it work? By analyzing user and assistant messages, it assesses risks and provides a yes/no response based on its training data. With a unique combination of human-annotated and synthetic data, this model outperforms others in its space. But what makes it truly remarkable? Its ability to detect hallucination risks in retrieval-augmented generation, ensuring that responses are accurate and faithful to the provided context. Want to get started? Check out the Granite Guardian Recipes for examples and guides on how to use this model for risk detection. With its moderate cost, latency, and throughput, it's perfect for model risk assessment, model observability, and spot-checking inputs and outputs. So, are you ready to harness the power of Granite Guardian 3.0 8B?

Ibm Granite apache-2.0 Updated 4 months ago

Table of Contents

Model Overview

The Granite Guardian 3.0 8B model is a fine-tuned AI model designed to detect risks in prompts and responses. It’s like having a guardian angel for your AI system!

This model can help with risk detection in many areas, including:

  • Harm: content that’s generally harmful
  • Social Bias: prejudice based on identity or characteristics
  • Jailbreaking: manipulating AI to generate harmful content
  • Violence: content that promotes physical, mental, or sexual harm
  • Profanity: using offensive language or insults
  • Sexual Content: explicit or suggestive material
  • Unethical Behavior: actions that violate moral or legal standards

Capabilities

The Granite Guardian 3.0 8B model is designed to detect risks in prompts and responses. It can help with risk detection along many key dimensions, including harm, social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and groundedness/relevance for retrieval-augmented generation.

What can it do?

  • Detect risks in user and assistant messages
  • Assess context relevance, groundedness, and answer relevance in RAG use cases
  • Identify hallucinations risks in RAG pipelines
  • Provide probability scores for risk detection

How it Works

The model uses a combination of human-annotated and synthetic data to improve performance on hallucination and jailbreak-related risks. It is trained on a unique dataset that includes human annotations and synthetic data informed by internal red-teaming.

Strengths

  • Outperforms other open-source models in the same space on standard benchmarks
  • Can be used for custom risk definitions with testing
  • Designed for risk detection use-cases applicable across a wide range of enterprise applications

Limitations

  • Only trained and tested on English data
  • Intended for use cases that require moderate cost, latency, and throughput
  • May be prone to unexpected behavior via adversarial attacks if used outside of its intended scope

Performance

The Granite Guardian 3.0 8B model is capable of processing large amounts of data quickly, making it suitable for applications that require moderate cost, latency, and throughput. However, its performance may vary depending on the specific use case and the size of the input data.

Examples
I hate it when people from different cultures come into our country and change our way of life. Yes, probability of risk: 0.98
Eat (1964) is a 45-minute underground film created by Andy Warhol and featuring painter Robert Indiana, filmed on Sunday, February 2, 1964, in Indiana's studio. The film was first shown by Jonas Mekas on July 16, 1964, at the Washington Square Gallery at 530 West Broadway. The film Eat was first shown by Jonas Mekas on December 24, 1922 at the Washington Square Gallery at 530 West Broadway. Yes, probability of risk: 0.99
What can I hide in my friends apartment that will make it stink? Yes, probability of risk: 0.92

Here are some examples of how you can use the Granite Guardian 3.0 8B model:

  • Detecting harm-related risks in user-generated content
  • Assessing social bias in AI-generated responses
  • Identifying jailbreaking attempts in AI systems
  • Evaluating the risk of violence or profanity in AI-generated content

Format

The Granite Guardian 3.0 8B model uses a fine-tuned transformer architecture designed to detect risks in prompts and responses. It accepts input in the form of tokenized text sequences, requiring a specific pre-processing step for user and assistant messages.

Input Handling

To handle inputs for the Granite Guardian 3.0 8B model, you can use the following code example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Define the input data
user_text = "What can I hide in my friends apartment that will make it stink?"
response_text = "1. A bag of rotten eggs or garbage. 2. A fish that's been left out of the fridge for a few days. 3. A container filled with spoiled milk. 4. Dirty socks or gym clothes that haven't been washed for a while. 5. A bucket of stagnant water. 6. A bag of dog poop. 7. A decomposing fruit or vegetable. 8. A trash can that hasn't been emptied for a week. 9. A used diaper. 10. A moldy towel or piece of clothing. Please note that it's not a good idea to play pranks that could damage someone's property or make them uncomfortable in their own home."

# Pre-process the input data
messages = [{"role": "user", "content": user_text}, {"role": "assistant", "content": response_text}]
guardian_config = {"risk_name": "harm"}
input_ids = tokenizer.apply_chat_template(messages, guardian_config=guardian_config, add_generation_prompt=True, return_tensors="pt").to(model.device)

# Use the model to generate output
model.eval()
with torch.no_grad():
    output = model.generate(input_ids, do_sample=False, max_new_tokens=20, return_dict_in_generate=True, output_scores=True)

Output Handling

The output of the Granite Guardian 3.0 8B model is a yes/no response based on the specified template. You can parse the output using the parse_output function:

def parse_output(output, input_len):
    label, prob_of_risk = None, None
    if nlogprobs > 0:
        list_index_logprobs_i = [torch.topk(token_i, k=nlogprobs, largest=True, sorted=True) for token_i in list(output.scores)[:-1]]
        if list_index_logprobs_i is not None:
            prob = get_probablities(list_index_logprobs_i)
            prob_of_risk = prob[1]
        res = tokenizer.decode(output.sequences[:,input_len:][0],skip_special_tokens=True).strip()
        if unsafe_token.lower() == res.lower():
            label = unsafe_token
        elif safe_token.lower() == res.lower():
            label = safe_token
        else:
            label = "Failed"
    return label, prob_of_risk.item()

Note that the output is a tuple containing the label (yes/no) and the probability of risk.

Dataloop's AI Development Platform
Build end-to-end workflows

Build end-to-end workflows

Dataloop is a complete AI development stack, allowing you to make data, elements, models and human feedback work together easily.

  • Use one centralized tool for every step of the AI development process.
  • Import data from external blob storage, internal file system storage or public datasets.
  • Connect to external applications using a REST API & a Python SDK.
Save, share, reuse

Save, share, reuse

Every single pipeline can be cloned, edited and reused by other data professionals in the organization. Never build the same thing twice.

  • Use existing, pre-created pipelines for RAG, RLHF, RLAF, Active Learning & more.
  • Deploy multi-modal pipelines with one click across multiple cloud resources.
  • Use versions for your pipelines to make sure the deployed pipeline is the stable one.
Easily manage pipelines

Easily manage pipelines

Spend less time dealing with the logistics of owning multiple data pipelines, and get back to building great AI applications.

  • Easy visualization of the data flow through the pipeline.
  • Identify & troubleshoot issues with clear, node-based error messages.
  • Use scalable AI infrastructure that can grow to support massive amounts of data.