Granite Guardian - Paper Detail

Abstract

We introduce the Granite Guardian models, a suite of safeguards designed toprovide risk detection for prompts and responses, enabling safe and responsibleuse in combination with any large language model (LLM). These models offercomprehensive coverage across multiple risk dimensions, including social bias,profanity, violence, sexual content, unethical behavior, jailbreaking, andhallucination-related risks such as context relevance, groundedness, and answerrelevance for retrieval-augmented generation (RAG). Trained on a unique datasetcombining human annotations from diverse sources and synthetic data, GraniteGuardian models address risks typically overlooked by traditional riskdetection models, such as jailbreaks and RAG-specific issues. With AUC scoresof 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarksrespectively, Granite Guardian is the most generalizable and competitive modelavailable in the space. Released as open-source, Granite Guardian aims topromote responsible AI development across the community. https://github.com/ibm-granite/granite-guardian