Granite Guardian

  • 2024-12-10 18:17:02
  • Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Martín Santillán Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri
  • 0

Abstract

We introduce the Granite Guardian models, a suite of safeguards designed toprovide risk detection for prompts and responses, enabling safe and responsibleuse in combination with any large language model (LLM). These models offercomprehensive coverage across multiple risk dimensions, including social bias,profanity, violence, sexual content, unethical behavior, jailbreaking, andhallucination-related risks such as context relevance, groundedness, and answerrelevance for retrieval-augmented generation (RAG). Trained on a unique datasetcombining human annotations from diverse sources and synthetic data, GraniteGuardian models address risks typically overlooked by traditional riskdetection models, such as jailbreaks and RAG-specific issues. With AUC scoresof 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarksrespectively, Granite Guardian is the most generalizable and competitive modelavailable in the space. Released as open-source, Granite Guardian aims topromote responsible AI development across the community. https://github.com/ibm-granite/granite-guardian