I would like to propose a new dataset loader and a companion scorer that expose the open Agent Threat Rules (ATR) detection standard inside PyRIT.
ATR is an Apache-2.0 detection standard for AI agent threats with 330 rules across nine attack categories: prompt-injection, tool-poisoning, skill-compromise, agent-manipulation, context-exfiltration, data-poisoning, excessive-autonomy, model-abuse, and privilege-escalation. The standard ships with attack payload corpora and test cases, and recent benchmarking measured 97.1 percent recall on the NVIDIA garak red-team dataset. ATR rule packs are shipped in production at Cisco AI Defense (skill-scanner) and Microsoft agent-governance-toolkit (PolicyEvaluator).
The repository is at https://github.com/Agent-Threat-Rule/agent-threat-rules.
Proposed contribution, two parts.
Part one is a SeedDatasetProvider implementation under pyrit/datasets/seed_datasets/remote/atr_threat_dataset.py that surfaces the ATR attack payload corpus as SeedPrompt sequences. Each prompt would carry harm_categories from the ATR taxonomy (one of the nine categories above) so existing PyRIT filters work out of the box. Loader pattern would mirror the existing remote loaders (forbidden_questions_dataset, harmbench_dataset, etc.). Source could be ATR releases on GitHub or a HuggingFace mirror, depending on what the team prefers.
Part two is a TrueFalseScorer subclass under pyrit/score/true_false/atr_taxonomy_scorer.py that scores PyRIT runs against the ATR rule taxonomy. Given a target response, it returns true_false plus an ATR rule ID and category, so PyRIT users can map outputs to the same nine-category schema other ecosystems already use. This would slot into the existing self_ask_category_scorer pattern.
Before opening a PR I wanted to follow the guidance in doc/contributing/2_incorporating_research.md and check for direction:
- Does the team prefer the dataset to load from HuggingFace upstream of ATR, or directly from ATR GitHub releases?
- Should the scorer be implemented as a SelfAskCategoryScorer subclass that wraps an LLM judge, or as a true-false scorer that does deterministic pattern matching, or both?
- Are there existing taxonomies in PyRIT that ATR's nine categories should be mapped onto for consistency, beyond the harm_categories field?
Happy to build the PR against whatever direction the team chooses. Estimated size matches recent dataset PRs (loader plus tests, around 200 to 300 lines).
Tagging maintainers gently; please retag if there is a different reviewer for new dataset and scorer proposals.
I would like to propose a new dataset loader and a companion scorer that expose the open Agent Threat Rules (ATR) detection standard inside PyRIT.
ATR is an Apache-2.0 detection standard for AI agent threats with 330 rules across nine attack categories: prompt-injection, tool-poisoning, skill-compromise, agent-manipulation, context-exfiltration, data-poisoning, excessive-autonomy, model-abuse, and privilege-escalation. The standard ships with attack payload corpora and test cases, and recent benchmarking measured 97.1 percent recall on the NVIDIA garak red-team dataset. ATR rule packs are shipped in production at Cisco AI Defense (skill-scanner) and Microsoft agent-governance-toolkit (PolicyEvaluator).
The repository is at https://github.com/Agent-Threat-Rule/agent-threat-rules.
Proposed contribution, two parts.
Part one is a SeedDatasetProvider implementation under pyrit/datasets/seed_datasets/remote/atr_threat_dataset.py that surfaces the ATR attack payload corpus as SeedPrompt sequences. Each prompt would carry harm_categories from the ATR taxonomy (one of the nine categories above) so existing PyRIT filters work out of the box. Loader pattern would mirror the existing remote loaders (forbidden_questions_dataset, harmbench_dataset, etc.). Source could be ATR releases on GitHub or a HuggingFace mirror, depending on what the team prefers.
Part two is a TrueFalseScorer subclass under pyrit/score/true_false/atr_taxonomy_scorer.py that scores PyRIT runs against the ATR rule taxonomy. Given a target response, it returns true_false plus an ATR rule ID and category, so PyRIT users can map outputs to the same nine-category schema other ecosystems already use. This would slot into the existing self_ask_category_scorer pattern.
Before opening a PR I wanted to follow the guidance in doc/contributing/2_incorporating_research.md and check for direction:
Happy to build the PR against whatever direction the team chooses. Estimated size matches recent dataset PRs (loader plus tests, around 200 to 300 lines).
Tagging maintainers gently; please retag if there is a different reviewer for new dataset and scorer proposals.