Proposal: Add Agent Threat Rules (ATR) dataset loader and taxonomy scorer

I would like to propose a new dataset loader and a companion scorer that expose the open Agent Threat Rules (ATR) detection standard inside PyRIT.

ATR is an Apache-2.0 detection standard for AI agent threats with 330 rules across nine attack categories: prompt-injection, tool-poisoning, skill-compromise, agent-manipulation, context-exfiltration, data-poisoning, excessive-autonomy, model-abuse, and privilege-escalation. The standard ships with attack payload corpora and test cases, and recent benchmarking measured 97.1 percent recall on the NVIDIA garak red-team dataset. ATR rule packs are shipped in production at Cisco AI Defense (skill-scanner) and Microsoft agent-governance-toolkit (PolicyEvaluator).

The repository is at https://github.com/Agent-Threat-Rule/agent-threat-rules.

Proposed contribution, two parts.

Part one is a SeedDatasetProvider implementation under pyrit/datasets/seed_datasets/remote/atr_threat_dataset.py that surfaces the ATR attack payload corpus as SeedPrompt sequences. Each prompt would carry harm_categories from the ATR taxonomy (one of the nine categories above) so existing PyRIT filters work out of the box. Loader pattern would mirror the existing remote loaders (forbidden_questions_dataset, harmbench_dataset, etc.). Source could be ATR releases on GitHub or a HuggingFace mirror, depending on what the team prefers.

Part two is a TrueFalseScorer subclass under pyrit/score/true_false/atr_taxonomy_scorer.py that scores PyRIT runs against the ATR rule taxonomy. Given a target response, it returns true_false plus an ATR rule ID and category, so PyRIT users can map outputs to the same nine-category schema other ecosystems already use. This would slot into the existing self_ask_category_scorer pattern.

Before opening a PR I wanted to follow the guidance in doc/contributing/2_incorporating_research.md and check for direction:

1. Does the team prefer the dataset to load from HuggingFace upstream of ATR, or directly from ATR GitHub releases?
2. Should the scorer be implemented as a SelfAskCategoryScorer subclass that wraps an LLM judge, or as a true-false scorer that does deterministic pattern matching, or both?
3. Are there existing taxonomies in PyRIT that ATR's nine categories should be mapped onto for consistency, beyond the harm_categories field?

Happy to build the PR against whatever direction the team chooses. Estimated size matches recent dataset PRs (loader plus tests, around 200 to 300 lines).

Tagging maintainers gently; please retag if there is a different reviewer for new dataset and scorer proposals.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add Agent Threat Rules (ATR) dataset loader and taxonomy scorer #1702

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Add Agent Threat Rules (ATR) dataset loader and taxonomy scorer #1702

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions