Skip to content

Proposal: Add Agent Threat Rules (ATR) dataset loader and taxonomy scorer #1702

@eeee2345

Description

@eeee2345

I would like to propose a new dataset loader and a companion scorer that expose the open Agent Threat Rules (ATR) detection standard inside PyRIT.

ATR is an Apache-2.0 detection standard for AI agent threats with 330 rules across nine attack categories: prompt-injection, tool-poisoning, skill-compromise, agent-manipulation, context-exfiltration, data-poisoning, excessive-autonomy, model-abuse, and privilege-escalation. The standard ships with attack payload corpora and test cases, and recent benchmarking measured 97.1 percent recall on the NVIDIA garak red-team dataset. ATR rule packs are shipped in production at Cisco AI Defense (skill-scanner) and Microsoft agent-governance-toolkit (PolicyEvaluator).

The repository is at https://github.com/Agent-Threat-Rule/agent-threat-rules.

Proposed contribution, two parts.

Part one is a SeedDatasetProvider implementation under pyrit/datasets/seed_datasets/remote/atr_threat_dataset.py that surfaces the ATR attack payload corpus as SeedPrompt sequences. Each prompt would carry harm_categories from the ATR taxonomy (one of the nine categories above) so existing PyRIT filters work out of the box. Loader pattern would mirror the existing remote loaders (forbidden_questions_dataset, harmbench_dataset, etc.). Source could be ATR releases on GitHub or a HuggingFace mirror, depending on what the team prefers.

Part two is a TrueFalseScorer subclass under pyrit/score/true_false/atr_taxonomy_scorer.py that scores PyRIT runs against the ATR rule taxonomy. Given a target response, it returns true_false plus an ATR rule ID and category, so PyRIT users can map outputs to the same nine-category schema other ecosystems already use. This would slot into the existing self_ask_category_scorer pattern.

Before opening a PR I wanted to follow the guidance in doc/contributing/2_incorporating_research.md and check for direction:

  1. Does the team prefer the dataset to load from HuggingFace upstream of ATR, or directly from ATR GitHub releases?
  2. Should the scorer be implemented as a SelfAskCategoryScorer subclass that wraps an LLM judge, or as a true-false scorer that does deterministic pattern matching, or both?
  3. Are there existing taxonomies in PyRIT that ATR's nine categories should be mapped onto for consistency, beyond the harm_categories field?

Happy to build the PR against whatever direction the team chooses. Estimated size matches recent dataset PRs (loader plus tests, around 200 to 300 lines).

Tagging maintainers gently; please retag if there is a different reviewer for new dataset and scorer proposals.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions