[DRAFT] FEAT add Agent Threat Rules (ATR) adversarial payload dataset loader by eeee2345 · Pull Request #1715 · microsoft/PyRIT

eeee2345 · 2026-05-11T00:50:34Z

Description

Draft PR implementing the dataset loader proposed in #1702, per @romanlutz's directional guidance (GitHub-hosted source, taxonomy as-is, scorer kept separate for a follow-up). Now incorporating Roman's review feedback (see "Review feedback addressed" section below).

What this PR adds

A new remote dataset loader at pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py that surfaces the Agent Threat Rules (ATR) autoresearch adversarial-payload corpus as a PyRIT SeedDataset.

ATR is an open MIT-licensed detection standard for AI agent threats. The autoresearch corpus (data/autoresearch/adversarial-samples.json) contains 1,054 attack-payload entries across ten base rule scenarios in six of the ten ATR categories (prompt-injection, tool-poisoning, context-exfiltration, agent-manipulation, privilege-escalation, skill-compromise). Each payload carries an attack technique label (paraphrase, language_switch, encoding, role_play, and 17 others) and the agent surface it targets (user_input, content, tool_args, tool_name, tool_response, agent_output).

Reference: https://github.com/Agent-Threat-Rule/agent-threat-rules
License: MIT

Files touched

pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py (new) — the loader, three companion enums (ATRCategory, ATRDetectionField, ATRVariationType), and a _RULE_ID_TO_CATEGORY dict that resolves each rule_id to its ATR taxonomy category (typed as ATRCategory so the enum is the single source of truth)
pyrit/datasets/seed_datasets/remote/__init__.py — adds the import to trigger auto-registration via SeedDatasetProvider.__init_subclass__, plus four entries in __all__
tests/unit/datasets/test_agent_threat_rules_dataset.py (new) — 19 unit tests covering happy path, missing-key validation, unknown rule_id skip path, all four filter axes, enum-validation errors, empty-filter rejection, per-rule description, and dict-enum source-of-truth invariant

No PyRIT core code is modified. No new dependencies are added. The manual doc/code/datasets/0_dataset.md entry was removed in 44dce8b per Roman's pointer to #1707, which regenerates that listing from the notebook.

Implementation notes

Source URL is pinned to the ATR commit db793f9 (current main HEAD when this PR was authored). This mirrors HarmBench's pinning convention (c0423b9) for reproducibility; pass the raw URL on main or a different tag to track upstream.
Each row of adversarial-samples.json maps to one SeedPrompt. The payload becomes value. The four upstream metadata fields (original_rule_id, technique, detection_field, variation_type) plus the upstream entry id are preserved on SeedPrompt.metadata. harm_categories is set to a single-element list with the ATR taxonomy category resolved via the loader's _RULE_ID_TO_CATEGORY dict.
The _RULE_ID_TO_CATEGORY dict is typed dict[str, ATRCategory] and stores enum members directly — a typo on either side is a static error rather than a silent data-quality bug at SeedPrompt construction.
Optional categories, techniques, detection_fields, and variation_types arguments narrow the dataset client-side after fetch. Enum arguments are validated against their expected types via the inherited _validate_enums helper, matching the pattern in _PromptIntelDataset. An empty list is rejected with ValueError (pass None to disable a filter) — the previous "empty list silently disables" behavior was a footgun.
Each SeedPrompt's description references the rule's own category (e.g. "Agent Threat Rules (ATR) adversarial payload in the prompt injection family. Rule ATR-2026-00001.") so downstream consumers reading metadata see the family that actually applies, not a corpus-wide list that ignores the rule's specific category.
Entries whose original_rule_id is not in the loader's category mapping are skipped (not errored) with an aggregate warning. This handles upstream rule additions that land before the loader's mapping is extended — users get a working dataset minus the unmapped rules, not a runtime failure.
The loader extends _RemoteDatasetLoader, so caching, file-type inference, and the public_url/file switch are all inherited — no duplicated infrastructure.

Review feedback addressed

Roman's three inline comments + PR-body note from 5/13:

Enum as single source of truth (line 32 → now defined before the dict). _RULE_ID_TO_CATEGORY is now dict[str, ATRCategory] and stores enum members directly. harm_categories=[category.value] at the SeedPrompt construction site. A new test (test_rule_id_mapping_uses_enum) asserts the invariant statically so future edits to the dict stay aligned with the enum.
Empty-filter footgun (line 187). Empty list now raises ValueError for all four filter arguments (categories, techniques, detection_fields, variation_types). The "pass None to include all" path remains. Four new tests cover the raises.
Per-rule description (line 228). Description is now computed per-seed from the rule's own category label, so a _AgentThreatRulesDataset(categories=[ATRCategory.PROMPT_INJECTION]) returns seeds whose description references prompt injection only, not all six families. A new test (test_per_rule_description_reflects_category) asserts descriptions differ across categories and each seed's description references its own rule_id.
PR-body note about 0_dataset.md — removed; the manual entry was already dropped in 44dce8b per DOC: Execute 1_loading_datasets notebook to populate cell outputs #1707.

What this PR does NOT include (per #1702 discussion)

No scorer. Per @romanlutz's guidance to keep that separate, the ATR taxonomy scorer is a follow-up after this loader lands.
No HuggingFace mirror. Source is GitHub-hosted per the initial direction; a HuggingFace sibling release is straightforward to add later if users want it.
No taxonomy mapping into other PyRIT category schemas. ATR's taxonomy is preserved on harm_categories as-is per the same guidance.

Optional context for PyRIT users

ATR was recently integrated into MISP at two layers (merged 2026-05-10 by Alexandre Dulaunoy, MISP project lead):

Add agent-threat-rules taxonomy MISP/misp-taxonomies#323 — 10 predicate categories and 330 rule IDs as machine tags
Add Agent Threat Rules galaxy + cluster (336 rules) MISP/misp-galaxy#1207 — 336 cluster values, each carrying kill-chain category, severity, and cve / owasp_llm / mitre_atlas cross-references per cluster

Mentioning since PyRIT users routing red-team output into MISP-compatible threat-intel or CSIRT pipelines could benefit from the original_rule_id metadata on each SeedPrompt resolving natively as MISP machine tags downstream — no translation layer needed. Not required for the loader itself; just a downstream interop note.

Tests and Documentation

19 unit tests in tests/unit/datasets/test_agent_threat_rules_dataset.py covering:

Dataset name and SeedDataset construction
SeedPrompt field population including metadata
Missing-key validation raises
Unknown rule_id skipped with aggregate warning
All four filter axes (categories, techniques, detection_fields, variation_types)
Combined filters
Invalid enum types raise (3 tests)
Empty filter lists raise (4 new tests, one per filter axis)
Per-rule description reflects the seed's own category and rule_id
_RULE_ID_TO_CATEGORY values are ATRCategory instances

ruff check and ruff format --check both pass on the new files and the modified __init__.py.

A real-network fetch against the pinned upstream URL was verified locally: 1,054 seeds load with the expected category distribution (prompt-injection 496, context-exfiltration 186, skill-compromise 124, tool-poisoning 93, agent-manipulation 93, privilege-escalation 62).

The new loader will be picked up automatically by tests/end_to_end/test_all_datasets.py via SeedDatasetProvider.get_all_providers() discovery — no parametrization update needed there.

JupyText was not run because this PR does not touch any notebooks or doc/code/ .py files.

Adds a new remote dataset loader at pyrit/datasets/seed_datasets/remote/ agent_threat_rules_dataset.py that surfaces the ATR autoresearch corpus (1,054 attack-payload entries across six ATR taxonomy categories) as a PyRIT SeedDataset. Implements proposal in microsoft#1702, per directional guidance in that issue: - Source pinned to GitHub (not HuggingFace) for the initial cut - ATR taxonomy preserved as-is on harm_categories - Scorer kept separate as a follow-up after this loader lands - No PyRIT core code modified Adds 13 unit tests covering happy path, missing-key validation, the unknown-rule_id skip path, all four filter axes (categories, techniques, detection_fields, variation_types), and enum-validation errors. Updates pyrit/datasets/seed_datasets/remote/__init__.py to register the loader via SeedDatasetProvider.__init_subclass__, and adds a one-line entry to doc/code/datasets/0_dataset.md. ruff check + ruff format both clean. Real-network fetch verified locally against the pinned upstream URL.

romanlutz · 2026-05-11T00:52:25Z

 - `harmbench`: Standard harmful behavior benchmarks
 - `dark_bench`: Dark pattern detection examples
 - `airt_*`: Various harm categories from AI Red Team
+- `agent_threat_rules`: Agent Threat Rules (ATR) adversarial payloads for prompt injection, tool poisoning, and other AI-agent attack classes


if you rerun the 1_loading_datasets notebook it will update the list there, too. This is just a small subset. I have a pr out for doing that in fact #1707

Got it — dropped the 0_dataset.md line in 44dce8b. Once #1707 lands and the notebook is re-executed against main, agent_threat_rules will show up in the canonical list automatically via SeedDatasetProvider auto-registration. Thanks for the pointer.

@romanlutz

@romanlutz pointed out the manual entry in 0_dataset.md is a small hardcoded subset; the canonical list is generated by re-executing 1_loading_datasets.ipynb (which his microsoft#1707 handles). Dropping the manual line; auto-registration via SeedDatasetProvider already ensures agent_threat_rules appears in the regenerated notebook output once microsoft#1707 lands.

romanlutz

Thanks for the PR — I ran it locally and the diff is small, well-tested, and slots in cleanly. Three inline comments below on the loader plus one note on the PR body.

Verified locally on 44dce8b0: 13 unit tests pass; the e2e test tests/end_to_end/test_all_datasets.py::TestAllDatasets::test_fetch_dataset[_AgentThreatRulesDataset-_AgentThreatRulesDataset] is auto-discovered and passes against the pinned commit; ruff is clean; _parse_metadata() returns a valid SeedDatasetMetadata. The upstream JSON shape matches the enums exactly (10 unique original_rule_ids — all mapped — 6 detection_field values, 2 variation_type values, 21 unique techniques).

One meta note: the PR description still claims a doc/code/datasets/0_dataset.md change, but commit 44dce8b0 ("DOC: drop manual 0_dataset.md entry per #1707") removed it. Could you refresh the PR body so reviewers don't go looking for a file that isn't there?

romanlutz · 2026-05-13T13:51:50Z

+    "ATR-2026-00040": "privilege-escalation",
+    "ATR-2026-00060": "skill-compromise",
+    "ATR-2026-00064": "skill-compromise",
+}


The dict values duplicate string literals that already exist in ATRCategory immediately above. This is the highest-signal change I'd ask for: a typo here (e.g. "prompt_injection" vs "prompt-injection") would silently produce inconsistent harm_categories on the resulting SeedPrompts, and no existing test would catch it. The enum should be the single source of truth.

Something like:

_RULE_ID_TO_CATEGORY: dict[str, ATRCategory] = { "ATR-2026-00001": ATRCategory.PROMPT_INJECTION, "ATR-2026-00002": ATRCategory.PROMPT_INJECTION, ... }

and then harm_categories=[category.value] at the SeedPrompt construction site. That way drift between the dict and the enum becomes a static error rather than a silent data-quality bug.

romanlutz · 2026-05-13T13:51:50Z

+        self._categories = {c.value for c in categories} if categories else None
+        self._techniques = set(techniques) if techniques else None
+        self._detection_fields = {d.value for d in detection_fields} if detection_fields else None
+        self._variation_types = {v.value for v in variation_types} if variation_types else None


Empty filter list silently disables the filter: categories=[] becomes set(), which is falsy, so if self._categories and ... skips filtering entirely and the user gets back the whole dataset. Most users would expect "no categories selected → empty result."

I'd either raise on empty (simplest) or normalize empty-to-None only after a deliberate decision is made about the semantics. Same applies to techniques, detection_fields, variation_types.

romanlutz · 2026-05-13T13:51:50Z

+            "autoresearch dataset. Attack payloads spanning prompt injection, "
+            "tool poisoning, context exfiltration, agent manipulation, "
+            "privilege escalation, and skill compromise."
+        )


Per-SeedPrompt description always lists all six categories regardless of which filters were applied. If a user calls _AgentThreatRulesDataset(categories=[ATRCategory.PROMPT_INJECTION]), every returned prompt's description still claims the corpus spans tool poisoning, context exfiltration, etc. Mildly confusing for downstream consumers reading the metadata.

Either describe the per-rule semantics (since harm_categories already carries the actual category) or compute the description from the active filter set.

@romanlutz

Three refactors per @romanlutz's 5/13 review: 1. ATRCategory enum is now the single source of truth for category strings. _RULE_ID_TO_CATEGORY is typed dict[str, ATRCategory] and stores enum members directly, so a typo on either side becomes a static error rather than a silent data-quality bug at SeedPrompt construction. harm_categories is built via category.value at the construction site. 2. Empty filter lists ([]) now raise ValueError for all four filter arguments (categories, techniques, detection_fields, variation_types). The previous behavior — empty list silently disabled the filter and returned the full dataset — was a footgun. Pass None to disable. 3. Per-SeedPrompt description is computed from the seed's own category label and rule_id, so a filtered call returns seeds whose description references only the active family, not a corpus-wide list. Five new unit tests cover the new contracts (empty-list raises x4 and per-rule description). One additional invariant test asserts that _RULE_ID_TO_CATEGORY values are ATRCategory instances.

eeee2345 · 2026-05-13T14:16:11Z

Thanks for the thorough review. All three inline points + the PR body note are addressed in 5f4490c:

Enum as single source of truth — _RULE_ID_TO_CATEGORY is now dict[str, ATRCategory] with enum members as values. harm_categories=[category.value] at construction site. New invariant test test_rule_id_mapping_uses_enum so future drift becomes a static error.
Empty-filter footgun — Empty list now raises ValueError for all four filter arguments. Pass None to disable. Four new tests cover the raises.
Per-rule description — Computed per-seed from the rule's own category label and rule_id. A _AgentThreatRulesDataset(categories=[ATRCategory.PROMPT_INJECTION]) returns seeds whose description references only prompt-injection. New test test_per_rule_description_reflects_category asserts descriptions vary and each references its own rule_id.
PR body — Refreshed to reflect 44dce8b dropping the manual 0_dataset.md entry, and the new 19-test count.

Branch is now caught up with main (via update-branch — local rebase hit unrelated add/add conflicts in tests/unit/scenario/ and uv.lock that are not in any path this PR touches).

Local verification on 5f4490c: 19 unit tests pass (13 original + 6 new), ruff check + ruff format both clean.

Total unit tests in this PR: 19.

eeee2345 mentioned this pull request May 11, 2026

Proposal: Add Agent Threat Rules (ATR) dataset loader and taxonomy scorer #1702

Open

romanlutz reviewed May 11, 2026

View reviewed changes

eeee2345 mentioned this pull request May 11, 2026

Companion package proposal: counterfit-detection-atr plugin (out-of-tree, no core changes) Azure/counterfit#96

Open

romanlutz reviewed May 13, 2026

View reviewed changes

eeee2345 and others added 2 commits May 13, 2026 22:04

Merge branch 'main' into feat/atr-dataset-loader

8d8cf4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] FEAT add Agent Threat Rules (ATR) adversarial payload dataset loader#1715

[DRAFT] FEAT add Agent Threat Rules (ATR) adversarial payload dataset loader#1715
eeee2345 wants to merge 4 commits into
microsoft:mainfrom
eeee2345:feat/atr-dataset-loader

eeee2345 commented May 11, 2026 •

edited

Loading

Uh oh!

romanlutz May 11, 2026

Uh oh!

eeee2345 May 11, 2026

Uh oh!

romanlutz left a comment

Uh oh!

romanlutz May 13, 2026

Uh oh!

romanlutz May 13, 2026

Uh oh!

romanlutz May 13, 2026

Uh oh!

eeee2345 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eeee2345 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What this PR adds

Files touched

Implementation notes

Review feedback addressed

What this PR does NOT include (per #1702 discussion)

Optional context for PyRIT users

Tests and Documentation

Uh oh!

romanlutz May 11, 2026

Choose a reason for hiding this comment

Uh oh!

eeee2345 May 11, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz left a comment

Choose a reason for hiding this comment

Uh oh!

romanlutz May 13, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz May 13, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz May 13, 2026

Choose a reason for hiding this comment

Uh oh!

eeee2345 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eeee2345 commented May 11, 2026 •

edited

Loading