Skip to content

[DRAFT] FEAT add Agent Threat Rules (ATR) adversarial payload dataset loader#1715

Draft
eeee2345 wants to merge 4 commits into
microsoft:mainfrom
eeee2345:feat/atr-dataset-loader
Draft

[DRAFT] FEAT add Agent Threat Rules (ATR) adversarial payload dataset loader#1715
eeee2345 wants to merge 4 commits into
microsoft:mainfrom
eeee2345:feat/atr-dataset-loader

Conversation

@eeee2345
Copy link
Copy Markdown

@eeee2345 eeee2345 commented May 11, 2026

Description

Draft PR implementing the dataset loader proposed in #1702, per @romanlutz's directional guidance (GitHub-hosted source, taxonomy as-is, scorer kept separate for a follow-up). Now incorporating Roman's review feedback (see "Review feedback addressed" section below).

What this PR adds

A new remote dataset loader at pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py that surfaces the Agent Threat Rules (ATR) autoresearch adversarial-payload corpus as a PyRIT SeedDataset.

ATR is an open MIT-licensed detection standard for AI agent threats. The autoresearch corpus (data/autoresearch/adversarial-samples.json) contains 1,054 attack-payload entries across ten base rule scenarios in six of the ten ATR categories (prompt-injection, tool-poisoning, context-exfiltration, agent-manipulation, privilege-escalation, skill-compromise). Each payload carries an attack technique label (paraphrase, language_switch, encoding, role_play, and 17 others) and the agent surface it targets (user_input, content, tool_args, tool_name, tool_response, agent_output).

Reference: https://github.com/Agent-Threat-Rule/agent-threat-rules
License: MIT

Files touched

  • pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py (new) — the loader, three companion enums (ATRCategory, ATRDetectionField, ATRVariationType), and a _RULE_ID_TO_CATEGORY dict that resolves each rule_id to its ATR taxonomy category (typed as ATRCategory so the enum is the single source of truth)
  • pyrit/datasets/seed_datasets/remote/__init__.py — adds the import to trigger auto-registration via SeedDatasetProvider.__init_subclass__, plus four entries in __all__
  • tests/unit/datasets/test_agent_threat_rules_dataset.py (new) — 19 unit tests covering happy path, missing-key validation, unknown rule_id skip path, all four filter axes, enum-validation errors, empty-filter rejection, per-rule description, and dict-enum source-of-truth invariant

No PyRIT core code is modified. No new dependencies are added. The manual doc/code/datasets/0_dataset.md entry was removed in 44dce8b per Roman's pointer to #1707, which regenerates that listing from the notebook.

Implementation notes

  • Source URL is pinned to the ATR commit db793f9 (current main HEAD when this PR was authored). This mirrors HarmBench's pinning convention (c0423b9) for reproducibility; pass the raw URL on main or a different tag to track upstream.
  • Each row of adversarial-samples.json maps to one SeedPrompt. The payload becomes value. The four upstream metadata fields (original_rule_id, technique, detection_field, variation_type) plus the upstream entry id are preserved on SeedPrompt.metadata. harm_categories is set to a single-element list with the ATR taxonomy category resolved via the loader's _RULE_ID_TO_CATEGORY dict.
  • The _RULE_ID_TO_CATEGORY dict is typed dict[str, ATRCategory] and stores enum members directly — a typo on either side is a static error rather than a silent data-quality bug at SeedPrompt construction.
  • Optional categories, techniques, detection_fields, and variation_types arguments narrow the dataset client-side after fetch. Enum arguments are validated against their expected types via the inherited _validate_enums helper, matching the pattern in _PromptIntelDataset. An empty list is rejected with ValueError (pass None to disable a filter) — the previous "empty list silently disables" behavior was a footgun.
  • Each SeedPrompt's description references the rule's own category (e.g. "Agent Threat Rules (ATR) adversarial payload in the prompt injection family. Rule ATR-2026-00001.") so downstream consumers reading metadata see the family that actually applies, not a corpus-wide list that ignores the rule's specific category.
  • Entries whose original_rule_id is not in the loader's category mapping are skipped (not errored) with an aggregate warning. This handles upstream rule additions that land before the loader's mapping is extended — users get a working dataset minus the unmapped rules, not a runtime failure.
  • The loader extends _RemoteDatasetLoader, so caching, file-type inference, and the public_url/file switch are all inherited — no duplicated infrastructure.

Review feedback addressed

Roman's three inline comments + PR-body note from 5/13:

  1. Enum as single source of truth (line 32 → now defined before the dict). _RULE_ID_TO_CATEGORY is now dict[str, ATRCategory] and stores enum members directly. harm_categories=[category.value] at the SeedPrompt construction site. A new test (test_rule_id_mapping_uses_enum) asserts the invariant statically so future edits to the dict stay aligned with the enum.
  2. Empty-filter footgun (line 187). Empty list now raises ValueError for all four filter arguments (categories, techniques, detection_fields, variation_types). The "pass None to include all" path remains. Four new tests cover the raises.
  3. Per-rule description (line 228). Description is now computed per-seed from the rule's own category label, so a _AgentThreatRulesDataset(categories=[ATRCategory.PROMPT_INJECTION]) returns seeds whose description references prompt injection only, not all six families. A new test (test_per_rule_description_reflects_category) asserts descriptions differ across categories and each seed's description references its own rule_id.
  4. PR-body note about 0_dataset.md — removed; the manual entry was already dropped in 44dce8b per DOC: Execute 1_loading_datasets notebook to populate cell outputs #1707.

What this PR does NOT include (per #1702 discussion)

  • No scorer. Per @romanlutz's guidance to keep that separate, the ATR taxonomy scorer is a follow-up after this loader lands.
  • No HuggingFace mirror. Source is GitHub-hosted per the initial direction; a HuggingFace sibling release is straightforward to add later if users want it.
  • No taxonomy mapping into other PyRIT category schemas. ATR's taxonomy is preserved on harm_categories as-is per the same guidance.

Optional context for PyRIT users

ATR was recently integrated into MISP at two layers (merged 2026-05-10 by Alexandre Dulaunoy, MISP project lead):

Mentioning since PyRIT users routing red-team output into MISP-compatible threat-intel or CSIRT pipelines could benefit from the original_rule_id metadata on each SeedPrompt resolving natively as MISP machine tags downstream — no translation layer needed. Not required for the loader itself; just a downstream interop note.

Tests and Documentation

19 unit tests in tests/unit/datasets/test_agent_threat_rules_dataset.py covering:

  • Dataset name and SeedDataset construction
  • SeedPrompt field population including metadata
  • Missing-key validation raises
  • Unknown rule_id skipped with aggregate warning
  • All four filter axes (categories, techniques, detection_fields, variation_types)
  • Combined filters
  • Invalid enum types raise (3 tests)
  • Empty filter lists raise (4 new tests, one per filter axis)
  • Per-rule description reflects the seed's own category and rule_id
  • _RULE_ID_TO_CATEGORY values are ATRCategory instances

ruff check and ruff format --check both pass on the new files and the modified __init__.py.

A real-network fetch against the pinned upstream URL was verified locally: 1,054 seeds load with the expected category distribution (prompt-injection 496, context-exfiltration 186, skill-compromise 124, tool-poisoning 93, agent-manipulation 93, privilege-escalation 62).

The new loader will be picked up automatically by tests/end_to_end/test_all_datasets.py via SeedDatasetProvider.get_all_providers() discovery — no parametrization update needed there.

JupyText was not run because this PR does not touch any notebooks or doc/code/ .py files.

Adds a new remote dataset loader at pyrit/datasets/seed_datasets/remote/
agent_threat_rules_dataset.py that surfaces the ATR autoresearch corpus
(1,054 attack-payload entries across six ATR taxonomy categories) as a
PyRIT SeedDataset.

Implements proposal in microsoft#1702, per directional guidance in that issue:
- Source pinned to GitHub (not HuggingFace) for the initial cut
- ATR taxonomy preserved as-is on harm_categories
- Scorer kept separate as a follow-up after this loader lands
- No PyRIT core code modified

Adds 13 unit tests covering happy path, missing-key validation, the
unknown-rule_id skip path, all four filter axes (categories, techniques,
detection_fields, variation_types), and enum-validation errors.

Updates pyrit/datasets/seed_datasets/remote/__init__.py to register the
loader via SeedDatasetProvider.__init_subclass__, and adds a one-line
entry to doc/code/datasets/0_dataset.md.

ruff check + ruff format both clean. Real-network fetch verified locally
against the pinned upstream URL.
Comment thread doc/code/datasets/0_dataset.md Outdated
- `harmbench`: Standard harmful behavior benchmarks
- `dark_bench`: Dark pattern detection examples
- `airt_*`: Various harm categories from AI Red Team
- `agent_threat_rules`: Agent Threat Rules (ATR) adversarial payloads for prompt injection, tool poisoning, and other AI-agent attack classes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you rerun the 1_loading_datasets notebook it will update the list there, too. This is just a small subset. I have a pr out for doing that in fact #1707

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it — dropped the 0_dataset.md line in 44dce8b. Once #1707 lands and the notebook is re-executed against main, agent_threat_rules will show up in the canonical list automatically via SeedDatasetProvider auto-registration. Thanks for the pointer.

@romanlutz pointed out the manual entry in 0_dataset.md is a small
hardcoded subset; the canonical list is generated by re-executing
1_loading_datasets.ipynb (which his microsoft#1707 handles). Dropping the
manual line; auto-registration via SeedDatasetProvider already
ensures agent_threat_rules appears in the regenerated notebook
output once microsoft#1707 lands.
Copy link
Copy Markdown
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR — I ran it locally and the diff is small, well-tested, and slots in cleanly. Three inline comments below on the loader plus one note on the PR body.

Verified locally on 44dce8b0: 13 unit tests pass; the e2e test tests/end_to_end/test_all_datasets.py::TestAllDatasets::test_fetch_dataset[_AgentThreatRulesDataset-_AgentThreatRulesDataset] is auto-discovered and passes against the pinned commit; ruff is clean; _parse_metadata() returns a valid SeedDatasetMetadata. The upstream JSON shape matches the enums exactly (10 unique original_rule_ids — all mapped — 6 detection_field values, 2 variation_type values, 21 unique techniques).

One meta note: the PR description still claims a doc/code/datasets/0_dataset.md change, but commit 44dce8b0 ("DOC: drop manual 0_dataset.md entry per #1707") removed it. Could you refresh the PR body so reviewers don't go looking for a file that isn't there?

"ATR-2026-00040": "privilege-escalation",
"ATR-2026-00060": "skill-compromise",
"ATR-2026-00064": "skill-compromise",
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dict values duplicate string literals that already exist in ATRCategory immediately above. This is the highest-signal change I'd ask for: a typo here (e.g. "prompt_injection" vs "prompt-injection") would silently produce inconsistent harm_categories on the resulting SeedPrompts, and no existing test would catch it. The enum should be the single source of truth.

Something like:

_RULE_ID_TO_CATEGORY: dict[str, ATRCategory] = {
    "ATR-2026-00001": ATRCategory.PROMPT_INJECTION,
    "ATR-2026-00002": ATRCategory.PROMPT_INJECTION,
    ...
}

and then harm_categories=[category.value] at the SeedPrompt construction site. That way drift between the dict and the enum becomes a static error rather than a silent data-quality bug.

self._categories = {c.value for c in categories} if categories else None
self._techniques = set(techniques) if techniques else None
self._detection_fields = {d.value for d in detection_fields} if detection_fields else None
self._variation_types = {v.value for v in variation_types} if variation_types else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty filter list silently disables the filter: categories=[] becomes set(), which is falsy, so if self._categories and ... skips filtering entirely and the user gets back the whole dataset. Most users would expect "no categories selected → empty result."

I'd either raise on empty (simplest) or normalize empty-to-None only after a deliberate decision is made about the semantics. Same applies to techniques, detection_fields, variation_types.

"autoresearch dataset. Attack payloads spanning prompt injection, "
"tool poisoning, context exfiltration, agent manipulation, "
"privilege escalation, and skill compromise."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per-SeedPrompt description always lists all six categories regardless of which filters were applied. If a user calls _AgentThreatRulesDataset(categories=[ATRCategory.PROMPT_INJECTION]), every returned prompt's description still claims the corpus spans tool poisoning, context exfiltration, etc. Mildly confusing for downstream consumers reading the metadata.

Either describe the per-rule semantics (since harm_categories already carries the actual category) or compute the description from the active filter set.

eeee2345 and others added 2 commits May 13, 2026 22:04
Three refactors per @romanlutz's 5/13 review:

1. ATRCategory enum is now the single source of truth for category
   strings. _RULE_ID_TO_CATEGORY is typed dict[str, ATRCategory] and
   stores enum members directly, so a typo on either side becomes a
   static error rather than a silent data-quality bug at SeedPrompt
   construction. harm_categories is built via category.value at the
   construction site.

2. Empty filter lists ([]) now raise ValueError for all four filter
   arguments (categories, techniques, detection_fields, variation_types).
   The previous behavior — empty list silently disabled the filter and
   returned the full dataset — was a footgun. Pass None to disable.

3. Per-SeedPrompt description is computed from the seed's own category
   label and rule_id, so a filtered call returns seeds whose description
   references only the active family, not a corpus-wide list.

Five new unit tests cover the new contracts (empty-list raises x4 and
per-rule description). One additional invariant test asserts that
_RULE_ID_TO_CATEGORY values are ATRCategory instances.
@eeee2345
Copy link
Copy Markdown
Author

Thanks for the thorough review. All three inline points + the PR body note are addressed in 5f4490c:

  1. Enum as single source of truth — _RULE_ID_TO_CATEGORY is now dict[str, ATRCategory] with enum members as values. harm_categories=[category.value] at construction site. New invariant test test_rule_id_mapping_uses_enum so future drift becomes a static error.

  2. Empty-filter footgun — Empty list now raises ValueError for all four filter arguments. Pass None to disable. Four new tests cover the raises.

  3. Per-rule description — Computed per-seed from the rule's own category label and rule_id. A _AgentThreatRulesDataset(categories=[ATRCategory.PROMPT_INJECTION]) returns seeds whose description references only prompt-injection. New test test_per_rule_description_reflects_category asserts descriptions vary and each references its own rule_id.

  4. PR body — Refreshed to reflect 44dce8b dropping the manual 0_dataset.md entry, and the new 19-test count.

Branch is now caught up with main (via update-branch — local rebase hit unrelated add/add conflicts in tests/unit/scenario/ and uv.lock that are not in any path this PR touches).

Local verification on 5f4490c: 19 unit tests pass (13 original + 6 new), ruff check + ruff format both clean.

Total unit tests in this PR: 19.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants