[DRAFT] FEAT add Agent Threat Rules (ATR) adversarial payload dataset loader#1715
[DRAFT] FEAT add Agent Threat Rules (ATR) adversarial payload dataset loader#1715eeee2345 wants to merge 4 commits into
Conversation
Adds a new remote dataset loader at pyrit/datasets/seed_datasets/remote/ agent_threat_rules_dataset.py that surfaces the ATR autoresearch corpus (1,054 attack-payload entries across six ATR taxonomy categories) as a PyRIT SeedDataset. Implements proposal in microsoft#1702, per directional guidance in that issue: - Source pinned to GitHub (not HuggingFace) for the initial cut - ATR taxonomy preserved as-is on harm_categories - Scorer kept separate as a follow-up after this loader lands - No PyRIT core code modified Adds 13 unit tests covering happy path, missing-key validation, the unknown-rule_id skip path, all four filter axes (categories, techniques, detection_fields, variation_types), and enum-validation errors. Updates pyrit/datasets/seed_datasets/remote/__init__.py to register the loader via SeedDatasetProvider.__init_subclass__, and adds a one-line entry to doc/code/datasets/0_dataset.md. ruff check + ruff format both clean. Real-network fetch verified locally against the pinned upstream URL.
| - `harmbench`: Standard harmful behavior benchmarks | ||
| - `dark_bench`: Dark pattern detection examples | ||
| - `airt_*`: Various harm categories from AI Red Team | ||
| - `agent_threat_rules`: Agent Threat Rules (ATR) adversarial payloads for prompt injection, tool poisoning, and other AI-agent attack classes |
There was a problem hiding this comment.
if you rerun the 1_loading_datasets notebook it will update the list there, too. This is just a small subset. I have a pr out for doing that in fact #1707
@romanlutz pointed out the manual entry in 0_dataset.md is a small hardcoded subset; the canonical list is generated by re-executing 1_loading_datasets.ipynb (which his microsoft#1707 handles). Dropping the manual line; auto-registration via SeedDatasetProvider already ensures agent_threat_rules appears in the regenerated notebook output once microsoft#1707 lands.
romanlutz
left a comment
There was a problem hiding this comment.
Thanks for the PR — I ran it locally and the diff is small, well-tested, and slots in cleanly. Three inline comments below on the loader plus one note on the PR body.
Verified locally on 44dce8b0: 13 unit tests pass; the e2e test tests/end_to_end/test_all_datasets.py::TestAllDatasets::test_fetch_dataset[_AgentThreatRulesDataset-_AgentThreatRulesDataset] is auto-discovered and passes against the pinned commit; ruff is clean; _parse_metadata() returns a valid SeedDatasetMetadata. The upstream JSON shape matches the enums exactly (10 unique original_rule_ids — all mapped — 6 detection_field values, 2 variation_type values, 21 unique techniques).
One meta note: the PR description still claims a doc/code/datasets/0_dataset.md change, but commit 44dce8b0 ("DOC: drop manual 0_dataset.md entry per #1707") removed it. Could you refresh the PR body so reviewers don't go looking for a file that isn't there?
| "ATR-2026-00040": "privilege-escalation", | ||
| "ATR-2026-00060": "skill-compromise", | ||
| "ATR-2026-00064": "skill-compromise", | ||
| } |
There was a problem hiding this comment.
The dict values duplicate string literals that already exist in ATRCategory immediately above. This is the highest-signal change I'd ask for: a typo here (e.g. "prompt_injection" vs "prompt-injection") would silently produce inconsistent harm_categories on the resulting SeedPrompts, and no existing test would catch it. The enum should be the single source of truth.
Something like:
_RULE_ID_TO_CATEGORY: dict[str, ATRCategory] = {
"ATR-2026-00001": ATRCategory.PROMPT_INJECTION,
"ATR-2026-00002": ATRCategory.PROMPT_INJECTION,
...
}and then harm_categories=[category.value] at the SeedPrompt construction site. That way drift between the dict and the enum becomes a static error rather than a silent data-quality bug.
| self._categories = {c.value for c in categories} if categories else None | ||
| self._techniques = set(techniques) if techniques else None | ||
| self._detection_fields = {d.value for d in detection_fields} if detection_fields else None | ||
| self._variation_types = {v.value for v in variation_types} if variation_types else None |
There was a problem hiding this comment.
Empty filter list silently disables the filter: categories=[] becomes set(), which is falsy, so if self._categories and ... skips filtering entirely and the user gets back the whole dataset. Most users would expect "no categories selected → empty result."
I'd either raise on empty (simplest) or normalize empty-to-None only after a deliberate decision is made about the semantics. Same applies to techniques, detection_fields, variation_types.
| "autoresearch dataset. Attack payloads spanning prompt injection, " | ||
| "tool poisoning, context exfiltration, agent manipulation, " | ||
| "privilege escalation, and skill compromise." | ||
| ) |
There was a problem hiding this comment.
Per-SeedPrompt description always lists all six categories regardless of which filters were applied. If a user calls _AgentThreatRulesDataset(categories=[ATRCategory.PROMPT_INJECTION]), every returned prompt's description still claims the corpus spans tool poisoning, context exfiltration, etc. Mildly confusing for downstream consumers reading the metadata.
Either describe the per-rule semantics (since harm_categories already carries the actual category) or compute the description from the active filter set.
Three refactors per @romanlutz's 5/13 review: 1. ATRCategory enum is now the single source of truth for category strings. _RULE_ID_TO_CATEGORY is typed dict[str, ATRCategory] and stores enum members directly, so a typo on either side becomes a static error rather than a silent data-quality bug at SeedPrompt construction. harm_categories is built via category.value at the construction site. 2. Empty filter lists ([]) now raise ValueError for all four filter arguments (categories, techniques, detection_fields, variation_types). The previous behavior — empty list silently disabled the filter and returned the full dataset — was a footgun. Pass None to disable. 3. Per-SeedPrompt description is computed from the seed's own category label and rule_id, so a filtered call returns seeds whose description references only the active family, not a corpus-wide list. Five new unit tests cover the new contracts (empty-list raises x4 and per-rule description). One additional invariant test asserts that _RULE_ID_TO_CATEGORY values are ATRCategory instances.
|
Thanks for the thorough review. All three inline points + the PR body note are addressed in 5f4490c:
Branch is now caught up with main (via update-branch — local rebase hit unrelated Local verification on 5f4490c: 19 unit tests pass (13 original + 6 new), ruff check + ruff format both clean. Total unit tests in this PR: 19. |
Description
Draft PR implementing the dataset loader proposed in #1702, per @romanlutz's directional guidance (GitHub-hosted source, taxonomy as-is, scorer kept separate for a follow-up). Now incorporating Roman's review feedback (see "Review feedback addressed" section below).
What this PR adds
A new remote dataset loader at
pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.pythat surfaces the Agent Threat Rules (ATR) autoresearch adversarial-payload corpus as a PyRITSeedDataset.ATR is an open MIT-licensed detection standard for AI agent threats. The autoresearch corpus (
data/autoresearch/adversarial-samples.json) contains 1,054 attack-payload entries across ten base rule scenarios in six of the ten ATR categories (prompt-injection, tool-poisoning, context-exfiltration, agent-manipulation, privilege-escalation, skill-compromise). Each payload carries an attack technique label (paraphrase, language_switch, encoding, role_play, and 17 others) and the agent surface it targets (user_input,content,tool_args,tool_name,tool_response,agent_output).Reference: https://github.com/Agent-Threat-Rule/agent-threat-rules
License: MIT
Files touched
pyrit/datasets/seed_datasets/remote/agent_threat_rules_dataset.py(new) — the loader, three companion enums (ATRCategory,ATRDetectionField,ATRVariationType), and a_RULE_ID_TO_CATEGORYdict that resolves each rule_id to its ATR taxonomy category (typed asATRCategoryso the enum is the single source of truth)pyrit/datasets/seed_datasets/remote/__init__.py— adds the import to trigger auto-registration viaSeedDatasetProvider.__init_subclass__, plus four entries in__all__tests/unit/datasets/test_agent_threat_rules_dataset.py(new) — 19 unit tests covering happy path, missing-key validation, unknown rule_id skip path, all four filter axes, enum-validation errors, empty-filter rejection, per-rule description, and dict-enum source-of-truth invariantNo PyRIT core code is modified. No new dependencies are added. The manual
doc/code/datasets/0_dataset.mdentry was removed in44dce8bper Roman's pointer to #1707, which regenerates that listing from the notebook.Implementation notes
db793f9(current main HEAD when this PR was authored). This mirrors HarmBench's pinning convention (c0423b9) for reproducibility; pass the raw URL onmainor a different tag to track upstream.adversarial-samples.jsonmaps to oneSeedPrompt. Thepayloadbecomesvalue. The four upstream metadata fields (original_rule_id,technique,detection_field,variation_type) plus the upstream entry id are preserved onSeedPrompt.metadata.harm_categoriesis set to a single-element list with the ATR taxonomy category resolved via the loader's_RULE_ID_TO_CATEGORYdict._RULE_ID_TO_CATEGORYdict is typeddict[str, ATRCategory]and stores enum members directly — a typo on either side is a static error rather than a silent data-quality bug at SeedPrompt construction.categories,techniques,detection_fields, andvariation_typesarguments narrow the dataset client-side after fetch. Enum arguments are validated against their expected types via the inherited_validate_enumshelper, matching the pattern in_PromptIntelDataset. An empty list is rejected withValueError(passNoneto disable a filter) — the previous "empty list silently disables" behavior was a footgun.SeedPrompt'sdescriptionreferences the rule's own category (e.g."Agent Threat Rules (ATR) adversarial payload in the prompt injection family. Rule ATR-2026-00001.") so downstream consumers reading metadata see the family that actually applies, not a corpus-wide list that ignores the rule's specific category.original_rule_idis not in the loader's category mapping are skipped (not errored) with an aggregate warning. This handles upstream rule additions that land before the loader's mapping is extended — users get a working dataset minus the unmapped rules, not a runtime failure._RemoteDatasetLoader, so caching, file-type inference, and thepublic_url/fileswitch are all inherited — no duplicated infrastructure.Review feedback addressed
Roman's three inline comments + PR-body note from 5/13:
_RULE_ID_TO_CATEGORYis nowdict[str, ATRCategory]and stores enum members directly.harm_categories=[category.value]at the SeedPrompt construction site. A new test (test_rule_id_mapping_uses_enum) asserts the invariant statically so future edits to the dict stay aligned with the enum.ValueErrorfor all four filter arguments (categories,techniques,detection_fields,variation_types). The "passNoneto include all" path remains. Four new tests cover the raises._AgentThreatRulesDataset(categories=[ATRCategory.PROMPT_INJECTION])returns seeds whosedescriptionreferences prompt injection only, not all six families. A new test (test_per_rule_description_reflects_category) asserts descriptions differ across categories and each seed's description references its own rule_id.0_dataset.md— removed; the manual entry was already dropped in44dce8bper DOC: Execute 1_loading_datasets notebook to populate cell outputs #1707.What this PR does NOT include (per #1702 discussion)
harm_categoriesas-is per the same guidance.Optional context for PyRIT users
ATR was recently integrated into MISP at two layers (merged 2026-05-10 by Alexandre Dulaunoy, MISP project lead):
cve/owasp_llm/mitre_atlascross-references per clusterMentioning since PyRIT users routing red-team output into MISP-compatible threat-intel or CSIRT pipelines could benefit from the
original_rule_idmetadata on eachSeedPromptresolving natively as MISP machine tags downstream — no translation layer needed. Not required for the loader itself; just a downstream interop note.Tests and Documentation
19 unit tests in
tests/unit/datasets/test_agent_threat_rules_dataset.pycovering:_RULE_ID_TO_CATEGORYvalues areATRCategoryinstancesruff checkandruff format --checkboth pass on the new files and the modified__init__.py.A real-network fetch against the pinned upstream URL was verified locally: 1,054 seeds load with the expected category distribution (prompt-injection 496, context-exfiltration 186, skill-compromise 124, tool-poisoning 93, agent-manipulation 93, privilege-escalation 62).
The new loader will be picked up automatically by
tests/end_to_end/test_all_datasets.pyviaSeedDatasetProvider.get_all_providers()discovery — no parametrization update needed there.JupyText was not run because this PR does not touch any notebooks or
doc/code/.pyfiles.