What An Agent Asks · Getting Started · Pipeline · Tools · Verification · Calibration · Architecture · Zetetic Standard
Companion projects:
Cortex — persistent memory that injects past decisions into every PRD
zetetic-team-subagents — 97 genius reasoning patterns that judge each claim
automatised-pipeline — the codebase intelligence layer this generator consumes upstream
Every AI agent that drafts a PRD eventually invents a function that doesn't exist, claims latency it can't measure, or writes acceptance criteria that don't tie back to the requirements they're supposed to test. The output sounds confident. It is not actionable. The next stage in the pipeline — code generation, ticket import, sprint planning — silently inherits the hallucination, ships it, and pays for it later.
prd-spec-generator is a TypeScript MCP server that fixes this at the structural level. The pipeline is a stateless reducer (step(state, result?) → next_state, action) driven by a host (Claude Code or any MCP-speaking agent). Sections are produced one at a time, validated by deterministic Hard Output Rules before the host ever sees them, and every load-bearing claim is judged by a panel of genius reasoning agents drawn from zetetic-team-subagents against the codebase graph from automatised-pipeline. Phase 4 then closes the loop: per-judge reliability is calibrated from history, retry budgets are derived from survival statistics, KPI gates are tuned against frozen baselines, and held-out partitions are mechanically sealed so no calibration result can be peeked at before evaluation.
10 packages. 17 MCP tools. 10 pipeline steps. Multi-judge verification with consensus. Closed-loop calibration with externally-grounded falsifiers. 583 tests. Every numeric constant traces to a citation, a benchmark, or a // source: provisional heuristic admission.
The verification subsystem is no longer a one-shot pass/fail report. Every claim resolution can flush an observation back to a calibration repository, every consensus run can pull calibrated posteriors from history, and every closed loop runs an external control arm so the calibration's effect is measured, not assumed.
- Per-judge Bayesian reliability calibration — Beta(7,3) prior with sensitivity / specificity split per
claim_type. Posteriors stored in a SQLite-backedReliabilityRepository; observations flushed on every claim resolution. - MAX_ATTEMPTS retry calibration — Kaplan-Meier survival math (
kmEstimate/kmMedianAttempts/logRankTestwith Greenwood + Brookmeyer-Crowley CIs); Schoenfeld sample-size derivation event-rate-corrected to ~519 (was 823) against the measuredevent_rate=0.4762, CP CI[0.4456, 0.5069]. - KPI gate tuning — Clopper-Pearson exact CIs; per-machine-class wall_time normalization with 5-bucket
detectMachineClass; frozen-baseline content-hash assertion;loadCalibratedGates+hold_provisionalratchet protection. - Plan-mismatch fire-rate — measured via XmR control charts (Wheeler 1995, Western Electric 1956) with a synthetic injection round-trip pre-flight that catches drift between the diagnostic prefix and the regex matcher.
- Externally-grounded held-out subsets — Ajv schema oracle, mathjs oracle,
tscsubprocess code oracle,validateSectionspec oracle.OracleUnavailableErrortyped throw replaces stub-mode fabrication. This is the layer that breaks annotator-circularity — judges and oracles share no inference path. - CC-3 forced-exploration control arms — every closed loop carves out a 20% partition that reverts to the prior. Without it, calibration-on-calibration looks like progress whether or not it actually is.
- Cross-arm comparison metrics —
computeAblationComparison/computeReliabilityComparison/computeKpiGateComparisonproduce paired-bootstrap CIs (Efron & Tibshirani 1993 §16.4; deterministic mulberry32 RNG; 12-decimal reproducibility pin). Outcome is a falsifiable recommendation:calibrated_helps,prior_helps, orinconclusive_underpowered. - Mechanically-enforced held-out partition seals — three sealed lock files (
maxattempts-heldout.lock.json,kpigates-heldout.lock.json,heldout-partition.lock.json) commit a sha256 of the partition before evaluation. TheSEAL_VERIFIEDtypeof sentinel is the only way to compute cross-arm metrics on a sealed partition; passing anything else is a type error at the boundary. - Production-mode dispatcher —
makeProductionDispatcher+AgentInvokerinterface. The CLI--mode production|cannedflag selects whether calibration sees real verdicts or canned ones; the canned arm is preserved for offline reproducibility.
start_pipeline(feature_description, codebase_path?)
→ returns the first NextAction; the host executes it and feeds the result
back via submit_action_result. Nine steps later: 9 PRD files written.
submit_action_result(run_id, result)
→ drives the reducer one more step. The host sees only SUBSTANTIVE actions
(ask_user, call_pipeline_tool, call_cortex_tool, spawn_subagents,
write_file, done, failed). emit_message is coalesced into the
messages array; the host never has to "advance past" a banner.
validate_prd_section(content, section_type)
→ deterministic Hard Output Rules — zero LLM calls, pure regex/parsing.
→ returns: violations[], hasCriticalViolations, totalScore.
validate_prd_document(sections[])
→ cross-section checks: SP arithmetic, AC numbering, FR-AC coverage,
test traceability. Catches what per-section validation misses.
coordinate_context_budget(prd_context, completed_sections[])
→ per-section retrieval/generation token budgets so Cortex recall and
section drafting don't fight over the same context window.
map_failure_to_retrieval(violations[])
→ closes the validation→retrieval feedback loop. When a section fails
validation, this returns the corrective Cortex query that would
have prevented the failure.
claude plugin marketplace add cdeust/prd-spec-generator
claude plugin install prd-spec-generatorRestart your Claude Code session. The 17 MCP tools register on first stdio handshake. Then:
/generate-prd build OAuth login for the admin console
The plugin's bundled MCP server at mcp-server/index.js is self-contained
(only better-sqlite3 is an optional native dependency for the evidence
repository — gracefully degrades to in-memory mode when absent).
For full effect, install the three companion plugins so the pipeline can consume codebase intelligence, persistent memory, and the genius-agent panel:
claude plugin marketplace add cdeust/automatised-pipeline # codebase graph intel
claude plugin marketplace add cdeust/Cortex # persistent memory
claude plugin marketplace add cdeust/zetetic-team-subagents # the genius + team agents
claude plugin install automatised-pipeline
claude plugin install cortex
claude plugin install zetetic-team-subagentsEach plugin is independently useful; together they are the ai-architect ecosystem. See Companion ecosystem above.
For development or to run the audit cycle locally:
git clone https://github.com/cdeust/prd-spec-generator.git
cd prd-spec-generator
pnpm install --frozen-lockfile
pnpm build # builds all 9 buildable packages via tsc
pnpm bundle # produces the standalone mcp-server/index.js
pnpm test # 583 tests + 2 integration skipped (live MCP integration
# env-gated by AIPRD_PIPELINE_BIN)pnpm verify runs all of the above (install + build + bundle + test) —
same as CI.
Prerequisites for source builds: Node.js 20.x or 22.x, pnpm v10+
(corepack enable && corepack prepare pnpm@10).
# Reducer end-to-end without a real host (uses the canned dispatcher):
pnpm test --filter @prd-gen/orchestration smoke
# Benchmark KPI run:
pnpm test --filter @prd-gen/benchmark pipeline-kpisBoth run in <2s on an M-series Mac. No LLM calls, no MCP traffic — the reducer is fully driven by canned ActionResults so you can audit behaviour offline.
The reducer produces nine sequential steps. Each step emits at most one substantive action; the host executes it and feeds the result back. A typical trial-tier feature run (11 sections) takes ~62 host-visible iterations.
| # | Step | What it produces |
|---|---|---|
| 1 | banner |
Welcome banner with run ID + feature description + capability summary |
| 2 | context_detection |
Detects PRD type from trigger words; asks user when ambiguous |
| 3 | input_analysis |
Calls index_codebase (automatised-pipeline) when a path is provided; sets codebase_graph_path |
| 4 | feasibility_gate |
Detects epic-scope inputs (≥2 EPIC_SIGNALS); asks user to focus |
| 5 | clarification |
Compose-then-answer rounds (4–10 depending on tier); short-circuits on "proceed" |
| 6 | budget |
Per-section retrieval/generation token allocation via Cortex paper's 60/30/10 split |
| 7 | section_generation |
One section at a time: Cortex recall → engineer draft → validate → (retry up to 3) |
| 8 | jira_generation |
Synthesises JIRA tickets from requirements + user_stories + acceptance_criteria |
| 9 | file_export |
Writes 9 files (6 core + 3 companion) per SKILL.md Phase 4 |
| 10 | self_check |
Two-phase multi-judge verification (see below); typed verification field on done |
Every step is independently testable (stepOnce(state, result?) returns the same shape as the runner). The runner coalesces emit_message actions internally so the host never sees a no-op.
Three surfaces. The reducer drives the full pipeline; the validation + verification
- budget tools can be consumed directly by other systems without entering the pipeline; the diagnostics surface exposes config + health + history.
Reducer (3):
start_pipeline Initialize a run; returns first NextAction
submit_action_result Drive the reducer one step; returns next NextAction
get_pipeline_state Read-only state snapshot for diagnostics
Validation (2):
validate_prd_section Hard Output Rules — single section
validate_prd_document Cross-section checks (SP/AC/FR/test traceability)
Verification (3):
plan_section_verification Extract claims + select judge panels
plan_document_verification Same, document-wide
conclude_verification Aggregate JudgeVerdict[] → VerificationReport;
accepts optional `claims` array carrying
`external_grounding` so oracle-resolved ground
truth can replace LLM-only consensus where
schema/math/code/spec oracles are available
Budget + feedback (2):
coordinate_context_budget Per-section token allocation
map_failure_to_retrieval Validation failure → corrective Cortex query
Diagnostics (7):
get_config, read_skill_config, check_health, get_prd_context_info,
list_available_strategies, get_quality_history, get_strategy_effectiveness
Each tool takes structured Zod-validated arguments and returns a typed response. No tool calls an LLM — section drafts and judge verdicts come back via the host's spawn_subagents action so the same pipeline runs against any agent runtime.
The self_check step is a two-phase contract. Phase A plans the verification batch and persists a snapshot of (claim_ids, judges) to state. Phase B receives the verdicts, parses them against the snapshot, and aggregates via the consensus engine.
plan_document_verification(sections[])
→ extracts atomic Claims (FR-001, AC-005, NFR-LATENCY-1, ...)
→ selects a panel per claim type:
architecture → liskov + alexander + dijkstra + architect
performance → fermi + carnot + curie + erlang
security → wu + ibnalhaytham + security-auditor
data_model → mendeleev + dba + lavoisier
acceptance_criteria → toulmin + popper + test-engineer
...
[host spawns the panel; each agent returns a JSON verdict]
conclude_verification(verdicts[])
→ Per claim, runs consensus():
strategy: weighted_average (default) | bayesian
fail_threshold: 0.5 (≥50% confidence-weighted FAIL → forces FAIL)
precautionary tie-breaker: more-severe verdict wins
→ distribution_suspicious flag fires when 100% PASS over ≥5 claims
→ returns ConsensusVerdict[] with full distribution + dissenting list
The verdict taxonomy is deliberately five-level — not binary. NFR claims (latency, fps, throughput, storage) MUST NOT receive PASS: they are SPEC-COMPLETE if a measurement method is specified, NEEDS-RUNTIME otherwise. Judges that default to PASS for everything are caught by the distribution_suspicious detector and flagged in the typed done.verification field.
The verification subsystem is itself a hypothesis: that consensus weighted by historically-calibrated reliability outperforms consensus weighted by a uniform prior. Phase 4 is the closed loop that tests it.
- Observe. Each verification run can flush per-judge observations (claim_id, claim_type, judge_id, verdict, oracle_truth?) to a SQLite reliability repository. Observations carry an
external_groundingfield that propagates fromClaimthrough the orchestrator to the oracle resolution path; when an external oracle (Ajv schema, mathjs,tsc,validateSection) can resolve the claim, its truth replaces LLM-only consensus. - Calibrate. On subsequent runs, calibrated posteriors weight consensus per judge, per
claim_type. A 20% control-arm partition is forced-explored using the prior (getReliabilityForRun/getRetryArmForRundecide which arm a given run lands in deterministically fromrun_id). Without the control arm, calibration-on-calibration looks like progress whether or not it actually is. - Compare. Cross-arm metrics (
computeAblationComparison,computeReliabilityComparison,computeKpiGateComparison) run paired-bootstrap CIs (Efron & Tibshirani 1993 §16.4; deterministic mulberry32 RNG; 12-decimal reproducibility pin) and emit one of three falsifiable recommendations:calibrated_helps,prior_helps, orinconclusive_underpowered. - Seal. Held-out partitions are committed to lock files (
maxattempts-heldout.lock.jsonfor §4.2,kpigates-heldout.lock.jsonfor §4.5,heldout-partition.lock.jsonfor the §4.1 50-claim externally-grounded corpus) with a sha256 hash of the partition. The cross-arm metric functions accept aSEAL_VERIFIEDtypeof sentinel as a parameter; the only way to obtain that sentinel is to verify the seal first. Peeking at a held-out partition before evaluation is a type error. - Ground. Where an external oracle can resolve a claim deterministically, it does. Where it cannot,
OracleUnavailableErroris thrown rather than fabricating a stub-mode truth. This is the line that breaks annotator-circularity: judges trained against (or biased toward) LLM-style reasoning cannot poison calibration that uses non-LLM truth.
The lock files, the seal-verification dance, and the control-arm partition together mean: when a Phase 4 cross-arm comparison says "calibrated_helps with 95% CI excluding zero," the claim is measured, not vibes-checked. When it says "inconclusive_underpowered," that is also a falsifiable claim — you need more data, not more confidence.
Ten workspace packages, each independently buildable, with strict Clean Architecture layering enforced by package boundaries.
core ← domain types, schemas, agent identities
│ no I/O, no infrastructure dependency
│ Zod-validated; the only place where verdict /
│ section_type / capability shapes are defined
▼
validation ← Hard Output Rules (per-section + cross-section)
│ pure functions; no I/O
▼
strategy ← thinking-strategy selector (genius pattern routing)
│
meta-prompting ← prompt builders for clarification / draft / jira
│ pure string composition
▼
verification ← claim extraction + judge selection +
│ consensus engine (weighted_average + Bayesian)
│ + buildJudgePrompt
▼
orchestration ← stateless reducer, 9 step handlers, runner
│ step(state, result?) → next_state, action
│ emit_message coalescing; canned-dispatcher utility
▼
ecosystem-adapters← StdioMcpClient, AutomatisedPipelineClient, CortexClient
│ the only package allowed to do I/O
▼
mcp-server ← composition root; 17 tools registered;
│ evidence repository (better-sqlite3, optional)
▼
benchmark ← pipeline KPI measurements + golden-fixture HOR scoring
│ + calibration/ subtree (Phase 4):
│ · ReliabilityRepository (SQLite, observation flush)
│ · Kaplan-Meier + log-rank + Schoenfeld N
│ · Clopper-Pearson exact CI + XmR control charts
│ · paired-bootstrap (Efron-Tibshirani 1993 §16.4)
│ · external oracles (Ajv / mathjs / tsc / validate)
│ · machine-class detector + frozen-baseline gates
│ · sealed held-out lock files (sha256 + SEAL_VERIFIED)
│ · production-mode dispatcher + AgentInvoker seam
│ Audit lineage: JSONL + .xmr sidecars per run.
skill ← SKILL.md + slash-command definitions for Claude Code
Every package's package.json is checked: core depends only on zod; verification depends only on core; orchestration depends on core/validation/verification/meta-prompting (NOT on ecosystem-adapters); ecosystem-adapters depends on core + verification; mcp-server is the only place where everything composes.
The Phase 3+4 cross-audit found and fixed two layer violations:
orchestrationwas importingextractJsonObjectandbuildJudgePromptfromecosystem-adapters— pure utilities lived in the wrong package; moved tocoreandverificationrespectively.- Pure domain types (
Claim,JudgeVerdict,JudgeRequest,AgentIdentity) lived inecosystem-adapters/contracts/subagent.ts; moved tocore/domain/agent.ts. The infrastructure package now re-exports them as a backward-compat shim.
| Failure mode | What we do |
|---|---|
| Section drift between turns | Single immutable PipelineState snapshot per step; reducer is pure; host can replay any step |
| Hallucinated symbols | validate_prd_section runs Hard Output Rules; symbols cross-checked against automatised-pipeline graph if codebase_path is set |
| NFRs claiming PASS without measurement | Verdict taxonomy refuses PASS for latency/throughput/fps/storage; consensus engine forwards SPEC-COMPLETE / NEEDS-RUNTIME |
| Confirmatory bias (every judge says PASS) | distribution_suspicious flag fires at 100% PASS over ≥5 claims; surfaced in typed done.verification.distribution_suspicious |
| Acceptance criteria not traceable to requirements | Cross-document validator checks FR-AC coverage and AC numbering gaps |
| Tests claiming "comprehensive" without listing what they cover | Test-traceability rule: every section's claimed test must reference an FR or AC ID |
| Retries that use the same context as the failure | map_failure_to_retrieval closes the validator→Cortex feedback loop; corrective queries before retry |
| Magic-number budgets ("we'll use 4K tokens for retrieval") | coordinate_context_budget produces per-section allocations from the canonical SECTIONS_BY_CONTEXT plan |
┌────────────────────────────────────┐
│ Claude Code (host) │
└────────────────────┬───────────────┘
│ stdio MCP
┌──────────────────────────────────┼──────────────────────────────────┐
▼ ▼ ▼
┌───────────────────┐ ┌────────────────────┐ ┌───────────────────┐
│ automatised- │ graph_path │ prd-spec- │ recall │ Cortex │
│ pipeline │ ───────────► │ generator │ ◄─────────── │ (memory engine) │
│ (Rust MCP) │ │ (TS MCP) │ │ (Python MCP) │
│ │ symbols │ │ excerpts │ │
│ read-only │ ◄──────────► │ stateless │ ───────────► │ thermodynamic │
│ intelligence │ │ reducer │ │ memory │
└───────────────────┘ └─────────┬──────────┘ └───────────────────┘
│
│ spawn_subagents
▼
┌─────────────────────────────┐
│ zetetic-team-subagents │
│ 97 genius + 19 team │
│ Each judge cites its │
│ primary paper. │
└─────────────────────────────┘
Each project owns one concern. automatised-pipeline knows what's true about the code. Cortex knows what we already decided. zetetic-team-subagents knows how to reason about a specific shape of claim. prd-spec-generator is the deterministic glue that turns those three signals into a PRD an agent can act on.
Every load-bearing constant in this codebase carries a // source: annotation. Three forms are accepted:
// source: <citation> // a paper, a spec, a referenced design doc
// source: benchmark <path> // a committed benchmark whose output produced this value
// source: provisional heuristic — <calibration plan>
// honest admission; tells the next reader
// (a) why the value is what it is today and
// (b) what evidence would change itThe cross-audit found and tagged every previously bare constant. Examples:
// pipeline-kpis.ts
const KPI_GATES = {
/** source: provisional heuristic. Smoke baseline = 62 iterations on
* trial+codebase; cap is 100 (~60% headroom). dijkstra cross-audit
* derived a structural max of 9 emit_message hops; the substantive-
* action count builds on that. Phase 4.5 will replace with measured
* P95 + 1σ. */
iteration_count_max: 100,
...
};
// verification/consensus.ts
/** source: provisional heuristic — Beta(7,3) (mean 0.7, ESS=10,
* moderately informative toward reliability). Phase 4.1 will replace
* with per-agent Beta(α+correct, β+incorrect) calibrated from history. */
const DEFAULT_RELIABILITY_PRIOR_MEAN = 0.7;The four pillars (consistent / true / useful / necessary) and the seven rules of zetetic inquiry are inherited from the zetetic-team-subagents standard. Provisional values are not silently propagated as truth.
The same standard applied to itself.
- It does not write code. This generator produces a PRD. The downstream coding agent (separate system) reads the PRD, the graph, and Cortex memory; it writes the implementation. Symbols in the PRD are validated against the graph but never edited by us.
- It does not validate prose quality. Hard Output Rules check structural invariants (FR numbering, AC traceability, NFR shape, cross-references). They do not check whether a sentence is well-written or persuasive. That is what the multi-judge phase is for, and even there the judges return verdicts on claims — atomic assertions — not on style.
- The judge phase is end-to-end testable but the judges are not deterministic. In tests we use a canned dispatcher that returns 100% PASS by construction; the
distribution_suspiciousdetector exists precisely because real judge panels can also degenerate into confirmatory consensus, and we do not pretend otherwise. - The KPI gates were provisional; Phase 4.5 has shipped.
iteration_count_max,wall_time_ms_max, andmean_section_attempts_maxwere originally canned-dispatcher baselines. They are now calibrated against the K=100 frozen baseline with Clopper-Pearson exact CIs, per-machine-class wall_time normalization, andloadCalibratedGates+hold_provisionalratchet protection. The §4.5 lock file commits a content-hash of the baseline; mutating it post hoc fails the seal verification. Where data is still thin, gates remainhold_provisionalrather than locked. See docs/PHASE_4_PLAN.md for the full pre-registration. - Citation presence ≠ citation validity. A
// source: Knuth 1998comment satisfies the convention whether or not Knuth 1998 exists or supports the value. We enforce that the citation IS THERE; the cross-audit cycle (genius + team review every phase) is what keeps it honest.
The repo ships a multi-agent cross-audit workflow. After every non-trivial phase:
# Engineering team review:
# architect, code-reviewer, refactorer, test-engineer, security-auditor,
# devops-engineer, dba (when relevant)
# Genius team review:
# feynman (integrity), curie (measurement), popper (falsifiability),
# dijkstra (correctness), shannon (signal), deming (variation),
# poincare (qualitative), ...Each agent reads the current state of the code (not from memory) and produces a ranked finding list. The Phase 3+4 cycle generated 30 findings; 28 were closed in the same cycle (4 CRIT + 13 HIGH + 11 MED). Two are deferred to Phase 4 calibration with the evidence required to close them documented in docs/PHASE_4_PLAN.md.
packages/
├── core/ Domain types · schemas · agent identities · evidence repo
├── validation/ Hard Output Rules · per-section + cross-section validators
├── verification/ Claim extraction · judge selection · consensus engine
├── meta-prompting/ Prompt builders (clarification / draft / jira)
├── strategy/ Thinking-strategy selector
├── orchestration/ Stateless reducer · 9 step handlers · runner · canned-dispatcher
├── ecosystem-adapters/ StdioMcpClient · AutomatisedPipelineClient · CortexClient
├── mcp-server/ Composition root · 17 MCP tools registered
├── benchmark/ Pipeline KPI measurement · golden-fixture HOR scoring
│ └── calibration/ Phase 4: ReliabilityRepository · KM survival ·
│ Clopper-Pearson · XmR · paired-bootstrap ·
│ external oracles · sealed held-out partitions ·
│ production-mode dispatcher
└── skill/ SKILL.md · slash-command definitions
MIT.
Don't ship a PRD that hallucinates a function it can't measure.
Ship one whose every claim was judged by Pearl, Curie, Liskov, and a panel of seven others, validated against the call graph, and grounded in what Cortex remembers from yesterday.