ROADMAP.md

Clawable Coding Harness Roadmap

Goal

Turn claw-code into the most clawable coding harness:

no human-first terminal assumptions
no fragile prompt injection timing
no opaque session state
no hidden plugin or MCP failures
no manual babysitting for routine recovery

This roadmap assumes the primary users are claws wired through hooks, plugins, sessions, and channel events.

Definition of "clawable"

A clawable harness is:

deterministic to start
machine-readable in state and failure modes
recoverable without a human watching the terminal
branch/test/worktree aware
plugin/MCP lifecycle aware
event-first, not log-first
capable of autonomous next-step execution

Current Pain Points

1. Session boot is fragile

trust prompts can block TUI startup
prompts can land in the shell instead of the coding agent
"session exists" does not mean "session is ready"

2. Truth is split across layers

tmux state
clawhip event stream
git/worktree state
test state
gateway/plugin/MCP runtime state

3. Events are too log-shaped

claws currently infer too much from noisy text
important states are not normalized into machine-readable events

4. Recovery loops are too manual

restart worker
accept trust prompt
re-inject prompt
detect stale branch
retry failed startup
classify infra vs code failures manually

5. Branch freshness is not enforced enough

side branches can miss already-landed main fixes
broad test failures can be stale-branch noise instead of real regressions

6. Plugin/MCP failures are under-classified

startup failures, handshake failures, config errors, partial startup, and degraded mode are not exposed cleanly enough

7. Human UX still leaks into claw workflows

too much depends on terminal/TUI behavior instead of explicit agent state transitions and control APIs

Product Principles

State machine first — every worker has explicit lifecycle states.
Events over scraped prose — channel output should be derived from typed events.
Recovery before escalation — known failure modes should auto-heal once before asking for help.
Branch freshness before blame — detect stale branches before treating red tests as new regressions.
Partial success is first-class — e.g. MCP startup can succeed for some servers and fail for others, with structured degraded-mode reporting.
Terminal is transport, not truth — tmux/TUI may remain implementation details, but orchestration state must live above them.
Policy is executable — merge, retry, rebase, stale cleanup, and escalation rules should be machine-enforced.

Roadmap

Phase 1 — Reliable Worker Boot

1. Ready-handshake lifecycle for coding workers

Add explicit states:

spawning
trust_required
ready_for_prompt
prompt_accepted
running
blocked
finished
failed

Acceptance:

prompts are never sent before ready_for_prompt
trust prompt state is detectable and emitted
shell misdelivery becomes detectable as a first-class failure state

2. Trust prompt resolver

Add allowlisted auto-trust behavior for known repos/worktrees.

Acceptance:

trusted repos auto-clear trust prompts
events emitted for trust_required and trust_resolved
non-allowlisted repos remain gated

3. Structured session control API

Provide machine control above tmux:

create worker
await ready
send task
fetch state
fetch last error
restart worker
terminate worker

Acceptance:

a claw can operate a coding worker without raw send-keys as the primary control plane

Phase 2 — Event-Native Clawhip Integration

4. Canonical lane event schema

Define typed events such as:

lane.started
lane.ready
lane.prompt_misdelivery
lane.blocked
lane.red
lane.green
lane.commit.created
lane.pr.opened
lane.merge.ready
lane.finished
lane.failed
branch.stale_against_main

Acceptance:

clawhip consumes typed lane events
Discord summaries are rendered from structured events instead of pane scraping alone

5. Failure taxonomy

Normalize failure classes:

prompt_delivery
trust_gate
branch_divergence
compile
test
plugin_startup
mcp_startup
mcp_handshake
gateway_routing
tool_runtime
infra

Acceptance:

blockers are machine-classified
dashboards and retry policies can branch on failure type

6. Actionable summary compression

Collapse noisy event streams into:

current phase
last successful checkpoint
current blocker
recommended next recovery action

Acceptance:

channel status updates stay short and machine-grounded
claws stop inferring state from raw build spam

Phase 3 — Branch/Test Awareness and Auto-Recovery

7. Stale-branch detection before broad verification

Before broad test runs, compare current branch to main and detect if known fixes are missing.

Acceptance:

emit branch.stale_against_main
suggest or auto-run rebase/merge-forward according to policy
avoid misclassifying stale-branch failures as new regressions

8. Recovery recipes for common failures

Encode known automatic recoveries for:

trust prompt unresolved
prompt delivered to shell
stale branch
compile red after cross-crate refactor
MCP startup handshake failure
partial plugin startup

Acceptance:

one automatic recovery attempt occurs before escalation
the attempted recovery is itself emitted as structured event data

9. Green-ness contract

Workers should distinguish:

targeted tests green
package green
workspace green
merge-ready green

Acceptance:

no more ambiguous "tests passed" messaging
merge policy can require the correct green level for the lane type

Phase 4 — Claws-First Task Execution

10. Typed task packet format

Define a structured task packet with fields like:

objective
scope
repo/worktree
branch policy
acceptance tests
commit policy
reporting contract
escalation policy

Acceptance:

claws can dispatch work without relying on long natural-language prompt blobs alone
task packets can be logged, retried, and transformed safely

11. Policy engine for autonomous coding

Encode automation rules such as:

if green + scoped diff + review passed -> merge to dev
if stale branch -> merge-forward before broad tests
if startup blocked -> recover once, then escalate
if lane completed -> emit closeout and cleanup session

Acceptance:

doctrine moves from chat instructions into executable rules

12. Claw-native dashboards / lane board

Expose a machine-readable board of:

repos
active claws
worktrees
branch freshness
red/green state
current blocker
merge readiness
last meaningful event

Acceptance:

claws can query status directly
human-facing views become a rendering layer, not the source of truth

Phase 5 — Plugin and MCP Lifecycle Maturity

13. First-class plugin/MCP lifecycle contract

Each plugin/MCP integration should expose:

config validation contract
startup healthcheck
discovery result
degraded-mode behavior
shutdown/cleanup contract

Acceptance:

partial-startup and per-server failures are reported structurally
successful servers remain usable even when one server fails

14. MCP end-to-end lifecycle parity

Close gaps from:

config load
server registration
spawn/connect
initialize handshake
tool/resource discovery
invocation path
error surfacing
shutdown/cleanup

Acceptance:

parity harness and runtime tests cover healthy and degraded startup cases
broken servers are surfaced as structured failures, not opaque warnings

Immediate Backlog (from current real pain)

Priority order: P0 = blocks CI/green state, P1 = blocks integration wiring, P2 = clawability hardening, P3 = swarm-efficiency improvements.

P0 — Fix first (CI reliability)

Isolate render_diff_report tests into tmpdir — done: render_diff_report_for() tests run in temp git repos instead of the live working tree, and targeted cargo test -p rusty-claude-cli render_diff_report -- --nocapture now stays green during branch/worktree activity
Expand GitHub CI from single-crate coverage to workspace-grade verification — done: .github/workflows/rust-ci.yml now runs cargo test --workspace plus fmt/clippy at the workspace level
Add release-grade binary workflow — done: .github/workflows/release.yml now builds tagged Rust release artifacts for the CLI
Add container-first test/run docs — done: Containerfile + docs/container.md document the canonical Docker/Podman workflow for build, bind-mount, and cargo test --workspace usage
Surface doctor / preflight diagnostics in onboarding docs and help — done: README + USAGE now put claw doctor / /doctor in the first-run path and point at the built-in preflight report
Automate branding/source-of-truth residue checks in CI — done: .github/scripts/check_doc_source_of_truth.py and the doc-source-of-truth CI job now block stale repo/org/invite residue in tracked docs and metadata
Eliminate warning spam from first-run help/build path — done: current cargo run -q -p rusty-claude-cli -- --help renders clean help output without a warning wall before the product surface
Promote doctor from slash-only to top-level CLI entrypoint — done: claw doctor is now a local shell entrypoint with regression coverage for direct help and health-report output
Make machine-readable status commands actually machine-readable — done: claw --output-format json status and claw --output-format json sandbox now emit structured JSON snapshots instead of prose tables
Unify legacy config/skill namespaces in user-facing output — done: skills/help JSON/text output now present .claw as the canonical namespace and collapse legacy roots behind .claw-shaped source ids/labels
Honor JSON output on inventory commands like skills and mcp — done: direct CLI inventory commands now honor --output-format json with structured payloads for both skills and MCP inventory
Audit --output-format contract across the whole CLI surface — done: direct CLI commands now honor deterministic JSON/text handling across help/version/status/sandbox/agents/mcp/skills/bootstrap-plan/system-prompt/init/doctor, with regression coverage in output_format_contract.rs and resumed /status JSON coverage

P1 — Next (integration wiring, unblocks verification) 2. Add cross-module integration tests — done: 12 integration tests covering worker→recovery→policy, stale_branch→policy, green_contract→policy, reconciliation flows 3. Wire lane-completion emitter — done: lane_completion module with detect_lane_completion() auto-sets LaneContext::completed from session-finished + tests-green + push-complete → policy closeout 4. Wire SummaryCompressor into the lane event pipeline — done: compress_summary_text() feeds into LaneEvent::Finished detail field in tools/src/lib.rs

P2 — Clawability hardening (original backlog) 5. Worker readiness handshake + trust resolution — done: WorkerStatus state machine with Spawning → TrustRequired → ReadyForPrompt → PromptAccepted → Running lifecycle, trust_auto_resolve + trust_gate_cleared gating 6. Prompt misdelivery detection and recovery — done: prompt_delivery_attempts counter, PromptMisdelivery event detection, auto_recover_prompt_misdelivery + replay_prompt recovery arm 7. Canonical lane event schema in clawhip — done: LaneEvent enum with Started/Blocked/Failed/Finished variants, LaneEvent::new() typed constructor, tools/src/lib.rs integration 8. Failure taxonomy + blocker normalization — done: WorkerFailureKind enum (TrustGate/PromptDelivery/Protocol/Provider), FailureScenario::from_worker_failure_kind() bridge to recovery recipes 9. Stale-branch detection before workspace tests — done: stale_branch.rs module with freshness detection, behind/ahead metrics, policy integration 10. MCP structured degraded-startup reporting — done: McpManager degraded-startup reporting (+183 lines in mcp_stdio.rs), failed server classification (startup/handshake/config/partial), structured failed_servers + recovery_recommendations in tool output 11. Structured task packet format — done: task_packet.rs module with TaskPacket struct, validation, serialization, TaskScope resolution (workspace/module/single-file/custom), integrated into tools/src/lib.rs 12. Lane board / machine-readable status API — done: Lane completion hardening + LaneContext::completed auto-detection + MCP degraded reporting surface machine-readable state 13. Session completion failure classification — done: WorkerFailureKind::Provider + observe_completion() + recovery recipe bridge landed 14. Config merge validation gap — done: config.rs hook validation before deep-merge (+56 lines), malformed entries fail with source-path context instead of merged parse errors 15. MCP manager discovery flaky test — done: manager_discovery_report_keeps_healthy_servers_when_one_server_fails now runs as a normal workspace test again after repeated stable passes, so degraded-startup coverage is no longer hidden behind #[ignore]

Commit provenance / worktree-aware push events — done: LaneCommitProvenance now carries branch/worktree/canonical-commit/supersession metadata in lane events, and dedupe_superseded_commit_events() is applied before agent manifests are written so superseded commit events collapse to the latest canonical lineage
Orphaned module integration audit — done: runtime now keeps session_control and trust_resolver behind #[cfg(test)] until they are wired into a real non-test execution path, so normal builds no longer advertise dead clawability surface area.
Context-window preflight gap — done: provider request sizing now emits context_window_blocked before oversized requests leave the process, using a model-context registry instead of the old naive max-token heuristic.
Subcommand help falls through into runtime/API path — done: claw doctor --help, claw status --help, claw sandbox --help, and nested mcp/skills help are now intercepted locally without runtime/provider startup, with regression tests covering the direct CLI paths.
Session state classification gap (working vs blocked vs finished vs truly stale) — done: agent manifests now derive machine states such as working, blocked_background_job, blocked_merge_conflict, degraded_mcp, interrupted_transport, finished_pending_report, and finished_cleanable, and terminal-state persistence records commit provenance plus derived state so downstream monitoring can distinguish quiet progress from truly idle sessions.
Resumed /status JSON parity gap — dogfooding shows fresh claw status --output-format json now emits structured JSON, but resumed slash-command status still leaks through a text-shaped path in at least one dispatch path. Local CI-equivalent repro fails rust/crates/rusty-claude-cli/tests/resume_slash_commands.rs::resumed_status_command_emits_structured_json_when_requested with expected value at line 1 column 1, so resumed automation can receive text where JSON was explicitly requested. Action: unify fresh vs resumed /status rendering through one output-format contract and add regression coverage so resumed JSON output is guaranteed valid.
Opaque failure surface for session/runtime crashes — repeated dogfood-facing failures can currently collapse to generic wrappers like Something went wrong while processing your request. Please try again, or use /new to start a fresh session. without exposing whether the fault was provider auth, session corruption, slash-command dispatch, render failure, or transport/runtime panic. This blocks fast self-recovery and turns actionable clawability bugs into blind retries. Action: preserve a short user-safe failure class (provider_auth, session_load, command_dispatch, render, runtime_panic, etc.), attach a local trace/session id, and ensure operators can jump from the chat-visible error to the exact failure log quickly.
doctor --output-format json check-level structure gap — direct dogfooding shows claw doctor --output-format json exposes has_failures at the top level, but individual check results (auth, config, workspace, sandbox, system) are buried inside flat prose fields like message / report. That forces claws to string-scrape human text instead of consuming stable machine-readable diagnostics. Action: emit structured per-check JSON (name, status, summary, details, and relevant typed fields such as sandbox fallback reason) while preserving the current human-readable report for text mode. P3 — Swarm efficiency
Swarm branch-lock protocol — done: branch_lock::detect_branch_lock_collisions() now detects same-branch/same-scope and nested-module collisions before parallel lanes drift into duplicate implementation
Commit provenance / worktree-aware push events — done: lane event provenance now includes branch/worktree/superseded/canonical lineage metadata, and manifest persistence de-dupes superseded commit events before downstream consumers render them

Suggested Session Split

Session A — worker boot protocol

Focus:

trust prompt detection
ready-for-prompt handshake
prompt misdelivery detection

Session B — clawhip lane events

Focus:

canonical lane event schema
failure taxonomy
summary compression

Session C — branch/test intelligence

Focus:

stale-branch detection
green-level contract
recovery recipes

Session D — MCP lifecycle hardening

Focus:

startup/handshake reliability
structured failed server reporting
degraded-mode runtime behavior
lifecycle tests/harness coverage

Session E — typed task packets + policy engine

Focus:

structured task format
retry/merge/escalation rules
autonomous lane closure behavior

MVP Success Criteria

We should consider claw-code materially more clawable when:

a claw can start a worker and know with certainty when it is ready
claws no longer accidentally type tasks into the shell
stale-branch failures are identified before they waste debugging time
clawhip reports machine states, not just tmux prose
MCP/plugin startup failures are classified and surfaced cleanly
a coding lane can self-recover from common startup and branch issues without human babysitting

Short Version

claw-code should evolve from:

a CLI a human can also drive

to:

a claw-native execution runtime
an event-native orchestration substrate
a plugin/hook-first autonomous coding harness

FilesExpand file tree

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

ROADMAP.md

Clawable Coding Harness Roadmap

Goal

Definition of "clawable"

Current Pain Points

1. Session boot is fragile

2. Truth is split across layers

3. Events are too log-shaped

4. Recovery loops are too manual

5. Branch freshness is not enforced enough

6. Plugin/MCP failures are under-classified

7. Human UX still leaks into claw workflows

Product Principles

Roadmap

Phase 1 — Reliable Worker Boot

1. Ready-handshake lifecycle for coding workers

2. Trust prompt resolver

3. Structured session control API

Phase 2 — Event-Native Clawhip Integration

4. Canonical lane event schema

5. Failure taxonomy

6. Actionable summary compression

Phase 3 — Branch/Test Awareness and Auto-Recovery

7. Stale-branch detection before broad verification

8. Recovery recipes for common failures

9. Green-ness contract

Phase 4 — Claws-First Task Execution

10. Typed task packet format

11. Policy engine for autonomous coding

12. Claw-native dashboards / lane board

Phase 5 — Plugin and MCP Lifecycle Maturity

13. First-class plugin/MCP lifecycle contract

14. MCP end-to-end lifecycle parity

Immediate Backlog (from current real pain)

Suggested Session Split

Session A — worker boot protocol

Session B — clawhip lane events

Session C — branch/test intelligence

Session D — MCP lifecycle hardening

Session E — typed task packets + policy engine

MVP Success Criteria

Short Version