-
Notifications
You must be signed in to change notification settings - Fork 5
Compile Pipeline
The compile pipeline turns a directory of markdown files into a concept graph and surplus signals. It is a zero-dependency Python module (compiler/, stdlib only) invoked by the MCP namespace compile.*. This page walks the stages, the data that flows between them, and the four MCP operations that drive them from outside.
Repeated from Rationale axiom 3: compile that only re-formats is not compile. The stages below exist so that the output is not the same information in a different file -- it is new information (resolved aliases, transitive links, orphan and stale flags, contradiction candidates) that the raw markdown did not carry.
The stdlib-only Python modules, roughly in dependency order:
| Module | Role |
|---|---|
_md_parse.py |
Frontmatter + markdown tokeniser. No third-party dep. |
chunker.py |
Splits notes into overlap-windowed chunks. Config: chunk_size, chunk_overlap. |
extractor.py |
Pulls wikilinks, aliases, tags, headings, mtime out of each note. |
link_discovery.py |
Resolves [[target]] to real note paths; handles alias maps, case-folding, missing targets. |
concept_graph.py |
Builds {nodes, edges, orphans, unresolvedLinks} from the extraction pass. |
evaluate.py |
Surplus signals: orphans, stale notes, contradictions, concept coverage gaps. |
scheduler.py |
Decides which agent action (compile |
frontmatter_generator.py |
Writes computed metadata back into note frontmatter on request. |
kb_meta.py |
Top-level catalog state (_index.md, log.md) per Karpathy LLM Wiki discipline. |
compile.py |
Entry point. Orchestrates stages and writes graph to disk. |
models.py |
Dataclasses for Node, Edge, GraphSummary, EvaluationReport. |
No embeddings. No LLM calls inside the pipeline. Everything is deterministic text processing. Optional semantic features live in separate adapters; see Adapter-Spec.
Pipeline flow, roughly:
raw .md files
-> _md_parse (frontmatter + body tokens)
-> chunker (optional, for downstream semantic adapters)
-> extractor (links, aliases, tags, headings, mtime)
-> link_discovery (resolve [[target]] -> path, build alias map)
-> concept_graph (nodes, edges, orphan/unresolved flags)
-> evaluate (surplus: stale, contradictions, coverage gaps)
-> scheduler (which agent action to run next)
-> graph written back to disk
_md_parse.py reads each .md file, splits frontmatter YAML from body, and emits a token stream. Malformed YAML is logged and the file is indexed as body-only; the pipeline does not crash on one bad note.
chunker.py produces overlap-windowed chunks sized by compiler.chunk_size / compiler.chunk_overlap in vault-mind.yaml. Used by the optional memU and vaultbrain adapters; the core graph build does not need chunks.
extractor.py walks the parsed token stream and collects, per note:
- wikilinks
[[Target]]and[[Target|alias]] - declared aliases (
aliases:in frontmatter) - tags (inline
#tagandtags:in frontmatter) - headings (for anchor targets)
- mtime (for stale detection)
link_discovery.py builds a name-to-path map (including aliases) and resolves every [[target]] to a real path. Case folding and whitespace normalization live here. Unresolved links become unresolvedLinks in the graph summary -- not errors, signals.
concept_graph.py emits the graph structure exposed via vault.graph:
-
nodes[]withexistsflag -
edges[]with{from, to, count} -
orphans[]--.mdfiles with no inbound wikilinks -
unresolvedLinkscount
evaluate.py is where surplus is produced. Outputs:
- stale-note candidates (mtime older than threshold + no recent inbound traffic)
- contradiction candidates (same concept, divergent claims across notes)
- coverage gaps (topic folders thinner than their sibling folders)
- orphan + broken-link counts
scheduler.py reads the evaluation report and proposes the next agent action. Actions: compile, emerge, reconcile, prune, challenge. This is the agent.trigger MCP operation's brain; see agent.schedule for cron-based scheduling.
Four operations, all in the compile.* namespace (source: docs/mcp-tools-reference.md):
| Operation | Mutating | What it does |
|---|---|---|
compile.run |
yes | Run the pipeline. Optional topic param scopes to a subdirectory. |
compile.status |
no | Current compile state (idle |
compile.diff |
no | Show what would change on the next compile (optional topic filter). |
compile.abort |
yes | Cancel a running compile. Useful when a large vault mid-compile starts pinning CPU. |
compile.run is incremental by default -- unchanged nodes are reused. A full rebuild is achieved by deleting the on-disk graph cache and re-invoking.
The compiler writes its artefacts as plain files alongside the vault (exact path configurable via vault-mind.yaml; default is a .compile/ subdirectory). Filesystem-only. No database.
This is load-bearing: it is what makes headless operation possible. See Rationale "Why not just a vector database?" and Architecture for the headless-first invariant.
On the author's reference vault: 554 notes, 2507 edges, full compile in under 3 seconds on a laptop. Incremental compile of 1-2 edits in under 500 ms. 10k+ note vaults have not been benchmarked -- if you have one, open an issue and we will replace this paragraph with real numbers.
compile.run is one of five actions the agent.* namespace can trigger:
| Action | When it runs |
|---|---|
compile |
Staleness threshold exceeded or scheduler chose it as the primary contradiction. |
emerge |
Surplus evaluation found enough new pattern-level signal to propose a synthesis. |
reconcile |
Contradiction candidates exceed threshold; agent proposes dry-run merges. |
prune |
Orphans + stale exceed threshold; agent proposes dry-run deletions/archives. |
challenge |
Concept coverage gap or group of notes that asserts without citation. |
Only compile is pure code. The other four emit dry-run proposals for a persona (usually vault-architect or vault-curator) to review. Nothing mutates the vault without human approval; see Security-Model.
Add a new stage by:
- Dropping a module into
compiler/that accepts aCompileContextand returns a new field on the report. - Registering it in
compile.pybetween existing stages. - If the stage emits a new kind of signal the agent scheduler should act on, extend
scheduler.pywith a rule. - If the stage's output should be queryable from outside, add an MCP operation in the
compile.*orquery.*namespace. See Architecture "Extension points".
- Architecture -- the compiler is layer 3 of four; this page zooms into that layer.
- Rationale -- axiom 3 explains why compile must produce surplus.
- Adapter-Spec -- semantic adapters that consume chunks from this pipeline.
- Security-Model -- why agent-proposed mutations are dry-run by default.
-
docs/philosophy.md-- the Aufhebung framing of compile as de-alienation. -
docs/mcp-tools-reference.md-- authoritative signatures for the fourcompile.*operations.