Python companion for the memex family. Recovers legacy Lance datasets, reshapes them into rmcp-memex-native chunks, injects namespaces. Turns dark-matter memory into searchable memory.
The memex family uses LanceDB, but real-world datasets come in two shapes:
- Legacy root-as-table:
<db>/{_versions,_transactions,data}— produced by older Python-lance ad-hoc pipelines where db-path == table-path. - v2 db-with-subtables:
<db>/<name>.lance/— current layout used byrmcp-memex,rust-memex,aicx.
rmcp-memex 0.5.x only reads v2. pymemex reads both, so you can inspect, sample, and recover legacy data without touching the originals.
Dark-matter data has a second, deeper problem: flat legacy tables have no concept of a namespace. rmcp-memex requires one on every chunk — it's how retrieval stays scoped per project / corpus / tenant. Orphaned Lance tables with no namespace field can't be imported as-is. pymemex bridges that gap: it reshapes legacy rows into the rmcp-memex schema and injects a namespace (static or per-row) on the way out, so previously unreachable memory becomes searchable.
| Tool | Language | Role |
|---|---|---|
| rust-memex | Rust | Core semantic retrieval library (embedding + BM25 over LanceDB) |
| rmcp-memex | Rust | MCP server over rust-memex |
| aicx | Rust | Canonical corpus + MCP (session ingest) |
| pymemex (this) | Python | Introspection + legacy bridge |
Dev install from source:
git clone https://github.com/Loctree/pymemex
cd pymemex
uv syncRun via uv without activating a venv:
uv run pymemex --helpuv run pymemex inspect /path/to/lancedbWorks for both layouts. Shows:
- detected layout (
legacy-rootvsv2-subtable) - subtable list (v2 only)
- schema with vector column detection (dim)
- row count, latest version
- last 5 versions with row/size deltas
Sample rows (vectors hidden by default)
uv run pymemex sample /path/to/lancedb -n 5
uv run pymemex sample /path/to/lancedb -n 2 --show-vectorsuv run pymemex tables /path/to/lancedbLegacy Python-lance pipelines often left behind ad-hoc tables: a flat schema with id, vector, a text column like summary or content, plus a few metadata fields, and crucially no namespace. Such a dataset — for example, a legacy 24k-row corpus of agent-session summaries — is invisible to rmcp-memex even though every row is already embedded. pymemex recover reshapes those rows into the rmcp-memex chunk schema, injects a namespace, and emits JSONL ready for import.
pymemex recover /path/to/legacy-lancedb \
--namespace proj:my-corpus \
-o /tmp/corpus.jsonlColumn auto-detection picks id, the text column (text / summary / content / body), and the first fixed_size_list<float*> as the vector. Override any of them with the explicit flags below.
rmcp-memex import --namespace proj:my-corpus --input /tmp/corpus.jsonlAfter import, the previously dark-matter data becomes searchable:
rmcp-memex rag_search -n proj:my-corpus -q "..."| Flag | Purpose |
|---|---|
--namespace |
Static namespace injected on every row. |
--namespace-from |
Column name to use as per-row namespace (overrides --namespace). |
--text-column |
Explicit text column name. Defaults to auto-detection. |
--id-column |
Explicit id column name. Defaults to auto-detection. |
--metadata-include (-m) |
Repeatable. Source columns to forward into the metadata field. |
--min-importance |
Skip rows with importance < value (requires an importance column). |
--permanent-only |
Keep only rows where permanent is truthy. |
--include-embeddings |
Emit the vector field alongside text. Omit to re-embed downstream. |
--dry-run |
Count + stats only; no JSONL output. |
-o / --output |
Output path for JSONL. Default: stdout. |
- Static (
--namespace proj:my-corpus) — every emitted row gets the same namespace. Best for single-origin corpora. - Per-row (
--namespace-from project) — read the namespace from a source column. Best when one legacy table spans multiple projects and each row carries its own label.
When both flags are present, --namespace-from wins and --namespace acts as the fallback for rows where the source column is null.
Auto-detection heuristics:
- id: first match in
("id", "uuid", "_id") - text: first match in
("text", "summary", "content", "body") - vector: first column typed
fixed_size_list<float*>
Override any of these with --id-column, --text-column, and (if auto-detection guesses wrong) inspect the schema via pymemex inspect first to confirm column names.
from pathlib import Path
from pymemex import open_dataset, inspect_columns, row_count
info = open_dataset(Path("/path/to/lancedb"))
print(info.layout, info.table_name, row_count(info))
for col in inspect_columns(info):
print(col.name, col.arrow_type, col.is_vector, col.vector_dim)Recovery is usable as a library, not just a CLI:
from pymemex import RecoveryConfig, recover
config = RecoveryConfig(namespace="proj:my-corpus", text_column="summary")
for row in recover(Path("/path/to/legacy-lancedb"), config):
# row: {"id", "namespace", "text", "vector", "content_hash", "metadata"}
...All helpers are read-only. pymemex never mutates the source dataset.
-
inspect— layout detection, schema, version summary -
sample— pretty sample (vectors hidden) -
tables— subtable listing -
recover— legacy → rmcp-memex JSONL with namespace injection -
stats— per-column statistics (null %, unique %, distributions) -
simplify— chunking + dedup -
map— YAML rules engine -
embed— optional re-embedding via pluggable endpoint -
pipe— one-shotrecover → simplify → embed → import
BUSL-1.1 (see LICENSE). Change License: Apache 2.0 on 2030-04-15.
Vibecrafted. with AI Agents by VetCoders (c)2024-2026 The LibraxisAI Team