pymemex

Python companion for the memex family. Recovers legacy Lance datasets, reshapes them into rmcp-memex-native chunks, injects namespaces. Turns dark-matter memory into searchable memory.

Why this exists

The memex family uses LanceDB, but real-world datasets come in two shapes:

Legacy root-as-table: <db>/{_versions,_transactions,data} — produced by older Python-lance ad-hoc pipelines where db-path == table-path.
v2 db-with-subtables: <db>/<name>.lance/ — current layout used by rmcp-memex, rust-memex, aicx.

rmcp-memex 0.5.x only reads v2. pymemex reads both, so you can inspect, sample, and recover legacy data without touching the originals.

Dark-matter data has a second, deeper problem: flat legacy tables have no concept of a namespace. rmcp-memex requires one on every chunk — it's how retrieval stays scoped per project / corpus / tenant. Orphaned Lance tables with no namespace field can't be imported as-is. pymemex bridges that gap: it reshapes legacy rows into the rmcp-memex schema and injects a namespace (static or per-row) on the way out, so previously unreachable memory becomes searchable.

Family

Tool	Language	Role
rust-memex	Rust	Core semantic retrieval library (embedding + BM25 over LanceDB)
rmcp-memex	Rust	MCP server over `rust-memex`
aicx	Rust	Canonical corpus + MCP (session ingest)
pymemex (this)	Python	Introspection + legacy bridge

Install

Dev install from source:

git clone https://github.com/Loctree/pymemex
cd pymemex
uv sync

Run via uv without activating a venv:

uv run pymemex --help

Usage — Introspection (v0.1)

Inspect a Lance dataset (auto-detects layout)

uv run pymemex inspect /path/to/lancedb

Works for both layouts. Shows:

detected layout (legacy-root vs v2-subtable)
subtable list (v2 only)
schema with vector column detection (dim)
row count, latest version
last 5 versions with row/size deltas

Sample rows (vectors hidden by default)

uv run pymemex sample /path/to/lancedb -n 5
uv run pymemex sample /path/to/lancedb -n 2 --show-vectors

List subtables (v2 layout)

uv run pymemex tables /path/to/lancedb

Usage — Recovery (v0.2)

Legacy Python-lance pipelines often left behind ad-hoc tables: a flat schema with id, vector, a text column like summary or content, plus a few metadata fields, and crucially no namespace. Such a dataset — for example, a legacy 24k-row corpus of agent-session summaries — is invisible to rmcp-memex even though every row is already embedded. pymemex recover reshapes those rows into the rmcp-memex chunk schema, injects a namespace, and emits JSONL ready for import.

Basic command

pymemex recover /path/to/legacy-lancedb \
    --namespace proj:my-corpus \
    -o /tmp/corpus.jsonl

Column auto-detection picks id, the text column (text / summary / content / body), and the first fixed_size_list<float*> as the vector. Override any of them with the explicit flags below.

Pipe to rmcp-memex

rmcp-memex import --namespace proj:my-corpus --input /tmp/corpus.jsonl

After import, the previously dark-matter data becomes searchable:

rmcp-memex rag_search -n proj:my-corpus -q "..."

Options

Flag	Purpose
`--namespace`	Static namespace injected on every row.
`--namespace-from`	Column name to use as per-row namespace (overrides `--namespace`).
`--text-column`	Explicit text column name. Defaults to auto-detection.
`--id-column`	Explicit id column name. Defaults to auto-detection.
`--metadata-include` (`-m`)	Repeatable. Source columns to forward into the `metadata` field.
`--min-importance`	Skip rows with `importance < value` (requires an `importance` column).
`--permanent-only`	Keep only rows where `permanent` is truthy.
`--include-embeddings`	Emit the `vector` field alongside `text`. Omit to re-embed downstream.
`--dry-run`	Count + stats only; no JSONL output.
`-o` / `--output`	Output path for JSONL. Default: stdout.

Namespace injection strategies

Static (--namespace proj:my-corpus) — every emitted row gets the same namespace. Best for single-origin corpora.
Per-row (--namespace-from project) — read the namespace from a source column. Best when one legacy table spans multiple projects and each row carries its own label.

When both flags are present, --namespace-from wins and --namespace acts as the fallback for rows where the source column is null.

Schema mapping

Auto-detection heuristics:

id: first match in ("id", "uuid", "_id")
text: first match in ("text", "summary", "content", "body")
vector: first column typed fixed_size_list<float*>

Override any of these with --id-column, --text-column, and (if auto-detection guesses wrong) inspect the schema via pymemex inspect first to confirm column names.

Programmatic API

from pathlib import Path
from pymemex import open_dataset, inspect_columns, row_count

info = open_dataset(Path("/path/to/lancedb"))
print(info.layout, info.table_name, row_count(info))
for col in inspect_columns(info):
    print(col.name, col.arrow_type, col.is_vector, col.vector_dim)

Recovery is usable as a library, not just a CLI:

from pymemex import RecoveryConfig, recover

config = RecoveryConfig(namespace="proj:my-corpus", text_column="summary")
for row in recover(Path("/path/to/legacy-lancedb"), config):
    # row: {"id", "namespace", "text", "vector", "content_hash", "metadata"}
    ...

All helpers are read-only. pymemex never mutates the source dataset.

Roadmap

inspect — layout detection, schema, version summary
sample — pretty sample (vectors hidden)
tables — subtable listing
recover — legacy → rmcp-memex JSONL with namespace injection
stats — per-column statistics (null %, unique %, distributions)
simplify — chunking + dedup
map — YAML rules engine
embed — optional re-embedding via pluggable endpoint
pipe — one-shot recover → simplify → embed → import

License

BUSL-1.1 (see LICENSE). Change License: Apache 2.0 on 2030-04-15.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/pymemex		src/pymemex
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pymemex

Why this exists

Family

Install

Usage — Introspection (v0.1)

Inspect a Lance dataset (auto-detects layout)

Sample rows (vectors hidden by default)

List subtables (v2 layout)

Usage — Recovery (v0.2)

Basic command

Pipe to rmcp-memex

Options

Namespace injection strategies

Schema mapping

Programmatic API

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pymemex

Why this exists

Family

Install

Usage — Introspection (v0.1)

Inspect a Lance dataset (auto-detects layout)

Sample rows (vectors hidden by default)

List subtables (v2 layout)

Usage — Recovery (v0.2)

Basic command

Pipe to rmcp-memex

Options

Namespace injection strategies

Schema mapping

Programmatic API

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages