Skip to content

Loctree/pymemex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pymemex

Python companion for the memex family. Recovers legacy Lance datasets, reshapes them into rmcp-memex-native chunks, injects namespaces. Turns dark-matter memory into searchable memory.

Why this exists

The memex family uses LanceDB, but real-world datasets come in two shapes:

  • Legacy root-as-table: <db>/{_versions,_transactions,data} — produced by older Python-lance ad-hoc pipelines where db-path == table-path.
  • v2 db-with-subtables: <db>/<name>.lance/ — current layout used by rmcp-memex, rust-memex, aicx.

rmcp-memex 0.5.x only reads v2. pymemex reads both, so you can inspect, sample, and recover legacy data without touching the originals.

Dark-matter data has a second, deeper problem: flat legacy tables have no concept of a namespace. rmcp-memex requires one on every chunk — it's how retrieval stays scoped per project / corpus / tenant. Orphaned Lance tables with no namespace field can't be imported as-is. pymemex bridges that gap: it reshapes legacy rows into the rmcp-memex schema and injects a namespace (static or per-row) on the way out, so previously unreachable memory becomes searchable.

Family

Tool Language Role
rust-memex Rust Core semantic retrieval library (embedding + BM25 over LanceDB)
rmcp-memex Rust MCP server over rust-memex
aicx Rust Canonical corpus + MCP (session ingest)
pymemex (this) Python Introspection + legacy bridge

Install

Dev install from source:

git clone https://github.com/Loctree/pymemex
cd pymemex
uv sync

Run via uv without activating a venv:

uv run pymemex --help

Usage — Introspection (v0.1)

Inspect a Lance dataset (auto-detects layout)

uv run pymemex inspect /path/to/lancedb

Works for both layouts. Shows:

  • detected layout (legacy-root vs v2-subtable)
  • subtable list (v2 only)
  • schema with vector column detection (dim)
  • row count, latest version
  • last 5 versions with row/size deltas

Sample rows (vectors hidden by default)

uv run pymemex sample /path/to/lancedb -n 5
uv run pymemex sample /path/to/lancedb -n 2 --show-vectors

List subtables (v2 layout)

uv run pymemex tables /path/to/lancedb

Usage — Recovery (v0.2)

Legacy Python-lance pipelines often left behind ad-hoc tables: a flat schema with id, vector, a text column like summary or content, plus a few metadata fields, and crucially no namespace. Such a dataset — for example, a legacy 24k-row corpus of agent-session summaries — is invisible to rmcp-memex even though every row is already embedded. pymemex recover reshapes those rows into the rmcp-memex chunk schema, injects a namespace, and emits JSONL ready for import.

Basic command

pymemex recover /path/to/legacy-lancedb \
    --namespace proj:my-corpus \
    -o /tmp/corpus.jsonl

Column auto-detection picks id, the text column (text / summary / content / body), and the first fixed_size_list<float*> as the vector. Override any of them with the explicit flags below.

Pipe to rmcp-memex

rmcp-memex import --namespace proj:my-corpus --input /tmp/corpus.jsonl

After import, the previously dark-matter data becomes searchable:

rmcp-memex rag_search -n proj:my-corpus -q "..."

Options

Flag Purpose
--namespace Static namespace injected on every row.
--namespace-from Column name to use as per-row namespace (overrides --namespace).
--text-column Explicit text column name. Defaults to auto-detection.
--id-column Explicit id column name. Defaults to auto-detection.
--metadata-include (-m) Repeatable. Source columns to forward into the metadata field.
--min-importance Skip rows with importance < value (requires an importance column).
--permanent-only Keep only rows where permanent is truthy.
--include-embeddings Emit the vector field alongside text. Omit to re-embed downstream.
--dry-run Count + stats only; no JSONL output.
-o / --output Output path for JSONL. Default: stdout.

Namespace injection strategies

  • Static (--namespace proj:my-corpus) — every emitted row gets the same namespace. Best for single-origin corpora.
  • Per-row (--namespace-from project) — read the namespace from a source column. Best when one legacy table spans multiple projects and each row carries its own label.

When both flags are present, --namespace-from wins and --namespace acts as the fallback for rows where the source column is null.

Schema mapping

Auto-detection heuristics:

  • id: first match in ("id", "uuid", "_id")
  • text: first match in ("text", "summary", "content", "body")
  • vector: first column typed fixed_size_list<float*>

Override any of these with --id-column, --text-column, and (if auto-detection guesses wrong) inspect the schema via pymemex inspect first to confirm column names.

Programmatic API

from pathlib import Path
from pymemex import open_dataset, inspect_columns, row_count

info = open_dataset(Path("/path/to/lancedb"))
print(info.layout, info.table_name, row_count(info))
for col in inspect_columns(info):
    print(col.name, col.arrow_type, col.is_vector, col.vector_dim)

Recovery is usable as a library, not just a CLI:

from pymemex import RecoveryConfig, recover

config = RecoveryConfig(namespace="proj:my-corpus", text_column="summary")
for row in recover(Path("/path/to/legacy-lancedb"), config):
    # row: {"id", "namespace", "text", "vector", "content_hash", "metadata"}
    ...

All helpers are read-only. pymemex never mutates the source dataset.

Roadmap

  • inspect — layout detection, schema, version summary
  • sample — pretty sample (vectors hidden)
  • tables — subtable listing
  • recover — legacy → rmcp-memex JSONL with namespace injection
  • stats — per-column statistics (null %, unique %, distributions)
  • simplify — chunking + dedup
  • map — YAML rules engine
  • embed — optional re-embedding via pluggable endpoint
  • pipe — one-shot recover → simplify → embed → import

License

BUSL-1.1 (see LICENSE). Change License: Apache 2.0 on 2030-04-15.


Vibecrafted. with AI Agents by VetCoders (c)2024-2026 The LibraxisAI Team

About

Python companion for the memex family: recovers legacy Lance datasets, reshapes them into rmcp-memex-native chunks, injects namespaces.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages