docs: update handover and lessons learned with real model inference session

AlexMikhalev · claude · AlexMikhalev · commit 2ada47feff96 · 2026-02-22T17:30:58.000+01:00
Add findings from real MedGemma 4B GGUF inference on CPU: PEP 668
venv requirements, persistent subprocess IPC pattern, MEDGEMMA_PYTHON
env var for venv discovery, and 3-level backend fallback chain.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/HANDOVER.md b/HANDOVER.md
@@ -2,17 +2,19 @@
 
 **Date**: 2026-02-22
 **Project**: MedGemma Competition - Terraphim-AI Crate Integration
-**Status**: E2E Verified (49/49 checks pass)
+**Status**: Real Model E2E Verified (10/10 eval cases, 95% grounding, 0 safety failures)
 **Handover To**: Development Team / Maintainers
 
 ---
 
 ## Executive Summary
 
 Completed full migration of medgemma-competition from standalone reimplementations to
-shared terraphim-ai crates behind `medical` feature flags. The pipeline is end-to-end
-verified: knowledge graphs populate and traverse correctly, entity extraction works,
-UMLS artifacts load, PGx validation runs, and the full 6-step consultation completes.
+shared terraphim-ai crates behind `medical` feature flags, then proved end-to-end with
+real local MedGemma 4B model inference on CPU. The pipeline is fully verified: knowledge
+graphs populate and traverse correctly, entity extraction works, UMLS artifacts load,
+PGx validation runs, real GGUF model inference completes for all 10 scenarios, and the
+3-gate evaluation harness produces solid reports (10/10 pass, 0 safety failures).
 
 **Two repositories involved:**
 - `terraphim/terraphim-ai` - upstream crate library (PR #551, branch `medical-extensions`)
@@ -58,6 +60,17 @@ UMLS artifacts load, PGx validation runs, and the full 6-step consultation compl
 - 49-check comprehensive pipeline example: `crates/terraphim-demo/examples/e2e_pipeline.rs`
 - Commit: `2e317ff`
 
+### Real Model Inference E2E
+- Wired `LocalMedGemmaClient` into demo.rs backend fallback (Proxy -> GGUF -> Mock)
+- Added `MEDGEMMA_PYTHON` env var for venv support via `resolve_python_binary()`
+- Created persistent Python GGUF server (`scripts/medgemma_server.py`) - loads 2.3GB model once
+- Added `run_with_local_model()` to evaluation_runner.rs (was mock-only before)
+- Comprehensive `e2e_real_model.rs` example: data loading, KG (36 nodes), entity extraction,
+  PGx validation, real model inference (10 cases, ~96s/case on CPU), 3-gate evaluation
+- Result: 10/10 cases passed, 0 safety failures, 95% avg grounding score
+- Reports: JSON + Markdown in `tests/evaluation/output/`
+- Commit: `a5828f6`
+
 ---
 
 ## Current State
@@ -71,8 +84,9 @@ UMLS artifacts load, PGx validation runs, and the full 6-step consultation compl
 - **Related issue**: #549
 
 ### medgemma-competition (main)
-- All migration commits on `main`, pushed to origin
+- All migration commits on `main` (1 commit ahead of origin)
 - E2E pipeline passes 49/49 checks
+- Real model E2E passes 10/10 eval cases with 95% grounding score
 - Open PR #38: Clinical Trial Protocol Parser (separate feature, by Kimiko)
 - Open issues: #33 (Meta-Cortex), #34 (Pre-serialized artifacts - partially done), #35 (SNOMED download)
 
@@ -152,18 +166,39 @@ Once terraphim-ai main has the medical feature, medgemma-competition can optiona
 | `crates/terraphim-medical-agents/Cargo.toml` | Agent infra deps |
 | `crates/terraphim-medical-agents/src/lib.rs` | Removed mailbox/router/supervisor modules |
 | `crates/terraphim-medical-agents/src/agents/role_graph_search.rs` | MedicalRoleGraph API |
+| `crates/terraphim-demo/src/demo.rs` | LocalMedGemmaClient fallback chain |
+| `crates/terraphim-demo/examples/e2e_real_model.rs` | NEW - real model e2e proof (10 scenarios) |
+| `crates/medgemma-client/src/local_inference.rs` | resolve_python_binary(), MEDGEMMA_PYTHON |
+| `crates/medgemma-client/src/lib.rs` | Export resolve_python_binary |
+| `crates/terraphim-evaluation/src/bin/evaluation_runner.rs` | Real model support via LocalMedGemmaClient |
+| `scripts/medgemma_server.py` | NEW - persistent GGUF inference server |
 
 ---
 
-## Running the E2E Test
+## Running the E2E Tests
 
+### Pipeline verification (no model required)
 ```bash
-cd /home/alex/projects/terraphim/medgemma-competition
 cargo run --example e2e_pipeline --package terraphim-demo
 ```
-
 Expected: 49 passed, 0 failed (~17s total, dominated by UMLS artifact load)
 
+### Real model inference (requires Python venv)
+```bash
+# First time setup
+python3 -m venv .venv
+.venv/bin/pip install llama-cpp-python huggingface-hub
+
+# Run (downloads 2.3GB model on first run)
+MEDGEMMA_PYTHON=.venv/bin/python3 cargo run --release --example e2e_real_model --package terraphim-demo
+```
+Expected: 10/10 eval cases pass, ~16min on CPU, reports in `tests/evaluation/output/`
+
+### Evaluation harness with real model
+```bash
+MEDGEMMA_PYTHON=.venv/bin/python3 cargo run --release --bin evaluation-runner --package terraphim-evaluation
+```
+
 ---
 
 ## Known Issues
diff --git a/lessons-learned.md b/lessons-learned.md
@@ -100,3 +100,58 @@ artifact. Without artifacts, every cold start is a 14-minute wait. The artifact
 Representing each node as (ancestors, descendants, depth) and computing Jaccard similarity
 produces ontologically meaningful scores: NSCLC/SCLC (siblings) score 1.0, NSCLC/Breast (cousins)
 score 0.62, NSCLC/Lung Cancer (parent-child) score 0.43. No vector database needed.
+
+---
+
+## Session 2: Real Model Inference E2E (2026-02-22)
+
+**Scope**: Proving end-to-end pipeline with real MedGemma 4B GGUF model on CPU
+
+### Technical Discoveries
+
+#### PEP 668 blocks system-wide pip install on modern Linux
+Ubuntu/Debian now mark system Python as "externally managed" (PEP 668), so `pip3 install` fails
+with "externally-managed-environment". Solution: always use a project-local venv (`python3 -m venv
+.venv`) and install there. Add `.venv/` to `.gitignore`.
+
+#### GGUF inference on CPU is viable but slow
+MedGemma 4B Q4_K_M (2.3GB GGUF) loads in ~42s and generates ~96s per clinical scenario on CPU.
+Total wall time for 10 cases is ~16 minutes. Viable for evaluation/CI but not for interactive use.
+Model download (~2.3GB from HuggingFace) only happens on first run, cached after that.
+
+#### Persistent subprocess beats per-call subprocess for model inference
+`LocalMedGemmaClient` spawns a new Python process per call, reloading the 2.3GB model each time
+(~42s load + ~96s generation). For 10 cases, that's ~23 minutes of model loading alone. The
+persistent server approach (load once, stdin/stdout JSON-lines protocol) cuts total time by ~40%
+by eliminating 9 redundant model loads.
+
+#### MEDGEMMA_PYTHON env var solves the venv discovery problem
+When packages are installed in a venv but the Rust code calls `python3` (which resolves to system
+Python without the packages), inference fails. Rather than hardcoding venv paths, the
+`MEDGEMMA_PYTHON` env var lets users point to any Python binary with the right packages installed.
+This is more flexible than `.venv/bin/python3` assumptions.
+
+### Pitfalls to Avoid
+
+#### Don't assume system Python has your packages
+On modern Linux, system Python may not even allow package installation. Always check with
+`python3 -c "import llama_cpp"` before assuming the package is available. Better yet, provide
+a configurable Python binary path.
+
+#### Dead code warnings from renamed struct fields
+Renaming a struct field from `load_time_s` to `_load_time_s` (to suppress unused warnings) requires
+updating the constructor too: `_load_time_s: load_time_s`. Easy to miss when the original variable
+and the field had the same name.
+
+### Best Practices Discovered
+
+#### stdin/stdout JSON-lines is the simplest IPC for model inference
+A Python subprocess that reads JSON requests from stdin and writes JSON responses to stdout
+(one per line, flushed) is simpler than HTTP servers, Unix sockets, or gRPC. No port conflicts,
+no connection management, no serialization framework dependencies. The parent process just writes
+a line and reads a line. Use `flush=True` in Python's print() to avoid buffering deadlocks.
+
+#### Three-level backend fallback chain works well
+The pattern Proxy -> Local GGUF -> Mock gives maximum flexibility: production uses the proxy,
+development uses the local model if available, tests use the mock. The `check_gguf_available()`
+function that tries importing the Python package is cheap (~100ms) and reliable.