Skip to content

Commit 2ada47f

Browse files
AlexMikhalevclaude
andcommitted
docs: update handover and lessons learned with real model inference session
Add findings from real MedGemma 4B GGUF inference on CPU: PEP 668 venv requirements, persistent subprocess IPC pattern, MEDGEMMA_PYTHON env var for venv discovery, and 3-level backend fallback chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a5828f6 commit 2ada47f

File tree

2 files changed

+98
-8
lines changed

2 files changed

+98
-8
lines changed

HANDOVER.md

Lines changed: 43 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,19 @@
22

33
**Date**: 2026-02-22
44
**Project**: MedGemma Competition - Terraphim-AI Crate Integration
5-
**Status**: E2E Verified (49/49 checks pass)
5+
**Status**: Real Model E2E Verified (10/10 eval cases, 95% grounding, 0 safety failures)
66
**Handover To**: Development Team / Maintainers
77

88
---
99

1010
## Executive Summary
1111

1212
Completed full migration of medgemma-competition from standalone reimplementations to
13-
shared terraphim-ai crates behind `medical` feature flags. The pipeline is end-to-end
14-
verified: knowledge graphs populate and traverse correctly, entity extraction works,
15-
UMLS artifacts load, PGx validation runs, and the full 6-step consultation completes.
13+
shared terraphim-ai crates behind `medical` feature flags, then proved end-to-end with
14+
real local MedGemma 4B model inference on CPU. The pipeline is fully verified: knowledge
15+
graphs populate and traverse correctly, entity extraction works, UMLS artifacts load,
16+
PGx validation runs, real GGUF model inference completes for all 10 scenarios, and the
17+
3-gate evaluation harness produces solid reports (10/10 pass, 0 safety failures).
1618

1719
**Two repositories involved:**
1820
- `terraphim/terraphim-ai` - upstream crate library (PR #551, branch `medical-extensions`)
@@ -58,6 +60,17 @@ UMLS artifacts load, PGx validation runs, and the full 6-step consultation compl
5860
- 49-check comprehensive pipeline example: `crates/terraphim-demo/examples/e2e_pipeline.rs`
5961
- Commit: `2e317ff`
6062

63+
### Real Model Inference E2E
64+
- Wired `LocalMedGemmaClient` into demo.rs backend fallback (Proxy -> GGUF -> Mock)
65+
- Added `MEDGEMMA_PYTHON` env var for venv support via `resolve_python_binary()`
66+
- Created persistent Python GGUF server (`scripts/medgemma_server.py`) - loads 2.3GB model once
67+
- Added `run_with_local_model()` to evaluation_runner.rs (was mock-only before)
68+
- Comprehensive `e2e_real_model.rs` example: data loading, KG (36 nodes), entity extraction,
69+
PGx validation, real model inference (10 cases, ~96s/case on CPU), 3-gate evaluation
70+
- Result: 10/10 cases passed, 0 safety failures, 95% avg grounding score
71+
- Reports: JSON + Markdown in `tests/evaluation/output/`
72+
- Commit: `a5828f6`
73+
6174
---
6275

6376
## Current State
@@ -71,8 +84,9 @@ UMLS artifacts load, PGx validation runs, and the full 6-step consultation compl
7184
- **Related issue**: #549
7285

7386
### medgemma-competition (main)
74-
- All migration commits on `main`, pushed to origin
87+
- All migration commits on `main` (1 commit ahead of origin)
7588
- E2E pipeline passes 49/49 checks
89+
- Real model E2E passes 10/10 eval cases with 95% grounding score
7690
- Open PR #38: Clinical Trial Protocol Parser (separate feature, by Kimiko)
7791
- Open issues: #33 (Meta-Cortex), #34 (Pre-serialized artifacts - partially done), #35 (SNOMED download)
7892

@@ -152,18 +166,39 @@ Once terraphim-ai main has the medical feature, medgemma-competition can optiona
152166
| `crates/terraphim-medical-agents/Cargo.toml` | Agent infra deps |
153167
| `crates/terraphim-medical-agents/src/lib.rs` | Removed mailbox/router/supervisor modules |
154168
| `crates/terraphim-medical-agents/src/agents/role_graph_search.rs` | MedicalRoleGraph API |
169+
| `crates/terraphim-demo/src/demo.rs` | LocalMedGemmaClient fallback chain |
170+
| `crates/terraphim-demo/examples/e2e_real_model.rs` | NEW - real model e2e proof (10 scenarios) |
171+
| `crates/medgemma-client/src/local_inference.rs` | resolve_python_binary(), MEDGEMMA_PYTHON |
172+
| `crates/medgemma-client/src/lib.rs` | Export resolve_python_binary |
173+
| `crates/terraphim-evaluation/src/bin/evaluation_runner.rs` | Real model support via LocalMedGemmaClient |
174+
| `scripts/medgemma_server.py` | NEW - persistent GGUF inference server |
155175

156176
---
157177

158-
## Running the E2E Test
178+
## Running the E2E Tests
159179

180+
### Pipeline verification (no model required)
160181
```bash
161-
cd /home/alex/projects/terraphim/medgemma-competition
162182
cargo run --example e2e_pipeline --package terraphim-demo
163183
```
164-
165184
Expected: 49 passed, 0 failed (~17s total, dominated by UMLS artifact load)
166185

186+
### Real model inference (requires Python venv)
187+
```bash
188+
# First time setup
189+
python3 -m venv .venv
190+
.venv/bin/pip install llama-cpp-python huggingface-hub
191+
192+
# Run (downloads 2.3GB model on first run)
193+
MEDGEMMA_PYTHON=.venv/bin/python3 cargo run --release --example e2e_real_model --package terraphim-demo
194+
```
195+
Expected: 10/10 eval cases pass, ~16min on CPU, reports in `tests/evaluation/output/`
196+
197+
### Evaluation harness with real model
198+
```bash
199+
MEDGEMMA_PYTHON=.venv/bin/python3 cargo run --release --bin evaluation-runner --package terraphim-evaluation
200+
```
201+
167202
---
168203

169204
## Known Issues

lessons-learned.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,3 +100,58 @@ artifact. Without artifacts, every cold start is a 14-minute wait. The artifact
100100
Representing each node as (ancestors, descendants, depth) and computing Jaccard similarity
101101
produces ontologically meaningful scores: NSCLC/SCLC (siblings) score 1.0, NSCLC/Breast (cousins)
102102
score 0.62, NSCLC/Lung Cancer (parent-child) score 0.43. No vector database needed.
103+
104+
---
105+
106+
## Session 2: Real Model Inference E2E (2026-02-22)
107+
108+
**Scope**: Proving end-to-end pipeline with real MedGemma 4B GGUF model on CPU
109+
110+
### Technical Discoveries
111+
112+
#### PEP 668 blocks system-wide pip install on modern Linux
113+
Ubuntu/Debian now mark system Python as "externally managed" (PEP 668), so `pip3 install` fails
114+
with "externally-managed-environment". Solution: always use a project-local venv (`python3 -m venv
115+
.venv`) and install there. Add `.venv/` to `.gitignore`.
116+
117+
#### GGUF inference on CPU is viable but slow
118+
MedGemma 4B Q4_K_M (2.3GB GGUF) loads in ~42s and generates ~96s per clinical scenario on CPU.
119+
Total wall time for 10 cases is ~16 minutes. Viable for evaluation/CI but not for interactive use.
120+
Model download (~2.3GB from HuggingFace) only happens on first run, cached after that.
121+
122+
#### Persistent subprocess beats per-call subprocess for model inference
123+
`LocalMedGemmaClient` spawns a new Python process per call, reloading the 2.3GB model each time
124+
(~42s load + ~96s generation). For 10 cases, that's ~23 minutes of model loading alone. The
125+
persistent server approach (load once, stdin/stdout JSON-lines protocol) cuts total time by ~40%
126+
by eliminating 9 redundant model loads.
127+
128+
#### MEDGEMMA_PYTHON env var solves the venv discovery problem
129+
When packages are installed in a venv but the Rust code calls `python3` (which resolves to system
130+
Python without the packages), inference fails. Rather than hardcoding venv paths, the
131+
`MEDGEMMA_PYTHON` env var lets users point to any Python binary with the right packages installed.
132+
This is more flexible than `.venv/bin/python3` assumptions.
133+
134+
### Pitfalls to Avoid
135+
136+
#### Don't assume system Python has your packages
137+
On modern Linux, system Python may not even allow package installation. Always check with
138+
`python3 -c "import llama_cpp"` before assuming the package is available. Better yet, provide
139+
a configurable Python binary path.
140+
141+
#### Dead code warnings from renamed struct fields
142+
Renaming a struct field from `load_time_s` to `_load_time_s` (to suppress unused warnings) requires
143+
updating the constructor too: `_load_time_s: load_time_s`. Easy to miss when the original variable
144+
and the field had the same name.
145+
146+
### Best Practices Discovered
147+
148+
#### stdin/stdout JSON-lines is the simplest IPC for model inference
149+
A Python subprocess that reads JSON requests from stdin and writes JSON responses to stdout
150+
(one per line, flushed) is simpler than HTTP servers, Unix sockets, or gRPC. No port conflicts,
151+
no connection management, no serialization framework dependencies. The parent process just writes
152+
a line and reads a line. Use `flush=True` in Python's print() to avoid buffering deadlocks.
153+
154+
#### Three-level backend fallback chain works well
155+
The pattern Proxy -> Local GGUF -> Mock gives maximum flexibility: production uses the proxy,
156+
development uses the local model if available, tests use the mock. The `check_gguf_available()`
157+
function that tries importing the Python package is cheap (~100ms) and reliable.

0 commit comments

Comments
 (0)