Calibration Pipeline User's Manual

The unified calibration pipeline reweights cloned CPS records to match administrative targets using L0-regularized optimization. This guide covers the main workflows: lightweight build-then-fit, full pipeline with PUF, and fitting from a saved package.

This is the current production calibration path. The older national-only Enhanced CPS path (make data-legacy) remains in the repo for legacy reproduction and uses a separate EnhancedCPS / build_loss_matrix() flow.

Quick Start

# Build matrix only from stratified CPS (no PUF, no re-imputation):
python -m policyengine_us_data.calibration.unified_calibration \
  --target-config policyengine_us_data/calibration/target_config.yaml \
  --skip-source-impute \
  --skip-takeup-rerandomize \
  --build-only

# Fit weights from a saved package:
python -m policyengine_us_data.calibration.unified_calibration \
  --package-path storage/calibration/calibration_package.pkl \
  --epochs 500 --device cuda

# Resume a previous fit for 500 more epochs:
python -m policyengine_us_data.calibration.unified_calibration \
  --package-path storage/calibration/calibration_package.pkl \
  --resume-from storage/calibration/calibration_weights.npy \
  --epochs 500 --device cuda

# Full pipeline with PUF (build + fit in one shot):
make calibrate

Architecture Overview

The pipeline has two phases:

Matrix build: Clone CPS records, assign geography, compute all target variable values, assemble a sparse calibration matrix. Optionally includes PUF cloning (doubles record count) and source re-imputation.
Weight fitting (~5-20 min on GPU): L0-regularized optimization to find household weights that reproduce administrative targets.

The calibration package checkpoint lets you run phase 1 once and iterate on phase 2 with different hyperparameters or target selections---without rebuilding.

Prerequisites

The matrix build requires two inputs from the data pipeline:

Stratified CPS (storage/stratified_extended_cps_2024.h5): ~12K households, built by make data. This is the base dataset that gets cloned.
Target database (storage/calibration/policy_data.db): Administrative targets, built by make database.

Both must exist before running calibration. The stratified CPS already contains all CPS variables needed for calibration; PUF cloning and source re-imputation are optional enhancements that happen at calibration time.

Workflows

1. Lightweight build-then-fit (recommended for iteration)

Build the matrix from the stratified CPS without PUF cloning or re-imputation. This is the fastest way to get a calibration package for experimentation.

Step 1: Build the matrix (~12K base records x 430 clones = ~5.2M columns).

python -m policyengine_us_data.calibration.unified_calibration \
  --target-config policyengine_us_data/calibration/target_config.yaml \
  --skip-source-impute \
  --skip-takeup-rerandomize \
  --build-only

This saves storage/calibration/calibration_package.pkl (default location). Use --package-output to specify a different path.

Step 2: Fit weights from the package (fast, repeatable).

python -m policyengine_us_data.calibration.unified_calibration \
  --package-path storage/calibration/calibration_package.pkl \
  --epochs 1000 \
  --lambda-l0 1e-8 \
  --beta 0.65 \
  --lambda-l2 1e-8 \
  --device cuda

You can re-run Step 2 as many times as you want with different hyperparameters. The expensive matrix build only happens once.

Every fit now also writes a checkpoint next to the weights output (calibration_weights.checkpoint.pt by default). To continue the same fit, pass --resume-from with the weights file or checkpoint path. If a sibling checkpoint exists next to the weights file, it is used automatically so the L0 gate state is restored as well.

python -m policyengine_us_data.calibration.unified_calibration \
  --package-path storage/calibration/calibration_package.pkl \
  --epochs 2000 \
  --beta 0.65 \
  --lambda-l0 1e-4 \
  --lambda-l2 1e-12 \
  --log-freq 500 \
  --target-config policyengine_us_data/calibration/target_config.yaml \
  --device cpu \
  --output policyengine_us_data/storage/calibration/national/weights.npy \
  --resume-from policyengine_us_data/storage/calibration/national/weights.npy

When --resume-from points to a checkpoint, --epochs means additional epochs to run beyond the saved checkpoint epoch count. If only a .npy weights file exists, the run warm-starts from those weights.

2. Full pipeline with PUF

Adding --puf-dataset doubles the record count (~24K base records x 430 clones = ~10.3M columns) by creating PUF-imputed copies of every CPS record. This also triggers source re-imputation unless skipped.

Single-pass (build + fit):

python -m policyengine_us_data.calibration.unified_calibration \
  --puf-dataset policyengine_us_data/storage/puf_2024.h5 \
  --target-config policyengine_us_data/calibration/target_config.yaml \
  --epochs 200 \
  --device cuda

Or equivalently: make calibrate

Output:

storage/calibration/calibration_weights.npy --- calibrated weight vector
storage/calibration/unified_diagnostics.csv --- per-target error report
storage/calibration/unified_run_config.json --- full run configuration

Build-only (save package for later fitting):

python -m policyengine_us_data.calibration.unified_calibration \
  --puf-dataset policyengine_us_data/storage/puf_2024.h5 \
  --target-config policyengine_us_data/calibration/target_config.yaml \
  --build-only

Or equivalently: make calibrate-build

This saves storage/calibration/calibration_package.pkl (default location). Use --package-output to specify a different path.

Then fit from the package using the same Step 2 command from Workflow 1.

3. Re-filtering a saved package

A saved package contains all targets from the database (before target config filtering). You can apply a different target config at fit time:

python -m policyengine_us_data.calibration.unified_calibration \
  --package-path storage/calibration/calibration_package.pkl \
  --target-config my_custom_config.yaml \
  --epochs 200

This lets you experiment with which targets to include without rebuilding the matrix.

4. Running on Modal (GPU cloud)

From a pre-built package (recommended):

Use --package-path to point at a local .pkl file. The runner automatically uploads it to the Modal Volume and then fits from it on the GPU, avoiding the function argument size limit.

modal run modal_app/remote_calibration_runner.py \
  --package-path policyengine_us_data/storage/calibration/calibration_package.pkl \
  --branch calibration-pipeline-improvements \
  --gpu T4 \
  --epochs 1000 \
  --beta 0.65 \
  --lambda-l0 1e-8 \
  --lambda-l2 1e-8

If a package already exists on the volume from a previous upload, you can also use --prebuilt-matrices to fit directly without re-uploading.

Full pipeline (builds matrix from scratch on Modal):

modal run modal_app/remote_calibration_runner.py \
  --branch calibration-pipeline-improvements \
  --gpu T4 \
  --epochs 1000 \
  --beta 0.65 \
  --lambda-l0 1e-8 \
  --lambda-l2 1e-8 \
  --target-config policyengine_us_data/calibration/target_config.yaml

The target config YAML is read from the cloned repo inside the container, so it must be committed to the branch you specify.

5. Portable fitting (Kaggle, Colab, etc.)

Transfer the package file to any environment with scipy, numpy, pandas, torch, and l0-python installed:

from policyengine_us_data.calibration.unified_calibration import (
    load_calibration_package,
    apply_target_config,
    fit_l0_weights,
)

package = load_calibration_package("calibration_package.pkl")
targets_df = package["targets_df"]
X_sparse = package["X_sparse"]

weights = fit_l0_weights(
    X_sparse=X_sparse,
    targets=targets_df["value"].values,
    lambda_l0=1e-8,
    epochs=500,
    device="cuda",
    beta=0.65,
    lambda_l2=1e-8,
)

Target Config

The target config controls which targets reach the optimizer. It uses a YAML exclusion list:

exclude:
  - variable: rent
    geo_level: national
  - variable: eitc
    geo_level: district
  - variable: snap
    geo_level: state
    domain_variable: snap   # optional: further narrow the match

Each rule drops rows from the calibration matrix where all specified fields match. Unrecognized variables silently match nothing.

Fields

Field	Required	Values	Description
`variable`	Yes	Any variable name in `target_overview`	The calibration target variable
`geo_level`	Yes	`national`, `state`, `district`	Geographic aggregation level
`domain_variable`	No	Any domain variable in `target_overview`	Narrows match to a specific domain

Default config

The checked-in config at policyengine_us_data/calibration/target_config.yaml reproduces the junkyard notebook's 22 excluded target groups. It drops:

13 national-level variables: alimony, charitable deduction, child support, interest deduction, medical expense deduction, net worth, person count, real estate taxes, rent, social security dependents/survivors
9 district-level variables: ACA PTC, EITC, income tax before credits, medical expense deduction, net capital gains, rental income, tax unit count, partnership/S-corp income, taxable social security

Applying this config reduces targets from ~37K to ~21K, matching the junkyard's target selection.

Writing a custom config

To experiment, copy the default and edit:

cp policyengine_us_data/calibration/target_config.yaml my_config.yaml
# Edit my_config.yaml to add/remove exclusion rules
python -m policyengine_us_data.calibration.unified_calibration \
  --package-path storage/calibration/calibration_package.pkl \
  --target-config my_config.yaml \
  --epochs 200

To see what variables and geo_levels are available in the database:

SELECT DISTINCT variable, geo_level
FROM target_overview
ORDER BY variable, geo_level;

CLI Reference

Core flags

Flag	Default	Description
`--dataset`	`storage/stratified_extended_cps_2024.h5`	Path to CPS h5 file
`--db-path`	`storage/calibration/policy_data.db`	Path to target database
`--output`	`storage/calibration/calibration_weights.npy`	Weight output path
`--puf-dataset`	None	Path to PUF h5 (enables PUF cloning)
`--preset`	`local`	L0 preset: `local` (1e-8) or `national` (1e-4)
`--lambda-l0`	None	Custom L0 penalty (overrides `--preset`)
`--epochs`	100	Training epochs
`--device`	`cpu`	`cpu` or `cuda`
`--n-clones`	430	Number of dataset clones
`--seed`	42	Random seed for geography assignment
`--national`	False	Use national preset (λ_L0=1e-4, ~50K records)
`--workers`	1	Parallel workers for per-state precomputation
`--county-level`	False	Include county-level targets (slower)

Target selection

Flag	Default	Description
`--target-config`	None	Path to YAML exclusion config
`--domain-variables`	None	Comma-separated domain filter (SQL-level)
`--hierarchical-domains`	None	Domains for hierarchical uprating

Checkpoint flags

Flag	Default	Description
`--build-only`	False	Build matrix, save package, skip fitting
`--package-path`	None	Load pre-built package (uploads to Modal volume automatically when using Modal runner)
`--package-output`	Auto (when `--build-only`)	Where to save package

Hyperparameter flags

Flag	Default	Junkyard value	Description
`--beta`	0.35	0.65	L0 gate temperature (higher = softer gates)
`--lambda-l2`	1e-12	1e-8	L2 regularization on weights
`--learning-rate`	0.15	0.15	Optimizer learning rate

Skip flags

Flag	Description
`--skip-puf`	Skip PUF clone + QRF imputation
`--skip-source-impute`	Skip ACS/SIPP/SCF re-imputation
`--skip-takeup-rerandomize`	Skip takeup re-randomization

Calibration Package Format

The package is a pickled Python dict:

{
    "X_sparse": scipy.sparse.csr_matrix,  # (n_targets, n_records)
    "targets_df": pd.DataFrame,           # target metadata + values
    "target_names": list[str],            # human-readable names
    "metadata": {
        "dataset_path": str,
        "db_path": str,
        "n_clones": int,
        "n_records": int,
        "seed": int,
        "created_at": str,       # ISO timestamp
        "target_config": dict,   # config used at build time
    },
}

The targets_df DataFrame has columns: variable, geo_level, geographic_id, domain_variable, value, and others from the database.

Validating a Package

Before uploading a package to Modal, validate it:

# Default package location
python -m policyengine_us_data.calibration.validate_package

# Specific package
python -m policyengine_us_data.calibration.validate_package path/to/calibration_package.pkl

# Strict mode: fail if any target has row_sum/target < 1%
python -m policyengine_us_data.calibration.validate_package --strict

Exit codes: 0 = pass, 1 = impossible targets, 2 = strict ratio failures.

Validation also runs automatically after --build-only.

Hyperparameter Tuning Guide

The three key hyperparameters control the tradeoff between target accuracy and sparsity:

beta (L0 gate temperature): Controls how sharply the L0 gates open/close. Higher values (0.5--0.8) give softer decisions and more exploration early in training. Lower values (0.2--0.4) give harder on/off decisions.
lambda_l0 (via --preset or --lambda-l0): Controls how many records survive. 1e-8 (local preset) keeps millions of records for local-area analysis. 1e-4 (national preset) keeps ~50K for the web app.
lambda_l2: Regularizes weight magnitudes. Larger values (1e-8) prevent any single record from having extreme weight. Smaller values (1e-12) allow more weight concentration.

Suggested starting points

For local-area calibration (millions of records):

--lambda-l0 1e-8 --beta 0.65 --lambda-l2 1e-8 --epochs 500

For national web app (~50K records):

--lambda-l0 1e-4 --beta 0.35 --lambda-l2 1e-12 --epochs 200

Makefile Targets

Target	Description
`make calibrate`	Full local pipeline with target config
`make calibrate-build`	Build-only mode (saves package, no fitting)
`make build-matrices`	Build calibration matrices on Modal (CPU)
`make calibrate-modal`	Fit county-level weights on Modal GPU
`make calibrate-modal-national`	Fit national weights on Modal GPU (T4)
`make calibrate-both`	Run county + national fits in parallel
`make stage-h5s`	Build state/district/city H5s on Modal
`make stage-national-h5`	Build national US.h5 on Modal
`make stage-all-h5s`	Run both staging jobs in parallel
`make promote`	Promote staged files to versioned HF paths
`make pipeline`	Print sequential steps for full pipeline
`make validate-staging`	Validate staged H5s against targets (states only)
`make validate-staging-full`	Validate staged H5s (states + districts)
`make upload-validation`	Push validation_results.csv to HF
`make check-staging`	Smoke test: sum key variables across all state H5s
`make check-sanity`	Quick structural integrity check on one state
`make upload-calibration`	Upload weights, blocks, and logs to HF

Takeup Rerandomization

The calibration pipeline uses two independent code paths to compute the same target variables:

Matrix builder (UnifiedMatrixBuilder.build_matrix): Computes a sparse calibration matrix $X$ where each row is a target and each column is a cloned household. The optimizer finds weights $w$ that minimize $|Xw - t|$ (target values).
Stacked builder (create_sparse_cd_stacked_dataset): Produces the .h5 files that users load in Microsimulation. It reconstructs each congressional district by combining base CPS records with calibrated weights and block-level geography.

For the calibration to be meaningful, both paths must produce identical values for every target variable. If the matrix builder computes $X_{snap,NC} \cdot w = $5.2B$ but the stacked NC.h5 file yields sim.calculate("snap") * household_weight = $4.8B, then the optimizer's solution does not actually match the target.

The problem with takeup variables

Variables like snap, aca_ptc, ssi, and medicaid depend on takeup draws — random Bernoulli samples that determine whether an eligible household actually claims the benefit. By default, PolicyEngine draws these at simulation time using Python's built-in hash(), which is randomized per process.

This means loading the same H5 file in two different processes can produce different SNAP totals, even with the same weights. Worse, the matrix builder runs in process A while the stacked builder runs in process B, so their draws can diverge.

The solution: block-level seeding

Both paths call seeded_rng(variable_name, salt=f"{block_geoid}:{household_id}") to generate deterministic takeup draws. This ensures:

The same household at the same block always gets the same draw
Draws are stable across processes (no dependency on hash())
Draws are stable when aggregating to any geography (state, CD, county)

All takeup variables in SIMPLE_TAKEUP_VARS (in utils/takeup.py) receive block-seeded draws in the H5 builder, including would_file_taxes_voluntarily. The calibration matrix uses TAKEUP_AFFECTED_TARGETS to identify which target variables need takeup-adjusted rows, but the H5 builder applies draws to all SIMPLE_TAKEUP_VARS so that every takeup variable gets proper block-seeded values.

The --skip-takeup-rerandomize flag disables this rerandomization for faster iteration when you only care about non-takeup variables. Do not use it for production calibrations.

Block-Level Seeding

Each cloned household is assigned to a Census block (15-digit GEOID) during the assign_random_geography step. The first 2 digits are the state FIPS code, which determines the household's takeup rates (since benefit eligibility rules are state-specific).

Mechanism

rng = seeded_rng(variable_name, salt=f"{block_geoid}:{household_id}")
draw = rng.random()
takes_up = draw < takeup_rate[state_fips]

The seeded_rng function uses _stable_string_hash — a deterministic hash that does not depend on Python's PYTHONHASHSEED. This is critical because Python's built-in hash() is randomized per process by default (since Python 3.3).

Why block (not CD or state)?

Blocks are the finest Census geography. A household's block assignment stays the same regardless of how blocks are aggregated — the same household-block-draw triple produces the same result whether you are building an H5 for a state, a congressional district, or a county. This means:

State H5s and district H5s are consistent (no draw drift)
Future county-level H5s will also be consistent
Re-running the pipeline with different area selections yields the same per-household values

Inactive records

When converting to stacked format, households that are not assigned to a given CD get zero weight. These inactive records must receive an empty string "" as their block GEOID, not a real block. If they received real blocks, they would inflate the entity count n passed to the RNG, shifting the draw positions for active entities and breaking the $X \cdot w$ consistency invariant.

The $X \cdot w$ Consistency Invariant

Formal statement

For every target variable $v$ and geography $g$:

$$X_{v,g} \cdot w = \sum_{i \in g} \text{sim.calculate}(v)_i \times w_i$$

where the left side comes from the matrix builder and the right side comes from loading the stacked H5 and running Microsimulation.calculate().

Why it matters

This invariant is what makes calibration meaningful. Without it, the optimizer's solution (which minimizes $|Xw - t|$) does not actually produce a dataset that matches the targets. The weights would be "correct" in the matrix builder's view but produce different totals in the H5 files that users actually load.

Known sources of drift

Mismatched takeup draws: The matrix builder and stacked builder use different RNG states. Solved by block-level seeding (see above).
Different block assignments: The stacked format uses first-clone-wins for multi-clone-same-CD records. With ~11M blocks and 3-10 clones, collision rate is ~0.7-10% of records. In practice, the residual mismatch is negligible.
Inactive records in RNG calls: If inactive records (w=0) receive real block GEOIDs, they inflate the entity count for that block's RNG call, shifting draw positions. Solved by using "" for inactive blocks.
Entity ordering: Both paths must iterate over entities in the same order (sim.calculate("{entity}_id", map_to=entity)). NumPy boolean masking preserves order, so draws[i] maps to the same entity in both paths.

Testing

The test_xw_consistency.py test (pytest -m slow) verifies this invariant end-to-end:

Load base dataset, create geography with uniform weights
Build $X$ with the matrix builder (including takeup rerandomization)
Convert weights to stacked format
Build stacked H5 for selected CDs
Compare $X \cdot w$ vs sim.calculate() * household_weight — assert ratio within 1%

Post-Calibration Gating Workflow

After the pipeline stages H5 files to HuggingFace, two manual review gates determine whether to promote to production.

Gate 1: Review calibration fit

Load calibration_log.csv in the microcalibrate dashboard. This file contains the $X \cdot w$ values from the matrix builder for every target at every epoch.

What to check:

Loss curve converges (no divergence in later epochs)
No individual target groups diverging while others improve
Final loss is comparable to or better than the previous production run

If fit is poor, re-calibrate with different hyperparameters (learning rate, lambda_l0, beta, epochs).

Gate 2: Review simulation quality

make validate-staging          # states only (~30 min)
make validate-staging-full     # states + districts (~3 hrs)
make upload-validation         # push CSV to HF

This produces validation_results.csv with sim.calculate() values for every target. Load it in the dashboard's Combined tab alongside calibration_log.csv.