The unified calibration pipeline reweights cloned CPS records to match administrative targets using L0-regularized optimization. This guide covers the main workflows: lightweight build-then-fit, full pipeline with PUF, and fitting from a saved package.
This is the current production calibration path. The older national-only Enhanced CPS path
(make data-legacy) remains in the repo for legacy reproduction and uses a separate EnhancedCPS /
build_loss_matrix() flow.
# Build matrix only from stratified CPS (no PUF, no re-imputation):
python -m policyengine_us_data.calibration.unified_calibration \
--target-config policyengine_us_data/calibration/target_config.yaml \
--skip-source-impute \
--skip-takeup-rerandomize \
--build-only
# Fit weights from a saved package:
python -m policyengine_us_data.calibration.unified_calibration \
--package-path storage/calibration/calibration_package.pkl \
--epochs 500 --device cuda
# Resume a previous fit for 500 more epochs:
python -m policyengine_us_data.calibration.unified_calibration \
--package-path storage/calibration/calibration_package.pkl \
--resume-from storage/calibration/calibration_weights.npy \
--epochs 500 --device cuda
# Full pipeline with PUF (build + fit in one shot):
make calibrateThe pipeline has two phases:
- Matrix build: Clone CPS records, assign geography, compute all target variable values, assemble a sparse calibration matrix. Optionally includes PUF cloning (doubles record count) and source re-imputation.
- Weight fitting (~5-20 min on GPU): L0-regularized optimization to find household weights that reproduce administrative targets.
The calibration package checkpoint lets you run phase 1 once and iterate on phase 2 with different hyperparameters or target selections---without rebuilding.
The matrix build requires two inputs from the data pipeline:
- Stratified CPS (
storage/stratified_extended_cps_2024.h5): ~12K households, built bymake data. This is the base dataset that gets cloned. - Target database (
storage/calibration/policy_data.db): Administrative targets, built bymake database.
Both must exist before running calibration. The stratified CPS already contains all CPS variables needed for calibration; PUF cloning and source re-imputation are optional enhancements that happen at calibration time.
Build the matrix from the stratified CPS without PUF cloning or re-imputation. This is the fastest way to get a calibration package for experimentation.
Step 1: Build the matrix (~12K base records x 430 clones = ~5.2M columns).
python -m policyengine_us_data.calibration.unified_calibration \
--target-config policyengine_us_data/calibration/target_config.yaml \
--skip-source-impute \
--skip-takeup-rerandomize \
--build-onlyThis saves storage/calibration/calibration_package.pkl (default location). Use --package-output
to specify a different path.
Step 2: Fit weights from the package (fast, repeatable).
python -m policyengine_us_data.calibration.unified_calibration \
--package-path storage/calibration/calibration_package.pkl \
--epochs 1000 \
--lambda-l0 1e-8 \
--beta 0.65 \
--lambda-l2 1e-8 \
--device cudaYou can re-run Step 2 as many times as you want with different hyperparameters. The expensive matrix build only happens once.
Every fit now also writes a checkpoint next to the weights output
(calibration_weights.checkpoint.pt by default). To continue the same fit, pass --resume-from
with the weights file or checkpoint path. If a sibling checkpoint exists next to the weights file,
it is used automatically so the L0 gate state is restored as well.
python -m policyengine_us_data.calibration.unified_calibration \
--package-path storage/calibration/calibration_package.pkl \
--epochs 2000 \
--beta 0.65 \
--lambda-l0 1e-4 \
--lambda-l2 1e-12 \
--log-freq 500 \
--target-config policyengine_us_data/calibration/target_config.yaml \
--device cpu \
--output policyengine_us_data/storage/calibration/national/weights.npy \
--resume-from policyengine_us_data/storage/calibration/national/weights.npyWhen --resume-from points to a checkpoint, --epochs means additional epochs to run beyond the
saved checkpoint epoch count. If only a .npy weights file exists, the run warm-starts from those
weights.
Adding --puf-dataset doubles the record count (~24K base records x 430 clones = ~10.3M columns) by
creating PUF-imputed copies of every CPS record. This also triggers source re-imputation unless
skipped.
Single-pass (build + fit):
python -m policyengine_us_data.calibration.unified_calibration \
--puf-dataset policyengine_us_data/storage/puf_2024.h5 \
--target-config policyengine_us_data/calibration/target_config.yaml \
--epochs 200 \
--device cudaOr equivalently: make calibrate
Output:
storage/calibration/calibration_weights.npy--- calibrated weight vectorstorage/calibration/unified_diagnostics.csv--- per-target error reportstorage/calibration/unified_run_config.json--- full run configuration
Build-only (save package for later fitting):
python -m policyengine_us_data.calibration.unified_calibration \
--puf-dataset policyengine_us_data/storage/puf_2024.h5 \
--target-config policyengine_us_data/calibration/target_config.yaml \
--build-onlyOr equivalently: make calibrate-build
This saves storage/calibration/calibration_package.pkl (default location). Use --package-output
to specify a different path.
Then fit from the package using the same Step 2 command from Workflow 1.
A saved package contains all targets from the database (before target config filtering). You can apply a different target config at fit time:
python -m policyengine_us_data.calibration.unified_calibration \
--package-path storage/calibration/calibration_package.pkl \
--target-config my_custom_config.yaml \
--epochs 200This lets you experiment with which targets to include without rebuilding the matrix.
From a pre-built package (recommended):
Use --package-path to point at a local .pkl file. The runner automatically uploads it to the
Modal Volume and then fits from it on the GPU, avoiding the function argument size limit.
modal run modal_app/remote_calibration_runner.py \
--package-path policyengine_us_data/storage/calibration/calibration_package.pkl \
--branch calibration-pipeline-improvements \
--gpu T4 \
--epochs 1000 \
--beta 0.65 \
--lambda-l0 1e-8 \
--lambda-l2 1e-8If a package already exists on the volume from a previous upload, you can also use
--prebuilt-matrices to fit directly without re-uploading.
Full pipeline (builds matrix from scratch on Modal):
modal run modal_app/remote_calibration_runner.py \
--branch calibration-pipeline-improvements \
--gpu T4 \
--epochs 1000 \
--beta 0.65 \
--lambda-l0 1e-8 \
--lambda-l2 1e-8 \
--target-config policyengine_us_data/calibration/target_config.yamlThe target config YAML is read from the cloned repo inside the container, so it must be committed to the branch you specify.
Transfer the package file to any environment with scipy, numpy, pandas, torch, and
l0-python installed:
from policyengine_us_data.calibration.unified_calibration import (
load_calibration_package,
apply_target_config,
fit_l0_weights,
)
package = load_calibration_package("calibration_package.pkl")
targets_df = package["targets_df"]
X_sparse = package["X_sparse"]
weights = fit_l0_weights(
X_sparse=X_sparse,
targets=targets_df["value"].values,
lambda_l0=1e-8,
epochs=500,
device="cuda",
beta=0.65,
lambda_l2=1e-8,
)The target config controls which targets reach the optimizer. It uses a YAML exclusion list:
exclude:
- variable: rent
geo_level: national
- variable: eitc
geo_level: district
- variable: snap
geo_level: state
domain_variable: snap # optional: further narrow the matchEach rule drops rows from the calibration matrix where all specified fields match. Unrecognized variables silently match nothing.
| Field | Required | Values | Description |
|---|---|---|---|
variable |
Yes | Any variable name in target_overview |
The calibration target variable |
geo_level |
Yes | national, state, district |
Geographic aggregation level |
domain_variable |
No | Any domain variable in target_overview |
Narrows match to a specific domain |
The checked-in config at policyengine_us_data/calibration/target_config.yaml reproduces the
junkyard notebook's 22 excluded target groups. It drops:
- 13 national-level variables: alimony, charitable deduction, child support, interest deduction, medical expense deduction, net worth, person count, real estate taxes, rent, social security dependents/survivors
- 9 district-level variables: ACA PTC, EITC, income tax before credits, medical expense deduction, net capital gains, rental income, tax unit count, partnership/S-corp income, taxable social security
Applying this config reduces targets from ~37K to ~21K, matching the junkyard's target selection.
To experiment, copy the default and edit:
cp policyengine_us_data/calibration/target_config.yaml my_config.yaml
# Edit my_config.yaml to add/remove exclusion rules
python -m policyengine_us_data.calibration.unified_calibration \
--package-path storage/calibration/calibration_package.pkl \
--target-config my_config.yaml \
--epochs 200To see what variables and geo_levels are available in the database:
SELECT DISTINCT variable, geo_level
FROM target_overview
ORDER BY variable, geo_level;| Flag | Default | Description |
|---|---|---|
--dataset |
storage/stratified_extended_cps_2024.h5 |
Path to CPS h5 file |
--db-path |
storage/calibration/policy_data.db |
Path to target database |
--output |
storage/calibration/calibration_weights.npy |
Weight output path |
--puf-dataset |
None | Path to PUF h5 (enables PUF cloning) |
--preset |
local |
L0 preset: local (1e-8) or national (1e-4) |
--lambda-l0 |
None | Custom L0 penalty (overrides --preset) |
--epochs |
100 | Training epochs |
--device |
cpu |
cpu or cuda |
--n-clones |
430 | Number of dataset clones |
--seed |
42 | Random seed for geography assignment |
--national |
False | Use national preset (λ_L0=1e-4, ~50K records) |
--workers |
1 | Parallel workers for per-state precomputation |
--county-level |
False | Include county-level targets (slower) |
| Flag | Default | Description |
|---|---|---|
--target-config |
None | Path to YAML exclusion config |
--domain-variables |
None | Comma-separated domain filter (SQL-level) |
--hierarchical-domains |
None | Domains for hierarchical uprating |
| Flag | Default | Description |
|---|---|---|
--build-only |
False | Build matrix, save package, skip fitting |
--package-path |
None | Load pre-built package (uploads to Modal volume automatically when using Modal runner) |
--package-output |
Auto (when --build-only) |
Where to save package |
| Flag | Default | Junkyard value | Description |
|---|---|---|---|
--beta |
0.35 | 0.65 | L0 gate temperature (higher = softer gates) |
--lambda-l2 |
1e-12 | 1e-8 | L2 regularization on weights |
--learning-rate |
0.15 | 0.15 | Optimizer learning rate |
| Flag | Description |
|---|---|
--skip-puf |
Skip PUF clone + QRF imputation |
--skip-source-impute |
Skip ACS/SIPP/SCF re-imputation |
--skip-takeup-rerandomize |
Skip takeup re-randomization |
The package is a pickled Python dict:
{
"X_sparse": scipy.sparse.csr_matrix, # (n_targets, n_records)
"targets_df": pd.DataFrame, # target metadata + values
"target_names": list[str], # human-readable names
"metadata": {
"dataset_path": str,
"db_path": str,
"n_clones": int,
"n_records": int,
"seed": int,
"created_at": str, # ISO timestamp
"target_config": dict, # config used at build time
},
}The targets_df DataFrame has columns: variable, geo_level, geographic_id, domain_variable,
value, and others from the database.
Before uploading a package to Modal, validate it:
# Default package location
python -m policyengine_us_data.calibration.validate_package
# Specific package
python -m policyengine_us_data.calibration.validate_package path/to/calibration_package.pkl
# Strict mode: fail if any target has row_sum/target < 1%
python -m policyengine_us_data.calibration.validate_package --strictExit codes: 0 = pass, 1 = impossible targets, 2 = strict ratio failures.
Validation also runs automatically after --build-only.
The three key hyperparameters control the tradeoff between target accuracy and sparsity:
-
beta(L0 gate temperature): Controls how sharply the L0 gates open/close. Higher values (0.5--0.8) give softer decisions and more exploration early in training. Lower values (0.2--0.4) give harder on/off decisions. -
lambda_l0(via--presetor--lambda-l0): Controls how many records survive.1e-8(local preset) keeps millions of records for local-area analysis.1e-4(national preset) keeps ~50K for the web app. -
lambda_l2: Regularizes weight magnitudes. Larger values (1e-8) prevent any single record from having extreme weight. Smaller values (1e-12) allow more weight concentration.
For local-area calibration (millions of records):
--lambda-l0 1e-8 --beta 0.65 --lambda-l2 1e-8 --epochs 500For national web app (~50K records):
--lambda-l0 1e-4 --beta 0.35 --lambda-l2 1e-12 --epochs 200| Target | Description |
|---|---|
make calibrate |
Full local pipeline with target config |
make calibrate-build |
Build-only mode (saves package, no fitting) |
make build-matrices |
Build calibration matrices on Modal (CPU) |
make calibrate-modal |
Fit county-level weights on Modal GPU |
make calibrate-modal-national |
Fit national weights on Modal GPU (T4) |
make calibrate-both |
Run county + national fits in parallel |
make stage-h5s |
Build state/district/city H5s on Modal |
make stage-national-h5 |
Build national US.h5 on Modal |
make stage-all-h5s |
Run both staging jobs in parallel |
make promote |
Promote staged files to versioned HF paths |
make pipeline |
Print sequential steps for full pipeline |
make validate-staging |
Validate staged H5s against targets (states only) |
make validate-staging-full |
Validate staged H5s (states + districts) |
make upload-validation |
Push validation_results.csv to HF |
make check-staging |
Smoke test: sum key variables across all state H5s |
make check-sanity |
Quick structural integrity check on one state |
make upload-calibration |
Upload weights, blocks, and logs to HF |
The calibration pipeline uses two independent code paths to compute the same target variables:
-
Matrix builder (
UnifiedMatrixBuilder.build_matrix): Computes a sparse calibration matrix$X$ where each row is a target and each column is a cloned household. The optimizer finds weights$w$ that minimize$|Xw - t|$ (target values). -
Stacked builder (
create_sparse_cd_stacked_dataset): Produces the.h5files that users load inMicrosimulation. It reconstructs each congressional district by combining base CPS records with calibrated weights and block-level geography.
For the calibration to be meaningful, both paths must produce identical values for every target
variable. If the matrix builder computes $X_{snap,NC} \cdot w = sim.calculate("snap") * household_weight = $4.8B, then the optimizer's solution does not
actually match the target.
Variables like snap, aca_ptc, ssi, and medicaid depend on takeup draws — random
Bernoulli samples that determine whether an eligible household actually claims the benefit. By
default, PolicyEngine draws these at simulation time using Python's built-in hash(), which is
randomized per process.
This means loading the same H5 file in two different processes can produce different SNAP totals, even with the same weights. Worse, the matrix builder runs in process A while the stacked builder runs in process B, so their draws can diverge.
Both paths call seeded_rng(variable_name, salt=f"{block_geoid}:{household_id}") to generate
deterministic takeup draws. This ensures:
- The same household at the same block always gets the same draw
- Draws are stable across processes (no dependency on
hash()) - Draws are stable when aggregating to any geography (state, CD, county)
All takeup variables in SIMPLE_TAKEUP_VARS (in utils/takeup.py) receive block-seeded draws in
the H5 builder, including would_file_taxes_voluntarily. The calibration matrix uses
TAKEUP_AFFECTED_TARGETS to identify which target variables need takeup-adjusted rows, but the H5
builder applies draws to all SIMPLE_TAKEUP_VARS so that every takeup variable gets proper
block-seeded values.
The --skip-takeup-rerandomize flag disables this rerandomization for faster iteration when you
only care about non-takeup variables. Do not use it for production calibrations.
Each cloned household is assigned to a Census block (15-digit GEOID) during the
assign_random_geography step. The first 2 digits are the state FIPS code, which determines the
household's takeup rates (since benefit eligibility rules are state-specific).
rng = seeded_rng(variable_name, salt=f"{block_geoid}:{household_id}")
draw = rng.random()
takes_up = draw < takeup_rate[state_fips]The seeded_rng function uses _stable_string_hash — a deterministic hash that does not depend on
Python's PYTHONHASHSEED. This is critical because Python's built-in hash() is randomized per
process by default (since Python 3.3).
Blocks are the finest Census geography. A household's block assignment stays the same regardless of how blocks are aggregated — the same household-block-draw triple produces the same result whether you are building an H5 for a state, a congressional district, or a county. This means:
- State H5s and district H5s are consistent (no draw drift)
- Future county-level H5s will also be consistent
- Re-running the pipeline with different area selections yields the same per-household values
When converting to stacked format, households that are not assigned to a given CD get zero weight.
These inactive records must receive an empty string "" as their block GEOID, not a real block. If
they received real blocks, they would inflate the entity count n passed to the RNG, shifting the
draw positions for active entities and breaking the
For every target variable
where the left side comes from the matrix builder and the right side comes from loading the stacked
H5 and running Microsimulation.calculate().
This invariant is what makes calibration meaningful. Without it, the optimizer's solution (which
minimizes
-
Mismatched takeup draws: The matrix builder and stacked builder use different RNG states. Solved by block-level seeding (see above).
-
Different block assignments: The stacked format uses first-clone-wins for multi-clone-same-CD records. With ~11M blocks and 3-10 clones, collision rate is ~0.7-10% of records. In practice, the residual mismatch is negligible.
-
Inactive records in RNG calls: If inactive records (w=0) receive real block GEOIDs, they inflate the entity count for that block's RNG call, shifting draw positions. Solved by using
""for inactive blocks. -
Entity ordering: Both paths must iterate over entities in the same order (
sim.calculate("{entity}_id", map_to=entity)). NumPy boolean masking preserves order, sodraws[i]maps to the same entity in both paths.
The test_xw_consistency.py test (pytest -m slow) verifies this invariant end-to-end:
- Load base dataset, create geography with uniform weights
- Build
$X$ with the matrix builder (including takeup rerandomization) - Convert weights to stacked format
- Build stacked H5 for selected CDs
- Compare
$X \cdot w$ vssim.calculate() * household_weight— assert ratio within 1%
After the pipeline stages H5 files to HuggingFace, two manual review gates determine whether to promote to production.
Load calibration_log.csv in the microcalibrate dashboard. This file contains the
What to check:
- Loss curve converges (no divergence in later epochs)
- No individual target groups diverging while others improve
- Final loss is comparable to or better than the previous production run
If fit is poor, re-calibrate with different hyperparameters (learning rate, lambda_l0, beta, epochs).
make validate-staging # states only (~30 min)
make validate-staging-full # states + districts (~3 hrs)
make upload-validation # push CSV to HFThis produces validation_results.csv with sim.calculate() values for every target. Load it in
the dashboard's Combined tab alongside calibration_log.csv.
What to check:
-
CalibrationVsSimComparisonshows the gap between$X \cdot w$ andsim.calculate()values - No large regressions vs the previous production run
- Sanity check column has no FAIL entries
If both gates pass:
- Run the "Promote Local Area H5 Files" GitHub workflow, OR
- Manually copy staged files to the production paths in the HF repo
For a quick structural check without loading the full database:
make check-sanity # one state, ~2 minThis runs weight non-negativity, entity ID uniqueness, NaN/Inf detection, person-household mapping, boolean takeup validation, and per-household range checks.