Skip to content

Fix h5 files by saving calibration geography artifact, and model fit resume function#708

Open
baogorek wants to merge 10 commits intomainfrom
fine-agi-brackets
Open

Fix h5 files by saving calibration geography artifact, and model fit resume function#708
baogorek wants to merge 10 commits intomainfrom
fine-agi-brackets

Conversation

@baogorek
Copy link
Copy Markdown
Collaborator

@baogorek baogorek commented Apr 9, 2026

Summary

Continues the fine-agi-brackets branch after #695 (which added fine AGI bracket targets from SOI stubs 9/10 and Table 1.4, re-enabled income_tax_positive, and added net_worth and district SNAP targets). This PR adds two infrastructure improvements to the calibration pipeline:

  • Saved geography artifacts (fixes Local area publish regenerates random geography instead of reusing calibration geography #706): The publish and worker steps were calling assign_random_geography() to generate a fresh geography assignment, meaning the H5 files were built with a different geography than the weights were optimized against. Calibration now saves geography_assignment.npz alongside weights, and all downstream steps load it instead of regenerating. Backward compatibility with legacy stacked_blocks.npy is preserved via reconstruct_geography_from_blocks.
  • Calibration resume/checkpoint support (fixes Calibration fits cannot be resumed after interruption #707): Long fits (2000+ epochs on 10M features) couldn't be resumed if interrupted. Adds --resume-from and --checkpoint-output flags to unified_calibration.py. Full checkpoint resume restores L0 gate state and continues epoch numbering; warm-start from .npy weights is also supported. Hyperparameter compatibility is validated on resume.
  • ITIN holder imputation: Undocumented (code-0) persons who file taxes with ITINs were incorrectly getting has_tin = False, disqualifying them from ODC ($500 credit). New impute_itin_status() function selects tax units with code-0 earners via weighted random sampling targeting 4.4M ITIN returns (IRS NTA benchmark), then marks all code-0 members of selected units as ITIN holders. Updates has_tin = (ssn_card_type != 0) | has_itin_number.
  • String constraint crash fix: The calibration matrix builder crashed on string variables like ssn_card_type when casting to float32. Both sites now try float32 first and fall back to keeping raw values.

Also:

  • Adds --national-only flag to publish_local_area.py for building just the national US.h5
  • Gitignores *.csv.gz to prevent accidental commit of cached ORG data

@baogorek baogorek force-pushed the fine-agi-brackets branch from 65a8134 to 1d25092 Compare April 9, 2026 03:17
@baogorek baogorek requested review from MaxGhenis and juaristi22 April 9, 2026 03:23
@baogorek baogorek changed the title Fine AGI bracket targets, saved geography, and calibration resume Save calibration geography artifact and add fit resume support Apr 9, 2026
@baogorek baogorek changed the title Save calibration geography artifact and add fit resume support Fix h5 files by saving calibration geography artifact, and model fit resume function Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The new saved-geography flow fixes the real issue here: calibration now uses the same geography assignment when building the matrix and when producing calibrated H5s, instead of trying to regenerate it later from (n_records, n_clones, seed).

One note: the legacy fallback is no longer backward compatible for older artifacts that only contain weights and the dataset. If neither geography_assignment.npz nor stacked_blocks.npy is present, publish/worker now fail instead of rebuilding via the old regeneration path. I’m fine with that tradeoff, but it should be documented explicitly as a compatibility break.

@baogorek
Copy link
Copy Markdown
Collaborator Author

baogorek commented Apr 9, 2026

Paused, waiting on #711

 PR #711 already has a fix for this. The commit "Harden CPS ORG month loading" replaces the old pd.read_csv(..., 
  usecols=CPS_BASIC_MONTHLY_ORG_COLUMNS)

@baogorek baogorek force-pushed the fine-agi-brackets branch from 6acc3eb to 5eb5eb0 Compare April 9, 2026 22:52
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
pipeline-diagrams Error Error Apr 10, 2026 1:25pm

Request Review

baogorek and others added 4 commits April 9, 2026 21:22
Calibration now persists geography_assignment.npz alongside weights so
that downstream publish and worker steps use the exact same geography
instead of regenerating it randomly. Adds --resume-from and
--checkpoint-output flags to unified_calibration for continuing fits
from a saved checkpoint or warm-starting from weights. Also gitignores
*.csv.gz to prevent accidental commits of cached ORG data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@baogorek baogorek force-pushed the fine-agi-brackets branch from 5eb5eb0 to 5ca1241 Compare April 10, 2026 01:22
baogorek and others added 5 commits April 9, 2026 21:33
…nfig

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix calibration crash on string constraint variables (ssn_card_type) by
falling back from float32 cast when values are non-numeric.

Impute ITIN status for undocumented (code-0) persons: select tax units
with code-0 earners via weighted random sampling targeting 4.4M ITIN
returns (IRS NTA), then mark all code-0 members of those units. Updates
has_tin = (ssn_card_type != 0) | has_itin_number so ITIN holders
correctly qualify for ODC ($500 credit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calibration fits cannot be resumed after interruption Local area publish regenerates random geography instead of reusing calibration geography

3 participants