Motivation
The API v2 alpha and policyengine package's PolicyEngineUSDataset require entity-level Pandas HDFStore format (one table per entity: person, household, tax_unit, spm_unit, family, marital_unit). Currently, -us-data publishes only variable-centric h5py format (variable/year → array).
Converting between these formats via create_datasets() is extremely slow (~1hr+ per state) because it routes every variable through sim.calculate(), invoking the full simulation engine's dependency resolution for each variable × each year.
The UK avoids this: -uk-data publishes entity-level HDFStore directly, and policyengine-uk has extend_single_year_dataset() which uprates DataFrames via simple multiplication — no simulation engine needed.
Changes
1. HDFStore serialization in stacked_dataset_builder.py
After the existing h5py serialization, create_sparse_cd_stacked_dataset() now also:
- Splits
combined_df into entity DataFrames — classifies each variable by entity using system.variables[var].entity.key, deduplicates group entities by entity ID
- Builds an uprating manifest — records each variable's entity and uprating parameter path (from
system.variables[var].uprating)
- Saves as HDFStore —
.hdfstore.h5 suffix alongside the existing .h5 file
2. Upload pipeline in publish_local_area.py
HDFStore files are uploaded to dedicated subdirectories:
states_hdfstore/
districts_hdfstore/
cities_hdfstore/
Both GCS and HuggingFace uploads are handled.
3. Comparison test
tests/test_format_comparison.py validates both formats contain identical data:
- Compares all ~183 variables between h5py and HDFStore
- Handles person-level (direct comparison) vs group-entity (unique value comparison)
- Tests manifest presence and entity table completeness
- Runnable as pytest or standalone CLI
pytest test_format_comparison.py --h5py-path NV.h5 --hdfstore-path NV.hdfstore.h5
# or
python -m policyengine_us_data.tests.test_format_comparison --h5py-path NV.h5 --hdfstore-path NV.hdfstore.h5
HDFStore structure
/person → DataFrame (all person-entity vars + entity membership IDs)
/household → DataFrame (deduplicated by household_id)
/tax_unit → DataFrame (deduplicated by tax_unit_id)
/spm_unit → DataFrame (deduplicated by spm_unit_id)
/family → DataFrame (deduplicated by family_id)
/marital_unit → DataFrame (deduplicated by marital_unit_id)
/_variable_metadata → DataFrame (variable, entity, uprating columns)
/_time_period → Series (base year)
Future work
policyengine-us will add extend_single_year_dataset() to consume the HDFStore directly, enabling instant year projection without the simulation engine. The embedded uprating manifest makes each file self-describing and allows fallback when the package version doesn't exactly match the version used to build the dataset.
Branch
add-hdfstore-output
Motivation
The API v2 alpha and
policyenginepackage'sPolicyEngineUSDatasetrequire entity-level Pandas HDFStore format (one table per entity: person, household, tax_unit, spm_unit, family, marital_unit). Currently,-us-datapublishes only variable-centric h5py format (variable/year → array).Converting between these formats via
create_datasets()is extremely slow (~1hr+ per state) because it routes every variable throughsim.calculate(), invoking the full simulation engine's dependency resolution for each variable × each year.The UK avoids this:
-uk-datapublishes entity-level HDFStore directly, andpolicyengine-ukhasextend_single_year_dataset()which uprates DataFrames via simple multiplication — no simulation engine needed.Changes
1. HDFStore serialization in
stacked_dataset_builder.pyAfter the existing h5py serialization,
create_sparse_cd_stacked_dataset()now also:combined_dfinto entity DataFrames — classifies each variable by entity usingsystem.variables[var].entity.key, deduplicates group entities by entity IDsystem.variables[var].uprating).hdfstore.h5suffix alongside the existing.h5file2. Upload pipeline in
publish_local_area.pyHDFStore files are uploaded to dedicated subdirectories:
states_hdfstore/districts_hdfstore/cities_hdfstore/Both GCS and HuggingFace uploads are handled.
3. Comparison test
tests/test_format_comparison.pyvalidates both formats contain identical data:pytest test_format_comparison.py --h5py-path NV.h5 --hdfstore-path NV.hdfstore.h5 # or python -m policyengine_us_data.tests.test_format_comparison --h5py-path NV.h5 --hdfstore-path NV.hdfstore.h5HDFStore structure
Future work
policyengine-uswill addextend_single_year_dataset()to consume the HDFStore directly, enabling instant year projection without the simulation engine. The embedded uprating manifest makes each file self-describing and allows fallback when the package version doesn't exactly match the version used to build the dataset.Branch
add-hdfstore-output