An end-to-end analytics platform built on the modern data stack. Synthea synthetic healthcare data → DuckDB warehouse → dbt transformations (sources → staging → intermediate → marts) → a Streamlit + Plotly operations UI that surfaces $2.6M outstanding across 14,176 claims and ranks which ones need attention first.
📄 Full slide deck (PDF) - problem framing, architecture, design choices, and UI walkthrough.
Claims and transaction data are fragmented across tables - it's hard to see a claim's resolution posture quickly. Revenue-cycle teams need to know, in one glance:
- Where the dollars sit - open balances, aged claims, concentration risk
- What's aging - what's been open 30 / 60 / 90+ days
- What's complex - high transaction counts, unusual payer/provider patterns
- Who it clusters on - which organizations, providers, and payers drive the biggest outstanding balances
Without a shared decision layer, teams fall back on ad-hoc SQL, inconsistent definitions, and tribal rules for what counts as "urgent."
| Dimension | Cost of getting it wrong |
|---|---|
| 💰 Cash & capacity | Open and aged claims delay money and burn limited team time |
| 🎯 Wrong focus = risk | Without a shared priority view, effort misses high-balance, high-complexity hotspots |
| 🤝 Trust | One repeatable definition of "urgent" beats a patchwork of ad-hoc SQL |
Dataset: Synthea synthetic healthcare data - a realistic population simulator that generates CSVs with the shape of real healthcare data but zero actual PHI. Safe to share publicly, safe to host anywhere.
Nine tables are in scope - the rest were intentionally left out to keep the project focused on ops and revenue risk, not general healthcare analytics:
| Table | Why it's in scope | How it's used |
|---|---|---|
claims |
Core business object | Claim grain: status, service date, identifiers |
claims_transactions |
Financial activity around claims | Rollups: payments, adjustments, transfers |
patients |
Person context | Demographics and geography for encounters |
encounters |
Visit / service context | Links claims to utilization patterns |
payers |
Coverage context | Payer dimension for portfolio views |
providers |
Performer context | Provider dimension + concentration analysis |
organizations |
Facility context | Org-level outstanding balances |
procedures |
Service detail | Procedure context where relevant |
conditions |
Clinical context | Diagnosis/condition flags for context |
Five stages, one design principle: business logic lives in the warehouse (DuckDB + dbt). The app reads from curated marts - it doesn't compute.
- Ingest - download Synthea CSV extracts, load shortlisted tables into DuckDB (
load_to_duckdb.py) - Verify - row counts + sanity checks (
inspect_schema.py,table_schema_summary.csv) - Model - dbt runs the four-layer transformation: sources → staging → intermediate → marts
- Serve - mart tables are the stable interface between warehouse and app; schema changes are versioned through dbt, not the UI
- Experience - Streamlit + Plotly render monitoring, ranking, and drill-down from the marts
DuckDB tables registered in dbt via YAML declarations.
Standardize naming (snake_case), cast types, keep only relevant fields, add foundational tests (not_null on keys, unique where expected).
Two key models do the heavy lifting:
int_claim_financials- aggregates transactions to the claim grain: payments, adjustments, transfers, totals, latest outstanding balance, transaction count, first/last transaction dates.int_claim_enriched- joins claims with the financial summary, adds patient/provider/org context, and computes derived analytical fields (age buckets, complexity signals, concentration flags).
mart_claim_priority_queue- one row per claim, ranked by priority score and enriched with every field the queue view needsmart_claims_overview- aggregated portfolio view for the command center KPIs
Why the separation: the app is dumb and fast. Priority logic is readable SQL in dbt, not Python buried inside a Streamlit callback. Change the definition of "urgent" in one place and both views update on the next dbt run.
The app is organized as a three-step operational workflow: monitor → prioritize → investigate.
KPIs and trends at a glance - volume, total outstanding, aging distribution, priority mix, top attention list.
14,176total claims ·$2,616,097outstanding ·434high-priority ·14,113aged 90+ days
A ranked work surface telling the team what to review first. Filtered live from the same rules as the command center - no silent divergence between views.
One claim, end-to-end: balances, financial rollup, clinical and payer context, and - crucially - a "why flagged" explanation that names the specific rule(s) that put the claim on the queue. Transparent, auditable, no black-box.
Global sidebar filters work across all three pages: organization, provider, priority bucket, age bucket, primary status, latest-outstanding range, and a claim-ID search.
| Tool | Why it fits |
|---|---|
| DuckDB | One-file SQL engine - fast joins and rollups, zero server ops, easy to ship as a portfolio piece |
| dbt | Repeatable, testable, documented transformations - metrics live in versioned models, not scattered notebook cells |
| Streamlit | Fast Python UI for filters, charts, and tables that mostly reads from curated tables |
- Right question first - focus on balances, payments, and open work, not "every healthcare table."
- Logic in the warehouse - the app displays; DuckDB + dbt own the math.
- Layered models - clean raw → connect → final reporting tables that people can trust.
- Tight scope - only the sources that serve ops + revenue risk. Less noise, faster delivery.
- Synthea on purpose - realistic data shape, zero PHI, safe to share and host publicly.
- Explainable outputs - flags and segments are tied to visible numbers, not black-box ML.
- Easy to rerun or host - simple dependencies and a clean run path so it's portable when needed.
| Dimension | What this project demonstrates |
|---|---|
| 🧠 Analytics engineering | Business logic lives in versioned, testable dbt models - not buried in UI code |
| 📊 Business usability | Revenue-risk signals made operational: balance, age, complexity, concentration - not just dashboards |
| 🎯 Judgment | Reframed the problem to match the available data; shortlisted tables with intent rather than dumping every source |
| 🛠️ Product thinking | Workflow-shaped app: monitor → prioritize → investigate - not a generic multi-tab dashboard |
| Layer | Tools |
|---|---|
| Language | Python 3.10+, SQL |
| Warehouse | DuckDB (embedded, single-file) |
| Transformations | dbt-duckdb (sources, staging, intermediate, marts) |
| App | Streamlit |
| Visualization | Plotly |
| Data | Synthea synthetic healthcare CSV extracts |
git clone https://github.com/Isha2605/healthcare-claims-operations-revenue-risk-platform.git
cd healthcare-claims-operations-revenue-risk-platform
python -m venv venv
source venv/bin/activate # macOS / Linux
# venv\Scripts\activate # Windows
pip install -r requirements.txtDownload a Synthea CSV extract from https://synthea.mitre.org/downloads (the "synthea_sample_data_csv" bundle works well), unzip it under data/, then:
python load_to_duckdb.py
python inspect_schema.py # optional sanity checkThis produces warehouse/claims.duckdb (ignored by git) with the nine shortlisted tables loaded.
cd dbt_project
dbt deps
dbt seed # if you're using any seed CSVs
dbt run
dbt testcd ../app
streamlit run app.pyOpen http://localhost:8501.
├── app/ # Streamlit operations UI
│ └── app.py # entrypoint with three views
├── assets/ # Architecture diagrams (SVG)
│ ├── ppt-architecture-horizontal.svg
│ └── ppt-dbt-layer-cake.svg
├── data/ # Raw Synthea CSV extracts (gitignored)
├── dbt_project/ # dbt transformation layer
│ ├── models/
│ │ ├── staging/ # stg_* - cleaned, typed, tested
│ │ ├── intermediate/ # int_claim_financials, int_claim_enriched
│ │ └── marts/ # mart_claim_priority_queue, mart_claims_overview
│ ├── tests/
│ ├── dbt_project.yml
│ └── profiles.yml
├── docs/ # Supporting documentation
├── notebooks/ # Exploration notebooks
├── screenshots/ # App screenshots (for this README)
├── warehouse/ # DuckDB files (gitignored)
├── inspect_schema.py # Schema inventory + sanity checks
├── load_to_duckdb.py # CSV → DuckDB loader
├── load_to_duckdb.txt # Load notes
├── table_schema_summary.csv # Table schema reference
└── README.md
The project uses Synthea synthetic data - a fully simulated population created by MITRE. No real patient health information is used or stored, which means:
- Safe to share the full repo publicly
- Safe to host the app without HIPAA infrastructure
- Realistic data shape for demonstrating healthcare analytics patterns without the risk surface of real PHI
If adapting this pipeline to real healthcare data, you'd need to add: HIPAA-compliant hosting, row-level access controls in the warehouse, audit logging on the Streamlit app, and credentialed-user authentication.
- Authentication on the Streamlit app (Streamlit's native auth or an external identity provider)
- Live refresh - replace manual
dbt runwith a scheduled trigger (dbt Cloud, GitHub Actions, or Prefect) - Forecasting - project outstanding balances 30 / 60 / 90 days forward per org
- Collection-action recommendations - extend the priority score into a "next best action" field
- Snowflake / BigQuery port - swap the dbt target to demonstrate cloud-warehouse portability
- Data contracts - formalize the mart schemas as contracts so downstream consumers can depend on them
Isha Narkhede · Portfolio · LinkedIn · ishajayant207@gmail.com
MIT - see LICENSE.


