Find the variant that broke your data timeline.
BreakSight extends OpenMetadata with two interconnected tools that answer the two questions every data engineer dreads:
- "What broke my pipeline at 3 AM?" → Time Machine walks backward through schema changes, DQ failures, pipeline runs, and team signals to reconstruct the incident timeline and name the root cause — in under 30 seconds.
- "Is it safe to drop this column?" → Simulator walks forward through downstream lineage, replays real historical queries against the proposed schema, scores a Blast Radius (0–100), and hands you a runbook before you merge the PR.
Both tools share one Correlation Engine. Time Machine walks it backward; Simulator walks it forward. Same engine, opposite time directions.
Schema changes are the silent killers of data pipelines. A column dropped in a hotfix at 01:00 breaks a revenue dashboard by 03:00, triggers a PagerDuty alert by 04:00, and costs an on-call engineer 2 hours of git blame and log-grep before they pinpoint the culprit. OpenMetadata tracks every version, every lineage edge, every DQ test — but there is no tool that stitches those signals into a causal timeline or predicts the blast radius of a change before it ships.
BreakSight is that tool.
Pick any table in your catalog, set an alert time, and Time Machine:
- Fetches every schema version from
/api/v1/tables/{id}/versionsin the lookback window - Pulls DQ test failures from
/api/v1/dataQuality/testCases - Fetches pipeline run statuses (OM Airflow connector or seeded SQLite events)
- Reads conversation signals from
/api/v1/feed - Scores every event with a causal likelihood model (temporal position × event type × severity)
- Assigns phases:
ROOT_CAUSE → FIRST_FAILURE → CASCADE → BUSINESS_IMPACT - Validates the causal chain (failures must follow the root cause, reference the changed column, overlap with downstream lineage)
- Returns a confidence score and a one-sentence recommended action
Key metric: Mean Time to Confidence drops from ~2 hours to ~23 seconds.
Pick a table, pick a change type (drop column, rename, type change, nullable, drop table), and Simulator:
- Fetches current schema from
/api/v1/tables/name/{fqn}with column metadata - Walks downstream lineage 3 hops via
/api/v1/lineage/table/name/{fqn} - Fetches tier, owner, and domain per downstream asset to weight risk
- Replays every historical SQL query from the local event store against the proposed schema (via
sqlglot) — classifying each aswill_error,will_silently_change,safe, orcould_not_analyze - Scores Blast Radius 0–100:
0.35 × tier_weight + 0.30 × usage_score + 0.20 × count_score + 0.15 × activity - Lists affected users ranked by query count
- Generates a step-by-step runbook (notify owners → coordinate migration → validate → monitor)
The two tools are linked. From the Simulator result you can jump directly to Time Machine for any affected downstream entity. From a Time Machine schema_change event you can launch the Simulator to model a rollback.
Browser (Next.js 14 + TypeScript + Tailwind + shadcn/ui)
/ → Landing portal (Breaksight)
/time-machine → Backward causal trace, schema health, version diff
/simulator → Forward impact prediction, blast radius, SQL replay
|
| REST JSON (localhost:8000)
v
FastAPI + Python 3.11
Correlation Engine
├── timeline.py — version snapshot fetcher + differ
├── blast_radius.py — lineage BFS + tier/usage scorer
├── sql_replay.py — sqlglot-based query impact classifier
├── schema_diff.py — column-level diff between versions
└── runbook.py — change-type-aware remediation generator
SQLite (breaksight.db) — normalized event store
|
| HTTP + OM JWT
v
OpenMetadata (localhost:8585)
/api/v1/tables/{id}/versions — schema history
/api/v1/lineage/table/name/{fqn} — column-level lineage
/api/v1/dataQuality/testCases — DQ test results
/api/v1/queries — query logs for SQL replay
/api/v1/feed — conversation signals
/api/v1/usage/{entity} — usage stats
/api/v1/search/query — entity picker autocomplete
No Kafka. No graph database. No ML framework. No microservices.
| API | Purpose |
|---|---|
GET /api/v1/tables/{id}/versions |
Schema version history — the primary root-cause signal |
GET /api/v1/lineage/table/name/{fqn} |
Downstream lineage walk (Blast Radius + impact scope) |
GET /api/v1/dataQuality/testCases |
DQ test failures as causal evidence |
GET /api/v1/queries |
Real historical SQL for replay classification |
GET /api/v1/feed |
Team conversation signals (escalations, alerts) |
GET /api/v1/usage/{entity} |
Usage weight in Blast Radius scoring |
GET /api/v1/search/query |
Entity picker in the UI |
GET /api/v1/tables |
Cross-table changelog feed |
GET /api/v1/pipelines/{id}/status |
Live pipeline run statuses |
- OpenMetadata running at
http://localhost:8585 - Python 3.11+ with
uv - Node.js 18+
git clone <this-repo>
cd breaksight
# Configure environment
cp .env.example .env
# Edit .env — set OM_TOKEN (get from OM UI → avatar → Access Token)
# Install dependencies
make install
# Seed demo data (creates realistic ecommerce incident scenario)
make seed
# Start backend (:8000) and frontend (:3000) in parallel
make upOpen http://localhost:3000.
make demo
# Waits for services, seeds data, opens on :3000curl http://localhost:8000/healthmake test| Method | Path | Description |
|---|---|---|
GET |
/api/time-machine/investigate |
Main investigate endpoint — backward + forward scan |
GET |
/api/time-machine/trace-backward |
Raw backward causal trace with phased timeline |
GET |
/api/time-machine/risk-assessment |
Flattened risk card (confidence, blast radius, failure sequence) |
GET |
/api/timeline/{fqn} |
Version history with diffs for a table |
GET |
/api/diff/{fqn} |
Column-level diff between two versions |
GET |
/api/schema-health/{fqn} |
Schema health score (A–F), risk factors, recommendations |
GET |
/api/dq-tests/{fqn} |
Data quality test results for a table |
GET |
/api/changelog |
Cross-table schema changelog (50 most recent) |
GET |
/api/usage/{fqn} |
Query usage stats per user |
GET |
/api/alerts/active |
Active DQ and pipeline alert counts |
Investigate example:
curl "http://localhost:8000/api/time-machine/investigate?\
entity_fqn=ecommerce_warehouse.shopflow.sales.orders&\
alert_time=2026-04-20T04:00:00Z&\
lookback_days=2"{
"confidence": 87,
"confidence_label": "HIGH",
"root_cause_summary": "alex.junior made a schema change at 01:23 UTC that cascaded into failures.",
"root_cause_actor": "alex.junior",
"root_cause_column": "amount_usd",
"root_cause_description": "dropped column amount_usd",
"blast_radius": {
"pipelines_failed": 3,
"tables_no_data": 4,
"dq_tests_failing": 2
},
"failure_sequence": [
{"time": "02:15", "what": "Pipeline Failed: revenue_etl_pipeline — column reference error"},
{"time": "03:50", "what": "DQ Test Failed: orders_amount_must_not_be_null — assertion failed"}
],
"recommended_action": "Restore amount_usd column OR update 3 pipeline queries to reference the replacement column instead.",
"timeline": [...]
}| Method | Path | Description |
|---|---|---|
POST |
/api/simulator/analyze |
Run a simulation — returns Blast Radius, SQL replay, runbook |
GET |
/api/simulations |
List recent simulation results |
Analyze example:
curl -X POST http://localhost:8000/api/simulator/analyze \
-H "Content-Type: application/json" \
-d '{
"entity_fqn": "ecommerce_warehouse.shopflow.sales.orders",
"change_type": "drop_column",
"target": "amount_usd"
}'{
"impact_score": 74,
"severity": "HIGH",
"downstream_count": 12,
"blast_radius_breakdown": {
"tier_score": 1.0,
"usage_score": 0.72,
"count_score": 0.60,
"final_score": 73.6
},
"sql_replay": {
"total_analyzed": 31,
"will_error": [...],
"will_silently_change": [...],
"safe": [...]
},
"most_affected_users": [
{"user": "priya.data", "query_count": 14, "verdicts": {"will_error": 9, "safe": 5}}
],
"runbook": [...]
}Every event in the investigation window gets a causal likelihood score:
score = 0.35 × temporal_weight
+ 0.25 × event_type_weight
+ 0.20 × severity_weight
+ 0.20 × distance_weight
- Temporal weight: Schema changes score higher earlier in the window (root causes precede effects). Pipeline/test failures score higher near the alert time (cascades accumulate later).
- Event type weight:
schema_change(1.0) >test_failure(0.9) >pipeline_failure(0.85) >team_signal(0.5) - Severity weight:
error(1.0) >warning(0.6) >info(0.2)
The top schema change is selected as root cause if it is dominant (scores ≥ 1.4× the second candidate) or plausible (scores ≥ 55 absolute). No hardcoded thresholds per entity or incident type.
Confidence receives a chain bonus (up to +30%) when:
- Pipeline/test failures exist after the root cause timestamp
- Failed queries reference the changed column by name
- Failed pipeline names overlap with known downstream lineage nodes
- Gap between root cause and first failure is ≤ 4 hours
score = 0.35 × max(tier_weights of downstream assets)
+ 0.30 × log10(1 + total_query_runs) / 5
+ 0.20 × downstream_asset_count / 20
+ 0.15 × activity_factor
Tier1 downstream assets (weight 1.0) dominate the score, ensuring business-critical tables always surface as high risk.
breaksight/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app + CORS + lifespan
│ │ ├── api/
│ │ │ ├── time_machine.py # All Time Machine routes
│ │ │ ├── simulator.py # Simulator analyze + history
│ │ │ ├── tables.py # Table schema + column fetch
│ │ │ └── dq_seed.py # DQ test seeding helper
│ │ └── core/
│ │ ├── timeline.py # Version snapshot fetcher
│ │ ├── blast_radius.py # Lineage BFS + Blast Radius scorer
│ │ ├── sql_replay.py # sqlglot query impact classifier
│ │ ├── schema_diff.py # Column-level version differ
│ │ ├── runbook.py # Remediation step generator
│ │ └── om_client.py # OpenMetadata async HTTP client
│ └── tests/
├── frontend/
│ └── app/
│ ├── page.tsx # Landing portal
│ ├── time-machine/page.tsx # Timeline, investigate, health tabs
│ └── simulator/page.tsx # Blast radius gauge, SQL replay panel
├── ingestion/
│ └── ingest.py # OpenMetadata → normalized SQLite events
├── demo-data/
│ └── seed.py # Realistic ecommerce incident scenario
├── .env.example
├── Makefile
└── README.md
| Layer | Technology |
|---|---|
| Frontend | Next.js 14, TypeScript (strict), Tailwind CSS, shadcn/ui, Recharts, framer-motion |
| Backend | Python 3.11, FastAPI, Pydantic v2, SQLAlchemy 2.0, SQLite |
| HTTP | httpx (async OpenMetadata client) |
| SQL analysis | sqlglot (dialect-aware query parser and impact classifier) |
| Packaging | uv (backend), npm (frontend) |
| Infrastructure | Docker Compose |
Team: S Nithiyashri · Subhashri S
The Correlation Engine is fully generic — it takes any (entity_fqn, incident_time, window) and investigates dynamically against live OpenMetadata data. Zero hardcoded incident knowledge in the engine. All scenario specifics live in seed data only.