BreakSight — Time Machine + Simulator for OpenMetadata

Find the variant that broke your data timeline.

BreakSight extends OpenMetadata with two interconnected tools that answer the two questions every data engineer dreads:

"What broke my pipeline at 3 AM?" → Time Machine walks backward through schema changes, DQ failures, pipeline runs, and team signals to reconstruct the incident timeline and name the root cause — in under 30 seconds.
"Is it safe to drop this column?" → Simulator walks forward through downstream lineage, replays real historical queries against the proposed schema, scores a Blast Radius (0–100), and hands you a runbook before you merge the PR.

Both tools share one Correlation Engine. Time Machine walks it backward; Simulator walks it forward. Same engine, opposite time directions.

The Problem

Schema changes are the silent killers of data pipelines. A column dropped in a hotfix at 01:00 breaks a revenue dashboard by 03:00, triggers a PagerDuty alert by 04:00, and costs an on-call engineer 2 hours of git blame and log-grep before they pinpoint the culprit. OpenMetadata tracks every version, every lineage edge, every DQ test — but there is no tool that stitches those signals into a causal timeline or predicts the blast radius of a change before it ships.

BreakSight is that tool.

Solution

Time Machine — Backward Causal Tracing

Pick any table in your catalog, set an alert time, and Time Machine:

Fetches every schema version from /api/v1/tables/{id}/versions in the lookback window
Pulls DQ test failures from /api/v1/dataQuality/testCases
Fetches pipeline run statuses (OM Airflow connector or seeded SQLite events)
Reads conversation signals from /api/v1/feed
Scores every event with a causal likelihood model (temporal position × event type × severity)
Assigns phases: ROOT_CAUSE → FIRST_FAILURE → CASCADE → BUSINESS_IMPACT
Validates the causal chain (failures must follow the root cause, reference the changed column, overlap with downstream lineage)
Returns a confidence score and a one-sentence recommended action

Key metric: Mean Time to Confidence drops from ~2 hours to ~23 seconds.

Simulator — Forward Impact Prediction

Pick a table, pick a change type (drop column, rename, type change, nullable, drop table), and Simulator:

Fetches current schema from /api/v1/tables/name/{fqn} with column metadata
Walks downstream lineage 3 hops via /api/v1/lineage/table/name/{fqn}
Fetches tier, owner, and domain per downstream asset to weight risk
Replays every historical SQL query from the local event store against the proposed schema (via sqlglot) — classifying each as will_error, will_silently_change, safe, or could_not_analyze
Scores Blast Radius 0–100: 0.35 × tier_weight + 0.30 × usage_score + 0.20 × count_score + 0.15 × activity
Lists affected users ranked by query count
Generates a step-by-step runbook (notify owners → coordinate migration → validate → monitor)

Interconnection

The two tools are linked. From the Simulator result you can jump directly to Time Machine for any affected downstream entity. From a Time Machine schema_change event you can launch the Simulator to model a rollback.

Architecture

Browser (Next.js 14 + TypeScript + Tailwind + shadcn/ui)
  /             → Landing portal (Breaksight)
  /time-machine → Backward causal trace, schema health, version diff
  /simulator    → Forward impact prediction, blast radius, SQL replay
        |
        | REST JSON (localhost:8000)
        v
FastAPI + Python 3.11
  Correlation Engine
    ├── timeline.py        — version snapshot fetcher + differ
    ├── blast_radius.py    — lineage BFS + tier/usage scorer
    ├── sql_replay.py      — sqlglot-based query impact classifier
    ├── schema_diff.py     — column-level diff between versions
    └── runbook.py         — change-type-aware remediation generator
  SQLite (breaksight.db)   — normalized event store
        |
        | HTTP + OM JWT
        v
OpenMetadata (localhost:8585)
  /api/v1/tables/{id}/versions      — schema history
  /api/v1/lineage/table/name/{fqn} — column-level lineage
  /api/v1/dataQuality/testCases    — DQ test results
  /api/v1/queries                  — query logs for SQL replay
  /api/v1/feed                     — conversation signals
  /api/v1/usage/{entity}           — usage stats
  /api/v1/search/query             — entity picker autocomplete

No Kafka. No graph database. No ML framework. No microservices.

OpenMetadata APIs Used

API	Purpose
`GET /api/v1/tables/{id}/versions`	Schema version history — the primary root-cause signal
`GET /api/v1/lineage/table/name/{fqn}`	Downstream lineage walk (Blast Radius + impact scope)
`GET /api/v1/dataQuality/testCases`	DQ test failures as causal evidence
`GET /api/v1/queries`	Real historical SQL for replay classification
`GET /api/v1/feed`	Team conversation signals (escalations, alerts)
`GET /api/v1/usage/{entity}`	Usage weight in Blast Radius scoring
`GET /api/v1/search/query`	Entity picker in the UI
`GET /api/v1/tables`	Cross-table changelog feed
`GET /api/v1/pipelines/{id}/status`	Live pipeline run statuses

Quick Start

Prerequisites

OpenMetadata running at http://localhost:8585
Python 3.11+ with uv
Node.js 18+

Setup

git clone <this-repo>
cd breaksight

# Configure environment
cp .env.example .env
# Edit .env — set OM_TOKEN (get from OM UI → avatar → Access Token)

# Install dependencies
make install

# Seed demo data (creates realistic ecommerce incident scenario)
make seed

# Start backend (:8000) and frontend (:3000) in parallel
make up

Open http://localhost:3000.

Docker (all-in-one)

make demo
# Waits for services, seeds data, opens on :3000

Health check

curl http://localhost:8000/health

Tests

make test

Key API Endpoints

Time Machine

Method	Path	Description
`GET`	`/api/time-machine/investigate`	Main investigate endpoint — backward + forward scan
`GET`	`/api/time-machine/trace-backward`	Raw backward causal trace with phased timeline
`GET`	`/api/time-machine/risk-assessment`	Flattened risk card (confidence, blast radius, failure sequence)
`GET`	`/api/timeline/{fqn}`	Version history with diffs for a table
`GET`	`/api/diff/{fqn}`	Column-level diff between two versions
`GET`	`/api/schema-health/{fqn}`	Schema health score (A–F), risk factors, recommendations
`GET`	`/api/dq-tests/{fqn}`	Data quality test results for a table
`GET`	`/api/changelog`	Cross-table schema changelog (50 most recent)
`GET`	`/api/usage/{fqn}`	Query usage stats per user
`GET`	`/api/alerts/active`	Active DQ and pipeline alert counts

Investigate example:

curl "http://localhost:8000/api/time-machine/investigate?\
entity_fqn=ecommerce_warehouse.shopflow.sales.orders&\
alert_time=2026-04-20T04:00:00Z&\
lookback_days=2"

{
  "confidence": 87,
  "confidence_label": "HIGH",
  "root_cause_summary": "alex.junior made a schema change at 01:23 UTC that cascaded into failures.",
  "root_cause_actor": "alex.junior",
  "root_cause_column": "amount_usd",
  "root_cause_description": "dropped column amount_usd",
  "blast_radius": {
    "pipelines_failed": 3,
    "tables_no_data": 4,
    "dq_tests_failing": 2
  },
  "failure_sequence": [
    {"time": "02:15", "what": "Pipeline Failed: revenue_etl_pipeline — column reference error"},
    {"time": "03:50", "what": "DQ Test Failed: orders_amount_must_not_be_null — assertion failed"}
  ],
  "recommended_action": "Restore amount_usd column OR update 3 pipeline queries to reference the replacement column instead.",
  "timeline": [...]
}

Simulator

Method	Path	Description
`POST`	`/api/simulator/analyze`	Run a simulation — returns Blast Radius, SQL replay, runbook
`GET`	`/api/simulations`	List recent simulation results

Analyze example:

curl -X POST http://localhost:8000/api/simulator/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "entity_fqn": "ecommerce_warehouse.shopflow.sales.orders",
    "change_type": "drop_column",
    "target": "amount_usd"
  }'

{
  "impact_score": 74,
  "severity": "HIGH",
  "downstream_count": 12,
  "blast_radius_breakdown": {
    "tier_score": 1.0,
    "usage_score": 0.72,
    "count_score": 0.60,
    "final_score": 73.6
  },
  "sql_replay": {
    "total_analyzed": 31,
    "will_error": [...],
    "will_silently_change": [...],
    "safe": [...]
  },
  "most_affected_users": [
    {"user": "priya.data", "query_count": 14, "verdicts": {"will_error": 9, "safe": 5}}
  ],
  "runbook": [...]
}

How the Correlation Engine Works

Causal Scoring

Every event in the investigation window gets a causal likelihood score:

score = 0.35 × temporal_weight
      + 0.25 × event_type_weight
      + 0.20 × severity_weight
      + 0.20 × distance_weight

Temporal weight: Schema changes score higher earlier in the window (root causes precede effects). Pipeline/test failures score higher near the alert time (cascades accumulate later).
Event type weight: schema_change (1.0) > test_failure (0.9) > pipeline_failure (0.85) > team_signal (0.5)
Severity weight: error (1.0) > warning (0.6) > info (0.2)

Root Cause Selection

The top schema change is selected as root cause if it is dominant (scores ≥ 1.4× the second candidate) or plausible (scores ≥ 55 absolute). No hardcoded thresholds per entity or incident type.

Causal Chain Validation

Confidence receives a chain bonus (up to +30%) when:

Pipeline/test failures exist after the root cause timestamp
Failed queries reference the changed column by name
Failed pipeline names overlap with known downstream lineage nodes
Gap between root cause and first failure is ≤ 4 hours

Blast Radius Scoring

score = 0.35 × max(tier_weights of downstream assets)
      + 0.30 × log10(1 + total_query_runs) / 5
      + 0.20 × downstream_asset_count / 20
      + 0.15 × activity_factor

Tier1 downstream assets (weight 1.0) dominate the score, ensuring business-critical tables always surface as high risk.

Repository Structure

breaksight/
├── backend/
│   ├── app/
│   │   ├── main.py               # FastAPI app + CORS + lifespan
│   │   ├── api/
│   │   │   ├── time_machine.py   # All Time Machine routes
│   │   │   ├── simulator.py      # Simulator analyze + history
│   │   │   ├── tables.py         # Table schema + column fetch
│   │   │   └── dq_seed.py        # DQ test seeding helper
│   │   └── core/
│   │       ├── timeline.py       # Version snapshot fetcher
│   │       ├── blast_radius.py   # Lineage BFS + Blast Radius scorer
│   │       ├── sql_replay.py     # sqlglot query impact classifier
│   │       ├── schema_diff.py    # Column-level version differ
│   │       ├── runbook.py        # Remediation step generator
│   │       └── om_client.py      # OpenMetadata async HTTP client
│   └── tests/
├── frontend/
│   └── app/
│       ├── page.tsx              # Landing portal
│       ├── time-machine/page.tsx # Timeline, investigate, health tabs
│       └── simulator/page.tsx    # Blast radius gauge, SQL replay panel
├── ingestion/
│   └── ingest.py                 # OpenMetadata → normalized SQLite events
├── demo-data/
│   └── seed.py                   # Realistic ecommerce incident scenario
├── .env.example
├── Makefile
└── README.md

Tech Stack

Layer	Technology
Frontend	Next.js 14, TypeScript (strict), Tailwind CSS, shadcn/ui, Recharts, framer-motion
Backend	Python 3.11, FastAPI, Pydantic v2, SQLAlchemy 2.0, SQLite
HTTP	httpx (async OpenMetadata client)
SQL analysis	sqlglot (dialect-aware query parser and impact classifier)
Packaging	uv (backend), npm (frontend)
Infrastructure	Docker Compose

Built for the OpenMetadata Hackathon (Apr 17–26, 2026)

Team: S Nithiyashri · Subhashri S

The Correlation Engine is fully generic — it takes any (entity_fqn, incident_time, window) and investigates dynamically against live OpenMetadata data. Zero hardcoded incident knowledge in the engine. All scenario specifics live in seed data only.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
breaksight		breaksight
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BreakSight — Time Machine + Simulator for OpenMetadata

The Problem

Solution

Time Machine — Backward Causal Tracing

Simulator — Forward Impact Prediction

Interconnection

Architecture

OpenMetadata APIs Used

Quick Start

Prerequisites

Setup

Docker (all-in-one)

Health check

Tests

Key API Endpoints

Time Machine

Simulator

How the Correlation Engine Works

Causal Scoring

Root Cause Selection

Causal Chain Validation

Blast Radius Scoring

Repository Structure

Tech Stack

Built for the OpenMetadata Hackathon (Apr 17–26, 2026)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BreakSight — Time Machine + Simulator for OpenMetadata

The Problem

Solution

Time Machine — Backward Causal Tracing

Simulator — Forward Impact Prediction

Interconnection

Architecture

OpenMetadata APIs Used

Quick Start

Prerequisites

Setup

Docker (all-in-one)

Health check

Tests

Key API Endpoints

Time Machine

Simulator

How the Correlation Engine Works

Causal Scoring

Root Cause Selection

Causal Chain Validation

Blast Radius Scoring

Repository Structure

Tech Stack

Built for the OpenMetadata Hackathon (Apr 17–26, 2026)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages