Skip to content

subhashris/openmetadata-hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 

Repository files navigation

BreakSight — Time Machine + Simulator for OpenMetadata

Find the variant that broke your data timeline.

BreakSight extends OpenMetadata with two interconnected tools that answer the two questions every data engineer dreads:

  • "What broke my pipeline at 3 AM?" → Time Machine walks backward through schema changes, DQ failures, pipeline runs, and team signals to reconstruct the incident timeline and name the root cause — in under 30 seconds.
  • "Is it safe to drop this column?" → Simulator walks forward through downstream lineage, replays real historical queries against the proposed schema, scores a Blast Radius (0–100), and hands you a runbook before you merge the PR.

Both tools share one Correlation Engine. Time Machine walks it backward; Simulator walks it forward. Same engine, opposite time directions.


The Problem

Schema changes are the silent killers of data pipelines. A column dropped in a hotfix at 01:00 breaks a revenue dashboard by 03:00, triggers a PagerDuty alert by 04:00, and costs an on-call engineer 2 hours of git blame and log-grep before they pinpoint the culprit. OpenMetadata tracks every version, every lineage edge, every DQ test — but there is no tool that stitches those signals into a causal timeline or predicts the blast radius of a change before it ships.

BreakSight is that tool.


Solution

Time Machine — Backward Causal Tracing

Pick any table in your catalog, set an alert time, and Time Machine:

  1. Fetches every schema version from /api/v1/tables/{id}/versions in the lookback window
  2. Pulls DQ test failures from /api/v1/dataQuality/testCases
  3. Fetches pipeline run statuses (OM Airflow connector or seeded SQLite events)
  4. Reads conversation signals from /api/v1/feed
  5. Scores every event with a causal likelihood model (temporal position × event type × severity)
  6. Assigns phases: ROOT_CAUSE → FIRST_FAILURE → CASCADE → BUSINESS_IMPACT
  7. Validates the causal chain (failures must follow the root cause, reference the changed column, overlap with downstream lineage)
  8. Returns a confidence score and a one-sentence recommended action

Key metric: Mean Time to Confidence drops from ~2 hours to ~23 seconds.

Simulator — Forward Impact Prediction

Pick a table, pick a change type (drop column, rename, type change, nullable, drop table), and Simulator:

  1. Fetches current schema from /api/v1/tables/name/{fqn} with column metadata
  2. Walks downstream lineage 3 hops via /api/v1/lineage/table/name/{fqn}
  3. Fetches tier, owner, and domain per downstream asset to weight risk
  4. Replays every historical SQL query from the local event store against the proposed schema (via sqlglot) — classifying each as will_error, will_silently_change, safe, or could_not_analyze
  5. Scores Blast Radius 0–100: 0.35 × tier_weight + 0.30 × usage_score + 0.20 × count_score + 0.15 × activity
  6. Lists affected users ranked by query count
  7. Generates a step-by-step runbook (notify owners → coordinate migration → validate → monitor)

Interconnection

The two tools are linked. From the Simulator result you can jump directly to Time Machine for any affected downstream entity. From a Time Machine schema_change event you can launch the Simulator to model a rollback.


Architecture

Browser (Next.js 14 + TypeScript + Tailwind + shadcn/ui)
  /             → Landing portal (Breaksight)
  /time-machine → Backward causal trace, schema health, version diff
  /simulator    → Forward impact prediction, blast radius, SQL replay
        |
        | REST JSON (localhost:8000)
        v
FastAPI + Python 3.11
  Correlation Engine
    ├── timeline.py        — version snapshot fetcher + differ
    ├── blast_radius.py    — lineage BFS + tier/usage scorer
    ├── sql_replay.py      — sqlglot-based query impact classifier
    ├── schema_diff.py     — column-level diff between versions
    └── runbook.py         — change-type-aware remediation generator
  SQLite (breaksight.db)   — normalized event store
        |
        | HTTP + OM JWT
        v
OpenMetadata (localhost:8585)
  /api/v1/tables/{id}/versions      — schema history
  /api/v1/lineage/table/name/{fqn} — column-level lineage
  /api/v1/dataQuality/testCases    — DQ test results
  /api/v1/queries                  — query logs for SQL replay
  /api/v1/feed                     — conversation signals
  /api/v1/usage/{entity}           — usage stats
  /api/v1/search/query             — entity picker autocomplete

No Kafka. No graph database. No ML framework. No microservices.


OpenMetadata APIs Used

API Purpose
GET /api/v1/tables/{id}/versions Schema version history — the primary root-cause signal
GET /api/v1/lineage/table/name/{fqn} Downstream lineage walk (Blast Radius + impact scope)
GET /api/v1/dataQuality/testCases DQ test failures as causal evidence
GET /api/v1/queries Real historical SQL for replay classification
GET /api/v1/feed Team conversation signals (escalations, alerts)
GET /api/v1/usage/{entity} Usage weight in Blast Radius scoring
GET /api/v1/search/query Entity picker in the UI
GET /api/v1/tables Cross-table changelog feed
GET /api/v1/pipelines/{id}/status Live pipeline run statuses

Quick Start

Prerequisites

  • OpenMetadata running at http://localhost:8585
  • Python 3.11+ with uv
  • Node.js 18+

Setup

git clone <this-repo>
cd breaksight

# Configure environment
cp .env.example .env
# Edit .env — set OM_TOKEN (get from OM UI → avatar → Access Token)

# Install dependencies
make install

# Seed demo data (creates realistic ecommerce incident scenario)
make seed

# Start backend (:8000) and frontend (:3000) in parallel
make up

Open http://localhost:3000.

Docker (all-in-one)

make demo
# Waits for services, seeds data, opens on :3000

Health check

curl http://localhost:8000/health

Tests

make test

Key API Endpoints

Time Machine

Method Path Description
GET /api/time-machine/investigate Main investigate endpoint — backward + forward scan
GET /api/time-machine/trace-backward Raw backward causal trace with phased timeline
GET /api/time-machine/risk-assessment Flattened risk card (confidence, blast radius, failure sequence)
GET /api/timeline/{fqn} Version history with diffs for a table
GET /api/diff/{fqn} Column-level diff between two versions
GET /api/schema-health/{fqn} Schema health score (A–F), risk factors, recommendations
GET /api/dq-tests/{fqn} Data quality test results for a table
GET /api/changelog Cross-table schema changelog (50 most recent)
GET /api/usage/{fqn} Query usage stats per user
GET /api/alerts/active Active DQ and pipeline alert counts

Investigate example:

curl "http://localhost:8000/api/time-machine/investigate?\
entity_fqn=ecommerce_warehouse.shopflow.sales.orders&\
alert_time=2026-04-20T04:00:00Z&\
lookback_days=2"
{
  "confidence": 87,
  "confidence_label": "HIGH",
  "root_cause_summary": "alex.junior made a schema change at 01:23 UTC that cascaded into failures.",
  "root_cause_actor": "alex.junior",
  "root_cause_column": "amount_usd",
  "root_cause_description": "dropped column amount_usd",
  "blast_radius": {
    "pipelines_failed": 3,
    "tables_no_data": 4,
    "dq_tests_failing": 2
  },
  "failure_sequence": [
    {"time": "02:15", "what": "Pipeline Failed: revenue_etl_pipeline — column reference error"},
    {"time": "03:50", "what": "DQ Test Failed: orders_amount_must_not_be_null — assertion failed"}
  ],
  "recommended_action": "Restore amount_usd column OR update 3 pipeline queries to reference the replacement column instead.",
  "timeline": [...]
}

Simulator

Method Path Description
POST /api/simulator/analyze Run a simulation — returns Blast Radius, SQL replay, runbook
GET /api/simulations List recent simulation results

Analyze example:

curl -X POST http://localhost:8000/api/simulator/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "entity_fqn": "ecommerce_warehouse.shopflow.sales.orders",
    "change_type": "drop_column",
    "target": "amount_usd"
  }'
{
  "impact_score": 74,
  "severity": "HIGH",
  "downstream_count": 12,
  "blast_radius_breakdown": {
    "tier_score": 1.0,
    "usage_score": 0.72,
    "count_score": 0.60,
    "final_score": 73.6
  },
  "sql_replay": {
    "total_analyzed": 31,
    "will_error": [...],
    "will_silently_change": [...],
    "safe": [...]
  },
  "most_affected_users": [
    {"user": "priya.data", "query_count": 14, "verdicts": {"will_error": 9, "safe": 5}}
  ],
  "runbook": [...]
}

How the Correlation Engine Works

Causal Scoring

Every event in the investigation window gets a causal likelihood score:

score = 0.35 × temporal_weight
      + 0.25 × event_type_weight
      + 0.20 × severity_weight
      + 0.20 × distance_weight
  • Temporal weight: Schema changes score higher earlier in the window (root causes precede effects). Pipeline/test failures score higher near the alert time (cascades accumulate later).
  • Event type weight: schema_change (1.0) > test_failure (0.9) > pipeline_failure (0.85) > team_signal (0.5)
  • Severity weight: error (1.0) > warning (0.6) > info (0.2)

Root Cause Selection

The top schema change is selected as root cause if it is dominant (scores ≥ 1.4× the second candidate) or plausible (scores ≥ 55 absolute). No hardcoded thresholds per entity or incident type.

Causal Chain Validation

Confidence receives a chain bonus (up to +30%) when:

  1. Pipeline/test failures exist after the root cause timestamp
  2. Failed queries reference the changed column by name
  3. Failed pipeline names overlap with known downstream lineage nodes
  4. Gap between root cause and first failure is ≤ 4 hours

Blast Radius Scoring

score = 0.35 × max(tier_weights of downstream assets)
      + 0.30 × log10(1 + total_query_runs) / 5
      + 0.20 × downstream_asset_count / 20
      + 0.15 × activity_factor

Tier1 downstream assets (weight 1.0) dominate the score, ensuring business-critical tables always surface as high risk.


Repository Structure

breaksight/
├── backend/
│   ├── app/
│   │   ├── main.py               # FastAPI app + CORS + lifespan
│   │   ├── api/
│   │   │   ├── time_machine.py   # All Time Machine routes
│   │   │   ├── simulator.py      # Simulator analyze + history
│   │   │   ├── tables.py         # Table schema + column fetch
│   │   │   └── dq_seed.py        # DQ test seeding helper
│   │   └── core/
│   │       ├── timeline.py       # Version snapshot fetcher
│   │       ├── blast_radius.py   # Lineage BFS + Blast Radius scorer
│   │       ├── sql_replay.py     # sqlglot query impact classifier
│   │       ├── schema_diff.py    # Column-level version differ
│   │       ├── runbook.py        # Remediation step generator
│   │       └── om_client.py      # OpenMetadata async HTTP client
│   └── tests/
├── frontend/
│   └── app/
│       ├── page.tsx              # Landing portal
│       ├── time-machine/page.tsx # Timeline, investigate, health tabs
│       └── simulator/page.tsx    # Blast radius gauge, SQL replay panel
├── ingestion/
│   └── ingest.py                 # OpenMetadata → normalized SQLite events
├── demo-data/
│   └── seed.py                   # Realistic ecommerce incident scenario
├── .env.example
├── Makefile
└── README.md

Tech Stack

Layer Technology
Frontend Next.js 14, TypeScript (strict), Tailwind CSS, shadcn/ui, Recharts, framer-motion
Backend Python 3.11, FastAPI, Pydantic v2, SQLAlchemy 2.0, SQLite
HTTP httpx (async OpenMetadata client)
SQL analysis sqlglot (dialect-aware query parser and impact classifier)
Packaging uv (backend), npm (frontend)
Infrastructure Docker Compose

Built for the OpenMetadata Hackathon (Apr 17–26, 2026)

Team: S Nithiyashri · Subhashri S

The Correlation Engine is fully generic — it takes any (entity_fqn, incident_time, window) and investigates dynamically against live OpenMetadata data. Zero hardcoded incident knowledge in the engine. All scenario specifics live in seed data only.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors