Skip to content

Roll back orphaned backfill when run creation fails#68705

Open
Abdulrehman-PIAIC80387 wants to merge 2 commits into
apache:mainfrom
Abdulrehman-PIAIC80387:fix-backfill-partial-creation-atomicity
Open

Roll back orphaned backfill when run creation fails#68705
Abdulrehman-PIAIC80387 wants to merge 2 commits into
apache:mainfrom
Abdulrehman-PIAIC80387:fix-backfill-partial-creation-atomicity

Conversation

@Abdulrehman-PIAIC80387

@Abdulrehman-PIAIC80387 Abdulrehman-PIAIC80387 commented Jun 18, 2026

Copy link
Copy Markdown

When a backfill is created via POST /api/v2/backfills, _create_backfill commits the Backfill row before creating its dag runs. If run creation then fails — the reported case is sqlite3.OperationalError: database is locked under concurrent requests, but any error would do — the already-committed Backfill row survives with no/partial runs. The num_active > 0 check then treats it as an in-progress backfill and blocks all future backfills for that dag with "already running backfill".

So this is an atomicity problem, not really SQLite-specific: any failure mid-creation leaves an orphaned, un-removable backfill.

Fix

Make creation self-healing: wrap run creation in a try/except and, on any failure, roll back the in-flight work and delete the orphaned Backfill row (its BackfillDagRun rows cascade) before re-raising. A failed creation now leaves no rows behind, so the dag is not blocked.

This keeps the existing early commit (which also acts as the concurrency guard for the "one active backfill per dag" check), so it does not change concurrency behaviour — it only ensures a failed attempt cleans up after itself.

Notes

There is a separate, pre-existing concurrency question (two simultaneous requests both passing the num_active check) that would need a DB-level guard to fully close; I raised both options on the issue. This PR deliberately scopes to the orphaned-backfill bug that's actually reported.

Tests

Added test_create_backfill_no_orphan_on_run_creation_failure: forces a failure during run creation and asserts no Backfill row remains and a subsequent backfill succeeds.

closes: #68699

@mwisnicki

Copy link
Copy Markdown

Does the test trigger the bug before fix is applied?

@Abdulrehman-PIAIC80387

Copy link
Copy Markdown
Author

Yes — confirmed both directions:

  • Without the fix (reverting just the _create_backfill change, keeping the test): it fails — the orphaned Backfill row survives the run-creation failure, so the count == 0 assertion fails (and a retry would hit AlreadyRunningBackfill).
  • With the fix: it passes.

So it's a genuine regression test for the orphaned-backfill behaviour. The full test_backfill.py suite also passes (84 passed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Concurrent POST /api/v2/backfills causes HTTP 500 + partial data with SQLite metadata DB

2 participants