Roll back orphaned backfill when run creation fails#68705
Open
Abdulrehman-PIAIC80387 wants to merge 2 commits into
Open
Roll back orphaned backfill when run creation fails#68705Abdulrehman-PIAIC80387 wants to merge 2 commits into
Abdulrehman-PIAIC80387 wants to merge 2 commits into
Conversation
|
Does the test trigger the bug before fix is applied? |
Author
|
Yes — confirmed both directions:
So it's a genuine regression test for the orphaned-backfill behaviour. The full |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a backfill is created via
POST /api/v2/backfills,_create_backfillcommits theBackfillrow before creating its dag runs. If run creation then fails — the reported case issqlite3.OperationalError: database is lockedunder concurrent requests, but any error would do — the already-committedBackfillrow survives with no/partial runs. Thenum_active > 0check then treats it as an in-progress backfill and blocks all future backfills for that dag with "already running backfill".So this is an atomicity problem, not really SQLite-specific: any failure mid-creation leaves an orphaned, un-removable backfill.
Fix
Make creation self-healing: wrap run creation in a
try/exceptand, on any failure, roll back the in-flight work and delete the orphanedBackfillrow (itsBackfillDagRunrows cascade) before re-raising. A failed creation now leaves no rows behind, so the dag is not blocked.This keeps the existing early commit (which also acts as the concurrency guard for the "one active backfill per dag" check), so it does not change concurrency behaviour — it only ensures a failed attempt cleans up after itself.
Notes
There is a separate, pre-existing concurrency question (two simultaneous requests both passing the
num_activecheck) that would need a DB-level guard to fully close; I raised both options on the issue. This PR deliberately scopes to the orphaned-backfill bug that's actually reported.Tests
Added
test_create_backfill_no_orphan_on_run_creation_failure: forces a failure during run creation and asserts noBackfillrow remains and a subsequent backfill succeeds.closes: #68699