Skip to content

[Bug] Riven v0.23.6 pipeline silently stalls — 8 bugs found and fixed: stream blacklisting on RD 429, broken DB sessions, ghost-Completed seasons, spin loop, season decomposition, PG timeout kills, add_torrent crash, watchlist RSS blocked #1376

@rshearer1

Description

@rshearer1

Summary

The Riven pipeline (v0.23.6) silently stalls after an initial burst of processing. No errors, no crashes — the main event loop continues running, retry_library re-queues items every 60 seconds, but process_event() returns (None, []) for every item, producing zero futures. Items never progress beyond their current state.

Deep investigation over multiple debugging sessions revealed 8 distinct bugs — 5 cascading bugs causing the initial stall, plus 3 additional bugs discovered during extended operation. Each upstream failure corrupted state that downstream code couldn't recover from.

Result after patching: Pipeline fully operational — 6,700+ completed items (up from 6,823 initially stalled), processing new content within seconds of watchlist additions. Actively downloading and symlinking new shows (e.g., 2 Broke Girls complete series downloaded in ~20 minutes).


Environment

  • Container: DUMB (iampuid0/dumb:latest)
  • Riven: v0.23.6
  • PostgreSQL: pool_pre_ping=True, pool_size=25, max_overflow=25
  • Debrid provider: RealDebrid
  • Scrapers: Torrentio, Comet, Zilean, Prowlarr (10 indexers)
  • Library updater: Plex

Bug 1 — PendingRollbackError: Broken DB Sessions

Severity: Critical — silently breaks all state persistence on affected threads
File: src/program/db/db_functions.py

Trigger

PostgreSQL connection drop (network hiccup, timeout, server restart) leaves the SQLAlchemy PatchedScopedSession (thread-local) in an invalid state.

What happens

  1. Session enters PendingRollbackError state
  2. ALL subsequent store_state() calls on that thread fail silently
  3. Items appear to process (logs show activity) but their state never advances in the database
  4. The thread is permanently broken until the session is manually rolled back
  5. Since sessions are thread-local, other threads continue working — making the bug intermittent and hard to detect

Impact

Items cycle through the pipeline repeatedly, consuming resources, but never actually progress. The heartbeat shows running=N but completed count never increases.

Fix

Added retry loop (2 attempts) around run_thread_with_db_item():

from sqlalchemy.exc import PendingRollbackError, OperationalError

# In run_thread_with_db_item():
for attempt in range(2):
    try:
        # Safety rollback to clear any lingering transaction state
        session.rollback()
        
        # ... existing processing logic ...
        
        break  # Success, exit retry loop
    except (PendingRollbackError, OperationalError) as e:
        if attempt < 1:
            time.sleep(1)  # Brief pause before retry
        # Session will be rolled back at top of next attempt

Key principle

Thread-local scoped sessions need per-operation safety rollbacks. A PendingRollbackError on one thread doesn't affect others, but that thread is permanently broken until the session is explicitly rolled back.


Bug 2 — Stream Blacklisting on RealDebrid HTTP 429

Severity: Critical — permanent data destruction on transient errors
Files: src/program/services/downloaders/models.py, src/program/services/downloaders/realdebrid.py, src/program/services/downloaders/__init__.py

Trigger

RealDebrid returns HTTP 429 (rate limit) during get_instant_availability().

What happens

  1. get_instant_availability() returns None on 429
  2. validate_stream() interprets None as "stream is invalid"
  3. Stream is blacklisted — the StreamRelation row is DELETEd from the database
  4. Zero remaining streams → store_state() regresses items from ScrapedIndexed
  5. Regression is permanent — streams are deleted, requiring a full re-scrape
  6. Observable in DB: Scraped count decreases (483→481) while Indexed increases (357→359)

Impact

Valid streams are permanently destroyed because of a temporary rate limit. Items that were ready to download are kicked all the way back to the scraping stage, and even re-scraping may not recover the same streams.

Fix

Introduced DebridTransientError exception class to distinguish transient failures from permanent ones:

# models.py
class DebridTransientError(Exception):
    """Raised on transient RD errors (429, 5xx) - do NOT blacklist streams."""
    pass

# realdebrid.py - in get_instant_availability():
if " 429 " in e.args[0] or "Rate Limit Exceeded" in e.args[0]:
    raise DebridTransientError(f"RD rate limit for {infohash}")

# downloaders/__init__.py - in validate_stream():
try:
    result = debrid.get_instant_availability(stream.infohash, item)
except DebridTransientError as e:
    logger.debug(f"Skipping stream {stream.infohash} (transient RD error, NOT blacklisting): {e}")
    return None  # Don't blacklist on transient errors!

# downloaders/__init__.py - in Downloader.run():
try:
    # ... stream processing ...
except DebridTransientError as e:
    logger.debug(f"RD transient error for {stream.infohash}, NOT blacklisting: {e}")
    break  # Stop trying more streams, but don't blacklist any

Key principle

Transient debrid API failures must NEVER destroy stream data. HTTP 429 is fundamentally different from "stream doesn't exist." The code path must branch on failure type — transient errors skip processing, permanent errors blacklist.


Bug 3 — "Ghost Completed" Seasons (State Inconsistency)

Severity: Critical — ROOT CAUSE of the pipeline stall
File: src/program/state_transition.py

Trigger

Bug 2's state regression moved episodes backward (ScrapedIndexed), but parent season/show states were never updated to reflect the regression.

What happens

  1. Seasons marked Completed contain episodes actually at Indexed state (due to Bug 2's regression)
  2. process_event() for shows filters seasons by checking parent state:
    if s.last_state not in [States.Completed, States.Unreleased]:
  3. "Ghost Completed" seasons pass this filter — they look Completed so they're skipped entirely
  4. All episodes within those seasons are permanently trapped — invisible to the pipeline
  5. The main loop runs, events are consumed and processed, but every show item produces zero actionable sub-items

Impact

This is the root cause of the observed stall. 195 episodes across 13 seasons were permanently trapped in ghost-Completed seasons. The pipeline appeared to be running (heartbeat active, retry_library cycling) but produced zero actual work.

Affected shows included: Disenchantment, Regular Show (S1-S8), Arrested Development, Stranger Things, Ren & Stimpy.

Fix

Two-part fix:

1. One-time DB repair — Updated 13 ghost-Completed seasons to PartiallyCompleted:

UPDATE "MediaItem" SET last_state = 'PartiallyCompleted'
WHERE id IN (SELECT s.id FROM "Season" s JOIN "MediaItem" ms ON ms.id = s.id
             WHERE ms.last_state = 'Completed'
             AND EXISTS (SELECT 1 FROM "Episode" e JOIN "MediaItem" me ON me.id = e.id
                         WHERE e.parent_id = s.id AND me.last_state NOT IN ('Completed', 'Unreleased')));

2. Code fix — Show handler now decomposes to individual episodes, checking actual episode state instead of trusting parent season state:

# Before (broken) — trusts parent season state:
for s in item.seasons:
    if s.last_state not in [States.Completed, States.Unreleased]:
        process_event(emitted_by, s, None)

# After (fixed) — checks actual episode states:
for season in item.seasons:
    if season.last_state == States.Unreleased:
        continue
    if hasattr(season, 'episodes') and season.episodes:
        for episode in season.episodes:
            if episode.last_state != States.Completed:
                _, sub_items = process_event(emitted_by, episode, None)

Key principle

Never trust parent state aggregation. Parent (season/show) state is a cache of child states that can become stale when children regress. Stale parent state creates invisible "ghost" items that the pipeline skips forever.


Bug 4 — submit_job() Spin Loop

Severity: High — infinite CPU-bound loop, zero progress
File: src/program/managers/event_manager.py

Trigger

After fixing Bug 3, 195+ previously trapped items flood the pipeline simultaneously. When submit_job() hits the pending buffer/future cap, it re-queues events.

What happens

  1. submit_job() detects pending cap exceeded → re-queues the event
  2. Re-queued event has run_at = now (no delay)
  3. Main loop immediately pops the event → re-submits → cap hit → re-queue → repeat
  4. Infinite tight loop at 100% CPU, processing zero items
  5. Observable: heartbeat shows queue=N fluctuating but futures=0 forever

Fix

Added 30-second delay to re-queued events:

event.run_at = datetime.now() + timedelta(seconds=30)

This gives the executor pool time to drain before retrying, preventing the spin loop while still ensuring items are eventually processed.

Key principle

Any re-queue mechanism MUST include a delay. Without it, re-queued items create tight loops that consume CPU without making progress.


Bug 5 — Season-Level Submission to Downloader

Severity: Medium — TypeError/incorrect behavior in downstream services
File: src/program/state_transition.py

Trigger

The initial decomposition fix (Bug 3) passed whole season objects to process_event() instead of individual episodes.

What happens

Downstream services (Downloader, Symlinker) receive season objects but are designed to handle movie or episode items. This causes:

  • TypeError when accessing episode-specific attributes
  • Incorrect file matching (season has no file attribute)
  • Silent failures where the service returns None

Fix

Decomposition always iterates to individual episodes before calling process_event():

# Always decompose seasons to episodes
for season in item.seasons:
    if season.last_state == States.Unreleased:
        continue
    for episode in season.episodes:
        if episode.last_state != States.Completed:
            _, sub_items = process_event(emitted_by, episode, None)

Key principle

State transition decomposition must always reach the atomic level (episode/movie) before submitting to service handlers that expect individual items.


Bug 6 — PostgreSQL Per-Database idle_in_transaction_session_timeout

Severity: High — kills active connections during normal operation
File: PostgreSQL configuration (not Riven code)

Trigger

PostgreSQL's idle_in_transaction_session_timeout was set to 5 minutes at the DATABASE level (via ALTER DATABASE), overriding any postgresql.conf settings.

What happens

  1. Riven opens a transaction for a batch operation (e.g., processing multiple episodes)
  2. Between individual item processing, the connection sits idle-in-transaction
  3. After 5 minutes, PostgreSQL forcibly terminates the connection
  4. The terminated connection feeds back into Bug 1 (PendingRollbackError)
  5. Any in-progress batch operation is silently aborted
  6. Symlinks created during the transaction may be left without corresponding DB state updates

Impact

This was the root cause of the Sopranos symlink loss — a batch state reset operation was killed mid-transaction, leaving 73 out of 86 episodes with their DB state updated but symlinks never created.

Fix

ALTER DATABASE riven SET idle_in_transaction_session_timeout = '30min';
ALTER DATABASE riven SET statement_timeout = '10min';

30 minutes provides ample time for large batch operations while still protecting against truly abandoned transactions.

How to detect

-- Check per-database timeout overrides (these override postgresql.conf!)
SELECT d.datname, s.setconfig
FROM pg_db_role_setting s
JOIN pg_database d ON d.oid = s.setdatabase
WHERE s.setconfig IS NOT NULL;

Key principle

Per-database ALTER DATABASE ... SET overrides are invisible in postgresql.conf and SHOW only reveals them when connected to that specific database. Always check pg_db_role_setting for hidden overrides.


Bug 7 — add_torrent() AttributeError on Missing .response

Severity: Medium — crashes torrent addition on certain error types
File: src/program/services/downloaders/realdebrid.py

Trigger

RealDebrid's add_torrent() raises an exception where e.response is None (e.g., network timeout, DNS failure).

What happens

  1. add_torrent() catches exceptions and tries to read e.response.status_code
  2. When the exception has no .response attribute (or it's None), this raises AttributeError
  3. The AttributeError propagates unhandled, crashing the download attempt
  4. The item is left in an inconsistent state

Original code (broken)

except Exception as e:
    if e.response.status_code == 503:  # AttributeError if e.response is None
        ...

Fix

Use safe attribute access with getattr():

except Exception as e:
    status_code = getattr(getattr(e, 'response', None), 'status_code', None)
    if status_code == 503:
        ...
    elif status_code == 429:
        ...
    elif status_code == 404:
        ...
    elif status_code == 400:
        ...
    elif status_code == 502:
        ...
    # Falls through gracefully if status_code is None

Key principle

Exception handlers must never assume the exception object has specific attributes. Always use defensive access patterns (getattr, hasattr, or try/except) when inspecting exception properties.


Bug 8 — Plex Watchlist API Failure Blocks RSS Feed

Severity: High — new content never imported despite working RSS
File: src/program/services/content/plex_watchlist.py

Trigger

The Plex metadata API returns HTTP 404 for the watchlist endpoint (Section 'watchlist' not found!). This appears to happen when the Plex account token doesn't have the correct permissions or the watchlist API endpoint changes.

What happens

  1. run() method has both API and RSS calls in a single try/except block:
    try:
        watchlist_items = self.api.get_items_from_watchlist()  # Throws 404!
        rss_items = self.api.get_items_from_rss()  # Never reached
    except Exception as e:
        logger.warning(f"Error fetching items: {e}")
        return  # Entire function exits — RSS never checked
  2. The API call fails with 404 → exception caught → function returns
  3. The RSS feed works perfectly (confirmed: returns IMDb IDs for all watchlist items)
  4. But it's never called because the exception handler returns early
  5. Result: No new content is ever imported, despite the RSS feed containing valid items

Observable symptoms

plex_watchlist.run - Error fetching items: (404) not_found;
Section 'watchlist' not found!

This message repeats every update_interval (60 seconds) and no items are ever fetched.

Fix

Separate API and RSS into independent try/except blocks:

def run(self) -> Generator[MediaItem, None, None]:
    watchlist_items: list[str] = []
    rss_items: list[str] = []

    # Try API watchlist — don't let failure block RSS
    try:
        watchlist_items = self.api.get_items_from_watchlist()
    except Exception as e:
        logger.warning(f"Error fetching watchlist via API (falling back to RSS): {e}")

    # Try RSS feed independently
    try:
        if self.api.rss_enabled:
            rss_items = self.api.get_items_from_rss()
    except Exception as e:
        logger.warning(f"Error fetching items from RSS: {e}")

    plex_items: set[str] = set(watchlist_items) | set(rss_items)
    items_to_yield = [MediaItem({"imdb_id": imdb_id, "requested_by": self.key})
                      for imdb_id in plex_items if imdb_id and imdb_id.startswith("tt")]
    logger.info(f"Fetched {len(items_to_yield)} items from plex watchlist "
                f"(api={len(watchlist_items)}, rss={len(rss_items)})")
    yield items_to_yield

Key principle

Independent data sources must have independent error handling. A failure in one source should never prevent the other from being checked. RSS is specifically designed as a reliable fallback — blocking it on API failure defeats its purpose.


Additional Improvements

Diagnostic Heartbeat

Added periodic heartbeat logging to program.py main loop for pipeline observability:

_hb_interval = 300  # Every ~30 seconds
if _hb_counter % _hb_interval == 0:
    logger.log("PROGRAM", f"[HEARTBEAT] loop={_hb_counter} queue={queue_depth} "
               f"running={running} futures={futures} validate={validate_result}")

This makes it immediately obvious whether the pipeline is stalled (queue>0, futures=0) vs idle (queue=0) vs working (futures>0).

retry_library Interval

Changed from 600 seconds (10 min) to 60 seconds (1 min):

self._retry_library: {"interval": 60},  # Was 600

This ensures incomplete items are retried promptly rather than sitting idle for 10 minutes between attempts.

Service-Bounce Detection

Added bounce-loop detector to event_manager.py that tracks items repeatedly cycling between the same states without progressing. After 5 consecutive bounces with exponential backoff delays, the item is dropped from the queue and logged:

BOUNCE-LOOP detected: {item_id} bounced back to {state} 5 times via {service}

Exempt services (like Symlinker, which legitimately retries) are excluded from bounce detection.


DB State Progression

Stage Completed Scraped Indexed Notes
Pre-fix (stalled) 6,823 483 357 Pipeline completely silent
After Bug 2 regression 6,823 481 359 Scraped→Indexed regression visible
After Bugs 1+2 fix 6,823 481 359 Sessions fixed, still zero throughput
After Bug 3 (state repair + decomposition) 6,864 ~460 ~300 Pipeline active, spin loop found
After Bug 4 (spin loop fix) 7,018 449 197 +195 completed, fully operational
After Bug 8 fix + new shows added 6,705+ 3 442 Processing new watchlist content

(Count fluctuates as new shows are imported and Season 0 specials were cleaned up)


Files Modified

File Changes
src/program/db/db_functions.py Session retry loop (2 attempts), PendingRollbackError + OperationalError catch, safety session.rollback()
src/program/managers/event_manager.py 30s delay on re-queue, service-bounce detector with exponential backoff, pipeline dashboard stats
src/program/services/downloaders/__init__.py DebridTransientError catch — returns None without blacklisting, breaks loop gracefully
src/program/services/downloaders/realdebrid.py Raises DebridTransientError on 429/RetryError, safe getattr() chain in add_torrent()
src/program/services/downloaders/models.py New DebridTransientError exception class
src/program/state_transition.py Episode-level decomposition, checks actual episode states instead of trusting parent season state
src/program/program.py Diagnostic heartbeat every ~30s, retry_library interval 600→60s
src/program/services/content/plex_watchlist.py Independent try/except for API vs RSS, graceful API fallback to RSS

Architectural Recommendations

  1. Never trust parent state aggregation — Always check actual child (episode) states. Parent state is a cache of child states that can become stale when children regress. Stale parent state creates invisible "ghost" items that the pipeline skips forever.

  2. Never destroy data on transient errors — Blacklisting (DELETE from StreamRelation) on a temporary 429 is permanent data loss. Transient errors (429, 5xx, timeouts) should skip processing, not delete data. Create a distinct exception type for transient failures.

  3. Always add delay to re-queue loops — Any re-queue mechanism without a delay creates potential spin loops. 30 seconds minimum. Without delay, the main loop becomes a tight infinite cycle consuming 100% CPU.

  4. Session errors are thread-local — A PendingRollbackError on one thread doesn't affect others, but that thread is permanently broken until explicitly rolled back. Thread-local scoped sessions need per-operation safety rollbacks at the start of every DB operation.

  5. State changes must be bidirectional — If episodes regress (e.g., Scraped → Indexed), parent season/show states must regress too. The original code only propagated completion upward, never regression. This creates "ghost Completed" parents containing incomplete children.

  6. Distinguish transient vs permanent failures — HTTP 429 is fundamentally different from "stream doesn't exist." The code path should branch on failure type. A type hierarchy (e.g., DebridTransientError vs DebridPermanentError) makes the distinction explicit and prevents accidental data destruction.

  7. Independent data sources need independent error handling — If you have both an API and RSS feed as content sources, a failure in one must never prevent the other from being checked. Putting both in the same try/except block creates a silent single point of failure.

  8. Check for database-level setting overridesALTER DATABASE ... SET overrides are invisible in postgresql.conf and only visible via pg_db_role_setting. A 5-minute idle_in_transaction_session_timeout at the database level silently kills long-running operations regardless of what postgresql.conf says.

  9. Exception handlers must be defensive — Never assume exception objects have specific attributes (.response, .status_code). Always use getattr() or hasattr() when inspecting exception properties to prevent secondary crashes in error handling code.


How to Detect If These Bugs Recur

-- Bug 3: Check for ghost-Completed seasons (should return 0 rows)
SELECT s.title, s.last_state, COUNT(e.id) as trapped_episodes
FROM "MediaItem" s
JOIN "Season" sea ON sea.id = s.id
JOIN "Episode" e ON e.parent_id = sea.id
JOIN "MediaItem" me ON me.id = e.id
WHERE s.type = 'season'
  AND s.last_state = 'Completed'
  AND me.last_state NOT IN ('Completed', 'Unreleased')
GROUP BY s.id, s.title, s.last_state;

-- Bug 2: If Completed count in DB ever DECREASES, stream blacklisting
-- on transient errors is recurring
SELECT last_state, COUNT(*) FROM "MediaItem" GROUP BY last_state ORDER BY COUNT(*) DESC;

-- Bug 6: Check for hidden per-database timeout overrides
SELECT d.datname, s.setconfig
FROM pg_db_role_setting s
JOIN pg_database d ON d.oid = s.setdatabase
WHERE s.setconfig IS NOT NULL;

Monitor the heartbeat log for stall signatures:

  • queue>0, futures=0 for extended periods = pipeline stalled (Bug 3 or Bug 4)
  • queue=0, running=0 = pipeline idle (normal when all items are terminal)
  • queue>0, futures>0 = pipeline working normally

Labels

bug, pipeline, database, real-debrid, state-management, plex

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions