Skip to content

Ensure we don't downgrade user error for runAlarm#6501

Closed
jqmmes wants to merge 1 commit intomainfrom
joaquim/propagate-userException-runAlarm
Closed

Ensure we don't downgrade user error for runAlarm#6501
jqmmes wants to merge 1 commit intomainfrom
joaquim/propagate-userException-runAlarm

Conversation

@jqmmes
Copy link
Copy Markdown
Contributor

@jqmmes jqmmes commented Apr 6, 2026

If a runAlarm exception is considered userError it should persist that until the result is returned back

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

The generated output of @cloudflare/workers-types has been changed by this PR. If this is intentional, run just generate-types to update the snapshot. Alternatively, you can download the full generated types:

Full Type Diff

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR preserves the user-error classification from the first catch block so that a subsequent waitForOutputLocks() failure cannot downgrade it. The intent is correct, but the implementation captures retryCountsAgainstLimit instead of the user-error classification itself, which has a wider meaning and can cause false positives.

  1. [HIGH] priorCountsAgainstLimit is not equivalent to "prior error was a user error" — capturing retryCountsAgainstLimit instead of the actual user-error classification can suppress diagnostic logging for genuine internal errors.
  2. [MEDIUM] The comment on line 606-609 says "user-error classification" but the captured variable actually tracks retryCountsAgainstLimit, which can be true for non-user errors. The comment is misleading.

This review was generated by an AI assistant and may contain inaccuracies.

Comment thread src/workerd/api/global-scope.c++ Outdated
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Apr 6, 2026

I'm Bonk, and I've done a quick review of your PR.

The review has been posted on PR #6501. The main finding is a high severity issue: the PR captures retryCountsAgainstLimit to preserve user-error classification, but that field has a broader meaning than "is user error" — it can be true for non-user errors (when the output gate was healthy) and defaults to true on the success path. This can suppress diagnostic logging for genuine internal output-lock failures and incorrectly classify them as user errors. I suggested using result.retryCountsAgainstLimit && result.retry as a more precise guard, along with a note that a dedicated isUserError field in AlarmResult would be the cleanest long-term fix.

github run

@jqmmes jqmmes force-pushed the joaquim/propagate-userException-runAlarm branch from 7cb62ab to a07f273 Compare April 6, 2026 09:27
Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR addresses a real problem: waitForOutputLocks() failure can downgrade a user-error classification from the alarm handler's catch block. The approach of avoiding retryCountsAgainstLimit (which conflates user-error with output-gate health) is the right call. However, re-deriving user-error status from the description alone misses cases where the error was classified via the EXCEPTION_IS_USER_ERROR exception detail.

  1. [HIGH] isAlarmFailureUserError(desc, false) on line 616 hardcodes false for hasUserErrorDetail, which drops user-error classifications that depend on EXCEPTION_IS_USER_ERROR. This affects errors with broken.* description prefixes — specifically state.abort() ("broken.outputGateBroken; jsg.Error: ...") and cache OOM ("broken.exceededMemory; jsg.Error: ...") — because tunneledErrorType() sets isDurableObjectReset = true for any broken.* prefix, causing !tunneled.isDurableObjectReset to fail.
  2. [MEDIUM] Missing test coverage — there's no test verifying that priorHandlerWasUserError is correctly propagated when waitForOutputLocks() fails after the alarm handler threw a user error.

This review was generated by an AI assistant and may contain inaccuracies.

Comment thread src/workerd/api/global-scope.c++ Outdated
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Apr 6, 2026

The review has been posted on PR #6501. Here's a summary of the findings:

HIGH severity: The PR's approach of re-deriving user-error status from errorDescription alone (line 616: isAlarmFailureUserError(desc, false)) misses errors classified via the EXCEPTION_IS_USER_ERROR exception detail. Specifically, errors with broken.* description prefixes — like state.abort() and cache OOM — get isDurableObjectReset = true from tunneledErrorType(), which makes the description-only check return false even though these are legitimate user errors. The recommended fix is to add a bool isUserError field to AlarmResult and carry the classification through directly, avoiding any re-derivation.

MEDIUM severity: No test coverage for the new priorHandlerWasUserError propagation path.

github run

@jqmmes jqmmes force-pushed the joaquim/propagate-userException-runAlarm branch from a07f273 to a2bdcd6 Compare April 6, 2026 09:41
Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up / correction on my prior reviews.

My second review incorrectly claimed that line 625 hardcodes false for hasUserErrorDetail. Looking at the actual code, it correctly passes e.getDetail(jsg::EXCEPTION_IS_USER_ERROR) != kj::none from the output-lock exception itself. Apologies for the noise on that one.

The concern from my first review still stands and is the only remaining actionable issue: retryCountsAgainstLimit is not equivalent to "the prior error was a user error." In the first catch block (line 586), it's set to !context.isOutputGateBroken() || isUserGeneratedError, so it's true for non-user errors whenever the output gate happens to be healthy. If the output gate then breaks during waitForOutputLocks(), the output-lock error handler will OR in priorCountsAgainstLimit = true, misclassifying that internal failure as a user error and suppressing diagnostic logging.

The explicit .retryCountsAgainstLimit = false on the success path (line 552) is a good fix — that eliminates the success-path false positive.

The cleanest resolution for the catch-path issue is still to add an isUserError field to AlarmResult and carry the classification through directly, rather than re-deriving it from retryCountsAgainstLimit.


This review was generated by an AI assistant and may contain inaccuracies.

Comment thread src/workerd/api/global-scope.c++ Outdated
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Apr 6, 2026

I'm Bonk, and I've done a quick review of your PR.

The review has been posted on PR #6501. This is a follow-up that:

  1. Corrects my second review — I incorrectly claimed isAlarmFailureUserError was called with a hardcoded false for hasUserErrorDetail, but the code actually passes the real exception detail from the output-lock exception. That concern was invalid.

  2. Reiterates the one remaining valid issue from my first review: retryCountsAgainstLimit conflates "user error" with "output gate was healthy" (line 586: !context.isOutputGateBroken() || isUserGeneratedError). When the alarm handler throws a non-user error and the output gate is healthy at catch time, priorCountsAgainstLimit will be true, which can misclassify a subsequent output-lock failure as a user error and suppress diagnostic logging.

The suggested fix is to add a dedicated bool isUserError field to AlarmResult and carry the classification through directly.

github run

If a runAlarm exception is considered userError it should not lose that value
@jqmmes jqmmes force-pushed the joaquim/propagate-userException-runAlarm branch from a2bdcd6 to 816b903 Compare April 6, 2026 10:06
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Apr 6, 2026

Good — AlarmResult is a C++ struct, not a capnp type. No wire-format concerns.

Let me verify the asOutcome() method doesn't include isUserError:

Looking back at my read of worker-interface.h lines 79-82:

AlarmOutcome asOutcome() const {
  return {
    .retry = retry, .retryCountsAgainstLimit = retryCountsAgainstLimit, .outcome = outcome};
}

Correct — isUserError is not propagated to AlarmOutcome, which is the right behavior since it's only needed within the sandbox process.

Now let me assess: the author has directly implemented my suggested fix (adding a dedicated isUserError field). The implementation is clean and correct. The only remaining concern is the lack of test coverage, which I already flagged in my prior reviews. The code is a 14-line change with straightforward boolean propagation logic.

Given that my prior reviews already raised the issues that have now been addressed, and the current code looks correct:

LGTM

github run

@jqmmes jqmmes marked this pull request as ready for review April 6, 2026 10:10
@jqmmes jqmmes requested review from a team as code owners April 6, 2026 10:10
Comment on lines +623 to +625
auto isUserGeneratedError = isAlarmFailureUserError(e.getDescription(),
e.getDetail(jsg::EXCEPTION_IS_USER_ERROR) != kj::none) ||
priorHandlerWasUserError;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this the right thing to do? e is thrown from waitForOutputLocks and, unless I'm mistaken, could be a completely different exception than the one that we used to set priorHandlerWasUserError. What's bad about this "upgrade" is that it can cause internal errors to turn into user-facing errors, which will then have a limited number of alarm retries.

It seems like a better approach would be to make sure whatever errors we're throwing out of waitForOutputLocks are correctly annotated.

@jqmmes jqmmes closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants