Ensure we don't downgrade user error for runAlarm#6501
Conversation
|
The generated output of Full Type Diff |
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
This PR preserves the user-error classification from the first catch block so that a subsequent waitForOutputLocks() failure cannot downgrade it. The intent is correct, but the implementation captures retryCountsAgainstLimit instead of the user-error classification itself, which has a wider meaning and can cause false positives.
- [HIGH]
priorCountsAgainstLimitis not equivalent to "prior error was a user error" — capturingretryCountsAgainstLimitinstead of the actual user-error classification can suppress diagnostic logging for genuine internal errors. - [MEDIUM] The comment on line 606-609 says "user-error classification" but the captured variable actually tracks
retryCountsAgainstLimit, which can betruefor non-user errors. The comment is misleading.
This review was generated by an AI assistant and may contain inaccuracies.
|
I'm Bonk, and I've done a quick review of your PR. The review has been posted on PR #6501. The main finding is a high severity issue: the PR captures |
7cb62ab to
a07f273
Compare
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
This PR addresses a real problem: waitForOutputLocks() failure can downgrade a user-error classification from the alarm handler's catch block. The approach of avoiding retryCountsAgainstLimit (which conflates user-error with output-gate health) is the right call. However, re-deriving user-error status from the description alone misses cases where the error was classified via the EXCEPTION_IS_USER_ERROR exception detail.
- [HIGH]
isAlarmFailureUserError(desc, false)on line 616 hardcodesfalseforhasUserErrorDetail, which drops user-error classifications that depend onEXCEPTION_IS_USER_ERROR. This affects errors withbroken.*description prefixes — specificallystate.abort()("broken.outputGateBroken; jsg.Error: ...") and cache OOM ("broken.exceededMemory; jsg.Error: ...") — becausetunneledErrorType()setsisDurableObjectReset = truefor anybroken.*prefix, causing!tunneled.isDurableObjectResetto fail. - [MEDIUM] Missing test coverage — there's no test verifying that
priorHandlerWasUserErroris correctly propagated whenwaitForOutputLocks()fails after the alarm handler threw a user error.
This review was generated by an AI assistant and may contain inaccuracies.
|
The review has been posted on PR #6501. Here's a summary of the findings: HIGH severity: The PR's approach of re-deriving user-error status from MEDIUM severity: No test coverage for the new |
a07f273 to
a2bdcd6
Compare
There was a problem hiding this comment.
Follow-up / correction on my prior reviews.
My second review incorrectly claimed that line 625 hardcodes false for hasUserErrorDetail. Looking at the actual code, it correctly passes e.getDetail(jsg::EXCEPTION_IS_USER_ERROR) != kj::none from the output-lock exception itself. Apologies for the noise on that one.
The concern from my first review still stands and is the only remaining actionable issue: retryCountsAgainstLimit is not equivalent to "the prior error was a user error." In the first catch block (line 586), it's set to !context.isOutputGateBroken() || isUserGeneratedError, so it's true for non-user errors whenever the output gate happens to be healthy. If the output gate then breaks during waitForOutputLocks(), the output-lock error handler will OR in priorCountsAgainstLimit = true, misclassifying that internal failure as a user error and suppressing diagnostic logging.
The explicit .retryCountsAgainstLimit = false on the success path (line 552) is a good fix — that eliminates the success-path false positive.
The cleanest resolution for the catch-path issue is still to add an isUserError field to AlarmResult and carry the classification through directly, rather than re-deriving it from retryCountsAgainstLimit.
This review was generated by an AI assistant and may contain inaccuracies.
|
I'm Bonk, and I've done a quick review of your PR. The review has been posted on PR #6501. This is a follow-up that:
The suggested fix is to add a dedicated |
If a runAlarm exception is considered userError it should not lose that value
a2bdcd6 to
816b903
Compare
|
Good — Let me verify the Looking back at my read of AlarmOutcome asOutcome() const {
return {
.retry = retry, .retryCountsAgainstLimit = retryCountsAgainstLimit, .outcome = outcome};
}Correct — Now let me assess: the author has directly implemented my suggested fix (adding a dedicated Given that my prior reviews already raised the issues that have now been addressed, and the current code looks correct: LGTM |
| auto isUserGeneratedError = isAlarmFailureUserError(e.getDescription(), | ||
| e.getDetail(jsg::EXCEPTION_IS_USER_ERROR) != kj::none) || | ||
| priorHandlerWasUserError; |
There was a problem hiding this comment.
Why is this the right thing to do? e is thrown from waitForOutputLocks and, unless I'm mistaken, could be a completely different exception than the one that we used to set priorHandlerWasUserError. What's bad about this "upgrade" is that it can cause internal errors to turn into user-facing errors, which will then have a limited number of alarm retries.
It seems like a better approach would be to make sure whatever errors we're throwing out of waitForOutputLocks are correctly annotated.
If a
runAlarmexception is considereduserErrorit should persist that until the result is returned back