Fix async observe masking unresolved operation failures#594
Fix async observe masking unresolved operation failures#594fabioaraujopt wants to merge 1 commit intocrossplane:mainfrom
Conversation
📝 WalkthroughWalkthroughObserve logic for async Terraform-provider clients was changed to avoid clearing cached async errors and to mark resources as not up-to-date when recovering from partial/failed async creates; tests added to validate these behaviors. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Prevent async observe paths from clearing failure signals when the last async operation still has an unresolved error, and add regression tests for both SDK and framework clients to ensure failed async operations are not reported as healthy. Signed-off-by: Fábio Araújo <fabioaraujo@Mac-PX.local>
203b176 to
49eb20e
Compare
Problem statement
In asynchronous reconciliation flows, a terminal failure from the last async create/update operation can be unintentionally masked by a later successful observe pass.
In practice, this can make a managed resource look healthy (
Synced=True/ reconcile success) even though the most recent async operation actually failed and has not yet been resolved by a successful retry.Failure pattern (generic)
A representative sequence:
5xx, dependency not ready, or backend operation failure).LastAsyncOperation=False, reconcile error).Observe.ObservereturnsResourceExists=trueandResourceUpToDate=true(resource object may exist and desired diff is empty from Terraform perspective).Result: failure signal from step 3 could be overwritten before a genuinely successful apply cycle occurs.
Concrete generic example
InternalServerErroror equivalent).Observereports resource as existing and up-to-date.This is a status/condition precedence issue, not a provider-specific business logic issue.
Root cause
The async observe code paths were clearing failure-related conditions based only on observe success (
err == nil && exists && upToDate) without verifying whether the operation tracker still contained an unresolved async error.Additionally, in the framework
recoverExternalNamebranch (failed create + partial state), observe returnedResourceUpToDate=true, which could also contribute to a false healthy signal during recovery.What this PR changes
1) Guard condition clearing with operation error state
For both async clients:
pkg/controller/external_async_tfpluginsdk.gopkg/controller/external_async_tfpluginfw.goCondition-clearing logic now executes only when:
LastOperation.Error()isnilSo unresolved async failures are preserved until a true successful cycle clears them.
2) Harden framework recovery branch
In:
pkg/controller/external_async_tfpluginfw.goWhen recovering external name after a failed async create, observe now returns:
ResourceExists=trueResourceLateInitialized=trueResourceUpToDate=false<-- changed from trueThis keeps reconcile state conservative while a failed async operation remains unresolved.
Behavior after fix
If an async create/update fails:
Regression coverage added
SDK async client
File:
pkg/controller/external_async_tfpluginsdk_test.goNew case validates:
LastAsyncOperationremains failedFramework async client
New file:
pkg/controller/external_async_tfpluginfw_test.goCases validate:
recoverExternalNamepath does not reportResourceUpToDate=trueand preserves failure signalingWhy this is safe
Test plan
go test ./pkg/controller/...