Skip to content

Fix async observe masking unresolved operation failures#594

Open
fabioaraujopt wants to merge 1 commit intocrossplane:mainfrom
fabioaraujopt:fix/async-observe-failure-masking
Open

Fix async observe masking unresolved operation failures#594
fabioaraujopt wants to merge 1 commit intocrossplane:mainfrom
fabioaraujopt:fix/async-observe-failure-masking

Conversation

@fabioaraujopt
Copy link
Copy Markdown

@fabioaraujopt fabioaraujopt commented Feb 10, 2026

Problem statement

In asynchronous reconciliation flows, a terminal failure from the last async create/update operation can be unintentionally masked by a later successful observe pass.

In practice, this can make a managed resource look healthy (Synced=True / reconcile success) even though the most recent async operation actually failed and has not yet been resolved by a successful retry.

Failure pattern (generic)

A representative sequence:

  1. Controller starts async create/update.
  2. Provider/API returns a terminal error for that async operation (for example, transient 5xx, dependency not ready, or backend operation failure).
  3. Callback path records failure (LastAsyncOperation=False, reconcile error).
  4. Next reconcile runs Observe.
  5. Observe returns ResourceExists=true and ResourceUpToDate=true (resource object may exist and desired diff is empty from Terraform perspective).
  6. Previous code path unconditionally cleared async failure condition and set reconcile success.

Result: failure signal from step 3 could be overwritten before a genuinely successful apply cycle occurs.

Concrete generic example

  • Async create returns a terminal provider error (e.g. InternalServerError or equivalent).
  • Immediately after, Observe reports resource as existing and up-to-date.
  • Status transitions could show success while the prior async operation remained unresolved.

This is a status/condition precedence issue, not a provider-specific business logic issue.

Root cause

The async observe code paths were clearing failure-related conditions based only on observe success (err == nil && exists && upToDate) without verifying whether the operation tracker still contained an unresolved async error.

Additionally, in the framework recoverExternalName branch (failed create + partial state), observe returned ResourceUpToDate=true, which could also contribute to a false healthy signal during recovery.

What this PR changes

1) Guard condition clearing with operation error state

For both async clients:

  • pkg/controller/external_async_tfpluginsdk.go
  • pkg/controller/external_async_tfpluginfw.go

Condition-clearing logic now executes only when:

  • observe is successful and
  • LastOperation.Error() is nil

So unresolved async failures are preserved until a true successful cycle clears them.

2) Harden framework recovery branch

In:

  • pkg/controller/external_async_tfpluginfw.go

When recovering external name after a failed async create, observe now returns:

  • ResourceExists=true
  • ResourceLateInitialized=true
  • ResourceUpToDate=false <-- changed from true

This keeps reconcile state conservative while a failed async operation remains unresolved.

Behavior after fix

If an async create/update fails:

  • failure condition is retained,
  • observe-only success does not prematurely clear failure,
  • resource is not presented as fully healthy/synced until a successful apply/update resolves the error.

Regression coverage added

SDK async client

File:

  • pkg/controller/external_async_tfpluginsdk_test.go

New case validates:

  • prior async create failure + observe up-to-date
  • LastAsyncOperation remains failed
  • operation tracker error remains set

Framework async client

New file:

  • pkg/controller/external_async_tfpluginfw_test.go

Cases validate:

  1. prior async failure is not masked by observe up-to-date
  2. failed-create recoverExternalName path does not report ResourceUpToDate=true and preserves failure signaling

Why this is safe

  • No API surface changes.
  • No CRD/schema changes.
  • Changes are limited to async reconcile condition handling and test coverage.
  • Logic is stricter/conservative: success is only reported when there is no unresolved async error state.

Test plan

  • go test ./pkg/controller/...
  • Validate unresolved async error is preserved in SDK async observe flow
  • Validate unresolved async error is preserved in framework async observe flow
  • Validate framework failed-create recovery path stays not-up-to-date until resolution

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 10, 2026

📝 Walkthrough

Walkthrough

Observe logic for async Terraform-provider clients was changed to avoid clearing cached async errors and to mark resources as not up-to-date when recovering from partial/failed async creates; tests added to validate these behaviors.

Changes

Cohort / File(s) Summary
Async TFPlugin Framework
pkg/controller/external_async_tfpluginfw.go, pkg/controller/external_async_tfpluginfw_test.go
Observe now sets ResourceUpToDate=false on early-return recovery from failed partial creates and only clears LastOperation/related success conditions when there is no cached async error. New tests validate error propagation, condition preservation, and external-name restoration after failed async creates.
Async TFPlugin SDK
pkg/controller/external_async_tfpluginsdk.go, pkg/controller/external_async_tfpluginsdk_test.go
Observe guarded clearing of LastAsyncOperation/reconcile success by checking for an existing LastOperation error. Tests extended with a case that simulates unresolved async failures and asserts preserved condition and last-operation error.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Template Breaking Changes ⚠️ Warning Changes to async connector instantiation in generated controllers significantly alter observe logic and failure handling across all async-enabled resources, requiring validation across multiple production providers before merging. Validate async connector changes with 2-3 generated providers (AWS, GCP, Azure), document tested providers in PR, and obtain explicit approval from provider maintainers using async operations.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main change: fixing a bug where async observe operations were masking unresolved operation failures.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, providing clear problem statement, root cause analysis, and detailed explanation of the fix.
Configuration Api Breaking Changes ✅ Passed PR modifies only pkg/controller/ files; no changes to pkg/config/ directory where Configuration API resides.
Generated Code Manual Edits ✅ Passed PR modifies only hand-written Go source files in pkg/controller directory (external_async_tfpluginfw.go, external_async_tfpluginsdk.go and their tests). No files matching 'zz_*.go' pattern or containing generated code markers were modified.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
pkg/controller/external_async_tfpluginsdk_test.go (1)

143-155: Consider moving assertion fields from args to want.

Thanks for adding these new fields to support the regression test! Small structural note: wantCondition and wantLastOpErr represent expected outcomes, so they'd fit more naturally in the want struct alongside obs and err. The prepare function is an input/setup concern and makes sense in args, but the two want* fields are assertions. Could you share if there was a reason for grouping them in args?

♻️ Suggested refactor
 type args struct {
   r             Resource
   cfg           *config.Resource
   obj           xpresource.Managed
   prepare       func(*terraformPluginSDKAsyncExternal, xpresource.Managed)
-  wantCondition *xpv1.Condition
-  wantLastOpErr error
 }
 type want struct {
   obs managed.ExternalObservation
   err error
+  condition *xpv1.Condition
+  lastOpErr error
 }

Then update the test runner references accordingly (e.g., tc.want.condition, tc.want.lastOpErr).

As per coding guidelines: **/*_test.go — "Enforce table-driven test structure: … args/want pattern."

pkg/controller/external_async_tfpluginfw_test.go (1)

156-171: Optional: consider unconditional assertions for lastOperationErr for future-proofing.

Currently, lastOperationErr and externalName are only asserted when the want value is non-zero. This works perfectly for the existing cases, but if a future test case expects no operation error (i.e., it was cleared), the nil want.lastOperationErr would skip the check entirely. A small tweak like always asserting lastOperationErr (similar to how err is unconditionally checked on line 153) would catch that.

Totally fine to leave as-is if you prefer — just flagging for awareness since this is a regression-focused test file. Thanks for the thorough assertions overall! 🙌

♻️ Optional: make lastOperationErr assertion unconditional
-		if tc.want.lastOperationErr != nil {
-			if diff := cmp.Diff(tc.want.lastOperationErr, asyncExternal.opTracker.LastOperation.Error(), test.EquateErrors()); diff != "" {
-				t.Errorf("\nObserve(...): -want last operation error, +got last operation error:\n%s", diff)
-			}
+		if diff := cmp.Diff(tc.want.lastOperationErr, asyncExternal.opTracker.LastOperation.Error(), test.EquateErrors()); diff != "" {
+			t.Errorf("\nObserve(...): -want last operation error, +got last operation error:\n%s", diff)
 		}

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Prevent async observe paths from clearing failure signals when the last async operation still has an unresolved error, and add regression tests for both SDK and framework clients to ensure failed async operations are not reported as healthy.

Signed-off-by: Fábio Araújo <fabioaraujo@Mac-PX.local>
@fabioaraujopt fabioaraujopt force-pushed the fix/async-observe-failure-masking branch from 203b176 to 49eb20e Compare February 11, 2026 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant