Skip to content

CI: transient NuGet restore failures (NU1301) intermittently fail the 20h test chain #23

@gfraiteur

Description

@gfraiteur

Summary

The long-running CI test chain (Check All Quality Requirements, ~20h) fails frequently due to transient NuGet restore network failures, not code defects. Over the long run, restore is invoked many times (most test projects restore their own solution), each hitting api.nuget.org; a single transient blip fails a test's initialization and cascades to a full chain failure.

Example occurrence

  • Build chain Check All Quality Requirements #325389 (release/2026.0): Tests failed: 1, passed: 10435, ignored: 279.
  • Failing test: PostSharp.UserInterface: UserInterface in Test .NET Framework (X64, VS 17.0, .NET SDK 8.0) (#325376), reported as "Test initialization failed."
  • Root cause in the binlog:
NU1301: Unable to load the service index for source https://api.nuget.org/v3/index.json
A connection attempt failed because the connected party did not properly respond after a
period of time ... 150.171.109.101:443

This propagates: restore fails -> "Test initialization failed" -> TC_EXIT_CODE 2 -> SNAPSHOT_DEPENDENCY_ERROR on the whole chain.

Why it keeps happening

  • NU1301 (service-index load failure) is not reliably retried by the NuGet client, and NuGet exposes no retry-count/timeout knob via nuget.config.
  • The shared wrapper Build/Scripts/restore.ps1 has a mutex but no real retry - on failure it only re-runs once with detailed verbosity to surface the error, then returns the original failure code.
  • A few test projects (notably UserInterface/PostSharp.UserInterface.Tests.proj) bypass the wrapper and call <MSBuild Targets="Restore"> directly, so they get no resilience at all.

Proposed fix (build/test tooling - upstream release/2024.0, merges forward)

  1. Add real retry-with-exponential-backoff to Build/Scripts/restore.ps1, gated on transient-error patterns (NU1301, connection failed, timeouts, 429/5xx) so genuine errors (NU1101, version conflicts) still fail fast. Covers the ~60 test projects that funnel through $(RestorePs1).
  2. Route the direct-<MSBuild Targets="Restore"> outliers (e.g. PostSharp.UserInterface.Tests.proj, and the build-asset restore in Tests/TestingFramework/Testing/PostSharp.BuildTests.targets) through the resilient wrapper.
  3. (Optional, defense-in-depth) Persist a shared global packages folder on agents to cut network hits; longer term consider an on-agent caching NuGet proxy.

Impact

Spurious red builds on a 20h pipeline; wasted agent time and manual re-runs. No product/runtime defect.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions