Summary
The long-running CI test chain (Check All Quality Requirements, ~20h) fails frequently due to transient NuGet restore network failures, not code defects. Over the long run, restore is invoked many times (most test projects restore their own solution), each hitting api.nuget.org; a single transient blip fails a test's initialization and cascades to a full chain failure.
Example occurrence
- Build chain
Check All Quality Requirements #325389 (release/2026.0): Tests failed: 1, passed: 10435, ignored: 279.
- Failing test:
PostSharp.UserInterface: UserInterface in Test .NET Framework (X64, VS 17.0, .NET SDK 8.0) (#325376), reported as "Test initialization failed."
- Root cause in the binlog:
NU1301: Unable to load the service index for source https://api.nuget.org/v3/index.json
A connection attempt failed because the connected party did not properly respond after a
period of time ... 150.171.109.101:443
This propagates: restore fails -> "Test initialization failed" -> TC_EXIT_CODE 2 -> SNAPSHOT_DEPENDENCY_ERROR on the whole chain.
Why it keeps happening
NU1301 (service-index load failure) is not reliably retried by the NuGet client, and NuGet exposes no retry-count/timeout knob via nuget.config.
- The shared wrapper
Build/Scripts/restore.ps1 has a mutex but no real retry - on failure it only re-runs once with detailed verbosity to surface the error, then returns the original failure code.
- A few test projects (notably
UserInterface/PostSharp.UserInterface.Tests.proj) bypass the wrapper and call <MSBuild Targets="Restore"> directly, so they get no resilience at all.
Proposed fix (build/test tooling - upstream release/2024.0, merges forward)
- Add real retry-with-exponential-backoff to
Build/Scripts/restore.ps1, gated on transient-error patterns (NU1301, connection failed, timeouts, 429/5xx) so genuine errors (NU1101, version conflicts) still fail fast. Covers the ~60 test projects that funnel through $(RestorePs1).
- Route the direct-
<MSBuild Targets="Restore"> outliers (e.g. PostSharp.UserInterface.Tests.proj, and the build-asset restore in Tests/TestingFramework/Testing/PostSharp.BuildTests.targets) through the resilient wrapper.
- (Optional, defense-in-depth) Persist a shared global packages folder on agents to cut network hits; longer term consider an on-agent caching NuGet proxy.
Impact
Spurious red builds on a 20h pipeline; wasted agent time and manual re-runs. No product/runtime defect.
Summary
The long-running CI test chain (
Check All Quality Requirements, ~20h) fails frequently due to transient NuGet restore network failures, not code defects. Over the long run, restore is invoked many times (most test projects restore their own solution), each hittingapi.nuget.org; a single transient blip fails a test's initialization and cascades to a full chain failure.Example occurrence
Check All Quality Requirements#325389 (release/2026.0):Tests failed: 1, passed: 10435, ignored: 279.PostSharp.UserInterface: UserInterfaceinTest .NET Framework (X64, VS 17.0, .NET SDK 8.0)(#325376), reported as "Test initialization failed."This propagates: restore fails -> "Test initialization failed" -> TC_EXIT_CODE 2 -> SNAPSHOT_DEPENDENCY_ERROR on the whole chain.
Why it keeps happening
NU1301(service-index load failure) is not reliably retried by the NuGet client, and NuGet exposes no retry-count/timeout knob vianuget.config.Build/Scripts/restore.ps1has a mutex but no real retry - on failure it only re-runs once with detailed verbosity to surface the error, then returns the original failure code.UserInterface/PostSharp.UserInterface.Tests.proj) bypass the wrapper and call<MSBuild Targets="Restore">directly, so they get no resilience at all.Proposed fix (build/test tooling - upstream
release/2024.0, merges forward)Build/Scripts/restore.ps1, gated on transient-error patterns (NU1301, connection failed, timeouts, 429/5xx) so genuine errors (NU1101, version conflicts) still fail fast. Covers the ~60 test projects that funnel through$(RestorePs1).<MSBuild Targets="Restore">outliers (e.g.PostSharp.UserInterface.Tests.proj, and the build-asset restore inTests/TestingFramework/Testing/PostSharp.BuildTests.targets) through the resilient wrapper.Impact
Spurious red builds on a 20h pipeline; wasted agent time and manual re-runs. No product/runtime defect.