What would you like to be added:

I propose to establish lifecycle rules for ClusterLoader2 (CL2) variables similarly to K8s feature gates, particularly for the main load scenario.

Specifically, the proposal is:

For the load scenario, start tracking all CL2_ flags that can potentially be set and work on their removal.
(this can be a simple markdown table in the README).
Put Kubernetes feature gate lifecycle rules for the flags, providing some rules for removal and requiring the flags at some point to become default:
- Alpha <- newly added optional, should be enabled in experimental scalability scenarios like (resource-size), should be graduated at some point or will be removed.
- Beta <- can be enabled in release blocking scenario by passing it manually. Should be graduated eventually.
- GA <- Enabled by default, flipped to default in the CL2. Can be totally removed at some point.
For flags that cannot be graduated by default or that other SIGs would like to maintain regardless, we should plan to move them to a separate scenario owned by that SIG.

Why is this needed:

ClusterLoader2 feature flags are currently used to test a variety of things by different SIGs for different use cases. However, adding support for these variables exponentially increases the testing matrix supported by a single SIG scalability.

To simplify informing release, there is one main scalability test (the 5k node test), which is the load scenario. Over the years, all the feature flags were added to this single scenario, but some of them were never used and are extensively complicating the test templates.

The deployment manifests have gathered a massive amount of dependent if statements that make it hard to land important improvements, such as the effort to establish "Pod Shape" as a formal scalability envelope (issue #138415). Without lifecycle rules and graduation criteria for these CL2 features, the maintenance burden on the small number of volunteers maintaining this growing testing matrix becomes unsustainable.

/cc @wojtek-t @Qqkyu @mborsz @aojea
/sig scalability

CL2 Variable	Proposed State	Used / Graduation Status	Notes / Next Steps
`CL2_USE_HOST_NETWORK_PODS`	Remove	Unused	Addressed/Removed in PR #3945
`CL2_RUN_ON_ARM_NODES`	Remove	Unused	Addressed/Removed in PR #3946
`CL2_ENABLE_NETWORK_POLICY_ENFORCEMENT_LATENCY_TEST`	Move	Used by `sig-network`	This test is the only one using it. Move out of main load scenario to the `sig-network` testgrid.
`CL2_DNS_QPS_PER_CLIENT`	TODO	TODO	TODO
`CL2_DEPLOYMENT_POD_PAYLOAD_SIZE`	TODO	TODO	TODO
`CL2_USE_ADVANCED_DNSTEST`	TODO	TODO	TODO
`CL2_TOLERATION`	TODO	TODO	TODO
`CL2_RUNTIME_CLASS_NAME`	TODO	TODO	TODO
`CL2_ENABLE_PVS`	TODO	TODO	TODO
`CL2_STATEFULSET_POD_PAYLOAD_SIZE`	TODO	TODO	TODO
`CL2_CHECK_IF_PODS_ARE_UPDATED`	TODO	TODO	TODO
`CL2_DISABLE_DAEMONSETS`	TODO	TODO	TODO
`CL2_ENABLE_DNSTESTS`	TODO	TODO	TODO
`CL2_ENABLE_NETWORKPOLICIES`	TODO	TODO	TODO
`CL2_NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_KEY`	TODO	TODO	TODO
`CL2_NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_VALUE`	TODO	TODO	TODO
`CL2_NET_POLICY_SERVER_EVERY_NTH_POD`	TODO	TODO	TODO
`CL2_JOB_POD_PAYLOAD_SIZE`	TODO	TODO	TODO
`CL2_DS_SURGE`	TODO	TODO	TODO
`CL2_NET_POLICY_ENFORCEMENT_LATENCY_NODE_LABEL_VALUE`	TODO	TODO	TODO
`CL2_DAEMONSET_POD_PAYLOAD_SIZE`	TODO	TODO

Establish Lifecycle Rules for ClusterLoader2 Variables #3973

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions