What would you like to be added:
I propose to establish lifecycle rules for ClusterLoader2 (CL2) variables similarly to K8s feature gates, particularly for the main load scenario.
Specifically, the proposal is:
- For the load scenario, start tracking all CL2_ flags that can potentially be set and work on their removal.
(this can be a simple markdown table in the README).
- Put Kubernetes feature gate lifecycle rules for the flags, providing some rules for removal and requiring the flags at some point to become default:
- Alpha <- newly added optional, should be enabled in experimental scalability scenarios like (resource-size), should be graduated at some point or will be removed.
- Beta <- can be enabled in release blocking scenario by passing it manually. Should be graduated eventually.
- GA <- Enabled by default, flipped to default in the CL2. Can be totally removed at some point.
- For flags that cannot be graduated by default or that other SIGs would like to maintain regardless, we should plan to move them to a separate scenario owned by that SIG.
Why is this needed:
ClusterLoader2 feature flags are currently used to test a variety of things by different SIGs for different use cases. However, adding support for these variables exponentially increases the testing matrix supported by a single SIG scalability.
To simplify informing release, there is one main scalability test (the 5k node test), which is the load scenario. Over the years, all the feature flags were added to this single scenario, but some of them were never used and are extensively complicating the test templates.
The deployment manifests have gathered a massive amount of dependent if statements that make it hard to land important improvements, such as the effort to establish "Pod Shape" as a formal scalability envelope (issue #138415). Without lifecycle rules and graduation criteria for these CL2 features, the maintenance burden on the small number of volunteers maintaining this growing testing matrix becomes unsustainable.
/cc @wojtek-t @Qqkyu @mborsz @aojea
/sig scalability
| CL2 Variable |
Proposed State |
Used / Graduation Status |
Notes / Next Steps |
CL2_USE_HOST_NETWORK_PODS |
Remove |
Unused |
Addressed/Removed in PR #3945 |
CL2_RUN_ON_ARM_NODES |
Remove |
Unused |
Addressed/Removed in PR #3946 |
CL2_ENABLE_NETWORK_POLICY_ENFORCEMENT_LATENCY_TEST |
Move |
Used by sig-network |
This test is the only one using it. Move out of main load scenario to the sig-network testgrid. |
CL2_DNS_QPS_PER_CLIENT |
TODO |
TODO |
TODO |
CL2_DEPLOYMENT_POD_PAYLOAD_SIZE |
TODO |
TODO |
TODO |
CL2_USE_ADVANCED_DNSTEST |
TODO |
TODO |
TODO |
CL2_TOLERATION |
TODO |
TODO |
TODO |
CL2_RUNTIME_CLASS_NAME |
TODO |
TODO |
TODO |
CL2_ENABLE_PVS |
TODO |
TODO |
TODO |
CL2_STATEFULSET_POD_PAYLOAD_SIZE |
TODO |
TODO |
TODO |
CL2_CHECK_IF_PODS_ARE_UPDATED |
TODO |
TODO |
TODO |
CL2_DISABLE_DAEMONSETS |
TODO |
TODO |
TODO |
CL2_ENABLE_DNSTESTS |
TODO |
TODO |
TODO |
CL2_ENABLE_NETWORKPOLICIES |
TODO |
TODO |
TODO |
CL2_NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_KEY |
TODO |
TODO |
TODO |
CL2_NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_VALUE |
TODO |
TODO |
TODO |
CL2_NET_POLICY_SERVER_EVERY_NTH_POD |
TODO |
TODO |
TODO |
CL2_JOB_POD_PAYLOAD_SIZE |
TODO |
TODO |
TODO |
CL2_DS_SURGE |
TODO |
TODO |
TODO |
CL2_NET_POLICY_ENFORCEMENT_LATENCY_NODE_LABEL_VALUE |
TODO |
TODO |
TODO |
CL2_DAEMONSET_POD_PAYLOAD_SIZE |
TODO |
TODO |
|
What would you like to be added:
I propose to establish lifecycle rules for ClusterLoader2 (CL2) variables similarly to K8s feature gates, particularly for the main load scenario.
Specifically, the proposal is:
(this can be a simple markdown table in the README).
Why is this needed:
ClusterLoader2 feature flags are currently used to test a variety of things by different SIGs for different use cases. However, adding support for these variables exponentially increases the testing matrix supported by a single SIG scalability.
To simplify informing release, there is one main scalability test (the 5k node test), which is the load scenario. Over the years, all the feature flags were added to this single scenario, but some of them were never used and are extensively complicating the test templates.
The deployment manifests have gathered a massive amount of dependent if statements that make it hard to land important improvements, such as the effort to establish "Pod Shape" as a formal scalability envelope (issue #138415). Without lifecycle rules and graduation criteria for these CL2 features, the maintenance burden on the small number of volunteers maintaining this growing testing matrix becomes unsustainable.
/cc @wojtek-t @Qqkyu @mborsz @aojea
/sig scalability
CL2_USE_HOST_NETWORK_PODSCL2_RUN_ON_ARM_NODESCL2_ENABLE_NETWORK_POLICY_ENFORCEMENT_LATENCY_TESTsig-networksig-networktestgrid.CL2_DNS_QPS_PER_CLIENTCL2_DEPLOYMENT_POD_PAYLOAD_SIZECL2_USE_ADVANCED_DNSTESTCL2_TOLERATIONCL2_RUNTIME_CLASS_NAMECL2_ENABLE_PVSCL2_STATEFULSET_POD_PAYLOAD_SIZECL2_CHECK_IF_PODS_ARE_UPDATEDCL2_DISABLE_DAEMONSETSCL2_ENABLE_DNSTESTSCL2_ENABLE_NETWORKPOLICIESCL2_NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_KEYCL2_NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_VALUECL2_NET_POLICY_SERVER_EVERY_NTH_PODCL2_JOB_POD_PAYLOAD_SIZECL2_DS_SURGECL2_NET_POLICY_ENFORCEMENT_LATENCY_NODE_LABEL_VALUECL2_DAEMONSET_POD_PAYLOAD_SIZE