Skip to content

Split Cluster Autoscaler codebase #5394

@x13n

Description

@x13n

Which component are you using?:

Cluster Autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Cluster Autoscaler has two parts: shared ("core") functionality which is generic to any cloud provider and the cloud provider specific one. CA releases contain both the shared part and all (~30) the cloud provider specific code, along with its dependencies. This approach leads to several problems:

  • With large number of supported cloud providers comes even larger number of imported libraries. This increases a risk of CA release having bad dependencies: either malicious or simply bugged, which can lead to security (CA permissions in the cluster have to be quite wide) and performance (additional logic burning cpu cycles) issues.
  • Maintainability is a problem. Many cloudprovider OWNERS are not k8s org members, so they cannot approve cloudprovider specific changes. At the same time, people with affinity to other cloud providers may not care enough or not be familiar enough to review or make changes in a specific cloud provider code.
  • Release qualification doesn't exist. With so many cloud providers it is hard to provide sufficient coverage and we provide none (beyond unit tests). Cluster Autoscaler: Segmentation violation caused by HintingSimulator predicateChecker nil pointer #5378 is a recent example of that: 1.26.0 image was released even though it panics in runtime.

Describe the solution you'd like.:

I believe cloud provider specific code should live in separate repositories. OSS Cluster Autoscaler should really be a library that is being used in various ways, rather than a component trying to support all possible cloud providers. There may be an implementation or two that make sense in this repo (grpc & cluster API come to mind), but everything else probably belongs elsewhere.

Describe any alternative solutions you've considered.:

  • Introduce more policies around cloud providers (i.e. iterate on https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/POLICY.md) . This might solve maintainability issues, but release qualification, security & performance risks would be hard to mitigate this way.
  • Release separate CA image for each cloud provider. This was previously discussed on SIG meeting and would address security/performance concerns with multiple dependencies. It might potentially also help with release qualification. Maintainability would still be a concern.
  • Combine the two above. This presents a chance to address all issues, but it would require building a lot of infrastructure which - in my opinion - shouldn't really be a part of this repository.

Additional context.:

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/cluster-autoscalerarea/core-autoscalerDenotes an issue that is related to the core autoscaler and is not specific to any provider.kind/featureCategorizes issue or PR as related to a new feature.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions