-
Notifications
You must be signed in to change notification settings - Fork 87
Extend eksctl deployer with richer input arguments #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8084d6a
to
5a54916
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ytsssun for contribution!
FYI, we have this eksapi-janitor for situation that cluster creation failed due to resource limit and need a clean up. You can add similar one if this fit your use case
Thanks, I believe eksctl supports resource deletion via "eksctl delete cluster -f xx.yaml" or even just point to the cluster name. We can add a thing wrapper to that, but I think it would be no different than running just the eksctl commands. |
5a54916
to
eac0414
Compare
Addressed comments. Minor refactoring to honor the cluster name from the cluster config so that we don't need to specify the "--cluster-name" when using the config file directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks! 🚀 @ytsssun
// ClusterName is the effective cluster name (from flag or RunID) | ||
clusterName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should always use the run ID for the cluster name. If you want to override the cluster name, you can already use --run-id
to do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why run-id is preferred? Is it because we always want a clean end-to-end run (including cluster creation and destruction)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this approach is taken from upstream kubetest2, but functionally provides better logical isolation of test runs - every resource created for the test is suffixed or named based on that ID. That makes for easy lookup of related resources and can help debugging creation/deletion issues in some cases
} | ||
|
||
// parseClusterNameFromConfig extracts the cluster name from an eksctl config file | ||
func (d *deployer) parseClusterNameFromConfig(configFilePath string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should go the other direction -- override any cluster name in the config file with the run ID
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I'd like to clarify my reasoning for adding the --cluster-name
flag:
-
Sematically, using
--cluster-name
is more intuitive than overriding--run-id
, which may affect components beyond just the cluster name. -
As I mentioned in our use cases, we sometimes would like to add nodegroups to existing clusters, so run-id will not work for us.
-
The implementation still defaults to using
runID
when--cluster-name
isn't specified, maintaining the clean run pattern by default.
For config files, I believe we should respect the user's intent when they explicitly provide configuration. If strong naming control is needed, we could consider a separate flag to enforce runID
as the cluster name.
What specific concerns do you have about allowing custom cluster names as an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable to me, but in the interest of keeping the flag profile minimal, I'm wondering if this could just be accomplished through the config file? As it is, when a config file is set, all other flags are ignored, so this functionality could be lumped in through there
internal/deployers/eksctl/up.go
Outdated
VolumeSize int `flag:"volume-size" desc:"Size of the node root volume in GB"` | ||
PrivateNetworking bool `flag:"private-networking" desc:"Use private networking for nodes"` | ||
WithOIDC bool `flag:"with-oidc" desc:"Enable OIDC provider for IAM roles for service accounts"` | ||
SkipClusterCreation bool `flag:"skip-cluster-creation" desc:"Skip cluster creation, only create nodegroups"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eksctl should already handle this for you, if a cluster with the expected name already exists. Are you trying to use pre-existing/static clusters? We generally do not aim to support that pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback. Let me clarify the rationale behind the --skip-cluster-creation
option:
While eksctl does have checking capabilities for nodegroups and clusters, it fails with an error when attempting eksctl create cluster
on an existing cluster:
2025-03-26 20:33:44 [✖] creating CloudFormation stack "eksctl-efa-p4-cluster-125-cluster": operation error CloudFormation: CreateStack, https response error StatusCode: 400, RequestID: 3239e519-0c64-47cc-a993-efd6ca6205bc, AlreadyExistsException: Stack [eksctl-efa-p4-cluster-125-cluster] already exists
Error: failed to create cluster "efa-p4-cluster-125"
This feature addresses specific testing workflows on our side:
-
Testing efficiency: Cluster creation is time-consuming. When iterating on nodegroup configurations or troubleshooting nodegroup deployment issues, redeploying the entire cluster is inefficient.
-
Intermediate approach: This isn't about static clusters (which does not work for us as it assumes the cluster and nodegroups are ready, in which case it does not add much value comparing to just run the
go test
command directly). Rather, it allows controlled management of the cluster lifecycle separately from nodegroups. -
Test isolation: We often need to deploy different nodegroup configurations on the same cluster for different features without recreating the entire cluster.
The option provides a middle ground between fully ephemeral clusters and static clusters, enabling more flexible testing patterns while still maintaining the ability to automate the full deployment process when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functionality of --skip-cluster-creation
seems reasonable to me, it's exposing the otherwise concealed create nodegroup
functionality from eksctl
whereas we were before limited to create cluster
without too much added logic to accomplish that.
I'm wondering if the name might complicate things down the line though. atm eksctl create
works for
Commands:
eksctl create accessentry Create access entries
eksctl create addon Create an Addon
eksctl create cluster Create a cluster
eksctl create fargateprofile Create a Fargate profile
eksctl create iamidentitymapping Create an IAM identity mapping
eksctl create iamserviceaccount Create an iamserviceaccount - AWS IAM role bound to a Kubernetes service account
eksctl create nodegroup Create a nodegroup
eksctl create podidentityassociation Create a pod identity association
If we currently support cluster
and nodegroup
, then try to add in something like just create fargateprofile
, I think it could get a bit messy.
Wondering if we can just pass it along more directly, e.g. the flag can be --create
and the value of it is the argument to the eksctl create call, defaulting to cluster
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we can just pass it along more directly, e.g. the flag can be --create and the value of it is the argument to the eksctl create call, defaulting to cluster.
That is definitely a valid point. But it further extended the scope of the PR. Also, as eksctl deployer, being able to redeploy the nodegroup is definitely a more "significant" need comparing to other options listed there. I don't see a compelling reason to must include all the "eksctl create" subcommands but it does extend the functionality.
I would still prefer to focus on support nodegroup creation and track others in a separate effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that nodegroup is probably currently the most likely use case, but not sure if eksctl might later add a target for nodepool
or something to enable auto on an existing cluster, which would then be very useful and require a breaking API change here to support intuitively
I'm not sure I follow why it's extending the scope, I think it'd just be like
SkipClusterCreation bool `flag:"skip-cluster-creation" desc:"Skip cluster creation, only create nodegroups"` | |
CreateTarget string `flag:"create-target" desc:"Target resource for eksctl create call, defaults to cluster"` |
added a suggestion of what that could look like on the render func, would also need to add that default in verifyUpFlags()
. I don't think we'd need to actually add additional validation of the argument, eksctl should fail pretty quickly with an invalid target
we don't have to implement the same CLI offerings off the bat for the other options, they can just be created with a custom config file for now.
internal/deployers/eksctl/up.go
Outdated
VolumeSize int `flag:"volume-size" desc:"Size of the node root volume in GB"` | ||
PrivateNetworking bool `flag:"private-networking" desc:"Use private networking for nodes"` | ||
WithOIDC bool `flag:"with-oidc" desc:"Enable OIDC provider for IAM roles for service accounts"` | ||
SkipClusterCreation bool `flag:"skip-cluster-creation" desc:"Skip cluster creation, only create nodegroups"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functionality of --skip-cluster-creation
seems reasonable to me, it's exposing the otherwise concealed create nodegroup
functionality from eksctl
whereas we were before limited to create cluster
without too much added logic to accomplish that.
I'm wondering if the name might complicate things down the line though. atm eksctl create
works for
Commands:
eksctl create accessentry Create access entries
eksctl create addon Create an Addon
eksctl create cluster Create a cluster
eksctl create fargateprofile Create a Fargate profile
eksctl create iamidentitymapping Create an IAM identity mapping
eksctl create iamserviceaccount Create an iamserviceaccount - AWS IAM role bound to a Kubernetes service account
eksctl create nodegroup Create a nodegroup
eksctl create podidentityassociation Create a pod identity association
If we currently support cluster
and nodegroup
, then try to add in something like just create fargateprofile
, I think it could get a bit messy.
Wondering if we can just pass it along more directly, e.g. the flag can be --create
and the value of it is the argument to the eksctl create call, defaulting to cluster
.
// ClusterName is the effective cluster name (from flag or RunID) | ||
clusterName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this approach is taken from upstream kubetest2, but functionally provides better logical isolation of test runs - every resource created for the test is suffixed or named based on that ID. That makes for easy lookup of related resources and can help debugging creation/deletion issues in some cases
} | ||
|
||
// parseClusterNameFromConfig extracts the cluster name from an eksctl config file | ||
func (d *deployer) parseClusterNameFromConfig(configFilePath string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable to me, but in the interest of keeping the flag profile minimal, I'm wondering if this could just be accomplished through the config file? As it is, when a config file is set, all other flags are ignored, so this functionality could be lumped in through there
eac0414
to
0cc0f5e
Compare
Pushed changes to depend on upstream ClusterConfig and render the config file via direct |
0cc0f5e
to
8011c03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, generally lgtm. one main follow-up regarding the SkipClusterCreation
flag
internal/deployers/eksctl/up.go
Outdated
VolumeSize int `flag:"volume-size" desc:"Size of the node root volume in GB"` | ||
PrivateNetworking bool `flag:"private-networking" desc:"Use private networking for nodes"` | ||
WithOIDC bool `flag:"with-oidc" desc:"Enable OIDC provider for IAM roles for service accounts"` | ||
SkipClusterCreation bool `flag:"skip-cluster-creation" desc:"Skip cluster creation, only create nodegroups"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that nodegroup is probably currently the most likely use case, but not sure if eksctl might later add a target for nodepool
or something to enable auto on an existing cluster, which would then be very useful and require a breaking API change here to support intuitively
I'm not sure I follow why it's extending the scope, I think it'd just be like
SkipClusterCreation bool `flag:"skip-cluster-creation" desc:"Skip cluster creation, only create nodegroups"` | |
CreateTarget string `flag:"create-target" desc:"Target resource for eksctl create call, defaults to cluster"` |
added a suggestion of what that could look like on the render func, would also need to add that default in verifyUpFlags()
. I don't think we'd need to actually add additional validation of the argument, eksctl should fail pretty quickly with an invalid target
we don't have to implement the same CLI offerings off the bat for the other options, they can just be created with a custom config file for now.
43fd06f
to
66d12e9
Compare
Adding newest testing result I have:
|
- Use ClusterConfig from upstream eksctl to render the eksctl config file. - Support more deployment flags. - Support eksctl spec files. Signed-off-by: Yutong Sun <[email protected]>
66d12e9
to
6359297
Compare
Issue #, if available:
#595
Description of changes:
This PR enhances the
aws-k8s-tester
eksctl deployer by adding extensive configuration options. Highlights:Added Options/Flags
--config-file
: Path to eksctl config file (if provided, other flags are ignored)--cluster-name
: Name of the EKS cluster (defaults to RunID if not specified)--availability-zones
: Node availability zones--ami-family
: AMI family to use (AmazonLinux2, Bottlerocket)--efa-enabled
: Enable Elastic Fabric Adapter for the nodegroup--volume-size
: Size of the node root volume in GB--private-networking
: Use private networking for nodes--with-oidc
: Enable OIDC provider for IAM roles for service accounts--skip-cluster-creation
: Skip cluster creation, only create nodegroups--unmanaged-nodegroup
: Use unmanaged nodegroup instead of managed nodegroup--nodegroup-name
: Name of the nodegroupTest done
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.