Skip to content

Extend eksctl deployer with richer input arguments #596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ytsssun
Copy link

@ytsssun ytsssun commented Mar 19, 2025

Issue #, if available:
#595
Description of changes:

This PR enhances the aws-k8s-tester eksctl deployer by adding extensive configuration options. Highlights:

  1. Supports config file directly for eksctl.
  2. Supports deploying unmanaged nodegroup.
  3. Supports deploying just the nodegroup.

Added Options/Flags

  • --config-file: Path to eksctl config file (if provided, other flags are ignored)
  • --cluster-name: Name of the EKS cluster (defaults to RunID if not specified)
  • --availability-zones: Node availability zones
  • --ami-family: AMI family to use (AmazonLinux2, Bottlerocket)
  • --efa-enabled: Enable Elastic Fabric Adapter for the nodegroup
  • --volume-size: Size of the node root volume in GB
  • --private-networking: Use private networking for nodes
  • --with-oidc: Enable OIDC provider for IAM roles for service accounts
  • --skip-cluster-creation: Skip cluster creation, only create nodegroups
  • --unmanaged-nodegroup: Use unmanaged nodegroup instead of managed nodegroup
  • --nodegroup-name: Name of the nodegroup

Test done

  1. Test the new flags introduced
docker run --rm \   
  --env-file <(env | grep AWS) \      
  -it kubetest2 \                                                               
  kubetest2 eksctl \                                                            
  --cluster-name=my-cluster \                              
  --nodegroup-name=my-ng3 \
  --region=us-west-2 \              
  --kubernetes-version=1.25 \
  --with-oidc \
  --ami-family=Bottlerocket \
  --ami=xxx \
  --instance-types=g5.2xlarge \
  --nodes=1 \
  --volume-size=400 \
  --private-networking \
  --skip-cluster-creation \
  --up \
  --test=exec -- /bin/nvidia.test \
  --test.timeout=60m \
  --test.v \
  --installDevicePlugin=false \
  --feature="unit-test" \
  -efaEnabled=false \
  -nvidiaTestImage=$IMAGE
  1. Test the config file
docker run --rm \                                                                                                                                   
  --env-file <(env | grep AWS) \
  -v ./test-eks-cluster.yaml:/config/test-eks-cluster.yaml \
  -it kubetest2 \
  kubetest2 eksctl \
  --cluster-name=my-cluster \
  --config-file=/config/test-eks-cluster.yaml \
  --region=us-west-2 \
  --skip-cluster-creation \
  --up \
  --test=exec -- /bin/nvidia.test \
  --test.timeout=60m \
  --test.v \
  --installDevicePlugin=false \
  --feature="unit-test" \
  -efaEnabled=false \
  -nvidiaTestImage=$IMAGE

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ytsssun ytsssun force-pushed the eksctl-deployer-add-flags branch from 8084d6a to 5a54916 Compare March 20, 2025 18:34
Copy link
Contributor

@Issacwww Issacwww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ytsssun for contribution!

FYI, we have this eksapi-janitor for situation that cluster creation failed due to resource limit and need a clean up. You can add similar one if this fit your use case

@ytsssun
Copy link
Author

ytsssun commented Mar 25, 2025

FYI, we have this eksapi-janitor for situation that cluster creation failed due to resource limit and need a clean up. You can add similar one if this fit your use case

Thanks, I believe eksctl supports resource deletion via "eksctl delete cluster -f xx.yaml" or even just point to the cluster name. We can add a thing wrapper to that, but I think it would be no different than running just the eksctl commands.

@ytsssun ytsssun force-pushed the eksctl-deployer-add-flags branch from 5a54916 to eac0414 Compare March 25, 2025 23:33
@ytsssun
Copy link
Author

ytsssun commented Mar 25, 2025

Addressed comments.

Minor refactoring to honor the cluster name from the cluster config so that we don't need to specify the "--cluster-name" when using the config file directly.

Copy link
Contributor

@Issacwww Issacwww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks! 🚀 @ytsssun

Comment on lines +30 to +31
// ClusterName is the effective cluster name (from flag or RunID)
clusterName string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always use the run ID for the cluster name. If you want to override the cluster name, you can already use --run-id to do so

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason why run-id is preferred? Is it because we always want a clean end-to-end run (including cluster creation and destruction)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this approach is taken from upstream kubetest2, but functionally provides better logical isolation of test runs - every resource created for the test is suffixed or named based on that ID. That makes for easy lookup of related resources and can help debugging creation/deletion issues in some cases

https://github.com/kubernetes-sigs/kubetest2/blob/1cc02edeb0b6b06ec1cb0d9d9849272561e89ce4/pkg/types/types.go#L60

}

// parseClusterNameFromConfig extracts the cluster name from an eksctl config file
func (d *deployer) parseClusterNameFromConfig(configFilePath string) (string, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go the other direction -- override any cluster name in the config file with the run ID

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I'd like to clarify my reasoning for adding the --cluster-name flag:

  1. Sematically, using --cluster-name is more intuitive than overriding --run-id, which may affect components beyond just the cluster name.

  2. As I mentioned in our use cases, we sometimes would like to add nodegroups to existing clusters, so run-id will not work for us.

  3. The implementation still defaults to using runID when --cluster-name isn't specified, maintaining the clean run pattern by default.

For config files, I believe we should respect the user's intent when they explicitly provide configuration. If strong naming control is needed, we could consider a separate flag to enforce runID as the cluster name.

What specific concerns do you have about allowing custom cluster names as an option?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable to me, but in the interest of keeping the flag profile minimal, I'm wondering if this could just be accomplished through the config file? As it is, when a config file is set, all other flags are ignored, so this functionality could be lumped in through there

VolumeSize int `flag:"volume-size" desc:"Size of the node root volume in GB"`
PrivateNetworking bool `flag:"private-networking" desc:"Use private networking for nodes"`
WithOIDC bool `flag:"with-oidc" desc:"Enable OIDC provider for IAM roles for service accounts"`
SkipClusterCreation bool `flag:"skip-cluster-creation" desc:"Skip cluster creation, only create nodegroups"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eksctl should already handle this for you, if a cluster with the expected name already exists. Are you trying to use pre-existing/static clusters? We generally do not aim to support that pattern

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback. Let me clarify the rationale behind the --skip-cluster-creation option:

While eksctl does have checking capabilities for nodegroups and clusters, it fails with an error when attempting eksctl create cluster on an existing cluster:

2025-03-26 20:33:44 [✖]  creating CloudFormation stack "eksctl-efa-p4-cluster-125-cluster": operation error CloudFormation: CreateStack, https response error StatusCode: 400, RequestID: 3239e519-0c64-47cc-a993-efd6ca6205bc, AlreadyExistsException: Stack [eksctl-efa-p4-cluster-125-cluster] already exists
Error: failed to create cluster "efa-p4-cluster-125"

This feature addresses specific testing workflows on our side:

  1. Testing efficiency: Cluster creation is time-consuming. When iterating on nodegroup configurations or troubleshooting nodegroup deployment issues, redeploying the entire cluster is inefficient.

  2. Intermediate approach: This isn't about static clusters (which does not work for us as it assumes the cluster and nodegroups are ready, in which case it does not add much value comparing to just run the go test command directly). Rather, it allows controlled management of the cluster lifecycle separately from nodegroups.

  3. Test isolation: We often need to deploy different nodegroup configurations on the same cluster for different features without recreating the entire cluster.

The option provides a middle ground between fully ephemeral clusters and static clusters, enabling more flexible testing patterns while still maintaining the ability to automate the full deployment process when needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality of --skip-cluster-creation seems reasonable to me, it's exposing the otherwise concealed create nodegroup functionality from eksctl whereas we were before limited to create cluster without too much added logic to accomplish that.

I'm wondering if the name might complicate things down the line though. atm eksctl create works for

Commands:
  eksctl create accessentry                     Create access entries
  eksctl create addon                           Create an Addon
  eksctl create cluster                         Create a cluster
  eksctl create fargateprofile                  Create a Fargate profile
  eksctl create iamidentitymapping              Create an IAM identity mapping
  eksctl create iamserviceaccount               Create an iamserviceaccount - AWS IAM role bound to a Kubernetes service account
  eksctl create nodegroup                       Create a nodegroup
  eksctl create podidentityassociation          Create a pod identity association

If we currently support cluster and nodegroup, then try to add in something like just create fargateprofile, I think it could get a bit messy.

Wondering if we can just pass it along more directly, e.g. the flag can be --create and the value of it is the argument to the eksctl create call, defaulting to cluster.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if we can just pass it along more directly, e.g. the flag can be --create and the value of it is the argument to the eksctl create call, defaulting to cluster.

That is definitely a valid point. But it further extended the scope of the PR. Also, as eksctl deployer, being able to redeploy the nodegroup is definitely a more "significant" need comparing to other options listed there. I don't see a compelling reason to must include all the "eksctl create" subcommands but it does extend the functionality.

I would still prefer to focus on support nodegroup creation and track others in a separate effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that nodegroup is probably currently the most likely use case, but not sure if eksctl might later add a target for nodepool or something to enable auto on an existing cluster, which would then be very useful and require a breaking API change here to support intuitively

I'm not sure I follow why it's extending the scope, I think it'd just be like

Suggested change
SkipClusterCreation bool `flag:"skip-cluster-creation" desc:"Skip cluster creation, only create nodegroups"`
CreateTarget string `flag:"create-target" desc:"Target resource for eksctl create call, defaults to cluster"`

added a suggestion of what that could look like on the render func, would also need to add that default in verifyUpFlags(). I don't think we'd need to actually add additional validation of the argument, eksctl should fail pretty quickly with an invalid target

we don't have to implement the same CLI offerings off the bat for the other options, they can just be created with a custom config file for now.

VolumeSize int `flag:"volume-size" desc:"Size of the node root volume in GB"`
PrivateNetworking bool `flag:"private-networking" desc:"Use private networking for nodes"`
WithOIDC bool `flag:"with-oidc" desc:"Enable OIDC provider for IAM roles for service accounts"`
SkipClusterCreation bool `flag:"skip-cluster-creation" desc:"Skip cluster creation, only create nodegroups"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality of --skip-cluster-creation seems reasonable to me, it's exposing the otherwise concealed create nodegroup functionality from eksctl whereas we were before limited to create cluster without too much added logic to accomplish that.

I'm wondering if the name might complicate things down the line though. atm eksctl create works for

Commands:
  eksctl create accessentry                     Create access entries
  eksctl create addon                           Create an Addon
  eksctl create cluster                         Create a cluster
  eksctl create fargateprofile                  Create a Fargate profile
  eksctl create iamidentitymapping              Create an IAM identity mapping
  eksctl create iamserviceaccount               Create an iamserviceaccount - AWS IAM role bound to a Kubernetes service account
  eksctl create nodegroup                       Create a nodegroup
  eksctl create podidentityassociation          Create a pod identity association

If we currently support cluster and nodegroup, then try to add in something like just create fargateprofile, I think it could get a bit messy.

Wondering if we can just pass it along more directly, e.g. the flag can be --create and the value of it is the argument to the eksctl create call, defaulting to cluster.

Comment on lines +30 to +31
// ClusterName is the effective cluster name (from flag or RunID)
clusterName string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this approach is taken from upstream kubetest2, but functionally provides better logical isolation of test runs - every resource created for the test is suffixed or named based on that ID. That makes for easy lookup of related resources and can help debugging creation/deletion issues in some cases

https://github.com/kubernetes-sigs/kubetest2/blob/1cc02edeb0b6b06ec1cb0d9d9849272561e89ce4/pkg/types/types.go#L60

}

// parseClusterNameFromConfig extracts the cluster name from an eksctl config file
func (d *deployer) parseClusterNameFromConfig(configFilePath string) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable to me, but in the interest of keeping the flag profile minimal, I'm wondering if this could just be accomplished through the config file? As it is, when a config file is set, all other flags are ignored, so this functionality could be lumped in through there

@ytsssun ytsssun force-pushed the eksctl-deployer-add-flags branch from eac0414 to 0cc0f5e Compare May 23, 2025 23:24
@ytsssun
Copy link
Author

ytsssun commented May 23, 2025

Pushed changes to depend on upstream ClusterConfig and render the config file via direct Marshall.

@ytsssun ytsssun force-pushed the eksctl-deployer-add-flags branch from 0cc0f5e to 8011c03 Compare May 25, 2025 22:39
Copy link
Contributor

@mselim00 mselim00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, generally lgtm. one main follow-up regarding the SkipClusterCreation flag

VolumeSize int `flag:"volume-size" desc:"Size of the node root volume in GB"`
PrivateNetworking bool `flag:"private-networking" desc:"Use private networking for nodes"`
WithOIDC bool `flag:"with-oidc" desc:"Enable OIDC provider for IAM roles for service accounts"`
SkipClusterCreation bool `flag:"skip-cluster-creation" desc:"Skip cluster creation, only create nodegroups"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that nodegroup is probably currently the most likely use case, but not sure if eksctl might later add a target for nodepool or something to enable auto on an existing cluster, which would then be very useful and require a breaking API change here to support intuitively

I'm not sure I follow why it's extending the scope, I think it'd just be like

Suggested change
SkipClusterCreation bool `flag:"skip-cluster-creation" desc:"Skip cluster creation, only create nodegroups"`
CreateTarget string `flag:"create-target" desc:"Target resource for eksctl create call, defaults to cluster"`

added a suggestion of what that could look like on the render func, would also need to add that default in verifyUpFlags(). I don't think we'd need to actually add additional validation of the argument, eksctl should fail pretty quickly with an invalid target

we don't have to implement the same CLI offerings off the bat for the other options, they can just be created with a custom config file for now.

@ytsssun ytsssun force-pushed the eksctl-deployer-add-flags branch 2 times, most recently from 43fd06f to 66d12e9 Compare May 28, 2025 23:07
@ytsssun
Copy link
Author

ytsssun commented May 28, 2025

Adding newest testing result I have:

docker run --rm \
                                                                             -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID \
                                                                             -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY \
                                                                             -e AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN \
                                                                             -it kubetest2 \
                                                                             kubetest2 eksctl \
                                                                             --cluster-name=my-cluster-130 \
                                                                             --nodegroup-name=my-ng2 \
                                                                             --region=us-west-2 \
                                                                             --kubernetes-version=1.30 \
                                                                             --with-oidc \
                                                                             --ami-family=Bottlerocket \
                                                                             --instance-types=g5.2xlarge \
                                                                             --nodes=1 \
                                                                             --volume-size=400 \
                                                                             --private-networking \
                                                                             --deploy-target nodegroup \
                                                                             --up \
                                                                             --test=exec -- /bin/nvidia.test \
                                                                             --test.timeout=60m \
                                                                             --test.v \
                                                                             --installDevicePlugin=false \
                                                                             --feature="unit-test" \
                                                                             -efaEnabled=false \
                                                                             -nvidiaTestImage=xxx.dkr.ecr.us-west-2.amazonaws.com/nccl-test:nvidia-v1-amd
I0530 21:37:25.866702      15 app.go:61] The files in RunDir shall not be part of Artifacts
I0530 21:37:25.866783      15 app.go:62] pass rundir-in-artifacts flag True for RunDir to be part of Artifacts
I0530 21:37:25.866789      15 app.go:64] RunDir for this run: "/workdir/_rundir/4db4e191-7958-4e1a-8794-7a453d11c6e3"
I0530 21:37:25.869547      15 app.go:136] ID for this run: "xxx"
I0530 21:37:25.869560      15 up.go:96] Using managed nodegroup for cluster my-cluster-130
I0530 21:37:25.869571      15 cluster_config.go:103] rendering cluster config yaml based on the ClusterConfig: &TypeMeta{Kind:ClusterConfig,APIVersion:eksctl.io/v1alpha5,}
I0530 21:37:25.870577      15 up.go:110] Rendered cluster config: accessConfig: {}
addonsConfig: {}
apiVersion: eksctl.io/v1alpha5
cloudWatch:
  clusterLogging: {}
iam:
  withOIDC: true
kind: ClusterConfig
kubernetesNetworkConfig:
  ipFamily: IPv4
managedNodeGroups:
- amiFamily: Bottlerocket
  desiredCapacity: 1
  efaEnabled: false
  iam:
    withAddonPolicies:
      albIngress: false
      appMesh: false
      appMeshPreview: false
      autoScaler: false
      awsLoadBalancerController: false
      certManager: false
      cloudWatch: false
      ebs: false
      efs: false
      externalDNS: false
      fsx: false
      imageBuilder: false
      xRay: false
  instanceTypes:
  - g5.2xlarge
  maxSize: 1
  minSize: 1
  name: my-ng2
  privateNetworking: true
  releaseVersion: ""
  securityGroups:
    withLocal: null
    withShared: null
  volumeSize: 400
  volumeType: gp3
metadata:
  name: my-cluster-130
  region: us-west-2
  version: "1.30"
privateCluster:
  enabled: false
  skipEndpointCreation: false
vpc:
  autoAllocateIPv6: false
  cidr: 192.168.0.0/16
  manageSharedNodeSecurityGroupRules: true
  nat:
    gateway: Single
2025-05-30 21:37:25 [!]  Amazon EKS will no longer publish EKS-optimized Amazon Linux 2 (AL2) AMIs after November 26th, 2025. Additionally, Kubernetes version 1.32 is the last version for which Amazon EKS will release AL2 AMIs. From version 1.33 onwards, Amazon EKS will continue to release AL2023 and Bottlerocket based AMIs. The default AMI family when creating clusters and nodegroups in Eksctl will be changed to AL2023 in the future.
2025-05-30 21:37:27 [ℹ]  nodegroup "my-ng2" will use "" [Bottlerocket/1.30]
2025-05-30 21:37:27 [ℹ]  1 existing nodegroup(s) (my-ng3) will be excluded
2025-05-30 21:37:27 [ℹ]  1 nodegroup (my-ng2) was included (based on the include/exclude rules)
2025-05-30 21:37:27 [ℹ]  will create a CloudFormation stack for each of 1 managed nodegroups in cluster "my-cluster-130"
2025-05-30 21:37:27 [ℹ]  
2 sequential tasks: { fix cluster compatibility, 1 task: { 1 task: { create managed nodegroup "my-ng2" } } 
}
2025-05-30 21:37:27 [ℹ]  checking cluster stack for missing resources
2025-05-30 21:37:27 [ℹ]  cluster stack has all required resources
2025-05-30 21:37:28 [ℹ]  building managed nodegroup stack "eksctl-my-cluster-130-nodegroup-my-ng2"
2025-05-30 21:37:28 [ℹ]  skipping us-west-2d from selection because it doesn't support the following instance type(s): g5.2xlarge
2025-05-30 21:37:28 [ℹ]  deploying stack "eksctl-my-cluster-130-nodegroup-my-ng2"
2025-05-30 21:37:28 [ℹ]  waiting for CloudFormation stack "eksctl-my-cluster-130-nodegroup-my-ng2"
2025-05-30 21:37:58 [ℹ]  waiting for CloudFormation stack "eksctl-my-cluster-130-nodegroup-my-ng2"
2025-05-30 21:38:46 [ℹ]  waiting for CloudFormation stack "eksctl-my-cluster-130-nodegroup-my-ng2"
2025-05-30 21:39:44 [ℹ]  waiting for CloudFormation stack "eksctl-my-cluster-130-nodegroup-my-ng2"
2025-05-30 21:39:44 [ℹ]  no tasks
2025-05-30 21:39:44 [✔]  created 0 nodegroup(s) in cluster "my-cluster-130"
2025-05-30 21:39:44 [ℹ]  nodegroup "my-ng2" has 1 node(s)
2025-05-30 21:39:44 [ℹ]  node "ip-192-168-162-117.us-west-2.compute.internal" is ready
2025-05-30 21:39:44 [ℹ]  waiting for at least 1 node(s) to become ready in "my-ng2"
2025-05-30 21:39:44 [ℹ]  nodegroup "my-ng2" has 1 node(s)
2025-05-30 21:39:44 [ℹ]  node "ip-192-168-162-117.us-west-2.compute.internal" is ready
2025-05-30 21:39:44 [✔]  created 1 managed nodegroup(s) in cluster "my-cluster-130"
2025-05-30 21:39:45 [ℹ]  checking security group configuration for all nodegroups
2025-05-30 21:39:45 [ℹ]  all nodegroups have up-to-date cloudformation templates
I0530 21:39:45.063136      15 up.go:143] Writing kubeconfig to /workdir/_rundir/4db4e191-7958-4e1a-8794-7a453d11c6e3/kubeconfig
2025-05-30 21:39:45 [✔]  saved kubeconfig as "/workdir/_rundir/4db4e191-7958-4e1a-8794-7a453d11c6e3/kubeconfig"
I0530 21:39:45.836479      15 up.go:157] Successfully wrote kubeconfig to /workdir/_rundir/4db4e191-7958-4e1a-8794-7a453d11c6e3/kubeconfig
=== RUN   TestContainerdConfig
2025/05/30 21:39:51 No node type specified. Using the node type g4dn.8xlarge in the node groups.
=== RUN   TestContainerdConfig/containerd-config-check
    env.go:438: Skipping feature "containerd-config-check": name not matched
--- PASS: TestContainerdConfig (0.00s)
    --- SKIP: TestContainerdConfig/containerd-config-check (0.00s)
=== RUN   TestMPIJobPytorchTraining
=== RUN   TestMPIJobPytorchTraining/single-node
    env.go:438: Skipping feature "single-node": name not matched
=== RUN   TestMPIJobPytorchTraining/multi-node
    env.go:438: Skipping feature "multi-node": name not matched
--- PASS: TestMPIJobPytorchTraining (0.00s)
    --- SKIP: TestMPIJobPytorchTraining/single-node (0.00s)
    --- SKIP: TestMPIJobPytorchTraining/multi-node (0.00s)
=== RUN   TestSingleNodeUnitTest
=== RUN   TestSingleNodeUnitTest/unit-test
=== RUN   TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds
=== NAME  TestSingleNodeUnitTest/unit-test
    unit_test.go:82: Test log for unit-test-job:
    unit_test.go:83: # Running tests in gpu_unit_tests/tests/test_basic.sh
        ok - test_01_device_query
        ok - test_02_vector_add
        ok - test_03_bandwidth
        ok - test_04_dcgm_diagnostics
        # Running tests in gpu_unit_tests/tests/test_sysinfo.sh
        diff: test_sysinfo.sh.data/g4dn.8xlarge/numa_topo.txt: No such file or directory
        ok - test_numa_topo_topo
        diff: test_sysinfo.sh.data/g4dn.8xlarge/gpu_count.txt: No such file or directory
        ok - test_nvidia_gpu_count
        ok - test_nvidia_gpu_throttled
        ok - test_nvidia_gpu_unused
        diff: test_sysinfo.sh.data/g4dn.8xlarge/nvidia_persistence_status.txt: No such file or directory
        ok - test_nvidia_persistence_status
        diff: test_sysinfo.sh.data/g4dn.8xlarge/nvidia_smi_topo.txt: No such file or directory
        ok - test_nvidia_smi_topo
        
=== RUN   TestSingleNodeUnitTest/hpc-benckmarks
    env.go:438: Skipping feature "hpc-benckmarks": name not matched
--- PASS: TestSingleNodeUnitTest (130.10s)
    --- PASS: TestSingleNodeUnitTest/unit-test (130.10s)
        --- PASS: TestSingleNodeUnitTest/unit-test/Unit_test_Job_succeeds (130.01s)
    --- SKIP: TestSingleNodeUnitTest/hpc-benckmarks (0.00s)
PASS

- Use ClusterConfig from upstream eksctl to render the eksctl config file.
- Support more deployment flags.
- Support eksctl spec files.

Signed-off-by: Yutong Sun <[email protected]>
@ytsssun ytsssun force-pushed the eksctl-deployer-add-flags branch from 66d12e9 to 6359297 Compare May 30, 2025 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants