Skip to content

[Improvement] Run static fleet checks only if there are static nodes #2960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

himani2411
Copy link
Contributor

@himani2411 himani2411 commented May 24, 2025

Description of changes

  • Add TOTAL_MIN_COUNT of a cluster as comment in /opt/slurm/etc/slurm_parallelcluster.conf which is later read to check if we have Static nodes or not.
  • Run static fleet checks if there are any static nodes and run Protected Mode checks irrespective of node type

Tests

  • Manual test for updating /opt/slurm/etc/slurm_parallelcluster.conf
test-suites:
  update:
    test_update.py::test_dynamic_file_systems_update:
      dimensions:
      - instances:
        - c5.xlarge
        oss:
        - alinux2023
        regions:
        - eu-west-2
        schedulers:
        - slurm
    test_update.py::test_dynamic_file_systems_update_data_loss:
      dimensions:
      - instances:
        - c5.xlarge
        oss:
        - rhel9
        regions:
        - eu-west-2
        schedulers:
        - slurm
    test_update.py::test_dynamic_file_systems_update_rollback:
      dimensions:
      - instances:
        - c5.xlarge
        oss:
        - rocky9
        regions:
        - rocky9
        schedulers:
        - slurm
    test_update.py::test_login_nodes_update:
      dimensions:
      - instances:
        - c5.xlarge
        oss:
        - rhel8
        regions:
        - us-east-2
        schedulers:
        - slurm
    test_update.py::test_multi_az_create_and_update:
      dimensions:
      - instances:
        - c5.xlarge
        oss:
        - alinux2
        regions:
        - eu-west-2
        schedulers:
        - slurm
    test_update.py::test_update_slurm:
      dimensions:
      - instances:
        - c5.xlarge
        oss:
        - rhel8
        regions:
        - eu-central-1

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@himani2411 himani2411 requested review from a team as code owners May 24, 2025 02:08
@himani2411 himani2411 marked this pull request as draft May 24, 2025 02:11
@himani2411 himani2411 force-pushed the develop-min-count-ad branch from c52db0b to 9674576 Compare May 27, 2025 16:01
@himani2411 himani2411 marked this pull request as ready for review June 6, 2025 18:37
* Run static fleet checks if there are any static nodes

Adding it as condition
@himani2411 himani2411 force-pushed the develop-min-count-ad branch from 86b8eb4 to 356004f Compare June 6, 2025 18:46
@himani2411 himani2411 changed the title [Bug] Run static fleet checks only if there are static nodes [Improvement] Run static fleet checks only if there are static nodes Jun 6, 2025
@@ -178,6 +178,11 @@ def wait_cluster_ready
end
end

def get_static_node_count
cmd = Mixlib::ShellOut.new("cat #{node['cluster']['slurm']['install_dir']}/etc/slurm_parallelcluster.conf | grep -o '#TOTAL_MIN_COUNT=\([0-9]*\)' | cut -d'=' -f2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality is based on the presence of a comment in a config file.
This is a fragile approach since comments are meant to be removable without any side effect.

Why not getting the total number of static nodes from the cluster config file instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the initial idea, however I wanted to see if we do loop through the cluster-config specifically for Scheduling section and this file is where we do it.

Even if somebody removes the comment it has to be at a point when the HeadNode is created and even if they remove this comment we can still go ahead and assume that the nos of static node is 1 and continue the old path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants