Skip to content

[AWS] ability to specify transient per-cluster security group #5317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

SeungjinYang
Copy link
Collaborator

@SeungjinYang SeungjinYang commented Apr 23, 2025

Related issue: #3688

Mini-proposal:

The issue raises a good point about how we deal with security groups in AWS. Currently, if the user doesn't specifically designate a security group to use, SkyPilot creates a long lived default security group to use. This security group is not deleted on cluster down.

It is worth noting this behavior is somewhat intended though. The benefit we gain by not having to delete the long lived security group for the cluster is that we can return from sky down command (or SDK/API call) before the instance is terminated. If SkyPilot is to delete the security group, which can only be deleted once attached instances are deleted, SkyPilot is forced to wait for instances to be terminated before deleting the security group and returning from sky down, significantly increasing the call duration. I.e. the current behavior is closer to a feature than a bug.

However, it is understandable that some users may want SkyPilot to clean up after itself even at the cost of increased call duration of sky down. I do think SkyPilot should at least give users the option to choose what side of the tradeoff they want to be at. This PR enables the user to use a per-cluster security group which is cleaned up on sky down

This option is especially useful for our buildkite tests which run on an ephemeral container - long lived security groups aren't reused in that case because the instance ids are essentially one time use, so it's better to use transient groups which get cleaned up on sky down.

What behavior we want to support as default (per cluster sg or long lived sg) is up for discussion.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • Manual: specify per-cluster security group name and verify a per-cluster security group managed by SkyPilot is used, and the sg is deleted on cluster down

@SeungjinYang SeungjinYang force-pushed the per-context-sg branch 2 times, most recently from f0baa37 to 3e66f98 Compare April 23, 2025 00:55
@SeungjinYang SeungjinYang changed the title [aws] ability to specify transient per-cluster security group [AWS] ability to specify transient per-cluster security group Apr 23, 2025
@SeungjinYang SeungjinYang marked this pull request as ready for review April 23, 2025 18:49
@cg505
Copy link
Collaborator

cg505 commented Apr 23, 2025

Re: #3688 presumably there a separate issue that causes the crash? (ValueError: Cannot find new security group sky-sg-llama4-9c67.) We should also fix that.
Maybe I'm misunderstanding the issue though.

@SeungjinYang
Copy link
Collaborator Author

That error message actually can print in two cases:

  • there is 0 security group with that name
  • there is more than 1 security group with that name (in which case we can't "find" the security group because it is impossible to definitively identify the security group to use)
    The first line of the output in the review is Found 2 security groups with name sky-sg-llama4-9c67. which lets me know we're hitting the second case.

As to why that can happen, I actually think the user ran into the bug to be fixed by #5316 - so the previous cluster's sg never got deleted, leading to multiple security groups existing when another named cluster is created.

@cg505
Copy link
Collaborator

cg505 commented Apr 24, 2025

Still, we should handle that case, right? Either by somehow ensuring that we never recreate the same per-cluster SG, or by gracefully handling the case where SkyPilot has somehow accidentally created two SGs to open ports. (or both)
Either way, I don't think this PR should close #3688

@SeungjinYang
Copy link
Collaborator Author

ah yes, I made PR #5316 close issue #3688 instead

@SeungjinYang SeungjinYang requested a review from zpoint April 24, 2025 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants