[AWS] ability to specify transient per-cluster security group #5317
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related issue: #3688
Mini-proposal:
The issue raises a good point about how we deal with security groups in AWS. Currently, if the user doesn't specifically designate a security group to use, SkyPilot creates a long lived default security group to use. This security group is not deleted on cluster down.
It is worth noting this behavior is somewhat intended though. The benefit we gain by not having to delete the long lived security group for the cluster is that we can return from
sky down
command (or SDK/API call) before the instance is terminated. If SkyPilot is to delete the security group, which can only be deleted once attached instances are deleted, SkyPilot is forced to wait for instances to be terminated before deleting the security group and returning fromsky down
, significantly increasing the call duration. I.e. the current behavior is closer to a feature than a bug.However, it is understandable that some users may want SkyPilot to clean up after itself even at the cost of increased call duration of
sky down
. I do think SkyPilot should at least give users the option to choose what side of the tradeoff they want to be at. This PR enables the user to use a per-cluster security group which is cleaned up onsky down
This option is especially useful for our buildkite tests which run on an ephemeral container - long lived security groups aren't reused in that case because the instance ids are essentially one time use, so it's better to use transient groups which get cleaned up on sky down.
What behavior we want to support as default (per cluster sg or long lived sg) is up for discussion.
Tested (run the relevant ones):
bash format.sh
per-cluster
security group name and verify a per-cluster security group managed by SkyPilot is used, and the sg is deleted on cluster down