-
Notifications
You must be signed in to change notification settings - Fork 635
[k8s] Allow per-context identity configuration + in-cluster auth fixes #4136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @romilbhardwaj! I am bit concerned for passing around the context name just for looking up the remote identity. Just left a comment : )
{% if k8s_env_vars is not none %} | ||
env: | ||
{% for key, value in k8s_env_vars.items() %} | ||
- name: {{ key }} | ||
value: {{ value }} | ||
{% endfor %} | ||
{% endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be overwrite by a user specifying env
in their pod_config in config.yaml
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the dictionaries will be merged. E.g., with config.yaml like:
kubernetes:
pod_config:
spec:
containers:
- env:
- name: MY_ENV_VAR
value: "my_value"
Jobs controller pod has:
env:
- name: SKYPILOT_IN_CLUSTER_CONTEXT_NAME
value: gke_sky-dev-465_us-central1-c_gkeusc6
- name: MY_ENV_VAR
value: my_value
sky/clouds/kubernetes.py
Outdated
if isinstance(remote_identity, dict): | ||
# If remote_identity is a dict, use the service account for the | ||
# current context | ||
k8s_service_account_name = remote_identity.get(context, None) | ||
if k8s_service_account_name is None: | ||
err_msg = (f'Context {context!r} not found in ' | ||
'remote identities from config.yaml') | ||
raise ValueError(err_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid the stacktrace with ux_utils
?
Will this work with failover, e.g., if an invalid remote_identity
is specified, will it cause SkyPilot failing to failover to other clouds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we enforce any context used to exist in the dict, should we add this to precheck for config, e.g. in the schema.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good catch, added ux_utils
to hide stack trace.
Currently this doesn't work with failover, but maybe that's desirable? This failure is not a failure of the cloud, but rather configuration failure on the user's end, so maybe good to surface this instead of silently failing over? lmk if you think otherwise, I can figure out a way to move this further downstream in the code so it becomes a part of failover.
should we add this to precheck for config, e.g. in the schema.py?
That might be tricky since schema.py
only supports simple json schema validation. We can consider adding to skypilot_config.py::_try_load_config()
; noting that might require importing k8s module dynamically if this field is specified.
# Set environment variables for the pod. Note that SkyPilot env vars | ||
# are set separately when the task is run. These env vars are | ||
# independent of the SkyPilot task to be run. | ||
k8s_env_vars = { | ||
kubernetes_utils.IN_CLUSTER_CONTEXT_NAME_ENV_VAR: context | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not super sure of using this env var for passing around the context just for looking up the context in the remote_identity dict (correct me if I am wrong).
If this is mainly for looking up a remote identity for jobs/serve controller, it might be better to just replace the key in remote_identity
to in-cluster
instead, like the one we did here:
skypilot/sky/utils/controller_utils.py
Lines 600 to 626 in f2991b1
def replace_skypilot_config_path_in_file_mounts( | |
cloud: 'clouds.Cloud', file_mounts: Optional[Dict[str, str]]): | |
"""Replaces the SkyPilot config path in file mounts with the real path.""" | |
# TODO(zhwu): This function can be moved to `backend_utils` once we have | |
# more predefined file mounts that needs to be replaced after the cluster | |
# is provisioned, e.g., we may need to decide which cloud to create a bucket | |
# to be mounted to the cluster based on the cloud the cluster is actually | |
# launched on (after failover). | |
if file_mounts is None: | |
return | |
replaced = False | |
for remote_path, local_path in list(file_mounts.items()): | |
if local_path is None: | |
del file_mounts[remote_path] | |
continue | |
if local_path.endswith(_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX): | |
with tempfile.NamedTemporaryFile('w', delete=False) as f: | |
user_config = common_utils.read_yaml(local_path) | |
config = _setup_proxy_command_on_controller(cloud, user_config) | |
common_utils.dump_yaml(f.name, dict(**config)) | |
file_mounts[remote_path] = f.name | |
replaced = True | |
if replaced: | |
logger.debug(f'Replaced {_LOCAL_SKYPILOT_CONFIG_PATH_SUFFIX} ' | |
f'with the real path in file mounts: {file_mounts}') | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than looking up remote_identity
, this envvar is also required to maintain parity with the region<->context mapping that we have created for SkyPilot.
E.g., if a user defines a YAML like:
resources:
cloud: kubernetes
region: gke_sky-dev-465_us-central1-c_gkeusc6
This YAML does not work on sky jobs launch
on current master:
(sky-3ff3-romilb, pid=1481) ValueError: Context gke_sky-dev-465_us-central1-c_gkeusc6 not found in kubeconfig. Kubernetes only supports context names as regions. Available contexts: ['in-cluster']
With this PR, we can have region map to the right context consistently, even if it may be running in-cluster.
This mapping is also needed in #4188 to allow our api server to use the context name instead of in-cluster
when running in a Kubernetes cluster.
Closes #4131. Also updates how in-cluster auth/context detection works. Instead of using
in-cluster
, we now pass the context name so that the remote_identity specified in config.yaml can be parsed correctly.Tested:
sky jobs launch
on asky local up
and GKE cluster, with both clusters configured in allowed_contexts.