[k8s_cloud] Add sshjump host support. #2201

aviweit · 2023-07-09T18:56:03Z

Add an initial support for ssh jump host.

By this, only single port (nodeport) is opened on a target k8s cluster where ssh connections established from skypilot user targeted to ray cluster - are tunneled through ssh jump host pod.

- it also applies to kubernetes where we use ssh jump host to access internal ray head pod

aviweit · 2023-07-09T19:13:06Z

sky/backends/backend_utils.py

@@ -958,7 +957,6 @@ def write_cluster_config(

                # Kubernetes only:
                'skypilot_k8s_image': k8s_image,
-                'ssh_key_secret_name': ssh_key_secret_name,


ssh_key_secret_name doesn't seem to be used.

Ah thanks for catching - this is removed in the current latest k8s_cloud branch.

aviweit · 2023-07-09T19:18:12Z

sky/templates/kubernetes-ray.yml.j2

@@ -17,7 +17,7 @@ provider:
    module: sky.skylet.providers.kubernetes.KubernetesNodeProvider

    # Use False if running from outside of k8s cluster
-    use_internal_ips: false
+    use_internal_ips: true


It causes ssh itself (not its proxy command) to target internal pod ipaddress.

aviweit · 2023-07-09T19:24:42Z

sky/backends/backend_utils.py

I am not sure how to use ssh_proxy_command_config handled here. Our ssh_proxy seems to be dynamic; port is dynamic NodePort.

Can we assume port assigned to nodeport doesn't change (since the service is created only once)?

Yes. The port is the same as long as the sshjump service is running.

aviweit · 2023-07-10T13:55:55Z

I confirm that launching the following task against kind, works well.

SSH interactions with ray head during provision phase, is done through sshjump host as well as user can access ray head with ssh where head is being reached via sshjump pod.

hello.yaml:

num_nodes: 1

resources:
  memory: 2
  cpus: 1
  # Optional; if left out, pick from the available clouds.
  cloud: kubernetes
workdir: .

setup: |
  echo "Running setup."

run: |
  echo "Hello, User, how are you?"
  conda env list

create a local kind k8s cluster:

sky local up

ensure cluster is up:

kubectl get nodes | grep skypilot-control-plane

launch the task

sky launch hello.yaml

romilbhardwaj

Thanks @aviweit! This is great. I skimmed through the code and left some comments. Haven't tried running yet. Will take another pass soon.

Can you also merge the latest changes from k8s_cloud branch? It has a few fixes, and some changes in how ssh auth works (especially after #2119), so would be good to verify this still works.

romilbhardwaj · 2023-07-13T04:40:26Z

Dockerfile_k8s_sshjump

+FROM continuumio/miniconda3:22.11.1
+
+RUN apt update -y && \
+    apt install sudo vim openssh-server -y && \
+    rm -rf /var/lib/apt/lists/*
+
+# Setup SSH and generate hostkeys
+RUN mkdir -p /var/run/sshd && \
+    sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
+    sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd && \
+    cd /etc/ssh/ && \
+    ssh-keygen -A
+
+RUN useradd -m -s /bin/bash sky && \
+    echo "sky ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers && \
+    echo 'Defaults        secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"' > /etc/sudoers.d/sky
+
+# Switch to sky user
+USER sky
+
+# Add /home/sky/.local/bin/ to PATH
+RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc
+
+WORKDIR /home/sky


Curious - what do you think about reusing the image generated from Dockerfile_k8s (us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:latest) instead of creating a new one?

Pros: Reduces image maintenance burden for us, may be faster to pull since cluster would likely already have the main skypilot:latest image for running tasks
Cons: Dockerfile_k8s is quite bloated because it installs SkyPilot and its dependencies.

Might be worth looking at image sizes for both before taking a call.

Hi @romilbhardwaj ,

I built both images using build_image.sh and noticed the below size differences. I agree with your Pros and Cons.

$docker images REPOSITORY TAG IMAGE ID CREATED SIZE ... sshjump latest 9c4a7ca3d121 4 hours ago 479MB skypilot latest 448c7c3b7553 4 hours ago 1.32GB ...

Thanks @aviweit. Since this image will be pulled only once, lets remove this and reuse Dockerfile_k8s (us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:latest).

@romilbhardwaj , done.

romilbhardwaj · 2023-07-13T04:41:42Z

sky/authentication.py

@@ -10,6 +10,8 @@
 from typing import Any, Dict, Tuple
 import uuid

+from urllib.parse import urlparse


nit: since urllib is standard library, perhaps should be added at L11 before import uuid.

Hi @romilbhardwaj , I removed import urlparse, as it is now being used by clouds.Kubernetes.get_external_ip() at setup_kubernetes_authentication.

romilbhardwaj · 2023-07-13T04:43:39Z

sky/authentication.py

+    sshjump_name = clouds.Kubernetes.SKY_SSH_JUMP_NAME
+    sshjump_image =  clouds.Kubernetes.SSH_JUMP_IMAGE
+
+    def _to_sshjump_pod_spec(name, image, secret):


Can we define this pod spec as a jinja template (e.g., templates/k8s_ssh_jump.j2) that is filled in with name, image and secret through a function call? Same for the service spec below. Might be able to put them in the same file too.

Yes, I pushed the changes.

romilbhardwaj · 2023-07-13T04:44:56Z

sky/backends/backend_utils.py

Can we assume port assigned to nodeport doesn't change (since the service is created only once)?

romilbhardwaj · 2023-07-13T04:45:30Z

sky/backends/backend_utils.py

@@ -958,7 +957,6 @@ def write_cluster_config(

                # Kubernetes only:
                'skypilot_k8s_image': k8s_image,
-                'ssh_key_secret_name': ssh_key_secret_name,


Ah thanks for catching - this is removed in the current latest k8s_cloud branch.

k8s_cloud-sshjumphost2

- it seems not to be used anymore

- template file to store sshjump pod and service specs - update setup_kubernetes_authentication to use it

aviweit · 2023-07-13T15:10:11Z

Hi @romilbhardwaj , I have merged the PR with k8s_cloud branch and tested again (with the additional commits) - with the task yaml that is mentioned at #2201 (comment).

It worked well with skypilot installed on Ubuntu 20.04 host - against an external kubernetes cluster (deployed via kubeadm), and a one deployed locally (kind).

romilbhardwaj

Thanks @aviweit! We probably also need to handle the lifecycle of the jump pod - left some comments. Also, let's use us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:latest image for the ssh jump pod.

romilbhardwaj · 2023-07-17T23:13:34Z

Dockerfile_k8s_sshjump

+FROM continuumio/miniconda3:22.11.1
+
+RUN apt update -y && \
+    apt install sudo vim openssh-server -y && \
+    rm -rf /var/lib/apt/lists/*
+
+# Setup SSH and generate hostkeys
+RUN mkdir -p /var/run/sshd && \
+    sed -i 's/PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config && \
+    sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd && \
+    cd /etc/ssh/ && \
+    ssh-keygen -A
+
+RUN useradd -m -s /bin/bash sky && \
+    echo "sky ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers && \
+    echo 'Defaults        secure_path="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"' > /etc/sudoers.d/sky
+
+# Switch to sky user
+USER sky
+
+# Add /home/sky/.local/bin/ to PATH
+RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc
+
+WORKDIR /home/sky


Thanks @aviweit. Since this image will be pulled only once, lets remove this and reuse Dockerfile_k8s (us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot:latest).

romilbhardwaj · 2023-07-18T22:32:34Z

sky/authentication.py

+    content = json.loads(_content)
+
+    try:
+        kubernetes.core_api().create_namespaced_pod('default', content['pod_spec'])


We need to think a bit about the lifecycle of this jump pod.

Creation - Creation of this jump ssh pod happens when the user creates their first SkyPilot cluster in the Kubernetes cluster.

Run - The jump SSH pod is kept alive as long as there is a SkyPilot cluster running (i.e., there exists a Kubernetes pod with label parent: skypilot in the same namespace)

Termination - If there hasn't been a pod with label parent: skypilot in the same namespace since the past n minutes (say, 10 minutes), this jump pod should terminate itself.

We can do this by running a script like this in the jump ssh pod as the command for the pod:

import os import datetime import pytz import time from kubernetes import client, config # Load kube config config.load_incluster_config() v1 = client.CoreV1Api() # Get the current namespace from the pod service account with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace", "r") as f: current_namespace = f.read() # Set the time delta for checking last active pods time_delta = datetime.timedelta(minutes=10) # Set delay for each retry retry_delay = 60 # In seconds while True: # Get the current time now = datetime.datetime.now(pytz.UTC) found = False # List the pods in the current namespace ret = v1.list_namespaced_pod(current_namespace) for i in ret.items: if i.metadata.labels and 'parent' in i.metadata.labels and i.metadata.labels['parent'] == 'skypilot': # Calculate the elapsed time since the pod was last active elapsed_time = now - i.metadata.creation_timestamp # If the pod was active in the last 10 minutes, set found to True if elapsed_time < time_delta: found = True break # If no active pods were found with the specified label, exit the script if not found: print("No active pods found with label 'parent: skypilot' in the past 10 minutes. Exiting...") exit(1) # If pods were found, sleep for the specified delay and then retry print(f"Active pods found with label 'parent: skypilot'. Retrying in {retry_delay} seconds...") time.sleep(retry_delay)

What do you think?

romilbhardwaj · 2023-07-18T22:39:39Z

sky/templates/kubernetes-sshjump.j2

@@ -0,0 +1,74 @@
+{


Can we have this file as a YAML instead of a json? Trying to have something that can also be easily used with kubectl apply -f for debugging/testing (of course, after replacing template variables).

@romilbhardwaj , yes. I changed. Thanks.

- use Dockerfile_k8s for sshjump host as well

aviweit · 2023-07-19T13:25:41Z

Hi @romilbhardwaj

Thanks a lot for the comments. I updated the code to reuse Dockerfile_k8s and updated the template format to yaml. I re-tested the recent changes in our environment and it seems to work well.

Your suggestion on sshjump lifecycle seems to be fine for me. There seems to be additional work such as defining RBAC for the pod to list other pods. I tested it and got that failure.

I would like to ask whether it would be ok to merge this PR and to open another PR to handle sshjump lifecycle? Other PRs can be probably opened to handle additional enhancements as needed.
What do you think?

Thanks again.

romilbhardwaj · 2023-07-20T00:01:44Z

@aviweit - this is great! Tried it out - works nicely.

There seems to be additional work such as defining RBAC for the pod to list other pods. I tested it and got that failure.

Is there some way to upload kubeconfig credentials so that jump pod can use those to authenticate with the control plane and list pods?

Asking because I think lifecycle management of the jump host is important to get right before we ship this. We do not want to leave an orphan ssh jump pod running on users' clusters after they are done using SkyPilot. If we can figure out a quick solution to the above problem, it'll be great to have in this PR.

If you think it will be an involved process, then it'll be good to have a draft solution verifying this kind of lifecycle management is possible. Once we can verify it can be done, then we can merge this PR and create another one to address it.

What do you think? Thanks again for adding this!

romilbhardwaj · 2023-07-22T01:01:07Z

@aviweit - I've made some updates to k8s_cloud branch. We use the autoscaler service account which has full permissions over the namespace. Is it possible to use the same autoscaler service account with this jump pod and use it to manage the lifecycle of the pod?

- remove skypilot label from sshjump related objects so that they are not currently being queried by sshjump lcm script

… k8s_cloud-sshjumphost2

aviweit · 2023-07-23T11:40:14Z

Hi @romilbhardwaj , I merged k8s_cloud changes into this PR and added lifecycle management of the jump host. It seems to work well in my environment. Can you please review?

Regarding reusing autoscaler service account, I think it would be good to define a dedicated one that will contain only the permissions needed by the sshjump host lifecycle logic.

I would like to ask whether this PR can be merged and if needed, additional PRs will be opened? This will save us time on keeping this PR up to date in respect to upstream changes.

@romilbhardwaj , I think that you will need to re-build and push Dockerfile_k8s because of adding sshjump-lcm.py.

Thanks again.

- add comments in sshjump lcm script

- update from greater to greater-equal

aviweit · 2023-07-26T13:22:54Z

Hi @romilbhardwaj , I have updated the code to create sshjump related resources in the correct namespace (and a small update to if statement in lcm script).

Thanks a lot.

aviweit · 2023-07-28T18:36:22Z

@romilbhardwaj , I added sshjump lcm to monitor only ray pods it is responsible for.
The changes are currently under sub-branch k8s_cloud-sshjumphost2-multiuser: https://github.com/aviweit/skypilot/tree/k8s_cloud-sshjumphost2-multiuser
commit: aviweit@21879c5.

If it is ok, it can be merged to this PR. Thanks.

Michaelvll · 2023-08-02T20:55:46Z

We should probably re-open this PR by changing the target branch. The original k8s_cloud branch has been deleted as it was merged to the master.

romilbhardwaj · 2023-08-03T02:15:31Z

Hey @aviweit ! This PR got closed when we merged #2096 into master.

For our next sprint, we'll be switching to k8s_cloud_beta1 branch. Can you re-open this PR on that branch?

aviweit added 5 commits July 9, 2023 17:45

Add ssh jumphost support.

162f535

Add Dockerfile for sshjump.

49a6324

Remove unused ssh_key_secret_name.

460fa15

get_head_ssh_port to return 22 for all clouds.

f10187c

- it also applies to kubernetes where we use ssh jump host to access internal ray head pod

Use get_external_ip of Kubernetes cloud.

b7dda3a

aviweit force-pushed the k8s_cloud-sshjumphost2 branch 3 times, most recently from 9063150 to e66cda1 Compare July 9, 2023 19:05

aviweit changed the title ~~Add sshjump host support.~~ [k8s_cloud] Add sshjump host support. Jul 9, 2023

aviweit commented Jul 9, 2023

View reviewed changes

concretevitamin requested a review from romilbhardwaj July 10, 2023 01:14

aviweit force-pushed the k8s_cloud-sshjumphost2 branch from e66cda1 to 832818c Compare July 10, 2023 07:57

Warn in case sshjump exists.

73e35d6

aviweit force-pushed the k8s_cloud-sshjumphost2 branch from 832818c to 73e35d6 Compare July 10, 2023 09:29

Add sshjump build.

2adefed

romilbhardwaj reviewed Jul 13, 2023

View reviewed changes

aviweit added 3 commits July 13, 2023 12:45

Merge remote-tracking branch 'origin/k8s_cloud' into

8cd2952

k8s_cloud-sshjumphost2

Remove import urlparse.

0d2fcb9

- it seems not to be used anymore

Create kubernetes-sshjump.j2.

9e8a085

- template file to store sshjump pod and service specs - update setup_kubernetes_authentication to use it

aviweit marked this pull request as ready for review July 13, 2023 15:10

romilbhardwaj reviewed Jul 18, 2023

View reviewed changes

aviweit added 2 commits July 19, 2023 11:42

Remove Dockerfile_k8s_sshjump.

4d125a9

- use Dockerfile_k8s for sshjump host as well

Update sshhost template to yaml format.

a5c388e

aviweit added 7 commits July 22, 2023 13:20

Merge branch 'k8s_cloud' into k8s_cloud-sshjumphost2

deead2c

Remove unused import.

4bb2d46

Add lifecycle to sshjump host.

01bf6d9

Define sshjump lcm script.

d925ac7

- remove skypilot label from sshjump related objects so that they are not currently being queried by sshjump lcm script

sshjump-lcm.py updates.

ae225cb

Delete sshjump service on pod termination.

cd64cef

Merge remote-tracking branch 'origin/k8s_cloud-sshjumphost2-lc2' into…

387fbf3

… k8s_cloud-sshjumphost2

aviweit added 4 commits July 23, 2023 15:40

Use single ServiceAccount, Role and RoleBinding for sshjump.

3201c99

- add comments in sshjump lcm script

Merge branch 'k8s_cloud' into k8s_cloud-sshjumphost2

af6d983

Create sshjump resources in the correct namespace.

4b4863f

sshjump-lcm.py: update if statement.

33067bb

- update from greater to greater-equal

Merge branch 'k8s_cloud' into k8s_cloud-sshjumphost2

f2b6c92

romilbhardwaj deleted the branch skypilot-org:k8s_cloud August 2, 2023 10:57

romilbhardwaj closed this Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s_cloud] Add sshjump host support. #2201

[k8s_cloud] Add sshjump host support. #2201

aviweit commented Jul 9, 2023 •

edited

Loading

aviweit Jul 9, 2023

romilbhardwaj Jul 13, 2023

aviweit Jul 9, 2023

aviweit Jul 9, 2023

romilbhardwaj Jul 13, 2023

aviweit Jul 13, 2023

aviweit commented Jul 10, 2023

romilbhardwaj left a comment

romilbhardwaj Jul 13, 2023

aviweit Jul 13, 2023

romilbhardwaj Jul 17, 2023

aviweit Jul 19, 2023

romilbhardwaj Jul 13, 2023

aviweit Jul 13, 2023

romilbhardwaj Jul 13, 2023

aviweit Jul 13, 2023

romilbhardwaj Jul 13, 2023

romilbhardwaj Jul 13, 2023

aviweit commented Jul 13, 2023 •

edited

Loading

romilbhardwaj left a comment •

edited

Loading

romilbhardwaj Jul 17, 2023

romilbhardwaj Jul 18, 2023

romilbhardwaj Jul 18, 2023

aviweit Jul 19, 2023

aviweit commented Jul 19, 2023

romilbhardwaj commented Jul 20, 2023

romilbhardwaj commented Jul 22, 2023

aviweit commented Jul 23, 2023 •

edited

Loading

aviweit commented Jul 26, 2023

aviweit commented Jul 28, 2023

Michaelvll commented Aug 2, 2023

romilbhardwaj commented Aug 3, 2023

[k8s_cloud] Add sshjump host support. #2201

[k8s_cloud] Add sshjump host support. #2201

Conversation

aviweit commented Jul 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aviweit commented Jul 10, 2023

romilbhardwaj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aviweit commented Jul 13, 2023 • edited Loading

romilbhardwaj left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aviweit commented Jul 19, 2023

romilbhardwaj commented Jul 20, 2023

romilbhardwaj commented Jul 22, 2023

aviweit commented Jul 23, 2023 • edited Loading

aviweit commented Jul 26, 2023

aviweit commented Jul 28, 2023

Michaelvll commented Aug 2, 2023

romilbhardwaj commented Aug 3, 2023

aviweit commented Jul 9, 2023 •

edited

Loading

aviweit commented Jul 13, 2023 •

edited

Loading

romilbhardwaj left a comment •

edited

Loading

aviweit commented Jul 23, 2023 •

edited

Loading