Skip to content

[New Provisioner] New provisioner for k8s #3019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 74 commits into from
Jan 30, 2024
Merged

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Jan 24, 2024

It supports all the functionality of the original k8s support, and also supports custom image (adopted from #2599)

Special thanks to @landscapepainter for the excellent contribution to the custom image support PR #2599, which formulates this PR.
Co-authored-by: @landscapepainter

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
    • pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll marked this pull request as ready for review January 24, 2024 08:01
@romilbhardwaj
Copy link
Collaborator

@Michaelvll I've pushed some changes to:

  1. Removed the python3 and python3-pip installation during setup. This was unnecessary since we install conda anyway.
  2. Added a simple kubernetes_handler under FailoverCloudErrorHandlerV2 to surface any Kubernetes provisioning errors to the suer.

Also ran some quick benchmarks to test provisioning speed compared to current master (Ray autoscaler based provisioning) using the default SkyPilot image:

#==== New provisioner ====
# Single node
time sky launch -y -c test -- echo hi # 1:11.45 total
# Multi node
time sky launch -y -c test --num-nodes 2 -- echo hi # 1:21.14 total


#==== Old Ray provisioner ====
# Single node
time sky launch -y -c test -- echo hi # 1:08.87 total
# Multi node
time sky launch -y -c test --num-nodes 2 -- echo hi # 1:40.66 total

I guess some slowdown is expected because of custom image support, which runs additional setup steps (running apt update etc), which negates the speed ups from using the new provisioner. Overall, provisioning time is roughly the same for single node, and we see some improvement for multi-node.

Running smoke tests now.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke tests (both Kubernetes and general) pass! LGTM.

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Jan 29, 2024

Updated docker images to 1) use the same base image as SkyPilot official docker image continuumio/miniconda3:23.3.1-0, 2) Remove unnecessary dependencies.

Tested with simple sky launch, running k8s smoke tests now.

Also tested backward compatibility (i.e., new image also works with the old provisioner).

Update - smoke tests pass, I have updated the image at us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot[-gpu]:latest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s] change the container images for k8s [K8s] Install SSH server in setup phase instead of image
4 participants