Skip to content

[Provisioner] New provisioner for GCP TPU VM #2898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
Dec 28, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Dec 25, 2023

This is to support TPU VM in our new provisioner API, which will significantly simplify many edge case handling in our backend.
This PR is blocked by #1758

We should merge #2681 first and change this PR to merge into master directly.

This PR also adds the following support:

  1. autodowing a TPUVM.
  2. multi-node TPU VM pod

Future TODO:

  • an example of training on 2 TPU VM pod (tpu-v2-32)
  • Move the TPU node implementation into the new provisioner

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch examples/tpu/tpuvm_mnist.yaml -c test-tpuvm -i 0 --down (TPU VM)
    • sky launch examples/tpu/tpuvm_mnist.yaml --gpus tpu-v2-32 -c test-tpuvm -i 0 --down (TPU VM pod)
    • sky launch examples/tpu/tpuvm_mnist.yaml --gpus tpu-v2-32 -c test-tpuvm -i 0 --down --use-spot (spot TPU VM pod)
    • sky launch examples/tpu/tpuvm_mnist.yaml --gpus tpu-v2-32 --num-nodes 2 -c test-tpuvm -i 0 --down; sky exec --gpus tpu-v2-32 --num-nodes 2 -c test-tpuvm echo hi; sky status -r test-tpuvm
    • pytest tests/test_smoke.py --tpu
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll force-pushed the new_provisioner_gcp_tpu_vm branch from 2c658bf to 38decda Compare December 26, 2023 12:37
@Michaelvll Michaelvll marked this pull request as ready for review December 26, 2023 12:47
@Michaelvll Michaelvll mentioned this pull request Dec 27, 2023
5 tasks
@Michaelvll Michaelvll merged commit 049d219 into new_provisioner_gcp Dec 28, 2023
@Michaelvll Michaelvll deleted the new_provisioner_gcp_tpu_vm branch December 28, 2023 07:13
Michaelvll added a commit that referenced this pull request Jan 1, 2024
* init

* remove ray

* update config

* update

* update

* update

* complete bootstrapping

* add start instance

* fix

* fix

* fix

* update

* wait stopping instances

* support normal gcp tpus first

* fix gcp

* support get cluster info

* fix

* update

* wait for instance starting

* rename

* hide gcp package import

* fix

* fix

* update constants

* fix comments

* remove unused methods

* fix comments

* sync 'config' & 'constants' with upstream, Nov 16

* sync 'instace_utils' with the upstream, Nov 16

* fix typing

* parallelize provisioning

* Fix TPU node

* Fix TPU NAME env for tpu node

* implement bulk provision

* refactor selflink

* format

* reduce the sleep time for autostop

* provisioner version refactoring

* refactor

* Add logging

* avoid saving the provisioner version

* format

* format

* Fix scheduling field in config

* format

* fix public key content

* Fix provisioner version for azure

* Use ray port from head node for workers

* format

* fix ray_port

* fix smoke tests

* shorter sleep time

* refactor status refresh version

* [Provisioner] Support reserved instances in GCP (#2824)

* Support reserved instances

* remove min max count

* remove unecessary fields

* Add todo

* Add todo

* remove unused reseravation config

* Fix config.yaml tests

* format

* sync with the upstream (Dec 05, 23)

* set timeout and retries

* handle GCP creation errors

* Fix provisioning errors and improve error handling

* update blocklist for GCP

* refactor code for linting issues

* fix

* show instance status during assertion error

* Refactor error handling for failover

* adopt changes in #2854

* format

* retry for wait operation

* format

* fix typo

* fix interface

* more robust zone to region

* Fix tpu vm external IP setup

* Fix get node

* format

* revert for TPU VM pod

* Fix get_cluster_info call

* fix tab

* Fix timeout case

* remvoe \t

* GCP query statuses with new provisioner

* format

* fix import

* refactor query status

* fix stopped status

* Fix stopped status

* Add head ray start command

* Add back keys

* add workers

* Fix non stopped states

* Add more logs for autostop

* format

* increase job_docker job time

* better logging

* shorter time for recovering

* fix conflicting var

* change to V1

* fix comments

* refactor constants

* refactoring

* typo

* Fix max retry

* longer sleep time for job

* add detach setup

* revert --detach-setup

* shorter time for recovering

* more retries

* Update sky/provision/instance_setup.py

Co-authored-by: Zongheng Yang <[email protected]>

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zongheng Yang <[email protected]>

* format

* [Provisioner] New provisioner for GCP TPU VM (#2898)

* init

* test

* test ins_type

* fix

* format..

* wip

* remove TPU config

* fix node ips

* Fix TPU VM pod

* format

* use TPU VM as default

* Fix example for TPU VM

* format

* fix optimizer random dag

* set TPU-VM

* accelerator_args False

* backward compatibility

* add tpu filter for tests

* fix

* Fix

* fix status refresh for tpu VM pod

* Support autodown for TPU VM pod

* Allow multi-node TPU VM pod

* Allow multi-node TPU VM pod

* fix

* add execute for operation

* avoid from

* Wait for pending before set_labels

* format

* refactor constants

* Fix for API changes

* remove GCP failover handler v1

* format

* remove TPU VM pod specific codes as they have been moved to new provisioner

* Add error handling for TPU pod case

* fix

* fix multiple node calculation

* refactor tpu_utils to gcp_utils

* shorter time for recovering

* format

---------

Co-authored-by: Wei-Lin Chiang <[email protected]>

* better error logging

* Fix logging for TPU VM

* Fix logging

* Add insufficientCapacity to error handler

* Avoid adding duplicated resources to blocked_resources

* Fix blocked resources

* address comment

* add comment

* Add comments

* format

* Fix num_node_ips

* format

* fix smoke test for preinstalled package

* shorter wait time for recovering

* Fix TPU VM pod stop

* format

* Update sky/provision/gcp/instance_utils.py

Co-authored-by: Zongheng Yang <[email protected]>

* update

* format

* Add debug message

* revert version for handle

* disable tpu name set

---------

Co-authored-by: Zhanghao Wu <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Wei-Lin Chiang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants