-
Notifications
You must be signed in to change notification settings - Fork 633
[Provisioner] New provisioner for GCP TPU VM #2898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…t into new_provisioner_gcp_tpu_vm
…t into new_provisioner_gcp_tpu_vm
…t into new_provisioner_gcp_tpu_vm
…t into new_provisioner_gcp_tpu_vm
… into new_provisioner_gcp_tpu_vm
2c658bf
to
38decda
Compare
…t into new_provisioner_gcp_tpu_vm
…t into new_provisioner_gcp_tpu_vm
…t into new_provisioner_gcp_tpu_vm
…t into new_provisioner_gcp_tpu_vm
…t into new_provisioner_gcp_tpu_vm
Michaelvll
added a commit
that referenced
this pull request
Jan 1, 2024
* init * remove ray * update config * update * update * update * complete bootstrapping * add start instance * fix * fix * fix * update * wait stopping instances * support normal gcp tpus first * fix gcp * support get cluster info * fix * update * wait for instance starting * rename * hide gcp package import * fix * fix * update constants * fix comments * remove unused methods * fix comments * sync 'config' & 'constants' with upstream, Nov 16 * sync 'instace_utils' with the upstream, Nov 16 * fix typing * parallelize provisioning * Fix TPU node * Fix TPU NAME env for tpu node * implement bulk provision * refactor selflink * format * reduce the sleep time for autostop * provisioner version refactoring * refactor * Add logging * avoid saving the provisioner version * format * format * Fix scheduling field in config * format * fix public key content * Fix provisioner version for azure * Use ray port from head node for workers * format * fix ray_port * fix smoke tests * shorter sleep time * refactor status refresh version * [Provisioner] Support reserved instances in GCP (#2824) * Support reserved instances * remove min max count * remove unecessary fields * Add todo * Add todo * remove unused reseravation config * Fix config.yaml tests * format * sync with the upstream (Dec 05, 23) * set timeout and retries * handle GCP creation errors * Fix provisioning errors and improve error handling * update blocklist for GCP * refactor code for linting issues * fix * show instance status during assertion error * Refactor error handling for failover * adopt changes in #2854 * format * retry for wait operation * format * fix typo * fix interface * more robust zone to region * Fix tpu vm external IP setup * Fix get node * format * revert for TPU VM pod * Fix get_cluster_info call * fix tab * Fix timeout case * remvoe \t * GCP query statuses with new provisioner * format * fix import * refactor query status * fix stopped status * Fix stopped status * Add head ray start command * Add back keys * add workers * Fix non stopped states * Add more logs for autostop * format * increase job_docker job time * better logging * shorter time for recovering * fix conflicting var * change to V1 * fix comments * refactor constants * refactoring * typo * Fix max retry * longer sleep time for job * add detach setup * revert --detach-setup * shorter time for recovering * more retries * Update sky/provision/instance_setup.py Co-authored-by: Zongheng Yang <[email protected]> * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zongheng Yang <[email protected]> * format * [Provisioner] New provisioner for GCP TPU VM (#2898) * init * test * test ins_type * fix * format.. * wip * remove TPU config * fix node ips * Fix TPU VM pod * format * use TPU VM as default * Fix example for TPU VM * format * fix optimizer random dag * set TPU-VM * accelerator_args False * backward compatibility * add tpu filter for tests * fix * Fix * fix status refresh for tpu VM pod * Support autodown for TPU VM pod * Allow multi-node TPU VM pod * Allow multi-node TPU VM pod * fix * add execute for operation * avoid from * Wait for pending before set_labels * format * refactor constants * Fix for API changes * remove GCP failover handler v1 * format * remove TPU VM pod specific codes as they have been moved to new provisioner * Add error handling for TPU pod case * fix * fix multiple node calculation * refactor tpu_utils to gcp_utils * shorter time for recovering * format --------- Co-authored-by: Wei-Lin Chiang <[email protected]> * better error logging * Fix logging for TPU VM * Fix logging * Add insufficientCapacity to error handler * Avoid adding duplicated resources to blocked_resources * Fix blocked resources * address comment * add comment * Add comments * format * Fix num_node_ips * format * fix smoke test for preinstalled package * shorter wait time for recovering * Fix TPU VM pod stop * format * Update sky/provision/gcp/instance_utils.py Co-authored-by: Zongheng Yang <[email protected]> * update * format * Add debug message * revert version for handle * disable tpu name set --------- Co-authored-by: Zhanghao Wu <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Wei-Lin Chiang <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is to support TPU VM in our new provisioner API, which will significantly simplify many edge case handling in our backend.
This PR is blocked by #1758
We should merge #2681 first and change this PR to merge into master directly.This PR also adds the following support:
Future TODO:
Tested (run the relevant ones):
bash format.sh
sky launch examples/tpu/tpuvm_mnist.yaml -c test-tpuvm -i 0 --down
(TPU VM)sky launch examples/tpu/tpuvm_mnist.yaml --gpus tpu-v2-32 -c test-tpuvm -i 0 --down
(TPU VM pod)sky launch examples/tpu/tpuvm_mnist.yaml --gpus tpu-v2-32 -c test-tpuvm -i 0 --down --use-spot
(spot TPU VM pod)sky launch examples/tpu/tpuvm_mnist.yaml --gpus tpu-v2-32 --num-nodes 2 -c test-tpuvm -i 0 --down
;sky exec --gpus tpu-v2-32 --num-nodes 2 -c test-tpuvm echo hi
;sky status -r test-tpuvm
pytest tests/test_smoke.py --tpu
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh