Skip to content

[Policy] Add SpotHedge. #4628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Mar 30, 2025
Merged

[Policy] Add SpotHedge. #4628

merged 17 commits into from
Mar 30, 2025

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Jan 31, 2025

Add SpotHedge policy.

TODO:

Initially, 3 spot span across regions, 3 on-demand fallback:

$ sky serve up tests/skyserve/spot/spot_hedge.yaml --env HF_TOKEN -n spot-hedge
$ sky serve status -a
Services
NAME        VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT                    AUTOSCALING_POLICY                                                                             LOAD_BALANCING_POLICY  REQUESTED_RESOURCES  
spot-hedge  -        -       NO_REPLICA  0/6       http://54.211.240.64:30001  Autoscaling from 2 to 5 replicas with 1 overprovisioned replicas (target QPS per replica: 10)  least_load             1x[L4:1]             

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT  LAUNCHED     RESOURCES                                                                STATUS        REGION           ZONE               
spot-hedge    1   1        -         56 secs ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  PROVISIONING  asia-northeast3  asia-northeast3-a  
spot-hedge    2   1        -         56 secs ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  PROVISIONING  asia-northeast3  asia-northeast3-b  
spot-hedge    3   1        -         56 secs ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  PROVISIONING  asia-east1       asia-east1-a       
spot-hedge    4   1        -         55 secs ago  1x GCP(g2-standard-32, {'L4': 1}, disk_tier=best, ports=['8081'])        PROVISIONING  us-east4         us-east4-a         
spot-hedge    5   1        -         56 secs ago  1x GCP(g2-standard-32, {'L4': 1}, disk_tier=best, ports=['8081'])        PROVISIONING  us-east4         us-east4-a         
spot-hedge    6   1        -         56 secs ago  1x GCP(g2-standard-32, {'L4': 1}, disk_tier=best, ports=['8081'])        PROVISIONING  us-east4         us-east4-a

Later, only 3 spot is kept after it becomes ready;

$ sky serve status -a
Services
NAME        VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT                    AUTOSCALING_POLICY                                                                             LOAD_BALANCING_POLICY  REQUESTED_RESOURCES  
spot-hedge  1        3m 39s  READY   3/3       http://54.211.240.64:30001  Autoscaling from 2 to 5 replicas with 1 overprovisioned replicas (target QPS per replica: 10)  least_load             1x[L4:1]             

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED    RESOURCES                                                                STATUS  REGION           ZONE               
spot-hedge    1   1        http://34.47.101.41:8081   8 mins ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  READY   asia-northeast3  asia-northeast3-a  
spot-hedge    2   1        http://34.64.236.23:8081   9 mins ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  READY   asia-northeast3  asia-northeast3-b  
spot-hedge    3   1        http://34.80.227.166:8081  9 mins ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  READY   asia-east1       asia-east1-a 

Manually deleting one on cloud console, one spot (in a different region) + one on-demand is scaled up:

$ sky serve status -a
Services
NAME        VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT                    AUTOSCALING_POLICY                                                                             LOAD_BALANCING_POLICY  REQUESTED_RESOURCES  
spot-hedge  1        5m 38s  READY   2/5       http://54.211.240.64:30001  Autoscaling from 2 to 5 replicas with 1 overprovisioned replicas (target QPS per replica: 10)  least_load             1x[L4:1]             

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                   LAUNCHED     RESOURCES                                                                STATUS        REGION           ZONE               
spot-hedge    1   1        http://34.47.101.41:8081   10 mins ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  READY         asia-northeast3  asia-northeast3-a  
spot-hedge    2   1        -                          -            -                                                                        PREEMPTED     -                -                  
spot-hedge    3   1        http://34.80.227.166:8081  11 mins ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  READY         asia-east1       asia-east1-a       
spot-hedge    7   1        -                          16 secs ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  PROVISIONING  asia-east1       asia-east1-b       
spot-hedge    8   1        -                          15 secs ago  1x GCP(g2-standard-32, {'L4': 1}, disk_tier=best, ports=['8081'])        PROVISIONING  us-east4         us-east4-a 

Finally, only the spot in different region is kept.

$ sky serve status -a
Services
NAME        VERSION  UPTIME  STATUS  REPLICAS  ENDPOINT                    AUTOSCALING_POLICY                                                                             LOAD_BALANCING_POLICY  REQUESTED_RESOURCES  
spot-hedge  1        16m 3s  READY   3/3       http://54.211.240.64:30001  Autoscaling from 2 to 5 replicas with 1 overprovisioned replicas (target QPS per replica: 10)  least_load             1x[L4:1]             

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT                    LAUNCHED     RESOURCES                                                                STATUS  REGION           ZONE               
spot-hedge    1   1        http://34.47.101.41:8081    21 mins ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  READY   asia-northeast3  asia-northeast3-a  
spot-hedge    3   1        http://34.80.227.166:8081   21 mins ago  1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  READY   asia-east1       asia-east1-a       
spot-hedge    7   1        http://35.194.146.246:8081  9 mins ago   1x GCP(g2-standard-32[Spot], {'L4': 1}, disk_tier=best, ports=['8081'])  READY   asia-east1       asia-east1-b
  • smoke test

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cblmemo cblmemo marked this pull request as ready for review January 31, 2025 21:48
@cblmemo cblmemo requested a review from MaoZiming January 31, 2025 21:48
@cblmemo
Copy link
Collaborator Author

cblmemo commented Jan 31, 2025

@MaoZiming I think it is ready for review. Only thing lacked is an e2e test. Will do it soon

@MaoZiming
Copy link
Collaborator

@cblmemo Initially, there should be 2 OD and 1 Spot. The reason is that we will launch the number of OD up to num target number of instances

@MaoZiming
Copy link
Collaborator

MaoZiming commented Feb 1, 2025

Can you also paste a screenshot of the zones/regions considered by SpotHedge with the optimizer

    - cloud: aws
      region: us-east-1
    - cloud: gcp

@cblmemo
Copy link
Collaborator Author

cblmemo commented Feb 3, 2025

@cblmemo Initially, there should be 2 OD and 1 Spot. The reason is that we will launch the number of OD up to num target number of instances

There is 1 overprovision replicas?

Edit: Oh, just realized that we don't need a fallback OD replica for overprovision replicas. Do you mean 2 OD + 3 Spot, and when spot is ready, it becomes 3 Spot?

@MaoZiming
Copy link
Collaborator

@cblmemo Yes exactly

@cblmemo
Copy link
Collaborator Author

cblmemo commented Feb 4, 2025

Can you also paste a screenshot of the zones/regions considered by SpotHedge with the optimizer

    - cloud: aws
      region: us-east-1
    - cloud: gcp
I 02-03 16:35:12 spot_placer.py:169] 44 possible location candidates are enabled for spot placement.
D 02-03 16:35:12 spot_placer.py:171] All possible locations: [Location(cloud=GCP, region='europe-west4', zone='europe-west4-a'), Location(cloud=GCP, region='asia-northeast1', zone='asia-northeast1-a'), Location(cloud=GCP, region='asia-northeast1', zone='asia-northeast1-c'), Location(cloud=GCP, region='asia-southeast1', zone='asia-southeast1-a'), Location(cloud=GCP, region='us-west4', zone='us-west4-c'), Location(cloud=GCP, region='europe-west3', zone='europe-west3-a'), Location(cloud=GCP, region='us-east4', zone='us-east4-a'), Location(cloud=GCP, region='europe-west6', zone='europe-west6-b'), Location(cloud=GCP, region='asia-northeast3', zone='asia-northeast3-a'), Location(cloud=GCP, region='asia-southeast1', zone='asia-southeast1-b'), Location(cloud=GCP, region='us-east1', zone='us-east1-d'), Location(cloud=GCP, region='asia-south1', zone='asia-south1-a'), Location(cloud=GCP, region='europe-west1', zone='europe-west1-c'), Location(cloud=GCP, region='us-west4', zone='us-west4-a'), Location(cloud=GCP, region='europe-west4', zone='europe-west4-c'), Location(cloud=AWS, region='us-east-1', zone='us-east-1c'), Location(cloud=GCP, region='europe-west3', zone='europe-west3-b'), Location(cloud=GCP, region='me-central2', zone='me-central2-a'), Location(cloud=GCP, region='us-central1', zone='us-central1-b'), Location(cloud=GCP, region='us-west1', zone='us-west1-a'), Location(cloud=GCP, region='us-west1', zone='us-west1-b'), Location(cloud=GCP, region='europe-west6', zone='europe-west6-c'), Location(cloud=GCP, region='us-east4', zone='us-east4-c'), Location(cloud=GCP, region='us-west1', zone='us-west1-c'), Location(cloud=GCP, region='us-east1', zone='us-east1-b'), Location(cloud=GCP, region='us-east1', zone='us-east1-c'), Location(cloud=GCP, region='asia-south1', zone='asia-south1-c'), Location(cloud=GCP, region='europe-west2', zone='europe-west2-b'), Location(cloud=GCP, region='asia-south1', zone='asia-south1-b'), Location(cloud=AWS, region='us-east-1', zone='us-east-1b'), Location(cloud=GCP, region='asia-northeast1', zone='asia-northeast1-b'), Location(cloud=GCP, region='asia-east1', zone='asia-east1-b'), Location(cloud=GCP, region='asia-east1', zone='asia-east1-c'), Location(cloud=AWS, region='us-east-1', zone='us-east-1a'), Location(cloud=GCP, region='us-central1', zone='us-central1-a'), Location(cloud=AWS, region='us-east-1', zone='us-east-1d'), Location(cloud=GCP, region='asia-northeast3', zone='asia-northeast3-b'), Location(cloud=GCP, region='europe-west4', zone='europe-west4-b'), Location(cloud=GCP, region='europe-west2', zone='europe-west2-a'), Location(cloud=GCP, region='europe-west1', zone='europe-west1-b'), Location(cloud=GCP, region='northamerica-northeast2', zone='northamerica-northeast2-a'), Location(cloud=GCP, region='asia-east1', zone='asia-east1-a'), Location(cloud=GCP, region='us-central1', zone='us-central1-c'), Location(cloud=GCP, region='asia-southeast1', zone='asia-southeast1-c')]

@cblmemo
Copy link
Collaborator Author

cblmemo commented Feb 4, 2025

@cblmemo Yes exactly

After the latest commit it works well:

Services
NAME        VERSION  UPTIME  STATUS      REPLICAS  ENDPOINT                   
spot-hedge  -        -       NO_REPLICA  0/5       http://54.144.2.226:30001  

Service Replicas
SERVICE_NAME  ID  VERSION  ENDPOINT  LAUNCHED  RESOURCES                STATUS        REGION           
spot-hedge    1   1        -         1 hr ago  1x GCP([Spot]{'L4': 1})  PROVISIONING  asia-northeast3  
spot-hedge    2   1        -         1 hr ago  1x GCP([Spot]{'L4': 1})  PROVISIONING  asia-northeast3  
spot-hedge    3   1        -         1 hr ago  1x GCP([Spot]{'L4': 1})  PROVISIONING  asia-east1       
spot-hedge    4   1        -         1 hr ago  1x GCP({'L4': 1})        PROVISIONING  us-east4         
spot-hedge    5   1        -         1 hr ago  1x GCP({'L4': 1})        PROVISIONING  us-east4

@cblmemo cblmemo requested a review from Michaelvll February 4, 2025 00:41
@MaoZiming
Copy link
Collaborator

LG @Michaelvll

@cblmemo
Copy link
Collaborator Author

cblmemo commented Feb 10, 2025

/smoke-test --serve

@cblmemo
Copy link
Collaborator Author

cblmemo commented Mar 11, 2025

/smoke-test --serve

@cblmemo
Copy link
Collaborator Author

cblmemo commented Mar 11, 2025

@Michaelvll Just merged the latest master branch and triggered smoke test. PTAL when you got time, thanks!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cblmemo @MaoZiming! This is awesome! Should we add a smoke test for this? Also, we may want to test backward compatibility for this.

@@ -4334,6 +4334,7 @@ def serve_up(
)
click.secho('Service spec:', fg='cyan')
click.echo(task.service)
serve_lib.validate_service_task(task)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this to be in SDK?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only checking whether the task is a valid service task and does not take a lot of time. Do you still think this needs to be put into sdk?

@cblmemo
Copy link
Collaborator Author

cblmemo commented Mar 14, 2025

/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Mar 26, 2025

/quicktest-core

@cblmemo
Copy link
Collaborator Author

cblmemo commented Mar 26, 2025

@Michaelvll I dont think I'll have the bandwidth to add smoke test for this, but I still want to merge it into master before our eurosys presentation so the audience can easier install and try it out. Do you think we can get this PR in first and add smoke test later? Created an issue #5040 for this.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Mar 26, 2025

/quicktest-core

@cblmemo cblmemo merged commit f90ccc1 into master Mar 30, 2025
19 checks passed
@cblmemo cblmemo deleted the spot-hedge-new branch March 30, 2025 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants