-
Notifications
You must be signed in to change notification settings - Fork 640
[Policy] Add SpotHedge. #4628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Policy] Add SpotHedge. #4628
Conversation
@MaoZiming I think it is ready for review. Only thing lacked is an e2e test. Will do it soon |
@cblmemo Initially, there should be 2 OD and 1 Spot. The reason is that we will launch the number of OD up to num target number of instances |
Can you also paste a screenshot of the zones/regions considered by SpotHedge with the optimizer
|
Edit: Oh, just realized that we don't need a fallback OD replica for overprovision replicas. Do you mean 2 OD + 3 Spot, and when spot is ready, it becomes 3 Spot? |
@cblmemo Yes exactly |
I 02-03 16:35:12 spot_placer.py:169] 44 possible location candidates are enabled for spot placement.
D 02-03 16:35:12 spot_placer.py:171] All possible locations: [Location(cloud=GCP, region='europe-west4', zone='europe-west4-a'), Location(cloud=GCP, region='asia-northeast1', zone='asia-northeast1-a'), Location(cloud=GCP, region='asia-northeast1', zone='asia-northeast1-c'), Location(cloud=GCP, region='asia-southeast1', zone='asia-southeast1-a'), Location(cloud=GCP, region='us-west4', zone='us-west4-c'), Location(cloud=GCP, region='europe-west3', zone='europe-west3-a'), Location(cloud=GCP, region='us-east4', zone='us-east4-a'), Location(cloud=GCP, region='europe-west6', zone='europe-west6-b'), Location(cloud=GCP, region='asia-northeast3', zone='asia-northeast3-a'), Location(cloud=GCP, region='asia-southeast1', zone='asia-southeast1-b'), Location(cloud=GCP, region='us-east1', zone='us-east1-d'), Location(cloud=GCP, region='asia-south1', zone='asia-south1-a'), Location(cloud=GCP, region='europe-west1', zone='europe-west1-c'), Location(cloud=GCP, region='us-west4', zone='us-west4-a'), Location(cloud=GCP, region='europe-west4', zone='europe-west4-c'), Location(cloud=AWS, region='us-east-1', zone='us-east-1c'), Location(cloud=GCP, region='europe-west3', zone='europe-west3-b'), Location(cloud=GCP, region='me-central2', zone='me-central2-a'), Location(cloud=GCP, region='us-central1', zone='us-central1-b'), Location(cloud=GCP, region='us-west1', zone='us-west1-a'), Location(cloud=GCP, region='us-west1', zone='us-west1-b'), Location(cloud=GCP, region='europe-west6', zone='europe-west6-c'), Location(cloud=GCP, region='us-east4', zone='us-east4-c'), Location(cloud=GCP, region='us-west1', zone='us-west1-c'), Location(cloud=GCP, region='us-east1', zone='us-east1-b'), Location(cloud=GCP, region='us-east1', zone='us-east1-c'), Location(cloud=GCP, region='asia-south1', zone='asia-south1-c'), Location(cloud=GCP, region='europe-west2', zone='europe-west2-b'), Location(cloud=GCP, region='asia-south1', zone='asia-south1-b'), Location(cloud=AWS, region='us-east-1', zone='us-east-1b'), Location(cloud=GCP, region='asia-northeast1', zone='asia-northeast1-b'), Location(cloud=GCP, region='asia-east1', zone='asia-east1-b'), Location(cloud=GCP, region='asia-east1', zone='asia-east1-c'), Location(cloud=AWS, region='us-east-1', zone='us-east-1a'), Location(cloud=GCP, region='us-central1', zone='us-central1-a'), Location(cloud=AWS, region='us-east-1', zone='us-east-1d'), Location(cloud=GCP, region='asia-northeast3', zone='asia-northeast3-b'), Location(cloud=GCP, region='europe-west4', zone='europe-west4-b'), Location(cloud=GCP, region='europe-west2', zone='europe-west2-a'), Location(cloud=GCP, region='europe-west1', zone='europe-west1-b'), Location(cloud=GCP, region='northamerica-northeast2', zone='northamerica-northeast2-a'), Location(cloud=GCP, region='asia-east1', zone='asia-east1-a'), Location(cloud=GCP, region='us-central1', zone='us-central1-c'), Location(cloud=GCP, region='asia-southeast1', zone='asia-southeast1-c')] |
After the latest commit it works well: Services
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
spot-hedge - - NO_REPLICA 0/5 http://54.144.2.226:30001
Service Replicas
SERVICE_NAME ID VERSION ENDPOINT LAUNCHED RESOURCES STATUS REGION
spot-hedge 1 1 - 1 hr ago 1x GCP([Spot]{'L4': 1}) PROVISIONING asia-northeast3
spot-hedge 2 1 - 1 hr ago 1x GCP([Spot]{'L4': 1}) PROVISIONING asia-northeast3
spot-hedge 3 1 - 1 hr ago 1x GCP([Spot]{'L4': 1}) PROVISIONING asia-east1
spot-hedge 4 1 - 1 hr ago 1x GCP({'L4': 1}) PROVISIONING us-east4
spot-hedge 5 1 - 1 hr ago 1x GCP({'L4': 1}) PROVISIONING us-east4 |
LG @Michaelvll |
/smoke-test --serve |
/smoke-test --serve |
@Michaelvll Just merged the latest master branch and triggered smoke test. PTAL when you got time, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cblmemo @MaoZiming! This is awesome! Should we add a smoke test for this? Also, we may want to test backward compatibility for this.
@@ -4334,6 +4334,7 @@ def serve_up( | |||
) | |||
click.secho('Service spec:', fg='cyan') | |||
click.echo(task.service) | |||
serve_lib.validate_service_task(task) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make this to be in SDK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only checking whether the task is a valid service task and does not take a lot of time. Do you still think this needs to be put into sdk?
/quicktest-core |
/quicktest-core |
@Michaelvll I dont think I'll have the bandwidth to add smoke test for this, but I still want to merge it into master before our eurosys presentation so the audience can easier install and try it out. Do you think we can get this PR in first and add smoke test later? Created an issue #5040 for this. |
/quicktest-core |
Add
SpotHedge
policy.TODO:
spot-hedge
branch to make sure everything works well.Initially, 3 spot span across regions, 3 on-demand fallback:
Later, only 3 spot is kept after it becomes ready;
Manually deleting one on cloud console, one spot (in a different region) + one on-demand is scaled up:
Finally, only the spot in different region is kept.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh