Skip to content

[Core] Disk tier ultra for AWS and GCP #3860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Sep 1, 2024
Merged

Conversation

Conless
Copy link
Contributor

@Conless Conless commented Aug 22, 2024

Currently, using disk tier high (or best) does not actually provide the highest disk performance available, as previously mentioned in #3517 #3585 #3836. However, directly modifying the existing settings for high will result in inconsistent performance across different cloud providers.

This PR introduces a new disk tier ultra, leveraging io2 Block Express on AWS and Extreme Persistent Disk on GCP. They provides high performance disks with around 50000 iops and around 4000 MB/s throughput, which is about 10x than the best previous option.

The benchmark and price of this tier on AWS and GCP are shown below. Benchmarks were conducted using examples/perf/storage_rawperf.yaml, on n2-standard-64(GCP) and m6i.2xlarge(AWS), and the prices are calculated for disk_size=256.

Cloud Disk Type Benchmarked Read Throughput Benchmarked Read IOPS Benchmarked Write Throughput Benchmarked Write IOPS Price (GiB/mo) Price (IOPS/mo) Total Price Each Month
GCP pd-extreme with IOPS=20000 4214.88 MB/s 64314.03 3172.06 MB/s 48401.77 0.125 0.065 1332
Azure - - - - - - - -
AWS io2 with IOPS=20000 3167.38 MB/s 48330.38 3400.61 MB/s 51889.15 0.125 0.065 1332

The solution of this PR has certain limitations.

  • The performance of ultra disks can only be fully realized on certain types of instances. Specifically, the benchmark result above may only be achieved on larger instance with 32+ cores. On other instances, the throughput may be limited to around 1200 MB/s (same as gp3 with 16000 iops). For more details on the specifications, please refer to machine support for GCP pd-extreme and Amazon EBS-optimized instance types.
  • The best disk performance on AWS has still not been fully achieved. This is because the peak performance on GCP is lower, and we need to ensure similar performance across different cloud providers.
  • Using disk_tier=ultra can lead to significant costs (see pricing of AWS and GCP). Now I print warning messages to notify users, maybe some other ways can be used in the future (e.g. take the price of disk into consideration in optimizer).

If you have any concerns about the current solution, we can discuss them together.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • tests/test_optimizer_dryruns.py::test_optimize_disk_tier has been updated for new features
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature @Conless ! It would be really helpful for users who want enhanced disk performance. The PR looks great to me! Left some code-style comments. Also, after a discussion w/ Zhanghao, we should consider using max as the name of the new tier and select the performance arguments as high as possible - the max tier does not need to be aligned.

It would be great if you calculate the corresponding cost ($/hour) to make sure the price is reasonable - if it costs like 10+ $/hour we might want to reconsider the reasonable argument to use.

@Conless
Copy link
Contributor Author

Conless commented Aug 22, 2024

Fix done.

It would be great if you calculate the corresponding cost to make sure the price is reasonable.

Configuration of ultra in this PR costs about additionally $2/h, which seems reasonable. For comparison, the max setting on AWS is 256,000 IOPS, which costs $13.9/h.

@cblmemo
Copy link
Collaborator

cblmemo commented Aug 22, 2024

Fix done.

It would be great if you calculate the corresponding cost to make sure the price is reasonable.

Configuration of ultra in this PR costs about additionally $2/h, which seems reasonable. For comparison, the max setting on AWS is 256,000 IOPS, which costs $13.9/h.

Thanks for the prompt fix! Do we have the cost number for GCP as well?

@Conless
Copy link
Contributor Author

Conless commented Aug 22, 2024

Fix done.

It would be great if you calculate the corresponding cost to make sure the price is reasonable.

Configuration of ultra in this PR costs about additionally $2/h, which seems reasonable. For comparison, the max setting on AWS is 256,000 IOPS, which costs $13.9/h.

Thanks for the prompt fix! Do we have the cost number for GCP as well?

The price for max setting on GCP is $13/h, similar to that on AWS.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix! This PR is in good shape and should be ready to go after we make the max/ultra decision. Left some nitpicks ;)

@cblmemo
Copy link
Collaborator

cblmemo commented Aug 23, 2024

Also, could you run the smoke tests and make sure they passed?

@Conless
Copy link
Contributor Author

Conless commented Aug 23, 2024

No problem. I'll later get them done.

@cblmemo
Copy link
Collaborator

cblmemo commented Aug 23, 2024

I tried this PR and found that the warning of price appears twice in the log. Could you take a look at this? Ideally, display it once is enough.

sky launch -c tmax @temp/lmds.yaml
Task from YAML spec: @temp/lmds.yaml
I 08-23 13:37:13 optimizer.py:691] == Optimizer ==
I 08-23 13:37:13 optimizer.py:714] Estimated cost: $16.3 / hour
I 08-23 13:37:13 optimizer.py:714] 
I 08-23 13:37:13 optimizer.py:839] Considered resources (1 node):
I 08-23 13:37:13 optimizer.py:909] ------------------------------------------------------------------------------------------
I 08-23 13:37:13 optimizer.py:909]  CLOUD   INSTANCE      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 08-23 13:37:13 optimizer.py:909] ------------------------------------------------------------------------------------------
I 08-23 13:37:13 optimizer.py:909]  AWS     g5.48xlarge   192     768       A10G:8         us-east-2     16.29         ✔     
I 08-23 13:37:13 optimizer.py:909] ------------------------------------------------------------------------------------------
I 08-23 13:37:13 optimizer.py:909] 
Launching a new cluster 'tmax'. Proceed? [Y/n]: 
I 08-23 13:37:13 cloud_vm_ray_backend.py:4354] Creating a new cluster: 'tmax' [1x AWS(g5.48xlarge, {'A10G': 8}, image_id={'us-east-2': 'ami-0f52939636b563497'}, disk_tier=ultra, ports=['8000'])].
I 08-23 13:37:13 cloud_vm_ray_backend.py:4354] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 08-23 13:37:15 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/memory/sky_logs/sky-2024-08-23-13-37-12-189495/provision.log
W 08-23 13:37:15 aws.py:797] Using disk_tier=ultra on AWS will utilize io2 Block Express, which can lead to significant higher costs (~$1.8/h). For more information, see: https://aws.amazon.com/ebs/pricing.
I 08-23 13:37:17 provisioner.py:65] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c)
W 08-23 13:37:50 aws.py:797] Using disk_tier=ultra on AWS will utilize io2 Block Express, which can lead to significant higher costs (~$1.8/h). For more information, see: https://aws.amazon.com/ebs/pricing.

@cblmemo
Copy link
Collaborator

cblmemo commented Aug 23, 2024

I tried this PR and found that the warning of price appears twice in the log. Could you take a look at this? Ideally, display it once is enough.

sky launch -c tmax @temp/lmds.yaml
Task from YAML spec: @temp/lmds.yaml
I 08-23 13:37:13 optimizer.py:691] == Optimizer ==
I 08-23 13:37:13 optimizer.py:714] Estimated cost: $16.3 / hour
I 08-23 13:37:13 optimizer.py:714] 
I 08-23 13:37:13 optimizer.py:839] Considered resources (1 node):
I 08-23 13:37:13 optimizer.py:909] ------------------------------------------------------------------------------------------
I 08-23 13:37:13 optimizer.py:909]  CLOUD   INSTANCE      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 08-23 13:37:13 optimizer.py:909] ------------------------------------------------------------------------------------------
I 08-23 13:37:13 optimizer.py:909]  AWS     g5.48xlarge   192     768       A10G:8         us-east-2     16.29         ✔     
I 08-23 13:37:13 optimizer.py:909] ------------------------------------------------------------------------------------------
I 08-23 13:37:13 optimizer.py:909] 
Launching a new cluster 'tmax'. Proceed? [Y/n]: 
I 08-23 13:37:13 cloud_vm_ray_backend.py:4354] Creating a new cluster: 'tmax' [1x AWS(g5.48xlarge, {'A10G': 8}, image_id={'us-east-2': 'ami-0f52939636b563497'}, disk_tier=ultra, ports=['8000'])].
I 08-23 13:37:13 cloud_vm_ray_backend.py:4354] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 08-23 13:37:15 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/memory/sky_logs/sky-2024-08-23-13-37-12-189495/provision.log
W 08-23 13:37:15 aws.py:797] Using disk_tier=ultra on AWS will utilize io2 Block Express, which can lead to significant higher costs (~$1.8/h). For more information, see: https://aws.amazon.com/ebs/pricing.
I 08-23 13:37:17 provisioner.py:65] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c)
W 08-23 13:37:50 aws.py:797] Using disk_tier=ultra on AWS will utilize io2 Block Express, which can lead to significant higher costs (~$1.8/h). For more information, see: https://aws.amazon.com/ebs/pricing.

Also, this warning appears after I confirm the launch of the cluster. It would be better if we showed those messages before confirmation so the user could utilize this information when making decisions.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! The PR looks great to me except for several nitpicks. It should be ready to go after the smoke tests are passed!

@cblmemo
Copy link
Collaborator

cblmemo commented Aug 27, 2024

Also, it would be great if you included the table for performance/price like in #1812 ;)

@Conless
Copy link
Contributor Author

Conless commented Aug 28, 2024

Also, it would be great if you included the table for performance/price like in #1812 ;)

OK! Let me add it in the PR description

# low: 1000 IOPS; read 90 MB/s; write 90 MB/s
# medium: 3000 IOPS; read 220 MB/s; write 220 MB/s
# high: 6000 IOPS; read 400 MB/s; write 400 MB/s
# ultra: 60000 IOPS; read 4000 MB/s; write 3000 MB/s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ultra: 60000 IOPS; read 4000 MB/s; write 3000 MB/s
# ultra: 20000 IOPS; read 4000 MB/s; write 3000 MB/s

accidentally?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh is it the actual benchmarked iops instead of the configured one? nvm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is the actual benchmarked iops

@cblmemo
Copy link
Collaborator

cblmemo commented Sep 1, 2024

Thanks for this amazing PR @Conless ! This would be very helpful for users loading large model weights. Confirmed that the smoke test passed. Merging now 🫡

@cblmemo cblmemo added this pull request to the merge queue Sep 1, 2024
Merged via the queue into skypilot-org:master with commit 2e204d0 Sep 1, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants