-
Notifications
You must be signed in to change notification settings - Fork 633
[Core] Disk tier ultra
for AWS and GCP
#3860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature @Conless ! It would be really helpful for users who want enhanced disk performance. The PR looks great to me! Left some code-style comments. Also, after a discussion w/ Zhanghao, we should consider using max
as the name of the new tier and select the performance arguments as high as possible - the max
tier does not need to be aligned.
It would be great if you calculate the corresponding cost (
Fix done.
Configuration of |
Thanks for the prompt fix! Do we have the cost number for GCP as well? |
The price for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the prompt fix! This PR is in good shape and should be ready to go after we make the max
/ultra
decision. Left some nitpicks ;)
Also, could you run the smoke tests and make sure they passed? |
No problem. I'll later get them done. |
I tried this PR and found that the warning of price appears twice in the log. Could you take a look at this? Ideally, display it once is enough. sky launch -c tmax @temp/lmds.yaml
Task from YAML spec: @temp/lmds.yaml
I 08-23 13:37:13 optimizer.py:691] == Optimizer ==
I 08-23 13:37:13 optimizer.py:714] Estimated cost: $16.3 / hour
I 08-23 13:37:13 optimizer.py:714]
I 08-23 13:37:13 optimizer.py:839] Considered resources (1 node):
I 08-23 13:37:13 optimizer.py:909] ------------------------------------------------------------------------------------------
I 08-23 13:37:13 optimizer.py:909] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 08-23 13:37:13 optimizer.py:909] ------------------------------------------------------------------------------------------
I 08-23 13:37:13 optimizer.py:909] AWS g5.48xlarge 192 768 A10G:8 us-east-2 16.29 ✔
I 08-23 13:37:13 optimizer.py:909] ------------------------------------------------------------------------------------------
I 08-23 13:37:13 optimizer.py:909]
Launching a new cluster 'tmax'. Proceed? [Y/n]:
I 08-23 13:37:13 cloud_vm_ray_backend.py:4354] Creating a new cluster: 'tmax' [1x AWS(g5.48xlarge, {'A10G': 8}, image_id={'us-east-2': 'ami-0f52939636b563497'}, disk_tier=ultra, ports=['8000'])].
I 08-23 13:37:13 cloud_vm_ray_backend.py:4354] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 08-23 13:37:15 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/memory/sky_logs/sky-2024-08-23-13-37-12-189495/provision.log
W 08-23 13:37:15 aws.py:797] Using disk_tier=ultra on AWS will utilize io2 Block Express, which can lead to significant higher costs (~$1.8/h). For more information, see: https://aws.amazon.com/ebs/pricing.
I 08-23 13:37:17 provisioner.py:65] Launching on AWS us-east-2 (us-east-2a,us-east-2b,us-east-2c)
W 08-23 13:37:50 aws.py:797] Using disk_tier=ultra on AWS will utilize io2 Block Express, which can lead to significant higher costs (~$1.8/h). For more information, see: https://aws.amazon.com/ebs/pricing. |
Also, this warning appears after I confirm the launch of the cluster. It would be better if we showed those messages before confirmation so the user could utilize this information when making decisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! The PR looks great to me except for several nitpicks. It should be ready to go after the smoke tests are passed!
Also, it would be great if you included the table for performance/price like in #1812 ;) |
OK! Let me add it in the PR description |
# low: 1000 IOPS; read 90 MB/s; write 90 MB/s | ||
# medium: 3000 IOPS; read 220 MB/s; write 220 MB/s | ||
# high: 6000 IOPS; read 400 MB/s; write 400 MB/s | ||
# ultra: 60000 IOPS; read 4000 MB/s; write 3000 MB/s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# ultra: 60000 IOPS; read 4000 MB/s; write 3000 MB/s | |
# ultra: 20000 IOPS; read 4000 MB/s; write 3000 MB/s |
accidentally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh is it the actual benchmarked iops instead of the configured one? nvm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is the actual benchmarked iops
Thanks for this amazing PR @Conless ! This would be very helpful for users loading large model weights. Confirmed that the smoke test passed. Merging now 🫡 |
Currently, using disk tier
high
(orbest
) does not actually provide the highest disk performance available, as previously mentioned in #3517 #3585 #3836. However, directly modifying the existing settings forhigh
will result in inconsistent performance across different cloud providers.This PR introduces a new disk tier
ultra
, leveraging io2 Block Express on AWS and Extreme Persistent Disk on GCP. They provides high performance disks with around 50000 iops and around 4000 MB/s throughput, which is about 10x than the best previous option.The benchmark and price of this tier on AWS and GCP are shown below. Benchmarks were conducted using
examples/perf/storage_rawperf.yaml
, onn2-standard-64
(GCP) andm6i.2xlarge
(AWS), and the prices are calculated fordisk_size=256
.The solution of this PR has certain limitations.
disk_tier=ultra
can lead to significant costs (see pricing of AWS and GCP). Now I print warning messages to notify users, maybe some other ways can be used in the future (e.g. take the price of disk into consideration in optimizer).If you have any concerns about the current solution, we can discuss them together.
Tested (run the relevant ones):
bash format.sh
tests/test_optimizer_dryruns.py::test_optimize_disk_tier
has been updated for new featurespytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh