-
Notifications
You must be signed in to change notification settings - Fork 633
[k8s] Improve /dev/fuse access on k8s #5028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
/smoke-test --kubernetes -k test_kubernetes_storage_mounts |
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
/smoke-test --kubernetes -k test_kubernetes_storage_mounts |
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
/smoke-test --aws -k test_docker_storage_mounts https://buildkite.com/skypilot-1/smoke-tests/builds/631 Verify there is no regression for non-k8s fuse mount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go code LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing work @aylei! Sending some quick comments, will give it a go in a bit. Mostly looks good to me
sky/provision/kubernetes/manifests/fusermount-server-daemonset.yaml
Outdated
Show resolved
Hide resolved
@aylei have we stress tested our new fuse solution in some way? I'm trying to run
gets stuck on:
To be fair, our current smarter-devices-fuse based solution also fails with a transport endpoint failure. However, it works when running on cloud VMs. |
@romilbhardwaj Thanks for testing it out! I will take a look |
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Christopher Cooper <[email protected]>
Signed-off-by: Aylei <[email protected]>
/smoke-test --aws -k test_docker_storage_mounts |
@romilbhardwaj It turns out there will be an error when running
After the latest commit: 5b6fc58, this problem is addressed: fio --name=64kseqwrites --rw=write --direct=1 --bs=1k --numjobs=1 --iodepth=8 --size=1M --group_reporting
64kseqwrites: (g=0): rw=write, bs=(R) 1024B-1024B, (W) 1024B-1024B, (T) 1024B-1024B, ioengine=psync, iodepth=8
fio-3.25
Starting 1 process
64kseqwrites: Laying out IO file (1 file / 1MiB)
64kseqwrites: (groupid=0, jobs=1): err= 0: pid=17212: Fri Apr 11 09:20:07 2025
write: IOPS=8827, BW=8828KiB/s (9039kB/s)(1024KiB/116msec); 0 zone resets
clat (usec): min=66, max=919, avg=111.66, stdev=57.73
lat (usec): min=66, max=919, avg=111.78, stdev=57.78
clat percentiles (usec):
| 1.00th=[ 71], 5.00th=[ 77], 10.00th=[ 79], 20.00th=[ 84],
| 30.00th=[ 88], 40.00th=[ 91], 50.00th=[ 94], 60.00th=[ 99],
| 70.00th=[ 106], 80.00th=[ 122], 90.00th=[ 165], 95.00th=[ 208],
| 99.00th=[ 338], 99.50th=[ 412], 99.90th=[ 676], 99.95th=[ 922],
| 99.99th=[ 922]
lat (usec) : 100=61.33%, 250=35.35%, 500=3.12%, 750=0.10%, 1000=0.10%
cpu : usr=0.00%, sys=18.26%, ctx=2049, majf=0, minf=10
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1024,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=8
Run status group 0 (all jobs):
WRITE: bw=8828KiB/s (9039kB/s), 8828KiB/s-8828KiB/s (9039kB/s-9039kB/s), io=1024KiB (1049kB), run=116-116msec |
/smoke-test --kubernetes -k test_kubernetes_storage_mounts |
@romilbhardwaj I also update the stress test result in the PR description, ready for another round of review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work @aylei! Super excited to get this in! LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is an exciting improvement
Given the smoke test & stress test result, merging into master and track follow-ups in separate issues |
close #4108
This PR introduces a privileged kubernetes DaemonSet to proxy fuse mount/unmount operations, so that we get rid of the additional privileges and capabilities of SkyPilot Pods. For elaboration, refer to https://github.com/skypilot-org/skypilot/blob/improve-k8s-fuse/addons/fuse-proxy/README.md
Benchmark:
command:
fio --name=64kseqwrites --rw=write --direct=1 --bs=1k --numjobs=1 --iodepth=8 --size=100M --group_reporting
There is a performance degradation compared to plain VM (both using 2c resources), need to figure why in the future. But compared to existing solution (smarter-device-plugin), there is performance regression.
Tested (run the relevant ones):
bash format.sh
/smoke-test
(CI) orpytest tests/test_smoke.py
(local)/smoke-test --kubernetes -k test_docker_storage_mounts
/smoke-test --kubernetes -k test_kubernetes_storage_mounts
/smoke-test --aws -k test_docker_storage_mounts
verify there is no regression for non-k8s fuse mount/quicktest-core
(CI) orpytest tests/smoke_tests/test_backward_compat.py
(local)Future TODOs