Add GPUDirect Storage #1774

Aidyn-A · 2024-02-05T20:52:23Z

This is a temporary placeholder for future PyTorch-native GPU Direct Storage until we push it to the upstream.

cc @eqy @Fuzzkatt

eqy · 2024-02-05T22:07:22Z

CC @crcrpar

apex/contrib/examples/gpu_direct_storage/example_save.py

apex/contrib/csrc/gpu_direct_storage/gds.cpp

crcrpar · 2024-02-06T07:22:45Z

apex/contrib/csrc/gpu_direct_storage/gds.cpp

+}
+
+void save_data_no_gds(torch::Tensor& tensor, std::string& filename) {
+  c10::cuda::CUDAGuard gpuGuard(tensor.device());


(not a question, nor a suggestion) oh I've never used CUDAGuard directly (always have used OptionalDeviceGuard)

apex/contrib/csrc/gpu_direct_storage/gds.h

eqy · 2024-02-06T21:57:24Z

apex/contrib/examples/gpu_direct_storage/example_load.py

+
+for size in [128, 1024, 8192]:
+    x = torch.empty(size, device = "cuda")
+    gds.load_data(x, f"{size}.data")


How complex would we expect converting the .data produced by the current API to the .pt format that PyTorch typically uses (assuming that we are doing this in host-code)

I'm imagining that it would look something like:
gds materializes the tensors on storage (fast),
host wraps the storage in .pt files (hopefully can be done without materializing the tensor in host memory)

A tensor.pt is basically a .zip that has the following structure:

. ├── byteorder ├── .data │ └── serialization_id ├── data │ └── 0 ├── data.pkl └── version

In this structure gds.load_data and gds.save_data should should be responsible for operating on data/0 file only.

Looking from the higher-level prospective, the serializer should handle zipping/unzipping the *.pt file, write/read byteorder, data.pkl ..., and do gds.load_data/gds.save_data on data/0. I hope that torch.serialization does not materialize it 😇

The main idea is to make it more atomic, so that other serializers like safetensors could use it.

This commit upstreams NVIDIA/apex#1774 into pytorch without api changes The struct and its methods are pybinded as `torch._C._CudaGdsFileBase` Something that needs fixing: - cmake/public/cuda.cmake, CUDA::cuFile does not seem to work despite being mentioned here https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#cufile I used /usr/local/cuda/lib64/libcufile.so for now but this needs to be properly fixed There is a simple sanity check in sanity_check_loop_device.py. If you do not have an ext4 or xfs mount, these are the series of commands I used to create an ext4 filesystem on a loop device ```bash dd if=/dev/zero of=./loopfile bs=1024 count=40000000 losetup -f sudo losetup /dev/loop0 ./loopfile sudo losetup /dev/loop0 sudo mkfs -t ext4 -v /dev/loop1 mkdir ./mnt/loopfs sudo mount -t ext4 -o data=ordered /dev/loop0 ./mnt/loopfs cd /dev/loop0 sudo chmod 777 ./mnt/loopfs sudo umount ./mnt/loopfs/ losetup -d /dev/loop0 ```

Based in part on NVIDIA/apex#1774 Pull Request resolved: #130633 Approved by: https://github.com/albanD

Based in part on NVIDIA/apex#1774 Pull Request resolved: pytorch#130633 Approved by: https://github.com/albanD

Based in part on NVIDIA/apex#1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: #130633 Approved by: https://github.com/albanD

add gpu_direct_storage

b60f10d

crcrpar reviewed Feb 6, 2024

View reviewed changes

apply suggested changes

91cb744

eqy reviewed Feb 6, 2024

View reviewed changes

use OOP API

5ec6256

Aidyn-A merged commit 5b67cd5 into NVIDIA:master Feb 7, 2024

Aidyn-A deleted the master branch February 7, 2024 21:31

mikaylagawarecki mentioned this pull request Mar 21, 2024

RFC-0033-GDS-checkpointing pytorch/rfcs#59

Open

mikaylagawarecki mentioned this pull request Jul 18, 2024

Add wrappers for synchronous GPUDirect Storage APIs pytorch/pytorch#130633

Closed

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Jul 22, 2024

Add wrappers for synchronous GPUDirect Storage APIs (#130633)

5b5e069

Based in part on NVIDIA/apex#1774 Pull Request resolved: #130633 Approved by: https://github.com/albanD

xuhancn pushed a commit to xuhancn/pytorch that referenced this pull request Jul 25, 2024

Add wrappers for synchronous GPUDirect Storage APIs (pytorch#130633)

535b7e9

Based in part on NVIDIA/apex#1774 Pull Request resolved: pytorch#130633 Approved by: https://github.com/albanD

mikaylagawarecki mentioned this pull request Aug 16, 2024

[Reland] Add wrappers for synchronous GPUDirect Storage APIs pytorch/pytorch#133489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add GPUDirect Storage #1774

Add GPUDirect Storage #1774

Uh oh!

Aidyn-A commented Feb 5, 2024

Uh oh!

eqy commented Feb 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

crcrpar Feb 6, 2024

Uh oh!

Uh oh!

Uh oh!

eqy Feb 6, 2024

Uh oh!

eqy Feb 6, 2024

Uh oh!

Aidyn-A Feb 7, 2024

Uh oh!

Uh oh!

Add GPUDirect Storage #1774

Add GPUDirect Storage #1774

Uh oh!

Conversation

Aidyn-A commented Feb 5, 2024

Uh oh!

eqy commented Feb 5, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

crcrpar Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eqy Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

eqy Feb 6, 2024

Choose a reason for hiding this comment

Uh oh!

Aidyn-A Feb 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!