-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add GPUDirect Storage #1774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPUDirect Storage #1774
Conversation
CC @crcrpar |
} | ||
|
||
void save_data_no_gds(torch::Tensor& tensor, std::string& filename) { | ||
c10::cuda::CUDAGuard gpuGuard(tensor.device()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not a question, nor a suggestion) oh I've never used CUDAGuard
directly (always have used OptionalDeviceGuard
)
|
||
for size in [128, 1024, 8192]: | ||
x = torch.empty(size, device = "cuda") | ||
gds.load_data(x, f"{size}.data") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How complex would we expect converting the .data
produced by the current API to the .pt
format that PyTorch typically uses (assuming that we are doing this in host-code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm imagining that it would look something like:
gds materializes the tensors on storage (fast),
host wraps the storage in .pt
files (hopefully can be done without materializing the tensor in host memory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A tensor.pt
is basically a .zip
that has the following structure:
.
├── byteorder
├── .data
│ └── serialization_id
├── data
│ └── 0
├── data.pkl
└── version
In this structure gds.load_data
and gds.save_data
should should be responsible for operating on data/0
file only.
Looking from the higher-level prospective, the serializer should handle zipping/unzipping the *.pt
file, write/read byteorder
, data.pkl
..., and do gds.load_data
/gds.save_data
on data/0
. I hope that torch.serialization
does not materialize it 😇
The main idea is to make it more atomic, so that other serializers like safetensors could use it.
This commit upstreams NVIDIA/apex#1774 into pytorch without api changes The struct and its methods are pybinded as `torch._C._CudaGdsFileBase` Something that needs fixing: - cmake/public/cuda.cmake, CUDA::cuFile does not seem to work despite being mentioned here https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#cufile I used /usr/local/cuda/lib64/libcufile.so for now but this needs to be properly fixed There is a simple sanity check in sanity_check_loop_device.py. If you do not have an ext4 or xfs mount, these are the series of commands I used to create an ext4 filesystem on a loop device ```bash dd if=/dev/zero of=./loopfile bs=1024 count=40000000 losetup -f sudo losetup /dev/loop0 ./loopfile sudo losetup /dev/loop0 sudo mkfs -t ext4 -v /dev/loop1 mkdir ./mnt/loopfs sudo mount -t ext4 -o data=ordered /dev/loop0 ./mnt/loopfs cd /dev/loop0 sudo chmod 777 ./mnt/loopfs sudo umount ./mnt/loopfs/ losetup -d /dev/loop0 ```
This commit upstreams NVIDIA/apex#1774 into pytorch without api changes The struct and its methods are pybinded as `torch._C._CudaGdsFileBase` Something that needs fixing: - cmake/public/cuda.cmake, CUDA::cuFile does not seem to work despite being mentioned here https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#cufile I used /usr/local/cuda/lib64/libcufile.so for now but this needs to be properly fixed There is a simple sanity check in sanity_check_loop_device.py. If you do not have an ext4 or xfs mount, these are the series of commands I used to create an ext4 filesystem on a loop device ```bash dd if=/dev/zero of=./loopfile bs=1024 count=40000000 losetup -f sudo losetup /dev/loop0 ./loopfile sudo losetup /dev/loop0 sudo mkfs -t ext4 -v /dev/loop1 mkdir ./mnt/loopfs sudo mount -t ext4 -o data=ordered /dev/loop0 ./mnt/loopfs cd /dev/loop0 sudo chmod 777 ./mnt/loopfs sudo umount ./mnt/loopfs/ losetup -d /dev/loop0 ```
Based in part on NVIDIA/apex#1774 Pull Request resolved: #130633 Approved by: https://github.com/albanD
Based in part on NVIDIA/apex#1774 Pull Request resolved: pytorch#130633 Approved by: https://github.com/albanD
Based in part on NVIDIA/apex#1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: #130633 Approved by: https://github.com/albanD
This is a temporary placeholder for future PyTorch-native GPU Direct Storage until we push it to the upstream.
cc @eqy @Fuzzkatt