-
Notifications
You must be signed in to change notification settings - Fork 227
Nightly pip wheels incompatible with pytorch-triton workflow #1318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can confirm the latest
I did some research in the For base images I've tried
|
@dllehr-amd @ptrblck thank you for your great analysis! Based on that, pytorch/pytorch#95265 might be able to fix this issue. I have verified the triton wheel build now packaged those thirdparty/cuda files. See https://github.com/pytorch/pytorch/actions/runs/4239265379/jobs/7367146618#step:6:1547 |
Re-opening until @ptrblck confirms this is resolved or we have nightly pip wheels that are confirmed to be resolved. Currently, the nightly job is not done, worse, we do not have Feb 21 nightly... |
Should fix pytorch/builder#1318 Pull Request resolved: pytorch#95265 Approved by: https://github.com/ngimel
Should fix pytorch/builder#1318 Pull Request resolved: pytorch#95265 Approved by: https://github.com/ngimel
* [BE] Cleanup triton builds (#95026) Remove Python-3.7 clause Do not install llvm-11, as llvm-14 is installed by triton/python/setup.py script Pull Request resolved: #95026 Approved by: https://github.com/osalpekar, https://github.com/weiwangmeta * Upgrade setuptools before building wheels (#95265) Should fix pytorch/builder#1318 Pull Request resolved: #95265 Approved by: https://github.com/ngimel --------- Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: Wei Wang <[email protected]>
The latest nightly fixes the original issue! 🎉
In another setup, I'm still running into the python resnet_compile.py
/home/pbialecki/miniforge3/envs/nightly_pip_cuda118/lib/python3.8/site-packages/torch/_inductor/compile_fx.py:90: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
/tmp/tmpua10w7u7/main.c:2:10: fatal error: cuda.h: No such file or directory
2 | #include "cuda.h"
| ^~~~~~~~
...
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp43bp1id7/main.c', '-O3', '-I/home/pbialecki/miniforge3/envs/nightly_pip_cuda118/include', '-I/home/pbialecki/miniforge3/envs/nightly_pip_cuda118/include/python3.8', '-I/tmp/tmp43bp1id7', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmp43bp1id7/triton_.cpython-38-x86_64-linux-gnu.so', '-L/usr/lib/x86_64-linux-gnu']' returned non-zero exit status 1. This issue seems to be unrelated to the missing |
I have encountered this too but thought it was my own setup issue. Looks like something else might be going on... Looking forward to your analysis, thanks! |
I have been testing things with "export CUDA_HOME=/usr/local/cuda-11.7", it turns out that this is blinding me. The above error will occur if CUDA_HOME is not set (or the C_INCLUDE_PATH does not contain "cuda.h") |
Should fix pytorch/builder#1318 Pull Request resolved: pytorch/pytorch#95265 Approved by: https://github.com/ngimel
* [BE] Cleanup triton builds (pytorch#95026) Remove Python-3.7 clause Do not install llvm-11, as llvm-14 is installed by triton/python/setup.py script Pull Request resolved: pytorch#95026 Approved by: https://github.com/osalpekar, https://github.com/weiwangmeta * Upgrade setuptools before building wheels (pytorch#95265) Should fix pytorch/builder#1318 Pull Request resolved: pytorch#95265 Approved by: https://github.com/ngimel --------- Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: Wei Wang <[email protected]>
Should fix pytorch/builder#1318 Pull Request resolved: pytorch#95265 Approved by: https://github.com/ngimel
Description
Based on pytorch/pytorch#94818 (comment)
ptxas
should be bundled with "triton" (I assume it should ship in thepytorch-triton
wheel), which does not seem to be the case using the latest nightly binary.Setup info
ptxas
inpytorch-triton
The latest nightly tags
pytorch-triton==2.0.0+c8bfe3f548
which is correct according to.github/ci_commit_pins/triton.txt
.pytorch-triton
searches theptxas
binary using a specifiedTRITON_PTXAS_PATH
or depends ontriton/third_party/cuda/bin/ptxas
as seen in: https://github.com/openai/triton/blob/c8bfe3f548b164f745ada620a560f87f41ab8465/python/triton/compiler.py#L1066-L1067.It seems however,
triton/third_party/cuda
does not contain the expectedbin
folder as seen in:https://github.com/openai/triton/tree/c8bfe3f548b164f745ada620a560f87f41ab8465/python/triton/third_party/cuda
Example code snippet with failure
Using a simple RN50 with
torch.compile
:fails with:
Workaround with TRTON_PTXAS_PATH
Setting the
TRITON_PTXAS_PATH
to a validptxas
location (from a locally installed CUDA toolkit) fails either with this error on bare metal:or with this error in a docker container:
Missing third_party dependency
The second error is strange, as it claims that not even the
libdevice.10.bc
file can be found and indeed it seems the entirethird_party
folder is missing:To double check it, I've downloaded the wheel manually via:
and after unzipping it I also cannot find any
ptxas
,libdevice*
, orthird_party
.Possible fixes
My best guess right now would be:
openai/triton
needs to be updated with thebin/ptxas
file as it's missing inthird_party/cuda
pytoch-triton
wheel needs to be rebuilt as it's missing the entirethird_party
folder/usr/bin/ld: cannot find -lcuda
error asgcc -lcuda
tries to use my local CUDA toolkit (should this be the case?).Side note for the last point: my local CUDA toolkit can properly build PyTorch from source and a simple CUDA driver API example with
-lcuda
, so even if we expect a dependency on a locally installed CUDA toolkit, I'm still unsure why it's failing.Let me know, if I'm missing something.
CC @malfet @atalman @ngimel
The text was updated successfully, but these errors were encountered: