-
Notifications
You must be signed in to change notification settings - Fork 15
rocshmem dependencies #349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Could you share a toy user submission as well using rocshmem. Just wanna get a sense of what things will look like e2e |
Also @saienduri to sanity check |
Vibe coded this but is gonna look similar to HIP kernels in python |
Looks good to me. Starting a test docker build here to check status: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17545534459. |
ooo! looks like there is some issue with UCX. I ll debug it today! |
@saienduri I made some changes but not sure if it works, is there a way to test the workflow without approval? I don't have MI300X to test 😅 |
Thanks, trying a build here now: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17701378282. You can locally try building the docker just to see if the build passes. |
Cool, the build passed and a sanity test passed here: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17702258708 |
@saienduri added one, lmk if it works! |
Hmm getting |
You want the example working with load_inline in PyTorch |
done but idk if it works 😬 |
@saienduri can we test the provided payload example on the server directly? If it's fine then we should be good to merge |
ok running the payload in github actions yielded the following (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17790562194):
I think it will be the same error on the server itself as well. |
Pushed a commit to fix the import issue.
|
Ok, I ll test this on runpod and push a working version. Apologies for all the back and forth! |
@chivatam Hi, I have no permission to directly push commit to your repo, I corrected your payload, you can refer to that. Just use extra_ldflag instead
|
@saienduri hi sai, could you pls replace the current one with mine above and trigger test again? Thanks |
@danielhua23 just gave you write access as well |
Description
added rocshmem dependencies to the dockerfile
@msaroufim