-
Notifications
You must be signed in to change notification settings - Fork 26
workaround for cxil_map write error #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
preview available: https://docs.tds.cscs.ch/161 |
|
||
The following environment variable can be set to disable gdrcopy: | ||
```bash | ||
export FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPICH_GPU_IPC_ENABLED=0
may be the semantically more correct option to set. If you happen to have an easy reproducer to check if this variable also helps, could you do so? If it's more than a few minutes, let's just just go with FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD
.
@@ -79,6 +79,13 @@ Cray MPICH may sometimes hang on larger runs. | |||
|
|||
Performance may be negatively affected by this option. | |||
|
|||
#### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication | |||
|
|||
The following environment variable can be set to disable gdrcopy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe (gdrcopy is obviously also technically correct, but I'm thinking GPUDirect might be a better known term)?
The following environment variable can be set to disable gdrcopy: | |
The following environment variable can be set to disable GPUDirect: |
@@ -79,6 +79,13 @@ Cray MPICH may sometimes hang on larger runs. | |||
|
|||
Performance may be negatively affected by this option. | |||
|
|||
#### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication | |||
|
|||
The following environment variable can be set to disable gdrcopy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
This error message is sometimes triggered by applications that use GPU Direct MPI calls when they trigger a bug in gdrcopy (a low-level library used to copy buffers between GPUs).
Setting the following option will completely disable gdrcopy.
Note that this has a performance impact for small message sizes, so it should only be enabled on a case-by-case basis.
You could also mention that it has been used for ICON.
preview available: https://docs.tds.cscs.ch/161 |
No description provided.