Skip to content

workaround for cxil_map write error #161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

simonpintarelli
Copy link
Member

No description provided.

Copy link

preview available: https://docs.tds.cscs.ch/161


The following environment variable can be set to disable gdrcopy:
```bash
export FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD=0
Copy link
Contributor

@msimberg msimberg Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPICH_GPU_IPC_ENABLED=0 may be the semantically more correct option to set. If you happen to have an easy reproducer to check if this variable also helps, could you do so? If it's more than a few minutes, let's just just go with FI_CXI_SAFE_DEVMEM_COPY_THRESHOLD.

@@ -79,6 +79,13 @@ Cray MPICH may sometimes hang on larger runs.

Performance may be negatively affected by this option.

#### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication

The following environment variable can be set to disable gdrcopy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe (gdrcopy is obviously also technically correct, but I'm thinking GPUDirect might be a better known term)?

Suggested change
The following environment variable can be set to disable gdrcopy:
The following environment variable can be set to disable GPUDirect:

@@ -79,6 +79,13 @@ Cray MPICH may sometimes hang on larger runs.

Performance may be negatively affected by this option.

#### `"cxil_map: write error"` when doing inter-node GPU-aware MPI communication

The following environment variable can be set to disable gdrcopy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

This error message is sometimes triggered by applications that use GPU Direct MPI calls when they trigger a bug in gdrcopy (a low-level library used to copy buffers between GPUs).
Setting the following option will completely disable gdrcopy.
Note that this has a performance impact for small message sizes, so it should only be enabled on a case-by-case basis.

You could also mention that it has been used for ICON.

Copy link

preview available: https://docs.tds.cscs.ch/161

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants