Skip to content

accelerator/rocm: add large BAR check #12982

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 16, 2024

Conversation

edgargabriel
Copy link
Member

check whether the device has large BAR support enabled. If not, set the rocm_copy_D2H and H2D thresholds to 0, i.e. use hipMemcpy(Async) for all data transfers. Data center (Instinct) devices usually have large BAR enabled, for gaming PCs its not always set by default.

The PR also does a little bit of cleanup in the error handling of the lazy_init() routine.

Comment on lines 278 to 285
if (0 < has_large_bar) {
// Without large BAR we have to use hipMemcpy(Async) for all data transfers
opal_output(0, "Large BAR support is not enabled on current device. "
"Enable large BAR support in BIOS (Above 4G Encoding) for "
"better performance\n.");
opal_accelerator_rocm_memcpyH2D_limit = 0;
opal_accelerator_rocm_memcpyD2H_limit = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what a value of has_large_bar larger than 0 means? That there it does not have large BAR? Couldn't find any documentation about it.

Also, should we limit the output to higher verbosity levels to not spam users?

Copy link
Member Author

@edgargabriel edgargabriel Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh, thank you for catching it, that should be == 0 ( I set to > zero to trigger the output once, and forgot to reset it). has_large_bar == 0 means that large BAR support is not enabled, has_large_bar == 1 means it is enabled.

Copy link
Member Author

@edgargabriel edgargabriel Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the level, I think this is quite important, I would prefer that the user has that on their screen. Until this PR, we segfaulted if large BAR was not enabled, so it was considered a mandatory feature.

check whether the device has large BAR support enabled. If not, set the
rocm_copy_D2H and H2D thresholds to 0, i.e. use hipMemcpy(Async) for all
data transfers. Data center (Instinct) devices usually have large BAR
enabled, for gaming PCs its not always set by the user.

The PR also does a little bit of cleanup in the error handling of the
lazy_init() routine.

Signed-off-by: Edgar Gabriel <[email protected]>
@edgargabriel edgargabriel force-pushed the topic/rocm-large-bar-check branch from acfd597 to 4df81a1 Compare December 16, 2024 20:53
@edgargabriel edgargabriel merged commit 95a8c39 into open-mpi:main Dec 16, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants