Skip to content

opal/mca/accelerator: introduce get_device_pci_attr api #11687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 18, 2023

Conversation

wenduwan
Copy link
Contributor

Introduce get_device_pci_attr api to query accelerator device PCI attributes. This enables intelligent selection of other PCI(e) devices based on affinity with the accelerator, e.g. NICs.

@@ -476,6 +478,11 @@ static int mca_accelerator_rocm_get_device(int *dev_id)
return OPAL_SUCCESS;
}

static int mca_accelerator_rocm_get_device_pci_attr(int dev_id, opal_accelerator_pci_attr_t *pci_attr)
{
return OPAL_ERR_NOT_IMPLEMENTED;
Copy link
Member

@bwbarrett bwbarrett May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgargabriel Hi could you please take a look at the change and let me know if it makes sense to you. Unfortunately I do not have access to rocm devices for testing, so if possible I very much appreciate if you could do a quick sanity test.🙏

@wenduwan wenduwan force-pushed the accelerator_awareness branch from 9fd8ec7 to 0d8df2d Compare May 17, 2023 00:18
@rhc54
Copy link
Contributor

rhc54 commented May 17, 2023

I do have to sigh as PMIx is supposed to provide you with such info. MPICH-4 uses it - guess it's just something OMPI hasn't caught up with yet. 🤷‍♂️

@jjhursey
Copy link
Member

bot:ibm:retest

@wenduwan
Copy link
Contributor Author

I do have to sigh as PMIx is supposed to provide you with such info.

@rhc54 Ooooh do you have a pointer to that? In theory we don't have to do this in accelerator, since PCI is not an accelerator concept(and this change implies that all accelerators must be PCI devices). But the winning motivation here is to abstract away the accelerator type, such that calling routines do not need to branch off cuda vs rocm vs etc.

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fundamentally good for me, I can also confirm that the code also compiles correctly.

Introduce get_device_pci_attr api to query accelerator device PCI
attributes. This enables intelligent selection of other PCI(e) devices
based on affinity with the accelerator, e.g. NICs.

Signed-off-by: Wenduo Wang <[email protected]>
@wenduwan wenduwan force-pushed the accelerator_awareness branch from 0d8df2d to 00a567e Compare May 17, 2023 17:23
@rhc54
Copy link
Contributor

rhc54 commented May 17, 2023

I do have to sigh as PMIx is supposed to provide you with such info.

@rhc54 Ooooh do you have a pointer to that? In theory we don't have to do this in accelerator, since PCI is not an accelerator concept(and this change implies that all accelerators must be PCI devices). But the winning motivation here is to abstract away the accelerator type, such that calling routines do not need to branch off cuda vs rocm vs etc.

Busy moving at the moment, so let me get back to you on it. For now, you'd need this as a backup method anyway, so no harm done.

Just don't let me forget about it 😄

@wenduwan
Copy link
Contributor Author

@bwbarrett @lrbison @edgargabriel Can we merge this change? I also need to backport it.

@edgargabriel edgargabriel merged commit a1519d2 into open-mpi:main May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants