Skip to content

Update RDMA Core, OFI, UCX, and Open MPI #9818

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Apr 29, 2025

ROCm

  • add ROCr headers (used by OFI), remove debuginfo files.

RDMA Core

  • update RDMA Core Userspace Libraries to version 57.0
  • make libibverbs plugins relocatable

OFI

  • add Libfabric OpenFabrics version 2.1.0

UCX

  • update UCX to version 1.18.1

Open MPI

  • update Open MPI to version 4.1.8

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 29, 2025

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 29, 2025

please test

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @fwyzard for branch IB/CMSSW_15_1_X/master.

@iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 29, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45783/summary.html
COMMIT: ab0f125
CMSSW: CMSSW_15_1_X_2025-04-29-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9818/45783/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation warning when building: See details on the summary page.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 30, 2025

ERROR: unsuccessful bootstrap

doesn't seem related to these changes :-/

@smuzaffar
Copy link
Contributor

please test

@fwyzard fwyzard force-pushed the IB/CMSSW_15_1_X/master_openmpi_updates branch from ab0f125 to d0f3d26 Compare April 30, 2025 06:05
@cmsbuild
Copy link
Contributor

Pull request #9818 was updated.

@fwyzard fwyzard force-pushed the IB/CMSSW_15_1_X/master_openmpi_updates branch from d0f3d26 to 25f6384 Compare April 30, 2025 06:11
@cmsbuild
Copy link
Contributor

Pull request #9818 was updated.

@fwyzard fwyzard force-pushed the IB/CMSSW_15_1_X/master_openmpi_updates branch from 25f6384 to 4592b19 Compare April 30, 2025 08:30
@cmsbuild
Copy link
Contributor

Pull request #9818 was updated.

@fwyzard fwyzard changed the title Open MPI-related updates Update RDMA Core, OFI, UCX, and Open MPI Apr 30, 2025
@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 30, 2025

please test

@smuzaffar
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 3, 2025

-1

Failed Tests: UnitTests rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45825/summary.html
COMMIT: 4592b19
CMSSW: CMSSW_15_1_X_2025-05-03-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9818/45825/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45825/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45825/git-merge-result

Unit Tests

I found 1 errors in the following unit tests:

---> test test_MilleZmm had ERRORS

ROCm Unit Tests

I found 2 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS

Comparison Summary

Summary:

  • You potentially added 17 lines to the logs
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4037687
  • DQMHistoTests: Total failures: 52
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4037615
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

ROCM Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

@cmsbuild
Copy link
Contributor

cmsbuild commented May 8, 2025

Pull request #9818 was updated.

@fwyzard
Copy link
Contributor Author

fwyzard commented May 8, 2025

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented May 8, 2025

please test for el8_amd64_gcc14

@cmsbuild
Copy link
Contributor

cmsbuild commented May 8, 2025

-1

Failed Tests: RelVals-ROCM rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45959/summary.html
COMMIT: 95dd427
CMSSW: CMSSW_15_1_X_2025-05-08-1100/el8_amd64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9818/45959/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-ROCM

  • 12834.40212834.402_TTbar_14TeV+2024_Patatrack_PixelOnlyAlpaka/step3_TTbar_14TeV+2024_Patatrack_PixelOnlyAlpaka.log

ROCm Unit Tests

I found 2 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS

Comparison Summary

Summary:

  • You potentially added 19 lines to the logs
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4037687
  • DQMHistoTests: Total failures: 38
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4037629
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 49 files compared)
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

CUDA Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 1
  • DQMHistoTests: Total histograms compared: 0
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 0
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0 KiB( 0 files compared)
  • Checked 0 log files, 0 edm output root files, 1 DQM output files

@cmsbuild
Copy link
Contributor

cmsbuild commented May 8, 2025

-1

Failed Tests: RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45966/summary.html
COMMIT: 95dd427
CMSSW: CMSSW_15_1_X_2025-05-07-2300/el8_aarch64_gcc12
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9818/45966/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45966/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45966/git-merge-result

RelVals

----- Begin Fatal Exception 08-May-2025 23:18:05 CEST-----------------------
An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 7 stream: 0
   [1] Running path 'validation_step'
   [2] Prefetching for module HGCalValidator/'hltHgcalValidator'
   [3] Calling method for module SimTrackstersProducer/'hltTiclSimTracksters'
Exception Message:
Principal::getByToken: Found zero products matching all criteria
Looking for type: edm::AssociationMap<edm::OneToManyWithQualityGeneric<std::vector<TrackingParticle>,edm::View<reco::Track>,double,unsigned int,edm::RefProd<std::vector<TrackingParticle> >,edm::RefToBaseProd<reco::Track>,edm::Ref<std::vector<TrackingParticle>,TrackingParticle,edm::refhelper::FindUsingAdvance<std::vector<TrackingParticle>,TrackingParticle> >,edm::RefToBase<reco::Track> > >
Looking for module label: tpToHltGeneralTrackAssociation
Looking for productInstanceName: 

   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "TryToContinue = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

----- End Fatal Exception -------------------------------------------------

@cmsbuild
Copy link
Contributor

cmsbuild commented May 9, 2025

-1

Failed Tests: rocmUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45960/summary.html
COMMIT: 95dd427
CMSSW: CMSSW_15_1_X_2025-05-07-2300/el8_amd64_gcc14
Additional Tests: CUDA,ROCM
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/9818/45960/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45960/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-89dc20/45960/git-merge-result

ROCm Unit Tests

I found 2 errors in the following unit tests:

---> test testRocmSoALayoutAndView_t had ERRORS
---> test alpakaTestBufferROCmAsync had ERRORS

Comparison Summary

Summary:

  • You potentially added 756 lines to the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 100508 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 4037687
  • DQMHistoTests: Total failures: 568481
  • DQMHistoTests: Total nulls: 463
  • DQMHistoTests: Total successes: 3468723
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 2.7439999999999993 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 10224.0 ): -0.054 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 13034.0 ): -0.596 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 140.045,... ): -0.004 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 141.042 ): 0.043 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 145.014 ): 0.004 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 145.408 ): -0.016 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 145.5 ): 0.008 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 145.604 ): 0.090 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 145.713 ): -0.008 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 17034.0 ): -1.074 KiB SiStrip/MechanicalView
  • DQMHistoSizes: changed ( 250202.181 ): ...
  • Checked 215 log files, 184 edm output root files, 50 DQM output files
  • TriggerResults: found differences in 23 / 48 workflows

@smuzaffar
Copy link
Contributor

@fwyzard , this looks good to me.
Do you want to run any local tests before we merge this?

@fwyzard
Copy link
Contributor Author

fwyzard commented May 9, 2025

I think we can go ahead.

@smuzaffar
Copy link
Contributor

smuzaffar commented May 9, 2025

I think we can go ahead.

OK, I will merge it once we have a green IB (hopefully for tomorrow 11h00)

@smuzaffar
Copy link
Contributor

+externals

@smuzaffar smuzaffar merged commit a0fdd2a into cms-sw:IB/CMSSW_15_1_X/master May 10, 2025
24 of 28 checks passed
@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_15_1_X/master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @rappoccio, @mandrenguyen, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@fwyzard
Copy link
Contributor Author

fwyzard commented May 16, 2025

type ngt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants