[AUTOGENERATED] [release/2.6] Cherry-pick PR-1889 #1893

rocm-mici · 2025-02-11T22:25:44Z

Cherry-pick of #1889

…ults file when offline tuning is disabled. (#1889) This is cherry-pick of an upstream PR that has been approved but is unmerged (due to CI delays). The PR did previously pass ROCm tests. pytorch#146574 This PR is to fix UT breakage that has been reported internally and is considered high priority. When tunable.record_untuned_enable(False) is invoked, we flush the results of the untuned gemm file. Offline tuning I/O currently doesn't have a set untuned results filename member function or untuned results write to file member function. When performing back-to-back unit tests, the same ofstream ends up getting reused between UTs. Due to the way the UT are executed, this can lead to unexpected failures.

rocm-repo-management-api · 2025-02-11T22:41:05Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-15T02:55:36Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during base docker image building:

#32 18.15 The following packages have unmet dependencies:
#32 18.22  rocm-dev : Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 18.22             Depends: rocm-device-libs (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 18.22  rocm-utils : Depends: rocminfo (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 18.22               Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 18.23 E: Unable to correct problems, you have held broken packages.
#32 ERROR: process "/bin/sh -c bash ./install_rocm.sh" did not complete successfully: exit code: 100
------
 > [stage-0 24/52] RUN bash ./install_rocm.sh:
18.15 distribution that some required packages have not yet been created
18.15 or been moved out of Incoming.

rocm-repo-management-api · 2025-02-17T19:25:34Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-17T21:33:41Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-17T21:35:11Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during base docker image building:

#32 12.44 The following packages have unmet dependencies:
#32 12.49  rocm-dev : Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 12.49             Depends: rocm-device-libs (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 12.49  rocm-utils : Depends: rocminfo (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 12.49               Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 12.50 E: Unable to correct problems, you have held broken packages.
#32 ERROR: process "/bin/sh -c bash ./install_rocm.sh" did not complete successfully: exit code: 100
------
 > [stage-0 24/52] RUN bash ./install_rocm.sh:
12.44 distribution that some required packages have not yet been created
12.44 or been moved out of Incoming.

rocm-repo-management-api · 2025-02-19T17:55:41Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Detected error during base docker image building:

#32 12.89 The following packages have unmet dependencies:
#32 12.95  rocm-dev : Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 12.95             Depends: rocm-device-libs (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 12.95  rocm-utils : Depends: rocminfo (= 1.0.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 12.95               Depends: rocm-cmake (= 0.14.0.60302-66~22.04) but 5.0.0-1 is to be installed
#32 12.96 E: Unable to correct problems, you have held broken packages.
#32 ERROR: process "/bin/sh -c bash ./install_rocm.sh" did not complete successfully: exit code: 100
------
 > [stage-0 24/52] RUN bash ./install_rocm.sh:
12.89 distribution that some required packages have not yet been created
12.89 or been moved out of Incoming.

rocm-repo-management-api · 2025-02-20T13:55:38Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit is in progress
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-21T16:55:39Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-21T20:55:36Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-21T22:55:35Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-26T20:25:40Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-02-27T16:25:40Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-03-04T05:25:38Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-03-04T18:25:55Z

Jenkins build for b75463345818e57d4d7c5aba59fd47b863ab667d commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Commit Messages: - update the param_id calculation so that it works on both CPX and SPX modes (#271) (#272) - reset parameters for FusedDenseGeluDense similar to FusedDense to make the test_gelu pass (#269) (#270) Co-authored-by: Sriram Kumar <[email protected]> - Fix build error (#263) - Fix warp size (#256) * replace c10_warp_size in fused rope * replace c10_warp_size in fused softmax * replace c10_warp_size in group batch norm * replace c10_warp_size in multiheadattention * replace c10_warp_size in tramsducer * replace c10_warp_size in xentropy * replace c10_warp_size in sync batch normalization * replace c10_warp_size in group batch norm * replace warp_size in multihead attention - Disabling Aiter Installation in default build (#254) * made a flag to switch on/off aiter compile using --aiter when installing apex * Added information on building AITER during installation in readme - Replaced warpsize with C10_WARP_SIZE (#249) - correct the approach to get to the apex folder from the test file (#248) - Apex extensions import test (#245) * add test to extract extensions from setup.py and test if there can be imported * moved test outside tests/L0 - Fixing the C10_warpsize issue. replacing the macros with at::cuda::warp_size() (#237) - Replacing c10_warp_size with platform based warp_size values (#228) fixes :https://ontrack-internal.amd.com/browse/SWDEV-541725 - [master] Added AITER as a submodule and use in fused_rope.py (#222) * Added aiter support in fused_rope.py for all 4 variants. Updated fused rope test, reduced tolerances according to unit test in aiter repo. * Add aiter as a submodule and install it if it is rocm. Switch on aiter backend if it is rocm and aiter is installed * add pandas to the requirements so that aiter can be used without numpy error - ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject * Replace ROCM_HOME condition to IS_ROCM_PYTORCH for installing aiter and use pip install -e . instead of python setup.py develop for installing aiter. * Create apex and aiter subclasses for the four variants of FusedRoPEFunc and select apex or aiter subclass based on AITER_ROPE_BACKEND value. The user can specify the environment variable USE_ROCM_AITER_ROPE_BACKEND to select between aiter and apex backends for fused rope. * If the AITER backend is selected, use lowered precision in the unit test otherwise use the original precision 1e-3 * warn user about the lower precision when using aiter backend for fused rope * Update fused_rope.py remove spaces * simplify the switch between aiter and apex subclasses * install aiter without editable mode - Merge pull request #227 from ROCm/amd/dev/iassiour/SWDEV-541770 Do not use warpSize as a constexpr in nhwc_batch_norm_kernel.h - Do not use warpSize as a constexpr in nhwc_batch_norm_kernel.h In ROCm 7.0, the warpSize variable is no longer constexpr. This commit replaces the variable use with the correct values based on the architecture we're running on. - change epilogue parameter for hipblaslt matmul in cuda kernel for fused dense gelu dense (#223) Fixes : https://ontrack-internal.amd.com/browse/SWDEV-534531 - Reset torch default device to cpu after running the amp unit tests. (#220) - Fix unit tests for transformer, fused dense, mlp (#218) * Fix fused_dense_gelu_dense, change the names of the parameters so that they can be accessed by the test appropriately * Update the absolute tolerances in test_mlp from 0 and 1e-7 to 1e-5 * Deactivate the amp state handle for optimization level other than O0. This helps to pass the UT after this. * Update condition for deactivating amp state handle from opt level equal to 1 to opt level not equal to 0 * Update torch set default dtype method to remove warning * Update the method to create overflow buffer for amp optimizer * Update the method to create overflow buffer for amp optimizer * Update the method to create overflow buffer for amp optimizer * reset the default device to cpu so that the generator uses cuda, as run_amp tests set its to cuda - Update fused layer norm code from upstream apex repo. The intra-warp reductions code inside cuWelfordMuSigma2() function in layer norm kernel assumes a warp size of 32, so added a condition for rocm to support gpu warp size (based on earlier apex code). For rocm, adjust the threadsize, based on earlier apex code. (#215) - upgrade matplotlib to resolve setuptools_scm error. (#213) The error: File /tmp/easy_install-_pfhn8pn/matplotlib-3.5.1/.eggs/setuptools_scm-8.3.1-py3.12.egg/setuptools_scm/_integration/pyproject_reading.py, line 36, in read_pyproject section = defn.get(tool, {})[tool_name] ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^ KeyError: 'setuptools_scm' Solution : https://github.com/matplotlib/matplotlib/blob/v3.8.x/pyproject.toml#L22 matplotlib 3.8 is the first version to have pyproject.toml with this tool.setuptools_scm section. This higher version of setuptools expects this structure in the python packages it installs. Matplotlib 3.5.1 doesn't satisfy this condition. The solution is to change the condition to matplotlib>=3.8. - Update distributed fused adam - integrate Pipeline operations and support different grad (#207) * Fix `DistributedFusedAdam` for grad dtype != param dtype (#1893) * Pipeline `reduce-scatter` and `all-reduce`. (#1895) --------- Co-authored-by: Tailing Yuan <[email protected]> Co-authored-by: Wil Kong <[email protected]> - Update the condition for building the NCCL allocator, PyTorch should be greater than or equal to 2.6 (#204) - Update version.txt (#203) change the version from 1.7.0 to 1.8.0 - [ROCm] Use at::empty to manage workspace memory to avoid hip runtime calls (#197) Optimize the memory for fused_weight_gradient_mlp_cuda module - Update README.md (#198) Add release notes for release/1.5, 1.6 and 1.7 - Update README.md (#196) updated the support versions for apex 1.7.0 PRs: - https://github.com/ROCm/apex/pull/1895 Fixes: - https://example.com/issue-271 - https://example.com/issue-249 - https://example.com/issue-254 - https://example.com/issue-228 - https://example.com/issue-263 - https://example.com/issue-223 - https://example.com/issue-237 - https://example.com/issue-203 - https://example.com/issue-256 - https://example.com/issue-245 - https://example.com/issue-272 - https://example.com/issue-204 - https://ontrack-internal.amd.com/browse/SWDEV-540029 - https://example.com/issue-222 - https://example.com/issue-220 - https://example.com/issue-248 - https://example.com/issue-1893 - https://example.com/issue-198 - https://example.com/issue-215 - https://example.com/issue-213 - https://example.com/issue-1895 - https://example.com/issue-218 - https://example.com/issue-227 - https://example.com/issue-196 - https://example.com/issue-197 - https://example.com/issue-207

rocm-mici mentioned this pull request Feb 11, 2025

[rocm6.4_internal_testing] [ROCm][TunableOp] Close offline tuning results file when offline tuning is disabled. #1889

Merged

naromero77amd requested a review from pruthvistony February 11, 2025 22:34

naromero77amd marked this pull request as ready for review February 11, 2025 22:34

pruthvistony approved these changes Mar 5, 2025

View reviewed changes

pruthvistony merged commit fcdff09 into release/2.6 Mar 5, 2025
0 of 2 checks passed

pruthvistony deleted the release/2.6_cherry-pick_pr-1889 branch March 5, 2025 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AUTOGENERATED] [release/2.6] Cherry-pick PR-1889 #1893

[AUTOGENERATED] [release/2.6] Cherry-pick PR-1889 #1893

Uh oh!

rocm-mici commented Feb 11, 2025

Uh oh!

rocm-repo-management-api bot commented Feb 11, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Feb 15, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Feb 17, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Feb 17, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Feb 17, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Feb 19, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Feb 20, 2025

Uh oh!

rocm-repo-management-api bot commented Feb 21, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Feb 21, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Feb 21, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Feb 26, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Feb 27, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Mar 4, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Mar 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[AUTOGENERATED] [release/2.6] Cherry-pick PR-1889 #1893

[AUTOGENERATED] [release/2.6] Cherry-pick PR-1889 #1893

Uh oh!

Conversation

rocm-mici commented Feb 11, 2025

Uh oh!

rocm-repo-management-api bot commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 20, 2025

Uh oh!

rocm-repo-management-api bot commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Feb 11, 2025 •

edited

Loading

rocm-repo-management-api bot commented Feb 15, 2025 •

edited

Loading

rocm-repo-management-api bot commented Feb 17, 2025 •

edited

Loading

rocm-repo-management-api bot commented Feb 17, 2025 •

edited

Loading

rocm-repo-management-api bot commented Feb 17, 2025 •

edited

Loading

rocm-repo-management-api bot commented Feb 19, 2025 •

edited

Loading

rocm-repo-management-api bot commented Feb 21, 2025 •

edited

Loading

rocm-repo-management-api bot commented Feb 21, 2025 •

edited

Loading

rocm-repo-management-api bot commented Feb 21, 2025 •

edited

Loading

rocm-repo-management-api bot commented Feb 26, 2025 •

edited

Loading

rocm-repo-management-api bot commented Feb 27, 2025 •

edited

Loading

rocm-repo-management-api bot commented Mar 4, 2025 •

edited

Loading

rocm-repo-management-api bot commented Mar 4, 2025 •

edited

Loading