Skip to content

Conversation

timmoon10
Copy link
Collaborator

Description

Update list of authorized CI users

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Update list of authorized CI users

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@timmoon10 timmoon10 requested a review from ptrendx September 5, 2025 05:50
@timmoon10 timmoon10 added the testing Improvements to tests or testing infrastructure label Sep 5, 2025
@timmoon10 timmoon10 merged commit 603dbf7 into NVIDIA:main Sep 8, 2025
10 of 12 checks passed
@timmoon10 timmoon10 deleted the update-ci-users branch September 8, 2025 18:13
vthumbe1503 pushed a commit to vthumbe1503/TransformerEngine that referenced this pull request Sep 8, 2025
Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Varun Thumbe <[email protected]>
vthumbe1503 added a commit to vthumbe1503/TransformerEngine that referenced this pull request Sep 19, 2025
Signed-off-by: Varun Thumbe <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <[email protected]>

Add cuBLASMp-backed GEMM-like API to TE common (NVIDIA#1824)

* Pick up cuBLASMp during build

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Saving...

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Change lib order to fix link error

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Saving...

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Context creation, incomplete...

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Test fixure

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Saving...

Signed-off-by: Vladimir Cherepanov <[email protected]>

* A sanity AgGemm test, failing...

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Saving...

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Fix axes

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Take care of uneven distribution

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Use MPI to get position of local matrices

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Refactor

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Refactor & fixes

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Saving...

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Gemm-RS

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Gemm-AR, not working...

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Fixes

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Setting all-reduce epilogue for gemm-ar

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Use supported shapes for GEMM-AR

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Tweak tolerance

Signed-off-by: Vladimir Cherepanov <[email protected]>

* First shot at fp8

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Use TensorHolder in tests

Signed-off-by: Vladimir Cherepanov <[email protected]>

* More test configs

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Support comm_sm_count

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Parametrize dtypes for A, B and D separately

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Tweak scaling

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Amax ptr

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Flags parity with cublas_gemm, saving...

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Cleanup

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Bias tests

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Fix bias test

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Aux, saving...

Signed-off-by: Vladimir Cherepanov <[email protected]>

* aux_ld

Signed-off-by: Vladimir Cherepanov <[email protected]>

* A fix

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Use test::Tensor

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Set scale inv

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Remove unsupported test configs

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Tweak tests

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Replace libcal with NCCL

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Add NVTX markers to API functions

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Tweak GemmAr tests

Signed-off-by: Vladimir Cherepanov <[email protected]>

* More test config

Signed-off-by: Vladimir Cherepanov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Fix merge fallout

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Remove MPI dependency, comment API, add algo parameter

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Fix nvshmem dependency

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Fix nvshmem build

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Excluse CommGemm tests from L0_cppunittest

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Add cpp_distributed sh file for CI

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Adapt tp TensorAllocator

Signed-off-by: Vladimir Cherepanov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Skip GemmAr test on unsupported HW

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Oversibscribe is needed on some clusters

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Fix incomplete libcal removal

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Move CI tests to L1

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Rename context to include NVTE prefix

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Remove leftover code

Signed-off-by: Vladimir Cherepanov <[email protected]>

* NVTE_WITH_CUBLASMP off by default

Signed-off-by: Vladimir Cherepanov <[email protected]>

* More detailed NVTE_CHECK diag

Signed-off-by: Vladimir Cherepanov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Comment API

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Include stdbool header for legacy C compilers

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Remove now unused argument

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Abstract away cuBLASMp algo behind our own enum

Signed-off-by: Vladimir Cherepanov <[email protected]>

* More detailed shape diag messages

Signed-off-by: Vladimir Cherepanov <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update transformer_engine/common/include/transformer_engine/comm_gemm.h

Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Vladimir Cherepanov <[email protected]>

* Add license

Signed-off-by: Vladimir Cherepanov <[email protected]>

---------

Signed-off-by: Vladimir Cherepanov <[email protected]>
Signed-off-by: Vladimir Cherepanov <[email protected]>
Co-authored-by: Vladimir Cherepanov <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Przemyslaw Tredak <[email protected]>
Signed-off-by: Varun Thumbe <[email protected]>

[PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (NVIDIA#2119)

* add noop to comp amax

Signed-off-by: zhongboz <[email protected]>

* fix for fp8 blockwise recipe

Signed-off-by: zhongboz <[email protected]>

* resolve comments

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: zhongboz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Varun Thumbe <[email protected]>

[PyTorch] fix cross entropy vanishing gradients (NVIDIA#2139)

* fix cross entropy

Signed-off-by: Casper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Casper <[email protected]>

* fix comments

Signed-off-by: Casper <[email protected]>

* fix: few more style issues

Signed-off-by: Casper <[email protected]>

* fix: remove grad_output_stride (unnecessary)

Signed-off-by: Casper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: only backward was broken

Signed-off-by: Casper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Generalize cross entropy backward kernel to handle reduced and unreduced loss

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Casper <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Varun Thumbe <[email protected]>

Fix bug when enabling --overlap-grad-reduce in mcore (NVIDIA#2142)

* fix bugs when enabling --overlap-grad-reduce in mcore

Signed-off-by: Hongbin Liu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix CI

Signed-off-by: Hongbin Liu <[email protected]>

* format

Signed-off-by: Hongbin Liu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: Varun Thumbe <[email protected]>

Fix CUDA version in setup.py (NVIDIA#2132)

* Fix CUDA version in setup.py

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Re-enable building comm-gemm tests

Signed-off-by: Vladimir Cherepanov <[email protected]>

* WAR for nvidia-nvshmem package

Signed-off-by: Vladimir Cherepanov <[email protected]>

---------

Signed-off-by: Vladimir Cherepanov <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Varun Thumbe <[email protected]>

[JAX] NoScaleTensor wrapper for non-quantized data (NVIDIA#2136)

* Custom call tests passing

Signed-off-by: Jeremy Berchtold <[email protected]>

* Fix test_layer.py

Signed-off-by: Jeremy Berchtold <[email protected]>

* Lint

Signed-off-by: Jeremy Berchtold <[email protected]>

* Fix comments

Signed-off-by: Jeremy Berchtold <[email protected]>

* Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling

Signed-off-by: Jeremy Berchtold <[email protected]>

* Fix shardy issue with amax being shape 1,1,1 instead of shape (1,)

Signed-off-by: Jeremy Berchtold <[email protected]>

* Add higher-precision VJP tests to test_distributed_layernorm_mlp

Signed-off-by: Jeremy Berchtold <[email protected]>

* Cast non-quantized kernels to input dtype in VJPs

Signed-off-by: Jeremy Berchtold <[email protected]>

* Rename HighPrecisionTensor to NoScaleTensor

Signed-off-by: Jeremy Berchtold <[email protected]>

* Use NoScaleTensor in pure JAX impls where it was missing

Signed-off-by: Jeremy Berchtold <[email protected]>

* Fix tests

Signed-off-by: Jeremy Berchtold <[email protected]>

---------

Signed-off-by: Jeremy Berchtold <[email protected]>
Signed-off-by: Varun Thumbe <[email protected]>

[JAX] Fix GroupedScaledTensor creation with keyword arg (NVIDIA#2154)

Fix GroupedScaledTensor creation

Signed-off-by: Phuong Nguyen <[email protected]>
Signed-off-by: Varun Thumbe <[email protected]>

Fixing few issues with multi-process launching. (NVIDIA#2155)

* Fixing few issues with multi-process launching.

Signed-off-by: Ming Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Ming Huang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <[email protected]>
Signed-off-by: Varun Thumbe <[email protected]>

Update list of authorized CI users (NVIDIA#2152)

Signed-off-by: Tim Moon <[email protected]>
Signed-off-by: Varun Thumbe <[email protected]>

a bit of cleanup

Signed-off-by: Varun Thumbe <[email protected]>
vthumbe1503 added a commit to vthumbe1503/TransformerEngine that referenced this pull request Sep 19, 2025
author Varun Thumbe <[email protected]> 1757373536 +0000
committer Varun Thumbe <[email protected]> 1758262513 +0000

parent de9ef2f
author Varun Thumbe <[email protected]> 1757373536 +0000
committer Varun Thumbe <[email protected]> 1758262476 +0000

parent de9ef2f
author Varun Thumbe <[email protected]> 1757373536 +0000
committer Varun Thumbe <[email protected]> 1758262304 +0000

merge conflict

Signed-off-by: Varun Thumbe <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <[email protected]>

FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (NVIDIA#2086)

* FP8 AllGather in FP8 GroupedGEMM

1. Support current scaling FP8 quantation with a given amax.
2. Support FP8 AG in fwd and BF16 RS in bwd.
3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM.

Signed-off-by: Ming Huang <[email protected]>

* Slightly refactor

Signed-off-by: Ming Huang <[email protected]>

* Adding documents of new args.

Signed-off-by: Ming Huang <[email protected]>

* Adding unit-tests.

Signed-off-by: Ming Huang <[email protected]>

* Adding license.

Signed-off-by: Ming Huang <[email protected]>

* Move unit-tests to L1.

Signed-off-by: Ming Huang <[email protected]>

* Move quantizaer store/reset into FP8 only.

Signed-off-by: Ming Huang <[email protected]>

* Adding all layout support for Blackwell+

Signed-off-by: Ming Huang <[email protected]>

* Adopt the feedback from code-review.

Signed-off-by: Ming Huang <[email protected]>

* Fixed the wrong stream used by d2d in groupedGEMM FFI.

Signed-off-by: Ming Huang <[email protected]>

---------

Signed-off-by: Ming Huang <[email protected]>
Co-authored-by: Phuong Nguyen <[email protected]>

[JAX] Delay MeshResource validation until first usage (NVIDIA#2124)

Delay MeshResource validation until first usage

Signed-off-by: Jeremy Berchtold <[email protected]>
Co-authored-by: Phuong Nguyen <[email protected]>

[JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (NVIDIA#2128)

* add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED

Signed-off-by: Phuong Nguyen <[email protected]>

---------

Signed-off-by: Phuong Nguyen <[email protected]>

[JAX] Add amax input to DBiasQuantizePrimitive and FFI (NVIDIA#2118)

* add amax input to DBiasQuantizePrimitive and FFI

Signed-off-by: Phuong Nguyen <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make sure amax is init with zero

Signed-off-by: Phuong Nguyen <[email protected]>

* fix sharding rule

Signed-off-by: Phuong Nguyen <[email protected]>

---------

Signed-off-by: Phuong Nguyen <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (NVIDIA#2121)

Signed-off-by: Kshitij Lakhani <[email protected]>

Temporarily remove comm_gemm tests (NVIDIA#2133)

Signed-off-by: Vladimir Cherepanov <[email protected]>

[PyTorch] Disable determinism for sm100 (NVIDIA#2130)

* disable determinism for sm100+ and cudnn<9.14

Signed-off-by: Charlene Yang <[email protected]>

* fix remaining CI failures

Signed-off-by: Charlene Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert some changes

Signed-off-by: Charlene Yang <[email protected]>

* revert more changes

Signed-off-by: Charlene Yang <[email protected]>

* remove sm100 from determinism table

Signed-off-by: Charlene Yang <[email protected]>

---------

Signed-off-by: Charlene Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

[PyTorch] ONNX export of FP8 Current Scaling (NVIDIA#2068)

* Compute amax in normalization forward in current scaling in untuned kernels

Signed-off-by: Jan Bielak <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* code drop

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Pawel Gadzinski <[email protected]>

* apply tims suggestions

Signed-off-by: Pawel Gadzinski <[email protected]>

---------

Signed-off-by: Jan Bielak <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
Co-authored-by: Jan Bielak <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

[PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (NVIDIA#2134)

use torch empty for empty shape instead of from_blob

Signed-off-by: zhongboz <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

build: pull cached wheels (NVIDIA#2127)

* build: pull cached wheels

Signed-off-by: oliver könig <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update setup.py

Signed-off-by: oliver könig <[email protected]>

---------

Signed-off-by: oliver könig <[email protected]>
Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

[Common] Add checks to CUDA kernel launch and CUDA API calls (NVIDIA#2074)

* add checks to cuda kernel launch and cuda API calls

Signed-off-by: Xin Yao <[email protected]>

* Remove exceptions from destructors

Signed-off-by: Tim Moon <[email protected]>

* fix weired dispatch in ln/rmsnorm

Signed-off-by: Xin Yao <[email protected]>

---------

Signed-off-by: Xin Yao <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>

[PyTorch] Support bf16+fp8 cudagraph (NVIDIA#2098)

* support bf16+fp8 model

Signed-off-by: Robin Zhang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: Robin Zhang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: Robin Zhang <[email protected]>

---------

Signed-off-by: Robin Zhang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>

Dropout with 8-bit RNG (NVIDIA#2014)

* Add dropout kernel with 8-bit RNG

Co-authored-by: Vasudevan Rengasamy <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix license

Signed-off-by: Tim Moon <[email protected]>

* Avoid ambiguous types

Signed-off-by: Tim Moon <[email protected]>

* Do not enforce dropout prob is representable in 8 bits

Signed-off-by: Tim Moon <[email protected]>

* Expand error message

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix small statistical bug from using less-equal instead of less-than

Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints.

Signed-off-by: Tim Moon <[email protected]>

* Fix linter warning

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unnecessary helper function in PyTorch extensions

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Create GPU reload buffers on main stream (NVIDIA#2131)

* Create GPU relaod buffers on main stream

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fixed typo

Signed-off-by: Selvaraj Anandaraj <[email protected]>

* Fixed typo

Signed-off-by: Selvaraj Anandaraj <[email protected]>

---------

Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Selvaraj Anandaraj <[email protected]>
Co-authored-by: Paweł Gadziński <[email protected]>

Fix CI failures for UB overlap changes (NVIDIA#2149)

Signed-off-by: djns99 <[email protected]>

[JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (NVIDIA#2135)

* Fix failing tests for dropout=0.1 and bias for fused attn for blackwell

Signed-off-by: Kshitij Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the skip message

Signed-off-by: Kshitij Lakhani <[email protected]>

* Assert in fused attn bwd pass for sm100

Signed-off-by: Kshitij Lakhani <[email protected]>

Add check for sm100

Signed-off-by: Kshitij Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support to get all devs in the process for jax

Signed-off-by: Kshitij Lakhani <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Code clean up

Signed-off-by: Kshitij Lakhani <[email protected]>

* Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion

Signed-off-by: Kshitij Lakhani <[email protected]>

* Represent attn bias using enum instead of string

Signed-off-by: Kshitij Lakhani <[email protected]>

---------

Signed-off-by: Kshitij Lakhani <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

[PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (NVIDIA#2119)

* add noop to comp amax

Signed-off-by: zhongboz <[email protected]>

* fix for fp8 blockwise recipe

Signed-off-by: zhongboz <[email protected]>

* resolve comments

Signed-off-by: zhongboz <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: zhongboz <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>

[PyTorch] fix cross entropy vanishing gradients (NVIDIA#2139)

* fix cross entropy

Signed-off-by: Casper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Casper <[email protected]>

* fix comments

Signed-off-by: Casper <[email protected]>

* fix: few more style issues

Signed-off-by: Casper <[email protected]>

* fix: remove grad_output_stride (unnecessary)

Signed-off-by: Casper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: only backward was broken

Signed-off-by: Casper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Generalize cross entropy backward kernel to handle reduced and unreduced loss

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Casper <[email protected]>
Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tim Moon <[email protected]>
Co-authored-by: Tim Moon <[email protected]>

Fix bug when enabling --overlap-grad-reduce in mcore (NVIDIA#2142)

* fix bugs when enabling --overlap-grad-reduce in mcore

Signed-off-by: Hongbin Liu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix CI

Signed-off-by: Hongbin Liu <[email protected]>

* format

Signed-off-by: Hongbin Liu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix CUDA version in setup.py (NVIDIA#2132)

* Fix CUDA version in setup.py

Signed-off-by: Vladimir Cherepanov <[email protected]>

* Re-enable building comm-gemm tests

Signed-off-by: Vladimir Cherepanov <[email protected]>

* WAR for nvidia-nvshmem package

Signed-off-by: Vladimir Cherepanov <[email protected]>

---------

Signed-off-by: Vladimir Cherepanov <[email protected]>
Co-authored-by: Tim Moon <[email protected]>

[JAX] NoScaleTensor wrapper for non-quantized data (NVIDIA#2136)

* Custom call tests passing

Signed-off-by: Jeremy Berchtold <[email protected]>

* Fix test_layer.py

Signed-off-by: Jeremy Berchtold <[email protected]>

* Lint

Signed-off-by: Jeremy Berchtold <[email protected]>

* Fix comments

Signed-off-by: Jeremy Berchtold <[email protected]>

* Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling

Signed-off-by: Jeremy Berchtold <[email protected]>

* Fix shardy issue with amax being shape 1,1,1 instead of shape (1,)

Signed-off-by: Jeremy Berchtold <[email protected]>

* Add higher-precision VJP tests to test_distributed_layernorm_mlp

Signed-off-by: Jeremy Berchtold <[email protected]>

* Cast non-quantized kernels to input dtype in VJPs

Signed-off-by: Jeremy Berchtold <[email protected]>

* Rename HighPrecisionTensor to NoScaleTensor

Signed-off-by: Jeremy Berchtold <[email protected]>

* Use NoScaleTensor in pure JAX impls where it was missing

Signed-off-by: Jeremy Berchtold <[email protected]>

* Fix tests

Signed-off-by: Jeremy Berchtold <[email protected]>

---------

Signed-off-by: Jeremy Berchtold <[email protected]>

[JAX] Fix GroupedScaledTensor creation with keyword arg (NVIDIA#2154)

Fix GroupedScaledTensor creation

Signed-off-by: Phuong Nguyen <[email protected]>

Fixing few issues with multi-process launching. (NVIDIA#2155)

* Fixing few issues with multi-process launching.

Signed-off-by: Ming Huang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Ming Huang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Phuong Nguyen <[email protected]>

Update list of authorized CI users (NVIDIA#2152)

Signed-off-by: Tim Moon <[email protected]>

Fused RoPE with combined QKV input. (NVIDIA#2122)

* Fused RoPE with combined QKV input.

Initial commit for Dropout with 8-bit RNG

Fix documentation

Initial commit for Fused QKV RoPE

WIP

Initial tests passing

Enable rotary percent and margin

Enable CP2, start_positions, interleaved

Cleanup test

Revert "Fix documentation"

This reverts commit 53df100.

Revert "Initial commit for Dropout with 8-bit RNG"

This reverts commit 301505e.

Cleanup.

Minor cleanup

Signed-off-by: Vasudevan Rengasamy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vasudevan Rengasamy <[email protected]>

* Optimize kernels

Signed-off-by: Vasudevan Rengasamy <[email protected]>

* Misc. Cleanup

Signed-off-by: Vasudevan Rengasamy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vasudevan Rengasamy <[email protected]>

* Optimize kernel performance

Signed-off-by: Vasudevan Rengasamy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Vasudevan Rengasamy <[email protected]>

* Move fused_qkv_rope test to test_fused_rope.py

Signed-off-by: Vasudevan Rengasamy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* apply shared memory optimization to separate fused rope kernels

Signed-off-by: Xin Yao <[email protected]>

* fix lint

Signed-off-by: Xin Yao <[email protected]>

---------

Signed-off-by: Vasudevan Rengasamy <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Tim Moon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Improvements to tests or testing infrastructure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant