Skip to content

Bump PyTorch nightly pin past April 22, 2025 #10362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 27, 2025
Merged

Conversation

swolchok
Copy link
Contributor

[ghstack-poisoned]
@swolchok
Copy link
Contributor Author

swolchok commented Apr 22, 2025

Stack from ghstack (oldest at bottom):

Copy link

pytorch-bot bot commented Apr 22, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10362

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit b8f6b50 with merge base 3066463 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

swolchok added a commit that referenced this pull request Apr 22, 2025
Found the commit hash from https://hud2.pytorch.org/hud/pytorch/pytorch/nightly/0 .


ghstack-source-id: 7ef4d5c
ghstack-comment-id: 2822111987
Pull-Request-resolved: #10362
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 22, 2025
@swolchok
Copy link
Contributor Author

swolchok commented Apr 22, 2025

"Build documentation / check-urls" failure looks spurious; will rerun when the option becomes available.
ditto "trunk / test-models-linux-aarch64 (qwen2_5, portable, linux.arm64.2xlarge) / linux-job (push)"

@cccclai
Copy link
Contributor

cccclai commented Apr 22, 2025

Can we hold on bumping? The PR pytorch/pytorch#151436 merged last week causing 50+ tests failing...

@swolchok
Copy link
Contributor Author

hold on bumping?

Can we hold on bumping? The PR pytorch/pytorch#151436 merged last week causing 50+ tests failing...

@cccclai OK please let me know when we can do this

@cccclai
Copy link
Contributor

cccclai commented Apr 23, 2025

@tugsbayasgalan is working on resolving the failure.

@swolchok
Copy link
Contributor Author

ping @cccclai @tugsbayasgalan

@cccclai
Copy link
Contributor

cccclai commented May 13, 2025

ping @cccclai @tugsbayasgalan

See discussion in #10319 (comment), hopefully it's resolved soon

@cccclai
Copy link
Contributor

cccclai commented May 15, 2025

#10769 for the fix.

[ghstack-poisoned]
@swolchok swolchok requested a review from GregoryComer as a code owner May 21, 2025 17:36
swolchok added a commit that referenced this pull request May 21, 2025
Found the commit hash from https://hud2.pytorch.org/hud/pytorch/pytorch/nightly/0 .

ghstack-source-id: fbc5600
ghstack-comment-id: 2822111987
Pull-Request-resolved: #10362
@swolchok swolchok changed the title Bump PyTorch nightly pin past April 22 2025 Bump PyTorch nightly pin past May 21, 2025 May 21, 2025
@swolchok
Copy link
Contributor Author

looks like we have more breakage if I try to roll forward to the present day. let's go back to at least catching up to April 22

[ghstack-poisoned]
@swolchok swolchok changed the title Bump PyTorch nightly pin past May 21, 2025 Bump PyTorch nightly pin past April 22, 2025 May 21, 2025
swolchok added a commit that referenced this pull request May 21, 2025
Found the commit hash from https://hud2.pytorch.org/hud/pytorch/pytorch/nightly/0 .

ghstack-source-id: ca64709
ghstack-comment-id: 2822111987
Pull-Request-resolved: #10362
@swolchok swolchok changed the title Bump PyTorch nightly pin past April 22, 2025 Bump PyTorch nightly pin past May 21, 2025 May 21, 2025
@cccclai
Copy link
Contributor

cccclai commented May 21, 2025

#10769 is merged as fyi

@swolchok swolchok changed the title Bump PyTorch nightly pin past May 21, 2025 Bump PyTorch nightly pin past April 22, 2025 May 21, 2025
[ghstack-poisoned]
swolchok added a commit that referenced this pull request May 21, 2025
Found the commit hash from https://hud2.pytorch.org/hud/pytorch/pytorch/nightly/0 .

ghstack-source-id: f50a6df
ghstack-comment-id: 2822111987
Pull-Request-resolved: #10362
@swolchok swolchok added the release notes: none Do not include this in the release notes label May 21, 2025
@swolchok
Copy link
Contributor Author

test-llava-runner-linux seems to be consistently failing with SIGILL :(

@swolchok
Copy link
Contributor Author

test-llava-runner-linux seems to be consistently failing with SIGILL :(

Hm, I successfully ran python -m unittest examples.models.llava.test.test_llava with this PR on a Linux box, and Google says that the runners support AVX512, so it's not something simple like that. Trying one more time.

@swolchok
Copy link
Contributor Author

if I'm reading correctly, the illegal instruction is a vgf2p8affineqb somewhere in portable_lib. this is an AVX512GFNI instruction which requires at least Ice Lake and the processor in question appears to be Cascade Lake.

@swolchok
Copy link
Contributor Author

This instruction does seem to be generated in xnn_f32_vrcopysignc_ukernel__sse2_u16 in my local build as well, I just have a machine handy that can execute it. I guess something in the pin bump must've caused us to start delegating this op to XNNPACK? CC @digantdesai

@swolchok
Copy link
Contributor Author

swolchok commented May 23, 2025

in xnn_f32_vrcopysignc_ukernel__sse2_u16

the symbol name here may or may not be accurate; I just found that XNNPACK has an avx512vnnigfni configuration

@swolchok
Copy link
Contributor Author

we are in fact building at least some avx512vnnigfni files per https://github.com/pytorch/executorch/actions/runs/15171458271/job/42808833402?pr=10362

however, this is not new; https://github.com/pytorch/executorch/actions/runs/15218616519/job/42809734929 (a recent trunk run from HUD) also builds them.

either 1) we are delegating some op that we weren't delegating before AND XNNPACK's gating for these GFNI kernels was always broken, 2) bumping the pytorch pin somehow magically caused ExecuTorch's copy of XNNPACK to change, or 3) something else I haven't thought of yet.

@digantdesai
Copy link
Contributor

digantdesai commented May 24, 2025

This instruction is likely coming from the QB4W gemm kernel. This kernel is only reachable through ET and not PT. Not sure why it is getting activated on a Cascade Lake CPU on the CI C5, esp on a pin bump. At compile time this kernel compilation is gated by XNNPACK_ENABLE_AVX512VNNIGFNI and at runtime it is selected if cpuinfo_has_x86_avx512vnni() && cpuinfo_has_x86_gfni().

For now you may try disabling these kernels at compile time, to unblock yourself from updating the pin. I can try to take a closer look next week. Like you, the x86 box I have at hand is also Icelake which supports these.

…nts), disable XNNPACK_ENABLE_AVX512VNNIGFNI

[ghstack-poisoned]
swolchok added a commit that referenced this pull request May 27, 2025
Found the commit hash from https://hud2.pytorch.org/hud/pytorch/pytorch/nightly/0 .

ghstack-source-id: 6c5d261
ghstack-comment-id: 2822111987
Pull-Request-resolved: #10362
@swolchok
Copy link
Contributor Author

failures seem to pattern-match to known issues on main. merging.

@swolchok swolchok merged commit 4cf5c06 into main May 27, 2025
288 of 295 checks passed
@swolchok swolchok deleted the gh/swolchok/422/head branch May 27, 2025 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: none Do not include this in the release notes topic: not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants