Skip to content

Arm backend: Use dbg_fail when node visitors raise exceptions #9391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 1, 2025

Conversation

oscarandersson8218
Copy link
Collaborator

@oscarandersson8218 oscarandersson8218 commented Mar 19, 2025

Adds a try-expect around the node_visitor call to be able to call dbg_fail() when an error/exception is raised.

cc @digantdesai @freddan80 @per @zingo

oscarandersson8218 and others added 3 commits March 17, 2025 13:50
Adds a try-expect around the node_visitor call to be able to call
dbg_fail() when an error/exception is raised.

Signed-off-by: Oscar Andersson <[email protected]>
Change-Id: I3b633e1ff255fa5b3a5257016acfe2e9dc03b033
Signed-off-by: Oscar Andersson <[email protected]>
Change-Id: Id7f44c539682a55e06564c7b3294988c122c00b3
Copy link

pytorch-bot bot commented Mar 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9391

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job

As of commit ac1f223 with merge base 97bca05 (image):

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 19, 2025
@oscarandersson8218 oscarandersson8218 added partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm ciflow/trunk topic: not user facing labels Mar 19, 2025
@zingo
Copy link
Collaborator

zingo commented Mar 19, 2025

Hi @larryliu0820

This broke our internal tests and we need to revert it first. Could you please re-submit the PR, and wait for us to import and run internal CI, and paste the error message?

Hi could you help us point out what error you got from this, to help us knowing how to avoid this better in the future or even better could the internal tests be ported to be runned in github also so we can get proper testing while merging?

@facebook-github-bot
Copy link
Contributor

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@larryliu0820
Copy link
Contributor

@zingo
Copy link
Collaborator

zingo commented Mar 19, 2025

@zingo @oscarandersson8218 can you guys take a look at the failing job first: https://github.com/pytorch/executorch/actions/runs/13940909648/job/39017417934?pr=9391

[ 87%] Building CXX object kernels/portable/CMakeFiles/portable_kernels.dir/cpu/util/matmul_ops_util.cpp.obj
[ 87%] Building CXX object kernels/portable/CMakeFiles/optimized_portable_kernels.dir/cpu/op_unbind_copy.cpp.obj
[ 88%] Building CXX object kernels/portable/CMakeFiles/portable_kernels.dir/cpu/util/kernel_ops_util.cpp.obj
/home/ec2-user/actions-runner/_work/_temp/3423f1ab-f4a3-4d06-ba1e-bf9895ca4559.sh: line 11: 476510 Killed                  python3 "/home/ec2-user/actions-runner/_work/executorch/executorch/test-infra/.github/scripts/run_with_env_secrets.py" ""
Error: Process completed with exit code 137.

It seem to just have been killed in the middle of building.
Is there maybe a resource problem while testing?

As for https://github.com/pytorch/executorch/actions/runs/13940909648/job/39017419390?pr=9391
it has not even started we see that a lot since about a week ago, Is there problem with other docker images?
If we retrigger many time they sometime run so it don't seem to be a logical problem.

@zingo
Copy link
Collaborator

zingo commented Mar 19, 2025

I retriggered a run of the arm tests

@digantdesai
Copy link
Contributor

digantdesai commented Mar 19, 2025

@zingo @oscarandersson8218 can you guys take a look at the failing job first: pytorch/executorch/actions/runs/13940909648/job/39017417934?pr=9391

[ 87%] Building CXX object kernels/portable/CMakeFiles/portable_kernels.dir/cpu/util/matmul_ops_util.cpp.obj
[ 87%] Building CXX object kernels/portable/CMakeFiles/optimized_portable_kernels.dir/cpu/op_unbind_copy.cpp.obj
[ 88%] Building CXX object kernels/portable/CMakeFiles/portable_kernels.dir/cpu/util/kernel_ops_util.cpp.obj
/home/ec2-user/actions-runner/_work/_temp/3423f1ab-f4a3-4d06-ba1e-bf9895ca4559.sh: line 11: 476510 Killed                  python3 "/home/ec2-user/actions-runner/_work/executorch/executorch/test-infra/.github/scripts/run_with_env_secrets.py" ""
Error: Process completed with exit code 137.

It seem to just have been killed in the middle of building. Is there maybe a resource problem while testing?

As for pytorch/executorch/actions/runs/13940909648/job/39017419390?pr=9391 it has not even started we see that a lot since about a week ago, Is there problem with other docker images? If we retrigger many time they sometime run so it don't seem to be a logical problem.

Taking with @huydhn, he suspects if we are running out of RAM on this 16G (c5.2xlarge) runner. Do you know @zingo on top of your head how much RAM we use during the container run. 137 seems like SIGKILL (9) so could be OOM. We can try with larger runner and see if this goes away.

Let's see if CI passes on this - #9409

@zingo
Copy link
Collaborator

zingo commented Mar 19, 2025

Interesting, maybe that is what happens. But also the interesting if it is that when we only building that should be kind of same.
Maybe we are the the only one using cmake --parallel making it run more stuff in parallel and stuff has grown over time 🤔

@zingo
Copy link
Collaborator

zingo commented Mar 19, 2025

Interesting, maybe that is what happens. But also the interesting if it is that when we only test building, maybe we are the the only one using --parallel making it run more stuff in parallel and stuff as grown over time 🤔

Also I think the cmake version got bumped recently maybe the --parallel flag got a bit different behavior?

@facebook-github-bot
Copy link
Contributor

@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@digantdesai
Copy link
Contributor

digantdesai commented Mar 21, 2025

cmake --parallel

hmm.. this is interesting. We could try and root cause this and revert back to 16G if we can. At least it is green for now.

@zingo
Copy link
Collaborator

zingo commented Mar 21, 2025

cmake --parallel

hmm.. this is interesting. We could try and root cause this and revert back to 16G if we can. At least it is green for now.

Yes maybe, that would be interesting

@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@digantdesai
Copy link
Contributor

Sorry still on me, I will take a look today or tomorrow. Retrying after rebase if it works :p

@oscarandersson8218
Copy link
Collaborator Author

Unrelated CI failures. @digantdesai can you have a look at this again? :)

@digantdesai
Copy link
Contributor

Internal CI looks good. Stamping.

@facebook-github-bot
Copy link
Contributor

@digantdesai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@oscarandersson8218 oscarandersson8218 merged commit 5cc98bc into pytorch:main Apr 1, 2025
166 of 167 checks passed
kirklandsign pushed a commit that referenced this pull request Apr 11, 2025
Adds a try-expect around the node_visitor call to be able to call
dbg_fail() when an error/exception is raised.

Signed-off-by: Oscar Andersson <[email protected]>
Co-authored-by: Digant Desai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm topic: not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants