-
Notifications
You must be signed in to change notification settings - Fork 60
[CI] Modify accelerate and transformers tests #1999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…l/torch-xpu-ops into mengfeil/modify-extra-tests
…l/torch-xpu-ops into mengfeil/modify-extra-tests
Hi @dvrogozh , we're working on torch-xpu-ops CI/CD workflows refactor, mainly including 2 aspects:
Those changes have a lot of benefits:
Please help to review this PR, after this PR land, we'll remove all single card runners with |
By the way, the accelerate test has 3 failures, I guess they are not related with this PR changes |
@mengfei25 please check the transformers test, there is no case to be executed actually. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transformers tests don't actually run anything after this change. I don't know the rootcause, but this needs to be debugged and fixed.
Important. Be extremely careful with parallelizing transformers tests. In the currently executing version parallelism of the tests is essentially switched off (see max parallel jobs settings set to 1). That was done for the reason that as seen as we are trying to parallelize we step into networking issues either on our side or on huggingface hub side. Last time we failed to overcome the problem and max parallel jobs was set to 1. The martix is still used to break the whole test into smaller chunks each around ~30mins or less. This allows to rerun smaller portion of the test on the failure instead of rerunning the whole suite.
These are new failures it seems. I see some test failures in this weekly as well and the a week old weekly was clean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mengfei25, the errors in transformers and accelerate tests look unrelated to this PR and correlate with the current version of the test. I am impressed by the reduction of the transformers test execution time. I was afraid that we will again step into networking issue, but test results indicate that we did not.
However, I have 2 requests before merging this test:
- There are still few not answered questions from me in this PR. Can you, please, go thru them and reply?
- I would like to be on the safe side making sure we really can parallelize transforms side. I suggest to force rerun test few times to verify that we don't step into the problem. We can do that once we will finish discussion on the questions not yet closed.
@dvrogozh Please help review it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not ok with the last changes in the PR. A lot of new changes appeared all of the sudden which require review the PR anew while yesterday version was almost ready to go with few minor changes. We need align again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@chuanqi129 Please help merge it. |
1. enable test in container 2. use local python instead of conda 3. enable pytest parallel run and continue-if-cash 4. use pytest-xdist to parallelize tests instead of pytest-shard on a 8 cards system 5. all tests on rolling driver test accelerate and transformers only disable_build disable_ut disable_e2e disable_distributed
test accelerate and transformers only
disable_build
disable_ut
disable_e2e
disable_distributed