Skip to content

Conversation

mengfei25
Copy link
Contributor

@mengfei25 mengfei25 commented Sep 1, 2025

  1. enable test in container
  2. use local python instead of conda
  3. enable pytest parallel run and continue-if-cash
  4. use pytest-xdist to parallelize tests instead of pytest-shard on a 8 cards system
  5. all tests on rolling driver

test accelerate and transformers only
disable_build
disable_ut
disable_e2e
disable_distributed

@mengfei25 mengfei25 requested a review from dvrogozh September 1, 2025 07:42
@chuanqi129
Copy link
Contributor

Hi @dvrogozh , we're working on torch-xpu-ops CI/CD workflows refactor, mainly including 2 aspects:

  1. Containerized build and test, please refer [CI] Refactor CICD test workflows #1862
  2. Parallelized UT tests by using pytest-xdist, please refer [CI] Enable pytest parallel run #1966

Those changes have a lot of benefits:

  • Standardize build and tests
  • Reduce single UT test job time cost
  • No Conda dependency, use setup-python in containers directly, test env totally isolated
  • Simplify the runner maintain effort, don't need to split one node as multiple runners
  • No change for matrix tests support

Please help to review this PR, after this PR land, we'll remove all single card runners with linux.idc.xpu label

@chuanqi129
Copy link
Contributor

By the way, the accelerate test has 3 failures, I guess they are not related with this PR changes

@chuanqi129
Copy link
Contributor

@mengfei25 please check the transformers test, there is no case to be executed actually.

@chuanqi129 chuanqi129 self-requested a review September 2, 2025 15:12
Copy link
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transformers tests don't actually run anything after this change. I don't know the rootcause, but this needs to be debugged and fixed.

Important. Be extremely careful with parallelizing transformers tests. In the currently executing version parallelism of the tests is essentially switched off (see max parallel jobs settings set to 1). That was done for the reason that as seen as we are trying to parallelize we step into networking issues either on our side or on huggingface hub side. Last time we failed to overcome the problem and max parallel jobs was set to 1. The martix is still used to break the whole test into smaller chunks each around ~30mins or less. This allows to rerun smaller portion of the test on the failure instead of rerunning the whole suite.

@dvrogozh
Copy link
Contributor

dvrogozh commented Sep 2, 2025

By the way, the accelerate test has 3 failures, I guess they are not related with this PR changes

These are new failures it seems. I see some test failures in this weekly as well and the a week old weekly was clean.

@mengfei25 mengfei25 requested a review from dvrogozh September 3, 2025 09:59
Copy link
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mengfei25, the errors in transformers and accelerate tests look unrelated to this PR and correlate with the current version of the test. I am impressed by the reduction of the transformers test execution time. I was afraid that we will again step into networking issue, but test results indicate that we did not.

However, I have 2 requests before merging this test:

  1. There are still few not answered questions from me in this PR. Can you, please, go thru them and reply?
  2. I would like to be on the safe side making sure we really can parallelize transforms side. I suggest to force rerun test few times to verify that we don't step into the problem. We can do that once we will finish discussion on the questions not yet closed.

@mengfei25
Copy link
Contributor Author

@dvrogozh Please help review it

@intel intel deleted a comment from chuanqi129 Sep 8, 2025
@mengfei25 mengfei25 requested a review from dvrogozh September 8, 2025 06:42
@mengfei25 mengfei25 requested a review from dvrogozh September 9, 2025 06:56
Copy link
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not ok with the last changes in the PR. A lot of new changes appeared all of the sudden which require review the PR anew while yesterday version was almost ready to go with few minor changes. We need align again.

Copy link
Contributor

@dvrogozh dvrogozh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mengfei25
Copy link
Contributor Author

@chuanqi129 Please help merge it.

@chuanqi129 chuanqi129 merged commit fa9212b into main Sep 11, 2025
19 of 25 checks passed
@chuanqi129 chuanqi129 deleted the mengfeil/modify-extra-tests branch September 11, 2025 05:35
zhangxiaoli73 pushed a commit that referenced this pull request Sep 22, 2025
1. enable test in container
2. use local python instead of conda
3. enable pytest parallel run and continue-if-cash
4. use pytest-xdist to parallelize tests instead of pytest-shard on a 8
cards system
5. all tests on rolling driver

test accelerate and transformers only
disable_build
disable_ut
disable_e2e
disable_distributed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants