[CI] Modify accelerate and transformers tests #1999

mengfei25 · 2025-09-01T03:42:06Z

enable test in container
use local python instead of conda
enable pytest parallel run and continue-if-cash
use pytest-xdist to parallelize tests instead of pytest-shard on a 8 cards system
all tests on rolling driver

test accelerate and transformers only
disable_build
disable_ut
disable_e2e
disable_distributed

…l/torch-xpu-ops into mengfeil/modify-extra-tests

.github/workflows/_linux_transformers.yml

…l/torch-xpu-ops into mengfeil/modify-extra-tests

chuanqi129 · 2025-09-02T14:59:08Z

Hi @dvrogozh , we're working on torch-xpu-ops CI/CD workflows refactor, mainly including 2 aspects:

Containerized build and test, please refer [CI] Refactor CICD test workflows #1862
Parallelized UT tests by using pytest-xdist, please refer [CI] Enable pytest parallel run #1966

Those changes have a lot of benefits:

Standardize build and tests
Reduce single UT test job time cost
No Conda dependency, use setup-python in containers directly, test env totally isolated
Simplify the runner maintain effort, don't need to split one node as multiple runners
No change for matrix tests support

Please help to review this PR, after this PR land, we'll remove all single card runners with linux.idc.xpu label

chuanqi129 · 2025-09-02T15:00:57Z

By the way, the accelerate test has 3 failures, I guess they are not related with this PR changes

chuanqi129 · 2025-09-02T15:11:25Z

@mengfei25 please check the transformers test, there is no case to be executed actually.

dvrogozh

Transformers tests don't actually run anything after this change. I don't know the rootcause, but this needs to be debugged and fixed.

Important. Be extremely careful with parallelizing transformers tests. In the currently executing version parallelism of the tests is essentially switched off (see max parallel jobs settings set to 1). That was done for the reason that as seen as we are trying to parallelize we step into networking issues either on our side or on huggingface hub side. Last time we failed to overcome the problem and max parallel jobs was set to 1. The martix is still used to break the whole test into smaller chunks each around ~30mins or less. This allows to rerun smaller portion of the test on the failure instead of rerunning the whole suite.

.github/workflows/_linux_accelerate.yml

.github/workflows/_linux_transformers.yml

.github/workflows/_linux_accelerate.yml

dvrogozh · 2025-09-02T20:47:58Z

By the way, the accelerate test has 3 failures, I guess they are not related with this PR changes

These are new failures it seems. I see some test failures in this weekly as well and the a week old weekly was clean.

dvrogozh

@mengfei25, the errors in transformers and accelerate tests look unrelated to this PR and correlate with the current version of the test. I am impressed by the reduction of the transformers test execution time. I was afraid that we will again step into networking issue, but test results indicate that we did not.

However, I have 2 requests before merging this test:

There are still few not answered questions from me in this PR. Can you, please, go thru them and reply?
I would like to be on the safe side making sure we really can parallelize transforms side. I suggest to force rerun test few times to verify that we don't step into the problem. We can do that once we will finish discussion on the questions not yet closed.

.github/workflows/_linux_accelerate.yml

mengfei25 · 2025-09-08T02:06:24Z

@dvrogozh Please help review it

.github/workflows/_linux_accelerate.yml

dvrogozh

I am not ok with the last changes in the PR. A lot of new changes appeared all of the sudden which require review the PR anew while yesterday version was almost ready to go with few minor changes. We need align again.

.github/workflows/_linux_transformers.yml

.github/actions/get-runner/action.yml

.github/workflows/_linux_e2e.yml

dvrogozh

LGTM

mengfei25 · 2025-09-11T05:33:16Z

@chuanqi129 Please help merge it.

1. enable test in container 2. use local python instead of conda 3. enable pytest parallel run and continue-if-cash 4. use pytest-xdist to parallelize tests instead of pytest-shard on a 8 cards system 5. all tests on rolling driver test accelerate and transformers only disable_build disable_ut disable_e2e disable_distributed

mengfei25 and others added 3 commits September 1, 2025 10:26

modify accelerate tests

77c7744

modify transformers tests

4b6013a

Merge branch 'main' into mengfeil/modify-extra-tests

c8cc64c

mengfei25 requested review from dvrogozh and chuanqi129 September 1, 2025 03:42

mengfei25 added 2 commits September 1, 2025 11:46

update

00aa549

Merge branch 'mengfeil/modify-extra-tests' of https://github.com/inte…

42e6541

…l/torch-xpu-ops into mengfeil/modify-extra-tests

dvrogozh requested changes Sep 1, 2025

View reviewed changes

.github/workflows/_linux_transformers.yml Outdated Show resolved Hide resolved

mengfei25 added 2 commits September 1, 2025 15:22

split transformers test jobs

2de397e

update

782a62c

mengfei25 requested a review from dvrogozh September 1, 2025 07:42

mengfei25 and others added 12 commits September 1, 2025 17:18

update

4df28d5

update

cbc00e3

Merge branch 'main' into mengfeil/modify-extra-tests

bec7bb1

update

29705c1

Merge branch 'mengfeil/modify-extra-tests' of https://github.com/inte…

50c015e

…l/torch-xpu-ops into mengfeil/modify-extra-tests

update

6727196

update

10c6152

update

5663c85

update

081fad6

update

608c5c6

Merge branch 'main' into mengfeil/modify-extra-tests

5e9a5be

update

12d8646

chuanqi129 approved these changes Sep 2, 2025

View reviewed changes

chuanqi129 self-requested a review September 2, 2025 15:12

dvrogozh requested changes Sep 2, 2025

View reviewed changes

mengfei25 requested a review from dvrogozh September 3, 2025 09:59

mengfei25 added 2 commits September 6, 2025 16:14

Merge branch 'main' into mengfeil/modify-extra-tests

234985e

Merge branch 'main' into mengfeil/modify-extra-tests

379cf35

dvrogozh reviewed Sep 8, 2025

View reviewed changes

.github/workflows/_linux_accelerate.yml Show resolved Hide resolved

intel deleted a comment from chuanqi129 Sep 8, 2025

mengfei25 requested a review from dvrogozh September 8, 2025 06:42

dvrogozh reviewed Sep 8, 2025

View reviewed changes

.github/workflows/_linux_accelerate.yml Outdated Show resolved Hide resolved

mengfei25 added 2 commits September 9, 2025 14:31

modify container args

b456364

remove workspace cleanup before checkout

856bd09

mengfei25 requested a review from dvrogozh September 9, 2025 06:56

mengfei25 and others added 6 commits September 10, 2025 16:44

modify ZE_AFFINITY_MASK in container

462b387

transformers mutli shards

9b49c07

update

fc3ac8e

Merge branch 'main' into mengfeil/modify-extra-tests

dda9f7d

set numactl to distribute CPUs

b47b279

fix lint

344d370

dvrogozh requested changes Sep 10, 2025

View reviewed changes

.github/workflows/_linux_transformers.yml Outdated Show resolved Hide resolved

.github/actions/get-runner/action.yml Outdated Show resolved Hide resolved

dvrogozh requested changes Sep 11, 2025

View reviewed changes

.github/workflows/_linux_e2e.yml Outdated Show resolved Hide resolved

.github/workflows/_linux_e2e.yml Outdated Show resolved Hide resolved

mengfei25 added 6 commits September 11, 2025 10:16

rollback to 856bd09

86bfbce

accelerate tests parallel with ZE_AFFINITY_MASK=n

01af1f6

split transformers test jobs

acbdde6

cleanup

8ad779c

cleanup

70e031d

cleanup

da60b8e

dvrogozh approved these changes Sep 11, 2025

View reviewed changes

chuanqi129 merged commit fa9212b into main Sep 11, 2025
19 of 25 checks passed

chuanqi129 deleted the mengfeil/modify-extra-tests branch September 11, 2025 05:35

[CI] Modify accelerate and transformers tests #1999

[CI] Modify accelerate and transformers tests #1999

Uh oh!

Conversation

mengfei25 commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chuanqi129 commented Sep 2, 2025

Uh oh!

chuanqi129 commented Sep 2, 2025

Uh oh!

chuanqi129 commented Sep 2, 2025

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dvrogozh commented Sep 2, 2025

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mengfei25 commented Sep 8, 2025

Uh oh!

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dvrogozh left a comment

Choose a reason for hiding this comment

Uh oh!

mengfei25 commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

mengfei25 commented Sep 1, 2025 •

edited

Loading