[WIP] Testutils run distributed fix #1209

hyang0129 · 2020-03-02T17:06:41Z

hyang0129 · 2020-03-02T17:08:23Z

@Squadrick please add Ubuntu GPU Python3 CI

…nstead use logical device

hyang0129 · 2020-03-02T17:51:37Z

@seanpmorgan I thought that the gpu test would stay, but it seems that it gets removed for some reason. I'll ping you again once the code is actually ready to test on GPU.

…laining what happens if you try to init devices again

added comments

googlebot · 2020-03-04T16:39:57Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

seanpmorgan · 2020-03-11T01:18:17Z

@googlebot I consent.

googlebot · 2020-03-11T01:18:46Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

seanpmorgan · 2020-03-11T01:20:00Z

it seems that the wheel build process changed sometime in the last 6 days.

The build wheels process now runs tests differently than the bazel tests. In seems that the tf context is initialized before tests are run or do not get reset between tests.

This is very odd because the bazel tests work without issue. This also seems to bypass the check to see if logical devices have been set up already.

Hmmm thanks for bringing this up. I'm thinking it's related to running pytest as the runner:
https://github.com/tensorflow/addons/blob/master/tools/testing/addons_cpu.sh#L41

But its not clear to me why. Just merged the most recent CI which splits that out as a single test in CI so it'll be easier to debug

gabrieldemarmiesse · 2020-03-11T07:40:08Z

So a bit of background to help here:

Pytest doesn't restart the python interpreter between tests files. I believe that bazel does. When running the gpu tests in the CI, there is only one worker to avoid out of memory errors. If that's an issue, we can always fallback to run the bazel command in kokoro, we're compatible with both bazel and pytest so that should be fine.

If I understand your issue correctly, you count on the python interpreter being shut down do force the cleanup of the virtual devices, right? Is it possible to do the cleanup in python without shutting down the interpreter?

Update: I tested locally on CPU this pull request. It has nothing to do with shutting down the interpreter. This pull request works fine with pytest when using a single worker pytest -v tensorflow_addons/utils, but it fails when using multiple processes: pytest -v -n 3 tensorflow_addons/utils. Since the tests are running with multiple workers in the CI, this explains that.
This also explains why it works with bazel. Bazel can't execute multiple test functions from the same file concurrently, but pytest can. So bazel executing your file with a single worker fixes the issue.

If possible, we should aim to have tests that work even if multiple processes are running in the same time, at least when running cpu tests.

hyang0129 · 2020-03-11T16:33:50Z

I would like that as well. Do you have any suggestions?

I believe the issue arises from the set logical device configuration command, which is cannot be altered as part of this PR (it is the only way to set up virtual devices for running tests in distributed mode).

My guess is that pytest in multi worker mode is capable of creating independent test sessions in tensorflow, but not able to interact with the physical and logical device configurations independently. There may be something external to the tf session level that determines device configurations.

Based on the logs and the fact that it works in single worker mode, I believe that the context is shared between test sessions in multi worker mode. For this to be true there would have to be some mechanism to reset the context between tests in single worker mode, that does not fire in multi worker mode.

If that is the case, then we really only have two options.

Make set logical device configurations work in a multi worker environment.
Run tests that interact with device configurations always in single worker mode.

The first option may be the most proper, but that requires changes to code in tensorflow. We might be stuck with option 2 for the near future.

How should I proceed on this?

gabrieldemarmiesse · 2020-03-11T17:09:31Z

Worst case scenario, if this issue concerns only a few tests, they can be run without multiprocessing separatly from the main test suite. Pytest-xdist has also options to control how are executed each test. So we can look into that too. But if we have many tests which won't run be able to run in parralel, that's a problem.

hyang0129 · 2020-03-11T19:31:12Z

Worst case scenario, if this issue concerns only a few tests, they can be run without multiprocessing separatly from the main test suite. Pytest-xdist has also options to control how are executed each test. So we can look into that too. But if we have many tests which won't run be able to run in parralel, that's a problem.

Agreed.

Worst case scenario, if this issue concerns only a few tests, they can be run without multiprocessing separatly from the main test suite. Pytest-xdist has also options to control how are executed each test. So we can look into that too. But if we have many tests which won't run be able to run in parralel, that's a problem.

Is there a test config file that I should update to indicate that the tests for testutils should run with a single worker?

gabrieldemarmiesse · 2020-03-22T19:44:38Z

@hyang0129 does that make sense to run those tests on CPU? I'm asking because currently, it the CI, gpu tests run with a single worker, but CPU tests run with multiple workers. So if we disable those tests on cpu, this would work, right?

hyang0129 · 2020-03-22T20:48:10Z

@hyang0129 does that make sense to run those tests on CPU? I'm asking because currently, it the CI, gpu tests run with a single worker, but CPU tests run with multiple workers. So if we disable those tests on cpu, this would work, right?

Yes that makes sense. However, doesn't that imply that distributed tests will only run in GPU mode?

gabrieldemarmiesse · 2020-03-22T20:54:50Z

Yes they would only run in gpu mode, is that an issue?

hyang0129 · 2020-03-22T20:58:20Z

Yes they would only run in gpu mode, is that an issue?

That should be fine. I am not aware of any real world application of distributed CPU training, so if we only test for distributed GPU, that should cover the distributed use cases.

gabrieldemarmiesse · 2020-03-23T17:51:26Z

@hyang0129 , I'll let you do the necessary changes. We can review once the CI is passing.

gabrieldemarmiesse · 2020-05-02T12:20:17Z

Closing this pull request because it's been superseded by #1770

created testing file

ebd39eb

googlebot added the cla: yes label Mar 2, 2020

hyang0129 changed the title ~~Testutils run distributed fix~~ [WIP] Testutils run distributed fix Mar 2, 2020

seanpmorgan added the kokoro:force-run label Mar 2, 2020

kokoro-team removed the kokoro:force-run label Mar 2, 2020

created run distributed test1

e61a1f0

seanpmorgan added the kokoro:force-run label Mar 2, 2020

created more tests

52dc73a

kokoro-team removed the kokoro:force-run label Mar 2, 2020

hyang0129 added 3 commits March 2, 2020 12:21

fixed build file?

c4a37de

fixed typo in the training loop

7bd6ac7

updated test utils to no longer use experimental virtual device and i…

9050b4f

…nstead use logical device

created function to get or create virtual devices and added error exp…

916b2b7

…laining what happens if you try to init devices again

seanpmorgan added the kokoro:force-run label Mar 2, 2020

kokoro-team removed the kokoro:force-run label Mar 2, 2020

added test to run first

38e041c

added comments

seanpmorgan added the kokoro:force-run label Mar 2, 2020

kokoro-team removed the kokoro:force-run label Mar 2, 2020

trying to figure out correct virtual device initialization order

55dd75a

seanpmorgan added the kokoro:force-run label Mar 3, 2020

kokoro-team removed the kokoro:force-run label Mar 3, 2020

trying to figure out correct virtual device initialization order

bd3c26e

seanpmorgan added the kokoro:force-run label Mar 3, 2020

added logging to identify when devices first get initialized

d78de6a

kokoro-team removed the kokoro:force-run label Mar 3, 2020

seanpmorgan added the kokoro:force-run label Mar 3, 2020

kokoro-team removed the kokoro:force-run label Mar 3, 2020

googlebot added cla: no and removed cla: yes labels Mar 11, 2020

googlebot added cla: yes and removed cla: no labels Mar 11, 2020

Merge branch 'master' into hyang0129_testutils_dist_test

cba4cef

This comment has been minimized.

Sign in to view

googlebot added cla: no and removed cla: yes labels Mar 11, 2020

This comment has been minimized.

Sign in to view

googlebot added cla: yes and removed cla: no labels Mar 11, 2020

hyang0129 mentioned this pull request Mar 11, 2020

Discriminative Layer Training #969

Merged

hyang0129 changed the title ~~Testutils run distributed fix~~ [WIP] Testutils run distributed fix Mar 11, 2020

gabrieldemarmiesse self-assigned this Mar 22, 2020

gabrieldemarmiesse mentioned this pull request Apr 16, 2020

Enable multiprocessing when testing with GPU and support distributed strategies in the tests. #1682

Closed

gabrieldemarmiesse closed this May 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Testutils run distributed fix #1209

[WIP] Testutils run distributed fix #1209

hyang0129 commented Mar 2, 2020

hyang0129 commented Mar 2, 2020

hyang0129 commented Mar 2, 2020

googlebot commented Mar 4, 2020

seanpmorgan commented Mar 11, 2020

googlebot commented Mar 11, 2020

seanpmorgan commented Mar 11, 2020

gabrieldemarmiesse commented Mar 11, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

hyang0129 commented Mar 11, 2020 •

edited by gabrieldemarmiesse

Loading

gabrieldemarmiesse commented Mar 11, 2020 •

edited

Loading

hyang0129 commented Mar 11, 2020

gabrieldemarmiesse commented Mar 22, 2020

hyang0129 commented Mar 22, 2020

gabrieldemarmiesse commented Mar 22, 2020

hyang0129 commented Mar 22, 2020

gabrieldemarmiesse commented Mar 23, 2020

gabrieldemarmiesse commented May 2, 2020

[WIP] Testutils run distributed fix #1209

[WIP] Testutils run distributed fix #1209

Conversation

hyang0129 commented Mar 2, 2020

hyang0129 commented Mar 2, 2020

hyang0129 commented Mar 2, 2020

googlebot commented Mar 4, 2020

seanpmorgan commented Mar 11, 2020

googlebot commented Mar 11, 2020

seanpmorgan commented Mar 11, 2020

gabrieldemarmiesse commented Mar 11, 2020 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

hyang0129 commented Mar 11, 2020 • edited by gabrieldemarmiesse Loading

gabrieldemarmiesse commented Mar 11, 2020 • edited Loading

hyang0129 commented Mar 11, 2020

gabrieldemarmiesse commented Mar 22, 2020

hyang0129 commented Mar 22, 2020

gabrieldemarmiesse commented Mar 22, 2020

hyang0129 commented Mar 22, 2020

gabrieldemarmiesse commented Mar 23, 2020

gabrieldemarmiesse commented May 2, 2020

gabrieldemarmiesse commented Mar 11, 2020 •

edited

Loading

hyang0129 commented Mar 11, 2020 •

edited by gabrieldemarmiesse

Loading

gabrieldemarmiesse commented Mar 11, 2020 •

edited

Loading