Skip to content

[WIP] Testutils run distributed fix #1209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from

Conversation

hyang0129
Copy link
Contributor

see #1171

@hyang0129 hyang0129 changed the title Testutils run distributed fix [WIP] Testutils run distributed fix Mar 2, 2020
@hyang0129
Copy link
Contributor Author

@Squadrick please add Ubuntu GPU Python3 CI

@hyang0129
Copy link
Contributor Author

@seanpmorgan I thought that the gpu test would stay, but it seems that it gets removed for some reason. I'll ping you again once the code is actually ready to test on GPU.

…laining what happens if you try to init devices again
added comments
@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@seanpmorgan
Copy link
Member

@googlebot I consent.

@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@seanpmorgan
Copy link
Member

it seems that the wheel build process changed sometime in the last 6 days.

The build wheels process now runs tests differently than the bazel tests. In seems that the tf context is initialized before tests are run or do not get reset between tests.

This is very odd because the bazel tests work without issue. This also seems to bypass the check to see if logical devices have been set up already.

Hmmm thanks for bringing this up. I'm thinking it's related to running pytest as the runner:
https://github.com/tensorflow/addons/blob/master/tools/testing/addons_cpu.sh#L41

But its not clear to me why. Just merged the most recent CI which splits that out as a single test in CI so it'll be easier to debug

@gabrieldemarmiesse
Copy link
Member

gabrieldemarmiesse commented Mar 11, 2020

So a bit of background to help here:

Pytest doesn't restart the python interpreter between tests files. I believe that bazel does. When running the gpu tests in the CI, there is only one worker to avoid out of memory errors. If that's an issue, we can always fallback to run the bazel command in kokoro, we're compatible with both bazel and pytest so that should be fine.

If I understand your issue correctly, you count on the python interpreter being shut down do force the cleanup of the virtual devices, right? Is it possible to do the cleanup in python without shutting down the interpreter?

Update: I tested locally on CPU this pull request. It has nothing to do with shutting down the interpreter. This pull request works fine with pytest when using a single worker pytest -v tensorflow_addons/utils, but it fails when using multiple processes: pytest -v -n 3 tensorflow_addons/utils. Since the tests are running with multiple workers in the CI, this explains that.
This also explains why it works with bazel. Bazel can't execute multiple test functions from the same file concurrently, but pytest can. So bazel executing your file with a single worker fixes the issue.

If possible, we should aim to have tests that work even if multiple processes are running in the same time, at least when running cpu tests.

@googlebot

This comment has been minimized.

@gabrieldemarmiesse

This comment has been minimized.

@googlebot

This comment has been minimized.

@hyang0129
Copy link
Contributor Author

hyang0129 commented Mar 11, 2020

I would like that as well. Do you have any suggestions?

I believe the issue arises from the set logical device configuration command, which is cannot be altered as part of this PR (it is the only way to set up virtual devices for running tests in distributed mode).

My guess is that pytest in multi worker mode is capable of creating independent test sessions in tensorflow, but not able to interact with the physical and logical device configurations independently. There may be something external to the tf session level that determines device configurations.

Based on the logs and the fact that it works in single worker mode, I believe that the context is shared between test sessions in multi worker mode. For this to be true there would have to be some mechanism to reset the context between tests in single worker mode, that does not fire in multi worker mode.

If that is the case, then we really only have two options.

  1. Make set logical device configurations work in a multi worker environment.
  2. Run tests that interact with device configurations always in single worker mode.

The first option may be the most proper, but that requires changes to code in tensorflow. We might be stuck with option 2 for the near future.

How should I proceed on this?

@gabrieldemarmiesse
Copy link
Member

gabrieldemarmiesse commented Mar 11, 2020

Worst case scenario, if this issue concerns only a few tests, they can be run without multiprocessing separatly from the main test suite. Pytest-xdist has also options to control how are executed each test. So we can look into that too. But if we have many tests which won't run be able to run in parralel, that's a problem.

@hyang0129
Copy link
Contributor Author

Worst case scenario, if this issue concerns only a few tests, they can be run without multiprocessing separatly from the main test suite. Pytest-xdist has also options to control how are executed each test. So we can look into that too. But if we have many tests which won't run be able to run in parralel, that's a problem.

Agreed.

Worst case scenario, if this issue concerns only a few tests, they can be run without multiprocessing separatly from the main test suite. Pytest-xdist has also options to control how are executed each test. So we can look into that too. But if we have many tests which won't run be able to run in parralel, that's a problem.

Is there a test config file that I should update to indicate that the tests for testutils should run with a single worker?

@hyang0129 hyang0129 changed the title Testutils run distributed fix [WIP] Testutils run distributed fix Mar 11, 2020
@gabrieldemarmiesse
Copy link
Member

@hyang0129 does that make sense to run those tests on CPU? I'm asking because currently, it the CI, gpu tests run with a single worker, but CPU tests run with multiple workers. So if we disable those tests on cpu, this would work, right?

@gabrieldemarmiesse gabrieldemarmiesse self-assigned this Mar 22, 2020
@hyang0129
Copy link
Contributor Author

@hyang0129 does that make sense to run those tests on CPU? I'm asking because currently, it the CI, gpu tests run with a single worker, but CPU tests run with multiple workers. So if we disable those tests on cpu, this would work, right?

Yes that makes sense. However, doesn't that imply that distributed tests will only run in GPU mode?

@gabrieldemarmiesse
Copy link
Member

Yes they would only run in gpu mode, is that an issue?

@hyang0129
Copy link
Contributor Author

Yes they would only run in gpu mode, is that an issue?

That should be fine. I am not aware of any real world application of distributed CPU training, so if we only test for distributed GPU, that should cover the distributed use cases.

@gabrieldemarmiesse
Copy link
Member

@hyang0129 , I'll let you do the necessary changes. We can review once the CI is passing.

@gabrieldemarmiesse
Copy link
Member

Closing this pull request because it's been superseded by #1770

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants