-
Notifications
You must be signed in to change notification settings - Fork 614
[WIP] Testutils run distributed fix #1209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@Squadrick please add Ubuntu GPU Python3 CI |
@seanpmorgan I thought that the gpu test would stay, but it seems that it gets removed for some reason. I'll ping you again once the code is actually ready to test on GPU. |
…laining what happens if you try to init devices again
added comments
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. ℹ️ Googlers: Go here for more info. |
@googlebot I consent. |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Hmmm thanks for bringing this up. I'm thinking it's related to running pytest as the runner: But its not clear to me why. Just merged the most recent CI which splits that out as a single test in CI so it'll be easier to debug |
So a bit of background to help here: Pytest doesn't restart the python interpreter between tests files. I believe that bazel does. When running the gpu tests in the CI, there is only one worker to avoid out of memory errors. If that's an issue, we can always fallback to run the bazel command in kokoro, we're compatible with both bazel and pytest so that should be fine. If I understand your issue correctly, you count on the python interpreter being shut down do force the cleanup of the virtual devices, right? Is it possible to do the cleanup in python without shutting down the interpreter? Update: I tested locally on CPU this pull request. It has nothing to do with shutting down the interpreter. This pull request works fine with pytest when using a single worker If possible, we should aim to have tests that work even if multiple processes are running in the same time, at least when running cpu tests. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I would like that as well. Do you have any suggestions? I believe the issue arises from the set logical device configuration command, which is cannot be altered as part of this PR (it is the only way to set up virtual devices for running tests in distributed mode). My guess is that pytest in multi worker mode is capable of creating independent test sessions in tensorflow, but not able to interact with the physical and logical device configurations independently. There may be something external to the tf session level that determines device configurations. Based on the logs and the fact that it works in single worker mode, I believe that the context is shared between test sessions in multi worker mode. For this to be true there would have to be some mechanism to reset the context between tests in single worker mode, that does not fire in multi worker mode. If that is the case, then we really only have two options.
The first option may be the most proper, but that requires changes to code in tensorflow. We might be stuck with option 2 for the near future. How should I proceed on this? |
Worst case scenario, if this issue concerns only a few tests, they can be run without multiprocessing separatly from the main test suite. Pytest-xdist has also options to control how are executed each test. So we can look into that too. But if we have many tests which won't run be able to run in parralel, that's a problem. |
Agreed.
Is there a test config file that I should update to indicate that the tests for testutils should run with a single worker? |
@hyang0129 does that make sense to run those tests on CPU? I'm asking because currently, it the CI, gpu tests run with a single worker, but CPU tests run with multiple workers. So if we disable those tests on cpu, this would work, right? |
Yes that makes sense. However, doesn't that imply that distributed tests will only run in GPU mode? |
Yes they would only run in gpu mode, is that an issue? |
That should be fine. I am not aware of any real world application of distributed CPU training, so if we only test for distributed GPU, that should cover the distributed use cases. |
@hyang0129 , I'll let you do the necessary changes. We can review once the CI is passing. |
Closing this pull request because it's been superseded by #1770 |
see #1171