Skip to content

Enable multiprocessing when testing with GPU and support distributed strategies in the tests. #1682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gabrieldemarmiesse opened this issue Apr 16, 2020 · 0 comments · Fixed by #1770
Labels
discussion needed test-cases Related to Addons tests

Comments

@gabrieldemarmiesse
Copy link
Member

gabrieldemarmiesse commented Apr 16, 2020

Describe the feature and the current behavior/state.

Here I'm not going to discuss the bazel case as it's much more complicated to handle, and we currently advertize using pytest anyway to run the tests. We can of course make sure everything is compatible though.

This revamping of gpu testing has multiple objectives:

  • The tests should behave the same weither the contributor has a gpu or not. Meaning we shouldn't run all the tests on a gpu just because a gpu is available, otherwise it hurts reproducibility.
  • The test suite should be able to run with multiple workers in kokoro or when a user has multiple gpus. Pytest should use all gpus visible by the main process.
  • We need to support testing with distributed strategies. Currently it doesn't work. A fix has been started in [WIP] Testutils run distributed fix #1209 but we need to update it for pytest.
  • Making the whole thing simple to use and to maintain. Notably, we would get rid of this file: https://github.com/tensorflow/addons/blob/master/tools/testing/parallel_gpu_execute.sh which is quite hard to work on.

To do all that, here is my proposal:

Stuff to know:

Test workers

Suppose we have a machine with 10CPUs and 4 GPUs, 10 processes will start to run the test suite. Workers 0 to 3 will have ownership of one GPU each (we can use CUDA_VISIBLE_DEVICES to enforce that, but I'm not even sure that's needed with the proposed implementation). Workers 4 to 9 will have no gpu available.

Virtual devices

Each of those processes, when starting, will split their physical device into 2 virtual device. Tests that just need to run on gpu will use the first of those virtual devices. Processes which need to test distributed strategies will use the two of them. We assume here that 2 virtual devices are enough to test distributed strategies.

Impact on the contributors:

For this whole machinery to work, we need to know which test needs to run on CPU, GPU, or in distributed strategies. To do that we'll use pytest markers: @pytest.mark.....

  • By default, if no marker is found, the test will run on CPU: with device("CPU:0"). It's equivalent to
    @pytest.mark.run_on(["cpu"]).
  • To run with gpu only: @pytest.mark.run_on(["gpu"]).
  • To run on the cpu and gpu: @pytest.mark.run_on(["cpu", "gpu"]) (test runs twice)
  • To run in within a distributed strategy @pytest.mark.run_on(["distributed strategy"]). (runs once here).
  • To run with everything @pytest.mark.run_on(["cpu", "gpu", "distributed strategy"])
  • To make crazy stuff, and not run the test in any device scope: @pytest.mark.no_device_scope. Then the contributor can do whatever he/she wants in the test.

Of course, if no gpu is available, we just skip the tests needing a distribution strategy or the gpu. Contributors who handle the devices manually have to make sure to skip manually the test if the gpu is used.

Since gpu are often the scarsest ressource (nb gpus << nb cpus), tests needing the gpu will also be marked with @pytest.mark.tryfirst to ensure that we don't have workers starvation at the end (to get maximum speed).

To implement that, we need first to convert all tests to pytest (as opposed to unittest) it's currently 80% done and thanks a lot @autoih for putting a LOT of work into that.

Relevant information

  • Are you willing to contribute it (yes/no): yes
  • Are you willing to maintain it going forward? (yes/no): yes
  • Is there a relevant academic paper? (if so, where): no
  • Is there already an implementation in another framework? (if so, where): no
  • Was it part of tf.contrib? (if so, where): no

Which API type would this fall under (layer, metric, optimizer, etc.)

Testing

Who will benefit with this feature?

Contributors with gpu, CI.

Any other info.

I believe that the implementation will first go in tensorflow addons because we have 4 GPUs available in the CI. Later on when it's stable we can split it from tensorflow addons and make it a separate pytest plugin with a public API.

Comments welcome. Especially from @Squadrick , @hyang0129 , @seanpmorgan since I'm not a ninja of tf.device.

@gabrieldemarmiesse gabrieldemarmiesse changed the title [WIP] Enable multiprocessing when testing with GPU and support distributed strategies in the tests. Enable multiprocessing when testing with GPU and support distributed strategies in the tests. Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed test-cases Related to Addons tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant