Enable multiprocessing when testing with GPU and support distributed strategies in the tests. #1682

gabrieldemarmiesse · 2020-04-16T16:56:37Z

Describe the feature and the current behavior/state.

Here I'm not going to discuss the bazel case as it's much more complicated to handle, and we currently advertize using pytest anyway to run the tests. We can of course make sure everything is compatible though.

This revamping of gpu testing has multiple objectives:

The tests should behave the same weither the contributor has a gpu or not. Meaning we shouldn't run all the tests on a gpu just because a gpu is available, otherwise it hurts reproducibility.
The test suite should be able to run with multiple workers in kokoro or when a user has multiple gpus. Pytest should use all gpus visible by the main process.
We need to support testing with distributed strategies. Currently it doesn't work. A fix has been started in [WIP] Testutils run distributed fix #1209 but we need to update it for pytest.
Making the whole thing simple to use and to maintain. Notably, we would get rid of this file: https://github.com/tensorflow/addons/blob/master/tools/testing/parallel_gpu_execute.sh which is quite hard to work on.

To do all that, here is my proposal:

Stuff to know:

Pytest-xdist uses multiprocessing to run the tests, not multithreading.
2 environement variables are available in each of the workers to identify them. https://github.com/pytest-dev/pytest-xdist#identifying-the-worker-process-during-a-test

Test workers

Suppose we have a machine with 10CPUs and 4 GPUs, 10 processes will start to run the test suite. Workers 0 to 3 will have ownership of one GPU each (we can use CUDA_VISIBLE_DEVICES to enforce that, but I'm not even sure that's needed with the proposed implementation). Workers 4 to 9 will have no gpu available.

Virtual devices

Each of those processes, when starting, will split their physical device into 2 virtual device. Tests that just need to run on gpu will use the first of those virtual devices. Processes which need to test distributed strategies will use the two of them. We assume here that 2 virtual devices are enough to test distributed strategies.

Impact on the contributors:

For this whole machinery to work, we need to know which test needs to run on CPU, GPU, or in distributed strategies. To do that we'll use pytest markers: @pytest.mark.....

By default, if no marker is found, the test will run on CPU: with device("CPU:0"). It's equivalent to
@pytest.mark.run_on(["cpu"]).
To run with gpu only: @pytest.mark.run_on(["gpu"]).
To run on the cpu and gpu: @pytest.mark.run_on(["cpu", "gpu"]) (test runs twice)
To run in within a distributed strategy @pytest.mark.run_on(["distributed strategy"]). (runs once here).
To run with everything @pytest.mark.run_on(["cpu", "gpu", "distributed strategy"])
To make crazy stuff, and not run the test in any device scope: @pytest.mark.no_device_scope. Then the contributor can do whatever he/she wants in the test.

Of course, if no gpu is available, we just skip the tests needing a distribution strategy or the gpu. Contributors who handle the devices manually have to make sure to skip manually the test if the gpu is used.

Since gpu are often the scarsest ressource (nb gpus << nb cpus), tests needing the gpu will also be marked with @pytest.mark.tryfirst to ensure that we don't have workers starvation at the end (to get maximum speed).

To implement that, we need first to convert all tests to pytest (as opposed to unittest) it's currently 80% done and thanks a lot @autoih for putting a LOT of work into that.

Relevant information

Are you willing to contribute it (yes/no): yes
Are you willing to maintain it going forward? (yes/no): yes
Is there a relevant academic paper? (if so, where): no
Is there already an implementation in another framework? (if so, where): no
Was it part of tf.contrib? (if so, where): no

Which API type would this fall under (layer, metric, optimizer, etc.)

Testing

Who will benefit with this feature?

Contributors with gpu, CI.

Any other info.

I believe that the implementation will first go in tensorflow addons because we have 4 GPUs available in the CI. Later on when it's stable we can split it from tensorflow addons and make it a separate pytest plugin with a public API.

Comments welcome. Especially from @Squadrick , @hyang0129 , @seanpmorgan since I'm not a ninja of tf.device.

The text was updated successfully, but these errors were encountered:

gabrieldemarmiesse changed the title ~~[WIP] Enable multiprocessing when testing with GPU and support distributed strategies in the tests.~~ Enable multiprocessing when testing with GPU and support distributed strategies in the tests. Apr 16, 2020

gabrieldemarmiesse mentioned this issue Apr 16, 2020

test_utils.run_distributed fails on GPU #1171

Closed

gabrieldemarmiesse added discussion needed test-cases Related to Addons tests labels Apr 16, 2020

gabrieldemarmiesse mentioned this issue Apr 21, 2020

Device selection with pytest. #1713

Merged

gabrieldemarmiesse mentioned this issue May 2, 2020

Enable distributed strategies in tests. #1770

Merged

seanpmorgan closed this as completed in #1770 Jun 5, 2020

NickleDave mentioned this issue Feb 26, 2021

make tests run on cpu, and when present gpu vocalpy/vak#310

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable multiprocessing when testing with GPU and support distributed strategies in the tests. #1682

Enable multiprocessing when testing with GPU and support distributed strategies in the tests. #1682

gabrieldemarmiesse commented Apr 16, 2020 •

edited

Loading

Enable multiprocessing when testing with GPU and support distributed strategies in the tests. #1682

Enable multiprocessing when testing with GPU and support distributed strategies in the tests. #1682

Comments

gabrieldemarmiesse commented Apr 16, 2020 • edited Loading

Test workers

Virtual devices

Impact on the contributors:

gabrieldemarmiesse commented Apr 16, 2020 •

edited

Loading