You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the feature and the current behavior/state.
Here I'm not going to discuss the bazel case as it's much more complicated to handle, and we currently advertize using pytest anyway to run the tests. We can of course make sure everything is compatible though.
This revamping of gpu testing has multiple objectives:
The tests should behave the same weither the contributor has a gpu or not. Meaning we shouldn't run all the tests on a gpu just because a gpu is available, otherwise it hurts reproducibility.
The test suite should be able to run with multiple workers in kokoro or when a user has multiple gpus. Pytest should use all gpus visible by the main process.
We need to support testing with distributed strategies. Currently it doesn't work. A fix has been started in [WIP] Testutils run distributed fix #1209 but we need to update it for pytest.
Suppose we have a machine with 10CPUs and 4 GPUs, 10 processes will start to run the test suite. Workers 0 to 3 will have ownership of one GPU each (we can use CUDA_VISIBLE_DEVICES to enforce that, but I'm not even sure that's needed with the proposed implementation). Workers 4 to 9 will have no gpu available.
Virtual devices
Each of those processes, when starting, will split their physical device into 2 virtual device. Tests that just need to run on gpu will use the first of those virtual devices. Processes which need to test distributed strategies will use the two of them. We assume here that 2 virtual devices are enough to test distributed strategies.
Impact on the contributors:
For this whole machinery to work, we need to know which test needs to run on CPU, GPU, or in distributed strategies. To do that we'll use pytest markers: @pytest.mark.....
By default, if no marker is found, the test will run on CPU: with device("CPU:0"). It's equivalent to @pytest.mark.run_on(["cpu"]).
To run with gpu only: @pytest.mark.run_on(["gpu"]).
To run on the cpu and gpu: @pytest.mark.run_on(["cpu", "gpu"]) (test runs twice)
To run in within a distributed strategy @pytest.mark.run_on(["distributed strategy"]). (runs once here).
To run with everything @pytest.mark.run_on(["cpu", "gpu", "distributed strategy"])
To make crazy stuff, and not run the test in any device scope: @pytest.mark.no_device_scope. Then the contributor can do whatever he/she wants in the test.
Of course, if no gpu is available, we just skip the tests needing a distribution strategy or the gpu. Contributors who handle the devices manually have to make sure to skip manually the test if the gpu is used.
Since gpu are often the scarsest ressource (nb gpus << nb cpus), tests needing the gpu will also be marked with @pytest.mark.tryfirst to ensure that we don't have workers starvation at the end (to get maximum speed).
To implement that, we need first to convert all tests to pytest (as opposed to unittest) it's currently 80% done and thanks a lot @autoih for putting a LOT of work into that.
Relevant information
Are you willing to contribute it (yes/no): yes
Are you willing to maintain it going forward? (yes/no): yes
Is there a relevant academic paper? (if so, where): no
Is there already an implementation in another framework? (if so, where): no
Was it part of tf.contrib? (if so, where): no
Which API type would this fall under (layer, metric, optimizer, etc.)
Testing
Who will benefit with this feature?
Contributors with gpu, CI.
Any other info.
I believe that the implementation will first go in tensorflow addons because we have 4 GPUs available in the CI. Later on when it's stable we can split it from tensorflow addons and make it a separate pytest plugin with a public API.
The text was updated successfully, but these errors were encountered:
gabrieldemarmiesse
changed the title
[WIP] Enable multiprocessing when testing with GPU and support distributed strategies in the tests.
Enable multiprocessing when testing with GPU and support distributed strategies in the tests.
Apr 16, 2020
Describe the feature and the current behavior/state.
Here I'm not going to discuss the bazel case as it's much more complicated to handle, and we currently advertize using pytest anyway to run the tests. We can of course make sure everything is compatible though.
This revamping of gpu testing has multiple objectives:
To do all that, here is my proposal:
Stuff to know:
Test workers
Suppose we have a machine with 10CPUs and 4 GPUs, 10 processes will start to run the test suite. Workers 0 to 3 will have ownership of one GPU each (we can use CUDA_VISIBLE_DEVICES to enforce that, but I'm not even sure that's needed with the proposed implementation). Workers 4 to 9 will have no gpu available.
Virtual devices
Each of those processes, when starting, will split their physical device into 2 virtual device. Tests that just need to run on gpu will use the first of those virtual devices. Processes which need to test distributed strategies will use the two of them. We assume here that 2 virtual devices are enough to test distributed strategies.
Impact on the contributors:
For this whole machinery to work, we need to know which test needs to run on CPU, GPU, or in distributed strategies. To do that we'll use pytest markers:
@pytest.mark.....
with device("CPU:0")
. It's equivalent to@pytest.mark.run_on(["cpu"])
.@pytest.mark.run_on(["gpu"])
.@pytest.mark.run_on(["cpu", "gpu"])
(test runs twice)@pytest.mark.run_on(["distributed strategy"])
. (runs once here).@pytest.mark.run_on(["cpu", "gpu", "distributed strategy"])
@pytest.mark.no_device_scope
. Then the contributor can do whatever he/she wants in the test.Of course, if no gpu is available, we just skip the tests needing a distribution strategy or the gpu. Contributors who handle the devices manually have to make sure to skip manually the test if the gpu is used.
Since gpu are often the scarsest ressource (nb gpus << nb cpus), tests needing the gpu will also be marked with
@pytest.mark.tryfirst
to ensure that we don't have workers starvation at the end (to get maximum speed).To implement that, we need first to convert all tests to pytest (as opposed to unittest) it's currently 80% done and thanks a lot @autoih for putting a LOT of work into that.
Relevant information
Which API type would this fall under (layer, metric, optimizer, etc.)
Testing
Who will benefit with this feature?
Contributors with gpu, CI.
Any other info.
I believe that the implementation will first go in tensorflow addons because we have 4 GPUs available in the CI. Later on when it's stable we can split it from tensorflow addons and make it a separate pytest plugin with a public API.
Comments welcome. Especially from @Squadrick , @hyang0129 , @seanpmorgan since I'm not a ninja of tf.device.
The text was updated successfully, but these errors were encountered: