-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
Describe the bug
Training immediately crashes when running against unity editor or running against built environment when unity editor (or hub) is running. The problem started when updating to mlagents 0.13.0.
According to this stackoverflow thread the problem can occur if other processes are using GPU.
Make sure you have no other processes using the GPU running. Run nvidia-smi to check this.
I ran nvidia-smi
and unity editor and unity hub showed up as processes (among others). I started to close processes one by one. After closing unity and unity hub the training worked as expected using a built environment. If I start the unity editor again, the training fails. If I close the editor again, it works.
The training always works if I use --cpu
when starting training.
Console logs / stack traces
2020-01-10 10:40:45.137803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6283 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-01-10 10:40:45.322073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2070 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:01:00.0
2020-01-10 10:40:45.330891: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-01-10 10:40:45.336787: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-01-10 10:40:45.342693: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-01-10 10:40:45.348401: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-01-10 10:40:45.354261: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-01-10 10:40:45.360015: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-01-10 10:40:45.365847: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-01-10 10:40:45.372212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-01-10 10:40:45.376940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-10 10:40:45.382622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-01-10 10:40:45.386354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-01-10 10:40:45.390669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6283 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-01-10 10:40:47.524845: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-01-10 10:40:47.818921: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.824753: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.830756: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.836544: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.842420: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.848479: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.854929: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.860460: W tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support
2020-01-10 10:40:47.860567: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.872424: W tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support
INFO:mlagents_envs:Environment shut down with return code 0 (CTRL_C_EVENT).
Traceback (most recent call last):
File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
return fn(*args)
File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
target_list, run_metadata)
File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(12, 8), b.shape=(8, 128), m=12, n=128, k=8
[[{{node main_graph_1/hidden_0/MatMul}}]]
[[action_probs/_15]]
(1) Internal: Blas GEMM launch failed : a.shape=(12, 8), b.shape=(8, 128), m=12, n=128, k=8
[[{{node main_graph_1/hidden_0/MatMul}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\Projects\Unity\venv\Scripts\mlagents-learn-script.py", line 11, in <module>
load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')()
File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\learn.py", line 478, in main
run_training(0, run_seed, options, Queue())
File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\learn.py", line 316, in run_training
tc.start_learning(env_manager)
File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\trainer_controller.py", line 234, in start_learning
n_steps = self.advance(env_manager)
File "d:\projects\unity\ml-agents-fork\ml-agents-envs\mlagents_envs\timers.py", line 262, in wrapped
return func(*args, **kwargs)
File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\trainer_controller.py", line 295, in advance
new_step_infos = env.step()
File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 222, in step
self._queue_steps()
File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 215, in _queue_steps
env_action_info = self._take_step(env_worker.previous_step)
File "d:\projects\unity\ml-agents-fork\ml-agents-envs\mlagents_envs\timers.py", line 262, in wrapped
return func(*args, **kwargs)
File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 310, in _take_step
brain_info
File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\tf_policy.py", line 143, in get_action
run_out = self.evaluate(brain_info) # pylint: disable=assignment-from-no-return
File "d:\projects\unity\ml-agents-fork\ml-agents-envs\mlagents_envs\timers.py", line 262, in wrapped
return func(*args, **kwargs)
File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\ppo\policy.py", line 163, in evaluate
run_out = self._execute_model(feed_dict, self.inference_dict)
File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\tf_policy.py", line 165, in _execute_model
network_out = self.sess.run(list(out_dict.values()), feed_dict=feed_dict)
File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
run_metadata_ptr)
File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
run_metadata)
File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(12, 8), b.shape=(8, 128), m=12, n=128, k=8
[[node main_graph_1/hidden_0/MatMul (defined at d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
[[action_probs/_15]]
(1) Internal: Blas GEMM launch failed : a.shape=(12, 8), b.shape=(8, 128), m=12, n=128, k=8
[[node main_graph_1/hidden_0/MatMul (defined at d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.
Environment (please complete the following information):
- OS + version: Windows 10
- python 3.6.7
- ML-Agents version: ML-Agents v0.13.0
- TensorFlow version: 1.15.0
- CUDA: 10.0
- cudnn 7.6
- Environment: 3DBall