Skip to content

Training using GPU crashes when unity editor is running #3199

@dlindmark

Description

@dlindmark

Describe the bug
Training immediately crashes when running against unity editor or running against built environment when unity editor (or hub) is running. The problem started when updating to mlagents 0.13.0.

According to this stackoverflow thread the problem can occur if other processes are using GPU.

Make sure you have no other processes using the GPU running. Run nvidia-smi to check this.

I ran nvidia-smi and unity editor and unity hub showed up as processes (among others). I started to close processes one by one. After closing unity and unity hub the training worked as expected using a built environment. If I start the unity editor again, the training fails. If I close the editor again, it works.

The training always works if I use --cpu when starting training.

Console logs / stack traces

2020-01-10 10:40:45.137803: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6283 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-01-10 10:40:45.322073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2070 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:01:00.0
2020-01-10 10:40:45.330891: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
2020-01-10 10:40:45.336787: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-01-10 10:40:45.342693: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cufft64_100.dll
2020-01-10 10:40:45.348401: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library curand64_100.dll
2020-01-10 10:40:45.354261: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusolver64_100.dll
2020-01-10 10:40:45.360015: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cusparse64_100.dll
2020-01-10 10:40:45.365847: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-01-10 10:40:45.372212: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-01-10 10:40:45.376940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-01-10 10:40:45.382622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2020-01-10 10:40:45.386354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2020-01-10 10:40:45.390669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6283 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070 SUPER, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-01-10 10:40:47.524845: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2020-01-10 10:40:47.818921: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.824753: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.830756: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.836544: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.842420: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.848479: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.854929: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.860460: W tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support
2020-01-10 10:40:47.860567: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-01-10 10:40:47.872424: W tensorflow/stream_executor/stream.cc:2041] attempting to perform BLAS operation using StreamExecutor without BLAS support
INFO:mlagents_envs:Environment shut down with return code 0 (CTRL_C_EVENT).
Traceback (most recent call last):
  File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
    return fn(*args)
  File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(12, 8), b.shape=(8, 128), m=12, n=128, k=8
         [[{{node main_graph_1/hidden_0/MatMul}}]]
         [[action_probs/_15]]
  (1) Internal: Blas GEMM launch failed : a.shape=(12, 8), b.shape=(8, 128), m=12, n=128, k=8
         [[{{node main_graph_1/hidden_0/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Projects\Unity\venv\Scripts\mlagents-learn-script.py", line 11, in <module>
    load_entry_point('mlagents', 'console_scripts', 'mlagents-learn')()
  File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\learn.py", line 478, in main
    run_training(0, run_seed, options, Queue())
  File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\learn.py", line 316, in run_training
    tc.start_learning(env_manager)
  File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\trainer_controller.py", line 234, in start_learning
    n_steps = self.advance(env_manager)
  File "d:\projects\unity\ml-agents-fork\ml-agents-envs\mlagents_envs\timers.py", line 262, in wrapped
    return func(*args, **kwargs)
  File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\trainer_controller.py", line 295, in advance
    new_step_infos = env.step()
  File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 222, in step
    self._queue_steps()
  File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 215, in _queue_steps
    env_action_info = self._take_step(env_worker.previous_step)
  File "d:\projects\unity\ml-agents-fork\ml-agents-envs\mlagents_envs\timers.py", line 262, in wrapped
    return func(*args, **kwargs)
  File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\subprocess_env_manager.py", line 310, in _take_step
    brain_info
  File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\tf_policy.py", line 143, in get_action
    run_out = self.evaluate(brain_info)  # pylint: disable=assignment-from-no-return
  File "d:\projects\unity\ml-agents-fork\ml-agents-envs\mlagents_envs\timers.py", line 262, in wrapped
    return func(*args, **kwargs)
  File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\ppo\policy.py", line 163, in evaluate
    run_out = self._execute_model(feed_dict, self.inference_dict)
  File "d:\projects\unity\ml-agents-fork\ml-agents\mlagents\trainers\tf_policy.py", line 165, in _execute_model
    network_out = self.sess.run(list(out_dict.values()), feed_dict=feed_dict)
  File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
    run_metadata_ptr)
  File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
    run_metadata)
  File "d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(12, 8), b.shape=(8, 128), m=12, n=128, k=8
         [[node main_graph_1/hidden_0/MatMul (defined at d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
         [[action_probs/_15]]
  (1) Internal: Blas GEMM launch failed : a.shape=(12, 8), b.shape=(8, 128), m=12, n=128, k=8
         [[node main_graph_1/hidden_0/MatMul (defined at d:\projects\unity\venv\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Environment (please complete the following information):

  • OS + version: Windows 10
  • python 3.6.7
  • ML-Agents version: ML-Agents v0.13.0
  • TensorFlow version: 1.15.0
  • CUDA: 10.0
  • cudnn 7.6
  • Environment: 3DBall

Metadata

Metadata

Assignees

Labels

bugIssue describes a potential bug in ml-agents.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions