Skip to content

Commit 5ebc7c9

Browse files
pritamdamaniafacebook-github-bot
authored andcommitted
Avoid holding GIL while calling retrieveContext. (pytorch#61588)
Summary: Pull Request resolved: pytorch#61588 As part of debugging pytorch#60290, we discovered the following deadlock: ``` Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)): #0 pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 #1 0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103 #2 take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224 #3 0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278 #4 0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so #5 0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so #6 0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so #7 0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #8 0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #9 0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so Thread 72 (Thread 0x7f53077fe700 (LWP 205412)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #3 0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so ``` Basically Thread 72, holds GIL and tries to acquire the lock for DistAutogradContainer to perform a lookup on a map. On the other hand, Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as part of TensorImpl destructor, concrete_decref_fn is called which waits for GIL. As a result, we have a deadlock. To fix this issue, I've ensured we release GIL when we call `retrieveContext` and acquire it later when needed. ghstack-source-id: 133493659 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D29682624 fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c
1 parent f2adbff commit 5ebc7c9

File tree

1 file changed

+9
-3
lines changed

1 file changed

+9
-3
lines changed

torch/csrc/distributed/autograd/init.cpp

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,8 @@ PyObject* dist_autograd_init(PyObject* _unused, PyObject* noargs) {
9696
[](int64_t context_id) -> const ContextPtr {
9797
return DistAutogradContainer::getInstance().retrieveContext(context_id);
9898
},
99-
py::return_value_policy::reference);
99+
py::return_value_policy::reference,
100+
py::call_guard<py::gil_scoped_release>());
100101

101102
module.def(
102103
"_current_context",
@@ -168,7 +169,11 @@ Example::
168169
[](int64_t contextId) -> py::dict {
169170
const auto& autogradContext =
170171
DistAutogradContainer::getInstance().retrieveContext(contextId);
171-
return torch::jit::toPyObject(IValue(autogradContext->getGradients()));
172+
auto ival = IValue(autogradContext->getGradients());
173+
174+
// Acquire GIL only for pyobject conversion.
175+
pybind11::gil_scoped_acquire ag;
176+
return torch::jit::toPyObject(ival);
172177
},
173178
R"(
174179
get_gradients(context_id: int) -> Dict[Tensor, Tensor]
@@ -196,7 +201,8 @@ Example::
196201
>>> print(grads[t1])
197202
>>> print(grads[t2])
198203
)",
199-
py::arg("context_id"));
204+
py::arg("context_id"),
205+
py::call_guard<py::gil_scoped_release>());
200206

201207
Py_RETURN_TRUE;
202208
}

0 commit comments

Comments
 (0)