Description
I have some code that involves several operations on DTable
s. I ran my code with nprocs()
equal to 1
, and everything worked fine. I then added some processes so that nprocs()
equaled 5
and ran my code again on worker 1 (so I didn't explicitly use any of the added workers). In this case, my code would hang when calling reduce
on a GDTable
(i.e., after calling groupby
).
I tried to create a MWE, but I haven't yet been able to find one that hangs. Fortunately, I did find a MWE that gives a different error (ConcurrencyViolationError("lock must be held")
); hopefully this error and the hanging I'm experiencing are different manifestations of the same issue.
EDIT: The next comment contains a simpler MWE that produces the same error (slightly different stacktrace, though).
Contents of mwe.jl
:
using Distributed
nworkers = 4
addprocs(nworkers - nprocs() + 1)
@everywhere using DTables, DataFrames, CSV
function f()
dt = DTable(x -> CSV.File(x), ["file.csv"]; tabletype = DataFrame)
df = fetch(dt)
gdt = groupby(dt, Symbol.(names(df)[[6, 12, 48]]))
sums = fetch(reduce(+, gdt; cols = Symbol.(names(df)[[93, 94]])))
end
f()
Error:
julia> include("mwe.jl")
ERROR: LoadError: ThunkFailedException:
Root Exception Type: RemoteException
Root Exception:
On worker 2:
ThunkFailedException:
Root Exception Type: RemoteException
Root Exception:
On worker 2:
ConcurrencyViolationError("lock must be held")
Stacktrace:
[1] assert_havelock
@ ./condition.jl:25 [inlined]
[2] assert_havelock
@ ./condition.jl:48 [inlined]
[3] assert_havelock
@ ./condition.jl:72 [inlined]
[4] _wait2
@ ./condition.jl:83
[5] #wait#621
@ ./condition.jl:127
[6] wait
@ ./condition.jl:125 [inlined]
[7] wait_for_conn
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:195
[8] check_worker_state
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/cluster.jl:170
[9] send_msg_
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:172
[10] send_msg
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/messages.jl:122 [inlined]
[11] #remotecall_fetch#159
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:460
[12] remotecall_fetch
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:454
[13] #remotecall_fetch#162
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
[14] remotecall_fetch
@ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
[15] #171
@ ~/.julia/packages/MemPool/l9nLj/src/datastore.jl:424 [inlined]
[16] forwardkeyerror
@ ~/.julia/packages/MemPool/l9nLj/src/datastore.jl:409
[17] poolget
@ ~/.julia/packages/MemPool/l9nLj/src/datastore.jl:423
[18] move
@ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:98
[19] move
@ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:96 [inlined]
[20] move
@ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:102
[21] #fetch#70
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:21
[22] fetch
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:11 [inlined]
[23] #fetch#75
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:58 [inlined]
[24] fetch
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:54 [inlined]
[25] _broadcast_getindex_evalf
@ ./broadcast.jl:683 [inlined]
[26] _broadcast_getindex
@ ./broadcast.jl:656 [inlined]
[27] _getindex
@ ./broadcast.jl:679 [inlined]
[28] _broadcast_getindex
@ ./broadcast.jl:655 [inlined]
[29] getindex
@ ./broadcast.jl:610 [inlined]
[30] copyto_nonleaf!
@ ./broadcast.jl:1068
[31] copy
@ ./broadcast.jl:920 [inlined]
[32] materialize
@ ./broadcast.jl:873 [inlined]
[33] #79
@ ~/.julia/packages/DTables/BjdY2/src/operations/operations.jl:187
[34] #invokelatest#2
@ ./essentials.jl:819 [inlined]
[35] invokelatest
@ ./essentials.jl:816 [inlined]
[36] #43
@ ~/.julia/packages/Dagger/M13n0/src/processor.jl:162
Stacktrace:
[1] wait
@ ./task.jl:349 [inlined]
[2] fetch
@ ./task.jl:369 [inlined]
[3] #execute!#42
@ ~/.julia/packages/Dagger/M13n0/src/processor.jl:172
[4] execute!
@ ~/.julia/packages/Dagger/M13n0/src/processor.jl:157
[5] #158
@ ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:1551 [inlined]
[6] #21
@ ~/.julia/packages/Dagger/M13n0/src/options.jl:17 [inlined]
[7] #1
@ ~/.julia/packages/ScopedValues/92HJZ/src/ScopedValues.jl:163
[8] with_logstate
@ ./logging.jl:514
[9] with_logger
@ ./logging.jl:626 [inlined]
[10] enter_scope
@ ~/.julia/packages/ScopedValues/92HJZ/src/payloadlogger.jl:17 [inlined]
[11] with
@ ~/.julia/packages/ScopedValues/92HJZ/src/ScopedValues.jl:162
[12] with_options
@ ~/.julia/packages/Dagger/M13n0/src/options.jl:16
[13] do_task
@ ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:1549
[14] macro expansion
@ ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:1243 [inlined]
[15] #132
@ ./task.jl:134
This Thunk: Thunk(id=9, #79(5 inputs...))
Stacktrace:
[1] #fetch#70
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:16
[2] fetch
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:11 [inlined]
[3] #fetch#75
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:58 [inlined]
[4] fetch
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:54 [inlined]
[5] #86
@ ./none:0
[6] iterate
@ ./generator.jl:47 [inlined]
[7] collect
@ ./array.jl:782
[8] #83
@ ~/.julia/packages/DTables/BjdY2/src/operations/operations.jl:205
[9] #invokelatest#2
@ ./essentials.jl:819 [inlined]
[10] invokelatest
@ ./essentials.jl:816 [inlined]
[11] #43
@ ~/.julia/packages/Dagger/M13n0/src/processor.jl:162
Stacktrace:
[1] wait
@ ./task.jl:349 [inlined]
[2] fetch
@ ./task.jl:369 [inlined]
[3] #execute!#42
@ ~/.julia/packages/Dagger/M13n0/src/processor.jl:172
[4] execute!
@ ~/.julia/packages/Dagger/M13n0/src/processor.jl:157
[5] #158
@ ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:1551 [inlined]
[6] #21
@ ~/.julia/packages/Dagger/M13n0/src/options.jl:17 [inlined]
[7] #1
@ ~/.julia/packages/ScopedValues/92HJZ/src/ScopedValues.jl:163
[8] with_logstate
@ ./logging.jl:514
[9] with_logger
@ ./logging.jl:626 [inlined]
[10] enter_scope
@ ~/.julia/packages/ScopedValues/92HJZ/src/payloadlogger.jl:17 [inlined]
[11] with
@ ~/.julia/packages/ScopedValues/92HJZ/src/ScopedValues.jl:162
[12] with_options
@ ~/.julia/packages/Dagger/M13n0/src/options.jl:16
[13] do_task
@ ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:1549
[14] macro expansion
@ ~/.julia/packages/Dagger/M13n0/src/sch/Sch.jl:1243 [inlined]
[15] #132
@ ./task.jl:134
This Thunk: Thunk(id=11, #83(5 inputs...))
Stacktrace:
[1] fetch(t::Dagger.ThunkFuture; proc::Dagger.OSProc, raw::Bool)
@ Dagger ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:16
[2] fetch
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:11 [inlined]
[3] #fetch#75
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:58 [inlined]
[4] fetch
@ ~/.julia/packages/Dagger/M13n0/src/eager_thunk.jl:54 [inlined]
[5] f()
@ Main ~/tmp/mwe.jl:11
[6] top-level scope
@ ~/tmp/mwe.jl:14
[7] include(fname::String)
@ Base.MainInclude ./client.jl:478
[8] top-level scope
@ REPL[1]:1
in expression starting at /home/steven/tmp/mwe.jl:14
Some notes:
- This is with Dagger v0.18.4 and DTables v0.4.2 (the latest versions).
- The error does not occur every time.
- The error also occurs (again, sometimes) when calling
f
via aremotecall_fetch
. - As in Various errors working with DTables.jl #438,
"file.csv"
is a 157 MB table with 233930 rows and 102 columns ofString
andFloat64
values. I tried to generate data to keep the MWE self-contained, but wasn't successful. - This MWE is very similar to the MWE in Various errors working with DTables.jl #438, just cleaned up. The biggest difference that actually matters (I think) in terms of reproducing the error is having more processes. Increasing
nworkers
to10
in this MWE seems to make the error happen more frequently. I'm guessing the previous MWE also would have exhibited this error ifnworkers
was larger.