`MemPool.datastore` memory utilization keeps increasing when using DTables with multiple processes #60

New issue

Open

MemPool.datastore memory utilization keeps increasing when using DTables with multiple processes#60

Assignees

Labels

bug

StevenWhitaker

opened

on Nov 11, 2023

· edited by StevenWhitaker

I have the following setup:

Start Julia with julia --project -t1 --heap-size-hint=3G
Add 4 processes with addprocs(4; exeflags = "--heap-size-hint=3G")
Worker 1 receives a query request and then tells worker 2 to do the work

The actual query includes loading a table from a .csv file into a DTable (with a DataFrame table type). Operations include selecting columns, fetching the table into a DataFrame for adding/removing rows/columns and other processing as needed, and re-wrapping the table in a DTable to later be processed further. At the end of processing, the result is returned as a DataFrame.

The .csv file contains a table with 233930 rows and 102 columns: 1 column of InlineStrings.String15, 2 columns of InlineStrings.String1, 45 columns of Int64, and 54 columns of Float64.

The issue: I noticed that if I keep running the same query repeatedly, the MemPool.datastore on worker 2 consumes more and more memory, as determined by

remotecall_fetch(2) do
    Base.summarysize(MyPackage.Dagger.MemPool.datastore)
end

Eventually, the memory usage grows enough to cause my WSL 2 Linux OOM manager to kill worker 2, crashing my program.

Notably, I do not observe this growth in memory usage in the following scenarios:

when running everything on a single process (i.e., not calling addprocs), or
when using DataFrames exclusively (i.e., not using DTables.jl at all).

I do observe this growth in memory usage in the following additional scenarios:

when using NamedTuple as the table type for the DTables, or
when running everything on a single process, but with multiple processes available. (To clarify, my code exclusively uses worker 1 in this scenario, but it appears DTables.jl/Dagger.jl uses the other available workers. And in this case the MemPool.datastore on worker 1 (not worker 2) is what consumes more and more memory. However, I never ran into any issues with the OOM manager killing my processes.)

I'm posting this issue in DTables.jl in case there's something DTables.jl is doing that somehow causes the MemPool.jl data store to keep references around longer than expected, but of course please transfer this issue to Dagger.jl or MemPool.jl as needed.

Please let me know if there is any other information that would help with finding the root cause of this issue.

jpsamaroo

Member

Do you have a reproducer for this one, just to help me debug it reliably?

StevenWhitaker

Author

I am working on a better reproducer, but I believe the behavior I pointed out in JuliaParallel/Dagger.jl#445 (related to JuliaParallel/Dagger.jl#438) is essentially the same I am reporting here---all those CPURAMDevices needed to be evicted when Julia closed because they were not being evicted earlier. It's just that before I wasn't running my code enough times to get to the point where the OOM manager killed one of my processes.

StevenWhitaker

Author

@jpsamaroo Here's a MWE that shows the ever-growing memory utilization. It took me running main about 120 times to get the OOM manager to kill worker 2 (I have 32 GB RAM). Let me know if you need data as well (i.e., "file.csv").

using Distributed
addprocs(5 - nprocs(); exeflags = "--heap-size-hint=3G")

@everywhere using DTables, DataFrames, CSV

@everywhere const DT = Ref{DTable}()

@everywhere mutable struct DTableCols
    key_names
    value_names
    keys
    values
end

function main()

    remotecall_fetch(query, 2)

end

@everywhere function query()

    dt1 = load_dt()
    dt2 = add_value_col!(dt1)
    dt3 = update_value_col!(dt2)
    @info "" length(dt3)
    dt4 = calc_value_cols(dt3)
    dt5 = select(dt4, [6; 12; 103:113]...; copycols = false)
    dt_agg = aggregate_dt(dt5)
    return fetch(dt_agg)

end

@everywhere function load_dt()

    isassigned(DT) && return DT[]
    file = "file.csv"
    GC.enable(false)
    dt = DTable(x -> CSV.File(x), [file]; tabletype = DataFrame)
    GC.enable(true)
    DT[] = dt
    return dt

end

@everywhere function add_value_col!(dt)

    dt_cols = create_dt_cols(dt, 1:48, 49:102)
    dt_cols.value_names = [dt_cols.value_names; "RAND"]
    dt_cols.values = (dt_cols.values..., rand(length(dt_cols.values[1])))
    return create_dt_from_cols(dt_cols; is_sorted = true)

end

@everywhere function create_dt_cols(dt, key_cols, value_cols)

    df = fetch(dt)
    key_names = names(df)[key_cols]
    value_names = names(df)[value_cols]
    keys = [df[!, i] for i in key_cols]
    values = [df[!, i] for i in value_cols]
    return DTableCols(key_names, value_names, keys, values)

end

@everywhere function create_dt_from_cols(dt_cols; is_sorted = false)

    df = DataFrame(
        (dt_cols.key_names .=> dt_cols.keys)...,
        (dt_cols.value_names .=> dt_cols.values)...;
        copycols = false,
    )
    is_sorted || sort!(df)
    return DTable(df)

end

@everywhere function update_value_col!(dt)

    dt_cols = create_dt_cols(dt, 1:48, 49:103)
    dt_cols.values = (
        dt_cols.values[1:10]...,
        rand(length(dt_cols.values[1])),
        dt_cols.values[12:end]...,
    )
    return create_dt_from_cols(dt_cols; is_sorted = true)

end

@everywhere function calc_value_cols(dt)

    newvals = Vector{Float64}[]
    for i = 1:10
        v = calc_new_value(dt, i)
        push!(newvals, v)
    end
    return append_value_cols(dt, newvals)

end

@everywhere function calc_new_value(dt, i)

    dt_cols = create_dt_cols(dt, 1:48, 49:103)
    return abs.(dt_cols.values[i])

end

@everywhere function append_value_cols(dt, newvals)

    df = fetch(dt)
    for (i, v) in enumerate(newvals)
        setproperty!(df, "NEW$i", v)
    end
    return DTable(df)

end

@everywhere function aggregate_dt(dt)

    key_names = [Symbol("6"), Symbol("12")]
    gdt = groupby(fetch(dt), key_names)
    gkeys = sort!(collect(keys(gdt)))
    key_pairs = key_names .=> invert(gkeys)
    value_names = [[Symbol("RAND")]; Symbol.("NEW", 1:10)]
    sums = fetch(reduce(+, gdt; cols = value_names))
    sorted = sortperm(invert(sums[key_names]))
    value_pairs = map(value_names) do value
        value => sums[Symbol(:result_, value)][sorted]
    end
    return DTable(DataFrame(key_pairs..., value_pairs...))

end

@everywhere invert(x) = [[x[j][i] for j = 1:length(x)] for i = 1:length(x[1])]

@everywhere function Base.reduce(f, df::DataFrames.AbstractDataFrame; cols)

    NamedTuple(col => reduce(f, df[!, col]) for col in cols)

end

@everywhere function Base.reduce(f, gdt::DataFrames.GroupedDataFrame; cols)

    gkeys = keys(gdt)
    dims = keys(gkeys[1])
    merge(
        NamedTuple(dim => getproperty.(gkeys, dim) for dim in dims),
        NamedTuple(
            Symbol(:result_, col) => [reduce(f, gdt[k]; cols = [col])[col] for k in gkeys]
            for col in cols
        ),
    )

end

StevenWhitaker

mentioned this in 2 issues

on Nov 13, 2023

krynju

Member

One thing I remembered was that when I was benchmarking dtables.jl around release time I had a really bad time with running it in wsl2.
I would barely get to a quarter of the table size which I could run successfully on Linux due to the weird memory management wsl does

Let's keep this in mind when looking at this, Linux will behave differently for sure. I'll try to have a look at it this week

StevenWhitaker

Author

I just ran the exact same code as in #61 (which in turn is the same as the MWE above but with a call to enable_disk_caching!(50, 2^10 * 20)). This time, instead of a BoundsError, I got AssertionError: Failed to migrate 183.839 MiB for ref 1624:

julia> include("mwe.jl"); for i = 1:100 (i % 10 == 0 && @show(i)); main() end
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
i = 10
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
i = 20
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
i = 30
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
      From worker 2:    ┌ Info:
      From worker 2:    └   length(dt3) = 233930
ERROR: On worker 2:
AssertionError: Failed to migrate 183.839 MiB for ref 1624
Stacktrace:
  [1] #105
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:887
  [2] with_lock
    @ ~/.julia/packages/MemPool/l9nLj/src/lock.jl:80
  [3] #sra_migrate!#103
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:849
  [4] sra_migrate!
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:826 [inlined]
  [5] write_to_device!
    @ ~/.julia/packages/MemPool/l9nLj/src/storage.jl:817
  [6] #poolset#160
    @ ~/.julia/packages/MemPool/l9nLj/src/datastore.jl:386
  [7] #tochunk#139
    @ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:267
  [8] tochunk (repeats 2 times)
    @ ~/.julia/packages/Dagger/M13n0/src/chunks.jl:259 [inlined]
  [9] #DTable#1
    @ ~/.julia/packages/DTables/BjdY2/src/table/dtable.jl:38
 [10] DTable
    @ ~/.julia/packages/DTables/BjdY2/src/table/dtable.jl:28
 [11] #create_dt_from_cols#9
    @ ~/tmp/mwe.jl:76
 [12] create_dt_from_cols
    @ ~/tmp/mwe.jl:68 [inlined]
 [13] add_value_col!
    @ ~/tmp/mwe.jl:53
 [14] query
    @ ~/tmp/mwe.jl:26
 [15] #invokelatest#2
    @ ./essentials.jl:819 [inlined]
 [16] invokelatest
    @ ./essentials.jl:816
 [17] #110
    @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285
 [18] run_work_thunk
    @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:70
 [19] macro expansion
    @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/process_messages.jl:285 [inlined]
 [20] #109
    @ ./task.jl:514
Stacktrace:
 [1] remotecall_fetch(::Function, ::Distributed.Worker; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
   @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:465
 [2] remotecall_fetch(::Function, ::Distributed.Worker)
   @ Distributed ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:454
 [3] #remotecall_fetch#162
   @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
 [4] remotecall_fetch
   @ ~/programs/julia/julia-1.9.3/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:492 [inlined]
 [5] main
   @ ~/tmp/mwe.jl:19 [inlined]
 [6] top-level scope
   @ ./REPL[1]:1

I wonder if this is essentially the same issue as the OP, where data is being kept longer than it should. Just in this case, instead of a process getting killed, MemPool errors because we end up exceeding the 20 GB of disk space I said MemPool could use. If it is the same issue, then WSL 2 memory management shouldn't have anything to do with this.

StevenWhitaker

Author

I ran the MWE (with and without enabling disk caching) on Windows (not WSL 2).

Without disk caching: I ran main about 200 times and never ran into any memory issues.
With enable_disk_caching!(50, 2^10 * 20): I could not replicate the AssertionError: Failed to migrate, but I did always get the BoundsError mentioned in BoundsError in MemPool.sra_migrate! when calling DTable(::DataFrame) #61.

So there definitely is a difference in behavior between WSL 2 and Windows.

krynju

Member

I did get it reproduced twice with the MemPool fix from the other issue
5 processes, no threads

julia> d = DTable((a=rand(Int, N1),), N1 ÷ 100)
ERROR: AssertionError: Failed to migrate 10.240 MiB for ref 5646
Stacktrace:
  [1] (::MemPool.var"#105#113"{Bool, MemPool.SimpleRecencyAllocator, MemPool.RefState, Int64})()
    @ MemPool C:\Users\krynjupc\.julia\dev\MemPool\src\storage.jl:920
  [2] with_lock(f::MemPool.var"#105#113"{Bool, MemPool.SimpleRecencyAllocator, MemPool.RefState, Int64}, lock::MemPool.NonReentrantLock, cond::Bool)
    @ MemPool C:\Users\krynjupc\.julia\dev\MemPool\src\lock.jl:80
  [3] sra_migrate!(sra::MemPool.SimpleRecencyAllocator, state::MemPool.RefState, ref_id::Int64, to_mem::Missing; read::Bool, locked::Bool)
    @ MemPool C:\Users\krynjupc\.julia\dev\MemPool\src\storage.jl:882
  [4] sra_migrate!(sra::MemPool.SimpleRecencyAllocator, state::MemPool.RefState, ref_id::Int64, to_mem::Missing)
    @ MemPool C:\Users\krynjupc\.julia\dev\MemPool\src\storage.jl:859 [inlined]
  [5] write_to_device!(sra::MemPool.SimpleRecencyAllocator, state::MemPool.RefState, ref_id::Int64)
    @ MemPool C:\Users\krynjupc\.julia\dev\MemPool\src\storage.jl:850
  [6]
    @ MemPool C:\Users\krynjupc\.julia\dev\MemPool\src\datastore.jl:386
  [7] tochunk(x::@NamedTuple{a::Vector{Int64}}, proc::OSProc, scope::AnyScope; persist::Bool, cache::Bool, device::Nothing, kwargs::@Kwargs{})
    @ Dagger C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\chunks.jl:267
  [8] tochunk
    @ Dagger C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\chunks.jl:259 [inlined]
  [9] tochunk(x::@NamedTuple{a::Vector{Int64}})
    @ Dagger C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\chunks.jl:259
 [10] DTable(table::@NamedTuple{a::Vector{Int64}}, chunksize::Int64; tabletype::Nothing, interpartition_merges::Bool)
    @ DTables C:\Users\krynjupc\.julia\packages\DTables\BjdY2\src\table\dtable.jl:122
 [11] DTable(table::@NamedTuple{a::Vector{Int64}}, chunksize::Int64)
    @ DTables C:\Users\krynjupc\.julia\packages\DTables\BjdY2\src\table\dtable.jl:61
 [12] top-level scope
    @ REPL[56]:1
Some type information was truncated. Use `show(err)` to see complete types.

julia> map(x -> (r=x.a + 1,), d) |> fetch
ERROR: ThunkFailedException:
  Root Exception Type: CapturedException
  Root Exception:
AssertionError: Failed to migrate 10.240 MiB for ref 5051
Stacktrace:
  [1] #105
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\storage.jl:920
  [2] with_lock
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\lock.jl:83
  [3] #sra_migrate!#103
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\storage.jl:882
  [4] #120
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\storage.jl:1001
  [5] with_lock
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\lock.jl:80
  [6] with_lock
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\lock.jl:78
  [7] read_from_device
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\storage.jl:991 [inlined]
  [8] _getlocal
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\datastore.jl:433
  [9] #174
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\datastore.jl:425
 [10] #invokelatest#2
    @ .\essentials.jl:899
 [11] invokelatest
    @ .\essentials.jl:896
 [12] #110
    @ C:\Users\krynjupc\AppData\Local\Programs\Julia-1.11.0-DEV\share\julia\stdlib\v1.11\Distributed\src\process_messages.jl:286
 [13] run_work_thunk
    @ C:\Users\krynjupc\AppData\Local\Programs\Julia-1.11.0-DEV\share\julia\stdlib\v1.11\Distributed\src\process_messages.jl:70
 [14] #109
    @ C:\Users\krynjupc\AppData\Local\Programs\Julia-1.11.0-DEV\share\julia\stdlib\v1.11\Distributed\src\process_messages.jl:286
Stacktrace:
  [1] #remotecall_fetch#159
    @ C:\Users\krynjupc\AppData\Local\Programs\Julia-1.11.0-DEV\share\julia\stdlib\v1.11\Distributed\src\remotecall.jl:465
  [2] remotecall_fetch
    @ C:\Users\krynjupc\AppData\Local\Programs\Julia-1.11.0-DEV\share\julia\stdlib\v1.11\Distributed\src\remotecall.jl:454
  [3] remotecall_fetch
    @ C:\Users\krynjupc\AppData\Local\Programs\Julia-1.11.0-DEV\share\julia\stdlib\v1.11\Distributed\src\remotecall.jl:492 [inlined]
  [4] #173
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\datastore.jl:424 [inlined]
  [5] forwardkeyerror
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\datastore.jl:409
  [6] poolget
    @ C:\Users\krynjupc\.julia\dev\MemPool\src\datastore.jl:423
  [7] move
    @ C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\chunks.jl:98
  [8] move
    @ C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\chunks.jl:96
  [9] #invokelatest#2
    @ .\essentials.jl:899 [inlined]
 [10] invokelatest
    @ .\essentials.jl:896 [inlined]
 [11] #154
    @ C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\sch\Sch.jl:1475
Stacktrace:
 [1] wait
   @ .\task.jl:354 [inlined]
 [2] fetch
   @ .\task.jl:374 [inlined]
 [3] fetch_report
   @ C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\sch\util.jl:241
 [4] do_task
   @ C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\sch\Sch.jl:1502
 [5] #132
   @ C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\sch\Sch.jl:1243
  Root Thunk:  Thunk(id=2224, #38(Dagger.WeakChunk(1, 5051, WeakRef(Dagger.Chunk{@NamedTuple{a::Vector{Int64}}, DRef, OSProc, AnyScope}(@NamedTuple{a::Vector{Int64}}, 
UnitDomain(), DRef(1, 5051, 0x0000000000a3d738), OSProc(1), AnyScope(), false))), #129))
  Inner Thunk: Thunk(id=2325, isnonempty(Thunk[2224](#38, Any[Dagger.WeakChunk(1, 5051, WeakRef(Dagger.Chunk{@NamedTuple{a::Vector{Int64}}, DRef, OSProc, AnyScope}(@NamedTuple{a::Vector{Int64}}, UnitDomain(), DRef(1, 5051, 0x0000000000a3d738), OSProc(1), AnyScope(), false))), var"#129#130"()])))
  This Thunk:  Thunk(id=2325, isnonempty(Thunk[2224](#38, Any[Dagger.WeakChunk(1, 5051, WeakRef(Dagger.Chunk{@NamedTuple{a::Vector{Int64}}, DRef, OSProc, AnyScope}(@NamedTuple{a::Vector{Int64}}, UnitDomain(), DRef(1, 5051, 0x0000000000a3d738), OSProc(1), AnyScope(), false))), var"#129#130"()])))
Stacktrace:
  [1] fetch(t::Dagger.ThunkFuture; proc::OSProc, raw::Bool)
    @ Dagger C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\eager_thunk.jl:16
  [2] fetch
    @ C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\eager_thunk.jl:11 [inlined]
  [3] #fetch#75
    @ C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\eager_thunk.jl:58 [inlined]
  [4] fetch
    @ C:\Users\krynjupc\.julia\packages\Dagger\M13n0\src\eager_thunk.jl:54 [inlined]
  [5] #10
    @ C:\Users\krynjupc\.julia\packages\DTables\BjdY2\src\table\dtable.jl:233 [inlined]
  [6] filter(f::DTables.var"#10#13"{Vector{Dagger.EagerThunk}}, a::Vector{Tuple{Int64, Union{Dagger.EagerThunk, Dagger.Chunk}}})
    @ Base .\array.jl:2673
  [7] trim!(d::DTable)
    @ DTables C:\Users\krynjupc\.julia\packages\DTables\BjdY2\src\table\dtable.jl:233
  [8] trim(d::DTable)
    @ DTables C:\Users\krynjupc\.julia\packages\DTables\BjdY2\src\table\dtable.jl:242
  [9] retrieve_partitions
    @ C:\Users\krynjupc\.julia\packages\DTables\BjdY2\src\table\dtable.jl:179 [inlined]
 [10] fetch(d::DTable)
    @ DTables C:\Users\krynjupc\.julia\packages\DTables\BjdY2\src\table\dtable.jl:167
 [11] |>(x::DTable, f::typeof(fetch))
    @ Base .\operators.jl:917
 [12] top-level scope
    @ REPL[57]:1

krynju

Member

reproducer, no files needed
just run julia -p4 and run the last 5 lines over and over till it appears (and then any further call will generate the error again)

ENV["JULIA_MEMPOOL_EXPERIMENTAL_FANCY_ALLOCATOR"] = "true"
ENV["JULIA_MEMPOOL_EXPERIMENTAL_MEMORY_BOUND"] = string(2 * (2^30)) # 2GB
# ENV["JULIA_MEMPOOL_EXPERIMENTAL_DISK_CACHE"] = "C:\\Users\\krynjupc\\.mempool\\demo_session_$(rand(Int))"

using Distributed

@info(
    "Execution environment details",
    julia_version=VERSION,
    n_workers=Distributed.nworkers(),
    n_procs=Distributed.nprocs(),
    n_threads=Threads.nthreads(),
)

function view_cache()
    !isdir(ENV["JULIA_MEMPOOL_EXPERIMENTAL_DISK_CACHE"]) && return []
    map(
        x -> (basename(x), round(filesize(x) / 2^20, digits=2)),
        readdir(ENV["JULIA_MEMPOOL_EXPERIMENTAL_DISK_CACHE"], join=true)
    )
end

using DTables
DTables.enable_disk_caching!()

using MemPool
using Dagger

N1 = 2^27 # 1GB
d = DTable((a=rand(Int, N1),), N1 ÷ 100)
map(x -> (r=x.a + 1,), d) |> fetch
MemPool.GLOBAL_DEVICE[]
view_cache()

added

mentioned this

SimpleRecencyAllocator: Decr ctrs in delete_from_device JuliaData/MemPool.jl#74

krynju

Member

Can't reproduce with the fix JuliaData/MemPool.jl#74
Stressed it really hard and I didn't get any errors

Will cut a release soon

krynju

closed this as completed

on Nov 17, 2023

StevenWhitaker

Author

I just tested the new releases of DTables.jl/Dagger.jl/MemPool.jl using the reproducer I mentioned above.

Without disk caching enabled:

WSL 2 still experiences the situation where the OOM manager kills my Julia process due to the ever-growing memory utilization of MemPool.datastore.
Windows seems fine.

With enable_disk_caching!(50, 2^10 * 20):

The AssertionError: Failed to migrate still occurs on both WSL 2 and Windows (though it does take many more calls to main than before to experience the error).

So, it looks like the issue is not entirely resolved yet.

24 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

jpsamaroo

Labels

bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`MemPool.datastore` memory utilization keeps increasing when using DTables with multiple processes #60

24 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

MemPool.datastore memory utilization keeps increasing when using DTables with multiple processes #60

Description

Activity

jpsamaroo commented on Nov 11, 2023

StevenWhitaker commented on Nov 12, 2023

StevenWhitaker commented on Nov 13, 2023

krynju commented on Nov 13, 2023

StevenWhitaker commented on Nov 13, 2023

StevenWhitaker commented on Nov 14, 2023

krynju commented on Nov 15, 2023

krynju commented on Nov 15, 2023

krynju commented on Nov 17, 2023

StevenWhitaker commented on Nov 20, 2023

24 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions

`MemPool.datastore` memory utilization keeps increasing when using DTables with multiple processes #60