Description
I have the following setup:
- Start Julia with
julia --project -t1 --heap-size-hint=3G
- Add 4 processes with
addprocs(4; exeflags = "--heap-size-hint=3G")
- Worker 1 receives a query request and then tells worker 2 to do the work
The actual query includes loading a table from a .csv file into a DTable
(with a DataFrame
table type). Operations include select
ing columns, fetch
ing the table into a DataFrame
for adding/removing rows/columns and other processing as needed, and re-wrapping the table in a DTable
to later be processed further. At the end of processing, the result is returned as a DataFrame
.
The .csv file contains a table with 233930 rows and 102 columns: 1 column of InlineStrings.String15
, 2 columns of InlineStrings.String1
, 45 columns of Int64
, and 54 columns of Float64
.
The issue: I noticed that if I keep running the same query repeatedly, the MemPool.datastore
on worker 2 consumes more and more memory, as determined by
remotecall_fetch(2) do
Base.summarysize(MyPackage.Dagger.MemPool.datastore)
end
Eventually, the memory usage grows enough to cause my WSL 2 Linux OOM manager to kill worker 2, crashing my program.
Notably, I do not observe this growth in memory usage in the following scenarios:
- when running everything on a single process (i.e., not calling
addprocs
), or - when using
DataFrame
s exclusively (i.e., not using DTables.jl at all).
I do observe this growth in memory usage in the following additional scenarios:
- when using
NamedTuple
as the table type for theDTable
s, or - when running everything on a single process, but with multiple processes available. (To clarify, my code exclusively uses worker 1 in this scenario, but it appears DTables.jl/Dagger.jl uses the other available workers. And in this case the
MemPool.datastore
on worker 1 (not worker 2) is what consumes more and more memory. However, I never ran into any issues with the OOM manager killing my processes.)
I'm posting this issue in DTables.jl in case there's something DTables.jl is doing that somehow causes the MemPool.jl data store to keep references around longer than expected, but of course please transfer this issue to Dagger.jl or MemPool.jl as needed.
Please let me know if there is any other information that would help with finding the root cause of this issue.
Activity
jpsamaroo commentedon Nov 11, 2023
Do you have a reproducer for this one, just to help me debug it reliably?
StevenWhitaker commentedon Nov 12, 2023
I am working on a better reproducer, but I believe the behavior I pointed out in JuliaParallel/Dagger.jl#445 (related to JuliaParallel/Dagger.jl#438) is essentially the same I am reporting here---all those
CPURAMDevices
needed to be evicted when Julia closed because they were not being evicted earlier. It's just that before I wasn't running my code enough times to get to the point where the OOM manager killed one of my processes.StevenWhitaker commentedon Nov 13, 2023
@jpsamaroo Here's a MWE that shows the ever-growing memory utilization. It took me running
main
about 120 times to get the OOM manager to kill worker 2 (I have 32 GB RAM). Let me know if you need data as well (i.e.,"file.csv"
).BoundsError
inMemPool.sra_migrate!
when callingDTable(::DataFrame)
#61krynju commentedon Nov 13, 2023
One thing I remembered was that when I was benchmarking dtables.jl around release time I had a really bad time with running it in wsl2.
I would barely get to a quarter of the table size which I could run successfully on Linux due to the weird memory management wsl does
Let's keep this in mind when looking at this, Linux will behave differently for sure. I'll try to have a look at it this week
StevenWhitaker commentedon Nov 13, 2023
I just ran the exact same code as in #61 (which in turn is the same as the MWE above but with a call to
enable_disk_caching!(50, 2^10 * 20)
). This time, instead of aBoundsError
, I gotAssertionError: Failed to migrate 183.839 MiB for ref 1624
:I wonder if this is essentially the same issue as the OP, where data is being kept longer than it should. Just in this case, instead of a process getting killed, MemPool errors because we end up exceeding the 20 GB of disk space I said MemPool could use. If it is the same issue, then WSL 2 memory management shouldn't have anything to do with this.
StevenWhitaker commentedon Nov 14, 2023
I ran the MWE (with and without enabling disk caching) on Windows (not WSL 2).
main
about 200 times and never ran into any memory issues.enable_disk_caching!(50, 2^10 * 20)
: I could not replicate theAssertionError: Failed to migrate
, but I did always get theBoundsError
mentioned inBoundsError
inMemPool.sra_migrate!
when callingDTable(::DataFrame)
#61.So there definitely is a difference in behavior between WSL 2 and Windows.
krynju commentedon Nov 15, 2023
I did get it reproduced twice with the MemPool fix from the other issue
5 processes, no threads
krynju commentedon Nov 15, 2023
reproducer, no files needed
just run
julia -p4
and run the last 5 lines over and over till it appears (and then any further call will generate the error again)krynju commentedon Nov 17, 2023
Can't reproduce with the fix JuliaData/MemPool.jl#74
Stressed it really hard and I didn't get any errors
Will cut a release soon
StevenWhitaker commentedon Nov 20, 2023
I just tested the new releases of DTables.jl/Dagger.jl/MemPool.jl using the reproducer I mentioned above.
Without disk caching enabled:
MemPool.datastore
.With
enable_disk_caching!(50, 2^10 * 20)
:AssertionError: Failed to migrate
still occurs on both WSL 2 and Windows (though it does take many more calls tomain
than before to experience the error).So, it looks like the issue is not entirely resolved yet.
24 remaining items