Memory Leak open_mfdataset #5585

N4321D · 2021-07-06T23:37:54Z

I used xarray to combine a couple of h5py saved numpy arrays.
It worked so far, but recently I updated to version 0.18.2 and my code stopped working.
The kernel died all the time, because of lack of memory.

Whenever opening multiple hdf files with open_mfdataset and trying to save them, memory (incl swap) completely fills up before any writing happens. If all the files fit in memory, the script works, but if the files are more than would fit, it crashes. (when i disable swap it crashed with around 30 files in the example script, with swap at around 50)

I forgot to note from which version of xarray I was coming where my script worked (I think it was 0.16?).

Python 3.8.2 (default, Mar 26 2020, 15:53:00)
IPython 7.22.0
Xarray version: 0.18.2
(in anaconda)

to reproduce:
Create Data files (warning this takes a couple of GBs):

import numpy as np
import xarray as xr
import h5py

def makefile(data, n, nfiles):
    for i in range(nfiles):
        with h5py.File(f"{i}.h5", 'w') as file:
            for par in range(n):
                file.create_dataset(f'data/{par}',
                                    data=data,
                                    dtype='f4',
                                    maxshape=(None,),
                                    chunks= (32000,), # (dlength,),
                                    compression='gzip',
                                    compression_opts=5,
                                    fletcher32=True,
                                    shuffle=True,
                                )


data = np.random.randint(0, 0xFFFF, int(2e7))
makefile(data, 10, 50)    # ~50 files is enough to create an error on my 16GB RAM 24GB swap, increase if you have more RAM?

Load files and save as xr dataset netcdf:

from dask.diagnostics import ProgressBar
ProgressBar().register()                   # see something happening

# load files:
ds = xr.open_mfdataset("*.h5", parallel=True, combine='nested', concat_dim='phony_dim_0', group='/data')

# save files:
save_opts = {key: {'zlib': True,     # change to blosc whenever available in xarray
                   'complevel': 5,
                   'shuffle': True,
                   'fletcher32': True,
                   } for key in ds}
ds.to_netcdf('delme.h5', 
             encoding=save_opts, 
             mode="w",
             #engine="h5netcdf", # "netcdf4", "scipy", "h5netcdf"
             engine='netcdf4',
             )

# wait for kernel to die because of mem overload.

output:

Kernel restarted after around 8%, onpy 96kb of data was written to the disk

The text was updated successfully, but these errors were encountered:

jhamman · 2023-09-12T15:40:59Z

@N4321D - apologies that we never responded to this issue. Were you able to move forward in some way?

I am hopeful that the new preferred chunks option from open_dataset(..., chunks={}) will mean better performance here.

xref: #7948

I'm going to close this as stale but feel free to reopen.

N4321D · 2023-09-12T16:15:37Z

Thank you, yes i wrote a custom function using dask array and h5py

N4321D mentioned this issue Jul 23, 2021

Issue opening h5py arrays dask/dask#7930

Open

andersy005 added the usage question label Jan 9, 2022

jhamman closed this as completed Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak open_mfdataset #5585

Memory Leak open_mfdataset #5585

N4321D commented Jul 6, 2021

jhamman commented Sep 12, 2023

N4321D commented Sep 12, 2023

Memory Leak open_mfdataset #5585

Memory Leak open_mfdataset #5585

Comments

N4321D commented Jul 6, 2021

jhamman commented Sep 12, 2023

N4321D commented Sep 12, 2023