Skip to content

Memory Leak open_mfdataset #5585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
N4321D opened this issue Jul 6, 2021 · 2 comments
Closed

Memory Leak open_mfdataset #5585

N4321D opened this issue Jul 6, 2021 · 2 comments

Comments

@N4321D
Copy link

N4321D commented Jul 6, 2021

I used xarray to combine a couple of h5py saved numpy arrays.
It worked so far, but recently I updated to version 0.18.2 and my code stopped working.
The kernel died all the time, because of lack of memory.

Whenever opening multiple hdf files with open_mfdataset and trying to save them, memory (incl swap) completely fills up before any writing happens. If all the files fit in memory, the script works, but if the files are more than would fit, it crashes. (when i disable swap it crashed with around 30 files in the example script, with swap at around 50)

I forgot to note from which version of xarray I was coming where my script worked (I think it was 0.16?).

Python 3.8.2 (default, Mar 26 2020, 15:53:00)
IPython 7.22.0
Xarray version: 0.18.2
(in anaconda)

to reproduce:
Create Data files (warning this takes a couple of GBs):

import numpy as np
import xarray as xr
import h5py

def makefile(data, n, nfiles):
    for i in range(nfiles):
        with h5py.File(f"{i}.h5", 'w') as file:
            for par in range(n):
                file.create_dataset(f'data/{par}',
                                    data=data,
                                    dtype='f4',
                                    maxshape=(None,),
                                    chunks= (32000,), # (dlength,),
                                    compression='gzip',
                                    compression_opts=5,
                                    fletcher32=True,
                                    shuffle=True,
                                )


data = np.random.randint(0, 0xFFFF, int(2e7))
makefile(data, 10, 50)    # ~50 files is enough to create an error on my 16GB RAM 24GB swap, increase if you have more RAM?

Load files and save as xr dataset netcdf:

from dask.diagnostics import ProgressBar
ProgressBar().register()                   # see something happening

# load files:
ds = xr.open_mfdataset("*.h5", parallel=True, combine='nested', concat_dim='phony_dim_0', group='/data')

# save files:
save_opts = {key: {'zlib': True,     # change to blosc whenever available in xarray
                   'complevel': 5,
                   'shuffle': True,
                   'fletcher32': True,
                   } for key in ds}
ds.to_netcdf('delme.h5', 
             encoding=save_opts, 
             mode="w",
             #engine="h5netcdf", # "netcdf4", "scipy", "h5netcdf"
             engine='netcdf4',
             )

# wait for kernel to die because of mem overload. 

output:
image
Kernel restarted after around 8%, onpy 96kb of data was written to the disk

@jhamman
Copy link
Member

jhamman commented Sep 12, 2023

@N4321D - apologies that we never responded to this issue. Were you able to move forward in some way?

I am hopeful that the new preferred chunks option from open_dataset(..., chunks={}) will mean better performance here.

xref: #7948

I'm going to close this as stale but feel free to reopen.

@jhamman jhamman closed this as completed Sep 12, 2023
@N4321D
Copy link
Author

N4321D commented Sep 12, 2023

Thank you, yes i wrote a custom function using dask array and h5py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants