You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used xarray to combine a couple of h5py saved numpy arrays.
It worked so far, but recently I updated to version 0.18.2 and my code stopped working.
The kernel died all the time, because of lack of memory.
Whenever opening multiple hdf files with open_mfdataset and trying to save them, memory (incl swap) completely fills up before any writing happens. If all the files fit in memory, the script works, but if the files are more than would fit, it crashes. (when i disable swap it crashed with around 30 files in the example script, with swap at around 50)
I forgot to note from which version of xarray I was coming where my script worked (I think it was 0.16?).
Python 3.8.2 (default, Mar 26 2020, 15:53:00)
IPython 7.22.0
Xarray version: 0.18.2
(in anaconda)
to reproduce: Create Data files (warning this takes a couple of GBs):
import numpy as np
import xarray as xr
import h5py
def makefile(data, n, nfiles):
for i in range(nfiles):
with h5py.File(f"{i}.h5", 'w') as file:
for par in range(n):
file.create_dataset(f'data/{par}',
data=data,
dtype='f4',
maxshape=(None,),
chunks= (32000,), # (dlength,),
compression='gzip',
compression_opts=5,
fletcher32=True,
shuffle=True,
)
data = np.random.randint(0, 0xFFFF, int(2e7))
makefile(data, 10, 50) # ~50 files is enough to create an error on my 16GB RAM 24GB swap, increase if you have more RAM?
Load files and save as xr dataset netcdf:
from dask.diagnostics import ProgressBar
ProgressBar().register() # see something happening
# load files:
ds = xr.open_mfdataset("*.h5", parallel=True, combine='nested', concat_dim='phony_dim_0', group='/data')
# save files:
save_opts = {key: {'zlib': True, # change to blosc whenever available in xarray
'complevel': 5,
'shuffle': True,
'fletcher32': True,
} for key in ds}
ds.to_netcdf('delme.h5',
encoding=save_opts,
mode="w",
#engine="h5netcdf", # "netcdf4", "scipy", "h5netcdf"
engine='netcdf4',
)
# wait for kernel to die because of mem overload.
output:
Kernel restarted after around 8%, onpy 96kb of data was written to the disk
The text was updated successfully, but these errors were encountered:
I used xarray to combine a couple of h5py saved numpy arrays.
It worked so far, but recently I updated to version 0.18.2 and my code stopped working.
The kernel died all the time, because of lack of memory.
Whenever opening multiple hdf files with open_mfdataset and trying to save them, memory (incl swap) completely fills up before any writing happens. If all the files fit in memory, the script works, but if the files are more than would fit, it crashes. (when i disable swap it crashed with around 30 files in the example script, with swap at around 50)
I forgot to note from which version of xarray I was coming where my script worked (I think it was 0.16?).
Python 3.8.2 (default, Mar 26 2020, 15:53:00)
IPython 7.22.0
Xarray version: 0.18.2
(in anaconda)
to reproduce:
Create Data files (warning this takes a couple of GBs):
Load files and save as xr dataset netcdf:
output:

Kernel restarted after around 8%, onpy 96kb of data was written to the disk
The text was updated successfully, but these errors were encountered: