Closed
Description
It seems that when I try to save dask arrays to netcdf with xarray the saved data is complete rubbish when distributed is used.
This presents a major problem for workflows, where dask is used to run a computation on a larger-than-memory dataset which is than saved to disk.
This example reproduces my problem. Note that I am using engine='scipy' in order to avoid a know issue regarding saving to file with netcdf4 (pydata/xarray#1464).
In [1]:
import xarray as xr
import dask
import distributed
import netCDF4
import numpy as np
%matplotlib inline
In [2]:
dask.__version__
Out[2]:
'0.15.3'
In [3]:
distributed.__version__
Out[3]:
'1.19.1'
In [4]:
netCDF4.__version__
Out[4]:
'1.3.0'
In [5]:
xr.__version__
Out[5]:
'0.9.6'
In [6]:
# Create test frame
frame = xr.DataArray(np.random.rand(3,3,3))
print(frame.isel(dim_0=1).data)
frame.to_netcdf('frame.nc')
Out[6]:
[[ 0.13228003 0.21012342 0.98197841]
[ 0.07155916 0.49629888 0.83948875]
[ 0.6004104 0.60991927 0.26890407]]
In [7]:
# save it out and reload as dask array
frame_dask = xr.open_dataarray('frame.nc',chunks={'dim_0':1})
print(frame_dask.isel(dim_0=1).data.compute())
frame_dask.isel(dim_0=1).to_netcdf('frame_dask.nc',engine='scipy')
Out[7]:
[[ 0.13228003 0.21012342 0.98197841]
[ 0.07155916 0.49629888 0.83948875]
[ 0.6004104 0.60991927 0.26890407]]
In [8]:
#This files seems to be written properly
frame_back = xr.open_dataarray('frame_dask.nc')
frame_back.data
Out[8]:
array([[ 0.13228003, 0.21012342, 0.98197841],
[ 0.07155916, 0.49629888, 0.83948875],
[ 0.6004104 , 0.60991927, 0.26890407]])
In [9]:
# So far so good. Now lets do the same thing with distributed
client = distributed.Client()
In [10]:
# save it out and reload as dask array
frame_dask = xr.open_dataarray('frame.nc',chunks={'dim_0':1})
print(frame_dask.isel(dim_0=1).data.compute())
frame_dask.isel(dim_0=1).to_netcdf('frame_dask_distributed.nc',engine='scipy')
Out[10]:
[[ 0.13228003 0.21012342 0.98197841]
[ 0.07155916 0.49629888 0.83948875]
[ 0.6004104 0.60991927 0.26890407]]
In [11]:
#Now when loaded again, the data is complete nonsense!
frame_back = xr.open_dataarray('frame_dask_distributed.nc')
frame_back.data
Out[11]:
array([[ -1.88270321e-134, -3.91977874e+157, 3.33289716e-199],
[ -4.65532185e-152, 6.52205493e+216, 1.88071323e+204],
[ 9.92775037e+246, -2.11058862e+306, 6.41161328e-035]])
Metadata
Metadata
Assignees
Labels
No labels