You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that when I try to save dask arrays to netcdf with xarray the saved data is complete rubbish when distributed is used.
This presents a major problem for workflows, where dask is used to run a computation on a larger-than-memory dataset which is than saved to disk.
This example reproduces my problem. Note that I am using engine='scipy' in order to avoid a know issue regarding saving to file with netcdf4 (pydata/xarray#1464).
In [1]:
importxarrayasxrimportdaskimportdistributedimportnetCDF4importnumpyasnp%matplotlibinlineIn [2]:
dask.__version__Out[2]:
'0.15.3'In [3]:
distributed.__version__Out[3]:
'1.19.1'In [4]:
netCDF4.__version__Out[4]:
'1.3.0'In [5]:
xr.__version__Out[5]:
'0.9.6'In [6]:
# Create test frame frame=xr.DataArray(np.random.rand(3,3,3))
print(frame.isel(dim_0=1).data)
frame.to_netcdf('frame.nc')
Out[6]:
[[ 0.132280030.210123420.98197841]
[ 0.071559160.496298880.83948875]
[ 0.60041040.609919270.26890407]]
In [7]:
# save it out and reload as dask arrayframe_dask=xr.open_dataarray('frame.nc',chunks={'dim_0':1})
print(frame_dask.isel(dim_0=1).data.compute())
frame_dask.isel(dim_0=1).to_netcdf('frame_dask.nc',engine='scipy')
Out[7]:
[[ 0.132280030.210123420.98197841]
[ 0.071559160.496298880.83948875]
[ 0.60041040.609919270.26890407]]
In [8]:
#This files seems to be written properlyframe_back=xr.open_dataarray('frame_dask.nc')
frame_back.dataOut[8]:
array([[ 0.13228003, 0.21012342, 0.98197841],
[ 0.07155916, 0.49629888, 0.83948875],
[ 0.6004104 , 0.60991927, 0.26890407]])
In [9]:
# So far so good. Now lets do the same thing with distributedclient=distributed.Client()
In [10]:
# save it out and reload as dask arrayframe_dask=xr.open_dataarray('frame.nc',chunks={'dim_0':1})
print(frame_dask.isel(dim_0=1).data.compute())
frame_dask.isel(dim_0=1).to_netcdf('frame_dask_distributed.nc',engine='scipy')
Out[10]:
[[ 0.132280030.210123420.98197841]
[ 0.071559160.496298880.83948875]
[ 0.60041040.609919270.26890407]]
In [11]:
#Now when loaded again, the data is complete nonsense!frame_back=xr.open_dataarray('frame_dask_distributed.nc')
frame_back.dataOut[11]:
array([[ -1.88270321e-134, -3.91977874e+157, 3.33289716e-199],
[ -4.65532185e-152, 6.52205493e+216, 1.88071323e+204],
[ 9.92775037e+246, -2.11058862e+306, 6.41161328e-035]])
The text was updated successfully, but these errors were encountered:
SciPy is writing garbage data here, but I'm pretty sure the fundamental issue is the same as pydata/xarray#1464. In that case, it's pretty clearly an xarray issue, not a dask one, so let's deal with it over there.
It seems that when I try to save dask arrays to netcdf with xarray the saved data is complete rubbish when distributed is used.
This presents a major problem for workflows, where dask is used to run a computation on a larger-than-memory dataset which is than saved to disk.
This example reproduces my problem. Note that I am using engine='scipy' in order to avoid a know issue regarding saving to file with netcdf4 (pydata/xarray#1464).
The text was updated successfully, but these errors were encountered: