xarray.to_netcdf not working properly with distributed

It seems that when I try to save dask arrays to netcdf with xarray the saved data is complete rubbish when distributed is used.

This presents a major problem for workflows, where dask is used to run a computation on a larger-than-memory dataset which is than saved to disk.

This example reproduces my problem. Note that I am using engine='scipy' in order to avoid a know issue regarding saving to file with netcdf4 (https://github.com/pydata/xarray/issues/1464).

```python

In [1]:
import xarray as xr
import dask
import distributed
import netCDF4
import numpy as np
%matplotlib inline

In [2]:
dask.__version__
Out[2]:
'0.15.3'

In [3]:
distributed.__version__
Out[3]:
'1.19.1'

In [4]:
netCDF4.__version__
Out[4]:
'1.3.0'

In [5]:
xr.__version__
Out[5]:
'0.9.6'

In [6]:
# Create test frame 
frame = xr.DataArray(np.random.rand(3,3,3))
print(frame.isel(dim_0=1).data)
frame.to_netcdf('frame.nc')
Out[6]:
[[ 0.13228003  0.21012342  0.98197841]
 [ 0.07155916  0.49629888  0.83948875]
 [ 0.6004104   0.60991927  0.26890407]]

In [7]:
# save it out and reload as dask array
frame_dask = xr.open_dataarray('frame.nc',chunks={'dim_0':1})
print(frame_dask.isel(dim_0=1).data.compute())
frame_dask.isel(dim_0=1).to_netcdf('frame_dask.nc',engine='scipy')
Out[7]:
[[ 0.13228003  0.21012342  0.98197841]
 [ 0.07155916  0.49629888  0.83948875]
 [ 0.6004104   0.60991927  0.26890407]]

In [8]:
#This files seems to be written properly
frame_back = xr.open_dataarray('frame_dask.nc')
frame_back.data
Out[8]:
array([[ 0.13228003,  0.21012342,  0.98197841],
       [ 0.07155916,  0.49629888,  0.83948875],
       [ 0.6004104 ,  0.60991927,  0.26890407]])

In [9]:
# So far so good. Now lets do the same thing with distributed
client = distributed.Client()

In [10]:
# save it out and reload as dask array
frame_dask = xr.open_dataarray('frame.nc',chunks={'dim_0':1})
print(frame_dask.isel(dim_0=1).data.compute())
frame_dask.isel(dim_0=1).to_netcdf('frame_dask_distributed.nc',engine='scipy')
Out[10]:
[[ 0.13228003  0.21012342  0.98197841]
 [ 0.07155916  0.49629888  0.83948875]
 [ 0.6004104   0.60991927  0.26890407]]

In [11]:
#Now when loaded again, the data is complete nonsense!
frame_back = xr.open_dataarray('frame_dask_distributed.nc')
frame_back.data
Out[11]:
array([[ -1.88270321e-134,  -3.91977874e+157,   3.33289716e-199],
       [ -4.65532185e-152,   6.52205493e+216,   1.88071323e+204],
       [  9.92775037e+246,  -2.11058862e+306,   6.41161328e-035]])
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

xarray.to_netcdf not working properly with distributed #1468

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

xarray.to_netcdf not working properly with distributed #1468

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions