Skip to content

to_netcdf is not idempotent when stacking rename and set_coords #5170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
floriankrb opened this issue Apr 16, 2021 · 3 comments · Fixed by #8195
Closed

to_netcdf is not idempotent when stacking rename and set_coords #5170

floriankrb opened this issue Apr 16, 2021 · 3 comments · Fixed by #8195

Comments

@floriankrb
Copy link
Contributor

floriankrb commented Apr 16, 2021

After doing

# !wget https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/dev/xarray-issue/source.nc
ds = xr.open_dataset('source.nc') 
ds = ds.rename({'number': 'n'})
ds = ds.set_coords('valid_time')
print(ds)
ds.to_netcdf('out.nc')
ds = xr.open_dataset('out.nc')
print(ds)

'valid_time' is not a coordinate in the netcdf file coordinate is turned into a variable.

<xarray.Dataset>
Dimensions:            (heightAboveGround: 1, latitude: 121, longitude: 240, n: 2, step: 3, time: 1)
Coordinates:
  * n                  (n) int64 0 1
  * time               (time) datetime64[ns] 2020-01-02
  * step               (step) timedelta64[ns] 1 days 2 days 3 days
  * heightAboveGround  (heightAboveGround) int64 2
  * latitude           (latitude) float64 90.0 88.5 87.0 ... -87.0 -88.5 -90.0
  * longitude          (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
    valid_time         (time, step) datetime64[ns] ...
Data variables:
    t2m                (n, time, step, heightAboveGround, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            2
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2021-04-15T12:08:59 GRIB to CDM+CF via cfgrib-0....
<xarray.Dataset>
Dimensions:            (heightAboveGround: 1, latitude: 121, longitude: 240, n: 2, step: 3, time: 1)
Coordinates:
  * n                  (n) int64 0 1
  * time               (time) datetime64[ns] 2020-01-02
  * step               (step) timedelta64[ns] 1 days 2 days 3 days
  * heightAboveGround  (heightAboveGround) int64 2
  * latitude           (latitude) float64 90.0 88.5 87.0 ... -87.0 -88.5 -90.0
  * longitude          (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
Data variables:
    valid_time         (time, step) datetime64[ns] ...
    t2m                (n, time, step, heightAboveGround, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            2
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2021-04-15T12:08:59 GRIB to CDM+CF via cfgrib-0....

What happened:
Output of the MCVE is :
Coords written in copy.1.nc: ['number', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']
Reread and check copy.1.nc: ['number', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']

Coords written in copy.2.nc: ['n', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']
Reread and check copy.2.nc: ['n', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude']

Coords written in copy.4.nc: ['n', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']
Reread and check copy.4.nc: ['n', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']

What you expected to happen
Output of the MCVE should be :

Coords written in copy.1.nc: ['number', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']
Reread and check copy.1.nc: ['number', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']

Coords written in copy.2.nc: ['n', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']
Reread and check copy.2.nc: ['n', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']

Coords written in copy.4.nc: ['n', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']
Reread and check copy.4.nc: ['n', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude', 'valid_time']

Minimal Complete Verifiable Example:

import xarray as xr
import numpy as np

xr.show_versions()
print(xr.__file__)

FILE1 = 'copy.1.nc'
FILE2 = 'copy.2.nc'
FILE3 = 'copy.3.nc'
FILE4 = 'copy.4.nc'

ds = xr.open_dataset('source.nc')

# Initial dataset is ok
print(f'Coords  written in {FILE1}: {list(ds.coords)}')
ds.to_netcdf(FILE1)
print(f'Reread and check   {FILE1}: {list(xr.open_dataset(FILE1).coords)}')
#print(xr.open_dataset(FILE1))

print()

# No round trip
ds = ds.rename({'number': 'n'})
ds = ds.set_coords('valid_time')
print(f'Coords  written in {FILE2}: {list(ds.coords)}')
ds.to_netcdf(FILE2)
print(f'Reread and check   {FILE2}: {list(xr.open_dataset(FILE2).coords)}')
#print(xr.open_dataset(FILE2))

print()

# Doing a round trip solves the issue :
ds = xr.open_dataset('source.nc')
ds = ds.rename({'number': 'n'})
ds.to_netcdf(FILE3)
ds = xr.open_dataset(FILE3)
ds = ds.set_coords('valid_time')
print(f'Coords  written in {FILE4}: {list(ds.coords)}')
ds.to_netcdf(FILE4)
print(f'Reread and check   {FILE4}: {list(xr.open_dataset(FILE4).coords)}')
#print(xr.open_dataset(FILE4))

Anything else we need to know?:
The initial source.nc file is available at wget https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/dev/xarray-issue/source.nc

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:21:18)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-1160.15.2.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.12.0
libnetcdf: 4.7.4

xarray: 0.17.0
pandas: 1.2.2
numpy: 1.19.5
scipy: None
netCDF4: 1.5.6
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: 2.6.1
cftime: 1.4.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: 0.9.8.5
iris: None
bottleneck: None
dask: 2021.02.0
distributed: 2021.02.0
matplotlib: 3.3.4
cartopy: None
seaborn: None
numbagg: None
pint: None
setuptools: 49.6.0.post20210108
pip: 21.0.1
conda: 4.9.2
pytest: 5.3.1
IPython: 7.20.0
sphinx: 3.4.3

It may be related/duplicate of #4512 and linked to #4108

@kmuehlbauer
Copy link
Contributor

@floriankrb Sorry for the long delay. If you are still interested in the source of the issue, here is what I found:

By default Xarray will promote any data variable which shares it's name with a dimension to a coordinate. That accounts for ['number', 'time', 'step', 'heightAboveGround', 'latitude', 'longitude']. valid_time is a two dimensional coordinate (by CF standard) and is a coordinate here because t2m data variable has a corresponding coordinates-attribute containing valid_time. In the decoding-step valid_time gets added to the .coords. The attribute is removed from t2m's attrs and kept in t2m.encoding. So far so good.

By renaming number to n that coordinates attribute (in encoding) does not change as well. So when the data is written, t2m will still hold number in it's coordinates-attribute (on disk).

The issue manifests on subsequent read as now the decoding-step tries to align the found coordinates with the available data variables. As number is not available, no coordinate from that string will be taken into account as coordinate (note the all on line 444):

if decode_coords in [True, "coordinates", "all"]:
var_attrs = new_vars[k].attrs
if "coordinates" in var_attrs:
coord_str = var_attrs["coordinates"]
var_coord_names = coord_str.split()
if all(k in variables for k in var_coord_names):
new_vars[k].encoding["coordinates"] = coord_str
del var_attrs["coordinates"]
coord_names.update(var_coord_names)

This can easily be observed by looking into t2m.attrs where the coordinates remains instead of being preserved in .encoding.

So the source of all problems here is that the renaming number -> n was missed for coordinates-attribute of t2m's .encoding.

@kmuehlbauer
Copy link
Contributor

connected to #1809

@kmuehlbauer
Copy link
Contributor

@floriankrb I've checked with your example and #8195 will solve your issue too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants