Skip to content

DataArray.to_dataset(dim) silently drops variable if it is already a dim #7823

@mgunyho

Description

@mgunyho

What happened?

If I have a DataArray da which I split into a Dataset using da.to_dataset(dim), and one of the values of da[dim] also happens to be one of the dimensions of da, that variable is silently missing from the resulting dataset.

What did you expect to happen?

If a variable cannot be created because it is already a dimension, it should raise an exception, or possibly issue a warning and rename the variable, so that no data is lost.

Minimal Complete Verifiable Example

import xarray as xr

da = xr.DataArray(
    np.zeros((3, 3)),
    coords={
        # note how 'foo' is one of the coordinate values, and also the name of a dimension
        "x": ["foo", "bar", "baz"],
        "foo": [1, 2, 3],
    }
)

# this produces a Dataset with the variables 'bar' and 'baz', 'foo' is missing (because it is already a coordinate)
print(da.to_dataset("x"))

# this produces a dataset with the variables 'foo', 'bar', and 'baz', as epected
print(da.rename({"foo": "qux"}).to_dataset("x"))

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

# Output of first conversion
<xarray.Dataset>
Dimensions:  (foo: 3)
Coordinates:
  * foo      (foo) int64 1 2 3
Data variables:
    bar      (foo) float64 0.0 0.0 0.0
    baz      (foo) float64 0.0 0.0 0.0

# Output of second conversion
<xarray.Dataset>
Dimensions:  (qux: 3)
Coordinates:
  * qux      (qux) int64 1 2 3
Data variables:
    foo      (qux) float64 0.0 0.0 0.0
    bar      (qux) float64 0.0 0.0 0.0
    baz      (qux) float64 0.0 0.0 0.0

Anything else we need to know?

This came up when I did to_dataset("param") on the fit result returned by curvefit, and one of the data dimensions happened to be the same as one of the arguments of the function which I was fitting. I was initially very confused by this.

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.10 (main, Mar 01 2023, 21:10:14) [GCC]
python-bits: 64
OS: Linux
OS-release: 6.2.12-1-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: ('en_GB', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: None

xarray: 2023.4.2
pandas: 2.0.1
numpy: 1.23.5
scipy: 1.10.1
netCDF4: None
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.14.2
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.4.1
distributed: None
matplotlib: 3.7.1
cartopy: None
seaborn: 0.12.2
numbagg: None
fsspec: 2023.4.0
cupy: None
pint: None
sparse: 0.14.0
flox: None
numpy_groupies: None
setuptools: 65.5.0
pip: 22.3.1
conda: None
pytest: 7.3.1
mypy: None
IPython: 8.13.2
sphinx: 6.2.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions