Skip to content

apply_ufunc doesn't inherit encoding #10297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
jpdehollain opened this issue May 8, 2025 · 6 comments
Closed
5 tasks done

apply_ufunc doesn't inherit encoding #10297

jpdehollain opened this issue May 8, 2025 · 6 comments

Comments

@jpdehollain
Copy link

What happened?

Encoding in data arrays are lost after applying any ufunc. There should be an argument similar to keep_attrs in apply_ufunc for encoding.

What did you expect to happen?

The encoding property should be inherited (or resolved if more than one data arrays are involved) when applying a ufunc.

Minimal Complete Verifiable Example

import xarray as xr
import numpy as np

my_ufunc = lambda x: x + 1
xarr1 = xr.DataArray(np.array([1,2,3]))
xarr1.encoding = {'dummy': 'baz'}
xarr2 = xr.apply_ufunc(my_ufunc, xarr1)
print(xarr1.encoding, xarr2.encoding)

# Workaround:
xarr2.encoding = xarr1.encoding.copy()
print(xarr1.encoding, xarr2.encoding)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.13.3 | packaged by conda-forge | (main, Apr 14 2025, 20:31:24) [MSC v.1943 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 11
machine: AMD64
processor: Intel64 Family 6 Model 170 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('English_Australia', '1252')
libhdf5: 1.14.6
libnetcdf: 4.9.2

xarray: 0.1.dev5937+g070af11
pandas: 2.2.3
numpy: 2.2.5
scipy: 1.15.2
netCDF4: 1.7.2
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: None
pip: 25.1.1
conda: None
pytest: None
mypy: None
IPython: 9.2.0
sphinx: None

@jpdehollain jpdehollain added bug needs triage Issue that has not been reviewed by xarray team member labels May 8, 2025
Copy link

welcome bot commented May 8, 2025

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@kmuehlbauer
Copy link
Contributor

Thanks @jpdehollain for raising this. Please see #6323 for more details. Closing in the light of #5336 and #5082.

@kmuehlbauer kmuehlbauer removed bug needs triage Issue that has not been reviewed by xarray team member labels May 8, 2025
@kmuehlbauer kmuehlbauer closed this as not planned Won't fix, can't repro, duplicate, stale May 8, 2025
@jpdehollain
Copy link
Author

Thanks @kmuehlbauer and apologies for not finding that issue before posting (I searched for inherit instead of propagate 😆)..

To add some context in case it aids the discussion, I have a dataset with data arrays that can contain either float or str (dtype=object) types. The arrays get updated one coordinate value at a time (which means that some data values get filled) and updates happen across sessions so I need to store the Dataset.

I use encoding on the str arrays only to set the fill value, because otherwise they get converted to an incompatible type when I load the Dataset from file.

Setting the encoding on the entire Dataset at the .to_netcdf feels inefficient because I only want it on the str type arrays, so in this particular case it would be inconvenient to drop the encoding property all together from the variables

@kmuehlbauer
Copy link
Contributor

@jpdehollain Thanks for the additional context. Yes, this is a somewhat not satisfactory situation. I've issues like this in my workflows, too. Usually I'm wrapping apply_ufunc in another function which fixes/copies the encoding as needed. It might be a bit disturbing, but in the end you have full control.

@dcherian
Copy link
Contributor

dcherian commented May 8, 2025

You could also just stick the information in attrs. I believe the encoding logic will look there too

@jpdehollain
Copy link
Author

You could also just stick the information in attrs. I believe the encoding logic will look there too

Thanks for the suggestion @dcherian. I just tried that but it didn't work for me, e.g., if I create:

ds = xr.Dataset({'foostr': ('dim', np.array(['one',np.nan,np.nan], dtype=object))},
                 coords={'dim': [1,2,3]})
# ds['foostr'].encoding = {'_FillValue': 'nan'}
ds['foostr'].attrs = {'_FillValue': 'nan'}

and then save it and load it:

ds.to_netcdf('test.nc')
xr.load_dataset('test.nc')

The string array gets loaded with empty strings. If instead I swap for the commented line above, the dataset loads in the correct way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants