Skip to content

handle default fill value #2742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
karl-malakoff opened this issue Feb 5, 2019 · 3 comments
Open

handle default fill value #2742

karl-malakoff opened this issue Feb 5, 2019 · 3 comments

Comments

@karl-malakoff
Copy link

karl-malakoff commented Feb 5, 2019

If a variable does not define a _FillValue value the 'default fill value' is normally used where data is masked. The default netCDF4 library does this by default and can be controlled with the set_auto_mask() function.

For example loading a NetCDF with no explicit fill value set:

In [92]: from netCDF4 import Dataset

In [93]: osd = Dataset('os150nb.nc', 'r')

In [94]: osd['u']
Out[94]:
<class 'netCDF4._netCDF4.Variable'>
float32 u(time, depth_cell)
    missing_value: 1e+38
    long_name: Zonal velocity component
    units: meter second-1
    C_format: %7.2f
    data_min: -0.6097069
    data_max: 0.6496426
unlimited dimensions:
current shape = (6830, 60)
filling on, default _FillValue of 9.969209968386869e+36 used

In [95]: u[1000]
Out[95]:
masked_array(data=[0.09373848885297775, 0.08173848688602448,
                   0.0697384923696518, 0.12273849546909332,
                   0.11573849618434906, 0.1387384980916977,
                   0.17173849046230316, 0.17673850059509277,
                   0.17673850059509277, 0.16373848915100098,
                   0.1857384890317917, 0.17673850059509277,
                   0.20173849165439606, 0.20973849296569824,
                   0.2037384957075119, 0.2297385036945343,
                   0.23273849487304688, 0.22873848676681519,
                   0.24073849618434906, 0.22873848676681519,
                   0.23073849081993103, 0.23273849487304688,
                   0.24973849952220917, 0.2467384934425354,
                   0.2207385003566742, 0.22773849964141846,
                   0.2387385070323944, 0.21473848819732666,
                   0.23973849415779114, 0.23673850297927856,
                   0.2517384886741638, 0.25273850560188293,
                   0.21973849833011627, 0.2387385070323944,
                   0.2207385003566742, 0.22373849153518677,
                   0.23473849892616272, 0.21073849499225616,
                   0.2247384935617447, --, --, --, --, --, --, --, --, --,
                   --, --, --, --, --, --, --, --, --, --, --, --],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False,  True,
                    True,  True,  True,  True,  True,  True,  True,  True,
                    True,  True,  True,  True,  True,  True,  True,  True,
                    True,  True,  True,  True],
       fill_value=9.96921e+36,
            dtype=float32)

The resulting array is a masked array where missing values are masked. You can see the default fill value has been given in the variable output.

When loading the same NetCDF with xarray that fill value gets used where the values would be masked by NetCDF4.

In [107]: os150 = xr.open_dataset('os150nb.nc', decode_cf=True, mask_and_scale=True, decode_coords=True)
                                                                                                        
In [108]: os150.u[1000]
Out[108]:
<xarray.DataArray 'u' (depth_cell: 60)>
array([9.373849e-02, 8.173849e-02, 6.973849e-02, 1.227385e-01, 1.157385e-01,
       1.387385e-01, 1.717385e-01, 1.767385e-01, 1.767385e-01, 1.637385e-01,
       1.857385e-01, 1.767385e-01, 2.017385e-01, 2.097385e-01, 2.037385e-01,
       2.297385e-01, 2.327385e-01, 2.287385e-01, 2.407385e-01, 2.287385e-01,
       2.307385e-01, 2.327385e-01, 2.497385e-01, 2.467385e-01, 2.207385e-01,
       2.277385e-01, 2.387385e-01, 2.147385e-01, 2.397385e-01, 2.367385e-01,
       2.517385e-01, 2.527385e-01, 2.197385e-01, 2.387385e-01, 2.207385e-01,
       2.237385e-01, 2.347385e-01, 2.107385e-01, 2.247385e-01, 9.969210e+36,
       9.969210e+36, 9.969210e+36, 9.969210e+36, 9.969210e+36, 9.969210e+36,
       9.969210e+36, 9.969210e+36, 9.969210e+36, 9.969210e+36, 9.969210e+36,
       9.969210e+36, 9.969210e+36, 9.969210e+36, 9.969210e+36, 9.969210e+36,
       9.969210e+36, 9.969210e+36, 9.969210e+36, 9.969210e+36, 9.969210e+36])
Coordinates:
    time     datetime64[ns] 2018-11-26T10:24:53.971200
Dimensions without coordinates: depth_cell
Attributes:
    long_name:  Zonal velocity component
    units:      meter second-1
    C_format:   %7.2f
    data_min:   -0.6097069
    data_max:   0.6496426

While this behaviour is correct in the sense that xarray has followed the NetCDF specification it's now no longer clear that those values were missing in the original NetCDF.

The attributes don't mention the fill value so even though this is outside the specified data range one could be forgiven for thinking that's the actual value in the DataArray. It's especially confusing when you've asked to have CF decoded and these values are still present.

Further more if you look at the encoding for this DataArray you can see that it incorrectly states that the _FillVaule is the missing_value:

In [136]: os150['u'].encoding
Out[136]:
{'source': 'C:\\Data\\adcp_processing\\in2018_v06\\postproc\\os150nb\\contour\\os150nb.nc',
 'original_shape': (6830, 60),
 '_FillValue': 1e+38,
 'dtype': dtype('float32')}

Unless I'm missing something I think this behaviour should be changed to either:

  • Explicitly mention that the default fill value is being used in the DataArray attributes or have some other way of identifying it
    or
  • Mask this value with nan/missing_vlaue in the resulting DataArray

Note that the NetCDF file I've used here isn't publicly available yet but I can add a link to it soon once it is.

@shoyer
Copy link
Member

shoyer commented Feb 5, 2019

Xarray's handling of missing values should be consistent with netCDF4-Python. In theory, both follow CF conventions.

I wonder if this is due to numeric overflow of some sort? It would definitely definitely be helpful to have a file which reproduces this behavior so we can debug what's going on. Also, please share the output of xarray.show_versions().

@karl-malakoff
Copy link
Author

Thanks for the response.

The file is now available here: http://www.marine.csiro.au/cgi-bin/marlin-dl/Investigator_NF/in2018_v06/data/in2018_v06_ADCP_nc.zip

I used the one titled 'in2018_v06_os150nb.nc' in my examples but both files display the same behaviour.

The output from xarray.show_versions() is below:

INSTALLED VERSIONS

commit: None
python: 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:25:24) [MSC v.1900 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
libhdf5: 1.10.3
libnetcdf: 4.6.1

xarray: 0.11.3
pandas: 0.23.0
numpy: 1.14.2
scipy: 0.19.0
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.0b1
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.2
distributed: 1.21.4
matplotlib: 2.0.2
cartopy: 0.16.0
seaborn: 0.7.1
setuptools: 40.6.3
pip: 19.0.1
conda: 4.5.11
pytest: 3.0.7
IPython: 7.2.0
sphinx: 1.5.6

@cofinoa
Copy link

cofinoa commented Oct 26, 2022

xarray it's assuming that _FillValue it's allways an attribute for netcdf, but it's only an attribute when a _FillValue it's been added to the variable explicitly.

The _FillValue should be requested only when the variable has the mode FILL ON active. Then, if no _FilleValue attributte it's been defined, then the netcdf default _FillValue should be use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants