xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks #8691

abarciauskas-bgse · 2024-01-31T22:04:02Z

What happened?

When opening MUR SST netcdfs from S3, xarray.open_dataset(file, engine="h5netcdf", chunks={}) returns a single chunk (whereas the h5netcdf library returns a chunk shape of (1, 1023, 2047).

A notebook version of the code below includes the output: https://gist.github.com/abarciauskas-bgse/9366e04d2af09b79c9de466f6c1d3b90

What did you expect to happen?

I thought the chunks={} option would return the same chunks (1, 1023, 2047) exposed by the h5netcdf engine.

Minimal Complete Verifiable Example

#!/usr/bin/env python
# coding: utf-8

# This notebook looks at how xarray and h5netcdf return different chunks.

import pandas as pd
import h5netcdf
import s3fs
import xarray as xr

dates = [
    d.to_pydatetime().strftime('%Y%m%d')
    for d in pd.date_range('2023-02-01', '2023-03-01', freq='D')
]

SHORT_NAME = 'MUR-JPL-L4-GLOB-v4.1'
s3_fs = s3fs.S3FileSystem(anon=False)
var = 'analysed_sst'

def make_filename(time):
    base_url = f's3://podaac-ops-cumulus-protected/{SHORT_NAME}/'
    # example file: "/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"
    return f'{base_url}{time}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'

s3_urls = [make_filename(d) for d in dates]


def print_chunk_shape(s3_url):
    try:
        # Open the dataset using xarray
        file = s3_fs.open(s3_url)
        dataset = xr.open_dataset(file, engine='h5netcdf', chunks={})

        # Print chunk shapes for each variable in the dataset
        print(f"\nChunk shapes for {s3_url}:")
        if dataset[var].chunks is not None:
            print(f"xarray open_dataset chunks for {var}: {dataset[var].chunks}")
        else:
            print(f"xarray open_dataset chunks for {var}: Not chunked")

        with h5netcdf.File(file, 'r') as file:
            dataset = file[var]

            # Check if the dataset is chunked
            if dataset.chunks:
                print(f"h5netcdf chunks for {var}:", dataset.chunks)
            else:
                print(f"h5netcdf dataset is not chunked.")            
            
    except Exception as e:
        print(f"Failed to process {s3_url}: {e}")

[print_chunk_shape(s3_url) for s3_url in s3_urls]

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.198-187.748.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.1
libnetcdf: 4.9.2

xarray: 2023.6.0
pandas: 2.0.3
numpy: 1.24.4
scipy: 1.11.1
netCDF4: 1.6.4
pydap: installed
h5netcdf: 1.2.0
h5py: 3.9.0
Nio: None
zarr: 2.15.0
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.6.1
distributed: 2023.6.1
matplotlib: 3.7.1
cartopy: 0.21.1
seaborn: 0.12.2
numbagg: None
fsspec: 2023.6.0
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.7.2
numpy_groupies: 0.9.22
setuptools: 68.0.0
pip: 23.1.2
conda: None
pytest: 7.4.0
mypy: None
IPython: 8.14.0
sphinx: None

weiji14 · 2024-01-31T22:26:02Z

Quoting from https://docs.xarray.dev/en/v2024.01.1/generated/xarray.open_dataset.html:

chunks={} loads the dataset with dask using engine preferred chunks if exposed by the backend, otherwise with a single chunk for all arrays.

It might be that the h5netcdf engine doesn't expose the preferred chunk scheme?

abarciauskas-bgse · 2024-01-31T22:28:42Z

how would h5netcdf expose the preferred chunk scheme if not through ds[var].chunks?

weiji14 · 2024-01-31T22:30:03Z

Oh yeah, looking at

xarray/xarray/backends/h5netcdf_.py

Lines 206 to 207 in 4de10d4

    
           if var.chunks: 
        
               encoding["preferred_chunks"] = dict(zip(var.dimensions, var.chunks))

it has been implemented in #7948 for h5netcdf, but you're using an old version of xarray (2023.6.0). Might need to upgrade to a newer version of xarray (at least v2023.09.0 or newer).

abarciauskas-bgse · 2024-01-31T22:56:17Z

aha! thank you @weiji14 I will close this then. I think perhaps we need an upgrade to the version of xarray in the JupyterHub we are using then.

abarciauskas-bgse added bug needs triage Issue that has not been reviewed by xarray team member labels Jan 31, 2024

abarciauskas-bgse mentioned this issue Jan 31, 2024

Failing: MUR SST NASA-IMPACT/veda-pforge-job-runner#15

Closed

8 tasks

abarciauskas-bgse closed this as completed Jan 31, 2024

abarciauskas-bgse mentioned this issue Feb 3, 2024

Track and test upgrade of pangeo-notebook image in veda-jh-environments NASA-IMPACT/veda-jh-environments#41

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks #8691

xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks #8691

abarciauskas-bgse commented Jan 31, 2024 •

edited

Loading

INSTALLED VERSIONS

weiji14 commented Jan 31, 2024 •

edited

Loading

abarciauskas-bgse commented Jan 31, 2024

weiji14 commented Jan 31, 2024 •

edited

Loading

abarciauskas-bgse commented Jan 31, 2024

xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks #8691

xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks #8691

Comments

abarciauskas-bgse commented Jan 31, 2024 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

INSTALLED VERSIONS

weiji14 commented Jan 31, 2024 • edited Loading

abarciauskas-bgse commented Jan 31, 2024

weiji14 commented Jan 31, 2024 • edited Loading

abarciauskas-bgse commented Jan 31, 2024

abarciauskas-bgse commented Jan 31, 2024 •

edited

Loading

weiji14 commented Jan 31, 2024 •

edited

Loading

weiji14 commented Jan 31, 2024 •

edited

Loading