Skip to content

xarray.open_dataset with chunks={} returns a single chunk and not engine (h5netcdf) preferred chunks #8691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
abarciauskas-bgse opened this issue Jan 31, 2024 · 4 comments
Labels
bug needs triage Issue that has not been reviewed by xarray team member

Comments

@abarciauskas-bgse
Copy link

abarciauskas-bgse commented Jan 31, 2024

What happened?

When opening MUR SST netcdfs from S3, xarray.open_dataset(file, engine="h5netcdf", chunks={}) returns a single chunk (whereas the h5netcdf library returns a chunk shape of (1, 1023, 2047).

A notebook version of the code below includes the output: https://gist.github.com/abarciauskas-bgse/9366e04d2af09b79c9de466f6c1d3b90

What did you expect to happen?

I thought the chunks={} option would return the same chunks (1, 1023, 2047) exposed by the h5netcdf engine.

Minimal Complete Verifiable Example

#!/usr/bin/env python
# coding: utf-8

# This notebook looks at how xarray and h5netcdf return different chunks.

import pandas as pd
import h5netcdf
import s3fs
import xarray as xr

dates = [
    d.to_pydatetime().strftime('%Y%m%d')
    for d in pd.date_range('2023-02-01', '2023-03-01', freq='D')
]

SHORT_NAME = 'MUR-JPL-L4-GLOB-v4.1'
s3_fs = s3fs.S3FileSystem(anon=False)
var = 'analysed_sst'

def make_filename(time):
    base_url = f's3://podaac-ops-cumulus-protected/{SHORT_NAME}/'
    # example file: "/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc"
    return f'{base_url}{time}090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'

s3_urls = [make_filename(d) for d in dates]


def print_chunk_shape(s3_url):
    try:
        # Open the dataset using xarray
        file = s3_fs.open(s3_url)
        dataset = xr.open_dataset(file, engine='h5netcdf', chunks={})

        # Print chunk shapes for each variable in the dataset
        print(f"\nChunk shapes for {s3_url}:")
        if dataset[var].chunks is not None:
            print(f"xarray open_dataset chunks for {var}: {dataset[var].chunks}")
        else:
            print(f"xarray open_dataset chunks for {var}: Not chunked")

        with h5netcdf.File(file, 'r') as file:
            dataset = file[var]

            # Check if the dataset is chunked
            if dataset.chunks:
                print(f"h5netcdf chunks for {var}:", dataset.chunks)
            else:
                print(f"h5netcdf dataset is not chunked.")            
            
    except Exception as e:
        print(f"Failed to process {s3_url}: {e}")

[print_chunk_shape(s3_url) for s3_url in s3_urls]

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.198-187.748.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.1
libnetcdf: 4.9.2

xarray: 2023.6.0
pandas: 2.0.3
numpy: 1.24.4
scipy: 1.11.1
netCDF4: 1.6.4
pydap: installed
h5netcdf: 1.2.0
h5py: 3.9.0
Nio: None
zarr: 2.15.0
cftime: 1.6.2
nc_time_axis: 1.4.1
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.6.1
distributed: 2023.6.1
matplotlib: 3.7.1
cartopy: 0.21.1
seaborn: 0.12.2
numbagg: None
fsspec: 2023.6.0
cupy: None
pint: 0.22
sparse: 0.14.0
flox: 0.7.2
numpy_groupies: 0.9.22
setuptools: 68.0.0
pip: 23.1.2
conda: None
pytest: 7.4.0
mypy: None
IPython: 8.14.0
sphinx: None

@abarciauskas-bgse abarciauskas-bgse added bug needs triage Issue that has not been reviewed by xarray team member labels Jan 31, 2024
@weiji14
Copy link
Contributor

weiji14 commented Jan 31, 2024

Quoting from https://docs.xarray.dev/en/v2024.01.1/generated/xarray.open_dataset.html:

chunks={} loads the dataset with dask using engine preferred chunks if exposed by the backend, otherwise with a single chunk for all arrays.

It might be that the h5netcdf engine doesn't expose the preferred chunk scheme?

@abarciauskas-bgse
Copy link
Author

how would h5netcdf expose the preferred chunk scheme if not through ds[var].chunks?

@weiji14
Copy link
Contributor

weiji14 commented Jan 31, 2024

Oh yeah, looking at

if var.chunks:
encoding["preferred_chunks"] = dict(zip(var.dimensions, var.chunks))
it has been implemented in #7948 for h5netcdf, but you're using an old version of xarray (2023.6.0). Might need to upgrade to a newer version of xarray (at least v2023.09.0 or newer).

@abarciauskas-bgse
Copy link
Author

aha! thank you @weiji14 I will close this then. I think perhaps we need an upgrade to the version of xarray in the JupyterHub we are using then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

No branches or pull requests

2 participants