-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What happened?
In the temporal APIs of the xCDAT package, we generate "labeled" time coordinates which allow grouping across multiple time components (e.g., "year", "month", etc.) since Xarray does not currently support this feature. These labeled time coordinates are added to the existing time dimension, then we use Xarray's .groupby()
method to group on these "labeled" time coordinates.
However, I noticed .groupby()
is much slower after adding these auxiliary coordinates to the time dimension. This performance slowdown also affects grouping on the original time dimension coordinates, which I don't think should happen.
In the MVCE, I generate a dummy dataset with a shape of (10000, 180, 360).
- Grouping on
time.month
without auxiliary coords: 0.0119 seconds - Grouping on
time.month
with auxiliary coords: 0.4892 seconds - Grouping on auxiliary coords: 0.5712 seconds
Note, as the dataset size increases, the slowdown is much more prominent.
What did you expect to happen?
Xarray's groupby()
should not slow down after adding auxiliary time coordinates to the time dimension (I think?).
Minimal Complete Verifiable Example
import timeit
# Date shape is (10000, 180, 360)
setup_code = """
import cftime
import numpy as np
import pandas as pd
import xarray as xr
# Define coordinates
lat = np.linspace(-90, 90, 180)
lon = np.linspace(-180, 180, 360)
time = pd.date_range("1850-01-01", periods=10000, freq="D")
# Generate random data
data = np.random.rand(len(time), len(lat), len(lon))
# Create dataset
ds = xr.Dataset(
{"random_data": (["time", "lat", "lon"], data)},
coords={"lat": lat, "lon": lon, "time": time},
)
"""
# Setup code to add auxiliary coords to time dimension.
setup_code2 = """
labeled_time = []
for coord in ds.time:
labeled_time.append(cftime.datetime(coord.dt.year, coord.dt.month, 1))
ds.coords["year_month"] = xr.DataArray(data=labeled_time, dims="time")
"""
# 1. Group by month component (without auxiliary coords) -- very fast
# Test case 1 execution time: 0.011903275270015001 seconds
# -----------------------------------------------------------------------------
test_case1 = """
ds_gb1 = ds.groupby("time.month")
"""
time_case1 = timeit.timeit(test_case1, setup=setup_code, globals=globals(), number=5)
print(f"Test case 1 execution time: {time_case1} seconds")
# 2. Group by month component (with auxiliary coords) -- very slow now
# Test case 2 execution time: 0.48919129418209195 seconds
# -----------------------------------------------------------------------------
time_case2 = timeit.timeit(
test_case1, setup=setup_code + setup_code2, globals=globals(), number=5
)
print(f"Test case 2 execution time: {time_case2} seconds")
# 3. Group by year_month coordinates -- very slow
# Test case 3 execution time: 0.5711932387202978 seconds
# -----------------------------------------------------------------------------
test_case3 = """
ds_gb3 = ds.groupby("year_month")
"""
time_case3 = timeit.timeit(
test_case3, setup=setup_code + setup_code2, globals=globals(), number=5
)
print(f"Test case 3 execution time: {time_case3} seconds")
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
- Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
No response
Anything else we need to know?
Related to PR #689 in the xCDAT repo.
Our workaround was to replace the dimension coordinates with the auxiliary coordinates for grouping purposes, which sped up grouping significantly.
Environment
xarray: 2024.7.0
pandas: 2.2.2
numpy: 2.1.0
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 73.0.1
pip: 24.2
conda: None
pytest: None
mypy: None
IPython: 8.27.0
sphinx: None