Clip and aggregate spatio-temporal gridded data in netCDF or GRIB format to specified areas.
stgrid2area
is a Python package that simplifies the workflow of extracting and aggregating spatio-temporal gridded data (e.g., climate data, weather forecasts) to geometrically defined areas like catchments, administrative boundaries, or any other geographical region. The package handles the process of clipping gridded data to the specified area's boundaries and then calculating spatial statistics across the area. The package also provides processor classes for efficient processing of large gridded datasets and many areas in parallel and is optimized for HPC systems.
Figure: Clip and aggregate spatio-temporal gridded data to area. Image from https://doi.org/10.5194/essd-16-5625-2024.
Extracting and aggregating data to areas is a common task in fields like hydrology and meteorology, where spatially distributed time series data (e.g. precipitation, air temperature) needs to be aggregated to catchments or other regions of interest. This is done by clipping the gridded data to the area boundaries and then spatially aggregating the data, meaning that for each time step, spatial statistics (e.g., mean, min, max, standard deviation) are calculated for the area. The result is a time series of aggregated values which can be saved as a CSV file.
- Clip spatio-temporal gridded data to area boundaries →
xarray.Dataset
or save to.nc
- Aggregate variables using various spatial statistics (mean, min, max, stdev, quantiles) →
pd.DataFrame
or save to.csv
- Support for both weighted (exactextract) and unweighted (xarray) spatial statistics
- Efficient parallel processing options for large datasets and many areas
- HPC integration with single-node and MPI multi-node support for large-scale processing using Dask
pip install stgrid2area
The workflow for clipping and aggregating spatio-temporal gridded data to specified areas usually involves the following steps:
- Read the spatio-temporal gridded (stgrid) data (e.g., NetCDF or GRIB files) using
xarray
. - Read the geometries (e.g., catchment boundaries, administrative boundaries) using
geopandas
. - Do preprocessing if necessary, e.g. set the coordinate reference system (CRS) of the stgrid data (
stgrid.rio.write_crs("EPSG:4326")
) and reproject the geometries to match the CRS of the stgrid data usinggeometries.to_crs(stgrid.crs)
or reformat the stgrid data if necessary. - Create an
Area
object for each geometry, which encapsulates the clipping and aggregation logic. You can use the helper functiongeodataframe_to_areas
to create a list ofArea
objects from a GeoDataFrame. - Clip and aggregate the gridded data to the areas. You can either use
area.clip()
andarea.aggregate()
directly or use a processor class for parallel processing.
import xarray as xr
import geopandas as gpd
from stgrid2area import Area
# Read the spatio-temporal gridded data
stgrid = xr.open_dataset("path/to/climate_data.nc")
stgrid = stgrid.rio.set_crs("EPSG:4326") # Set the CRS if not already set
# Read the geometry data (e.g., a catchment boundary)
geometry = gpd.read_file("path/to/catchments.gpkg")
geometry = geometry.to_crs(stgrid.crs) # Ensure CRS matches the gridded data
# Create an Area object
area = Area(geometry=geometry, id="catchment1", output_dir="/path/to/output")
# Clip the gridded data to the area
clipped_data = area.clip(stgrid)
# Aggregate a single variable across the area
aggregated = area.aggregate(
clipped_data,
variables="temperature",
method="exact_extract",
operations=["mean", "min", "max"]
)
# Save the aggregated data
aggregated.to_csv("/path/to/output/temperature_stats.csv")
# Aggregate multiple variables at once
aggregated = area.aggregate(
clipped_data,
variables=["temperature", "precipitation", "wind_speed"],
method="exact_extract",
operations=["mean", "max"]
)
# The result will contain columns like: temperature_mean, temperature_max, precipitation_mean, precipitation_max, wind_speed_mean, wind_speed_max
import geopandas as gpd
from stgrid2area import Area, LocalDaskProcessor, geodataframe_to_areas
# Read the spatio-temporal gridded data
stgrid = xr.open_dataset('path/to/climate_data.nc')
# Read multiple geometries (e.g., multiple catchments)
geometries = gpd.read_file('path/to/catchments.gpkg')
# Create a list of Area objects
areas = geodataframe_to_areas(
geometries,
output_dir="/path/to/output",
id_column="catchment_id" # Column containing unique identifiers for each area
)
# Create a single-node processor with 4 workers
processor = LocalDaskProcessor(
areas=areas,
stgrid=stgrid,
variables=["temperature", "precipitation"],
method="exact_extract",
operations=["mean", "max"],
n_workers=4,
save_nc=True, # Save clipped NetCDF files
save_csv=True # Save aggregated CSV files
)
# Run the processing
processor.run()
The Area
class represents a geographical area (e.g., a catchment or administrative boundary) and provides methods to clip and aggregate gridded data to this area.
from stgrid2area import Area
# Create an Area object
area = Area(
geometry=geometry_gdf, # geopandas GeoDataFrame containing the geometry
id="unique_identifier", # Unique identifier for this area
output_dir="/path/to/output" # Directory to save output files
)
# Clip the gridded data to the area's geometry
clipped_data = area.clip(
stgrid, # xarray.Dataset: The gridded data to clip
all_touched=True, # Whether to include cells that are partially within the area
save_result=True, # Whether to save the clipped data to disk
skip_exist=False, # Whether to skip clipping if the output file already exists
filename=None # Optional filename for the output file
)
# Aggregate the clipped data spatially
aggregated_data = area.aggregate(
stgrid, # xarray.Dataset: The gridded data to aggregate
variables=["temperature", "precipitation"], # Variable(s) to aggregate
method="exact_extract", # Aggregation method: "exact_extract" or "xarray"
operations=["mean", "min", "max", "stdev"], # Statistical operations
save_result=True, # Whether to save the aggregated data to disk
skip_exist=False, # Whether to skip aggregation if the output file already exists
filename=None # Optional filename for the output file
)
stgrid2area supports two aggregation methods:
-
exact_extract
: Uses the exactextract library to calculate weighted statistics based on cell coverage. This provides more accurate results, especially for areas with irregular shapes and with cells that are partially within the area. -
xarray
: Uses xarray's built-in statistical functions to calculate unweighted statistics.
For processing multiple areas efficiently, stgrid2area provides several processor classes:
For processing on a single machine / a single-node HPC batch queue with multiple cores. This processor class creates a Dask cluster locally and distributes the workload across multiple workers.
from stgrid2area import LocalDaskProcessor
processor = LocalDaskProcessor(
areas=areas, # List of Area objects
stgrid=stgrid, # The gridded data
variables="temperature", # Variable(s) to aggregate
method="exact_extract", # Aggregation method
operations=["mean", "max"], # Statistical operations
n_workers=4, # Number of parallel workers
skip_exist=True, # Skip areas that already have output files
batch_size=10, # Process areas in batches of this size
save_nc=True, # Save clipped NetCDF files
save_csv=True # Save aggregated CSV files
)
processor.run()
For processing with MPI (Message Passing Interface) on HPC systems. This processor is designed to run on distributed systems with multiple nodes, allowing for efficient parallel processing of large datasets.
To use this method you need to initialize a Dask MPI client yourself and pass it to processor.run()
. On the HPC system, you would typically launch the Python processing script using mpirun
inside a SLURM job script or similar.
from stgrid2area import MPIDaskProcessor
from dask.distributed import Client
import dask_mpi
# Initialize dask_mpi
dask_mpi.initialize()
client = Client()
processor = MPIDaskProcessor(
areas=areas, # List of Area objects
stgrid=stgrid, # The gridded data
variables=["temperature", "precipitation"], # Variable(s) to aggregate
method="exact_extract", # Aggregation method
operations=["mean", "max"], # Statistical operations
skip_exist=True, # Skip areas that already have output files
batch_size=100 # Process areas in batches of this size
)
processor.run(client=client)
- LocalDaskProcessor: Use locally on a single machine / a single-node HPC batch queue with multiple cores.
- MPIDaskProcessor: Use for large datasets on HPC systems using MPI for multi-node queues and distributed computing.
If your stgrid data is too large to fit in memory, you can process it in chunks. The LocalDaskProcessor
and MPIDaskProcessor
classes can handle a list of xarray datasets, the clipped and aggregated results are saved for each input dataset separately, so you have to care about post-processing the results if you want to combine them later.
import xarray as xr
from stgrid2area import LocalDaskProcessor
# Read multiple files, each containing data for a different time period
stgrid_files = [
'path/to/data_2020.nc',
'path/to/data_2021.nc',
'path/to/data_2022.nc'
]
# Create a list of xarray datasets
stgrids = [xr.open_dataset(file) for file in stgrid_files]
# Create a processor with multiple stgrids
processor = LocalDaskProcessor(
areas=areas,
stgrid=stgrids, # List of xarray datasets
variables="temperature",
method="exact_extract",
operations=["mean", "max"],
n_workers=4
)
# Run the processing
processor.run()
The stgrid2area-workflows repository is a collection of implemented data processing workflows using the stgrid2area
package. It is mainly used for processing meteorological input data for the CAMELS-DE dataset. It also includes Python scripts and SLURM job scripts for running the workflows on HPC systems.
Hence, it can serve as a reference for how to use stgrid2area
in practice, even though the code is not always perfectly documented and under development.