Skip to content

AlexDo1/stgrid2area

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stgrid2area

Clip and aggregate spatio-temporal gridded data in netCDF or GRIB format to specified areas.

Overview

stgrid2area is a Python package that simplifies the workflow of extracting and aggregating spatio-temporal gridded data (e.g., climate data, weather forecasts) to geometrically defined areas like catchments, administrative boundaries, or any other geographical region. The package handles the process of clipping gridded data to the specified area's boundaries and then calculating spatial statistics across the area. The package also provides processor classes for efficient processing of large gridded datasets and many areas in parallel and is optimized for HPC systems.

Clip and aggregate

Figure: Clip and aggregate spatio-temporal gridded data to area. Image from https://doi.org/10.5194/essd-16-5625-2024.

Extracting and aggregating data to areas is a common task in fields like hydrology and meteorology, where spatially distributed time series data (e.g. precipitation, air temperature) needs to be aggregated to catchments or other regions of interest. This is done by clipping the gridded data to the area boundaries and then spatially aggregating the data, meaning that for each time step, spatial statistics (e.g., mean, min, max, standard deviation) are calculated for the area. The result is a time series of aggregated values which can be saved as a CSV file.

Features

  • Clip spatio-temporal gridded data to area boundaries → xarray.Dataset or save to .nc
  • Aggregate variables using various spatial statistics (mean, min, max, stdev, quantiles) → pd.DataFrame or save to .csv
  • Support for both weighted (exactextract) and unweighted (xarray) spatial statistics
  • Efficient parallel processing options for large datasets and many areas
  • HPC integration with single-node and MPI multi-node support for large-scale processing using Dask

Installation

pip install stgrid2area

Basic Usage

The workflow for clipping and aggregating spatio-temporal gridded data to specified areas usually involves the following steps:

  1. Read the spatio-temporal gridded (stgrid) data (e.g., NetCDF or GRIB files) using xarray.
  2. Read the geometries (e.g., catchment boundaries, administrative boundaries) using geopandas.
  3. Do preprocessing if necessary, e.g. set the coordinate reference system (CRS) of the stgrid data (stgrid.rio.write_crs("EPSG:4326")) and reproject the geometries to match the CRS of the stgrid data using geometries.to_crs(stgrid.crs) or reformat the stgrid data if necessary.
  4. Create an Area object for each geometry, which encapsulates the clipping and aggregation logic. You can use the helper function geodataframe_to_areas to create a list of Area objects from a GeoDataFrame.
  5. Clip and aggregate the gridded data to the areas. You can either use area.clip() and area.aggregate() directly or use a processor class for parallel processing.

Example 1: Basic Clipping and Aggregation

import xarray as xr
import geopandas as gpd
from stgrid2area import Area

# Read the spatio-temporal gridded data
stgrid = xr.open_dataset("path/to/climate_data.nc")
stgrid = stgrid.rio.set_crs("EPSG:4326")  # Set the CRS if not already set

# Read the geometry data (e.g., a catchment boundary)
geometry = gpd.read_file("path/to/catchments.gpkg")
geometry = geometry.to_crs(stgrid.crs)  # Ensure CRS matches the gridded data

# Create an Area object
area = Area(geometry=geometry, id="catchment1", output_dir="/path/to/output")

# Clip the gridded data to the area
clipped_data = area.clip(stgrid)

# Aggregate a single variable across the area
aggregated = area.aggregate(
    clipped_data, 
    variables="temperature", 
    method="exact_extract", 
    operations=["mean", "min", "max"]
)

# Save the aggregated data
aggregated.to_csv("/path/to/output/temperature_stats.csv")

Example 2: Processing Multiple Variables

# Aggregate multiple variables at once
aggregated = area.aggregate(
    clipped_data, 
    variables=["temperature", "precipitation", "wind_speed"], 
    method="exact_extract", 
    operations=["mean", "max"]
)

# The result will contain columns like: temperature_mean, temperature_max, precipitation_mean, precipitation_max, wind_speed_mean, wind_speed_max

Example 3: Processing Multiple Areas in Parallel

import geopandas as gpd
from stgrid2area import Area, LocalDaskProcessor, geodataframe_to_areas

# Read the spatio-temporal gridded data
stgrid = xr.open_dataset('path/to/climate_data.nc')

# Read multiple geometries (e.g., multiple catchments)
geometries = gpd.read_file('path/to/catchments.gpkg')

# Create a list of Area objects
areas = geodataframe_to_areas(
    geometries, 
    output_dir="/path/to/output", 
    id_column="catchment_id"  # Column containing unique identifiers for each area
)

# Create a single-node processor with 4 workers
processor = LocalDaskProcessor(
    areas=areas,
    stgrid=stgrid,
    variables=["temperature", "precipitation"],
    method="exact_extract",
    operations=["mean", "max"],
    n_workers=4,
    save_nc=True,  # Save clipped NetCDF files
    save_csv=True  # Save aggregated CSV files
)

# Run the processing
processor.run()

Documentation

The Area Class

The Area class represents a geographical area (e.g., a catchment or administrative boundary) and provides methods to clip and aggregate gridded data to this area.

from stgrid2area import Area

# Create an Area object
area = Area(
    geometry=geometry_gdf,  # geopandas GeoDataFrame containing the geometry
    id="unique_identifier",  # Unique identifier for this area
    output_dir="/path/to/output"  # Directory to save output files
)

Clipping Data

# Clip the gridded data to the area's geometry
clipped_data = area.clip(
    stgrid,  # xarray.Dataset: The gridded data to clip
    all_touched=True,  # Whether to include cells that are partially within the area
    save_result=True,  # Whether to save the clipped data to disk
    skip_exist=False,  # Whether to skip clipping if the output file already exists
    filename=None  # Optional filename for the output file
)

Aggregating Data

# Aggregate the clipped data spatially
aggregated_data = area.aggregate(
    stgrid,  # xarray.Dataset: The gridded data to aggregate
    variables=["temperature", "precipitation"],  # Variable(s) to aggregate
    method="exact_extract",  # Aggregation method: "exact_extract" or "xarray"
    operations=["mean", "min", "max", "stdev"],  # Statistical operations
    save_result=True,  # Whether to save the aggregated data to disk
    skip_exist=False,  # Whether to skip aggregation if the output file already exists
    filename=None  # Optional filename for the output file
)

Aggregation Methods

stgrid2area supports two aggregation methods:

  1. exact_extract: Uses the exactextract library to calculate weighted statistics based on cell coverage. This provides more accurate results, especially for areas with irregular shapes and with cells that are partially within the area.

  2. xarray: Uses xarray's built-in statistical functions to calculate unweighted statistics.

Processor Classes

For processing multiple areas efficiently, stgrid2area provides several processor classes:

LocalDaskProcessor

For processing on a single machine / a single-node HPC batch queue with multiple cores. This processor class creates a Dask cluster locally and distributes the workload across multiple workers.

from stgrid2area import LocalDaskProcessor

processor = LocalDaskProcessor(
    areas=areas,  # List of Area objects
    stgrid=stgrid,  # The gridded data
    variables="temperature",  # Variable(s) to aggregate
    method="exact_extract",  # Aggregation method
    operations=["mean", "max"],  # Statistical operations
    n_workers=4,  # Number of parallel workers
    skip_exist=True,  # Skip areas that already have output files
    batch_size=10,  # Process areas in batches of this size
    save_nc=True,  # Save clipped NetCDF files
    save_csv=True  # Save aggregated CSV files
)

processor.run()

MPIDaskProcessor

For processing with MPI (Message Passing Interface) on HPC systems. This processor is designed to run on distributed systems with multiple nodes, allowing for efficient parallel processing of large datasets.
To use this method you need to initialize a Dask MPI client yourself and pass it to processor.run(). On the HPC system, you would typically launch the Python processing script using mpirun inside a SLURM job script or similar.

from stgrid2area import MPIDaskProcessor
from dask.distributed import Client
import dask_mpi

# Initialize dask_mpi
dask_mpi.initialize()
client = Client()

processor = MPIDaskProcessor(
    areas=areas,  # List of Area objects
    stgrid=stgrid,  # The gridded data
    variables=["temperature", "precipitation"],  # Variable(s) to aggregate
    method="exact_extract",  # Aggregation method
    operations=["mean", "max"],  # Statistical operations
    skip_exist=True,  # Skip areas that already have output files
    batch_size=100  # Process areas in batches of this size
)

processor.run(client=client)

When to Use Which Processor

  • LocalDaskProcessor: Use locally on a single machine / a single-node HPC batch queue with multiple cores.
  • MPIDaskProcessor: Use for large datasets on HPC systems using MPI for multi-node queues and distributed computing.

Advanced Usage

Processing Multiple Time Periods

If your stgrid data is too large to fit in memory, you can process it in chunks. The LocalDaskProcessor and MPIDaskProcessor classes can handle a list of xarray datasets, the clipped and aggregated results are saved for each input dataset separately, so you have to care about post-processing the results if you want to combine them later.

import xarray as xr
from stgrid2area import LocalDaskProcessor

# Read multiple files, each containing data for a different time period
stgrid_files = [
    'path/to/data_2020.nc',
    'path/to/data_2021.nc',
    'path/to/data_2022.nc'
]

# Create a list of xarray datasets
stgrids = [xr.open_dataset(file) for file in stgrid_files]

# Create a processor with multiple stgrids
processor = LocalDaskProcessor(
    areas=areas,
    stgrid=stgrids,  # List of xarray datasets
    variables="temperature",
    method="exact_extract",
    operations=["mean", "max"],
    n_workers=4
)

# Run the processing
processor.run()

stgrid2area-workflows

The stgrid2area-workflows repository is a collection of implemented data processing workflows using the stgrid2area package. It is mainly used for processing meteorological input data for the CAMELS-DE dataset. It also includes Python scripts and SLURM job scripts for running the workflows on HPC systems.
Hence, it can serve as a reference for how to use stgrid2area in practice, even though the code is not always perfectly documented and under development.

About

Extract and aggregate spatio-temporal data in netCDF or GRIB format to a specified area.

Resources

License

Stars

Watchers

Forks

Packages

No packages published