GeoZarr compliant data model for EOPF (Earth Observation Processing Framework) datasets.
This library provides tools to convert EOPF datasets to GeoZarr-spec 0.4 compliant format while maintaining native projections and using /2 downsampling logic for multiscale support.
- GeoZarr Specification Compliance: Full compliance with GeoZarr spec 0.4
- Native CRS Preservation: No reprojection to TMS, maintains original coordinate reference systems
- Multiscale Support: COG-style /2 downsampling with overview levels as children groups
- CF Conventions: Proper CF standard names and grid_mapping attributes
- Robust Processing: Band-by-band writing with validation and retry logic
- S3 Support: Direct output to Amazon S3 buckets with automatic credential validation
- Parallel Processing: Optional dask cluster support for parallel chunk processing
- Chunk Alignment: Automatic chunk alignment to prevent data corruption with dask
_ARRAY_DIMENSIONS
attributes on all arrays- CF standard names for all variables
grid_mapping
attributes referencing CF grid_mapping variablesGeoTransform
attributes in grid_mapping variables- Proper multiscales metadata structure
- Native CRS tile matrix sets
pip install eopf-geozarr
For development:
git clone <repository-url>
cd eopf-geozarr
pip install -e ".[dev]"
After installation, you can use the eopf-geozarr
command:
# Convert EOPF dataset to GeoZarr format (local output)
eopf-geozarr convert input.zarr output.zarr
# Convert EOPF dataset to GeoZarr format (S3 output)
eopf-geozarr convert input.zarr s3://my-bucket/path/to/output.zarr
# Convert with parallel processing using dask cluster
eopf-geozarr convert input.zarr output.zarr --dask-cluster
# Convert with dask cluster and verbose output
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose
# Get information about a dataset
eopf-geozarr info input.zarr
# Validate GeoZarr compliance
eopf-geozarr validate output.zarr
# Get help
eopf-geozarr --help
The library supports direct output to S3-compatible storage, including custom providers like OVH Cloud. Simply provide an S3 URL as the output path:
# Convert to S3
eopf-geozarr convert local_input.zarr s3://my-bucket/geozarr-data/output.zarr --verbose
Before using S3 output, ensure your S3 credentials are configured:
For AWS S3:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
For OVH Cloud Object Storage:
export AWS_ACCESS_KEY_ID=your_ovh_access_key
export AWS_SECRET_ACCESS_KEY=your_ovh_secret_key
export AWS_DEFAULT_REGION=gra # or other OVH region
export AWS_ENDPOINT_URL=https://s3.gra.cloud.ovh.net # OVH endpoint
For other S3-compatible providers:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=your_region
export AWS_ENDPOINT_URL=https://your-s3-endpoint.com
Alternative: AWS CLI Configuration
aws configure
# Note: For custom endpoints, you'll still need to set AWS_ENDPOINT_URL
- Custom Endpoints: Support for any S3-compatible storage (AWS, OVH Cloud, MinIO, etc.)
- Automatic Validation: The tool validates S3 access before starting conversion
- Credential Detection: Automatically detects and validates S3 credentials
- Error Handling: Provides helpful error messages for S3 configuration issues
- Performance: Optimized for S3 with proper chunking and retry logic
The library supports parallel processing using dask clusters for improved performance on large datasets:
# Enable dask cluster for parallel processing
eopf-geozarr convert input.zarr output.zarr --dask-cluster
# With verbose output to see cluster information
eopf-geozarr convert input.zarr output.zarr --dask-cluster --verbose
- Local Cluster: Automatically starts a local dask cluster with multiple workers
- Dashboard Access: Provides access to the dask dashboard for monitoring (shown in verbose mode)
- Automatic Cleanup: Properly closes the cluster even if errors occur during processing
- Chunk Alignment: Automatically aligns Zarr chunks with dask chunks to prevent data corruption
- Memory Efficiency: Better memory management through parallel chunk processing
- Error Handling: Graceful handling of dask import errors with helpful installation instructions
The library includes advanced chunk alignment logic to prevent the common issue of overlapping chunks when using dask:
- Smart Detection: Automatically detects if data is dask-backed and uses existing chunk structure
- Aligned Calculation: Uses
calculate_aligned_chunk_size()
to find optimal chunk sizes that divide evenly into data dimensions - Proper Rechunking: Ensures datasets are rechunked to match encoding before writing
- Fallback Logic: For non-dask arrays, uses reasonable chunk sizes that don't exceed data dimensions
This prevents errors like:
❌ Failed to write tci after 2 attempts: Specified Zarr chunks encoding['chunks']=(1, 3660, 3660)
for variable named 'tci' would overlap multiple Dask chunks
import os
import xarray as xr
from eopf_geozarr import create_geozarr_dataset
# Configure for OVH Cloud (example)
os.environ['AWS_ACCESS_KEY_ID'] = 'your_ovh_access_key'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'your_ovh_secret_key'
os.environ['AWS_DEFAULT_REGION'] = 'gra'
os.environ['AWS_ENDPOINT_URL'] = 'https://s3.gra.cloud.ovh.net'
# Load your EOPF DataTree
dt = xr.open_datatree("path/to/eopf/dataset.zarr", engine="zarr")
# Convert directly to S3
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=["/measurements/r10m", "/measurements/r20m", "/measurements/r60m"],
output_path="s3://my-bucket/geozarr-data/output.zarr",
spatial_chunk=4096,
min_dimension=256,
tile_width=256,
max_retries=3
)
import xarray as xr
from eopf_geozarr import create_geozarr_dataset
# Load your EOPF DataTree
dt = xr.open_datatree("path/to/eopf/dataset.zarr", engine="zarr")
# Define groups to convert (e.g., resolution groups)
groups = ["/measurements/r10m", "/measurements/r20m", "/measurements/r60m"]
# Convert to GeoZarr compliant format
dt_geozarr = create_geozarr_dataset(
dt_input=dt,
groups=groups,
output_path="path/to/output/geozarr.zarr",
spatial_chunk=4096,
min_dimension=256,
tile_width=256,
max_retries=3
)
Create a GeoZarr-spec 0.4 compliant dataset from EOPF data.
Parameters:
dt_input
(xr.DataTree): Input EOPF DataTreegroups
(List[str]): List of group names to process as Geozarr datasetsoutput_path
(str): Output path for the Zarr storespatial_chunk
(int, default=4096): Spatial chunk size for encodingmin_dimension
(int, default=256): Minimum dimension for overview levelstile_width
(int, default=256): Tile width for TMS compatibilitymax_retries
(int, default=3): Maximum number of retries for network operations
Returns:
xr.DataTree
: DataTree containing the GeoZarr compliant data
Set up GeoZarr-spec compliant CF standard names and CRS information.
Parameters:
dt
(xr.DataTree): The data tree containing the datasets to processgroups
(List[str]): List of group names to process as Geozarr datasets
Returns:
Dict[str, xr.Dataset]
: Dictionary of datasets with GeoZarr compliance applied
Downsample a 2D array using block averaging.
Calculate a chunk size that divides evenly into the dimension size. This ensures that Zarr chunks align properly with the data dimensions, preventing chunk overlap issues when writing with Dask.
Parameters:
dimension_size
(int): Size of the dimension to chunktarget_chunk_size
(int): Desired chunk size
Returns:
int
: Aligned chunk size that divides evenly into dimension_size
Example:
from eopf_geozarr.conversion.utils import calculate_aligned_chunk_size
# For a dimension of size 5490 with target chunk size 3660
aligned_size = calculate_aligned_chunk_size(5490, 3660) # Returns 2745
Check if a variable is a grid_mapping variable by looking for references to it.
Validate that a specific band exists and is complete in the dataset.
The library is organized into the following modules:
conversion
: Core conversion tools for EOPF to GeoZarr transformationgeozarr.py
: Main conversion functions and GeoZarr spec complianceutils.py
: Utility functions for data processing and validation
data_api
: Data access API (future development with pydantic-zarr)
This library implements the GeoZarr specification 0.4 with the following key requirements:
- Array Dimensions: All arrays must have
_ARRAY_DIMENSIONS
attributes - CF Standard Names: All variables must have CF-compliant
standard_name
attributes - Grid Mapping: Data variables must reference CF grid_mapping variables via
grid_mapping
attributes - Multiscales Structure: Overview levels are stored as children groups with proper tile matrix metadata
- Native CRS: Coordinate reference systems are preserved without reprojection
Our implementation has contributed valuable feedback to the GeoZarr specification development process. Based on our real-world experience with Earth observation data, we have identified and reported several areas for improvement:
- Arbitrary Coordinate Systems Support: Advocating for native CRS preservation instead of web mapping bias
- Chunking Performance Optimization: Proposing flexible chunking strategies for optimal performance
- Multiscale Hierarchy Clarification: Providing clear structure definitions for multiscale implementations
Our implementation demonstrates that scientific accuracy and performance can be maintained while working with arbitrary coordinate systems, not just web mapping projections. This is particularly important for Earth observation data that often comes in UTM zones, polar stereographic, or other scientific projections.
For detailed information about our contributions, see our GeoZarr Specification Contribution documentation.
# Clone the repository
git clone <repository-url>
cd eopf-geozarr
# Install in development mode with all dependencies
pip install -e ".[dev,docs,all]"
# Install pre-commit hooks
pre-commit install
pytest
The project uses:
- Black for code formatting
- isort for import sorting
- flake8 for linting
- mypy for type checking
- pre-commit for automated checks
cd docs
make html
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Make your changes
- Run tests and ensure code quality checks pass
- Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Built on top of the excellent xarray and zarr libraries
- Follows the GeoZarr specification for geospatial data in Zarr
- Designed for compatibility with EOPF datasets
For questions, issues, or contributions, please visit the GitHub repository.