diff --git a/README.md b/README.md index 8b784312e..9cd693df3 100644 --- a/README.md +++ b/README.md @@ -49,3 +49,46 @@ You can start editing the page by modifying `pages/index.js`. The page auto-upda

+ +## Authoring blog post tips + +1. To create a new blog post a good place to start is copying a subfolder under `src/posts/`, so, for example https://xarray.dev/blog/flox is written here https://github.com/xarray-contrib/xarray.dev/blob/e04905f5ea039eb2eb848c0b4945beee323900e4/src/posts/flox/index.md + +### Static assets + +Once you have `src/posts/newpost/index.md` start writing! If you want to include figures or other static assets, they go into a matching `public/posts/newpost` folder. But! reference an images without the `public` part of the path like this: + +```html +

+ +

+``` + +### Xarray HTML reprs + +To include an html repr, you must save it first: + +```python +with open('da-repr.html', 'w') as f: + f.write(da._repr_html_()) +``` + +Then put it into the post's static assets folder `public/posts/newpost/da-repr.html`. And finally in `src/posts/newpost/index.md` you can include it with this syntax: + +``` + +``` + +### Toggling visibilty of sections (markdown comments) + +While authoring, you might want to toggle specific sections on and off during rendering. You can do that with this syntax: + +``` +{/* This is a comment that won't be rendered! */} +``` + +### Landing page banner + +If you'd like to add a link to the latest blog post on the landing page banner, edit this section here: + +https://github.com/xarray-contrib/xarray.dev/blob/e04905f5ea039eb2eb848c0b4945beee323900e4/src/components/layout.js#L18 diff --git a/public/posts/flexible-indexing/da-pandas-repr.html b/public/posts/flexible-indexing/da-pandas-repr.html new file mode 100644 index 000000000..6ef906031 --- /dev/null +++ b/public/posts/flexible-indexing/da-pandas-repr.html @@ -0,0 +1,447 @@ +
+ + + + + + + + + + + + + + +
<xarray.DataArray 'M' (X: 6)> Size: 48B
+array([10., 20., 30., 40., 50., 60.])
+Coordinates:
+  * X        (X) int64 48B 1 2 4 8 16 32
\ No newline at end of file diff --git a/public/posts/flexible-indexing/da-rasterix-repr.html b/public/posts/flexible-indexing/da-rasterix-repr.html new file mode 100644 index 000000000..232306f65 --- /dev/null +++ b/public/posts/flexible-indexing/da-rasterix-repr.html @@ -0,0 +1,457 @@ +
+ + + + + + + + + + + + + + +
<xarray.DataArray 'band_data' (y: 626401, x: 1296001)> Size: 3TB
+[811816322401 values with dtype=float32]
+Coordinates:
+    band         int64 8B 1
+    spatial_ref  int64 8B ...
+  * x            (x) float64 10MB -180.0 -180.0 -180.0 ... 180.0 180.0 180.0
+  * y            (y) float64 5MB 84.0 84.0 84.0 84.0 ... -90.0 -90.0 -90.0 -90.0
+Indexes:
+  ┌ x        RasterIndex (crs=None)
+  └ y
+Attributes:
+    AREA_OR_POINT:  Point
\ No newline at end of file diff --git a/public/posts/flexible-indexing/ds-range-repr.html b/public/posts/flexible-indexing/ds-range-repr.html new file mode 100644 index 000000000..aec8e205e --- /dev/null +++ b/public/posts/flexible-indexing/ds-range-repr.html @@ -0,0 +1,451 @@ +
+ + + + + + + + + + + + + + +
<xarray.Dataset> Size: 8MB
+Dimensions:  (x: 1000000)
+Coordinates:
+  * x        (x) float64 8MB 0.0 0.1 0.2 0.3 0.4 ... 1e+05 1e+05 1e+05 1e+05
+Data variables:
+    *empty*
+Indexes:
+    x        RangeIndex (start=0, stop=1e+05, step=0.1)
\ No newline at end of file diff --git a/public/posts/flexible-indexing/ds-range-slice-repr.html b/public/posts/flexible-indexing/ds-range-slice-repr.html new file mode 100644 index 000000000..af6ff6d72 --- /dev/null +++ b/public/posts/flexible-indexing/ds-range-slice-repr.html @@ -0,0 +1,449 @@ +
+ + + + + + + + + + + + + + +
<xarray.DataArray 'x' (x: 490)> Size: 4kB
+[490 values with dtype=float64]
+Coordinates:
+  * x        (x) float64 4kB 1e-06 1.1e-06 1.2e-06 ... 4.98e-05 4.99e-05
+Indexes:
+    x        RangeIndex (start=1e-06, stop=5e-05, step=1e-07)
diff --git a/public/posts/flexible-indexing/summary-slide.png b/public/posts/flexible-indexing/summary-slide.png new file mode 100644 index 000000000..35300973e Binary files /dev/null and b/public/posts/flexible-indexing/summary-slide.png differ diff --git a/public/posts/flexible-indexing/xvec-repr.html b/public/posts/flexible-indexing/xvec-repr.html new file mode 100644 index 000000000..355b5d734 --- /dev/null +++ b/public/posts/flexible-indexing/xvec-repr.html @@ -0,0 +1,498 @@ +
+ + + + + + + + + + + + + + +
<xarray.Dataset> Size: 173kB
+Dimensions:       (county: 3085, year: 4)
+Coordinates:
+  * county        (county) geometry 25kB POLYGON ((-95.34258270263672 48.5467...
+  * year          (year) int64 32B 1960 1970 1980 1990
+Data variables:
+    population    (county, year) int32 49kB 4304 3987 3764 ... 43766 55800 65077
+    unemployment  (county, year) float64 99kB 7.9 9.0 5.903 ... 7.018 5.489
+Indexes:
+    county   GeometryIndex (crs=EPSG:4326)
\ No newline at end of file diff --git a/public/posts/flexible-indexing/xvecfig.png b/public/posts/flexible-indexing/xvecfig.png new file mode 100644 index 000000000..31cdb83aa Binary files /dev/null and b/public/posts/flexible-indexing/xvecfig.png differ diff --git a/src/components/layout.js b/src/components/layout.js index eacaceb32..f2d94d2ad 100644 --- a/src/components/layout.js +++ b/src/components/layout.js @@ -13,26 +13,26 @@ export const Layout = ({ url = 'https://xarray.dev', enableBanner = false, }) => { - const bannerTitle = 'Check out the new blog post!:' + const bannerTitle = 'Check out the latest blog post:' // The first link will be the main description for the banner const bannerDescription = ( - + {' '} {/* Ensure it stands out a bit */} - Xarray for Biology: Learn how Xarray can be used for Biological workflows. + Flexible Indexes: Exciting new ways to slice and dice your data! ) // The second link will be passed as children, styled to be smaller - const bannerChildren = ( - - {' '} - {/* Add your second link here, smaller font */} - SciPy 2025 Click here for info about an Xarray for Bio Sprint! - - ) + // const bannerChildren = ( + // + // {' '} + // {/* Add your second link here, smaller font */} + // SciPy 2025 Click here for info about an Xarray for Bio Sprint! + // + //) // Determine the base URL based on the environment const baseUrl = process.env.NEXT_PUBLIC_VERCEL_URL @@ -77,7 +77,7 @@ export const Layout = ({
{enableBanner && ( - {bannerChildren} + {/* {bannerChildren} */} )} {children} diff --git a/src/posts/flexible-indexing/index.md b/src/posts/flexible-indexing/index.md new file mode 100644 index 000000000..750e257d5 --- /dev/null +++ b/src/posts/flexible-indexing/index.md @@ -0,0 +1,220 @@ +--- +title: 'Flexible Indexes: Exciting new ways to slice and dice your data!' +date: '2025-08-11' +authors: + - name: Scott Henderson + github: scottyhq + - name: Benoît Bovy + github: benbovy + - name: Deepak Cherian + github: dcherian + - name: Justus Magin + github: keewis +summary: 'An introduction to customizable coordinate-based data selection and alignment for more efficient handling of both traditional and more exotic data structures' +--- + +**TL;DR**: Over the last few years we've gradually refactored Xarray internals to make coordinate-based data selection and alignment customizable. As a result, Xarray>=2025.6 enables more efficient handling of both traditional and more exotic data structures. In this post we highlight a few examples that take advantage of this new superpower! See the [Gallery of Custom Index Examples](https://xarray-indexes.readthedocs.io/) for more! + +
+ +
+ *Summary schematic from our [2025 SciPy + Presentation](https://www.youtube.com/watch?v=I-NHCuLhRjY) highlighting new + custom Indexes and usecases. [Link to full slide + deck](https://docs.google.com/presentation/d/1sQU2N0-ThNZM8TUhsZy-kT0bZnu0H5X0FRJz2eKwEpA/edit?slide=id.g37373ba88e6_0_214#slide=id.g37373ba88e6_0_214)* +
+
+ +## Indexing basics + +First thing's first, _what is an Index and why is it helpful?_ + +> In brief, an _index_ makes data retrieval and alignment more efficient. Xarray Indexes connect coordinate labels to associated data location (array indices) and encode important contextual information about the coordinate space. + +Examples of indexes are all around you and are a fundamental way to organize and simplify access to information. +In the United States, if you want a book about Natural Sciences, you can go to your local library branch and head straight to section 500. Or if you're in the mood for a classic novel go to section 800. Connecting thematic labels with numbers (`{'Natural Sciences': 500, 'Literature': 800}`) is a classic indexing system that's been around for hundreds of years [(Dewey Decimal System, 1876)](https://en.wikipedia.org/wiki/Dewey_Decimal_Classification). +The need for an index becomes critical as the size of data grows - just imagine the time it would take to find a specific novel amongst a million uncategorized books! + +The same efficiencies arise in computing. Consider a simple 1D dataset consisting of measurement values `M=[10.0, 20.0, 30.0, 40.0, 50.0, 60.0]` at six coordinate positions `X=[1, 2, 4, 8, 16, 32]`. _What was our measurement at `X = 8`?_ +To answer this in code, we could either do a brute-force linear search (or binary search if sorted) through the coordinates array, or we could build a more efficient data structure designed for fast searches --- an Index. A common convenient index is a _key:value_ mapping or "hash table" between the coordinate values and their integer positions `i=[0, 1, 2, 3, 4, 5]`. Once we identify the _index_ `i=3` for our coordinate of interest (`X[3] = 8`) we use it to lookup our measurement value `M[3] = 40.0`. + +> 💡 **Note:** Index structures present a trade-off: they are a little slow to construct and have a memory footprint, but are much faster at lookups than brute-force searches. + +## pandas.Index + +Xarray's [label-based selection](https://docs.xarray.dev/en/latest/user-guide/indexing.html) allows a more expressive and simple syntax in which you don't have to think about the index: `da.sel(x=8)`. To accomplish this, Xarray has historically relied on [pandas.Index](https://pandas.pydata.org/docs/user_guide/indexing.html) behind the scenes, which is still used by default: + +```python +x = np.array([1, 2, 4, 8, 16, 32]) +m = np.arange(10, 70, 10.0) +da = xr.DataArray(m, coords={'X': x}, name='M') +da +``` + + + +```python +da.sel(x=8) +# 40 +``` + +## Alternatives to pandas.Index + +There are many different indexing schemes and ways to generate an index. pandas.Index's approach is roughly similar to running a loop over all coordinate values to create an _index:coordinate_ mapping, optionally identifying duplicates and sorting along the way. But, you might recognize that our example coordinates above can in fact be represented by a function `X(i) = 2**i` where `i` is the integer position! Given that function we can quickly get measurement values at any coordinate: `M(X=8)` = `M[log2(8)]` = `M[3] = 40`. Xarray now has a [CoordinateTransformIndex](https://xarray-indexes.readthedocs.io/blocks/transform.html) to handle this type of on-demand calculation of coordinates! + +### xarray.RangeIndex + +A simple special case of `CoordinateTransformIndex` is a `RangeIndex` where coordinates can be defined by a start, stop, and uniform step size. `pandas.RangeIndex` only supports integers, whereas Xarray handles floating-point values. Coordinate look-up is performed on-the-fly rather than loading all values into memory up-front when creating a Dataset, which is critical for the example below that has a coordinate array of 7 terabytes! + +```python +from xarray.indexes import RangeIndex + +index = RangeIndex.arange(0.0, 1000.0, 1e-9, dim='x') # 7TB coordinate array! +ds = xr.Dataset(coords=xr.Coordinates.from_xindex(index)) +ds +``` + + + +Selection with slices preserves the RangeIndex and does not require loading all the coordinates into memory. + +``` +sliced = ds.isel(x=slice(1_000, 50_000, 100)) +sliced.x +``` + + + +## Third-party custom Indexes + +In addition to a few new built-in indexes, `xarray.Index` provides an API that allows dealing with coordinate data and metadata in a highly customizable way for the most common Xarray operations such as `sel`, `align`, `concat`, `stack`. This is a powerful extension mechanism that is very important for supporting a multitude of domain-specific data structures. Here are a few examples below. + +### rasterix.RasterIndex + +Earlier we mentioned that coordinates may have a _functional representation_. +For 2D raster images, this function often takes the form of an [Affine Transform](https://en.wikipedia.org/wiki/Affine_transformation). +The [rasterix](https://github.com/xarray-contrib/rasterix) library extends Xarray with a `RasterIndex` which computes coordinates for geospatial images such as GeoTiffs via Affine Transform. + +Below is a simple example of slicing a large mosaic of GeoTiffs without ever loading the coordinates into memory: + +```python +# 811816322401 values! +import rasterix + +#26475 GeoTiffs represented by a GDAL VRT +da = ( + xr.open_dataarray( + "https://opentopography.s3.sdsc.edu/raster/COP30/COP30_hh.vrt", + engine="rasterio", + parse_coordinates=False, + ) + .squeeze() + .pipe(rasterix.assign_index) +) +da +``` + + + +After the slicing operation, a new Affine is defined. For example, notice how the origin changes below from (-180.0, 84.0) -> (-122.4, -47.1), while the spacing is unchanged (0.000277). + +```python +print('Original geotransform:\n', da.xindexes['x'].transform()) +da_sliced = da.sel(x=slice(-122.4, -120.0), y=slice(-47.1,-49.0)) +print('Sliced geotransform:\n', da_sliced.xindexes['x'].transform()) +``` + +```python +# Original geotransform: +Affine(0.0002777777777777778, 0.0, -180.0001389, + 0.0, -0.0002777777777777778, 84.0001389) + +# Sliced geotransform: +Affine(0.0002777777777777778, 0.0, -122.40013889999999, + 0.0, -0.0002777777777777778, -47.09986109999999) +``` + +### xproj.CRSIndex + +> "real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc." -- [Xarray Documentation](https://docs.xarray.dev/en/stable/getting-started-guide/why-xarray.html#what-labels-enable) + +We often think about metadata providing context for _measurement values_ but metadata is also critical for coordinates! +In particular, to align two different datasets we must ask if the coordinates are in the same coordinate system? + +There are currently over 7000 commonly used [Coordinate Reference Systems (CRS)](https://spatialreference.org/ref/epsg/) for geospatial data in the authoritative EPSG database! +And of course an infinite number of custom-defined CRSs. +[xproj.CRSIndex](https://xproj.readthedocs.io/en/latest/) gives Xarray objects an automatic awareness of the coordinate reference system so that operations like `xr.align()` raises an informative error when there is a CRS mismatch: + +```python +from xproj import CRSIndex +lons1 = np.arange(-125, -120, 1) +lons2 = np.arange(-122, -118, 1) +ds1 = xr.Dataset(coords={'longitude': lons1}).proj.assign_crs(crs=4267) +ds2 = xr.Dataset(coords={'longitude': lons2}).proj.assign_crs(crs=4326) +ds1 + ds2 +``` + +```pytb +MergeError: conflicting values/indexes on objects to be combined for coordinate 'crs' +``` + +### xvec.GeometryIndex + +A "vector data cube" is an n-D array that has at least one dimension indexed by an array of vector geometries. +With the `xvec.GeometryIndex`, Xarray objects gain functionality equivalent to geopandas' GeoDataFrames! +For example, large vector cubes can take advantage of an [R-tree spatial index](https://en.wikipedia.org/wiki/R-tree) for efficiently selecting vector geometries within a given bounding box. +Below is a short code snippet which automatically uses R-tree selection behind the scenes: + +```python +import xvec +import geopandas as gpd +from geodatasets import get_path + +# Dataset that contains demographic data indexed by U.S. counties +counties = gpd.read_file(get_path("geoda.natregimes")) + +cube = xr.Dataset( + data_vars=dict( + population=(["county", "year"], counties[["PO60", "PO70", "PO80", "PO90"]]), + unemployment=(["county", "year"], counties[["UE60", "UE70", "UE80", "UE90"]]), + ), + coords=dict(county=counties.geometry, year=[1960, 1970, 1980, 1990]), +).xvec.set_geom_indexes("county", crs=counties.crs) +cube +``` + + + +```python +# Efficient selection using shapely.STRtree +from shapely.geometry import box + +subset = cube.xvec.query( + "county", + box(-125.4, 40, -120.0, 50), + predicate="intersects", +) + +subset['population'].xvec.plot(col='year'); +``` + +

+ +

+ +### Even more examples! + +Be sure to check out the [Gallery of Custom Index Examples](https://xarray-indexes.readthedocs.io) for more detailed examples of all the indexes mentioned in this post and more! + +## What's next? + +While we're extremely excited about what can _already_ be accomplished with the new indexing capabilities, there are plenty of exciting ideas for future work. In a follow-up blog post, we will also illustrate how Xarray's internals interact with the xarray.Index API and how it can be leveraged in order to customize the behavior of some of the most common Xarray operations like indexing and alignment. + +We believe the new flexible indexing machinery will increase usage of Xarray across scientific domains and are actively working on examples that hopefully will appeal to [astronomers](https://xarray-indexes.readthedocs.io/blocks/transform.html#example-astronomy) and [biologists](https://xarray.dev/blog/xarray-biology)! + +Have an idea for your own custom index? Check out [this section of the Xarray documentation](https://docs.xarray.dev/en/stable/internals/how-to-create-custom-index.html) and please advertise what you're working on in our [gallery of examples](https://github.com/xarray-contrib/xarray-indexes). + +## Acknowledgments + +This work would not have been possible without technical input from the Xarray core team and community! +Several developers received essential funding from a [CZI Essential Open Source Software for Science (EOSS) grant](https://xarray.dev/blog/czi-eoss-grant-conclusion) as well as NASA's Open Source Tools, Frameworks, and Libraries (OSTFL) grant 80NSSC22K0345.