Skip to content

Add multidimensional bins demo #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/user-stories.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,5 @@
user-stories/climatology.ipynb
user-stories/climatology-hourly.ipynb
user-stories/custom-aggregations.ipynb
user-stories/nD-bins.ipynb
```
373 changes: 373 additions & 0 deletions docs/source/user-stories/nD-bins.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,373 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e970d800-c612-482a-bb3a-b1eb7ad53d88",
"metadata": {
"tags": [],
"user_expressions": []
},
"source": [
"# Binning with multi-dimensional bins\n",
"\n",
"```{warning}\n",
"This post is a proof-of-concept for discussion. Expect APIs to change to enable this use case.\n",
"```\n",
"\n",
"Here we explore a binning problem where the bins are multidimensional\n",
"([xhistogram issue](https://github.com/xgcm/xhistogram/issues/28))\n",
"\n",
"> One of such multi-dim bin applications is the ranked probability score rps we\n",
"> use in `xskillscore.rps`, where we want to know how many forecasts fell into\n",
"> which bins. Bins are often defined as terciles of the forecast distribution\n",
"> and the bins for these terciles\n",
"> (`forecast_with_lon_lat_time_dims.quantile(q=[.33,.66],dim='time')`) depend on\n",
"> `lon` and `lat`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "01f1a2ef-de62-45d0-a04e-343cd78debc5",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import math\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import xarray as xr\n",
"\n",
"import flox\n",
"import flox.xarray"
]
},
{
"cell_type": "markdown",
"id": "0be3e214-0cf0-426f-8ebb-669cc5322310",
"metadata": {
"user_expressions": []
},
"source": [
"## Create test data\n"
]
},
{
"cell_type": "markdown",
"id": "ce239000-e053-4fc3-ad14-e9e0160da869",
"metadata": {
"user_expressions": []
},
"source": [
"Data to be reduced\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7659c24e-f5a1-4e59-84c0-5ec965ef92d2",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"array = xr.DataArray(\n",
" np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]),\n",
" dims=(\"space\", \"time\"),\n",
" name=\"array\",\n",
")\n",
"array"
]
},
{
"cell_type": "markdown",
"id": "da0c0ac9-ad75-42cd-a1ea-99069f5bef00",
"metadata": {
"user_expressions": []
},
"source": [
"Array to group by\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4601e744-5d22-447e-97ce-9644198d485e",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"by = xr.DataArray(\n",
" np.array([[1, 2, 3], [3, 4, 5], [5, 6, 7], [6, 7, 9]]),\n",
" dims=(\"space\", \"time\"),\n",
" name=\"by\",\n",
")\n",
"by"
]
},
{
"cell_type": "markdown",
"id": "61c21c94-7b6e-46a6-b9c2-59d7b2d40c81",
"metadata": {
"tags": [],
"user_expressions": []
},
"source": [
"Multidimensional bins:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "863a1991-ab8d-47c0-aa48-22b422fcea8c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"bins = by + 0.5\n",
"bins = xr.DataArray(\n",
" np.concatenate([bins, bins[:, [-1]] + 1], axis=-1)[:, :-1].T,\n",
" dims=(\"time\", \"nbins\"),\n",
" name=\"bins\",\n",
")\n",
"bins"
]
},
{
"cell_type": "markdown",
"id": "e65ecaba-d1cc-4485-ae58-c390cb2ebfab",
"metadata": {
"user_expressions": []
},
"source": [
"## Concept\n",
"\n",
"The key idea is that GroupBy is two steps:\n",
"\n",
"1. Factorize (a.k.a \"digitize\") : convert the `by` data to a set of integer\n",
" codes representing the bins.\n",
"2. Apply the reduction.\n",
"\n",
"We treat multi-dimensional binning as a slightly complicated factorization\n",
"problem. Assume that bins are a function of `time`. So we\n",
"\n",
"1. generate a set of appropriate integer codes by:\n",
" 1. Loop over \"time\" and factorize the data appropriately.\n",
" 2. Add an offset to these codes so that \"bin 0\" for `time=0` is different\n",
" from \"bin 0\" for `time=1`\n",
"2. apply the groupby reduction to the \"offset codes\"\n",
"3. reshape the output to the right shape\n",
"\n",
"We will work at the xarray level, so its easy to keep track of the different\n",
"dimensions.\n",
"\n",
"### Factorizing\n",
"\n",
"The core `factorize_` function (which wraps `pd.cut`) only handles 1D bins, so\n",
"we use `xr.apply_ufunc` to vectorize it for us.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aa33ab2c-0ecf-4198-a033-2a77f5d83c99",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"factorize_loop_dim = \"time\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "afcddcc1-dd57-461e-a649-1f8bcd30342f",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def factorize_nd_bins_core(by, bins):\n",
" group_idx, *_, props = flox.core.factorize_(\n",
" (by,),\n",
" axes=(-1,),\n",
" expected_groups=(pd.IntervalIndex.from_breaks(bins),),\n",
" )\n",
" # Use -1 as the NaN sentinel value\n",
" group_idx[props.nanmask] = -1\n",
" return group_idx\n",
"\n",
"\n",
"codes = xr.apply_ufunc(\n",
" factorize_nd_bins_core,\n",
" by,\n",
" bins,\n",
" # TODO: avoid hardcoded dim names\n",
" input_core_dims=[[\"space\"], [\"nbins\"]],\n",
" output_core_dims=[[\"space\"]],\n",
" vectorize=True,\n",
")\n",
"codes"
]
},
{
"cell_type": "markdown",
"id": "1661312a-dc61-4a26-bfd8-12c2dc01eb15",
"metadata": {
"user_expressions": []
},
"source": [
"### Offset the codes\n",
"\n",
"These are integer codes appropriate for a single timestep.\n",
"\n",
"We now add an offset that changes in time, to make sure \"bin 0\" for `time=0` is\n",
"different from \"bin 0\" for `time=1` (taken from\n",
"[this StackOverflow thread](https://stackoverflow.com/questions/46256279/bin-elements-per-row-vectorized-2d-bincount-for-numpy)).\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0e5801cb-a79c-4670-ad10-36bb19f1a6ff",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"N = math.prod([codes.sizes[d] for d in codes.dims if d != factorize_loop_dim])\n",
"offset = xr.DataArray(np.arange(codes.sizes[factorize_loop_dim]), dims=factorize_loop_dim)\n",
"# TODO: think about N-1 here\n",
"offset_codes = (codes + offset * (N - 1)).rename(by.name)\n",
"offset_codes.data[codes == -1] = -1\n",
"offset_codes"
]
},
{
"cell_type": "markdown",
"id": "6c06c48b-316b-4a33-9bc3-921acd10bcba",
"metadata": {
"user_expressions": []
},
"source": [
"### Reduce\n",
"\n",
"Now that we have appropriate codes, let's apply the reduction\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2cf1295e-4585-48b9-ac2b-9e00d03b2b9a",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"interim = flox.xarray.xarray_reduce(\n",
" array,\n",
" offset_codes,\n",
" func=\"sum\",\n",
" # We use RangeIndex to indicate that `-1` code can be safely ignored\n",
" # (it indicates values outside the bins)\n",
" # TODO: Avoid hardcoding 9 = sizes[\"time\"] x (sizes[\"nbins\"] - 1)\n",
" expected_groups=pd.RangeIndex(9),\n",
")\n",
"interim"
]
},
{
"cell_type": "markdown",
"id": "3539509b-d9b4-4342-a679-6ada6f285dfb",
"metadata": {
"user_expressions": []
},
"source": [
"## Make final result\n",
"\n",
"Now reshape that 1D result appropriately.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1389d37-d76d-4a50-9dfb-8710258de3fd",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"final = (\n",
" interim.coarsen(by=3)\n",
" # bin_number dimension is last, this makes sense since it is the core dimension\n",
" # and we vectorize over the loop dims.\n",
" # So the first (Nbins-1) elements are for the first index of the loop dim\n",
" .construct({\"by\": (factorize_loop_dim, \"bin_number\")})\n",
" .transpose(..., factorize_loop_dim)\n",
" .drop_vars(\"by\")\n",
")\n",
"final"
]
},
{
"cell_type": "markdown",
"id": "a98b5e60-94af-45ae-be1b-4cb47e2d77ba",
"metadata": {
"user_expressions": []
},
"source": [
"I think this is the expected answer.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "053a8643-f6d9-4fd1-b014-230fa716449c",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"array.isel(space=slice(1, None)).rename({\"space\": \"bin_number\"}).identical(final)"
]
},
{
"cell_type": "markdown",
"id": "619ba4c4-7c87-459a-ab86-c187d3a86c67",
"metadata": {
"tags": [],
"user_expressions": []
},
"source": [
"## TODO\n",
"\n",
"This could be extended to:\n",
"\n",
"1. handle multiple `factorize_loop_dim`\n",
"2. avoid hard coded dimension names in the `apply_ufunc` call for factorizing\n",
"3. avoid hard coded number of output elements in the `xarray_reduce` call.\n",
"4. Somehow propagate the bin edges to the final output.\n"
]
}
],
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}