Skip to content

Commit 865c6e0

Browse files
dcherianscottyhq
andauthored
Update and reorganize apply_ufunc material (#180)
--------- Co-authored-by: Scott Henderson <[email protected]>
1 parent edd9312 commit 865c6e0

11 files changed

+2605
-620
lines changed

.pre-commit-config.yaml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,10 +28,6 @@ repos:
2828
hooks:
2929
- id: flake8
3030

31-
- repo: https://github.com/asottile/seed-isort-config
32-
rev: v2.2.0
33-
hooks:
34-
- id: seed-isort-config
3531
- repo: https://github.com/PyCQA/isort
3632
rev: 5.12.0
3733
hooks:

_config.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -65,9 +65,10 @@ sphinx:
6565
# maintain old paths and redirect them (so google results dont go to 404)
6666
# https://github.com/wpilibsuite/sphinxext-rediraffe
6767
- sphinxext.rediraffe
68+
- sphinx_exercise
6869
config:
6970
# application/vnd.holoviews_load.v0+json, application/vnd.holoviews_exec.v0+json
70-
suppress_warnings: ['mystnb.unknown_mime_type']
71+
suppress_warnings: ['mystnb.unknown_mime_type', 'misc.highlighting_failure']
7172
notfound_context:
7273
body: "<h1>Whoops! 404 Page Not Found</h1>\n\n<p>Sorry, this page doesn't exist. Many sections of this book have been updated recently.</p><p> Try the search box 🔎 to find what you're looking for!</p>"
7374
notfound_urls_prefix: /

_toc.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,12 @@ parts:
4747
- file: advanced/apply_ufunc/apply_ufunc.md
4848
sections:
4949
- file: advanced/apply_ufunc/simple_numpy_apply_ufunc
50-
- file: advanced/apply_ufunc/simple_dask_apply_ufunc
51-
- file: advanced/apply_ufunc/vectorize_1d
50+
- file: advanced/apply_ufunc/core-dimensions
51+
- file: advanced/apply_ufunc/complex-output-numpy
52+
- file: advanced/apply_ufunc/automatic-vectorizing-numpy
53+
- file: advanced/apply_ufunc/dask_apply_ufunc
54+
- file: advanced/apply_ufunc/numba-vectorization
55+
- file: advanced/apply_ufunc/example-interp
5256
- file: advanced/map_blocks/map_blocks.md
5357
sections:
5458
- file: advanced/map_blocks/simple_map_blocks
Lines changed: 359 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,359 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "6849dcdc-3484-4f41-8b23-51613d36812f",
6+
"metadata": {
7+
"tags": []
8+
},
9+
"source": [
10+
"(vectorize)=\n",
11+
"# Automatic Vectorization"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"id": "afc56d28-6e55-4967-b27d-28e2cc539cc7",
17+
"metadata": {
18+
"tags": []
19+
},
20+
"source": [
21+
"Previously we looked at [applying functions](gentle-intro) on numpy arrays, and the concept of [core dimensions](core-dimensions).\n",
22+
"We learned that functions commonly support specifying \"core dimensions\" through the `axis` keyword\n",
23+
"argument. \n",
24+
"\n",
25+
"However many functions exist, that implicitly have core dimensions, but do not provide an `axis` keyword\n",
26+
"argument. Applying such functions to a nD array usually involves one or multiple loops over the other dimensions\n",
27+
"--- termed \"loop dimensions\" or \"broadcast dimensions\".\n",
28+
"\n",
29+
"\n",
30+
"A good example is numpy's 1D interpolate function `numpy.interp`:\n",
31+
"\n",
32+
"```\n",
33+
" Signature: np.interp(x, xp, fp, left=None, right=None, period=None)\n",
34+
" Docstring:\n",
35+
" One-dimensional linear interpolation.\n",
36+
"\n",
37+
" Returns the one-dimensional piecewise linear interpolant to a function\n",
38+
" with given discrete data points (`xp`, `fp`), evaluated at `x`.\n",
39+
"```\n",
40+
"\n",
41+
"This function expects 1D arrays as input, so there is one core dimension and we cannot easily apply \n",
42+
"it to a nD array since there is no `axis` keyword argument. \n",
43+
"\n",
44+
"\n",
45+
"Our goal here is to \n",
46+
"1. Understand the difference between core dimensions and loop dimensions\n",
47+
"1. Understand vectorization\n",
48+
"1. Learn how to apply such functions without loops using `apply_ufunc` by providing the `vectorize` keyword argument.\n",
49+
"\n",
50+
"## Core dimensions and looping\n",
51+
"\n",
52+
"Let's say we want to\n",
53+
"interpolate an array with two dimensions (`space`, `time`) over the `time` dimension, we might \n",
54+
"1. loop over the `space` dimension, \n",
55+
"1. subset the array to a 1D array at that `space` location, \n",
56+
"1. Interpolate the 1D arrays to the new `time` vector, and\n",
57+
"1. Assign that new interpolated 1D array to the appropriate location of a 2D output array\n",
58+
"\n",
59+
"In pseudo-code this might look like\n",
60+
"\n",
61+
"```python\n",
62+
"for index in range(size_of_space_axis):\n",
63+
" out[index, :] = np.interp(..., array[index, :], ...)\n",
64+
"```\n",
65+
"\n",
66+
"\n",
67+
"```{exercise}\n",
68+
":label: coreloopdims\n",
69+
"\n",
70+
"Consider the example problem of interpolating a 2D array with dimensions `space` and `time` along the `time` dimension.\n",
71+
"Which dimension is the core dimension, and which is the \"loop dimension\"?\n",
72+
"```\n",
73+
"```{solution} coreloopdims\n",
74+
":class: dropdown\n",
75+
"\n",
76+
"`time` is the core dimension, and `space` is the loop dimension.\n",
77+
"```\n",
78+
"\n",
79+
"## Vectorization\n",
80+
"\n",
81+
"The pattern of looping over any number of \"loop dimensions\" and applying a function along \"core dimensions\" \n",
82+
"is so common that numpy provides wrappers that automate these steps: \n",
83+
"1. [numpy.apply_along_axis](https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html)\n",
84+
"1. [numpy.apply_over_axes](https://numpy.org/doc/stable/reference/generated/numpy.apply_over_axes.html)\n",
85+
"1. [numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html)\n",
86+
"\n",
87+
"\n",
88+
"`apply_ufunc` provides an easy interface to `numpy.vectorize` through the keyword argument `vectorize`. Here we see how to use\n",
89+
"that to automatically apply `np.interp` along a single axis of a nD array\n",
90+
"\n",
91+
"## Load data\n",
92+
"\n",
93+
"First lets load an example dataset\n",
94+
"\n",
95+
"```{tip}\n",
96+
"We'll reduce the length of error messages using `%xmode minimal` See the [ipython documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xmode) for details.\n",
97+
"```"
98+
]
99+
},
100+
{
101+
"cell_type": "code",
102+
"execution_count": null,
103+
"id": "76aa13b8-5ced-4468-a72e-6b0a29172d6d",
104+
"metadata": {
105+
"tags": []
106+
},
107+
"outputs": [],
108+
"source": [
109+
"%xmode minimal\n",
110+
"\n",
111+
"import xarray as xr\n",
112+
"import numpy as np\n",
113+
"\n",
114+
"xr.set_options(display_expand_data=False)\n",
115+
"\n",
116+
"air = (\n",
117+
" xr.tutorial.load_dataset(\"air_temperature\")\n",
118+
" .air.sortby(\"lat\") # np.interp needs coordinate in ascending order\n",
119+
" .isel(time=slice(4), lon=slice(3)) # choose a small subset for convenience\n",
120+
")\n",
121+
"air"
122+
]
123+
},
124+
{
125+
"cell_type": "markdown",
126+
"id": "81356724-6c1a-4d4a-9a32-bb906a9419b2",
127+
"metadata": {
128+
"tags": []
129+
},
130+
"source": [
131+
"## Review\n",
132+
"\n",
133+
"\n",
134+
"We'll work with the `apply_ufunc` call from the section on [handling dimensions that change size](complex-output-change-size). See the \"Handling Complex Output\" section for how to get here.\n",
135+
"\n",
136+
"This version only works with 1D vectors. We will expand that to work with inputs of any number of dimensions."
137+
]
138+
},
139+
{
140+
"cell_type": "code",
141+
"execution_count": null,
142+
"id": "cb286fa0-deba-4929-b18a-79af5acb0b5b",
143+
"metadata": {
144+
"tags": []
145+
},
146+
"outputs": [],
147+
"source": [
148+
"newlat = np.linspace(15, 75, 100)\n",
149+
"\n",
150+
"xr.apply_ufunc(\n",
151+
" np.interp, # first the function\n",
152+
" newlat,\n",
153+
" air.lat,\n",
154+
" air.isel(lon=0, time=0), # this version only works with 1D vectors\n",
155+
" input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n",
156+
" output_core_dims=[[\"lat\"]],\n",
157+
" exclude_dims={\"lat\"},\n",
158+
")"
159+
]
160+
},
161+
{
162+
"cell_type": "markdown",
163+
"id": "e3382658-14e1-4842-a618-ce7a27948c31",
164+
"metadata": {
165+
"tags": []
166+
},
167+
"source": [
168+
"## Try nD input\n",
169+
"\n",
170+
"Our goal is to interpolate latitude at every longitude and time, such that we go from a dataset with dimensions `(time: 4, lat: 25, lon: 3)` to `(time: 4, lat: 100, lon: 3)`. \n",
171+
"\n",
172+
"If we blindly try passing `air` (a 3D DataArray), we get a hard-to-understand error"
173+
]
174+
},
175+
{
176+
"cell_type": "code",
177+
"execution_count": null,
178+
"id": "1476bcce-cc7b-4252-90dd-f45502dffb09",
179+
"metadata": {
180+
"tags": [
181+
"raises-exception"
182+
]
183+
},
184+
"outputs": [],
185+
"source": [
186+
"newlat = np.linspace(15, 75, 100)\n",
187+
"\n",
188+
"xr.apply_ufunc(\n",
189+
" np.interp, # first the function\n",
190+
" newlat,\n",
191+
" air.lat,\n",
192+
" air,\n",
193+
" input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n",
194+
" output_core_dims=[[\"lat\"]],\n",
195+
" exclude_dims={\"lat\"},\n",
196+
")"
197+
]
198+
},
199+
{
200+
"cell_type": "markdown",
201+
"id": "1d1da9c2-a634-4920-890c-74d9bec9eab9",
202+
"metadata": {
203+
"tags": []
204+
},
205+
"source": [
206+
"We will use a \"wrapper\" function `debug_interp` to examine what gets passed to `numpy.interp`.\n",
207+
"\n",
208+
"```{tip}\n",
209+
"Such wrapper functions are a great way to understand and debug `apply_ufunc` use cases.\n",
210+
"```"
211+
]
212+
},
213+
{
214+
"cell_type": "code",
215+
"execution_count": null,
216+
"id": "fa306dcf-eec3-425c-b278-42d15bbc0e4f",
217+
"metadata": {
218+
"tags": [
219+
"raises-exception"
220+
]
221+
},
222+
"outputs": [],
223+
"source": [
224+
"def debug_interp(xi, x, data):\n",
225+
" print(f\"data: {data.shape} | x: {x.shape} | xi: {xi.shape}\")\n",
226+
" return np.interp(xi, x, data)\n",
227+
"\n",
228+
"\n",
229+
"interped = xr.apply_ufunc(\n",
230+
" debug_interp, # first the function\n",
231+
" newlat,\n",
232+
" air.lat,\n",
233+
" air,\n",
234+
" input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n",
235+
" output_core_dims=[[\"lat\"]],\n",
236+
" exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n",
237+
")"
238+
]
239+
},
240+
{
241+
"cell_type": "markdown",
242+
"id": "6f5c928b-f8cb-4016-9d6d-39743f9c2976",
243+
"metadata": {
244+
"tags": []
245+
},
246+
"source": [
247+
"That's a hard-to-interpret error from NumPy but our `print` call helpfully printed the shapes of the input data: \n",
248+
"\n",
249+
" data: (4, 3, 25) | x: (25,) | xi: (100,)\n",
250+
"\n",
251+
"We see that `apply_ufunc` passes the full 3D array to `interp1d_np` which in turn passes that on to `numpy.interp`. But `numpy.interp` requires a 1D input, and thus the error.\n",
252+
"\n",
253+
"Instead of passing the full 3D array we want loop over all combinations of `lon` and `time`; and apply our function to each corresponding vector of data along `lat`."
254+
]
255+
},
256+
{
257+
"cell_type": "markdown",
258+
"id": "737cc6b4-522f-488c-9124-524cc42ebef3",
259+
"metadata": {
260+
"tags": []
261+
},
262+
"source": [
263+
"## Vectorization with `np.vectorize`\n"
264+
]
265+
},
266+
{
267+
"cell_type": "markdown",
268+
"id": "b6dac8da-8420-4fc4-9aeb-29b8999d4b37",
269+
"metadata": {
270+
"tags": []
271+
},
272+
"source": [
273+
"`apply_ufunc` makes it easy to loop over the loop dimensions by specifying `vectorize=True`:\n",
274+
"\n",
275+
" vectorize : bool, optional\n",
276+
" If True, then assume ``func`` only takes arrays defined over core\n",
277+
" dimensions as input and vectorize it automatically with\n",
278+
" :py:func:`numpy.vectorize`. This option exists for convenience, but is\n",
279+
" almost always slower than supplying a pre-vectorized function.\n",
280+
" Using this option requires NumPy version 1.12 or newer.\n",
281+
" \n",
282+
"\n",
283+
"```{warning}\n",
284+
"Also see the numpy documentation for [numpy.vectorize](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html). Most importantly\n",
285+
"\n",
286+
" The vectorize function is provided primarily for convenience, not for performance. \n",
287+
" The implementation is essentially a for loop.\n",
288+
"```"
289+
]
290+
},
291+
{
292+
"cell_type": "code",
293+
"execution_count": null,
294+
"id": "d72fdd8c-44d2-4f6e-9fc4-7084e0e49986",
295+
"metadata": {
296+
"tags": [],
297+
"user_expressions": []
298+
},
299+
"outputs": [],
300+
"source": [
301+
"interped = xr.apply_ufunc(\n",
302+
" debug_interp, # first the function\n",
303+
" newlat,\n",
304+
" air.lat,\n",
305+
" air,\n",
306+
" input_core_dims=[[\"lat\"], [\"lat\"], [\"lat\"]],\n",
307+
" output_core_dims=[[\"lat\"]],\n",
308+
" exclude_dims={\"lat\"}, # dimensions allowed to change size. Must be set!\n",
309+
" vectorize=True,\n",
310+
")\n",
311+
"interped"
312+
]
313+
},
314+
{
315+
"cell_type": "markdown",
316+
"id": "d81f399e-1649-4d4b-ad28-81cba8403210",
317+
"metadata": {
318+
"tags": []
319+
},
320+
"source": [
321+
"Wow that worked!\n",
322+
"\n",
323+
"Notice that \n",
324+
"1. the printed input shapes are all 1D and correspond to one vector of size 25 along the `lat` dimension.\n",
325+
"2. `debug_interp` was called 4x3 = 12 times which is the total number `lat` vectors since the size along `time` is 4, and the size along `lon` is 3.\n",
326+
"3. The result `interped` is now an xarray object with coordinate values copied over from `data`. \n",
327+
"\n",
328+
"\n",
329+
"```{note}\n",
330+
"`lat` is now the *last* dimension in `interped`. This is a \"property\" of core dimensions: they are moved to the end before being sent to `interp1d_np` as noted in the docstring for `input_core_dims`\n",
331+
"\n",
332+
" Core dimensions are automatically moved to the last axes of input\n",
333+
" variables before applying ``func``, which facilitates using NumPy style\n",
334+
" generalized ufuncs [2]_.\n",
335+
"```\n",
336+
"\n",
337+
"## Conclusion\n",
338+
"This is why `apply_ufunc` is so convenient; it takes care of a lot of code necessary to apply functions that consume and produce numpy arrays to xarray objects.\n",
339+
"\n",
340+
"The `vectorize` keyword argument, when set to True, will use `numpy.vectorize` to apply the function by looping over the \"loop dimensions\" --- dimensions that are not the core dimensions for the applied function."
341+
]
342+
}
343+
],
344+
"metadata": {
345+
"language_info": {
346+
"codemirror_mode": {
347+
"name": "ipython",
348+
"version": 3
349+
},
350+
"file_extension": ".py",
351+
"mimetype": "text/x-python",
352+
"name": "python",
353+
"nbconvert_exporter": "python",
354+
"pygments_lexer": "ipython3"
355+
}
356+
},
357+
"nbformat": 4,
358+
"nbformat_minor": 5
359+
}

0 commit comments

Comments
 (0)