Skip to content

Commit 82c0e4e

Browse files
authored
Added Generalized Extreme Value distribution example (#241)
* Added GEV distribution example * Added commentary on divergences * Updated for style and tags * Bibtex commas * Remove black from data formatting * Updates following detailed review * Typos * ref tweak * Minor improvements * Updated idata & plot_dist_comparison following review * bib file conformance * Updated for latest changes to repo * Removed pymc tags in first cell * Updated nb per comments * Corrected bib file
1 parent 1656c38 commit 82c0e4e

File tree

3 files changed

+1329
-0
lines changed

3 files changed

+1329
-0
lines changed

examples/case_studies/GEV.ipynb

Lines changed: 1039 additions & 0 deletions
Large diffs are not rendered by default.

examples/references.bib

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,18 @@ @article{baio2010bayesian
1919
year = {2010},
2020
publisher = {Taylor \& Francis}
2121
}
22+
@article{bali2003gev,
23+
title = {The generalized extreme value distribution},
24+
journal = {Economics Letters},
25+
volume = {79},
26+
number = {3},
27+
pages = {423--427},
28+
year = {2003},
29+
issn = {0165-1765},
30+
doi = {https://doi.org/10.1016/S0165-1765(03)00035-1},
31+
url = {https://www.sciencedirect.com/science/article/pii/S0165176503000351},
32+
author = {Turan G. Bali}
33+
}
2234
@article{bauer2005probing,
2335
title = {Probing interactions in fixed and multilevel regression: Inferential and graphical techniques},
2436
author = {Bauer, Daniel J and Curran, Patrick J},
@@ -42,12 +54,49 @@ @book{breen1996regression
4254
year = {1996},
4355
publisher = {Sage}
4456
}
57+
@inproceedings{caprani2009gev,
58+
title = "Estimating extreme highway bridge traffic load effects",
59+
author = {Colin C. Caprani and Eugene J. OBrien},
60+
year = "2010",
61+
language = "English",
62+
isbn = "9780415475570",
63+
pages = "1 -- 8",
64+
editor = "Hitoshi Furuta and Frangopol, {Dan M} and Masanobu Shinozuka",
65+
booktitle = "Proceedings of the 10th International Conference on Structural Safety and Reliability (ICOSSAR2009)",
66+
publisher = "CRC Press",
67+
url = {https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.722.6789\&rep=rep1\&type=pdf}
68+
}
69+
@article{caprani2010gev,
70+
title = {The use of predictive likelihood to estimate the distribution of extreme bridge traffic load effect},
71+
journal = {Structural Safety},
72+
volume = {32},
73+
number = {2},
74+
pages = {138--144},
75+
year = {2010},
76+
issn = {0167-4730},
77+
doi = {https://doi.org/10.1016/j.strusafe.2009.09.001},
78+
url = {https://www.sciencedirect.com/science/article/pii/S016747300900071X},
79+
author = {Colin C. Caprani and Eugene J. OBrien}
80+
}
4581
@misc{carpenter2016hierarchical,
4682
title = {Hierarchical partial pooling for repeated binary trials},
4783
author = {Carpenter, Bob and Gabry, J and Goodrich, B},
4884
year = {2016},
4985
publisher = {Technical report. Retrieved from https://mc-stan. org/users/docum entat ion~\ldots{}}
5086
}
87+
@book{coles2001gev,
88+
title = "An introduction to statistical modeling of extreme values",
89+
author = "Coles, Stuart",
90+
publisher = "Springer",
91+
series = "Springer Series in Statistics",
92+
edition = 2001,
93+
month = aug,
94+
year = 2001,
95+
address = "London, England",
96+
language = "en",
97+
isbn = {978-1-85233-459-8},
98+
url = {https://doi.org/10.1007/978-1-4471-3675-0}
99+
}
51100
@article{collinswilson2019,
52101
title = {Ten simple rules for the computational modeling of behavioral data},
53102
author = {Wilson, Robert C and Collins, Anne GE},

myst_nbs/case_studies/GEV.myst.md

Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
---
2+
jupytext:
3+
text_representation:
4+
extension: .md
5+
format_name: myst
6+
format_version: 0.13
7+
jupytext_version: 1.13.7
8+
kernelspec:
9+
display_name: pymc4-dev
10+
language: python
11+
name: pymc4-dev
12+
---
13+
14+
+++ {"tags": []}
15+
16+
# Generalized Extreme Value Distribution
17+
18+
:::{post} Sept 27, 2022
19+
:tags: extreme, inference, prediction
20+
:category: beginner
21+
:author: Colin Caprani
22+
:::
23+
24+
+++ {"tags": []}
25+
26+
## Introduction
27+
28+
The Generalized Extreme Value (GEV) distribution is a meta-distribution containing the Weibull, Gumbel, and Frechet families of extreme value distributions. It is used for modelling the distribution of extremes (maxima or minima) of stationary processes, such as the annual maximum wind speed, annual maximum truck weight on a bridge, and so on, without needing *a priori* decision on the tail behaviour.
29+
30+
Following the parametrization used in {cite:t}`coles2001gev`, the GEV distribution for maxima is given by:
31+
32+
$$G(x) = \exp \left\{ \left[ 1 - \xi \left( \frac{x - \mu}{\sigma} \right) \right]^{-\frac{1}{\xi}} \right\}$$
33+
34+
when:
35+
- $\xi < 0$ we get the Weibull distribution with a bounded upper tail;
36+
- $\xi = 0$, in the limit, we get the Gumbel distribution, unbonded in both tails;
37+
- $\xi > 0$, we get the Frechet distribution which is bounded in the lower tail.
38+
39+
Note that this parametrization of the shape parameter $\xi$ is opposite in sign to that used in SciPy (where it is denoted `c`). Further, the distribution for minima is readily examined by studying the distribution of maxima of the negative of the data.
40+
41+
We will use the example of the Port Pirie annual maximum sea-level data used in {cite:t}`coles2001gev`, and compare with the frequentist results presented there.
42+
43+
```{code-cell} ipython3
44+
import aesara.tensor as at
45+
import arviz as az
46+
import matplotlib.pyplot as plt
47+
import numpy as np
48+
import pymc as pm
49+
import pymc_experimental.distributions as pmx
50+
51+
from arviz.plots import plot_utils as azpu
52+
```
53+
54+
## Data
55+
The Port Pirie data is provided by {cite:t}`coles2001gev`, and repeated here:
56+
57+
```{code-cell} ipython3
58+
# fmt: off
59+
data = np.array([4.03, 3.83, 3.65, 3.88, 4.01, 4.08, 4.18, 3.80,
60+
4.36, 3.96, 3.98, 4.69, 3.85, 3.96, 3.85, 3.93,
61+
3.75, 3.63, 3.57, 4.25, 3.97, 4.05, 4.24, 4.22,
62+
3.73, 4.37, 4.06, 3.71, 3.96, 4.06, 4.55, 3.79,
63+
3.89, 4.11, 3.85, 3.86, 3.86, 4.21, 4.01, 4.11,
64+
4.24, 3.96, 4.21, 3.74, 3.85, 3.88, 3.66, 4.11,
65+
3.71, 4.18, 3.90, 3.78, 3.91, 3.72, 4.00, 3.66,
66+
3.62, 4.33, 4.55, 3.75, 4.08, 3.90, 3.88, 3.94,
67+
4.33])
68+
# fmt: on
69+
plt.hist(data)
70+
plt.show()
71+
```
72+
73+
## Modelling & Prediction
74+
In the modelling we wish to do two thing:
75+
76+
- parameter inference on the GEV parameters, based on some fairly non-informative priors, and;
77+
- prediction of the 10-year return level.
78+
79+
Predictions of extreme values considering parameter uncertainty are easily accomplished in the Bayesian setting. It is interesting to compare this ease to the difficulties encountered by {cite:t}`caprani2010gev` in doing this in a frequentist setting. In any case, the predicted value at a probability of exceedance $p$ is given by:
80+
81+
$$ x_p = \mu - \frac{\sigma}{\xi} \left\{1 - \left[-\log\left(1-p\right)\right] \right\} $$
82+
83+
This is a deterministic function of the parameter values, and so is accomplished using `pm.Deterministic` within the model context.
84+
85+
Consider then, the 10-year return period, for which $p = 1/10$:
86+
87+
```{code-cell} ipython3
88+
p = 1 / 10
89+
```
90+
91+
And now set up the model using priors estimated from a quick review of the historgram above:
92+
93+
- $\mu$: there is no real basis for considering anything other than a `Normal` distribution with a standard deviation limiting negative outcomes;
94+
- $\sigma$: this must be positive, and has a small value, so use `HalfNormal` with a unit standard deviation;
95+
- $\xi$: we are agnostic to the tail behaviour so centre this at zero, but limit to physically reasonable bounds of $\pm 0.6$, and keep it somewhat tight near zero.
96+
97+
```{code-cell} ipython3
98+
:tags: []
99+
100+
# Optionally centre the data, depending on fitting and divergences
101+
# cdata = (data - data.mean())/data.std()
102+
103+
with pm.Model() as model:
104+
# Priors
105+
μ = pm.Normal("μ", mu=3.8, sigma=0.2)
106+
σ = pm.HalfNormal("σ", sigma=0.3)
107+
ξ = pm.TruncatedNormal("ξ", mu=0, sigma=0.2, lower=-0.6, upper=0.6)
108+
109+
# Estimation
110+
gev = pmx.GenExtreme("gev", mu=μ, sigma=σ, xi=ξ, observed=data)
111+
# Return level
112+
z_p = pm.Deterministic("z_p", μ - σ / ξ * (1 - (-np.log(1 - p)) ** (-ξ)))
113+
```
114+
115+
## Prior Predictive Checks
116+
Let's get a feel for how well our selected priors cover the range of the data:
117+
118+
```{code-cell} ipython3
119+
idata = pm.sample_prior_predictive(samples=1000, model=model)
120+
az.plot_ppc(idata, group="prior", figsize=(12, 6))
121+
ax = plt.gca()
122+
ax.set_xlim([2, 6])
123+
ax.set_ylim([0, 2]);
124+
```
125+
126+
And we can look at the sampled values of the parameters, using the `plot_posterior` function, but passing in the `idata` object and specifying the `group` to be `"prior"`:
127+
128+
```{code-cell} ipython3
129+
az.plot_posterior(
130+
idata, group="prior", var_names=["μ", "σ", "ξ"], hdi_prob="hide", point_estimate=None
131+
);
132+
```
133+
134+
## Inference
135+
Press the magic Inference Button$^{\mathrm{TM}}$:
136+
137+
```{code-cell} ipython3
138+
with model:
139+
trace = pm.sample(
140+
5000,
141+
cores=4,
142+
chains=4,
143+
tune=2000,
144+
initvals={"μ": -0.5, "σ": 1.0, "ξ": -0.1},
145+
target_accept=0.98,
146+
)
147+
# add trace to existing idata object
148+
idata.extend(trace)
149+
```
150+
151+
```{code-cell} ipython3
152+
az.plot_trace(idata, var_names=["μ", "σ", "ξ"], figsize=(12, 12));
153+
```
154+
155+
### Divergences
156+
The trace exhibits divergences (usually). The HMC/NUTS sampler can have problems when the bounds of support for parameters are approached. And since the bounds of the GEV change with the sign of $\xi$, it is difficult to offer a transformation that resolves this problem. One possible transformation - the Box-Cox - has been proposed by {cite:t}`bali2003gev`, but {cite:t}`caprani2009gev` find it numerically unstable, even for just maximum likelihood estimation. In any case, recommendations to alleviate divergence problems are:
157+
158+
- Increase the target acceptance ratio;
159+
- Use more informative priors, especially limit the shape parameter to physically reasonable values, typically $\xi \in [-0.5,0.5]$;
160+
- Decide upon the domain of attraction of the tail (i.e. Weibull, Gumbel, or Frechet), and use that distribution directly.
161+
162+
163+
### Inferences
164+
The 95% credible interval range of the parameter estimates is:
165+
166+
```{code-cell} ipython3
167+
az.hdi(idata, hdi_prob=0.95)
168+
```
169+
170+
And examine the prediction distribution, considering parameter variability (and without needing to assume normality):
171+
172+
```{code-cell} ipython3
173+
az.plot_posterior(idata, hdi_prob=0.95, var_names=["z_p"], round_to=4);
174+
```
175+
176+
And let's compare the prior and posterior predictions of $z_p$ to see how the data has influenced things:
177+
178+
```{code-cell} ipython3
179+
az.plot_dist_comparison(idata, var_names=["z_p"]);
180+
```
181+
182+
## Comparison
183+
To compare with the results given in {cite:t}`coles2001gev`, we approximate the maximum likelihood estimates (MLE) using the mode of the posterior distributions (the *maximum a posteriori* or MAP estimate). These are close when the prior is reasonably flat around the posterior estimate.
184+
185+
The MLE results given in {cite:t}`coles2001gev` are:
186+
187+
$$\left(\hat{\mu}, \hat{\sigma}, \hat{\xi} \right) = \left( 3.87, 0.198, -0.050 \right) $$
188+
189+
190+
And the variance-covariance matrix of the estimates is:
191+
192+
$$ V = \left[ \begin{array} 0.000780 & 0.000197 & -0.00107 \\
193+
0.000197 & 0.000410 & -0.000778 \\
194+
-0.00107 & -0.000778 & 0.00965
195+
\end{array} \right] $$
196+
197+
198+
Note that extracting the MLE estimates from our inference involves accessing some of the Arviz back end functions to bash the xarray into something examinable:
199+
200+
```{code-cell} ipython3
201+
_, vals = az.sel_utils.xarray_to_ndarray(idata["posterior"], var_names=["μ", "σ", "ξ"])
202+
mle = [azpu.calculate_point_estimate("mode", val) for val in vals]
203+
mle
204+
```
205+
206+
```{code-cell} ipython3
207+
idata["posterior"].drop_vars("z_p").to_dataframe().cov().round(6)
208+
```
209+
210+
The results are a good match, but the benefit of doing this in a Bayesian setting is we get the full posterior joint distribution of the parameters and the return level, essentially for free. Compare this to the loose normality assumption and computational effort to get even the variance-covarince matrix, as done in {cite:t}`coles2001gev`.
211+
212+
Finally, we examine the pairs plots and see where any difficulties in inference lie using the divergences
213+
214+
```{code-cell} ipython3
215+
az.plot_pair(idata, var_names=["μ", "σ", "ξ"], kind="kde", marginals=True, divergences=True);
216+
```
217+
218+
## Authors
219+
220+
* Authored by [Colin Caprani](https://github.com/ccaprani), October 2021
221+
222+
+++
223+
224+
## References
225+
226+
:::{bibliography}
227+
:filter: docname in docnames
228+
:::
229+
230+
+++
231+
232+
## Watermark
233+
234+
```{code-cell} ipython3
235+
%load_ext watermark
236+
%watermark -n -u -v -iv -w -p aesara,arviz
237+
```
238+
239+
```{code-cell} ipython3
240+
241+
```

0 commit comments

Comments
 (0)