Skip to content

Bring back updated GLM out of sample prediction notebook #486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
115,979 changes: 414 additions & 115,565 deletions examples/generalized_linear_models/GLM-out-of-sample-predictions.ipynb

Large diffs are not rendered by default.

264 changes: 133 additions & 131 deletions examples/generalized_linear_models/GLM-out-of-sample-predictions.myst.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,87 +5,66 @@ jupytext:
format_name: myst
format_version: 0.13
kernelspec:
display_name: 'Python 3.7.6 64-bit (''website_projects'': conda)'
metadata:
interpreter:
hash: fbddea5140024843998ae64bf59a7579a9660d103062604797ea5984366c686c
display_name: pymc_env
language: python
name: python3
---

# GLM in PyMC3: Out-Of-Sample Predictions
(GLM-out-of-sample-predictions)=
# Out-Of-Sample Predictions

In this notebook I explore the [glm](https://docs.pymc.io/api/glm.html) module of [PyMC3](https://docs.pymc.io/). I am particularly interested in the model definition using [patsy](https://patsy.readthedocs.io/en/latest/) formulas, as it makes the model evaluation loop faster (easier to include features and/or interactions). There are many good resources on this subject, but most of them evaluate the model in-sample. For many applications we require doing predictions on out-of-sample data. This experiment was motivated by the discussion of the thread ["Out of sample" predictions with the GLM sub-module](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) on the (great!) forum [discourse.pymc.io/](https://discourse.pymc.io/), thank you all for your input!

**Resources**


- [PyMC3 Docs: Example Notebooks](https://docs.pymc.io/nb_examples/index.html)

- In particular check [GLM: Logistic Regression](https://docs.pymc.io/notebooks/GLM-logistic.html)

- [Bambi](https://bambinos.github.io/bambi/), a more complete implementation of the GLM submodule which also allows for mixed-effects models.

- [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
- [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/)

+++

## Prepare Notebook
:::{post} December, 2022
:tags: generalized linear model, logistic regression, out of sample predictions, patsy
:category: beginner
:::

```{code-cell} ipython3
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sns.set_style(style="darkgrid", rc={"axes.facecolor": ".9", "grid.color": ".8"})
sns.set_palette(palette="deep")
sns_c = sns.color_palette(palette="deep")

import arviz as az
import patsy
import pymc3 as pm
import pymc as pm
import seaborn as sns

from pymc3 import glm
from scipy.special import expit as inverse_logit
from sklearn.metrics import RocCurveDisplay, accuracy_score, auc, roc_curve
from sklearn.model_selection import train_test_split
```

plt.rcParams["figure.figsize"] = [7, 6]
plt.rcParams["figure.dpi"] = 100
```{code-cell} ipython3
RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
az.style.use("arviz-darkgrid")
```

## Generate Sample Data

We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.

```{code-cell} ipython3
SEED = 42
np.random.seed(SEED)

# Number of data points.
# Number of data points
n = 250
# Create features.
x1 = np.random.normal(loc=0.0, scale=2.0, size=n)
x2 = np.random.normal(loc=0.0, scale=2.0, size=n)
epsilon = np.random.normal(loc=0.0, scale=0.5, size=n)
# Define target variable.
# Create features
x1 = rng.normal(loc=0.0, scale=2.0, size=n)
x2 = rng.normal(loc=0.0, scale=2.0, size=n)
# Define target variable
intercept = -0.5
beta_x1 = 1
beta_x2 = -1
beta_interaction = 2
z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2
p = 1 / (1 + np.exp(-z))
y = np.random.binomial(n=1, p=p, size=n)

p = inverse_logit(z)
# note binimial with n=1 is equal to a bernoulli
y = rng.binomial(n=1, p=p, size=n)
df = pd.DataFrame(dict(x1=x1, x2=x2, y=y))

df.head()
```

Let us do some exploration of the data:

```{code-cell} ipython3
sns.pairplot(
data=df, kind="scatter", height=2, plot_kws={"color": sns_c[1]}, diag_kws={"color": sns_c[2]}
);
sns.pairplot(data=df, kind="scatter");
```

- $x_1$ and $x_2$ are not correlated.
Expand All @@ -94,91 +73,85 @@ sns.pairplot(

```{code-cell} ipython3
fig, ax = plt.subplots()
sns_c_div = sns.diverging_palette(240, 10, n=2)
sns.scatterplot(x="x1", y="x2", data=df, hue="y", palette=[sns_c_div[0], sns_c_div[-1]])
ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
sns.scatterplot(x="x1", y="x2", data=df, hue="y")
ax.legend(title="y")
ax.set(title="Sample Data", xlim=(-9, 9), ylim=(-9, 9));
```

## Prepare Data for Modeling

I wanted to use the *`classmethod`* `from_formula` (see [documentation](https://docs.pymc.io/api/glm.html)), but I was not able to generate out-of-sample predictions with this approach (if you find a way please let me know!). As a workaround, I created the features from a formula using [patsy](https://patsy.readthedocs.io/en/latest/) directly and then use *`class`* `pymc3.glm.linear.GLM` (this was motivated by going into the [source code](https://github.com/pymc-devs/pymc3/blob/master/pymc3/glm/linear.py)).

```{code-cell} ipython3
# Define model formula.
formula = "y ~ x1 * x2"
# Create features.
y, x = patsy.dmatrices(formula_like=formula, data=df)
y, x = patsy.dmatrices("y ~ x1 * x2", data=df)
y = np.asarray(y).flatten()
labels = x.design_info.column_names
x = np.asarray(x)
```

As pointed out on the [thread](https://discourse.pymc.io/t/out-of-sample-predictions-with-the-glm-sub-module/773) (thank you @Nicky!), we need to keep the labels of the features in the design matrix.

```{code-cell} ipython3
print(f"labels = {labels}")
```

Now we do a train-test split.

```{code-cell} ipython3
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=SEED)
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
```

## Define and Fit the Model

We now specify the model in PyMC3.
We now specify the model in PyMC.

```{code-cell} ipython3
with pm.Model() as model:
# Set data container.
data = pm.Data("data", x_train)
# Define GLM family.
family = pm.glm.families.Binomial()
# Set priors.
priors = {
"Intercept": pm.Normal.dist(mu=0, sd=10),
"x1": pm.Normal.dist(mu=0, sd=10),
"x2": pm.Normal.dist(mu=0, sd=10),
"x1:x2": pm.Normal.dist(mu=0, sd=10),
}
# Specify model.
glm.GLM(y=y_train, x=data, family=family, intercept=False, labels=labels, priors=priors)
# Configure sampler.
trace = pm.sample(5000, chains=5, tune=1000, target_accept=0.87, random_seed=SEED)
COORDS = {"coeffs": labels}

with pm.Model(coords=COORDS) as model:
# data containers
X = pm.MutableData("X", x_train)
y = pm.MutableData("y", y_train)
# priors
b = pm.Normal("b", mu=0, sigma=1, dims="coeffs")
# linear model
mu = pm.math.dot(X, b)
# link function
p = pm.Deterministic("p", pm.math.invlogit(mu))
# likelihood
pm.Bernoulli("obs", p=p, observed=y)

pm.model_to_graphviz(model)
```

```{code-cell} ipython3
# Plot chains.
az.plot_trace(data=trace);
with model:
idata = pm.sample()
```

```{code-cell} ipython3
az.summary(trace)
az.plot_trace(idata, var_names="b", compact=False);
```

The chains look good.

+++
```{code-cell} ipython3
az.summary(idata, var_names="b")
```

And we do a good job of recovering the true parameters for this simulated dataset.

```{code-cell} ipython3
az.plot_posterior(
idata, var_names=["b"], ref_val=[intercept, beta_x1, beta_x2, beta_interaction], figsize=(15, 4)
);
```

## Generate Out-Of-Sample Predictions

Now we generate predictions on the test set.

```{code-cell} ipython3
# Update data reference.
pm.set_data({"data": x_test}, model=model)
# Generate posterior samples.
ppc_test = pm.sample_posterior_predictive(trace, model=model, samples=1000)
with model:
pm.set_data({"X": x_test, "y": y_test})
idata.extend(pm.sample_posterior_predictive(idata))
```

```{code-cell} ipython3
# Compute the point prediction by taking the mean
# and defining the category via a threshold.
p_test_pred = ppc_test["y"].mean(axis=0)
# Compute the point prediction by taking the mean and defining the category via a threshold.
p_test_pred = idata.posterior_predictive["obs"].mean(dim=["chain", "draw"])
y_test_pred = (p_test_pred >= 0.5).astype("int")
```

Expand All @@ -187,24 +160,20 @@ y_test_pred = (p_test_pred >= 0.5).astype("int")
First let us compute the accuracy on the test set.

```{code-cell} ipython3
from sklearn.metrics import accuracy_score

print(f"accuracy = {accuracy_score(y_true=y_test, y_pred=y_test_pred): 0.3f}")
```

Next, we plot the [roc curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and compute the [auc](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve).

```{code-cell} ipython3
from sklearn.metrics import RocCurveDisplay, auc, roc_curve

fpr, tpr, thresholds = roc_curve(
y_true=y_test, y_score=p_test_pred, pos_label=1, drop_intermediate=False
)
roc_auc = auc(fpr, tpr)

fig, ax = plt.subplots()
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc_display = roc_display.plot(ax=ax, marker="o", color=sns_c[4], markersize=4)
roc_display = roc_display.plot(ax=ax, marker="o", markersize=4)
ax.set(title="ROC");
```

Expand Down Expand Up @@ -237,60 +206,93 @@ Observe that this curve is a hyperbola centered at the singularity point $x_1 =
Let us now plot the model decision boundary using a grid:

```{code-cell} ipython3
# Construct grid.
x1_grid = np.linspace(start=-9, stop=9, num=300)
x2_grid = x1_grid

x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)

x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)

# Create features on the grid.
x_grid_ext = patsy.dmatrix(formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1]))

x_grid_ext = np.asarray(x_grid_ext)
def make_grid():
x1_grid = np.linspace(start=-9, stop=9, num=300)
x2_grid = x1_grid
x1_mesh, x2_mesh = np.meshgrid(x1_grid, x2_grid)
x_grid = np.stack(arrays=[x1_mesh.flatten(), x2_mesh.flatten()], axis=1)
return x1_grid, x2_grid, x_grid


x1_grid, x2_grid, x_grid = make_grid()

with model:
# Create features on the grid.
x_grid_ext = patsy.dmatrix(formula_like="x1 * x2", data=dict(x1=x_grid[:, 0], x2=x_grid[:, 1]))
x_grid_ext = np.asarray(x_grid_ext)
# set the observed variables
pm.set_data({"X": x_grid_ext})
# calculate pushforward values of `p`
ppc_grid = pm.sample_posterior_predictive(idata, var_names=["p"])
```

# Generate model predictions on the grid.
pm.set_data({"data": x_grid_ext}, model=model)
ppc_grid = pm.sample_posterior_predictive(trace, model=model, samples=1000)
```{code-cell} ipython3
# grid of predictions
grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
grid_df["p"] = ppc_grid.posterior_predictive.p.mean(dim=["chain", "draw"])
p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()
```

Now we compute the model decision boundary on the grid for visualization purposes.

```{code-cell} ipython3
numerator = -(trace["Intercept"].mean(axis=0) + trace["x1"].mean(axis=0) * x1_grid)
denominator = trace["x2"].mean(axis=0) + trace["x1:x2"].mean(axis=0) * x1_grid
bd_grid = numerator / denominator

grid_df = pd.DataFrame(x_grid, columns=["x1", "x2"])
grid_df["p"] = ppc_grid["y"].mean(axis=0)
grid_df.sort_values("p", inplace=True)

p_grid = grid_df.pivot(index="x2", columns="x1", values="p").to_numpy()
def calc_decision_boundary(idata, x1_grid):
# posterior mean of coefficients
intercept = idata.posterior["b"].sel(coeffs="Intercept").mean().data
b1 = idata.posterior["b"].sel(coeffs="x1").mean().data
b2 = idata.posterior["b"].sel(coeffs="x2").mean().data
b1b2 = idata.posterior["b"].sel(coeffs="x1:x2").mean().data
# decision boundary equation
return -(intercept + b1 * x1_grid) / (b2 + b1b2 * x1_grid)
```

We finally get the plot and the predictions on the test set:

```{code-cell} ipython3
fig, ax = plt.subplots()
cmap = sns.diverging_palette(240, 10, n=50, as_cmap=True)

# data
sns.scatterplot(
x=x_test[:, 1].flatten(),
y=x_test[:, 2].flatten(),
hue=y_test,
palette=[sns_c_div[0], sns_c_div[-1]],
ax=ax,
)
sns.lineplot(x=x1_grid, y=bd_grid, color="black", ax=ax)
ax.contourf(x1_grid, x2_grid, p_grid, cmap=cmap, alpha=0.3)

# decision boundary
ax.plot(x1_grid, calc_decision_boundary(idata, x1_grid), color="black", linestyle=":")

# grid of predictions
ax.contourf(x1_grid, x2_grid, p_grid, alpha=0.3)

ax.legend(title="y", loc="center left", bbox_to_anchor=(1, 0.5))
ax.lines[0].set_linestyle("dotted")
ax.set(title="Model Decision Boundary", xlim=(-9, 9), ylim=(-9, 9), xlabel="x1", ylabel="x2");
```

**Remark:** Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and auc). One way of doing this is by storing and computing it inside the model definition as a `Deterministic` variable as in [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb).
Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and AUC).

+++

## References

- [Bayesian Analysis with Python (Second edition) - Chapter 4](https://github.com/aloctavodia/BAP/blob/master/code/Chp4/04_Generalizing_linear_models.ipynb)
- [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/)

+++

## Authors
- Created by [Juan Orduz](https://github.com/juanitorduz).
- Updated by [Benjamin T. Vincent](https://github.com/drbenvincent) to PyMC v4 in June 2022
- Re-executed by [Benjamin T. Vincent](https://github.com/drbenvincent) with PyMC v5 in December 2022

+++

## Watermark

```{code-cell} ipython3
%load_ext watermark
%watermark -n -u -v -iv -w
%watermark -n -u -v -iv -w -p pytensor
```

:::{include} ../page_footer.md
:::