GLM-out-of-sample-predictions.ipynb

https://github.com/pymc-devs/pymc-examples/blob/main/examples/generalized_linear_models/GLM-out-of-sample-predictions.ipynb

(GLM-out-of-sample-predictions)=

Out-Of-Sample Predictions

:::{post} December, 2023 :tags: generalized linear model, logistic regression, out of sample predictions, patsy :category: beginner :::

Generate Sample Data

We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.

Let us do some exploration of the data:

$x_1$ and $x_2$ are not correlated.
$x_1$ and $x_2$ do not seem to separate the $y$-classes independently.
The distribution of $y$ is not highly unbalanced.

Prepare Data for Modeling

Now we do a train-test split.

Define and Fit the Model

We now specify the model in PyMC.

The chains look good.

And we do a good job of recovering the true parameters for this simulated dataset.

Generate Out-Of-Sample Predictions

Now we generate predictions on the test set.

Evaluate Model

First let us compute the accuracy on the test set.

Next, we plot the roc curve and compute the auc.

The model is performing as expected (we of course know the data generating process, which is almost never the case in practical applications).

Model Decision Boundary

Finally we will describe and plot the model decision boundary, which is the space defined as

$\mathcal{B} = \{(x_1, x_2) \in \mathbb{R}^2 \: | \: p(x_1, x_2) = 0.5\}$

where $p$ denotes the probability of belonging to the class $y=1$ output by the model. To make this set explicit, we simply write the condition in terms of the model parametrization:

$0.5 = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1x_2))}$

which implies

$0 = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1x_2$

Solving for $x_2$ we get the formula

$x_2 = - \frac{\beta_0 + \beta_1 x_1}{\beta_2 + \beta_{12}x_1}$

Observe that this curve is a hyperbola centered at the singularity point $x_1 = - \beta_2 / \beta_{12}$.

Let us now plot the model decision boundary using a grid:

Now we compute the model decision boundary on the grid for visualization purposes.

We finally get the plot and the predictions on the test set:

Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and AUC).