(GLM-out-of-sample-predictions)=
:::{post} December, 2023 :tags: generalized linear model, logistic regression, out of sample predictions, patsy :category: beginner :::
We want to fit a logistic regression model where there is a multiplicative interaction between two numerical features.
Let us do some exploration of the data:
Now we do a train-test split.
We now specify the model in PyMC.
The chains look good.
And we do a good job of recovering the true parameters for this simulated dataset.
Now we generate predictions on the test set.
First let us compute the accuracy on the test set.
The model is performing as expected (we of course know the data generating process, which is almost never the case in practical applications).
Finally we will describe and plot the model decision boundary, which is the space defined as
where $p$ denotes the probability of belonging to the class $y=1$ output by the model. To make this set explicit, we simply write the condition in terms of the model parametrization:
which implies
Solving for $x_2$ we get the formula
Observe that this curve is a hyperbola centered at the singularity point $x_1 = - \beta_2 / \beta_{12}$.
Let us now plot the model decision boundary using a grid:
Now we compute the model decision boundary on the grid for visualization purposes.
We finally get the plot and the predictions on the test set:
Note that we have computed the model decision boundary by using the mean of the posterior samples. However, we can generate a better (and more informative!) plot if we use the complete distribution (similarly for other metrics like accuracy and AUC).
:::{include} ../page_footer.md :::