(reinforcement_learning)=
:::{post} Aug 5, 2022 :tags: PyTensor, Reinforcement Learning :category: advanced, how-to :author: Ricardo Vieira :::
Reinforcement Learning models are commonly used in behavioral research to model how animals and humans learn, in situtions where they get to make repeated choices that are followed by some form of feedback, such as a reward or a punishment.
In this notebook we will consider the simplest learning scenario, where there are only two possible actions. When an action is taken, it is always followed by an immediate reward. Finally, the outcome of each action is independent from the previous actions taken. This scenario is sometimes referred to as the multi-armed bandit problem.
Let's say that the two actions (e.g., left and right buttons) are associated with a unit reward 40% and 60% of the time, respectively. At the beginning the learning agent does not know which action $a$ is better, so they may start by assuming both actions have a mean value of 50%. We can store these values in a table, which is usually referred to as a $Q$ table:
.5, a = \text{left}\\ .5, a = \text{right} \end{cases}When an action is chosen and a reward $r = {0,1}$ is observed, the estimated value of that action is updated as follows:
where $\alpha \in [0, 1]$ is a learning parameter that influences how much the value of an action is shifted towards the observed reward in each trial. Finally, the $Q$ table values are converted into action probabilities via the softmax transformation:
where the $\beta \in (0, +\infty)$ parameter determines the level of noise in the agent choices. Larger values will be associated with more deterministic choices and smaller values with increasingly random choices.
The plot above shows a simulated run of 150 trials, with parameters $\alpha = .5$ and $\beta = 5$, and constant reward probabilities of $.4$ and $.6$ for the left (blue) and right (orange) actions, respectively.
Solid and empty dots indicate actions followed by rewards and no-rewards, respectively. The solid line shows the estimated $Q$ value for each action centered around the respective colored dots (the line is above its dots when the respective $Q$ value is above $.5$, and below otherwise). It can be seen that this value increases with rewards (solid dots) and decreases with non-rewards (empty dots).
The change in line height following each outcome is directly related to the $\alpha$ parameter. The influence of the $\beta$ parameter is more difficult to grasp, but one way to think about it is that the higher its value, the more an agent will stick to the action that has the highest estimated value, even if the difference between the two is quite small. Conversely, as this value approaches zero, the agent will start picking randomly between the two actions, regardless of their estimated values.
Having generated the data, the goal is to now 'invert the model' to estimate the learning parameters $\alpha$ and $\beta$. I start by doing it via Maximum Likelihood Estimation (MLE). This requires writing a custom function that computes the likelihood of the data given a potential $\alpha$ and $\beta$ and the fixed observed actions and rewards (actually the function computes the negative log likelihood, in order to avoid underflow issues).
I employ the handy scipy.optimize.minimize function, to quickly retrieve the values of $\alpha$ and $\beta$ that maximize the likelihood of the data (or actually, minimize the negative log likelihood).
This was also helpful when I later wrote the PyTensor function that computed the choice probabilities in PyMC. First, the underlying logic is the same, the only thing that changes is the syntax. Second, it provides a way to be confident that I did not mess up, and what I was actually computing was what I intended to.
The function llik_td is strikingly similar to the generate_data one, except that instead of simulating an action and reward in each trial, it stores the log-probability of the observed action.
The function scipy.special.logsumexp is used to compute the term $\log(\exp(\beta Q_{\text{right}}) + \exp(\beta Q_{\text{left}}))$ in a way that is more numerically stable.
In the end, the function returns the negative sum of all the log probabilities, which is equivalent to multiplying the probabilities in their original scale.
(The first action is ignored just to make the output comparable to the later PyTensor function. It doesn't actually change any estimation, as the initial probabilities are fixed and do not depend on either the $\alpha$ or $\beta$ parameters.)
Above, I computed the negative log likelihood of the data given the true $\alpha$ and $\beta$ parameters.
Below, I let scipy find the MLE values for the two parameters:
The estimated MLE values are relatively close to the true ones. However, this procedure does not give any idea of the plausible uncertainty around these parameter values. To get that, I'll turn to PyMC for a bayesian posterior estimation.
But before that, I will implement a simple vectorization optimization to the log-likelihood function that will be more similar to the PyTensor counterpart. The reason for this is to speed up the slow bayesian inference engine down the road.
The vectorized function gives the same results, but runs almost one order of magnitude faster.
When implemented as an PyTensor function, the difference between the vectorized and standard versions was not this drastic. Still, it ran twice as fast, which meant the model also sampled at twice the speed it would otherwise have!
The most challenging part was to create an PyTensor function/loop to estimate the Q values when sampling our parameters with PyMC.
Let's wrap it up in a function to test out if it's working as expected.
The same result is obtained, so we can be confident that the PyTensor loop is working as expected. We are now ready to implement the PyMC model.
In this example, the obtained posteriors are nicely centered around the MLE values. What we have gained is an idea of the plausible uncertainty around these values.
In this last section I provide an alternative implementation of the model using a Bernoulli likelihood.
:::{Note}
One reason why it's useful to use the Bernoulli likelihood is that one can then do prior and posterior predictive sampling as well as model comparison. With pm.Potential you cannot do it, because PyMC does not know what is likelihood and what is prior nor how to generate random draws. Neither of this is a problem when using a pm.Bernoulli likelihood.
:::
:::{bibliography} :filter: docname in docnames :::
Authored by Ricardo Vieira in June 2022
Adapted PyMC code from Maria Eckstein (GitHub, PyMC Discourse)
Adapted MLE code from Robert Wilson and Anne Collins {cite:p}collinswilson2019 (GitHub)
Re-executed by Juan Orduz in August 2022 (pymc-examples#410)
:::{include} ../page_footer.md :::