(bayesian_ab_testing_intro)=
:::{post} May 23, 2021 :tags: case study, ab test :category: beginner, tutorial :author: Cuong Duong :::
This notebook demonstrates how to implement a Bayesian analysis of an A/B test. We implement the models discussed in VWO's Bayesian A/B Testing Whitepaper {cite:p}stucchio2015bayesian, and discuss the effect of different prior choices for these models. This notebook does not discuss other related topics like how to choose a prior, early stopping, and power analysis.
From https://vwo.com/ab-testing/:
A/B testing (also known as split testing) is a process of showing two variants of the same web page to different segments of website visitors at the same time and comparing which variant drives more conversions.
Specifically, A/B tests are often used in the software industry to determine whether a new feature or changes to an existing feature should be released to users, and the impact of the change on core product metrics ("conversions"). Furthermore:
If you've studied controlled experiments in the context of biology, psychology, and other sciences before, A/B testing will sound a lot like a controlled experiment - and that's because it is! The concept of a control group and treatment groups, and the principles of experimental design, are the building blocks of A/B testing. The main difference is the context in which the experiment is run: A/B tests are typically run by online software companies, where the subjects are visitors to the website / app, the outcomes of interest are behaviours that can be tracked like signing up, purchasing a product, and returning to the website.
A/B tests are typically analysed with traditional hypothesis tests (see t-test), but another method is to use Bayesian statistics. This allows us to incorporate prior distributions and produce a range of outcomes to the questions "is there a winning variant?" and "by how much?".
Let's first deal with a simple two-variant A/B test, where the metric of interest is the proportion of users performing an action (e.g. purchase at least one item), a bernoulli conversion. Our variants are called A and B, where A refers to the existing landing page and B refers to the new page we want to test. The outcome that we want to perform statistical inference on is whether B is "better" than A, which is depends on the underlying "true" conversion rates for each variant. We can formulate this as follows:
Let $\theta_A, \theta_B$ be the true conversion rates for variants A and B respectively. Then the outcome of whether a visitor converts in variant A is the random variable $\mathrm{Bernoulli}(\theta_A)$, and $\mathrm{Bernoulli}(\theta_B)$ for variant B. If we assume that visitors' behaviour on the landing page is independent of other visitors (a fair assumption), then the total conversions $y$ for a variant has the Binomial distribution:
Under a Bayesian framework, we assume the true conversion rates $\theta_A, \theta_B$ cannot be known, and instead they each follow a Beta distribution. The underlying rates are assumed to be independent (we would split traffic between each variant randomly, so one variant would not affect the other):
The observed data for the duration of the A/B test (the likelihoood distribution) is: the number of visitors landing on the page N, and the number of visitors purchasing at least one item y:
With this, we can sample from the joint posterior of $\theta_A, \theta_B$.
You may have noticed that the Beta distribution is the conjugate prior for the Binomial, so we don't need MCMC sampling to estimate the posterior (the exact solution can be found in the VWO paper). We'll still demonstrate how sampling can be done with PyMC though, and doing this makes it easier to extend the model with different priors, dependency assumptions, etc.
Finally, remember that our outcome of interest is whether B is better than A. A common measure in practice for whether B is better than is the relative uplift in conversion rates, i.e. the percentage difference of $\theta_B$ over $\theta_A$:
We'll implement this model setup in PyMC below.
Now that we've defined a class that can take a prior and our synthetic data as inputs, our first step is to choose an appropriate prior. There are a few things to consider when doing this in practice, but for the purpose of this notebook we'll focus on the following:
alpha and beta. For example, alpha = 1, beta = 1 leads to a uniform distribution as a prior. If we were considering one distribution in isolation, setting this prior is a statement that we don't know anything about the value of the parameter, nor our confidence around it. In the context of A/B testing however, we're interested in comparing the relative uplift of one variant over another. With a weakly informative Beta prior, this relative uplift distribution is very wide, so we're implicitly saying that the variants could be very different to each other.alpha and beta. Contrary to the above, a strong prior would imply that the relative uplift distribution is thin, i.e. our prior belief is that the variants are not very different from each other.We illustrate these points with prior predictive checks.
Note that we can pass in arbitrary values for the observed data in these prior predictive checks. PyMC will not use that data when sampling from the prior predictive distribution.
With the weak prior our 94% HDI for the relative uplift for B over A is roughly [-20%, +20%], whereas it is roughly [-2%, +2%] with the strong prior. This is effectively the "starting point" for the relative uplift distribution, and will affect how the observed conversions translate to the posterior distribution.
How we choose these priors in practice depends on broader context of the company running the A/B tests. A strong prior can help guard against false discoveries, but may require more data to detect winning variants when they exist (and more data = more time required running the test). A weak prior gives more weight to the observed data, but could also lead to more false discoveries as a result of early stopping issues.
Below we'll walk through the inference results from two different prior choices.
We generate two datasets: one where the "true" conversion rate of each variant is the same, and one where variant B has a higher true conversion rate.
We'll also write a function to wrap the data generation, sampling, and posterior plots so that we can easily compare the results of both models (strong and weak prior) under both scenarios (same true rate vs. different true rate).
The above examples demonstrate how to calculate perform A/B testing analysis for a two-variant test with the simple Beta-Binomial model, and the benefits and disadvantages of choosing a weak vs. strong prior. In the next section we provide a guide for handling a multi-variant ("A/B/n") test.
We'll continue using Bernoulli conversions and the Beta-Binomial model in this section for simplicity. The focus is on how to analyse tests with 3 or more variants - e.g. instead of just having one different landing page to test, we have multiple ideas we want to test at once. How can we tell if there's a winner amongst all of them?
There are two main approaches we can take here:
max() of the other variants.Approach 1 is intuitive to most people, and is easily explained. But what if there are two variants that both beat the control, and we want to know which one is better? We can't make that inference with the individual uplift distributions. Approach 2 does handle this case - it effectively tries to find whether there is a clear winner or clear loser(s) amongst all the variants.
We'll implement the model setup for both approaches below, cleaning up our code from before so that it generalises to the n variant case. Note that we can also re-use this model for the 2-variant case.
We generate data where variants B and C are well above A, but quite close to each other:
reluplift for B, the maximum of the other variants will likely be variant C. Similarly when we are calculating reluplift for C, it is likely being compared to B.Now what if we wanted to compare A/B test variants in terms of how much revenue they generate, and/or estimate how much additional revenue a winning variant brings? We can't use a Beta-Binomial model for this, as the possible values for each visitor are now in the range [0, Inf). The model proposed in the VWO paper is as follows:](streamdown:incomplete-link)
The revenue generated by an individual visitor is revenue = probability of paying at all * mean amount spent when paying:*
*
We assume that the probability of paying at all is independent to the mean amount spent when paying. This is a typical assumption in practice, unless we have reason to believe that the two parameters have dependencies. With this, we can create separate models for the total number of visitors paying, and the total amount spent amongst the purchasing visitors (assuming independence between the behaviour of each visitor):
where $N$ is the total number of visitors, $K$ is the total number of visitors with at least one purchase.
We can re-use our Beta-Binomial model from before to model the Bernoulli conversions. For the mean purchase amount, we use a Gamma prior (which is also a conjugate prior to the Gamma likelihood). So in a two-variant test, the setup is:
$\mu$ here represents the average revenue per visitor, including those who don't make a purchase. This is the best way to capture the overall revenue effect - some variants may increase the average sales value, but reduce the proportion of visitors that pay at all (e.g. if we promoted more expensive items on the landing page).
Below we put the model setup into code and perform prior predictive checks.
For the Beta prior, we can set a similar prior to before - centered around 0.5, with the magnitude of alpha and beta determining how "thin" the distribution is.
We need to be a bit more careful about the Gamma prior. The mean of the Gamma prior is $\dfrac{\alpha_G}{\beta_G}$, and needs to be set to a reasonable value given existing mean purchase values. For example, if alpha and beta were set such that the mean was 1 dollar, but the average revenue per visitor for a website is much higher at 100 dollars, his could affect our inference.
Similar to the model for Bernoulli conversions, the width of the prior predictive uplift distribution will depend on the strength of our priors. See the Bernoulli conversions section for a discussion of the benefits and disadvantages of using a weak vs. strong prior.
Next we generate synthetic data for the model. We'll generate the following scenarios:
theta (the purchase-anything rate) and 1 / lam (the mean purchase value) to understand how the A/B test has affected visitor behaviour. We show this below:Note that one concern with using value conversions in practice (that doesn't show up when we're just simulating synthetic data) is the existence of outliers. For example, a visitor in one variant could spend thousands of dollars, and the observed revenue data no longer follows a 'nice' distribution like Gamma. It's common to impute these outliers prior to running a statistical analysis (we have to be careful with removing them altogether, as this could bias the inference), or fall back to bernoulli conversions for decision making.
There are many other considerations to implementing a Bayesian framework to analyse A/B tests in practice. Some include:
Various textbooks and online resources dive into these areas in more detail. Doing Bayesian Data Analysis {cite:p}kruschke2014doing by John Kruschke is a great resource, and has been translated to PyMC here.
We also plan to create more PyMC tutorials on these topics, so stay tuned!
:::{bibliography} :filter: docname in docnames :::
:::{include} ../page_footer.md :::