Bayesian Hierarchical A/B Testing with PyMC3


For this Product Data Science project I’ll explore the use of Bayesian Inference in A/B testing (really is't A/B/C/D testing) using the PyMC3 library. Using synthetic data, the idea behind the project will be to test 4 new playlist algorithms against the current algorithm. The metrics will focus on user interaction during the first selected song, and the metrics measures will be the skip rate and the average time it took a user to skip the song. 

I’ll keep the theory and code light but please have a look in my GitHub repo for the full code and also the resources posted after the article for further reading on Bayesian Inference


Background & Structure

For this project I wanted to explore how Bayesian A/B testing can overcome some of the limitations of the Frequentist methods. With Bayesian Inference you don’t have to correct for multiple testing (Bonferroni, Tukey, etc…), you don’t have to have run long tests to gather a sufficiently large amount of samples (Central Limit Theorem), can handle imbalanced subgroups without worry, and more. All of this is due to the fact that Bayesian Inference uses statisitical distributions to model prior beliefs, then updates these distributions as more data becomes available. These distributions are easily sampled from thus eliminating the need to collect a large number of samples.

So as I stated earlier, I am imagining that I work for music streaming company such as Spotify and that the company would like to test 4 new playlist generating algorithms against the currently running algorithm. The test will be structured to randomly assign these playlists to 1000 user’s each. Measures will have been put in place to assure independence, such as taking into account seasonality and the uniqueness of users. 


Recommendations and playlists play a huge role in the customer satisfaction at a product oriented company such as Spotify. They likely have a huge assortment of metrics and custom KPIs they keep a close eye on. For this project I am going to focus on measuring a customer’s reaction to the first recommended song only. First I will try to measure the number of skips associated with that first song out of all of the users – or skip rate. And secondly the time it takes for a user to skip that song, if they skipped – the skip time

Typically I would also want to measure things such as DAU (daily active users), WAU (weekly active users), etc… shares, as well as have some guardrail metrics in place to make sure that these new playlist algorithms don’t have any undesired effects if they do appear to be performing well. But I’m just going to keep it simple here to explore Bayesian Inference.

 Generating Synthetic Data

One of my favorite things about working with Bayesian Inference is that you can closely model real events and their statistics with distributions. For instance whether or not a user skipped a song can be modeled by a Bernoulli distribution, and the probability that many users skipped a song could be modeled by the BinomialDistribution (since it’s just multiple Bernoulli trials). The time it takes a user to skip a song on the other hand can be modeled by an Exponential distribution, which is commonly used to model the time until failure. 

To create this synthetic data I first used the random.binomial() function from numpy, which takes the number of trials and the probability of success as it’s parameters. I assigned each test group it’s own probability of success to differentiate them. I tried to create unique buckets of users based off of what I thought typical interactions with the product would look like. Then I used random.exponential() function from numpy to generate skip times. This function is parametrized by the scale parameter beta, which is just 1 / lambda, where lambda is the rate parameter and beta is the mean or expected value. Whenever a user skipped the song, I replaced the skip time value with NaNs.

n = 1000group_list = ['A', 'B', 'C', 'D', 'E']  beta_list = [4.9, 6.0, 3.5, 4.0, 5.1]             p_list = [0.58, 0.80, 0.45, 0.56, 0.62] # A - current playlist# B - longer to skip, less skips# C - quicker to skip, less skips# D - quicker to skip, same skips# E - same to skip, more skips  full_df = pd.DataFrame() for g in range(len(group_list)):         curr_group = group_list.pop(0)    curr_beta = beta_list.pop(0)    curr_p = p_list.pop(0)         skipped = np.random.binomial(size = n, n = 1, p = curr_p)    skip_times = np.round(np.random.exponential(scale = curr_beta, size = n), 6)         df = pd.DataFrame(skip_times, columns = ['skip_times'])         df['skipped'] = skipped    df['skip_times'] = skip_times    df['skip_times'] = np.where(df['skipped'] == 0, np.nan, df['skip_times'])         df['test_group'] = curr_group         full_df = full_df.append(df) full_df = full_df.reset_index(drop=True).reset_index().rename(columns = {'index':'user_id'})

 Bayesian Inference

Now I’m going to use PyMC3 to do all of the heavy lifting computationally, but you still need to know a little bit about Bayesian Inference to set it up properly. First you need to understand the mathematical concepts behind Bayesian Inference, then how to choose the proper distributions to work with (conjugate priors). From there PyMC3 can handle all of the sampling algorithms for you. 


In short the basic mathematical concept is that you start off with an idea about the distribution of a parameter, such as the skip rate. This distribution sums up your prior beliefs and is called simply the prior distribution. The prior is then multiplied by the likelihood, which is the conditional density of the data given the parameter – or how likely this data would be given the parameter value. These are then divided by the evidence, which is the probability of the observed data independent of the parameter. This results in the posterior distribution of the parameter, which represents the updated beliefs – or the probability distribution of the parameter given the observed data.

 Conjugate Priors

Not all distributions work well together in the above equation. Depending on the parameter you would like to model, you need to specifically choose a conjugate priorotherwise things can get sticky. I usually just use this table from Wikipedia, which tells me all I need to know about which distributions work well together and what their respective parameters are. Here you can see that since I’d like to use a Binomial distribution as my likelihood to model the number of skips, I’d need to use a Beta prior distribution. This combination is quite popular.

And since I’d like to use the Exponential distribution as my likelihood to model the time a user took before skipping, I’ll need to use Gamma as my prior distribution. 

 Uninformed Priors

It is fairly common to build priors off of an expert opinion or historical data. However, not all priors have to be representative of prior knowledge or beliefs. It is quite common to use uninformed priors, which are just distributions that don’t make any assumptions about the distribution of the parameter. Rather than influence the posterior toward a prior belief, an uninformed prior would influence the posterior as little as possible. The posterior will still update as more and more data gets sampled, and eventually converge toward a parameter value. In this project I’ll use both types of priors. 

PyMC3 Sampling

The real power in Bayesian Inference comes from the fact that you don’t need too much data to paint a representative picture of the true parameter. The reason for this is that Bayesian Inference utilizes sampling techniques to generate more data. Also referred to as Monte Carlo methods, there are many different algorithms that have been developed and they typically work by generating values and then accepting or rejecting those values based on certain criteria. Two common ones that you can use with PyMC3 are the Metropolis-Hastings algorithm and the NUTS algorithm.

Model for Skip Times

Now that we’ve got the basics squared away, I will walk through my model to test the time it took a user to skip a song. For this model I’m using an Exponential likelihood and a Gamma prior. I will use historical data to build my Gamma prior, meaning I will use the mean and standard deviation of the control group A, as hyperpriors and parameters to construct the Gamma distribution. In this hierarchical model I will test all of the variants A/B/C/D/E all at once. One thing to note is that this data is much smaller than each test group’s size since not all users actually skipped the first song. If I sampled 1000 users in each group, maybe only half skipped the first song. The Bayesian methods handle this through sampling. 

First I find the statistics from control group A, and use these to calculate alpha and beta based off of the expected value and variance of the Gamma distribution.

 # alpha, beta, mean, variance, and std of control group mean_of_A = skip_df[skip_df['test_group']=='A']['skip_times'].mean()std_of_A = skip_df[skip_df['test_group']=='A']['skip_times'].std()var_of_A = skip_df[skip_df['test_group']=='A']['skip_times'].var() alpha_of_A = mean_of_A**2 / var_of_Abeta_of_A = mean_of_A / var_of_A


Mean:   4.5091

Std:    4.449

Var:    19.7939

Alpha:  1.0272

Beta:   0.2278

 Now I can build my model with PyMC3, it looks straight forward but there’s a whole lot going on under the hood. I’ll take 10,000 samples, use the NUTS algorithm, use MAP estimation to infer starting values, and increase the acceptance rate to 95%. The metrics I will monitor here will be the difference of the control group from each test group.

 with pymc3.Model() as model:         # define hyperpriors    mu = pymc3.Gamma('mu', alpha=alpha_of_A, beta=beta_of_A)    sigma = pymc3.Exponential('sigma', std_of_A)    alpha =  pymc3.Deterministic('alpha', mu**2 / sigma**2)    beta = pymc3.Deterministic('beta', mu / sigma**2)     # define priors    lam_A = pymc3.Gamma('lam_A', alpha=alpha, beta=beta)    lam_B = pymc3.Gamma('lam_B', alpha=alpha, beta=beta)    lam_C = pymc3.Gamma('lam_C', alpha=alpha, beta=beta)    lam_D = pymc3.Gamma('lam_D', alpha=alpha, beta=beta)    lam_E = pymc3.Gamma('lam_E', alpha=alpha, beta=beta)         # define likelihood    obs_A = pymc3.Exponential('obs_A', lam=lam_A, observed=skip_df[skip_df['test_group'] == 'A']['skip_times'])    obs_B = pymc3.Exponential('obs_B', lam=lam_B, observed=skip_df[skip_df['test_group'] == 'B']['skip_times'])    obs_C = pymc3.Exponential('obs_C', lam=lam_C, observed=skip_df[skip_df['test_group'] == 'C']['skip_times'])    obs_D = pymc3.Exponential('obs_D', lam=lam_D, observed=skip_df[skip_df['test_group'] == 'D']['skip_times'])    obs_E = pymc3.Exponential('obs_E', lam=lam_E, observed=skip_df[skip_df['test_group'] == 'E']['skip_times'])         # define metrics    pymc3.Deterministic('difference_B', lam_B - lam_A)    pymc3.Deterministic('difference_C', lam_C - lam_A)    pymc3.Deterministic('difference_D', lam_D - lam_A)    pymc3.Deterministic('difference_E', lam_E - lam_A)     # sample distributions    trace = pymc3.sample(draws=10000,  start = pymc3.find_MAP(), step=pymc3.NUTS(target_accept=0.95), progressbar=True, return_inferencedata=False)



PyMC3 generates some nice performance plots to confirm the model ran appropriately and converged rather than diverged. After confirming the model ran smoothly, the plots of interest are those of the posterior distributions. Here you can see the mean values as well as their HDI (highest density interval). The HDI is the shortest credible interval, which is the Bayesian version of the Frequentist confidence interval – and so much better! Credible intervals and HDI’s are also very interpretable, for instance I can simply now say that:

 It is 95% probable that users in test group C fall in the interval [0.26, 0.31] with an average skip time of 0.29 seconds.

It is 95% probable that users in control group A fall in the interval [0.18, 0.21] with an average skip time of 0.19 seconds.

 To see all of the distributions together and for greater chart control, I opted to plot the data with seaborn. It is easy to gain valuable insights now and can see below that:

 plt.figure(figsize=(16,10)) posteriors = []posterior_vars = ['lam_A', 'lam_B', 'lam_C', 'lam_D', 'lam_E'] for p in posterior_vars:    posteriors.append(trace.get_values(p, burn=100)) for posterior in posteriors:    sns.kdeplot(posterior ,shade=True, palette=colors, legend = True)     plt.legend(loc="upper right", labels = posterior_vars, fontsize=12)plt.title("Posterior Distributions of Skip Times\nfor Control and Test Groups", fontsize=20)plt.grid(color='#4d4d4d')

 Test group E was most similar to control group A 

Test group B tended to skip much faster than control group A

Test groups C & D took much longer to skip the song

Notice below that test group C has a much wider distribution. This is likely because there were less people in this test group bucket due to the imbalanced data. This wideness represents my models uncertainty about the group. Still that could be a good thing, meaning that less users skipped that song and the ones that did took longer to do so. Here is the difference of the control group from each test group.

 Model for Skip Rate

Now rather than model the time it took a user to skip the first song, I will model the number of skips per each test group – the skip rate. Another 10,000 samples without actually testing more users. As mentioned earlier the likelihood will be modeled by the Binomial distribution and the prior will then be the Beta distribution. 

Rather than use historical data to build my prior, I will use an uninformed prior. It would be straightforward to use a Uniform distribution to sample alpha and beta, however doing so would result in an improper posterior distribution. Gelman poses a solution to this in his book Bayesian Data Analysis on page 109-113. His solution is to sample fro the below distribution. To do this I will have to build a custom function in PyMC3.

 def logp_ab(value):        return tt.log(tt.pow(tt.sum(value), -5/2)) 

with pymc3.Model() as model:
          # uninformative prior for alpha and beta     ab = pymc3.HalfFlat('ab', shape=2, testval=np.asarray([1., 1.]))     pymc3.Potential('p(a, b)', logp_ab(ab))          # our alpha and beta. Remove this code if you don't want to track alpha and beta     alpha = pymc3.Deterministic('alpha', ab[0])     beta = pymc3.Deterministic('beta', ab[1])          # define priors     p_A = pymc3.Beta('p_A', alpha=alpha, beta=beta)     p_B = pymc3.Beta('p_B', alpha=alpha, beta=beta)     p_C = pymc3.Beta('p_C', alpha=alpha, beta=beta)    p_D = pymc3.Beta('p_D', alpha=alpha, beta=beta)     p_E = pymc3.Beta('p_E', alpha=alpha, beta=beta)          # define likelihood     obs_A = pymc3.Binomial('obs_A', n=len(full_df[full_df['test_group'] == 'A']), p=p_A, observed=full_df[full_df['test_group'] == 'A']['skipped'].sum())    obs_B = pymc3.Binomial('obs_B', n=len(full_df[full_df['test_group'] == 'B']), p=p_B, observed=full_df[full_df['test_group'] == 'B']['skipped'].sum())    obs_C = pymc3.Binomial('obs_C', n=len(full_df[full_df['test_group'] == 'C']), p=p_C, observed=full_df[full_df['test_group'] == 'C']['skipped'].sum())     obs_D = pymc3.Binomial('obs_D', n=len(full_df[full_df['test_group'] == 'D']), p=p_D, observed=full_df[full_df['test_group'] == 'D']['skipped'].sum())    obs_E = pymc3.Binomial('obs_E', n=len(full_df[full_df['test_group'] == 'E']), p=p_E, observed=full_df[full_df['test_group'] == 'E']['skipped'].sum())          # define metrics     pymc3.Deterministic('difference_B', p_B - p_A)     pymc3.Deterministic('difference_C', p_C - p_A)    pymc3.Deterministic('difference_D', p_D - p_A)    pymc3.Deterministic('difference_E', p_E - p_A)          # sample distributions    trace = pymc3.sample(draws=10000, start = pymc3.find_MAP(), step=pymc3.NUTS(target_accept=0.95), progressbar=True, return_inferencedata=False)    


Again I’ll plot the posteriors to gain insights.

It is 95% probable that the skip rate for test group C is in the interval [42%, 48%] with an average of 45%.

It is 95% probable that the skip rate for control group A is in the interval [56%, 62%] with an average of 59%.

Looking closer at the distributions for each group.

And the distributions for the difference compared to control group A. 

 Test group C is the only test group with a lower skip rate than the control group, with an average skip rate of 45% and is 13.66 percentage points lower than control group A.

Get in touch at: