4 Bayesian Regressions with Linear, Binomial Logistic, Poisson, and Exponential

 Here's a walkthrough of 4 different flavors of Bayesian regression with inference, each around a seperate case study or scenario using synthetic data. This might be interesting for someone who is familiar wth the concept of regression and has always wondered what the fuss is with Bayesian statistics. You'll see that while it might require the use of pymc, a library for Bayesian computation,  the structure is very similar to the Frequentist approach. You might even find that inference with Bayesian statistics is more flexible and more insightful. 

Outline

 For all 3 regressions, I'll be using the pymc library for Bayesian modeling, along with the arviz for inference, and the graphviz library for some DAG vizualizations. You can also find all of the code and data files in my GitHub repo.

Linear Regression: Predicting Online Marketplace User Spend

For the first use case, I'll be modeling user spend in an online market. We're trying to understand how much user's spend give some other covariates we deem important based off of past analysis.  There's a mix of numerical data types; the variable first_time is binary. 

 To set up a model with pymc, all you need to do is wrap the variables in a model object. Then the first thing you can do is create Data objects for each x and y. This isn't a nessessary step but it helps with predictions later on. You should specify that the x's are mutable and the y is not. I've seen many people use the x1, x2, x3,... styling but I prefer descriptive variable names as they'll be handy later when we make inference. 

 Next you can set variables for each beta coefficient you'll need. Here for brevity I typically shorten the variable names, but continue to give the descriptive names as a parameter. These will show up in the summary inference table later on, and it will be easier to have the descriptive names handy at that point. These are uninformative priors, normal distributions with very wide sigmas, with the mu centered at 0. The idea here is that since we will be running many iterations, and generating many samples, the uninformative priors will eventually update themselves to be near the true beta values. This is the flexibility in Bayesian modeling. You'll also need ot include a beta coefficent for the intercept of course. 

 Now all we have to do is put these varaibles together with the x covariate variables in a linear fashion, similar to what we would do in any linear regression. Notice how creating interaction terms is exactly how you would expect it to be. We can simply add, multiply, square, etc... the variables together. The likelihood, represents our y variable, in our case spend, which we can model was a flexible Normal distribution. For this variable we will use the linear combination we created as it's mean or mu. For the sigma of the likelihood, I'm actually choosing to model it with an uninformative Gamma hyper prior. Sometimes pymc can have divergence issues with really really small values. So using a Gamma hyper prior will help avoid these issues. We also pass the observed spend (y) values into the likelihood, into the observed parameter. 

 Finally we create the trace varaible which will hold all of our generated samples from the Monte Carlo process. By setting the tune parameter we are essentially throwing out the first N smaples. The purpose of these burn samples is to allow the uninformative priors time to learn, thereby creating more accurate estimates.  Typically I set it to about 1% of however many samples I'm generating. So here for 10,000 samples, I'll burn 1,000 samples.

 After the model has finsished generating it's samples, you can call a summary table through the arviz package in order to view the coefficients and their respective 95% Bayesian credible intervals, the Bayesian equivalent of a confidence interval. HDIs or Highest Density Intervals  are sometimes also refered to as HPD or Highest Posterior Distributions. Here I'm using the var_names and filer_vars parameters to get only the coefficients that start with "b".  

Looking into the variables whose 95% intervals do not contain 0, we find that:


It's also really simple to add in interaction terms. Just create new uninformative priors for these terms, and add them into the linear model.

Although adding in the interaction terms can confuscate the interpretation of coefficients, we can still  gain some valuable insights. 

Interestingly, you can see that view_time * times_visited_this_week resulted in a negative coefficient, while it's parts both resulted in positive coefficients. The interaction being negative here alludes to the fact that perhaps users who visit often and view too long are getting cold feet. This could be something to look further into, in order to try to retain these sales. Maybe these users would be a good candidate for our coupons. 

 To make inference on the spend variable, which was our likelihood, we can use the summary function again. We take the mean average over all of the samples. Here you can see that the average spend would be $95.49, but it is not a gaurunteed spend since the 95% HDI contains 0. So we can esaily say that there is a 95% probability that the user will spend somewhere in the interval of [-13.92,  204.96]. However,  it's hard to spend -$. 

 Another inntersting thing we can do with our model, is that we can make predictions. For instance let's say that we want to know the average spend for a particular type of user. Let's say that this user typically spends about $30, it's not their first time on the site, and they've already visted the site 2 times this week, and looked at a product for 2 minutes.  Maybe we want to targetthis user with a coupon. We can then simply update our model with the set_data function, and regenerate the summary table. 

Now we can say that we expect this user to spend about $190.19 with a 95% proability of spending in the interval of [75.43, 304.87]. Isn't that a much nicer way to make inference, rather than say with a p-value? With the Bayesian credible intervals you can communicate your finding on the parameter in terms of direct probability. 

 We can plot our posterior to vizualy see the distrubtuions of each beta coefficient, along with their 95% credible intervals. 

 The very normal shape of each of those beta coefficients is a good check that the model converged well. To evaluate the model you can generate samples posterior predictive distribution, then compare them to the actual known spend distribution. An familiar metric here is R2, which you can call from the arviz package. 

 Another check for you model would be to plot the trace. You basically want to see lot's of fuzzy caterpillars, which look exactly like what we've got here.

 And here's a DAG of our simple Linear regression model. 

 

Binomial Logistic Regression: Predicting Melanoma Survival

For the second use case, I'll be modeling counts of patients who have survived melanoma. Again it's synthetic data. But what's realy interesting here is that it's set up to be a Logisitc regression model through the lens of the Binomial distribution. So there's an extra layer of considration I'll have to take into account when building the model. But the point of this is to show case how easy that consideration is. 

 The Binomial distribution models the number of sucesses in a certain number of trials and takes two paramters n (the number of trials) and p (the probability of success). We are also given the length of the melanoma (measured in mm) in our data which we will use as our x explanatory variable.  Here we set up the pymc model the same way as before. Even though we are only using a singular x, we will also create Data objects for s, the number of survivals, since we can pass this into the Binnomial distribution as an observed value. Same thing with the total number of experiments, n. Here, as before, I'm using uninformative priors for all beta coefficients. 

 Now the likelihood here will be the Binomial distribution, and we can pass the number of experiemnts to n and the number of observed survivals to the observed parameter. It also requires p, the probability of success (survival). For this we can  use the logistic function, aka  the sigmoid function! And within the logistic function we can pass the beta coefficients and expalanatory variable, length. These will update and generate the approprite probabilities of survival. Wow! 

With pymc we also have the option of modeling the invlogit() function if we wish to track this varaible, or we can use the logit_p parameter in the Binomial function and just simply pass in the linear equation. So flexible! By using the invlogit() fucntion we will be modeling the direct probability, not the log odds, so when we make a predicition there wll be no need to apply exp().

 Calling the summary function we can see that the length of the melanoma negatively effects the patients chances for survival. We can see that since the HDI doesn't contain 0 this is statistically significant. So for every unit increase in length (1 mm), the posterior probability of survival decreases by 46.8%.  

 And we can vizualize that interval, with the plot posterior function

  Now what if we wanted to make a prediction based off of our model? Let's say that a patient walks into our office and has a possible melanoma growth of 2.75mm. We can use the coefficients of our model to make that prediction. But to do this we'll need to wrap our linear function in the sigmoid function. We can use scipy.special.expit() for our logistic sigmoid function. And that patient will have a 91.8% probability of survival. 

Poisson Regression: Cancer Risk from Water Samples

For the third use case, I'll be modeling counts of people who have developed a certain type of cancer within the past year in 1000 different cities, using two different chemicals in the water supply as predictors ( in ppb ). For context chlorine is added to water for cleaning purposes, so a certian amount is considered fine, but radium which is a byproduct of adding chlorine is not good and needs to be removed.  Again it's synthetic data. This time however we will be using the Poisson distribution since we are modeling counts of events which happen in a certain location at a fixed rate. 

This time I can squeeze the entire model into one frame. Similar to the Logistic regression from before we can use the linear function wrapped into the appropriate link function. For the Poisson distribution, this is the exponential function. We can call it from the pymc.math library. We're also using uninformative priors here, as they will update pretty quickly. 

 Calling the summary function with a 95% HDI, we can see that both chlorine and radium are significant predictors of cancer here, but in opposite dierctions. It seems that for every unit increase in cancer (1 person) there is an increase of 0.024 ppb int he water in that city. 

 And let's say that we'd like to test out our own cities water and we find that there is 3.24 ppb chlorine and 8.16 ppb radium. Well as it turn out we could expecet 26 people to develop this particular cancer this year. And we can say that there is a 95% proability that the number of people who develop cancer will be in the interal of [16.068,  35.947].

Exponential Regression with Imputation: Salmon Population on Mars

For the fourth and final use case, I'll model the population decline of salmon population on Mars. So in this hypotheircal situation, scientists have been trying to test the viability of sustaining a salmon population in various lakes around Mars. Of course this hypotherical situation is taking place in the future, where we have mastered space travel - or almost mastered space travel, because a few missions failed to return home with the data; meaning that there is missing data for some of the later missions. So besides modeling the population decline with the Exponential ditribution, I'll also show how easy it is to imupte missing data with Bayesian modeling. 

 Here's the synthetic data. I've added some null each column, and also into both columns. Then I've added a point where the number of days is greater than our maximum number of days, with a null value for the number of salmon; just so we can see how the model can extrapolate. 

 

 Now I'll set some variables to help the model later on. Min, max, mean, and standard deviation. Then I'm going to create numpy masks to help the model understand that there is missing data. 

The model follows a similar structure as the previous models. For the variable that is not the likelihood (days), I will model it with a Normal distribution, specifying the sample mean and standard deviation to set it to about the right place. The pymc model will impute the missing data automatically. Yes, that's right - automatically. I can do this easily because we've alreay specified the distribution of the parameter, and Bayesian modeling relies heavily on sampling. All of the other coefficients are set to be uninformative Normal priors. The mu variable is set to an exponential equation, rather than linear, to model population decline. And the likelihood (number of salmon) are modeled with a Truncated Normal distribution. For this distributions you can specify cutoffs for the distribution. And in our case it deosn't make contextual sense to have negative number of salmon in the lakes. For the maximum, I couldv'e chosen to use the None argument in the paramter, but instead I've set it to the maximum number of salmon + 10% for wiggle room. 

 Here's the summary. And rather than focus on the coefficients, let's focus on these imputed values. Each missing value is returned for each varaible as a mean with a 95% HDI. Amazing. 

 Now I'm creating an new column to flag the vales as imputed, and also using some simple for loops to impute the missing values that were retunred in the summary. 

 And finally we can vizuallize our imputed data points along side our actuals. The imputed values seem to be right in line with what I would've expected their actual values to be. Notably, you can see that they don't fall on an unrealistic straight line, but rather are scattered according to the distribution they were sampled from. It's much more natural. You can even see that the imputation method can not only handle the interpolations, but also exterpolations as seen in the point to the far right.  That point was 50 days away from any other actual data point and the model imputed it perfectly. 

Get in touch at:       mr.sam.tritto@gmail.com