The Best Resaon to Practice Bayesian Statistics, That You Haven't Heard Of

This post is is dedicated to Bayesian Statistics, in particular how it gracefully handles summarizing parameters for multimodal data using the Highest Density Interval or HDI (also referred to as the Highest Posterior Density or HPD). 

For the past few years I’ve been reading countless blog stories (mostly through Medium) on Bayesian statistics, many of which do a really great job of highlighting some of the many wonderful things that can come about when one practices Bayesian statistics: how it performs with small data, how it incorporates prior beliefs, Bayesian A/B testing, Naive Bayes Classifiers, Bayesian Optimization (which is used in Optuna), and most notably, it’s interpretability. However, I have yet to come across an article going over what I believe to be it’s coolest feature. 

If you are a serious student of statistics or have studied Bayesian statistics then you have almost certainly come across these methods that I’m referring to before, but possibly not this unique feature. Even when covering the HDI, many textbooks will only briefly mention it, if they even mention it at all. I feel that this functionality truly sets it apart from anything Frequentist. 

Therefore this blog post is my effort to call more attention to using the HDI with multimodal data and the wonderful things that can come about. If Bayesian statistics and the HDI are old news to you, then you can simply scroll to the bottom for an understanding of what I’m hinting at. But if this is all new to you, then I will briefly go over some basics that will set you up for a better understanding. 

The following work done in R can be found as a Jupyter notebook and found on my GitHub repo here: https://github.com/sam-brady/bayesian-hdi

Enjoy!

A History of Beliefs

While the philosophical roots of Bayesian statistics have been around for centuries, sometime before his death in 1761 Reverend Thomas Bayes began to formulate his thoughts on conditional probability in an essay which would later become the foundations for Bayes’ Theorem. As a theologian and mathematician, his ideas were centered around the notion that prior beliefs could be used to calculate posterior probabilities. After his death his ideas were later published and used to try to prove the existence of God and miracles. An excerpt from the Cornell Blog sums it up best:

Bayes’ Theorem and the Existence of God 

https://qz.com/1315731/the-most-important-formula-in-data-science-was-first-used-to-prove-the-existence-of-god/

According the article above, Bayes’ Theorem, arguably the most influential formula in all of statistics, has been used extensively in many fields of science since its development in the 18th-century. Today, the theorem is essential for statistical analysis in areas like machine learning, artificial intelligence and medicine. Ironically, however, the first ever use of Bayes’ Rule was not to bolster a scientific understanding, but quite the opposite; it was used to prove the existence of God. 

One of Thomas Bayes close friend’s was a minister named Richard Price. When his friend died in 1761, Price found the famous formula in Bayes’ An Essay towards solving a Problem in the Doctrine of Chances, and published it for him posthumously in 1763. In 1767, Price employed that same theorem in an argumentative essay in support of the existence of God.

HTTPS://BLOGS.CORNELL.EDU/INFO2040/2018/11/28/BAYES-THEOREM-AND-THE-EXISTENCE-OF-GOD/

Prior & Posterior Distributions

The building blocks of Bayesian statistics consist of parameters, prior distributions, and posterior distributions. The parameters are both known (hyper parameters) and unknown, and the goal is to try to understand the value of the unknown parameters. The prior distribution represents your current belief about this unknown parameter and is then updated after observations or experimentation, resulting in a posterior distribution which again describes the unknown parameter. This process can then be repeated until your posterior distribution converges toward a value or interval of values surrounding this parameter. Each statistical distribution (Bernouli, Binomial, Beta, Gamma, Normal, Exponential, Poisson, Erlang, etc. ) has it’s own parameter and hyper parameter values that help give it’s shape and model real life phenomenon. 

Updating a Prior Distribution

Your prior doesn’t have to closely match your posterior distribution. In fact there are certain priors that work well with certain posteriors and even something called uninformed priors which are designed to effect your posterior as little as possible. Typically I refer to this table on Wikipedia to choose my priors and code my hyper parameters. 


https://en.wikipedia.org/wiki/Conjugate_prior

Here is an example of a small amount of observed data modeled as Poisson distribution, with a Gamma prior distribution. The Poisson distribution is used to model the number of events that occur within a fixed amount of space or time. The unknown parameter to be described for the Poisson posterior distribution is lambda (which is it’s mean, expected value, and variance) and it’s true value is marked here at 3 with a dotted line. The Gamma distribution typically is used to model wait times. You can see here that the small sample set of data was not enough to fully represent the true value of lambda. 

When more data is observed, the posterior distribution updates itself accordingly and moves closer to the true lambda value of 3 and then goes slightly past it, only to come back again.

Getting warmer…

Very hot!

With an increasing sample size your chances of describing the true parameter increase as well. 

The Credible Interval

How would a Bayesian describe this posterior distribution? Well, in a very similar way to the Frequentist confidence interval but much more interpretable. 

A Frequentist would say something like:

95% of the experiments would contain the population parameter in their intervals if we repeated this experiment an infinite amount of times.

A Bayesian would say something like: 

There’s a 95% probability that the population parameter lies in this interval. 

Let’s see an example. By setting an alpha of 0.05 and using the quantiles functions in R it is easy to code up a 95% credible interval. Here is a normal distribution with a mean of 5 and whose parameter has a 95% credible interval after one million samples. 

alpha_conf <- .05 mu <- 5sigma <- 2 # generate data to compute empirical quantilesn_sim <- 1000000theta <- sort(c(rnorm(n_sim, mu, sigma))) lower_idx <- round((alpha_conf / 2) * n_sim)upper_idx <- round((1 - alpha_conf / 2) * n_sim)q_lower <- theta[lower_idx]q_upper <- theta[upper_idx] x <- seq(-5, 15, by = 0.001)y_val <- dnorm(x, mu, sigma)x_coord <- c(q_lower, x[x >= q_lower & x <= q_upper], q_upper)y_coord <- c(0, y_val[x >= q_lower & x <= q_upper], 0)

There would be a 95% probability that the true population parameter theta would lie in the interval [1.08, 8.92]. Isn’t that nice?

And similarly to the confidence intervals, we can also have equi-tailed or one-sided credible intervals. Here’s the same example with a one-sided credible interval.

Now there is a 95% probability that the true population parameter would lie in the [-4.56, 8.28] interval. But this one-sided interval doesn’t seem to represent the posterior distribution well and is a much larger interval. 

The Highest Density Interval or HDI

Let’s explore now how credible intervals hold up against skewed distributions. 

Here is a skewed Gamma distribution showing a 95% equivalent-tailed credible interval. Notice how on the left side of the interval there are actually points that are outside of the credible interval, who have a greater density than some points inside of the interval toward the right side. That’s not ideal. 

How can we best represent a distribution that is skewed? Well, Bayesians have an alternative method for computing credible intervals call the Highest Density Interval, also sometimes called the Highest Posterior Density. 

This is the narrowest credible interval that is comprised of 95% of this distribution’s density mass. Also notice how all all the points inside of the interval have a higher density than those outside of the interval. Ahh, much better.

You can compute this easily in R using the hdi function of the HDInterval library. Or simply code it yourself by using optimization and the inverse of the cdf. 

dens <- density(theta)HPD_region <- HDInterval::hdi(dens, allowSplit = TRUE)height <- attr(HPD_region, 'height')lower <- HPD_region[1,1]upper <- HPD_region[1,2] x_coord <- c(lower, x[x >= lower & x <= upper], upper)y_coord <- c(0, y_val[x >= lower & x <= upper], 0) plot(x, dgamma(x, alpha_1, beta_1),     type='l', col = 'darkslategray4', lwd = 2,     xlab = expression(theta), ylab = 'density')polygon(x_coord, y_coord, col = 'darkslategray2', lwd = 2,  border = 'darkslategray4')  abline(h = height, col = 'indianred1', lty = 2, lwd = 3) segments(lower, 0, upper, 0, lwd=4, col='indianred1', lend='butt')

Why Wouldn’t You Want to Use the HDI?

Ok, so if the HDI is such a great representation of the parameters’ probability then why shouldn’t we always use this interval?

Consider an example of a lightbulb that has an average life expectancy of 1 year. That would be modeled by an Exponential distribution with a mean of 365 and then who’s lambda would become the inverse of that, 1/365. 

When using a 95% HDI to summarize the population parameter, we can see that it is similar to a one-sided credible interval. So it is very probable that when you get home from the store the lightbulb will already be broken. 

However, when we look a the same scenario but using an 95% equal-tailed credible interval, we can see that there is a period of about 9 days before the lightbulb will break. This would be a better measure of the true population parameter in this case. 

Multimodal Distributions

The real beauty of how Bayesians consider parameter intervals shows itself when you begin to consider multimodal data. To me this is the coolest and most elegant feature of Bayesian statistics and the reason I decided to create this post. 

Here is a mixture model comprised of two beta distributions. First consider how this distributions population parameter would be represented with an equi-tailed 95% credible interval.

Notice how the middle of the interval has a very low density. Would you consider these values probable?

Now let’s see how this would be represented using a 95% HDI. 

Wow — did you know it was possible to have more than one credible interval? This is amazing. Notice how the improbable values in the middle of the mixture distribution were left out of the interval. This is a much better representation of the true population parameter. 

Using the hdi function you can set the functions parameter allowSplit = True in order to calculate disjoint intervals. 

The Credible Set

When we have more than one credible interval, we can simply combine them using a union into what’s called a credible set. The previous interval would be written as [0.14, 0.41] U [0.61, 0.89]. I’d love to see a Frequentist confidence interval do this! Take that, z-score.

Actually, I believe a Frequentist approach would involve estimating each of the mixture densities and considering their parameters separately. But it isn’t anywhere as elegant as how the Bayesian HDI handles it. You can find an example of this Frequentist approach in this research paper here, Modeling Flight Departure Delay Distributions, where the researchers try to identify the different distributions that comprise a flight’s delay time. Next up I will try to use real flight delay data to show how this might look using the Bayesian HDI approach. 

Real Multimodal Data

To model this I will use real flight delay data from The Bureau of Transportation and Statistics website showing the percentage of each type of delay for my favorite airline, Hawaiian Airlines (who is always recognized for their low delay times). 

I will use this data to create a multimodal normal mixture model that I can sample from, but I will generate my own means and standard deviations for each delay cause as those were not provided. 

Here you can see that Hawaiian Airline’s delay times are quite good, with many flights even leaving early. However due to the fact that the different reasons for delays have very different delay times associated with them, there are some low time values in-between that are very improbable. Unfortunately these improbable delay times are captured by the 95% equi-tailed credible interval.


Now to compare this with the 95% HDI where we get three disjoint intervals and values with a low density are omitted.

For this credible set we can say that with a 95% probability the delay time will fall into the interval of [-12.41, 1.51] U [4.15, 7.03] U [22.63, 28.5]. And that is a simple yet elegant representation of the multimodal distributions true population parameter or average delay time. 

Sources

Here are two books that have guided me the most in my understanding of Bayesian statistics. The first gives an entertaining walkthrough and the second a more comprehensive look at these topics.

For a great introduction to Bayesian statistics I’d like to recommend Bayesian Statistics: The Fun Way by Will Kurt. It is certainly the most entertaining book that I’ve come across and can lay a solid foundation for further study.

Much of the work above is based liberally off of a great digital text book entitled Bayesian Inference by Ville Hyvönen & Topias Tolonen, which was written as lecture notes for a course at the University of Helsinki. The explanations are thorough and the charts are elegant.

Resources

Conjugate Priors Wikipedia

R Documentation for the hdi Function

Bayesian Data Analysis 3rd Edition by Gelman et al

Bayes’ Theorem and the Existence of God


Get in touch at:       mr.sam.tritto@gmail.com