Machine Learning in Tableau Using R and Dynamic K-Means Clustering

 

Here is a simple tutorial for using the R statistical language in Tableau for more advanced ML features including Clustering.

Although Tableau has recently introduced some Clustering functionality, I wanted to explore connecting my Tableau workbook with the R statistical language for a more nuanced and tunable approach. With R I can now do things such as set the seed for reproducibility, scale my factors, tune the hyper parameters such as number of starting cluster centers, tune the maximum number of iterations, and even choose the clustering algorithm from the kmeans() function in R. The other plus to note here is that it is extremely simple to do, especially if you are comfortable working in R. You can also easily change the number of clusters interactively and visually be able to see the changes in real time, something you cannot do in Tableau otherwise. Below are the possible algorithms to choose from: 

"Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"

These could potentially be hugely beneficial to the Analyst who has a particularly nuanced situation to model. The only real drawback I can see from utilizing R in Tableau is that you can no longer easily share the dashboard, even to Tableau Public. This is because Tableau establishes a connection to RStudio on your local computer behind the scenes. 

The Data

The dataset I used is from the Unified Crime Reporting Statistics. Crime statistics are available for public review through the U.S. Department of Justice and the Federal Bureau of Investigation. The data set has information on the crime rates and totals for states across the United States over a wide range of years. The crime reports are divided into two main categories: property and violent crime. Property crime refers to burglary, larceny, and motor related crime, while violent crime refers to assault, murder, rape, and robbery. These reports go from 1960 to 2012. You can find the link here!

Installing RServe in RStudio

The first step was to open RStudio and install the Rserve. It is important to get the latest version, and make sure it matches up with your version of R. For me, the following commands worked to install and then load the package: 

install.packages("Rserve", "Rserve_1.8-6.tgz", "http://www.rforge.net/")

library(Rserve)

  Rserve(args = "--no-save")

Setting Up The Connection in Tableau

The next step was to open Tableau Desktop and establish an Analytics Extension connection. Here is a tutorial from Tableau that can help you through the process (Tableau Rserve Tutorial). 


Creating a Parameter for Cluster Size

After the data is loaded into Tableau, the first thing you need to do is create a parameter that will control the number of clusters generated in the R function. The end result will be a slider or dropdown menu allowing dynamic control of the number of clusters generated in your function and chart. I set it to be an integer from 2 to 10, the minimum and maximum number of clusters I am interested in. 


Creating a Calculated Field for Clusters

The next step is to create a calculated field to host the R code. I used the SCRIPT_INT() function since I wanted to output the cluster number as an integer. There are other options for other outputs. It may seem strange but each factor is replaced by .argX in the R code, and then following that will be a list of all of the factor arguments corresponding to each .argX. They will also need to be in aggregate form, SUM([Factor]), however this will not effect the calculation so don’t sweat it. The parameter created just before is used in the calculation and inputted as a STR() variable at the end of the function, this will be dynamic and based off of the parameter control. Here is my calculated field, notice the use of the scale() function and set.seed().

 

Dragging the Clusters to the Filter and Marks Cards

Now just like any calculated field in Tableau, you can drag the field into the filters and marks card. I also had to drag in the state variable and then make sure to compute the filter using the State variable

.

Dynamic Clusters in the Dashboard

After building a dashboard and adding the filters, you can now dynamically change the number of clusters with a dropdown or slider. This is unique to using R with Tableau and can be a great tool to analyze how the number of clusters effects the clustering.

Get in touch at:       mr.sam.tritto@gmail.com