Deep Learning in Tableau with a Keras Neural Network and python (TabPy)

Lately I’ve been experimenting with utilizing the Analytics Extensions in Tableau Desktop. I haven’t quite seen anyone incorporate any Keras Deep Learning models yet, so I thought it would be a good challenge to explore the possibilities. Below I used a data set containing flight data for a handful of airports to try to predict whether or not a future flight will be delayed with a Keras Deep Learning model. Then I will deploy the model to Tableau Desktop using the TabPy package.

You can find all of the code in my GitHub Repo.

Here’s the tutorial… enjoy!

The Data Set

Before I begin I would just like to state that the focus of this tutorial is not going to be related to the complexities of building a deep learning model, or even dealing with imbalanced data — more on this shortly. The focus will be on deploying a Keras Model to Tableau (which I have not seen done yet), and sharing what I’ve learned along the way. 

The data set and general idea come from a Microsoft Learn module (found here) and a great tutorial from which I borrowed much of the data preprocessing code (found here). The data set consists of data from a handful of airports during 2016 and includes factors such as origins, destinations, times, distances, tail numbers, etc. The target variable will be whether or not the flight was delayed, which I will try to predict. 

Imbalanced Data

One of the more unique challenges this data set brings to the table is it’s class imbalances. It can be especially hard for machine learning algorithms to learn the patterns well enough in the data if there isn’t much of a representation, or equal representation, of each class in the data. In this case there are many flights that are classified as “on time” but much fewer that are “delayed” in the data. There are many techniques to address this such as undersampling, oversampling, and even threshold moving. If you would like to read a great article on this topic where the author explores some advance techniques like SMOTE please follow this link (found here).

Since the focus of this tutorial isn’t on the data, I will stick to a simple method laid out in this official Keras tutorial.


counts = np.bincount(y_train)print(    "Number of positive samples in training data: {} ({:.2f}% of total)".format(        counts[1], 100 * float(counts[1]) / len(y_train)    ))weight_for_0 = 1.0 / counts[0]weight_for_1 = 1.0 / counts[1]

 It does a fairly good job of evening out the data, but another simple method to getting even classes is to use a stratified split on the target variable when splitting the data for train and test sets. Scikit-Learn has some great and easy options you can read more about (found here).

 # stratify split to better handle class imbalances

x_train, x_test, y_train, y_test = train_test_split(delay_df.drop('ARR_DEL15', axis=1), delay_df['ARR_DEL15'], 

                                                    test_size=0.20, stratify=delay_df['ARR_DEL15'], random_state=42)

 Comparing Other Machine Learning Classifiers

One point I’d like to address right now would be that Deep Learning might not be the best method to predict on this data set. It certainly can outperform traditional Machine Learning models, but typically when the data set is very large, as explained in this article (found here). I did run a comparison on a few ML models before diving into Deep Learning. You could certainly spend all day doing grid searches to find the best hyper parameters, but I did not. Even though the data set was small, the Deep learning model still outperformed the others. As expected it was hard for the models to learn from the minority positive class.


 In the end this becomes a business problem… should we minimize false negatives or maximize the true positives? Which is more costly? Building the model will greatly depend on answering questions like these and having a clear idea of what you would like to accomplish. 

And here is the Keras Model. It still need some tuning but the False Negatives are much lower, which is the flights that were predicted as on time but were actually delayed.


Keras Deep Learning Neural Network Model 

To build the model I just consulted this Keras imbalanced classification tutorial and slightly modified their model. It provided for a great foundation and further built upon the weight classes previously calculated. At the end of the article the author makes a great point which gives way to some direction regarding further experimentation. A simple way to do this would be to multiply the weight_for_1 variable by some value slightly over 1.0, maybe something around 1.125 could work well. 

In the real world, one would put an even higher weight on class 1, so as to reflect that False Negatives are more costly than False Positives.

Here is my very simple neural network.


 model = keras.Sequential(    [        keras.layers.Dense(            256, activation="relu", input_shape=(x.shape[-1],)        ),        keras.layers.Dense(256, activation="relu"),        keras.layers.Dropout(0.1),        keras.layers.Dense(256, activation="relu"),        keras.layers.Dropout(0.01),        keras.layers.Dense(1, activation="sigmoid"),    ])model.summary()

 Put a Pickle Wrapper on the Keras Model

Now I think one of the challenges holding many people back from using Keras on Tableau is that it is hard to export the model across the server. You can learn more in this Keras support issue here. 

One work around would be to simply pickle the model before sending it to Tableau. There is a wonderful, easy pickle wrapper called keras_pickle_wrapper which can accomplish this easily. 

So finally with this work around here is the model being built. Notice the class_weight variables to handle the class imbalances.

 model.compile(    optimizer=keras.optimizers.Adam(1e-2), loss="binary_crossentropy")
callbacks = [keras.callbacks.ModelCheckpoint("delay_model_at_epoch_{epoch}.h5")]class_weight = {0: weight_for_0, 1: weight_for_1}
kpw = KerasPickleWrapper(model)
kpw().fit(    x,    y,    batch_size=64,    epochs=100,    verbose=0,    callbacks=callbacks,    validation_split = 0.2,   class_weight=class_weight,)

 Setting up the TabPy Connection with Tableau

Now it is time to connect with Tableau. First you can install TabPy via pip or another method, then import the libraries on your notebook. You can follow the instructions in their GitHub Repo. And learn more about TabPy here.


import tabpyimport tabpy_clientfrom tabpy.tabpy_tools.client import Client

Once installed you must navigate to your terminal and enter the command:

 tabpy

 Next move on over to Tableau and find the Manage Analytics Extension Connection tab. You can learn more in this post from Tableau here.

 Then choose TabPy and enter “localhost” and the port 9004 which is used for python. 


 The last step would be to navigate back to the python notebook and run the following code to establish the connection.

 client = tabpy_client.Client('http://localhost:9004/')

 Creating a Python Function and Handling Dummy Variables

Now in order for the Tableau dashboard to interact dynamically with your model you will need to write a python function to pass variables. You will notice that my function takes the input variable from Tableau, transforms them, puts them in a data frame, predicts right from the pickle wrapper, and then returns a string (Float32 variables will give an error).

Another challenge with using Deep Learning models on Tableau is that they often require the use of Dummy Variable to encode categorical data. To transform your variables you can run some simple if else lines in python. Make sure to access the first element of the argument list as Tableau’s variables are stored as lists (for example, with _arg5[0]).


 def PredictDelay(_arg1, _arg2, _arg3, _arg4, _arg5, _arg6):

    

    import pandas as pd

    

    #handle dummy variable assignments

    ORIGIN_ATL = 1 if _arg4[0] == 'ATL' else 0

    ORIGIN_DTW = 1 if _arg4[0] == 'DTW' else 0

    ORIGIN_JFK = 1 if _arg4[0] == 'JFK' else 0

    ORIGIN_MSP = 1 if _arg4[0] == 'MSP' else 0

    ORIGIN_SEA = 1 if _arg4[0] == 'SEA' else 0


    DEST_ATL = 1 if _arg5[0] == 'ATL' else 0

    DEST_DTW = 1 if _arg5[0] == 'DTW' else 0

    DEST_JFK = 1 if _arg5[0] == 'JFK' else 0

    DEST_MSP = 1 if _arg5[0] == 'MSP' else 0

    DEST_SEA = 1 if _arg5[0] == 'SEA' else 0

    

    # create a data dictionary

    row = {'MONTH': _arg1,

           'DAY_OF_MONTH': _arg2,

           'DAY_OF_WEEK': _arg3,

           'CRS_DEP_TIME': _arg6,

           

           'ORIGIN_ATL': ORIGIN_ATL,

           'ORIGIN_DTW': ORIGIN_DTW,

           'ORIGIN_JFK': ORIGIN_JFK,

           'ORIGIN_MSP': ORIGIN_MSP,

           'ORIGIN_SEA': ORIGIN_SEA,

           

           'DEST_ATL': DEST_ATL,

           'DEST_DTW': DEST_DTW,

           'DEST_JFK': DEST_JFK,

           'DEST_MSP': DEST_MSP,

           'DEST_SEA': DEST_SEA,

           }


    # convert it into a dataframe

    X = pd.DataFrame(data = row, index=[0]) 


    # return prediction as a string since float32 cannot be serialized

    return str(kpw().predict(X)[0][0])    

 Deploying the Keras Model

You need one last function to publish the python function to the server. The override parameter allows the function to update it’s predictions. 

 client.deploy('PredictDelay',

                  PredictDelay,

                  'Returns the probability of a delayed flight', override = True)

 Creating Parameters and a Calculated Field in Tableau

Now that your python notebook is connected and ready, you can navigate back to Tableau and create a new calculated field for this prediction. In this Calculated Field you will need to access Parameters, so let’s create them first. You will need one for each predictor variable in your model. Make sure to set the values and the data type to whatever makes sense. For instance, the Days of the Month variable needs to be integers from 1 to 30.


 Now to create a Calculated Field. You can follow this form, and learn more from Tableau here. Here I used the SCRIPT_STR() function since the python prediction is returned as a string, however I then wrap this in a FLOAT() function to return it to numerical form (which allowed me to convert it from a decimal to a percentage). You should use the form tabpy.query()[‘response’] to interact with the python function with the name of the function as the first variable then the series of arguments for each variable as _argX followed by a list of their Parameters below. The Parameters will allow you to control the values being passed into the python deep learning model. 

 Building the Dashboard

Now the only thing left to do is build the dashboard. The predictions are dynamic in Tableau making this a great exploratory tool. You can play around with the Parameters by adding them to the dashboard. But remember the predictions are only as good as the data and model quality.

 And here’s a video of the dashboard in action!

 If you are interested in downloading the notebook, you can find a link to the dashboard in my GitHub Repo here

Thanks for Reading!

Please stop by my personal projects page!

Add me on LinkedIn

And Medium.


Get in touch at:       mr.sam.tritto@gmail.com