Recent Projects

The Causal Effect of the Maui Wildfires on the Unemployment Rate with CausalImpact (python) and Dynamic Time Warping (tslearn)


 Here's a short tutorial on evaluating the estimated casual effect from the recent Maui wildfires on the local unemployment rate using data from the Bureau of Labor Statistics (BLS). Rather than measure the correlation or association, here I'll dive into causality using the python implememntation of the CausalImpact library and use the Dynamic Time Warping algorithm from tslearn in an attempt to choose the most similar counties to measure against as controls ( a form of Market Matching ).  

Talk to Your BigQuery Data with GCP (VertexAI PaLM 2 LLM) and LangChain


Here's a short tutorial on how you can set up a LLM on GCP to talk to your BigQuery data through VertexAI using  the PaLM 2 LLM... otherwise known as Table Q&A. I'll store a sample HR Attrition dataset from IBM in BigQuery and then set up a LLM in order for us to chat with the data. We'll be able to ask it simple questions and validate it's answers all in only a few lines of code. 

MLOps on GCP: Upcoming Local Shows Playlist (DataOps)


  This will be Part 1 of a tutorial on how to create a simple Flask web app, which will ultimately help a user create a playlist on their Spotify account containing the most popular songs from artists that will be playing in their area in the upcoming months. Part 1 will set up a simple ETL data process through GCP focusing on pulling data from the APIs of both Spotify and SeatGeek, combing the data, and then uploading/automating the process through GCP using App Engine, Cloud Scheduler, Cloud Storage, and Secret Manager. 


More Projects

Student's Unpaired t-Test (Independent / Two-Sample) for Means (A/B Testing) with SciPy, statsmodels, and pingouin


An implementation of  Student's Unpaired t-Test for Means from end to end. This test is the appropriate test for comparing the means between 2 independent but similar groups with small-ish to large sample sizes in an A/B Testing scenario. 

The McNemar's Test for Proportions (A/B Testing) with pingouin and statsmodels


An implementation of  McNemar's Test for Proportions from end to end. This test is the appropriate test for comparing the proportion of binary data between 2 paired groups in an A/B Testing scenario. 

The Z-Test for Proportions (A/B Testing) with SciPy and statsmodels


An implementation of the Z-Test for Proportions from end to end. This test is the appropriate test for comparing the proportion of binary data between 2 independent groups with different and large sample sizes in an A/B Testing scenario. 

Fisher's Exact Test (and Barnard's and Boschloo's) for Proportions (A/B Testing) with SciPy


An implementation of  Fisher's Exact Test for Proportions from end to end. This test is the appropriate test for comparing the proportion of categorical data between 2 independent groups with small sample sizes in an A/B Testing scenario. I'll also go over Barnard's and Boschloo's Exact tests which are both considered improvements to Fisher's test. 

The Binomial Test for Proportions (A/B Testing) with SciPy, statsmodels, and pingouin


An implementation of the Binomial Test for Proportions from end to end. This test is the appropriate test for comparing the proportion of binary data between 2 independent groups with different sample sizes in an A/B Testing scenario. 

The Chi-Squared Test for Proportions (A/B Testing) with SciPy, statsmodels, and pingouin


An implementation of the Chi-Squared Test for Proportions from end to end. This test is the appropriate test for comparing the proportion of categorical data between 2 or more groups in an A/B Testing scenario. 

4 Bayesian Regressions with Linear, Binomial Logistic, Poisson, and Exponential


 Here's a walkthrough of 4 different flavors of Bayesian regression with inference, each around a seperate case study or scenario using synthetic data. This might be interesting for someone who is familiar wth the concept of regression and has always wondered what the fuss is with Bayesian statistics. You'll see that while it might require the use of pymc, a library for Bayesian computation,  the structure is very similar to the Frequentist approach. You might even find that inference with Bayesian statistics is more flexible and more insightful. 

Bayesian Hierarchical A/B Testing with pyMC3


For this product Data Science project I’ll explore the use of Bayesian Inference in A/B testing using the PyMC3 library. Using synthetic data, the idea behind the project will be to test 4 new playlist algorithms against the current algorithm. The metrics will focus on user interaction during the first selected song, and the metrics measures will be the skip rate and the average time it took a user to skip the song.

Hawai'i Median Home Price Forecasting with prophet


In this project I'll attempt to forecast Hawai'i Median Home Prices with the prophet library, and explore some intermediate features while doing so. I'll take a look at seasonality, changepoints, growth modes, anomaly omission, and prior scales in order to find plausibly accurate forecast for Home Price. And while this typically would be fairly straight forward, we'll see that the pandemic has gievn us some volatility that needs to be accounted for in order to find a nice fitting model. 




Mauna Loa Forecasting CO2 Emissions with prophet


 This will be a very short project where I'll forecast CO2 emmissions recorded on top of Mauna Loa on the Big Island of Hawai'i using the prophet library in python. While it won't be the most complex trend, I mostly wanted to forecast this data having lived on the Big Island for a handful of years and even walked right up to the lava flow a few times... I couldn't pass up the opportunity. Plus prophet is just so easy to setup and use, even tuning it is fairly straight forward. I'll probably circle back and do a more complicated forecast later on, but for now let's holo holo!

The Best Reason to Practice Bayesian Statistics that You Haven't Heard of

This post is is dedicated to Bayesian Statistics, in particular how it gracefully handles summarizing parameters for multimodal data using the Highest Density Interval or HDI (also referred to as the Highest Posterior Density or HPD). Here I'll show how it's possible to generate two or more credible intervals for your parameters.

Upcoming Local Shows Playlist with the Spotify API

Here I scrape the bandisintown.com website for local upcoming shows, then connect that data to the Spotify API through fuzzy string matching on the artist name in order to generate a Spotify playlist of their 3 most popular songs. Now we can quickly and easily explore artists music that will be playing live ni our area soon, and decide if we'd like to go out and catch a show.

Machine Learning to Predict Student Grades with CatBoost and Optuna

Predicting student grades with quantitative and qualitative data at various time periods of the year using Catboost Regression/Classification and Optuna for Hyperparameter tuning.


Deep Learning in Tableau Using a Keras Neural Network with TabPy

Lately I’ve been experimenting with utilizing the Analytics Extensions in Tableau Desktop. I haven’t quite seen anyone incorporate any Keras Deep Learning models yet, so I thought it would be a good challenge to explore the possibilities. Here I used a data set containing flight data for a handful of airports to try to predict whether or not a future flight will be delayed with a Keras Deep Learning model. Then I will deploy the model to Tableau Desktop using the TabPy package.


Machine Learning in Tableau Using R and Dynamic K-Means Clustering

Here is a simple tutorial for using the R statistical language in Tableau for more advanced ML features including Clustering. Although Tableau has recently introduced some Clustering functionality, I wanted to explore connecting my Tableau workbook with the R statistical language for a more nuanced and tunable approach.



Car Financing Conversion A/B Test with scipy and statsmodels

For this project I used a synthetic dataset to analyze the results of an A/B test where a fictional car financing company experimented with lowering the APR given to a customer with the hopes of increasing both their sales and margins. 


Optimal Airline Overbooking Seat Number Simulation in R

Here is a project using statistical computer simulation to answer the question of how many overbooked seats results in the highest revenue for an airline. 


SIR Model Monte Carlo Simulation of Pandemic Flu Spread in a Classroom Setting in python

Here is a project where I used statistical computer simulation to model a pandemic flu outbreak in a classroom setting. 


Twitter's 50 Trending Topics Sentiment Analysis & Dashboard with GCP and the Twitter API

Training, Automating, and Deploying a Custom Deep Neural Network on the Google Cloud Platform.

One Million Podcasts Analyzed with the Spotify API

This is a follow up project to my last article where I showed how to access the Spotify API specifically for podcast data. Previously I showed how you can amass a dataset of podcast information related to a specific search term such as “data science”. Here I will show how you can join several of these datasets together, thus enabling you to look for trends among a very large dataset composed of several search topics.


Spotify Podcast Data with the Spotify API

A tutorial on how to access the Spotify API specifically for podcast data, and what you can potentially do with it! For this project I will show how to gather podcast data for every show and episode related to a search term such as “data science”.


Creating a Willie Nelson Inspired Song with textgenrnn and spaCy

For this project I wanted to try my hand at some text generation. And as a Willie Nelson fan, I was hoping to use it to write a brand new Willie Nelson Song based off of all of his song lyrics. The lyrics were obtained easily through the Genius API and the text generation was performed with textgenrnn and spaCy.


The Big Takeover Band: 12+ Years of Real Tour Data

I’ve been playing around a lot lately with the Spotify API, but I was wondering what other kind of musical data is out there. Was there any live show data I could play around with? So I reached out to my former band mates in The Big Takeover and asked them if they had anything. Turns out the excel spreadsheet I had started 12+ years ago was still going strong, full of years worth of shows, pay, attendance, etc… And boy was it a mess (almost as messy as the band van)! For this project I wanted to see what kind of value I could create for them – if I could clean it up well enough first!

How to Write a Bestselling Book in 2020 with the New York Times API

For this article I will show you how to obtain and create a dataset from the New York Times API, containing books from one of their many weekly “Bestseller’s” lists from over a decade’s worth of publications. In this dataset you will find descriptions of all of the books that made it onto their bestsellers lists, as well as the titles and authors names. After some human data analysis I will then use Machine Learning techniques (Markov Chain Ensemble) for text generation to create a description of what we think a new bestselling book would be in 2020.

How to Write a Hit Song in 2020 with the Spotify API

Recently I revisited my first musical data project and gave it a total makeover – inside and out! Before I knew how easy it is to access musical data using the Spotify API, I had found a dataset on Kaggle that someone had made from Spotify’s API and began to play around with it. The dataset is the top 50 tracks on Spotify from each year 2010 – 2019. In the end I got some very surprising results.


Get in touch at:       mr.sam.tritto@gmail.com