Recent Projects
2024 - 2025
Evaluating Multimodal LLMs that can... Think? with Gemini 2.0 Flash Thinking, the MATH Vision Dataset (HuggingFace), & LLM as a Judge
Here I'll evavluate the new Gemini 2.0 Flash Thinking model against the MATH Vision dataset using the LLM as a Judge technique. I'll show that this new model has impressive state of the art accuracy and just might be the best scoring model available right now.
The Causal Effect of the Maui Wildfires on the Unemployment Rate with CausalImpact (python) and Dynamic Time Warping (tslearn)
Here's a short tutorial on evaluating the estimated casual effect from the recent Maui wildfires on the local unemployment rate using data from the Bureau of Labor Statistics (BLS). Rather than measure the correlation or association, here I'll dive into causality using the python implememntation of the CausalImpact library and use the Dynamic Time Warping algorithm from tslearn in an attempt to choose the most similar counties to measure against as controls ( a form of Market Matching ).
Talk to Your BigQuery Data with GCP (VertexAI PaLM 2 LLM) and LangChain
Here's a short tutorial on how you can set up a LLM on GCP to talk to your BigQuery data through VertexAI using the PaLM 2 LLM... otherwise known as Table Q&A. I'll store a sample HR Attrition dataset from IBM in BigQuery and then set up a LLM in order for us to chat with the data. We'll be able to ask it simple questions and validate it's answers all in only a few lines of code.
More Projects
2024
MLOps on GCP: Upcoming Local Shows Playlist (DataOps)
This will be Part 1 of a tutorial on how to create a simple Flask web app, which will ultimately help a user create a playlist on their Spotify account containing the most popular songs from artists that will be playing in their area in the upcoming months. Part 1 will set up a simple ETL data process through GCP focusing on pulling data from the APIs of both Spotify and SeatGeek, combing the data, and then uploading/automating the process through GCP using App Engine, Cloud Scheduler, Cloud Storage, and Secret Manager.
Student's Paired t-Test (Dependent / Repeated Measures) for Means (A/B Testing) with SciPy, statsmodels, and pingouin
An implementation of Student's Paired t-Test for Means from end to end. This test is the appropriate test for comparing the means of one group sampled twice (once before and once after an intervention) with small-ish to large sample sizes in an A/B Testing scenario.
Student's Unpaired t-Test (Independent / Two-Sample) for Means (A/B Testing) with SciPy, statsmodels, and pingouin
An implementation of Student's Unpaired t-Test for Means from end to end. This test is the appropriate test for comparing the means between 2 independent but similar groups with small-ish to large sample sizes in an A/B Testing scenario.
The Z-Test for Proportions (A/B Testing) with SciPy and statsmodels
An implementation of the Z-Test for Proportions from end to end. This test is the appropriate test for comparing the proportion of binary data between 2 independent groups with different and large sample sizes in an A/B Testing scenario.
Fisher's Exact Test (and Barnard's and Boschloo's) for Proportions (A/B Testing) with SciPy
An implementation of Fisher's Exact Test for Proportions from end to end. This test is the appropriate test for comparing the proportion of categorical data between 2 independent groups with small sample sizes in an A/B Testing scenario. I'll also go over Barnard's and Boschloo's Exact tests which are both considered improvements to Fisher's test.
The Binomial Test for Proportions (A/B Testing) with SciPy, statsmodels, and pingouin
An implementation of the Binomial Test for Proportions from end to end. This test is the appropriate test for comparing the proportion of binary data between 2 independent groups with different sample sizes in an A/B Testing scenario.
4 Bayesian Regressions with Linear, Binomial Logistic, Poisson, and Exponential
Here's a walkthrough of 4 different flavors of Bayesian regression with inference, each around a seperate case study or scenario using synthetic data. This might be interesting for someone who is familiar wth the concept of regression and has always wondered what the fuss is with Bayesian statistics. You'll see that while it might require the use of pymc, a library for Bayesian computation, the structure is very similar to the Frequentist approach. You might even find that inference with Bayesian statistics is more flexible and more insightful.
2023
Bayesian Hierarchical A/B Testing with pyMC3
For this product Data Science project I’ll explore the use of Bayesian Inference in A/B testing using the PyMC3 library. Using synthetic data, the idea behind the project will be to test 4 new playlist algorithms against the current algorithm. The metrics will focus on user interaction during the first selected song, and the metrics measures will be the skip rate and the average time it took a user to skip the song.
Hawai'i Median Home Price Forecasting with prophet
In this project I'll attempt to forecast Hawai'i Median Home Prices with the prophet library, and explore some intermediate features while doing so. I'll take a look at seasonality, changepoints, growth modes, anomaly omission, and prior scales in order to find plausibly accurate forecast for Home Price. And while this typically would be fairly straight forward, we'll see that the pandemic has gievn us some volatility that needs to be accounted for in order to find a nice fitting model.
Mauna Loa Forecasting CO2 Emissions with prophet
This will be a very short project where I'll forecast CO2 emmissions recorded on top of Mauna Loa on the Big Island of Hawai'i using the prophet library in python. While it won't be the most complex trend, I mostly wanted to forecast this data having lived on the Big Island for a handful of years and even walked right up to the lava flow a few times... I couldn't pass up the opportunity. Plus prophet is just so easy to setup and use, even tuning it is fairly straight forward. I'll probably circle back and do a more complicated forecast later on, but for now let's holo holo!
2022
The Best Reason to Practice Bayesian Statistics that You Haven't Heard of
This post is is dedicated to Bayesian Statistics, in particular how it gracefully handles summarizing parameters for multimodal data using the Highest Density Interval or HDI (also referred to as the Highest Posterior Density or HPD). Here I'll show how it's possible to generate two or more credible intervals for your parameters.
Upcoming Local Shows Playlist with the Spotify API
Here I scrape the bandisintown.com website for local upcoming shows, then connect that data to the Spotify API through fuzzy string matching on the artist name in order to generate a Spotify playlist of their 3 most popular songs. Now we can quickly and easily explore artists music that will be playing live ni our area soon, and decide if we'd like to go out and catch a show.
Deep Learning in Tableau Using a Keras Neural Network with TabPy
Lately I’ve been experimenting with utilizing the Analytics Extensions in Tableau Desktop. I haven’t quite seen anyone incorporate any Keras Deep Learning models yet, so I thought it would be a good challenge to explore the possibilities. Here I used a data set containing flight data for a handful of airports to try to predict whether or not a future flight will be delayed with a Keras Deep Learning model. Then I will deploy the model to Tableau Desktop using the TabPy package.
Machine Learning in Tableau Using R and Dynamic K-Means Clustering
Here is a simple tutorial for using the R statistical language in Tableau for more advanced ML features including Clustering. Although Tableau has recently introduced some Clustering functionality, I wanted to explore connecting my Tableau workbook with the R statistical language for a more nuanced and tunable approach.
2021
2020
One Million Podcasts Analyzed with the Spotify API
This is a follow up project to my last article where I showed how to access the Spotify API specifically for podcast data. Previously I showed how you can amass a dataset of podcast information related to a specific search term such as “data science”. Here I will show how you can join several of these datasets together, thus enabling you to look for trends among a very large dataset composed of several search topics.
Creating a Willie Nelson Inspired Song with textgenrnn and spaCy
For this project I wanted to try my hand at some text generation. And as a Willie Nelson fan, I was hoping to use it to write a brand new Willie Nelson Song based off of all of his song lyrics. The lyrics were obtained easily through the Genius API and the text generation was performed with textgenrnn and spaCy.
2019
The Big Takeover Band: 12+ Years of Real Tour Data
I’ve been playing around a lot lately with the Spotify API, but I was wondering what other kind of musical data is out there. Was there any live show data I could play around with? So I reached out to my former band mates in The Big Takeover and asked them if they had anything. Turns out the excel spreadsheet I had started 12+ years ago was still going strong, full of years worth of shows, pay, attendance, etc… And boy was it a mess (almost as messy as the band van)! For this project I wanted to see what kind of value I could create for them – if I could clean it up well enough first!
How to Write a Bestselling Book in 2020 with the New York Times API
For this article I will show you how to obtain and create a dataset from the New York Times API, containing books from one of their many weekly “Bestseller’s” lists from over a decade’s worth of publications. In this dataset you will find descriptions of all of the books that made it onto their bestsellers lists, as well as the titles and authors names. After some human data analysis I will then use Machine Learning techniques (Markov Chain Ensemble) for text generation to create a description of what we think a new bestselling book would be in 2020.
How to Write a Hit Song in 2020 with the Spotify API
Recently I revisited my first musical data project and gave it a total makeover – inside and out! Before I knew how easy it is to access musical data using the Spotify API, I had found a dataset on Kaggle that someone had made from Spotify’s API and began to play around with it. The dataset is the top 50 tracks on Spotify from each year 2010 – 2019. In the end I got some very surprising results.
Get in touch at: mr.sam.tritto@gmail.com