GitHub Repo:
You can find this notebook in my GitHub Repo here:
https://github.com/sam-tritto/spotify-hit-song
OUTLINE
Turning up the Insights of a Hit Song with LightGBM, Optuna, SHAP, and Scikit-Learn
Import Libraries
The Universal Top Spotify Songs Dataset
EDA
Feature Engineering
Split the Data - Train, Validation, and Test Sets
Manually Tuned Model
Optuna Bayesian Hyperparamter Tuned Model
Final Model
SHAP Explainability
The 1-NN Algorithm... What Does a Hit Song Sound Like?
For this tutorial I'm going to circle back on my very first Data Science project, where we'll embark on a data-driven journey to uncover the intrinsic characteristics that define a "hit song" using a rich dataset from Spotify. This exploration will leverage a powerful trio of machine learning techniques: LightGBM for robust predictive modeling, Optuna for efficient hyperparameter optimization, and SHAP (SHapley Additive exPlanations) for insightful model interpretability.
In this guide, we'll delve into the nuances of building an effective predictive model for song success. We'll specifically address key considerations often encountered in real-world machine learning applications. One such consideration involves the inherent limitations of tree-based models like LightGBM, particularly their inability to extrapolate beyond the range of the training data. This means that while they excel at interpolating within learned patterns, predicting scores significantly outside the 0-100 target range observed in our Spotify data would require careful handling and potentially alternative approaches or specialized data transformations.
Furthermore to combat this issue, we'll discuss practical strategies for transforming our target variable, which represents song popularity on a scale of 0-100. For scores clustering near these bounds, a simple linear regression might struggle. We'll explore how appropriate transformations can help our LightGBM model better capture the underlying relationships and produce more accurate predictions for songs at both ends of the popularity spectrum.
Beyond prediction, this tutorial will emphasize understanding. Once LightGBM provides predictions and SHAP illuminates the feature contributions to a song's success, we can extend our analysis using K-Nearest Neighbors (KNN) in two distinct ways. Firstly, to cut through some of the noise in the data, KNN can be invaluable for identifying the closest matching "control countries" based on various musical features. This allows us to compare how song characteristics perform in similar markets, providing a more robust understanding of generalizability or specific cultural influences on hit potential.
Secondly, and perhaps more directly applicable to our core goal, KNN will be employed to find the closest matching exemplar songs from our dataset once the characteristics of a hit song have been identified using SHAP. Imagine we've determined that a "hit" tends to have a high 'danceability' and a moderate 'energy' — KNN can then search our entire Spotify dataset to retrieve real-world songs that exhibit these precise traits most closely. This provides tangible examples, turning abstract feature importance into concrete, listenable instances that embody the identified success factors.
Finally we will see that while residual analysis for tree-based models like LightGBM isn't as critical for assumptions as it is for Ordinary Least Squares (OLS) regression, it can still serve as a valuable diagnostic tool. We'll touch upon how examining the residuals can provide insights into areas where our model might be underperforming or reveal systematic biases, thereby guiding further model refinement and feature engineering. By combining these advanced techniques and addressing these critical considerations, we aim to build an interpretable and powerful model that sheds light on what truly makes a song resonate with listeners.
Import Libraries
The libraries I'll be using here are my go to for all things regressions and classification. They are a flexibe power trio that allows for explainability and sharp insights. LightGBM has both a Regressor and Classifier. Even though this is a regression project, once I'm done with fitting the regression model I'll use the KNeightborsClassifer to identify control countries and then again to identify an exemplar song from the existing dataset.
The Universal Top Spotify Songs Dataset
Typically I try to avoid toy data sets but, I wanted to circle back on my first Data Science project and pulling the data from the API or spotipy library would just take too long. If you are interested in that I have some other project tutorials on my page as well. This data set I found on Kaggle and it's a very clean and updated daily. There's minimal missing data and only a handful of categorical columns the rest are the typical musical features avaiable from the Spotipy API.
You can find more information and the data set here on Kaggle:
https://www.kaggle.com/datasets/asaniczka/top-spotify-songs-in-73-countries-daily-updated
The first thing to note is that the numerical data is not all ordinal. Or, at least I'm not sure that it is. To be safe I'll treat the time signature and key as categorical features. There are other categorical features as well, but I'll do some more interesting things with these later during Feature Engineering rather than model them in as raw features.
EDA
The most valuable thing for EDA is always to look at the underlying distribution of the data. The skewness and modality can inform your modeling strategy later on. The klib package has some nice distribution plots with summary stats.
If anything looks off, I like to plot them again with bins and matplotlib. For the target metric above, popularity, I can see a small hump near 0. That seems odd that in a dataset containing specifically popular songs would have 0 as a popularity score. I'll need to look closer. Basically if I see any outliers I might look into the tails to look for bad data. Notice also that our target values stop promptly at 100. We will see later that our data might have trouble fitting when the data clustered around the bounds.
A safe bet is to just drop them for modeling, especially since we have enough data already and we are interested in the other side of popularity score anyways, the scores near 100.
The only other EDA I'll do here is check for missing values, nothing fancy, just print out the counts per feature. I'm not going to use album_name or album_release_date, so I won't worry about those, but I'm also noticing that country has as many nulls as other countries have data for. This would be considered MNAR or "Missing not at Random".
I'm going to assume this is null due to the fact there is no country code becasue it is the API pull for all countries, or International songs. Therefore I'll impute these with "INTERNATIONAL".
Feature Engineering
Target Logit Transformation
Like I mentioned in the intro and as we just saw above, the popularity values clustered around 0 will make it hard for our model to learn the patterns of these high popularity scores. This isn't great since the whole point of this project is to fit a model that learns high popularity scores well. Tree-based models cannot extrapolate beyond the training data like OLS can.
One transformation that works particularly well to transform your target variable to an unbounded scale is the logit transformation. Logit transforms data from (0, 1) to (-inf, inf), so first we will have to transform our data to (0, 1). I'm going to add a small epsilor correction so things don't get weird around 0.
Looking at the data we can see the transformation at work.
The best reason I can give for this is to take a peek at the residual analysis that I'll perform after modeling. We will take a closer look at this later but it is a useful tool that I look to throughout the Feature Engineering phase to inform my modeling. Even though residuals are typically used to validate the assumptions of OLS regression, they can still be a useful tool even with tree-based regressors which are robust to most of those assumptions.
Residuals Before Transformation
If you look in the left most Observed vs Fitted chart you can see that in the upper right quadrant the highest popularity score the values are hitting a wall. THe tree-based model cannot extrapolate (like OLS) so the Fitted and Observed values stop at 100.
Residuals After Transformation
You can see here after the transformation the residuals no longer hit this wall at 100. They look much more random and continuous. This should help our model learn better for these higher scores, which is exactly the score we would like to model. You might also notice that both before and after the transformation we still suffer from heteroscadasticity and some non-Normality, but like I mentioned tree-based models are robust to the typical assumptions for OLS.
Interaction Terms
Next after model fitting when I ran the SHAP analysis I noticed in the dependence plots that the loudness feature had some interaction with the energy feature. Notice in the chart below how the loudness values which have a positive SHAP value also have high energy (in red). We can code that knowledge into it's own seperate feature simply by creating an interaction term.
Data Aggregation
There are a few different ways we can manipulate or aggregate this data and each one of those ways could model a unique an interesting hypothesis. That is exactly what we are going to do next. The level of data starts off at the daily level, which not only is large and thus harder to work with, but also contains many duplicate song tracks. What I decided to do was to try to create and use the year attribute of the date as a temporal feature where I would select the maximum year a song was on these popular lists.
Other features we can create are the total number of days a song was on one of these popular lists and the total number of countries the song was popular in. There are many other options but these should give us a good model fit. Grouping by the unique identifiers and categorical columns will allow you to get the right level of data and then we can utilize a dictionary to aggregate each metric differently. Most numerical metrics will get the mean average which will just result in their original unaggregated values. Importantly we also take the max popularity score which is in line with our hypothesis. This should result in a much stronger relationship than the average.
Outliers
Now here's an interesting one. Since we are using a tree-based model we technically shouldn't have to worry about outliers, since they are robust. However we can still create outlier flags and use them as features to our model. Here I'll be using the modified z-score which is a median based score similar to the z-score. This statistical algorithm works better for moderately skewed data and after looking at the distribution plots I can see many of the metrics are moderately skewed so this is an appropriate choice here. The median_abs_deviation() function from SciPy handles the intracacies of weighting the value so that it can be similar to ta z-score for Normal data.
After looking at the counts of outlier flags per metric. I'll opt to use just the outliers from our interaction term as there is a small percentage of flags for that metric.
VIF
Another step we probably don't have to take here would be calculating the VIF and removing features. Typically we would use this to identify features with a problematic level of multicollinearity and remove them from the data so that our data can pass the assumptions of an OLS linear regression model. We are running a tree-based model specifically for predictions so we are robust to multicollinearity. That being said, I'll still use this to see if there are any features are redundant. A simpler model is always the goal.
The function utilizes the variance_inflation_factor() function from statsmodels to handle all the dirty work, just don't forget to add a constant with the add_constant() function. I'm using a threshold of 5 here which is pretty standard.
And we find one feature that we can probably remove from our model, loudness. This makes sense as we used it to create our interaction term.
Split the Data - Train, Validation, and Test Sets
Since we are using tree-based models, which are easy to overfit, we are going to need a validation set in addition to our test set so that we can utilize the model's early stopping abilities. First I split the data 80/20 into train and val/test, then further split the val/test 50/50. This results in a 80/10/10 split.
Since the data is temporal, we have to be careful here. If we were interested in predicting 2025 with 2024 and 2023 data and we used this split method we would be introducing leakage since many of the popular songs are in multiple years. I've structured this model in a way where we keep one unique song for all years so we don't have to worry about that. Rather than trying to predict songs in the next year, we are modeling with a pool of popular songs from the past 3 years and trying to understand which features are relevant.
Manually Tuned Model
Now we can build the regression model. I like to start with a manually tuned model. This let's me get a feel for how the model fits to the data, and also allows me to build any expert knowledge into my paramters. I'll show next how once you find parameters that work well for your data you can then further tune them with Optuna, but use the ones you've found manually as a starting place.
Categorical Data
It was only a few years ago when if we had categorical data that we had to utilize one-hot encoding and dummy variables. Now LightGBM can handle categorical data by default. There is one catch though, before we pass the data to the model we need to make sure it is a categorical data type and not string or "object".
Overfitting & Early Stopping
Like I mentioned earlier, all tree-based models are prone to overfitting, all besides Random Forests. The idea here is to set a large number of estimators and then allow the model's early stopping capabilities to stop the model once it hasn't learned anything for a certain number of iterations. It will use the validation set to evaluate itself. You can see below that even though I've set the number of estimators to be 5000, the early stopping was triggered and the best iteration was at 62. Good thing we didn't keep going to 5000. I've set the early stopping parameter to 10, meaning that if the model has not improved in it's validation set accuracy metric in 10 iterations it will stop early.
Inverse Transform
Now since we have logit transformed out target varaible earler our predictions will be on the same scale. This is actually fine, but if we want to keep the interpretability of our data then we should transform both of them back to their original scale. We just need to reverse all of the operations we transformed with.
Loss Curves
As the model trains itself it keeps record of its evaluation scores for both the training and evaluation set. We should monitor these to look for overfiting and convergence. If the train set keeps improving and the validation set does not, then its overfitting. There are 2 curves, one for each evaluation metric. Notice the number of boosting rounds is about 72. That is 10 more than the best iteration of 62, or 62 rounds + 10 early stopping rounds = 72 iterations. If we hadn't set the early_stopping paramter this would have continued up to the n_estimators paramter of 5000, even though it had stopped learning at round 62.
Residual Analysis
Now it's time to take a look at the residuals, or errors. Like I've mentioned already this is something we need to be critical about when we are modeling with OLS since there are a handful of assumtions that need to be met however, with a tree-based model we can relax a little since they are robust to these assumptions. Specifically when modeling with OLS and assumptions are not met the consequences are that your standard errors of the coefficients are likely biased (usually underestimated), your p-values will be incorrect making you more likely to declare a feature statistically significant when it's not (Type I error) or vice-versa, and your confidence intervals for predictions will be too narrow in some areas and too wide in others, leading to misleading assessments of prediction uncertainty. None of that matters to us for this use case.
In the second Residuals vs Fitted plot you can see a clear typical cone shaped pattern which is called heteroscedasticity. When the cone shape is wider at lower predicted values and narrower at higher predicted values (meaning the residuals get smaller as the predicted value increases), this implies that your model's predictions are more accurate for higher target values and less accurate (or have more variability) for lower target values. In Ordinary Least Squares (OLS) regression, homoscedasticity is a key assumption for the BLUE (Best Linear Unbiased Estimator) property. If your errors are heteroscedastic, OLS estimators are still unbiased and consistent, but they are no longer "Best" (i.e., they are not the most efficient). This means there are other estimators (like Weighted Least Squares) that could give you more precise (lower variance) coefficient estimates.
For this project however we are perfectly fine with some heteroscedasticity especially if the accuracy favors the area of the target we are trying to predict, the high popularity songs. Rather than use these residual plots to validate assumptions for OLS, when modeling with tree-based models we can still use these plots to inform our feature engineering, parameter tuning, and modeling choices.
Now that we've got our model set and we're pretty happy with the hyperparamters, we can tune them further and use the ones we've found as a starting place. We will start by defining the objective function where we specify the type and range of each search hyperparameter. We don't have to tune every hyperparameter, we can but it just takes longer. We can use some of the manually chosen ones to save some time. We can also limit the search space for each hyperparameter to save some time.
This isn't a widely used function, but one worth using for sure. If I've seen 10 Optuna tutorials, none of them have covered the enque_trial() function... tell your friends!
Prior Knowledge
Since Optuna is a Bayesian model under the hood, that means it has Priors. And since it has priors it allows for us to set in some prior or expert knowledge about each hyperparameter. There is a method in Optuna called enque_trial() that I don't often see being used, but it can cut down your search time drastically. It simply allows us to pass in the hyperparameter dictionary as a starting place, which saves many iterations of search time.
You can see in the Optimization Plot History that our expertly picked hyperparameters were a little worse than the best Optuna was able to find. Not a huge reduction in RMSE, but an improvement for sure.
The boosting_type hyperparameter was the one that needed the most tuning and effected our model the most.
In the parallel plot we can see that the gradient boosted tree (gbdt) type was a better fit than goss. Here we can trace all of the iterations and see which choices were most common.
Once we have run Optuna we need to extract the best hyperparamters into a dictionary for further use.
Now that our hyperparamters are tuned, we can test our final model on the holdout set. First we join the train and validation sets back together since we won't be doing any more evaluations. Finally, we fit and predict.
Inverse Transform
Similar to how we did earlier, we should put the predictions back on their original scale for interpretation purposes.
Residual Analysis
One more time we look at our errors and our final RMSE from the holdout set.
Now that our model is tuned and we've made some predictions we can focus on something other than accuracy metrics. There are so many insights to be had beyond accuracy. In this section I'll go over how to utilize SHAP values to identify hidden thresholds in the data and I'll show you how to use them for prescriptive insights. To get started we simply need to fit the explainer to our test data. This will give us SHAP values for each record in our test or prediction set. These SHAP values will answer the question, "Why did our model predict this song as Popular?".
The first insights I look for are the SHAP Feature Importances. Based on this plot I might want to rework some features, or worse find data leakage (if one feature is overly important). Notice here (a little hard to read) that the Interaction term is now the 3rd most important metric. ANd the top 2 metrics are metrics which we would consider non-actionable. For instance we cannot write a song with these features, they are non-musical features, but highly predictive of popularity.
Now if you're not looking at that chart and wondering what those little dots represent, then you should be. These dots are SHAP values of each individual prediction. The dependence plots expand these into scatterplots but color code them with the feature that has the most interaction with the current feature. We can look for patterns in these color codings to think about possible new features we could code up.
For instance, you can see here in the SHAP values for loudness that there might be some interplay from energy. If we can use this knowledge to create a new feature it might help the model learn the intricacies better. I've already coded this interaction feature up at this point and rerun the anlysis, but the idea for this feature came from here. You can see a clear distinction between the red (high) and blue (low) energy and that most red (high) energy values as well as high loudness values are above the y=0 or x-axis. This indicates that high loudness and high energy songs are more popular according to our model.
For me this is the most exciting part of the tutorial. I haven't seen anyone deconstruct the SHAP values for this purpose yet, but it's really cool. If we build our own scatterplots of the SHAP values for each feature, we can then fit a small LGBMRegressor() on the SHAP values for the purpose of predicting when the values will change sign, or cross the x-axis. This tells us the approximate feature value when the model's predictions start to change from "Not Popular" to "Popular". Read that again. We can now tell at which point each feature starts to turn from favorable to unfavorable.
Not all of the features will be actionable from a music composition standpoint, so we should only focus on the musical features from our model here. Here is one of the musical features where we could encode into a song loudness_x_energy, the interaction term. On the x-axis you have the feature's values. The mean average loudness_x_energy is -4.2. On the y-axis we have this feature's SHAP values where below 0 indicates a prediction for "Not Popular" and above 0 indicates a prediction for "Popular". The dashed red line indicates our small regression model's predictions for when the fit will equal 0. So it is at this point when the model's predictions start to trun from unfavorable to favorable. You'll notice it's actually not the average of -4.2, it's a little to the right of that around -3.8. Turn it up!
Again, this is something I haven't seen many doing with their regression models, but something I would encourage more of. It provides thoughtful and data driven guideposts and hypotheses. Please tell you friends!
What Does a Popular Song Sound Like?
Now that we have trained and tuned a predictive model which performs well for highly popular songs and have attributed SHAP values for each musical features contribution toward that prediction, we are going to analyze these SHAP velues for the top most highly popular songs in the hopes that we can identify which features best characterize these songs.
First, we simply gather only the musical metrics and their respective SHAP value columns together in one place.
Then we can visualize them as a simple horizontal bar chart. The top 1% of popular songs happends to have a corresponding popularity score of over 90%.
Then we can visualize them. The top 1% of popular songs happends to have a corresponding popularity scor eof 90%+. Interestingly we can see that their fast paced tempo, speechiness, and danceability likely placed them in the top of the charts.
Now that we understand more about what features our model thinks are highly predictive of a hit song, we are stll a little unclear as to what a hit song actually sounds like. One method I really like for this is utilizing KNN to identify exemplar records from our data, sometimes called 1-NN since we are looking at the 1 closest matching neighbor. But first since we are utilizing KNN we need to scale our data.
I am going to filter to only the US songs since they are the most familiar to me, but this is not a nessessary step. Then I fit the KNN Classifier on 1 NN and predict on the scaled average musical features resulting in the index of an exemplar hit song.
And there we have it. This is what a hit song should sound like. The most interesting thing about this is that there are no lyrics at all. In fact this is just a tambourine played with a nice beat. Wow... maybe we are overthinking this! But when you think about it, maybe it does make a lot of sense since this model was trained on music from all over the world - the Universal Top Spotify Songs Dataset. Isn't this then the perfect representation of a universally popular song?