How to Write a Hit Song in 2020 with the Spotify API

Recently I revisited my first musical data project and gave it a total makeover – inside and out! Before I knew how easy it is to access musical data using the Spotify API, I had found a dataset on Kaggle that someone had made from Spotify’s API and began to play around with it. The dataset is the top 50 tracks on Spotify from each year 2010 – 2019. In the end I got some very surprising results that I am proud of. You can find the same dataset here, or even on my GitHub page.

In this project you will find:

Spoiler Alert – I got the same prediction using many of the methods. This was surprising to me since I used many different types of regression and algorithms. And on top of that, for some methods I used the yearly trend and the others I used Spotify’s generated Popularity score. Needless to say I feel justified with these predictions. Other than the surprising results, you should also understand that this is a multi target regression problem. Simply, I train the algorithms on only one variable and use that variable to predict several audio features (basically, given the popularity score/year, what would a hit song’s audio features sound like). 

I’ll quickly go through the EDA, shows some visualizations, then get right to the Machine Learning predictions!

I wanted to start off the project by giving everything a Spotify feel, that meant remaking all of my visualizations, even the EDA ones, to have the same limegreen Spotify vibe. Here’s some of the highlights.

Part 1-2 EDA & Feature Engineering

Part 3 Multiple Linear Regression 

After exploring the data and how all of Spotify’s generated features related to one another it was time to get to the predictions. One of the simplest ways to predict data that follows a timeline is to use linear regression. It’s a time tested method and for this situation it fit perfectly. Below are the averages for all the target features for each year. Essentially I wanted to know where the next dot would be, for 2020. In general hit songs are getting slower, less energy, way more danceable, a little louder, slighly happier, way shorter, way more acoustic, with less lyrics!

I experimented with doing some PCA with the dataset to reduce dimensionality then made the decision to proceed without. There really aren’t that many features to begin with and I opted to just eliminate some from my predictive analysis altogether because I didn’t feel that they were that relevant toward the goal of writing a hit song. For dimensionality reduction, I eliminated the Popularity feature and Length feature. It’s not like you could decide to write a song that was “X” popular after all. 

The data has different scales and units so some scaling needed to be done. Here you can see the effect of all the different types of scalers one would want to use, and their graphical effect on the spread of the data. There weren’t too many outliers, so a simple MinMax scaler would work, and put al of the values between 0 and 1. Scaling would also be critical later on to the last part of the prediction, where I look back in the dataset to find the 1 K-Nearest Neighbor to my values (since this algorithm uses distance).

Exploring different scalers… I ended up using MinMax and Standard Scaler.

As mentioned before this is a multi target linear regression, so the model would be trained on the year, and then predict several audio features. I experimented with a few different models and regularizations, then chose the one that gave me the best MSE or mean squared error, between my train and test set. I also experimented with stratifying my split on the year so that every subset would have the same number of songs from each year, but found it had no real impact on the resulting prediction. 

After playing around with and selecting the best linear regression model, it was time to train the data and make some predictions. Since the model was trained on the year, all I had to do was enter the year 2020 and the model provided me with the predicted audio features for that year. 

Once I had the predicted audio features, the best way for me to understand the descriptive features, would be for me to find a song from the dataset whose features were closest to the prediction’s features – an exemplar song. To do this I decided to use the K-Nearest Neighbor algorithm with k=1, to find the 1 nearest neighbor. I don’t see this done too often, but for this use case it was a perfect solution. Typically you would want to use the square root of the number of elements in your dataset, but that’s for a typical use case. Since it is pretty uncommon to use just one, I decided to prove to myself that the results were valid. At first I built my own algorithm which gave me the same result as k=1, and then I decided to plot the best values of K against their MSE’s. You can see here, for this situation k=1 would do the trick.

Looking back into the dataset I finally found a prediction. Based off of linear regression, a hit song it 2020 would sound closest to… Close by Nick Jonas!

Part 4 Random Forest Regression

Next up I wanted to try a different type of regression algorithm. I found out that many of the different types of regressions don’t work with a multi target situation. I wanted to try XG Boost but it was’t designed to be used this way. There are some interesting work arounds, but in the end I decided to try the Random Forest Regressor. This is a bit different than the linear regression because it uses decision trees as it basis – a totally different type of approach to regression – perfect! You can see below many algorithms don’t work with a multi target regression.

Now for Random Forests you don’t necessarily need to scale your data, but for the K-Nearest Neighbor that would come next, you really should. So I used scaled data again and began to tune my hyperparameters. For this regression I would use GridSearchCV to find the best hyperparamters to tune.

Next the same K-Nearest Neighbor approach was used, with k=1, to find an exemplar song in the dataset. And the results are in! So according to the Random Forest Regressor, and based off of the trends in the years, if you want to write a hit song it 2020, it would sound just like… Keeping Your Head Up by Birdie!

Part 5 Deep Learning Based Off of the Year

Next I wanted to see if Deep Learning could provide any insight in predicting how to write hit song in 2020. It may even be overkill for this situation but we needed to break the tie. I used Keras to set up a neural network which would work in a similar way to the regressors before. It starts off with 128 neurons and ends up with just 8 audio features. You can see below it trained very quickly and produced an MSE only slightly better than the Random Forest model.

Then following the same procedure I placed these predicted features into the K-Nearest Neighbor algorithm to find an exemplar track from the dataset. And well what do you know.. Keeping Your Head Up by Birdy!

Part 6 Deep Learning Based Off of Spotify’s Popularity Score

The tie seemed to have been broken, but I couldn’t resist trying to predict what a hit song in 2020 would sound like by taking another route – this time using Spotify’s Popularity Score. Same approach and neural network as before, except this time the algorithm would be trained on the Popularity Score rather than the year. This would better capture the features of a song that could become popular, right now!

The Deep Learning framework performed similarly to before and the K-nearest Neighbor gave me a surprising result…

Again… Keeping Your Head Up by Birdy! 

Unbelievable. So now that I have justified these predictions to myself, I now have a better understanding of what a hit song in 2020 will sound like. While she only made it to a Popularity score of 52, it seems that maybe she was before her time like many great artists. 

Part 7 K-Means Clustering Playlists

After I knew what a hit song would sound like, I realized that these are all hit songs! The top 50 from each year this decade. What a better way than to get some inspiration for writing a hit song than to listen to the hit songs themselves! I decided to make playlists from the dataset and let the K-Means algorithm choose the songs based off of their audio features. So the question became, what’s the best number of playlists to make? To explore this I looked at the the elbow method. After I couldn’t come to a concrete number, I ended up using the silhouette score method. You can see below how I ended up with 8 unique playlists from this data set. 


It was a great joy for me to revisit an older project of mine to enhance the visualizations and improve my algorithms, also to just be working with music data in general! I had fun trying to give everything a Spotify type feel, and was ecstatic to have my results corroborate each other based off different methods and algorithms. I think in the future I will connect to the Spotify API and actually generate these playlists, which isn’t hard to do at all. It was a cool experience to work with a multi target regression and get a feel for why some algorithms aren’t able to work in these situations. Using a k=1, K-Nearest Neighbor algorithm was kind of strange, and I learned a lot trying to convince myself it was the best method to use. But like they say, “the right tool, for the right job.” Creating my own algorithm to test this out was a unique experience for me but if anything I learned that I need to trust myself more. If you are interested more about the k=1 Nearest Neighbor I found some links I’ll post below, and if you are interested in seeing some more visualizations, or even just checking out the playlists that were generated, please look into my notebook on GitHub, also in the links below. Now if you’d please excuse me, I have a hit song I need to write!


Dataset on Kaggle

Sam Brady’s GitHub Page

A Simple Introduction to K-Nearest Neighbors Algorithm (Medium)

The 1-nearest neighbor classifier (Wikipedia)

Machine Learning from University of Washington on the 1-Nearest Neighbor Classifier (Coursera)

Get in touch at: