One Million Spotify Podcasts Analyzed with the Spotify API

Visualizing, analyzing, and pulling insights from over one million Spotify podcasts.

This is a follow up project to my last article where I showed how to access the Spotify API specifically for podcast data. Previously I showed how you can amass a dataset of podcast information related to a specific search term such as “data science”. Here I will show how you can join several of these datasets together, thus enabling you to look for trends among a very large dataset composed of several search topics.

Please have a look at my previous article here to learn how you can start to create your own podcast datasets tailored to your specific interests.

Spotify Podcast Data

For this project I have created datasets (as csv files) for podcasts related to a broad range of my interests. The topics of podcasts that I have accessed are data science, business, climate, education, health, movies, music, sports, true crime, relationships, food, and travel. The first step in this project would be to concatenate the csv files into one pandas dataframe, and create a separate column to designate their related search topic; like so.

As you can see here there are slightly over one million (1,072,763) podcasts together in these datasets, although some of them may be duplicates. The features contained are the show name, episode id, length (in milliseconds), date, episode name, episode description, and related search topic. There are several more relevant pieces of metadata you can pull from the Spotify API, but for my purposes these were enough. If you are interested in other features you can use the code from my previous post and simply change the code to include other metadata you may be interested in.

Since many of the search topics are closely related, such as music and movies, by looking at duplicate episode ids you can see that roughly 11% of the podcasts (117,731 episodes) in my dataframes are duplicates. Even by looking at some of their corresponding show names it becomes obvious why this is… “Business and Biceps” is a great example of a podcast that would come up for the “health” as well as “business” search topic.

The length of each episode is in milliseconds, which is probably great for finding the exact moment at which a user is listening to. However, for us I think minutes would be a better metric for length. First I converted the length to minutes and then visualized the spread of each episode in the dataframe. There are a few very long episodes which skew the distribution, so I zoomed in on the bulk of the distribution to get a better visualization, although this may be misleading.

Hmm… You can see a few local maximums here. I am wondering if we visualized each distribution by by topic, if any insights would be revealed. Even though I would typically advise against a graph like the one below (since it’s clearly pretty cluttered) I feel like the trends do present themselves here. Rather than the length of each episode, here is the distribution of the average of an episode on each show, colored by topic. You can now see that there are many shows with lengths of 10 mins or less, around 30 mins, and around 45 mins – on average. And surprisingly you can also see that the movies category stands out with a maximum of 60 mins on average.

Since the notebook at this point is setup to look at average lengths of episodes, it makes sense to look for the longest podcasts on average. With a bar chart it is easy to color each by by it’s searched topic. Although not the longest, you can see that the movies category takes many of the top spots. Music also has many podcasts where the average length of an episode is around 3 or 4 hours… I wonder if these are playlists of songs

With the format of the bar chart already set up it is easy enough to do a quick count of the number of episodes as well. Hans and Scotty G. sure do have a lot to talk about I guess. There are several education and sports podcasts high up in the ranks as well. Here are the podcasts with more than 1000 episodes.

After looking into the longest podcasts by length, I began to wonder about the timespan of these shows. To play around with dates, I like the datetime.date() module for python. You can use built in functions like max() and min() to easily find the latest and first date, however you can also use these functions with dates in string format as well.

After looking at the maximum date, you can see there was clearly a user error here.

After fixing that error another interesting revelation made it self present. I pulled the data off of the Spotify API in May 24th, but you can see here the greatest date was June 2nd. Apparently there are podcasts that are scheduled to be released in their database. Further investigating would need to be done to confirm this however.

After making a bar chart to visualize the longest running podcasts, something still wasn’t sitting right with me. How could these podcasts be going on for 50 years? To look further into this I pulled up the Spotify app and searched one of these shows, scrolled to the bottom of the episode list to find the culprit. It seems that although the first episode was posted in 1970, they didn’t have a follow up until 2017. A few things could be going on here… but it was brought to my attention by a reader that Unix world time starts counting on that day so this could simply be an error in the dataset. It may be wise to eliminate any episodes with dates that early or impute a more reasonable date for them.

Before making a timeline of episodes I wanted to look closer at the beginnings of podcasts. There are a few sporadic episodes until about 2001 and then they seem to start becoming more regular, or at least a few episodes every month. This seemed like a safe place, or date, to start graphing the timeline.

Looking at the timeline of episodes per month it is easy to see a clear trend that podcasts are taking off! They seem to be growing exponentially, and with Spotify’s new tools like Anchor and Soundtrap pretty much anyone can record a podcast and upload it to their app. There seems to be no slowing down here

Looking closer at just the number of episodes since 2020 you can see clearly see a pattern. There’s something going on here. Let’s dig deeper!

Breaking each date up into a weekday is very easy with the date time package, you can use the .day_name()method to assign each days corresponding weekday name. It didn’t make too much sense to see the averages since the dawn of podcasts since they are growing so rapidly, but looking at the averages in 2020 gives us a pretty good idea of how many podcasts are being put out each weekday on average this year. We can now see that not too many episodes are being put out on the weekends, relatively speaking of course.

I started to wonder what kinds of insights I cold draw by looking for words in each episode’s description. The possibilities are endless here and can be tailored to your specific use case. I chose to stick with the broad idea of looking for how the COVID-19 global pandemic has impacted the podcast episodes on Spotify. There are two dates that we can look at here, the first date is 12-31-19 when the officials in Wuhan became aware of the virus, and then mid February when it started to begin to spread in the US. But you can see below the some of the terms were present in an episode’s description starting in 2012… well they could have been talking about the “corona” beer I suppose. Further investigating would need to be done here.

Now all together for scale. You can see there was about a three week lag before podcasts started to talk about being in quarantine or remote work/learning as a solution to the pandemic. Reality soon kicked in.

Also you might be noticing by now the steep drop off on the right end side of these graphs. That is due to the fact that episodes are put onto their database before the actual release date. These content makers turned their homework in early! There are only a few instances of these “soon to be released” podcasts so the graphs appear to dip.

Conclusion

This was just a sample of some of the insights you can gain from looking through Spotify podcast data and comparing different topics of podcasts. There are a few things to keep in mind before moving on. The Spotify API limits the offset (something like a page number) when pulling out data to 2000. That essentially is like reading a book but saying you can only read up to page 2000. There are likely many many more podcast episodes and shows that did not make the cut. Their library is DEEP! Also this analysis just represents a small fraction of possible topics you could search for. They were tailored to my interests but you could get entirely different insights by creating dataframes of topics you are interested in. I had fun playing around with this data and I would be super curious to see what others could do with it. If you see any cool projects with Spotify podcast data please send them my way!

Thanks for reading! Feel free to share …and if you want to see the notebook and code please just head on over to my GitHub!

Links

Sam Brady’s LinkedIn ← —Seeking an exciting new employment opportunity!

Sam Brady’s GitHub Repo

Sam Brady’s Personal Project Blog

Spotify API Developer’s Page

Spotify Podcast Dataset and TREC Challenge 2020

Spotify Podcast Endpoints

spotipy Python Package

markovify Python Package

Cyberpunk Style with Matplotlib

HTML Color Picker

One Million Spotify Podcasts Analyzed with the Spotify API

Conclusion

Links

Get in touch at: mr.sam.tritto@gmail.com