Spotify Podcast Data with the Spotify API

A tutorial on how to access the Spotify API specifically for podcast data, and what you can potentially do with it! For this project I will show how to gather podcast data for every show and episode related to a search term such as “data science”.

As a musician the Spotify API has been my go to place to find and play around with music data. It’s awesome! As of late Spotify has also been putting a lot of effort toward their enormous podcast library. True crime lovers rejoice – there seems to be years worth of entertaining and informative podcasts to binge on. Recently Spotify has created a contest using data from their podcast library, which you can find using the link below.

Podcast Dataset and TREC Challenge 2020

In this challenge, a dataset will be provided consisting of 100,000 episodes from different podcast shows on Spotify. Participants will be asked to complete 2 tasks that will focus on understanding podcast content, and enhancing the search functionality within podcasts.

I wanted to see what kind of generic podcast data I could obtain from their API. Previously to access their API I have used the spotipy package for python, but I haven’t seen anyone use it to access podcast data yet so I wanted to explore this further. Since the Spotify API uses GET requests, the idea was to let the spotipy package handle the handshake and credentials, then write my own GET requests. After creating a developer’s account you can read Spotify’s Docs to gain an understanding of the API’s capabilities.

Spotify Developer’s Docs

Getting the Data

We will be creating a dataframe of every episode related to a search term such as “data science”, but feel free to change the term to something else you might be interested in.

The first step is creating a developer’s account and obtaining relevant credentials. The spotipy package the takes these credentials and produces a “token“, which you can use to make the GET requests.

Next we need to choose one of the many available endpoints from the Spotify API. They recently added several new endpoints to accommodate podcasts. You can read about them here:

Search, browse, and follow podcasts using the new Podcast API

To perform the query there are a few things we need to keep in mind. One thing is that Spotify has a maximum request limit of 50. To get more than 50 shows at a time we can start a while loop with some conditional variables that will increment until the maximum number of shows have been reached. The variable we need to increment is the offset variable, which tells the query which show number it should start to search on. Initially it is set to 0 for the first show, and then the next offset value we use will be 50 to grab the next 50 starting at the 50th show… and so on until we have them all. The second thing we need to keep in mind is the total number of shows related to a search. The json provides the total number of shows and we can simply divide to find out how many grabs of 50 we will need to do. And before I forget, we also need to set the type of search to “show”, since we are looking for all the related shows.

All of the relevant information is then appended into the empty lists. The only thing left to do is to create a dataframe from the full lists. There is plenty more information you could grab from the metadata but we are really just after the show id’s here. We will use these later to loop though when we are looking for individual episodes. We also grab the show name as well as the description, because it was easy enough for us to do. And as you can see below we have a total of 279 podcast shows related to the term “data science”.

We will perform one more search query which essentially is the same except for the type variable and endpoint. This time we are looking for each episode, and to do this we can loop through the show id’s we grabbed previously.

If there are any errors, the code will simply print the show_id of the show that caused the error then just keep running. If you are curious you can dig a little deeper to see why.

Again creating a dataframe from the full lists, and a little cleaning of the data, we can see that there are 8474 episodes from shows that relate to the term “data science”. There is more meta data available but the most interesting features to me included length, date, and description.

Data Visualizations

So what can we do with our data? Well, to start we can do some visualizations to explore the trends. Let’s see what we find…

What podcasts have the most episodes?

On average how long are each show’s episodes?

Which podcasts have the longest episodes?

How many more podcasts are being produced each year?

How many more podcasts are being produced each month?

Is there a trend for 2020? Clearly something is going on here…

Which days of the week are new podcasts being made available?

Very exciting. On average we can see that most podcasts fall normally around the 30 min mark, with a small handful of exceptions around 85 mins and 140 mins. We see that there has been a steady increase in the total number of shows since 2013 and even a few spikes which could be further investigated. The trend is repeated when looked at on the yearly level, and this year seems to be on track for following this trend. When looking closer at the daily podcast count for 2020, we can see there are many peaks and valleys. By looking at the weekday averages we can see that not many podcasts are put out over the weekend. Usually podcasts are put out on Monday, slightly less on Friday, and everywhere in-between.

Wordclouds

Since we have a description of each episode, let’s see if we can generate any insights from looking at a wordcloud of the most common words used in their descriptions.

And how about a prettier version…

Now that’s something to hang on the office wall!

Text Generation

The only other thing I can think of here is to try to generate a brand new ficticious episode description based off of all of the descriptions. You could certainly dive deep into this topic here, perhaps using GPT-2 or some other advanced model, but for our purposes I decided to use markovify, a python package for text generation that is super easy and simple which uses Markov Chains to generate text. Here are 10 fictitious episode descriptions:

Conclusion

Spotify again does such a great job making their data freely available and easy to work with. It was super fun and exciting to play around with this data and look for trends. With the ever growing library of podcasts there is certainly much more exploring to do. I am wondering what kinds of topics I can search next and how to create value from this data. One really exciting endpoint I haven’t gotten to play around with yet is the episode resume points, which tells where in the podcast the user has listened up until. There seems to be a lot of possibilities there.

Thanks for reading! Feel free to share …and if you want to see the notebook and code please just head on over to my GitHub!

Links

Sam Brady’s GitHub Repo

Spotify API Developer’s Page

Spotify Podcast Dataset and TREC Challenge 2020

Spotify Podcast Endpoints

spotipy Python Package

markovify Python Package