MLOps on GCP Upcoming Local Shows Playlist (DataOps)
This will be Part 1 of a tutorial on how to create a simple Flask web app, which will ultimately help a user create a playlist on their Spotify account containing the most popular songs from artists that will be playing in their area in the upcoming months. Part 1 will set up a simple ETL data process through GCP focusing on pulling data from the APIs of both Spotify and SeatGeek, combing the data, and then uploading/automating the process through GCP using App Engine, Cloud Scheduler, Cloud Storage, and Secret Manager.
Please find the code in my GitHub repo: https://github.com/sam-tritto/upcoming-local-shows-playlist
OUTLINE
Obtaining a Spotify API Key
Obtaining a SeatGeek API Key
Creating a Project on GCP
Storing Credentials in the Secret Manager
Setting up a Google Storage Bucket
Creating the Other Components
Building the Python Script
Setting Up an App Engine Sevice Account & Granting IAM Permissions
Setting Up Google Cloud SDK
Deploying an App in App Engine
Automating with Cloud Scheduler
Viewing the Logs
Obtaining a Spotify API Key
The first step for this project starts by setting up a Developer's account with Spotify, creating an App, and obtaining the credentials. I'll keep things at a high level for brevity, but you can sign up and follow their documentation here: https://developer.spotify.com
Once you have an account, you'll need to copy down your client_id and also client_secret, which we will later upload and store on GCP.
Obtaining a SeatGeek API Key
Next you'll need to follow a similar process for SeatGeek by again setting up an account, creating an App, and obtaining the credentials. You can sign up and follow their documentation here: https://platform.seatgeek.com
Once you have an account, you'll need to copy down your client_id and also client_secret, which we will also upload and store on GCP later.
Creating a Project on GCP
The final account you'll need to set up wil be on GCP. This step is a little more involved as you'll need to not only create an account, but also set up a billing account. You can find the GCP console here: https://console.cloud.google.com
Once you create a project, navigate to the Dashboard and on the right hand side you should see the Billing tab. Click there to set up a billing account. If it's your first time using GCP then you'll be blessed with a $300 credit to explore their products. If it's not your first time, create a new gmail account and get those credits! I'm kidding. This project will be relatively cheap to run and will utilize GCP's free tier products when available. For an estimate, I think the whole ETL process weekly costs me around $0.10.
Storing Credentials in the Secret Manager
Now rather than hard code your credentials into the python script, we can utilize GCP's Secret Manager and then programatically access them in a secure manner. Navigate into the Secret Manager and click on + Create Secret. You can create a unique name for each of the credentials and enter their values ( no parenthesees ). You'll need at least the 4 below.
Setting up a Google Storage Bucket
Now we'll need to create a bucket to place all of our files. Aside from the data we'll pull from SeatGeek and Spotify, GCP will generate some metadata from using App Engine and will place those files here. Navigate to Cloud Storage and click Create. Give it a unique name and choose the default options. The code that I've written will read the data from this bucket, append new data, and then overwrite data into this same bucket. Since it first reads the data from this folder, I would suggest uploading a blank .csv file into the bucket now.
Creating the Other Components
We'll utilize a few helper files for our processes. I've put the 4 below into a folder named etl since I'll also be using a Flask app and a similar process for the web app ( and the files will have the same names ).
First I'll create a requirements.txt file to pin the packages and verisons needed for the python script.
For the ETL app we'll create an app.yaml file with some simple instructions. If you've never made a YAML file before, you can just open a simple text editor and save it with the .yaml extension.
And finally, another YAML file for the Cloud Scheduler named cron.yaml. This is a cron job, or a set of instructions on how and when to schedule the app to run.
Building the Python Script
Now for the fun stuff. The python script will consist of several parts and perform the bulk of the heavy lifting for this part of the project. All of the steps will be contained inside of a main() function, with many subfunctions contained inside. I'll go over those subfunctions first and then go through the rest of the main function where they're used.
search_spotify_tracks()
This function inputs an artist's name as a string and then tries, to search Spotify for artists with a similar name. I'm returning 5 matches, and will take the closest matching one later on. This function is wrapped around some rate limiting logic to handle the case if we reach Spotify's maximum rate limit. I told it to wait 1 minute and then try again and also to print out helpful statements which I'll be able to see in the logs.
The next part of this function goes through the results of the API request, first checking to see if there are in fact results to comb through. Next, it uses the partial_ratio() function from the fuzz package (formerly fuzzywuzzy), which performs partial fuzzy string matching between artist names. Basically it's trying to match the artist name from the SeatGeek API with the artist name from the Spotify API. It might be worth mentioning that the SeatGeek API actually contains the Spoitfy URI for each artist, so that might be a good candidate to use rather than string matching. Here I've opted to do the string matching as I was unsure of the coverage of the Spotify URIs and also I'd like to account for instances where bands are represented differently on each platform. For instance, on one API a artist might be labeled as Bruce Springsteen and on the other API might be labeled as Bruce Springsteen & the E-Street Band. Now what's really cool is that the fuzz package has actually be taken over and maintained by SeatGeek, which is prefect as they understand the need for this type of matching the best. They have a function called partial_ratio() that performs fuzzy string matching using Levenshtein distance metrics on sub-strings. So if the artists' name appear somewhere in the larger string they will receive a higher score. The function returns a score from 1-100.
You can find more on the fuzz package on their GitHub repo here: https://github.com/seatgeek/thefuzz
I'll then use that string matching on each of the 5 closest matches returned from Spotify, and only accept the artist if their fuzz score was greater than 80. After that I just extract the relevant info from the results. Most importantly I'll use the returned Spotify artist_id next to search for each artist's top 10 tracks.
I'll use the retunred artist_id here with the artist_top_tracks() function from spotipy, which hits the Spotify endpoint that returns the top 10 tracks for each artist. After a quick check to see if the results returned are valid, I'll simply iterate through the top 10 track's, passing them into a sub-function get_track_audio_features() which I'll go over next. The data returned from this function contains audio features for each track as a dictionary. From there I'll add or update the artist info from before into that same dictionary. Then I can append this dictionary into an empty list. If there are no tracks returned I'll use a print statement to print the artist's name into the logs.
get_track_audio_features()
Short and sweet, this function calls the audio_features() function from spotipy, hitting the Spoitfy endpoint that returns the audio feature of each track.
get_secret_from_secret_manager()
This will be an important one to get right. Since I've stored the secrets in GCP's Secret Manager, I'll need a way to safely and programatically access them here in the script. Setting up and using the SecretManagerServiceClient() is pretty easy, but it expects the name of the secret to be returned in a very specific way - not simply the name of the secret. This function will access the secrets (after we've granted the permissions) and return them as a string.
Accessing the Secrets
Those we're all of my helper functions, now we can look at other parts of the script. First thing, I'll access those secrets and use them to build the parameters for each API. Now all of the handshakes have been taken care of and we can call any function from either API.
Getting Data from my Google Cloud Storage Bucket
Now we can use the Client() function from GCP's Cloud Storage library to retreive the csv that is stored there. It takes a few steps to get it from the bucket, to the blob, to a csv, then finally to a pandas dataframe.
Pulling Data from the SeatGeek API
SeatGeek has a few different ways you can serach event's. The easiest here and most comprehensive would be to iterate through a list of abbreviated state pulling all future concert dates for each state one at a time. There are only 1001 cities in their database, but the results will contain concerts in nearby cities.
Expanding the Data for Multiple Artists per Show
Now each event returned has the main headliner as well as supporting acts. I'll expand the data to include a seperate row for each supporting artist as well.
Checking for Old Records
Now we don't need to bother the Spotify API for concert and audio track data if we already have it in our existing csv, so before we go making the API calls, I'll subset the dataframe into two; shows we already have concert and audio track data for and those we don't. It's common for touring artists to keep adding tour dates. We can check by using the id column from SeatGeek.
Pulling Data from the Spotify API
We've already checked that we're not using a dataframe consisting of concerts we already have data for, now we will also check to see if we already have audio track data for each artist in our existing dataframe. If we do, we will simply grab the data from our old records. If we don't then we will use the search_spotify_tracks() function to pull that data from the API for us. If we can't find any data we'll print an error message to the logs.
Combine Records
Now we can combine all the data we have, old and new. Then drop duplicate records. Finally we will subset the data to be only future concert dates. This will help us stay in the Free Tier on GCP, but you could easily omit this to build a historical data set.
Uploading to GCP Storage Bucket
The last step in out python script will be to upload and overwrite the exisiting csv stored in Cloud Storage. It's similar to how we read it in.
Setting Up an App Engine Service Account & Granting IAM Permissions
Next, we'll need to create an App Engine instance even though we're not ready to deploy anything just yet. The reason for this is that when you create an App Engine, you also create a service account which we will need to grant permissions for. Navigate to GCP's App Engine page, and click CREATE APPLICATION. Choose a location for you instance. Then inside of Identity and API Access, choose NEW SERVICE ACCOUNT.
Enter a name and copy down the associated email address for the service account.
In the next step you'll be asked if you'd like to grant IAM premissions. It makes sense to do this now, but you can always naviagate to GCP's IAM & Admin page and add these later on. As a matter of fact, after adding these, you should navigate there and double check that the permissions have been granted. The roles you'll need to grant are the following:
Secret Manager: Secret Manager Viewer
Secret Manager: Secret Manager Secret Accessor
Cloud Storage: Storage Object Admin
Cloud Scheduler: Cloud Scheduler Job Runner
Setting Up Google Cloud SDK
Once you grant the permissions and GCP spins up your App Engine instance, you'll be asked to install Google Cloud SDK. Click that button and begin the install process. I also like to add it to my PATH, so that the commands are shorter and I won't have to typeout the path where the SDK is installed to run commands.
You can find and follow the documentation here: https://cloud.google.com/sdk/docs/install-sdk
Choose your OS
Download and extract the file to your root directory
Start up gcloud in the terminal: gcloud init
Login with your password
Copy and Paste the code from your pop up web browser into the terminal
Choose the project id related to this project
Add gcloud to your path (edit this to be the path where you extracted the SDK during install, I'm on a Mac so the path will look different if you're on Windows): export PATH="/Users/<USER>/google-cloud-sdk/bin:$PATH"
Change directories into the etl foder with all of the files: cd etl
Deploying the App in the App Engine
Now all the components are in place and all that's left to do is push the shiny red button.
Here's where all of the magic happens, in the terminal type: gcloud app deploy
Then choose: Y
That's it. The process should spin for a minute while it uploads the new versions of you files into your Cloud Storage Bucket. If you ever need ot make changes to the files simply rerun this command and your app will get updated in GCP. Another alternative would be to use something like GitHub Actions here to automate the build and deploy steps everytime you push changes into your GitHub repo - pretty cool stuff.
Automating with Cloud Scheduler
The final step in our ETL process will be automating the job to run weekly. We can do this easily with Cloud Scheduler and also from the command line since we've create the YAML file in advance with the instructions for when we'd like the job to run.
In the same terminal type: gcloud app deploy cron.yaml
Then choose: Y
You should get a similar message as before. Now you can navigate on GCP to the Cloud Scheduler page. There is a APP ENGINE CRON JOBS tab, click it.
On the far right you'll see the three dots which will allow you to force run your app. A good practice would be to start with a simple test script that is less intensive, before switching over to the production script that read and processes all of the data. To get this up and running I used a sample csv file with only 10 rows of data before switching over.
Viewing the Logs
If the process ran successfully you can navigate to GCP's Logs Explorer page to view the process in real time. You'll be able to see warnings, errors, as well as any print statements you placed in your python file. Here's a sample output from the last run:
Of course you should also navigate now to Cloud Storage and inspect the csv file for any unintended issues that may have come up. But that's it. The process should run automatically now and we'll be updating our data with the most recent upcoming local show data each week. Part 2 of this project will create a simple web interface, again with Flask and App Engine, where a user can interact with this data set in order to create a playlist on their Spotify account of local upcoming shows in their area. Stay tuned!