How to Write a Best Selling Book in 2020 with the New York Times API

 

For this article I will show you how to obtain and create a dataset from the New York Times API, containing books from one of their many weekly “Bestseller’s” lists from over a decade’s worth of publications. In this dataset you will find descriptions of all of the books that made it onto their bestsellers lists, as well as the titles and authors names. After some human data analysis I will then use Machine Learning techniques (Markov Chain Ensemble) for text generation to create a description of what we think a new bestselling book would be in 2020.


The New York Times does not promote or endorse me or any third party or the causes, ideas, Web sites, products or services here

STEP 1 – Create a New York Times Developer Account and Obtain an API Key

First is first, creating a New York Times Developer Account will allow you to pull loads of data right from their website. You can access news articles and so much more, but today we are after data about books! Please follow the link below to sign up.

https://developer.nytimes.com

Here is an image of the available lists that are updated WEEKLY. There are many more that are updated MONTHLY, but I have set up the code to extract a dataset to do so WEEKLY. You will need to use the dates in these columns later on. For this tutorial we will focus on the Combined Print & E-book Fiction list as well as the Young Adult Hardcover Fiction list.


Step 2 Make the API Call and Generate a DataFrame Full of Book Descriptions

Here’s a snippet of code I wrote to make calls to the API in order to create a custom dataframe full of bestseller info. There’s a lot more inside my notebook on my GitHubpage, including a second notebook where I tackle a different bestseller list… the link is at the bottom!


 # api call create and append dataframe


api_url_start = 'https://api.nytimes.com/svc/books/v3/lists/'       # generate generic api url

api_url_end = ('/' + list_name  + '.json?api-key=')                 # add in list name


genre_df = pd.DataFrame()                                           # create empty dataframe


for chunk in chunks:                                                # pull 10 dates at a time from total dates

    

    for date in chunk:                                              # pull one date at a time from the 10

    

        url = (api_url_start +  date   + api_url_end)               # put togehter api url parts

        response = requests.get(url + api_key)                      # pull from api with key

    

        genre = response.json()                                     # create variable for json file

        genre = pd.DataFrame.from_dict(genre)                       # create dataframe from json file 

        

        books_df = pd.DataFrame.from_dict(genre.iloc[1,4])          # select only book info from dataframe

        

        books_df['bestsellers-date'] = genre.iloc[9,4]              # create and join column for bestsellers dates

        

        genre_df = genre_df.append(books_df, ignore_index=True, sort=True)      # append 15 


# select only relevant columns add bestsellers date (you may wish to add or delete some)  


        genre_df = genre_df[['bestsellers-date', 'author', 'description', 'price', 'publisher',                      'rank', 'rank_last_week', 'title', 'weeks_on_list' ]]

    

# Wait for 70 seconds to not max out the api call limit  ...and repeat for the next 10 dates


    time.sleep(70)


genre_df                     # show dataframe

Once the pull is completed you should end up with a dataframe that contains information on the bestsellers list books. In order to get ideas for a book, I wanted to read all of the bestseller descriptions and then either come up with my own or have a text generator come up with some for me! I decided the Combined Print and E-Book Fiction list would be the best place to start. Here’s a sample of the dataframe below… 8760 rows and 9 columns, not bad. 


Step 3 Some EDA and Visualizations

The price column was all null which was a bit disappointing, but that didn’t stop me. I was curious how many books have been on for at least 52 weeks, and also which authors were the most prominent. 


Step 4 Using Wordclouds to Visualize the Descriptions and Generate One of My Own

The best way I could think of to see all of the descriptions at once would be to create a wordcloud. Usually they aren’t that interesting, but for me they were an easy way to create value from this dataset. They would allow me to quickly visualize all of the most common descriptive words and then brainstorm my own idea for a book. How’s that for inspiration. So I created a quick wordcloud and then removed some of the less descriptive words; words like is, the, and, etc… Here’s the result. 


On their own wordclouds are… well, not that interesting. So I tired to make this one a bit more appealing by putting it into the shape of a book (which is a bit hard to tell I know).


So I stared at this book for a few minutes and then tried to generate my own best selling book idea. 

A New York woman and a detective fall in love on a mission to investigate the death and disappearance of her husband who may have a had a secret life with her mysterious sister and they discover that she is a serial killer... or a vampire.

Not too bad, I definitely could see the appeal here. 

STEP 5 Using Machine Learning to Create a New Bestseller Book Description

OK… that was fun. but lets give Machine Learning a try at this. After experimenting with several methods like LSTM Recurrent Neural Networks I decided that Markov Chains would be a better approach. The LSTMs took forever, maybe something like 20 hours to run on my MacBook Pro and almost fried my board (not really, but it did get super hot!). While some of the predictions were pretty good, eventually the words began to turn into gibberish. The LSTMs that I was using worked by separating each individual character and then looking for the next probable character to follow in a sequence of strings. The Markov Chain works similarly but the model I ended up using worked by looking at the probabilities of each word to follow the preceding word. Additionally I was able to find a package that handled a lot of the coding and still allowed me to play with some parameters. It was called markovify and gave me some really interesting results. Follow the links for more info.

For this experiment I used an ensemble of 5 Markov Chains each of equal weight to build my model. Also playing around with the state_size parameter I noticed that 3 gave the best results as it looks at 3 words at a time, and many of the descriptive words in the description work well with this. Lastly the max_overlap_ratio allowed me to play around with the ratio of words generated that overlapped the previous sentence from which they were generated. Anywho… Here’s ten generated descriptions of bestselling books.


Some of these were very strange. 

Conclusion

It was really fun to play around with the text generators to try to get them to come up with coherent sentences. The hardest part was trying to get them to come up with original story lines. The dataset here was pretty small for something like a text generator so I wasn’t expecting too much. There’s so many more things that could be done to get better sentences generated by machine learning; I could have trained a model on social media data, or reddit data, or generic web scraped internet data. But honestly this is one of those things where it was just so much easier for me to look at the wordcloud and generate a description of my own, plus I really loved the creativeness to that. In the end I had a blast, and think I may be onto something with the description I generated… now let me go find a pen!

Thanks for reading. Please look down at the links for more information, and if you’re interested in the code you can always go to my notebook on GitHub. I also repeated the same process for a different bestseller list – Young Adult Hardcover Fiction, in case you are interested in that… here’s the wordcloud from that part of the project!

Get in touch at:       mr.sam.tritto@gmail.com