Twitter's 50 Trending Topics Sentiment Analysis with GCP

 

Training, Automating, and Deploying a Deep Neural Network on the Google Cloud Platform... ( for free! )

There are already so many tutorials on sentiment analysis using Twitter. So before I start, let me first explain how this tutorial is different than the rest:

Table of Contents






Recently I have been playing around with the GCP (Google Cloud Platform) and I thought it might be the perfect time to apply my learning from the last specialization I took from DeepLearning.AI – Natural Language Processing, where we built a Deep Neural Network for sentiment analysis using Trax. The idea was to pull the most popular tweets and perform a sentiment analysis on them. 

Twitter is a great playground for sentiment models, as some tweets aren’t always what they seem. There is a lot of sarcasm and shade, people are using the platform to express some very real and complex emotions. As you can see below, by using a Deep Neural Network built in Trax we will be able to successfully predict even difficult tweets!


2. Tools and Libraries

Google Colab Notebook – to host the code, train, test, and evaluate the model. 

Tweepy – python package to access the Twitter API and pull tweets. 

Twitter’s API – sign up to apply for access and key codes. 

Emojis – python package to convert the emojis into text and vice versa. 

Trax – to build the deep neural network.

Google Cloud Compute Engine – a virtual machine in the cloud which hosts our script

Google Cloud Storage – a bucket in the cloud to store our data

Google Cloud Secrets Manager – a place in the cloud to store our secrets and keys

Google Data Studio – an interactive dashboard studio

3. Setup on Google Colab and Twitter API

Here is an excellent tutorial on how to set up your Colab notebook to access the Twitter API… Twitter data collection tutorial using Python. It will also walk you through getting your Twitter API codes. Once you are approved for the Twitter API use and receive our codes, you should create a json file to store your secret codes in rather than hosting them in the cloud or GitHub. You can do this in your text editor (make sure to save it as a json file) and should look something like this:


This is only temporary while we train the model. Eventually we will use Google Cloud Platform’s Secret Manager to store our secrets.

Here is a link to my Google Colab Notebook where I will train and test my code on the free GPU… Google Colab Notebook – Train and Test

Or you can just go to my GitHub Repo for this project… GitHub Repo – Twitter Sentiment


 

4. Getting the Training Data

Garbage in, garbage out. Once the notebook is all setup and libraries installed, you can begin to start importing data. Here I will combine two different datasets to get a combined 1.61 million tweets, half of which are classified as positive and half as negative. For a neural network it is important that the training data be balanced. And the size of the training set makes sure that are corpus of words is comprehensive enough to handle new tweets the model hasn’t yet seen. Since the size of the dataset is quite large I went with a 90/10 split, where 90% of my data was for training and the remaining 10% was for testing and evaluation. I played around with these numbers and found this split to be the best. A quick note… the Sentiment140 data has labels of 0 and 4 for negative and positive tweets, so you should write a quick mapping function to change the 4 to a 1 for the positive data. 


 fmap = {4:1, 0:0}


sentiment140_tweets['target'] = sentiment140_tweets['target'].map(fmap)

 You can find more about the datasets here:

NLTK twitter samples (#41) 10,000 tweets

Sentiment140 1.6 million tweets

5. Pre Processing the Tweets

Next up is to process all of the tweets in such a way to remove all of the extraneous parts. Below is a function you can work with, you can also find a good variety doing a web search if this doesn’t meet your particular needs. Normally I would remove the stopwords (or most common words… “a”, “the”, “and” etc…) but I found I got a higher model performance with the stopwords included. Additionally I use the Emojis package here to convert the emojis into text descriptions and add these to the training corpus of words. So for instance the 😁emoji would be converted to something like “smile face” and we would capture the sentiment of the word “smile”, which is quite positive. The words are all tokenized and lemmatized in such a way to only return the root of the actual word, ie… “running” would become “run”. Here’s a pre processing function:


 def process_tweet(tweet):  


    # remove old style retweet text "RT"

    new_tweet = re.sub(r'^RT[\s]+', '', tweet)

    

    # decode emojis to text descriptions

    new_tweet = emojis.decode(new_tweet)


    # remove hyperlinks

    new_tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+)|(http?://[^\s]+))', '', new_tweet)

    new_tweet = re.sub(r'http\S+', '', new_tweet)


    # remove hashtags

    new_tweet = re.sub(r'#', '', new_tweet)

    

    # remove underscores

    new_tweet = re.sub(r'_', '', new_tweet)


    # remove all numbers

    new_tweet = re.sub(r'[0-9]', '', new_tweet)


    # remove usernames

    new_tweet = re.sub('@[^\s]+', '', new_tweet)

    

    # remove punctuation even in the middle of a string "in.the.middle"

    new_tweet = re.sub(r'[^\w\s]',' ', new_tweet)


    # instantiate tokenizer class

    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)


    # tokenize tweets

    tweet_tokens = tokenizer.tokenize(new_tweet)


    tweets_clean = []


    for word in tweet_tokens: # Go through every word in your tokens list

        if (word not in string.punctuation):  # remove punctuation

            tweets_clean.append(word)


    # Instantiate stemming class

    stemmer = PorterStemmer() 


    # Create an empty list to store the stems

    tweets_stem = [] 


    for word in tweets_clean:

        stem_word = stemmer.stem(word)  # stemming word

        tweets_stem.append(stem_word)  # append to the list

    

    return tweets_stem

 And here’s an example tweet pre and post processing:


 6. Building the Vocabulary and a Tweet to Tensor Function

After each word is processed, the next thing would be to create a vocabulary dictionary where each word gets a unique number identifier. This will make it possible to turn the tweet into a tensor of numbers. Make sure to include in the vocabulary “__Pad__” for the padding of a tensor, “__</e>__” to mark the end of a line, and “__Unk__” for words that are unknown to the training data. Also is will be very important to save the vocabulary as a json file to use later in the cloud.


 # started with pad, end of line and unk tokens

Vocab = {'__PAD__': 0, '__</e>__': 1, '__UNK__': 2} 


# Note that we build vocab using training data

for tweet in train_x: 

    processed_tweet = process_tweet(tweet)

    for word in processed_tweet:

        if word not in Vocab: 

            Vocab[word] = len(Vocab)


#save to json file

json.dump(Vocab, open("Vocab.json", 'w' ))

 And once this is complete we can transform each tweet into a tensor of numbers.

 def tweet_to_tensor(tweet, vocab_dict, unk_token='__UNK__', verbose=False):

    '''

    Input: 

        tweet - A string containing a tweet

        vocab_dict - The words dictionary

        unk_token - The special string for unknown tokens

        verbose - Print info during runtime

    Output:

        tensor_l - A python list with

        

    '''  


    # Process the tweet into a list of words

    # where only important words are kept (stop words removed)

    word_l = process_tweet(tweet)

    

    if verbose:

        print("List of words from the processed tweet:")

        print(word_l)

        

    # Initialize the list that will contain the unique integer IDs of each word

    tensor_l = []

    

    # Get the unique integer ID of the __UNK__ token

    unk_ID = vocab_dict[unk_token]

    

    if verbose:

        print(f"The unique integer ID for the unk_token is {unk_ID}")

        

    # for each word in the list:

    for word in word_l:

        

        # Get the unique integer ID.

        # If the word doesn't exist in the vocab dictionary,

        # use the unique ID for __UNK__ instead.

        word_ID = vocab_dict.get(word, unk_ID)


        

        # Append the unique integer ID to the tensor list.

        tensor_l.append(word_ID) 

    

    return tensor_l

 Which results in something like this:

 7. Building a Batch Generator

Now this is quite a large function and I will encourage you to look at the code in my Colab notebook if you are curious to see it. Basically what this function does is create lots of subsets of the training data in order to feed into the model chucks at a time. Padding is added to the end of each tensor so that al of the tensor lengths are the same for each chunk. It also loads the weight of each word with a starting value of 1, which will change as the model is being trained. 

8. Building the Trax Deep Neural Network Model

Trax is fairly straightforward to use and build with. If you are a beginner I would strongly encourage you to build a model with Keras for two reasons… 1) it is more user friendly when coding the model, and more importantly 2) there are so so so many more tutorials to help you with Keras compared to Trax. You will still be able to follow through with this tutorial if you use Keras. That being said this model is a Classifier model with an Embedding layer, a Mean layer, a Dense Layer, and a LogSoftmax layer. All of these will be tied together with Trax’s Serial layer.


 def classifier(vocab_size=len(Vocab), embedding_dim=256, output_dim=2, mode='train'):


    # create embedding layer

    embed_layer = tl.Embedding(

        vocab_size=vocab_size, # Size of the vocabulary

        d_feature=embedding_dim)  # Embedding dimension

    

    # Create a mean layer, to create an "average" word embedding

    mean_layer = tl.Mean(axis=1)

    

    # Create a dense layer, one unit for each output

    dense_output_layer = tl.Dense(n_units = output_dim)

    

    # Create the log softmax layer (no parameters needed)

    log_softmax_layer = tl.LogSoftmax()

    

    # Use tl.Serial to combine all layers

    # and create the classifier

    # of type trax.layers.combinators.Serial

    model = tl.Serial(

      embed_layer, # embedding layer

      mean_layer, # mean layer

      dense_output_layer, # dense output layer 

      log_softmax_layer # log softmax layer

    )


    # return the model of type

    return model

9. Training the Model

Now that the architecture is in place and all of the tweets have been converted into machine readable form (tensors with padding), it is time to train the model. I chose to work with a batch size of 16 tweets, 100 steps per check point, and an Adam optimizer value of 0.0001. Feel free to play around with these, but for me they were the best values. It will be important to save the model to your drive as you will need to upload it to the cloud later.


 batch_size = 16

rnd.seed(42)


train_task = training.TrainTask(

    labeled_data=train_generator(batch_size=batch_size, shuffle=True),

    loss_layer=tl.CrossEntropyLoss(),

    optimizer=trax.optimizers.Adam(0.0001),

    n_steps_per_checkpoint=100,

)


eval_task = training.EvalTask(

    labeled_data=val_generator(batch_size=batch_size, shuffle=True),

    metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],

)


model = classifier()

 

And next you should create a function which creates a training loop to train, evaluate, and save/update the model.

 output_dir = '~/content/model_adam0001_90562_9010_/'

def train_model(classifier, train_task, eval_task, n_steps, output_dir):

    '''

    Input: 

        classifier - the model you are building

        train_task - Training task

        eval_task - Evaluation task

        n_steps - the evaluation steps

        output_dir - folder to save your files

    Output:

        trainer -  trax trainer

    '''


    training_loop = training.Loop(

                                classifier, # The learning model

                                train_task, # The training task

                                eval_tasks = eval_task, # The evaluation task

                                output_dir = output_dir) # The output directory


    training_loop.run(n_steps = n_steps)


    # Return the training_loop, since it has the model.

    return training_loop



 And call the training loop to run the model. Make sure at this point you have turned on Google Colab’s free GPU to speed up the process, otherwise it might take you days to train. I chose 90562 as my n_steps with the logic that:

 batches in epoch = training set size / batch_size


 training_loop = train_model(model, train_task, eval_task, 90562, output_dir_expand)

 Now is a good time go take a coffee break!

10. Building an Accuracy Function

Now that the model is trained and saved it is time to test the model, it also would be nice to know the accuracy. We can build a function to find the accuracy first then build a function which uses it to test the model. Here’s a function to calculate the accuracy.


 def compute_accuracy(preds, y, y_weights):

    """

    Input: 

        preds: a tensor of shape (dim_batch, output_dim) 

        y: a tensor of shape (dim_batch, output_dim) with the true labels

        y_weights: a n.ndarray with the a weight for each example

    Output: 

        accuracy: a float between 0-1 

        weighted_num_correct (np.float32): Sum of the weighted correct predictions

        sum_weights (np.float32): Sum of the weights

    """


    # Create an array of booleans, 

    # True if the probability of positive sentiment is greater than

    # the probability of negative sentiment

    # else False

    is_pos =  preds[:,1] > preds[:,0]


    # convert the array of booleans into an array of np.int32

    is_pos_int = np.array(is_pos, dtype = np.int32)

    

    # compare the array of predictions (as int32) with the target (labels) of type int32

    correct = is_pos_int == y


    # Count the sum of the weights.

    sum_weights = np.sum(y_weights)

    

    # convert the array of correct predictions (boolean) into an arrayof np.float32

    correct_float = np.array(correct, dtype = np.float32)

    

    # Multiply each prediction with its corresponding weight.

    weighted_correct_float = correct_float * y_weights


    # Sum up the weighted correct predictions (of type np.float32), to go in the

    # numerator.

    weighted_num_correct = np.sum(weighted_correct_float)

 

    # Divide the number of weighted correct predictions by the sum of the

    # weights.

    accuracy =  weighted_num_correct / sum_weights



    return accuracy, weighted_num_correct, sum_weights

        total_num_pred += batch_num_pred


    # Calculate accuracy over all examples

    accuracy = total_num_correct / total_num_pred


    return accuracy

 11. Testing the Model

And here’s a function to test our model.

 def test_model(generator, model):

    '''

    Input: 

        generator: an iterator instance that provides batches of inputs and targets

        model: a model instance 

    Output: 

        accuracy: float corresponding to the accuracy

    '''

    

    accuracy = 0.

    total_num_correct = 0

    total_num_pred = 0


    for batch in generator: 

        

        # Retrieve the inputs from the batch

        inputs = batch[0]

        

        # Retrieve the targets (actual labels) from the batch

        targets = batch[1]

        

        # Retrieve the example weight.

        example_weight = batch[2]


        # Make predictions using the inputs

        pred = model(inputs)

        

        # Calculate accuracy for the batch by comparing its predictions and targets

        batch_accuracy, batch_num_correct, batch_num_pred = compute_accuracy(preds=pred, y=targets, y_weights=example_weight)

        

        # Update the total number of correct predictions

        # by adding the number of correct predictions from this batch

        total_num_correct += batch_num_correct

        

        # Update the total number of predictions 

        # by adding the number of predictions made for the batch

        total_num_pred += batch_num_pred


    # Calculate accuracy over all examples

    accuracy = total_num_correct / total_num_pred


    return accuracy

 And the results are in… 78.52% accuracy. Not great, but not terrible either! Now considering how difficult tweets are to predict, how much sarcasm, and how many slang terms get thrown around I am pretty happy with this accuracy. The only way to know for sure is to test it out on live Twitter data and to evaluate with my own two eyes.

 12. Authorize Tweepy to Access the Twitter API

Before we can access Twitter data, we first need to authorize Tweepy using the variables we saved into a json file earlier. You will want to set “wait_on_rate_limit = True” in the event that you hit your rate limit, so that it will resume in a few minutes without error. Eventually we will do this using the Google Cloud Platform’s Secret Manager, but for now we are reading a local json file.


 # Load Twitter API secrets from an external JSON file

secrets = json.load(open(r'XXXXXXX(the path to your json file)XXXXXXXX/secrets.json'))


access_token = secrets['access_token']

access_token_secret = secrets['access_token_secret']

api_key = secrets['api_key']

api_secret = secrets['api_secret']

bearer_token = secrets['bearer_token']

 # authorize api handshake


auth = tweepy.OAuthHandler(api_key, api_secret)


auth.set_access_token(access_token, access_token_secret)


api = tweepy.API(auth,wait_on_rate_limit=True)

 13. Pull the 50 Trending Topics for the USA

So the Twitter regularly updates a list of 50 trending topics for many destinations. We will use this feature to access the topics specifically for the USA. We will need to find out the USA’s woeid (A WOEID (Where On Earth IDentifier) is a unique 32-bit reference identifier, originally defined by GeoPlanet and now assigned by Yahoo!, that identifies any feature on Earth.) We now have a list of 50 Trending Topics for the USA which we can rerun every hour to get the freshest topics. 

 def grab_trending_topics(country_id):   

    

    trending_topics = api.trends_place(country_id)                      

    

    topic_list = []

    

    for k in range(len(trending_topics[0]['trends'])):                

        topic = trending_topics[0]['trends'][k]['name']               

        topic_list.append(topic)

   

    return topic_list 

14. Pull the Most Popular Tweets for Each Trending Topic

Now it’s time to actually go and search for tweets. Here I loop through every trending topic in the topic list created earlier, and search for “popular” tweets that contain those topics. There is a good variety of ways to customize this type of Twitter search and I had a good time playing around with all of the parameters. For instance you could also search for “recent” tweets, or maybe search only a specific latitude, longitude, and radius. I would encourage you to read the docs and come up with something tailored to your project. You will most likely want “extended” tweets, as those are longer than 140 characters. 

For every tweet I pull out info that is relevant to me, but again there is more info available from each tweet so go read the docs to make sure you’re getting everything relevant to you! At this time I also pull a list of all hashtags used and all emojis as well. All of this data is appended to a dataframe.

 

 def grab_popular_tweets(topic_list, max_tweets):             

    

    columns = [ 'pulled_at', 'created_at', 'username', 'user_location', 'region', 'search_type', 

               'trending_topic', 'retweetcount', 'favorites', 'text', 'hashtags', 'emojis']         # set up columns for dataframes   

    

    tweets_data_grab = pd.DataFrame(columns = columns)                                  # create empty dataframe    

        

    for topic in topic_list:                # loop though each trending topic

                                                            

                                                                                # grab tweets with Cursor

        tweets = tweepy.Cursor(api.search, q = topic,                           # search for each trending topic                                 

                         lang="en", result_type = 'popular',                    # tweets in english , type is "recent"/"popular"

                          tweet_mode = 'extended').items(max_tweets)            # longer tweets,  grab max_tweets number of tweets

        

        tweet_list = [tweet for tweet in tweets]                                # create list of tweets

                    

        tweets_topic = pd.DataFrame(columns = columns)         # create dataframe to put in current top tweets for this town and trending topic

            

        for tweet in tweet_list:                                      # loop through each tweet that was grabbed

            

            username = tweet.user.screen_name                                    # store username

            user_location = tweet.user.location                                  # store location of user

            retweetcount = tweet.retweet_count                                   # store retweet count

            favorites = tweet.favorite_count                                     # store favorite count

            hashtags = [h['text'].lower() for h in tweet.entities['hashtags']]   # store hashtags    

            search_type = 'popular'                                              # store search type

            region = "USA"                                                       # trending tweets in USA

            created_at = tweet.created_at                                        # time tweet created

            pulled_at = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")    # time tweet was pulled

        

            try:                              

                text = tweet.retweeted_status.full_text    # store text if it's a retweet

            

            except AttributeError: 

                text = tweet.full_text                     # store text if it's a regular tweet

                

            emoji = list(emojis.get(text))                 # get the emojis

            

            curr_tweet = [pulled_at, created_at, username, user_location, region,     # store current tweet's data in a list soon to be a row

                          search_type, topic, retweetcount, favorites, text, hashtags, emoji]                             

        

            tweets_topic.loc[len(tweets_topic)] = curr_tweet                         # add current tweet data to dataframe for town and topic         

                                

        tweets_topic.sort_values(by=['retweetcount', 'favorites'], inplace = True, ascending = False)     # sort the retweet values highest first

                                

        tweets_data_grab = pd.concat([tweets_data_grab, tweets_topic], ignore_index = True, sort = False)       # concatenate top n to final dataframe

        

    return tweets_data_grab

 15. Building a Function to Predict on New Tweets

Now we will need a function that can take new tweets and evaluate them using our model. After the model gives both a negative and positive probability to each tweet, this function compares them to see which is greater and then assigns the tweet the appropriate sentiment label. Some tweets may return an error (for instance if there is no text only a hyper link), this function will catch these errors and assign no sentiment.


 def predict(sentence):

    inputs = np.array(tweet_to_tensor(sentence, vocab_dict=Vocab))

    

    # Batch size 1, add dimension for batch, to work with the model

    inputs = inputs[None, :]  

     

    try:

    

        # predict with the model

        preds_probs = model(inputs)

    

        # Turn probabilities into categories

        preds = int(preds_probs[0, 1] > preds_probs[0, 0])

    

        sentiment = "negative"

        if preds == 1:

            sentiment = 'positive'

            

    except:

        

        return  'N/A', -0.0, -0.0


    return sentiment, round(float(preds_probs[0, 0]),4), round(float(preds_probs[0, 1]),4)

 16. Building a Function to Add the Sentiment to New Tweets

Now that we can find the sentiment of each tweet, we can also add these values to the existing dataframe of tweet data with the function below using the function we just created above. After this we can save the dataframe to a csv!


 def add_sentiment(tweets_data):


    for i in range(len(tweets_data)):            


        tweets_data.loc[i, 'sentiment'], tweets_data.loc[i, 'neg_prob'], tweets_data.loc[i, 'pos_prob'] = predict(tweets_data['text'].iloc[i])        

                

    return tweets_data

 Eventually we will put all of these functions together and call them later in the cloud inside a python script ( a .py file ). More on this later. Now we should get all set up on the Google Cloud Platform!

17. Signing up for the Google Cloud Platform

Not only does Google offer their GPU’s for free (up to a limit) with Colab, they also have an entire Free Tier on their Cloud Platform and offer a $300 credit upon signing up for a period of 90 days. This will allow us to run out model for free for several months! …Thanks Google!! First step is to sign up for an account. If you want to go ahead and enable billing you can do so now, or you can wait til you get to some of the next steps who’s tutorials will walk you through it then. Here’s a link to get you started… Google Cloud Platform Free Tier

18. Creating a Project

First thing we need to do is create a project. This will allow us to link all of the resources and monitor everything under one hood. There are several ways to do this programatically, but I would suggest using the console. Here’s a link to Google’s helpful tutorial… Creating and Managing Projects

19. Setting Up a Service Account

A service account is an identity that an instance or an application can use to run API requests on your behalf. When you create a new Cloud project, Google Cloud automatically creates one Compute Engine service account and one App Engine service account under that project. But to make our lives easier later we will follow the best practice to create a new service account, which will hold all of the necessary permissions. You can set up the permissions when you create the service account or in the next step below. Follow this tutorial to get started… Creating and Managing a Service Account. And for some further reading… Creating and Enabling Service Accounts for Instances

It is highly recommended for you to use Google Cloud Client Libraries for your application. Since we will be using the client libraries for Google Cloud Storage and for the Secrets Manager, the script we will write will automatically link to the API’s secret keys and storage bucket! Google Cloud Client Libraries use a library called Application Default Credentials (ADC) to automatically find your service account credentials. You can read more about it here… Authenticating as a Service Account

20. Setting Up the IAM Permissions

We had the opportunity to add permissions when we were creating the service account but here we will navigate to the IAM Permissions page and set them up there. Always make sure to grant only the minimum amount of permissions for a project. There are tutorials for this but you will be also walked through the process in the Secrets Manger tutorial which is coming up next, here’s the link… Configuring the Secret Manager. Below you can see my new service account “twitter-test”, this is the service account we will be granting permissions to. In order to access the storage buckets I added the Storage Admin role, and to access the secrets and keys I added the Secret Manager Secret Accessor role to my new service account.


 21. Setting Up the Secrets Manager

In order to access the Twitter API without compromising your keys and secrets you can use Google’s Secret Manager. First you need to complete the tutorial…Configuring the Secret Manager for this project and then look though the Quickstart to get an overview of how to create and access secrets. Or for a more in-depth look… Creating Secrets and Versions. A secret contains one or more secret versions, along with metadata such as labels and replication information. The actual contents of a secret are stored in a secret version, and when you call them in code later you will be accessing the secret versions. When you create a secret, a first version will be automatically created for you. If you ever need to change, update, delete, or change access to a secrete you can follow the tutorial… Managing Secrets.

For this project we will need to create a few secrets using the Web UI and then we will put code to access them in our python script. To start I added the four tokens, secrets, and keys for accessing the Twitter API and named them accordingly. In case anybody is wondering, you should enter the codes without any quotation marks ( ” ” or ‘ ‘ ).


 def access_secret_version(project_id, secret_id, version_id):

    """

    Access the payload for the given secret version if one exists. The version

    can be a version number as a string (e.g. "5") or an alias (e.g. "latest").

    """


    # Create the Secret Manager client.

    client = secretmanager.SecretManagerServiceClient()


    # Build the resource name of the secret version.

    name = f"projects/{project_id}/secrets/{secret_id}/versions/{version_id}"


    # Access the secret version.

    response = client.access_secret_version(request={"name": name})


    payload = response.payload.data.decode("UTF-8")


    return payload

 And here are the function calls and Twitter API authorization later in the python script.

 project_id = 'twitter-test-298418'

version_id = 'latest'


access_token = access_secret_version(project_id, 'access_token', version_id)

access_token_secret = access_secret_version(project_id, 'access_token_secret', version_id)

api_key = access_secret_version(project_id, 'api_key', version_id)

api_secret = access_secret_version(project_id, 'api_secret', version_id)


# authorize api handshake

auth = tweepy.OAuthHandler(api_key, api_secret)

auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)  

 22. Setting Up a Linux Virtual Machine Instance on Google Cloud Compute Engine

Next up is to create and start a VM instance in the cloud. This is where we will store our files, write and run our code, and automate the creation of a csv file every hour. The first step in to enable the Compute Engine API. Then create a VM instance. You can follow this tutorial exactly and just go along with all of the standard choices… Quickstart Using a Linux VM. The only thing you will need to make sure you do here is choose your new service account rather than the one Google provides, see the picture below. For the other options I chose everything standard but you can always play around with a smaller CPU depending on your project. Make sure to enable pop-up windows here before you click on the SSH button!


 We will come back to the VM in a little while, but first we should set up the other components. 

23. Creating a Storage Bucket

Next up should be to create the storage bucket where you wish to store your csv file full of tweet data. You can do this easily with a Google Cloud Storage Bucket. You can follow this tutorial here to guide you through the quick and easy process… Creating Storage Buckets. I used all of the standard options which are pre-selected. You will need to take note of the bucket name here to be used in your script later.

24. Installing Google Cloud Storage Libraries in the VM

Since we will be installing and using the client libraries in our VM, we will not need to download the service account key json file here. Remember by doing it this way we are able to automatically access the secrets and storage bucket. We will only need to run the “Installing the client library code” python snippet in our VM shortly. You can read these tutorials here but we will only be using the small bits of python code as seen below… Google Storage Client Libraries and Secret Manager Client Libraries

Now we can navigate back to the VM instance. We will start by making sure the VM has all of the necessary libraries to work with Google Cloud. You can check to see first if you have python3, pip, etc… already installed by typing “python3 -V” which will tell you the current version if any. For me, these are the three commands I needed to run to get started and with the Google Cloud Storage Client Library. 

sudo apt-get install python3-pip

pip3 install –upgrade pip

pip3 install –upgrade google-cloud-storage

pip3 install –upgrade google-cloud-secret-manager

*note there should be a double hyphen before “upgrade”

You will need to add this chunk of code at the end of your python script in order to send the csv data to the storage bucket.


 # Instantiates a client

storage_client = storage.Client()


# The name for the bucket

bucket_name = "twitter-test-bucket"             


# Gets the bucket

bucket = storage_client.get_bucket(bucket_name)


data = bucket.blob("tweets_data.csv")


data.upload_from_filename(r"/home/mr_sam_j_brady/tweets_data.csv") 

 25. Uploading Files to the VM

Back on the VM… The gear icon in the top right corner allows you to upload any files and we will start by uploading Vocab.json file we created before. Next we should upload all of the file saved to the output directory during the training and evaluating of the model. If you used Keras instead of Trax, don’t worry just make sure to upload all of the files saved when training the model even including separate folders. Here is a link to a Keras tutorial where you can see that in more detail… Keras tutorial. For us here, we will also need to upload a pickle file, a train file into a train folder, an eval file into an eval folder, and a config.gin file. All of those were saved to the output directory when training the model, so you can look there to find them. Here’s a screenshot of my uploaded files in the VM using the ls -lh command.


 I will try to be specific about the commands I used, however if you need a tutorial on Linux commands… Basic Linux Commands Cheat Sheet

Here are the commands that I used most often with short examples and explanations:

26. Installing Python Libraries and Dependencies to the VM

In order for the python script to run, we will first have to install all of the necessary libraries, packages, and dependencies used in the script. The best way to do this is to create a requirements.txt file in your local text editor and upload it to the VM. Here’s a brief tutorial, How to Install Python Packages


 Then run the following command in the VM instance to install all of the packages in the cloud space for your script to use.

pip3 install -r requirements.txt –no-cache-dir

*note there should be a double hyphen before “no-cache”

If an error comes up when you are trying to run the script, you might be able to fix it by changing to specific versions, or perhaps you may need to add some other libraries here.

27. Creating a Python Script with Nano Text Editor

Now that one files are uploaded and all of the dependencies are installed, it is time to actually write/upload the python script with all of the functions necessary to accomplish the job of pulling the tweets from twitter, performing sentiment analysis on them, and creating a csv file full of data in the storage bucket. To do this, it is easiest to use the built in Nano Text Editor in the VM instance. All you have to do to open up the editor is type “nano” followed by the name you would like to call your script then “.py”. Rather than type everything out in the VM, it will most likely be easiest to copy and paste from a file on your local computer, Jupyter / Colab notebook, or GitHub repo. Here’s the link to my Colab notebook which contains the python script, you can copy the code right from there… Google Colab Notebook Python Script (twitter_sentiment.py)

nano twitter_sentiment.py

To exit the editor and save the file press Control + X and then Y, then Enter

Now that your script is up and everything is in place to execute the script, it may be a good time to test everything out. You can run the script by typing…

python3 twitter_sentiment.py

If there are no errors, we can move on to automating the process. 

28. Automate the Script with a Cron Job and TMUX

If all goes well the next step is to use Cron Jobs to automate a schedule of when you would like the script to run (every day, hour, week etc…) and then use TMUX to keep the script running even after you close the VM window. Here’s Google’s tutorial on cron jobs… Configuring Cron Job Schedules, where you can see in more detail how this works. To execute the cron job in the VM you can type:

crontab –e and then press Enter, then 1, then Enteragain

You will be taken to a new text editor, and should navigate to the bottom, below directly the “m h dom mon dow”. This is where you can write new code to create you custom schedule. To run our script every hour on the 0th minute we type… 

0 * * * * python3 twitter_sentiment.py


 Then type Control + X, then Y , then Enter

To install TMUX you can type:

sudo apt-get install tmux and then Enter and then Y

Now we should be able to close the SSH window and our script will still run!

OK, all of the hard work is done! You can sit back and relax now that your python script is automated and ML model is running in the cloud. Head on over to your storage bucket and wait for your csv file to arrive. You can even monitor the data collection to make sure their are no hiccups just look in the VM instances Monitoring tab, check it out – every hour!!




 29. Creating a Dashboard in Google Data Studio

There are so many things you can do with your new data file, and many options to visualize the data. Originally I had intended on trying to send the data over to Tableau or some other visualization tool using Big Query and Cloud Functions. However I happened to stumble across an article somewhere that explained how you could connect your storage bucket right to Google Data Studio. Sounds too good to be true, but it isn’t. You can even update the freshness up to 1 hour. Here’s a link to Google’s tutorial… Connect to Google Cloud Storage


 I won’t walk you through creating the dashboard, but please have a look at mine and share if you find it interesting!

30. Outro and Next Steps

Hopefully you can think of some applications for this that may benefit you or even entertain you. A few things that come to my mind would be to alter the tweepy search to pull twitter data about a specific topic of interest, perhaps a product. You could even narrow down the tweepy search to specific locations and radii, perhaps for multiple cities. You could change the tweepy search from “popular” to “recent”…. so many possibilities. 

For me I think the next steps will be visualizing daily twitter data – to create a dashboard and storage bucket that keeps twitter data for a 24 hour period. Maybe weekly! All I know is that I still have plenty of Google Cloud Storage create to run through for the next 90 days! Thanks Google!!

Some Helpful Links

My GitHub Repo for This Project

Google Colab Notebook – Train and Test 

Google Colab Notebook – Python Script (twitter_sentiment.py)

How to process and visualize financial data on Google Cloud with Big Query & Data Studio

How to automate financial data collection with Python using APIs and Google Cloud

Twitter data collection tutorial using Python

Automate reports in Google Data Studio Based on Data from Google BigQuery


Get in touch at:       mr.sam.tritto@gmail.com