Talk to Your BigQuery Data with GCP (VertexAI & PaLM 2 LLM) and LangChain
Here's a short tutorial on how you can set up a LLM on GCP to talk to your BigQuery data through VertexAI using the PaLM 2 LLM... otherwise known as Table Q&A.
I'll store a sample HR Attrition dataset from IBM in BigQuery and then set up a LLM in order for us to chat with the data. We'll be able to ask it simple questions and validate it's answers all in only a few lines of code.
Please find the code in my GitHub repo: https://github.com/sam-tritto/bigquery-qa/
OUTLINE
The HR Attrition Dataset
Create a GCP Project
Upload Data to BigQuery
Setting up the IAM Permissions
Install and Import Packages
Set the Varaibles
VertexAI PaLM 2
LangChain Agents
Prompt Engineering and Parameters
Q & A
The HR Attrition Dataset
The dataset I'll be using has been put togther by IBM and focuses on employee attrition. I feel like I've seen a few different versions of this dataset throughout the years and I typically navigate to Kaggle to find a good one. The data is categorical and numerical, plus is an inblanaced dataset as that is the nature of attrition data. Here's the link to Kaggle for the dataset I'll be using. And below you can see a sample.
Create a Project on GCP
To get started you'll need to create a project on GCP and also set up a billing account. You can find the GCP console here: https://console.cloud.google.com
Once you create a project, navigate to the Dashboard and on the right hand side you should see the Billing tab. Click there to set up a billing account. If it's your first time using GCP then you'll be blessed with a $300 credit to explore their products. If it's not your first time, create a new gmail account and get those credits! I'm kidding. This project will be relatively cheap to run and you'll have a hard time breaking $1 for an hour or so of playing around with this.
Upload Data to BigQuery
Now, you might already have some data being stored in BigQuery, but since this is a tutorial let's assume you don't and upload some data. You can search GCP for BigQuery and enable the API. Once enabled you can click the + ADD icon and then choose how you'd like to upload the data. Since I have the data stored as a csv I'll choose that option. BigQuery follows the {PROJECT}.{DATASET}.{TABLE} naming convention, and you'll need to reference these later so choose them wisely. You can create those varaibles as seen below.
Once you've uploaded the data you can explore it in the SQL UI as seen below. You can navigate through the project's dataset hierarchy on the left and then query the data as seen on the right. Don't forget to set a LIMIT, especially if you have large data.
Setting up the IAM Permissions
Next you'll need to grant the proper permissions to a service account. You can start by searching or navigating over to the IAM & Admin page and enabling the API. Once here you can navigate down to the Service Accounts page. You'll need to grant the service account a few permissions, namely:
BigQuery User
BigQuery Job User
BigQuery Data Viewer
After that you'll need to generate a key for your service account. On the main Service Accounts page you can click the the dots to the right of your new service account under Actions, and then choose Manage Keys. Now typically I'd want to work with Google's Secrets Manager and store them in the cloud and reference them through code after autentication but since this is just a quick tutorial I'll download the key as a json file and store it locally. Not the safest, but I'll likely delete this project after a few weeks anyways. Once downloaded you'll need to reference the file path in code later on, so make note of where you store it.
Install and Import Packages
GCP should all be set up now and we can move onto the code. I'll be working in a Jupyter notebook in VS Code with a custom environment. And you'll need the following pacakges installed, which you can install inline with one pip command.
And here's all of my imports.
Set the Variables
Now we'll need to set some global varaibles to reference later on. You'll need the filepath to your secret key json file as well as your BigQuery dataset details.
VertexAI PaLM 2
The PaLM 2 model is extremely simple to work with in python. You can simply initialize the model and ask it any question as seen below. The model name is 'text-bison@001'. To test it out you can pass in any question as a string, similar to how youd interact with ChatGPT for instance.
LangChain Agents
LangChain will act as the middle man between us, our LLM, and our GCP data. Basically it uses Agents to ask the LLM follow up questions in order to complete the task. For instance if we ask the LLM "How many employees are there?", the Agent will then work to understand what tools or other questions the LLM might need to know to answer our original question. It might then ask the LLM a follow up question such as "What datasets are available?" or "What coloumn names are avaiable in the dataset?". Then the Agent would query the Schema in the background to find this information out, then pass through your original question. LangChain has a variety of Agents that you can utilize for different tasks and in our case it has one for talking to our SQL dataset. LangChain also makes it extremely easy to swap out different LLMs, so by changing one line of code you can use PaLM 2, ChatGPT, GPT4, LLama 2, or anything else that might come about.
Below you can see how to connect LangChain and BigQuery in just 4 lines of code. If you set verbose = True then you can see all of the questions the Agent had to ask in order to help answer your question. This is helpful in debugging and you can follow it's train of thought easily.
Let's see a sample output. The chain works by allowing the Agent to have Observations and Thoughts until the chain is finished. You can even see the SQL query the Agent came up with in order to answer the question.
Prompt Engineering & Parameters
There are a few noteable parameters in the create_sql_agent() function we can tune to get our desired results. First we can take a look at the prefix and format_instructions parameters which would be the appropriate place to do some prompt engineering. You can see the default of these paramters below (first prefix then format_instructions). There's also an optional suffix parameter.
Here's my quick attempt at prompt engineering for Table Q&A. I've created a data dictionary and also a description of the table and it's general purpose. We can simply concatonate this to the prefix parameter and pass that in.
Another notable paramter would be the top_k parameter which will limit the number of rows returned by a query. By default it is set to 10 so this is one you're probably going to want to increase to be larger than your dataset (after testing of course).
Other than the create_sql_agent() function, you might want to play with paramters from the LLM directly such as temperature, which you can also pass into the create_sql_agent() function. The temperature parameter controls the randomness of the output, where 1 is very creative. For us then it makes sense to have this be set to 0 as we are retrieving data.
Q&A
Now we can test our LLM against our dataset. With one line of code we can pass a question and view the response. Starting off with an easy one we can see that the model handles categorical data well as there are 3 different Departments.
Ok, now for something on the numerical side. Can it handle subsets and aggregations? Yes! It was able to subset the Sales department and find the average of the column "MonthlyIncome", but notice I didn't specify the column name - I simply asked for the "monthly income". That's the power of the LangChain Agents. They infered the correct column from first querying the Schema.
How about maximums? Yup, nailed it.
Now something a little more complicated. This is a two step problem where it first needs to calculate the department with the highest attrition and then subset by this department and aggregate the distance from home. And as you can see the LLM got it correct - very impressive considering that the Attrition column has a boolean datatype and contains Yes and No data.
How about a more complicated subset and a percentage? Oh no! As you can see it didn't interpret the correct numerator and ignored my request to consider the employees who "have attrited". Maybe this is due to the past tense of the word? Not sure, but I was hoping to push this LLM a little harder before it broke.
Not a model I would put into production, but since this was fairly out of the box there is still plenty room to fine tune the model and even work on some prompt engineering. Perhaps by passing in more metadata, context, and thoughtful prompts we could build a more robust solution. Still we are leaps beyond where this sort of technology was 2 years ago. Next I'd love to see how this can perform with multiple tables and will continue to play around with sharpening this model.