GitHub Repo:
You can find this notebook in my GitHub Repo here: https://github.com/sam-tritto/resume-parser
For this project we will leverage the power of Google's Gemini model via the LangExtract API to transform unstructured resume PDF documents into clean, structured data. It also incorporates a crucial ATS fraud detection step to identify hidden white text.
In short, this tool acts as an intelligent data extraction engine for candidate screening and HR automation:
PDF to Structured Data: It reads resume PDFs and intelligently extracts key entities (Name, Experience, Skills, Education, etc.).
ATS Fraud Detection: It scans for text deliberately hidden by making the font color white, a common tactic for boosting keyword density in ATS systems.
Persistent Storage: It handles file management by allowing you to select and process files directly from your secure Google Cloud Storage (GCS) bucket.
LangExtract our main library here is the core component that turns the raw text extracted from the PDF into the clean, categorized data summary needed for downstream processing and storage.
Reliable Structured Output: You define the desired output structure (e.g., a list of entities with attributes) using natural language prompts and one or more "few-shot" examples. LangExtract then guides the LLM to strictly adhere to this format, often using controlled generation to enforce JSON schemas.
Precise Source Grounding (Traceability): This is a standout feature. Every piece of data extracted is mapped to its exact location (character offset or span) in the original text. This allows for full auditability and the creation of interactive visualizations that highlight the extracted entities in context .
Import Libraries
A few standard python libraries for dealing with pdfs as well as some from the Google suite. There should be a requirements file in my repo if you want the specific versions, which you can then pip install:
pip install -r requirements.txt
Personally Identifiable Information (P.I.I.)
Resumes contain a lot of sensitive and private information, a.k.a. Personally Identifiable Information (P.I.I.). So we are going to treat them a little differently than we would normal files. Rather than store them on our computers we will keep them secure in Google Cloud Storage Buckets and access them from our app with a Signed URL. This will create a temporary token to allow us to access the data in the buckets. LangExtract accepts URLs so we can point it at the Resume's GCS URL without ever having to download the document onto our drive. Another secure option might be to download the files into a temporary folder and clean them up after done extracting.
You'll need to navigate to GCP and create a storage bucket and populate it with some resumes. Our app will also have the ability to upload resumes into this bucket for safe keeping.
Next I'll go through step by step how to get the json key file needed to create a Signed URL.
How to Get a Service Account Key File
You will do this in the Google Cloud Console (the website where you manage your project).
Step 1: Navigate to Service Accounts
Go to the Google Cloud Console and select your project.
In the search bar at the top, type Service Accounts and click the result under IAM & Admin.
Click the blue "+ CREATE SERVICE ACCOUNT" button at the top.
Step 2: Configure the Service Account
Service account name: Give it a helpful name, like gcs-signer or signed-url-generator. (The Service account ID will be generated automatically.)
Click CREATE AND CONTINUE.
Step 3: Grant Permissions (Roles)
This step decides what the "robot" is allowed to do. To generate a signed URL for a file in Google Cloud Storage (GCS), it needs permission to access that storage.
In the "Select a role" box, search for and select the following roles:
Storage Object Viewer (Allows it to view the files)
Storage Object Admin (A good role if you also need it to upload/manage files, but Viewer is often enough to create a signed URL for reading.)
Click CONTINUE. (You can skip the third step, "Grant users access to this service account.")
Click DONE to create the Service Account.
Step 4: Create and Download the Key File
Now that the Service Account exists, you need to generate the private key (the JSON file).
On the main Service Accounts list page, find the account you just created and click on its name.
Click the KEYS tab at the top.
Click the ADD KEY dropdown and select Create new key.
Choose JSON as the Key type (this is required for the Python script).
Click CREATE.
Your browser will automatically download a file named something like gcs-signer-abcd1234.json. This is your private key! Keep it secure and don't push it to GitHub.
Env & Global Variables
Since we will be working with some API keys, I'll store them in an hidden ".env" file. You can find and example of this file in my repo called ".env.example". You can use the structure found in this file and update it with your API keys and values then rename the file as ".env".
You can add this file to your ".gitignore" file so you don't accidentally push it to your repo. These two commands should cover the ".env" file as well as the Service Account json file and other json files created through LangExtract later.
Then load them into the environment.
This is the standard, simple way to load environment variables from a .env file into os.environ.
Generate a Signed URL
A function to generate the signed URL when needed.
Downloading the Text Binary into Memory
Even though LangExtract can interact with URLs, in order to detect any "white text" we'll need to download the text. We'll download it into memory so we don't have to store and PII data on our computers which will work well since resumes are small documents. This function will point at the same GCS Signed URL.
Here we can create the prompt for what entities LangExtract should look for in the documents. Nothing too complex here but we can specify the nested structure or list attributes of each entity which will help the model learn.
Examples
LangExtract will require a few good examples for parsing. You specify the type of entity (class) and then the subtext (extraction_text) referencing the larger text above. Note the nested examples with attributes.
Now we can call LangExtract to parse the documents. You can play with the parameters for speed and accuracy, but be careful on the extraction_passes, that one will cost you!
Before we put the visualization inside of a Streamlit app we can open it up in a Browser window during testing.
You can see the extracted entities color coded here as well as scroll or play through the HTML file. The attributes key in the json contains the nested information from each entity.
If you're not familiar with "white text" it is essentially hidden text that is intended to be captured by the Applicant Tracking System (ATS) but not by the reader. Some people might insert hundreds of skills, fake credentials, or even hidden or malicious prompts. We will scan the document and create a binary flag, the page number it's found on, the boundary box of the text, and extracted words of any text that is white.
For the record, I would never ever do this. But for this project we're going to need to test out our detection algorithm. Here's a sample from my Resume where I've inserted some sneaky white text. This text won't be seen by LangExtract, but will be picked up by our detection algorithm.
At this point we have pretty much all we need to create a simple app. There are a few other helper functions that I haven't yet gone over but will be working in the backend of the app. If you're interested you can navigate to the app.py file in my repo to see how the Streamlit app comes together.
A really big part of working with Streamlit apps is the session state and cacheing data. The app runs from top to bottom each time a user makes an update to the app, so we can store variables in the session state so that they preserve which makes a much nicer user experience. You can even specify the expiry time for these variables.
Upload or Choose a Document From GCS
You can choose to either to upload your own resume into the GCS bucket or parse a document from GCS directly without ever downloading it onto your local machine.
The ATS White Text Detector automatically runs and generates a warning if any white text is found as seen below. Busted!
Then simply hit the big red button to extract all of the entities.
Now that we've successfully parsed the data we should save it somewhere for later use. The extracted data as it is resides as a structured JSON object or JSONL file but we can actually parse that into a clean pandas data frame and then into BigQuery. Below you can see the extracted entities and attributes being captured. I've even added the hidden white text data and flags into this dataframe even though it was a seperate process from the LangExtract extraction.