GitHub Repo:
You can find this notebook in my GitHub Repo here: https://github.com/sam-tritto/ufo-sightings
OUTLINE
Introduction
UFO Sightings Dataset
BigQuery Public Datasets
Data Cleaning
BigQuery Geo Functions
Within County
Within Zip Code
Distance Between
Clustering Chevron Sightings
For residents of the Hudson Valley, NY (where I'm from), the 1980s weren't just about big hair and synth-pop; they were also a period marked by an extraordinary wave of UFO sightings, particularly those of a distinctive chevron or V-shaped craft. Whether you're a believer or not, these mysterious aerial phenomena captured the public imagination and sparked countless conversations for decades to come.
In this tutorial, we'll delve into the fascinating world of geospatial analysis using Google BigQuery's powerful set of geographic functions and its rich collection of public datasets. We'll leverage the extensive UFO Sightings dataset from Kaggle to explore these enigmatic events in BigQuery, focusing specifically on the reported sightings in the Hudson Valley during that iconic time. You'll learn how to perform proximity analysis at scale as we explore these anomalies. Get ready to unlock the secrets of the sky (or at least, the data behind them) with the power of BigQuery!
This dataset, which can be found on Kaggle, contains over 80,000 reports of UFO sightings over the last century up until the year 2014. It contains a few categorical columns and the date of the reported sighting. The data although it has been cleaned is still pretty messy which we wll have to tackle first before we get into the geospatial analysis.
Find it here: https://www.kaggle.com/datasets/NUFORC/ufo-sightings/data
Like many datasets that contain geo information, you'll find that this data only contains some geo components such as the City, State, Latitude and Longitude, but not all. What if we are interested in the County or Zip Code? Not to fear, BigQuery hosts some great geo datasets from the US Census Bureau each at a different geo hierarchy found in their public data set project bigquery-public-data and in their geo_us_boundaries dataset, seen below. They have MSAs and CBSAs to match up to Census Bureau data, Zip Codes, Counties, etc. Each one has various codes which can make joining them together very easy. There are latitudes, longitues, names, and even boundary points for the geographies. The only thing to understand here is that they aren't updated that often. The last update was in 2022, so newer zip codes for instance might not be available.
You can find more about them here:
https://console.cloud.google.com/marketplace/product/united-states-census-bureau/us-geographic-boundaries?hl=en&inv=1&invt=Ab0Hvw&project=gemini-sandbox-446215
I'm not going to run through all the code to clean the data. There wasn't anything too complex I had to do to get it into a useable form; datatypes, column names, nulls, etc. I did however filter this to the US only which is about 70,000 sightings, since the BigQuery geo data is for the US only. You can follow along in my GitHub repo if you are interested.
GitHub Repo: https://github.com/sam-tritto/ufo-sightings/blob/main/UFO_Sightings.ipynb
Now we can explore Google's fantastic suite of geo functions for BigQuery. They have everything from functions that measure distance, boundaries, intersections, to even clustering. I'm just going to stick to some of the most common/useful ones for this project but encourage you to read through them all here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions
We can take a quick look at some that I'll use later on straight from their documentation.
Here is the ST_GEOPOINT() function which we will use to build a geography point from a latitude and longitude.
Here is the ST_WITHIN() function we will use to determine if a coordinate point is located within a boundary.
Here is the ST_DISTANCE() function we will use to determine the distance in meters between geographies.
And here is the ST_CLUSTERDBSCAN() function which we will use to cluster the geographies with the DBSCAN algorithm. It's really cool to see BigQuery offering some ML options specifically for geographies.
The first thing we will explore is fltering for all UFO sighting reports in Dutchess County. This is in the heart of the Hudson Valley and will allow us to use the ST_WITHIN() function. It would be nice to be able to simply join the two tables together, but our UFO sightings data only has city and state. Also the public county geography table does not have a record for each city in each county. With third party data, or any data for that matter, there is no garuntee that the spellings, case, spaces match between textual columns. It will be a much safer bet to use the latitude and longitude of each UFO sighting report and see if it is contained in the geo boundary for Dutchess County.
Utilizing a cte can help both with readability and optimization. Since we already know we'd like to focus on Dutchess County we can pre filter that data ahead of the join which will greatly reduce time and cost. We could have also used another cte for the UFO data if there were any filtering or manipulations we would have needed to do there, such as year or type of craft. Filtering in the WHERE clause of the join is always an option too, but since the JOIN clause comes before the WHERE clause in the SQL order of operations we really should pre filter our data. You'll notice this is a CROSS JOIN between the two tables so thiking about filtering and optimizations is crucial since the number of records can spin out of control pretty fast.
The ST_WTHIN() function takes a geo point coordinate and a geo boundary as inputs. The us boundary table already contains a MULTIPOLYGON column named county_geom, this is what ST_WITHIN() accepts. The UFO sightings data however only has coordinate points, so we will us ST_GEOPOINT() to convert this into a useable geography. ST_WITHIN() then returns a boolean True or False for each record so when used in the WHERE clause, becomes an efficient filter.
Looking at the results we get all of the UFO Sightings from Dutchess County, 78 in total.
Here's a similar query but using the zip code boundary table. Zip Codes are interesting since they can span multiple cities. Since our UFO dataset only has City and State this query might have returned multiple cities for this zip code, but it did not for our select zip code.
There were only 5 reported cases in this dataset, and no chevrons of V-shaped craft.
Now we can try out a different function to find the distance between Dutchess County and any other County where a chevron or V-shaped craft was seen. We'll use the ST_DISTANCE() function which returns the distance in meters which we can convert into miles. This isn't street driving distance, but rather "as the bird flies" distance or the distance on top of a sphere which is perfect for this UFO case study. The Google Maps API is a great choice if you need driving distance, but it is costly. There's also geopy a free and open sourced alternative if you are comfortable with python.
I'll start building some CTEs to split up the query. First I'll isolate Dutchess County so we can measure the distance from here. You'll notice that while we've previously used the geo border of the County now we are relying on the County's internal coordinate points which are also available in these boundary datasets. These latitudes and longitudes represent the central most point within each County's border.
Next we'll reuse the same ST_WITHIN() query we used before but with a few new conditions. We'll need to filter out Dutchess County since we don't want to find the distance between it and itself. Also we'll look only for other chevron shaped craft.
Now we can put these pieces together and use the ST_DISTANCE() function to find the distance between Dutchess County's internal coordinates and all other counties coordinates by using a Cross Join. The function returns the distance in meters so I'll convert them with simple division into miles.
As expected there were several other reports in other counties throughout the Hudson Valley around that time. More interestingly however, there was still a chevron shaped craft seen close by as recently as 2009.
Now we can explore some ML in BigQuery with the DBSCAN algorithm and the ST_CLUSTERDBSCAN() function. We'll also use plotly, an interactive python visualization library, to show these clusters.
DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, is a powerful clustering algorithm that groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions. Unlike algorithms like K-Means, you don't need to specify the number of clusters beforehand, and it can find arbitrarily shaped clusters.
DBSCAN defines clusters based on the concept of "density." A cluster is a region of high data point density, separated by regions of lower density. There are just two parameters...
Two Key Parameters:
epsilon: This is the maximum distance between two points for them to be considered "neighbors." Imagine drawing a circle of radius epsilon around a point; any other points within this circle are its neighbors.
min_geographies: This is the minimum number of points required within the epsilon radius for a point to be considered a "core point."
Point Types: DBSCAN classifies each data point into one of three types:
Core Point: A point is a core point if it has at least min_geographies (including itself) within its epsilon neighborhood. These are the "heart" of a dense region.
Border Point: A point is a border point if it has fewer than min_geographies within its epsilon neighborhood, but it is within the epsilon neighborhood of a core point. These points are on the "edge" of a cluster.
Noise Point (Outlier): A point is a noise point if it is neither a core point nor a border point. These points are considered outliers or noise and don't belong to any cluster.
First, we'll need to query the data in a way that will be easy to visualize. Starting similarly as before, a CTE with the geo information for each County. We will need the internal lat and long to visualize later in a bubble chart map aside from being used in BigQuery to map to UFO sightings.
Now mapping the UFO sightings data to cases to each County using the ST_WITHIN() function, same as before, filtering for 'chevron' type UFOs.
Finally Clustering with the ST_CLUSTERDBSCAN() function. I'm specifying epsilon as 40 miles and that the min_geographies to be 3 and pulling all the columns I'll need to play nice with plotly.
Now I'll save the data locally to being into VS Code where I'm working locally in python. There are other options to visualize the data from BigQuery such as using Looker Studio or Google Colab.
The first thing you might notice is that the first handful of records have a null cluster number. These records weren't able to be assigned a cluster based on our parameters. You might consider them anomalous points. Here I will call them what they are, "Noise". But before I do that I'll first create a "County, State" column since there can be Counties named similarly in different States and that will mess up our aggregations.
Now we can aggregate without fear by "County, State". Here I'm doing multiple aggregations at once, but really I'm just doing some tricks to bring some columns along for the aggregation ride. Really we're interested in creating the number of sightings in each County.
Throwing this data in plotly, we can create a Bubble Map that is colored for each Cluster and sized for the number of sightings.
I'm not going to spend too much time tuning the DBSCAN paramters, but the fit here is pretty good already. Notice the Counties being classified as Noise and how they seem pretty disjoint from any of the Clusters. We can clearly see that there are certain hot spots around the country for these chevron shaped UFOs, so interesting. What's really cool here is that the DBSCAN algorithm can typically take some time in python, but BigQuery's geo function was able to jam it out in seconds. This could scale easily with big data.