GitHub Repo:
You can find this notebook in my GitHub Repo here:
OUTLINE
Introduction
The Wine Quality Dataset
EDA
Parameters
Pre-Processing
Univariate Outliers - Skew Conditional Statistical Ensemble
Multivariate Anomalies - Machine Learning Model Voting Ensemble
Post-Processing
Visualization & Investigation
Box-and-Swarm Plots
Ever wonder if there are "sour grapes" in your data that could mess up your analysis and insights? In this guide, we're going to dive into the popular Wine Quality Dataset and learn how to find those unusual data points – what we call outliers and anomalies.
Think of it like this:
Outliers are like a single grape that's way too sour or too sweet compared to all the others in a bunch. It's odd in just one way.
Anomalies are more like a whole cluster of grapes that, while maybe looking okay individually, have a strange combination of traits (like being tiny, overly ripe, and also a weird color) that makes them clearly different from typical grapes. They're odd in multiple ways at once.
Finding these "odd ones out" is super important because they can easily trick our analyses and make our results unreliable. If we don't spot them, we might draw the wrong conclusions about what makes a good wine!
This tutorial will show you a practical, two-part strategy to catch both kinds of weird data:
Part 1: Spotting Simple Outliers
First, we'll look at each wine feature (like its acidity or sugar level) one by one. We know that data behaves differently – some data is Normal (like a bell curve), some is a bit lopsided and some is really uneven. So, instead of using a single tool, we'll use a smart system that picks the best method for each situation:
We'll use Z-scores for data that looks "Normal".
We'll use Modified Z-scores for data that's a bit lopsided or moderately skewed.
And we'll use the Interquartile Range (IQR) method for data that's very skewed or uneven. This way, we make sure we're using the right tool for the right job, leading to much more accurate outlier detection.
Part 2: Finding Complex Anomalies
Next, we'll get fancy and look at how wines compare across all their features at once. This is where the truly complex "anomalies" hide. We'll use a cool Python library called PyOD (Python Outlier Detection) and unleash a set of powerful models that are great at sniffing out these complex deviations, each one bringing something unique to the table:
ECOD (Empirical Cumulative Distribution Outlier Detection): A fast and robust method that doesn't require explicit model training.
Isolation Forest (iForest): An ensemble tree-based method particularly effective for high-dimensional data.
Local Outlier Factor (LOF): A density-based technique that identifies outliers based on their local deviation from their neighbors.
AutoEncoder: A neural network-based approach that learns a compressed representation of normal data and flags points with high reconstruction errors as anomalies.
One-Class SVM (OCSVM): A machine learning model that learns a boundary enclosing normal data points, classifying anything outside as an anomaly.
By the time you finish this tutorial, you'll not only know how to use these awesome techniques in Python but also understand why and when to use them. You'll be ready to tackle strange data in your own projects, making your data cleaner, your analyses stronger, and your conclusions much more reliable.
Ready to pour over some data and uncover its hidden secrets? Let's get started!
This dataset, which can be found on UC Irvine's Machine Learning Repository contains just under 6,500 records and 13 columns of strictly numerical data relating to the characteristics of both red and white wine (see [Cortez et al., 2009], http://www3.dsi.uminho.pt/pcortez/wine/). The data is numeric and contains no missing values. The data comes in two seperate files for each kind of wine, but I will combine them and create a columns for the wine type.
Find it here: https://archive.ics.uci.edu/dataset/186/wine+quality
Before I flag outliers, I like to start by looking at the distribution of the data. The klib package has a nice function to plot each metric's distribution along with some summary stats including skew. The outlier ensemble we use later will be conditional on skew, so here are 3 different examples of data with varying skew. Each will use a different algorithm to indetify outliers.
Normal (ish) data:
Moderately skewed data:
Highly skewed data:
Before we start modeling we can make some decisions which will determine how liberal or conservative our models should be when labeling the outliers and anomalies.
std_threshold: This is the number of standard deviations either above or below the mean average needed to be classified as a z-score outlier.
mod_std_threshold: This is the number of "standard deviations" either above or below the median average needed to be classified as a modified z-score outlier. We're measuring approximate z-scores here and making a slight liberal adjustment to the z-score.
iqr_threshold: If a datapoint is this many times greater than or less than the interquartile range (IQR) than it is classified as an outlier.
contamination: This is a required parameter for PyOD. It is a percentage of expected outliers in your data. This is often not known in advance, so we will use a different technique of utilizing the predicted probabilities, however we still need a value for the function to run.
anomaly_prob_threshold: As I stated, we will use the predicted probabilities from the ML models rather than contamination. To do this we will need to set a probability threshold to classify an anomaly.
anomaly_voting_threshold: Since this is an ensemble model of 5 ML algorithms we will need to vote in order to classify a record as an anomaly. We need this many votes.
Since we are going to run some ML algorithms they will benefit from the data being standardized. A quick but necessary step, let's just not forget to inverse this transformation later.
Now we can start labeling some outliers. Since this dataset has only one categorical variable, wine type, and the numerical features are likely highly dependent on this category it's going to make sense to first group by this wine type variable and then classify outliers. The cleanest way to do this is to create a function and then use pandas .groupby().apply().
We are going to give the record as a whole an outlier label as well as each individual column. This way we can later identify which feature or features were responsible for the record being an outlier. We will also check that there is at least more than 2 unique values in the feature before we start labeling outliers. It's a quick and dirty way to make sure that binary columns don't get labeled. First I'll start by labeling z-score and modified z-score outliers.
Z-Score Outlier: This simply measures how far a data point is from the average (mean) of its dataset, expressed in terms of standard deviations. A positive Z-score means the point is above the average, a negative one means it's below, and the larger the absolute Z-score, the more unusual the data point. Typically, data points with an absolute Z-score greater than 3 are considered outliers, as they are very rare in normally distributed data. About 99.7% of all data points fall within 3 standard deviations of the mean.
Modified Z-Score Outlier: Modified Z-scores offer a more robust way to identify outliers, especially in moderately skewed data, by replacing the mean and standard deviation with measures less influenced by extreme values. Specifically, this method calculates how far a data point is from the median (a more robust measure of central tendency than the mean) and scales this difference by the Median Absolute Deviation (MAD) (a more robust measure of spread than the standard deviation), using a scaling factor of 0.6745 to make it comparable to a standard Z-score for normally distributed data. Data points with an absolute modified Z-score exceeding a threshold, which in our case is adjusted to (std_threshold + 0.5), are then flagged as outliers.
IQR Outlier: For highly skewed data, the Interquartile Range (IQR) method flags outliers as points falling beyond a custom threshold. However, if the IQR is 0 (meaning at least 50% of the data points are identical), this method becomes ineffective. In such extreme cases, this approach intelligently switches to a log-transformed modified Z-score: by adding a shift to handle zero or negative values and then applying np.log1p, the data's skewness is reduced, allowing the robust modified Z-score (calculated using the median and Median Absolute Deviation from the log-transformed values) to effectively identify outliers, again using our adjusted threshold of (std_threshold + 0.5).
Each feature will have a binary outlier flag column and finally we can also count these labels for all of the features to create a column named total outliers. We'll also create an outlier score column here as a percentage of the number of outlier flags from the total number of features.
Now that we have assigned univariate outlier labels to each feature with our statistical framework, we can look at the features all at once and assign multivariate anomaly flags with an ensemble of machine learning models from PyOD. Some of these models require a bit of extra care, especially if we are grouping by a category, which we are. Basically we'll need to make sure that there are enough members in each group before assigning any lables or flags. We'll use a parameter to set the minimum limit per group.
If there are enough members per group then we can fit some machine learning models, starting with ECOD (my favorite).
ECOD (Empirical-Cumulative-distribution-based Outlier Detection), is an unsupervised outlier detection algorithm designed to be simple, efficient, and interpretable. It operates on the principle that outliers are "rare events" found in the tails of a data distribution. For each feature in your dataset, ECOD estimates its empirical cumulative distribution function non-parametrically. It then uses these estimated distributions to calculate a "tail probability" for each data point per dimension, essentially assessing how extreme a point is in each individual feature. Finally, it aggregates these tail probabilities across all dimensions (typically by summing their logarithms, assuming independence) to compute a single outlier score, with higher scores indicating a higher likelihood of being an anomaly. A key advantage of ECOD is that it's largely parameter-free, reducing the need for extensive hyperparameter tuning.
LOF (Local Outlier Factor), is a density-based unsupervised outlier detection algorithm that identifies outliers by measuring how isolated a data point is with respect to its local neighborhood. Unlike global methods that look at the entire dataset's density, LOF calculates the "local reachability density" for each point, which is essentially the inverse of the average distance to its k-nearest neighbors. It then compares this local density to the local densities of its neighbors. If a data point's local density is significantly lower than the average local density of its neighbors (resulting in an LOF score notably greater than 1), it indicates that the point resides in a sparser region than its surroundings, thus classifying it as a local outlier. This "local" comparison makes LOF particularly effective at identifying outliers in datasets where clusters of varying densities exist.
iForest (Isolation Forest) is an unsupervised anomaly detection algorithm that works on the principle that outliers are "few and different," making them easier to isolate than normal observations. It constructs an ensemble of isolation trees, where each tree is built by recursively partitioning the data using random feature selections and random split points until individual data points are isolated. Anomalies, being less frequent and having distinct characteristics, typically require fewer random partitions (i.e., shorter paths from the root to the leaf node in the tree) to be isolated compared to normal data points. The anomaly score for an instance is then derived from its average path length across all trees in the forest: instances with shorter average path lengths are assigned higher anomaly scores, indicating a greater likelihood of being an outlier.
OCSVM (One-Class Support Vector Machine) is an unsupervised anomaly detection algorithm that works by learning a boundary around the "normal" data points in the feature space. Unlike traditional SVMs that separate two classes, OCSVM focuses on modeling the distribution of a single class (the normal data) by finding a hyperplane that best separates these normal observations from the origin (or from the empty space around them). Data points that fall outside this learned boundary or lie on the "outlier side" of the hyperplane are considered anomalies. The core idea is to find a small region that encompasses most of the normal data, with any points outside this region being flagged as outliers.
AutoEncoder is a type of neural network designed for unsupervised learning, specifically to learn a compressed, lower-dimensional representation of its input data. It achieves this by having an "encoder" part that maps the input to a reduced dimension (the latent space) and a "decoder" part that attempts to reconstruct the original input from this compressed representation. When used for anomaly detection, the AutoEncoder is typically trained extensively only on normal, non-anomalous data. The underlying assumption is that if the model has only seen and learned to reconstruct normal patterns, it will perform poorly when attempting to reconstruct an anomalous data point. Therefore, data points with a significantly high "reconstruction error" (the difference between the original input and its reconstructed output) are flagged as anomalies, as they deviate from the patterns the AutoEncoder was trained to understand. For this algorithm the number of features must be larger than the number of neurons in the hidden layer of the neural network so we'll make some adjustments if that's not the case.
And finally with all of our ensemble models fit and classified we can count up the flags. We will also create an anomaly score as a percentage of the number of anomalies by the number of ML algorithms. Since we have predicted probabilities we can also calculate the average predicted probability of each record. These are all great ways to filter records later on, we're just giving ourselves some options to investigate these anomalies.
Similar to the outlier function, we will group by this anomaly function as well, then classify anomalies per group.
Just adding one last flag before we wrap up. A flag to show whether each record had an outlier flag, anomaly flag, both flags, or neither. Then inverse scaling back the metrics to their original scale.
Most of the time the preferred action with outliers and anomalies is to investigate them, not drop them. You'll learn a lot about the data generating process and might reveal some issues upstream. Now that we have everything flagged a good first step is to understand at a high level how many outliers and anomalies were flagged.
Taking this a step further we can utilize a Venn diagram with the venny4py library. A three circle diagram might be overkill here becasue a data point cannot be an Inlier and an Outlier at the same time. You can adjust this to be 2-4 circles simply by commenting out part of that dictionary or adding a line in if needed.
I'm curious which record was an anomaly but not flagged as an outlier. This is a sneaky one! And we get to reap the benefits of adding that outlier_or_anomaly_flag column earlier.
That was a sufficient visualization for a high level overview of our flags. Next we can try another approach toward visualization to help us take a more grandular look at the outlier flags for each feature individually, a box-and-swarm plot. This sounds fancy but it's just a swarm plot with a box-and-whisker plot in the background. We'll use colors here to help differentiate the inliers from the outliers. You can also add in another color for anomalies if you are ever interested in seeing those here as well. Since the data was grouped by wine type and then flagged, we'll visualize these one group at a time. Another thing that can be helpful is to sample the inliers since we don't need to see each one and we don't want to wait around while 10,000 data points get plotted. When there's too many it can often flood the chart and make it hard to read.
The inner loop here creates a few different plots for outliers or inliers and also the box and whisker. The rest is just formatting.
Here's a sample from the output. Play around with the size of the dots if you want, also maybe add in the anomalies.
I find these most helpful to adjust my model parameters (std_threshold, etc.). Now that we can see what's being classified and where it related to the median and the rest of the data, we can make a human in the loop style adjustment with a simple where() function.
That's it. Many think that outliers are supposed to be objective, but I find them to be a really personal thing. Have fun with this template, make it your own, and tune it to your taste.