
Deploying our models, whether on Excel or code, is rarely a perfect exercise.
Beyond just the bugs that creep up and devour time, often, models can be just plain wrong. A prediction turns out to never come true. Our model is dismissed or ignored on the next go-around. Us — the engineer, analyst, data owner, or blogger — is left with only questions. What went wrong?
Several issues could explain the misalignment of our analysis. Bias may be present. Or the data is simply not “good” enough. But, one overlooked cause of misfiring analysis is data leakage.
What is leaky data? How do we fix it?
Let’s start with a definition of leaky data. Data leakage may also be called “target leakage:” referring to the unintentional introduction of information about the target that shouldn’t yet be available. This “contamination” can lead to misleading analysis, over-optimistic expectations, and ultimately disappointment when the model hits reality.
There are five common causes of data leakage. We’ll review each one and consider what to do about it.
1. The output is a function of an input
Suppose the target is a direct function of one of the inputs. In that case, the analysis isn’t really performing anything special and — crucially — the information should then be the target of your analysis or removed entirely.
Here’s a quick example. As an analyst, we’ve been hired to predict the next month’s sales for a dog food company that sells only one type of dog food. The Excel sheet we’re handed includes several columns, including items sold and sale price. We perform some fancy modelling, and the end result is a perfect prediction. How? The next month’s sales are a function of the number of dog food bags sold and the sale price.
In that simple example, it would be more useful to predict the demand of quantity sold than monthly sales as a sum, given it is a direct output of that feature. Without adjustments, the sales prediction leaks from a function of the demand and sale price.
The other form that a function can take is a direct duplicate. Occasionally the data you train on and test on might have shared a duplicate. This often occurs with natural language machine learning models where, for instance, a spammer has posted identical messages many times. Reviewing duplicates is a useful preprocessing step to counter this potential data leakage. This is a form of training-test leakage, slightly different than target leakage but just as damaging to deployment.
2. The output is hidden within an input
Sometimes the answer is right in front of your eyes, like a data version of where’s Waldo. That’s the case of a target that is hidden within one of the features of your dataset.
Imagine that we’re employed as a data scientist for a cellphone company. We’ve received a large quantity of customer data to cluster them by phone plan — the goal being to determine which customers would be best served by switching to what new plan. Inside the dataset is a feature named “Group,” it contains a four-digit sequence that repeats for a specific phone plan.
If this data is factual and not a prediction by another model, including Group in the analysis leaks the ideal phone plan cluster. Rather than learning weights for other features, the model will rely on the four-digit sequence within the column Group. The data hid the correct feature, forcing us to find Waldo.
The exemption to this leakage is when another model or analysis had produced the Group input. In that case, we’re stacking models and can potentially lead to more robust analysis. More on that some other time.
In this example, removing the leaky target source — the Group input — is an easy fix.
3. The output is from the future
Much like Marty, Doc or Biff, sometimes data can travel back from the future. This is a crucial data leakage that is often difficult to catch without a solid knowledge of the underlying business processes and methods. Business analysts take note — you alone are best suited to recognize and assist in the patching of this leaky data.
In this case, the famous example is attempting to perform an analysis on the likelihood that a borrower will repay a loan. The bank hired us to look at data that includes age, gender, education, salary, marital status and more. One column is the number of late payment reminders sent to the borrower.
The analysis is completed, we deploy it to the field. While our model performed great on the data we had, it flops in the field.
Why? The late payment reminders leaked information on the likelihood of defaulting on the loan. On the ground, after deployment, the late payment reminders were always zero. The financial officer used the model before any late payment reminders being sent.
Domain experts must consider the features used for relevance and whether they will be available in the deployment environment. Otherwise, a feature may be back from the future to wreak havoc on your analysis.
4. Related inputs are scattered during partitioning
One of the principles of building a reliable analysis or model is to divide your available data into training, test and validation sets. These each has different use cases, which we’ll dive into with a later blog post. You can still imagine the training data as what we’ll use to build the model, the validation data to check how we’re doing as we iterate analyses, and the test set for deployment checks. Generally speaking, data is randomly sorted (partitioned) into these groups.
A problem arises with group information; when inputs are related to each other. An example is medical diagnosis information that includes features on the patient. If the patient’s data is divided (randomly) between training, validation and test sets, then a machine learning model might learn to predict the patient’s details and not the disease. Otherwise, unimportant features may lead to red-herrings and, worse of all, inaccuracy in predictions.
This data leakage is more contained to machine learning models where the analysis is performed at arm’s length from the human. However, it’s something to also consider for linear or logistic regression performed in Excel or other more human-controlled data analysis.
5. Imputation of inputs
Data imputation is a well-known technique to manage missing values or extreme outliers. Rather than distort the analysis or remove the example entirely, the feature is imputated with a fabricated input. For instance, if the data holds monthly flow rates for a lake and July 2018 is missing, it can be imputated with either the average flow rate for every month or the average flow rate for the other Julys in the dataset.
However, sometimes the data imputation can lead to data leakage. When we fill a missing value with the average of a dataset, the best practice is to perform this imputation on the training set only. Missing values in validation and test sets should be imputated with the means of their respective partitioned features. This avoids leakage from the validation or test set into the imputation of the training data.
In our July 2018 example, the training data monthly flow rate’s mean would be used rather than the mean of all monthly flow rates. If it was a time-series analysis, potentially each of the Julys before 2018 would be used to calculate the new value.
This is Part 2 of a series on data literacy. Other posts in the series so far:
Enjoy this blog post? Here’s some exciting research that helped create it:
