learning | the data sapien

Deploying our models, whether on Excel or code, is rarely a perfect exercise.

Beyond just the bugs that creep up and devour time, often, models can be just plain wrong. A prediction turns out to never come true. Our model is dismissed or ignored on the next go-around. Us — the engineer, analyst, data owner, or blogger — is left with only questions. What went wrong?

Several issues could explain the misalignment of our analysis. Bias may be present. Or the data is simply not “good” enough. But, one overlooked cause of misfiring analysis is data leakage.

What is leaky data? How do we fix it?

Let’s start with a definition of leaky data. Data leakage may also be called “target leakage:” referring to the unintentional introduction of information about the target that shouldn’t yet be available. This “contamination” can lead to misleading analysis, over-optimistic expectations, and ultimately disappointment when the model hits reality.

There are five common causes of data leakage. We’ll review each one and consider what to do about it.

1. The output is a function of an input

Suppose the target is a direct function of one of the inputs. In that case, the analysis isn’t really performing anything special and — crucially — the information should then be the target of your analysis or removed entirely.

Here’s a quick example. As an analyst, we’ve been hired to predict the next month’s sales for a dog food company that sells only one type of dog food. The Excel sheet we’re handed includes several columns, including items sold and sale price. We perform some fancy modelling, and the end result is a perfect prediction. How? The next month’s sales are a function of the number of dog food bags sold and the sale price.

In that simple example, it would be more useful to predict the demand of quantity sold than monthly sales as a sum, given it is a direct output of that feature. Without adjustments, the sales prediction leaks from a function of the demand and sale price.

The other form that a function can take is a direct duplicate. Occasionally the data you train on and test on might have shared a duplicate. This often occurs with natural language machine learning models where, for instance, a spammer has posted identical messages many times. Reviewing duplicates is a useful preprocessing step to counter this potential data leakage. This is a form of training-test leakage, slightly different than target leakage but just as damaging to deployment.

2. The output is hidden within an input

Sometimes the answer is right in front of your eyes, like a data version of where’s Waldo. That’s the case of a target that is hidden within one of the features of your dataset.

Imagine that we’re employed as a data scientist for a cellphone company. We’ve received a large quantity of customer data to cluster them by phone plan — the goal being to determine which customers would be best served by switching to what new plan. Inside the dataset is a feature named “Group,” it contains a four-digit sequence that repeats for a specific phone plan.

If this data is factual and not a prediction by another model, including Group in the analysis leaks the ideal phone plan cluster. Rather than learning weights for other features, the model will rely on the four-digit sequence within the column Group. The data hid the correct feature, forcing us to find Waldo.

The exemption to this leakage is when another model or analysis had produced the Group input. In that case, we’re stacking models and can potentially lead to more robust analysis. More on that some other time.

In this example, removing the leaky target source — the Group input — is an easy fix.

3. The output is from the future

Much like Marty, Doc or Biff, sometimes data can travel back from the future. This is a crucial data leakage that is often difficult to catch without a solid knowledge of the underlying business processes and methods. Business analysts take note — you alone are best suited to recognize and assist in the patching of this leaky data.

In this case, the famous example is attempting to perform an analysis on the likelihood that a borrower will repay a loan. The bank hired us to look at data that includes age, gender, education, salary, marital status and more. One column is the number of late payment reminders sent to the borrower.

The analysis is completed, we deploy it to the field. While our model performed great on the data we had, it flops in the field.

Why? The late payment reminders leaked information on the likelihood of defaulting on the loan. On the ground, after deployment, the late payment reminders were always zero. The financial officer used the model before any late payment reminders being sent.

Domain experts must consider the features used for relevance and whether they will be available in the deployment environment. Otherwise, a feature may be back from the future to wreak havoc on your analysis.

4. Related inputs are scattered during partitioning

One of the principles of building a reliable analysis or model is to divide your available data into training, test and validation sets. These each has different use cases, which we’ll dive into with a later blog post. You can still imagine the training data as what we’ll use to build the model, the validation data to check how we’re doing as we iterate analyses, and the test set for deployment checks. Generally speaking, data is randomly sorted (partitioned) into these groups.

A problem arises with group information; when inputs are related to each other. An example is medical diagnosis information that includes features on the patient. If the patient’s data is divided (randomly) between training, validation and test sets, then a machine learning model might learn to predict the patient’s details and not the disease. Otherwise, unimportant features may lead to red-herrings and, worse of all, inaccuracy in predictions.

This data leakage is more contained to machine learning models where the analysis is performed at arm’s length from the human. However, it’s something to also consider for linear or logistic regression performed in Excel or other more human-controlled data analysis.

5. Imputation of inputs

Data imputation is a well-known technique to manage missing values or extreme outliers. Rather than distort the analysis or remove the example entirely, the feature is imputated with a fabricated input. For instance, if the data holds monthly flow rates for a lake and July 2018 is missing, it can be imputated with either the average flow rate for every month or the average flow rate for the other Julys in the dataset.

However, sometimes the data imputation can lead to data leakage. When we fill a missing value with the average of a dataset, the best practice is to perform this imputation on the training set only. Missing values in validation and test sets should be imputated with the means of their respective partitioned features. This avoids leakage from the validation or test set into the imputation of the training data.

In our July 2018 example, the training data monthly flow rate’s mean would be used rather than the mean of all monthly flow rates. If it was a time-series analysis, potentially each of the Julys before 2018 would be used to calculate the new value.

This is Part 2 of a series on data literacy. Other posts in the series so far:

What is “good” data?

Enjoy this blog post? Here’s some exciting research that helped create it:

Patching the leaks on your data

by robb fournier January 24, 2021

How will new technology impact the employment model?

by robb fournier January 20, 2021

What is “good” data?

by robb fournier January 8, 2021

This is the start of a series of blog posts about the role of data in the future of the workplace, and helping introduce fundamentals of data engineering and data science for everyone. Sign up at the bottom of the page for notifications on future posts.

The growth of data in the workplace means that, sooner or later and almost regardless of profession, a manager will ask every employee to gather data and draw insights. Easy enough, right? Data is everywhere; it shouldn’t be a problem to find what you need.

Not so fast — we’ll quickly be told by a data engineer — are we sure our data is “good?”

Substitute “good” for “clean,” “useable,” “learnable,” or a dozen other keywords, and it’s a familiar story across those beginning their data journeys. We have a tremendous new analysis in mind, but — hold up — the data might not be “good” enough.

The bad news? They’re right. Data should allow us to draw meaningful conclusions. Having “bad” data means that our insights or analysis may not just be skewed but potentially incorrect. And unfortunately, good data is hard to come by.

What is “good” data for analysis?

Informative

What does it mean for data to be informative? Isn’t all data naturally informative, since by its nature, it explains something? Informative data isn’t just that the data demonstrates something; it means there is enough explanatory data to provide meaningful results.

For example, if I were a data scientist and wanted to deploy some machine learning to predict whether you would read this blog post, I would need to have data on both this article and the previous articles you read. If I only had a list of names and locations, the model would learn to predict based on location whether someone would read — not whether you would read yourself.

Another more common example is the business analyst considering why sales of lipstick spiked last quarter. If there is no correlation in the data (negative or positive), presumably the data isn’t informative enough and is missing some determining feature.

Wide enough coverage

The data you use should cover the task at hand. If you want to rank baseball players, you can’t only have the current MLB players’ names, heights and jersey numbers — at least not to create a ranking based on their play. You wouldn’t have enough data coverage of what you want to learn.

In another example, if you want to build a classifier to cluster or group employees into performance brackets, you need examples (hopefully many) within each group you want to assign. That means your organization would need some top performers, above-average performers, medium performers and low performers.

The problem of coverage becomes trickier and trickier with larger classes to assign. A web page classifier, assigning one of potentially thousands of topics, would need examples in each subject.

Real inputs

The data should also reflect the reality of the inputs. That doesn’t mean the data’s validity — presumably, the data is based in reality and wasn’t generated randomly. Instead, it means when you try to apply your model or analysis to the next problem, does the data include all the real inputs it will face?

The classic example of real inputs is “cat, dog, raccoon.” If you train a classifier model to predict whether an image is a cat or a dog, it will perform very well on pictures of cats and dogs. But if you input a picture of a raccoon, it would try to predict if it was a cat or a dog.

In business, this is often trying to force an analysis from one industry or model onto another. An analysis trained on data from a manufacturing wholesale food company will perform poorly on a SaaS business. The analyst didn’t build the initial model on the real inputs of customer churn or monthly revenue.

Unbiased

Data bias is a well-documented phenomenon that deserves its own post to address properly. There is a wide variety of biases that can be present in data that will skew the results. Some of the major ones to look out for: selection bias, sampling bias, stereotype bias, experimenter bias.

Unlinked

Keep the data from being the result of the model itself. In other words, avoid the circular error that is known to bedevil Excel users. If the model outputs a new data point, that shouldn’t be part of the model’s analysis initially.

For example, if you’re determining what emails are important and star those messages or otherwise highlight them, you should not consider clicks as a data point to consider the importance. Presumably, the messages starred with have more clicks.

Consistently labelled

The label of your data is a defining feature. For instance, if you had a customer row of data, a business analyst might look at the data and say, “high priority customer.” A more common example is labellers who look at pictures (say of cats, dogs and now raccoons) and assign a label depending on what it is (cat for a picture of a cat playing the guitar).

The issue can arise when multiple people assign the labels and there is discrepancy in how they are assigned. With many labellers, the problem can become pronounced, and the label might differ in some grey areas depending on who is labelling the data.

Another issue would be data that evolves with time and will, eventually, shed its old label. A hockey player might have a “rookie” label at the beginning of their career, but they would lose that label after several seasons.

And finally, inconsistency with labels might occur from not understanding the individual’s motives generating the data. Suppose a label of importance is generated depending on how fast we respond to an email message. In that case, it may incorrectly assume a note wasn’t important just because we were away from the computer for a moment longer.

Big (enough)

Finally, the data should be big … enough. Depending on the problem that we’re solving, whether that’s a unique business proposal or a machine learning model for natural language processing, we need the appropriate amount of data to push it from case-specific to generalized.

The amount varies heavily depending on the use. For some analysis, only a few hundred data points could provide enough to generalize. In other cases, a model requires tens or hundreds of thousands to generalize across problems.

Enjoy this blog post? Here’s some interesting research that helped create it:

the data sapien

Menu

Tag Archives: learning

Patching the leaks on your data

1. The output is a function of an input

2. The output is hidden within an input

3. The output is from the future

4. Related inputs are scattered during partitioning

5. Imputation of inputs

Patching the leaks on your data

How will new technology impact the employment model?

What is “good” data?

What is “good” data?