Patching the leaks on your data

Deploying our models, whether on Excel or code, is rarely a perfect exercise.

Beyond just the bugs that creep up and devour time, often, models can be just plain wrong. A prediction turns out to never come true. Our model is dismissed or ignored on the next go-around. Us — the engineer, analyst, data owner, or blogger — is left with only questions. What went wrong?

Several issues could explain the misalignment of our analysis. Bias may be present. Or the data is simply not “good” enough. But, one overlooked cause of misfiring analysis is data leakage.

What is leaky data? How do we fix it?

Let’s start with a definition of leaky data. Data leakage may also be called “target leakage:” referring to the unintentional introduction of information about the target that shouldn’t yet be available. This “contamination” can lead to misleading analysis, over-optimistic expectations, and ultimately disappointment when the model hits reality.

There are five common causes of data leakage. We’ll review each one and consider what to do about it.

1. The output is a function of an input

Suppose the target is a direct function of one of the inputs. In that case, the analysis isn’t really performing anything special and — crucially — the information should then be the target of your analysis or removed entirely.

Here’s a quick example. As an analyst, we’ve been hired to predict the next month’s sales for a dog food company that sells only one type of dog food. The Excel sheet we’re handed includes several columns, including items sold and sale price. We perform some fancy modelling, and the end result is a perfect prediction. How? The next month’s sales are a function of the number of dog food bags sold and the sale price.

In that simple example, it would be more useful to predict the demand of quantity sold than monthly sales as a sum, given it is a direct output of that feature. Without adjustments, the sales prediction leaks from a function of the demand and sale price.

The other form that a function can take is a direct duplicate. Occasionally the data you train on and test on might have shared a duplicate. This often occurs with natural language machine learning models where, for instance, a spammer has posted identical messages many times. Reviewing duplicates is a useful preprocessing step to counter this potential data leakage. This is a form of training-test leakage, slightly different than target leakage but just as damaging to deployment.

2. The output is hidden within an input

Sometimes the answer is right in front of your eyes, like a data version of where’s Waldo. That’s the case of a target that is hidden within one of the features of your dataset.

Imagine that we’re employed as a data scientist for a cellphone company. We’ve received a large quantity of customer data to cluster them by phone plan — the goal being to determine which customers would be best served by switching to what new plan. Inside the dataset is a feature named “Group,” it contains a four-digit sequence that repeats for a specific phone plan.

If this data is factual and not a prediction by another model, including Group in the analysis leaks the ideal phone plan cluster. Rather than learning weights for other features, the model will rely on the four-digit sequence within the column Group. The data hid the correct feature, forcing us to find Waldo.

The exemption to this leakage is when another model or analysis had produced the Group input. In that case, we’re stacking models and can potentially lead to more robust analysis. More on that some other time.

In this example, removing the leaky target source — the Group input — is an easy fix.

3. The output is from the future

Much like Marty, Doc or Biff, sometimes data can travel back from the future. This is a crucial data leakage that is often difficult to catch without a solid knowledge of the underlying business processes and methods. Business analysts take note — you alone are best suited to recognize and assist in the patching of this leaky data.

In this case, the famous example is attempting to perform an analysis on the likelihood that a borrower will repay a loan. The bank hired us to look at data that includes age, gender, education, salary, marital status and more. One column is the number of late payment reminders sent to the borrower.

The analysis is completed, we deploy it to the field. While our model performed great on the data we had, it flops in the field.

Why? The late payment reminders leaked information on the likelihood of defaulting on the loan. On the ground, after deployment, the late payment reminders were always zero. The financial officer used the model before any late payment reminders being sent.

Domain experts must consider the features used for relevance and whether they will be available in the deployment environment. Otherwise, a feature may be back from the future to wreak havoc on your analysis.

4. Related inputs are scattered during partitioning

One of the principles of building a reliable analysis or model is to divide your available data into training, test and validation sets. These each has different use cases, which we’ll dive into with a later blog post. You can still imagine the training data as what we’ll use to build the model, the validation data to check how we’re doing as we iterate analyses, and the test set for deployment checks. Generally speaking, data is randomly sorted (partitioned) into these groups.

A problem arises with group information; when inputs are related to each other. An example is medical diagnosis information that includes features on the patient. If the patient’s data is divided (randomly) between training, validation and test sets, then a machine learning model might learn to predict the patient’s details and not the disease. Otherwise, unimportant features may lead to red-herrings and, worse of all, inaccuracy in predictions.

This data leakage is more contained to machine learning models where the analysis is performed at arm’s length from the human. However, it’s something to also consider for linear or logistic regression performed in Excel or other more human-controlled data analysis.

5. Imputation of inputs

Data imputation is a well-known technique to manage missing values or extreme outliers. Rather than distort the analysis or remove the example entirely, the feature is imputated with a fabricated input. For instance, if the data holds monthly flow rates for a lake and July 2018 is missing, it can be imputated with either the average flow rate for every month or the average flow rate for the other Julys in the dataset.

However, sometimes the data imputation can lead to data leakage. When we fill a missing value with the average of a dataset, the best practice is to perform this imputation on the training set only. Missing values in validation and test sets should be imputated with the means of their respective partitioned features. This avoids leakage from the validation or test set into the imputation of the training data.

In our July 2018 example, the training data monthly flow rate’s mean would be used rather than the mean of all monthly flow rates. If it was a time-series analysis, potentially each of the Julys before 2018 would be used to calculate the new value.


This is Part 2 of a series on data literacy. Other posts in the series so far:

Enjoy this blog post? Here’s some exciting research that helped create it:

How will new technology impact the employment model?

As an undergraduate student at the University of Toronto studying Employment Relations, we learned about the New Employment Model: a theory that employment had evolved in the one hundred and fifty years since the Industrial Revolution.

The theory proposed that while in the old, industrial model employees would serve for a lifetime at one company, the new employment model is made of short service spans at many companies. The model argued that since employees wanted to switch jobs many times in their career, organizations shouldn’t fight it — after all, they benefited from talent coming onto the team from other organizations.

The result is the now-ubiquitous adage embodied in this cartoon.

But where does the path of this new employment model lead? Is the model itself even “new”, or an evolution of the original employment model? And, presuming that employment is evolving with technological changes, what does the future look like for companies and their employees?

From skill development to skill shifting

In the current, “new” employment model, organizations train their people based on the roles that they have. A software engineer is trained to become a better software engineer. An accountant is trained on new accounting practices. A policy advisor attends a conference with other policy advisors to learn from them.

What it means to be an accountant, a software engineer, or a policy advisor will evolve with increasing frequency.

As automation and technology continue to disrupt, the need for skill development will change to skill shifting. No longer will it be sufficient to “stay in our lane,” organizations will instead be focused on specific skill development and shifting our skills to critical areas. No longer will an end to education in our twenties be sufficient, employees will increasingly need to become life-long learners. Companies will be a major source of that learning — inside and outside of the current role. What it means to be an accountant, a software engineer, or a policy advisor will evolve with increasing frequency.

Already this practice is growing and occurring. The reason? Technology is disrupting the way we work in new and profound manner. Machine learning and automation promise to make many tasks redundant; meaning employees will need to transition from those tasks to new ones. So while we might remain a Business Analyst, what that means in terms of skills will constantly evolve and shift depending on the technology implemented.

From shared space to distributed space

With the COVID-19 pandemic up-ending regular workspaces, the shift to a distributed workforce not sharing a central location has received plenty of spotlight. Zoom, Slack, Trello, and countless other platforms have become household names around the globe. So, the future of distributed spaces seems closer than ever.

Yet, it’s important to consider the impacts that distributed work has on employees and the employment model. What does it mean for us — the employee — when every company is remote and virtual? What does it mean for the company?

For one, it will exacerbate the new employment model’s theory of shifting loyalties. Never co-locating means the ties holding an employee to a certain job — comfort, friends, commutes — are easily transferable. Comfort is in your home office and will follow you between jobs. Friends have more ways to keep in touch than ever before. Commutes are never shorter than a remote work environment.

It also means greater free-time to spend on other pursuits, and — crucially — skills or learning. The hours spent on a train or bus can now be leveraged to learn new things. And while for many it will be hobbies, for some it may be employable skills to add to their repertoire.

From accreditations to showcases

As skills and tasks evolve and shift with increasing frequency, static accreditations will become worryingly out-of-date. This is a growing trend already seen in workplaces around the globe. A MBA received in 2005 is quite different than a MBA in 2020, and the gap will continue to widen over smaller timeframes.

It is not a far-fetched prediction to see the 20th century’s paper resume ending its reign.

The ability to showcase skills, however, will remain critical. With the growth of online portfolio platforms, it is becoming easier and easier to share our knowledge and abilities with open audiences. Employers increasingly view projects and portfolios as valid methods of demonstrating experience, not to mention gamified competition communities like Kaggle. It is not a far fetched prediction to see the 20th century’s paper resume ending its reign.

Employees will increasingly want to be part of projects that add to their skills or validate their abilities. Companies will service those projects with increasing openness and flexibility.

From closed doors to open networks

The scale and flexibility of future-oriented companies requires that they begin to open what were previously limited boundaries. No longer limited to staffs, but companies also engaging networks of collaborators who can assist with projects — the same projects they, in turn, can showcase. This rise can already be witnessed on a number of “gig economy” platforms featuring professionals plying their craft.

Given the benefits to companies — cheaper, more flexible labour — this isn’t a practice likely to subside. And it does offer some benefit to the individuals involved, increasing their showcases. However, there are plenty of challenges on the horizon, including regulatory involvement on the definition of employee to protect vulnerable “gig economy” participants.

Where are we heading?

It’s a fool’s guess to predict the future, but the trend of the late 20th and 21st century seems poised to continue. Increasing turnover, more independent employees, and lifelong learning are already staples of today’s employment model. The future promises to continue those trends, and increase the speed of evolution of work. The next employment model may only last a decade (or less) before becoming archaic itself, but it may be defined by reskilling employees, distributed workforces, networks of contributors, and death of the resume.


Want to learn more? These sources were used in the creation of this blog post:

What is “good” data?

This is the start of a series of blog posts about the role of data in the future of the workplace, and helping introduce fundamentals of data engineering and data science for everyone. Sign up at the bottom of the page for notifications on future posts.


The growth of data in the workplace means that, sooner or later and almost regardless of profession, a manager will ask every employee to gather data and draw insights. Easy enough, right? Data is everywhere; it shouldn’t be a problem to find what you need.

Not so fast — we’ll quickly be told by a data engineer — are we sure our data is “good?”

Substitute “good” for “clean,” “useable,” “learnable,” or a dozen other keywords, and it’s a familiar story across those beginning their data journeys. We have a tremendous new analysis in mind, but — hold up — the data might not be “good” enough. 

The bad news? They’re right. Data should allow us to draw meaningful conclusions. Having “bad” data means that our insights or analysis may not just be skewed but potentially incorrect. And unfortunately, good data is hard to come by.

What is “good” data for analysis?

Informative

What does it mean for data to be informative? Isn’t all data naturally informative, since by its nature, it explains something? Informative data isn’t just that the data demonstrates something; it means there is enough explanatory data to provide meaningful results.

For example, if I were a data scientist and wanted to deploy some machine learning to predict whether you would read this blog post, I would need to have data on both this article and the previous articles you read. If I only had a list of names and locations, the model would learn to predict based on location whether someone would read — not whether you would read yourself.

Another more common example is the business analyst considering why sales of lipstick spiked last quarter. If there is no correlation in the data (negative or positive), presumably the data isn’t informative enough and is missing some determining feature.

Wide enough coverage

The data you use should cover the task at hand. If you want to rank baseball players, you can’t only have the current MLB players’ names, heights and jersey numbers — at least not to create a ranking based on their play. You wouldn’t have enough data coverage of what you want to learn.

In another example, if you want to build a classifier to cluster or group employees into performance brackets, you need examples (hopefully many) within each group you want to assign. That means your organization would need some top performers, above-average performers, medium performers and low performers.

The problem of coverage becomes trickier and trickier with larger classes to assign. A web page classifier, assigning one of potentially thousands of topics, would need examples in each subject.

Real inputs

The data should also reflect the reality of the inputs. That doesn’t mean the data’s validity — presumably, the data is based in reality and wasn’t generated randomly. Instead, it means when you try to apply your model or analysis to the next problem, does the data include all the real inputs it will face?

The classic example of real inputs is “cat, dog, raccoon.” If you train a classifier model to predict whether an image is a cat or a dog, it will perform very well on pictures of cats and dogs. But if you input a picture of a raccoon, it would try to predict if it was a cat or a dog.

In business, this is often trying to force an analysis from one industry or model onto another. An analysis trained on data from a manufacturing wholesale food company will perform poorly on a SaaS business. The analyst didn’t build the initial model on the real inputs of customer churn or monthly revenue.

Unbiased

Data bias is a well-documented phenomenon that deserves its own post to address properly. There is a wide variety of biases that can be present in data that will skew the results. Some of the major ones to look out for: selection bias, sampling bias, stereotype bias, experimenter bias.

Unlinked

Keep the data from being the result of the model itself. In other words, avoid the circular error that is known to bedevil Excel users. If the model outputs a new data point, that shouldn’t be part of the model’s analysis initially.

For example, if you’re determining what emails are important and star those messages or otherwise highlight them, you should not consider clicks as a data point to consider the importance. Presumably, the messages starred with have more clicks.

Consistently labelled

The label of your data is a defining feature. For instance, if you had a customer row of data, a business analyst might look at the data and say, “high priority customer.” A more common example is labellers who look at pictures (say of cats, dogs and now raccoons) and assign a label depending on what it is (cat for a picture of a cat playing the guitar).

The issue can arise when multiple people assign the labels and there is discrepancy in how they are assigned. With many labellers, the problem can become pronounced, and the label might differ in some grey areas depending on who is labelling the data.

Another issue would be data that evolves with time and will, eventually, shed its old label. A hockey player might have a “rookie” label at the beginning of their career, but they would lose that label after several seasons.

And finally, inconsistency with labels might occur from not understanding the individual’s motives generating the data. Suppose a label of importance is generated depending on how fast we respond to an email message. In that case, it may incorrectly assume a note wasn’t important just because we were away from the computer for a moment longer.

Big (enough)

Finally, the data should be big … enough. Depending on the problem that we’re solving, whether that’s a unique business proposal or a machine learning model for natural language processing, we need the appropriate amount of data to push it from case-specific to generalized.

The amount varies heavily depending on the use. For some analysis, only a few hundred data points could provide enough to generalize. In other cases, a model requires tens or hundreds of thousands to generalize across problems.


Enjoy this blog post? Here’s some interesting research that helped create it:

Sign up below to receive notifications when new posts are published.