Manage the data: How missing data biases data-driven decisions

October 26, 2020

This is the sixth post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

In this post, I’ll cover the following:

What is data management?
How does data go missing, featuring examples of disappearing data
What you can do about missing data

How you manage data in order to prepare it for analysis can cause data to go missing and decisions based on the resulting analysis to be biased. With so many ways for data to go missing, there’s just as many chances to address the potential bias that results from missing data at this stage.

What is data management? #

Data management, for the purposes of this post, covers all the steps you take to prepare data after it’s been collected. That includes all the steps you take to answer the following questions:

How do you extract the data from the data source?
What transformations happen to the data to make it easier to analyze?
How is it loaded into the analysis tool?
Is the data normalized against a common information model?
How is the data structured (or not) for analysis?
What retention periods are in place for different types of data?
Who has access to the data?
How do people access the data?
For what use cases are people permitted to access the data?
How is information stored and shared about the data sources?
What information is stored or shared about the data sources?
What upstream and downstream dependencies feed into the data pipeline?

How you answer these questions (if you even consider them at all) can cause data to go missing when you’re managing data.

How does data go missing? #

Data can go missing at this stage in many ways. With so many moving parts from various tooling and transformation steps being taken to prepare data for analysis and make it easier to work with, a lot can go wrong. For example, if you neglect to monitor your dependencies, a configuration change in one system can cause data to go missing from your analysis process.

Disappearing data: missing docs site metrics #

It was just an average Wednesday when my coworker messaged me asking for help with her documentation website metrics search—she thought she had a working search, but it wasn’t showing the results she expected. It was showing her that no one was reading any of her documentation, which I knew couldn’t be true.

As I dug deeper, I realized the problem wasn’t the search syntax, but the indexed data itself. We were missing data!

I reported it to our internal teams, and after some investigation they realized that a configuration change on the docs site had resulted in data being routed to a different index. A configuration change that they thought wouldn’t affect anything ended up causing data to go missing for nearly a week because we weren’t monitoring dependencies crucial to our data management system.

Thankfully, the data was only misrouted and not dropped entirely, but it was a good lesson in how easily data can go missing at this management stage. If you identify the sources you expect to be reporting data, then you can monitor for changes in the data flow. You can also document those sources as dependencies, and ensure that configuration changes include additional testing to ensure the continued fidelity of your data collection and management process.

Disappearing data: data retention settings slip-up #

Another way data can go missing is if you neglect to manage or be aware of default tool constraints that might affect your data.

In this example, I was uploading my music data to the Splunk platform for the first time. I was so excited to analyze the 10 years of historical data. I uploaded the file, set up the field extractions, and got to searching my data. I wrote an all time search to see how my music listening habits had shifted year over year in the past decade—but only 3 years of results were returned. What?!

In my haste to start analyzing my data, I’d completely ignored a warning message about a seemingly-irrelevant setting called max_days_ago. It turns out, this setting is set by default to drop any data older than 3 years. The Splunk platform recognized that I had data in my dataset older than 3 years, but I didn’t heed the warning and didn’t update the default setting to match my data. I ended up having to delete the data I’d uploaded, fix my configuration settings, and upload the data again—without any of it being dropped this time!

This experience taught me to pay attention to how I configure a tool to manage my data to make sure data doesn’t go missing. This happened to me while using the Splunk platform, but it can happen with whatever tool you’re using to manage, transform, and process your data.

As reported by Alex Hern in the Guardian,

“A million-row limit on Microsoft’s Excel spreadsheet software may have led to Public Health England misplacing nearly 16,000 Covid test results”. This happened because of a mismatch in formats and a misunderstanding of the data limitations imposed by the file formats used by labs to report case data, as well as of the software (Microsoft Excel) used to manage the case data. Hern continues, pointing out that “while CSV files can be any size, Microsoft Excel files can only be 1,048,576 rows long – or, in older versions which PHE may have still been using, a mere 65,536. When a CSV file longer than that is opened, the bottom rows get cut off and are no longer displayed. That means that, once the lab had performed more than a million tests, it was only a matter of time before its reports failed to be read by PHE.”

This limitation in Microsoft Excel isn’t the only way that tool limitations and settings can cause data to go missing at the data management stage.

Data transformation: Microsoft wants genes to be dates #

If you’re not using Splunk for your data management and analysis, you might be using Microsoft Excel. It turns out that Microsoft Excel, despite (or perhaps due to) its popularity, can also cause data to go missing due to configuration settings. In the case of some genetics researchers, it turned out that Microsoft Excel was transforming their data incorrectly. The software was transforming certain gene names, such as MAR1 and DEC1, into dates of March 1 and December 1, causing data to go missing from the analysis.

Clearly, if you’re doing genetics research, this is a problem. Your data has been changed, and this missing data will bias any research based on this dataset, because certain genes are now dates!

To handle cases where a tool is improperly transforming data, you have 3 options:

Change the tool that you’re using,
Modify the configuration settings of the tool so that it doesn’t modify your data,
Or modify the data itself.

The genetics researchers ended up deciding to modify the data itself. The HUGO Gene Nomenclature Committee officially renamed 27 genes to accommodate this data transformation error in Microsoft Excel. Thanks to this decision, these researchers have one fewer configuration setting to worry about when helping to ensure vital data doesn’t go missing during the data analysis process.

What can you do about missing data? #

These examples illustrate common ways that data can go missing at the management stage, but they’re not the only ways. What can you do when data goes missing?

Carefully set configurations #

The configuration settings that you use to manage data that you’ve collected can result in events and data points being dropped.

For example, if you incorrectly configure data source collection, you might lose events or parts of events. Even worse, data can go missing if you incorrectly record events due to incorrect line breaking, truncation, time zone, timestamp recognition, or retention settings. Data can go missing inconsistently if all of the nodes of your data management system don’t have identical configurations.

You might cause some data to go missing intentionally. You might choose to drop INFO level log messages and collect only the ERROR messages in an attempt to track just the signal from the noise of log messages, or you might choose to drop all events older than 3 months from all data sources to save money on storage. These choices, if inadequately communicated or documented, can lead to false assumptions or incorrect analyses being performed on the data.

If you don’t keep track of configuration changes and updates, a data source format could change before you update the configurations to manage the new format, causing data to get dropped, misrouted, or otherwise go missing from the process.

If your data analyst is communicating their use cases and questions to you, you can better understand data retention settings according to those use cases, and review the current policies across your datasets and see how they compare for complementary data types.

You can also identify complementary data sources that might help the analyst answer the questions they want to answer, and plan how and when to bring in those data sources to improve the data analysis.

You need to manage dataset transformations just as closely as you do the configurations that manage the data.

Communicate dataset transformations #

The steps you take to transform data can also lead to missing data. If you don’t normalize fields, or if your field normalizations are inconsistently applied across the data or across the data analysts, data can appear to be missing even if it is there. If some data has a field name of http_referrer and the same fields in other data sources are consistently http_referer, the data with http_referrer might appear to be missing for some data analysts when they start the data analysis process.

Normalization can also help you identify where fields might be missing across similar datasets, such as cases where an ID is present in one type of data but not another, making it difficult to trace a request across multiple services.

If the data analyst doesn’t know or remember which field name exists in one dataset, and whether or not it’s the same field as another dataset, data can go missing at the analysis stage—as we saw with my examples of the “rating” field missing from some events and the info field not having a value that I expected in the data analysis post from this series, Analyze the data: How missing data biases data-driven decisions.

In the same vein, if you use vague field names to describe the data that you’ve collected, or dataset names that ambitiously describe the data that you want to be collecting—instead of what you’re actually collecting—data can go missing. Shortcuts like “future-proofing” dataset names can be misleading to data analysts that want to easily and quickly understand what data they’re working with.

The data doesn’t go missing immediately, but you’re effectively causing it to go missing when data analysis begins if data analysts can’t correctly decipher what data they’re working with.

Educate and incorporate data analysis into existing processes #

Another way data can go missing is painfully human. If the people that you expect to analyze the data and use it in their decision-making process don’t know how to use the tool that the data is stored in, well, that data goes missing from the process. Tristan Handy in the dbt blog post Analytics engineering for everyone discusses this problem in depth.

It’s important to not just train people on the tool that the data is stored in, but also make sure that the tool and the data in it are considered as part of the decision-making process. Evangelize what data is available in the tool, and make it easy to interact with the tool and the data. This is a case where a lack of confidence and knowledge can cause data to go missing.

Data gaps aren’t always caused by a lack of data—they can also be caused by knowledge gaps and tooling gaps if people aren’t confident or trained to use the systems with the data in them.

Monitor data strategically #

Everyone wants to avoid missing data, but you can’t monitor what you can’t define. So in order to monitor data to prevent it from going missing, you must define what data you expect to see, both from which sources or at which ingestion volumes.

If you don’t have a way of defining those expectations, then you can’t alert on what’s missing. Start by identifying what you expect, and then quantify what’s missing based on those expectations. For guidance on how to do this in Splunk, see Duane Waddle’s blog post Proving a Negative, as well as the apps TrackMe or Meta Woot!.

Plan changes to the data management system carefully #

It’s also crucial to review changes to the configurations that you use to manage data sources, especially changes to data structures or normalization in data sources. Make sure that you consistently deploy these changes as well, to reduce the chance that some sources collect different data in different ways from other sources for the same data.

Be careful to note downstream and upstream dependencies for your data management system, such as other tools, permissions settings, or network configurations, before making changes, such as an upgrade or a software change.

The simplest way for data to go missing from a data analysis process is when it’s being collected. The next post in the series discusses how data can go missing at the collection stage: Collect the data: How missing data biases data-driven decisions.