Manage the data: How missing data biases data-driven decisions

This is the sixth post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

In this post, I’ll cover the following:

  • What is data management?
  • How does data go missing, featuring examples of disappearing data
  • What you can do about missing data

How you manage data in order to prepare it for analysis can cause data to go missing and decisions based on the resulting analysis to be biased. With so many ways for data to go missing, there’s just as many chances to address the potential bias that results from missing data at this stage.

What is data management?

Data management, for the purposes of this post, covers all the steps you take to prepare data after it’s been collected. That includes all the steps you take to answer the following questions:

  • How do you extract the data from the data source?
  • What transformations happen to the data to make it easier to analyze?
  • How is it loaded into the analysis tool?
  • Is the data normalized against a common information model?
  • How is the data structured (or not) for analysis?
  • What retention periods are in place for different types of data?
  • Who has access to the data? 
  • How do people access the data?
  • For what use cases are people permitted to access the data?
  • How is information stored and shared about the data sources? 
  • What information is stored or shared about the data sources?
  • What upstream and downstream dependencies feed into the data pipeline? 

How you answer these questions (if you even consider them at all) can cause data to go missing when you’re managing data. 

How does data go missing? 

Data can go missing at this stage in many ways. With so many moving parts from various tooling and transformation steps being taken to prepare data for analysis and make it easier to work with, a lot can go wrong. For example, if you neglect to monitor your dependencies, a configuration change in one system can cause data to go missing from your analysis process. 

Disappearing data: missing docs site metrics 

It was just an average Wednesday when my coworker messaged me asking for help with her documentation website metrics search—she thought she had a working search, but it wasn’t showing the results she expected. It was showing her that no one was reading any of her documentation, which I knew couldn’t be true.

As I dug deeper, I realized the problem wasn’t the search syntax, but the indexed data itself. We were missing data! 

I reported it to our internal teams, and after some investigation they realized that a configuration change on the docs site had resulted in data being routed to a different index. A configuration change that they thought wouldn’t affect anything ended up causing data to go missing for nearly a week because we weren’t monitoring dependencies crucial to our data management system. 

Thankfully, the data was only misrouted and not dropped entirely, but it was a good lesson in how easily data can go missing at this management stage. If you identify the sources you expect to be reporting data, then you can monitor for changes in the data flow. You can also document those sources as dependencies, and ensure that configuration changes include additional testing to ensure the continued fidelity of your data collection and management process. 

Disappearing data: data retention settings slip-up 

Another way data can go missing is if you neglect to manage or be aware of default tool constraints that might affect your data. 

In this example, I was uploading my music data to the Splunk platform for the first time. I was so excited to analyze the 10 years of historical data. I uploaded the file, set up the field extractions, and got to searching my data. I wrote an all time search to see how my music listening habits had shifted year over year in the past decade—but only 3 years of results were returned. What?!

In my haste to start analyzing my data, I’d completely ignored a warning message about a seemingly-irrelevant setting called “max_days_ago”. It turns out, this setting is set by default to drop any data older than 3 years. The Splunk platform recognized that I had data in my dataset older than 3 years, but I didn’t heed the warning and didn’t update the default setting to match my data. I ended up having to delete the data I’d uploaded, fix my configuration settings, and upload the data again—without any of it being dropped this time! 

This experience taught me to pay attention to how I configure a tool to manage my data to make sure data doesn’t go missing. This happened to me while using the Splunk platform, but it can happen with whatever tool you’re using to manage, transform, and process your data.

As reported by Alex Hern in the Guardian

“A million-row limit on Microsoft’s Excel spreadsheet software may have led to Public Health England misplacing nearly 16,000 Covid test results”. This happened because of a mismatch in formats and a misunderstanding of the data limitations imposed by the file formats used by labs to report case data, as well as of the software (Microsoft Excel) used to manage the case data. Hern continues, pointing out that “while CSV files can be any size, Microsoft Excel files can only be 1,048,576 rows long – or, in older versions which PHE may have still been using, a mere 65,536. When a CSV file longer than that is opened, the bottom rows get cut off and are no longer displayed. That means that, once the lab had performed more than a million tests, it was only a matter of time before its reports failed to be read by PHE.” 

This limitation in Microsoft Excel isn’t the only way that tool limitations and settings can cause data to go missing at the data management stage. 

Data transformation: Microsoft wants genes to be dates 

If you’re not using Splunk for your data management and analysis, you might be using Microsoft Excel. It turns out that Microsoft Excel, despite (or perhaps due to) its popularity, can also cause data to go missing due to configuration settings. In the case of some genetics researchers, it turned out that Microsoft Excel was transforming their data incorrectly. The software was transforming certain gene names, such as MAR1 and DEC1, into dates of March 1 and December 1, causing data to go missing from the analysis. 

Clearly, if you’re doing genetics research, this is a problem. Your data has been changed, and this missing data will bias any research based on this dataset, because certain genes are now dates! 

To handle cases where a tool is improperly transforming data, you have 3 options:

  • Change the tool that you’re using,
  • Modify the configuration settings of the tool so that it doesn’t modify your data,
  • Or modify the data itself.

The genetics researchers ended up deciding to modify the data itself. The HUGO Gene Nomenclature Committee officially renamed 27 genes to accommodate this data transformation error in Microsoft Excel. Thanks to this decision, these researchers have one fewer configuration setting to worry about when helping to ensure vital data doesn’t go missing during the data analysis process. 

What can you do about missing data? 

These examples illustrate common ways that data can go missing at the management stage, but they’re not the only ways. What can you do when data goes missing?

Carefully set configurations 

The configuration settings that you use to manage data that you’ve collected can result in events and data points being dropped.

For example, if you incorrectly configure data source collection, you might lose events or parts of events. Even worse, data can go missing if you incorrectly record events due to incorrect line breaking, truncation, time zone, timestamp recognition, or retention settings. Data can go missing inconsistently if all of the nodes of your data management system don’t have identical configurations. 

You might cause some data to go missing intentionally. You might choose to drop INFO level log messages and collect only the ERROR messages in an attempt to track just the signal from the noise of log messages, or you might choose to drop all events older than 3 months from all data sources to save money on storage. These choices, if inadequately communicated or documented, can lead to false assumptions or incorrect analyses being performed on the data. 

If you don’t keep track of configuration changes and updates, a data source format could change before you update the configurations to manage the new format, causing data to get dropped, misrouted, or otherwise go missing from the process. 

If your data analyst is communicating their use cases and questions to you, you can better understand data retention settings according to those use cases, and review the current policies across your datasets and see how they compare for complementary data types. 

You can also identify complementary data sources that might help the analyst answer the questions they want to answer, and plan how and when to bring in those data sources to improve the data analysis. 

You need to manage dataset transformations just as closely as you do the configurations that manage the data. 

Communicate dataset transformations

The steps you take to transform data can also lead to missing data. If you don’t normalize fields, or if your field normalizations are inconsistently applied across the data or across the data analysts, data can appear to be missing even if it is there. If some data has a field name of “http_referrer” and the same fields in other data sources are consistently “http_referer”, the data with “http_referrer” data might appear to be missing for some data analysts when they start the data analysis process. 

Normalization can also help you identify where fields might be missing across similar datasets, such as cases where an ID is present in one type of data but not another, making it difficult to trace a request across multiple services. 

If the data analyst doesn’t know or remember which field name exists in one dataset, and whether or not it’s the same field as another dataset, data can go missing at the analysis stage—as we saw with my examples of the “rating” field missing from some events and the info field not having a value that I expected in the data analysis post from this series, Analyze the data: How missing data biases data-driven decisions

In the same vein, if you use vague field names to describe the data that you’ve collected, or dataset names that ambitiously describe the data that you want to be collecting—instead of what you’re actually collecting—data can go missing. Shortcuts like “future-proofing” dataset names can be misleading to data analysts that want to easily and quickly understand what data they’re working with. 

The data doesn’t go missing immediately, but you’re effectively causing it to go missing when data analysis begins if data analysts can’t correctly decipher what data they’re working with. 

Educate and incorporate data analysis into existing processes

Another way data can go missing is painfully human. If the people that you expect to analyze the data and use it in their decision-making process don’t know how to use the tool that the data is stored in, well, that data goes missing from the process. Tristan Handy in the dbt blog post Analytics engineering for everyone discusses this problem in depth. 

It’s important to not just train people on the tool that the data is stored in, but also make sure that the tool and the data in it are considered as part of the decision-making process. Evangelize what data is available in the tool, and make it easy to interact with the tool and the data. This is a case where a lack of confidence and knowledge can cause data to go missing. 

Data gaps aren’t always caused by a lack of data—they can also be caused by knowledge gaps and tooling gaps if people aren’t confident or trained to use the systems with the data in them. 

Monitor data strategically

Everyone wants to avoid missing data, but you can’t monitor what you can’t define. So in order to monitor data to prevent it from going missing, you must define what data you expect to see, both from which sources or at which ingestion volumes. 

If you don’t have a way of defining those expectations, then you can’t alert on what’s missing. Start by identifying what you expect, and then quantify what’s missing based on those expectations. For guidance on how to do this in Splunk, see Duane Waddle’s blog post Proving a Negative, as well as the apps TrackMe or Meta Woot!

Plan changes to the data management system carefully

It’s also crucial to review changes to the configurations that you use to manage data sources, especially changes to data structures or normalization in data sources. Make sure that you consistently deploy these changes as well, to reduce the chance that some sources collect different data in different ways from other sources for the same data. 

Be careful to note downstream and upstream dependencies for your data management system, such as other tools, permissions settings, or network configurations, before making changes, such as an upgrade or a software change.

The simplest way for data to go missing from a data analysis process is when it’s being collected. The next post in the series discusses how data can go missing at the collection stage: Collect the data: How missing data biases data-driven decisions.

Analyze the data: How missing data biases data-driven decisions

This is the fifth post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

When you do data analysis, you’re searching and analyzing your data so that you can answer a specific question with the data so that you can make a decision at the end of the process. Unfortunately, there are many ways that data can go missing while you analyze it. 

In this post, I’ll cover the following

  • Why disaggregation matters when performing data analysis
  • How data goes missing when analyzing data
  • What to do about missing data when analyzing data
  • How simple mistakes can cause data to go missing 

Declining bird populations in North America

Any aggregation that you do in your data analysis can cause data to go missing. I recently finished listening to the podcast series The Last Archive, by Jill Lepore, and episode 9 talked about declining bird populations in North America. I realized that it’s an excellent example of why disaggregating data is vital to making sure that data doesn’t go missing during the analysis process. 

As reported in Science Magazine, scientists have concluded that since 1970, the continent of North America has lost 3 billion birds—nearly 30% of the surveyed total. But what does that mean for specific bird populations? Are some more affected than others?

It’s easy for me to look around San Francisco and think, well clearly that many birds can’t be missing—I’m still surrounded by pigeons and crows, and ducks and geese are everywhere when I go to the park! 

If you don’t disaggregate your results, there’s no way to determine which bird breeds are affected and where. You might assume that 30% of all bird breeds were equally affected across habitats and types. Thankfully, these scientists did disaggregate their results, so we can identify variations that otherwise would be missing from the analysis. 

Screenshot of bar chart from Science magazine, showing bird decline by habitat in percentage. Wetlands gained more than 10%, all other populations declined. Grasslands declined more than 50%.

Screenshot of a bar chart from Science magazine showing decline by type of bird, relevant statistics duplicated in text.

In the case of this study, we can see that some bird populations—Old world sparrows, Larks, and Starlings—are more affected than other types of birds, while others—Raptors, Ducks and geese, and Vireos—have flourished in the past 50 years.

Because the data is disaggregated, you can uncover data that would otherwise be missing from the analysis—how the different types of birds have actually been differently affected, due to habitat loss in grasslands, or cases where restoration and rehabilitation efforts have been effective, such as the resurgence in the population of raptors. 

Without an understanding of which specific bird populations are affected, and where they live, you can’t take as effective action to help bird populations recover, because you’re missing too much data due to an overgeneralized aggregate. Any decisions you took based on the overgeneralized aggregate would be biased and ultimately incorrect. 

In the case of this study, we know that targeted bird population restoration is perhaps most needed in grasslands habitats, like the Midwest where I grew up. 

Unfortunately, the study only covers 76% of all bird breeds, so my city-dwelling self, will just continue to wonder how the bird population has changed since 1970 for pigeons, doves, crows, and others. 

How does data go missing?

An easy way for data to go missing is for incomplete data to be returned when you’re analyzing the data. Many of these examples are Splunk-specific, but are limitations shared by most data analysis tools. 

Truncated results

In Splunk, search results could be truncated for a variety of reasons. Truncated results have visible error messages if you’re running ad hoc searches, but you might not see error messages if the searches are producing scheduled reports. 

If the indexers where the data is stored are unreachable, or your search times out before completing, the results could be incomplete. If you’re using a subsearch, you might hit the default event limit of 10K events, or the timeout limit of 60 seconds and have incomplete subsearch results. If you’re using the join search command, you might hit the 50K row limit that Nick Mealy discusses in his brilliant .conf talk, Master Joining Datasets Without Using Join

If you’re searching an external service from Splunk, for example by using ldapsearch or a custom command to search an external database, you might not get complete results if that service is having problems or if you don’t have access to some data that you’re searching for.

It’s surprisingly easy for data to go missing when you’re correlating datasets. 

Missing and “missing” fields across datasets

If you’re trying to compare datasets and some of the datasets are missing fields, you might accidentally miss data. Without the same field across multiple types of data, it can be difficult to perform valuable or accurate data correlations.

In this containerized cloud-native world, tracing outages across systems can be complex if you don’t have matching identifiers in every dataset that you’re using. As another example, it can be difficult to identify suspicious user session activity without the ability to correlate session identifiers with active users logged into a specific host. 

Sometimes the fields aren’t actually missing, they’re just named differently. Because the data isn’t normalized, or the fields don’t match your naming expectations, they’re effectively missing from your analysis because you can’t find them. 

Missing fields in a dataset that you want to include in your analysis

Sometimes data is missing from specific events within a dataset. For example, I wanted to determine the average rating that I gave songs in my iTunes library. However, iTunes XML files store tracks with no rating (or a zero star rating for my purposes) without a rating field at all in the events. 

Calculating an average with that data missing gives me an average 3 star rating for all the tracks in my iTunes library. 

Screenshot of a Splunk search and visualization, showing a single value result of "3 stars". Splunk search is: `itunes` | search track_name=* | stats count as track_count by rating | stats avg(rating) as average_rating | replace "60" WITH "3 stars" IN average_rating

But if I add the zero-star rated tracks back in, representing those as “tracks I have decided aren’t worthy of a rating” rather than “tracks that have not yet been rated”, the average changes to 2.5 stars per track. 

Screenshot of splunk search and visualization showing "2.5 stars". Splunk search is: `itunes` | search track_name=* | fillnull rating value="0" | stats count as track_count by rating | stats avg(rating) as average_rating | replace "50" WITH "2.5 stars" IN average_rating

If I’d used the results that were missing data, I’d have a biased interpretation of my average song rating in my iTunes library. 

Mismatched numeric units and timezones

You might also have data that goes missing if you’re improperly analyzing data because you don’t know or misinterpret the units that the data is stored in. 

Keep track of whether or not a field contains a time in minutes or seconds and if your network traffic logs are in bytes or megabytes. Data can vanish from your analysis if you improperly compare dissimilar units!

If you convert a time to a more human-readable format and make incorrect assumptions about the time format, such as the time zone that it’s in, you can cause data to go missing from the proper time period. 

Even without transforming your data, if you’re comparing different datasets that store time data with different time zones, data can go missing. You might think that you’re comparing a four hour window from last Monday while debugging an issue across several datasets

It’s also important to note that how you choose to aggregate your data can hide data that is missing from your dataset, or hide data in a way that causes it to go missing. 

Consider the granularity of your aggregations

When you perform your data analysis, consider what time ranges you’re using to bin or aggregate your data, the time ranges that you use to search your data across, for what data points, which fields you use to disaggregate your data, and how that might affect your results. 

For example, it’s important to keep track of how you aggregate events across time periods in your analysis. If the timespans that you use when aggregating your data don’t align with your use case and the question you’re trying to answer, data can go missing. 

In this example, I was trying to convert an existing search that showed me how much music I was listening to by type per year, into a search that would show me how much music I was listening to by weekday by year. Let’s take a look:

Screenshot of very detailed Splunk search and visualization showing time spent listening by weekday and year

This was my initial attempt at modifying the data, and there’s a lot missing. There’s no results at all for Tuesdays in 2019, and the counts for Sundays in 2017, Mondays in 2018, and Thursdays in 2020 were laughably low. What did I do wrong?

It turned out that the time spans I was using in the original search to aggregate my data were too broad for the results I was trying to get. I was doing a timechart with a span of 1 month to start, and then trying to get data down to the a granularity of weekday. That wasn’t going to work! 

I’d caused data to go missing because I didn’t align the question I was trying to answer with the time span aggregation in my analysis. 

Thankfully, it was a quick fix. I updated the initial time grouping to match my desired granularity of one day, and I was no longer missing data from my results! 

Screenshot showing revised results for time spent listening by weekday and year

This is a case where an overly broad timespan aggregation, combined with a lack of consideration for my use case, caused data to go missing. 

You make an error in your analysis process 

How many of you have written a Splunk search with a conditional eval function and totally messed it up? 

Screenshot of a Splunk search. Search: |inputlookup append=t concerthistoryparse.csv | eval show_length=case(info == "festival", "28800", info == "dj set", "14400", info == "concert", "10800")

In this case, I wrote a search to calculate the duration of music listening activities—specifically, calculating an estimated amount of time spent at a music festival, DJ set, or concert—in order to compare how much time I spent listening to music before I was sheltering in place with after. 

I used a conditional “info == concert” to apply an estimated concert length of 3 hours, but no field-value pair of info=concert existed in my data. In fact, concerts had no info field at all. It wasn’t until I’d finished my search, combining 2 other datasets, that I realized a concert I’d attended in March was missing from the results. 

Screenshot of full Splunk search and visualization showing time spent listening to music, going to shows, and listening to livestreams by month. March shows only livestreams and listening activity.

In order to prevent this data from going missing, I had to break my search down into smaller pieces, validating the accuracy of the results at each step against my expectations and knowledge of what the data should reflect. Eventually I realized that I’d caused data to go missing in my analysis process by making the assumption that an info=concert field existed in my data. 

Same screenshot as previous image, but with a corrected search and March now shows one show and livestream and listening activity.

Ironically, this graph is still missing data because Last.fm failed to scrobble tracks from Spotify for several time periods in July and August. 

What can you do about missing data?

If you have missing data in your analysis process, what can you do? 

Optimize your searches and environment

If you’re processing large volumes of events, you can consider expanding limits for data, such as those for subsearches or timeouts for results. Default limits are the most common settings, and aren’t always something to adjust, but occasionally it makes sense to adjust the limits to suit your use cases if your architecture can handle it. 

More practically, make sure that you’re running efficient and precise analyses to make the most use of your data analysis environment. If you’re using Splunk, I again recommend Nick Mealy’s excellent .conf talk, Master Joining Datasets Without Using Join for guidance on optimizing your searches, as well as Martin Müller’s talk on the Splunk Job Inspector, featuring Clara Merriman for the .conf20 edition. 

Normalize your data

You can normalize your data to make it easier to use, especially if you’re correlating common fields across many datasets. Use field names that are as accurate and descriptive as possible, but also names that are consistent across datasets. 

It’s a best practice to follow a common information model, but also ensure that you work closely with data administrators and data analysts (if you aren’t performing both roles) to make sure that analyst use cases and expectations align with administrator practices and implementations. 

If you’re using Splunk, start with the Splunk Common Information Model (CIM) or write your own data model to help normalize fields. Keep in mind too, that you don’t have to use accelerated data models to use a data model when naming fields.  

Enrich your datasets 

You can also explore different ways to enrich your datasets earlier in the analysis process to help make sure data doesn’t go missing across datasets. If you perform data enrichment on the data as it streams in, using a tool like Cribl LogStream or Splunk Data Stream Processor, you can add this missing data back to the events and make it easier to accurately and usefully correlate data across datasets. 

Consistently validate your results

Do what you can to consistently and constantly ensure the validity of your search results. Assess the validity against your expectations, but ultimately against itself and against context. If you don’t validate your results, you might lose data that is hidden inside of an aggregate or misinterpreted due to missing context. 

  • Check different time ranges. What seems to be a gap or a pattern in data could just be seasonality or noise. Consider separating weekends from weekdays, or business hours from other hours, depending on the type of data that you’re aggregating. 
  • Use different types of averages, and consider whether the span of values in your results are well-represented by an average. An average of a wide range of values can inaccurately reflect the range of the values, effectively hiding the minimum and the maximum values. 
  • Disaggregate your data to identify where you might be missing more data, and thus have bias in your results. Heather Krause and Shena Ashley discuss disaggregation in an interview for If Data Could Talk by Tableau Software on The Ethics of Visualizing Data on Race, making the point that disaggregated data does not imply causation, it merely describes the data.
  • Compare like with like. Validate time zones, units, and the data in similarly-named fields using a data dictionary or by working with the stewards of the datasets to make sure that you’re using the data correctly. 

To help ensure you’re properly accounting for missing data that you might cause while analyzing data, work with your data administrator and consult with a statistics expert if you’re not sure you’re properly analyzing or aggregating data. To help learn more, I recommend Ben Jones’ book Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations.

The next post in this series covers the different ways that data can go missing at the data management stage—what happens to prepare data for analysis?  Manage the data: How missing data biases data-driven decisions