Collect the data: How missing data biases data-driven decisions

October 26, 2020

This is the seventh post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

When you’re gathering the data you need and creating datasets that don’t exist yet, you’re in the midst of the data collection stage. Data can easily go missing when you’re collecting it!

In this post, I’ll cover the following:

How data goes missing at the data collection stage
What to do about missing data at the collection stage

How does data go missing?

There are many reasons why data might be missing from your analysis process. Data goes missing at the collection stage because the data doesn’t exist, or the data exists but you can’t use it for whatever reason, or the data exists but the events in the dataset are missing information.

The dataset doesn’t exist

Frequently data goes missing because the data itself does not exist, and you need to create it. It’s very difficult and impractical to create a comprehensive dataset, so data can easily go missing at this stage. It’s important to do what you can to make sure data goes consistently missing when you collect it, if possible, by collecting representative data.

In some cases, though, you do need comprehensive data. For example, if you need to create a dataset of all the servers in your organization for compliance reasons, you might discover that there is no one dataset of servers, and that efforts to compile one are a challenge. You can start with just the servers that your team administers, but that’s an incomplete list.

Some servers are grant-owned and fully administered by a separate team entirely. Perhaps some servers are lurking under the desks of some colleagues, connected to the network but not centrally managed. You can try to use network scans to come up with a list, but then you gather only those servers connected to the network at that particular time. Airgapped servers or servers that aren’t turned on 24/7 won’t be captured by such an audit. It’s important to continually consider if you really need comprehensive data, or just data that comprehensively addresses your use case.

The data exists, but…

There’s also a chance that the data exists, but isn’t machine-readable. If the data is provided only in PDFs, as many FOIA requests are returned in, then it becomes more difficult to include the data in data analysis. There’s also a chance that the data is available only as paper documents, as is the case with gun registration records. As Jeanne Marie Laskas reports for GQ in Inside The Federal Bureau Of Way Too Many Guns, having records only on paper prevents large-scale data analysis on the information, thus causing it to effectively go missing from the entire process of data analysis.

It’s possible that the data exists, but isn’t on the network—perhaps because it is housed on an airgapped device, or perhaps stored on servers subject to different compliance regulations than the infrastructure of your data analysis software. In this case, the data exists but it is missing from your analysis process because it isn’t available to you due to technical limitations.

Another common case is that the data exists, but you can’t have it. If you’ve made an enemy in another department, they might not share the data with you because they don’t want to. It’s more likely that access to the data is controlled by legal or compliance concerns, so you aren’t able to access the data for your desired purposes, or perhaps you can’t analyze it on the tool that you’re using for data analysis due to compliance reasons.

For example, most doctors offices and hospitals in the United States use electronic health records systems to store the medical records of thousands of Americans. However, scientific researchers are not permitted to access detailed electronic health records of patients, though they exist in large databases and the data is machine-readable, because the health insurance portability and accountability act (HIPAA) privacy rule regulates how protected health information (PHI) can be accessed and used.

Perhaps the data exists, but is only available to people who pay for access. This is the case for many music metadata datasets like those from Nielsen, much to my chagrin. The effort it takes to create quality datasets is often commoditized. This also happens with scientific research, which is often only available to those with access to scientific journals that publish the results of the research. The datasets that produce the research are also often closely-guarded, as one dataset is time-consuming to create and can lead to multiple publications.

There’s also a chance the data exists, but it isn’t made available outside of the company. A common circumstance for this is public API endpoints for cloud services. Spotify collects far more data than they make available via the API, so too do companies like Zoom or Google. You might hope to collect various types of data from these companies, but if the API endpoints don’t make the data available, you don’t have many options.

And of course, in some cases the data exists, but it’s inconsistent. Maybe you’re trying to collect equivalent data from servers or endpoints with different operating systems, but you can’t get the same details due to logging limitations. A common example is trying to collect the same level of detail from computers with MacOS and computers with Windows installed. You can also see inconsistencies if different log levels are set on different servers for the same software. This inconsistent data causes data to go missing within events and makes it more difficult to compare like with like.

14-page forms lead to incomplete data collection in a pandemic

Data can easily go missing if it’s just too difficult to collect. An example from Illinois, reported by WBEZ reporter Kristen Schorsch in Illinois' Incomplete COVID-19 Data May Hinder Response, is that “the Illinois Department of Public Health issued a 14-page form that it has asked hospitals to fill out when they identify a patient with COVID-19. But faced with a cumbersome process in the midst of a pandemic, many hospitals aren’t completely filling out the forms.”

It’s likely that as a result of the length of the form, data isn’t consistently collected for all patients from all hospitals—which can certainly bias any decisions that the Illinois Department of Public Health might make, given that they have incomplete data.

In fact, as Schorsch reports, without that data, public health workers “told WBEZ that makes it harder for them to understand where to fight for more resources, like N95 masks that provide the highest level of protection against COVID-19, and help each other plan for how to make their clinics safer as they welcome back patients to the office.”

In this case, where data is going missing because it’s too difficult to collect, you can refocus your data collection on the most crucial data points for what you need to know, rather than the most complete data points.

What can you do about missing data?

Most crucially, identify the missing data. If you know that you need a certain type of data to answer the questions that you want to answer in your data analysis, you must know that it is missing from your analysis process.

After you identify the missing data, you can determine whether or not it matters. If the data that you do have is representative of the population that you’re making decisions about, and you don’t need comprehensive data to make those decisions, a representative sample of the data is likely sufficient.

Communicate your use cases

Another important thing you can do is to communicate your use cases to the people collecting the data. For example,

If software developers have a better understanding of how telemetry or log data is being used for analysis, they might write more detailed or more useful logging messages and add new fields to the telemetry data collection.
If you share a business case with cloud service providers to provide additional data types or fields via their API endpoints, you might get better data to help you perform less biased and more useful data analyses. In return, those cloud providers are likely to retain you as a customer.

Communicating the use case for data collection is most helpful when communicating that information leads to additional data gathering. It’s riskier when it might cause potential data sources to be excluded.

For example, if you’re using a survey to collect information about a population’s preferences—let’s say, the design of a sneaker—and you disclose that information upfront, you might only get people with strong opinions about sneaker design responding to your survey. That can be great if you want to survey only that population, but if you want a more mainstream opinion, you might miss those responses because the use case you disclosed wasn’t interesting to them. In that context, you need to evaluate the missing data for its relevance to your analysis.

Build trust when collecting sensitive data

Data collection is a trust exercise. If the population that you’re collecting data about does not understand why you’re collecting the data, or trust that you will protect it, use it as you say you will, or if they believe that you will use the data against them, you might end up with missing data.

Nowhere is this more apparent than with the U.S. Census. Performed every 10 years, the data from the census is used to determine political representation, distribute federal resources, and much more. Because of how the data from the census survey is used, a representative sample isn’t enough—it must be as complete a survey as possible.

Screenshot of the Census page How We Protect Your Information.

The Census Bureau understands that mistrust is a common reason why people might not respond to the census survey. Because of that, the U.S. Census Bureau hires pollsters that are part of groups that might be less inclined to respond to the census, and also provide clear and easy-to-find details on their website (See How We Protect Your Information on census.gov) about the measures in place to protect the data collected in the census survey. Those details are even clear in the marketing campaigns urging you to respond to the census! The census survey also faces other challenges when ensuring the comprehensive survey is as complete as possible.

This year, the U.S. Census also faced time limits for completing the collecting and counting of surveys, in addition to delays already imposed by the COVID-19 pandemic. The New York Times has additional details about those challenges: The Census, the Supreme Court and Why the Count Is Stopping Early.

Address instances of mistrust with data stewardship

As Jill Lepore discusses in episode 4, Unheard, of her podcast The Last Archive, mistrust can also affect the accuracy of the data being collected, such as in the case of former enslaved people being interviewed by descendants of their former owners, or their current white neighbors, for records collected by the Works Progress Administration. Surely, data is missing from those accounts of slavery due to mistrust of the people doing the data collection, or at the least, because those collecting the stories perhaps do not deserve to hear the true lived experiences of the former enslaved people.

If you and your team are not good data stewards, if you don’t do a good job of protecting data that you’ve collected or managing who has access to that data, people are less likely to trust you with more data—and thus it’s likely that datasets you collect will be missing data. Because of that, it’s important to practice good data stewardship. Use datasheets for datasets, or a data biography to record when data was collected, for what purpose, by whom or what means, and more. You can then review those to understand whether data is missing, or even to remember what data might be intentionally missing.

In some cases, data can be intentionally masked, excluded, or left to collect at a later date. If you keep track of these details about the dataset during the data collection process, it’s easier to be informed about the data that you’re using to answer questions and thus use it safely, equitably, and knowledgeably.

Collect what’s missing, maybe

If possible and if necessary, collect the data that is missing. You can create a new dataset if one does not already exist, such as those that journalists and organizations such as Campaign Zero have been compiling about police brutality in the United States. Some data collection that you perform might supplement existing datasets, such as adding additional introspection details to a log file to help you answer a new question for an existing data source.

If there are cases where you do need to collect additional data, you might not be able to do so at the moment. In those cases, you can build a roadmap or a business case to collect the data that is missing, making it clear how it can help reduce uncertainty for your decision. That last point is key, because collecting more data isn’t always the best solution for missing data.

Sometimes, it isn’t possible to collect more data. For instance, if you’re trying to gather historical data, but everyone from that period has died and very few or no primary sources remain. Or cases where the data has been destroyed, such as in a fire or intentionally, as the Stasi did after the fall of the Berlin Wall.

Consider whether you need complete data

Also consider whether or not more data will actually help address the problem that you’re attempting to solve. You can be missing data, and yet still not need to collect more data in order to make your decision. As Douglas Hubbard points out in his book, How to Measure Anything, data analysis is about reducing uncertainty about what the most likely answer to a question is. If collecting more data doesn’t reduce your uncertainty, then it isn’t necessary.

Nani Jansen Reventlow of the Digital Freedom Fund makes this point clear in her Op-Ed on Al Jazeera, Data collection is not the solution for Europe’s racism problem. In that case, collecting more data, even though it could be argued that the data is missing, doesn’t actually reduce uncertainty about what the likely solution for racism is. Being able to quantify the effect or harms of racism on a region does not solve the problem—the drive to solve the problem is the only thing that can solve that problem.

Avoid cases where you continue to collect data, especially at the expense of an already-marginalized population, in an attempt to prove what is already made clear by the existing information available to you.

You might think that data collection is the first stage of a data analysis process, but in fact, it’s the second. The next and last post in this series covers defining the question that guides your data analysis, and how to take action to reduce bias in your data-driven decisions: Define the question: How missing data biases data-driven decisions.