Decide with the data: How missing data biases data-driven decisions

October 26, 2020

This is the second post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Any decision based solely on the results of data analysis is missing data—the non-quantitative kind. But data can also go missing from data-driven decisions as a result of the analysis process. Exhaustive data analysis and universal data collection might seem like the best way to prevent missing data, but it’s not realistic, feasible, or possible. So what can you do about the possible bias introduced by missing data?

In this post, I’ll cover the following:

What to do if you must make a decision with missing data
How data goes missing at the decision stage
What to do about missing data at the decision stage
How much missing data matters
How much to care about missing data before making your decision

Missing data in decisions from the Oregon Health Authority #

In the midst of a global pandemic, we’ve all been struggling to evaluate how safe it is to resume pre-pandemic activities like going to the gym, going out to bars, sending our kids back to school, or eating at restaurants. In the United States, state governors are the ones tasked with making decisions about what to reopen and what to keep closed, and most are taking a data-driven approach.

In Oregon, the decisions they’re making about what to reopen and when are based on incomplete data. As reported by Erin Ross for Oregon Public Broadcasting, the Oregon Health Authority is not collecting or analyzing data about whether or not restaurants or bars are contributing to COVID-19 case rates.

The contact tracers interviewing people who’ve tested positive for SARS COV-2 are not asking, and even if the information is shared, the data isn’t being analyzed in a way that might allow officials to identify the effect of bars and restaurants on coronavirus case rates. Although this data is missing, officials are making decisions about whether or not bars and restaurants should remain open for indoor operations.

Oregon and the Oregon Health Authority aren’t alone in this limitation. We’re in the midst of a pandemic, and everyone is doing the best they can with the limited resources and information that they have. Complete data isn’t always possible, especially when real-time data matters. So what can Oregon (and you) do to make sure that missing data doesn’t negatively affect the decision being made?

If circumstances allow, it’s best to try to narrow the scope of your decision. Limit your decision to those groups and situations about which you have complete or representative data. If you can’t limit your decision, such as is the case with this pandemic, you can still make a decision with incomplete data.

Acknowledge that your decision is based on limited data, identify the gaps in your knowledge, and make plans to address those gaps as soon as possible. You can address the missing data by collecting more data, analyzing your existing data differently, or by reexamining the relevance of various data points to the decision that you’re making. This is one key example of how missing data can affect a decision-making process. How else can data go missing when making decisions?

How can data go missing? #

Data-driven decisions are especially vulnerable to the effects of missing data. Because data-driven are based on the results of data analysis, the effects of missing data in the earlier data analysis stages are compounded.

The reports that you reviewed before making your decision included the prettiest graphs instead of the most useful visualizations to help you make your decision.
The visualizations in the report were different from the ones in the last report, making it difficult for you to compare the new results with the results in the previous report.
The data being collected doesn’t include the necessary details for your decision. This is what is happening in Oregon, where the collected data doesn’t include all the details that are relevant when making decisions about what businesses and organizations to reopen.
The data analysis that was performed doesn’t actually answer the question that you’re asking. If you need to know “how soon can we reopen indoor dining and bars despite the pandemic”, and the data analysis being performed can only tell you “what are the current infection rates, by county, based on the results of tests administered 5 days ago”, the decisions that you’re making might be based on incomplete data.

What can you do about missing data? #

Identify if the missing data matters to your decision. If it does, you must acknowledge that the data is missing when you make your decision. If you don’t have data about a group, intentionally exclude that group from your conclusions or decision-making process. Constrain your decision according to what you know, and acknowledge the limitations of the analysis.

If you want to make broader decisions, you must address the missing data throughout the rest of the process! If you aren’t able to immediately collect missing data, you can attempt to supplement the data with a dedicated survey aimed at gathering that information before your decision. You can also investigate to find out if the data is already available in a different format or context—for example in Oregon, where the Health Authority might have some information about indoor restaurant and bar attendance in the content of contact tracing interviews and just aren’t analyzing it systematically. If the data is representative even if it isn’t comprehensive, you can still use it to supplement your decision.

To make sure you’re making an informed decision, ask questions about the data analysis process that led to the results you’re reviewing. Discuss whether or not data could be missing from the results of the analysis presented to you, and why. Ask yourself: does the missing data affect the decision that I’m making? Does it affect the results of the analysis presented to me? Evaluate how much missing data matters to your decision-making process.

You don’t always need more data #

You will always be missing some data. It’s important to identify when the data that is missing is actually relevant to your analysis process, and when it won’t change the outcome. Acknowledge when additional data won’t change your conclusions.

You don’t need all possible data in existence to support a decision. As Douglas Hubbard points out in his book How to Measure Anything, the goal of a data analysis process is to reduce your uncertainty about the right approach to take.

If additional data, or more detailed analysis, won’t further reduce any uncertainty, then it’s likely unnecessary. The more clearly you constrain your decisions, and the questions you use to guide your data analysis, the more easily you can balance reducing data gaps and making a decision with the data and analysis results you have.

USCIS doesn’t allow missing data, even if it doesn’t affect the decision #

Sometimes, missing data doesn’t affect the decision that you’re making. This is why you must understand the decision you’re making, and how important comprehensive data is to the decision. If that’s true, you want to make sure your policies acknowledge that reality.

In the case of the U.S. Citizenship and Immigration Services (USCIS), their policies don’t seem to recognize that some kinds of missing data for citizenship applications are irrelevant.

In an Opinions column by Washington Post reporter Catherine Rampell, The Trump administration’s no-blanks policy is the latest Kafkaesque plan designed to curb immigration, she describes the “no blanks” policy applied to immigration applications, and now, to third-party documents included with the applications.

“Last fall, U.S. Citizenship and Immigration Services introduced perhaps its most arbitrary, absurd modification yet to the immigration system: It began rejecting applications unless every single field was filled in, even those that obviously did not pertain to the applicant. “Middle name” field left blank because the applicant does not have a middle name? Sorry, your application gets rejected. No apartment number because you live in a house? You’re rejected, too. No address given for your parents because they’re dead? No siblings named because you’re an only child? No work history dates because you’re an 8-year-old kid? All real cases, all rejected.”

In this example, missing data is deemed a problem for making a decision about the citizenship application for a person—even when the data that is missing is supposed to be missing because it doesn’t exist. When asked for comment,

“a USCIS spokesperson emailed, “Complete applications are necessary for our adjudicators to preserve the integrity of our immigration system and ensure they are able to confirm identities, as well as an applicant’s immigration and criminal history, to determine the applicant’s eligibility.””

Missing data alone is not enough to affect your decision—only missing data that affects the results of your decision. A lack of data is not itself a problem—the problem is when that is relevant to your decision is missing. That’s how bias gets introduced to a data-driven decision.

In the next post in this series, I’ll explore some ways that data can go missing when the results of data analysis are communicated: Communicate the data: How missing data biases data-driven decisions.