Collect the data: How missing data biases data-driven decisions

This is the seventh post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

When you’re gathering the data you need and creating datasets that don’t exist yet, you’re in the midst of the data collection stage. Data can easily go missing when you’re collecting it! 

In this post, I’ll cover the following: 

  • How data goes missing at the data collection stage 
  • What to do about missing data at the collection stage

How does data go missing?

There are many reasons why data might be missing from your analysis process. Data goes missing at the collection stage because the data doesn’t exist, or the data exists but you can’t use it for whatever reason, or the data exists but the events in the dataset are missing information. 

The dataset doesn’t exist 

Frequently data goes missing because the data itself does not exist, and you need to create it. It’s very difficult and impractical to create a comprehensive dataset, so data can easily go missing at this stage. It’s important to do what you can to make sure data goes consistently missing when you collect it, if possible, by collecting representative data. 

In some cases, though, you do need comprehensive data. For example, if you need to create a dataset of all the servers in your organization for compliance reasons, you might discover that there is no one dataset of servers, and that efforts to compile one are a challenge. You can start with just the servers that your team administers, but that’s an incomplete list. 

Some servers are grant-owned and fully administered by a separate team entirely. Perhaps some servers are lurking under the desks of some colleagues, connected to the network but not centrally managed. You can try to use network scans to come up with a list, but then you gather only those servers connected to the network at that particular time. Airgapped servers or servers that aren’t turned on 24/7 won’t be captured by such an audit. It’s important to continually consider if you really need comprehensive data, or just data that comprehensively addresses your use case. 

The data exists, but… 

There’s also a chance that the data exists, but isn’t machine-readable. If the data is provided only in PDFs, as many FOIA requests are returned in, then it becomes more difficult to include the data in data analysis. There’s also a chance that the data is available only as paper documents, as is the case with gun registration records. As Jeanne Marie Laskas reports for GQ in Inside The Federal Bureau Of Way Too Many Guns, having records only on paper prevents large-scale data analysis on the information, thus causing it to effectively go missing from the entire process of data analysis. 

It’s possible that the data exists, but isn’t on the network—perhaps because it is housed on an airgapped device, or perhaps stored on servers subject to different compliance regulations than the infrastructure of your data analysis software. In this case, the data exists but it is missing from your analysis process because it isn’t available to you due to technical limitations. 

Another common case is that the data exists, but you can’t have it. If you’ve made an enemy in another department, they might not share the data with you because they don’t want to. It’s more likely that access to the data is controlled by legal or compliance concerns, so you aren’t able to access the data for your desired purposes, or perhaps you can’t analyze it on the tool that you’re using for data analysis due to compliance reasons. 

For example, most doctors offices and hospitals in the United States use electronic health records systems to store the medical records of thousands of Americans. However, scientific researchers are not permitted to access detailed electronic health records of patients, though they exist in large databases and the data is machine-readable, because the health insurance portability and accountability act (HIPAA) privacy rule regulates how protected health information (PHI) can be accessed and used. 

Perhaps the data exists, but is only available to people who pay for access. This is the case for many music metadata datasets like those from Nielsen, much to my chagrin. The effort it takes to create quality datasets is often commoditized. This also happens with scientific research, which is often only available to those with access to scientific journals that publish the results of the research. The datasets that produce the research are also often closely-guarded, as one dataset is time-consuming to create and can lead to multiple publications. 

There’s also a chance the data exists, but it isn’t made available outside of the company. A common circumstance for this is public API endpoints for cloud services. Spotify collects far more data than they make available via the API, so too do companies like Zoom or Google. You might hope to collect various types of data from these companies, but if the API endpoints don’t make the data available, you don’t have many options.

And of course, in some cases the data exists, but it’s inconsistent. Maybe you’re trying to collect equivalent data from servers or endpoints with different operating systems, but you can’t get the same details due to logging limitations. A common example is trying to collect the same level of detail from computers with MacOS and computers with Windows installed. You can also see inconsistencies if different log levels are set on different servers for the same software. This inconsistent data causes data to go missing within events and makes it more difficult to compare like with like. 

14-page forms lead to incomplete data collection in a pandemic 

Data can easily go missing if it’s just too difficult to collect. An example from Illinois, reported by WBEZ reporter Kristen Schorsch in Illinois’ Incomplete COVID-19 Data May Hinder Response, is that “the Illinois Department of Public Health issued a 14-page form that it has asked hospitals to fill out when they identify a patient with COVID-19. But faced with a cumbersome process in the midst of a pandemic, many hospitals aren’t completely filling out the forms.”

It’s likely that as a result of the length of the form, data isn’t consistently collected for all patients from all hospitals—which can certainly bias any decisions that the Illinois Department of Public Health might make, given that they have incomplete data. 

In fact, as Schorsch reports, without that data, public health workers “told WBEZ that makes it harder for them to understand where to fight for more resources, like N95 masks that provide the highest level of protection against COVID-19, and help each other plan for how to make their clinics safer as they welcome back patients to the office.” 

In this case, where data is going missing because it’s too difficult to collect, you can refocus your data collection on the most crucial data points for what you need to know, rather than the most complete data points.

What can you do about missing data? 

Most crucially, identify the missing data. If you know that you need a certain type of data to answer the questions that you want to answer in your data analysis, you must know that it is missing from your analysis process. 

After you identify the missing data, you can determine whether or not it matters. If the data that you do have is representative of the population that you’re making decisions about, and you don’t need comprehensive data to make those decisions, a representative sample of the data is likely sufficient. 

Communicate your use cases

Another important thing you can do is to communicate your use cases to the people collecting the data. For example, 

  • If software developers have a better understanding of how telemetry or log data is being used for analysis, they might write more detailed or more useful logging messages and add new fields to the telemetry data collection. 
  • If you share a business case with cloud service providers to provide additional data types or fields via their API endpoints, you might get better data to help you perform less biased and more useful data analyses. In return, those cloud providers are likely to retain you as a customer. 

Communicating the use case for data collection is most helpful when communicating that information leads to additional data gathering. It’s riskier when it might cause potential data sources to be excluded. 

For example, if you’re using a survey to collect information about a population’s preferences—let’s say, the design of a sneaker—and you disclose that information upfront, you might only get people with strong opinions about sneaker design responding to your survey. That can be great if you want to survey only that population, but if you want a more mainstream opinion, you might miss those responses because the use case you disclosed wasn’t interesting to them. In that context, you need to evaluate the missing data for its relevance to your analysis. 

Build trust when collecting sensitive data 

Data collection is a trust exercise. If the population that you’re collecting data about does not understand why you’re collecting the data, or trust that you will protect it, use it as you say you will, or if they believe that you will use the data against them, you might end up with missing data. 

Nowhere is this more apparent than with the U.S. Census. Performed every 10 years, the data from the census is used to determine political representation, distribute federal resources, and much more. Because of how the data from the census survey is used, a representative sample isn’t enough—it must be as complete a survey as possible. 

Screenshot of the Census page How We Protect Your Information.

The Census Bureau understands that mistrust is a common reason why people might not respond to the census survey. Because of that, the U.S. Census Bureau hires pollsters that are part of groups that might be less inclined to respond to the census, and also provide clear and easy-to-find details on their website (See How We Protect Your Information on census.gov) about the measures in place to protect the data collected in the census survey. Those details are even clear in the marketing campaigns urging you to respond to the census! The census survey also faces other challenges when ensuring the comprehensive survey is as complete as possible.

This year, the U.S. Census also faced time limits for completing the collecting and counting of surveys, in addition to delays already imposed by the COVID-19 pandemic. The New York Times has additional details about those challenges: The Census, the Supreme Court and Why the Count Is Stopping Early.  

Address instances of mistrust with data stewardship

As Jill Lepore discusses in episode 4, Unheard, of her podcast The Last Archive, mistrust can also affect the accuracy of the data being collected, such as in the case of former enslaved people being interviewed by descendants of their former owners, or their current white neighbors, for records collected by the Works Progress Administration. Surely, data is missing from those accounts of slavery due to mistrust of the people doing the data collection, or at the least, because those collecting the stories perhaps do not deserve to hear the true lived experiences of the former enslaved people. 

If you and your team are not good data stewards, if you don’t do a good job of protecting data that you’ve collected or managing who has access to that data, people are less likely to trust you with more data—and thus it’s likely that datasets you collect will be missing data. Because of that, it’s important to practice good data stewardship. Use datasheets for datasets, or a data biography to record when data was collected, for what purpose, by whom or what means, and more. You can then review those to understand whether data is missing, or even to remember what data might be intentionally missing. 

In some cases, data can be intentionally masked, excluded, or left to collect at a later date. If you keep track of these details about the dataset during the data collection process, it’s easier to be informed about the data that you’re using to answer questions and thus use it safely, equitably, and knowledgeably. 

Collect what’s missing, maybe

If possible and if necessary, collect the data that is missing. You can create a new dataset if one does not already exist, such as those that journalists and organizations such as Campaign Zero have been compiling about police brutality in the United States. Some data collection that you perform might supplement existing datasets, such as adding additional introspection details to a log file to help you answer a new question for an existing data source. 

If there are cases where you do need to collect additional data, you might not be able to do so at the moment. In those cases, you can build a roadmap or a business case to collect the data that is missing, making it clear how it can help reduce uncertainty for your decision. That last point is key, because collecting more data isn’t always the best solution for missing data. 

Sometimes, it isn’t possible to collect more data. For instance, if you’re trying to gather historical data, but everyone from that period has died and very few or no primary sources remain. Or cases where the data has been destroyed, such as in a fire or intentionally, as the Stasi did after the fall of the Berlin Wall

Consider whether you need complete data

Also consider whether or not more data will actually help address the problem that you’re attempting to solve. You can be missing data, and yet still not need to collect more data in order to make your decision. As Douglas Hubbard points out in his book, How to Measure Anything, data analysis is about reducing uncertainty about what the most likely answer to a question is. If collecting more data doesn’t reduce your uncertainty, then it isn’t necessary. 

Nani Jansen Reventlow of the Digital Freedom Fund makes this point clear in her Op-Ed on Al Jazeera, Data collection is not the solution for Europe’s racism problem. In that case, collecting more data, even though it could be argued that the data is missing, doesn’t actually reduce uncertainty about what the likely solution for racism is. Being able to quantify the effect or harms of racism on a region does not solve the problem—the drive to solve the problem is the only thing that can solve that problem. 

Avoid cases where you continue to collect data, especially at the expense of an already-marginalized population, in an attempt to prove what is already made clear by the existing information available to you. 

You might think that data collection is the first stage of a data analysis process, but in fact, it’s the second. The next and last post in this series covers defining the question that guides your data analysis, and how to take action to reduce bias in your data-driven decisions: Define the question: How missing data biases data-driven decisions

The Concepts Behind the Book: How to Measure Anything

I just finished reading How to Measure Anything: Finding the Value of Intangibles in Business by Douglas Hubbard. It discusses fascinating concepts about measurement and observability, but they are tendrils that you must follow among mentions of Excel, statistical formulas, and somewhat dry consulting anecdotes. For those of you that might want to focus mainly on the concepts rather than the literal statistics and formulas behind implementing his framework, I wanted to share the concepts that resonated with me. If you want to read a more thorough summary, I recommend the summary on Less Wrong, also titled How to Measure Anything.

The premise of the book is that people undertake many business decisions and large projects with the idea that success of the decisions or projects can’t be measured, and thus they aren’t measured. It seems a large waste of money and effort if you can’t measure the success of such projects and decisions, and so he developed a consulting business and a framework, Applied Information Economics (AIE)to prove that you can measure such things.

Near the end of his book on page 267, he summarizes his philosophy as six main points:

1. If it’s really that important, it’s something you can define. If it’s something you think exists at all, then it’s something that you’ve already observed somehow.

2. If it’s something important and something uncertain, then you have a cost of being wrong and a chance of being wrong.

3. You can quantify your current uncertainty with calibrated estimates.

4. You can compute the value of additional information by knowing the “threshold” of the measurement where it begins to make a difference compared to your existing uncertainty.

5. Once you know what it’s worth to measure something, you can put the measurement effort in context and decide on the effort it should take.

6. Knowing just a few methods for random sampling, controlled experiments, or even just improving on the judgment of experts can lead to a significant reduction in uncertainty.

To restate those points:

  1. Define what you want to know. Consider ways that you or others have measured similar problems. What you want to know might be easier to see than you thought.
  2. It’s valuable to measure things that you aren’t certain about if they are important to be certain about.
  3. Make estimates about what you think will happen, and calibrate those estimates to understand just how uncertain you are about outcomes.
  4. Determine a level of certainty that will help you feel more confident about a decision. Additionally, determine how much information will be needed to get you there.
  5. Determine how much effort it might take to gather that information.
  6. Understand that it probably takes less effort than you think to reduce uncertainty.

The crux of the book revolves around restating measurement from “answer a specific question” to “reduce uncertainty based on what you know today”.

Measure to reduce uncertainty

Before reading this book, I thought about data analysis as a way to find an answer to a question. I’d go in with a question, I’d find data, and thanks to that data, I’d magically know the answer. However, that approach only works with specifically-defined questions and perfect data. If I want to know “how many views did a specific documentation topic get last week” I can answer that straightforwardly with website metrics.

However, if I want to know “Was the guidance about how to perform a task more useful after I rewrote it?” there was really no way to know the answer to that question. Or so I thought.

Hubbard’s book makes the crucial distinction that data doesn’t need to exist to directly answer that question. It merely needs to make you more certain of the likely answer. You can make a guess about whether or not it was useful, carefully calibrating your guess based on your knowledge of similar scenarios, and then perform data analysis or measurement to improve the accuracy of your guess. If you’re not very certain of the answer, it doesn’t take much data or measurement to make you more certain, and thus increase your confidence in an outcome. However, the more certain you are, the more measurement you need to perform to increase your certainty.

Start by decomposing the problem

If you think what you want to measure isn’t measurable, Hubbard encourages you to think again, and decompose the problem. To use my example, and #1 on his list, I want to measure whether or not a documentation topic was more useful after I rewrote it. As he points out with his first point, the problem is likely more observable than I might think at first.

“Decompose the measurement so that it can be estimated from other measurements. Some of these elements may be easier to measure and sometimes the decomposition itself will have reduced uncertainty.”

I can decompose the question that I’m trying to answer, and consider how I might measure usefulness of a topic. Maybe something is more useful if it is viewed more often, or if people are sharing the link to the topic more frequently, or if there are qualitative comments in surveys or forums that refer to it. I can think about how I might tell someone that a topic is useful, what factors of the topic and information about it I might point to. Does it come up first when you search for a specific customer question? Maybe then search rankings for relevant keywords are an observable metric that could help me measure utility of a topic.

You can also perform extra research to think of ways to measure something.

“Consider your findings from secondary research: Look at how others measured similar issues. Even if their specific findings don’t relate to your measurement problem, is there anything you can salvage from the methods they used?”

Is it business critical to measure this?

Before I invest a lot of time and energy performing measurements, I want to make sure (to Hubbard’s second point in his list) that the question I am attempting to answer, what I am trying to measure, is important enough to merit measurement. This is also tied to points four, five, and six: does the importance of the knowledge outweigh the difficulty of the measurement? It often does, especially because (to his sixth point), the measurement is often easier to obtain than it might seem at first.

Estimate what you think you’ll measure

To Hubbard’s third point, a calibrated estimate is important when you do a measurement. I need to be able to estimate what “success” might look like, and what reasonable bounds of success I might expect are.

Make estimates about what you think will happen, and calibrate those estimates to understand just how uncertain you are about outcomes.

To continue with my question about a rewritten topic’s usefulness, let’s say that I’ve determined that added page views, elevated search rankings, and link shares on social media will mean the project is a success. I’d then want to estimate what number of each of those measurements might be meaningful.

To use page views as an example for estimation, If page views increase by 1%, it might not be meaningful. But maybe 5% is a meaningful increase? I can use that as a lower bound for my estimate. I can also think about a likely upper bound. A 1000% increase would be unreasonable, but maybe I could hope that page views would double, and I’d see a 100% increase in page views! I can use that as an upper bound. By considering and dismissing the 1% and 1000% numbers, I’m also doing some calibration of my estimates—essentially gut checking them with my expertise and existing knowledge. The summary of How to Measure Anything that I linked in the first paragraph addresses calibration of estimates in more detail, as does the book itself!

After I’ve settled on a range of measurement outcomes, I can assess how confident I am that this might happen. Hubbard calls this a Confidence Interval. I might only be 60% certain that page views will increase by at least 5% but they won’t increase more than 100%. This gives me a lot of uncertainty to reduce when I start measuring page views.

One way to start reducing my uncertainty about these percentage increases might be to look at the past page views of this topic, to try to understand what regular fluctuation in page views might be over time. I can look at the past 3 months, week by week, and might discover that 5% is too low to be meaningful, and a more reasonable signifier of success would be a 10% or higher increase in page views.

Estimating gives me a number that I am attempting to reduce uncertainty about, and performing that initial historical measurement can already help me reduce some uncertainty. Now I can be 100% certain that a successful change to the topic should show more than 5% page views on a week-to-week basis, and maybe am 80% certain that a successful change would show 10% or more page views.

When doing this, keep in mind another point of Hubbards:

“a persistent misconception is that unless a measurement meets an arbitrary standard….it has no value….what really makes a measurement of high value is a lot of uncertainty combined with a high cost of being wrong.”

If you’re choosing to undertake a large-scale project that will cost quite a bit if you get it wrong, you likely want to know in advance how to measure the success of that project. This point also underscores his continued emphasis on reducing uncertainty.

For my (admittedly mild) example, it isn’t valuable for me to declare that I can’t learn anything from page view data unless  3 months have passed. I can likely reduce uncertainty enough with two weeks of data to learn something valuable, especially if my uncertainty level is in relatively low (in this example, in the 40-70% range).

Measure just enough, not a lot

Hubbard talks about the notion of a Rule of Five:

There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.

Knowing the median value of a population can go a long way in reducing uncertainty. Even if you can only get a seemingly-tiny sample of data, this rule of five makes it clear that even that small sample can be incredibly valuable for reducing uncertainty about a likely value. You don’t have to know all of something to know something important about it.

Do something with what you’ve learned

After you perform measurements or do some data analysis and reduce your uncertainty, then it’s time to do something with what you’ve learned. Given my example, maybe my rewrite increased page views of the topic by 20%, something I’m now fairly certain is a significant degree, and it is now higher in the search results. I’ve now sufficiently reduced my uncertainty about whether or not the changes made this topic more useful, and I can now rewrite similar topics to use a similar content pattern with confidence. Or at least, more confidence than I had before.

Overall summary

My super abbreviated summary of the book would then be to do the following:

  1. Start by decomposing the problem
  2. Ask is it business critical to measure this?
  3. Estimate what you think you’ll measure
  4. Measure just enough, not a lot
  5. Do something with what you’ve learned

I recommend the book (with judicious skimming), especially if you need some conceptual discussion to help you unravel how best to measure a specific problem. As I read the book, I took numerous notes about how I might be able to measure something like support case deflection with documentation, or how to prioritize new features for product development (or documentation). I also considered how customers might better be able to identify valuable data sources for measuring security posture or other events in their data if they followed many of the practices outlined in this book.