Collect the data: How missing data biases data-driven decisions

This is the seventh post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

When you’re gathering the data you need and creating datasets that don’t exist yet, you’re in the midst of the data collection stage. Data can easily go missing when you’re collecting it! 

In this post, I’ll cover the following: 

  • How data goes missing at the data collection stage 
  • What to do about missing data at the collection stage

How does data go missing?

There are many reasons why data might be missing from your analysis process. Data goes missing at the collection stage because the data doesn’t exist, or the data exists but you can’t use it for whatever reason, or the data exists but the events in the dataset are missing information. 

The dataset doesn’t exist 

Frequently data goes missing because the data itself does not exist, and you need to create it. It’s very difficult and impractical to create a comprehensive dataset, so data can easily go missing at this stage. It’s important to do what you can to make sure data goes consistently missing when you collect it, if possible, by collecting representative data. 

In some cases, though, you do need comprehensive data. For example, if you need to create a dataset of all the servers in your organization for compliance reasons, you might discover that there is no one dataset of servers, and that efforts to compile one are a challenge. You can start with just the servers that your team administers, but that’s an incomplete list. 

Some servers are grant-owned and fully administered by a separate team entirely. Perhaps some servers are lurking under the desks of some colleagues, connected to the network but not centrally managed. You can try to use network scans to come up with a list, but then you gather only those servers connected to the network at that particular time. Airgapped servers or servers that aren’t turned on 24/7 won’t be captured by such an audit. It’s important to continually consider if you really need comprehensive data, or just data that comprehensively addresses your use case. 

The data exists, but… 

There’s also a chance that the data exists, but isn’t machine-readable. If the data is provided only in PDFs, as many FOIA requests are returned in, then it becomes more difficult to include the data in data analysis. There’s also a chance that the data is available only as paper documents, as is the case with gun registration records. As Jeanne Marie Laskas reports for GQ in Inside The Federal Bureau Of Way Too Many Guns, having records only on paper prevents large-scale data analysis on the information, thus causing it to effectively go missing from the entire process of data analysis. 

It’s possible that the data exists, but isn’t on the network—perhaps because it is housed on an airgapped device, or perhaps stored on servers subject to different compliance regulations than the infrastructure of your data analysis software. In this case, the data exists but it is missing from your analysis process because it isn’t available to you due to technical limitations. 

Another common case is that the data exists, but you can’t have it. If you’ve made an enemy in another department, they might not share the data with you because they don’t want to. It’s more likely that access to the data is controlled by legal or compliance concerns, so you aren’t able to access the data for your desired purposes, or perhaps you can’t analyze it on the tool that you’re using for data analysis due to compliance reasons. 

For example, most doctors offices and hospitals in the United States use electronic health records systems to store the medical records of thousands of Americans. However, scientific researchers are not permitted to access detailed electronic health records of patients, though they exist in large databases and the data is machine-readable, because the health insurance portability and accountability act (HIPAA) privacy rule regulates how protected health information (PHI) can be accessed and used. 

Perhaps the data exists, but is only available to people who pay for access. This is the case for many music metadata datasets like those from Nielsen, much to my chagrin. The effort it takes to create quality datasets is often commoditized. This also happens with scientific research, which is often only available to those with access to scientific journals that publish the results of the research. The datasets that produce the research are also often closely-guarded, as one dataset is time-consuming to create and can lead to multiple publications. 

There’s also a chance the data exists, but it isn’t made available outside of the company. A common circumstance for this is public API endpoints for cloud services. Spotify collects far more data than they make available via the API, so too do companies like Zoom or Google. You might hope to collect various types of data from these companies, but if the API endpoints don’t make the data available, you don’t have many options.

And of course, in some cases the data exists, but it’s inconsistent. Maybe you’re trying to collect equivalent data from servers or endpoints with different operating systems, but you can’t get the same details due to logging limitations. A common example is trying to collect the same level of detail from computers with MacOS and computers with Windows installed. You can also see inconsistencies if different log levels are set on different servers for the same software. This inconsistent data causes data to go missing within events and makes it more difficult to compare like with like. 

14-page forms lead to incomplete data collection in a pandemic 

Data can easily go missing if it’s just too difficult to collect. An example from Illinois, reported by WBEZ reporter Kristen Schorsch in Illinois’ Incomplete COVID-19 Data May Hinder Response, is that “the Illinois Department of Public Health issued a 14-page form that it has asked hospitals to fill out when they identify a patient with COVID-19. But faced with a cumbersome process in the midst of a pandemic, many hospitals aren’t completely filling out the forms.”

It’s likely that as a result of the length of the form, data isn’t consistently collected for all patients from all hospitals—which can certainly bias any decisions that the Illinois Department of Public Health might make, given that they have incomplete data. 

In fact, as Schorsch reports, without that data, public health workers “told WBEZ that makes it harder for them to understand where to fight for more resources, like N95 masks that provide the highest level of protection against COVID-19, and help each other plan for how to make their clinics safer as they welcome back patients to the office.” 

In this case, where data is going missing because it’s too difficult to collect, you can refocus your data collection on the most crucial data points for what you need to know, rather than the most complete data points.

What can you do about missing data? 

Most crucially, identify the missing data. If you know that you need a certain type of data to answer the questions that you want to answer in your data analysis, you must know that it is missing from your analysis process. 

After you identify the missing data, you can determine whether or not it matters. If the data that you do have is representative of the population that you’re making decisions about, and you don’t need comprehensive data to make those decisions, a representative sample of the data is likely sufficient. 

Communicate your use cases

Another important thing you can do is to communicate your use cases to the people collecting the data. For example, 

  • If software developers have a better understanding of how telemetry or log data is being used for analysis, they might write more detailed or more useful logging messages and add new fields to the telemetry data collection. 
  • If you share a business case with cloud service providers to provide additional data types or fields via their API endpoints, you might get better data to help you perform less biased and more useful data analyses. In return, those cloud providers are likely to retain you as a customer. 

Communicating the use case for data collection is most helpful when communicating that information leads to additional data gathering. It’s riskier when it might cause potential data sources to be excluded. 

For example, if you’re using a survey to collect information about a population’s preferences—let’s say, the design of a sneaker—and you disclose that information upfront, you might only get people with strong opinions about sneaker design responding to your survey. That can be great if you want to survey only that population, but if you want a more mainstream opinion, you might miss those responses because the use case you disclosed wasn’t interesting to them. In that context, you need to evaluate the missing data for its relevance to your analysis. 

Build trust when collecting sensitive data 

Data collection is a trust exercise. If the population that you’re collecting data about does not understand why you’re collecting the data, or trust that you will protect it, use it as you say you will, or if they believe that you will use the data against them, you might end up with missing data. 

Nowhere is this more apparent than with the U.S. Census. Performed every 10 years, the data from the census is used to determine political representation, distribute federal resources, and much more. Because of how the data from the census survey is used, a representative sample isn’t enough—it must be as complete a survey as possible. 

Screenshot of the Census page How We Protect Your Information.

The Census Bureau understands that mistrust is a common reason why people might not respond to the census survey. Because of that, the U.S. Census Bureau hires pollsters that are part of groups that might be less inclined to respond to the census, and also provide clear and easy-to-find details on their website (See How We Protect Your Information on census.gov) about the measures in place to protect the data collected in the census survey. Those details are even clear in the marketing campaigns urging you to respond to the census! The census survey also faces other challenges when ensuring the comprehensive survey is as complete as possible.

This year, the U.S. Census also faced time limits for completing the collecting and counting of surveys, in addition to delays already imposed by the COVID-19 pandemic. The New York Times has additional details about those challenges: The Census, the Supreme Court and Why the Count Is Stopping Early.  

Address instances of mistrust with data stewardship

As Jill Lepore discusses in episode 4, Unheard, of her podcast The Last Archive, mistrust can also affect the accuracy of the data being collected, such as in the case of former enslaved people being interviewed by descendants of their former owners, or their current white neighbors, for records collected by the Works Progress Administration. Surely, data is missing from those accounts of slavery due to mistrust of the people doing the data collection, or at the least, because those collecting the stories perhaps do not deserve to hear the true lived experiences of the former enslaved people. 

If you and your team are not good data stewards, if you don’t do a good job of protecting data that you’ve collected or managing who has access to that data, people are less likely to trust you with more data—and thus it’s likely that datasets you collect will be missing data. Because of that, it’s important to practice good data stewardship. Use datasheets for datasets, or a data biography to record when data was collected, for what purpose, by whom or what means, and more. You can then review those to understand whether data is missing, or even to remember what data might be intentionally missing. 

In some cases, data can be intentionally masked, excluded, or left to collect at a later date. If you keep track of these details about the dataset during the data collection process, it’s easier to be informed about the data that you’re using to answer questions and thus use it safely, equitably, and knowledgeably. 

Collect what’s missing, maybe

If possible and if necessary, collect the data that is missing. You can create a new dataset if one does not already exist, such as those that journalists and organizations such as Campaign Zero have been compiling about police brutality in the United States. Some data collection that you perform might supplement existing datasets, such as adding additional introspection details to a log file to help you answer a new question for an existing data source. 

If there are cases where you do need to collect additional data, you might not be able to do so at the moment. In those cases, you can build a roadmap or a business case to collect the data that is missing, making it clear how it can help reduce uncertainty for your decision. That last point is key, because collecting more data isn’t always the best solution for missing data. 

Sometimes, it isn’t possible to collect more data. For instance, if you’re trying to gather historical data, but everyone from that period has died and very few or no primary sources remain. Or cases where the data has been destroyed, such as in a fire or intentionally, as the Stasi did after the fall of the Berlin Wall

Consider whether you need complete data

Also consider whether or not more data will actually help address the problem that you’re attempting to solve. You can be missing data, and yet still not need to collect more data in order to make your decision. As Douglas Hubbard points out in his book, How to Measure Anything, data analysis is about reducing uncertainty about what the most likely answer to a question is. If collecting more data doesn’t reduce your uncertainty, then it isn’t necessary. 

Nani Jansen Reventlow of the Digital Freedom Fund makes this point clear in her Op-Ed on Al Jazeera, Data collection is not the solution for Europe’s racism problem. In that case, collecting more data, even though it could be argued that the data is missing, doesn’t actually reduce uncertainty about what the likely solution for racism is. Being able to quantify the effect or harms of racism on a region does not solve the problem—the drive to solve the problem is the only thing that can solve that problem. 

Avoid cases where you continue to collect data, especially at the expense of an already-marginalized population, in an attempt to prove what is already made clear by the existing information available to you. 

You might think that data collection is the first stage of a data analysis process, but in fact, it’s the second. The next and last post in this series covers defining the question that guides your data analysis, and how to take action to reduce bias in your data-driven decisions: Define the question: How missing data biases data-driven decisions

Finding Myself on the Wall

How climbing teaches me to manage my fear and love myself.

Sometimes I find myself on the wall doing something I never thought possible: holding onto something that doesn’t seem to have a place to hold, or reaching something that looks out of reach. Other times it’s like I’m waking up to find myself trapped in what seems to be an inescapable spot: no holds above me, or nowhere to put my feet to push myself higher. In these cases, the problem is clear. The solution isn’t.

In climbing, the problem can be on the wall, or it can be with my confidence, or my fear. Being able to consistently test solutions, push through challenges, and conquer the problem is what makes climbing a perfect mental and physical outlet for me.

For me, climbing is all about managing fear and trusting myself. I have to manage my natural instincts of being afraid of heights and of falling. I also have to learn to trust my abilities and skills while respecting myself and my boundaries in order to avoid getting hurt or endangering myself or others.

In addition, the different types of climbing require different levels of this fear management and self-trust. I first learned top-rope climbing, but as I got better I got more comfortable. Then I learned bouldering, and got more comfortable there, so I learned how to lead climb. Throughout this process, I’ve built my physical strength and climbing technique, but also self confidence and my ability to manage fear.

  • Top-roping is the most comfortable form of climbing for me. I can see the rope, and I can sometimes see the anchor keeping the rope secure. I can feel the taut lack of slack in my rope, and lean back from the wall to test it. I can rest at any time as well, so there is time to slow down and take breaks. All of this physical security reinforces a psychological sense of security, which can help me do more challenging moves and climb higher than I might otherwise feel comfortable climbing.
  • Bouldering requires me to stomach my fear and muster my self-confidence to take me to the top of a wall, or over the top of a wall, without a rope. Bouldering routes are typically anywhere from 10-20 feet high in a gym, and in some incredible outdoor routes, 40 or more feet high. Without the physical security of a rope or an anchor, I have to know my physical and psychological strengths and limits before I start. This forces me to scope out the route before I start climbing, and prepare myself to jump or fall to the ground if I feel uncomfortable. Bouldering forces me to get used to this discomfort and either overcome it or recognize when it is valid and to listen to it.
  • Lead climbing takes the height of top-roping and combines it with the mental aspects of bouldering. No longer do I have the visible anchor or a taut rope to help me feel safe—it’s just me and the wall. I’m conquering the problem while also taking all the necessary steps to keep myself safe: clip properly, climb safely around the rope, and rest when I can. There’s little to no room for fear.

Each type of climbing removes an element of physical security and further challenges my psychological security as I progress. In this way, I’ve been forced to progressively confront and challenge my limits at the same time that I learn to respect and recognize them.

The dangers of climbing are real. It’s an extreme sport. Though it doesn’t always feel dangerous in a gym, any time that you are high up in the air relying on humans and equipment, something can fail and you can die. It’s also easy to get injured due to bad technique: over-gripping holds, inadequately engaging muscles, straining hand muscles and tendons on hard-to-grip holds. If anything, these risks force me to prioritize muscle recovery and rest days, allowing me to recognize that just as physical self care is important, so too is psychological self care.

Despite these risks, climbing lets me get more in-tune with myself than anything else that I’ve tried. It’s a wall of problems, but each one is recognizable and each one is solvable, and I can try them again and again. I can learn by watching someone else solve it, but I can’t solve it the same way because we have different skill sets, physical strength, and body types. I still have solve the problem myself in my own way.

Climbing with other people has also been key to my mental strength. Climbing partners are vital to my safety, but also to my confidence level. They can encourage me to try new routes, and give me beta when I start to falter on a route. Beta, typically defined as information about a route, can also involve encouragement. Everything from the tactical “there’s a foothold by your right knee” to the encouraging “you can reach it!” to the calming “don’t look, just feel” is great beta that has helped me succeed. (I’ve named that last type Yoga Beta). Even so, sometimes the best beta is silence so that I can focus on the problem.

Climbing as a method for teaching myself that I can succeed and iterating my way through problem-solving helps me overcome my fear of failure. I’m learning to trust myself to get through each move, and find something to (physically, psychologically) support myself along the way. I have to trust myself, and the rock, every step of the way.

Torture, Ownership, and Privacy

The Senate Intelligence Committee released hundreds of pages (soon available as a book) detailing acts of torture committed by the CIA.

Continue reading

Software, Sharing, and Music

Here’s what was important this week…

Software is everywhere lately. My boyfriend asked me what I thought the next big website would be (after the success of Google, Myspace, Facebook, Twitter, etc.), and I realized it’s just as likely (if not more likely) to be a software application rather than a website. Paul Ford took some time to enshrine some works of software in a “software canon” — Microsoft Office, Photoshop, Pacman, the Unix operating system, and eMacs (which I’d never heard of until this essay came out).

Software has had a noticeable effect on our day to day lives (especially those with smartphones), but it’s also had a huge impact on music and the way it’s created, recorded, and produced. Fact Magazine went through 14 works of software that shaped modern music (electronic music started way earlier than I thought). One of those software applications is Auto-Tune, and the Sounding Out! blog happened to post about the history of Auto-Tune.

 

Continue reading