Define the question: How missing data biases data-driven decisions

This is the eight and final post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

In this post, I’ll cover the following: 

  • Define the question you want to answer for your data analysis process
  • How does data go missing when you’re defining your question?
  • What can you do about missing data when defining your question?

This post also concludes this blog post series about missing data, featuring specific actions you can take to reduce bias resulting from missing data in your end-to-end data-driven decision-making process.

Define the question 

When you start a data analysis process, you always want to start by deciding what questions you want to answer. Before you make a decision, you need to decide what you want to know.

If you start with the data instead of with a question, you’re sure to be missing data that could help you make a decision, because you’re starting with what you have instead of what you want to know

Start by carefully defining what you want to know, and then determine what data you need to answer that question. What aggregations and analyses might you perform, and what tools do you need access to in order to perform your analysis? 

If you’re not sure how to answer the question, or what questions to ask to make the decisions that you want to make, you can explore best practices guidance and talk to experts in your field. For example, I gave a presentation about how to define questions when trying to prioritize documentation using data (watch on YouTube). If you are trying to monitor and make decisions about software that you’re hosting and managing, you can dig into the RED method for infrastructure monitoring or the USE method

It’s also crucial to consider whether you can answer that question adequately, safely, and ethically with the data you have access to.

How does data go missing? 

Data can go missing at this stage if it isn’t there at all—if the data you need to answer a question does not exist. There’s also the possibility that the data you want to use to answer a question is incomplete, or you have some, but not all of the data that you need to answer the question.

It’s also possible that the data exists, but you can’t have it—you either don’t have it, or you aren’t permitted to use the data that has already been collected to answer your particular question. 

It’s also possible that the data that you do have is not accurate, in which case the data might exist to help answer your question, but it’s unusable, so it’s effectively missing. Perhaps the data is outdated, or the way it was collected means you can’t trust it. 

Depending on who funded the data collection, who performed the data collection, and when and why it was performed, can tell you a lot about whether or not you can use a dataset to answer your particular set of questions. 

For example, if you are trying to answer the question “What is the effect of slavery on the United States?”, you could review economic reports, the records from plantations about how humans were bought and sold, and stop there. But you might be better off considering who created those datasets, who is missing from those datasets, and whether or not those datasets are useful to answer your question, and which datasets might be missing entirely because they were never created, or what records did exist were destroyed. You might also want to consider whether or not it’s ethical to use data to answer specific questions about the lived experiences of people. 

Or, for another grim example, if you want to understand how American attitudes towards Muslims changed after 9/11, you could (if you’re Paul Krugman) look at hate crime data and stop there. Or, as Jameel Jaffer points out in a Twitter thread, you could consider whether or not hate crime data is enough to represent the experience of Muslims after 9/11, considering that “most of the “anti-Muslim sentiment and violence” was *officially sanctioned*” and therefore, all of that is missing from an analysis that focuses solely on hate crime data. Jaffer continues by pointing out that,

“For example, hundreds of Muslim men were rounded up in New York and New Jersey in the weeks after 9/11. They were imprisoned without charge and often subject to abuse in custody because of their religion. None of this would register in any hate crimes database.” 

Data can also go missing if the dataset that you choose to use to answer your question is incomplete.

Incomplete dataset by relying only on digitized archival films

As Rick Prelinger laments in a tweet—if part of a dataset is digitized, often that portion of the dataset is used for data analysis (or research, as the case may be), with the non-digitized portion ignored entirely. 

Screenshot of a tweet by Rick Prelinger @footage "20 years ago we began putting archival film online. Today I can't convince my students that most #archival footage is still NOT online. Unintended consequence of our work: the same images are repeatedly downloaded and used, and many important images remain unused and unseen." sent 12:15 PM Pacific time May 27, 2020

For example, if I wanted to answer the question “What are common themes in American television advertising in the 1950s”? I might turn to the Prelinger Archives, because they make so much digitized archival film footage available. But just because it’s easily accessible doesn’t make it complete. Just because it’s there doesn’t make it the best dataset to answer your question.

It’s possible that the Prelinger Archives don’t have enough film footage for me to answer such a broad question. In this case, I can supplement the dataset available to me with information that is harder to find, such as by tracking down those non-digitized films. I can also choose to refine my question to focus on a specific type of film, year, or advertising agency that is more comprehensively featured in the archive, narrowing the scope of my analysis to focus on the data that I have available. I could even choose a different dataset entirely, if I find one that more comprehensively and accurately answers my question.

Possibly the most common way that data can go missing when trying to answer a question is that the data you have, or even all of the data available to you, doesn’t accurately proxy what you want to know. 

Inaccurate proxy to answer a question leads to missing data

If you identify data points that inaccurately proxy the question that you’re trying to answer, you can end up with missing data. For example, if you want to answer the question, “How did residents of New York City behave before, during, and after Hurricane Sandy?”, you might look at geotagged social media posts. 

Kate Crawford discusses a study Nir Grinberg, Mor Naaman, Blake Shaw, and Gilad Lotan, Extracting Diurnal Patterns of Real World Activity from Social Media, in the context of this question in her excellent 2013 article for Harvard Business Review, The Hidden Biases in Big Data

As she puts it,

“consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture.” 

Because the users of social media, especially those that use Twitter and Foursquare and share location data with those tools, only represent a specific slice of the population affected by Hurricane Sandy. And that specific slice is not a representative or comprehensive slice of New York City residents. Indeed, as Crawford makes very clear, “there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate.”

The dataset of geotagged social media posts only represents some residents of New York City, and not in a representative way, so it’s an inaccurate proxy for the experience of all New York City residents. This means data is missing from the question stage of the data analysis step. You want to answer a question about the experience of all New York City residents, but you only have data about the experience of New York City residents that shared geotagged posts on social media during a specific period of time. 

The risk is clear—if you don’t identify the gaps in this dataset, you might draw false conclusions. Crawford is careful to point this out clearly, identifying that “The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster.”

When you identify the gaps in the dataset, you can understand what limitations exist in the dataset, and thus how you might draw false and biased conclusions. You can also identify new datasets to examine or groups to interview to gather additional data to identify the root cause of the missing data (as discussed in my post on data gaps in data collection). 

The gaps in who is using Twitter, and who is choosing to use Twitter during a natural disaster, are one way that Twitter data can inaccurately proxy a population that you want to research and thus cause data to go missing. Another way that it can cause data to go missing is by inaccurately representing human behavior in general because interactions with the platform itself are not neutral. 

As Angela Xiao Wu points out in her blog post, How Not to Know Ourselves, based on a research paper she wrote with Harsh Taneja:

“platform log data are not “unobtrusive” recordings of human behavior out in the wild. Rather, their measurement conditions determine that they are accounts of putative user activity — “putative” in a sense that platforms are often incentivized to keep bots and other fake accounts around, because, from their standpoint, it’s always a numbers game with investors, marketers, and the actual, oft-insecure users.” 

Put another way, you can’t interpret social media interactions as neutral reflections of user behavior due to the mechanisms a social media platform uses to encourage user activity. The authors also point out that it’s difficult to identify if social media interactions reflect the behavior of real people at all, given the number of bot and fake accounts that proliferate on such sites. 

Using a dataset that inaccurately proxies the question that you’re trying to answer is just one way for data to go missing at this stage. What can you do to prevent data from going missing as you’re devising the questions you want to ask of the data? 

What can you do about missing data?

Most importantly, redefine your questions so that you can use data to answer them! If you refine the questions that you’re trying to ask into something that can be quantified, it’s easier to ask the question and get a valid, unbiased, data-driven result. 

Rather than try to understand the experience of all residents of New York City before, during, and after Hurricane Sandy, you can constrain your efforts to understand how social media use was affected by Hurricane Sandy, or how users that share their locations on social media altered their behavior before, during, and after the hurricane.

As another example, you might shift from trying to understand “How useful is my documentation?” to instead asking a question that is based on the data that you have: “How many people view my content?”. You can also try making a broad question more specific. Instead of asking “Is our website accessible?”, instead ask, “Does our website meet the AA standard of web content accessibility guidelines?” 

Douglas Hubbard’s book, How to Measure Anything, provides excellent guidance about how to refine and devise a question that you can use data analysis to answer. He also makes the crucial point that sometimes it’s not worth it to use data to answer a question. If you are fairly certain that you already know the answer to a question, and the amount of effort it would take to perform data analysis (let alone perform it well) will take a lot of time and resources, it’s perhaps not worth attempting to answer the question with data at all! 

You can also choose to use a different data source. If the data that you have access to in order to answer your question is incomplete, inadequate, inaccurate, or otherwise missing data, choose a different data source. This might lead you to change your dataset choice from readily-available digitized content to microfiche research at a library across the globe in order to perform a more complete and accurate data analysis.

And of course, if a different data source doesn’t exist, you can create a new data source with the information you need. Collaborate with stakeholders within your organization, make a business case to a third-party system that you want to gather data from, use freedom of information act (FOIA) requests to gather data that exists but is not easily-accessible to create a dataset. 

I also want to take care to acknowledge that choosing to use or create a different dataset can often require immense privilege—monetary privilege to fund added data collection, a trip across the globe, or a more complex survey methodology; privilege of access, to have access to others doing similar research and are willing to share data with you; and privilege of time to perform the added data collection and analysis that might be necessary to prevent missing data.

If the data exists but you don’t have permission to use it, you might devise a research plan to request access to sensitive data, or work to gain the consent of those in the dataset that you want to use to allow you to use the data to answer the question that you want to answer. This is another case where communicating the use case of the data can help you gather it—if you share the questions that you’re trying to answer with the people that you’re trying to collect data from, they may be more inclined to share it with you. 

Take action to reduce bias in your data-driven decisions from missing data 

If you’re a data decision-maker, you want to take these steps to take action:

  1. Define the questions being answered with data. 
  2. Identify missing data in the analysis process.
  3. Ask questions of the data analysis before making decisions.

If you carefully define the questions guiding the data analysis process, clearly communicating your use cases to the data analysts that you’re working with, you can prevent data from going missing at the very start. 

Work with your teams and identify where data might go missing in the analysis process, and do what you can to address a leaky analysis pipeline. 

Finally, ask questions of the data analysis results before making decisions. Dig deeper into what is communicated to you, seek to understand what might be missing from the reports, visualizations, and analysis results being presented, and whether or not that missing data is relevant to your decision. 

If you work with data as a data analyst, engineer, admin, or communicator, you can take these steps to take action:

  1. Steward and normalize data.
  2. Analyze data at multiple levels of aggregation and time spans.
  3. Add context to reports and communicate missing data.

Responsibly steward data as you collect and manage it, and normalize it when you prepare it for analysis to make it easier to use. 

If you analyze data at multiple levels of aggregation and time spans, you can determine which level allows you to communicate the most useful information with the least amount of data going missing, hidden by overgeneralized aggregations or overlarge time spans, or hidden in the noise of overly-detailed time spans or too many split-bys. 

Add context to the reports that you produce, providing details about the data analysis process and the dataset used, acknowledging what’s missing and what’s represented. Communicate missing data with detailed and focused visualizations, keeping visualizations consistent for regularly-communicated reports. 

I hope that no matter your role in the data analysis process, this blog post series helps you reduce missing data and make smarter, more accurate, and less biased data-driven decisions.

Analyze the data: How missing data biases data-driven decisions

This is the fifth post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

When you do data analysis, you’re searching and analyzing your data so that you can answer a specific question with the data so that you can make a decision at the end of the process. Unfortunately, there are many ways that data can go missing while you analyze it. 

In this post, I’ll cover the following

  • Why disaggregation matters when performing data analysis
  • How data goes missing when analyzing data
  • What to do about missing data when analyzing data
  • How simple mistakes can cause data to go missing 

Declining bird populations in North America

Any aggregation that you do in your data analysis can cause data to go missing. I recently finished listening to the podcast series The Last Archive, by Jill Lepore, and episode 9 talked about declining bird populations in North America. I realized that it’s an excellent example of why disaggregating data is vital to making sure that data doesn’t go missing during the analysis process. 

As reported in Science Magazine, scientists have concluded that since 1970, the continent of North America has lost 3 billion birds—nearly 30% of the surveyed total. But what does that mean for specific bird populations? Are some more affected than others?

It’s easy for me to look around San Francisco and think, well clearly that many birds can’t be missing—I’m still surrounded by pigeons and crows, and ducks and geese are everywhere when I go to the park! 

If you don’t disaggregate your results, there’s no way to determine which bird breeds are affected and where. You might assume that 30% of all bird breeds were equally affected across habitats and types. Thankfully, these scientists did disaggregate their results, so we can identify variations that otherwise would be missing from the analysis. 

Screenshot of bar chart from Science magazine, showing bird decline by habitat in percentage. Wetlands gained more than 10%, all other populations declined. Grasslands declined more than 50%.

Screenshot of a bar chart from Science magazine showing decline by type of bird, relevant statistics duplicated in text.

In the case of this study, we can see that some bird populations—Old world sparrows, Larks, and Starlings—are more affected than other types of birds, while others—Raptors, Ducks and geese, and Vireos—have flourished in the past 50 years.

Because the data is disaggregated, you can uncover data that would otherwise be missing from the analysis—how the different types of birds have actually been differently affected, due to habitat loss in grasslands, or cases where restoration and rehabilitation efforts have been effective, such as the resurgence in the population of raptors. 

Without an understanding of which specific bird populations are affected, and where they live, you can’t take as effective action to help bird populations recover, because you’re missing too much data due to an overgeneralized aggregate. Any decisions you took based on the overgeneralized aggregate would be biased and ultimately incorrect. 

In the case of this study, we know that targeted bird population restoration is perhaps most needed in grasslands habitats, like the Midwest where I grew up. 

Unfortunately, the study only covers 76% of all bird breeds, so my city-dwelling self, will just continue to wonder how the bird population has changed since 1970 for pigeons, doves, crows, and others. 

How does data go missing?

An easy way for data to go missing is for incomplete data to be returned when you’re analyzing the data. Many of these examples are Splunk-specific, but are limitations shared by most data analysis tools. 

Truncated results

In Splunk, search results could be truncated for a variety of reasons. Truncated results have visible error messages if you’re running ad hoc searches, but you might not see error messages if the searches are producing scheduled reports. 

If the indexers where the data is stored are unreachable, or your search times out before completing, the results could be incomplete. If you’re using a subsearch, you might hit the default event limit of 10K events, or the timeout limit of 60 seconds and have incomplete subsearch results. If you’re using the join search command, you might hit the 50K row limit that Nick Mealy discusses in his brilliant .conf talk, Master Joining Datasets Without Using Join

If you’re searching an external service from Splunk, for example by using ldapsearch or a custom command to search an external database, you might not get complete results if that service is having problems or if you don’t have access to some data that you’re searching for.

It’s surprisingly easy for data to go missing when you’re correlating datasets. 

Missing and “missing” fields across datasets

If you’re trying to compare datasets and some of the datasets are missing fields, you might accidentally miss data. Without the same field across multiple types of data, it can be difficult to perform valuable or accurate data correlations.

In this containerized cloud-native world, tracing outages across systems can be complex if you don’t have matching identifiers in every dataset that you’re using. As another example, it can be difficult to identify suspicious user session activity without the ability to correlate session identifiers with active users logged into a specific host. 

Sometimes the fields aren’t actually missing, they’re just named differently. Because the data isn’t normalized, or the fields don’t match your naming expectations, they’re effectively missing from your analysis because you can’t find them. 

Missing fields in a dataset that you want to include in your analysis

Sometimes data is missing from specific events within a dataset. For example, I wanted to determine the average rating that I gave songs in my iTunes library. However, iTunes XML files store tracks with no rating (or a zero star rating for my purposes) without a rating field at all in the events. 

Calculating an average with that data missing gives me an average 3 star rating for all the tracks in my iTunes library. 

Screenshot of a Splunk search and visualization, showing a single value result of "3 stars". Splunk search is: `itunes` | search track_name=* | stats count as track_count by rating | stats avg(rating) as average_rating | replace "60" WITH "3 stars" IN average_rating

But if I add the zero-star rated tracks back in, representing those as “tracks I have decided aren’t worthy of a rating” rather than “tracks that have not yet been rated”, the average changes to 2.5 stars per track. 

Screenshot of splunk search and visualization showing "2.5 stars". Splunk search is: `itunes` | search track_name=* | fillnull rating value="0" | stats count as track_count by rating | stats avg(rating) as average_rating | replace "50" WITH "2.5 stars" IN average_rating

If I’d used the results that were missing data, I’d have a biased interpretation of my average song rating in my iTunes library. 

Mismatched numeric units and timezones

You might also have data that goes missing if you’re improperly analyzing data because you don’t know or misinterpret the units that the data is stored in. 

Keep track of whether or not a field contains a time in minutes or seconds and if your network traffic logs are in bytes or megabytes. Data can vanish from your analysis if you improperly compare dissimilar units!

If you convert a time to a more human-readable format and make incorrect assumptions about the time format, such as the time zone that it’s in, you can cause data to go missing from the proper time period. 

Even without transforming your data, if you’re comparing different datasets that store time data with different time zones, data can go missing. You might think that you’re comparing a four hour window from last Monday while debugging an issue across several datasets

It’s also important to note that how you choose to aggregate your data can hide data that is missing from your dataset, or hide data in a way that causes it to go missing. 

Consider the granularity of your aggregations

When you perform your data analysis, consider what time ranges you’re using to bin or aggregate your data, the time ranges that you use to search your data across, for what data points, which fields you use to disaggregate your data, and how that might affect your results. 

For example, it’s important to keep track of how you aggregate events across time periods in your analysis. If the timespans that you use when aggregating your data don’t align with your use case and the question you’re trying to answer, data can go missing. 

In this example, I was trying to convert an existing search that showed me how much music I was listening to by type per year, into a search that would show me how much music I was listening to by weekday by year. Let’s take a look:

Screenshot of very detailed Splunk search and visualization showing time spent listening by weekday and year

This was my initial attempt at modifying the data, and there’s a lot missing. There’s no results at all for Tuesdays in 2019, and the counts for Sundays in 2017, Mondays in 2018, and Thursdays in 2020 were laughably low. What did I do wrong?

It turned out that the time spans I was using in the original search to aggregate my data were too broad for the results I was trying to get. I was doing a timechart with a span of 1 month to start, and then trying to get data down to the a granularity of weekday. That wasn’t going to work! 

I’d caused data to go missing because I didn’t align the question I was trying to answer with the time span aggregation in my analysis. 

Thankfully, it was a quick fix. I updated the initial time grouping to match my desired granularity of one day, and I was no longer missing data from my results! 

Screenshot showing revised results for time spent listening by weekday and year

This is a case where an overly broad timespan aggregation, combined with a lack of consideration for my use case, caused data to go missing. 

You make an error in your analysis process 

How many of you have written a Splunk search with a conditional eval function and totally messed it up? 

Screenshot of a Splunk search. Search: |inputlookup append=t concerthistoryparse.csv | eval show_length=case(info == "festival", "28800", info == "dj set", "14400", info == "concert", "10800")

In this case, I wrote a search to calculate the duration of music listening activities—specifically, calculating an estimated amount of time spent at a music festival, DJ set, or concert—in order to compare how much time I spent listening to music before I was sheltering in place with after. 

I used a conditional “info == concert” to apply an estimated concert length of 3 hours, but no field-value pair of info=concert existed in my data. In fact, concerts had no info field at all. It wasn’t until I’d finished my search, combining 2 other datasets, that I realized a concert I’d attended in March was missing from the results. 

Screenshot of full Splunk search and visualization showing time spent listening to music, going to shows, and listening to livestreams by month. March shows only livestreams and listening activity.

In order to prevent this data from going missing, I had to break my search down into smaller pieces, validating the accuracy of the results at each step against my expectations and knowledge of what the data should reflect. Eventually I realized that I’d caused data to go missing in my analysis process by making the assumption that an info=concert field existed in my data. 

Same screenshot as previous image, but with a corrected search and March now shows one show and livestream and listening activity.

Ironically, this graph is still missing data because Last.fm failed to scrobble tracks from Spotify for several time periods in July and August. 

What can you do about missing data?

If you have missing data in your analysis process, what can you do? 

Optimize your searches and environment

If you’re processing large volumes of events, you can consider expanding limits for data, such as those for subsearches or timeouts for results. Default limits are the most common settings, and aren’t always something to adjust, but occasionally it makes sense to adjust the limits to suit your use cases if your architecture can handle it. 

More practically, make sure that you’re running efficient and precise analyses to make the most use of your data analysis environment. If you’re using Splunk, I again recommend Nick Mealy’s excellent .conf talk, Master Joining Datasets Without Using Join for guidance on optimizing your searches, as well as Martin Müller’s talk on the Splunk Job Inspector, featuring Clara Merriman for the .conf20 edition. 

Normalize your data

You can normalize your data to make it easier to use, especially if you’re correlating common fields across many datasets. Use field names that are as accurate and descriptive as possible, but also names that are consistent across datasets. 

It’s a best practice to follow a common information model, but also ensure that you work closely with data administrators and data analysts (if you aren’t performing both roles) to make sure that analyst use cases and expectations align with administrator practices and implementations. 

If you’re using Splunk, start with the Splunk Common Information Model (CIM) or write your own data model to help normalize fields. Keep in mind too, that you don’t have to use accelerated data models to use a data model when naming fields.  

Enrich your datasets 

You can also explore different ways to enrich your datasets earlier in the analysis process to help make sure data doesn’t go missing across datasets. If you perform data enrichment on the data as it streams in, using a tool like Cribl LogStream or Splunk Data Stream Processor, you can add this missing data back to the events and make it easier to accurately and usefully correlate data across datasets. 

Consistently validate your results

Do what you can to consistently and constantly ensure the validity of your search results. Assess the validity against your expectations, but ultimately against itself and against context. If you don’t validate your results, you might lose data that is hidden inside of an aggregate or misinterpreted due to missing context. 

  • Check different time ranges. What seems to be a gap or a pattern in data could just be seasonality or noise. Consider separating weekends from weekdays, or business hours from other hours, depending on the type of data that you’re aggregating. 
  • Use different types of averages, and consider whether the span of values in your results are well-represented by an average. An average of a wide range of values can inaccurately reflect the range of the values, effectively hiding the minimum and the maximum values. 
  • Disaggregate your data to identify where you might be missing more data, and thus have bias in your results. Heather Krause and Shena Ashley discuss disaggregation in an interview for If Data Could Talk by Tableau Software on The Ethics of Visualizing Data on Race, making the point that disaggregated data does not imply causation, it merely describes the data.
  • Compare like with like. Validate time zones, units, and the data in similarly-named fields using a data dictionary or by working with the stewards of the datasets to make sure that you’re using the data correctly. 

To help ensure you’re properly accounting for missing data that you might cause while analyzing data, work with your data administrator and consult with a statistics expert if you’re not sure you’re properly analyzing or aggregating data. To help learn more, I recommend Ben Jones’ book Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations.

The next post in this series covers the different ways that data can go missing at the data management stage—what happens to prepare data for analysis?  Manage the data: How missing data biases data-driven decisions

Visualize the data: How missing data biases data-driven decisions

This is the fourth post in a series about how missing data can bias data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Visualizing data is crucial to communicate the results of a data analysis process. Whether you use a chart, a table, a list of raw data, or a three-dimensional graph that you can interact with in virtual reality—your visualization choice can cause data to go missing. Any time you visualize the results of data analysis, you make intentional decisions about what to visualize and what not to visualize. How can you make sure that data that goes missing at this stage doesn’t bias data-driven decisions?

In this post, I’ll cover the following: 

  • How the usage of your data visualization can cause data to go missing
  • How data goes missing in data visualizations
  • Why accessibility matters for data visualizations
  • How a lack of labels and scale can mislead and misinform 
  • What to do about missing data in data visualizations

How people use the Georgia Department of Public Health COVID-19 daily report 

When creating a data visualization, it’s important to consider how it will be used. For example, the state of Georgia provides a Department of Public Health Daily COVID-19 reporting page to help communicate the relative case rate for each county in the state. 

In the midst of this global pandemic, I’m taking extra precautions before deciding to go hiking or climbing outside. Part of that risk calculation involves checking the relative case rate in my region — are cases going up, down, or staying the same? 

If you wanted to check that case rate in Georgia in July, you might struggle to make an unbiased decision about your safety because of the format of a data visualization in that report.

As Andisheh Nouraee illustrates in a now-deleted Twitter thread, the Georgia Department of Public Health on the COVID-19 Daily Status Report provided a heat map in July that visualized the number of cases across counties in Georgia in such a way that it effectively hid a 49% increase in cases across 15 days.

July 2nd heat map visualization of Georgia counties showing cases per 100K residents, with bins covering ranges from 1-620, 621-1070, 1071 - 1622, 1623 - 2960, with the red bins covering a range from 2961 - 4661.
Image from July 2nd, shared by Andisheh Nouraee, my screenshot of that image
July 17th heat map visualization showing cases per 100K residents in Georgia, with three counties colored red. Bins represent none, 1-949 cases, 950 - 1555 cases, 1556-2336 cases, 2337 - 3768 cases, and the red bins represent 3769-5165 cases.
Image from July 17th, shared by Andisheh Nouraee, my screenshot of that image

You might think that these visualizations aren’t missing data at all—the values of the gradient bins are clearly labeled, and the map clearly shows how many cases exist for every 100K residents.

However, the missing data isn’t in the visualization itself, but in how it’s used. This heat map is provided to help people understand the relative case rate. If I were checking this graph every week or so, I would probably think that the case rate has stayed the same over that time period. 

Instead, because the visualization uses auto-adjusting gradient bins, the red counties in the visualization from July 2nd cover a range from 2961 to 4661, while the same color counties on July 17th now have case rates of 3769–5165 cases per 100K residents. The relative size of the bins is different enough to where the bins can’t be compared with each other over time. 

As reported by Keren Landman for the Atlanta Magazine, the Department of Public Health didn’t have direct control over the data on the dashboard anyway, making it harder to make updates or communicate the data more intentionally.

Thankfully, the site now uses a visualization with a consistent gradient scale, rather than auto-adjusting bins.

Screenshot of heat map of counties with cases per 100K residents in Georgia with Union County highlighted showing the confirmed cases from the past 2 weeks and total, and other data points that are irrelevant to this post.

In this example, the combination of the visualization choice and the use of that visualization by the visitors of this website caused data to go missing and possibly resulting in biased decisions about whether it’s safe to go for a hike in the community. 

How does data go missing? 

This example from the Georgia Department of Health describes one way that data can go missing, but there are many more. 

Data can go missing from your visualization in a number of ways:

  • If the data exists, but is not represented in the visualization, data is missing
  • If data points and fluctuations are smoothed over, or connected across gaps, data is missing. 
  • If outliers and other values are excluded from the visualization, data is missing.
  • If people can’t see or interact with the visualization, data is missing.
  • If a limited number of results are being visualized, but the label and title of the visualization don’t make that clear, data is missing.

Accessible data visualizations prevent data from going missing 

Accessible visualizations are crucial for avoiding missing data because data can go missing if people can’t see or interact with it. 

Lisa Charlotte Rost wrote an excellent series for Data Wrapper’s blog about colorblindness and data visualizations that I highly recommend for considering color vision accessibility for data visualization: How your colorblind and colorweak readers see your colors, What to consider when visualizing data for colorblind readers, and What’s it like to be colorblind

You can also go further to consider how to make it easier for folks with low or no vision to interact with your data visualizations. Data visualization artist Mona Chalabi has been experimenting with ways to make her data visualization projects more accessible, including making a tactile version of a data visualization piece, and an interactive piece that uses touch and sound to communicate information, created in collaboration with sound artist Emmy the Great.

At a more basic level, consider how your visualizations look at high zoom levels and how they sound when read aloud by a screen reader. If a visualization is unintelligible at high zoom levels or if portions aren’t read aloud by a screen reader, those are cases where data has gone missing from your visualization. Any decisions that someone with low or no vision wants to make based on a data visualization is biased to include only the data visualizations that they can interact with successfully. 

Beyond vision considerations, you want to consider cognitive processing accessibility to prevent missing data. If you overload a visualization with lots of overlays, rely on legends to communicate meaning in your data, or have a lot of text in your visualization, folks with ADHD or dyslexia might struggle to process your visualization. 

Any data that people can’t understand in your visualization is missing data. For more, I recommend the blog post by Sarah L. Fossheim, An intro to designing accessible data visualizations.  

Map with caution and label prodigiously: Beirut explosion map

Data can go missing if you fail to visualize it clearly or correctly. When I found out about the explosion in Beirut, after I made sure that my friends and their family were safe, I wanted to better understand what had happened. 

Screenshot of a map of the beirut explosion with labels pointing to different overlapping circles, saying "blast site", "widespread destruction", "heavy damage", "damage reported" and "windows blown out up to 15 miles away".
Image shared by Joanna Merson, my screenshot of the image

I haven’t had the privilege to visit Beirut before, so the maps of the explosion radius weren’t as easy for me to personally relate to. Thankfully, people started sharing maps about what the same explosion might look like if it occurred in New York City or London.  

Screenshot of a google maps visualization with 3 overlapping circles centered over New York. No labels.
Image shared by Joanna Merson, my screenshot of the image

This map attempts to show the scale of the same explosion in New York City, but it’s missing a lot of data. I’m not an expert in map visualizations, but thankfully cartographer Joanna Merson tweeted a correction to this map and unpacked just how much data is missing from this visualization. 

There’s no labels on this map, so you don’t know the scale of the circles, or what distance each blast radius is supposed to represent. You don’t know what the epicenter of the blast is because it isn’t labeled, and perhaps most egregiously, the map projection used is incorrect. 

Joanna Merson created an alternate visualization, with all the missing data added back in. 

Screenshot of a map visualization by Joanna Merson made on August 5 2020, with a basemap from esri World Imagery using a scale 1:200,000 and the Azimuthal Equidistant projection. Map is centered over New York City with an epicenter of Manhattan labeled and circle radii of 1km, 5km, and 10km clearly labeled.
Image by Joanna Merson, my screenshot of the image.

Her visualization carefully labels the epicenter of the blast, as well as the radii of each of the circles that represent different effects from the blast. She’s also careful to share the map projection that she used—one that has the same distance for every point along that circle. It turns out that the projection used by Google Maps is not the right projection to show distance with an overlaid circle. Without the scale or an accurate projection in use, data goes missing (and gets added) as unaffected areas are misleadingly shown as affected by the blast. 

How many of you are guilty of making a geospatial visualization, but don’t know anything about map projections and how they might affect your visualization? 

Joanna Merson further points out in her thread on Twitter that maps like this with an overlaid radius to show distance can be inaccurate because they don’t take into account the effect of topography. Data goes missing because topography isn’t represented or considered by the visualization overlaid on the map. 

It’s impractical to model everything perfectly in every map visualization. Depending on how you’re using the map, this missing data might not actually matter. If you communicate what your visualization is intended to represent when you share it, you can convey the missing data and also assert its irrelevance to your point. All maps, after all, must make decisions about what data to include based on the usage of the map. Your map-based data visualizations are no different! 

It can be easy to cut corners and make a simple visualization to communicate the results of data analysis quickly. It can be tedious to add a scale, a legend, and labels to your visualization. But you must consider how your visualization might be used after you make it—and how it might be misused.

Will a visualization that you create end up in a blog post like this one, or a Twitter thread unpacking your mistakes? 

What can you do about missing data?

To prevent or mitigate missing data in a data visualization, you have several options. Nathan Yau of Flowing Data has a very complete guide for Visualizing Incomplete and Missing Data that I highly recommend in addition to the points that I’m sharing here. 

Visualize what’s missing

One important way to mitigate missing data in a data visualization is to devise a way to show the data that is there alongside the data that isn’t. Make the gaps apparent and visualize missing data, such as by avoiding connecting the dots between missing values in a line chart.

In cases where your data has gaps, you can add annotations or labels to acknowledge and explain any inconsistencies or perceived gaps in the data. In some cases, data can appear to be missing, but is actually a gap in the data due to seasonal fluctuations or other reasons. It’s important to thoroughly understand your data to identify the difference. 

If you visualize the gaps in your data, you have the opportunity to discuss what can be causing the gaps. Gaps in data can reflect reality, or flaws in your analysis process. Either way, visualizing the gaps in your data is just as valuable as visualizing the data that you do have. Don’t hide or ignore missing data.

Carefully consider time spans

Be intentional about the span that you choose for time-based visualizations. You can unintentionally hide fluctuations in the data if you choose an overly-broad span for your visualization, causing data to go missing by flattening it. 

If you choose an overly-short time span for your visualization, however, the meaning of the data and what you’re trying to communicate can go missing with all the noise of the individual data points. Consider what you’re trying to communicate with the data visualization, and choose a time span accordingly.

Write clearly 

Another way to address missing data is to write good labels and titles for visualizations. It’s crucial to explain exactly what is present in a visualization—an important component of communicating results. If you’re intentional and precise about your labels and titles, you can prevent data from going missing. 

If the data analysis contains the results for the top 10 cities by population density, but your title only says “Top Cities”, data has gone missing from your visualization!

You can test out the usefulness of your labels and titles by considering the following: If someone screenshots your visualization and puts it in a different presentation, or tweets it without the additional context that might be in the full report, how much data would be missing from the visualization? How completely does the visualization communicate the results of data analysis if it’s viewed out of context?

Validate your scale

Make sure any visualization that you create has a scale and that it’s included. It’s really easy for data to go missing if the scale of the data itself is missing. 

Also validate that the scale on your visualization is accurate and relevant. If you’re visualizing percentages, make sure the scale goes from 0-100. If you’re visualizing logarithmic data, make sure your scale reflects that correctly. 

Consider the use

Consider how your visualization will be used, and design your visualizations accordingly. What decisions are people trying to make based on your visualization? What questions are you trying to answer when you make it? 

Automatically-adjusting gradient bins in a heat map can be an excellent design choice, but as we saw in Georgia, they don’t make sense to communicate relative change over time. 

Choose the right chart for the data

It’s also important to choose the right chart to visualize your data. I’m not a visualization expert, so check out this data tutorial from Chartio, How to Choose the Right Data Visualization as well as these tutorials of different chart types on Flowing Data: Chart Types.  

I do want to recommend that if you’re visualizing multiple aggregations in one visualization in the Splunk platform, consider the Trellis layout to create different charts to help compare across the aggregates. 

Always try various types of visualizations for your data to determine which one shows the results of your analysis in the clearest way.

One of the best ways to make sure your data visualization isn’t missing data is to make sure that the data analysis is sound. The next post in this series addresses how data can go missing while you analyze it: Analyze the data: How missing data biases data-driven decisions.

Communicate the data: How missing data biases data-driven decisions

This is the third post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Communicating the results of a data analysis process is crucial to making a data-driven decision. You might review results communicated to you in many ways:

  • A slide deck presented to you
  • An automatically-generated report emailed to you regularly
  • A white paper produced by an expert analysis firm that you review 
  • A dashboard full of curated visualizations
  • A marketing campaign

If you’re a data communicator, what you choose to communicate (and how you do so) can cause data to go missing and possibly bias data-driven decisions made as a result.

In this post, I’ll cover the following:

  • A marketing campaign that misrepresents the results of a data analysis process
  • A renowned white paper produced by leaders in the security space
  • How data goes missing at the communication stage
  • What you can do about missing data at the communication stage

Spotify Wrapped: Your “year” in music 

If you listen to music on Spotify, you might have joined in the hordes of people that viewed and shared their Spotify Wrapped playlists and images like these last year. 

Screenshot of Spotify Wrapped graphic showing "we spend some serious time together"

Spotify Wrapped is a marketing campaign that purports to communicate your year in music, based on the results of some behind-the-scenes data analysis that Spotify performs. While it is a marketing campaign, it’s still a way that data analysis results are being communicated to others, and thus relevant to this discussion of missing data in communications. 

Screenshot of Spotify Wrapped campaign showing "You were genre-fluid" Screenshot of Spotify Wrapped campaign showing that I discovered 1503 new artists.

In this case you can see that my top artists, top songs, minutes listened, and top genre are  shared with me, along with the number of artists that I discovered, and even the number of minutes that I spent listening to music over the last year. It’s impressive, and exciting to see my year in music summed up in such a way! 

But after I dug deeper into the data, the gaps in the communication and data analysis became apparent. What’s presented as my year in music is actually more like my “10 months in music”, because the data only represents the period from January 1st to October 31st of 2019. Two full months of music listening behavior are completely missing from my “year” in music. 

Unfortunately, these details about the dataset are missing from the Spotify Wrapped campaign report itself, so I had to search through additional resources to find out more information. According to a now-archived FAQ on Spotify for Artists (thank goodness for the Wayback Machine), the data represented in the campaign covers the dates from January 1st, 2019 – October 31st, 2019. I also read two key blog posts, Spotify Wrapped 2019 Reveals Your Streaming Trends, from 2010 to Now announcing the campaign and Unwrapping Wrapped 2019: Spotify VP of Engineering Tyson Singer Explains digging into the data analysis behind the campaign, to learn what I could about the size of the dataset, or how many data points might be necessary to draw the conclusions shared in the report. 

Screenshot of Spotify FAQ question "Why are my 2019 Artist Wrapped stats different from the stats I see in Spotify for Artists?"

Screenshot of Spotify FAQ answering the question "How do I get my 2019 Artist Wrapped?" with the first line of the answer being "If you have music on Spotify with at least 3 listeners before October 31st 2019, you get a 2019 Wrapped!"

This is a case where data is going missing from the communication because the data is presented as though it represents an entire time period, when in fact it only represents a subset of the relevant time period. Not only that, but it’s unclear what other data points actually represent. The number of minutes spent listening to music in those 10 months could be calculated by adding up the actual amount of time I spent listening to songs on the service, but could also be an approximate metric calculated from the number of streams of tracks in the service. It’s not possible to find out how this metric is being calculated. If it is an approximate metric based on the number of streams, that’s also a case of uncommunicated missing data (or a misleading metric), because according to the Spotify for Artists FAQ, Spotify counts a song as streamed if you’ve listened to at least 30 seconds of a track. 

It’s likely that other data is missing in this incomplete communication, but another data communication is a great example of how to do it right. 

Verizon DBIR: A star of data communication 

The Verizon Data Breach Investigations Report (DBIR) is an annual report put out by Verizon about, well, data breach investigations. Before the report shares any results, it includes a readable summary about the dataset used, how the analysis is performed, as well as what is missing from the data. Only after the limitations of the dataset and the resulting analysis is communicated, are the actual results of the analysis shared. 

Screenshot of Verizon DBIR results and analysis section of the 2020 DBIRAnd those results are well-presented, featuring confidence intervals, detailed titles and labels, as well as clear scales to visualize the data. I talk more about how to prevent data from going missing in visualizations in the next post of the series, Visualize the data: How missing data biases data-driven decisions.

Screenshot of 2 visualizations from the Verizon 2020 DBIR

How does data go missing?

These examples make it clear that data can easily go missing when you’re choosing how to communicate the results of data analysis. 

  • You include only some visualizations in the report — the prettiest ones, or the ones with “enough” data
  • You include different visualizations and metrics than the ones in the last report, without an explanation about why they changed, or what changed. For example, what happened in Spain in late May, 2020 as reported by the Financial Times: Flawed data casts cloud over Spain’s lockdown strategy.
  • You choose not to share or discuss the confidence intervals for your findings. 
  • You neglect to include details about the dataset, such as the size of the dataset or the representativeness of the dataset.
  • You discuss the available data as though it represents the entire problem, rather than a subset of the problem. 

For example, the Spotify Wrapped campaign shares the data that it makes available as though it represents an entire year, but instead only reflects 10 months, and doesn’t include any details about the dataset beyond what you can assume—it’s based on Spotify’s data. This missing data doesn’t make the conclusions drawn in the campaign inaccurate, but it is additional context that can affect how you interpret the findings and make decisions based on the data, such as which artist’s album you might buy yourself to celebrate the new year. 

What can you do about missing data?

To mitigate missing data in your communications, it’s vital to be precise and consistent. Communicate exactly what is and is not covered in your report. If it makes sense, you could even share the different levels of disaggregation used throughout the data analysis process that led to the results discussed in the final communication. 

Consistency in the visualizations that you choose to include, as well as the data spans covered in your report, can make it easier to identify when something has changed since the last report. Changes can be due to missing data, but can also be due to something else that might not be identified if the visualizations in the report can’t be compared with each other.

If you do change the format of the report, consider continuing to provide the old format alongside the new format. Alternately, highlight what is different from the last report, why you made changes, and clearly discuss whether comparisons can or cannot be made with older formats of the report to avoid unintentional errors. 

If you know data is missing, you can discuss it in the report, share why or why not the missing data matters, and possibly choose to communicate a timeline or a business case for addressing the missing data. For example, there might be a valid business reason why data is missing, or why some visualizations have changed from the previous report.

The 2020 edition of Spotify Wrapped will almost certainly include different types of data points due to the effects of the global pandemic on people’s listening habits. Adding context to why data is missing, or why the report has changed, can add confidence in the face of missing data—people now understand why data is missing or has changed. 

Often when you’re communicating data, you’re including detailed visualizations of the results of data analysis processes. The next post in this series covers how data can go missing from visualizations, and what to do about it: Visualize the data: How missing data biases data-driven decisions.

Decide with the data: How missing data biases data-driven decisions

This is the second post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Any decision based solely on the results of data analysis is missing data—the non-quantitative kind. But data can also go missing from data-driven decisions as a result of the analysis process. Exhaustive data analysis and universal data collection might seem like the best way to prevent missing data, but it’s not realistic, feasible, or possible. So what can you do about the possible bias introduced by missing data? 

In this post, I’ll cover the following:

  • What to do if you must make a decision with missing data
  • How data goes missing at the decision stage
  • What to do about missing data at the decision stage
  • How much missing data matters
  • How much to care about missing data before making your decision

Missing data in decisions from the Oregon Health Authority

In the midst of a global pandemic, we’ve all been struggling to evaluate how safe it is to resume pre-pandemic activities like going to the gym, going out to bars, sending our kids back to school, or eating at restaurants. In the United States, state governors are the ones tasked with making decisions about what to reopen and what to keep closed, and most are taking a data-driven approach. 

In Oregon, the decisions they’re making about what to reopen and when are based on incomplete data. As reported by Erin Ross for Oregon Public Broadcasting, the Oregon Health Authority is not collecting or analyzing data about whether or not restaurants or bars are contributing to COVID-19 case rates.

The contact tracers interviewing people who’ve tested positive for SARS COV-2 are not asking, and even if the information is shared, the data isn’t being analyzed in a way that might allow officials to identify the effect of bars and restaurants on coronavirus case rates. Although this data is missing, officials are making decisions about whether or not bars and restaurants should remain open for indoor operations. 

Oregon and the Oregon Health Authority aren’t alone in this limitation. We’re in the midst of a pandemic, and everyone is doing the best they can with the limited resources and information that they have. Complete data isn’t always possible, especially when real-time data matters. So what can Oregon (and you) do to make sure that missing data doesn’t negatively affect the decision being made? 

If circumstances allow, it’s best to try to narrow the scope of your decision. Limit your decision to those groups and situations about which you have complete or representative data. If you can’t limit your decision, such as is the case with this pandemic, you can still make a decision with incomplete data. 

Acknowledge that your decision is based on limited data, identify the gaps in your knowledge, and make plans to address those gaps as soon as possible. You can address the missing data by collecting more data, analyzing your existing data differently, or by reexamining the relevance of various data points to the decision that you’re making. 

This is one key example of how missing data can affect a decision-making process. How else can data go missing when making decisions? 

How can data go missing?

Data-driven decisions are especially vulnerable to the effects of missing data. Because data-driven are based on the results of data analysis, the effects of missing data in the earlier data analysis stages are compounded. 

  • The reports that you reviewed before making your decision included the prettiest graphs instead of the most useful visualizations to help you make your decision. 
  • The visualizations in the report were different from the ones in the last report, making it difficult for you to compare the new results with the results in the previous report. 
  • The data being collected doesn’t include the necessary details for your decision. This is what is happening in Oregon, where the collected data doesn’t include all the details that are relevant when making decisions about what businesses and organizations to reopen. 
  • The data analysis that was performed doesn’t actually answer the question that you’re asking. If you need to know “how soon can we reopen indoor dining and bars despite the pandemic”, and the data analysis being performed can only tell you “what are the current infection rates, by county, based on the results of tests administered 5 days ago”, the decisions that you’re making might be based on incomplete data.

What can you do about missing data?

Identify if the missing data matters to your decision. If it does, you must acknowledge that the data is missing when you make your decision. If you don’t have data about a group, intentionally exclude that group from your conclusions or decision-making process. Constrain your decision according to what you know, and acknowledge the limitations of the analysis.

If you want to make broader decisions, you must address the missing data throughout the rest of the process! If you aren’t able to immediately collect missing data, you can attempt to supplement the data with a dedicated survey aimed at gathering that information before your decision. You can also investigate to find out if the data is already available in a different format or context—for example in Oregon, where the Health Authority might have some information about indoor restaurant and bar attendance in the content of contact tracing interviews and just aren’t analyzing it systematically. If the data is representative even if it isn’t comprehensive, you can still use it to supplement your decision. 

To make sure you’re making an informed decision, ask questions about the data analysis process that led to the results you’re reviewing. Discuss whether or not data could be missing from the results of the analysis presented to you, and why. Ask yourself: does the missing data affect the decision that I’m making? Does it affect the results of the analysis presented to me? Evaluate how much missing data matters to your decision-making process. 

You don’t always need more data

You will always be missing some data. It’s important to identify when the data that is missing is actually relevant to your analysis process, and when it won’t change the outcome. Acknowledge when additional data won’t change your conclusions. 

You don’t need all possible data in existence to support a decision. As Douglas Hubbard points out in his book How to Measure Anything, the goal of a data analysis process is to reduce your uncertainty about the right approach to take. 

If additional data, or more detailed analysis, won’t further reduce any uncertainty, then it’s likely unnecessary. The more clearly you constrain your decisions, and the questions you use to guide your data analysis, the more easily you can balance reducing data gaps and making a decision with the data and analysis results you have. 

USCIS doesn’t allow missing data, even if it doesn’t affect the decision

Sometimes, missing data doesn’t affect the decision that you’re making. This is why you must understand the decision you’re making, and how important comprehensive data is to the decision. If that’s true, you want to make sure your policies acknowledge that reality. 

In the case of the U.S. Citizenship and Immigration Services (USCIS), their policies don’t seem to recognize that some kinds of missing data for citizenship applications are irrelevant. 

In an Opinions column by Washington Post reporter Catherine Rampell, The Trump administration’s no-blanks policy is the latest Kafkaesque plan designed to curb immigration, she describes the “no blanks” policy applied to immigration applications, and now, to third-party documents included with the applications. 

“Last fall, U.S. Citizenship and Immigration Services introduced perhaps its most arbitrary, absurd modification yet to the immigration system: It began rejecting applications unless every single field was filled in, even those that obviously did not pertain to the applicant.

“Middle name” field left blank because the applicant does not have a middle name? Sorry, your application gets rejected. No apartment number because you live in a house? You’re rejected, too.

No address given for your parents because they’re dead? No siblings named because you’re an only child? No work history dates because you’re an 8-year-old kid?

All real cases, all rejected.”

In this example, missing data is deemed a problem for making a decision about the citizenship application for a person—even when the data that is missing is supposed to be missing because it doesn’t exist. When asked for comment,

“a USCIS spokesperson emailed, “Complete applications are necessary for our adjudicators to preserve the integrity of our immigration system and ensure they are able to confirm identities, as well as an applicant’s immigration and criminal history, to determine the applicant’s eligibility.””

Missing data alone is not enough to affect your decision—only missing data that affects the results of your decision. A lack of data is not itself a problem—the problem is when that is relevant to your decision is missing. That’s how bias gets introduced to a data-driven decision.

In the next post in this series, I’ll explore some ways that data can go missing when the results of data analysis are communicated: Communicate the data: How missing data biases data-driven decisions

What’s missing? Reduce bias by addressing data gaps in your analysis process

We live in an uncertain world. Facing an ongoing global pandemic, worsening climate change, persistent threats to human rights, and the more mundane uncertainties of our day-to-day lives, we try to use data to make sense of it all. Relying on data to guide decisions can feel safe. 

But you might not be able to trust your data-driven decisions if data is missing from your data analysis process. If you can identify and address gaps in your data analysis process, you can reduce the bias introduced by missing data in your data-driven decisions, regaining your confidence and certainty while ensuring you limit possible harm. 

This is post 1 of 8 in a series about how missing data can negatively affect a data analysis process. 

In this post, I’ll cover:

  • What is missing data? 
  • Why missing data matters
  • What’s missing from all data-driven decisions
  • The stages of a data analysis process

I hope this series inspires you and prepares you to take action to address bias introduced by missing data in your own data analysis processes. At the very least, I hope you gain a new perspective when evaluating your data-driven decisions, the success of data analysis processes, and how you frame a data analysis process from the start.

What is missing data?

Data can go missing in many ways. If you’re not collecting data, or don’t have access to some kinds of data, or if you can’t use existing data for a specific data analysis process—that data is missing from your analysis process. 

Other data might not be accessible to you for other reasons. Throughout this series, I’ll use the term “missing data” to refer to both data that does not exist, data that you do not have access to, and data that is obscured by your analysis process—effectively missing, even if not literally gone. 

Why missing data matters

Missing data matters because it can easily introduce bias into the results of a data analysis process. Biased data analysis is often framed in the context of machine learning models, training datasets, or inscrutable and biased algorithms leading to biased decisions. 

But you can draw biased and inaccurate conclusions from any data analysis process, regardless of whether machine learning or artificial intelligence is involved. As Meg Miller makes clear in her essay Finding the Blank Spots in Data for Eye on Design, “Artists and designers are working to address a major problem for marginalized communities in the data economy: “If the data does not exist, you do not exist.””. And that’s just one part of why missing data matters. 

You can identify the possible biases in your decisions if you can identify the gaps in your data and data analysis process. And if you can recognize those possible biases, you can do something to mitigate them. But first we need to acknowledge what’s missing from every data-driven decision. 

What’s missing from all data-driven decisions?

It feels safe to make a data-driven decision. You’ve performed a data analysis process and have a list of results matched up with objectives that you want to achieve. It’s easy to equate data with neutral facts. But we can’t actually use data for every decision, and we can’t rely only on data for a decision-making process. Data can’t capture the entirety of an experience—it’s inherently incomplete.

Data only represents what can be quantified. Howard Zinn writes about the incompleteness of data in representing the horrors of slavery in A People’s History of the United States:

“Economists or cliometricians (statistical historians) have tried to assess slavery by estimating how much money was spent on slaves for food and medical care. But can this describe the reality of slavery as it as to a human being who lived in side it? Are the conditions of slavery as important as the existence of slavery?” 

“But can statistics record what it meant for families to be torn apart, when a master, for profit, sold a husband or a wife, a son or a daughter?” 

(pg 172, emphasis original). 

Statistical historians and others can attempt to quantify the effects of slavery based on the records available to them, but parts of that story can never be quantified. The parts that can’t be quantified must be told, and must be considered when creating a historical record and of course, in deciding whose story gets told and how.

What data is available, and from whom, represents an implicit value and power structure in society as well. If data has been collected about something, and made available to others, then that information must be important—whether to an organization, a society, a government, or a world—and the keepers of the data had the privilege and the power to maintain it and make it available after it was collected. 

This power structure, this value structure, and the limitations of data alone when making decisions are crucial to consider in this era of seemingly-objective data-driven decisions. Because data alone isn’t enough to capture the reality of a situation, it isn’t enough to drive the decisions you make in our uncertain world. And that’s only the beginning of how missing data can affect decisions.

Data can go missing at any stage of the data analysis process

It’s easy to consider missing data as solely a data collection problem—if the dataset existed, or new data was collected, no data would be missing and so we can make better data-driven decisions. In fact, avoiding missing data when you’re collecting it is just one way to reduce bias in your data-driven decisions—it’s far from the only way.

Data can go missing at any stage of the data analysis process and bias your resulting decisions. Each post in this series addresses a different stage of the process. 

  1. 🗣 Make a decision based on the results of the data analysis. Decide with the data: How missing data biases data-driven decisions.
  2. 📋 Communicate the results of the data analysis. Communicate the data: How missing data biases data-driven decisions
  3. 📊 Visualize the data to represent the answers to your questions. Visualize the data: How missing data biases data-driven decisions.
  4. 🔎 Analyze the data to answer your questions. Analyze the data: How missing data biases data-driven decisions.
  5. 🗂 Manage the data that you’ve collected to make it easier to analyze. Manage the data: How missing data biases data-driven decisions
  6. 🗄 Collect the data you need to answer the questions you’ve defined. Collect the data: How missing data biases data-driven decisions.
  7. 🙋🏻‍♀️ Define the question that you want to ask the data. Define the question: How missing data biases data-driven decisions

In each post, I’ll discuss real world examples of how data can go missing, and what you can do about it!

Making Concert Decisions with Splunk

The annual Noise Pop music festival starts this week, and I purchased a badge this year, which means I get to go to any show that’s a part of the festival without buying a dedicated ticket.

That means I have a lot of choices to make this week! I decided to use data to assess (and validate) some of the harder choices I needed to make, so I built a dashboard, “Who Should I See?” to help me out.

First off, the Wednesday night show. Albert Hammond, Jr. of the Strokes is playing, but more people are talking about the Baths show the same night. Maybe I should go see Baths instead?

Screen capture showing two inputs, one with Baths and one with Albert Hammond, Jr, resulting in count of listens compared for each artist (6 vs 39) and listens over time for each artist. Baths has 1 listen before 2012, and 1 listen each year for 2016 until this year. Albert Hammond, Jr has 8 listens before 2010, and a consistent yet reducing number over time, with 5 in 2011 and 4 in 2015, but just a couple since then.

If I’m making my decisions purely based on listen count, it’s clear that I’m making the right choice to see Albert Hammond, Jr. It is telling, though, that I’ve listened to Baths more recently than him, which might have contributed to my indecision.

The other night I’m having a tough time deciding about is Saturday night. Beirut is playing, but across the Bay in Oakland. Two other interesting artists are playing closer to home, Bob Mould and River Whyless. I wouldn’t normally care about this so much, but I know my Friday night shows will keep me busy and leave me pretty tired. So which artist should I go see?

3 inputs on a dashboard this time, Beirut, Bob Mould, and River Whyless are the three artists being compared. Beirut has 44 listens, Bob Mould has 21, River Whyless has 3. Beirut has frequent listens over time, peaking at 6 before 2010, but with peaks at 5 in 2011 and 2019. Bob Mould has 6 listens pre-2009, but only 3 in 2010 and after that, 1 a year at most. River Whyless has 1 listen in April, and 2 in December of 2018.

It’s pretty clear that I’m making the right choice to go see Beirut, especially given my recent renewed interest thanks to their new album.

I also wanted to be able to consider if I should see a band at all! This isn’t as relevant this week thanks to the Noise Pop badge, but it currently evaluates if the number of listens I have for an artist exceeds the threshold that I calculate based on the total number of listens for all artists that I’ve seen live in concert. To do this, I’m evaluating whether or not an artist has more listens than the threshold. If they do, I return advice to “Go to the concert!” but if they don’t, I recommend “Only if it’s cheap, yo.”

Because I don’t need to make this decision for Noise Pop artists, I picked a few that I’ve been wanting to see lately: Lane 8, Luttrell, and The Rapture.

4 dashboard panels, 3 of which ask "Should I go see (artist) at all?" one for each artist, Lane 8, Luttrell, and The Rapture. Lane 8 and Luttrell both say "Only go if it's cheap, yo." and The Rapture says "Go to the concert!". The fourth panel shows frequent listening for The Rapture, especially from 2008-2012, with a recent peak in 2018. Lane 8 spikes at the end of the graph, and Luttrell is a small blip at the end of the graph.

While my interest in Lane 8 has spiked recently, there still aren’t enough cumulative listens to put them over the threshold. Same for Luttrell. However, The Rapture has enough to put me over the threshold (likely due to the fact that I’ve been listening to them for over 10 years), so I should go to the concert! I’m going to see The Rapture in May, so I am gleefully obeying my eval statement!

On a more digressive note, it’s clear to me that this evaluation needs some refinement to actually reflect my true concert-going sentiments. Currently, the threshold averages all the listens for all artists that I’ve seen live. It doesn’t restrict that average to consider only the listens that occur before seeing an artist live, which might make it more accurate. That calculation would also be fairly complex, given that it would need to account for artists that I’ve seen multiple times.

However, number of listens over time doesn’t alone reflect interest in going to a concert. It might be useful to also consider time spent listening, beyond count of listens for an artist. This is especially relevant when considering electronic music, or DJ sets, because I might only have 4 listen counts for an artist, but if that comprises 8 hours of DJ sets by that artist that I’ve listened to, that is a pretty strong signal that I would likely enjoy seeing that artist perform live.

I thought that I’d need to get direct access to the MusicBrainz database in order to get metadata like that, but it turns out that the Last.fm API makes some available through their track.getInfo endpoint, so I just found a new project! In the meantime I am able to at least calculate duration for tracks that exist in my iTunes library.

I now have a new avenue to explore with this project, collecting that data and refining this calculation. Reach out on Twitter to let me know what you might consider adding to this calculation to craft a data-driven concert-going decision-making dashboard.

If you’re interested in this app, it is open sourced and available on Splunkbase. I’ll commit the new dashboard to the app repo soon!