Visualize the data: How missing data biases data-driven decisions

This is the fourth post in a series about how missing data can bias data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Visualizing data is crucial to communicate the results of a data analysis process. Whether you use a chart, a table, a list of raw data, or a three-dimensional graph that you can interact with in virtual reality—your visualization choice can cause data to go missing. Any time you visualize the results of data analysis, you make intentional decisions about what to visualize and what not to visualize. How can you make sure that data that goes missing at this stage doesn’t bias data-driven decisions?

In this post, I’ll cover the following: 

  • How the usage of your data visualization can cause data to go missing
  • How data goes missing in data visualizations
  • Why accessibility matters for data visualizations
  • How a lack of labels and scale can mislead and misinform 
  • What to do about missing data in data visualizations

How people use the Georgia Department of Public Health COVID-19 daily report 

When creating a data visualization, it’s important to consider how it will be used. For example, the state of Georgia provides a Department of Public Health Daily COVID-19 reporting page to help communicate the relative case rate for each county in the state. 

In the midst of this global pandemic, I’m taking extra precautions before deciding to go hiking or climbing outside. Part of that risk calculation involves checking the relative case rate in my region — are cases going up, down, or staying the same? 

If you wanted to check that case rate in Georgia in July, you might struggle to make an unbiased decision about your safety because of the format of a data visualization in that report.

As Andisheh Nouraee illustrates in a now-deleted Twitter thread, the Georgia Department of Public Health on the COVID-19 Daily Status Report provided a heat map in July that visualized the number of cases across counties in Georgia in such a way that it effectively hid a 49% increase in cases across 15 days.

July 2nd heat map visualization of Georgia counties showing cases per 100K residents, with bins covering ranges from 1-620, 621-1070, 1071 - 1622, 1623 - 2960, with the red bins covering a range from 2961 - 4661.
Image from July 2nd, shared by Andisheh Nouraee, my screenshot of that image
July 17th heat map visualization showing cases per 100K residents in Georgia, with three counties colored red. Bins represent none, 1-949 cases, 950 - 1555 cases, 1556-2336 cases, 2337 - 3768 cases, and the red bins represent 3769-5165 cases.
Image from July 17th, shared by Andisheh Nouraee, my screenshot of that image

You might think that these visualizations aren’t missing data at all—the values of the gradient bins are clearly labeled, and the map clearly shows how many cases exist for every 100K residents.

However, the missing data isn’t in the visualization itself, but in how it’s used. This heat map is provided to help people understand the relative case rate. If I were checking this graph every week or so, I would probably think that the case rate has stayed the same over that time period. 

Instead, because the visualization uses auto-adjusting gradient bins, the red counties in the visualization from July 2nd cover a range from 2961 to 4661, while the same color counties on July 17th now have case rates of 3769–5165 cases per 100K residents. The relative size of the bins is different enough to where the bins can’t be compared with each other over time. 

As reported by Keren Landman for the Atlanta Magazine, the Department of Public Health didn’t have direct control over the data on the dashboard anyway, making it harder to make updates or communicate the data more intentionally.

Thankfully, the site now uses a visualization with a consistent gradient scale, rather than auto-adjusting bins.

Screenshot of heat map of counties with cases per 100K residents in Georgia with Union County highlighted showing the confirmed cases from the past 2 weeks and total, and other data points that are irrelevant to this post.

In this example, the combination of the visualization choice and the use of that visualization by the visitors of this website caused data to go missing and possibly resulting in biased decisions about whether it’s safe to go for a hike in the community. 

How does data go missing? 

This example from the Georgia Department of Health describes one way that data can go missing, but there are many more. 

Data can go missing from your visualization in a number of ways:

  • If the data exists, but is not represented in the visualization, data is missing
  • If data points and fluctuations are smoothed over, or connected across gaps, data is missing. 
  • If outliers and other values are excluded from the visualization, data is missing.
  • If people can’t see or interact with the visualization, data is missing.
  • If a limited number of results are being visualized, but the label and title of the visualization don’t make that clear, data is missing.

Accessible data visualizations prevent data from going missing 

Accessible visualizations are crucial for avoiding missing data because data can go missing if people can’t see or interact with it. 

Lisa Charlotte Rost wrote an excellent series for Data Wrapper’s blog about colorblindness and data visualizations that I highly recommend for considering color vision accessibility for data visualization: How your colorblind and colorweak readers see your colors, What to consider when visualizing data for colorblind readers, and What’s it like to be colorblind

You can also go further to consider how to make it easier for folks with low or no vision to interact with your data visualizations. Data visualization artist Mona Chalabi has been experimenting with ways to make her data visualization projects more accessible, including making a tactile version of a data visualization piece, and an interactive piece that uses touch and sound to communicate information, created in collaboration with sound artist Emmy the Great.

At a more basic level, consider how your visualizations look at high zoom levels and how they sound when read aloud by a screen reader. If a visualization is unintelligible at high zoom levels or if portions aren’t read aloud by a screen reader, those are cases where data has gone missing from your visualization. Any decisions that someone with low or no vision wants to make based on a data visualization is biased to include only the data visualizations that they can interact with successfully. 

Beyond vision considerations, you want to consider cognitive processing accessibility to prevent missing data. If you overload a visualization with lots of overlays, rely on legends to communicate meaning in your data, or have a lot of text in your visualization, folks with ADHD or dyslexia might struggle to process your visualization. 

Any data that people can’t understand in your visualization is missing data. For more, I recommend the blog post by Sarah L. Fossheim, An intro to designing accessible data visualizations.  

Map with caution and label prodigiously: Beirut explosion map

Data can go missing if you fail to visualize it clearly or correctly. When I found out about the explosion in Beirut, after I made sure that my friends and their family were safe, I wanted to better understand what had happened. 

Screenshot of a map of the beirut explosion with labels pointing to different overlapping circles, saying "blast site", "widespread destruction", "heavy damage", "damage reported" and "windows blown out up to 15 miles away".
Image shared by Joanna Merson, my screenshot of the image

I haven’t had the privilege to visit Beirut before, so the maps of the explosion radius weren’t as easy for me to personally relate to. Thankfully, people started sharing maps about what the same explosion might look like if it occurred in New York City or London.  

Screenshot of a google maps visualization with 3 overlapping circles centered over New York. No labels.
Image shared by Joanna Merson, my screenshot of the image

This map attempts to show the scale of the same explosion in New York City, but it’s missing a lot of data. I’m not an expert in map visualizations, but thankfully cartographer Joanna Merson tweeted a correction to this map and unpacked just how much data is missing from this visualization. 

There’s no labels on this map, so you don’t know the scale of the circles, or what distance each blast radius is supposed to represent. You don’t know what the epicenter of the blast is because it isn’t labeled, and perhaps most egregiously, the map projection used is incorrect. 

Joanna Merson created an alternate visualization, with all the missing data added back in. 

Screenshot of a map visualization by Joanna Merson made on August 5 2020, with a basemap from esri World Imagery using a scale 1:200,000 and the Azimuthal Equidistant projection. Map is centered over New York City with an epicenter of Manhattan labeled and circle radii of 1km, 5km, and 10km clearly labeled.
Image by Joanna Merson, my screenshot of the image.

Her visualization carefully labels the epicenter of the blast, as well as the radii of each of the circles that represent different effects from the blast. She’s also careful to share the map projection that she used—one that has the same distance for every point along that circle. It turns out that the projection used by Google Maps is not the right projection to show distance with an overlaid circle. Without the scale or an accurate projection in use, data goes missing (and gets added) as unaffected areas are misleadingly shown as affected by the blast. 

How many of you are guilty of making a geospatial visualization, but don’t know anything about map projections and how they might affect your visualization? 

Joanna Merson further points out in her thread on Twitter that maps like this with an overlaid radius to show distance can be inaccurate because they don’t take into account the effect of topography. Data goes missing because topography isn’t represented or considered by the visualization overlaid on the map. 

It’s impractical to model everything perfectly in every map visualization. Depending on how you’re using the map, this missing data might not actually matter. If you communicate what your visualization is intended to represent when you share it, you can convey the missing data and also assert its irrelevance to your point. All maps, after all, must make decisions about what data to include based on the usage of the map. Your map-based data visualizations are no different! 

It can be easy to cut corners and make a simple visualization to communicate the results of data analysis quickly. It can be tedious to add a scale, a legend, and labels to your visualization. But you must consider how your visualization might be used after you make it—and how it might be misused.

Will a visualization that you create end up in a blog post like this one, or a Twitter thread unpacking your mistakes? 

What can you do about missing data?

To prevent or mitigate missing data in a data visualization, you have several options. Nathan Yau of Flowing Data has a very complete guide for Visualizing Incomplete and Missing Data that I highly recommend in addition to the points that I’m sharing here. 

Visualize what’s missing

One important way to mitigate missing data in a data visualization is to devise a way to show the data that is there alongside the data that isn’t. Make the gaps apparent and visualize missing data, such as by avoiding connecting the dots between missing values in a line chart.

In cases where your data has gaps, you can add annotations or labels to acknowledge and explain any inconsistencies or perceived gaps in the data. In some cases, data can appear to be missing, but is actually a gap in the data due to seasonal fluctuations or other reasons. It’s important to thoroughly understand your data to identify the difference. 

If you visualize the gaps in your data, you have the opportunity to discuss what can be causing the gaps. Gaps in data can reflect reality, or flaws in your analysis process. Either way, visualizing the gaps in your data is just as valuable as visualizing the data that you do have. Don’t hide or ignore missing data.

Carefully consider time spans

Be intentional about the span that you choose for time-based visualizations. You can unintentionally hide fluctuations in the data if you choose an overly-broad span for your visualization, causing data to go missing by flattening it. 

If you choose an overly-short time span for your visualization, however, the meaning of the data and what you’re trying to communicate can go missing with all the noise of the individual data points. Consider what you’re trying to communicate with the data visualization, and choose a time span accordingly.

Write clearly 

Another way to address missing data is to write good labels and titles for visualizations. It’s crucial to explain exactly what is present in a visualization—an important component of communicating results. If you’re intentional and precise about your labels and titles, you can prevent data from going missing. 

If the data analysis contains the results for the top 10 cities by population density, but your title only says “Top Cities”, data has gone missing from your visualization!

You can test out the usefulness of your labels and titles by considering the following: If someone screenshots your visualization and puts it in a different presentation, or tweets it without the additional context that might be in the full report, how much data would be missing from the visualization? How completely does the visualization communicate the results of data analysis if it’s viewed out of context?

Validate your scale

Make sure any visualization that you create has a scale and that it’s included. It’s really easy for data to go missing if the scale of the data itself is missing. 

Also validate that the scale on your visualization is accurate and relevant. If you’re visualizing percentages, make sure the scale goes from 0-100. If you’re visualizing logarithmic data, make sure your scale reflects that correctly. 

Consider the use

Consider how your visualization will be used, and design your visualizations accordingly. What decisions are people trying to make based on your visualization? What questions are you trying to answer when you make it? 

Automatically-adjusting gradient bins in a heat map can be an excellent design choice, but as we saw in Georgia, they don’t make sense to communicate relative change over time. 

Choose the right chart for the data

It’s also important to choose the right chart to visualize your data. I’m not a visualization expert, so check out this data tutorial from Chartio, How to Choose the Right Data Visualization as well as these tutorials of different chart types on Flowing Data: Chart Types.  

I do want to recommend that if you’re visualizing multiple aggregations in one visualization in the Splunk platform, consider the Trellis layout to create different charts to help compare across the aggregates. 

Always try various types of visualizations for your data to determine which one shows the results of your analysis in the clearest way.

One of the best ways to make sure your data visualization isn’t missing data is to make sure that the data analysis is sound. The next post in this series addresses how data can go missing while you analyze it: Analyze the data: How missing data biases data-driven decisions.

Communicate the data: How missing data biases data-driven decisions

This is the third post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Communicating the results of a data analysis process is crucial to making a data-driven decision. You might review results communicated to you in many ways:

  • A slide deck presented to you
  • An automatically-generated report emailed to you regularly
  • A white paper produced by an expert analysis firm that you review 
  • A dashboard full of curated visualizations
  • A marketing campaign

If you’re a data communicator, what you choose to communicate (and how you do so) can cause data to go missing and possibly bias data-driven decisions made as a result.

In this post, I’ll cover the following:

  • A marketing campaign that misrepresents the results of a data analysis process
  • A renowned white paper produced by leaders in the security space
  • How data goes missing at the communication stage
  • What you can do about missing data at the communication stage

Spotify Wrapped: Your “year” in music 

If you listen to music on Spotify, you might have joined in the hordes of people that viewed and shared their Spotify Wrapped playlists and images like these last year. 

Screenshot of Spotify Wrapped graphic showing "we spend some serious time together"

Spotify Wrapped is a marketing campaign that purports to communicate your year in music, based on the results of some behind-the-scenes data analysis that Spotify performs. While it is a marketing campaign, it’s still a way that data analysis results are being communicated to others, and thus relevant to this discussion of missing data in communications. 

Screenshot of Spotify Wrapped campaign showing "You were genre-fluid" Screenshot of Spotify Wrapped campaign showing that I discovered 1503 new artists.

In this case you can see that my top artists, top songs, minutes listened, and top genre are  shared with me, along with the number of artists that I discovered, and even the number of minutes that I spent listening to music over the last year. It’s impressive, and exciting to see my year in music summed up in such a way! 

But after I dug deeper into the data, the gaps in the communication and data analysis became apparent. What’s presented as my year in music is actually more like my “10 months in music”, because the data only represents the period from January 1st to October 31st of 2019. Two full months of music listening behavior are completely missing from my “year” in music. 

Unfortunately, these details about the dataset are missing from the Spotify Wrapped campaign report itself, so I had to search through additional resources to find out more information. According to a now-archived FAQ on Spotify for Artists (thank goodness for the Wayback Machine), the data represented in the campaign covers the dates from January 1st, 2019 – October 31st, 2019. I also read two key blog posts, Spotify Wrapped 2019 Reveals Your Streaming Trends, from 2010 to Now announcing the campaign and Unwrapping Wrapped 2019: Spotify VP of Engineering Tyson Singer Explains digging into the data analysis behind the campaign, to learn what I could about the size of the dataset, or how many data points might be necessary to draw the conclusions shared in the report. 

Screenshot of Spotify FAQ question "Why are my 2019 Artist Wrapped stats different from the stats I see in Spotify for Artists?"

Screenshot of Spotify FAQ answering the question "How do I get my 2019 Artist Wrapped?" with the first line of the answer being "If you have music on Spotify with at least 3 listeners before October 31st 2019, you get a 2019 Wrapped!"

This is a case where data is going missing from the communication because the data is presented as though it represents an entire time period, when in fact it only represents a subset of the relevant time period. Not only that, but it’s unclear what other data points actually represent. The number of minutes spent listening to music in those 10 months could be calculated by adding up the actual amount of time I spent listening to songs on the service, but could also be an approximate metric calculated from the number of streams of tracks in the service. It’s not possible to find out how this metric is being calculated. If it is an approximate metric based on the number of streams, that’s also a case of uncommunicated missing data (or a misleading metric), because according to the Spotify for Artists FAQ, Spotify counts a song as streamed if you’ve listened to at least 30 seconds of a track. 

It’s likely that other data is missing in this incomplete communication, but another data communication is a great example of how to do it right. 

Verizon DBIR: A star of data communication 

The Verizon Data Breach Investigations Report (DBIR) is an annual report put out by Verizon about, well, data breach investigations. Before the report shares any results, it includes a readable summary about the dataset used, how the analysis is performed, as well as what is missing from the data. Only after the limitations of the dataset and the resulting analysis is communicated, are the actual results of the analysis shared. 

Screenshot of Verizon DBIR results and analysis section of the 2020 DBIRAnd those results are well-presented, featuring confidence intervals, detailed titles and labels, as well as clear scales to visualize the data. I talk more about how to prevent data from going missing in visualizations in the next post of the series, Visualize the data: How missing data biases data-driven decisions.

Screenshot of 2 visualizations from the Verizon 2020 DBIR

How does data go missing?

These examples make it clear that data can easily go missing when you’re choosing how to communicate the results of data analysis. 

  • You include only some visualizations in the report — the prettiest ones, or the ones with “enough” data
  • You include different visualizations and metrics than the ones in the last report, without an explanation about why they changed, or what changed. For example, what happened in Spain in late May, 2020 as reported by the Financial Times: Flawed data casts cloud over Spain’s lockdown strategy.
  • You choose not to share or discuss the confidence intervals for your findings. 
  • You neglect to include details about the dataset, such as the size of the dataset or the representativeness of the dataset.
  • You discuss the available data as though it represents the entire problem, rather than a subset of the problem. 

For example, the Spotify Wrapped campaign shares the data that it makes available as though it represents an entire year, but instead only reflects 10 months, and doesn’t include any details about the dataset beyond what you can assume—it’s based on Spotify’s data. This missing data doesn’t make the conclusions drawn in the campaign inaccurate, but it is additional context that can affect how you interpret the findings and make decisions based on the data, such as which artist’s album you might buy yourself to celebrate the new year. 

What can you do about missing data?

To mitigate missing data in your communications, it’s vital to be precise and consistent. Communicate exactly what is and is not covered in your report. If it makes sense, you could even share the different levels of disaggregation used throughout the data analysis process that led to the results discussed in the final communication. 

Consistency in the visualizations that you choose to include, as well as the data spans covered in your report, can make it easier to identify when something has changed since the last report. Changes can be due to missing data, but can also be due to something else that might not be identified if the visualizations in the report can’t be compared with each other.

If you do change the format of the report, consider continuing to provide the old format alongside the new format. Alternately, highlight what is different from the last report, why you made changes, and clearly discuss whether comparisons can or cannot be made with older formats of the report to avoid unintentional errors. 

If you know data is missing, you can discuss it in the report, share why or why not the missing data matters, and possibly choose to communicate a timeline or a business case for addressing the missing data. For example, there might be a valid business reason why data is missing, or why some visualizations have changed from the previous report.

The 2020 edition of Spotify Wrapped will almost certainly include different types of data points due to the effects of the global pandemic on people’s listening habits. Adding context to why data is missing, or why the report has changed, can add confidence in the face of missing data—people now understand why data is missing or has changed. 

Often when you’re communicating data, you’re including detailed visualizations of the results of data analysis processes. The next post in this series covers how data can go missing from visualizations, and what to do about it: Visualize the data: How missing data biases data-driven decisions.

Decide with the data: How missing data biases data-driven decisions

This is the second post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Any decision based solely on the results of data analysis is missing data—the non-quantitative kind. But data can also go missing from data-driven decisions as a result of the analysis process. Exhaustive data analysis and universal data collection might seem like the best way to prevent missing data, but it’s not realistic, feasible, or possible. So what can you do about the possible bias introduced by missing data? 

In this post, I’ll cover the following:

  • What to do if you must make a decision with missing data
  • How data goes missing at the decision stage
  • What to do about missing data at the decision stage
  • How much missing data matters
  • How much to care about missing data before making your decision

Missing data in decisions from the Oregon Health Authority

In the midst of a global pandemic, we’ve all been struggling to evaluate how safe it is to resume pre-pandemic activities like going to the gym, going out to bars, sending our kids back to school, or eating at restaurants. In the United States, state governors are the ones tasked with making decisions about what to reopen and what to keep closed, and most are taking a data-driven approach. 

In Oregon, the decisions they’re making about what to reopen and when are based on incomplete data. As reported by Erin Ross for Oregon Public Broadcasting, the Oregon Health Authority is not collecting or analyzing data about whether or not restaurants or bars are contributing to COVID-19 case rates.

The contact tracers interviewing people who’ve tested positive for SARS COV-2 are not asking, and even if the information is shared, the data isn’t being analyzed in a way that might allow officials to identify the effect of bars and restaurants on coronavirus case rates. Although this data is missing, officials are making decisions about whether or not bars and restaurants should remain open for indoor operations. 

Oregon and the Oregon Health Authority aren’t alone in this limitation. We’re in the midst of a pandemic, and everyone is doing the best they can with the limited resources and information that they have. Complete data isn’t always possible, especially when real-time data matters. So what can Oregon (and you) do to make sure that missing data doesn’t negatively affect the decision being made? 

If circumstances allow, it’s best to try to narrow the scope of your decision. Limit your decision to those groups and situations about which you have complete or representative data. If you can’t limit your decision, such as is the case with this pandemic, you can still make a decision with incomplete data. 

Acknowledge that your decision is based on limited data, identify the gaps in your knowledge, and make plans to address those gaps as soon as possible. You can address the missing data by collecting more data, analyzing your existing data differently, or by reexamining the relevance of various data points to the decision that you’re making. 

This is one key example of how missing data can affect a decision-making process. How else can data go missing when making decisions? 

How can data go missing?

Data-driven decisions are especially vulnerable to the effects of missing data. Because data-driven are based on the results of data analysis, the effects of missing data in the earlier data analysis stages are compounded. 

  • The reports that you reviewed before making your decision included the prettiest graphs instead of the most useful visualizations to help you make your decision. 
  • The visualizations in the report were different from the ones in the last report, making it difficult for you to compare the new results with the results in the previous report. 
  • The data being collected doesn’t include the necessary details for your decision. This is what is happening in Oregon, where the collected data doesn’t include all the details that are relevant when making decisions about what businesses and organizations to reopen. 
  • The data analysis that was performed doesn’t actually answer the question that you’re asking. If you need to know “how soon can we reopen indoor dining and bars despite the pandemic”, and the data analysis being performed can only tell you “what are the current infection rates, by county, based on the results of tests administered 5 days ago”, the decisions that you’re making might be based on incomplete data.

What can you do about missing data?

Identify if the missing data matters to your decision. If it does, you must acknowledge that the data is missing when you make your decision. If you don’t have data about a group, intentionally exclude that group from your conclusions or decision-making process. Constrain your decision according to what you know, and acknowledge the limitations of the analysis.

If you want to make broader decisions, you must address the missing data throughout the rest of the process! If you aren’t able to immediately collect missing data, you can attempt to supplement the data with a dedicated survey aimed at gathering that information before your decision. You can also investigate to find out if the data is already available in a different format or context—for example in Oregon, where the Health Authority might have some information about indoor restaurant and bar attendance in the content of contact tracing interviews and just aren’t analyzing it systematically. If the data is representative even if it isn’t comprehensive, you can still use it to supplement your decision. 

To make sure you’re making an informed decision, ask questions about the data analysis process that led to the results you’re reviewing. Discuss whether or not data could be missing from the results of the analysis presented to you, and why. Ask yourself: does the missing data affect the decision that I’m making? Does it affect the results of the analysis presented to me? Evaluate how much missing data matters to your decision-making process. 

You don’t always need more data

You will always be missing some data. It’s important to identify when the data that is missing is actually relevant to your analysis process, and when it won’t change the outcome. Acknowledge when additional data won’t change your conclusions. 

You don’t need all possible data in existence to support a decision. As Douglas Hubbard points out in his book How to Measure Anything, the goal of a data analysis process is to reduce your uncertainty about the right approach to take. 

If additional data, or more detailed analysis, won’t further reduce any uncertainty, then it’s likely unnecessary. The more clearly you constrain your decisions, and the questions you use to guide your data analysis, the more easily you can balance reducing data gaps and making a decision with the data and analysis results you have. 

USCIS doesn’t allow missing data, even if it doesn’t affect the decision

Sometimes, missing data doesn’t affect the decision that you’re making. This is why you must understand the decision you’re making, and how important comprehensive data is to the decision. If that’s true, you want to make sure your policies acknowledge that reality. 

In the case of the U.S. Citizenship and Immigration Services (USCIS), their policies don’t seem to recognize that some kinds of missing data for citizenship applications are irrelevant. 

In an Opinions column by Washington Post reporter Catherine Rampell, The Trump administration’s no-blanks policy is the latest Kafkaesque plan designed to curb immigration, she describes the “no blanks” policy applied to immigration applications, and now, to third-party documents included with the applications. 

“Last fall, U.S. Citizenship and Immigration Services introduced perhaps its most arbitrary, absurd modification yet to the immigration system: It began rejecting applications unless every single field was filled in, even those that obviously did not pertain to the applicant.

“Middle name” field left blank because the applicant does not have a middle name? Sorry, your application gets rejected. No apartment number because you live in a house? You’re rejected, too.

No address given for your parents because they’re dead? No siblings named because you’re an only child? No work history dates because you’re an 8-year-old kid?

All real cases, all rejected.”

In this example, missing data is deemed a problem for making a decision about the citizenship application for a person—even when the data that is missing is supposed to be missing because it doesn’t exist. When asked for comment,

“a USCIS spokesperson emailed, “Complete applications are necessary for our adjudicators to preserve the integrity of our immigration system and ensure they are able to confirm identities, as well as an applicant’s immigration and criminal history, to determine the applicant’s eligibility.””

Missing data alone is not enough to affect your decision—only missing data that affects the results of your decision. A lack of data is not itself a problem—the problem is when that is relevant to your decision is missing. That’s how bias gets introduced to a data-driven decision.

In the next post in this series, I’ll explore some ways that data can go missing when the results of data analysis are communicated: Communicate the data: How missing data biases data-driven decisions

What’s missing? Reduce bias by addressing data gaps in your analysis process

We live in an uncertain world. Facing an ongoing global pandemic, worsening climate change, persistent threats to human rights, and the more mundane uncertainties of our day-to-day lives, we try to use data to make sense of it all. Relying on data to guide decisions can feel safe. 

But you might not be able to trust your data-driven decisions if data is missing from your data analysis process. If you can identify and address gaps in your data analysis process, you can reduce the bias introduced by missing data in your data-driven decisions, regaining your confidence and certainty while ensuring you limit possible harm. 

This is post 1 of 8 in a series about how missing data can negatively affect a data analysis process. 

In this post, I’ll cover:

  • What is missing data? 
  • Why missing data matters
  • What’s missing from all data-driven decisions
  • The stages of a data analysis process

I hope this series inspires you and prepares you to take action to address bias introduced by missing data in your own data analysis processes. At the very least, I hope you gain a new perspective when evaluating your data-driven decisions, the success of data analysis processes, and how you frame a data analysis process from the start.

What is missing data?

Data can go missing in many ways. If you’re not collecting data, or don’t have access to some kinds of data, or if you can’t use existing data for a specific data analysis process—that data is missing from your analysis process. 

Other data might not be accessible to you for other reasons. Throughout this series, I’ll use the term “missing data” to refer to both data that does not exist, data that you do not have access to, and data that is obscured by your analysis process—effectively missing, even if not literally gone. 

Why missing data matters

Missing data matters because it can easily introduce bias into the results of a data analysis process. Biased data analysis is often framed in the context of machine learning models, training datasets, or inscrutable and biased algorithms leading to biased decisions. 

But you can draw biased and inaccurate conclusions from any data analysis process, regardless of whether machine learning or artificial intelligence is involved. As Meg Miller makes clear in her essay Finding the Blank Spots in Data for Eye on Design, “Artists and designers are working to address a major problem for marginalized communities in the data economy: “If the data does not exist, you do not exist.””. And that’s just one part of why missing data matters. 

You can identify the possible biases in your decisions if you can identify the gaps in your data and data analysis process. And if you can recognize those possible biases, you can do something to mitigate them. But first we need to acknowledge what’s missing from every data-driven decision. 

What’s missing from all data-driven decisions?

It feels safe to make a data-driven decision. You’ve performed a data analysis process and have a list of results matched up with objectives that you want to achieve. It’s easy to equate data with neutral facts. But we can’t actually use data for every decision, and we can’t rely only on data for a decision-making process. Data can’t capture the entirety of an experience—it’s inherently incomplete.

Data only represents what can be quantified. Howard Zinn writes about the incompleteness of data in representing the horrors of slavery in A People’s History of the United States:

“Economists or cliometricians (statistical historians) have tried to assess slavery by estimating how much money was spent on slaves for food and medical care. But can this describe the reality of slavery as it as to a human being who lived in side it? Are the conditions of slavery as important as the existence of slavery?” 

“But can statistics record what it meant for families to be torn apart, when a master, for profit, sold a husband or a wife, a son or a daughter?” 

(pg 172, emphasis original). 

Statistical historians and others can attempt to quantify the effects of slavery based on the records available to them, but parts of that story can never be quantified. The parts that can’t be quantified must be told, and must be considered when creating a historical record and of course, in deciding whose story gets told and how.

What data is available, and from whom, represents an implicit value and power structure in society as well. If data has been collected about something, and made available to others, then that information must be important—whether to an organization, a society, a government, or a world—and the keepers of the data had the privilege and the power to maintain it and make it available after it was collected. 

This power structure, this value structure, and the limitations of data alone when making decisions are crucial to consider in this era of seemingly-objective data-driven decisions. Because data alone isn’t enough to capture the reality of a situation, it isn’t enough to drive the decisions you make in our uncertain world. And that’s only the beginning of how missing data can affect decisions.

Data can go missing at any stage of the data analysis process

It’s easy to consider missing data as solely a data collection problem—if the dataset existed, or new data was collected, no data would be missing and so we can make better data-driven decisions. In fact, avoiding missing data when you’re collecting it is just one way to reduce bias in your data-driven decisions—it’s far from the only way.

Data can go missing at any stage of the data analysis process and bias your resulting decisions. Each post in this series addresses a different stage of the process. 

  1. 🗣 Make a decision based on the results of the data analysis. Decide with the data: How missing data biases data-driven decisions.
  2. 📋 Communicate the results of the data analysis. Communicate the data: How missing data biases data-driven decisions
  3. 📊 Visualize the data to represent the answers to your questions. Visualize the data: How missing data biases data-driven decisions.
  4. 🔎 Analyze the data to answer your questions. Analyze the data: How missing data biases data-driven decisions.
  5. 🗂 Manage the data that you’ve collected to make it easier to analyze. Manage the data: How missing data biases data-driven decisions
  6. 🗄 Collect the data you need to answer the questions you’ve defined. Collect the data: How missing data biases data-driven decisions.
  7. 🙋🏻‍♀️ Define the question that you want to ask the data. Define the question: How missing data biases data-driven decisions

In each post, I’ll discuss real world examples of how data can go missing, and what you can do about it!

Problems with Indexing Datasets like Web Pages

Google has created a dataset search for researchers or the average person looking for datasets. On the one hand, this is a cool idea. Datasets are hard to find in cases, and this ostensibly makes the datasets and accompanying research easier to find.
In my opinion this dataset search is problematic for two main reasons.

1. Positioning Google as a one-stop-shop for research is risky.

There’s consistent evidence that many people (especially college students who don’t work with their library) start and end their research with Google, rather than using scholarly databases, limiting the potential quality of their research. (There’s also something to be said here about the limiting of access to quality research behind exploitative and exclusionary paywalls, but that’s for another discussion).
Google’s business goal of being the first and last stop for information hunts makes sense for them as a company. But such a goal doesn’t necessarily improve academic research, or the knowledge that people derive based on information returned from search results.

2. Datasets without datasheets easily lead to bias.

The dataset search is clearly focused on indexing and making more available as many datasets as possible. The cost of that is continuing sloppy data analysis and research due to the lack of standardized Datasheets for Datasets (for example) that fully expose the contents and limitations of datasets.
The existing information about these datasets is constructed based on the schema defined by the dataset author, or perhaps more specifically, the site hosting the dataset. It’s encouraging that datasets have dates associated with them, but I’m curious where the description for the datasets are coming from.
Only the description and the name fields for the dataset are required before a dataset appears in the search. As such, the dataset search has limitations. Is the description for a given dataset any higher quality than the Knowledge Panels that show up in some Google search results? How can we as users independently validate the accuracy of the dataset schema information?
The quality of and details provided in the description field vary widely across various datasets (I did a cursory scan of datasets resulting from a keyword search for “cheese”) indicating that having a plain text required field doesn’t do much to assure quality and valuable information.
When datasets are easier to find, that can lead to better data insights for data analysts. However, it can just as easily lead to off-base analyses if someone misuses data that they found based on a keyword search, either intentionally or, more likely, because they don’t fully understand the limitations of a dataset.
Some vital limitations to understand when selecting one for use in data analysis are things like:
  • What does the data cover?
  • Who collected the data?
  • For what purpose was the data collected?
  • What features exist in the data?
  • Which fields were collected and which were derived?
  • If fields were derived, how were they derived?
  • What assumptions were made when collecting the data?

Without these valuable limitations being made as visible as the datasets themselves, I struggle to feel overly encouraged by this dataset search in its current form.

Ultimately, making information more easily accessible while removing or obscuring indicators that can help researchers assess the quality of the information is risky and creates new burdens for researchers.

Unbiased data analysis with the data-to-everything platform: unpacking the Splunk rebrand in an era of ethical data concerns

Splunk software provides powerful data collection, analysis, and reporting functionality. The new slogan, “data is for doing”, alongside taglines like “the data-to-everything platform” and “turn data into answers” want to bring the company to the forefront of data powerhouses, where it rightly belongs (I’m biased, I work for Splunk).

There is nuance in those phrases that can’t be adequately expressed in marketing materials, but that are crucial for doing ethical and unbiased data analysis, helping you find ultimately better answers with your data and do even better things with it.

Start with the question

If you start attempting to analyze data without an understanding of a question you’re trying to answer, you’re going to have a bad time. This is something I really appreciate about moving away from the slogan “listen to your data” (even though I love a good music pun). Listening to your data implies that you should start with the data, when in fact you should start with what you want to know and why you want to know it. You start with a question.

Data analysis starts with a question, and because I’m me, I want to answer a fairly complex question: what kind of music do I like to listen to? This overall question, also called an objective function in data science, can direct my data analysis. But first, I want to evaluate my question. If I’m going to turn my data into doing, I want to consider the ethics and the bias of my question.

Consider what you want to know, and why you want to know it so that you can consider the ethics of the question. 

  • Is this question ethical to ask? 
  • Is it ethical to use data to answer it? 
  • Could you ask a different question that would be more ethical and still help you find useful, actionable answers? 
  • Does my question contain inherent bias? 
  • How might the biases in my question affect the results of my data analysis? 

Questions like “How can we identify fans of this artist so that we can charge them more money for tickets?” or “What’s the highest fee that we can add to tickets where people will still buy the tickets?” could be good for business, or help increase profits, but they’re unethical. You’d be using data to take actions that are unfair, unequal, and unethical. Just because Splunk software can help you bring data to everything doesn’t mean that you should. 

Break down the question into answerable pieces

If my question is something that I’ve considered ethical to use data to help answer, then it’s time to consider how I’ll perform my data analysis. I want to be sure I consider the following about my question, before I try to answer it:

  • Is this question small enough to answer with data?
  • What data do I need to help me answer this question?
  • How much data do I need to help me answer this question?

I can turn data into answers, but I have to be careful about the answers that I look for. If I don’t consider the small questions that make up the big question, I might end up with biased answers. (For more on this, see my .conf17 talk with Celeste Tretto).

So if I consider “What kind of music do I like to listen to?”, I might recognize right away that the question is too broad. There are many things that could change the answer to that question. I’ll want to consider how my subjective preferences (what I like listening to) might change depending on what I’m doing at the time: commuting, working out, writing technical documentation, or hanging out on the couch. I need to break the question down further. 

A list of questions that might help me answer my overall question could be: 

  • What music do I listen to while I’m working? When am I usually working?
  • What music do I listen to while I’m commuting? When am I usually commuting?
  • What music do I listen to when I’m relaxing? When am I usually relaxing?
  • What are some characteristics of the music that I listen to?
  • What music do I listen to more frequently than other music?
  • What music have I purchased or added to a library? 
  • What information about my music taste isn’t captured in data?
  • Do I like all the music that I listen to?

As I’m breaking down the larger question of “What kind of music do I like to listen to?”, the most important question I can ask is “What kind of music do I think I like to listen to?”. This question matters because data analysis isn’t as simple as turning data into answers. That can make for catchy marketing, but the nuance here lies in using the data you have to reduce uncertainty about what you think the answer might be. The book How to Measure Anything by Douglas Hubbard covers this concept of data analysis as uncertainty reduction in great detail, but essentially the crux is that for a sufficiently valuable and complex question, there is no single objective answer (or else we would’ve found it already!). 

So I must consider, right at the start, what I think the answer (or answers) to my overall question might be. Since I want to know what kind of music I like, I therefore want to ask myself what kind of music I think I might like. Because “liking” and “kind of music” are subjective characteristics, there can be no single true answer that is objective truth. Very few, if any, complex questions have objectively true answers, especially those that can be found in data. 

So I can’t turn data into answers for my overall question, “What kind of music do I like?” but I can turn it into answers for more simple questions that are rooted in fact. The questions I listed earlier are much easier to answer with data, with relative certainty, because I broke up the complex, somewhat subjective question into many objective questions. 

Consider the data you have

After you have your questions, look for the answers! Consider the data that you have, and whether or not it is sufficient and appropriate to answer the questions. 

The flexibility of Splunk software means that you don’t have to consider the questions you’ll ask of the data before you ingest it. Structured or unstructured, you can ask questions of your data, but you might have to work harder to fully understand the context of the data to accurately interpret it. 

Before you analyze and interpret the data, you’ll want to gather context about the data, like:

  • Is the dataset complete? If not, what data is missing?
  • Is the data correct? If not, in what ways could it be biased or inaccurate?
  • Is the data similar to other datasets you’re using? If not, how is it different?

This additional metadata (data about your datasets) can provide crucial context necessary to accurately analyze and interpret data in an unbiased way. For example, if I know there is data missing in my analysis, I need to consider how to account for that missing data. I can add additional (relevant and useful) data, or I can acknowledge how the missing data might or might not affect the answers I get.

After gathering context about your datasets, you’ll also want to consider if the data is appropriate to answer the question(s) that you want to answer. 

In my case, I’ll want to assess the following aspects of the datasets: 

  • Is using the audio features API data from Spotify the best way to identify characteristics in music I listen to? 
  • Could another dataset be better? 
  • Should I make my own dataset? 
  • Does the data available to me align with what matters for my data analysis? 

You can see a small way that the journalist Matt Daniels of The Pudding considered the data relevant to answer the question “How popular is male falsetto?” for the Vox YouTube series Earworm starting at 1:45 in this clip. For about 90 seconds, Matt and the host of the show, Estelle Caswell, discuss the process of selecting the right data to answer their question, including discussing the size of the dataset (eventually choosing a smaller, but more relevant, dataset) to answer their question. 

Is more data always better? 

Data is valuable when it’s in context and applied with consideration for the problem that I’m trying to solve. Collecting data about my schedule may seem overly-intrusive or irrelevant, but if it’s applied to a broader question of “what kind of music do I like to listen to?” it can add valuable insights and possibly shift the possible overall answer, because I’ve applied that additional data with consideration for the question that I’m trying to answer.

Splunk published a white paper to accompany the rebranding, and it contains some excellent points. One of them that I want to explore further is the question:

“how complete, how smart, are these decisions if you’re ignoring vast swaths of your data?” 

On the one hand, having more data available can be valuable. I am able to get a more valuable answer to “what kind of music do I like” because I’m able to consider additional, seemingly irrelevant data about how I spend my time while I’m listening to music. However, there are many times when you want to ignore vast swaths of your data. 

The most important aspect to consider when adding data to your analysis is not quantity, but quality. Rather than focusing on how much data you might be ignoring, I’d suggest instead focusing on which data you might be ignoring, for which questions, and affecting which answers. You might have a lot of ignored data, but put your focus on the small amount of data that can make a big difference in the answers you find in the data.

As the academics in “I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb?” make clear with their crucial finding:

“More data lead to better conclusions only when we know how to take advantage of their information. In other words, size does matter, but only if it is used appropriately.”

The most important aspect of adding data to an analysis is exactly as the academics point out: it’s only more helpful if you know what to do with it. If you aren’t sure how to use additional data you have access to, it can distract you from what you’re trying to answer, or even make it harder to find useful answers because of the scale of the data you’re attempting to analyze. 

Douglas Hubbard in the book How to Measure Anything makes the case that doing data analysis is not about gathering the most data possible to produce the best answer possible. Instead, it’s about measuring to reduce uncertainty in the possible answers and measuring only what you need to know to make a better decision (based on the results of your data analysis). As a result, such a focused analysis often doesn’t require large amounts of data — rough calculations and small samples of data are often enough. More data might lead to greater precision in your answer, but it’s a tradeoff between time, effort, cost, and precision. (I also blogged about the high-level concepts in the book).

If I want to answer my question “What kind of music do I like to listen to?” I don’t need the listening data of every user on the Last.fm service, nor do I need metadata for songs I’ve never heard to help me identify song characteristics I might like. Because I want to answer a specific question, it’s important that I identify the specific data that I need to answer it—restricted by affected user, existence in another dataset, time range, type, or whatever else.

If you want more evidence, the notion that more data is always better is also neatly upended by the Nielsen-Norman Group in Why You Only Need to Test with 5 Users and the follow-up How Many Test Users in a Usability Study?.

Keep context alongside the data

Indeed, the white paper talks about bringing people to a world where they can take action without worrying about where their data is, or where it comes from. But it’s important to still consider where the data comes from, even if you aren’t having to worry about it because you use Splunk software. It’s relevant to data analysis to keep context about the data alongside the data.

For example, it’s important for me to keep track of the fact that the song characteristics I might use to identify the type of music I like come from a dataset crafted by Spotify, or that my listening behavior is tracked by the service Last.fm. Last.fm can only track certain types of listening behavior on certain devices, and Spotify has their own biases in creating a set of audio characteristics.

If I lose track of this seemingly-mundane context when analyzing my data, I can potentially incorrectly interpret my data and/or draw inaccurate conclusions about what kind of music I like to listen to, based purely on the limitations of the data available to me. If I don’t know where my data is coming from, or what it represents, then it’s easy to find biased answers to questions, even though I’m using data to answer them.

If you have more data than you need, this also makes keeping context close to your data more difficult. The more data, the more room for error when trying to track contextual meaning. Splunk software includes metadata fields for data that can help you keep some context with the data, such as where it came from, but other types of context you’d need to track yourself.

More data can not only complicate your analysis, but it can also create security and privacy concerns if you keep a lot of data around and for longer than you need it. If I want to know what kind of music I like to listen to, I might be comfortable doing data analysis to answer that question, identifying the characteristics of music that I like, and then removing all of the raw data that led me to that conclusion out of privacy or security concerns. Or I could drop the metadata for all songs that I’ve ever listened to, and keep only the metadata for some songs. I’d want to consider, again, how much data I really need to keep around. 

Turn data into answers—mostly

So I’ve broken down my overall question into smaller, more answerable questions, I’ve considered the data I have, and I’ve kept the context alongside the data I have. Now I can finally turn it into answers, just like I was promised!

It turns out I can take a corpus of my personal listening data and combine it with a dataset of my personal music libraries to weight the songs in the listening dataset. I can also assess the frequency of listens to further weight the songs in my analysis and formulate a ranking of songs in order of how much I like them. I’d probably also want to split that ranking by what I was doing while I was listening to the music, to eliminate outliers from the dataset that might bias the results. All the small questions that feed into the overall question are coming to life.

After I have that ranking, I could use additional metadata from another source, such as the Spotify audio features API, to identify the characteristics of the top-ranked songs, and ostensibly then be able to answer my overall question: what kind of music do I like to listen to?

By following all these steps, I turned my data into answers! And now I can turn my data into doing, by taking action on those characteristics. I can of course seek out new music based on those characteristics, but I can also book the ideal DJs for my birthday party, create or join a community of music lovers with similar taste in music, or even delete any music from my library that doesn’t match those characteristics. Maybe the only action I would take is self-reflection, and see if what the data has “told” me is in line with what I think is true about myself.

It is possible to turn data into answers, and turn data into doing, with caution and attention to all the ways that bias can be introduced into the data analysis process. But there’s still one more way that data analysis could result in biased outcomes: communicating results. 

Carefully communicate data findings

After I find the answers in my data, I need to carefully communicate them to avoid bias. If I want to tell all my friends that I figured out what kind of music I like to listen to, I want to make sure that I’m telling them that carefully so that they can take the appropriate and ethical action in response to what I tell them. 

I’ll want to present the answers in context. I need to describe the findings with the relevant qualifiers: I like music with these specific characteristics, and when I say I like this music I mean this is the kind of music that I listen to while doing things I enjoy, like working out, writing, or sitting on my couch. 

I also need to make clear what kind of action might be appropriate or ethical to take in reaction to this information. Maybe I want to find more music that has these characteristics, or I’d like to expand my taste, or I want to see some live shows and DJ sets that would feature music that has these characteristics. Actions that support those ends would be appropriate, but can also risk being unethical. What if someone learns of these characteristics, and chooses to then charge me more money than other people (whose taste in music is unknown) to see specific DJ sets or concerts featuring music with those characteristics? 

Data, per the white paper, “must be brought not only to every action and decision, but to every department.” Because of that, it’s important to consider how that happens. Share relevant parts of the process that led to the answers you found from the data. Communicate the results in a way that can be easily understood by your audience. This Medium post by Cecelia Shao, a product manager at Comet.ml, covers important points about how to communicate the results of data analysis. 

Use data for good

I wanted to talk through the data analysis process in the context of the rebranded slogans and marketing content so that I could unpack additional nuance that marketing content can’t convey. I know how easy it is to introduce bias into data analysis, and how easily data analysis can be applied to unethical questions, or used to take unethical actions.

As the white paper aptly points out, the value of data is not merely in having it, but in how you use it to create positive outcomes. You need to be sure you’re using data safely and intelligently, because with great access to data comes great responsibility. 

Go forth and use the data-to-everything platform to turn data into doing…the right thing. 

Disclosure: I work for Splunk. Thanks to my colleagues Chris Gales, Erica Chen, and Richard Brewer-Hay for the feedback on drafts of this post. While colleagues reviewed this post and provided feedback, the content is my own and represents my own views rather than those of Splunk the company.