We live in an uncertain world. Facing an ongoing global pandemic, worsening climate change, persistent threats to human rights, and the more mundane uncertainties of our day-to-day lives, we try to use data to make sense of it all. Relying on data to guide decisions can feel safe.
But you might not be able to trust your data-driven decisions if data is missing from your data analysis process. If you can identify and address gaps in your data analysis process, you can reduce the bias introduced by missing data in your data-driven decisions, regaining your confidence and certainty while ensuring you limit possible harm.
This is post 1 of 8 in a series about how missing data can negatively affect a data analysis process.
In this post, I’ll cover:
What is missing data?
Why missing data matters
What’s missing from all data-driven decisions
The stages of a data analysis process
I hope this series inspires you and prepares you to take action to address bias introduced by missing data in your own data analysis processes. At the very least, I hope you gain a new perspective when evaluating your data-driven decisions, the success of data analysis processes, and how you frame a data analysis process from the start.
What is missing data?
Data can go missing in many ways. If you’re not collecting data, or don’t have access to some kinds of data, or if you can’t use existing data for a specific data analysis process—that data is missing from your analysis process.
Other data might not be accessible to you for other reasons. Throughout this series, I’ll use the term “missing data” to refer to both data that does not exist, data that you do not have access to, and data that is obscured by your analysis process—effectively missing, even if not literally gone.
Why missing data matters
Missing data matters because it can easily introduce bias into the results of a data analysis process. Biased data analysis is often framed in the context of machine learning models, training datasets, or inscrutable and biased algorithms leading to biased decisions.
But you can draw biased and inaccurate conclusions from any data analysis process, regardless of whether machine learning or artificial intelligence is involved. As Meg Miller makes clear in her essay Finding the Blank Spots in Data for Eye on Design, “Artists and designers are working to address a major problem for marginalized communities in the data economy: “If the data does not exist, you do not exist.””. And that’s just one part of why missing data matters.
You can identify the possible biases in your decisions if you can identify the gaps in your data and data analysis process. And if you can recognize those possible biases, you can do something to mitigate them. But first we need to acknowledge what’s missing from every data-driven decision.
What’s missing from all data-driven decisions?
It feels safe to make a data-driven decision. You’ve performed a data analysis process and have a list of results matched up with objectives that you want to achieve. It’s easy to equate data with neutral facts. But we can’t actually use data for every decision, and we can’t rely only on data for a decision-making process. Data can’t capture the entirety of an experience—it’s inherently incomplete.
Data only represents what can be quantified. Howard Zinn writes about the incompleteness of data in representing the horrors of slavery in A People’s History of the United States:
“Economists or cliometricians (statistical historians) have tried to assess slavery by estimating how much money was spent on slaves for food and medical care. But can this describe the reality of slavery as it as to a human being who lived in side it? Are the conditions of slavery as important as the existence of slavery?”
“But can statistics record what it meant for families to be torn apart, when a master, for profit, sold a husband or a wife, a son or a daughter?”
(pg 172, emphasis original).
Statistical historians and others can attempt to quantify the effects of slavery based on the records available to them, but parts of that story can never be quantified. The parts that can’t be quantified must be told, and must be considered when creating a historical record and of course, in deciding whose story gets told and how.
What data is available, and from whom, represents an implicit value and power structure in society as well. If data has been collected about something, and made available to others, then that information must be important—whether to an organization, a society, a government, or a world—and the keepers of the data had the privilege and the power to maintain it and make it available after it was collected.
This power structure, this value structure, and the limitations of data alone when making decisions are crucial to consider in this era of seemingly-objective data-driven decisions. Because data alone isn’t enough to capture the reality of a situation, it isn’t enough to drive the decisions you make in our uncertain world. And that’s only the beginning of how missing data can affect decisions.
Data can go missing at any stage of the data analysis process
It’s easy to consider missing data as solely a data collection problem—if the dataset existed, or new data was collected, no data would be missing and so we can make better data-driven decisions. In fact, avoiding missing data when you’re collecting it is just one way to reduce bias in your data-driven decisions—it’s far from the only way.
Data can go missing at any stage of the data analysis process and bias your resulting decisions. Each post in this series addresses a different stage of the process.
As a professional writer, I frequently get asked, “as a ______, how can I get better at writing?” I’ve never had a good list of resources to point people to, so I finally decided to write one. I’ve worked hard to become a good writer, and I’ve had the privilege of many good teachers along the way.
If you’re not really sure why your writing isn’t as good as you want it to be, that’s okay. In this blog post, I’ve identified the strategies that I use to write well. I hope they’re useful to you.
Where to start
Read and write more frequently. You can’t get better without good examples or practice. If you want to get better at writing you need to read more and you need to write more.
Identify what you’re trying to improve. Maybe you struggle with grammar, or in clearly communicating your ideas. Maybe it takes too many words for you to get your point across, or you can’t quite connect with the people reading your writing.
Write accurate content by improving your grammar and word choice
Use a tool like Grammarly, or enable grammar checking in whatever tool you use to write, if it’s available. If you don’t want a mysterious AI reading your writing, you can use other resources to improve specific aspects of your grammar.
Review easily-confused words, often homonyms, to identify words that you might be consistently misusing. This Homonyms Main List from an English learning Russian website is fairly exhaustive.
Look words up in the dictionary. If you’re not sure you’ve used a word correctly, or used the correct spelling of a word, look it up. I do this almost every day. I’m partial to the Merriam-Webster dictionary.
I still struggle with the following (more pedantic) grammar rules:
When do I need to use a hyphen to connect two words? See Hyphen Use, on the Purdue Online Writing Lab website.
Did I split an infinitive? What is a split infinitive, anyway? See Infinitives, on the Purdue Online Writing Lab website.
Does my relative pronoun actually clearly refer to something or do I have a vague “that” or “it”? See Pronouns in the Splunk Style Guide.
Write helpful content by defining outcomes before you start
Before you start writing something, whether it’s a slide deck, an engineering-requirements document, an email, or a blog post like this one, consider what you want someone to do after reading what you wrote.
Often called learning objectives or learning outcomes in instructional design, defining outcomes can help you write something useful and focused. Sometimes when you’re writing something, other extraneous ideas come to mind. They can be valuable ideas, but if they distract from your defined outcomes, you might want to remove them from your main content.
Some example outcomes are:
After reading this blog post, you can confidently draft a clear document with defined outcomes.
After reading this engineering requirements document, my colleague can provide accurate and helpful architecture feedback on the design.
After reading the release notes, I can convince my boss that the new features are worth an immediate upgrade.
I also want to note that if you write an outcome focused on someone understanding something, rewrite it. It’s tough to measure understanding. It’s easier to measure action. For that reason, I try to write outcomes with action-oriented verbs. For more about writing good learning objectives, see the Learning Objectives chapter in The Product is Docs.
Write focused content by identifying your audience
Who will be reading your writing? What do they know? Who are they? What assumptions can you make about them?
If you can’t answer these questions about the people reading your writing, you won’t be able to clearly communicate your ideas to them. You don’t have to be able to answer these questions with 100% certainty, but make the attempt.
If you recognize that you’re writing something for multiple audiences, consider breaking up the content into specific sections for each audience. For example, architects might care about different content than a UI engineer, a product manager might care about different details than the backend engineer.
If you identify the different needs of your varying audiences, you can write more consistently for each specific audience, rather than trying to address all of them all the time. For more on identifying your audience, see the Audience chapter of The Product is Docs.
Write findable content by considering how people get to it
How people get to your content can influence how you write it. If people use search, an intranet, or direct links to find your content, you might make different decisions about how to structure it.
I always assume that people are finding my content by searching the web. They’ve typed a specific search query, found my content as a result, and open it with the hopes that it is the right content for them.
Consider what people are searching for that can be answered by your content, and write a title accordingly. Spend time on the first few sentences of your content to make sure that they further clarify what your content addresses.
For example, I titled this blog post “How can I get better at writing?” because I expect that’s what a lot of people might type into their preferred search engine out of desperation. I could call it “7 quick tips to improve your writing”, but that’s not how most people type search queries (in my opinion).
Mark Baker’s book, Every Page is Page One, covers a lot of information related to this concept. He coins the term “information scent” to describe the signals that indicate to a person that they’ve found the right content to answer their question, and “information foraging” to describe the process of looking for the right information.
Write readable content by considering the structure
People aren’t excited to read technical content or technical documentation. No one rejoices when they get an email. I get paid to write technical documentation and I still avoid reading it if I can. Because people don’t want to read your content, structure it intentionally.
Write for skimming. Bullet points are often better than paragraphs. Tables are often better than paragraphs.
Put information where it needs to be. If you’re writing a series of steps, make sure the steps are actually in the right order. For example, if something needs to be done before all the steps can succeed, put it before the set of steps as a prerequisite.
You also want to consider the desired outcomes of your content and your audience when you structure your content. It can make sense to focus on one audience in one piece of content, or one desired outcome in one piece of content. Don’t try to do too much in one piece of writing.
Nielsen Norman Group has an incredible set of research and recommendations about how people read and how you can structure your content. I recommend the following articles:
Write clear content by intentionally choosing your words
You want to make your content easy to find and easy to understand. To do this, you need to be consistent and intentional about the words that you use.
Use consistent terminology. This isn’t the time to write beautiful prose that uses different words to mean the same thing. Don’t overload terms by using the same term for multiple things, and don’t use multiple terms to refer to one thing. Use the same terms and use them consistently.
If something is a JSON object, call it that. Don’t call it a JSON object sometimes, a JSON setting other times, or a JSON blob other times. Pick one term and use it consistently. You might have to pick an imperfect term and live with it. It happens! There are only so many words to choose from.
Be intentional about the words you use. Consider the words that your readers use to describe what you’re writing about, and use the same words if you can. Even if those words don’t match up completely with the feature names in use by your product.
If all of your software’s users refer to “dark mode” instead of “dark theme”, you might need to use both terms in your content so that people can find it. For some internal documentation, you might need to make a mapping of internal names that people use for something with the external names used in the product.
If you’re not sure what term to use, find out what terms your readers are already using. If you have access to search query logs of your website search, review those for patterns. If you don’t already have readers or users for your product, you can do some competitive analysis to understand what terms are in common usage in the market.
Write trustworthy content by thinking about the future
Errors in content, especially technical documentation, lead to mistrust. When you write a piece of content, consider the future of the content.
The future of the content depends on the purpose and type of content that you’re writing. This list contains some common expectations that readers might have about various content types:
A blog post has a date stamp and isn’t kept continually updated.
Technical documentation always matches the product version that it references.
Architecture documents reflect the current state of the microservice architecture.
An email gets the point across and can’t be edited after you send it.
You must consider the future and maintenance of any content that you write if your readers expect it to be kept up-to-date. To figure out how difficult maintaining your content will be, you can ask yourself these questions:
How frequently does the thing I’m writing about change?
How reliable does my content need to be?
How quickly does my content need to be accurate (e.g., after a product release)?
By answering these questions, you can then make decisions about how you write your content.
What level of detail will you include in your content?
Will you focus your efforts on accuracy, speed, or content coverage?
Do you want to include high-fidelity screenshots, gifs, or complex diagrams?
Do you want to automate any part of your content creation?
Who will review your content? How quickly and thoroughly will they review it?
I hope that after reading this blog post you feel empowered to write more accurate, helpful, focused, findable, readable, clear, trustworthy content. This is an overview of strategies. If you want to dig deeper into a specific way to improve your writing, check out the books and articles linked throughout this post.
If you have something you think I missed, you can find me on Twitter @smorewithface.
My top 5 artists are nearly the same, but much more influenced by music that I’ve purchased. The overall list instead looks like:
For the second year in a row, Tourist is my top artist! Kidnap still makes it into the top 10, as my 7th most-listened-to artist so far of 2020.
Disclosure, somewhat hilariously, doesn’t even break the top 10 artists if I am relying on Last.fm data instead of only Spotify. What’s going on there? Turns out Disclosure is my 11th-most-listened to artist, with 97 total listens so far this year. If I dig a little bit deeper, looking at the song Know Your Worth which Spotify says I’ve listened to the most in 2020 by Disclosure, I can see exactly why this is happening.
Disclosure’s latest album, ENERGY, includes a number of collaborations. Disclosure is the main artist for most of these tracks, but in some cases (like with Know Your Worth, which came out as a single February 4, 2020) the artist can be inconsistently stored by different services.
As a result, the Last.fm data has a number of different entries for the same track, with differently-listed artists for each one. Last.fm stores only one artist per track, whereas Spotify stores an array of artists for each track. This data structure decision means that Disclosure should have had about 127 total listens, and been my 7th-most-listened-to artist of 2020, instead of 11th.
This truncated screenshot shows some examples of the permutations of data that exist in my Last.fm data collection, with a total listen count of 127 for Disclosure during 2020.
I had a sneaking suspicion that my Booka Shade listening habits are primarily concentrated on a few songs from an EP that he put out this year, so I dug into how many tracks my total listens for the year were spread across.
Instead, it turns out that my listens to Booka Shade are actually the most distributed across tracks of all of my top 10 artists. Sjowgren is also an outlier here, because they’ve never released an album, so they only have 15 songs in their overall discography yet still made the top 10 artist listens.
Returning my comparison between Spotify and Last.fm data, Amtrac and Lane 8 are in both top 5 lists. This is somewhat expected, because if I look at the top 10 list for artists that I’ve most consistently listened to—artists that I’ve listened to at least once in each month of 2020—both Amtrac and Lane 8 place high in that list.
Given that only 2 days of December have happened as I write this, it’s unsurprising that I’ve only listened to one artist in every month of 2020.
Top Songs of 2020
Enough about the artists—what about the songs?
According to Spotify, my Top 5 songs of the year are:
Apricots by Bicep
Atlas by Bicep
Idontknow by Jamie xx
Cappadocia by Ben Böhmer feat. Romain Garcia
Know Your Worth by Disclosure feat. Khalid
That pretty closely matches my top 5 list according to Last.fm, with some notable exceptions.
My top 5 tracks according to Last.fm are:
Apricots by Bicep (38 listens)
Atlas by Bicep (32 listens)
Idontknow by Jamie xx (22 listens)
White Ferrari (Greene Edit) by Jacques Greene (21 listens)
That Home Extended by The Cinematic Orchestra (20 listens)
The first 3 tracks match, though of course Spotify has an incomplete representation of those listens—I have 29 streams of Apricots according to Spotify.
However, since I bought the track almost as soon as it came out, I also have another 9 listens that have happened off of Spotify. There were also some mysterious things happening with Spotify and Last.fm connections around that time as well, so it’s possible some listens are missing beyond these numbers.
What’s up with the 4th track on the list, though? Where is that in Spotify’s data? It’s actually a bootleg remix of the Frank Ocean song White Ferrari that Jacques Greene shared on SoundCloud and as a free download earlier this year, so it isn’t anywhere on Spotify. It did, however, make it onto my top tracks of 2020 on SoundCloud:
And again, this is a spot where metadata intrudes again and leads to some inconsistent counts. If I look at all the permutations of White Ferrari and Jacques Greene in my data for 2020, the total number of listens should actually be a bit higher, at 23 total listens:
This would actually make it my 3rd-most popular song of 2020 so far, and I’m listening to it as I write this paragraph, so let’s go ahead and call that total number 24 listens.
The 5th-most popular song and 7th-most popular song of 2020 make the case that I haven’t been sleeping very well this year (though I recall these tracks also showed up in 2019 as well…), because those 2 tracks comprise my “Insomnia” playlist that I use to help me fall asleep on nights when I’ve been, perhaps, staying up too late doing data analysis like this.
You can see the influence of consistent listening habits with top artist behaviors when you look at the top 10 songs that I’ve consistently listened to throughout 2020, with 2 songs by Kidnap, one by Bicep, and another by Amtrac.
To me, though, this table mostly underscores how much music discovery this year involved. I didn’t return to the same songs month after month during 2020. Likely as a result of all the DJ sets I’ve been streaming (as I mentioned in my post about Listening to Music while Sheltering in Place) this has been quite a year for music discovery, and breadth of listening habits.
My top 10 songs of 2020 had a total of 222 listens across them. However, I have a total of 14,336 listens for the entire year, spread across 8,118 unique songs in total.
Even with possible metadata issues, that’s still quite the distribution of behavior. Let’s dig a bit deeper into artist discovery this year.
Artist Discovery in 2020
In my post earlier this year about my listening behavior while sheltering in place, I discovered that my artist discovery numbers in 2020 seemed to be way up compared with 2018 and 2019, but weren’t actually that far off from 2017 numbers.
What I see when comparing my 2020 artist discovery statistics from my Last.fm data and my Spotify data is even more interesting. In contrast to what seemed to be true in last year’s post, Wrapping up the year and the decade in music: Spotify vs my data (For what it’s worth, last year’s number should have been 1074, instead of 2857 artists discovered—data analysis is difficult), Spotify’s data is much higher than the number I calculated this year.
According to Spotify, I discovered 2,051 new artists, whereas my Last.fm data claims that I only discovered 1,497 artists this year.
Similarly, Spotify claims that I listened to 4,179 artists this year, whereas my Last.fm data indicates that I listened to 3,715 artists.
Again, this comes down to data structures and how the artist metadata is stored for each service. I wrote about the importance of quality metadata for digital streaming providers earlier this year in Why the quality of audio analysis metadatasets matters for music, but it’s also apparent that the data structures for those metadatasets are just as important for crafting data insights of varying value.
Because Spotify stores all artists that contributed to a track as an array, I can listen to a track with 4 contributing artists on it, 1 of which I’ve listened to before, and according to Spotify, I’ve now discovered 3 artists and listened to 4, whereas according to Last.fm, I’ll have either listened to 1 artist that I’ve already heard before, or a new artist, possibly called “Luciano & David Morales”.
Spotify would store the second artist as Luciano, David Morales, thus allowing a more accurate count of listens for the Luciano artist. Similarly, my artist discovery data includes some flawed data, such as YouTube videos that got incorrectly recorded.
This becomes clear in my top 20 artist discoveries of 2020 chart, where BTS and Big Hit Labels are listed separately, although they are both indicative of one of my best friends joining BTS ARMY this year and sharing her enthusiasm with me.
Ultimately I’m grateful that the top 20 artists of 2020 are all artists that I discovered during the pandemic and have excellent songs that I love and continue to listen to. Many of the sparklines that represent my listening activity for these artists throughout the year have spikes, but mostly my listening patterns indicate that I’ve been returning to these artists and their songs multiple times after first discovery. Some notable favorites on this list are KC Lights’ track Girl and Dennis Cruz’s track El Sueño, plus the entire Fennec album Free Us Of This Feeling.
Genre Discovery in 2020
The most-commented-on data insight from #wrapped2020 is probably the genre discovery slide.
According to Spotify, I listened to 801 genres this year, including 294 new ones. I’m not even sure I could name 30 genres, let alone 300 or 800. Where are these numbers coming from?
It turns out that, much like storing artist data as an array for each song, Spotify stores genre data as an array for each artist. This means that each artist can be assigned multiple genres, thus successfully inflating the number of genres that you’ve listened to in 2020.
I could start discussing the possible meaningless of genres as a descriptive tool, the lack of validation possible for such a signifier, the lack of clarity about how these genres were defined and also assigned to specific artists, but that’s best for another blog post.
Instead, let’s look at what little genre data I do have available to me more generally.
According to Spotify, my top genres were:
All of these make sense to me, except for Organic House, because I don’t know what makes house music organic, unless it’s also grass-fed, locally-sourced, and free range. Perhaps Blond:ish is organic house.
I don’t have any genre data from Last.fm, since the service only stores user-defined tags for each artist, and those are not included in the data that I collect from Last.fm today. Instead, I have the genres assigned by iTunes for the tracks that I’ve purchased from the iTunes store.
The top 8 genres of music that I added to my iTunes library in 2020 by purchasing tracks from the iTunes store are:
Dance (124 songs)
Electronic (121 songs)
House (78 songs)
Pop (37 songs)
Alternative (27 songs)
Electronica (12 songs)
Deep House (10 songs)
Melodic House & Techno (9 songs)
Clearly, this is a very selective sample, and is only tied to select purchasing habits, which are roughly correlated to my listening habits.
I shared all of this genre data to essentially look at it and go “wow, that wasn’t very insightful at all”. Let’s move on.
Time Spent Listening to Music in 2020
The last metric I want to unpack from Spotify’s #wrapped2020 campaign is the minutes listened data insight. According to Spotify, I spent 59,038 minutes listening to music this year.
According to my own calculations, I spent roughly 81,134 minutes listening to music in 2020.
Let’s talk about how both of these metrics are super flawed!
Spotify counts a song as streamed after you listen to it for more than 30 seconds (per their Spotfiy for Artists FAQ), so it’s logical to assume that this minutes listened metric likely from a calculation of “number of streams for a track” x “length of track” and then rounded and converted to minutes. It could even result from an different type of calculation, “number of total streams” x “average length of track in Spotify library”, but I have no way of knowing if either of these are accurate besides tweeting at Spotify and hoping they’ll pay attention to me.
Unfortunately for all of us, but mostly me, my own minutes listened metric is just as lazily calculated. I don’t have track length data for all the tracks that I listen to and I don’t know at what point Last.fm counts a track as being worthy of a scrobble. I do have a list of how much time I spent listening to livestreamed DJ sets online, and I do have some excellent estimation skills. I calculated my number of 81,134 minutes so far in 2020 by calculating and assuming the following:
An average track length of 4 minutes
An average concert length of 3 hours
An average DJ set length of 4 hours
An average festival length of 8 hours
Using those averages and estimates, I calculated the total amount of time I spent listening to music across Last.fm listening habits, concerts and DJ sets attended (no festivals this year), and livestreams that I watched online, thus arriving at 81,134 minutes. That doesn’t count any DJ sets that I listened to on SoundCloud, and certainly the combination of a 4 minute track length estimate with the uncertainty of what qualifies a track as being scrobbled makes this data insight somewhat meaningless.
Regardless, let’s compare this estimated time spent listening in minutes against the total number of minutes in a year.
Beautiful. I still remembered to sleep this year. No matter which dataset I use, however, it’s clear that I’ve listened to more music in 2020 than in 2019. Spotify’s metric for this same time period in 2019 was 35,496 minutes. The less-flawed but less-complete metric I used last year, calculated using the track length stored in iTunes multiplied by the number of listens for that track, indicated that I spent 14,296 minutes listening to music in 2019.
As one final Spotify examination, let’s dig into the Spotify Top 100 playlist.
Top 100 Songs of 2020 Playlist
Alongside the fancy graphics and data insights in the #wrapped2020 campaign, Spotify also creates a 100 song playlist, likely (but not definitively) the top 100 songs of the time period between January 1st, 2020 and October 31st, 2020.
I found my playlist this year to be relatively accurate, perhaps because I spent more time listening to Spotify than I might have in previous years, or perhaps they made some internal data improvements, or both! I often spend more time listening to SoundCloud if I’m traveling a lot, listening to offline DJ sets on plane flights; or listening to Apple Music on my iPhone, with songs that I’ve added from my iTunes library. Without much time spent commuting or traveling this year, it’s likely that my listening habits remained fairly consolidated.
Similarly to what I discovered about my top 10 tracks, I had relatively distributed music interests this year. The 811 total listens for all 100 songs in my Spotify playlist represent just 0.06% of my total listens in 2020 so far.
Despite my overall listening habits being relatively distributed across lots of artists and songs, the Top Songs playlist is somewhat more consolidated, with 69 artists performing the 100 songs on the playlist. Nice.
It’s clear that I spent most of this year exploring and discovering new artists, given that 83 of my top songs of 2020 according to Spotify were songs that I discovered in 2020.
Thanks for coming on this journey through my music data with me. I’ll be back at the actual end of the year to dive deeper into my top 10 artists of the year, top 10 consistent artists of the year, my music purchasing activity, as well as some more livestream and concert statistics to round out my 2020 year in music.
Define the question you want to answer for your data analysis process
How does data go missing when you’re defining your question?
What can you do about missing data when defining your question?
This post also concludes this blog post series about missing data, featuring specific actions you can take to reduce bias resulting from missing data in your end-to-end data-driven decision-making process.
Define the question
When you start a data analysis process, you always want to start by deciding what questions you want to answer. Before you make a decision, you need to decide what you want to know.
If you start with the data instead of with a question, you’re sure to be missing data that could help you make a decision, because you’re starting with what you have instead of what you want to know.
Start by carefully defining what you want to know, and then determine what data you need to answer that question. What aggregations and analyses might you perform, and what tools do you need access to in order to perform your analysis?
It’s also crucial to consider whether you can answer that question adequately, safely, and ethically with the data you have access to.
How does data go missing?
Data can go missing at this stage if it isn’t there at all—if the data you need to answer a question does not exist. There’s also the possibility that the data you want to use to answer a question is incomplete, or you have some, but not all of the data that you need to answer the question.
It’s also possible that the data exists, but you can’t have it—you either don’t have it, or you aren’t permitted to use the data that has already been collected to answer your particular question.
It’s also possible that the data that you do have is not accurate, in which case the data might exist to help answer your question, but it’s unusable, so it’s effectively missing. Perhaps the data is outdated, or the way it was collected means you can’t trust it.
Depending on who funded the data collection, who performed the data collection, and when and why it was performed, can tell you a lot about whether or not you can use a dataset to answer your particular set of questions.
For example, if you are trying to answer the question “What is the effect of slavery on the United States?”, you could review economic reports, the records from plantations about how humans were bought and sold, and stop there. But you might be better off considering who created those datasets, who is missing from those datasets, and whether or not those datasets are useful to answer your question, and which datasets might be missing entirely because they were never created, or what records did exist were destroyed. You might also want to consider whether or not it’s ethical to use data to answer specific questions about the lived experiences of people.
“For example, hundreds of Muslim men were rounded up in New York and New Jersey in the weeks after 9/11. They were imprisoned without charge and often subject to abuse in custody because of their religion. None of this would register in any hate crimes database.”
Data can also go missing if the dataset that you choose to use to answer your question is incomplete.
Incomplete dataset by relying only on digitized archival films
As Rick Prelinger laments in a tweet—if part of a dataset is digitized, often that portion of the dataset is used for data analysis (or research, as the case may be), with the non-digitized portion ignored entirely.
For example, if I wanted to answer the question “What are common themes in American television advertising in the 1950s”? I might turn to the Prelinger Archives, because they make so much digitized archival film footage available. But just because it’s easily accessible doesn’t make it complete. Just because it’s there doesn’t make it the best dataset to answer your question.
It’s possible that the Prelinger Archives don’t have enough film footage for me to answer such a broad question. In this case, I can supplement the dataset available to me with information that is harder to find, such as by tracking down those non-digitized films. I can also choose to refine my question to focus on a specific type of film, year, or advertising agency that is more comprehensively featured in the archive, narrowing the scope of my analysis to focus on the data that I have available. I could even choose a different dataset entirely, if I find one that more comprehensively and accurately answers my question.
Possibly the most common way that data can go missing when trying to answer a question is that the data you have, or even all of the data available to you, doesn’t accurately proxy what you want to know.
Inaccurate proxy to answer a question leads to missing data
If you identify data points that inaccurately proxy the question that you’re trying to answer, you can end up with missing data. For example, if you want to answer the question, “How did residents of New York City behave before, during, and after Hurricane Sandy?”, you might look at geotagged social media posts.
“consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture.”
Because the users of social media, especially those that use Twitter and Foursquare and share location data with those tools, only represent a specific slice of the population affected by Hurricane Sandy. And that specific slice is not a representative or comprehensive slice of New York City residents. Indeed, as Crawford makes very clear, “there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate.”
The dataset of geotagged social media posts only represents some residents of New York City, and not in a representative way, so it’s an inaccurate proxy for the experience of all New York City residents. This means data is missing from the question stage of the data analysis step. You want to answer a question about the experience of all New York City residents, but you only have data about the experience of New York City residents that shared geotagged posts on social media during a specific period of time.
The risk is clear—if you don’t identify the gaps in this dataset, you might draw false conclusions. Crawford is careful to point this out clearly, identifying that “The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster.”
When you identify the gaps in the dataset, you can understand what limitations exist in the dataset, and thus how you might draw false and biased conclusions. You can also identify new datasets to examine or groups to interview to gather additional data to identify the root cause of the missing data (as discussed in my post on data gaps in data collection).
The gaps in who is using Twitter, and who is choosing to use Twitter during a natural disaster, are one way that Twitter data can inaccurately proxy a population that you want to research and thus cause data to go missing. Another way that it can cause data to go missing is by inaccurately representing human behavior in general because interactions with the platform itself are not neutral.
“platform log data are not “unobtrusive” recordings of human behavior out in the wild. Rather, their measurement conditions determine that they are accounts of putative user activity — “putative” in a sense that platforms are often incentivized to keep bots and other fake accounts around, because, from their standpoint, it’s always a numbers game with investors, marketers, and the actual, oft-insecure users.”
Put another way, you can’t interpret social media interactions as neutral reflections of user behavior due to the mechanisms a social media platform uses to encourage user activity. The authors also point out that it’s difficult to identify if social media interactions reflect the behavior of real people at all, given the number of bot and fake accounts that proliferate on such sites.
Using a dataset that inaccurately proxies the question that you’re trying to answer is just one way for data to go missing at this stage. What can you do to prevent data from going missing as you’re devising the questions you want to ask of the data?
What can you do about missing data?
Most importantly, redefine your questions so that you can use data to answer them! If you refine the questions that you’re trying to ask into something that can be quantified, it’s easier to ask the question and get a valid, unbiased, data-driven result.
Rather than try to understand the experience of all residents of New York City before, during, and after Hurricane Sandy, you can constrain your efforts to understand how social media use was affected by Hurricane Sandy, or how users that share their locations on social media altered their behavior before, during, and after the hurricane.
As another example, you might shift from trying to understand “How useful is my documentation?” to instead asking a question that is based on the data that you have: “How many people view my content?”. You can also try making a broad question more specific. Instead of asking “Is our website accessible?”, instead ask, “Does our website meet the AA standard of web content accessibility guidelines?”
Douglas Hubbard’s book, How to Measure Anything, provides excellent guidance about how to refine and devise a question that you can use data analysis to answer. He also makes the crucial point that sometimes it’s not worth it to use data to answer a question. If you are fairly certain that you already know the answer to a question, and the amount of effort it would take to perform data analysis (let alone perform it well) will take a lot of time and resources, it’s perhaps not worth attempting to answer the question with data at all!
You can also choose to use a different data source. If the data that you have access to in order to answer your question is incomplete, inadequate, inaccurate, or otherwise missing data, choose a different data source. This might lead you to change your dataset choice from readily-available digitized content to microfiche research at a library across the globe in order to perform a more complete and accurate data analysis.
And of course, if a different data source doesn’t exist, you can create a new data source with the information you need. Collaborate with stakeholders within your organization, make a business case to a third-party system that you want to gather data from, use freedom of information act (FOIA) requests to gather data that exists but is not easily-accessible to create a dataset.
I also want to take care to acknowledge that choosing to use or create a different dataset can often require immense privilege—monetary privilege to fund added data collection, a trip across the globe, or a more complex survey methodology; privilege of access, to have access to others doing similar research and are willing to share data with you; and privilege of time to perform the added data collection and analysis that might be necessary to prevent missing data.
If the data exists but you don’t have permission to use it, you might devise a research plan to request access to sensitive data, or work to gain the consent of those in the dataset that you want to use to allow you to use the data to answer the question that you want to answer. This is another case where communicating the use case of the data can help you gather it—if you share the questions that you’re trying to answer with the people that you’re trying to collect data from, they may be more inclined to share it with you.
Take action to reduce bias in your data-driven decisions from missing data
If you’re a data decision-maker, you want to take these steps to take action:
Define the questions being answered with data.
Identify missing data in the analysis process.
Ask questions of the data analysis before making decisions.
If you carefully define the questions guiding the data analysis process, clearly communicating your use cases to the data analysts that you’re working with, you can prevent data from going missing at the very start.
Work with your teams and identify where data might go missing in the analysis process, and do what you can to address a leaky analysis pipeline.
Finally, ask questions of the data analysis results before making decisions. Dig deeper into what is communicated to you, seek to understand what might be missing from the reports, visualizations, and analysis results being presented, and whether or not that missing data is relevant to your decision.
If you work with data as a data analyst, engineer, admin, or communicator, you can take these steps to take action:
Steward and normalize data.
Analyze data at multiple levels of aggregation and time spans.
Add context to reports and communicate missing data.
Responsibly steward data as you collect and manage it, and normalize it when you prepare it for analysis to make it easier to use.
If you analyze data at multiple levels of aggregation and time spans, you can determine which level allows you to communicate the most useful information with the least amount of data going missing, hidden by overgeneralized aggregations or overlarge time spans, or hidden in the noise of overly-detailed time spans or too many split-bys.
Add context to the reports that you produce, providing details about the data analysis process and the dataset used, acknowledging what’s missing and what’s represented. Communicate missing data with detailed and focused visualizations, keeping visualizations consistent for regularly-communicated reports.
I hope that no matter your role in the data analysis process, this blog post series helps you reduce missing data and make smarter, more accurate, and less biased data-driven decisions.
When you’re gathering the data you need and creating datasets that don’t exist yet, you’re in the midst of the data collection stage. Data can easily go missing when you’re collecting it!
In this post, I’ll cover the following:
How data goes missing at the data collection stage
What to do about missing data at the collection stage
How does data go missing?
There are many reasons why data might be missing from your analysis process. Data goes missing at the collection stage because the data doesn’t exist, or the data exists but you can’t use it for whatever reason, or the data exists but the events in the dataset are missing information.
The dataset doesn’t exist
Frequently data goes missing because the data itself does not exist, and you need to create it. It’s very difficult and impractical to create a comprehensive dataset, so data can easily go missing at this stage. It’s important to do what you can to make sure data goes consistently missing when you collect it, if possible, by collecting representative data.
In some cases, though, you do need comprehensive data. For example, if you need to create a dataset of all the servers in your organization for compliance reasons, you might discover that there is no one dataset of servers, and that efforts to compile one are a challenge. You can start with just the servers that your team administers, but that’s an incomplete list.
Some servers are grant-owned and fully administered by a separate team entirely. Perhaps some servers are lurking under the desks of some colleagues, connected to the network but not centrally managed. You can try to use network scans to come up with a list, but then you gather only those servers connected to the network at that particular time. Airgapped servers or servers that aren’t turned on 24/7 won’t be captured by such an audit. It’s important to continually consider if you really need comprehensive data, or just data that comprehensively addresses your use case.
The data exists, but…
There’s also a chance that the data exists, but isn’t machine-readable. If the data is provided only in PDFs, as many FOIA requests are returned in, then it becomes more difficult to include the data in data analysis. There’s also a chance that the data is available only as paper documents, as is the case with gun registration records. As Jeanne Marie Laskas reports for GQ in Inside The Federal Bureau Of Way Too Many Guns, having records only on paper prevents large-scale data analysis on the information, thus causing it to effectively go missing from the entire process of data analysis.
It’s possible that the data exists, but isn’t on the network—perhaps because it is housed on an airgapped device, or perhaps stored on servers subject to different compliance regulations than the infrastructure of your data analysis software. In this case, the data exists but it is missing from your analysis process because it isn’t available to you due to technical limitations.
Another common case is that the data exists, but you can’t have it. If you’ve made an enemy in another department, they might not share the data with you because they don’t want to. It’s more likely that access to the data is controlled by legal or compliance concerns, so you aren’t able to access the data for your desired purposes, or perhaps you can’t analyze it on the tool that you’re using for data analysis due to compliance reasons.
For example, most doctors offices and hospitals in the United States use electronic health records systems to store the medical records of thousands of Americans. However, scientific researchers are not permitted to access detailed electronic health records of patients, though they exist in large databases and the data is machine-readable, because the health insurance portability and accountability act (HIPAA) privacy rule regulates how protected health information (PHI) can be accessed and used.
Perhaps the data exists, but is only available to people who pay for access. This is the case for many music metadata datasets like those from Nielsen, much to my chagrin. The effort it takes to create quality datasets is often commoditized. This also happens with scientific research, which is often only available to those with access to scientific journals that publish the results of the research. The datasets that produce the research are also often closely-guarded, as one dataset is time-consuming to create and can lead to multiple publications.
There’s also a chance the data exists, but it isn’t made available outside of the company. A common circumstance for this is public API endpoints for cloud services. Spotify collects far more data than they make available via the API, so too do companies like Zoom or Google. You might hope to collect various types of data from these companies, but if the API endpoints don’t make the data available, you don’t have many options.
And of course, in some cases the data exists, but it’s inconsistent. Maybe you’re trying to collect equivalent data from servers or endpoints with different operating systems, but you can’t get the same details due to logging limitations. A common example is trying to collect the same level of detail from computers with MacOS and computers with Windows installed. You can also see inconsistencies if different log levels are set on different servers for the same software. This inconsistent data causes data to go missing within events and makes it more difficult to compare like with like.
14-page forms lead to incomplete data collection in a pandemic
Data can easily go missing if it’s just too difficult to collect. An example from Illinois, reported by WBEZ reporter Kristen Schorsch in Illinois’ Incomplete COVID-19 Data May Hinder Response, is that “the Illinois Department of Public Health issued a 14-page form that it has asked hospitals to fill out when they identify a patient with COVID-19. But faced with a cumbersome process in the midst of a pandemic, many hospitals aren’t completely filling out the forms.”
It’s likely that as a result of the length of the form, data isn’t consistently collected for all patients from all hospitals—which can certainly bias any decisions that the Illinois Department of Public Health might make, given that they have incomplete data.
In fact, as Schorsch reports, without that data, public health workers “told WBEZ that makes it harder for them to understand where to fight for more resources, like N95 masks that provide the highest level of protection against COVID-19, and help each other plan for how to make their clinics safer as they welcome back patients to the office.”
In this case, where data is going missing because it’s too difficult to collect, you can refocus your data collection on the most crucial data points for what you need to know, rather than the most complete data points.
What can you do about missing data?
Most crucially, identify the missing data. If you know that you need a certain type of data to answer the questions that you want to answer in your data analysis, you must know that it is missing from your analysis process.
After you identify the missing data, you can determine whether or not it matters. If the data that you do have is representative of the population that you’re making decisions about, and you don’t need comprehensive data to make those decisions, a representative sample of the data is likely sufficient.
Communicate your use cases
Another important thing you can do is to communicate your use cases to the people collecting the data. For example,
If software developers have a better understanding of how telemetry or log data is being used for analysis, they might write more detailed or more useful logging messages and add new fields to the telemetry data collection.
If you share a business case with cloud service providers to provide additional data types or fields via their API endpoints, you might get better data to help you perform less biased and more useful data analyses. In return, those cloud providers are likely to retain you as a customer.
Communicating the use case for data collection is most helpful when communicating that information leads to additional data gathering. It’s riskier when it might cause potential data sources to be excluded.
For example, if you’re using a survey to collect information about a population’s preferences—let’s say, the design of a sneaker—and you disclose that information upfront, you might only get people with strong opinions about sneaker design responding to your survey. That can be great if you want to survey only that population, but if you want a more mainstream opinion, you might miss those responses because the use case you disclosed wasn’t interesting to them. In that context, you need to evaluate the missing data for its relevance to your analysis.
Build trust when collecting sensitive data
Data collection is a trust exercise. If the population that you’re collecting data about does not understand why you’re collecting the data, or trust that you will protect it, use it as you say you will, or if they believe that you will use the data against them, you might end up with missing data.
Nowhere is this more apparent than with the U.S. Census. Performed every 10 years, the data from the census is used to determine political representation, distribute federal resources, and much more. Because of how the data from the census survey is used, a representative sample isn’t enough—it must be as complete a survey as possible.
The Census Bureau understands that mistrust is a common reason why people might not respond to the census survey. Because of that, the U.S. Census Bureau hires pollsters that are part of groups that might be less inclined to respond to the census, and also provide clear and easy-to-find details on their website (See How We Protect Your Information on census.gov) about the measures in place to protect the data collected in the census survey. Those details are even clear in the marketing campaigns urging you to respond to the census! The census survey also faces other challenges when ensuring the comprehensive survey is as complete as possible.
Address instances of mistrust with data stewardship
As Jill Lepore discusses in episode 4, Unheard, of her podcast The Last Archive, mistrust can also affect the accuracy of the data being collected, such as in the case of former enslaved people being interviewed by descendants of their former owners, or their current white neighbors, for records collected by the Works Progress Administration. Surely, data is missing from those accounts of slavery due to mistrust of the people doing the data collection, or at the least, because those collecting the stories perhaps do not deserve to hear the true lived experiences of the former enslaved people.
If you and your team are not good data stewards, if you don’t do a good job of protecting data that you’ve collected or managing who has access to that data, people are less likely to trust you with more data—and thus it’s likely that datasets you collect will be missing data. Because of that, it’s important to practice good data stewardship. Use datasheets for datasets, or a data biography to record when data was collected, for what purpose, by whom or what means, and more. You can then review those to understand whether data is missing, or even to remember what data might be intentionally missing.
In some cases, data can be intentionally masked, excluded, or left to collect at a later date. If you keep track of these details about the dataset during the data collection process, it’s easier to be informed about the data that you’re using to answer questions and thus use it safely, equitably, and knowledgeably.
Collect what’s missing, maybe
If possible and if necessary, collect the data that is missing. You can create a new dataset if one does not already exist, such as those that journalists and organizations such as Campaign Zero have been compiling about police brutality in the United States. Some data collection that you perform might supplement existing datasets, such as adding additional introspection details to a log file to help you answer a new question for an existing data source.
If there are cases where you do need to collect additional data, you might not be able to do so at the moment. In those cases, you can build a roadmap or a business case to collect the data that is missing, making it clear how it can help reduce uncertainty for your decision. That last point is key, because collecting more data isn’t always the best solution for missing data.
Sometimes, it isn’t possible to collect more data. For instance, if you’re trying to gather historical data, but everyone from that period has died and very few or no primary sources remain. Or cases where the data has been destroyed, such as in a fire or intentionally, as the Stasi did after the fall of the Berlin Wall.
Consider whether you need complete data
Also consider whether or not more data will actually help address the problem that you’re attempting to solve. You can be missing data, and yet still not need to collect more data in order to make your decision. As Douglas Hubbard points out in his book, How to Measure Anything, data analysis is about reducing uncertainty about what the most likely answer to a question is. If collecting more data doesn’t reduce your uncertainty, then it isn’t necessary.
Nani Jansen Reventlow of the Digital Freedom Fund makes this point clear in her Op-Ed on Al Jazeera, Data collection is not the solution for Europe’s racism problem. In that case, collecting more data, even though it could be argued that the data is missing, doesn’t actually reduce uncertainty about what the likely solution for racism is. Being able to quantify the effect or harms of racism on a region does not solve the problem—the drive to solve the problem is the only thing that can solve that problem.
Avoid cases where you continue to collect data, especially at the expense of an already-marginalized population, in an attempt to prove what is already made clear by the existing information available to you.
You might think that data collection is the first stage of a data analysis process, but in fact, it’s the second. The next and last post in this series covers defining the question that guides your data analysis, and how to take action to reduce bias in your data-driven decisions: Define the question: How missing data biases data-driven decisions.
How does data go missing, featuring examples of disappearing data
What you can do about missing data
How you manage data in order to prepare it for analysis can cause data to go missing and decisions based on the resulting analysis to be biased. With so many ways for data to go missing, there’s just as many chances to address the potential bias that results from missing data at this stage.
What is data management?
Data management, for the purposes of this post, covers all the steps you take to prepare data after it’s been collected. That includes all the steps you take to answer the following questions:
How do you extract the data from the data source?
What transformations happen to the data to make it easier to analyze?
How is it loaded into the analysis tool?
Is the data normalized against a common information model?
How is the data structured (or not) for analysis?
What retention periods are in place for different types of data?
Who has access to the data?
How do people access the data?
For what use cases are people permitted to access the data?
How is information stored and shared about the data sources?
What information is stored or shared about the data sources?
What upstream and downstream dependencies feed into the data pipeline?
How you answer these questions (if you even consider them at all) can cause data to go missing when you’re managing data.
How does data go missing?
Data can go missing at this stage in many ways. With so many moving parts from various tooling and transformation steps being taken to prepare data for analysis and make it easier to work with, a lot can go wrong. For example, if you neglect to monitor your dependencies, a configuration change in one system can cause data to go missing from your analysis process.
Disappearing data: missing docs site metrics
It was just an average Wednesday when my coworker messaged me asking for help with her documentation website metrics search—she thought she had a working search, but it wasn’t showing the results she expected. It was showing her that no one was reading any of her documentation, which I knew couldn’t be true.
As I dug deeper, I realized the problem wasn’t the search syntax, but the indexed data itself. We were missing data!
I reported it to our internal teams, and after some investigation they realized that a configuration change on the docs site had resulted in data being routed to a different index. A configuration change that they thought wouldn’t affect anything ended up causing data to go missing for nearly a week because we weren’t monitoring dependencies crucial to our data management system.
Thankfully, the data was only misrouted and not dropped entirely, but it was a good lesson in how easily data can go missing at this management stage. If you identify the sources you expect to be reporting data, then you can monitor for changes in the data flow. You can also document those sources as dependencies, and ensure that configuration changes include additional testing to ensure the continued fidelity of your data collection and management process.
Disappearing data: data retention settings slip-up
Another way data can go missing is if you neglect to manage or be aware of default tool constraints that might affect your data.
In this example, I was uploading my music data to the Splunk platform for the first time. I was so excited to analyze the 10 years of historical data. I uploaded the file, set up the field extractions, and got to searching my data. I wrote an all time search to see how my music listening habits had shifted year over year in the past decade—but only 3 years of results were returned. What?!
In my haste to start analyzing my data, I’d completely ignored a warning message about a seemingly-irrelevant setting called “max_days_ago”. It turns out, this setting is set by default to drop any data older than 3 years. The Splunk platform recognized that I had data in my dataset older than 3 years, but I didn’t heed the warning and didn’t update the default setting to match my data. I ended up having to delete the data I’d uploaded, fix my configuration settings, and upload the data again—without any of it being dropped this time!
This experience taught me to pay attention to how I configure a tool to manage my data to make sure data doesn’t go missing. This happened to me while using the Splunk platform, but it can happen with whatever tool you’re using to manage, transform, and process your data.
“A million-row limit on Microsoft’s Excel spreadsheet software may have led to Public Health England misplacing nearly 16,000 Covid test results”. This happened because of a mismatch in formats and a misunderstanding of the data limitations imposed by the file formats used by labs to report case data, as well as of the software (Microsoft Excel) used to manage the case data. Hern continues, pointing out that “while CSV files can be any size, Microsoft Excel files can only be 1,048,576 rows long – or, in older versions which PHE may have still been using, a mere 65,536. When a CSV file longer than that is opened, the bottom rows get cut off and are no longer displayed. That means that, once the lab had performed more than a million tests, it was only a matter of time before its reports failed to be read by PHE.”
This limitation in Microsoft Excel isn’t the only way that tool limitations and settings can cause data to go missing at the data management stage.
Data transformation: Microsoft wants genes to be dates
Clearly, if you’re doing genetics research, this is a problem. Your data has been changed, and this missing data will bias any research based on this dataset, because certain genes are now dates!
To handle cases where a tool is improperly transforming data, you have 3 options:
Change the tool that you’re using,
Modify the configuration settings of the tool so that it doesn’t modify your data,
Or modify the data itself.
The genetics researchers ended up deciding to modify the data itself. The HUGO Gene Nomenclature Committee officially renamed 27 genes to accommodate this data transformation error in Microsoft Excel. Thanks to this decision, these researchers have one fewer configuration setting to worry about when helping to ensure vital data doesn’t go missing during the data analysis process.
What can you do about missing data?
These examples illustrate common ways that data can go missing at the management stage, but they’re not the only ways. What can you do when data goes missing?
Carefully set configurations
The configuration settings that you use to manage data that you’ve collected can result in events and data points being dropped.
For example, if you incorrectly configure data source collection, you might lose events or parts of events. Even worse, data can go missing if you incorrectly record events due to incorrect line breaking, truncation, time zone, timestamp recognition, or retention settings. Data can go missing inconsistently if all of the nodes of your data management system don’t have identical configurations.
You might cause some data to go missing intentionally. You might choose to drop INFO level log messages and collect only the ERROR messages in an attempt to track just the signal from the noise of log messages, or you might choose to drop all events older than 3 months from all data sources to save money on storage. These choices, if inadequately communicated or documented, can lead to false assumptions or incorrect analyses being performed on the data.
If you don’t keep track of configuration changes and updates, a data source format could change before you update the configurations to manage the new format, causing data to get dropped, misrouted, or otherwise go missing from the process.
If your data analyst is communicating their use cases and questions to you, you can better understand data retention settings according to those use cases, and review the current policies across your datasets and see how they compare for complementary data types.
You can also identify complementary data sources that might help the analyst answer the questions they want to answer, and plan how and when to bring in those data sources to improve the data analysis.
You need to manage dataset transformations just as closely as you do the configurations that manage the data.
Communicate dataset transformations
The steps you take to transform data can also lead to missing data. If you don’t normalize fields, or if your field normalizations are inconsistently applied across the data or across the data analysts, data can appear to be missing even if it is there. If some data has a field name of “http_referrer” and the same fields in other data sources are consistently “http_referer”, the data with “http_referrer” data might appear to be missing for some data analysts when they start the data analysis process.
Normalization can also help you identify where fields might be missing across similar datasets, such as cases where an ID is present in one type of data but not another, making it difficult to trace a request across multiple services.
If the data analyst doesn’t know or remember which field name exists in one dataset, and whether or not it’s the same field as another dataset, data can go missing at the analysis stage—as we saw with my examples of the “rating” field missing from some events and the info field not having a value that I expected in the data analysis post from this series, Analyze the data: How missing data biases data-driven decisions.
In the same vein, if you use vague field names to describe the data that you’ve collected, or dataset names that ambitiously describe the data that you want to be collecting—instead of what you’re actually collecting—data can go missing. Shortcuts like “future-proofing” dataset names can be misleading to data analysts that want to easily and quickly understand what data they’re working with.
The data doesn’t go missing immediately, but you’re effectively causing it to go missing when data analysis begins if data analysts can’t correctly decipher what data they’re working with.
Educate and incorporate data analysis into existing processes
Another way data can go missing is painfully human. If the people that you expect to analyze the data and use it in their decision-making process don’t know how to use the tool that the data is stored in, well, that data goes missing from the process. Tristan Handy in the dbt blog post Analytics engineering for everyone discusses this problem in depth.
It’s important to not just train people on the tool that the data is stored in, but also make sure that the tool and the data in it are considered as part of the decision-making process. Evangelize what data is available in the tool, and make it easy to interact with the tool and the data. This is a case where a lack of confidence and knowledge can cause data to go missing.
Data gaps aren’t always caused by a lack of data—they can also be caused by knowledge gaps and tooling gaps if people aren’t confident or trained to use the systems with the data in them.
Monitor data strategically
Everyone wants to avoid missing data, but you can’t monitor what you can’t define. So in order to monitor data to prevent it from going missing, you must define what data you expect to see, both from which sources or at which ingestion volumes.
If you don’t have a way of defining those expectations, then you can’t alert on what’s missing. Start by identifying what you expect, and then quantify what’s missing based on those expectations. For guidance on how to do this in Splunk, see Duane Waddle’s blog post Proving a Negative, as well as the apps TrackMe or Meta Woot!.
Plan changes to the data management system carefully
It’s also crucial to review changes to the configurations that you use to manage data sources, especially changes to data structures or normalization in data sources. Make sure that you consistently deploy these changes as well, to reduce the chance that some sources collect different data in different ways from other sources for the same data.
Be careful to note downstream and upstream dependencies for your data management system, such as other tools, permissions settings, or network configurations, before making changes, such as an upgrade or a software change.
When you do data analysis, you’re searching and analyzing your data so that you can answer a specific question with the data so that you can make a decision at the end of the process. Unfortunately, there are many ways that data can go missing while you analyze it.
In this post, I’ll cover the following
Why disaggregation matters when performing data analysis
How data goes missing when analyzing data
What to do about missing data when analyzing data
How simple mistakes can cause data to go missing
Declining bird populations in North America
Any aggregation that you do in your data analysis can cause data to go missing. I recently finished listening to the podcast series The Last Archive, by Jill Lepore, and episode 9 talked about declining bird populations in North America. I realized that it’s an excellent example of why disaggregating data is vital to making sure that data doesn’t go missing during the analysis process.
As reported in Science Magazine, scientists have concluded that since 1970, the continent of North America has lost 3 billion birds—nearly 30% of the surveyed total. But what does that mean for specific bird populations? Are some more affected than others?
It’s easy for me to look around San Francisco and think, well clearly that many birds can’t be missing—I’m still surrounded by pigeons and crows, and ducks and geese are everywhere when I go to the park!
If you don’t disaggregate your results, there’s no way to determine which bird breeds are affected and where. You might assume that 30% of all bird breeds were equally affected across habitats and types. Thankfully, these scientists did disaggregate their results, so we can identify variations that otherwise would be missing from the analysis.
In the case of this study, we can see that some bird populations—Old world sparrows, Larks, and Starlings—are more affected than other types of birds, while others—Raptors, Ducks and geese, and Vireos—have flourished in the past 50 years.
Because the data is disaggregated, you can uncover data that would otherwise be missing from the analysis—how the different types of birds have actually been differently affected, due to habitat loss in grasslands, or cases where restoration and rehabilitation efforts have been effective, such as the resurgence in the population of raptors.
Without an understanding of which specific bird populations are affected, and where they live, you can’t take as effective action to help bird populations recover, because you’re missing too much data due to an overgeneralized aggregate. Any decisions you took based on the overgeneralized aggregate would be biased and ultimately incorrect.
In the case of this study, we know that targeted bird population restoration is perhaps most needed in grasslands habitats, like the Midwest where I grew up.
Unfortunately, the study only covers 76% of all bird breeds, so my city-dwelling self, will just continue to wonder how the bird population has changed since 1970 for pigeons, doves, crows, and others.
How does data go missing?
An easy way for data to go missing is for incomplete data to be returned when you’re analyzing the data. Many of these examples are Splunk-specific, but are limitations shared by most data analysis tools.
In Splunk, search results could be truncated for a variety of reasons. Truncated results have visible error messages if you’re running ad hoc searches, but you might not see error messages if the searches are producing scheduled reports.
If the indexers where the data is stored are unreachable, or your search times out before completing, the results could be incomplete. If you’re using a subsearch, you might hit the default event limit of 10K events, or the timeout limit of 60 seconds and have incomplete subsearch results. If you’re using the join search command, you might hit the 50K row limit that Nick Mealy discusses in his brilliant .conf talk, Master Joining Datasets Without Using Join.
If you’re searching an external service from Splunk, for example by using ldapsearch or a custom command to search an external database, you might not get complete results if that service is having problems or if you don’t have access to some data that you’re searching for.
It’s surprisingly easy for data to go missing when you’re correlating datasets.
Missing and “missing” fields across datasets
If you’re trying to compare datasets and some of the datasets are missing fields, you might accidentally miss data. Without the same field across multiple types of data, it can be difficult to perform valuable or accurate data correlations.
In this containerized cloud-native world, tracing outages across systems can be complex if you don’t have matching identifiers in every dataset that you’re using. As another example, it can be difficult to identify suspicious user session activity without the ability to correlate session identifiers with active users logged into a specific host.
Sometimes the fields aren’t actually missing, they’re just named differently. Because the data isn’t normalized, or the fields don’t match your naming expectations, they’re effectively missing from your analysis because you can’t find them.
Missing fields in a dataset that you want to include in your analysis
Sometimes data is missing from specific events within a dataset. For example, I wanted to determine the average rating that I gave songs in my iTunes library. However, iTunes XML files store tracks with no rating (or a zero star rating for my purposes) without a rating field at all in the events.
Calculating an average with that data missing gives me an average 3 star rating for all the tracks in my iTunes library.
But if I add the zero-star rated tracks back in, representing those as “tracks I have decided aren’t worthy of a rating” rather than “tracks that have not yet been rated”, the average changes to 2.5 stars per track.
If I’d used the results that were missing data, I’d have a biased interpretation of my average song rating in my iTunes library.
Mismatched numeric units and timezones
You might also have data that goes missing if you’re improperly analyzing data because you don’t know or misinterpret the units that the data is stored in.
Keep track of whether or not a field contains a time in minutes or seconds and if your network traffic logs are in bytes or megabytes. Data can vanish from your analysis if you improperly compare dissimilar units!
If you convert a time to a more human-readable format and make incorrect assumptions about the time format, such as the time zone that it’s in, you can cause data to go missing from the proper time period.
Even without transforming your data, if you’re comparing different datasets that store time data with different time zones, data can go missing. You might think that you’re comparing a four hour window from last Monday while debugging an issue across several datasets
It’s also important to note that how you choose to aggregate your data can hide data that is missing from your dataset, or hide data in a way that causes it to go missing.
Consider the granularity of your aggregations
When you perform your data analysis, consider what time ranges you’re using to bin or aggregate your data, the time ranges that you use to search your data across, for what data points, which fields you use to disaggregate your data, and how that might affect your results.
For example, it’s important to keep track of how you aggregate events across time periods in your analysis. If the timespans that you use when aggregating your data don’t align with your use case and the question you’re trying to answer, data can go missing.
In this example, I was trying to convert an existing search that showed me how much music I was listening to by type per year, into a search that would show me how much music I was listening to by weekday by year. Let’s take a look:
This was my initial attempt at modifying the data, and there’s a lot missing. There’s no results at all for Tuesdays in 2019, and the counts for Sundays in 2017, Mondays in 2018, and Thursdays in 2020 were laughably low. What did I do wrong?
It turned out that the time spans I was using in the original search to aggregate my data were too broad for the results I was trying to get. I was doing a timechart with a span of 1 month to start, and then trying to get data down to the a granularity of weekday. That wasn’t going to work!
I’d caused data to go missing because I didn’t align the question I was trying to answer with the time span aggregation in my analysis.
Thankfully, it was a quick fix. I updated the initial time grouping to match my desired granularity of one day, and I was no longer missing data from my results!
This is a case where an overly broad timespan aggregation, combined with a lack of consideration for my use case, caused data to go missing.
You make an error in your analysis process
How many of you have written a Splunk search with a conditional eval function and totally messed it up?
In this case, I wrote a search to calculate the duration of music listening activities—specifically, calculating an estimated amount of time spent at a music festival, DJ set, or concert—in order to compare how much time I spent listening to music before I was sheltering in place with after.
I used a conditional “info == concert” to apply an estimated concert length of 3 hours, but no field-value pair of info=concert existed in my data. In fact, concerts had no info field at all. It wasn’t until I’d finished my search, combining 2 other datasets, that I realized a concert I’d attended in March was missing from the results.
In order to prevent this data from going missing, I had to break my search down into smaller pieces, validating the accuracy of the results at each step against my expectations and knowledge of what the data should reflect. Eventually I realized that I’d caused data to go missing in my analysis process by making the assumption that an info=concert field existed in my data.
Ironically, this graph is still missing data because Last.fm failed to scrobble tracks from Spotify for several time periods in July and August.
What can you do about missing data?
If you have missing data in your analysis process, what can you do?
Optimize your searches and environment
If you’re processing large volumes of events, you can consider expanding limits for data, such as those for subsearches or timeouts for results. Default limits are the most common settings, and aren’t always something to adjust, but occasionally it makes sense to adjust the limits to suit your use cases if your architecture can handle it.
You can normalize your data to make it easier to use, especially if you’re correlating common fields across many datasets. Use field names that are as accurate and descriptive as possible, but also names that are consistent across datasets.
It’s a best practice to follow a common information model, but also ensure that you work closely with data administrators and data analysts (if you aren’t performing both roles) to make sure that analyst use cases and expectations align with administrator practices and implementations.
If you’re using Splunk, start with the Splunk Common Information Model (CIM) or write your own data model to help normalize fields. Keep in mind too, that you don’t have to use accelerated data models to use a data model when naming fields.
Enrich your datasets
You can also explore different ways to enrich your datasets earlier in the analysis process to help make sure data doesn’t go missing across datasets. If you perform data enrichment on the data as it streams in, using a tool like Cribl LogStream or Splunk Data Stream Processor, you can add this missing data back to the events and make it easier to accurately and usefully correlate data across datasets.
Consistently validate your results
Do what you can to consistently and constantly ensure the validity of your search results. Assess the validity against your expectations, but ultimately against itself and against context. If you don’t validate your results, you might lose data that is hidden inside of an aggregate or misinterpreted due to missing context.
Check different time ranges. What seems to be a gap or a pattern in data could just be seasonality or noise. Consider separating weekends from weekdays, or business hours from other hours, depending on the type of data that you’re aggregating.
Use different types of averages, and consider whether the span of values in your results are well-represented by an average. An average of a wide range of values can inaccurately reflect the range of the values, effectively hiding the minimum and the maximum values.
Compare like with like. Validate time zones, units, and the data in similarly-named fields using a data dictionary or by working with the stewards of the datasets to make sure that you’re using the data correctly.
Visualizing data is crucial to communicate the results of a data analysis process. Whether you use a chart, a table, a list of raw data, or a three-dimensional graph that you can interact with in virtual reality—your visualization choice can cause data to go missing. Any time you visualize the results of data analysis, you make intentional decisions about what to visualize and what not to visualize. How can you make sure that data that goes missing at this stage doesn’t bias data-driven decisions?
In this post, I’ll cover the following:
How the usage of your data visualization can cause data to go missing
How data goes missing in data visualizations
Why accessibility matters for data visualizations
How a lack of labels and scale can mislead and misinform
What to do about missing data in data visualizations
How people use the Georgia Department of Public Health COVID-19 daily report
In the midst of this global pandemic, I’m taking extra precautions before deciding to go hiking or climbing outside. Part of that risk calculation involves checking the relative case rate in my region — are cases going up, down, or staying the same?
If you wanted to check that case rate in Georgia in July, you might struggle to make an unbiased decision about your safety because of the format of a data visualization in that report.
As Andisheh Nouraee illustrates in a now-deleted Twitter thread, the Georgia Department of Public Health on the COVID-19 Daily Status Report provided a heat map in July that visualized the number of cases across counties in Georgia in such a way that it effectively hid a 49% increase in cases across 15 days.
You might think that these visualizations aren’t missing data at all—the values of the gradient bins are clearly labeled, and the map clearly shows how many cases exist for every 100K residents.
However, the missing data isn’t in the visualization itself, but in how it’s used. This heat map is provided to help people understand the relative case rate. If I were checking this graph every week or so, I would probably think that the case rate has stayed the same over that time period.
Instead, because the visualization uses auto-adjusting gradient bins, the red counties in the visualization from July 2nd cover a range from 2961 to 4661, while the same color counties on July 17th now have case rates of 3769–5165 cases per 100K residents. The relative size of the bins is different enough to where the bins can’t be compared with each other over time.
Thankfully, the site now uses a visualization with a consistent gradient scale, rather than auto-adjusting bins.
In this example, the combination of the visualization choice and the use of that visualization by the visitors of this website caused data to go missing and possibly resulting in biased decisions about whether it’s safe to go for a hike in the community.
How does data go missing?
This example from the Georgia Department of Health describes one way that data can go missing, but there are many more.
Data can go missing from your visualization in a number of ways:
If the data exists, but is not represented in the visualization, data is missing
If data points and fluctuations are smoothed over, or connected across gaps, data is missing.
If outliers and other values are excluded from the visualization, data is missing.
If people can’t see or interact with the visualization, data is missing.
If a limited number of results are being visualized, but the label and title of the visualization don’t make that clear, data is missing.
Accessible data visualizations prevent data from going missing
Accessible visualizations are crucial for avoiding missing data because data can go missing if people can’t see or interact with it.
At a more basic level, consider how your visualizations look at high zoom levels and how they sound when read aloud by a screen reader. If a visualization is unintelligible at high zoom levels or if portions aren’t read aloud by a screen reader, those are cases where data has gone missing from your visualization. Any decisions that someone with low or no vision wants to make based on a data visualization is biased to include only the data visualizations that they can interact with successfully.
Beyond vision considerations, you want to consider cognitive processing accessibility to prevent missing data. If you overload a visualization with lots of overlays, rely on legends to communicate meaning in your data, or have a lot of text in your visualization, folks with ADHD or dyslexia might struggle to process your visualization.
Map with caution and label prodigiously: Beirut explosion map
Data can go missing if you fail to visualize it clearly or correctly. When I found out about the explosion in Beirut, after I made sure that my friends and their family were safe, I wanted to better understand what had happened.
I haven’t had the privilege to visit Beirut before, so the maps of the explosion radius weren’t as easy for me to personally relate to. Thankfully, people started sharing maps about what the same explosion might look like if it occurred in New York City or London.
This map attempts to show the scale of the same explosion in New York City, but it’s missing a lot of data. I’m not an expert in map visualizations, but thankfully cartographer Joanna Merson tweeted a correction to this map and unpacked just how much data is missing from this visualization.
There’s no labels on this map, so you don’t know the scale of the circles, or what distance each blast radius is supposed to represent. You don’t know what the epicenter of the blast is because it isn’t labeled, and perhaps most egregiously, the map projection used is incorrect.
Joanna Merson created an alternate visualization, with all the missing data added back in.
Her visualization carefully labels the epicenter of the blast, as well as the radii of each of the circles that represent different effects from the blast. She’s also careful to share the map projection that she used—one that has the same distance for every point along that circle. It turns out that the projection used by Google Maps is not the right projection to show distance with an overlaid circle. Without the scale or an accurate projection in use, data goes missing (and gets added) as unaffected areas are misleadingly shown as affected by the blast.
How many of you are guilty of making a geospatial visualization, but don’t know anything about map projections and how they might affect your visualization?
Joanna Merson further points out in her thread on Twitter that maps like this with an overlaid radius to show distance can be inaccurate because they don’t take into account the effect of topography. Data goes missing because topography isn’t represented or considered by the visualization overlaid on the map.
It’s impractical to model everything perfectly in every map visualization. Depending on how you’re using the map, this missing data might not actually matter. If you communicate what your visualization is intended to represent when you share it, you can convey the missing data and also assert its irrelevance to your point. All maps, after all, must make decisions about what data to include based on the usage of the map. Your map-based data visualizations are no different!
It can be easy to cut corners and make a simple visualization to communicate the results of data analysis quickly. It can be tedious to add a scale, a legend, and labels to your visualization. But you must consider how your visualization might be used after you make it—and how it might be misused.
Will a visualization that you create end up in a blog post like this one, or a Twitter thread unpacking your mistakes?
What can you do about missing data?
To prevent or mitigate missing data in a data visualization, you have several options. Nathan Yau of Flowing Data has a very complete guide for Visualizing Incomplete and Missing Data that I highly recommend in addition to the points that I’m sharing here.
Visualize what’s missing
One important way to mitigate missing data in a data visualization is to devise a way to show the data that is there alongside the data that isn’t. Make the gaps apparent and visualize missing data, such as by avoiding connecting the dots between missing values in a line chart.
In cases where your data has gaps, you can add annotations or labels to acknowledge and explain any inconsistencies or perceived gaps in the data. In some cases, data can appear to be missing, but is actually a gap in the data due to seasonal fluctuations or other reasons. It’s important to thoroughly understand your data to identify the difference.
If you visualize the gaps in your data, you have the opportunity to discuss what can be causing the gaps. Gaps in data can reflect reality, or flaws in your analysis process. Either way, visualizing the gaps in your data is just as valuable as visualizing the data that you do have. Don’t hide or ignore missing data.
Carefully consider time spans
Be intentional about the span that you choose for time-based visualizations. You can unintentionally hide fluctuations in the data if you choose an overly-broad span for your visualization, causing data to go missing by flattening it.
If you choose an overly-short time span for your visualization, however, the meaning of the data and what you’re trying to communicate can go missing with all the noise of the individual data points. Consider what you’re trying to communicate with the data visualization, and choose a time span accordingly.
Another way to address missing data is to write good labels and titles for visualizations. It’s crucial to explain exactly what is present in a visualization—an important component of communicating results. If you’re intentional and precise about your labels and titles, you can prevent data from going missing.
If the data analysis contains the results for the top 10 cities by population density, but your title only says “Top Cities”, data has gone missing from your visualization!
You can test out the usefulness of your labels and titles by considering the following: If someone screenshots your visualization and puts it in a different presentation, or tweets it without the additional context that might be in the full report, how much data would be missing from the visualization? How completely does the visualization communicate the results of data analysis if it’s viewed out of context?
Validate your scale
Make sure any visualization that you create has a scale and that it’s included. It’s really easy for data to go missing if the scale of the data itself is missing.
Also validate that the scale on your visualization is accurate and relevant. If you’re visualizing percentages, make sure the scale goes from 0-100. If you’re visualizing logarithmic data, make sure your scale reflects that correctly.
Consider the use
Consider how your visualization will be used, and design your visualizations accordingly. What decisions are people trying to make based on your visualization? What questions are you trying to answer when you make it?
Automatically-adjusting gradient bins in a heat map can be an excellent design choice, but as we saw in Georgia, they don’t make sense to communicate relative change over time.
Choose the right chart for the data
It’s also important to choose the right chart to visualize your data. I’m not a visualization expert, so check out this data tutorial from Chartio, How to Choose the Right Data Visualization as well as these tutorials of different chart types on Flowing Data: Chart Types.
I do want to recommend that if you’re visualizing multiple aggregations in one visualization in the Splunk platform, consider the Trellis layout to create different charts to help compare across the aggregates.
Always try various types of visualizations for your data to determine which one shows the results of your analysis in the clearest way.
Communicating the results of a data analysis process is crucial to making a data-driven decision. You might review results communicated to you in many ways:
A slide deck presented to you
An automatically-generated report emailed to you regularly
A white paper produced by an expert analysis firm that you review
A dashboard full of curated visualizations
A marketing campaign
If you’re a data communicator, what you choose to communicate (and how you do so) can cause data to go missing and possibly bias data-driven decisions made as a result.
In this post, I’ll cover the following:
A marketing campaign that misrepresents the results of a data analysis process
A renowned white paper produced by leaders in the security space
How data goes missing at the communication stage
What you can do about missing data at the communication stage
Spotify Wrapped: Your “year” in music
If you listen to music on Spotify, you might have joined in the hordes of people that viewed and shared their Spotify Wrapped playlists and images like these last year.
Spotify Wrapped is a marketing campaign that purports to communicate your year in music, based on the results of some behind-the-scenes data analysis that Spotify performs. While it is a marketing campaign, it’s still a way that data analysis results are being communicated to others, and thus relevant to this discussion of missing data in communications.
In this case you can see that my top artists, top songs, minutes listened, and top genre are shared with me, along with the number of artists that I discovered, and even the number of minutes that I spent listening to music over the last year. It’s impressive, and exciting to see my year in music summed up in such a way!
But after I dug deeper into the data, the gaps in the communication and data analysis became apparent. What’s presented as my year in music is actually more like my “10 months in music”, because the data only represents the period from January 1st to October 31st of 2019. Two full months of music listening behavior are completely missing from my “year” in music.
This is a case where data is going missing from the communication because the data is presented as though it represents an entire time period, when in fact it only represents a subset of the relevant time period. Not only that, but it’s unclear what other data points actually represent. The number of minutes spent listening to music in those 10 months could be calculated by adding up the actual amount of time I spent listening to songs on the service, but could also be an approximate metric calculated from the number of streams of tracks in the service. It’s not possible to find out how this metric is being calculated. If it is an approximate metric based on the number of streams, that’s also a case of uncommunicated missing data (or a misleading metric), because according to the Spotify for Artists FAQ, Spotify counts a song as streamed if you’ve listened to at least 30 seconds of a track.
It’s likely that other data is missing in this incomplete communication, but another data communication is a great example of how to do it right.
Verizon DBIR: A star of data communication
The Verizon Data Breach Investigations Report (DBIR) is an annual report put out by Verizon about, well, data breach investigations. Before the report shares any results, it includes a readable summary about the dataset used, how the analysis is performed, as well as what is missing from the data. Only after the limitations of the dataset and the resulting analysis is communicated, are the actual results of the analysis shared.
These examples make it clear that data can easily go missing when you’re choosing how to communicate the results of data analysis.
You include only some visualizations in the report — the prettiest ones, or the ones with “enough” data
You include different visualizations and metrics than the ones in the last report, without an explanation about why they changed, or what changed. For example, what happened in Spain in late May, 2020 as reported by the Financial Times: Flawed data casts cloud over Spain’s lockdown strategy.
You choose not to share or discuss the confidence intervals for your findings.
You neglect to include details about the dataset, such as the size of the dataset or the representativeness of the dataset.
You discuss the available data as though it represents the entire problem, rather than a subset of the problem.
For example, the Spotify Wrapped campaign shares the data that it makes available as though it represents an entire year, but instead only reflects 10 months, and doesn’t include any details about the dataset beyond what you can assume—it’s based on Spotify’s data. This missing data doesn’t make the conclusions drawn in the campaign inaccurate, but it is additional context that can affect how you interpret the findings and make decisions based on the data, such as which artist’s album you might buy yourself to celebrate the new year.
What can you do about missing data?
To mitigate missing data in your communications, it’s vital to be precise and consistent. Communicate exactly what is and is not covered in your report. If it makes sense, you could even share the different levels of disaggregation used throughout the data analysis process that led to the results discussed in the final communication.
Consistency in the visualizations that you choose to include, as well as the data spans covered in your report, can make it easier to identify when something has changed since the last report. Changes can be due to missing data, but can also be due to something else that might not be identified if the visualizations in the report can’t be compared with each other.
If you do change the format of the report, consider continuing to provide the old format alongside the new format. Alternately, highlight what is different from the last report, why you made changes, and clearly discuss whether comparisons can or cannot be made with older formats of the report to avoid unintentional errors.
If you know data is missing, you can discuss it in the report, share why or why not the missing data matters, and possibly choose to communicate a timeline or a business case for addressing the missing data. For example, there might be a valid business reason why data is missing, or why some visualizations have changed from the previous report.
The 2020 edition of Spotify Wrapped will almost certainly include different types of data points due to the effects of the global pandemic on people’s listening habits. Adding context to why data is missing, or why the report has changed, can add confidence in the face of missing data—people now understand why data is missing or has changed.
Any decision based solely on the results of data analysis is missing data—the non-quantitative kind. But data can also go missing from data-driven decisions as a result of the analysis process. Exhaustive data analysis and universal data collection might seem like the best way to prevent missing data, but it’s not realistic, feasible, or possible. So what can you do about the possible bias introduced by missing data?
In this post, I’ll cover the following:
What to do if you must make a decision with missing data
How data goes missing at the decision stage
What to do about missing data at the decision stage
How much missing data matters
How much to care about missing data before making your decision
Missing data in decisions from the Oregon Health Authority
In the midst of a global pandemic, we’ve all been struggling to evaluate how safe it is to resume pre-pandemic activities like going to the gym, going out to bars, sending our kids back to school, or eating at restaurants. In the United States, state governors are the ones tasked with making decisions about what to reopen and what to keep closed, and most are taking a data-driven approach.
In Oregon, the decisions they’re making about what to reopen and when are based on incomplete data. As reported by ErinRoss for Oregon Public Broadcasting, the Oregon Health Authority is not collecting or analyzing data about whether or not restaurants or bars are contributing to COVID-19 case rates.
The contact tracers interviewing people who’ve tested positive for SARS COV-2 are not asking, and even if the information is shared, the data isn’t being analyzed in a way that might allow officials to identify the effect of bars and restaurants on coronavirus case rates. Although this data is missing, officials are making decisions about whether or not bars and restaurants should remain open for indoor operations.
Oregon and the Oregon Health Authority aren’t alone in this limitation. We’re in the midst of a pandemic, and everyone is doing the best they can with the limited resources and information that they have. Complete data isn’t always possible, especially when real-time data matters. So what can Oregon (and you) do to make sure that missing data doesn’t negatively affect the decision being made?
If circumstances allow, it’s best to try to narrow the scope of your decision. Limit your decision to those groups and situations about which you have complete or representative data. If you can’t limit your decision, such as is the case with this pandemic, you can still make a decision with incomplete data.
Acknowledge that your decision is based on limited data, identify the gaps in your knowledge, and make plans to address those gaps as soon as possible. You can address the missing data by collecting more data, analyzing your existing data differently, or by reexamining the relevance of various data points to the decision that you’re making.
This is one key example of how missing data can affect a decision-making process. How else can data go missing when making decisions?
How can data go missing?
Data-driven decisions are especially vulnerable to the effects of missing data. Because data-driven are based on the results of data analysis, the effects of missing data in the earlier data analysis stages are compounded.
The reports that you reviewed before making your decision included the prettiest graphs instead of the most useful visualizations to help you make your decision.
The visualizations in the report were different from the ones in the last report, making it difficult for you to compare the new results with the results in the previous report.
The data being collected doesn’t include the necessary details for your decision. This is what is happening in Oregon, where the collected data doesn’t include all the details that are relevant when making decisions about what businesses and organizations to reopen.
The data analysis that was performed doesn’t actually answer the question that you’re asking. If you need to know “how soon can we reopen indoor dining and bars despite the pandemic”, and the data analysis being performed can only tell you “what are the current infection rates, by county, based on the results of tests administered 5 days ago”, the decisions that you’re making might be based on incomplete data.
What can you do about missing data?
Identify if the missing data matters to your decision. If it does, you must acknowledge that the data is missing when you make your decision. If you don’t have data about a group, intentionally exclude that group from your conclusions or decision-making process. Constrain your decision according to what you know, and acknowledge the limitations of the analysis.
If you want to make broader decisions, you must address the missing data throughout the rest of the process! If you aren’t able to immediately collect missing data, you can attempt to supplement the data with a dedicated survey aimed at gathering that information before your decision. You can also investigate to find out if the data is already available in a different format or context—for example in Oregon, where the Health Authority might have some information about indoor restaurant and bar attendance in the content of contact tracing interviews and just aren’t analyzing it systematically. If the data is representative even if it isn’t comprehensive, you can still use it to supplement your decision.
To make sure you’re making an informed decision, ask questions about the data analysis process that led to the results you’re reviewing. Discuss whether or not data could be missing from the results of the analysis presented to you, and why. Ask yourself: does the missing data affect the decision that I’m making? Does it affect the results of the analysis presented to me? Evaluate how much missing data matters to your decision-making process.
You don’t always need more data
You will always be missing some data. It’s important to identify when the data that is missing is actually relevant to your analysis process, and when it won’t change the outcome. Acknowledge when additional data won’t change your conclusions.
You don’t need all possible data in existence to support a decision. As Douglas Hubbard points out in his book How to Measure Anything, the goal of a data analysis process is to reduce your uncertainty about the right approach to take.
If additional data, or more detailed analysis, won’t further reduce any uncertainty, then it’s likely unnecessary. The more clearly you constrain your decisions, and the questions you use to guide your data analysis, the more easily you can balance reducing data gaps and making a decision with the data and analysis results you have.
USCIS doesn’t allow missing data, even if it doesn’t affect the decision
Sometimes, missing data doesn’t affect the decision that you’re making. This is why you must understand the decision you’re making, and how important comprehensive data is to the decision. If that’s true, you want to make sure your policies acknowledge that reality.
In the case of the U.S. Citizenship and Immigration Services (USCIS), their policies don’t seem to recognize that some kinds of missing data for citizenship applications are irrelevant.
“Last fall, U.S. Citizenship and Immigration Services introduced perhaps its most arbitrary, absurd modification yet to the immigration system: It began rejecting applications unless every single field was filled in, even those that obviously did not pertain to the applicant.
“Middle name” field left blank because the applicant does not have a middle name? Sorry, your application gets rejected. No apartment number because you live in a house? You’re rejected, too.
No address given for your parents because they’re dead? No siblings named because you’re an only child? No work history dates because you’re an 8-year-old kid?
All real cases, all rejected.”
In this example, missing data is deemed a problem for making a decision about the citizenship application for a person—even when the data that is missing is supposed to be missing because it doesn’t exist. When asked for comment,
“a USCIS spokesperson emailed, “Complete applications are necessary for our adjudicators to preserve the integrity of our immigration system and ensure they are able to confirm identities, as well as an applicant’s immigration and criminal history, to determine the applicant’s eligibility.””
Missing data alone is not enough to affect your decision—only missing data that affects the results of your decision. A lack of data is not itself a problem—the problem is when that is relevant to your decision is missing. That’s how bias gets introduced to a data-driven decision.