Wrapping up 2020: Spotify, SoundCloud, and Last.fm data

Another year, another Spotify Wrapped campaign, another effort to analyze the music data that I collect and compare it to what Spotify produces. This year I have last.fm listening habit data, concert attendance and ticket purchase data, livestream view activity data, my SoundCloud 2020 Playback playlist, and the tracks on my Spotify top 100 songs of 2020 playlist

Screenshot of Spotify Wrapped header image, top artists of disclosure, lane 8, kidnap, tourist, and amtrac, top songs of apricots, atlas, idontknow, cappadocia, know your worth, minutes listened of 59,038 and top genre of house.

It’s always important to point out that the data covered in the Spotify Wrapped campaign only covers the time period from January 1st, 2020 to October 31st, 2020. I discuss the effects of this misleading time period in Communicate the data: How missing data biases data-driven decisions. Of course, writing this post on December 2nd, nearly the entire month of December is missing from my own analyses. I’ll follow up (on Twitter) about any data insights that change over the next few weeks.

Top Artists of the Year

screenshot of spotify wrapped top artists, content duplicated in surrounding text.

Spotify says my Top 5 artists of the year are: 

  1. Disclosure
  2. Lane 8 
  3. Kidnap
  4. Tourist 
  5. Amtrac

My own data shows some slight permutations.

Screenshot of Splunk table showing top 10 artists in order: tourist with 156 listens, amtrac with 155 listens, booka shade with 147 listens, jacques greene with 134 listens, lane 8 with 129 listens, bicep with 128 listens, kidnap with 114 listens, ben böhmer with 111 listens, cold war kids with 110 listens, and sjowgren with 99 listens

My top 5 artists are nearly the same, but much more influenced by music that I’ve purchased. The overall list instead looks like:

  1. Tourist
  2. Amtrac
  3. Booka Shade
  4. Jacques Greene
  5. Lane 8

For the second year in a row, Tourist is my top artist! Kidnap still makes it into the top 10, as my 7th most-listened-to artist so far of 2020. 

Disclosure, somewhat hilariously, doesn’t even break the top 10 artists if I am relying on Last.fm data instead of only Spotify. What’s going on there? Turns out Disclosure is my 11th-most-listened to artist, with 97 total listens so far this year. If I dig a little bit deeper, looking at the song Know Your Worth which Spotify says I’ve listened to the most in 2020 by Disclosure, I can see exactly why this is happening.

Screenshot showing the track_name Know Your Worth listed 5 times, with different artist permutations each time, Khalid, Disclosure & Khalid, Disclosure & Blick Bassy, Khalid & Disclosure, and Khalid, with total listens of 20 for all permutations.

Disclosure’s latest album, ENERGY, includes a number of collaborations. Disclosure is the main artist for most of these tracks, but in some cases (like with Know Your Worth, which came out as a single February 4, 2020) the artist can be inconsistently stored by different services.

As a result, the Last.fm data has a number of different entries for the same track, with differently-listed artists for each one. Last.fm stores only one artist per track, whereas Spotify stores an array of artists for each track. This data structure decision means that Disclosure should have had about 127 total listens, and been my 7th-most-listened-to artist of 2020, instead of 11th. 

This truncated screenshot shows some examples of the permutations of data that exist in my Last.fm data collection, with a total listen count of 127 for Disclosure during 2020. 

Screenshot showing additional permutations of Disclosure artist data, such as Disclosure & slowthai, Disclosure & Common, and Disclosure & Channel Tres.

I had a sneaking suspicion that my Booka Shade listening habits are primarily concentrated on a few songs from an EP that he put out this year, so I dug into how many tracks my total listens for the year were spread across.

Table showing top 10 artists and total listens, with total tracks for each artist as well. Tourist has 62 tracks for 159 listens, Amtrac has 59 tracks for 155 listens, Booka Shade has 64 tracks for 147 listens, Jacques Greene has 33 tracks for 134 listens, Lane 8 has 60 tracks for 129 listens, Bicep has 46 tracks for 128 listens, Kidnap has 35 tracks for 114 listens, Ben Böhmer has 51 tracks for 111 listens, Cold War Kids has 53 tracks for 110 listens, and sjowgren has 15 tracks for 99 listens.

Instead, it turns out that my listens to Booka Shade are actually the most distributed across tracks of all of my top 10 artists. Sjowgren is also an outlier here, because they’ve never released an album, so they only have 15 songs in their overall discography yet still made the top 10 artist listens. 

Returning my comparison between Spotify and Last.fm data, Amtrac and Lane 8 are in both top 5 lists. This is somewhat expected, because if I look at the top 10 list for artists that I’ve most consistently listened to—artists that I’ve listened to at least once in each month of 2020—both Amtrac and Lane 8 place high in that list. 

Screenshot of a table showing top 10 consistently listened to artists, with Lane 8 being listened to at least once in all 12 months of 2020, Amtrac 11 months, Caribou 11 months, Disclosure 11 months, Elderbrook 11 months, Kidnap 11 months, Kölsch 11 months, Tourist 11 months, Ben Böhmer 10 months, and CamelPhat for 10 months.

Given that only 2 days of December have happened as I write this, it’s unsurprising that I’ve only listened to one artist in every month of 2020. 

Top Songs of 2020

Enough about the artists—what about the songs? 

Screenshot of top 5 songs from spotify wrapped, duplicated in surrounding text.

According to Spotify, my Top 5 songs of the year are:

  1. Apricots by Bicep
  2. Atlas by Bicep
  3. Idontknow by Jamie xx
  4. Cappadocia by Ben Böhmer feat. Romain Garcia
  5. Know Your Worth by Disclosure feat. Khalid

That pretty closely matches my top 5 list according to Last.fm, with some notable exceptions.

Screenshot of Splunk table with top 10 songs of last.fm data, Apricots by Bicep with 38 listens, Atlas by Bicep with 32 listens, Idontknow by Jamie xx with 22 listens, White Ferrari (Greene Edit) by Jacques Greene with 21 listens, That Home Extended by The Cinematic Orchestra with 20 listens, Lalala by Y2K and bbno$ with 19 listens, Trish's Song by Hey Rosetta! with 18 listens, Wonderful by Burna Boy with 18 listens, Somewhere feat. Octavian by the Blaze with 17 listens, and Yes, I Know by Daphni with 17 listens.

My top 5 tracks according to Last.fm are:

  1. Apricots by Bicep (38 listens)
  2. Atlas by Bicep (32 listens)
  3. Idontknow by Jamie xx (22 listens)
  4. White Ferrari (Greene Edit) by Jacques Greene (21 listens)
  5. That Home Extended by The Cinematic Orchestra (20 listens)

The first 3 tracks match, though of course Spotify has an incomplete representation of those listens—I have 29 streams of Apricots according to Spotify.

However, since I bought the track almost as soon as it came out, I also have another 9 listens that have happened off of Spotify. There were also some mysterious things happening with Spotify and Last.fm connections around that time as well, so it’s possible some listens are missing beyond these numbers. 

What’s up with the 4th track on the list, though? Where is that in Spotify’s data? It’s actually a bootleg remix of the Frank Ocean song White Ferrari that Jacques Greene shared on SoundCloud and as a free download earlier this year, so it isn’t anywhere on Spotify. It did, however, make it onto my top tracks of 2020 on SoundCloud:

Screenshot of top 13 tracks in SoundCloud, with Jacques Greene - White Ferrari (JG Edit) listed as the 11th track.

And again, this is a spot where metadata intrudes again and leads to some inconsistent counts. If I look at all the permutations of White Ferrari and Jacques Greene in my data for 2020, the total number of listens should actually be a bit higher, at 23 total listens:

Screenshot of Splunk table showing the two permutations of the Jacques Greene remix, with 21 listens for the Greene Edit version and 2 listens for the JG Edit version, for a total of 23.

This would actually make it my 3rd-most popular song of 2020 so far, and I’m listening to it as I write this paragraph, so let’s go ahead and call that total number 24 listens. 

The 5th-most popular song and 7th-most popular song of 2020 make the case that I haven’t been sleeping very well this year (though I recall these tracks also showed up in 2019 as well…), because those 2 tracks comprise my “Insomnia” playlist that I use to help me fall asleep on nights when I’ve been, perhaps, staying up too late doing data analysis like this. 

You can see the influence of consistent listening habits with top artist behaviors when you look at the top 10 songs that I’ve consistently listened to throughout 2020, with 2 songs by Kidnap, one by Bicep, and another by Amtrac.

Table of tracks listened to consistently in 2020, Never Come Back by Caribou listened to at least once in 8 months of 2020, Start Again by Kidnap with 8 months, Accountable by Amtrac with 7 months, Atlas by Bicep with 7 months, Calling out by Sophie Lloyd with 7 months, Made to Stray by Mount Kimbie for 7 months, Moments (Ben Böhmer Remix) by Kidnap with 7 months, Somewhere feat. Octavian by the Blaze with 7 months, The Promise by David Spinelli with 7 months, and Without You My Life Would Be Boring by The Knife with 7 months.

To me, though, this table mostly underscores how much music discovery this year involved. I didn’t return to the same songs month after month during 2020. Likely as a result of all the DJ sets I’ve been streaming (as I mentioned in my post about Listening to Music while Sheltering in Place) this has been quite a year for music discovery, and breadth of listening habits. 

My top 10 songs of 2020 had a total of 222 listens across them. However, I have a total of 14,336 listens for the entire year, spread across 8,118 unique songs in total.

duplicated in surrounding text

Even with possible metadata issues, that’s still quite the distribution of behavior. Let’s dig a bit deeper into artist discovery this year. 

Artist Discovery in 2020

In my post earlier this year about my listening behavior while sheltering in place, I discovered that my artist discovery numbers in 2020 seemed to be way up compared with 2018 and 2019, but weren’t actually that far off from 2017 numbers. 

What I see when comparing my 2020 artist discovery statistics from my Last.fm data and my Spotify data is even more interesting. In contrast to what seemed to be true in last year’s post, Wrapping up the year and the decade in music: Spotify vs my data (For what it’s worth, last year’s number should have been 1074, instead of 2857 artists discovered—data analysis is difficult), Spotify’s data is much higher than the number I calculated this year. 

duplicated in surrounding text

According to Spotify, I discovered 2,051 new artists, whereas my Last.fm data claims that I only discovered 1,497 artists this year. 

duplicated in surrounding text

Similarly, Spotify claims that I listened to 4,179 artists this year, whereas my Last.fm data indicates that I listened to 3,715 artists. 

duplicated in surrounding text

Again, this comes down to data structures and how the artist metadata is stored for each service. I wrote about the importance of quality metadata for digital streaming providers earlier this year in Why the quality of audio analysis metadatasets matters for music, but it’s also apparent that the data structures for those metadatasets are just as important for crafting data insights of varying value. 

Because Spotify stores all artists that contributed to a track as an array, I can listen to a track with 4 contributing artists on it, 1 of which I’ve listened to before, and according to Spotify, I’ve now discovered 3 artists and listened to 4, whereas according to Last.fm, I’ll have either listened to 1 artist that I’ve already heard before, or a new artist, possibly called “Luciano & David Morales”. 

Screenshot of two artist names, Luciano, and Luciano & David Morales.

Spotify would store the second artist as Luciano, David Morales, thus allowing a more accurate count of listens for the Luciano artist. Similarly, my artist discovery data includes some flawed data, such as YouTube videos that got incorrectly recorded.

Screenshot of 3 artist names in my data, Billie Joe Armstrong of Green Day, Billy Joel and Jimmy Fallon Form 2, and Biosphere.
The Billy Joel and Jimmy Fallon duet of The Lion Sleeps Tonight never gets old, but it appears the original video is no longer on YouTube so I’m not going to link it.
Screenshot of two artist names in my data, &lez and 'Coming of age ceremony' Dance cover by Jimin and Jung Kook.

This becomes clear in my top 20 artist discoveries of 2020 chart, where BTS and Big Hit Labels are listed separately, although they are both indicative of one of my best friends joining BTS ARMY this year and sharing her enthusiasm with me. 

Giant table of top 20 artists discovered in 2020, in order with first_discovered date last:
Re.You with 85 listens starting July 12, 2020
Elliot Adamson, 75 listens, April 15 2020
Fennec, 53 listens, March 24 2020
Southern Shores, 52 listens, November 19 2020
Eelke Kleijn, 45 listens, August 10 2020
Christian Löffler, 43 listens, April 2 2020
Icarus, 35 listens, April 2 2020
Monkey Safari, 35 listens, April 15 2020
Black Motion, 34 listens, April 30 2020
BTS, 31 listens, September 29 2020
Bronson, 31 listens, May 9 2020
Love Regenerator, 30 listens, March 30 2020
Eltonnick, 29 listens, April 27 2020
Jerro, 27 listens, April 29 2020
Theo Kottis, 27 listens, June 16 2020
Dennis Cruz, 26 listens, June 22 2020
Da Capo, 25 listens, May 10 2020
Bit Hit Labels, 21 listens, June 30 2020
HYENAH, 20 listens, June 4 2020
KC Lights, 20 listens, September 22 2020

Ultimately I’m grateful that the top 20 artists of 2020 are all artists that I discovered during the pandemic and have excellent songs that I love and continue to listen to. Many of the sparklines that represent my listening activity for these artists throughout the year have spikes, but mostly my listening patterns indicate that I’ve been returning to these artists and their songs multiple times after first discovery. Some notable favorites on this list are KC Lights’ track Girl and Dennis Cruz’s track El Sueño, plus the entire Fennec album Free Us Of This Feeling.

Genre Discovery in 2020

The most-commented-on data insight from #wrapped2020 is probably the genre discovery slide.

According to Spotify, I listened to 801 genres this year, including 294 new ones. I’m not even sure I could name 30 genres, let alone 300 or 800. Where are these numbers coming from? 

It turns out that, much like storing artist data as an array for each song, Spotify stores genre data as an array for each artist. This means that each artist can be assigned multiple genres, thus successfully inflating the number of genres that you’ve listened to in 2020. 

For example, if I use Spotify’s API developer console to retrieve the artist information for Tourist, with a Spotify ID of 2ABBMkcUeM9hdpimo86mo6, it turns out that he has 6 total genres associated with him in Spotify’s database: chillwave, electronica, indie soul, shimmer pop, tropical house, and vapor soul. 

Screenshot of JSON response from Spotify API call, content duplicated in surrounding text.

I could start discussing the possible meaningless of genres as a descriptive tool, the lack of validation possible for such a signifier, the lack of clarity about how these genres were defined and also assigned to specific artists, but that’s best for another blog post.

Instead, let’s look at what little genre data I do have available to me more generally. 

duplicated in surrounding text

According to Spotify, my top genres were:

  1. House
  2. Electronica
  3. Pop
  4. Afro House
  5. Organic House

All of these make sense to me, except for Organic House, because I don’t know what makes house music organic, unless it’s also grass-fed, locally-sourced, and free range. Perhaps Blond:ish is organic house. 

I don’t have any genre data from Last.fm, since the service only stores user-defined tags for each artist, and those are not included in the data that I collect from Last.fm today. Instead, I have the genres assigned by iTunes for the tracks that I’ve purchased from the iTunes store. 

The top 8 genres of music that I added to my iTunes library in 2020 by purchasing tracks from the iTunes store are:

  1. Dance (124 songs)
  2. Electronic (121 songs)
  3. House (78 songs)
  4. Pop (37 songs)
  5. Alternative (27 songs)
  6. Electronica (12 songs)
  7. Deep House (10 songs)
  8. Melodic House & Techno (9 songs)
duplicated in surrounding text

Clearly, this is a very selective sample, and is only tied to select purchasing habits, which are roughly correlated to my listening habits.

I shared all of this genre data to essentially look at it and go “wow, that wasn’t very insightful at all”. Let’s move on. 

Time Spent Listening to Music in 2020

The last metric I want to unpack from Spotify’s #wrapped2020 campaign is the minutes listened data insight. According to Spotify, I spent 59,038 minutes listening to music this year. 

relevant content duplicated in surrounding text

According to my own calculations, I spent roughly 81,134 minutes listening to music in 2020.

Let’s talk about how both of these metrics are super flawed!

Spotify counts a song as streamed after you listen to it for more than 30 seconds (per their Spotfiy for Artists FAQ), so it’s logical to assume that this minutes listened metric likely from a calculation of “number of streams for a track” x “length of track” and then rounded and converted to minutes. It could even result from an different type of calculation, “number of total streams” x “average length of track in Spotify library”, but I have no way of knowing if either of these are accurate besides tweeting at Spotify and hoping they’ll pay attention to me. 

Unfortunately for all of us, but mostly me, my own minutes listened metric is just as lazily calculated. I don’t have track length data for all the tracks that I listen to and I don’t know at what point Last.fm counts a track as being worthy of a scrobble. I do have a list of how much time I spent listening to livestreamed DJ sets online, and I do have some excellent estimation skills. I calculated my number of 81,134 minutes so far in 2020 by calculating and assuming the following:

  • An average track length of 4 minutes
  • An average concert length of 3 hours
  • An average DJ set length of 4 hours
  • An average festival length of 8 hours

Using those averages and estimates, I calculated the total amount of time I spent listening to music across Last.fm listening habits, concerts and DJ sets attended (no festivals this year), and livestreams that I watched online, thus arriving at 81,134 minutes. That doesn’t count any DJ sets that I listened to on SoundCloud, and certainly the combination of a 4 minute track length estimate with the uncertainty of what qualifies a track as being scrobbled makes this data insight somewhat meaningless.

Regardless, let’s compare this estimated time spent listening in minutes against the total number of minutes in a year.

Total minutes listened (81,134) as a gauge compared with total minutes in a year (525,600)

Beautiful. I still remembered to sleep this year. No matter which dataset I use, however, it’s clear that I’ve listened to more music in 2020 than in 2019. Spotify’s metric for this same time period in 2019 was 35,496 minutes. The less-flawed but less-complete metric I used last year, calculated using the track length stored in iTunes multiplied by the number of listens for that track, indicated that I spent 14,296 minutes listening to music in 2019. 

As one final Spotify examination, let’s dig into the Spotify Top 100 playlist.

Top 100 Songs of 2020 Playlist

Alongside the fancy graphics and data insights in the #wrapped2020 campaign, Spotify also creates a 100 song playlist, likely (but not definitively) the top 100 songs of the time period between January 1st, 2020 and October 31st, 2020. 

I found my playlist this year to be relatively accurate, perhaps because I spent more time listening to Spotify than I might have in previous years, or perhaps they made some internal data improvements, or both! I often spend more time listening to SoundCloud if I’m traveling a lot, listening to offline DJ sets on plane flights; or listening to Apple Music on my iPhone, with songs that I’ve added from my iTunes library. Without much time spent commuting or traveling this year, it’s likely that my listening habits remained fairly consolidated. 

duplicated in surrounding text

Similarly to what I discovered about my top 10 tracks, I had relatively distributed music interests this year. The 811 total listens for all 100 songs in my Spotify playlist represent just 0.06% of my total listens in 2020 so far. 

duplicated in surrounding text

Despite my overall listening habits being relatively distributed across lots of artists and songs, the Top Songs playlist is somewhat more consolidated, with 69 artists performing the 100 songs on the playlist. Nice. 

duplicated in surrounding text

It’s clear that I spent most of this year exploring and discovering new artists, given that 83 of my top songs of 2020 according to Spotify were songs that I discovered in 2020. 

Thanks for coming on this journey through my music data with me. I’ll be back at the actual end of the year to dive deeper into my top 10 artists of the year, top 10 consistent artists of the year, my music purchasing activity, as well as some more livestream and concert statistics to round out my 2020 year in music. 

Define the question: How missing data biases data-driven decisions

This is the eight and final post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

In this post, I’ll cover the following: 

  • Define the question you want to answer for your data analysis process
  • How does data go missing when you’re defining your question?
  • What can you do about missing data when defining your question?

This post also concludes this blog post series about missing data, featuring specific actions you can take to reduce bias resulting from missing data in your end-to-end data-driven decision-making process.

Define the question 

When you start a data analysis process, you always want to start by deciding what questions you want to answer. Before you make a decision, you need to decide what you want to know.

If you start with the data instead of with a question, you’re sure to be missing data that could help you make a decision, because you’re starting with what you have instead of what you want to know

Start by carefully defining what you want to know, and then determine what data you need to answer that question. What aggregations and analyses might you perform, and what tools do you need access to in order to perform your analysis? 

If you’re not sure how to answer the question, or what questions to ask to make the decisions that you want to make, you can explore best practices guidance and talk to experts in your field. For example, I gave a presentation about how to define questions when trying to prioritize documentation using data (watch on YouTube). If you are trying to monitor and make decisions about software that you’re hosting and managing, you can dig into the RED method for infrastructure monitoring or the USE method

It’s also crucial to consider whether you can answer that question adequately, safely, and ethically with the data you have access to.

How does data go missing? 

Data can go missing at this stage if it isn’t there at all—if the data you need to answer a question does not exist. There’s also the possibility that the data you want to use to answer a question is incomplete, or you have some, but not all of the data that you need to answer the question.

It’s also possible that the data exists, but you can’t have it—you either don’t have it, or you aren’t permitted to use the data that has already been collected to answer your particular question. 

It’s also possible that the data that you do have is not accurate, in which case the data might exist to help answer your question, but it’s unusable, so it’s effectively missing. Perhaps the data is outdated, or the way it was collected means you can’t trust it. 

Depending on who funded the data collection, who performed the data collection, and when and why it was performed, can tell you a lot about whether or not you can use a dataset to answer your particular set of questions. 

For example, if you are trying to answer the question “What is the effect of slavery on the United States?”, you could review economic reports, the records from plantations about how humans were bought and sold, and stop there. But you might be better off considering who created those datasets, who is missing from those datasets, and whether or not those datasets are useful to answer your question, and which datasets might be missing entirely because they were never created, or what records did exist were destroyed. You might also want to consider whether or not it’s ethical to use data to answer specific questions about the lived experiences of people. 

Or, for another grim example, if you want to understand how American attitudes towards Muslims changed after 9/11, you could (if you’re Paul Krugman) look at hate crime data and stop there. Or, as Jameel Jaffer points out in a Twitter thread, you could consider whether or not hate crime data is enough to represent the experience of Muslims after 9/11, considering that “most of the “anti-Muslim sentiment and violence” was *officially sanctioned*” and therefore, all of that is missing from an analysis that focuses solely on hate crime data. Jaffer continues by pointing out that,

“For example, hundreds of Muslim men were rounded up in New York and New Jersey in the weeks after 9/11. They were imprisoned without charge and often subject to abuse in custody because of their religion. None of this would register in any hate crimes database.” 

Data can also go missing if the dataset that you choose to use to answer your question is incomplete.

Incomplete dataset by relying only on digitized archival films

As Rick Prelinger laments in a tweet—if part of a dataset is digitized, often that portion of the dataset is used for data analysis (or research, as the case may be), with the non-digitized portion ignored entirely. 

Screenshot of a tweet by Rick Prelinger @footage "20 years ago we began putting archival film online. Today I can't convince my students that most #archival footage is still NOT online. Unintended consequence of our work: the same images are repeatedly downloaded and used, and many important images remain unused and unseen." sent 12:15 PM Pacific time May 27, 2020

For example, if I wanted to answer the question “What are common themes in American television advertising in the 1950s”? I might turn to the Prelinger Archives, because they make so much digitized archival film footage available. But just because it’s easily accessible doesn’t make it complete. Just because it’s there doesn’t make it the best dataset to answer your question.

It’s possible that the Prelinger Archives don’t have enough film footage for me to answer such a broad question. In this case, I can supplement the dataset available to me with information that is harder to find, such as by tracking down those non-digitized films. I can also choose to refine my question to focus on a specific type of film, year, or advertising agency that is more comprehensively featured in the archive, narrowing the scope of my analysis to focus on the data that I have available. I could even choose a different dataset entirely, if I find one that more comprehensively and accurately answers my question.

Possibly the most common way that data can go missing when trying to answer a question is that the data you have, or even all of the data available to you, doesn’t accurately proxy what you want to know. 

Inaccurate proxy to answer a question leads to missing data

If you identify data points that inaccurately proxy the question that you’re trying to answer, you can end up with missing data. For example, if you want to answer the question, “How did residents of New York City behave before, during, and after Hurricane Sandy?”, you might look at geotagged social media posts. 

Kate Crawford discusses a study Nir Grinberg, Mor Naaman, Blake Shaw, and Gilad Lotan, Extracting Diurnal Patterns of Real World Activity from Social Media, in the context of this question in her excellent 2013 article for Harvard Business Review, The Hidden Biases in Big Data

As she puts it,

“consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture.” 

Because the users of social media, especially those that use Twitter and Foursquare and share location data with those tools, only represent a specific slice of the population affected by Hurricane Sandy. And that specific slice is not a representative or comprehensive slice of New York City residents. Indeed, as Crawford makes very clear, “there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate.”

The dataset of geotagged social media posts only represents some residents of New York City, and not in a representative way, so it’s an inaccurate proxy for the experience of all New York City residents. This means data is missing from the question stage of the data analysis step. You want to answer a question about the experience of all New York City residents, but you only have data about the experience of New York City residents that shared geotagged posts on social media during a specific period of time. 

The risk is clear—if you don’t identify the gaps in this dataset, you might draw false conclusions. Crawford is careful to point this out clearly, identifying that “The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster.”

When you identify the gaps in the dataset, you can understand what limitations exist in the dataset, and thus how you might draw false and biased conclusions. You can also identify new datasets to examine or groups to interview to gather additional data to identify the root cause of the missing data (as discussed in my post on data gaps in data collection). 

The gaps in who is using Twitter, and who is choosing to use Twitter during a natural disaster, are one way that Twitter data can inaccurately proxy a population that you want to research and thus cause data to go missing. Another way that it can cause data to go missing is by inaccurately representing human behavior in general because interactions with the platform itself are not neutral. 

As Angela Xiao Wu points out in her blog post, How Not to Know Ourselves, based on a research paper she wrote with Harsh Taneja:

“platform log data are not “unobtrusive” recordings of human behavior out in the wild. Rather, their measurement conditions determine that they are accounts of putative user activity — “putative” in a sense that platforms are often incentivized to keep bots and other fake accounts around, because, from their standpoint, it’s always a numbers game with investors, marketers, and the actual, oft-insecure users.” 

Put another way, you can’t interpret social media interactions as neutral reflections of user behavior due to the mechanisms a social media platform uses to encourage user activity. The authors also point out that it’s difficult to identify if social media interactions reflect the behavior of real people at all, given the number of bot and fake accounts that proliferate on such sites. 

Using a dataset that inaccurately proxies the question that you’re trying to answer is just one way for data to go missing at this stage. What can you do to prevent data from going missing as you’re devising the questions you want to ask of the data? 

What can you do about missing data?

Most importantly, redefine your questions so that you can use data to answer them! If you refine the questions that you’re trying to ask into something that can be quantified, it’s easier to ask the question and get a valid, unbiased, data-driven result. 

Rather than try to understand the experience of all residents of New York City before, during, and after Hurricane Sandy, you can constrain your efforts to understand how social media use was affected by Hurricane Sandy, or how users that share their locations on social media altered their behavior before, during, and after the hurricane.

As another example, you might shift from trying to understand “How useful is my documentation?” to instead asking a question that is based on the data that you have: “How many people view my content?”. You can also try making a broad question more specific. Instead of asking “Is our website accessible?”, instead ask, “Does our website meet the AA standard of web content accessibility guidelines?” 

Douglas Hubbard’s book, How to Measure Anything, provides excellent guidance about how to refine and devise a question that you can use data analysis to answer. He also makes the crucial point that sometimes it’s not worth it to use data to answer a question. If you are fairly certain that you already know the answer to a question, and the amount of effort it would take to perform data analysis (let alone perform it well) will take a lot of time and resources, it’s perhaps not worth attempting to answer the question with data at all! 

You can also choose to use a different data source. If the data that you have access to in order to answer your question is incomplete, inadequate, inaccurate, or otherwise missing data, choose a different data source. This might lead you to change your dataset choice from readily-available digitized content to microfiche research at a library across the globe in order to perform a more complete and accurate data analysis.

And of course, if a different data source doesn’t exist, you can create a new data source with the information you need. Collaborate with stakeholders within your organization, make a business case to a third-party system that you want to gather data from, use freedom of information act (FOIA) requests to gather data that exists but is not easily-accessible to create a dataset. 

I also want to take care to acknowledge that choosing to use or create a different dataset can often require immense privilege—monetary privilege to fund added data collection, a trip across the globe, or a more complex survey methodology; privilege of access, to have access to others doing similar research and are willing to share data with you; and privilege of time to perform the added data collection and analysis that might be necessary to prevent missing data.

If the data exists but you don’t have permission to use it, you might devise a research plan to request access to sensitive data, or work to gain the consent of those in the dataset that you want to use to allow you to use the data to answer the question that you want to answer. This is another case where communicating the use case of the data can help you gather it—if you share the questions that you’re trying to answer with the people that you’re trying to collect data from, they may be more inclined to share it with you. 

Take action to reduce bias in your data-driven decisions from missing data 

If you’re a data decision-maker, you want to take these steps to take action:

  1. Define the questions being answered with data. 
  2. Identify missing data in the analysis process.
  3. Ask questions of the data analysis before making decisions.

If you carefully define the questions guiding the data analysis process, clearly communicating your use cases to the data analysts that you’re working with, you can prevent data from going missing at the very start. 

Work with your teams and identify where data might go missing in the analysis process, and do what you can to address a leaky analysis pipeline. 

Finally, ask questions of the data analysis results before making decisions. Dig deeper into what is communicated to you, seek to understand what might be missing from the reports, visualizations, and analysis results being presented, and whether or not that missing data is relevant to your decision. 

If you work with data as a data analyst, engineer, admin, or communicator, you can take these steps to take action:

  1. Steward and normalize data.
  2. Analyze data at multiple levels of aggregation and time spans.
  3. Add context to reports and communicate missing data.

Responsibly steward data as you collect and manage it, and normalize it when you prepare it for analysis to make it easier to use. 

If you analyze data at multiple levels of aggregation and time spans, you can determine which level allows you to communicate the most useful information with the least amount of data going missing, hidden by overgeneralized aggregations or overlarge time spans, or hidden in the noise of overly-detailed time spans or too many split-bys. 

Add context to the reports that you produce, providing details about the data analysis process and the dataset used, acknowledging what’s missing and what’s represented. Communicate missing data with detailed and focused visualizations, keeping visualizations consistent for regularly-communicated reports. 

I hope that no matter your role in the data analysis process, this blog post series helps you reduce missing data and make smarter, more accurate, and less biased data-driven decisions.

Collect the data: How missing data biases data-driven decisions

This is the seventh post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

When you’re gathering the data you need and creating datasets that don’t exist yet, you’re in the midst of the data collection stage. Data can easily go missing when you’re collecting it! 

In this post, I’ll cover the following: 

  • How data goes missing at the data collection stage 
  • What to do about missing data at the collection stage

How does data go missing?

There are many reasons why data might be missing from your analysis process. Data goes missing at the collection stage because the data doesn’t exist, or the data exists but you can’t use it for whatever reason, or the data exists but the events in the dataset are missing information. 

The dataset doesn’t exist 

Frequently data goes missing because the data itself does not exist, and you need to create it. It’s very difficult and impractical to create a comprehensive dataset, so data can easily go missing at this stage. It’s important to do what you can to make sure data goes consistently missing when you collect it, if possible, by collecting representative data. 

In some cases, though, you do need comprehensive data. For example, if you need to create a dataset of all the servers in your organization for compliance reasons, you might discover that there is no one dataset of servers, and that efforts to compile one are a challenge. You can start with just the servers that your team administers, but that’s an incomplete list. 

Some servers are grant-owned and fully administered by a separate team entirely. Perhaps some servers are lurking under the desks of some colleagues, connected to the network but not centrally managed. You can try to use network scans to come up with a list, but then you gather only those servers connected to the network at that particular time. Airgapped servers or servers that aren’t turned on 24/7 won’t be captured by such an audit. It’s important to continually consider if you really need comprehensive data, or just data that comprehensively addresses your use case. 

The data exists, but… 

There’s also a chance that the data exists, but isn’t machine-readable. If the data is provided only in PDFs, as many FOIA requests are returned in, then it becomes more difficult to include the data in data analysis. There’s also a chance that the data is available only as paper documents, as is the case with gun registration records. As Jeanne Marie Laskas reports for GQ in Inside The Federal Bureau Of Way Too Many Guns, having records only on paper prevents large-scale data analysis on the information, thus causing it to effectively go missing from the entire process of data analysis. 

It’s possible that the data exists, but isn’t on the network—perhaps because it is housed on an airgapped device, or perhaps stored on servers subject to different compliance regulations than the infrastructure of your data analysis software. In this case, the data exists but it is missing from your analysis process because it isn’t available to you due to technical limitations. 

Another common case is that the data exists, but you can’t have it. If you’ve made an enemy in another department, they might not share the data with you because they don’t want to. It’s more likely that access to the data is controlled by legal or compliance concerns, so you aren’t able to access the data for your desired purposes, or perhaps you can’t analyze it on the tool that you’re using for data analysis due to compliance reasons. 

For example, most doctors offices and hospitals in the United States use electronic health records systems to store the medical records of thousands of Americans. However, scientific researchers are not permitted to access detailed electronic health records of patients, though they exist in large databases and the data is machine-readable, because the health insurance portability and accountability act (HIPAA) privacy rule regulates how protected health information (PHI) can be accessed and used. 

Perhaps the data exists, but is only available to people who pay for access. This is the case for many music metadata datasets like those from Nielsen, much to my chagrin. The effort it takes to create quality datasets is often commoditized. This also happens with scientific research, which is often only available to those with access to scientific journals that publish the results of the research. The datasets that produce the research are also often closely-guarded, as one dataset is time-consuming to create and can lead to multiple publications. 

There’s also a chance the data exists, but it isn’t made available outside of the company. A common circumstance for this is public API endpoints for cloud services. Spotify collects far more data than they make available via the API, so too do companies like Zoom or Google. You might hope to collect various types of data from these companies, but if the API endpoints don’t make the data available, you don’t have many options.

And of course, in some cases the data exists, but it’s inconsistent. Maybe you’re trying to collect equivalent data from servers or endpoints with different operating systems, but you can’t get the same details due to logging limitations. A common example is trying to collect the same level of detail from computers with MacOS and computers with Windows installed. You can also see inconsistencies if different log levels are set on different servers for the same software. This inconsistent data causes data to go missing within events and makes it more difficult to compare like with like. 

14-page forms lead to incomplete data collection in a pandemic 

Data can easily go missing if it’s just too difficult to collect. An example from Illinois, reported by WBEZ reporter Kristen Schorsch in Illinois’ Incomplete COVID-19 Data May Hinder Response, is that “the Illinois Department of Public Health issued a 14-page form that it has asked hospitals to fill out when they identify a patient with COVID-19. But faced with a cumbersome process in the midst of a pandemic, many hospitals aren’t completely filling out the forms.”

It’s likely that as a result of the length of the form, data isn’t consistently collected for all patients from all hospitals—which can certainly bias any decisions that the Illinois Department of Public Health might make, given that they have incomplete data. 

In fact, as Schorsch reports, without that data, public health workers “told WBEZ that makes it harder for them to understand where to fight for more resources, like N95 masks that provide the highest level of protection against COVID-19, and help each other plan for how to make their clinics safer as they welcome back patients to the office.” 

In this case, where data is going missing because it’s too difficult to collect, you can refocus your data collection on the most crucial data points for what you need to know, rather than the most complete data points.

What can you do about missing data? 

Most crucially, identify the missing data. If you know that you need a certain type of data to answer the questions that you want to answer in your data analysis, you must know that it is missing from your analysis process. 

After you identify the missing data, you can determine whether or not it matters. If the data that you do have is representative of the population that you’re making decisions about, and you don’t need comprehensive data to make those decisions, a representative sample of the data is likely sufficient. 

Communicate your use cases

Another important thing you can do is to communicate your use cases to the people collecting the data. For example, 

  • If software developers have a better understanding of how telemetry or log data is being used for analysis, they might write more detailed or more useful logging messages and add new fields to the telemetry data collection. 
  • If you share a business case with cloud service providers to provide additional data types or fields via their API endpoints, you might get better data to help you perform less biased and more useful data analyses. In return, those cloud providers are likely to retain you as a customer. 

Communicating the use case for data collection is most helpful when communicating that information leads to additional data gathering. It’s riskier when it might cause potential data sources to be excluded. 

For example, if you’re using a survey to collect information about a population’s preferences—let’s say, the design of a sneaker—and you disclose that information upfront, you might only get people with strong opinions about sneaker design responding to your survey. That can be great if you want to survey only that population, but if you want a more mainstream opinion, you might miss those responses because the use case you disclosed wasn’t interesting to them. In that context, you need to evaluate the missing data for its relevance to your analysis. 

Build trust when collecting sensitive data 

Data collection is a trust exercise. If the population that you’re collecting data about does not understand why you’re collecting the data, or trust that you will protect it, use it as you say you will, or if they believe that you will use the data against them, you might end up with missing data. 

Nowhere is this more apparent than with the U.S. Census. Performed every 10 years, the data from the census is used to determine political representation, distribute federal resources, and much more. Because of how the data from the census survey is used, a representative sample isn’t enough—it must be as complete a survey as possible. 

Screenshot of the Census page How We Protect Your Information.

The Census Bureau understands that mistrust is a common reason why people might not respond to the census survey. Because of that, the U.S. Census Bureau hires pollsters that are part of groups that might be less inclined to respond to the census, and also provide clear and easy-to-find details on their website (See How We Protect Your Information on census.gov) about the measures in place to protect the data collected in the census survey. Those details are even clear in the marketing campaigns urging you to respond to the census! The census survey also faces other challenges when ensuring the comprehensive survey is as complete as possible.

This year, the U.S. Census also faced time limits for completing the collecting and counting of surveys, in addition to delays already imposed by the COVID-19 pandemic. The New York Times has additional details about those challenges: The Census, the Supreme Court and Why the Count Is Stopping Early.  

Address instances of mistrust with data stewardship

As Jill Lepore discusses in episode 4, Unheard, of her podcast The Last Archive, mistrust can also affect the accuracy of the data being collected, such as in the case of former enslaved people being interviewed by descendants of their former owners, or their current white neighbors, for records collected by the Works Progress Administration. Surely, data is missing from those accounts of slavery due to mistrust of the people doing the data collection, or at the least, because those collecting the stories perhaps do not deserve to hear the true lived experiences of the former enslaved people. 

If you and your team are not good data stewards, if you don’t do a good job of protecting data that you’ve collected or managing who has access to that data, people are less likely to trust you with more data—and thus it’s likely that datasets you collect will be missing data. Because of that, it’s important to practice good data stewardship. Use datasheets for datasets, or a data biography to record when data was collected, for what purpose, by whom or what means, and more. You can then review those to understand whether data is missing, or even to remember what data might be intentionally missing. 

In some cases, data can be intentionally masked, excluded, or left to collect at a later date. If you keep track of these details about the dataset during the data collection process, it’s easier to be informed about the data that you’re using to answer questions and thus use it safely, equitably, and knowledgeably. 

Collect what’s missing, maybe

If possible and if necessary, collect the data that is missing. You can create a new dataset if one does not already exist, such as those that journalists and organizations such as Campaign Zero have been compiling about police brutality in the United States. Some data collection that you perform might supplement existing datasets, such as adding additional introspection details to a log file to help you answer a new question for an existing data source. 

If there are cases where you do need to collect additional data, you might not be able to do so at the moment. In those cases, you can build a roadmap or a business case to collect the data that is missing, making it clear how it can help reduce uncertainty for your decision. That last point is key, because collecting more data isn’t always the best solution for missing data. 

Sometimes, it isn’t possible to collect more data. For instance, if you’re trying to gather historical data, but everyone from that period has died and very few or no primary sources remain. Or cases where the data has been destroyed, such as in a fire or intentionally, as the Stasi did after the fall of the Berlin Wall

Consider whether you need complete data

Also consider whether or not more data will actually help address the problem that you’re attempting to solve. You can be missing data, and yet still not need to collect more data in order to make your decision. As Douglas Hubbard points out in his book, How to Measure Anything, data analysis is about reducing uncertainty about what the most likely answer to a question is. If collecting more data doesn’t reduce your uncertainty, then it isn’t necessary. 

Nani Jansen Reventlow of the Digital Freedom Fund makes this point clear in her Op-Ed on Al Jazeera, Data collection is not the solution for Europe’s racism problem. In that case, collecting more data, even though it could be argued that the data is missing, doesn’t actually reduce uncertainty about what the likely solution for racism is. Being able to quantify the effect or harms of racism on a region does not solve the problem—the drive to solve the problem is the only thing that can solve that problem. 

Avoid cases where you continue to collect data, especially at the expense of an already-marginalized population, in an attempt to prove what is already made clear by the existing information available to you. 

You might think that data collection is the first stage of a data analysis process, but in fact, it’s the second. The next and last post in this series covers defining the question that guides your data analysis, and how to take action to reduce bias in your data-driven decisions: Define the question: How missing data biases data-driven decisions

Manage the data: How missing data biases data-driven decisions

This is the sixth post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

In this post, I’ll cover the following:

  • What is data management?
  • How does data go missing, featuring examples of disappearing data
  • What you can do about missing data

How you manage data in order to prepare it for analysis can cause data to go missing and decisions based on the resulting analysis to be biased. With so many ways for data to go missing, there’s just as many chances to address the potential bias that results from missing data at this stage.

What is data management?

Data management, for the purposes of this post, covers all the steps you take to prepare data after it’s been collected. That includes all the steps you take to answer the following questions:

  • How do you extract the data from the data source?
  • What transformations happen to the data to make it easier to analyze?
  • How is it loaded into the analysis tool?
  • Is the data normalized against a common information model?
  • How is the data structured (or not) for analysis?
  • What retention periods are in place for different types of data?
  • Who has access to the data? 
  • How do people access the data?
  • For what use cases are people permitted to access the data?
  • How is information stored and shared about the data sources? 
  • What information is stored or shared about the data sources?
  • What upstream and downstream dependencies feed into the data pipeline? 

How you answer these questions (if you even consider them at all) can cause data to go missing when you’re managing data. 

How does data go missing? 

Data can go missing at this stage in many ways. With so many moving parts from various tooling and transformation steps being taken to prepare data for analysis and make it easier to work with, a lot can go wrong. For example, if you neglect to monitor your dependencies, a configuration change in one system can cause data to go missing from your analysis process. 

Disappearing data: missing docs site metrics 

It was just an average Wednesday when my coworker messaged me asking for help with her documentation website metrics search—she thought she had a working search, but it wasn’t showing the results she expected. It was showing her that no one was reading any of her documentation, which I knew couldn’t be true.

As I dug deeper, I realized the problem wasn’t the search syntax, but the indexed data itself. We were missing data! 

I reported it to our internal teams, and after some investigation they realized that a configuration change on the docs site had resulted in data being routed to a different index. A configuration change that they thought wouldn’t affect anything ended up causing data to go missing for nearly a week because we weren’t monitoring dependencies crucial to our data management system. 

Thankfully, the data was only misrouted and not dropped entirely, but it was a good lesson in how easily data can go missing at this management stage. If you identify the sources you expect to be reporting data, then you can monitor for changes in the data flow. You can also document those sources as dependencies, and ensure that configuration changes include additional testing to ensure the continued fidelity of your data collection and management process. 

Disappearing data: data retention settings slip-up 

Another way data can go missing is if you neglect to manage or be aware of default tool constraints that might affect your data. 

In this example, I was uploading my music data to the Splunk platform for the first time. I was so excited to analyze the 10 years of historical data. I uploaded the file, set up the field extractions, and got to searching my data. I wrote an all time search to see how my music listening habits had shifted year over year in the past decade—but only 3 years of results were returned. What?!

In my haste to start analyzing my data, I’d completely ignored a warning message about a seemingly-irrelevant setting called “max_days_ago”. It turns out, this setting is set by default to drop any data older than 3 years. The Splunk platform recognized that I had data in my dataset older than 3 years, but I didn’t heed the warning and didn’t update the default setting to match my data. I ended up having to delete the data I’d uploaded, fix my configuration settings, and upload the data again—without any of it being dropped this time! 

This experience taught me to pay attention to how I configure a tool to manage my data to make sure data doesn’t go missing. This happened to me while using the Splunk platform, but it can happen with whatever tool you’re using to manage, transform, and process your data.

As reported by Alex Hern in the Guardian

“A million-row limit on Microsoft’s Excel spreadsheet software may have led to Public Health England misplacing nearly 16,000 Covid test results”. This happened because of a mismatch in formats and a misunderstanding of the data limitations imposed by the file formats used by labs to report case data, as well as of the software (Microsoft Excel) used to manage the case data. Hern continues, pointing out that “while CSV files can be any size, Microsoft Excel files can only be 1,048,576 rows long – or, in older versions which PHE may have still been using, a mere 65,536. When a CSV file longer than that is opened, the bottom rows get cut off and are no longer displayed. That means that, once the lab had performed more than a million tests, it was only a matter of time before its reports failed to be read by PHE.” 

This limitation in Microsoft Excel isn’t the only way that tool limitations and settings can cause data to go missing at the data management stage. 

Data transformation: Microsoft wants genes to be dates 

If you’re not using Splunk for your data management and analysis, you might be using Microsoft Excel. It turns out that Microsoft Excel, despite (or perhaps due to) its popularity, can also cause data to go missing due to configuration settings. In the case of some genetics researchers, it turned out that Microsoft Excel was transforming their data incorrectly. The software was transforming certain gene names, such as MAR1 and DEC1, into dates of March 1 and December 1, causing data to go missing from the analysis. 

Clearly, if you’re doing genetics research, this is a problem. Your data has been changed, and this missing data will bias any research based on this dataset, because certain genes are now dates! 

To handle cases where a tool is improperly transforming data, you have 3 options:

  • Change the tool that you’re using,
  • Modify the configuration settings of the tool so that it doesn’t modify your data,
  • Or modify the data itself.

The genetics researchers ended up deciding to modify the data itself. The HUGO Gene Nomenclature Committee officially renamed 27 genes to accommodate this data transformation error in Microsoft Excel. Thanks to this decision, these researchers have one fewer configuration setting to worry about when helping to ensure vital data doesn’t go missing during the data analysis process. 

What can you do about missing data? 

These examples illustrate common ways that data can go missing at the management stage, but they’re not the only ways. What can you do when data goes missing?

Carefully set configurations 

The configuration settings that you use to manage data that you’ve collected can result in events and data points being dropped.

For example, if you incorrectly configure data source collection, you might lose events or parts of events. Even worse, data can go missing if you incorrectly record events due to incorrect line breaking, truncation, time zone, timestamp recognition, or retention settings. Data can go missing inconsistently if all of the nodes of your data management system don’t have identical configurations. 

You might cause some data to go missing intentionally. You might choose to drop INFO level log messages and collect only the ERROR messages in an attempt to track just the signal from the noise of log messages, or you might choose to drop all events older than 3 months from all data sources to save money on storage. These choices, if inadequately communicated or documented, can lead to false assumptions or incorrect analyses being performed on the data. 

If you don’t keep track of configuration changes and updates, a data source format could change before you update the configurations to manage the new format, causing data to get dropped, misrouted, or otherwise go missing from the process. 

If your data analyst is communicating their use cases and questions to you, you can better understand data retention settings according to those use cases, and review the current policies across your datasets and see how they compare for complementary data types. 

You can also identify complementary data sources that might help the analyst answer the questions they want to answer, and plan how and when to bring in those data sources to improve the data analysis. 

You need to manage dataset transformations just as closely as you do the configurations that manage the data. 

Communicate dataset transformations

The steps you take to transform data can also lead to missing data. If you don’t normalize fields, or if your field normalizations are inconsistently applied across the data or across the data analysts, data can appear to be missing even if it is there. If some data has a field name of “http_referrer” and the same fields in other data sources are consistently “http_referer”, the data with “http_referrer” data might appear to be missing for some data analysts when they start the data analysis process. 

Normalization can also help you identify where fields might be missing across similar datasets, such as cases where an ID is present in one type of data but not another, making it difficult to trace a request across multiple services. 

If the data analyst doesn’t know or remember which field name exists in one dataset, and whether or not it’s the same field as another dataset, data can go missing at the analysis stage—as we saw with my examples of the “rating” field missing from some events and the info field not having a value that I expected in the data analysis post from this series, Analyze the data: How missing data biases data-driven decisions

In the same vein, if you use vague field names to describe the data that you’ve collected, or dataset names that ambitiously describe the data that you want to be collecting—instead of what you’re actually collecting—data can go missing. Shortcuts like “future-proofing” dataset names can be misleading to data analysts that want to easily and quickly understand what data they’re working with. 

The data doesn’t go missing immediately, but you’re effectively causing it to go missing when data analysis begins if data analysts can’t correctly decipher what data they’re working with. 

Educate and incorporate data analysis into existing processes

Another way data can go missing is painfully human. If the people that you expect to analyze the data and use it in their decision-making process don’t know how to use the tool that the data is stored in, well, that data goes missing from the process. Tristan Handy in the dbt blog post Analytics engineering for everyone discusses this problem in depth. 

It’s important to not just train people on the tool that the data is stored in, but also make sure that the tool and the data in it are considered as part of the decision-making process. Evangelize what data is available in the tool, and make it easy to interact with the tool and the data. This is a case where a lack of confidence and knowledge can cause data to go missing. 

Data gaps aren’t always caused by a lack of data—they can also be caused by knowledge gaps and tooling gaps if people aren’t confident or trained to use the systems with the data in them. 

Monitor data strategically

Everyone wants to avoid missing data, but you can’t monitor what you can’t define. So in order to monitor data to prevent it from going missing, you must define what data you expect to see, both from which sources or at which ingestion volumes. 

If you don’t have a way of defining those expectations, then you can’t alert on what’s missing. Start by identifying what you expect, and then quantify what’s missing based on those expectations. For guidance on how to do this in Splunk, see Duane Waddle’s blog post Proving a Negative, as well as the apps TrackMe or Meta Woot!

Plan changes to the data management system carefully

It’s also crucial to review changes to the configurations that you use to manage data sources, especially changes to data structures or normalization in data sources. Make sure that you consistently deploy these changes as well, to reduce the chance that some sources collect different data in different ways from other sources for the same data. 

Be careful to note downstream and upstream dependencies for your data management system, such as other tools, permissions settings, or network configurations, before making changes, such as an upgrade or a software change.

The simplest way for data to go missing from a data analysis process is when it’s being collected. The next post in the series discusses how data can go missing at the collection stage: Collect the data: How missing data biases data-driven decisions.

Visualize the data: How missing data biases data-driven decisions

This is the fourth post in a series about how missing data can bias data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Visualizing data is crucial to communicate the results of a data analysis process. Whether you use a chart, a table, a list of raw data, or a three-dimensional graph that you can interact with in virtual reality—your visualization choice can cause data to go missing. Any time you visualize the results of data analysis, you make intentional decisions about what to visualize and what not to visualize. How can you make sure that data that goes missing at this stage doesn’t bias data-driven decisions?

In this post, I’ll cover the following: 

  • How the usage of your data visualization can cause data to go missing
  • How data goes missing in data visualizations
  • Why accessibility matters for data visualizations
  • How a lack of labels and scale can mislead and misinform 
  • What to do about missing data in data visualizations

How people use the Georgia Department of Public Health COVID-19 daily report 

When creating a data visualization, it’s important to consider how it will be used. For example, the state of Georgia provides a Department of Public Health Daily COVID-19 reporting page to help communicate the relative case rate for each county in the state. 

In the midst of this global pandemic, I’m taking extra precautions before deciding to go hiking or climbing outside. Part of that risk calculation involves checking the relative case rate in my region — are cases going up, down, or staying the same? 

If you wanted to check that case rate in Georgia in July, you might struggle to make an unbiased decision about your safety because of the format of a data visualization in that report.

As Andisheh Nouraee illustrates in a now-deleted Twitter thread, the Georgia Department of Public Health on the COVID-19 Daily Status Report provided a heat map in July that visualized the number of cases across counties in Georgia in such a way that it effectively hid a 49% increase in cases across 15 days.

July 2nd heat map visualization of Georgia counties showing cases per 100K residents, with bins covering ranges from 1-620, 621-1070, 1071 - 1622, 1623 - 2960, with the red bins covering a range from 2961 - 4661.
Image from July 2nd, shared by Andisheh Nouraee, my screenshot of that image
July 17th heat map visualization showing cases per 100K residents in Georgia, with three counties colored red. Bins represent none, 1-949 cases, 950 - 1555 cases, 1556-2336 cases, 2337 - 3768 cases, and the red bins represent 3769-5165 cases.
Image from July 17th, shared by Andisheh Nouraee, my screenshot of that image

You might think that these visualizations aren’t missing data at all—the values of the gradient bins are clearly labeled, and the map clearly shows how many cases exist for every 100K residents.

However, the missing data isn’t in the visualization itself, but in how it’s used. This heat map is provided to help people understand the relative case rate. If I were checking this graph every week or so, I would probably think that the case rate has stayed the same over that time period. 

Instead, because the visualization uses auto-adjusting gradient bins, the red counties in the visualization from July 2nd cover a range from 2961 to 4661, while the same color counties on July 17th now have case rates of 3769–5165 cases per 100K residents. The relative size of the bins is different enough to where the bins can’t be compared with each other over time. 

As reported by Keren Landman for the Atlanta Magazine, the Department of Public Health didn’t have direct control over the data on the dashboard anyway, making it harder to make updates or communicate the data more intentionally.

Thankfully, the site now uses a visualization with a consistent gradient scale, rather than auto-adjusting bins.

Screenshot of heat map of counties with cases per 100K residents in Georgia with Union County highlighted showing the confirmed cases from the past 2 weeks and total, and other data points that are irrelevant to this post.

In this example, the combination of the visualization choice and the use of that visualization by the visitors of this website caused data to go missing and possibly resulting in biased decisions about whether it’s safe to go for a hike in the community. 

How does data go missing? 

This example from the Georgia Department of Health describes one way that data can go missing, but there are many more. 

Data can go missing from your visualization in a number of ways:

  • If the data exists, but is not represented in the visualization, data is missing
  • If data points and fluctuations are smoothed over, or connected across gaps, data is missing. 
  • If outliers and other values are excluded from the visualization, data is missing.
  • If people can’t see or interact with the visualization, data is missing.
  • If a limited number of results are being visualized, but the label and title of the visualization don’t make that clear, data is missing.

Accessible data visualizations prevent data from going missing 

Accessible visualizations are crucial for avoiding missing data because data can go missing if people can’t see or interact with it. 

Lisa Charlotte Rost wrote an excellent series for Data Wrapper’s blog about colorblindness and data visualizations that I highly recommend for considering color vision accessibility for data visualization: How your colorblind and colorweak readers see your colors, What to consider when visualizing data for colorblind readers, and What’s it like to be colorblind

You can also go further to consider how to make it easier for folks with low or no vision to interact with your data visualizations. Data visualization artist Mona Chalabi has been experimenting with ways to make her data visualization projects more accessible, including making a tactile version of a data visualization piece, and an interactive piece that uses touch and sound to communicate information, created in collaboration with sound artist Emmy the Great.

At a more basic level, consider how your visualizations look at high zoom levels and how they sound when read aloud by a screen reader. If a visualization is unintelligible at high zoom levels or if portions aren’t read aloud by a screen reader, those are cases where data has gone missing from your visualization. Any decisions that someone with low or no vision wants to make based on a data visualization is biased to include only the data visualizations that they can interact with successfully. 

Beyond vision considerations, you want to consider cognitive processing accessibility to prevent missing data. If you overload a visualization with lots of overlays, rely on legends to communicate meaning in your data, or have a lot of text in your visualization, folks with ADHD or dyslexia might struggle to process your visualization. 

Any data that people can’t understand in your visualization is missing data. For more, I recommend the blog post by Sarah L. Fossheim, An intro to designing accessible data visualizations.  

Map with caution and label prodigiously: Beirut explosion map

Data can go missing if you fail to visualize it clearly or correctly. When I found out about the explosion in Beirut, after I made sure that my friends and their family were safe, I wanted to better understand what had happened. 

Screenshot of a map of the beirut explosion with labels pointing to different overlapping circles, saying "blast site", "widespread destruction", "heavy damage", "damage reported" and "windows blown out up to 15 miles away".
Image shared by Joanna Merson, my screenshot of the image

I haven’t had the privilege to visit Beirut before, so the maps of the explosion radius weren’t as easy for me to personally relate to. Thankfully, people started sharing maps about what the same explosion might look like if it occurred in New York City or London.  

Screenshot of a google maps visualization with 3 overlapping circles centered over New York. No labels.
Image shared by Joanna Merson, my screenshot of the image

This map attempts to show the scale of the same explosion in New York City, but it’s missing a lot of data. I’m not an expert in map visualizations, but thankfully cartographer Joanna Merson tweeted a correction to this map and unpacked just how much data is missing from this visualization. 

There’s no labels on this map, so you don’t know the scale of the circles, or what distance each blast radius is supposed to represent. You don’t know what the epicenter of the blast is because it isn’t labeled, and perhaps most egregiously, the map projection used is incorrect. 

Joanna Merson created an alternate visualization, with all the missing data added back in. 

Screenshot of a map visualization by Joanna Merson made on August 5 2020, with a basemap from esri World Imagery using a scale 1:200,000 and the Azimuthal Equidistant projection. Map is centered over New York City with an epicenter of Manhattan labeled and circle radii of 1km, 5km, and 10km clearly labeled.
Image by Joanna Merson, my screenshot of the image.

Her visualization carefully labels the epicenter of the blast, as well as the radii of each of the circles that represent different effects from the blast. She’s also careful to share the map projection that she used—one that has the same distance for every point along that circle. It turns out that the projection used by Google Maps is not the right projection to show distance with an overlaid circle. Without the scale or an accurate projection in use, data goes missing (and gets added) as unaffected areas are misleadingly shown as affected by the blast. 

How many of you are guilty of making a geospatial visualization, but don’t know anything about map projections and how they might affect your visualization? 

Joanna Merson further points out in her thread on Twitter that maps like this with an overlaid radius to show distance can be inaccurate because they don’t take into account the effect of topography. Data goes missing because topography isn’t represented or considered by the visualization overlaid on the map. 

It’s impractical to model everything perfectly in every map visualization. Depending on how you’re using the map, this missing data might not actually matter. If you communicate what your visualization is intended to represent when you share it, you can convey the missing data and also assert its irrelevance to your point. All maps, after all, must make decisions about what data to include based on the usage of the map. Your map-based data visualizations are no different! 

It can be easy to cut corners and make a simple visualization to communicate the results of data analysis quickly. It can be tedious to add a scale, a legend, and labels to your visualization. But you must consider how your visualization might be used after you make it—and how it might be misused.

Will a visualization that you create end up in a blog post like this one, or a Twitter thread unpacking your mistakes? 

What can you do about missing data?

To prevent or mitigate missing data in a data visualization, you have several options. Nathan Yau of Flowing Data has a very complete guide for Visualizing Incomplete and Missing Data that I highly recommend in addition to the points that I’m sharing here. 

Visualize what’s missing

One important way to mitigate missing data in a data visualization is to devise a way to show the data that is there alongside the data that isn’t. Make the gaps apparent and visualize missing data, such as by avoiding connecting the dots between missing values in a line chart.

In cases where your data has gaps, you can add annotations or labels to acknowledge and explain any inconsistencies or perceived gaps in the data. In some cases, data can appear to be missing, but is actually a gap in the data due to seasonal fluctuations or other reasons. It’s important to thoroughly understand your data to identify the difference. 

If you visualize the gaps in your data, you have the opportunity to discuss what can be causing the gaps. Gaps in data can reflect reality, or flaws in your analysis process. Either way, visualizing the gaps in your data is just as valuable as visualizing the data that you do have. Don’t hide or ignore missing data.

Carefully consider time spans

Be intentional about the span that you choose for time-based visualizations. You can unintentionally hide fluctuations in the data if you choose an overly-broad span for your visualization, causing data to go missing by flattening it. 

If you choose an overly-short time span for your visualization, however, the meaning of the data and what you’re trying to communicate can go missing with all the noise of the individual data points. Consider what you’re trying to communicate with the data visualization, and choose a time span accordingly.

Write clearly 

Another way to address missing data is to write good labels and titles for visualizations. It’s crucial to explain exactly what is present in a visualization—an important component of communicating results. If you’re intentional and precise about your labels and titles, you can prevent data from going missing. 

If the data analysis contains the results for the top 10 cities by population density, but your title only says “Top Cities”, data has gone missing from your visualization!

You can test out the usefulness of your labels and titles by considering the following: If someone screenshots your visualization and puts it in a different presentation, or tweets it without the additional context that might be in the full report, how much data would be missing from the visualization? How completely does the visualization communicate the results of data analysis if it’s viewed out of context?

Validate your scale

Make sure any visualization that you create has a scale and that it’s included. It’s really easy for data to go missing if the scale of the data itself is missing. 

Also validate that the scale on your visualization is accurate and relevant. If you’re visualizing percentages, make sure the scale goes from 0-100. If you’re visualizing logarithmic data, make sure your scale reflects that correctly. 

Consider the use

Consider how your visualization will be used, and design your visualizations accordingly. What decisions are people trying to make based on your visualization? What questions are you trying to answer when you make it? 

Automatically-adjusting gradient bins in a heat map can be an excellent design choice, but as we saw in Georgia, they don’t make sense to communicate relative change over time. 

Choose the right chart for the data

It’s also important to choose the right chart to visualize your data. I’m not a visualization expert, so check out this data tutorial from Chartio, How to Choose the Right Data Visualization as well as these tutorials of different chart types on Flowing Data: Chart Types.  

I do want to recommend that if you’re visualizing multiple aggregations in one visualization in the Splunk platform, consider the Trellis layout to create different charts to help compare across the aggregates. 

Always try various types of visualizations for your data to determine which one shows the results of your analysis in the clearest way.

One of the best ways to make sure your data visualization isn’t missing data is to make sure that the data analysis is sound. The next post in this series addresses how data can go missing while you analyze it: Analyze the data: How missing data biases data-driven decisions.

Communicate the data: How missing data biases data-driven decisions

This is the third post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Communicating the results of a data analysis process is crucial to making a data-driven decision. You might review results communicated to you in many ways:

  • A slide deck presented to you
  • An automatically-generated report emailed to you regularly
  • A white paper produced by an expert analysis firm that you review 
  • A dashboard full of curated visualizations
  • A marketing campaign

If you’re a data communicator, what you choose to communicate (and how you do so) can cause data to go missing and possibly bias data-driven decisions made as a result.

In this post, I’ll cover the following:

  • A marketing campaign that misrepresents the results of a data analysis process
  • A renowned white paper produced by leaders in the security space
  • How data goes missing at the communication stage
  • What you can do about missing data at the communication stage

Spotify Wrapped: Your “year” in music 

If you listen to music on Spotify, you might have joined in the hordes of people that viewed and shared their Spotify Wrapped playlists and images like these last year. 

Screenshot of Spotify Wrapped graphic showing "we spend some serious time together"

Spotify Wrapped is a marketing campaign that purports to communicate your year in music, based on the results of some behind-the-scenes data analysis that Spotify performs. While it is a marketing campaign, it’s still a way that data analysis results are being communicated to others, and thus relevant to this discussion of missing data in communications. 

Screenshot of Spotify Wrapped campaign showing "You were genre-fluid" Screenshot of Spotify Wrapped campaign showing that I discovered 1503 new artists.

In this case you can see that my top artists, top songs, minutes listened, and top genre are  shared with me, along with the number of artists that I discovered, and even the number of minutes that I spent listening to music over the last year. It’s impressive, and exciting to see my year in music summed up in such a way! 

But after I dug deeper into the data, the gaps in the communication and data analysis became apparent. What’s presented as my year in music is actually more like my “10 months in music”, because the data only represents the period from January 1st to October 31st of 2019. Two full months of music listening behavior are completely missing from my “year” in music. 

Unfortunately, these details about the dataset are missing from the Spotify Wrapped campaign report itself, so I had to search through additional resources to find out more information. According to a now-archived FAQ on Spotify for Artists (thank goodness for the Wayback Machine), the data represented in the campaign covers the dates from January 1st, 2019 – October 31st, 2019. I also read two key blog posts, Spotify Wrapped 2019 Reveals Your Streaming Trends, from 2010 to Now announcing the campaign and Unwrapping Wrapped 2019: Spotify VP of Engineering Tyson Singer Explains digging into the data analysis behind the campaign, to learn what I could about the size of the dataset, or how many data points might be necessary to draw the conclusions shared in the report. 

Screenshot of Spotify FAQ question "Why are my 2019 Artist Wrapped stats different from the stats I see in Spotify for Artists?"

Screenshot of Spotify FAQ answering the question "How do I get my 2019 Artist Wrapped?" with the first line of the answer being "If you have music on Spotify with at least 3 listeners before October 31st 2019, you get a 2019 Wrapped!"

This is a case where data is going missing from the communication because the data is presented as though it represents an entire time period, when in fact it only represents a subset of the relevant time period. Not only that, but it’s unclear what other data points actually represent. The number of minutes spent listening to music in those 10 months could be calculated by adding up the actual amount of time I spent listening to songs on the service, but could also be an approximate metric calculated from the number of streams of tracks in the service. It’s not possible to find out how this metric is being calculated. If it is an approximate metric based on the number of streams, that’s also a case of uncommunicated missing data (or a misleading metric), because according to the Spotify for Artists FAQ, Spotify counts a song as streamed if you’ve listened to at least 30 seconds of a track. 

It’s likely that other data is missing in this incomplete communication, but another data communication is a great example of how to do it right. 

Verizon DBIR: A star of data communication 

The Verizon Data Breach Investigations Report (DBIR) is an annual report put out by Verizon about, well, data breach investigations. Before the report shares any results, it includes a readable summary about the dataset used, how the analysis is performed, as well as what is missing from the data. Only after the limitations of the dataset and the resulting analysis is communicated, are the actual results of the analysis shared. 

Screenshot of Verizon DBIR results and analysis section of the 2020 DBIRAnd those results are well-presented, featuring confidence intervals, detailed titles and labels, as well as clear scales to visualize the data. I talk more about how to prevent data from going missing in visualizations in the next post of the series, Visualize the data: How missing data biases data-driven decisions.

Screenshot of 2 visualizations from the Verizon 2020 DBIR

How does data go missing?

These examples make it clear that data can easily go missing when you’re choosing how to communicate the results of data analysis. 

  • You include only some visualizations in the report — the prettiest ones, or the ones with “enough” data
  • You include different visualizations and metrics than the ones in the last report, without an explanation about why they changed, or what changed. For example, what happened in Spain in late May, 2020 as reported by the Financial Times: Flawed data casts cloud over Spain’s lockdown strategy.
  • You choose not to share or discuss the confidence intervals for your findings. 
  • You neglect to include details about the dataset, such as the size of the dataset or the representativeness of the dataset.
  • You discuss the available data as though it represents the entire problem, rather than a subset of the problem. 

For example, the Spotify Wrapped campaign shares the data that it makes available as though it represents an entire year, but instead only reflects 10 months, and doesn’t include any details about the dataset beyond what you can assume—it’s based on Spotify’s data. This missing data doesn’t make the conclusions drawn in the campaign inaccurate, but it is additional context that can affect how you interpret the findings and make decisions based on the data, such as which artist’s album you might buy yourself to celebrate the new year. 

What can you do about missing data?

To mitigate missing data in your communications, it’s vital to be precise and consistent. Communicate exactly what is and is not covered in your report. If it makes sense, you could even share the different levels of disaggregation used throughout the data analysis process that led to the results discussed in the final communication. 

Consistency in the visualizations that you choose to include, as well as the data spans covered in your report, can make it easier to identify when something has changed since the last report. Changes can be due to missing data, but can also be due to something else that might not be identified if the visualizations in the report can’t be compared with each other.

If you do change the format of the report, consider continuing to provide the old format alongside the new format. Alternately, highlight what is different from the last report, why you made changes, and clearly discuss whether comparisons can or cannot be made with older formats of the report to avoid unintentional errors. 

If you know data is missing, you can discuss it in the report, share why or why not the missing data matters, and possibly choose to communicate a timeline or a business case for addressing the missing data. For example, there might be a valid business reason why data is missing, or why some visualizations have changed from the previous report.

The 2020 edition of Spotify Wrapped will almost certainly include different types of data points due to the effects of the global pandemic on people’s listening habits. Adding context to why data is missing, or why the report has changed, can add confidence in the face of missing data—people now understand why data is missing or has changed. 

Often when you’re communicating data, you’re including detailed visualizations of the results of data analysis processes. The next post in this series covers how data can go missing from visualizations, and what to do about it: Visualize the data: How missing data biases data-driven decisions.

Decide with the data: How missing data biases data-driven decisions

This is the second post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Any decision based solely on the results of data analysis is missing data—the non-quantitative kind. But data can also go missing from data-driven decisions as a result of the analysis process. Exhaustive data analysis and universal data collection might seem like the best way to prevent missing data, but it’s not realistic, feasible, or possible. So what can you do about the possible bias introduced by missing data? 

In this post, I’ll cover the following:

  • What to do if you must make a decision with missing data
  • How data goes missing at the decision stage
  • What to do about missing data at the decision stage
  • How much missing data matters
  • How much to care about missing data before making your decision

Missing data in decisions from the Oregon Health Authority

In the midst of a global pandemic, we’ve all been struggling to evaluate how safe it is to resume pre-pandemic activities like going to the gym, going out to bars, sending our kids back to school, or eating at restaurants. In the United States, state governors are the ones tasked with making decisions about what to reopen and what to keep closed, and most are taking a data-driven approach. 

In Oregon, the decisions they’re making about what to reopen and when are based on incomplete data. As reported by Erin Ross for Oregon Public Broadcasting, the Oregon Health Authority is not collecting or analyzing data about whether or not restaurants or bars are contributing to COVID-19 case rates.

The contact tracers interviewing people who’ve tested positive for SARS COV-2 are not asking, and even if the information is shared, the data isn’t being analyzed in a way that might allow officials to identify the effect of bars and restaurants on coronavirus case rates. Although this data is missing, officials are making decisions about whether or not bars and restaurants should remain open for indoor operations. 

Oregon and the Oregon Health Authority aren’t alone in this limitation. We’re in the midst of a pandemic, and everyone is doing the best they can with the limited resources and information that they have. Complete data isn’t always possible, especially when real-time data matters. So what can Oregon (and you) do to make sure that missing data doesn’t negatively affect the decision being made? 

If circumstances allow, it’s best to try to narrow the scope of your decision. Limit your decision to those groups and situations about which you have complete or representative data. If you can’t limit your decision, such as is the case with this pandemic, you can still make a decision with incomplete data. 

Acknowledge that your decision is based on limited data, identify the gaps in your knowledge, and make plans to address those gaps as soon as possible. You can address the missing data by collecting more data, analyzing your existing data differently, or by reexamining the relevance of various data points to the decision that you’re making. 

This is one key example of how missing data can affect a decision-making process. How else can data go missing when making decisions? 

How can data go missing?

Data-driven decisions are especially vulnerable to the effects of missing data. Because data-driven are based on the results of data analysis, the effects of missing data in the earlier data analysis stages are compounded. 

  • The reports that you reviewed before making your decision included the prettiest graphs instead of the most useful visualizations to help you make your decision. 
  • The visualizations in the report were different from the ones in the last report, making it difficult for you to compare the new results with the results in the previous report. 
  • The data being collected doesn’t include the necessary details for your decision. This is what is happening in Oregon, where the collected data doesn’t include all the details that are relevant when making decisions about what businesses and organizations to reopen. 
  • The data analysis that was performed doesn’t actually answer the question that you’re asking. If you need to know “how soon can we reopen indoor dining and bars despite the pandemic”, and the data analysis being performed can only tell you “what are the current infection rates, by county, based on the results of tests administered 5 days ago”, the decisions that you’re making might be based on incomplete data.

What can you do about missing data?

Identify if the missing data matters to your decision. If it does, you must acknowledge that the data is missing when you make your decision. If you don’t have data about a group, intentionally exclude that group from your conclusions or decision-making process. Constrain your decision according to what you know, and acknowledge the limitations of the analysis.

If you want to make broader decisions, you must address the missing data throughout the rest of the process! If you aren’t able to immediately collect missing data, you can attempt to supplement the data with a dedicated survey aimed at gathering that information before your decision. You can also investigate to find out if the data is already available in a different format or context—for example in Oregon, where the Health Authority might have some information about indoor restaurant and bar attendance in the content of contact tracing interviews and just aren’t analyzing it systematically. If the data is representative even if it isn’t comprehensive, you can still use it to supplement your decision. 

To make sure you’re making an informed decision, ask questions about the data analysis process that led to the results you’re reviewing. Discuss whether or not data could be missing from the results of the analysis presented to you, and why. Ask yourself: does the missing data affect the decision that I’m making? Does it affect the results of the analysis presented to me? Evaluate how much missing data matters to your decision-making process. 

You don’t always need more data

You will always be missing some data. It’s important to identify when the data that is missing is actually relevant to your analysis process, and when it won’t change the outcome. Acknowledge when additional data won’t change your conclusions. 

You don’t need all possible data in existence to support a decision. As Douglas Hubbard points out in his book How to Measure Anything, the goal of a data analysis process is to reduce your uncertainty about the right approach to take. 

If additional data, or more detailed analysis, won’t further reduce any uncertainty, then it’s likely unnecessary. The more clearly you constrain your decisions, and the questions you use to guide your data analysis, the more easily you can balance reducing data gaps and making a decision with the data and analysis results you have. 

USCIS doesn’t allow missing data, even if it doesn’t affect the decision

Sometimes, missing data doesn’t affect the decision that you’re making. This is why you must understand the decision you’re making, and how important comprehensive data is to the decision. If that’s true, you want to make sure your policies acknowledge that reality. 

In the case of the U.S. Citizenship and Immigration Services (USCIS), their policies don’t seem to recognize that some kinds of missing data for citizenship applications are irrelevant. 

In an Opinions column by Washington Post reporter Catherine Rampell, The Trump administration’s no-blanks policy is the latest Kafkaesque plan designed to curb immigration, she describes the “no blanks” policy applied to immigration applications, and now, to third-party documents included with the applications. 

“Last fall, U.S. Citizenship and Immigration Services introduced perhaps its most arbitrary, absurd modification yet to the immigration system: It began rejecting applications unless every single field was filled in, even those that obviously did not pertain to the applicant.

“Middle name” field left blank because the applicant does not have a middle name? Sorry, your application gets rejected. No apartment number because you live in a house? You’re rejected, too.

No address given for your parents because they’re dead? No siblings named because you’re an only child? No work history dates because you’re an 8-year-old kid?

All real cases, all rejected.”

In this example, missing data is deemed a problem for making a decision about the citizenship application for a person—even when the data that is missing is supposed to be missing because it doesn’t exist. When asked for comment,

“a USCIS spokesperson emailed, “Complete applications are necessary for our adjudicators to preserve the integrity of our immigration system and ensure they are able to confirm identities, as well as an applicant’s immigration and criminal history, to determine the applicant’s eligibility.””

Missing data alone is not enough to affect your decision—only missing data that affects the results of your decision. A lack of data is not itself a problem—the problem is when that is relevant to your decision is missing. That’s how bias gets introduced to a data-driven decision.

In the next post in this series, I’ll explore some ways that data can go missing when the results of data analysis are communicated: Communicate the data: How missing data biases data-driven decisions

What’s missing? Reduce bias by addressing data gaps in your analysis process

We live in an uncertain world. Facing an ongoing global pandemic, worsening climate change, persistent threats to human rights, and the more mundane uncertainties of our day-to-day lives, we try to use data to make sense of it all. Relying on data to guide decisions can feel safe. 

But you might not be able to trust your data-driven decisions if data is missing from your data analysis process. If you can identify and address gaps in your data analysis process, you can reduce the bias introduced by missing data in your data-driven decisions, regaining your confidence and certainty while ensuring you limit possible harm. 

This is post 1 of 8 in a series about how missing data can negatively affect a data analysis process. 

In this post, I’ll cover:

  • What is missing data? 
  • Why missing data matters
  • What’s missing from all data-driven decisions
  • The stages of a data analysis process

I hope this series inspires you and prepares you to take action to address bias introduced by missing data in your own data analysis processes. At the very least, I hope you gain a new perspective when evaluating your data-driven decisions, the success of data analysis processes, and how you frame a data analysis process from the start.

What is missing data?

Data can go missing in many ways. If you’re not collecting data, or don’t have access to some kinds of data, or if you can’t use existing data for a specific data analysis process—that data is missing from your analysis process. 

Other data might not be accessible to you for other reasons. Throughout this series, I’ll use the term “missing data” to refer to both data that does not exist, data that you do not have access to, and data that is obscured by your analysis process—effectively missing, even if not literally gone. 

Why missing data matters

Missing data matters because it can easily introduce bias into the results of a data analysis process. Biased data analysis is often framed in the context of machine learning models, training datasets, or inscrutable and biased algorithms leading to biased decisions. 

But you can draw biased and inaccurate conclusions from any data analysis process, regardless of whether machine learning or artificial intelligence is involved. As Meg Miller makes clear in her essay Finding the Blank Spots in Data for Eye on Design, “Artists and designers are working to address a major problem for marginalized communities in the data economy: “If the data does not exist, you do not exist.””. And that’s just one part of why missing data matters. 

You can identify the possible biases in your decisions if you can identify the gaps in your data and data analysis process. And if you can recognize those possible biases, you can do something to mitigate them. But first we need to acknowledge what’s missing from every data-driven decision. 

What’s missing from all data-driven decisions?

It feels safe to make a data-driven decision. You’ve performed a data analysis process and have a list of results matched up with objectives that you want to achieve. It’s easy to equate data with neutral facts. But we can’t actually use data for every decision, and we can’t rely only on data for a decision-making process. Data can’t capture the entirety of an experience—it’s inherently incomplete.

Data only represents what can be quantified. Howard Zinn writes about the incompleteness of data in representing the horrors of slavery in A People’s History of the United States:

“Economists or cliometricians (statistical historians) have tried to assess slavery by estimating how much money was spent on slaves for food and medical care. But can this describe the reality of slavery as it as to a human being who lived in side it? Are the conditions of slavery as important as the existence of slavery?” 

“But can statistics record what it meant for families to be torn apart, when a master, for profit, sold a husband or a wife, a son or a daughter?” 

(pg 172, emphasis original). 

Statistical historians and others can attempt to quantify the effects of slavery based on the records available to them, but parts of that story can never be quantified. The parts that can’t be quantified must be told, and must be considered when creating a historical record and of course, in deciding whose story gets told and how.

What data is available, and from whom, represents an implicit value and power structure in society as well. If data has been collected about something, and made available to others, then that information must be important—whether to an organization, a society, a government, or a world—and the keepers of the data had the privilege and the power to maintain it and make it available after it was collected. 

This power structure, this value structure, and the limitations of data alone when making decisions are crucial to consider in this era of seemingly-objective data-driven decisions. Because data alone isn’t enough to capture the reality of a situation, it isn’t enough to drive the decisions you make in our uncertain world. And that’s only the beginning of how missing data can affect decisions.

Data can go missing at any stage of the data analysis process

It’s easy to consider missing data as solely a data collection problem—if the dataset existed, or new data was collected, no data would be missing and so we can make better data-driven decisions. In fact, avoiding missing data when you’re collecting it is just one way to reduce bias in your data-driven decisions—it’s far from the only way.

Data can go missing at any stage of the data analysis process and bias your resulting decisions. Each post in this series addresses a different stage of the process. 

  1. 🗣 Make a decision based on the results of the data analysis. Decide with the data: How missing data biases data-driven decisions.
  2. 📋 Communicate the results of the data analysis. Communicate the data: How missing data biases data-driven decisions
  3. 📊 Visualize the data to represent the answers to your questions. Visualize the data: How missing data biases data-driven decisions.
  4. 🔎 Analyze the data to answer your questions. Analyze the data: How missing data biases data-driven decisions.
  5. 🗂 Manage the data that you’ve collected to make it easier to analyze. Manage the data: How missing data biases data-driven decisions
  6. 🗄 Collect the data you need to answer the questions you’ve defined. Collect the data: How missing data biases data-driven decisions.
  7. 🙋🏻‍♀️ Define the question that you want to ask the data. Define the question: How missing data biases data-driven decisions

In each post, I’ll discuss real world examples of how data can go missing, and what you can do about it!

Listening to Music while Sheltering in Place

The world is, to varying degrees, sheltering-in-place during this global coronavirus pandemic. Starting in March, the pandemic started to affect me personally: 

  • I started working from home on March 6th. 
  • Governor Gavin Newsom announced on March 11 that any gatherings over 250 people were strongly discouraged, effectively cancelling all concerts for the month of March. 
  • On March 16th, the mayor of San Francisco along with several other counties in the area, announced a shelter-in-place order. 

Ever since then, I’ve been at home. Given all these changes in my life, I was curious what new patterns I might see in my music listening habits. 

With large gatherings prohibited, I went to my last concert on March 7th. With gatherings increasingly cancelled nationwide, and touring musicians postponing and cancelling events, March 27th, Beatport hosted the first livestream festival, “ReConnect. A Global Music Series”. Many more followed. 

Industry-wide studies and data analysis have attempted to unpack various trends in the pandemic’s influence on the music industry. Analytics startup Chartmetric is digging into genre-based listening, geographical listening habits, and Billboard and Nielsen conducting a periodic entertainment tracker survey.

Because I’m me, and I have so much data about my music listening patterns, I wanted to explore what trends might be emerging in my personal habits. I analyzed the months March, April, and May during 2020, and in some cases compared that period against the same period in 2019, 2018, and 2017. The screenshots of data visualizations in this blog post represent data points from May 15th, so it is an incomplete analysis and comparison, given that May in 2020 is not yet complete. 

Looking at my listening habits during this time period, with key dates highlighted, it’s clear that the very beginning of the crisis didn’t have much of an effect on my listening behavior. However, after the shelter-in-place order, the amount of time I spent listening to music increased. After that increase it’s remained fairly steady.

Screenshot of an area chart depicting listening duration ranging from 100 minutes with a couple spikes of 500 minutes but hovering around a max of 250 minutes per day for much of january and february, then starting in march a new range from about 250 to 450 minutes per day, with a couple outliers of nearly 700 minutes of listening activity, and a couple outliers with only a 90 minutes of listening activity.

Key dates such as the first case in the United States, the first case in California, and the first case in the Bay Area are highlighted along with other pandemic-relevant dates.

Listening behavior during March, April, and May over time

When I started my analysis, I looked at my basic listening count from traditional music listening sources. I use Last.fm to scrobble my listening behavior in iTunes, Spotify, and the web from sites like YouTube, SoundCloud, Bandcamp, Hype Machine, and more. 

Chart depicting 2700 total listens for 2017, 2000 total listens for 2018, and 2300 total listens for 2019 during March, April, and May, compared to 3000 total listens in that same period in 2020.

If you just look at 2018 to 2020, it seems like my listening habits are trending upward, maybe with a culmination in 2020. But comparing against 2017, it isn’t much of a difference. I listened to 25% fewer tracks in 2018 compared with 2017, 19% more tracks in 2019 compared with 2018, and 25% more tracks in 2020 compared with 2019. 

Chart depicting total weekday listens during March, April, and May during 2017, 2018, 2019, and 2020 with total weekend listens during the same time. 2017 shows roughly 2400 listens on weekdays and 200ish for 2017, 2000 weekday listens vs 100 weekend listens for 2018, 2100 weekday listens vs 300 weekend listens in 2019, and 2500 weekday listens vs 200 weekend listens in 2020

If I break that down by when I was listening by comparing my weekend and weekday listening habits from the previous 3 years to now, there’s still perhaps a bit of an increase, but nothing much. 

With just the data points from Last.fm, there aren’t really any notable patterns. But number of tracks listened to on Spotify, SoundCloud, YouTube, or iTunes provides an incomplete perspective of my listening habits. If I expand the data I’m analyzing to include other types of listening—concerts attended and livestreams watched—and change the data point that I’m analyzing to the amount of time that I spend listening, instead of the number of tracks that I’ve listened to, it gets a bit more interesting. 

Chart shows roughly 12000 minutes spent listening in 2017, 10000 in 2018, 12000 in 2019, and 22000 in 2020While the number of tracks I listened to from 2019 to 2020 increased only 25%, the amount of time I spent listening to music increased by 74%, a full 150 hours more than the previous year during this time period. And May isn’t even over yet! 

It’s worth briefly noting that I’m estimating, rather than directly calculating, the amount of time spent listening to music tracks and attending live music events. To make this calculation, I’m using an estimate of 3 hours for each concert attended, 4 hours for each DJ set attended, 8 hours for each festival attended, and an estimate of 4 minutes for each track listened to, based on the average of all the tracks I’ve purchased over the past two years. Livestreamed sets are easier to track, but some of those are estimates as well because I didn’t start keeping track until the end of April.

I spent an extra 150 hours listening to music this year during this time—but when was I spending this time listening? If I break down the amount of time I spent listening by weekend compared with weekdays, it’s obvious:

Chart depicts 10000 weekday minutes and 5000 weekend minutes spent listening in 2017, 9500 weekday minutes and 4500 weekend minutes in 2018, 14000 weekday minutes and 2000 weekend minutes in 2019, and 12000 weekday minutes and 13000 weekend minutes in 2020

Before shelter-in-place, I’d spend most of my weekends outside, hanging out with friends, or attending concerts, DJ sets, and the occasional day party. Now that I’m spending my weekends largely inside and at home, coupled with the number of livestreaming festivals, I’m spending much more of that time listening to music. 

I was curious if perhaps working from home might reveal new weekday listening habits too, but the pattern remains fairly consistent. I also haven’t worked from home for an extended period before, so I don’t have a baseline to compare it with. 

It’s clear that weekends are when I’m doing most of my new listening, and that this new listening likely isn’t coming from my traditional listening habits. If I split the amount of time that I spend listening to music by the type of listening that I’m doing, the source of the added time spent listening is clear.

Depicts 11000 minutes of track listens and 1000 minutes of time spent at concerts in 2017, 8000 minutes spent listening to music tracks and 2000 minutes spent at concerts in 2018, 10000 minutes spent listening to music tracks and 3000 minutes spent at concerts in 2019, and 12000 minutes spent listening to music tracks and 9000 minutes listening to livestreams, with a sliver of 120 minutes spent at a single concert in 2020

Hello, livestreams. If you look closely you can also spy the sliver of a concert that I attended on March 7th.

Livestreams dominate, and so does Shazam

All of the livestreams I’ve been watching have primarily been DJ sets. Ordinarily, when I’m at a DJ set, I spend a good amount of time Shazamming the tracks I’m hearing. I want to identify the tracks that I’m enjoying so much on the dancefloor so I can track them down, buy them, and dig into the back catalog of those artists. 

So I requested my Shazam data to see what’s happening now that I’m home, with unlimited, shameless, and convenient access to Shazam.

For the time period that I have Shazam data for, the correlation of Shazam activity to number of livestreams watched is fairly consistent at roughly 10 successful Shazams per livestream.  

Chart details largely duplicated in surrounding text, but of note is a spike of 6 livestreams with only 30 or so songs shazammed, while the next few weeks show a fairly tight interlock of shazam activity with number of livestreams

Given the correlation of Shazam data, as well as the continued focus on watching DJ sets, I wanted to explore my artist discovery statistics as well. Especially when it seemed like my listening activity hadn’t shifted much, I was betting that my artist discovery statistics have been increasing during this time. If I look at just the past few years, there seems to be a direct increase during this time period. 

Chart depicts 260ish artists discovered in March, April, and May of 2018, 280 discovered in 2019, and 360 discovered in 2020Chart depicts 260ish artists discovered in March, April, and May of 2018, 280 discovered in 2019, and 360 discovered in 2020. Second chart shows the same data but adds 2017, with 390 artists discovered

However, after I add 2017 into the list as well, the pattern doesn’t seem like much of a pattern at all. Perhaps by the end of May, there will be a correlation or an outsized increase. But at least for now, the added number of livestreams I’ve been watching don’t seem to be producing an equivalently high number of artist discoveries, even though they’re elevated compared with the last two years. 

That could also be that the artists I’m discovering in the livestreams haven’t yet had a substantial effect on my non-livestream listening patterns, even if there’s 91 hours of music (and counting) in my quarandjed playlist where I store the tracks that catch my ear in a quarantine DJ set. Adding music to a playlist, of course, is not the same thing as listening to it. 

Livestreaming as concert replacement?

Shelter-in-place brought with it a slew of event cancellations and postponements. My live events calendar was severely affected. As of now, 15 concerts were affected in the following ways:

Chart depicts 6 concerts cancelled and 9 postponed

The amount of time that I spend at concerts compared with watching livestreams is also starkly different.

Chart depicts 1000 minutes spent at concerts in 2017, 2000 minutes at concerts in 2018, 2500 minutes at concerts in 2019, and 8000 minutes spent watching livestreams, with a topper of 120 minutes at a concert in 2020

I’ve spent 151 hours (and counting) watching livestreams, the rough equivalent of 50 concerts—my entire concert attendance of last year. This is almost certainly because I’m often listening to livestreams, rather than watching them happen.

Concerts require dedication—a period of time where you can’t really do anything else, a monetary investment, and travel to and from the show. Livestreams don’t have any of that, save a voluntary donation. That makes it easier to turn on a stream while I’m doing other things. While listening to a livestream, I often avoid engaging with the streaming experience. Unless the chat is a cozy few hundred folks at most, it’s a tire fire of trolls and not a pleasant experience. That, coupled with the fact that sitting on my couch watching a screen is inherently less engaging than standing in a club with music and people surrounding me, means that I’m often multitasking while livestreams are happening.

The attraction for me is that these streams are live, and they’re an event to tune into, and if you don’t, you might miss it. Because it’s live, you have the opportunity to create a shared collective experience. The chatrooms that accompany live video streams on YouTube, Twitch, and especially with Facebook’s Watch Party feature for Facebook Live videos, are what foster this shared experience. For me, it’s about that experience, so much so that I started a chat thread for Jamie xx’s 2020 Essential Mix so that my friends and I could experience and react to the set live. This personal experience is contrary to the conclusion drawn in this article on Hypebot called Our Music Consumption Habits Are Changing, But Will They Remain That Way? by Bobby Owsinski: “Given the choice, people would rather watch something than just listen.”. Given the choice, I’d rather have a shared collective experience with music rather than just sit alone on my couch and listen to it. 

Of course, with shelter-in-place, I haven’t been given a choice between attending concerts and watching livestreamed shows. It’s clear that without a choice, I’ll take whatever approximation of live music I can find.