Communicate the data: How missing data biases data-driven decisions

This is the third post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Communicating the results of a data analysis process is crucial to making a data-driven decision. You might review results communicated to you in many ways:

  • A slide deck presented to you
  • An automatically-generated report emailed to you regularly
  • A white paper produced by an expert analysis firm that you review 
  • A dashboard full of curated visualizations
  • A marketing campaign

If you’re a data communicator, what you choose to communicate (and how you do so) can cause data to go missing and possibly bias data-driven decisions made as a result.

In this post, I’ll cover the following:

  • A marketing campaign that misrepresents the results of a data analysis process
  • A renowned white paper produced by leaders in the security space
  • How data goes missing at the communication stage
  • What you can do about missing data at the communication stage

Spotify Wrapped: Your “year” in music 

If you listen to music on Spotify, you might have joined in the hordes of people that viewed and shared their Spotify Wrapped playlists and images like these last year. 

Screenshot of Spotify Wrapped graphic showing "we spend some serious time together"

Spotify Wrapped is a marketing campaign that purports to communicate your year in music, based on the results of some behind-the-scenes data analysis that Spotify performs. While it is a marketing campaign, it’s still a way that data analysis results are being communicated to others, and thus relevant to this discussion of missing data in communications. 

Screenshot of Spotify Wrapped campaign showing "You were genre-fluid" Screenshot of Spotify Wrapped campaign showing that I discovered 1503 new artists.

In this case you can see that my top artists, top songs, minutes listened, and top genre are  shared with me, along with the number of artists that I discovered, and even the number of minutes that I spent listening to music over the last year. It’s impressive, and exciting to see my year in music summed up in such a way! 

But after I dug deeper into the data, the gaps in the communication and data analysis became apparent. What’s presented as my year in music is actually more like my “10 months in music”, because the data only represents the period from January 1st to October 31st of 2019. Two full months of music listening behavior are completely missing from my “year” in music. 

Unfortunately, these details about the dataset are missing from the Spotify Wrapped campaign report itself, so I had to search through additional resources to find out more information. According to a now-archived FAQ on Spotify for Artists (thank goodness for the Wayback Machine), the data represented in the campaign covers the dates from January 1st, 2019 – October 31st, 2019. I also read two key blog posts, Spotify Wrapped 2019 Reveals Your Streaming Trends, from 2010 to Now announcing the campaign and Unwrapping Wrapped 2019: Spotify VP of Engineering Tyson Singer Explains digging into the data analysis behind the campaign, to learn what I could about the size of the dataset, or how many data points might be necessary to draw the conclusions shared in the report. 

Screenshot of Spotify FAQ question "Why are my 2019 Artist Wrapped stats different from the stats I see in Spotify for Artists?"

Screenshot of Spotify FAQ answering the question "How do I get my 2019 Artist Wrapped?" with the first line of the answer being "If you have music on Spotify with at least 3 listeners before October 31st 2019, you get a 2019 Wrapped!"

This is a case where data is going missing from the communication because the data is presented as though it represents an entire time period, when in fact it only represents a subset of the relevant time period. Not only that, but it’s unclear what other data points actually represent. The number of minutes spent listening to music in those 10 months could be calculated by adding up the actual amount of time I spent listening to songs on the service, but could also be an approximate metric calculated from the number of streams of tracks in the service. It’s not possible to find out how this metric is being calculated. If it is an approximate metric based on the number of streams, that’s also a case of uncommunicated missing data (or a misleading metric), because according to the Spotify for Artists FAQ, Spotify counts a song as streamed if you’ve listened to at least 30 seconds of a track. 

It’s likely that other data is missing in this incomplete communication, but another data communication is a great example of how to do it right. 

Verizon DBIR: A star of data communication 

The Verizon Data Breach Investigations Report (DBIR) is an annual report put out by Verizon about, well, data breach investigations. Before the report shares any results, it includes a readable summary about the dataset used, how the analysis is performed, as well as what is missing from the data. Only after the limitations of the dataset and the resulting analysis is communicated, are the actual results of the analysis shared. 

Screenshot of Verizon DBIR results and analysis section of the 2020 DBIRAnd those results are well-presented, featuring confidence intervals, detailed titles and labels, as well as clear scales to visualize the data. I talk more about how to prevent data from going missing in visualizations in the next post of the series, Visualize the data: How missing data biases data-driven decisions.

Screenshot of 2 visualizations from the Verizon 2020 DBIR

How does data go missing?

These examples make it clear that data can easily go missing when you’re choosing how to communicate the results of data analysis. 

  • You include only some visualizations in the report — the prettiest ones, or the ones with “enough” data
  • You include different visualizations and metrics than the ones in the last report, without an explanation about why they changed, or what changed. For example, what happened in Spain in late May, 2020 as reported by the Financial Times: Flawed data casts cloud over Spain’s lockdown strategy.
  • You choose not to share or discuss the confidence intervals for your findings. 
  • You neglect to include details about the dataset, such as the size of the dataset or the representativeness of the dataset.
  • You discuss the available data as though it represents the entire problem, rather than a subset of the problem. 

For example, the Spotify Wrapped campaign shares the data that it makes available as though it represents an entire year, but instead only reflects 10 months, and doesn’t include any details about the dataset beyond what you can assume—it’s based on Spotify’s data. This missing data doesn’t make the conclusions drawn in the campaign inaccurate, but it is additional context that can affect how you interpret the findings and make decisions based on the data, such as which artist’s album you might buy yourself to celebrate the new year. 

What can you do about missing data?

To mitigate missing data in your communications, it’s vital to be precise and consistent. Communicate exactly what is and is not covered in your report. If it makes sense, you could even share the different levels of disaggregation used throughout the data analysis process that led to the results discussed in the final communication. 

Consistency in the visualizations that you choose to include, as well as the data spans covered in your report, can make it easier to identify when something has changed since the last report. Changes can be due to missing data, but can also be due to something else that might not be identified if the visualizations in the report can’t be compared with each other.

If you do change the format of the report, consider continuing to provide the old format alongside the new format. Alternately, highlight what is different from the last report, why you made changes, and clearly discuss whether comparisons can or cannot be made with older formats of the report to avoid unintentional errors. 

If you know data is missing, you can discuss it in the report, share why or why not the missing data matters, and possibly choose to communicate a timeline or a business case for addressing the missing data. For example, there might be a valid business reason why data is missing, or why some visualizations have changed from the previous report.

The 2020 edition of Spotify Wrapped will almost certainly include different types of data points due to the effects of the global pandemic on people’s listening habits. Adding context to why data is missing, or why the report has changed, can add confidence in the face of missing data—people now understand why data is missing or has changed. 

Often when you’re communicating data, you’re including detailed visualizations of the results of data analysis processes. The next post in this series covers how data can go missing from visualizations, and what to do about it: Visualize the data: How missing data biases data-driven decisions.

Decide with the data: How missing data biases data-driven decisions

This is the second post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Any decision based solely on the results of data analysis is missing data—the non-quantitative kind. But data can also go missing from data-driven decisions as a result of the analysis process. Exhaustive data analysis and universal data collection might seem like the best way to prevent missing data, but it’s not realistic, feasible, or possible. So what can you do about the possible bias introduced by missing data? 

In this post, I’ll cover the following:

  • What to do if you must make a decision with missing data
  • How data goes missing at the decision stage
  • What to do about missing data at the decision stage
  • How much missing data matters
  • How much to care about missing data before making your decision

Missing data in decisions from the Oregon Health Authority

In the midst of a global pandemic, we’ve all been struggling to evaluate how safe it is to resume pre-pandemic activities like going to the gym, going out to bars, sending our kids back to school, or eating at restaurants. In the United States, state governors are the ones tasked with making decisions about what to reopen and what to keep closed, and most are taking a data-driven approach. 

In Oregon, the decisions they’re making about what to reopen and when are based on incomplete data. As reported by Erin Ross for Oregon Public Broadcasting, the Oregon Health Authority is not collecting or analyzing data about whether or not restaurants or bars are contributing to COVID-19 case rates.

The contact tracers interviewing people who’ve tested positive for SARS COV-2 are not asking, and even if the information is shared, the data isn’t being analyzed in a way that might allow officials to identify the effect of bars and restaurants on coronavirus case rates. Although this data is missing, officials are making decisions about whether or not bars and restaurants should remain open for indoor operations. 

Oregon and the Oregon Health Authority aren’t alone in this limitation. We’re in the midst of a pandemic, and everyone is doing the best they can with the limited resources and information that they have. Complete data isn’t always possible, especially when real-time data matters. So what can Oregon (and you) do to make sure that missing data doesn’t negatively affect the decision being made? 

If circumstances allow, it’s best to try to narrow the scope of your decision. Limit your decision to those groups and situations about which you have complete or representative data. If you can’t limit your decision, such as is the case with this pandemic, you can still make a decision with incomplete data. 

Acknowledge that your decision is based on limited data, identify the gaps in your knowledge, and make plans to address those gaps as soon as possible. You can address the missing data by collecting more data, analyzing your existing data differently, or by reexamining the relevance of various data points to the decision that you’re making. 

This is one key example of how missing data can affect a decision-making process. How else can data go missing when making decisions? 

How can data go missing?

Data-driven decisions are especially vulnerable to the effects of missing data. Because data-driven are based on the results of data analysis, the effects of missing data in the earlier data analysis stages are compounded. 

  • The reports that you reviewed before making your decision included the prettiest graphs instead of the most useful visualizations to help you make your decision. 
  • The visualizations in the report were different from the ones in the last report, making it difficult for you to compare the new results with the results in the previous report. 
  • The data being collected doesn’t include the necessary details for your decision. This is what is happening in Oregon, where the collected data doesn’t include all the details that are relevant when making decisions about what businesses and organizations to reopen. 
  • The data analysis that was performed doesn’t actually answer the question that you’re asking. If you need to know “how soon can we reopen indoor dining and bars despite the pandemic”, and the data analysis being performed can only tell you “what are the current infection rates, by county, based on the results of tests administered 5 days ago”, the decisions that you’re making might be based on incomplete data.

What can you do about missing data?

Identify if the missing data matters to your decision. If it does, you must acknowledge that the data is missing when you make your decision. If you don’t have data about a group, intentionally exclude that group from your conclusions or decision-making process. Constrain your decision according to what you know, and acknowledge the limitations of the analysis.

If you want to make broader decisions, you must address the missing data throughout the rest of the process! If you aren’t able to immediately collect missing data, you can attempt to supplement the data with a dedicated survey aimed at gathering that information before your decision. You can also investigate to find out if the data is already available in a different format or context—for example in Oregon, where the Health Authority might have some information about indoor restaurant and bar attendance in the content of contact tracing interviews and just aren’t analyzing it systematically. If the data is representative even if it isn’t comprehensive, you can still use it to supplement your decision. 

To make sure you’re making an informed decision, ask questions about the data analysis process that led to the results you’re reviewing. Discuss whether or not data could be missing from the results of the analysis presented to you, and why. Ask yourself: does the missing data affect the decision that I’m making? Does it affect the results of the analysis presented to me? Evaluate how much missing data matters to your decision-making process. 

You don’t always need more data

You will always be missing some data. It’s important to identify when the data that is missing is actually relevant to your analysis process, and when it won’t change the outcome. Acknowledge when additional data won’t change your conclusions. 

You don’t need all possible data in existence to support a decision. As Douglas Hubbard points out in his book How to Measure Anything, the goal of a data analysis process is to reduce your uncertainty about the right approach to take. 

If additional data, or more detailed analysis, won’t further reduce any uncertainty, then it’s likely unnecessary. The more clearly you constrain your decisions, and the questions you use to guide your data analysis, the more easily you can balance reducing data gaps and making a decision with the data and analysis results you have. 

USCIS doesn’t allow missing data, even if it doesn’t affect the decision

Sometimes, missing data doesn’t affect the decision that you’re making. This is why you must understand the decision you’re making, and how important comprehensive data is to the decision. If that’s true, you want to make sure your policies acknowledge that reality. 

In the case of the U.S. Citizenship and Immigration Services (USCIS), their policies don’t seem to recognize that some kinds of missing data for citizenship applications are irrelevant. 

In an Opinions column by Washington Post reporter Catherine Rampell, The Trump administration’s no-blanks policy is the latest Kafkaesque plan designed to curb immigration, she describes the “no blanks” policy applied to immigration applications, and now, to third-party documents included with the applications. 

“Last fall, U.S. Citizenship and Immigration Services introduced perhaps its most arbitrary, absurd modification yet to the immigration system: It began rejecting applications unless every single field was filled in, even those that obviously did not pertain to the applicant.

“Middle name” field left blank because the applicant does not have a middle name? Sorry, your application gets rejected. No apartment number because you live in a house? You’re rejected, too.

No address given for your parents because they’re dead? No siblings named because you’re an only child? No work history dates because you’re an 8-year-old kid?

All real cases, all rejected.”

In this example, missing data is deemed a problem for making a decision about the citizenship application for a person—even when the data that is missing is supposed to be missing because it doesn’t exist. When asked for comment,

“a USCIS spokesperson emailed, “Complete applications are necessary for our adjudicators to preserve the integrity of our immigration system and ensure they are able to confirm identities, as well as an applicant’s immigration and criminal history, to determine the applicant’s eligibility.””

Missing data alone is not enough to affect your decision—only missing data that affects the results of your decision. A lack of data is not itself a problem—the problem is when that is relevant to your decision is missing. That’s how bias gets introduced to a data-driven decision.

In the next post in this series, I’ll explore some ways that data can go missing when the results of data analysis are communicated: Communicate the data: How missing data biases data-driven decisions

What’s missing? Reduce bias by addressing data gaps in your analysis process

We live in an uncertain world. Facing an ongoing global pandemic, worsening climate change, persistent threats to human rights, and the more mundane uncertainties of our day-to-day lives, we try to use data to make sense of it all. Relying on data to guide decisions can feel safe. 

But you might not be able to trust your data-driven decisions if data is missing from your data analysis process. If you can identify and address gaps in your data analysis process, you can reduce the bias introduced by missing data in your data-driven decisions, regaining your confidence and certainty while ensuring you limit possible harm. 

This is post 1 of 8 in a series about how missing data can negatively affect a data analysis process. 

In this post, I’ll cover:

  • What is missing data? 
  • Why missing data matters
  • What’s missing from all data-driven decisions
  • The stages of a data analysis process

I hope this series inspires you and prepares you to take action to address bias introduced by missing data in your own data analysis processes. At the very least, I hope you gain a new perspective when evaluating your data-driven decisions, the success of data analysis processes, and how you frame a data analysis process from the start.

What is missing data?

Data can go missing in many ways. If you’re not collecting data, or don’t have access to some kinds of data, or if you can’t use existing data for a specific data analysis process—that data is missing from your analysis process. 

Other data might not be accessible to you for other reasons. Throughout this series, I’ll use the term “missing data” to refer to both data that does not exist, data that you do not have access to, and data that is obscured by your analysis process—effectively missing, even if not literally gone. 

Why missing data matters

Missing data matters because it can easily introduce bias into the results of a data analysis process. Biased data analysis is often framed in the context of machine learning models, training datasets, or inscrutable and biased algorithms leading to biased decisions. 

But you can draw biased and inaccurate conclusions from any data analysis process, regardless of whether machine learning or artificial intelligence is involved. As Meg Miller makes clear in her essay Finding the Blank Spots in Data for Eye on Design, “Artists and designers are working to address a major problem for marginalized communities in the data economy: “If the data does not exist, you do not exist.””. And that’s just one part of why missing data matters. 

You can identify the possible biases in your decisions if you can identify the gaps in your data and data analysis process. And if you can recognize those possible biases, you can do something to mitigate them. But first we need to acknowledge what’s missing from every data-driven decision. 

What’s missing from all data-driven decisions?

It feels safe to make a data-driven decision. You’ve performed a data analysis process and have a list of results matched up with objectives that you want to achieve. It’s easy to equate data with neutral facts. But we can’t actually use data for every decision, and we can’t rely only on data for a decision-making process. Data can’t capture the entirety of an experience—it’s inherently incomplete.

Data only represents what can be quantified. Howard Zinn writes about the incompleteness of data in representing the horrors of slavery in A People’s History of the United States:

“Economists or cliometricians (statistical historians) have tried to assess slavery by estimating how much money was spent on slaves for food and medical care. But can this describe the reality of slavery as it as to a human being who lived in side it? Are the conditions of slavery as important as the existence of slavery?” 

“But can statistics record what it meant for families to be torn apart, when a master, for profit, sold a husband or a wife, a son or a daughter?” 

(pg 172, emphasis original). 

Statistical historians and others can attempt to quantify the effects of slavery based on the records available to them, but parts of that story can never be quantified. The parts that can’t be quantified must be told, and must be considered when creating a historical record and of course, in deciding whose story gets told and how.

What data is available, and from whom, represents an implicit value and power structure in society as well. If data has been collected about something, and made available to others, then that information must be important—whether to an organization, a society, a government, or a world—and the keepers of the data had the privilege and the power to maintain it and make it available after it was collected. 

This power structure, this value structure, and the limitations of data alone when making decisions are crucial to consider in this era of seemingly-objective data-driven decisions. Because data alone isn’t enough to capture the reality of a situation, it isn’t enough to drive the decisions you make in our uncertain world. And that’s only the beginning of how missing data can affect decisions.

Data can go missing at any stage of the data analysis process

It’s easy to consider missing data as solely a data collection problem—if the dataset existed, or new data was collected, no data would be missing and so we can make better data-driven decisions. In fact, avoiding missing data when you’re collecting it is just one way to reduce bias in your data-driven decisions—it’s far from the only way.

Data can go missing at any stage of the data analysis process and bias your resulting decisions. Each post in this series addresses a different stage of the process. 

  1. 🗣 Make a decision based on the results of the data analysis. Decide with the data: How missing data biases data-driven decisions.
  2. 📋 Communicate the results of the data analysis. Communicate the data: How missing data biases data-driven decisions
  3. 📊 Visualize the data to represent the answers to your questions. Visualize the data: How missing data biases data-driven decisions.
  4. 🔎 Analyze the data to answer your questions. Analyze the data: How missing data biases data-driven decisions.
  5. 🗂 Manage the data that you’ve collected to make it easier to analyze. Manage the data: How missing data biases data-driven decisions
  6. 🗄 Collect the data you need to answer the questions you’ve defined. Collect the data: How missing data biases data-driven decisions.
  7. 🙋🏻‍♀️ Define the question that you want to ask the data. Define the question: How missing data biases data-driven decisions

In each post, I’ll discuss real world examples of how data can go missing, and what you can do about it!

How do you make large scale harm visible on the individual level?

Teams that build security and privacy tools like Brave Browser, Tor Browser, Signal, Telegram, and others focus on usability and feature parity of these tools in an effort to more effectively acquire users from Google Chrome, iMessage, Google Hangouts, WhatsApp, and others. 

Do people fail to adopt these more secure and private tools because they aren’t as usable as what they’re already using, or because it requires too much effort to switch?

I mean, of course it’s both. You need to make the effort to switch, and in order to switch you need viable alternatives to switch to. And that’s where the usability and feature parity of Brave Browser and Signal compared with Google Chrome and WhatsApp come in. 

But if we’re living in a world where feature parity and usability are a foregone conclusion, and we are, then what? What needs to happen to drive a large-scale shift away from data-consuming and privacy-invading tools and toward those that don’t collect data and aggressively encrypt our messages? 

To me, that’s where it becomes clear that the amorphous effects of widespread data collection—though well-chronicled in blog posts, books, and shows like The Social Dilemma— don’t often lead to real change unless a personal threat is felt. 

Marginalized and surveilled communities adopt tools like Signal or FireChat in order to protect their privacy and security, because their privacy and security are actively under threat. For others, their privacy and security is still under threat, but indirectly. Lacking a single (or a series of) clear events that are tied to direct personal harm, people don’t often abandon a platform. 

If I don’t see how the use of using Google Chrome, YouTube, Facebook, Instagram, Twitter, and other sites and tools cause direct harm to me, I have little incentive to make a change, despite the evidence of aggregate harm on society—amplified societal divisions, active disinformation campaigns, and more. 

Essays that expose the “dark side” of social media and algorithms make an attempt to identify distinct personal harms caused by these systems. Essays like James Bridle’s essay on YouTube, Something is wrong on the internet (2017), or Adrian Chen’s essay about what social media content moderators experience, The Laborers Who Keep Dick Pics and Beheadings Out of Your Facebook Feed (2014) or Casey Newton’s about the same, The secret lives of Facebook moderators in America (2019), gain widespread attention for the problems they expose, but don’t necessarily lead to people abandoning the platforms, nor lead the platforms themselves to take action. 

These theorists and journalists are making a serious attempt to make large-scale harm caused by these platforms visible on an individual level, but nothing is changing. Is it the fault of the individual, or the platform?

Spoilers, it’s always “both”. And here we can draw an analogy to climate change too. As with climate change, the effects resulting from these platforms and companies are so amorphous, it’s possible to point to alternate explanations—for a time. Dramatically worsening wildfires in the Western United States are a result of poor fire policy, worsening tropical storms are a result of weaker wind patterns (or stronger ones? I don’t study wind). 

One could argue that perhaps climate change is the result of mechanization and industrialization in general, and it would be happening without the companies currently contributing to it. Perhaps the dark side of the internet is just the dark side of reality, and nothing worse than would exist without these platforms and companies contributing. 

The truth is, it’s both. We live in a “yes, and” world. Climate change is causing, contributing to, and intensifying the effects of wildfires and the strength and frequency of tropical storms and hurricanes. Platform algorithms are causing, contributing to, and intensifying the effects of misinformation campaigns and violence on social media and the internet. 

And much like companies that contributed to climate change knew what was happening, as reported in The Guardian: Shell and Exxon’s secret 1980s climate change warnings, Facebook Google and others know that their algorithms are actively contributing to societal harm—but the companies aren’t doing enough about it. 

So what’s next? 

  • Do we continue to attempt to make the individual feel the pain of the community in an effort to cause individual change? 
  • Do we use laws and policy to constrain the use of algorithms for specific purposes, in an effort to regulate the effects away?
  • Do we build alternate tools with the same functionality and take users away from the harm-causing tools? 
  • Do we use our power as laborers to strike against the harm caused by the tools that we build? 

With climate change (and too with data security and privacy), we’re already taking all of these approaches. What else might be out there? What else can we do to lead to change? 

Repersonalizing Digital Communications: Against Standardizing and Interfering Mediations

Back in 2013 I wrote a blog post reacting to Cristina Vanko’s project to handwrite her text messages for one week. At the time, I focused on how Cristina introduced slowness into a digital communication that often operates as a conversation due to the immediacy and frequency of responses. Since 2013, texting has grown more popular and instant messaging has woven its way into our work environments as well. Reinvoking that slowness stays relevant, but careful notification settings can help recapture it as well. 

What I want to focus on is the way that her project repersonalizes the digital medium of communication, adding her handwriting and therefore more of her personality into the messages that she sends. I thought of this project again while watching a talk from Jonathan Zong for the Before and Beyond Typography Online Conference. In his talk, he points out that “writing is a form of identity representation”, with handwriting being “highly individualized and expressive”, while “in contrast, digital writing makes everyone’s writing look the same. People’s communications are filtered through the standardized letterforms of a font.” 

His project that he discusses in part of that talk, Biometric Sans, “elongates letterforms in response to the typing speed of the individual”, thus providing another way to reembody personality into digitally-mediated communications. He describes the font as “a gesture toward the reembodiment of typography, the reintroduction of the hand in digital writing.” It’s an explicit repersonalization of a digitally-mediated communication, in much the same way Cristina Vanko chose to handwrite her text messages to do the same. Both projects seek to repersonalize, and thereby rehumanize, the somewhat coldly standardized digital communication formats that we rely on. 

Without resorting to larger projects, we find other ways to repersonalize our digital communications: sharing stickers (I’m rather fond of Rejoinders), crafting new expressions (lol) and words, and even sending voice responses (at times accidentally) in text messages. In this way we can poke at the boundaries of the digital communication methods sanitized by standardized fonts for all users.

While Jonathan stayed rather focused on the typography mediation of digital communication due to the topic of the conference, I want to expand this notion of repersonalizing the digital communication methods. Fonts are not the only mechanism by which digital communications can be mediated and standardized—the tools that we use to create the text displayed by the fonts do just as much (if not more). 

The tools that mediate and standardize our text in other ways are, of course, automatic correction, predictive text, and the software keyboards themselves.

Apple is frustratingly subtle about automatic correction (autocorrect), oftentimes changing a perfectly legitimate word that you’ve typed into a word with a completely different meaning. It’s likely that autocorrect is attempting to “accelerate” your communications by guessing what you’re trying to type. This guess, mediating your input to alter the output, often interferes with your desired meaning. When this interfering mediation fails (which is often), you’re instead slowed down, forced to identify that your intended input has been unintentionally transformed, fix it, perhaps fix it again, and only then send your message.

Google, meanwhile, more often preemptively mediates your text. Predictive text in Google Mail “helps” you by suggesting commonly-typed words or responses.

Screenshot of Google Mail draft, with the text Here are some suggestions about what I might be typing next.  Do you want to go to the store? Maybe to the movies? What about to the mall?  What do you listen to? Sofi Tukker? What other DJs do you have? Where "have?" is a predictive suggestion and not actually typed.

This is another form of interference (in my mind), distracting you from what you’re actually trying to communicate and instead inserting you into a conflict with the software, fighting a standardized communication suggestion while you seek to express your point (and your personality) with a clear communication. Often, it can be distractingly bland or comical.

Screenshot of google mail smart responses, showing one that says "Thank you, I will do that." another that says "thank you!" and a third that says "Will do, thank you!" In Google Mail, this focus on standardized predictive responses also further perpetuates the notion of email as a “task to be completed” rather than an opportunity to interact, communicate, or share something of yourself with someone else. 

Software keyboards themselves also serve to mediate and effectively standardize digital communications. For me personally, I dislike software keyboards because I’m unable to touchtype on them (Frustrated, I tweeted about this in January). Lacking any hardware feedback or orientation, I frequently have to stare at the keyboard while I’m typing. I’m less able to focus on what I’m trying to say because I’m busy focusing on how to literally type it. This forced slowness, introducing a max speed at which you can communicate your thoughts, effectively forces you to rely on software-enabled shortcuts such as autocorrect, predictive text, or actual programmed shortcuts (such as replacing “omw” with “On my way!”), rather than being able to write or type at the speed of your thoughts (or close to it). Because of this limitation, I often choose to write out more abstract considerations or ideas longhand, or reluctantly open my computer, so that I have the privilege of a direct input-to-output translation without any or extensive software mediation. 

In a talk last June at the SF Public Library, Tom Mullaney discussed the mediation of software keyboards in depth, pointing out that software keyboards (or IMEs as he referred to them) do not serve as mechanical interpreters of what we type, but rather use input methods to transcribe text, and that those input methods can adapt to be more efficient. He used the term “hypography” to talk about the practice of writing when your input does not directly match the output. For example, when you use a programmed shortcut like omw, but also when you seek to type a character that isn’t represented on a key, such as ö, or if you’re typing in a language that uses a non-latin alphabet, a specific sequence of keystrokes to represent a fully-formed character in written text. Your input maps to an output, rather than the output matching the input. 

These inputs are often standardized, allowing you to learn the shortcuts over time and serving the purpose of accelerating your communications, but in the case of autocorrect or predictive text, they’re frequently suffering from new iterations—new words or phrases that interferingly mediate and change a slip up into a skip up, encourage you to respond to an email with a bland “Great, thanks!” or attempt to anticipate the entire rest of your sentence after you’ve only written a few words. Because I also have a German keyboard configured, my predictive text will occasionally “correct” an English typo into a German word, or overcapitalize generic English nouns by mistakenly applying German language rules. 

All of these interfering and distracting mediations that accelerate and decelerate our digital communications, alongside our ongoing efforts to repersonalize those communications, has me wondering: What do we lose when our digital communications are accelerated by expectations of instantaneous responses? What do we lose when they’re decelerated by interfering mediations of autocorrect? What do we lose when our communications are standardized by fonts, predictive text, and suggested responses?

Listening to Music while Sheltering in Place

The world is, to varying degrees, sheltering-in-place during this global coronavirus pandemic. Starting in March, the pandemic started to affect me personally: 

  • I started working from home on March 6th. 
  • Governor Gavin Newsom announced on March 11 that any gatherings over 250 people were strongly discouraged, effectively cancelling all concerts for the month of March. 
  • On March 16th, the mayor of San Francisco along with several other counties in the area, announced a shelter-in-place order. 

Ever since then, I’ve been at home. Given all these changes in my life, I was curious what new patterns I might see in my music listening habits. 

With large gatherings prohibited, I went to my last concert on March 7th. With gatherings increasingly cancelled nationwide, and touring musicians postponing and cancelling events, March 27th, Beatport hosted the first livestream festival, “ReConnect. A Global Music Series”. Many more followed. 

Industry-wide studies and data analysis have attempted to unpack various trends in the pandemic’s influence on the music industry. Analytics startup Chartmetric is digging into genre-based listening, geographical listening habits, and Billboard and Nielsen conducting a periodic entertainment tracker survey.

Because I’m me, and I have so much data about my music listening patterns, I wanted to explore what trends might be emerging in my personal habits. I analyzed the months March, April, and May during 2020, and in some cases compared that period against the same period in 2019, 2018, and 2017. The screenshots of data visualizations in this blog post represent data points from May 15th, so it is an incomplete analysis and comparison, given that May in 2020 is not yet complete. 

Looking at my listening habits during this time period, with key dates highlighted, it’s clear that the very beginning of the crisis didn’t have much of an effect on my listening behavior. However, after the shelter-in-place order, the amount of time I spent listening to music increased. After that increase it’s remained fairly steady.

Screenshot of an area chart depicting listening duration ranging from 100 minutes with a couple spikes of 500 minutes but hovering around a max of 250 minutes per day for much of january and february, then starting in march a new range from about 250 to 450 minutes per day, with a couple outliers of nearly 700 minutes of listening activity, and a couple outliers with only a 90 minutes of listening activity.

Key dates such as the first case in the United States, the first case in California, and the first case in the Bay Area are highlighted along with other pandemic-relevant dates.

Listening behavior during March, April, and May over time

When I started my analysis, I looked at my basic listening count from traditional music listening sources. I use Last.fm to scrobble my listening behavior in iTunes, Spotify, and the web from sites like YouTube, SoundCloud, Bandcamp, Hype Machine, and more. 

Chart depicting 2700 total listens for 2017, 2000 total listens for 2018, and 2300 total listens for 2019 during March, April, and May, compared to 3000 total listens in that same period in 2020.

If you just look at 2018 to 2020, it seems like my listening habits are trending upward, maybe with a culmination in 2020. But comparing against 2017, it isn’t much of a difference. I listened to 25% fewer tracks in 2018 compared with 2017, 19% more tracks in 2019 compared with 2018, and 25% more tracks in 2020 compared with 2019. 

Chart depicting total weekday listens during March, April, and May during 2017, 2018, 2019, and 2020 with total weekend listens during the same time. 2017 shows roughly 2400 listens on weekdays and 200ish for 2017, 2000 weekday listens vs 100 weekend listens for 2018, 2100 weekday listens vs 300 weekend listens in 2019, and 2500 weekday listens vs 200 weekend listens in 2020

If I break that down by when I was listening by comparing my weekend and weekday listening habits from the previous 3 years to now, there’s still perhaps a bit of an increase, but nothing much. 

With just the data points from Last.fm, there aren’t really any notable patterns. But number of tracks listened to on Spotify, SoundCloud, YouTube, or iTunes provides an incomplete perspective of my listening habits. If I expand the data I’m analyzing to include other types of listening—concerts attended and livestreams watched—and change the data point that I’m analyzing to the amount of time that I spend listening, instead of the number of tracks that I’ve listened to, it gets a bit more interesting. 

Chart shows roughly 12000 minutes spent listening in 2017, 10000 in 2018, 12000 in 2019, and 22000 in 2020While the number of tracks I listened to from 2019 to 2020 increased only 25%, the amount of time I spent listening to music increased by 74%, a full 150 hours more than the previous year during this time period. And May isn’t even over yet! 

It’s worth briefly noting that I’m estimating, rather than directly calculating, the amount of time spent listening to music tracks and attending live music events. To make this calculation, I’m using an estimate of 3 hours for each concert attended, 4 hours for each DJ set attended, 8 hours for each festival attended, and an estimate of 4 minutes for each track listened to, based on the average of all the tracks I’ve purchased over the past two years. Livestreamed sets are easier to track, but some of those are estimates as well because I didn’t start keeping track until the end of April.

I spent an extra 150 hours listening to music this year during this time—but when was I spending this time listening? If I break down the amount of time I spent listening by weekend compared with weekdays, it’s obvious:

Chart depicts 10000 weekday minutes and 5000 weekend minutes spent listening in 2017, 9500 weekday minutes and 4500 weekend minutes in 2018, 14000 weekday minutes and 2000 weekend minutes in 2019, and 12000 weekday minutes and 13000 weekend minutes in 2020

Before shelter-in-place, I’d spend most of my weekends outside, hanging out with friends, or attending concerts, DJ sets, and the occasional day party. Now that I’m spending my weekends largely inside and at home, coupled with the number of livestreaming festivals, I’m spending much more of that time listening to music. 

I was curious if perhaps working from home might reveal new weekday listening habits too, but the pattern remains fairly consistent. I also haven’t worked from home for an extended period before, so I don’t have a baseline to compare it with. 

It’s clear that weekends are when I’m doing most of my new listening, and that this new listening likely isn’t coming from my traditional listening habits. If I split the amount of time that I spend listening to music by the type of listening that I’m doing, the source of the added time spent listening is clear.

Depicts 11000 minutes of track listens and 1000 minutes of time spent at concerts in 2017, 8000 minutes spent listening to music tracks and 2000 minutes spent at concerts in 2018, 10000 minutes spent listening to music tracks and 3000 minutes spent at concerts in 2019, and 12000 minutes spent listening to music tracks and 9000 minutes listening to livestreams, with a sliver of 120 minutes spent at a single concert in 2020

Hello, livestreams. If you look closely you can also spy the sliver of a concert that I attended on March 7th.

Livestreams dominate, and so does Shazam

All of the livestreams I’ve been watching have primarily been DJ sets. Ordinarily, when I’m at a DJ set, I spend a good amount of time Shazamming the tracks I’m hearing. I want to identify the tracks that I’m enjoying so much on the dancefloor so I can track them down, buy them, and dig into the back catalog of those artists. 

So I requested my Shazam data to see what’s happening now that I’m home, with unlimited, shameless, and convenient access to Shazam.

For the time period that I have Shazam data for, the correlation of Shazam activity to number of livestreams watched is fairly consistent at roughly 10 successful Shazams per livestream.  

Chart details largely duplicated in surrounding text, but of note is a spike of 6 livestreams with only 30 or so songs shazammed, while the next few weeks show a fairly tight interlock of shazam activity with number of livestreams

Given the correlation of Shazam data, as well as the continued focus on watching DJ sets, I wanted to explore my artist discovery statistics as well. Especially when it seemed like my listening activity hadn’t shifted much, I was betting that my artist discovery statistics have been increasing during this time. If I look at just the past few years, there seems to be a direct increase during this time period. 

Chart depicts 260ish artists discovered in March, April, and May of 2018, 280 discovered in 2019, and 360 discovered in 2020Chart depicts 260ish artists discovered in March, April, and May of 2018, 280 discovered in 2019, and 360 discovered in 2020. Second chart shows the same data but adds 2017, with 390 artists discovered

However, after I add 2017 into the list as well, the pattern doesn’t seem like much of a pattern at all. Perhaps by the end of May, there will be a correlation or an outsized increase. But at least for now, the added number of livestreams I’ve been watching don’t seem to be producing an equivalently high number of artist discoveries, even though they’re elevated compared with the last two years. 

That could also be that the artists I’m discovering in the livestreams haven’t yet had a substantial effect on my non-livestream listening patterns, even if there’s 91 hours of music (and counting) in my quarandjed playlist where I store the tracks that catch my ear in a quarantine DJ set. Adding music to a playlist, of course, is not the same thing as listening to it. 

Livestreaming as concert replacement?

Shelter-in-place brought with it a slew of event cancellations and postponements. My live events calendar was severely affected. As of now, 15 concerts were affected in the following ways:

Chart depicts 6 concerts cancelled and 9 postponed

The amount of time that I spend at concerts compared with watching livestreams is also starkly different.

Chart depicts 1000 minutes spent at concerts in 2017, 2000 minutes at concerts in 2018, 2500 minutes at concerts in 2019, and 8000 minutes spent watching livestreams, with a topper of 120 minutes at a concert in 2020

I’ve spent 151 hours (and counting) watching livestreams, the rough equivalent of 50 concerts—my entire concert attendance of last year. This is almost certainly because I’m often listening to livestreams, rather than watching them happen.

Concerts require dedication—a period of time where you can’t really do anything else, a monetary investment, and travel to and from the show. Livestreams don’t have any of that, save a voluntary donation. That makes it easier to turn on a stream while I’m doing other things. While listening to a livestream, I often avoid engaging with the streaming experience. Unless the chat is a cozy few hundred folks at most, it’s a tire fire of trolls and not a pleasant experience. That, coupled with the fact that sitting on my couch watching a screen is inherently less engaging than standing in a club with music and people surrounding me, means that I’m often multitasking while livestreams are happening.

The attraction for me is that these streams are live, and they’re an event to tune into, and if you don’t, you might miss it. Because it’s live, you have the opportunity to create a shared collective experience. The chatrooms that accompany live video streams on YouTube, Twitch, and especially with Facebook’s Watch Party feature for Facebook Live videos, are what foster this shared experience. For me, it’s about that experience, so much so that I started a chat thread for Jamie xx’s 2020 Essential Mix so that my friends and I could experience and react to the set live. This personal experience is contrary to the conclusion drawn in this article on Hypebot called Our Music Consumption Habits Are Changing, But Will They Remain That Way? by Bobby Owsinski: “Given the choice, people would rather watch something than just listen.”. Given the choice, I’d rather have a shared collective experience with music rather than just sit alone on my couch and listen to it. 

Of course, with shelter-in-place, I haven’t been given a choice between attending concerts and watching livestreamed shows. It’s clear that without a choice, I’ll take whatever approximation of live music I can find.

 

What it takes to get to a concert

Ticket buying in the modern era is pretty brutal. You find out your favorite artist is coming to town, and with any luck, you discover this before the tickets go on sale. Then you start planning to get tickets. Set up a calendar reminder with a link to the site, then you get ready. If there are presales, you ask friends or you check emails — if you’re a dedicated concertgoer, you probably get emails from the promoters, venues, and maybe even your favorite artists’ fan clubs — tracking down the codes.

Then you get ready, mouse pointer cued up at 9:59, waiting until tickets go on sale. The time flips, it’s 10:00 AM and you click! Prepared to quickly select 2, best available (or GA floor, because who wants a balcony seat), and add to cart. But wait! You see the dreaded message. You’re in a queue. Now all you can do is desperately stare at the webpage, hoping nothing changes. What if a browser extension interferes? What if your browser freezes up? Finally, you’re out of the queue. You go to select your tickets, but wait. GA is all gone. All that’s left is the seated Loge. For a band that you dance to. Or worse, it’s already sold out. All that time, all that anxiety, all that preparation, only to get shut down. 

And that’s just the presale. You’ll do the whole thing over again at the next presale, or during the general onsale, hoping that the artist and the venue were strategic enough to set some tickets aside for each sale. If it comes down to it, you might have to show up to the venue an hour early (or more) before the show starts to get one of the limited tickets available at the door. 

That’s everything a dedicated concertgoer goes through to get concert tickets. Thing is, according to Kaitlyn Tiffany in The Atlantic, that’s also what modern ticket scalpers do.

This week was a brutal one for ticket sales for me and my friends. A show at a 2000+ capacity venue sold out within a few minutes during the presale, and a second show added later also sold out within minutes. The Format announced their first live dates in years, playing 2 shows in NYC and Chicago both, and 1 show in Phoenix. The presale tickets for all the shows sold out within a minute, or in the case of Phoenix, was plagued by ticket website issues but still managed to sell out by the end of the day. By the time the general ticket sales happened, they’d announced an additional show in each city. The general ticket sales also sold out within minutes, and Phoenix ended up with a third show before the day was up. 

How does it happen? And why do we put ourselves through this?! 

It’s important to note that buying concert tickets at all is a privilege. Some people (like me) make it a lifestyle to go to concerts and DJ sets. Others save their money and spend big to get great seats to see favorite artists in arena shows. But it takes money, time, and a bit of luck (or planning) to get tickets and get to a show. 

Whether or not you manage to get tickets to a show depends on several factors: 

  • Did you hear about the show before the tickets went on sale?
  • Did you have enough money at the time tickets went on sale (and in general) to afford the tickets?
  • Is your work schedule stable enough to know that you can go to the show if you buy tickets immediately when they go on sale?

If any one of these factors doesn’t work out, then you don’t have tickets to the show. Whether or not you get the opportunity to see an artist perform in concert at all is up to a whole other set of factors, subject to the careful strategies of the music industry combined with the artistic whims of the performers. 

If an artist doesn’t have a big enough fanbase in your city, and if it isn’t geographically convenient with available music venues, the artist probably won’t stop in your city. Even if they stop, the venue size can play a crucial role in whether or not you’ll get tickets to the show—will they be available, and will you even want them? 

Artists, especially after they’ve “gotten big”, can crave smaller, more intimate shows. But those are the shows that tend to sell out in a minute—especially if the fanbase in a certain city is larger than anticipated or if the artist is only playing a limited number of shows and end up drawing people from out of the ordinary reach of a venue.

Other times, artists can analyze the size of their fanbase in a city and then choose a venue—without considering if the venue size is appropriate for their type of music. Bon Iver toured 20,000+ seat arenas on their last tour, while they’re famous for their intimate music and have videos on YouTube with hundreds of thousands of views of Justin Vernon playing to just 1 fan. Even if an artist’s fanbase is large enough to fill an arena, the fans still might not want to buy tickets to see them in an arena. 

Beyond those considerations, artists can’t always play the venues they want to play due to promoter restrictions or other industry partnerships, sometimes leading to uncharacteristic bookings at oddly-sized or oddly-shaped venues: DJs playing a concert hall, rock bands in a semi-seated venue, or possibly even skipping a city entirely. 

The venue an artist chooses (or is forced to choose) can be a key factor when you’re deciding if you want to get tickets. But the artist (and their tour manager, and others) have still more to do before this concert happens. 

The ticket prices have to be set. Surely venues and promoters have set costs and prices that end up as effective ticket minimums for many shows, but artists certainly have a level of influence as well. Especially high-profile artists like Taylor Swift have chosen on past tours to make affordable tickets available to their fans.

And therein lies the rub: artists can price competitively, or highly, knowing they can charge a certain price and still sell out their show (or nearly sell it out). But they can also price affordably, hoping that legitimate fans will be able to snap up tickets when they go on sale, rather than delaying their purchase and being forced to buy from scalpers. 

OK so we’re still trying to buy these concert tickets. You’ve heard about the show, the artist has booked the venue and priced their tickets, you’ve got the money, you’ve got the time, you are ready at 10am on a Friday (or a Wednesday or a Thursday for those sweet sweet presale tickets). Where are you buying your tickets?

Ticket sites range from the homegrown (see: Bottom of the Hill), new kids on the block (Big Neon, Tixr), the budding behemoths (Eventbrite, AXS, Etix) and the (despised) old guard (TicketWeb/Ticketmaster/LiveNation). If you’re rushing to buy online tickets, you also need to prepare for the site experience. 

If it isn’t a site you’ve used before, you might want to consider if it requires an account to buy tickets. If it does, you have to make one and make sure you’re signed in before you try to buy the tickets. You also want to consider if the show big enough that you’ll end up in a queue to buy the tickets, and if the site is reliable enough to handle the load of a lot of people trying to buy tickets without crashing or throwing an error. 

Beyond site reliability, you have to consider your personal threshold for every ticket-buyer’s worst nightmare: fees. Almost every ticket purchase includes fees. How high do the fees need to be before you abandon your ticket purchase entirely? 

You also have to consider if there will be fees added to the face value of the ticket, and how high are too high of fees before you abandon the ticket purchase entirely. Of course, the irony of paying ticket fees is that most fans (myself included) dislike paying them because for so long the fees are hidden—last minute additions to your total, spiking the cost of $35 tickets to $60 at times. But it can be argued that transparently-disclosed fees are acceptable, and even necessary to provide a resilient, secure, reliable ticketing site—as well as to pay the promoters working hard to make sure your favorite band actually stops in your city.

Artists, promoters, venues, and ticketing sites do a lot to try to prevent ticket scalpers from bombing the market and selling out a show in minutes only to relist the tickets minutes later at unbelievable prices. Innovations in ticket technology, new marketplaces, and just plain making it harder to get tickets:

What makes a ticket purchaser legitimate? Probably some degree of purchasing tickets in a specific geographic region and in clusters of genres, likely combined with some fraud analysis. Then I wonder how suspicious my own ticket purchasing habits must look to the algorithms at times. As long as we’re attempting to define what a legitimate ticket purchaser looks like, we can consider who deserves the presale codes for shows.

There’s a notion that only “real fans” deserve first access to presale codes and tickets. But how do you verify and validate true fans? You could use specific digital consumption patterns, such as those that are probably used to give out Spotify presale codes, but those are limited to only those listening habits that are directly observable in digital data. Artists want people to buy tickets to their shows—that’s why often, presale codes are straightforward to track down.

Most often, getting tickets to a show is a matter of knowing the right people at the right time that might have information you don’t have. Songkick is there to fill in the gaps, alongside emails and texts from promoters and venues. But ultimately, nothing beats having a community of fans. And that was the thing that fascinated me about the article in The Atlantic about the modern ticket scalpers. Me and my friends, we use many of the same tactics to buy tickets. It’s a privilege and a challenge to get the tickets we want, but we love going to concerts. And often, it feels like it’s the only way these days we can help artists make money. 

Problems with Indexing Datasets like Web Pages

Google has created a dataset search for researchers or the average person looking for datasets. On the one hand, this is a cool idea. Datasets are hard to find in cases, and this ostensibly makes the datasets and accompanying research easier to find.
In my opinion this dataset search is problematic for two main reasons.

1. Positioning Google as a one-stop-shop for research is risky.

There’s consistent evidence that many people (especially college students who don’t work with their library) start and end their research with Google, rather than using scholarly databases, limiting the potential quality of their research. (There’s also something to be said here about the limiting of access to quality research behind exploitative and exclusionary paywalls, but that’s for another discussion).
Google’s business goal of being the first and last stop for information hunts makes sense for them as a company. But such a goal doesn’t necessarily improve academic research, or the knowledge that people derive based on information returned from search results.

2. Datasets without datasheets easily lead to bias.

The dataset search is clearly focused on indexing and making more available as many datasets as possible. The cost of that is continuing sloppy data analysis and research due to the lack of standardized Datasheets for Datasets (for example) that fully expose the contents and limitations of datasets.
The existing information about these datasets is constructed based on the schema defined by the dataset author, or perhaps more specifically, the site hosting the dataset. It’s encouraging that datasets have dates associated with them, but I’m curious where the description for the datasets are coming from.
Only the description and the name fields for the dataset are required before a dataset appears in the search. As such, the dataset search has limitations. Is the description for a given dataset any higher quality than the Knowledge Panels that show up in some Google search results? How can we as users independently validate the accuracy of the dataset schema information?
The quality of and details provided in the description field vary widely across various datasets (I did a cursory scan of datasets resulting from a keyword search for “cheese”) indicating that having a plain text required field doesn’t do much to assure quality and valuable information.
When datasets are easier to find, that can lead to better data insights for data analysts. However, it can just as easily lead to off-base analyses if someone misuses data that they found based on a keyword search, either intentionally or, more likely, because they don’t fully understand the limitations of a dataset.
Some vital limitations to understand when selecting one for use in data analysis are things like:
  • What does the data cover?
  • Who collected the data?
  • For what purpose was the data collected?
  • What features exist in the data?
  • Which fields were collected and which were derived?
  • If fields were derived, how were they derived?
  • What assumptions were made when collecting the data?

Without these valuable limitations being made as visible as the datasets themselves, I struggle to feel overly encouraged by this dataset search in its current form.

Ultimately, making information more easily accessible while removing or obscuring indicators that can help researchers assess the quality of the information is risky and creates new burdens for researchers.