Top Business / Management / Leadership Books by and/or about Womxn

I’ve been listening to the Farnam Street podcast, The Knowledge Project, recently and enjoying the guests that have talked about The Personal MBA or Relationships vs Transactions. But I noticed a pattern. I realized that the guests were largely telling stories about men, mentioning books by men, and I didn’t see myself in these conversations. When I went to dig deeper into the recommended reading, I found more of the same. 

I’m not trying to pick on Farnam Street, but the institutional blindness of having a slogan: “Our Content Helps You Succeed In Work and Life” without examining who is behind that “You” is real. So I dashed off a quick tweet about my frustration, and gosh did the Twitterverse deliver.

You can read the full replies to that tweet if you want to see all the attributed recommendations, but I’ve gathered them here in a loose structure. If you want the unstructured list, check out this published Google doc compilation I created.

Managing and Leading People

Leading an Organization

Founding and Building a Business

Working Better 

Work more efficiently or productively

Growing Yourself (At Work, Maybe)

Widen Your Perspective

I also recommend using the Library Extension to automatically search your local library catalog for these books.

Thanks to the recommenders

Many thanks to Better Allies, Kim Moir, David Ryan, Alice MacGillivray, Jillian Kozyra, Margaret Fero, Laura Glu, Liz Wiseman, Linda van der Pal, Sophie Weston, Mariposa Leadership, Richard Hughes-Jones, Katherine Collins, Arie Goldshlager, Michele Zanini, Bob Sutton, James Addison, Davis Liu, MD, Suva Chattopadhyay, Anna-Lisa Leefers, Neil Hodgson, Mindy Howard, leverup, Jo Miller, and Jeff Tetz for recommending these books, and to everyone that retweeted my request as well. 

Inc. published a similar list that you likely want to check out as well: 60 Great Business and Leadership Books, All Written by Women. Thanks to Shantha R. Mohan, Ph.D., DTM for the pointer!

One last thought

I asked for recommendations of business, management, and leadership books by and about women, and I got so many more than I expected! As I dug through the list of recommendations, noticed a new pattern—most of the authors look like me, a white cis woman. 

Many of the authors have degrees and/or positions at Ivy League universities. Some of these books seem to espouse a kind of “Lean In feminism”, where if you work hard enough in the existing system, or change yourself to work with the system, you’ll succeed. That doesn’t work for everyone, and can even work against people

There’s an innate bias to who gets published, and it’s worth considering whose voices we might not be listening to in the room, who doesn’t feel comfortable enough to talk in the room, and who isn’t even in the room. (In this case, the room is a list of crowdsourced book recommendations). 

Despite publishing this list of book recommendations, you might not need a book. 

As Don Jones (yet another dude) interviewed on the Tech Lead Journal podcast put it, “Define what success means to you” and go after it. And bring others up with you.

And while we’re at it, let’s build a new system where everyone is empowered and supported to find their own success—beyond mere survival. 

From Nothing to Something with Minimum Viable Documentation

More and more startups and enterprises are recognizing the importance of high quality product documentation, but it’s tough to know where to start. I’ve taken a few enterprise software products from “nothing to something” documentation and this is the framework I’ve built for myself to create MVD—minimum viable documentation. 

Diagram using a dotted line circle with an arrow toward a pink shaded box with "MVD" inside to represent going from nothing to minimum viable documentation.

If you’re a technical writer trying to find your footing, or someone who cares about adding user documentation for your software but have no idea where to start, this is the guide for you.

What is minimum viable documentation?

If documentation is a product (and it is), minimum viable documentation is the bare minimum documentation that is useful and helpful to customers

Something good is better than something chaotic and unhelpful, and it’s much better than no documentation at all. It’s also easier to focus on getting to minimum viable documentation rather than trying to reach full-featured documentation as soon as possible, because you’re a human with a life that is not your job.

Venn diagram with overlapping circles of Helpful, Useful, and Quick intersecting to form MVD.

You might be working with a fully-functional software product that has no useful documentation. In that case, getting to full-featured documentation isn’t your primary goal—getting to minimum useful documentation is. So let’s get started.

Define minimum viable documentation for your product

Before you can write MVD, you need to define what it is for your product. MVD differs depending on your market, customer base, product type, your pricing structure, and more. 

I recommend you do the following to define what MVD looks like for your product. 

1. Talk to your colleagues

Your goal with these conversations is to get a good understanding of who the target user is for your product and the goals they want to accomplish with your product.

Diagram with 5 circles, 1 each representing PM, UX, Docs, Engineering, and Marketing.

If you have product management, start with them. Find out as much as you can about why the product is being built, who it’s for, and how the product is being positioned in the market. 

Also talk to engineering management or senior engineering subject matter experts (SMEs). What user problems are the software trying to solve? What level of expertise do the engineers assume the user has? 

If you’re lucky enough to have a sales or marketing team, talk to them. Because of their efforts defining the customer journey, they can help you understand who the audience is and what the key success workflows look like. Who is the product targeting? Why do they want to use this product? What problems are they trying to solve?

Talk to the user experience designers to get an understanding of the user personas they’re designing for and what workflows they think have the most friction. You can also get a sense for how the team approaches their role, whether they’re more focused on designing friction-free workflows or pixel-perfect screens. 

After talking to PM, EM, UX, and marketing, you can do the following:

  • Identify what level of expertise a typical product user has, both with the domain and with the product. This functions as your audience definition.
  • Write down the main goals of a user before and after they start using your product. What motivates the user?
  • Map out the key workflows that a user is going to perform in the product. What tasks are the user trying to accomplish?

2. Perform a documentation competitor review

It’s always a good idea to know what your competitors are doing! If you’re not sure what products your product is competing with, ask your sales or marketing people for a list or do some research on your own. 

Pick 3-5 companies to focus on, such as your strongest competitors in terms of funding, usage, or closeness to what your product does. 

You also want to make sure you’re not benchmarking off useless garbage when you perform your competitor review. In addition to the 3-5 competitors you identify, pick a couple industry leaders or companies that your colleagues mention as having good documentation, such as Stripe API docs, Microsoft docs, or even IBM docs, and include one or two of them in your competitor analysis. 

The advantage of choosing the documentation of a couple larger products to review is that they tend to have established documentation teams and offerings in a variety of markets. This makes it easier to find a product that is well-documented and at least somewhat adjacent to what your product does.

Diagram using different colored shaded rectangles to represent competitor documentation, with an arrow pointing to MVD.

The goal of your competitor analysis is to identify how the companies provide documentation about their product(s). Pay attention to the following:

  • How is the documentation structured?
    • By feature?
    • By use case?
    • By persona?
  • What documentation is provided?
    • What workflows are covered? 
    • What seems left out?
  • What type of documentation is it?
    • Lots of conceptual information about how the product works?
    • Heavy reference information but light on the how-to tasks?
  • Who is the documentation targeting?
    • Look for introductory content or tutorials
    • Is there advanced developer content? 
  • How is the documentation site built?
    • Use the inspect element option in your web browser or a site like to figure out what technology the documentation site is built with.
  • Anything else interesting?
    • Does a company have an interesting way of differentiating beta functionality?
    • Are code samples hidden behind toggle-to-expand options?
    • Is there a plethora of gifs, videos, or other multimedia in the documentation?

Document your findings, of course, and feel free to share your findings with your team during a demo. After all, they might be wondering what you’re doing if you haven’t started writing yet. 

3. Assess the current state of your documentation

If there is any sort of documentation for your product, you want to know what it is. It might be a sad README and some code comments, it might be detailed multilayered documentation without much organization or clear goals. 

Diagram with a transparent dotted circle pointing to {} encasing a column chart, circles with customers, marketing, UX, engineering, and PM, plus a dotted outlined rectangle with "existing content?" inside pointing to an MVD square, representing the pre-planning process.

To get a sense of the current state, I recommend doing the following:

  • Audit the existing content. Identify which topics are covered in the documentation already, and where. Make a list, and also keep track of what topics seem to have a lot of detail, and what you suspect might be outdated. This is a cursory audit, not an in-depth one that you might perform if you were migrating content.
  • Look at the documentation analytics. If you have analytics for the documentation site, take note of which pages are most frequently viewed, which pages might be serving as entry points, and how much time people spend on various pages. 
  • Talk to your team and get their thoughts on the current documentation. Who has been writing it so far? Are they attached to any topics in particular? Do they share specific topics with customers regularly?
  • Interview customers of the product and documentation to see what they want to see or find most useful today.

Depending on the quality of the existing documentation, these steps might not be that helpful in informing your approach, but help you set benchmarks for documentation growth and quality, plus identify links you likely don’t want to break.

If you don’t have any documentation, still talk to your team and customers. If you can’t talk to customers for some reason, you can look for discussions about the product on social media like Reddit, Twitter, or Hacker News to identify themes that people ask questions about or really enjoy about your product.

A brief note about terminology: As you review competitor and existing documentation and interview internal and external folks, you might find that your product has some inconsistent terminology. At this stage, you might want to delay the writing process while you create a definitive list of terms to use for the product. This type of work can take more time upfront but it’s easier to create consistency from the beginning than to apply it after the fact. 

Define the structure of your documentation

Before you start writing, you want to create a structure or a framework to place your topics into. 

Diagram with the empty circle pointing to shaded rectangles structured in a hierarchy of three chapters, one with 3 topics below it, one with 2, and another with 4, all pointing to the MVD shaded square.

The structure for your MVD is directly informed by the work you did to define what MVD looks like for your product, plus some information-architecture-specific research. 

  • Revisit your conversations with colleagues. What workflows and functionality might be important to highlight? Who is buying your product? Who is using your product?
  • Refer to your competitor review notes. How did your competitors and benchmark docs structure their documentation?
  • Research information architecture best practices. Refer to some key articles from the Nielsen Norman Group, as well as the book How to Make Sense of Any Mess by Abby Covert, and the associated worksheets.

After this research, draft up some chapter headings and possible topic titles to start with, then get feedback from your UX, PM, Engineering, and Sales and Marketing folks. How accurate, relevant, or helpful does the new structure seem? Have you made any assumptions that don’t make sense for the customer base?

Expect this information architecture to change as you write the MVD and especially as you develop full-featured documentation. This is the nature of a minimum viable product! Put a task in your backlog to plan to refine the structure after you finish MVD and are approaching full-featured documentation so that you can iterate without confusing your customers with frequent changes, and plan so that you don’t break any links.

After you design the initial information architecture, you can start writing. 

Start writing minimum viable documentation

So you know what minimum viable documentation might look for for your product, but how do you get there? MVD is all about creating useful content for your users, so start with the entry content! 

Venn diagram with three circles, one with identify key information, one with describe the path to success, and a third with clarify complexity, with MVD at the intersection of all three.

Focus on key information for customers

As with any “minimum viable” approach, you’re trying to get a basic functional framework down before you start improving it. As you lay that framework, be mindful of scope creep.

Think back to the key workflows that you mapped out earlier. Broadly cover the top few workflows and then flesh out details as you get more comfortable with the product and understand the user goals better. Why go broad instead of going deep into a specific workflow? You’re still learning what the customer finds useful, and what level of detail they might want or need about a specific workflow. 

If you spend a lot of time writing a highly detailed workflow that you thought was important and it turns out it’s actually pretty intuitive for customers—that’s time that you could have used to write about something that was really confusing and holding back customers from succeeding with your product. 

It’s likely that you’ll encounter cases and situations that you want to write more about. That’s great! Write them down and put them in a backlog to address later. For now, you want to stay focused on these minimal workflows to build out the minimum viable documentation for your product. You can get fancy with use cases and in-depth examples later. 

Identify the simplest path to success

Within those broad key workflows, start with the simplest path to success, the “happy path” that most of your customers will take. 

That might involve writing a series of topics like:

  • “Get started using Best Product Ever”
  • “Install Best Product Ever”
  • “Set up Best Product Ever”
  • “Accomplish Straightforward Task in Best Product Ever”. 

Get those written, reviewed, and published and start helping people use your product that much sooner. 

Clarify any complexity

After you write the documentation to support the simple path to success, what do you write next? Documentation that unravels where complexity lurks in your product. 

Depending on your product familiarity, you might need to take more time to research and lean on technical subject matter experts (SMEs) a bit more to write this, but it’s worth it. This documentation content might be topics like:

  • “Configure the Weird Setting You Must Touch”
  • “All About This Task That Everyone Wants to Do but No One Can Find”.

You don’t want to get bogged down in documenting around product complexity here. Stay focused on the complex aspects of the key customer workflows, and the crucial information customers need. What might confuse someone if you left it out? What assumptions have you been making about the user that need to be made explicit? 

This is often the step when I remember to write things like software requirements, role-based restrictions to functionality, or other crucial cases that are often assumed when developers write their own documentation.

Get feedback and iterate

I assume you’re focusing on minimum viable documentation because you have more work than you have time to complete it. That’s why it’s important to iterate. Yes, I just harped on the importance of prioritization and focus—and it’s essential to make sure that what you prioritize and focus on is still important. 

Diagram showing an MVD shaded rectangle with an arrow pointing across to circles with PM, engineering, and customers, then another arrow pointing back to MVD, to emphasize the importance of a feedback loop for your MVD.

Check in with product management and engineering management regularly (I’d recommend weekly for an every-few-months or less release cadence) about what you’re prioritizing and why. 

This check-in is mostly about getting signoff and validation, not direction—but don’t ignore the direction that PM and EM can offer you! If there are important releases coming up that will affect one of the key workflows on your list, you might want to document that workflow sooner, or in more detail than you might otherwise for MVD. 

Use these conversations as a way of discovering what customers are paying attention to, and what your PM and engineers are paying attention to as well. 

As you send your documentation out for technical review, you might also get feedback that you can use to improve your approach to MVD. With any luck, much of the feedback will duplicate what you have planned—and that’s helpful validation for your approach.

You might get so much feedback that you have to dump a lot of ideas into “plans to write this later” and a backlog that feels like it’s spiralling out of control, but if you stay focused on your scope, you’ll get to that backlog sooner and with a more comprehensive understanding of your documentation and your customers. 

If the direction and feedback you get from your team is pretty far removed from your approach to MVD, it’s helpful to discuss why that might be and treat it as prioritization guidance for your future plans. Maybe you misunderstood a key target customer, or the purpose of the product in the market. You might discover you need to realign your understanding and vision of the documentation with that of your team. 

What’s next after MVD?

When do you know that you’ve reached minimum viable documentation? It’s somewhat of a fuzzy line. When you notice that you’re writing documentation by adding to existing topics, or writing net new example content, or documenting new features instead of existing features — you’ve moved past MVD and into shaping full-featured documentation. 

As you start shifting into that mode, you’re no longer focused on creating the skeleton structure to build off of, but filling in the details and settling into the usual work of modern technical writing.

Shaded MVD box pointing to {} boxes emphasizing the headers that follow, work through backlog, improve product, create examples, collect feedback, review analytics, all pointing now to a filled in square labeled full featured documentation.

1. Go through the backlog

Start going through your backlog of ideas. Revisit those ideas and group similar ones together, adding audience definitions, acceptance criteria, and learning objectives where you can. Note who the technical SMEs are and whether any upcoming releases are relevant for some of the tasks. 

Ideally, you’re storing this backlog in the same spot as your engineering backlog so that your work is visible to the engineering team. 

Work with PM or EM to prioritize those tasks and start working through them. As any writer for a fast-paced development team knows as well, the product development often happens faster than you can write about it, so you’ll never run out of tasks in your backlog.

2. Suggest product improvements

As you went through a flurry of documentation writing to produce MVD, you likely identified some parts of the product that might need to be improved. Again, work with your PM and engineering teams to discuss possible product improvements. 

You can also suggest product improvements that directly involve the docs, doing a review of UI text in the product, or auditing pages in the product to suggest opportunities for in-app documentation or context-sensitive help. This is a great opportunity to partner with the UX team as well. 

Partner with your engineering and UX teams to make suggestions and build those relationships based on your newfound product and customer expertise. 

3. Write use cases and examples

To create more useful content for your customers, you probably want to flesh out specific example scenarios for using your product. You might have written some already as quick start use cases for getting started with your product, but you likely want to write more for the next stage of customer product understanding.

You can use example content to describe customization options for the product, or highlight domain-specific use cases for a market that your customer might be trying to break into. 

4. Ask for feedback

You put all this effort into creating minimum viable documentation, but how viable is it really? 

Ask your technical SMEs, sales and marketing teams, customers, really anyone that might interact with the documentation internally or externally if they have feedback on your documentation improvements and information architecture. 

You could perform some tree testing with the MVD structure to see if there are some improvements you can make to the information architecture as you flesh out the documentation, or just have short conversations with stakeholders. 

Use the feedback you get to help shape priorities for your backlog. However, don’t treat all feedback you get as tasks that you must perform—if someone asked for it, it must be important, right?

Instead, validate feedback against your target audience definitions and user goals. Sometimes you’ll get feedback relevant only to a specific edge case that doesn’t make sense to document in the official documentation, or feedback related to a product bug that isn’t something necessarily appropriate to address in the documentation. 

5. Review analytics

Review documentation site analytics. Analytics are an imperfect source of feedback, but as long as you established a prior benchmark, check to see if the entry-level pages that you created or updated are the most popular pages. 

  • Are the pageviews higher, or at least somewhat proportional to the user base of your product? 
  • Are there any surprising outlier pages that have a lot of views that you might want to focus on? 
  • What search terms are popular? 

You can use these to inform your plans and priorities

Get from nothing to something with MVD

It can be intimidating to create a set of documentation for a product from scratch, but I hope this post outlines a basic approach that can help.

Diagram showing an empty circle with a dotted line border and an arrow pointing to a pink shaded square labeled MVD, which points to a larger pink filled in square labeled full featured documentation.

Start by defining what MVD looks like for your product by talking to colleagues, performing a competitor review, and assessing the current state of documentation. Then do some additional research and define the initial structure of your documentation. 

After you’ve laid the groundwork, start writing. Focus on key information for customers and identify the simplest path to success. Clarify any product and task complexity, and seek out feedback. Regularly make changes to what you’ve written as you learn more about the product and your customers. 

As you evolve beyond MVD to full-featured documentation, work through your backlog, suggest product improvements, write use cases and examples, and continue asking for feedback. You can also review site analytics to get a sense of how far you’ve come and what you might want to focus on next. 

Whether you’re a professional technical writer, a committed startup founder, a generous open source contributor, or someone else, I hope you can use this framework to document your software product.

I tried my best to create a minimum viable blog post to describe this minimum viable documentation framework. As such, I might not have gone into much depth about how to perform a competitor review, get buy-in for terminology proposals, or how to handle the full range of feedback you might receive on your documentation. 

If you have feedback or questions for me, or want to see more details about a specific topic, don’t hesitate to reach out on Twitter @smorewithface

How can I get better at writing?

As a professional writer, I frequently get asked, “as a ______, how can I get better at writing?” I’ve never had a good list of resources to point people to, so I finally decided to write one. I’ve worked hard to become a good writer, and I’ve had the privilege of many good teachers along the way.

If you’re not really sure why your writing isn’t as good as you want it to be, that’s okay. In this blog post, I’ve identified the strategies that I use to write well. I hope they’re useful to you. 

Where to start

Read and write more frequently. You can’t get better without good examples or practice. If you want to get better at writing you need to read more and you need to write more. 

Identify what you’re trying to improve. Maybe you struggle with grammar, or in clearly communicating your ideas. Maybe it takes too many words for you to get your point across, or you can’t quite connect with the people reading your writing. 

Write accurate content by improving your grammar and word choice

Use a tool like Grammarly, or enable grammar checking in whatever tool you use to write, if it’s available. If you don’t want a mysterious AI reading your writing, you can use other resources to improve specific aspects of your grammar.

Some key concepts to focus on:

I still struggle with the following (more pedantic) grammar rules: 

  • When do I need to use a hyphen to connect two words? See Hyphen Use, on the Purdue Online Writing Lab website. 
  • Did I split an infinitive? What is a split infinitive, anyway? See Infinitives, on the Purdue Online Writing Lab website. 
  • Does my relative pronoun actually clearly refer to something or do I have a vague “that” or “it”? See Pronouns in the Splunk Style Guide.

The somewhat silly yet practical book, Curious Case of the Misplaced Modifier by Bonnie Trenga, might also be a useful read.  

Write helpful content by defining outcomes before you start

Before you start writing something, whether it’s a slide deck, an engineering-requirements document, an email, or a blog post like this one, consider what you want someone to do after reading what you wrote. 

Often called learning objectives or learning outcomes in instructional design, defining outcomes can help you write something useful and focused. Sometimes when you’re writing something, other extraneous ideas come to mind. They can be valuable ideas, but if they distract from your defined outcomes, you might want to remove them from your main content.

Some example outcomes are:

  • After reading this blog post, you can confidently draft a clear document with defined outcomes.
  • After reading this engineering requirements document, my colleague can provide accurate and helpful architecture feedback on the design. 
  • After reading the release notes, I can convince my boss that the new features are worth an immediate upgrade. 

I also want to note that if you write an outcome focused on someone understanding something, rewrite it. It’s tough to measure understanding. It’s easier to measure action. For that reason, I try to write outcomes with action-oriented verbs. For more about writing good learning objectives, see the Learning Objectives chapter in The Product is Docs.

Write focused content by identifying your audience

Who will be reading your writing? What do they know? Who are they? What assumptions can you make about them? 

If you can’t answer these questions about the people reading your writing, you won’t be able to clearly communicate your ideas to them. You don’t have to be able to answer these questions with 100% certainty, but make the attempt. 

If you recognize that you’re writing something for multiple audiences, consider breaking up the content into specific sections for each audience. For example, architects might care about different content than a UI engineer, a product manager might care about different details than the backend engineer. 

If you identify the different needs of your varying audiences, you can write more consistently for each specific audience, rather than trying to address all of them all the time. For more on identifying your audience, see the Audience chapter of The Product is Docs.

Write findable content by considering how people get to it

How people get to your content can influence how you write it. If people use search, an intranet, or direct links to find your content, you might make different decisions about how to structure it. 

I always assume that people are finding my content by searching the web. They’ve typed a specific search query, found my content as a result, and open it with the hopes that it is the right content for them. 

Consider what people are searching for that can be answered by your content, and write a title accordingly. Spend time on the first few sentences of your content to make sure that they further clarify what your content addresses. 

For example, I titled this blog post “How can I get better at writing?” because I expect that’s what a lot of people might type into their preferred search engine out of desperation. I could call it “7 quick tips to improve your writing”, but that’s not how most people type search queries (in my opinion).  

Mark Baker’s book, Every Page is Page One, covers a lot of information related to this concept. He coins the term “information scent” to describe the signals that indicate to a person that they’ve found the right content to answer their question, and “information foraging” to describe the process of looking for the right information. 

Write readable content by considering the structure

People aren’t excited to read technical content or technical documentation. No one rejoices when they get an email. I get paid to write technical documentation and I still avoid reading it if I can. Because people don’t want to read your content, structure it intentionally. 

Write for skimming. Bullet points are often better than paragraphs. Tables are often better than paragraphs. 

Put information where it needs to be. If you’re writing a series of steps, make sure the steps are actually in the right order. For example, if something needs to be done before all the steps can succeed, put it before the set of steps as a prerequisite.

You also want to consider the desired outcomes of your content and your audience when you structure your content. It can make sense to focus on one audience in one piece of content, or one desired outcome in one piece of content. Don’t try to do too much in one piece of writing. 

Nielsen Norman Group has an incredible set of research and recommendations about how people read and how you can structure your content. I recommend the following articles:

Write clear content by intentionally choosing your words

You want to make your content easy to find and easy to understand. To do this, you need to be consistent and intentional about the words that you use.

Use consistent terminology. This isn’t the time to write beautiful prose that uses different words to mean the same thing. Don’t overload terms by using the same term for multiple things, and don’t use multiple terms to refer to one thing. Use the same terms and use them consistently. 

If something is a JSON object, call it that. Don’t call it a JSON object sometimes, a JSON setting other times, or a JSON blob other times. Pick one term and use it consistently. You might have to pick an imperfect term and live with it. It happens! There are only so many words to choose from. 

Be intentional about the words you use. Consider the words that your readers use to describe what you’re writing about, and use the same words if you can. Even if those words don’t match up completely with the feature names in use by your product.

If all of your software’s users refer to “dark mode” instead of “dark theme”, you might need to use both terms in your content so that people can find it. For some internal documentation, you might need to make a mapping of internal names that people use for something with the external names used in the product. 

If you’re not sure what term to use, find out what terms your readers are already using. If you have access to search query logs of your website search, review those for patterns. If you don’t already have readers or users for your product, you can do some competitive analysis to understand what terms are in common usage in the market. 

You can also check the dictionary or use a tool like Google Books Ngram Viewer or Google Trends to identify common terms for what you’re attempting to describe. 

Nielsen Norman Group again has some excellent resources on clear writing:

Write trustworthy content by thinking about the future

Errors in content, especially technical documentation, lead to mistrust. When you write a piece of content, consider the future of the content. 

The future of the content depends on the purpose and type of content that you’re writing. This list contains some common expectations that readers might have about various content types:

  • A blog post has a date stamp and isn’t kept continually updated.
  • Technical documentation always matches the product version that it references.
  • Architecture documents reflect the current state of the microservice architecture.
  • An email gets the point across and can’t be edited after you send it.

You must consider the future and maintenance of any content that you write if your readers expect it to be kept up-to-date. To figure out how difficult maintaining your content will be, you can ask yourself these questions:

  • How frequently does the thing I’m writing about change?
  • How reliable does my content need to be?
  • How quickly does my content need to be accurate (e.g., after a product release)?

By answering these questions, you can then make decisions about how you write your content. 

  • What level of detail will you include in your content?
  • Will you focus your efforts on accuracy, speed, or content coverage?
  • Do you want to include high-fidelity screenshots, gifs, or complex diagrams?
  • Do you want to automate any part of your content creation?
  • Who will review your content? How quickly and thoroughly will they review it?

For more on maintaining content and making decisions about your documentation, see the Documentation Decisions chapter in the book The Product is Docs (which I contributed to). 

Feel empowered to write better content

I hope that after reading this blog post you feel empowered to write more accurate, helpful, focused, findable, readable, clear, trustworthy content. This is an overview of strategies. If you want to dig deeper into a specific way to improve your writing, check out the books and articles linked throughout this post.

If you have something you think I missed, you can find me on Twitter @smorewithface

Wrapping up 2020: Spotify, SoundCloud, and data

Another year, another Spotify Wrapped campaign, another effort to analyze the music data that I collect and compare it to what Spotify produces. This year I have listening habit data, concert attendance and ticket purchase data, livestream view activity data, my SoundCloud 2020 Playback playlist, and the tracks on my Spotify top 100 songs of 2020 playlist

Screenshot of Spotify Wrapped header image, top artists of disclosure, lane 8, kidnap, tourist, and amtrac, top songs of apricots, atlas, idontknow, cappadocia, know your worth, minutes listened of 59,038 and top genre of house.

It’s always important to point out that the data covered in the Spotify Wrapped campaign only covers the time period from January 1st, 2020 to October 31st, 2020. I discuss the effects of this misleading time period in Communicate the data: How missing data biases data-driven decisions. Of course, writing this post on December 2nd, nearly the entire month of December is missing from my own analyses. I’ll follow up (on Twitter) about any data insights that change over the next few weeks.

Top Artists of the Year

screenshot of spotify wrapped top artists, content duplicated in surrounding text.

Spotify says my Top 5 artists of the year are: 

  1. Disclosure
  2. Lane 8 
  3. Kidnap
  4. Tourist 
  5. Amtrac

My own data shows some slight permutations.

Screenshot of Splunk table showing top 10 artists in order: tourist with 156 listens, amtrac with 155 listens, booka shade with 147 listens, jacques greene with 134 listens, lane 8 with 129 listens, bicep with 128 listens, kidnap with 114 listens, ben böhmer with 111 listens, cold war kids with 110 listens, and sjowgren with 99 listens

My top 5 artists are nearly the same, but much more influenced by music that I’ve purchased. The overall list instead looks like:

  1. Tourist
  2. Amtrac
  3. Booka Shade
  4. Jacques Greene
  5. Lane 8

For the second year in a row, Tourist is my top artist! Kidnap still makes it into the top 10, as my 7th most-listened-to artist so far of 2020. 

Disclosure, somewhat hilariously, doesn’t even break the top 10 artists if I am relying on data instead of only Spotify. What’s going on there? Turns out Disclosure is my 11th-most-listened to artist, with 97 total listens so far this year. If I dig a little bit deeper, looking at the song Know Your Worth which Spotify says I’ve listened to the most in 2020 by Disclosure, I can see exactly why this is happening.

Screenshot showing the track_name Know Your Worth listed 5 times, with different artist permutations each time, Khalid, Disclosure & Khalid, Disclosure & Blick Bassy, Khalid & Disclosure, and Khalid, with total listens of 20 for all permutations.

Disclosure’s latest album, ENERGY, includes a number of collaborations. Disclosure is the main artist for most of these tracks, but in some cases (like with Know Your Worth, which came out as a single February 4, 2020) the artist can be inconsistently stored by different services.

As a result, the data has a number of different entries for the same track, with differently-listed artists for each one. stores only one artist per track, whereas Spotify stores an array of artists for each track. This data structure decision means that Disclosure should have had about 127 total listens, and been my 7th-most-listened-to artist of 2020, instead of 11th. 

This truncated screenshot shows some examples of the permutations of data that exist in my data collection, with a total listen count of 127 for Disclosure during 2020. 

Screenshot showing additional permutations of Disclosure artist data, such as Disclosure & slowthai, Disclosure & Common, and Disclosure & Channel Tres.

I had a sneaking suspicion that my Booka Shade listening habits are primarily concentrated on a few songs from an EP that he put out this year, so I dug into how many tracks my total listens for the year were spread across.

Table showing top 10 artists and total listens, with total tracks for each artist as well. Tourist has 62 tracks for 159 listens, Amtrac has 59 tracks for 155 listens, Booka Shade has 64 tracks for 147 listens, Jacques Greene has 33 tracks for 134 listens, Lane 8 has 60 tracks for 129 listens, Bicep has 46 tracks for 128 listens, Kidnap has 35 tracks for 114 listens, Ben Böhmer has 51 tracks for 111 listens, Cold War Kids has 53 tracks for 110 listens, and sjowgren has 15 tracks for 99 listens.

Instead, it turns out that my listens to Booka Shade are actually the most distributed across tracks of all of my top 10 artists. Sjowgren is also an outlier here, because they’ve never released an album, so they only have 15 songs in their overall discography yet still made the top 10 artist listens. 

Returning my comparison between Spotify and data, Amtrac and Lane 8 are in both top 5 lists. This is somewhat expected, because if I look at the top 10 list for artists that I’ve most consistently listened to—artists that I’ve listened to at least once in each month of 2020—both Amtrac and Lane 8 place high in that list. 

Screenshot of a table showing top 10 consistently listened to artists, with Lane 8 being listened to at least once in all 12 months of 2020, Amtrac 11 months, Caribou 11 months, Disclosure 11 months, Elderbrook 11 months, Kidnap 11 months, Kölsch 11 months, Tourist 11 months, Ben Böhmer 10 months, and CamelPhat for 10 months.

Given that only 2 days of December have happened as I write this, it’s unsurprising that I’ve only listened to one artist in every month of 2020. 

Top Songs of 2020

Enough about the artists—what about the songs? 

Screenshot of top 5 songs from spotify wrapped, duplicated in surrounding text.

According to Spotify, my Top 5 songs of the year are:

  1. Apricots by Bicep
  2. Atlas by Bicep
  3. Idontknow by Jamie xx
  4. Cappadocia by Ben Böhmer feat. Romain Garcia
  5. Know Your Worth by Disclosure feat. Khalid

That pretty closely matches my top 5 list according to, with some notable exceptions.

Screenshot of Splunk table with top 10 songs of data, Apricots by Bicep with 38 listens, Atlas by Bicep with 32 listens, Idontknow by Jamie xx with 22 listens, White Ferrari (Greene Edit) by Jacques Greene with 21 listens, That Home Extended by The Cinematic Orchestra with 20 listens, Lalala by Y2K and bbno$ with 19 listens, Trish's Song by Hey Rosetta! with 18 listens, Wonderful by Burna Boy with 18 listens, Somewhere feat. Octavian by the Blaze with 17 listens, and Yes, I Know by Daphni with 17 listens.

My top 5 tracks according to are:

  1. Apricots by Bicep (38 listens)
  2. Atlas by Bicep (32 listens)
  3. Idontknow by Jamie xx (22 listens)
  4. White Ferrari (Greene Edit) by Jacques Greene (21 listens)
  5. That Home Extended by The Cinematic Orchestra (20 listens)

The first 3 tracks match, though of course Spotify has an incomplete representation of those listens—I have 29 streams of Apricots according to Spotify.

However, since I bought the track almost as soon as it came out, I also have another 9 listens that have happened off of Spotify. There were also some mysterious things happening with Spotify and connections around that time as well, so it’s possible some listens are missing beyond these numbers. 

What’s up with the 4th track on the list, though? Where is that in Spotify’s data? It’s actually a bootleg remix of the Frank Ocean song White Ferrari that Jacques Greene shared on SoundCloud and as a free download earlier this year, so it isn’t anywhere on Spotify. It did, however, make it onto my top tracks of 2020 on SoundCloud:

Screenshot of top 13 tracks in SoundCloud, with Jacques Greene - White Ferrari (JG Edit) listed as the 11th track.

And again, this is a spot where metadata intrudes again and leads to some inconsistent counts. If I look at all the permutations of White Ferrari and Jacques Greene in my data for 2020, the total number of listens should actually be a bit higher, at 23 total listens:

Screenshot of Splunk table showing the two permutations of the Jacques Greene remix, with 21 listens for the Greene Edit version and 2 listens for the JG Edit version, for a total of 23.

This would actually make it my 3rd-most popular song of 2020 so far, and I’m listening to it as I write this paragraph, so let’s go ahead and call that total number 24 listens. 

The 5th-most popular song and 7th-most popular song of 2020 make the case that I haven’t been sleeping very well this year (though I recall these tracks also showed up in 2019 as well…), because those 2 tracks comprise my “Insomnia” playlist that I use to help me fall asleep on nights when I’ve been, perhaps, staying up too late doing data analysis like this. 

You can see the influence of consistent listening habits with top artist behaviors when you look at the top 10 songs that I’ve consistently listened to throughout 2020, with 2 songs by Kidnap, one by Bicep, and another by Amtrac.

Table of tracks listened to consistently in 2020, Never Come Back by Caribou listened to at least once in 8 months of 2020, Start Again by Kidnap with 8 months, Accountable by Amtrac with 7 months, Atlas by Bicep with 7 months, Calling out by Sophie Lloyd with 7 months, Made to Stray by Mount Kimbie for 7 months, Moments (Ben Böhmer Remix) by Kidnap with 7 months, Somewhere feat. Octavian by the Blaze with 7 months, The Promise by David Spinelli with 7 months, and Without You My Life Would Be Boring by The Knife with 7 months.

To me, though, this table mostly underscores how much music discovery this year involved. I didn’t return to the same songs month after month during 2020. Likely as a result of all the DJ sets I’ve been streaming (as I mentioned in my post about Listening to Music while Sheltering in Place) this has been quite a year for music discovery, and breadth of listening habits. 

My top 10 songs of 2020 had a total of 222 listens across them. However, I have a total of 14,336 listens for the entire year, spread across 8,118 unique songs in total.

duplicated in surrounding text

Even with possible metadata issues, that’s still quite the distribution of behavior. Let’s dig a bit deeper into artist discovery this year. 

Artist Discovery in 2020

In my post earlier this year about my listening behavior while sheltering in place, I discovered that my artist discovery numbers in 2020 seemed to be way up compared with 2018 and 2019, but weren’t actually that far off from 2017 numbers. 

What I see when comparing my 2020 artist discovery statistics from my data and my Spotify data is even more interesting. In contrast to what seemed to be true in last year’s post, Wrapping up the year and the decade in music: Spotify vs my data (For what it’s worth, last year’s number should have been 1074, instead of 2857 artists discovered—data analysis is difficult), Spotify’s data is much higher than the number I calculated this year. 

duplicated in surrounding text

According to Spotify, I discovered 2,051 new artists, whereas my data claims that I only discovered 1,497 artists this year. 

duplicated in surrounding text

Similarly, Spotify claims that I listened to 4,179 artists this year, whereas my data indicates that I listened to 3,715 artists. 

duplicated in surrounding text

Again, this comes down to data structures and how the artist metadata is stored for each service. I wrote about the importance of quality metadata for digital streaming providers earlier this year in Why the quality of audio analysis metadatasets matters for music, but it’s also apparent that the data structures for those metadatasets are just as important for crafting data insights of varying value. 

Because Spotify stores all artists that contributed to a track as an array, I can listen to a track with 4 contributing artists on it, 1 of which I’ve listened to before, and according to Spotify, I’ve now discovered 3 artists and listened to 4, whereas according to, I’ll have either listened to 1 artist that I’ve already heard before, or a new artist, possibly called “Luciano & David Morales”. 

Screenshot of two artist names, Luciano, and Luciano & David Morales.

Spotify would store the second artist as Luciano, David Morales, thus allowing a more accurate count of listens for the Luciano artist. Similarly, my artist discovery data includes some flawed data, such as YouTube videos that got incorrectly recorded.

Screenshot of 3 artist names in my data, Billie Joe Armstrong of Green Day, Billy Joel and Jimmy Fallon Form 2, and Biosphere.
The Billy Joel and Jimmy Fallon duet of The Lion Sleeps Tonight never gets old, but it appears the original video is no longer on YouTube so I’m not going to link it.
Screenshot of two artist names in my data, &lez and 'Coming of age ceremony' Dance cover by Jimin and Jung Kook.

This becomes clear in my top 20 artist discoveries of 2020 chart, where BTS and Big Hit Labels are listed separately, although they are both indicative of one of my best friends joining BTS ARMY this year and sharing her enthusiasm with me. 

Giant table of top 20 artists discovered in 2020, in order with first_discovered date last:
Re.You with 85 listens starting July 12, 2020
Elliot Adamson, 75 listens, April 15 2020
Fennec, 53 listens, March 24 2020
Southern Shores, 52 listens, November 19 2020
Eelke Kleijn, 45 listens, August 10 2020
Christian Löffler, 43 listens, April 2 2020
Icarus, 35 listens, April 2 2020
Monkey Safari, 35 listens, April 15 2020
Black Motion, 34 listens, April 30 2020
BTS, 31 listens, September 29 2020
Bronson, 31 listens, May 9 2020
Love Regenerator, 30 listens, March 30 2020
Eltonnick, 29 listens, April 27 2020
Jerro, 27 listens, April 29 2020
Theo Kottis, 27 listens, June 16 2020
Dennis Cruz, 26 listens, June 22 2020
Da Capo, 25 listens, May 10 2020
Bit Hit Labels, 21 listens, June 30 2020
HYENAH, 20 listens, June 4 2020
KC Lights, 20 listens, September 22 2020

Ultimately I’m grateful that the top 20 artists of 2020 are all artists that I discovered during the pandemic and have excellent songs that I love and continue to listen to. Many of the sparklines that represent my listening activity for these artists throughout the year have spikes, but mostly my listening patterns indicate that I’ve been returning to these artists and their songs multiple times after first discovery. Some notable favorites on this list are KC Lights’ track Girl and Dennis Cruz’s track El Sueño, plus the entire Fennec album Free Us Of This Feeling.

Genre Discovery in 2020

The most-commented-on data insight from #wrapped2020 is probably the genre discovery slide.

According to Spotify, I listened to 801 genres this year, including 294 new ones. I’m not even sure I could name 30 genres, let alone 300 or 800. Where are these numbers coming from? 

It turns out that, much like storing artist data as an array for each song, Spotify stores genre data as an array for each artist. This means that each artist can be assigned multiple genres, thus successfully inflating the number of genres that you’ve listened to in 2020. 

For example, if I use Spotify’s API developer console to retrieve the artist information for Tourist, with a Spotify ID of 2ABBMkcUeM9hdpimo86mo6, it turns out that he has 6 total genres associated with him in Spotify’s database: chillwave, electronica, indie soul, shimmer pop, tropical house, and vapor soul. 

Screenshot of JSON response from Spotify API call, content duplicated in surrounding text.

I could start discussing the possible meaningless of genres as a descriptive tool, the lack of validation possible for such a signifier, the lack of clarity about how these genres were defined and also assigned to specific artists, but that’s best for another blog post.

Instead, let’s look at what little genre data I do have available to me more generally. 

duplicated in surrounding text

According to Spotify, my top genres were:

  1. House
  2. Electronica
  3. Pop
  4. Afro House
  5. Organic House

All of these make sense to me, except for Organic House, because I don’t know what makes house music organic, unless it’s also grass-fed, locally-sourced, and free range. Perhaps Blond:ish is organic house. 

I don’t have any genre data from, since the service only stores user-defined tags for each artist, and those are not included in the data that I collect from today. Instead, I have the genres assigned by iTunes for the tracks that I’ve purchased from the iTunes store. 

The top 8 genres of music that I added to my iTunes library in 2020 by purchasing tracks from the iTunes store are:

  1. Dance (124 songs)
  2. Electronic (121 songs)
  3. House (78 songs)
  4. Pop (37 songs)
  5. Alternative (27 songs)
  6. Electronica (12 songs)
  7. Deep House (10 songs)
  8. Melodic House & Techno (9 songs)
duplicated in surrounding text

Clearly, this is a very selective sample, and is only tied to select purchasing habits, which are roughly correlated to my listening habits.

I shared all of this genre data to essentially look at it and go “wow, that wasn’t very insightful at all”. Let’s move on. 

Time Spent Listening to Music in 2020

The last metric I want to unpack from Spotify’s #wrapped2020 campaign is the minutes listened data insight. According to Spotify, I spent 59,038 minutes listening to music this year. 

relevant content duplicated in surrounding text

According to my own calculations, I spent roughly 81,134 minutes listening to music in 2020.

Let’s talk about how both of these metrics are super flawed!

Spotify counts a song as streamed after you listen to it for more than 30 seconds (per their Spotfiy for Artists FAQ), so it’s logical to assume that this minutes listened metric likely from a calculation of “number of streams for a track” x “length of track” and then rounded and converted to minutes. It could even result from an different type of calculation, “number of total streams” x “average length of track in Spotify library”, but I have no way of knowing if either of these are accurate besides tweeting at Spotify and hoping they’ll pay attention to me. 

Unfortunately for all of us, but mostly me, my own minutes listened metric is just as lazily calculated. I don’t have track length data for all the tracks that I listen to and I don’t know at what point counts a track as being worthy of a scrobble. I do have a list of how much time I spent listening to livestreamed DJ sets online, and I do have some excellent estimation skills. I calculated my number of 81,134 minutes so far in 2020 by calculating and assuming the following:

  • An average track length of 4 minutes
  • An average concert length of 3 hours
  • An average DJ set length of 4 hours
  • An average festival length of 8 hours

Using those averages and estimates, I calculated the total amount of time I spent listening to music across listening habits, concerts and DJ sets attended (no festivals this year), and livestreams that I watched online, thus arriving at 81,134 minutes. That doesn’t count any DJ sets that I listened to on SoundCloud, and certainly the combination of a 4 minute track length estimate with the uncertainty of what qualifies a track as being scrobbled makes this data insight somewhat meaningless.

Regardless, let’s compare this estimated time spent listening in minutes against the total number of minutes in a year.

Total minutes listened (81,134) as a gauge compared with total minutes in a year (525,600)

Beautiful. I still remembered to sleep this year. No matter which dataset I use, however, it’s clear that I’ve listened to more music in 2020 than in 2019. Spotify’s metric for this same time period in 2019 was 35,496 minutes. The less-flawed but less-complete metric I used last year, calculated using the track length stored in iTunes multiplied by the number of listens for that track, indicated that I spent 14,296 minutes listening to music in 2019. 

As one final Spotify examination, let’s dig into the Spotify Top 100 playlist.

Top 100 Songs of 2020 Playlist

Alongside the fancy graphics and data insights in the #wrapped2020 campaign, Spotify also creates a 100 song playlist, likely (but not definitively) the top 100 songs of the time period between January 1st, 2020 and October 31st, 2020. 

I found my playlist this year to be relatively accurate, perhaps because I spent more time listening to Spotify than I might have in previous years, or perhaps they made some internal data improvements, or both! I often spend more time listening to SoundCloud if I’m traveling a lot, listening to offline DJ sets on plane flights; or listening to Apple Music on my iPhone, with songs that I’ve added from my iTunes library. Without much time spent commuting or traveling this year, it’s likely that my listening habits remained fairly consolidated. 

duplicated in surrounding text

Similarly to what I discovered about my top 10 tracks, I had relatively distributed music interests this year. The 811 total listens for all 100 songs in my Spotify playlist represent just 0.06% of my total listens in 2020 so far. 

duplicated in surrounding text

Despite my overall listening habits being relatively distributed across lots of artists and songs, the Top Songs playlist is somewhat more consolidated, with 69 artists performing the 100 songs on the playlist. Nice. 

duplicated in surrounding text

It’s clear that I spent most of this year exploring and discovering new artists, given that 83 of my top songs of 2020 according to Spotify were songs that I discovered in 2020. 

Thanks for coming on this journey through my music data with me. I’ll be back at the actual end of the year to dive deeper into my top 10 artists of the year, top 10 consistent artists of the year, my music purchasing activity, as well as some more livestream and concert statistics to round out my 2020 year in music. 

Define the question: How missing data biases data-driven decisions

This is the eight and final post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

In this post, I’ll cover the following: 

  • Define the question you want to answer for your data analysis process
  • How does data go missing when you’re defining your question?
  • What can you do about missing data when defining your question?

This post also concludes this blog post series about missing data, featuring specific actions you can take to reduce bias resulting from missing data in your end-to-end data-driven decision-making process.

Define the question 

When you start a data analysis process, you always want to start by deciding what questions you want to answer. Before you make a decision, you need to decide what you want to know.

If you start with the data instead of with a question, you’re sure to be missing data that could help you make a decision, because you’re starting with what you have instead of what you want to know

Start by carefully defining what you want to know, and then determine what data you need to answer that question. What aggregations and analyses might you perform, and what tools do you need access to in order to perform your analysis? 

If you’re not sure how to answer the question, or what questions to ask to make the decisions that you want to make, you can explore best practices guidance and talk to experts in your field. For example, I gave a presentation about how to define questions when trying to prioritize documentation using data (watch on YouTube). If you are trying to monitor and make decisions about software that you’re hosting and managing, you can dig into the RED method for infrastructure monitoring or the USE method

It’s also crucial to consider whether you can answer that question adequately, safely, and ethically with the data you have access to.

How does data go missing? 

Data can go missing at this stage if it isn’t there at all—if the data you need to answer a question does not exist. There’s also the possibility that the data you want to use to answer a question is incomplete, or you have some, but not all of the data that you need to answer the question.

It’s also possible that the data exists, but you can’t have it—you either don’t have it, or you aren’t permitted to use the data that has already been collected to answer your particular question. 

It’s also possible that the data that you do have is not accurate, in which case the data might exist to help answer your question, but it’s unusable, so it’s effectively missing. Perhaps the data is outdated, or the way it was collected means you can’t trust it. 

Depending on who funded the data collection, who performed the data collection, and when and why it was performed, can tell you a lot about whether or not you can use a dataset to answer your particular set of questions. 

For example, if you are trying to answer the question “What is the effect of slavery on the United States?”, you could review economic reports, the records from plantations about how humans were bought and sold, and stop there. But you might be better off considering who created those datasets, who is missing from those datasets, and whether or not those datasets are useful to answer your question, and which datasets might be missing entirely because they were never created, or what records did exist were destroyed. You might also want to consider whether or not it’s ethical to use data to answer specific questions about the lived experiences of people. 

Or, for another grim example, if you want to understand how American attitudes towards Muslims changed after 9/11, you could (if you’re Paul Krugman) look at hate crime data and stop there. Or, as Jameel Jaffer points out in a Twitter thread, you could consider whether or not hate crime data is enough to represent the experience of Muslims after 9/11, considering that “most of the “anti-Muslim sentiment and violence” was *officially sanctioned*” and therefore, all of that is missing from an analysis that focuses solely on hate crime data. Jaffer continues by pointing out that,

“For example, hundreds of Muslim men were rounded up in New York and New Jersey in the weeks after 9/11. They were imprisoned without charge and often subject to abuse in custody because of their religion. None of this would register in any hate crimes database.” 

Data can also go missing if the dataset that you choose to use to answer your question is incomplete.

Incomplete dataset by relying only on digitized archival films

As Rick Prelinger laments in a tweet—if part of a dataset is digitized, often that portion of the dataset is used for data analysis (or research, as the case may be), with the non-digitized portion ignored entirely. 

Screenshot of a tweet by Rick Prelinger @footage "20 years ago we began putting archival film online. Today I can't convince my students that most #archival footage is still NOT online. Unintended consequence of our work: the same images are repeatedly downloaded and used, and many important images remain unused and unseen." sent 12:15 PM Pacific time May 27, 2020

For example, if I wanted to answer the question “What are common themes in American television advertising in the 1950s”? I might turn to the Prelinger Archives, because they make so much digitized archival film footage available. But just because it’s easily accessible doesn’t make it complete. Just because it’s there doesn’t make it the best dataset to answer your question.

It’s possible that the Prelinger Archives don’t have enough film footage for me to answer such a broad question. In this case, I can supplement the dataset available to me with information that is harder to find, such as by tracking down those non-digitized films. I can also choose to refine my question to focus on a specific type of film, year, or advertising agency that is more comprehensively featured in the archive, narrowing the scope of my analysis to focus on the data that I have available. I could even choose a different dataset entirely, if I find one that more comprehensively and accurately answers my question.

Possibly the most common way that data can go missing when trying to answer a question is that the data you have, or even all of the data available to you, doesn’t accurately proxy what you want to know. 

Inaccurate proxy to answer a question leads to missing data

If you identify data points that inaccurately proxy the question that you’re trying to answer, you can end up with missing data. For example, if you want to answer the question, “How did residents of New York City behave before, during, and after Hurricane Sandy?”, you might look at geotagged social media posts. 

Kate Crawford discusses a study Nir Grinberg, Mor Naaman, Blake Shaw, and Gilad Lotan, Extracting Diurnal Patterns of Real World Activity from Social Media, in the context of this question in her excellent 2013 article for Harvard Business Review, The Hidden Biases in Big Data

As she puts it,

“consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture.” 

Because the users of social media, especially those that use Twitter and Foursquare and share location data with those tools, only represent a specific slice of the population affected by Hurricane Sandy. And that specific slice is not a representative or comprehensive slice of New York City residents. Indeed, as Crawford makes very clear, “there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate.”

The dataset of geotagged social media posts only represents some residents of New York City, and not in a representative way, so it’s an inaccurate proxy for the experience of all New York City residents. This means data is missing from the question stage of the data analysis step. You want to answer a question about the experience of all New York City residents, but you only have data about the experience of New York City residents that shared geotagged posts on social media during a specific period of time. 

The risk is clear—if you don’t identify the gaps in this dataset, you might draw false conclusions. Crawford is careful to point this out clearly, identifying that “The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster.”

When you identify the gaps in the dataset, you can understand what limitations exist in the dataset, and thus how you might draw false and biased conclusions. You can also identify new datasets to examine or groups to interview to gather additional data to identify the root cause of the missing data (as discussed in my post on data gaps in data collection). 

The gaps in who is using Twitter, and who is choosing to use Twitter during a natural disaster, are one way that Twitter data can inaccurately proxy a population that you want to research and thus cause data to go missing. Another way that it can cause data to go missing is by inaccurately representing human behavior in general because interactions with the platform itself are not neutral. 

As Angela Xiao Wu points out in her blog post, How Not to Know Ourselves, based on a research paper she wrote with Harsh Taneja:

“platform log data are not “unobtrusive” recordings of human behavior out in the wild. Rather, their measurement conditions determine that they are accounts of putative user activity — “putative” in a sense that platforms are often incentivized to keep bots and other fake accounts around, because, from their standpoint, it’s always a numbers game with investors, marketers, and the actual, oft-insecure users.” 

Put another way, you can’t interpret social media interactions as neutral reflections of user behavior due to the mechanisms a social media platform uses to encourage user activity. The authors also point out that it’s difficult to identify if social media interactions reflect the behavior of real people at all, given the number of bot and fake accounts that proliferate on such sites. 

Using a dataset that inaccurately proxies the question that you’re trying to answer is just one way for data to go missing at this stage. What can you do to prevent data from going missing as you’re devising the questions you want to ask of the data? 

What can you do about missing data?

Most importantly, redefine your questions so that you can use data to answer them! If you refine the questions that you’re trying to ask into something that can be quantified, it’s easier to ask the question and get a valid, unbiased, data-driven result. 

Rather than try to understand the experience of all residents of New York City before, during, and after Hurricane Sandy, you can constrain your efforts to understand how social media use was affected by Hurricane Sandy, or how users that share their locations on social media altered their behavior before, during, and after the hurricane.

As another example, you might shift from trying to understand “How useful is my documentation?” to instead asking a question that is based on the data that you have: “How many people view my content?”. You can also try making a broad question more specific. Instead of asking “Is our website accessible?”, instead ask, “Does our website meet the AA standard of web content accessibility guidelines?” 

Douglas Hubbard’s book, How to Measure Anything, provides excellent guidance about how to refine and devise a question that you can use data analysis to answer. He also makes the crucial point that sometimes it’s not worth it to use data to answer a question. If you are fairly certain that you already know the answer to a question, and the amount of effort it would take to perform data analysis (let alone perform it well) will take a lot of time and resources, it’s perhaps not worth attempting to answer the question with data at all! 

You can also choose to use a different data source. If the data that you have access to in order to answer your question is incomplete, inadequate, inaccurate, or otherwise missing data, choose a different data source. This might lead you to change your dataset choice from readily-available digitized content to microfiche research at a library across the globe in order to perform a more complete and accurate data analysis.

And of course, if a different data source doesn’t exist, you can create a new data source with the information you need. Collaborate with stakeholders within your organization, make a business case to a third-party system that you want to gather data from, use freedom of information act (FOIA) requests to gather data that exists but is not easily-accessible to create a dataset. 

I also want to take care to acknowledge that choosing to use or create a different dataset can often require immense privilege—monetary privilege to fund added data collection, a trip across the globe, or a more complex survey methodology; privilege of access, to have access to others doing similar research and are willing to share data with you; and privilege of time to perform the added data collection and analysis that might be necessary to prevent missing data.

If the data exists but you don’t have permission to use it, you might devise a research plan to request access to sensitive data, or work to gain the consent of those in the dataset that you want to use to allow you to use the data to answer the question that you want to answer. This is another case where communicating the use case of the data can help you gather it—if you share the questions that you’re trying to answer with the people that you’re trying to collect data from, they may be more inclined to share it with you. 

Take action to reduce bias in your data-driven decisions from missing data 

If you’re a data decision-maker, you want to take these steps to take action:

  1. Define the questions being answered with data. 
  2. Identify missing data in the analysis process.
  3. Ask questions of the data analysis before making decisions.

If you carefully define the questions guiding the data analysis process, clearly communicating your use cases to the data analysts that you’re working with, you can prevent data from going missing at the very start. 

Work with your teams and identify where data might go missing in the analysis process, and do what you can to address a leaky analysis pipeline. 

Finally, ask questions of the data analysis results before making decisions. Dig deeper into what is communicated to you, seek to understand what might be missing from the reports, visualizations, and analysis results being presented, and whether or not that missing data is relevant to your decision. 

If you work with data as a data analyst, engineer, admin, or communicator, you can take these steps to take action:

  1. Steward and normalize data.
  2. Analyze data at multiple levels of aggregation and time spans.
  3. Add context to reports and communicate missing data.

Responsibly steward data as you collect and manage it, and normalize it when you prepare it for analysis to make it easier to use. 

If you analyze data at multiple levels of aggregation and time spans, you can determine which level allows you to communicate the most useful information with the least amount of data going missing, hidden by overgeneralized aggregations or overlarge time spans, or hidden in the noise of overly-detailed time spans or too many split-bys. 

Add context to the reports that you produce, providing details about the data analysis process and the dataset used, acknowledging what’s missing and what’s represented. Communicate missing data with detailed and focused visualizations, keeping visualizations consistent for regularly-communicated reports. 

I hope that no matter your role in the data analysis process, this blog post series helps you reduce missing data and make smarter, more accurate, and less biased data-driven decisions.

Collect the data: How missing data biases data-driven decisions

This is the seventh post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

When you’re gathering the data you need and creating datasets that don’t exist yet, you’re in the midst of the data collection stage. Data can easily go missing when you’re collecting it! 

In this post, I’ll cover the following: 

  • How data goes missing at the data collection stage 
  • What to do about missing data at the collection stage

How does data go missing?

There are many reasons why data might be missing from your analysis process. Data goes missing at the collection stage because the data doesn’t exist, or the data exists but you can’t use it for whatever reason, or the data exists but the events in the dataset are missing information. 

The dataset doesn’t exist 

Frequently data goes missing because the data itself does not exist, and you need to create it. It’s very difficult and impractical to create a comprehensive dataset, so data can easily go missing at this stage. It’s important to do what you can to make sure data goes consistently missing when you collect it, if possible, by collecting representative data. 

In some cases, though, you do need comprehensive data. For example, if you need to create a dataset of all the servers in your organization for compliance reasons, you might discover that there is no one dataset of servers, and that efforts to compile one are a challenge. You can start with just the servers that your team administers, but that’s an incomplete list. 

Some servers are grant-owned and fully administered by a separate team entirely. Perhaps some servers are lurking under the desks of some colleagues, connected to the network but not centrally managed. You can try to use network scans to come up with a list, but then you gather only those servers connected to the network at that particular time. Airgapped servers or servers that aren’t turned on 24/7 won’t be captured by such an audit. It’s important to continually consider if you really need comprehensive data, or just data that comprehensively addresses your use case. 

The data exists, but… 

There’s also a chance that the data exists, but isn’t machine-readable. If the data is provided only in PDFs, as many FOIA requests are returned in, then it becomes more difficult to include the data in data analysis. There’s also a chance that the data is available only as paper documents, as is the case with gun registration records. As Jeanne Marie Laskas reports for GQ in Inside The Federal Bureau Of Way Too Many Guns, having records only on paper prevents large-scale data analysis on the information, thus causing it to effectively go missing from the entire process of data analysis. 

It’s possible that the data exists, but isn’t on the network—perhaps because it is housed on an airgapped device, or perhaps stored on servers subject to different compliance regulations than the infrastructure of your data analysis software. In this case, the data exists but it is missing from your analysis process because it isn’t available to you due to technical limitations. 

Another common case is that the data exists, but you can’t have it. If you’ve made an enemy in another department, they might not share the data with you because they don’t want to. It’s more likely that access to the data is controlled by legal or compliance concerns, so you aren’t able to access the data for your desired purposes, or perhaps you can’t analyze it on the tool that you’re using for data analysis due to compliance reasons. 

For example, most doctors offices and hospitals in the United States use electronic health records systems to store the medical records of thousands of Americans. However, scientific researchers are not permitted to access detailed electronic health records of patients, though they exist in large databases and the data is machine-readable, because the health insurance portability and accountability act (HIPAA) privacy rule regulates how protected health information (PHI) can be accessed and used. 

Perhaps the data exists, but is only available to people who pay for access. This is the case for many music metadata datasets like those from Nielsen, much to my chagrin. The effort it takes to create quality datasets is often commoditized. This also happens with scientific research, which is often only available to those with access to scientific journals that publish the results of the research. The datasets that produce the research are also often closely-guarded, as one dataset is time-consuming to create and can lead to multiple publications. 

There’s also a chance the data exists, but it isn’t made available outside of the company. A common circumstance for this is public API endpoints for cloud services. Spotify collects far more data than they make available via the API, so too do companies like Zoom or Google. You might hope to collect various types of data from these companies, but if the API endpoints don’t make the data available, you don’t have many options.

And of course, in some cases the data exists, but it’s inconsistent. Maybe you’re trying to collect equivalent data from servers or endpoints with different operating systems, but you can’t get the same details due to logging limitations. A common example is trying to collect the same level of detail from computers with MacOS and computers with Windows installed. You can also see inconsistencies if different log levels are set on different servers for the same software. This inconsistent data causes data to go missing within events and makes it more difficult to compare like with like. 

14-page forms lead to incomplete data collection in a pandemic 

Data can easily go missing if it’s just too difficult to collect. An example from Illinois, reported by WBEZ reporter Kristen Schorsch in Illinois’ Incomplete COVID-19 Data May Hinder Response, is that “the Illinois Department of Public Health issued a 14-page form that it has asked hospitals to fill out when they identify a patient with COVID-19. But faced with a cumbersome process in the midst of a pandemic, many hospitals aren’t completely filling out the forms.”

It’s likely that as a result of the length of the form, data isn’t consistently collected for all patients from all hospitals—which can certainly bias any decisions that the Illinois Department of Public Health might make, given that they have incomplete data. 

In fact, as Schorsch reports, without that data, public health workers “told WBEZ that makes it harder for them to understand where to fight for more resources, like N95 masks that provide the highest level of protection against COVID-19, and help each other plan for how to make their clinics safer as they welcome back patients to the office.” 

In this case, where data is going missing because it’s too difficult to collect, you can refocus your data collection on the most crucial data points for what you need to know, rather than the most complete data points.

What can you do about missing data? 

Most crucially, identify the missing data. If you know that you need a certain type of data to answer the questions that you want to answer in your data analysis, you must know that it is missing from your analysis process. 

After you identify the missing data, you can determine whether or not it matters. If the data that you do have is representative of the population that you’re making decisions about, and you don’t need comprehensive data to make those decisions, a representative sample of the data is likely sufficient. 

Communicate your use cases

Another important thing you can do is to communicate your use cases to the people collecting the data. For example, 

  • If software developers have a better understanding of how telemetry or log data is being used for analysis, they might write more detailed or more useful logging messages and add new fields to the telemetry data collection. 
  • If you share a business case with cloud service providers to provide additional data types or fields via their API endpoints, you might get better data to help you perform less biased and more useful data analyses. In return, those cloud providers are likely to retain you as a customer. 

Communicating the use case for data collection is most helpful when communicating that information leads to additional data gathering. It’s riskier when it might cause potential data sources to be excluded. 

For example, if you’re using a survey to collect information about a population’s preferences—let’s say, the design of a sneaker—and you disclose that information upfront, you might only get people with strong opinions about sneaker design responding to your survey. That can be great if you want to survey only that population, but if you want a more mainstream opinion, you might miss those responses because the use case you disclosed wasn’t interesting to them. In that context, you need to evaluate the missing data for its relevance to your analysis. 

Build trust when collecting sensitive data 

Data collection is a trust exercise. If the population that you’re collecting data about does not understand why you’re collecting the data, or trust that you will protect it, use it as you say you will, or if they believe that you will use the data against them, you might end up with missing data. 

Nowhere is this more apparent than with the U.S. Census. Performed every 10 years, the data from the census is used to determine political representation, distribute federal resources, and much more. Because of how the data from the census survey is used, a representative sample isn’t enough—it must be as complete a survey as possible. 

Screenshot of the Census page How We Protect Your Information.

The Census Bureau understands that mistrust is a common reason why people might not respond to the census survey. Because of that, the U.S. Census Bureau hires pollsters that are part of groups that might be less inclined to respond to the census, and also provide clear and easy-to-find details on their website (See How We Protect Your Information on about the measures in place to protect the data collected in the census survey. Those details are even clear in the marketing campaigns urging you to respond to the census! The census survey also faces other challenges when ensuring the comprehensive survey is as complete as possible.

This year, the U.S. Census also faced time limits for completing the collecting and counting of surveys, in addition to delays already imposed by the COVID-19 pandemic. The New York Times has additional details about those challenges: The Census, the Supreme Court and Why the Count Is Stopping Early.  

Address instances of mistrust with data stewardship

As Jill Lepore discusses in episode 4, Unheard, of her podcast The Last Archive, mistrust can also affect the accuracy of the data being collected, such as in the case of former enslaved people being interviewed by descendants of their former owners, or their current white neighbors, for records collected by the Works Progress Administration. Surely, data is missing from those accounts of slavery due to mistrust of the people doing the data collection, or at the least, because those collecting the stories perhaps do not deserve to hear the true lived experiences of the former enslaved people. 

If you and your team are not good data stewards, if you don’t do a good job of protecting data that you’ve collected or managing who has access to that data, people are less likely to trust you with more data—and thus it’s likely that datasets you collect will be missing data. Because of that, it’s important to practice good data stewardship. Use datasheets for datasets, or a data biography to record when data was collected, for what purpose, by whom or what means, and more. You can then review those to understand whether data is missing, or even to remember what data might be intentionally missing. 

In some cases, data can be intentionally masked, excluded, or left to collect at a later date. If you keep track of these details about the dataset during the data collection process, it’s easier to be informed about the data that you’re using to answer questions and thus use it safely, equitably, and knowledgeably. 

Collect what’s missing, maybe

If possible and if necessary, collect the data that is missing. You can create a new dataset if one does not already exist, such as those that journalists and organizations such as Campaign Zero have been compiling about police brutality in the United States. Some data collection that you perform might supplement existing datasets, such as adding additional introspection details to a log file to help you answer a new question for an existing data source. 

If there are cases where you do need to collect additional data, you might not be able to do so at the moment. In those cases, you can build a roadmap or a business case to collect the data that is missing, making it clear how it can help reduce uncertainty for your decision. That last point is key, because collecting more data isn’t always the best solution for missing data. 

Sometimes, it isn’t possible to collect more data. For instance, if you’re trying to gather historical data, but everyone from that period has died and very few or no primary sources remain. Or cases where the data has been destroyed, such as in a fire or intentionally, as the Stasi did after the fall of the Berlin Wall

Consider whether you need complete data

Also consider whether or not more data will actually help address the problem that you’re attempting to solve. You can be missing data, and yet still not need to collect more data in order to make your decision. As Douglas Hubbard points out in his book, How to Measure Anything, data analysis is about reducing uncertainty about what the most likely answer to a question is. If collecting more data doesn’t reduce your uncertainty, then it isn’t necessary. 

Nani Jansen Reventlow of the Digital Freedom Fund makes this point clear in her Op-Ed on Al Jazeera, Data collection is not the solution for Europe’s racism problem. In that case, collecting more data, even though it could be argued that the data is missing, doesn’t actually reduce uncertainty about what the likely solution for racism is. Being able to quantify the effect or harms of racism on a region does not solve the problem—the drive to solve the problem is the only thing that can solve that problem. 

Avoid cases where you continue to collect data, especially at the expense of an already-marginalized population, in an attempt to prove what is already made clear by the existing information available to you. 

You might think that data collection is the first stage of a data analysis process, but in fact, it’s the second. The next and last post in this series covers defining the question that guides your data analysis, and how to take action to reduce bias in your data-driven decisions: Define the question: How missing data biases data-driven decisions

Manage the data: How missing data biases data-driven decisions

This is the sixth post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

In this post, I’ll cover the following:

  • What is data management?
  • How does data go missing, featuring examples of disappearing data
  • What you can do about missing data

How you manage data in order to prepare it for analysis can cause data to go missing and decisions based on the resulting analysis to be biased. With so many ways for data to go missing, there’s just as many chances to address the potential bias that results from missing data at this stage.

What is data management?

Data management, for the purposes of this post, covers all the steps you take to prepare data after it’s been collected. That includes all the steps you take to answer the following questions:

  • How do you extract the data from the data source?
  • What transformations happen to the data to make it easier to analyze?
  • How is it loaded into the analysis tool?
  • Is the data normalized against a common information model?
  • How is the data structured (or not) for analysis?
  • What retention periods are in place for different types of data?
  • Who has access to the data? 
  • How do people access the data?
  • For what use cases are people permitted to access the data?
  • How is information stored and shared about the data sources? 
  • What information is stored or shared about the data sources?
  • What upstream and downstream dependencies feed into the data pipeline? 

How you answer these questions (if you even consider them at all) can cause data to go missing when you’re managing data. 

How does data go missing? 

Data can go missing at this stage in many ways. With so many moving parts from various tooling and transformation steps being taken to prepare data for analysis and make it easier to work with, a lot can go wrong. For example, if you neglect to monitor your dependencies, a configuration change in one system can cause data to go missing from your analysis process. 

Disappearing data: missing docs site metrics 

It was just an average Wednesday when my coworker messaged me asking for help with her documentation website metrics search—she thought she had a working search, but it wasn’t showing the results she expected. It was showing her that no one was reading any of her documentation, which I knew couldn’t be true.

As I dug deeper, I realized the problem wasn’t the search syntax, but the indexed data itself. We were missing data! 

I reported it to our internal teams, and after some investigation they realized that a configuration change on the docs site had resulted in data being routed to a different index. A configuration change that they thought wouldn’t affect anything ended up causing data to go missing for nearly a week because we weren’t monitoring dependencies crucial to our data management system. 

Thankfully, the data was only misrouted and not dropped entirely, but it was a good lesson in how easily data can go missing at this management stage. If you identify the sources you expect to be reporting data, then you can monitor for changes in the data flow. You can also document those sources as dependencies, and ensure that configuration changes include additional testing to ensure the continued fidelity of your data collection and management process. 

Disappearing data: data retention settings slip-up 

Another way data can go missing is if you neglect to manage or be aware of default tool constraints that might affect your data. 

In this example, I was uploading my music data to the Splunk platform for the first time. I was so excited to analyze the 10 years of historical data. I uploaded the file, set up the field extractions, and got to searching my data. I wrote an all time search to see how my music listening habits had shifted year over year in the past decade—but only 3 years of results were returned. What?!

In my haste to start analyzing my data, I’d completely ignored a warning message about a seemingly-irrelevant setting called “max_days_ago”. It turns out, this setting is set by default to drop any data older than 3 years. The Splunk platform recognized that I had data in my dataset older than 3 years, but I didn’t heed the warning and didn’t update the default setting to match my data. I ended up having to delete the data I’d uploaded, fix my configuration settings, and upload the data again—without any of it being dropped this time! 

This experience taught me to pay attention to how I configure a tool to manage my data to make sure data doesn’t go missing. This happened to me while using the Splunk platform, but it can happen with whatever tool you’re using to manage, transform, and process your data.

As reported by Alex Hern in the Guardian

“A million-row limit on Microsoft’s Excel spreadsheet software may have led to Public Health England misplacing nearly 16,000 Covid test results”. This happened because of a mismatch in formats and a misunderstanding of the data limitations imposed by the file formats used by labs to report case data, as well as of the software (Microsoft Excel) used to manage the case data. Hern continues, pointing out that “while CSV files can be any size, Microsoft Excel files can only be 1,048,576 rows long – or, in older versions which PHE may have still been using, a mere 65,536. When a CSV file longer than that is opened, the bottom rows get cut off and are no longer displayed. That means that, once the lab had performed more than a million tests, it was only a matter of time before its reports failed to be read by PHE.” 

This limitation in Microsoft Excel isn’t the only way that tool limitations and settings can cause data to go missing at the data management stage. 

Data transformation: Microsoft wants genes to be dates 

If you’re not using Splunk for your data management and analysis, you might be using Microsoft Excel. It turns out that Microsoft Excel, despite (or perhaps due to) its popularity, can also cause data to go missing due to configuration settings. In the case of some genetics researchers, it turned out that Microsoft Excel was transforming their data incorrectly. The software was transforming certain gene names, such as MAR1 and DEC1, into dates of March 1 and December 1, causing data to go missing from the analysis. 

Clearly, if you’re doing genetics research, this is a problem. Your data has been changed, and this missing data will bias any research based on this dataset, because certain genes are now dates! 

To handle cases where a tool is improperly transforming data, you have 3 options:

  • Change the tool that you’re using,
  • Modify the configuration settings of the tool so that it doesn’t modify your data,
  • Or modify the data itself.

The genetics researchers ended up deciding to modify the data itself. The HUGO Gene Nomenclature Committee officially renamed 27 genes to accommodate this data transformation error in Microsoft Excel. Thanks to this decision, these researchers have one fewer configuration setting to worry about when helping to ensure vital data doesn’t go missing during the data analysis process. 

What can you do about missing data? 

These examples illustrate common ways that data can go missing at the management stage, but they’re not the only ways. What can you do when data goes missing?

Carefully set configurations 

The configuration settings that you use to manage data that you’ve collected can result in events and data points being dropped.

For example, if you incorrectly configure data source collection, you might lose events or parts of events. Even worse, data can go missing if you incorrectly record events due to incorrect line breaking, truncation, time zone, timestamp recognition, or retention settings. Data can go missing inconsistently if all of the nodes of your data management system don’t have identical configurations. 

You might cause some data to go missing intentionally. You might choose to drop INFO level log messages and collect only the ERROR messages in an attempt to track just the signal from the noise of log messages, or you might choose to drop all events older than 3 months from all data sources to save money on storage. These choices, if inadequately communicated or documented, can lead to false assumptions or incorrect analyses being performed on the data. 

If you don’t keep track of configuration changes and updates, a data source format could change before you update the configurations to manage the new format, causing data to get dropped, misrouted, or otherwise go missing from the process. 

If your data analyst is communicating their use cases and questions to you, you can better understand data retention settings according to those use cases, and review the current policies across your datasets and see how they compare for complementary data types. 

You can also identify complementary data sources that might help the analyst answer the questions they want to answer, and plan how and when to bring in those data sources to improve the data analysis. 

You need to manage dataset transformations just as closely as you do the configurations that manage the data. 

Communicate dataset transformations

The steps you take to transform data can also lead to missing data. If you don’t normalize fields, or if your field normalizations are inconsistently applied across the data or across the data analysts, data can appear to be missing even if it is there. If some data has a field name of “http_referrer” and the same fields in other data sources are consistently “http_referer”, the data with “http_referrer” data might appear to be missing for some data analysts when they start the data analysis process. 

Normalization can also help you identify where fields might be missing across similar datasets, such as cases where an ID is present in one type of data but not another, making it difficult to trace a request across multiple services. 

If the data analyst doesn’t know or remember which field name exists in one dataset, and whether or not it’s the same field as another dataset, data can go missing at the analysis stage—as we saw with my examples of the “rating” field missing from some events and the info field not having a value that I expected in the data analysis post from this series, Analyze the data: How missing data biases data-driven decisions

In the same vein, if you use vague field names to describe the data that you’ve collected, or dataset names that ambitiously describe the data that you want to be collecting—instead of what you’re actually collecting—data can go missing. Shortcuts like “future-proofing” dataset names can be misleading to data analysts that want to easily and quickly understand what data they’re working with. 

The data doesn’t go missing immediately, but you’re effectively causing it to go missing when data analysis begins if data analysts can’t correctly decipher what data they’re working with. 

Educate and incorporate data analysis into existing processes

Another way data can go missing is painfully human. If the people that you expect to analyze the data and use it in their decision-making process don’t know how to use the tool that the data is stored in, well, that data goes missing from the process. Tristan Handy in the dbt blog post Analytics engineering for everyone discusses this problem in depth. 

It’s important to not just train people on the tool that the data is stored in, but also make sure that the tool and the data in it are considered as part of the decision-making process. Evangelize what data is available in the tool, and make it easy to interact with the tool and the data. This is a case where a lack of confidence and knowledge can cause data to go missing. 

Data gaps aren’t always caused by a lack of data—they can also be caused by knowledge gaps and tooling gaps if people aren’t confident or trained to use the systems with the data in them. 

Monitor data strategically

Everyone wants to avoid missing data, but you can’t monitor what you can’t define. So in order to monitor data to prevent it from going missing, you must define what data you expect to see, both from which sources or at which ingestion volumes. 

If you don’t have a way of defining those expectations, then you can’t alert on what’s missing. Start by identifying what you expect, and then quantify what’s missing based on those expectations. For guidance on how to do this in Splunk, see Duane Waddle’s blog post Proving a Negative, as well as the apps TrackMe or Meta Woot!

Plan changes to the data management system carefully

It’s also crucial to review changes to the configurations that you use to manage data sources, especially changes to data structures or normalization in data sources. Make sure that you consistently deploy these changes as well, to reduce the chance that some sources collect different data in different ways from other sources for the same data. 

Be careful to note downstream and upstream dependencies for your data management system, such as other tools, permissions settings, or network configurations, before making changes, such as an upgrade or a software change.

The simplest way for data to go missing from a data analysis process is when it’s being collected. The next post in the series discusses how data can go missing at the collection stage: Collect the data: How missing data biases data-driven decisions.

Analyze the data: How missing data biases data-driven decisions

This is the fifth post in a series about how missing data biases data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

When you do data analysis, you’re searching and analyzing your data so that you can answer a specific question with the data so that you can make a decision at the end of the process. Unfortunately, there are many ways that data can go missing while you analyze it. 

In this post, I’ll cover the following

  • Why disaggregation matters when performing data analysis
  • How data goes missing when analyzing data
  • What to do about missing data when analyzing data
  • How simple mistakes can cause data to go missing 

Declining bird populations in North America

Any aggregation that you do in your data analysis can cause data to go missing. I recently finished listening to the podcast series The Last Archive, by Jill Lepore, and episode 9 talked about declining bird populations in North America. I realized that it’s an excellent example of why disaggregating data is vital to making sure that data doesn’t go missing during the analysis process. 

As reported in Science Magazine, scientists have concluded that since 1970, the continent of North America has lost 3 billion birds—nearly 30% of the surveyed total. But what does that mean for specific bird populations? Are some more affected than others?

It’s easy for me to look around San Francisco and think, well clearly that many birds can’t be missing—I’m still surrounded by pigeons and crows, and ducks and geese are everywhere when I go to the park! 

If you don’t disaggregate your results, there’s no way to determine which bird breeds are affected and where. You might assume that 30% of all bird breeds were equally affected across habitats and types. Thankfully, these scientists did disaggregate their results, so we can identify variations that otherwise would be missing from the analysis. 

Screenshot of bar chart from Science magazine, showing bird decline by habitat in percentage. Wetlands gained more than 10%, all other populations declined. Grasslands declined more than 50%.

Screenshot of a bar chart from Science magazine showing decline by type of bird, relevant statistics duplicated in text.

In the case of this study, we can see that some bird populations—Old world sparrows, Larks, and Starlings—are more affected than other types of birds, while others—Raptors, Ducks and geese, and Vireos—have flourished in the past 50 years.

Because the data is disaggregated, you can uncover data that would otherwise be missing from the analysis—how the different types of birds have actually been differently affected, due to habitat loss in grasslands, or cases where restoration and rehabilitation efforts have been effective, such as the resurgence in the population of raptors. 

Without an understanding of which specific bird populations are affected, and where they live, you can’t take as effective action to help bird populations recover, because you’re missing too much data due to an overgeneralized aggregate. Any decisions you took based on the overgeneralized aggregate would be biased and ultimately incorrect. 

In the case of this study, we know that targeted bird population restoration is perhaps most needed in grasslands habitats, like the Midwest where I grew up. 

Unfortunately, the study only covers 76% of all bird breeds, so my city-dwelling self, will just continue to wonder how the bird population has changed since 1970 for pigeons, doves, crows, and others. 

How does data go missing?

An easy way for data to go missing is for incomplete data to be returned when you’re analyzing the data. Many of these examples are Splunk-specific, but are limitations shared by most data analysis tools. 

Truncated results

In Splunk, search results could be truncated for a variety of reasons. Truncated results have visible error messages if you’re running ad hoc searches, but you might not see error messages if the searches are producing scheduled reports. 

If the indexers where the data is stored are unreachable, or your search times out before completing, the results could be incomplete. If you’re using a subsearch, you might hit the default event limit of 10K events, or the timeout limit of 60 seconds and have incomplete subsearch results. If you’re using the join search command, you might hit the 50K row limit that Nick Mealy discusses in his brilliant .conf talk, Master Joining Datasets Without Using Join

If you’re searching an external service from Splunk, for example by using ldapsearch or a custom command to search an external database, you might not get complete results if that service is having problems or if you don’t have access to some data that you’re searching for.

It’s surprisingly easy for data to go missing when you’re correlating datasets. 

Missing and “missing” fields across datasets

If you’re trying to compare datasets and some of the datasets are missing fields, you might accidentally miss data. Without the same field across multiple types of data, it can be difficult to perform valuable or accurate data correlations.

In this containerized cloud-native world, tracing outages across systems can be complex if you don’t have matching identifiers in every dataset that you’re using. As another example, it can be difficult to identify suspicious user session activity without the ability to correlate session identifiers with active users logged into a specific host. 

Sometimes the fields aren’t actually missing, they’re just named differently. Because the data isn’t normalized, or the fields don’t match your naming expectations, they’re effectively missing from your analysis because you can’t find them. 

Missing fields in a dataset that you want to include in your analysis

Sometimes data is missing from specific events within a dataset. For example, I wanted to determine the average rating that I gave songs in my iTunes library. However, iTunes XML files store tracks with no rating (or a zero star rating for my purposes) without a rating field at all in the events. 

Calculating an average with that data missing gives me an average 3 star rating for all the tracks in my iTunes library. 

Screenshot of a Splunk search and visualization, showing a single value result of "3 stars". Splunk search is: `itunes` | search track_name=* | stats count as track_count by rating | stats avg(rating) as average_rating | replace "60" WITH "3 stars" IN average_rating

But if I add the zero-star rated tracks back in, representing those as “tracks I have decided aren’t worthy of a rating” rather than “tracks that have not yet been rated”, the average changes to 2.5 stars per track. 

Screenshot of splunk search and visualization showing "2.5 stars". Splunk search is: `itunes` | search track_name=* | fillnull rating value="0" | stats count as track_count by rating | stats avg(rating) as average_rating | replace "50" WITH "2.5 stars" IN average_rating

If I’d used the results that were missing data, I’d have a biased interpretation of my average song rating in my iTunes library. 

Mismatched numeric units and timezones

You might also have data that goes missing if you’re improperly analyzing data because you don’t know or misinterpret the units that the data is stored in. 

Keep track of whether or not a field contains a time in minutes or seconds and if your network traffic logs are in bytes or megabytes. Data can vanish from your analysis if you improperly compare dissimilar units!

If you convert a time to a more human-readable format and make incorrect assumptions about the time format, such as the time zone that it’s in, you can cause data to go missing from the proper time period. 

Even without transforming your data, if you’re comparing different datasets that store time data with different time zones, data can go missing. You might think that you’re comparing a four hour window from last Monday while debugging an issue across several datasets

It’s also important to note that how you choose to aggregate your data can hide data that is missing from your dataset, or hide data in a way that causes it to go missing. 

Consider the granularity of your aggregations

When you perform your data analysis, consider what time ranges you’re using to bin or aggregate your data, the time ranges that you use to search your data across, for what data points, which fields you use to disaggregate your data, and how that might affect your results. 

For example, it’s important to keep track of how you aggregate events across time periods in your analysis. If the timespans that you use when aggregating your data don’t align with your use case and the question you’re trying to answer, data can go missing. 

In this example, I was trying to convert an existing search that showed me how much music I was listening to by type per year, into a search that would show me how much music I was listening to by weekday by year. Let’s take a look:

Screenshot of very detailed Splunk search and visualization showing time spent listening by weekday and year

This was my initial attempt at modifying the data, and there’s a lot missing. There’s no results at all for Tuesdays in 2019, and the counts for Sundays in 2017, Mondays in 2018, and Thursdays in 2020 were laughably low. What did I do wrong?

It turned out that the time spans I was using in the original search to aggregate my data were too broad for the results I was trying to get. I was doing a timechart with a span of 1 month to start, and then trying to get data down to the a granularity of weekday. That wasn’t going to work! 

I’d caused data to go missing because I didn’t align the question I was trying to answer with the time span aggregation in my analysis. 

Thankfully, it was a quick fix. I updated the initial time grouping to match my desired granularity of one day, and I was no longer missing data from my results! 

Screenshot showing revised results for time spent listening by weekday and year

This is a case where an overly broad timespan aggregation, combined with a lack of consideration for my use case, caused data to go missing. 

You make an error in your analysis process 

How many of you have written a Splunk search with a conditional eval function and totally messed it up? 

Screenshot of a Splunk search. Search: |inputlookup append=t concerthistoryparse.csv | eval show_length=case(info == "festival", "28800", info == "dj set", "14400", info == "concert", "10800")

In this case, I wrote a search to calculate the duration of music listening activities—specifically, calculating an estimated amount of time spent at a music festival, DJ set, or concert—in order to compare how much time I spent listening to music before I was sheltering in place with after. 

I used a conditional “info == concert” to apply an estimated concert length of 3 hours, but no field-value pair of info=concert existed in my data. In fact, concerts had no info field at all. It wasn’t until I’d finished my search, combining 2 other datasets, that I realized a concert I’d attended in March was missing from the results. 

Screenshot of full Splunk search and visualization showing time spent listening to music, going to shows, and listening to livestreams by month. March shows only livestreams and listening activity.

In order to prevent this data from going missing, I had to break my search down into smaller pieces, validating the accuracy of the results at each step against my expectations and knowledge of what the data should reflect. Eventually I realized that I’d caused data to go missing in my analysis process by making the assumption that an info=concert field existed in my data. 

Same screenshot as previous image, but with a corrected search and March now shows one show and livestream and listening activity.

Ironically, this graph is still missing data because failed to scrobble tracks from Spotify for several time periods in July and August. 

What can you do about missing data?

If you have missing data in your analysis process, what can you do? 

Optimize your searches and environment

If you’re processing large volumes of events, you can consider expanding limits for data, such as those for subsearches or timeouts for results. Default limits are the most common settings, and aren’t always something to adjust, but occasionally it makes sense to adjust the limits to suit your use cases if your architecture can handle it. 

More practically, make sure that you’re running efficient and precise analyses to make the most use of your data analysis environment. If you’re using Splunk, I again recommend Nick Mealy’s excellent .conf talk, Master Joining Datasets Without Using Join for guidance on optimizing your searches, as well as Martin Müller’s talk on the Splunk Job Inspector, featuring Clara Merriman for the .conf20 edition. 

Normalize your data

You can normalize your data to make it easier to use, especially if you’re correlating common fields across many datasets. Use field names that are as accurate and descriptive as possible, but also names that are consistent across datasets. 

It’s a best practice to follow a common information model, but also ensure that you work closely with data administrators and data analysts (if you aren’t performing both roles) to make sure that analyst use cases and expectations align with administrator practices and implementations. 

If you’re using Splunk, start with the Splunk Common Information Model (CIM) or write your own data model to help normalize fields. Keep in mind too, that you don’t have to use accelerated data models to use a data model when naming fields.  

Enrich your datasets 

You can also explore different ways to enrich your datasets earlier in the analysis process to help make sure data doesn’t go missing across datasets. If you perform data enrichment on the data as it streams in, using a tool like Cribl LogStream or Splunk Data Stream Processor, you can add this missing data back to the events and make it easier to accurately and usefully correlate data across datasets. 

Consistently validate your results

Do what you can to consistently and constantly ensure the validity of your search results. Assess the validity against your expectations, but ultimately against itself and against context. If you don’t validate your results, you might lose data that is hidden inside of an aggregate or misinterpreted due to missing context. 

  • Check different time ranges. What seems to be a gap or a pattern in data could just be seasonality or noise. Consider separating weekends from weekdays, or business hours from other hours, depending on the type of data that you’re aggregating. 
  • Use different types of averages, and consider whether the span of values in your results are well-represented by an average. An average of a wide range of values can inaccurately reflect the range of the values, effectively hiding the minimum and the maximum values. 
  • Disaggregate your data to identify where you might be missing more data, and thus have bias in your results. Heather Krause and Shena Ashley discuss disaggregation in an interview for If Data Could Talk by Tableau Software on The Ethics of Visualizing Data on Race, making the point that disaggregated data does not imply causation, it merely describes the data.
  • Compare like with like. Validate time zones, units, and the data in similarly-named fields using a data dictionary or by working with the stewards of the datasets to make sure that you’re using the data correctly. 

To help ensure you’re properly accounting for missing data that you might cause while analyzing data, work with your data administrator and consult with a statistics expert if you’re not sure you’re properly analyzing or aggregating data. To help learn more, I recommend Ben Jones’ book Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations.

The next post in this series covers the different ways that data can go missing at the data management stage—what happens to prepare data for analysis?  Manage the data: How missing data biases data-driven decisions

Visualize the data: How missing data biases data-driven decisions

This is the fourth post in a series about how missing data can bias data-driven decisions. Start at the beginning: What’s missing? Reduce bias by addressing data gaps in your analysis process.

Visualizing data is crucial to communicate the results of a data analysis process. Whether you use a chart, a table, a list of raw data, or a three-dimensional graph that you can interact with in virtual reality—your visualization choice can cause data to go missing. Any time you visualize the results of data analysis, you make intentional decisions about what to visualize and what not to visualize. How can you make sure that data that goes missing at this stage doesn’t bias data-driven decisions?

In this post, I’ll cover the following: 

  • How the usage of your data visualization can cause data to go missing
  • How data goes missing in data visualizations
  • Why accessibility matters for data visualizations
  • How a lack of labels and scale can mislead and misinform 
  • What to do about missing data in data visualizations

How people use the Georgia Department of Public Health COVID-19 daily report 

When creating a data visualization, it’s important to consider how it will be used. For example, the state of Georgia provides a Department of Public Health Daily COVID-19 reporting page to help communicate the relative case rate for each county in the state. 

In the midst of this global pandemic, I’m taking extra precautions before deciding to go hiking or climbing outside. Part of that risk calculation involves checking the relative case rate in my region — are cases going up, down, or staying the same? 

If you wanted to check that case rate in Georgia in July, you might struggle to make an unbiased decision about your safety because of the format of a data visualization in that report.

As Andisheh Nouraee illustrates in a now-deleted Twitter thread, the Georgia Department of Public Health on the COVID-19 Daily Status Report provided a heat map in July that visualized the number of cases across counties in Georgia in such a way that it effectively hid a 49% increase in cases across 15 days.

July 2nd heat map visualization of Georgia counties showing cases per 100K residents, with bins covering ranges from 1-620, 621-1070, 1071 - 1622, 1623 - 2960, with the red bins covering a range from 2961 - 4661.
Image from July 2nd, shared by Andisheh Nouraee, my screenshot of that image
July 17th heat map visualization showing cases per 100K residents in Georgia, with three counties colored red. Bins represent none, 1-949 cases, 950 - 1555 cases, 1556-2336 cases, 2337 - 3768 cases, and the red bins represent 3769-5165 cases.
Image from July 17th, shared by Andisheh Nouraee, my screenshot of that image

You might think that these visualizations aren’t missing data at all—the values of the gradient bins are clearly labeled, and the map clearly shows how many cases exist for every 100K residents.

However, the missing data isn’t in the visualization itself, but in how it’s used. This heat map is provided to help people understand the relative case rate. If I were checking this graph every week or so, I would probably think that the case rate has stayed the same over that time period. 

Instead, because the visualization uses auto-adjusting gradient bins, the red counties in the visualization from July 2nd cover a range from 2961 to 4661, while the same color counties on July 17th now have case rates of 3769–5165 cases per 100K residents. The relative size of the bins is different enough to where the bins can’t be compared with each other over time. 

As reported by Keren Landman for the Atlanta Magazine, the Department of Public Health didn’t have direct control over the data on the dashboard anyway, making it harder to make updates or communicate the data more intentionally.

Thankfully, the site now uses a visualization with a consistent gradient scale, rather than auto-adjusting bins.

Screenshot of heat map of counties with cases per 100K residents in Georgia with Union County highlighted showing the confirmed cases from the past 2 weeks and total, and other data points that are irrelevant to this post.

In this example, the combination of the visualization choice and the use of that visualization by the visitors of this website caused data to go missing and possibly resulting in biased decisions about whether it’s safe to go for a hike in the community. 

How does data go missing? 

This example from the Georgia Department of Health describes one way that data can go missing, but there are many more. 

Data can go missing from your visualization in a number of ways:

  • If the data exists, but is not represented in the visualization, data is missing
  • If data points and fluctuations are smoothed over, or connected across gaps, data is missing. 
  • If outliers and other values are excluded from the visualization, data is missing.
  • If people can’t see or interact with the visualization, data is missing.
  • If a limited number of results are being visualized, but the label and title of the visualization don’t make that clear, data is missing.

Accessible data visualizations prevent data from going missing 

Accessible visualizations are crucial for avoiding missing data because data can go missing if people can’t see or interact with it. 

Lisa Charlotte Rost wrote an excellent series for Data Wrapper’s blog about colorblindness and data visualizations that I highly recommend for considering color vision accessibility for data visualization: How your colorblind and colorweak readers see your colors, What to consider when visualizing data for colorblind readers, and What’s it like to be colorblind

You can also go further to consider how to make it easier for folks with low or no vision to interact with your data visualizations. Data visualization artist Mona Chalabi has been experimenting with ways to make her data visualization projects more accessible, including making a tactile version of a data visualization piece, and an interactive piece that uses touch and sound to communicate information, created in collaboration with sound artist Emmy the Great.

At a more basic level, consider how your visualizations look at high zoom levels and how they sound when read aloud by a screen reader. If a visualization is unintelligible at high zoom levels or if portions aren’t read aloud by a screen reader, those are cases where data has gone missing from your visualization. Any decisions that someone with low or no vision wants to make based on a data visualization is biased to include only the data visualizations that they can interact with successfully. 

Beyond vision considerations, you want to consider cognitive processing accessibility to prevent missing data. If you overload a visualization with lots of overlays, rely on legends to communicate meaning in your data, or have a lot of text in your visualization, folks with ADHD or dyslexia might struggle to process your visualization. 

Any data that people can’t understand in your visualization is missing data. For more, I recommend the blog post by Sarah L. Fossheim, An intro to designing accessible data visualizations.  

Map with caution and label prodigiously: Beirut explosion map

Data can go missing if you fail to visualize it clearly or correctly. When I found out about the explosion in Beirut, after I made sure that my friends and their family were safe, I wanted to better understand what had happened. 

Screenshot of a map of the beirut explosion with labels pointing to different overlapping circles, saying "blast site", "widespread destruction", "heavy damage", "damage reported" and "windows blown out up to 15 miles away".
Image shared by Joanna Merson, my screenshot of the image

I haven’t had the privilege to visit Beirut before, so the maps of the explosion radius weren’t as easy for me to personally relate to. Thankfully, people started sharing maps about what the same explosion might look like if it occurred in New York City or London.  

Screenshot of a google maps visualization with 3 overlapping circles centered over New York. No labels.
Image shared by Joanna Merson, my screenshot of the image

This map attempts to show the scale of the same explosion in New York City, but it’s missing a lot of data. I’m not an expert in map visualizations, but thankfully cartographer Joanna Merson tweeted a correction to this map and unpacked just how much data is missing from this visualization. 

There’s no labels on this map, so you don’t know the scale of the circles, or what distance each blast radius is supposed to represent. You don’t know what the epicenter of the blast is because it isn’t labeled, and perhaps most egregiously, the map projection used is incorrect. 

Joanna Merson created an alternate visualization, with all the missing data added back in. 

Screenshot of a map visualization by Joanna Merson made on August 5 2020, with a basemap from esri World Imagery using a scale 1:200,000 and the Azimuthal Equidistant projection. Map is centered over New York City with an epicenter of Manhattan labeled and circle radii of 1km, 5km, and 10km clearly labeled.
Image by Joanna Merson, my screenshot of the image.

Her visualization carefully labels the epicenter of the blast, as well as the radii of each of the circles that represent different effects from the blast. She’s also careful to share the map projection that she used—one that has the same distance for every point along that circle. It turns out that the projection used by Google Maps is not the right projection to show distance with an overlaid circle. Without the scale or an accurate projection in use, data goes missing (and gets added) as unaffected areas are misleadingly shown as affected by the blast. 

How many of you are guilty of making a geospatial visualization, but don’t know anything about map projections and how they might affect your visualization? 

Joanna Merson further points out in her thread on Twitter that maps like this with an overlaid radius to show distance can be inaccurate because they don’t take into account the effect of topography. Data goes missing because topography isn’t represented or considered by the visualization overlaid on the map. 

It’s impractical to model everything perfectly in every map visualization. Depending on how you’re using the map, this missing data might not actually matter. If you communicate what your visualization is intended to represent when you share it, you can convey the missing data and also assert its irrelevance to your point. All maps, after all, must make decisions about what data to include based on the usage of the map. Your map-based data visualizations are no different! 

It can be easy to cut corners and make a simple visualization to communicate the results of data analysis quickly. It can be tedious to add a scale, a legend, and labels to your visualization. But you must consider how your visualization might be used after you make it—and how it might be misused.

Will a visualization that you create end up in a blog post like this one, or a Twitter thread unpacking your mistakes? 

What can you do about missing data?

To prevent or mitigate missing data in a data visualization, you have several options. Nathan Yau of Flowing Data has a very complete guide for Visualizing Incomplete and Missing Data that I highly recommend in addition to the points that I’m sharing here. 

Visualize what’s missing

One important way to mitigate missing data in a data visualization is to devise a way to show the data that is there alongside the data that isn’t. Make the gaps apparent and visualize missing data, such as by avoiding connecting the dots between missing values in a line chart.

In cases where your data has gaps, you can add annotations or labels to acknowledge and explain any inconsistencies or perceived gaps in the data. In some cases, data can appear to be missing, but is actually a gap in the data due to seasonal fluctuations or other reasons. It’s important to thoroughly understand your data to identify the difference. 

If you visualize the gaps in your data, you have the opportunity to discuss what can be causing the gaps. Gaps in data can reflect reality, or flaws in your analysis process. Either way, visualizing the gaps in your data is just as valuable as visualizing the data that you do have. Don’t hide or ignore missing data.

Carefully consider time spans

Be intentional about the span that you choose for time-based visualizations. You can unintentionally hide fluctuations in the data if you choose an overly-broad span for your visualization, causing data to go missing by flattening it. 

If you choose an overly-short time span for your visualization, however, the meaning of the data and what you’re trying to communicate can go missing with all the noise of the individual data points. Consider what you’re trying to communicate with the data visualization, and choose a time span accordingly.

Write clearly 

Another way to address missing data is to write good labels and titles for visualizations. It’s crucial to explain exactly what is present in a visualization—an important component of communicating results. If you’re intentional and precise about your labels and titles, you can prevent data from going missing. 

If the data analysis contains the results for the top 10 cities by population density, but your title only says “Top Cities”, data has gone missing from your visualization!

You can test out the usefulness of your labels and titles by considering the following: If someone screenshots your visualization and puts it in a different presentation, or tweets it without the additional context that might be in the full report, how much data would be missing from the visualization? How completely does the visualization communicate the results of data analysis if it’s viewed out of context?

Validate your scale

Make sure any visualization that you create has a scale and that it’s included. It’s really easy for data to go missing if the scale of the data itself is missing. 

Also validate that the scale on your visualization is accurate and relevant. If you’re visualizing percentages, make sure the scale goes from 0-100. If you’re visualizing logarithmic data, make sure your scale reflects that correctly. 

Consider the use

Consider how your visualization will be used, and design your visualizations accordingly. What decisions are people trying to make based on your visualization? What questions are you trying to answer when you make it? 

Automatically-adjusting gradient bins in a heat map can be an excellent design choice, but as we saw in Georgia, they don’t make sense to communicate relative change over time. 

Choose the right chart for the data

It’s also important to choose the right chart to visualize your data. I’m not a visualization expert, so check out this data tutorial from Chartio, How to Choose the Right Data Visualization as well as these tutorials of different chart types on Flowing Data: Chart Types.  

I do want to recommend that if you’re visualizing multiple aggregations in one visualization in the Splunk platform, consider the Trellis layout to create different charts to help compare across the aggregates. 

Always try various types of visualizations for your data to determine which one shows the results of your analysis in the clearest way.

One of the best ways to make sure your data visualization isn’t missing data is to make sure that the data analysis is sound. The next post in this series addresses how data can go missing while you analyze it: Analyze the data: How missing data biases data-driven decisions.