Just Add Data: Using data to prioritize your documentation

This is a blog post adaptation of a talk I gave at Write the Docs Portland on May 21, 2019. The talk was livestreamed and recorded, and you can view the recording on YouTube: Just Add Data: Make it easier to prioritize your documentation - Sarah Moir

Prioritizing documentation is hard. How do you decide what to work on if there isn’t a deadline looming? How do you decide what not to work on when your list of work just keeps growing? How do you identify what new content you might want to add to your documentation?

By adding data to the process, it’s possible to prioritize your documentation tasks with confidence!

Prioritizing without data #

Prioritizing a backlog without data can involve asking yourself some questions, like what will take the least amount of time? Or, what did someone most recently request? If I’m doing this, I might ask my product manager what to work on, or do whatever task seems easiest at the time. I might even focus on whichever task I can complete without talking to other people, because I’m tired.

Based on the answers to those questions, I’ll end up with a prioritized backlog, but lack confidence that what I’ve chosen to work on will actually bring the most value to customers and the documentation. Especially if I’m choosing not to do work, it can be a challenge to keep ignoring an item in the backlog because it doesn’t fit with what I think I need to be working on, especially without some sort of “proof” that it’s okay to ignore. To make this process easier, I add data.

Why prioritize with data? #

Using data to prioritize a documentation backlog can help give you more confidence in your decisions and help you justify why you’re not working on something. It can challenge your assumptions about what you should be working on, or validate them. Adding data can help improve your overall understanding of how customers are using your product and the documentation, leading to benefits beyond the backlog.

Data types for prioritization #

What kinds of data am I talking about? All kinds of data! If you skim the following list, you’ll notice that this data goes beyond quantitative sources. When I talk about data, I’m including all kinds of information: qualitative comments, usage metrics, metadata, website access logs, survey results, database records, all of these and more fit in with my definition of data. Here’s the full list

Some of these data types are more relevant to different types of organizations and documentation installations. For example, open source projects might have more useful issue tags, or organizations that use DITA will have easier access to topic type information.

This list of data types is to demonstrate the different types of information that can help you prioritize documentation, but I don’t want you to think that you need to do large-scale collections or implementations to get any valuable data worth incorporating into your prioritization process.

I’ll cover a couple of these data types in more detail here, but I talk about all of them in another post: Detailed data types you can use for documentation prioritization.

Product usage data #

You can use usage data for products (also called telemetry) to find out where people are spending their time. What features or functionality are they using? Even if they’ve purchased or installed the product, are they actually using it?

Some examples of product usage data include:

In addition to data about how people are interacting with the product, you can also gather product usage data without actual introspection into how people are using it. If you have information about how many people have downloaded a product or are logging in to a service:

I mostly talk about using data to help you prioritize the more ambiguous parts of a backlog that might not be tied to a release, but especially with the help of product usage data, you can better-prioritize release-focused documentation as well. If your product is in beta, and you want more data to help you prioritize your overall documentation backlog, you can use some product usage data to understand where people are spending more of their time, and draw conclusions about what to spend more time on or less time on, or what level of detail to include in the documentation, to achieve your overall documentation goals for the release.

Site metrics #

Site metrics like page views, session data, HTTP referer data, and link clicks can help you understand where people are coming to your docs from, how long they’re staying on the page, how many readers there are, and what they’re doing after they get to a topic. Here are some example site metrics:

You can also use this data to understand better how people interact with your documentation, like whether they’re using a version switcher on your page or expanding/collapsing more information hidden on the page.

You can also split this data by IP address to understand groups of topics that specific users are clustering around, to better understand how people use the documentation.

Identify questions based on your backlog #

The process of adding data to your documentation prioritization strategy is all about making do with what you have to answer what you want to know. What you want to know depends on your backlog.

Data analysis is focused on a goal. You don’t want to collect a lot of data and then just stare at it, or get stressed out by the amount of “insights” that you could be gathering but meanwhile you’re not really sure what to do with the information. If you consider questions that you want to answer in advance, you can focus your data collection and analysis in a more valuable way.

Some example questions that you might identify based on your task list:

Tie questions to data types #

After you’ve identified questions relevant to your task list, you can tie those questions to data types that can help you answer the questions.

For example, the question: What are people looking for and not finding?

To answer this, you can look where people are looking for information, namely search keywords that they’re typing into search engines, common questions being posted on forums, or the topics of support cases filed by customers.

For example, I looked at some data and was able to identify specific search terms people are using on the documentation site that are routing customers to a company-managed forum site.  I can then use that data to identify cases where people are looking for documentation about something, but are not finding the answers in the documentation.

Another example question: What do people want more help with?

This could be answered by looking at the topics of support cases again, but also the types of questions being asked in training courses, as well as unanswered questions on forums.

As a final example: What market groups are we targeting that don’t see their use cases represented?

To answer this, you could look at data about sales leads, questions being asked by the field that contain specific use cases for various market verticals, as well as questions being asked in training courses.

Find questions from data #

If you don’t have much of a task list to work with, or if you aren’t able to get access to data that can help you answer your questions, you can still make use of the data that is available to you and draw valuable insights from it.

You can identify interest in content that you maybe weren’t aware of, and make plans to write more to address that interest, or modify existing content to address that interest. Maybe there are a bunch of forum threads about how to do something, but nothing authoritative in the documentation. That information hasn’t made it to the docs writers in any way, but because you’re looking at the available data, you’re able to see that it’s important.

Even if you have no data specifically relevant to the documentation or customer questions, you can still find ways to identify documentation work to add to a task list. You could create datasets by performing text analysis on all or specific documentation topics, and identify complexity issues, or topics that don’t adhere to a style guide. You could use customer satisfaction surveys to identify places where documentation architecture or linking strategies could be improved.

Working with the data #

Now you hopefully have a better understanding of different types of data available to you, and how you can identify valuable data sources based on your questions that you want to answer. But how much data do you need to collect? And how do you get the data? Most importantly, how do you analyze it to answer the questions you want to answer?

How much data? #

How much data do you need to collect? You don’t need to collect data forever. You don’t need ALL the data. You just need enough data to point you in a direction and reduce uncertainty.

You can use a small sample of users, or a small sample of time, so long as it helps you answer your question and reduce uncertainty about what the answer could be. Collecting larger amounts of data doesn’t mean that you reduce uncertainty by an equally large degree. The amount of data you collect doesn’t correlate directly to what you’re able to learn from it. However, if the question you’re trying to answer with data concerns all the documentation users over a long period of time, you will be collecting more data than if you just want to know what a specific subset of readers found interesting on a Friday afternoon.

Try for representative samples that are relevant for the questions you’re trying to answer. If you can’t get representative data, try for a random sample. If you can’t get representative or random samples, acknowledge the bias that is inherent in the data you’re using. Add context to the data wherever possible, especially about who the data represents and why the data is still valuable if it isn’t representative.

You might find that collecting a small amount of data leaves you with more questions than answers, and that’s okay too. It’s an opportunity to continue exploring and learning more about your customers and your documentation tasks. But how do you even get any data at all?

How do you get the data? #

You’ll either be collecting your own data, or asking others for the data you need.

If it’s data about the documentation site or its content, you might own that data yourself, and already have access to it. If it’s other types of data, like sales leads or user research data, it’s time to talk to the departments or people that manage those areas.

The teams managing different datasets will vary at your organization, and might even be you in many cases. They may be reluctant to share data. With that in mind, remember that when you collect data, you don’t need to get persistent access to all the data you want. Focus on getting some access to some data that is useful to answer your questions. After that, you can use that data to make your work more efficient and informed, and then hopefully communicate that value and get more access to data in the future if you want.

What to use for data analysis? #

What do you use to analyze that data after you get it? How do you transform data into a report of useful information?

Some tools might already have analytics and reporting built in, like Google Analytics. That can certainly make it easier to analyze the data!

For other types of data that you need to analyze yourself, use the tools available to you. Think about what already know how to use, or have access to:

You don’t have to spend a long time learning a new tool to analyze data for these purposes. If you continue incorporating data analysis into your work, it might make sense, but it isn’t necessary to get started.

Tools aren’t magic #

It’s also important to note that tools aren’t magic. Some degree of data analysis will involve manual collecting, categorizing, or cleaning of the data. If your organization doesn’t have strict topic types, you might need to perform manual topic-typing. If you want to analyze some information but the data isn’t in a machine-readable format, you might have to sit at your desk copy pasting for hours.

Depending on your skills, the current state of the data that you want to analyze, and the tools available to you, the amount of time it takes to analyze data and get results can vary widely. I have spent 3 days manually processing data in Excel, and I’ve spent 2 hours creating searches in mostly-clean datasets in Splunk to get answers to various questions. Keep that in mind when you’re analyzing data.

How to perform data analysis #

When you analyze data, what are you actually looking at?

Top, rare, outlying values #

Find out what values are most common, and which values are least common. Those can be established by counting the various instances of values.

Look for values that are different from the others by a large margin. You can use standard deviation as a function to achieve this.

Patterns and clusters across data #

You can also look for patterns and clusters in your data.

If you’re working with qualitative data, you might need to categorize, or code, the data so that you can sort it and look for patterns in the results. You can identify these patterns by counting instances of categories, or looking at clusters of behavior. An example of a cluster of behavior is if you look at documentation topic visits over time, and you identify a spike in visits at a particular time.

Split by different features #

You also want to segment data by different features. Meaning, you can better understand the most common values if you split them by other types of information. For example, you can look at the most commonly visited topics in your documentation set over the last 3 months, or you can look at the most commonly visited topics in your documentation over the last 3 months, but on a week-to-week basis. That additional split can help you understand how those values are changing over time. If you identify a spike in a particular topic or category of topics, you can then interpret the data. Maybe a new product release led to a spike of interest in the release notes topic that wasn’t easily identified until you split the results by week. This is also a good opportunity to point out to a product team that people really do read your documentation!

That’s an example of splitting by time, but you can split by any other field available to you in your data. To use the same data type, looking at the most common topics by product, by IP address, or other factors, can help lead to valuable insights.

Combine data types #

You can combine different types of data to understand approximately how many people are using the product vs how many of them are using the documentation. Comparing sales leads, product usage data, and existing page views could help you approximate the number of potential, and existing customers, alongside the number of distinct documentation readers.

Make sure that when you combine data across datasets, you keep track of units and time ranges, and make sure that you compare like data with like data. For example, be careful not to use data that refers to potential customers with data that refers to existing customers, because that could lead to misleading results if you don’t keep context with the data.

Interpreting results #

When you interpret the results of your data analysis, make sure that you are adding context to the data. Especially when dealing with outlier data, but even when reviewing data like rarely-viewed or frequently-viewed topics, keep in mind additional context that could explain results.

Add context from expertise #

Use your expertise and knowledge of the documentation to add context. For example, topics concerning a specific functionality are likely to be more popular at a specific time if that functionality was recently changed.

Pursue alternate explanations #

Whenever you’re interpreting data, you want to make sure that you’re gut-checking it against what you already know. So if a relatively mundane topic has wildly out-of-the-ordinary page views, there are likely alternate explanations for that interest. Maybe your topic ended up being a great resource about cron syntax in general, even for people that don’t use your product.

Draw realistic conclusions #

Draw realistic conclusions based on the data available to you. You might not be able to get access to or combine specific datasets due to privacy concerns. If you carefully identify what problems you’re trying to solve, and select only the data sources that can help you solve those problems, you can reduce the potential that you’ll introduce bias into your data analysis, and improve the conclusions that you’re able to draw.

Don’t trust data blindly #

Don’t trust the data blindly. When reviewing data that seems out of the ordinary or like outliers, examine the different reasons why the data could be like that. Who does the data represent? What does it represent? Make sure that you’re interpreting data in context, so that you’re able to understand exactly what it represents. It can be tempting to ignore data that doesn’t match your biases or expectations.

Above all, remember to use data to complement your research and writing, and validate or challenge assumptions about your audience.

Your turn to add data #

  1. Identify the questions you’re trying to answer
  2. Use the data available to you
  3. Use the tools available to you
  4. Analyze and interpret the data
  5. Take action and prioritize accordingly

Additional resources #