Detailed data types you can use for documentation prioritization

Data analysis is a valuable way to learn more about what documentation tasks to prioritize above others. My post (and talk), Just Add Data, presented at Write the Docs Portland in 2019, talk about this broadly. In this post I want to cover in detail a number of different data types that can lead to valuable insights for prioritization.

This list of data types is long, but I promise each one contains value for a technical writer. These types of data might come from your own collection, a user research organization, the business development department, marketing organization, or product management organization:

  • User research reports
  • Support cases
  • Forum threads and questions
  • Product usage metrics
  • Search strings
  • Tags on bugs or issues
  • Education/training course content and questions
  • Customer satisfaction survey

More documentation-specific data types:

  • Documentation feedback
  • Site metrics
  • Text analysis metrics
  • Download/last accessed numbers
  • Topic type metrics
  • Topic metadata
  • Contribution data
  • Social media analytics

Many of these data types are best used in combination with others.

User research reports

User research reports can contain a lot of valuable data that you can use for documentation. 

  • Types of customers being interviewed
  • Customer use cases and problems
  • Types of studies being performed

This can give you insight into both what the company finds valuable to study (so some insight into internal priorities) but also direct customer feedback about things that are confusing or the ways that they use the product. The types of customers that are interviewed can provide valuable audience or persona-targeting information, allowing you to better calibrate the information in your documentation. See How to use data in user research when you have no web analytics on the Gov.UK site for more details about what you can do with user research data.

Support cases

Support cases can help you better understand customer problems. Specific metrics include:

  • Number of cases
  • Frequency of cases
  • Categories of questions
  • Customer environments and licenses

With these you can compile metrics about specific customer problems, the frequency of problems, and the types of customers and customer environments that are encountering specific problems, allowing you to better understand target customers, or customers that might be using your documentation more than others. Support cases are also rich data for common customer problems, providing a good way to gather new use cases and subjects for topics. 

Forum threads and questions

These can be internal forums (like Splunk Answers for Splunk) or external ones, like Reddit or StackOverflow.

  • Common questions
  • Common categories
  • Frequently unanswered questions
  • Post titles

If you’re trying to understand what people are struggling with, or get a better sense of how people are using specific functionality, forum threads can help you understand. The types of questions that people ask and how they phrase them can also help make it clear what kinds of configuration combinations might make specific functions harder for customers. Based on the question types and frequencies that you see, you might be able to fine-tune existing documentation to make it more user-centric and easily findable, or supplement content with additional specific examples. 

Product usage metrics

Some examples of product usage metrics are as follows:

  • Time in product
  • Intra-product clicks
  • Types of data ingested
  • Types of content created
  • Amount of content created

Even if you don’t have specific usage data introspecting the product, you can gather metrics about how people are interacting with the purchase and activation process, and extrapolate accordingly.

  • Number of downloads and installs
  • License activations and types
  • Daily and monthly active users

You can use this type of data to better understand how people are spending their time in your product, and what features or functionality they’re using. Even if a customer has purchased or installed the product, it’s even more valuable to find out if they’re actually using it, and if so, how.

If your product is only in beta, and you want more data to help you prioritize an overall documentation backlog, such as topics that are tied to a specific release, you can use some product usage data to understand where people are spending more of their time, and draw conclusions about what to prioritize based on that.

Maybe the under-utilized features could use more documentation, or more targeted documentation. Maybe the features themselves need work.  Be careful not to draw overly-simplistic conclusions about the data that you see from product usage metrics. Keep context in mind at all times. 

Search strings

You can gather search strings from HTTP referer data from web searches performed on external search sites such as Google or DuckDuckGo, or from internal search services. It’s pretty unlikely that you’ll be able to gather search strings from external sites given the widespread implementation of HTTPS, but internal search services can be vital and valuable data sources for this.

Look at specific search strings to find out what people are looking for, and what people are searching that’s landing them on specific documentation pages. Maybe they’re searching for something and landing on the wrong page, and you can update your topic titles to help.

JIRA or issue data

You can use metrics from your issue tracking services to better understand product quality, as well as customer confusion.

  • Number of issues/bugs
  • Categories/tags/components of issues/bugs
  • Frequency of different types of issues being created/closed

Issue tags or bug components can help you identify categories of the product where there are lots of problems or perhaps customer confusion. This is especially useful data if you’re an open source product and want to get a good understanding of where there are issues that might need more decision support or guidance in the documentation. 

Training courses

If you have an education department, or produce training courses about your product, these are quite useful to gather data from. Some examples of data you might find useful:

  • Questions asked by customers
  • Questions asked by course developers
  • Use cases covered by content in courses
  • Enrollment in courses
  • Categories of courses offered

Also useful to correlate this with other data to help identify verticals of customers interested in different topics. Because education and training courses cover more hands-on material, it can be an excellent source of use case examples, as well as occasions where decision support and guidance is needed. 

Customer surveys

Customer surveys especially cover surveys like satisfaction surveys and sentiment analysis surveys. By reviewing the qualitative statements and types of questions asked in the surveys, you can gain valuable insights and information like:

  • What do people think about the product?
  • What do people want more help with?
  • How do people think about the product?
  • How do people feel about the product?
  • What does the company want to know from customers? 
  • What are the company priorities?

This can also help you think about how the documentation you write has a real effect on peoples’ interactions with the product, and can shift sentiment in one way or another.

Documentation feedback

Direct feedback on your documentation is a vital source of data if you can get it. 

  • Qualitative comments about the documentation
  • Usefulness votes (yes/no)
  • Ratings

Even if you don’t have a direct feedback mechanism on your website, you can collect documentation feedback from internal and external customers by paying attention in conversations with people and even asking them directly if they have any documentation feedback. Qualitative comments and direct feedback can be vital for making improvements to specific areas. 

Site metrics

If your documentation is on a website, you can use web access logs to gather important site metrics, such as the following:

  • Page views
  • Session data like time on page
  • Referer data
  • Link clicks
  • Button clicks
  • Bounce rate
  • Client IP

Site metrics like page views, session data, referer data, and link clicks can help you understand where people are coming to your docs from, how long they are staying on the page, how many readers there are, and where they’re going after they get to a topic. You can also use this data to understand better how people interact with your documentation. Are readers using a version switcher on your page? Are they expanding or collapsing information sections on the page to learn more? Maybe readers are using a table of contents to skip to specific parts of specific topics.  

You can split this data by IP address to understand groups of topics that specific users are clustering around, to better understand how people use the documentation.

Text analysis metrics

Data about the actual text on your documentation site is also useful to help understand the complexity of the documentation on your site.

  • Flesch-Kincaid readability score
  • Inclusivity level
  • Length of sentences and headers
  • Style linter

You can assess the readability or usability of the documentation, or even the grade level score for the content to understand how consistent your documentation is. Identify the length of sentences and headers to see if they match best practices in the industry for writing on the web. You can even scan content against a style linter to identify inconsistencies of documentation topics against a style guide.

Download metrics

If you don’t have site metrics for your documentation site, because the documentation is published only via PDF or another medium, you can still use metrics from that. 

  • Download numbers 
  • Download dates and times
  • Download categories and types

You can use these metrics to gather interest about what people want to be reading offline, or how frequently people are accessing your documentation. You can also correlate this data with product usage data and release cycles to determine how frequently people access the documentation compared with release dates, and the number of people accessing the documentation compared with the number of people using a product or service.

Topic type metrics

If you use strict topic typing at your documentation organization, you can use topic type metrics as an additional metadata layer for documentation data analysis. Even if you don’t, you can manually categorize organize your documentation by type to gather this data.

  • What are the topic types?
  • How many topic types are there?
  • How many topics are there of each type?

Understanding topic types can help you understand how reader interaction patterns can vary for your documentation by type, or whether your developer documentation has predominantly different types of documentation compared with your user documentation, and better understand what types of documentation are written for which audiences.

Topic metadata

Metadata about documentation topics is also incredibly valuable as a correlation data source. You can correlate topic metadata like the following information:

  • What are the titles?
  • Average length of a topic?
  • Last updated and creation dates
  • Versions that different topics apply for

You can correlate it with site metrics, to see if longer topics are viewed less-frequently than shorter topics, or identify outliers in those data points. You can also manually analyze the topic titles to identify if there are patterns (good or bad) that exist.

Contribution data

If you have information about who is writing documentation, and when, you can use these types of data:

  • Last updated dates
  • Authors/contributors
  • Amount of information added or removed

Contribution data can tell you how frequently specific topics were updated to add new information, and by whom, and how much information was added or removed. You can identify frequency patterns, clusters over time, as well as consistent contributors.

It’s useful to split this data by other features, or correlate it with other metrics, especially site metrics. You can then identify things like:

  • Last updated dates by topic
  • Last updated dates by product
  • Last updated dates over time

to see if there are correlations between updates and page views. Perhaps more frequently updated content is viewed more often.

Social media analytics

  • Social media referers
  • Link clicks from social media sites

If you publicize your documentation using social media, you can track the interest in the documentation from those sites. If you’re curious about social media referers leading people to your documentation, and see whether or not people are getting to your documentation in that way. Maybe your support team is responding to people on twitter with links to your documentation, and you want to better understand how frequently that happens and how frequently people click through those links to the documentation…

You can also identify whether or not, and how, people are sharing your documentation on social media by using data crawled or retrieved from those sites’ APIs, and looking for instances of links to your documentation. This can help you get a better sense of how people are using your documentation, how they’re talking about it, how they feel about it, and whether or not you have an organic community out there on the web sharing your documentation. 

Beyond documentation data

I hope that this detail has given you a better understanding of different types of data, beyond documentation data, that are available to you as a technical writer to draw valuable conclusions from. By analyzing these types of data, you are prepared for prioritizing your documentation task list, but also better able to understand the customers of your product and documentation. Even if only some of these are available to you, I hope they are useful. Be sure to read Just Add Data: Using data to prioritize your documentation for the full explanation of how to use data in this way. 

Just Add Data: Using data to prioritize your documentation

This is a blog post adaptation of a talk I gave at Write the Docs Portland on May 21, 2019. The talk was livestreamed and recorded, and you can view the recording on YouTube: Just Add Data: Make it easier to prioritize your documentation – Sarah Moir

Prioritizing documentation is hard. How do you decide what to work on if there isn’t a deadline looming? How do you decide what not to work on when your list of work just keeps growing? How do you identify what new content you might want to add to your documentation?

By adding data to the process, it’s possible to prioritize your documentation tasks with confidence!

Prioritizing without data

Prioritizing a backlog without data can involve asking yourself some questions, like what will take the least amount of time? Or, what did someone most recently request? If I’m doing this, I might ask my product manager what to work on, or do whatever task seems easiest at the time. I might even focus on whichever task I can complete without talking to other people, because I’m tired. 

Based on the answers to those questions, I’ll end up with a prioritized backlog, but lack confidence that what I’ve chosen to work on will actually bring the most value to customers and the documentation. Especially if I’m choosing not to do work, it can be a challenge to keep ignoring an item in the backlog because it doesn’t fit with what I think I need to be working on, especially without some sort of “proof” that it’s okay to ignore. To make this process easier, I add data.

Why prioritize with data?

Using data to prioritize a documentation backlog can help give you more confidence in your decisions and help you justify why you’re not working on something. It can challenge your assumptions about what you should be working on, or validate them. Adding data can help improve your overall understanding of how customers are using your product and the documentation, leading to benefits beyond the backlog.

Data types for prioritization

What kinds of data am I talking about? All kinds of data! If you skim the following list, you’ll notice that this data goes beyond quantitative sources. When I talk about data, I’m including all kinds of information: qualitative comments, usage metrics, metadata, website access logs, survey results, database records, all of these and more fit in with my definition of data. Here’s the full list

  • User research reports
  • Support cases
  • Forum threads and questions
  • Product usage metrics
  • Search strings
  • Tags on bugs or issues
  • Education/training course content and questions
  • Customer satisfaction survey
  • Documentation feedback
  • Site metrics
  • Text analysis metrics
  • Download/last accessed numbers
  • Topic type metrics
  • Topic metadata
  • Contribution data
  • Social media analytics

Some of these data types are more relevant to different types of organizations and documentation installations. For example, open source projects might have more useful issue tags, or organizations that use DITA will have easier access to topic type information.

This list of data types is to demonstrate the different types of information that can help you prioritize documentation, but I don’t want you to think that you need to do large-scale collections or implementations to get any valuable data worth incorporating into your prioritization process.

I’ll cover a couple of these data types in more detail here, but I talk about all of them in another post: Detailed data types you can use for documentation prioritization.

Product usage data

You can use usage data for products (also called telemetry) to find out where people are spending their time. What features or functionality are they using? Even if they’ve purchased or installed the product, are they actually using it?

Some examples of product usage data include:

  • Time in product
  • Intra-product clicks
  • Types of data ingested
  • Types of content created (e.g., dashboards, playlists)
  • Amount of content created (e.g., dashboards, playlists)

In addition to data about how people are interacting with the product, you can also gather product usage data without actual introspection into how people are using it. If you have information about how many people have downloaded a product or are logging in to a service:

  • Number of downloads and installs
  • License activations and types
  • Daily and monthly active users

I mostly talk about using data to help you prioritize the more ambiguous parts of a backlog that might not be tied to a release, but especially with the help of product usage data, you can better-prioritize release-focused documentation as well. If your product is in beta, and you want more data to help you prioritize your overall documentation backlog, you can use some product usage data to understand where people are spending more of their time, and draw conclusions about what to spend more time on or less time on, or what level of detail to include in the documentation, to achieve your overall documentation goals for the release. 

Site metrics

Site metrics like page views, session data, HTTP referer data, and link clicks can help you understand where people are coming to your docs from, how long they’re staying on the page, how many readers there are, and what they’re doing after they get to a topic. Here are some example site metrics:

  • Page views
  • Session data like time on page
  • Referer data
  • Link clicks
  • Button clicks
  • Bounce rate
  • Client IP

You can also use this data to understand better how people interact with your documentation, like whether they’re using a version switcher on your page or expanding/collapsing more information hidden on the page. 

You can also split this data by IP address to understand groups of topics that specific users are clustering around, to better understand how people use the documentation.

Identify questions based on your backlog

The process of adding data to your documentation prioritization strategy is all about making do with what you have to answer what you want to know. What you want to know depends on your backlog.

Data analysis is focused on a goal. You don’t want to collect a lot of data and then just stare at it, or get stressed out by the amount of “insights” that you could be gathering but meanwhile you’re not really sure what to do with the information. If you consider questions that you want to answer in advance, you can focus your data collection and analysis in a more valuable way. 

Some example questions that you might identify based on your task list:

  • What are people looking for? Are they finding what they’re looking for?
  • Are people looking for information about <thing I’ve been told to document>?
  • What do people want more help with?
  • What people are we targeting that don’t see their use cases represented?

Tie questions to data types

After you’ve identified questions relevant to your task list, you can tie those questions to data types that can help you answer the questions.

For example, the question: What are people looking for and not finding?

To answer this, you can look where people are looking for information, namely search keywords that they’re typing into search engines, common questions being posted on forums, or the topics of support cases filed by customers.

For example, I looked at some data and was able to identify specific search terms people are using on the documentation site that are routing customers to a company-managed forum site.  I can then use that data to identify cases where people are looking for documentation about something, but are not finding the answers in the documentation.

Another example question: What do people want more help with? 

This could be answered by looking at the topics of support cases again, but also the types of questions being asked in training courses, as well as unanswered questions on forums. 

As a final example: What market groups are we targeting that don’t see their use cases represented?

To answer this, you could look at data about sales leads, questions being asked by the field that contain specific use cases for various market verticals, as well as questions being asked in training courses.

Find questions from data

If you don’t have much of a task list to work with, or if you aren’t able to get access to data that can help you answer your questions, you can still make use of the data that is available to you and draw valuable insights from it.

You can identify interest in content that you maybe weren’t aware of, and make plans to write more to address that interest, or modify existing content to address that interest. Maybe there are a bunch of forum threads about how to do something, but nothing authoritative in the documentation. That information hasn’t made it to the docs writers in any way, but because you’re looking at the available data, you’re able to see that it’s important.

Even if you have no data specifically relevant to the documentation or customer questions, you can still find ways to identify documentation work to add to a task list. You could create datasets by performing text analysis on all or specific documentation topics, and identify complexity issues, or topics that don’t adhere to a style guide. You could use customer satisfaction surveys to identify places where documentation architecture or linking strategies could be improved.

Working with the data

Now you hopefully have a better understanding of different types of data available to you, and how you can identify valuable data sources based on your questions that you want to answer. But how much data do you need to collect? And how do you get the data? Most importantly, how do you analyze it to answer the questions you want to answer?

How much data?

How much data do you need to collect? You don’t need to collect data forever. You don’t need ALL the data. You just need enough data to point you in a direction and reduce uncertainty.

You can use a small sample of users, or a small sample of time, so long as it helps you answer your question and reduce uncertainty about what the answer could be. Collecting larger amounts of data doesn’t mean that you reduce uncertainty by an equally large degree. The amount of data you collect doesn’t correlate directly to what you’re able to learn from it. However, if the question you’re trying to answer with data concerns all the documentation users over a long period of time, you will be collecting more data than if you just want to know what a specific subset of readers found interesting on a Friday afternoon.

Try for representative samples that are relevant for the questions you’re trying to answer. If you can’t get representative data, try for a random sample. If you can’t get representative or random samples, acknowledge the bias that is inherent in the data you’re using. Add context to the data wherever possible, especially about who the data represents and why the data is still valuable if it isn’t representative.

You might find that collecting a small amount of data leaves you with more questions than answers, and that’s okay too. It’s an opportunity to continue exploring and learning more about your customers and your documentation tasks. But how do you even get any data at all?

How do you get the data?

You’ll either be collecting your own data, or asking others for the data you need.

If it’s data about the documentation site or its content, you might own that data yourself, and already have access to it. If it’s other types of data, like sales leads or user research data, it’s time to talk to the departments or people that manage those areas.

  • A business development department might have reporting on internal tools like sales leads or support cases.
  • Product managers can share direct customer data and product usage data if you don’t have direct access.
  • Project managers can share data related to internal development processes.

The teams managing different datasets will vary at your organization, and might even be you in many cases. They may be reluctant to share data. With that in mind, remember that when you collect data, you don’t need to get persistent access to all the data you want. Focus on getting some access to some data that is useful to answer your questions. After that, you can use that data to make your work more efficient and informed, and then hopefully communicate that value and get more access to data in the future if you want.

What to use for data analysis?

What do you use to analyze that data after you get it? How do you transform data into a report of useful information?

Some tools might already have analytics and reporting built in, like Google Analytics. That can certainly make it easier to analyze the data!

For other types of data that you need to analyze yourself, use the tools available to you. Think about what already know how to use, or have access to:

  • Know how to use Excel? Perfect! Get started collecting and processing data in spreadsheets and with macros.
  • Know how to write scripts in R/Python to analyze data? Great! You can write scripts to collect, process, and visualize this data.
  • Is your organization using a tool like Splunk, ElasticSearch, Tableau, etc.? Good news! You are really ready for data analysis.

You don’t have to spend a long time learning a new tool to analyze data for these purposes. If you continue incorporating data analysis into your work, it might make sense, but it isn’t necessary to get started.

Tools aren’t magic

It’s also important to note that tools aren’t magic. Some degree of data analysis will involve manual collecting, categorizing, or cleaning of the data. If your organization doesn’t have strict topic types, you might need to perform manual topic-typing. If you want to analyze some information but the data isn’t in a machine-readable format, you might have to sit at your desk copy pasting for hours.

Depending on your skills, the current state of the data that you want to analyze, and the tools available to you, the amount of time it takes to analyze data and get results can vary widely. I have spent 3 days manually processing data in Excel, and I’ve spent 2 hours creating searches in mostly-clean datasets in Splunk to get answers to various questions. Keep that in mind when you’re analyzing data.

How to perform data analysis

When you analyze data, what are you actually looking at? 

Top, rare, outlying values

Find out what values are most common, and which values are least common. Those can be established by counting the various instances of values.

Look for values that are different from the others by a large margin. You can use standard deviation as a function to achieve this.

Patterns and clusters across data

You can also look for patterns and clusters in your data.

If you’re working with qualitative data, you might need to categorize, or code, the data so that you can sort it and look for patterns in the results. You can identify these patterns by counting instances of categories, or looking at clusters of behavior. An example of a cluster of behavior is if you look at documentation topic visits over time, and you identify a spike in visits at a particular time.

Split by different features

You also want to segment data by different features. Meaning, you can better understand the most common values if you split them by other types of information. For example, you can look at the most commonly visited topics in your documentation set over the last 3 months, or you can look at the most commonly visited topics in your documentation over the last 3 months, but on a week-to-week basis. That additional split can help you understand how those values are changing over time. If you identify a spike in a particular topic or category of topics, you can then interpret the data. Maybe a new product release led to a spike of interest in the release notes topic that wasn’t easily identified until you split the results by week. This is also a good opportunity to point out to a product team that people really do read your documentation!

That’s an example of splitting by time, but you can split by any other field available to you in your data. To use the same data type, looking at the most common topics by product, by IP address, or other factors, can help lead to valuable insights.

Combine data types

You can combine different types of data to understand approximately how many people are using the product vs how many of them are using the documentation. Comparing sales leads, product usage data, and existing page views could help you approximate the number of potential, and existing customers, alongside the number of distinct documentation readers.

Make sure that when you combine data across datasets, you keep track of units and time ranges, and make sure that you compare like data with like data. For example, be careful not to use data that refers to potential customers with data that refers to existing customers, because that could lead to misleading results if you don’t keep context with the data.

Interpreting results

When you interpret the results of your data analysis, make sure that you are adding context to the data. Especially when dealing with outlier data, but even when reviewing data like rarely-viewed or frequently-viewed topics, keep in mind additional context that could explain results.

Add context from expertise

Use your expertise and knowledge of the documentation to add context. For example, topics concerning a specific functionality are likely to be more popular at a specific time if that functionality was recently changed.

Pursue alternate explanations

Whenever you’re interpreting data, you want to make sure that you’re gut-checking it against what you already know. So if a relatively mundane topic has wildly out-of-the-ordinary page views, there are likely alternate explanations for that interest. Maybe your topic ended up being a great resource about cron syntax in general, even for people that don’t use your product.

Draw realistic conclusions

Draw realistic conclusions based on the data available to you. You might not be able to get access to or combine specific datasets due to privacy concerns. If you carefully identify what problems you’re trying to solve, and select only the data sources that can help you solve those problems, you can reduce the potential that you’ll introduce bias into your data analysis, and improve the conclusions that you’re able to draw.

Don’t trust data blindly

Don’t trust the data blindly. When reviewing data that seems out of the ordinary or like outliers, examine the different reasons why the data could be like that. Who does the data represent? What does it represent? Make sure that you’re interpreting data in context, so that you’re able to understand exactly what it represents. It can be tempting to ignore data that doesn’t match your biases or expectations.

Above all, remember to use data to complement your research and writing, and validate or challenge assumptions about your audience.

Your turn to add data

  1. Identify the questions you’re trying to answer
  2. Use the data available to you
  3. Use the tools available to you
  4. Analyze and interpret the data
  5. Take action and prioritize accordingly

Additional resources