Measuring data (and documentation) quality is hard

Gwen Windflower asks Are you actually measuring data quality?:

Data is a woven net thrown over the world, capturing it in a grid of variable resolution. We want to capture as much detail as we possibly can, but it’s then crucial to make judicious decisions about how we translate what we capture into a useful map.

Documentation involves a similar discernment process as data analysis and data engineering—given a large set of inputs, identify how to make sense of the output and where to start first. Defining priorities and classifying information is vital.

When it comes to quality in particular, and how to measure quality, Gwen addresses the common practices of data analysts:

Typically though, when it comes to quality, we proceed to measure things like the internal consistency of multiple printings of our map, the resolution of the ink, and how quickly we print new maps in response to changes in the terrain. Two related maps stay internally consistent? Quality. Granular resolution of detail? High quality. Map updates in near-realtime? The highest quality. We often assume that making these kinds of measures higher in our output is an unalloyed good.

Again, this is a common practice in documentation. It’s tempting to align documentation quality measures with documentation coverage (parity with product functionality), documentation freshness (updated quickly), or with particular measures about the writing quality itself, such as sentence length, page length, or word complexity.

But as Gwen points out, measuring quality in these ways is flawed:

There’s a problem with these self-referential, almost tautological measures though: none of them tells us if a map is fulfilling its purpose.

Maps should be measured on how well they get you where you want to go. To do that maps need to be as detailed and accurate as necessary and no more.

If you want to measure the quality of your output, you also need to assess its usefulness.

To measure quality, we need to look beyond the data (or the documentation) itself, and look at the function. Gwen makes this clear, continuing with the map metaphor:

to know the proper level of detail, pace, and presence for our maps, we need to know where we want to go.

If you want to hike through a national park, you want a trail map, not a road map. Similarly, you need to understand why you’re creating, refining, or analyzing data so that you can produce something that is high quality for the purpose.

The payroll department might have an objective to improve payroll quality, measured by reducing paycheck errors across the company from 5% to 1%. If you’re managing their data pipeline, knowing what the data is used for and what the quality issues are can make a big difference in how you approach your work. You might start by improving the accuracy and reliability of the data before you worry about reducing latency of the data to be near real-time.

Similarly, documentation is a supporting team. We do our own work, but it’s relatively useless if it doesn’t help folks meet their objectives. To write high quality documentation, we must consider the goals of our audience, and what they want to accomplish with the product, to provide relevant and useful documentation.

As a supporting team, data teams face a challenge that Gwen identifies:

There is constant discussion of how hard it is to measure the impact of data teams, and that’s directly because of a lack of intentionality with our data.

That’s something that applies to tech writing teams. The impact of a data team, and the impact of a documentation team, can be measured by identifying the objectives that those teams are helping others meet. Who does our work support, and did we do our work with the goals of that audience in mind?