Documenting machine learning models

Products use machine learning and “artificial intelligence” to do things like recommend a song to listen to, offer a quick response to an email, organize search results, provide a transcript of a meeting, and more. Some products rely on ordinary data analysis to construct insights about things like your business performance in a market, or the conversion rates for your online shopping site.

Unfortunately, the companies providing these products often gloss over the technical details of how these systems work—making it seem like those tools are magic, omniscient, or just plain inscrutable. Algorithms are so common, yet it’s just as common for the logic programmed into the systems to be hand waved away.

As integration of machine learning, artificial intelligence, algorithms, and data analysis becomes standard and expected in software products, internal and external documentation for those systems must also become standard.

As a technical writer, it’s perhaps not surprising that I think documentation is important.

Writing documentation is a way to communicate information about your product—and that information in turn lets others use and understand your product.

Commenting code, providing READMEs, writing how-to guides — all of these forms of documentation help people understand and interpret your code, evaluate your project, and use your product.

The normalization of data-driven software, where machine learning models drive key product functionality, means that folks in operations, procurement, legal, and more departments need to understand the components of the product, how it might interact with their system and users, and what risks the business might face as a result.

That makes it extremely important to document each component of a data-driven system, from the datasets involved in data analysis and machine learning model training, to the machine learning models and model results, as well as the systems in which the machine learning models exist in.

An illustrated diagram showing a database table labeled data, pointing to a rosy pink rectangle labeled model, which points to a pink shaded rectangle with rounded edges labeled results, all of which is labeled with a longways { as system.

When evaluating software, product documentation is expected. It would require an immense amount of trust, and likely some promises, to buy or even start using a free software product that lacks documentation. It’s also much faster and easier to start using software if it’s well-documented.

But machine learning models and datasets available on the web are often not well-documented.

Sometimes that can happen because you encounter a machine learning model without realizing it. You’re listening to music and a model is queuing up the next recommended track in Spotify’s autoplay, or you’re browsing the web and you read an article generated by a large language model. Other times, you’re more aware that you’re interacting with a model, such as when you feed a prompt into Midjourney or Stable Diffusion, or ask ChatGPT a question.

Much like Apple might want you to feel when using their products, artificial intelligence aficionados want you to be able to use their trained models without the friction of documentation about how it works.

But documentation can be vital to make AI-based products more useful. For tools based on generative models, like Midjourney or ChatGPT, understanding what led to the output that you see can help you tweak your input to produce more useful results.

This holds true for data analysis and other machine learning implementations as well, such as any data analysis project in an organization, or internally deployed machine learning models supporting business processes and more. Documentation can provide a lot of support.

Randy Au makes the case for documenting a data analysis process in his post Data Science Practice 101: Always Leave An Analysis Paper Trail:

Despite how much busy work it sounds like right now, you need to leave a paper trail that can clearly be traced all the way to raw data. Inevitably, someone will want you to re-run an analysis done 6 months ago so that they can update a report. Or someone will reach out and ask you if your specific definition of “active user” happened to include people who wore green hats. Unless you have a perfect memory, you won’t remember the details and have to go search for the answer.

Analysis deliverables are often separated from the things used to generate it. Results are sent out in slide decks, a dashboard on a TV screen, or a chart pasted into an email, a single slide in a joint presentation for executives, or just random CSV dumps floating around in a file structure somewhere. The coupling between deliverable and source is non-existent unless we deliberately do something about it.

Without a clear definition of how an analysis was performed, how data points were defined, or where a particular chart came from, the results lose value, credibility, and reproducibility. It helps the future version of you, as well as folks that you collaborate with, if you document the context of a project and store that documentation alongside the output of the project.

Similarly, without a clear sense for how a model was trained or why it might be producing specific results, the results lose value and credibility. In an era where generative machine learning models output fabricated academic references when you ask it for citations about a topic, documentation about the systems becomes even more valuable to help you judge which machine learning model output you can trust.

You can more accurately assess if the outcomes delivered by your model are valid, or due to making the error of testing on your training data, if you have clear documentation about how the model was trained and on which datasets.

As Emily Bender points out in the Radical AI podcast episode, The Limitations of ChatGPT with Emily M. Bender and Casey Fiesler:

If you don’t know what’s in the training data, then you’re not positioned to decide if you can safely deploy the thing.

Especially if you’re using a machine learning model in a commercial context, you want to be able to identify possible harm that could arise from the output, such as violent, misogynistic, racist, or other dangerous output. Without information about the training datasets, model training practices, and overall system context for a model, you can’t properly evaluate it.

Provide helpful documentation

What does it mean to provide helpful documentation for the components of machine learning and data analysis practices?

As with any documentation, you need to consider the following:

Who is the audience?

The audience of the documentation depends on the context of the analysis or the model. The documentation for a dataset distributed on the web for free on Kaggle has a different audience than a machine learning model deployed internally at a large financial institution to detect fraudulent credit card transactions. As such, the content of your documentation might change accordingly to include more or less detail about specific aspects of the data.

What purpose does the documentation serve?

Similarly, the purpose of the documentation is different for a publicly available dataset when compared to an internally deployed machine learning model, and is different still from a product that provides a chat interface for a large language model. The documentation for a machine learning model used by a financial services company needs to exist for regulatory and auditing purposes, in addition to the typical purpose of remembering how the model works.

How do you provide the documentation?

How you provide the documentation also differs depending on what you need to document. You might be able to add inline comments to a dataset, but then you don’t have a good way to provide an overview of the dataset itself. A machine learning model offers no simple way to include documentation as part of the training output. So far, most standards involve providing a PDF with information, but others such as Hugging Face Dataset Cards, implement a YAML-formatted specification file to publish alongside the dataset on the Hugging Face Hub.

What do you put in the documentation?

When it comes to determining what the documentation should contain, it depends on which component of the machine learning process you’re documenting, and there are a number of standards proposed by academics and prominent players in the data industry.

How to document machine learning components

If you’re deploying machine learning models in your business operations, disseminating the results of data analysis, or integrating machine learning into your product, you need to write documentation. What you write depends on the part of the system that you choose to document.

There are several perspectives to consider:

Documenting the datasets

When you focus on documenting the datasets, you want to capture things like:

For labeled datasets, you want to consider additional components:

Standards for documenting datasets:

Documenting the models

In addition to documenting the datasets used for data analysis or machine learning training, you also need to document the models trained on the datasets.

When documenting a machine learning model, you want to capture things like the following:

For versioned machine learning models, it’s also helpful to include context about what is different between one version of the model and the previous, such as:

Standards for documenting machine learning models:

Documenting the model results

It’s important to document the results produced by a model in specific scenarios, which can help you debug and retrain the model as needed. Documentation about model results is often referred to as “explainable AI”, as the goal is to explain the outcomes produced by artificial intelligence (one or more machine learning models, or algorithms).

When you document the model results, you want to collect the following information:

Standards for documenting machine learning model results:

Documenting the system

It’s important to document the entire system in which a machine learning model operates. A machine learning model is never implemented as a discrete object. Models must be kept updated to avoid drift and thus are implemented as part of a larger system that can include data quality tools, a testing framework, a build and deploy framework, and even other models, such as in the context of an ensemble model.

Documentation about the entire system offers helpful guidance to folks trying to understand a system so that they can maintain, update, debug, and audit the system, to name a few common tasks.

As such, machine learning system documentation needs to include the following:

Standards for documenting machine learning systems:

Actually write the documentation

Making sure the documentation actually gets written is the most important aspect of documenting machine learning components and systems.

You can choose to automate portions of the documentation, identify different points of the model development process where it would be prudent to update part of the template that you choose, or whatever works for your team and your processes. You can also define robust accountability mechanisms like checklists.

If you lack processes and accountability, it’s easy to skip writing documentation. All the standards in the world for documenting data-driven systems don’t matter if the documentation never gets written.

Most fields struggle to incorporate documentation into their processes, and build accountability for making sure it gets written, but fast-moving fields like data science and machine learning that are still formalizing their practices struggle even more.

No matter which method you choose for documenting data-driven systems, you must include writing the documentation in your existing workflows.