Are AI chatbots docs?

Robin Sloan asks: Is the doc bot docs, or not?

He relates a recent experience he had with an AI chatbot where the chatbot provided a response for his question that was incorrect despite being provided from the authoritative context of the official Shopify documentation.

Reacting, Sloan considers:

I suppose there are domains in which just taking a guess is okay; is the official documentation one of them?

I vote no, and I think a freestyling doc bot undermines the effort and care of the folks […] who work […] to produce documentation that is thorough and accurate.

As a writer of official documentation, I agree.

Most documentation sites are part of the official service agreement for a product, often serving as the warranty for how the product is supposed to work, and which uses are supported by the company. Meanwhile, most LLM technology is prone to hallucinations (not to mention prompt injection), even if constrained by the system prompt.

If you provide an AI-powered chatbot on your documentation site, you must consider how to make it a productive and accurate experience for your customers…

Imagine providing a more powerful search for your documentation, but instead of a search returning no results, it summarizes a best guess. Sometimes that can work, but sometimes that means you provide irrelevant information. The nature of generated text is that it seems far more authoritative than a clearly irrelevant link in a list.

Illustration of an anxious looking angular robot next to an empty input box titled “Ask a question”.

It’s easy to add a bot to your docs site as an experiment, but if you do so, it’s important to be intentional about it, and follow up after deployment. If you want to add an AI-powered chatbot to your documentation site, before and after you add it, consider the following:

Add disclaimers #

If the content output by your AI-powered chatbot might be inaccurate, consider indicating that your bot is in beta, or otherwise not officially supported.

If the service or license agreement that customers sign for your product includes your documentation as an official guarantee of product functionality, it might be a good idea to check with your legal team (if you have one) to ensure that the responses from the AI-powered chatbot that you deploy are not subject to the same guarantee.

In addition, you might want to add an information disclosure as well, letting people know how and whether the information they provide to the chatbot is collected and used.

Strategically supplement responses #

Presumably, if you’re providing an AI-powered chatbot on your documentation site, it’s configured to supplement and source responses from the official documentation using retrieval-augmented generation (RAG).

When you identify the sources to use, you might want to consider whether those sources should include more than just the official product documentation. If you have a community forum site full of customer context (but less up-to-date or accurate information), or a technical blog full of use cases (and similarly potentially outdated), you might want to make those available to the chatbot.

Ideally you’d have some way of adjusting different weights (though it might be limited to temperature or other vague setting names). If you have multiple sources that you want to draw from, being able to indicate which ones are more authoritative than others, or at least prioritizing more recently updated content with the expectation that it’s going to be more accurate.

Test outputs #

Set up automated and manual testing of the outputs for accuracy and consistency. The best way to do this is to build a “ground truth” dataset of questions and answers. After you have a canonical dataset of common questions and accurate answers, you can evaluate the output of the AI chatbot against that dataset.

Perform evaluations generally (is the output correct?) and over time (is the output still correct?) to ensure that the quality of responses doesn’t drift.

If you’re new to evaluating the output of LLMs, many chatbot providers like Kapa provide resources that outline these tactics, such as in Kapa’s doc Conversation review best practices. Intercom seems to provide a set of automated performance metrics for their FinAI chatbot which, if you trust them, you can use to review the quality of responses.

If you’re not using one of these providers (or even if you are), the field of data labeling, especially for NLP data, has a wealth of resources available. Microsoft provides a detailed list of metrics for evaluating LLM-generated content.

You can also automate testing of outputs with a tool like Evidently, incorporating testing into your CI/CD pipeline.

Monitor input and output #

Beyond testing the quality and accuracy of the chatbot responses, you also want to monitor the questions being asked and the responses offered. The data from AI chatbot interactions can be used almost like a combination of search term data and community forum content — identify missing content, misaligned mental models, inconsistent terminology, customer use cases, and more that you can use to improve your content.

The monitoring process isn’t just about improving the accuracy of your content, but also its relevancy to customers.

You can then take action in your regular documentation process to address anything that you find:

Chatbots aren’t docs, but you can use the interactions with them to make your docs better.