Unbiased data analysis with the data-to-everything platform: unpacking the Splunk rebrand in an era of ethical data concerns

Splunk software provides powerful data collection, analysis, and reporting functionality. The new slogan, “data is for doing”, alongside taglines like “the data-to-everything platform” and “turn data into answers” want to bring the company to the forefront of data powerhouses, where it rightly belongs (I’m biased, I work for Splunk).

There is nuance in those phrases that can’t be adequately expressed in marketing materials, but that are crucial for doing ethical and unbiased data analysis, helping you find ultimately better answers with your data and do even better things with it.

Start with the question

If you start attempting to analyze data without an understanding of a question you’re trying to answer, you’re going to have a bad time. This is something I really appreciate about moving away from the slogan “listen to your data” (even though I love a good music pun). Listening to your data implies that you should start with the data, when in fact you should start with what you want to know and why you want to know it. You start with a question.

Data analysis starts with a question, and because I’m me, I want to answer a fairly complex question: what kind of music do I like to listen to? This overall question, also called an objective function in data science, can direct my data analysis. But first, I want to evaluate my question. If I’m going to turn my data into doing, I want to consider the ethics and the bias of my question.

Consider what you want to know, and why you want to know it so that you can consider the ethics of the question. 

  • Is this question ethical to ask? 
  • Is it ethical to use data to answer it? 
  • Could you ask a different question that would be more ethical and still help you find useful, actionable answers? 
  • Does my question contain inherent bias? 
  • How might the biases in my question affect the results of my data analysis? 

Questions like “How can we identify fans of this artist so that we can charge them more money for tickets?” or “What’s the highest fee that we can add to tickets where people will still buy the tickets?” could be good for business, or help increase profits, but they’re unethical. You’d be using data to take actions that are unfair, unequal, and unethical. Just because Splunk software can help you bring data to everything doesn’t mean that you should. 

Break down the question into answerable pieces

If my question is something that I’ve considered ethical to use data to help answer, then it’s time to consider how I’ll perform my data analysis. I want to be sure I consider the following about my question, before I try to answer it:

  • Is this question small enough to answer with data?
  • What data do I need to help me answer this question?
  • How much data do I need to help me answer this question?

I can turn data into answers, but I have to be careful about the answers that I look for. If I don’t consider the small questions that make up the big question, I might end up with biased answers. (For more on this, see my .conf17 talk with Celeste Tretto).

So if I consider “What kind of music do I like to listen to?”, I might recognize right away that the question is too broad. There are many things that could change the answer to that question. I’ll want to consider how my subjective preferences (what I like listening to) might change depending on what I’m doing at the time: commuting, working out, writing technical documentation, or hanging out on the couch. I need to break the question down further. 

A list of questions that might help me answer my overall question could be: 

  • What music do I listen to while I’m working? When am I usually working?
  • What music do I listen to while I’m commuting? When am I usually commuting?
  • What music do I listen to when I’m relaxing? When am I usually relaxing?
  • What are some characteristics of the music that I listen to?
  • What music do I listen to more frequently than other music?
  • What music have I purchased or added to a library? 
  • What information about my music taste isn’t captured in data?
  • Do I like all the music that I listen to?

As I’m breaking down the larger question of “What kind of music do I like to listen to?”, the most important question I can ask is “What kind of music do I think I like to listen to?”. This question matters because data analysis isn’t as simple as turning data into answers. That can make for catchy marketing, but the nuance here lies in using the data you have to reduce uncertainty about what you think the answer might be. The book How to Measure Anything by Douglas Hubbard covers this concept of data analysis as uncertainty reduction in great detail, but essentially the crux is that for a sufficiently valuable and complex question, there is no single objective answer (or else we would’ve found it already!). 

So I must consider, right at the start, what I think the answer (or answers) to my overall question might be. Since I want to know what kind of music I like, I therefore want to ask myself what kind of music I think I might like. Because “liking” and “kind of music” are subjective characteristics, there can be no single true answer that is objective truth. Very few, if any, complex questions have objectively true answers, especially those that can be found in data. 

So I can’t turn data into answers for my overall question, “What kind of music do I like?” but I can turn it into answers for more simple questions that are rooted in fact. The questions I listed earlier are much easier to answer with data, with relative certainty, because I broke up the complex, somewhat subjective question into many objective questions. 

Consider the data you have

After you have your questions, look for the answers! Consider the data that you have, and whether or not it is sufficient and appropriate to answer the questions. 

The flexibility of Splunk software means that you don’t have to consider the questions you’ll ask of the data before you ingest it. Structured or unstructured, you can ask questions of your data, but you might have to work harder to fully understand the context of the data to accurately interpret it. 

Before you analyze and interpret the data, you’ll want to gather context about the data, like:

  • Is the dataset complete? If not, what data is missing?
  • Is the data correct? If not, in what ways could it be biased or inaccurate?
  • Is the data similar to other datasets you’re using? If not, how is it different?

This additional metadata (data about your datasets) can provide crucial context necessary to accurately analyze and interpret data in an unbiased way. For example, if I know there is data missing in my analysis, I need to consider how to account for that missing data. I can add additional (relevant and useful) data, or I can acknowledge how the missing data might or might not affect the answers I get.

After gathering context about your datasets, you’ll also want to consider if the data is appropriate to answer the question(s) that you want to answer. 

In my case, I’ll want to assess the following aspects of the datasets: 

  • Is using the audio features API data from Spotify the best way to identify characteristics in music I listen to? 
  • Could another dataset be better? 
  • Should I make my own dataset? 
  • Does the data available to me align with what matters for my data analysis? 

You can see a small way that the journalist Matt Daniels of The Pudding considered the data relevant to answer the question “How popular is male falsetto?” for the Vox YouTube series Earworm starting at 1:45 in this clip. For about 90 seconds, Matt and the host of the show, Estelle Caswell, discuss the process of selecting the right data to answer their question, including discussing the size of the dataset (eventually choosing a smaller, but more relevant, dataset) to answer their question. 

Is more data always better? 

Data is valuable when it’s in context and applied with consideration for the problem that I’m trying to solve. Collecting data about my schedule may seem overly-intrusive or irrelevant, but if it’s applied to a broader question of “what kind of music do I like to listen to?” it can add valuable insights and possibly shift the possible overall answer, because I’ve applied that additional data with consideration for the question that I’m trying to answer.

Splunk published a white paper to accompany the rebranding, and it contains some excellent points. One of them that I want to explore further is the question:

“how complete, how smart, are these decisions if you’re ignoring vast swaths of your data?” 

On the one hand, having more data available can be valuable. I am able to get a more valuable answer to “what kind of music do I like” because I’m able to consider additional, seemingly irrelevant data about how I spend my time while I’m listening to music. However, there are many times when you want to ignore vast swaths of your data. 

The most important aspect to consider when adding data to your analysis is not quantity, but quality. Rather than focusing on how much data you might be ignoring, I’d suggest instead focusing on which data you might be ignoring, for which questions, and affecting which answers. You might have a lot of ignored data, but put your focus on the small amount of data that can make a big difference in the answers you find in the data.

As the academics in “I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb?” make clear with their crucial finding:

“More data lead to better conclusions only when we know how to take advantage of their information. In other words, size does matter, but only if it is used appropriately.”

The most important aspect of adding data to an analysis is exactly as the academics point out: it’s only more helpful if you know what to do with it. If you aren’t sure how to use additional data you have access to, it can distract you from what you’re trying to answer, or even make it harder to find useful answers because of the scale of the data you’re attempting to analyze. 

Douglas Hubbard in the book How to Measure Anything makes the case that doing data analysis is not about gathering the most data possible to produce the best answer possible. Instead, it’s about measuring to reduce uncertainty in the possible answers and measuring only what you need to know to make a better decision (based on the results of your data analysis). As a result, such a focused analysis often doesn’t require large amounts of data — rough calculations and small samples of data are often enough. More data might lead to greater precision in your answer, but it’s a tradeoff between time, effort, cost, and precision. (I also blogged about the high-level concepts in the book).

If I want to answer my question “What kind of music do I like to listen to?” I don’t need the listening data of every user on the Last.fm service, nor do I need metadata for songs I’ve never heard to help me identify song characteristics I might like. Because I want to answer a specific question, it’s important that I identify the specific data that I need to answer it—restricted by affected user, existence in another dataset, time range, type, or whatever else.

If you want more evidence, the notion that more data is always better is also neatly upended by the Nielsen-Norman Group in Why You Only Need to Test with 5 Users and the follow-up How Many Test Users in a Usability Study?.

Keep context alongside the data

Indeed, the white paper talks about bringing people to a world where they can take action without worrying about where their data is, or where it comes from. But it’s important to still consider where the data comes from, even if you aren’t having to worry about it because you use Splunk software. It’s relevant to data analysis to keep context about the data alongside the data.

For example, it’s important for me to keep track of the fact that the song characteristics I might use to identify the type of music I like come from a dataset crafted by Spotify, or that my listening behavior is tracked by the service Last.fm. Last.fm can only track certain types of listening behavior on certain devices, and Spotify has their own biases in creating a set of audio characteristics.

If I lose track of this seemingly-mundane context when analyzing my data, I can potentially incorrectly interpret my data and/or draw inaccurate conclusions about what kind of music I like to listen to, based purely on the limitations of the data available to me. If I don’t know where my data is coming from, or what it represents, then it’s easy to find biased answers to questions, even though I’m using data to answer them.

If you have more data than you need, this also makes keeping context close to your data more difficult. The more data, the more room for error when trying to track contextual meaning. Splunk software includes metadata fields for data that can help you keep some context with the data, such as where it came from, but other types of context you’d need to track yourself.

More data can not only complicate your analysis, but it can also create security and privacy concerns if you keep a lot of data around and for longer than you need it. If I want to know what kind of music I like to listen to, I might be comfortable doing data analysis to answer that question, identifying the characteristics of music that I like, and then removing all of the raw data that led me to that conclusion out of privacy or security concerns. Or I could drop the metadata for all songs that I’ve ever listened to, and keep only the metadata for some songs. I’d want to consider, again, how much data I really need to keep around. 

Turn data into answers—mostly

So I’ve broken down my overall question into smaller, more answerable questions, I’ve considered the data I have, and I’ve kept the context alongside the data I have. Now I can finally turn it into answers, just like I was promised!

It turns out I can take a corpus of my personal listening data and combine it with a dataset of my personal music libraries to weight the songs in the listening dataset. I can also assess the frequency of listens to further weight the songs in my analysis and formulate a ranking of songs in order of how much I like them. I’d probably also want to split that ranking by what I was doing while I was listening to the music, to eliminate outliers from the dataset that might bias the results. All the small questions that feed into the overall question are coming to life.

After I have that ranking, I could use additional metadata from another source, such as the Spotify audio features API, to identify the characteristics of the top-ranked songs, and ostensibly then be able to answer my overall question: what kind of music do I like to listen to?

By following all these steps, I turned my data into answers! And now I can turn my data into doing, by taking action on those characteristics. I can of course seek out new music based on those characteristics, but I can also book the ideal DJs for my birthday party, create or join a community of music lovers with similar taste in music, or even delete any music from my library that doesn’t match those characteristics. Maybe the only action I would take is self-reflection, and see if what the data has “told” me is in line with what I think is true about myself.

It is possible to turn data into answers, and turn data into doing, with caution and attention to all the ways that bias can be introduced into the data analysis process. But there’s still one more way that data analysis could result in biased outcomes: communicating results. 

Carefully communicate data findings

After I find the answers in my data, I need to carefully communicate them to avoid bias. If I want to tell all my friends that I figured out what kind of music I like to listen to, I want to make sure that I’m telling them that carefully so that they can take the appropriate and ethical action in response to what I tell them. 

I’ll want to present the answers in context. I need to describe the findings with the relevant qualifiers: I like music with these specific characteristics, and when I say I like this music I mean this is the kind of music that I listen to while doing things I enjoy, like working out, writing, or sitting on my couch. 

I also need to make clear what kind of action might be appropriate or ethical to take in reaction to this information. Maybe I want to find more music that has these characteristics, or I’d like to expand my taste, or I want to see some live shows and DJ sets that would feature music that has these characteristics. Actions that support those ends would be appropriate, but can also risk being unethical. What if someone learns of these characteristics, and chooses to then charge me more money than other people (whose taste in music is unknown) to see specific DJ sets or concerts featuring music with those characteristics? 

Data, per the white paper, “must be brought not only to every action and decision, but to every department.” Because of that, it’s important to consider how that happens. Share relevant parts of the process that led to the answers you found from the data. Communicate the results in a way that can be easily understood by your audience. This Medium post by Cecelia Shao, a product manager at Comet.ml, covers important points about how to communicate the results of data analysis. 

Use data for good

I wanted to talk through the data analysis process in the context of the rebranded slogans and marketing content so that I could unpack additional nuance that marketing content can’t convey. I know how easy it is to introduce bias into data analysis, and how easily data analysis can be applied to unethical questions, or used to take unethical actions.

As the white paper aptly points out, the value of data is not merely in having it, but in how you use it to create positive outcomes. You need to be sure you’re using data safely and intelligently, because with great access to data comes great responsibility. 

Go forth and use the data-to-everything platform to turn data into doing…the right thing. 

Disclosure: I work for Splunk. Thanks to my colleagues Chris Gales, Erica Chen, and Richard Brewer-Hay for the feedback on drafts of this post. While colleagues reviewed this post and provided feedback, the content is my own and represents my own views rather than those of Splunk the company. 

Streaming, the cloud, and music interactions: are libraries a thing of the past?

Several years ago I wrote about fragmented music libraries and music discovery. In light of the overwhelming popularity of Spotify and the dominance of streaming music (Spotify, Apple Music, Amazon Music, Tidal, and others), I’m curious if music libraries even exist anymore. Or, if they exist today, will they continue to exist? 

My guess is that the only people still maintaining music libraries are DJs, fervent music fans (like myself), or people that aren’t using streaming music at all (due to age, lack of interest, or lack of availability due to markets or internet speeds). 

I was chatting with a friend of mine that has a collection of vinyl records, but she only ever listens to vinyl if she’s relaxing on the weekend. Oftentimes she’s just asking Alexa to play some music, without much attention to where that music is coming from. With Amazon Music bundled into Amazon Prime for many members, people can be totally unaware that they’re using a streaming service at all. I’d hazard that this interaction pattern is true for most people, especially those that never enjoyed maintaining a music library but instead collected CDs and records because that was the only way to be able to listen to music at all. 

Even my own habits are changing, perhaps equally due to time constraints as due to current music technology services. I used to carefully curate playlists for sharing with others, listening in the car, mix CDs, and for radio shows. These days I make playlists for many of those same purposes on Spotify, but the songs in my “actual” music library (iTunes) aren’t categorized into playlists at all anymore, and I give the playlists I make on my iPhone random names like “Aaa yay” to make the playlists easier to find, rather than to describe the contents. 

I’m limited by storage size in terms of what I can add to my iPhone, just like I was with my iPod, but that shapes my experience of the music. Since I’m limited to a smaller catalogue, I’m able to sit with the music more and create more distinct memories. There are still songs that remind me of being in Berlin in 2011, limited to the songs that I added to my iPod before I left the United States because the internet I had access to in Germany was too slow to download new music and add it to my iPod. 

Nowadays, I am less motivated to carefully manage my iTunes library because it’s only on one device, whereas I can access my Spotify library across multiple devices. That’s the one I find myself carefully creating folders of playlists for, organizing and sorting tracks and playlists. A primary reason for the success of Spotify for my listening habits is the social and collaborative nature of it. It’s easy to share tracks with others, make a playlist for a DJ set that I went to to share with others, contribute to a weekly collaborative playlist with a community of fellow music-lovers, or to follow playlists created by artists and DJs I love. My local library can give me a lot, but it can’t give me that community interaction.

Indeed, in 2015 that’s something I identified as lacking. I felt that it was harder to feel part of a music culture, writing:

“It’s harder than it used to be to feel connected with music. It’s not a stream or a subculture one is tapped into anymore, because it’s so distributed on the web. There’s so much music, and it lives in so many different services, that the music culture has imploded a bit.”

I feel completely differently these days, thanks to a vibrant live music community in San Francisco. I loathe Facebook, but the groups that I’m a part of on that site enable me to feel connected to a greater music scene and community that supplement my connection to music and music discovery. Ironically, Facebook groups have also helped my music culture experience become more local. The music blogs that I used to be able to tap into are now largely defunct, or have multiple functions (the burning ear also running vinyl me please, or All Things Go also providing news and an annual festival in DC). Instead yet another way I discover new music is by paying attention to the artists and DJs that people in these Facebook groups are talking about and posting tracks and albums from. 

Despite the challenges of a local music library, I keep buying digital music partially because I made a promise to myself when I was younger that I’d do so when I could afford to, partially to support musicians and producers, and partially because I distrust that streaming services will stick around with all the music I might want to listen to. I’d rather “own” it, at least as best as I can when it’s a digital file that risks deletion and decomposition over time. 

Music discovery in the past was equal parts discovery and collection, with a hefty dose of listening after I collected new music.

A flowchart showing Discover -> Collect -> Listen in a triangle, with listen connecting back to discoverI’d do the following when discovering new music:

  • Writing down song lyrics while listening to the radio or while working my retail job, then later looking up the tracks to check out albums from the library to rip to my family computer.
  • Following music blogs like The Burning Ear, All Things Go, Earmilk, Stereogum, Line of Best Fit, then downloading what I liked best from their site from MediaFire or MegaUpload to save to my own library.
  • Trolling through illicit LiveJournal communities or invite-only torrent sites to download discographies for artists I already liked, or might like.

Over time, those music blogs shifted to using SoundCloud, the online communities and torrent sites shuttered, and I started listening to more music on streaming sites instead. The loop stopped going from discovery to collection and instead to discovery, like, and discovery again. 

Find a new track, listen, click the heart or the plus sign, and move on. Rarely do you remember to go back and listen to your fully-compiled list of saved tracks (or even if you do, trying to listen to the whole thing on shuffle will be limited by the web app, thanks SoundCloud). 

A flowchart showing a cycle from discover to like and back again using arrows.

This type of cycle is faster than the old cycle, and more focused on engagement with the service (rather than the music) and less on collecting and more on consuming. In some ways, downloading music was like this too. When I accidentally deleted my entire music library in 2012, the tatters of my library that I was able to recover from my iPod was a scant representation of my full collection, but included in that library was discographies that I would likely never listen to. Now that it’s been years, there have been a few occasions where I go back and discover that an artist I listen to now is in that graveyard of deleted songs, but even knowing that, I’m not sure I would’ve gotten to it any sooner. I was always collecting more than I was listening to. 

Streaming music lets me collect in the same way, but without the personal risk. It just makes me dependent on a third-party entity that permits me to access the tracks that they store for me. I end up with lists of liked tracks across multiple different services, none of which I fully control. These days my music discovery is now largely driven by 3 services: Spotify, Shazam, and Soundcloud. Spotify pushes algorithmic recommendations to me, Shazam enables me to discover what track the DJ is currently playing when I’m out at a DJ set, and Soundcloud lets me listen to recorded DJ sets as well as having excellent autoplay recommendations. In all of them I have lists of tracks that I may never revisit after saving them. Some of them I’ll never be able to revisit, because they’ve been deleted or the service has lost the rights to the track. 

In 2015 I lamented the fragmentation of music discovery, but looking back, my music discovery was always shared across services, devices, and methods—the central iTunes library was what tied the radio songs, the library CDs, the discography downloads, and the music blog tracks together. The real issue is that the primary music discovery modes of today are service-dependent, and each of those services provides their own constructs of a music library. I mentioned in 2015 that:

“my library is all over the place. iTunes is still the main home of my music—I can afford to buy new music when I want —but I frequent Spotify and SoundCloud to check out new music. I sync my iTunes library to Google Play Music too, so I can listen to it at work.” 

While this is still largely true, I largely consume Spotify when I’m at work, listen to SoundCloud sets or tracks from iTunes when I’m on-the-go with my phone, and listen to Spotify or iTunes when I’m on my personal laptop. That’s essentially 2.5 places that I keep a music library, and while I maintain a purchase pipeline of tracks from Spotify and SoundCloud into my iTunes library, it’s a fraction of my discoveries that make it into my collection for the long term. The days of a true central collection of my library are long since past. 

It seems a feat, with all these digital cloud music services streaming music into our ears, to have a local music library. Indeed, what’s the point of holding onto your local files when it becomes so difficult to access it? iTunes is becoming the Apple Music app, with the Apple Music streaming service front and center. Spotify is, well, Spotify. And SoundCloud continues to flounder yet provides an essential service of underground music and DJ sets. Google Play Music exists, but only has a web-based player (no client) to make it easier to access and listen to your local library after you’ve mirrored it to the cloud. Streaming is convenient. But streaming music lets others own your content for you, granting you subscription access to it at best, ruining the quality of your music listening experience at worst. 

A recent essay by Dave Holmes in Esquire talks about “The Deleted Years”, or the years that we stored music on iPods, but since Spotify and other streaming services, have largely moved on from. As he puts it, 

“From 2003 to 2012, music was disposable and nothing survived.”

Perhaps it’s more true that from 2012 onward, music is omnipresent and yet more disposable. It can disappear into the void of a streaming service, and we’ll never even know we saved it. At least an abandoned iPod gives us a tangible record of our past habits. 

As Vicki Boykis wrote about SoundCloud in 2017

“I’m worried that, for internet music culture, what’s coming is the loss of a place that offered innumerable avenues for creativity, for enjoyment, for discovery of music that couldn’t and wouldn’t be created anywhere else. And, like everyone who has ever invested enough emotion in an online space long enough to make it their own, I’m wondering what’s next.”

I’ll be here, discovering, collecting, liking, and listening for what’s next.

Music streaming and sovereignty

As the music industry moves away from downloads and toward building streaming platforms, international sovereignty becomes more of a barrier to people listening to music and discussing it with others, because they don’t have access to the same music on the same platforms. As Sean Michaels points out in The Morning News several years ago:

one of the undocumented glitches in the current internet is all its asymmetrical licensing rules. I can’t use Spotify in Canada (yet). Whenever I’m able to, there’s no guarantee that Spotify Canada’s music library will match Spotify America’s. Just as Netflix Canada is different than Netflix US, and YouTube won’t let me see Jon Stewart. As we move away from downloads and toward streaming, international sovereignty is going to become more and more of a barrier to common discussions of music.

Location has always been a challenge to music access, but it’s important to keep in mind that the internet and music streaming has not been an equitable boon to music access—it is still controlled.

Defining my career values

If you’re thinking about changing careers, or want guidance in determining whether your career is right for you, I hope this post can help you! It’s all about how I defined my career values and reframed how I thought about my career and my future.

Why I needed to define my career values

A couple years ago I was comfortable in my position at work. After four years in my career, surrounded by talk of the importance of having a growth mindset, I thought maybe I was too comfortable. As a technical writer, I was contributing to product management conversations, and thinking intensely about customer needs, and realized I wanted to be even more involved in what we were choosing to build. I took a training course, and found a job on the product management team in my company that appealed to my interests. 11 months later, I went back to documentation, after realizing that that pathway better suited my career values. Throughout those 11 months and in the time since, I’ve worked to determine what I really want to get out of my career, and make sure that what I am doing fits those values.

Ask myself some questions

I started by asking myself some questions, common ones that people recommend when you’re thinking about making a job change. I found that I was better able to answer these questions after I’d already made a job change, likely because I didn’t have that much work experience before making the career change. I asked myself the following questions:

  • What makes me excited to go into work?
  • What makes me dread going into work?
  • What helps me feel validated or appreciated at work?
    • Working with others? Contributing in meetings? Reporting project status on a regular basis?
  • What is my working style when working with others?
    • Do I prefer collaborative, consultative, or independent work?
  • What do I like producing when I’m at work?
    • Ideas, or tangible things? Concrete concepts or future-oriented concepts?
  • Do I prefer hands-on management, consultative management, or completely hands-off management?

After changing roles, I realized that many parts of my technical writing position, and the way that my team and my duties were structured, were very well suited to my working styles. However, since I hadn’t had much experience with other types of work, I hadn’t identified them as vital to my work. Switching positions forced me to reexamine what parts of a role were vital to my happiness at work, and in what way.

Find strategies that have worked for others

I found several strategies that worked for others by listening to some You 2.0 episodes from the Hidden Brain podcast.

I realized there were options to transform a job I was already in by finding more enjoyable aspects within it by listening to the You 2.0 Dream Jobs and You 2.0 How to Build a Better Job podcast episodes of Hidden Brain. The dream jobs episode helped me consider whether I was looking for too much meaning and validation within my job, and if I needed to separate those pursuits more. The how to build a better job helped me consider what I could shift within my day-to-day job in terms of focus or duties so that I could enjoy it as well.

I also looked beyond work-specific strategies. The episode You 2.0: How Silicon Valley Can Help You Get Unstuck taught me perhaps my favorite tidbit, which is to apply iterative methodologies to your life. Yes it’s kitschy, but one example he mentioned resonated deeply with me: the notion of creating multiple five year plans. Whenever I’d previously considered how my future might look, it was easy to get stressed about the fact that I have one future available to me and ~ people ~ expect me to have a plan for it. But this philosophy helped me realize that I can have multiple plans for it, and test them out.

I put together several five year plans to speculate about where I spend my time and how my life  might look if I stayed pursuing product management roles, or how it might look if I was doing tech writing for those years, as well as what it might look like if I took a different role entirely, moved cities, or even moved countries. This helped me consider what types of futures excited me, and position my work priorities alongside my overall life priorities. They aren’t separate, and I wanted to be sure that I didn’t consider them separately. I also realized that this exercise wasn’t about making these plans and then choosing one of them, but rather choosing the elements of each of the plans that made sense to me and got me excited about the future. I plan to revisit this exercise and continue to evaluate the spectrum of futures available to me.

The episode You 2.0: Decide Already! interviewed Dan Gilbert, the author of the excellent book Stumbling on Happiness. Both the episode and the book helped me consider the ways that being anxious about the future and planning for it and attempting to reduce uncertainty about it wasn’t necessarily making me feel better about it—and might actually be making me feel worse. So despite creating some five year plans, allowing room and flexibility in those plans, and welcoming uncertainty in my work and life is also crucial.

Take a values-centered approach

In my personal life I was working on developing and defining my personal values, using a card-sorting exercise similar to this one from the Urban Indian Health Institute. Defining my personal values, and understanding them as a way to assess whether or not my goals and day-to-day tasks were fulfilling or not, turned out to be vital. I attempted to apply a similar framework to my work goals and fulfillment as well, and identify one or more overarching themes that I could associate with my career.

Putting all the strategies together to define career values

After assessing the structure of work that I thrive and find validation in, I was better able to understand what I found fulfilling about a career, and what I could look for in future roles to find fulfillment and the right kind of comfort. A work environment with clear expectations and measurable, tangible results, was vital. A team that I could collaborate with and draw support from, while also working semi-independently, was also important to me.

After creating multiple five year plans, I was able to realize that a career path more similar to the one I had as a technical writer was more valuable to me than one that was closer to product management, where I’d be busier and spending more time and stress on work than on my personal life. In addition, by engaging with the technical writer community, I realized that the futures available to me with a technical writing career were more broad, varied, and flexible than I’d previously realized. I didn’t need the power and recognition within a company that a product management position might offer me, because that power and recognition would also come with added responsibilities, time commitments, and stressful challenges.

I attempted to reverse-engineer my career values based on these experiences and my personal values exercise. I ultimately centered on a core career value of “Information Conveyance”. What this means to me is that if I spend my time at work learning and sharing information with others, I will likely feel fulfilled and be excited to go to work. Defining this as a career value allowed me to move past specific roles and titles, because multiple career paths can help me support this value. Right now I love technical writing, but other functions like communication strategy, developer advocacy, community management, instructional designer, and others align with this value and are available to me as other potential career paths.

Detailed data types you can use for documentation prioritization

Data analysis is a valuable way to learn more about what documentation tasks to prioritize above others. My post (and talk), Just Add Data, presented at Write the Docs Portland in 2019, talk about this broadly. In this post I want to cover in detail a number of different data types that can lead to valuable insights for prioritization.

This list of data types is long, but I promise each one contains value for a technical writer. These types of data might come from your own collection, a user research organization, the business development department, marketing organization, or product management organization:

  • User research reports
  • Support cases
  • Forum threads and questions
  • Product usage metrics
  • Search strings
  • Tags on bugs or issues
  • Education/training course content and questions
  • Customer satisfaction survey

More documentation-specific data types:

  • Documentation feedback
  • Site metrics
  • Text analysis metrics
  • Download/last accessed numbers
  • Topic type metrics
  • Topic metadata
  • Contribution data
  • Social media analytics

Many of these data types are best used in combination with others.

User research reports

User research reports can contain a lot of valuable data that you can use for documentation. 

  • Types of customers being interviewed
  • Customer use cases and problems
  • Types of studies being performed

This can give you insight into both what the company finds valuable to study (so some insight into internal priorities) but also direct customer feedback about things that are confusing or the ways that they use the product. The types of customers that are interviewed can provide valuable audience or persona-targeting information, allowing you to better calibrate the information in your documentation. See How to use data in user research when you have no web analytics on the Gov.UK site for more details about what you can do with user research data.

Support cases

Support cases can help you better understand customer problems. Specific metrics include:

  • Number of cases
  • Frequency of cases
  • Categories of questions
  • Customer environments and licenses

With these you can compile metrics about specific customer problems, the frequency of problems, and the types of customers and customer environments that are encountering specific problems, allowing you to better understand target customers, or customers that might be using your documentation more than others. Support cases are also rich data for common customer problems, providing a good way to gather new use cases and subjects for topics. 

Forum threads and questions

These can be internal forums (like Splunk Answers for Splunk) or external ones, like Reddit or StackOverflow.

  • Common questions
  • Common categories
  • Frequently unanswered questions
  • Post titles

If you’re trying to understand what people are struggling with, or get a better sense of how people are using specific functionality, forum threads can help you understand. The types of questions that people ask and how they phrase them can also help make it clear what kinds of configuration combinations might make specific functions harder for customers. Based on the question types and frequencies that you see, you might be able to fine-tune existing documentation to make it more user-centric and easily findable, or supplement content with additional specific examples. 

Product usage metrics

Some examples of product usage metrics are as follows:

  • Time in product
  • Intra-product clicks
  • Types of data ingested
  • Types of content created
  • Amount of content created

Even if you don’t have specific usage data introspecting the product, you can gather metrics about how people are interacting with the purchase and activation process, and extrapolate accordingly.

  • Number of downloads and installs
  • License activations and types
  • Daily and monthly active users

You can use this type of data to better understand how people are spending their time in your product, and what features or functionality they’re using. Even if a customer has purchased or installed the product, it’s even more valuable to find out if they’re actually using it, and if so, how.

If your product is only in beta, and you want more data to help you prioritize an overall documentation backlog, such as topics that are tied to a specific release, you can use some product usage data to understand where people are spending more of their time, and draw conclusions about what to prioritize based on that.

Maybe the under-utilized features could use more documentation, or more targeted documentation. Maybe the features themselves need work.  Be careful not to draw overly-simplistic conclusions about the data that you see from product usage metrics. Keep context in mind at all times. 

Search strings

You can gather search strings from HTTP referer data from web searches performed on external search sites such as Google or DuckDuckGo, or from internal search services. It’s pretty unlikely that you’ll be able to gather search strings from external sites given the widespread implementation of HTTPS, but internal search services can be vital and valuable data sources for this.

Look at specific search strings to find out what people are looking for, and what people are searching that’s landing them on specific documentation pages. Maybe they’re searching for something and landing on the wrong page, and you can update your topic titles to help.

JIRA or issue data

You can use metrics from your issue tracking services to better understand product quality, as well as customer confusion.

  • Number of issues/bugs
  • Categories/tags/components of issues/bugs
  • Frequency of different types of issues being created/closed

Issue tags or bug components can help you identify categories of the product where there are lots of problems or perhaps customer confusion. This is especially useful data if you’re an open source product and want to get a good understanding of where there are issues that might need more decision support or guidance in the documentation. 

Training courses

If you have an education department, or produce training courses about your product, these are quite useful to gather data from. Some examples of data you might find useful:

  • Questions asked by customers
  • Questions asked by course developers
  • Use cases covered by content in courses
  • Enrollment in courses
  • Categories of courses offered

Also useful to correlate this with other data to help identify verticals of customers interested in different topics. Because education and training courses cover more hands-on material, it can be an excellent source of use case examples, as well as occasions where decision support and guidance is needed. 

Customer surveys

Customer surveys especially cover surveys like satisfaction surveys and sentiment analysis surveys. By reviewing the qualitative statements and types of questions asked in the surveys, you can gain valuable insights and information like:

  • What do people think about the product?
  • What do people want more help with?
  • How do people think about the product?
  • How do people feel about the product?
  • What does the company want to know from customers? 
  • What are the company priorities?

This can also help you think about how the documentation you write has a real effect on peoples’ interactions with the product, and can shift sentiment in one way or another.

Documentation feedback

Direct feedback on your documentation is a vital source of data if you can get it. 

  • Qualitative comments about the documentation
  • Usefulness votes (yes/no)
  • Ratings

Even if you don’t have a direct feedback mechanism on your website, you can collect documentation feedback from internal and external customers by paying attention in conversations with people and even asking them directly if they have any documentation feedback. Qualitative comments and direct feedback can be vital for making improvements to specific areas. 

Site metrics

If your documentation is on a website, you can use web access logs to gather important site metrics, such as the following:

  • Page views
  • Session data like time on page
  • Referer data
  • Link clicks
  • Button clicks
  • Bounce rate
  • Client IP

Site metrics like page views, session data, referer data, and link clicks can help you understand where people are coming to your docs from, how long they are staying on the page, how many readers there are, and where they’re going after they get to a topic. You can also use this data to understand better how people interact with your documentation. Are readers using a version switcher on your page? Are they expanding or collapsing information sections on the page to learn more? Maybe readers are using a table of contents to skip to specific parts of specific topics.  

You can split this data by IP address to understand groups of topics that specific users are clustering around, to better understand how people use the documentation.

Text analysis metrics

Data about the actual text on your documentation site is also useful to help understand the complexity of the documentation on your site.

  • Flesch-Kincaid readability score
  • Inclusivity level
  • Length of sentences and headers
  • Style linter

You can assess the readability or usability of the documentation, or even the grade level score for the content to understand how consistent your documentation is. Identify the length of sentences and headers to see if they match best practices in the industry for writing on the web. You can even scan content against a style linter to identify inconsistencies of documentation topics against a style guide.

Download metrics

If you don’t have site metrics for your documentation site, because the documentation is published only via PDF or another medium, you can still use metrics from that. 

  • Download numbers 
  • Download dates and times
  • Download categories and types

You can use these metrics to gather interest about what people want to be reading offline, or how frequently people are accessing your documentation. You can also correlate this data with product usage data and release cycles to determine how frequently people access the documentation compared with release dates, and the number of people accessing the documentation compared with the number of people using a product or service.

Topic type metrics

If you use strict topic typing at your documentation organization, you can use topic type metrics as an additional metadata layer for documentation data analysis. Even if you don’t, you can manually categorize organize your documentation by type to gather this data.

  • What are the topic types?
  • How many topic types are there?
  • How many topics are there of each type?

Understanding topic types can help you understand how reader interaction patterns can vary for your documentation by type, or whether your developer documentation has predominantly different types of documentation compared with your user documentation, and better understand what types of documentation are written for which audiences.

Topic metadata

Metadata about documentation topics is also incredibly valuable as a correlation data source. You can correlate topic metadata like the following information:

  • What are the titles?
  • Average length of a topic?
  • Last updated and creation dates
  • Versions that different topics apply for

You can correlate it with site metrics, to see if longer topics are viewed less-frequently than shorter topics, or identify outliers in those data points. You can also manually analyze the topic titles to identify if there are patterns (good or bad) that exist.

Contribution data

If you have information about who is writing documentation, and when, you can use these types of data:

  • Last updated dates
  • Authors/contributors
  • Amount of information added or removed

Contribution data can tell you how frequently specific topics were updated to add new information, and by whom, and how much information was added or removed. You can identify frequency patterns, clusters over time, as well as consistent contributors.

It’s useful to split this data by other features, or correlate it with other metrics, especially site metrics. You can then identify things like:

  • Last updated dates by topic
  • Last updated dates by product
  • Last updated dates over time

to see if there are correlations between updates and page views. Perhaps more frequently updated content is viewed more often.

Social media analytics

  • Social media referers
  • Link clicks from social media sites

If you publicize your documentation using social media, you can track the interest in the documentation from those sites. If you’re curious about social media referers leading people to your documentation, and see whether or not people are getting to your documentation in that way. Maybe your support team is responding to people on twitter with links to your documentation, and you want to better understand how frequently that happens and how frequently people click through those links to the documentation…

You can also identify whether or not, and how, people are sharing your documentation on social media by using data crawled or retrieved from those sites’ APIs, and looking for instances of links to your documentation. This can help you get a better sense of how people are using your documentation, how they’re talking about it, how they feel about it, and whether or not you have an organic community out there on the web sharing your documentation. 

Beyond documentation data

I hope that this detail has given you a better understanding of different types of data, beyond documentation data, that are available to you as a technical writer to draw valuable conclusions from. By analyzing these types of data, you are prepared for prioritizing your documentation task list, but also better able to understand the customers of your product and documentation. Even if only some of these are available to you, I hope they are useful. Be sure to read Just Add Data: Using data to prioritize your documentation for the full explanation of how to use data in this way. 

Just Add Data: Using data to prioritize your documentation

This is a blog post adaptation of a talk I gave at Write the Docs Portland on May 21, 2019. The talk was livestreamed and recorded, and you can view the recording on YouTube: Just Add Data: Make it easier to prioritize your documentation – Sarah Moir

Prioritizing documentation is hard. How do you decide what to work on if there isn’t a deadline looming? How do you decide what not to work on when your list of work just keeps growing? How do you identify what new content you might want to add to your documentation?

By adding data to the process, it’s possible to prioritize your documentation tasks with confidence!

Prioritizing without data

Prioritizing a backlog without data can involve asking yourself some questions, like what will take the least amount of time? Or, what did someone most recently request? If I’m doing this, I might ask my product manager what to work on, or do whatever task seems easiest at the time. I might even focus on whichever task I can complete without talking to other people, because I’m tired. 

Based on the answers to those questions, I’ll end up with a prioritized backlog, but lack confidence that what I’ve chosen to work on will actually bring the most value to customers and the documentation. Especially if I’m choosing not to do work, it can be a challenge to keep ignoring an item in the backlog because it doesn’t fit with what I think I need to be working on, especially without some sort of “proof” that it’s okay to ignore. To make this process easier, I add data.

Why prioritize with data?

Using data to prioritize a documentation backlog can help give you more confidence in your decisions and help you justify why you’re not working on something. It can challenge your assumptions about what you should be working on, or validate them. Adding data can help improve your overall understanding of how customers are using your product and the documentation, leading to benefits beyond the backlog.

Data types for prioritization

What kinds of data am I talking about? All kinds of data! If you skim the following list, you’ll notice that this data goes beyond quantitative sources. When I talk about data, I’m including all kinds of information: qualitative comments, usage metrics, metadata, website access logs, survey results, database records, all of these and more fit in with my definition of data. Here’s the full list

  • User research reports
  • Support cases
  • Forum threads and questions
  • Product usage metrics
  • Search strings
  • Tags on bugs or issues
  • Education/training course content and questions
  • Customer satisfaction survey
  • Documentation feedback
  • Site metrics
  • Text analysis metrics
  • Download/last accessed numbers
  • Topic type metrics
  • Topic metadata
  • Contribution data
  • Social media analytics

Some of these data types are more relevant to different types of organizations and documentation installations. For example, open source projects might have more useful issue tags, or organizations that use DITA will have easier access to topic type information.

This list of data types is to demonstrate the different types of information that can help you prioritize documentation, but I don’t want you to think that you need to do large-scale collections or implementations to get any valuable data worth incorporating into your prioritization process.

I’ll cover a couple of these data types in more detail here, but I talk about all of them in another post: Detailed data types you can use for documentation prioritization.

Product usage data

You can use usage data for products (also called telemetry) to find out where people are spending their time. What features or functionality are they using? Even if they’ve purchased or installed the product, are they actually using it?

Some examples of product usage data include:

  • Time in product
  • Intra-product clicks
  • Types of data ingested
  • Types of content created (e.g., dashboards, playlists)
  • Amount of content created (e.g., dashboards, playlists)

In addition to data about how people are interacting with the product, you can also gather product usage data without actual introspection into how people are using it. If you have information about how many people have downloaded a product or are logging in to a service:

  • Number of downloads and installs
  • License activations and types
  • Daily and monthly active users

I mostly talk about using data to help you prioritize the more ambiguous parts of a backlog that might not be tied to a release, but especially with the help of product usage data, you can better-prioritize release-focused documentation as well. If your product is in beta, and you want more data to help you prioritize your overall documentation backlog, you can use some product usage data to understand where people are spending more of their time, and draw conclusions about what to spend more time on or less time on, or what level of detail to include in the documentation, to achieve your overall documentation goals for the release. 

Site metrics

Site metrics like page views, session data, HTTP referer data, and link clicks can help you understand where people are coming to your docs from, how long they’re staying on the page, how many readers there are, and what they’re doing after they get to a topic. Here are some example site metrics:

  • Page views
  • Session data like time on page
  • Referer data
  • Link clicks
  • Button clicks
  • Bounce rate
  • Client IP

You can also use this data to understand better how people interact with your documentation, like whether they’re using a version switcher on your page or expanding/collapsing more information hidden on the page. 

You can also split this data by IP address to understand groups of topics that specific users are clustering around, to better understand how people use the documentation.

Identify questions based on your backlog

The process of adding data to your documentation prioritization strategy is all about making do with what you have to answer what you want to know. What you want to know depends on your backlog.

Data analysis is focused on a goal. You don’t want to collect a lot of data and then just stare at it, or get stressed out by the amount of “insights” that you could be gathering but meanwhile you’re not really sure what to do with the information. If you consider questions that you want to answer in advance, you can focus your data collection and analysis in a more valuable way. 

Some example questions that you might identify based on your task list:

  • What are people looking for? Are they finding what they’re looking for?
  • Are people looking for information about <thing I’ve been told to document>?
  • What do people want more help with?
  • What people are we targeting that don’t see their use cases represented?

Tie questions to data types

After you’ve identified questions relevant to your task list, you can tie those questions to data types that can help you answer the questions.

For example, the question: What are people looking for and not finding?

To answer this, you can look where people are looking for information, namely search keywords that they’re typing into search engines, common questions being posted on forums, or the topics of support cases filed by customers.

For example, I looked at some data and was able to identify specific search terms people are using on the documentation site that are routing customers to a company-managed forum site.  I can then use that data to identify cases where people are looking for documentation about something, but are not finding the answers in the documentation.

Another example question: What do people want more help with? 

This could be answered by looking at the topics of support cases again, but also the types of questions being asked in training courses, as well as unanswered questions on forums. 

As a final example: What market groups are we targeting that don’t see their use cases represented?

To answer this, you could look at data about sales leads, questions being asked by the field that contain specific use cases for various market verticals, as well as questions being asked in training courses.

Find questions from data

If you don’t have much of a task list to work with, or if you aren’t able to get access to data that can help you answer your questions, you can still make use of the data that is available to you and draw valuable insights from it.

You can identify interest in content that you maybe weren’t aware of, and make plans to write more to address that interest, or modify existing content to address that interest. Maybe there are a bunch of forum threads about how to do something, but nothing authoritative in the documentation. That information hasn’t made it to the docs writers in any way, but because you’re looking at the available data, you’re able to see that it’s important.

Even if you have no data specifically relevant to the documentation or customer questions, you can still find ways to identify documentation work to add to a task list. You could create datasets by performing text analysis on all or specific documentation topics, and identify complexity issues, or topics that don’t adhere to a style guide. You could use customer satisfaction surveys to identify places where documentation architecture or linking strategies could be improved.

Working with the data

Now you hopefully have a better understanding of different types of data available to you, and how you can identify valuable data sources based on your questions that you want to answer. But how much data do you need to collect? And how do you get the data? Most importantly, how do you analyze it to answer the questions you want to answer?

How much data?

How much data do you need to collect? You don’t need to collect data forever. You don’t need ALL the data. You just need enough data to point you in a direction and reduce uncertainty.

You can use a small sample of users, or a small sample of time, so long as it helps you answer your question and reduce uncertainty about what the answer could be. Collecting larger amounts of data doesn’t mean that you reduce uncertainty by an equally large degree. The amount of data you collect doesn’t correlate directly to what you’re able to learn from it. However, if the question you’re trying to answer with data concerns all the documentation users over a long period of time, you will be collecting more data than if you just want to know what a specific subset of readers found interesting on a Friday afternoon.

Try for representative samples that are relevant for the questions you’re trying to answer. If you can’t get representative data, try for a random sample. If you can’t get representative or random samples, acknowledge the bias that is inherent in the data you’re using. Add context to the data wherever possible, especially about who the data represents and why the data is still valuable if it isn’t representative.

You might find that collecting a small amount of data leaves you with more questions than answers, and that’s okay too. It’s an opportunity to continue exploring and learning more about your customers and your documentation tasks. But how do you even get any data at all?

How do you get the data?

You’ll either be collecting your own data, or asking others for the data you need.

If it’s data about the documentation site or its content, you might own that data yourself, and already have access to it. If it’s other types of data, like sales leads or user research data, it’s time to talk to the departments or people that manage those areas.

  • A business development department might have reporting on internal tools like sales leads or support cases.
  • Product managers can share direct customer data and product usage data if you don’t have direct access.
  • Project managers can share data related to internal development processes.

The teams managing different datasets will vary at your organization, and might even be you in many cases. They may be reluctant to share data. With that in mind, remember that when you collect data, you don’t need to get persistent access to all the data you want. Focus on getting some access to some data that is useful to answer your questions. After that, you can use that data to make your work more efficient and informed, and then hopefully communicate that value and get more access to data in the future if you want.

What to use for data analysis?

What do you use to analyze that data after you get it? How do you transform data into a report of useful information?

Some tools might already have analytics and reporting built in, like Google Analytics. That can certainly make it easier to analyze the data!

For other types of data that you need to analyze yourself, use the tools available to you. Think about what already know how to use, or have access to:

  • Know how to use Excel? Perfect! Get started collecting and processing data in spreadsheets and with macros.
  • Know how to write scripts in R/Python to analyze data? Great! You can write scripts to collect, process, and visualize this data.
  • Is your organization using a tool like Splunk, ElasticSearch, Tableau, etc.? Good news! You are really ready for data analysis.

You don’t have to spend a long time learning a new tool to analyze data for these purposes. If you continue incorporating data analysis into your work, it might make sense, but it isn’t necessary to get started.

Tools aren’t magic

It’s also important to note that tools aren’t magic. Some degree of data analysis will involve manual collecting, categorizing, or cleaning of the data. If your organization doesn’t have strict topic types, you might need to perform manual topic-typing. If you want to analyze some information but the data isn’t in a machine-readable format, you might have to sit at your desk copy pasting for hours.

Depending on your skills, the current state of the data that you want to analyze, and the tools available to you, the amount of time it takes to analyze data and get results can vary widely. I have spent 3 days manually processing data in Excel, and I’ve spent 2 hours creating searches in mostly-clean datasets in Splunk to get answers to various questions. Keep that in mind when you’re analyzing data.

How to perform data analysis

When you analyze data, what are you actually looking at? 

Top, rare, outlying values

Find out what values are most common, and which values are least common. Those can be established by counting the various instances of values.

Look for values that are different from the others by a large margin. You can use standard deviation as a function to achieve this.

Patterns and clusters across data

You can also look for patterns and clusters in your data.

If you’re working with qualitative data, you might need to categorize, or code, the data so that you can sort it and look for patterns in the results. You can identify these patterns by counting instances of categories, or looking at clusters of behavior. An example of a cluster of behavior is if you look at documentation topic visits over time, and you identify a spike in visits at a particular time.

Split by different features

You also want to segment data by different features. Meaning, you can better understand the most common values if you split them by other types of information. For example, you can look at the most commonly visited topics in your documentation set over the last 3 months, or you can look at the most commonly visited topics in your documentation over the last 3 months, but on a week-to-week basis. That additional split can help you understand how those values are changing over time. If you identify a spike in a particular topic or category of topics, you can then interpret the data. Maybe a new product release led to a spike of interest in the release notes topic that wasn’t easily identified until you split the results by week. This is also a good opportunity to point out to a product team that people really do read your documentation!

That’s an example of splitting by time, but you can split by any other field available to you in your data. To use the same data type, looking at the most common topics by product, by IP address, or other factors, can help lead to valuable insights.

Combine data types

You can combine different types of data to understand approximately how many people are using the product vs how many of them are using the documentation. Comparing sales leads, product usage data, and existing page views could help you approximate the number of potential, and existing customers, alongside the number of distinct documentation readers.

Make sure that when you combine data across datasets, you keep track of units and time ranges, and make sure that you compare like data with like data. For example, be careful not to use data that refers to potential customers with data that refers to existing customers, because that could lead to misleading results if you don’t keep context with the data.

Interpreting results

When you interpret the results of your data analysis, make sure that you are adding context to the data. Especially when dealing with outlier data, but even when reviewing data like rarely-viewed or frequently-viewed topics, keep in mind additional context that could explain results.

Add context from expertise

Use your expertise and knowledge of the documentation to add context. For example, topics concerning a specific functionality are likely to be more popular at a specific time if that functionality was recently changed.

Pursue alternate explanations

Whenever you’re interpreting data, you want to make sure that you’re gut-checking it against what you already know. So if a relatively mundane topic has wildly out-of-the-ordinary page views, there are likely alternate explanations for that interest. Maybe your topic ended up being a great resource about cron syntax in general, even for people that don’t use your product.

Draw realistic conclusions

Draw realistic conclusions based on the data available to you. You might not be able to get access to or combine specific datasets due to privacy concerns. If you carefully identify what problems you’re trying to solve, and select only the data sources that can help you solve those problems, you can reduce the potential that you’ll introduce bias into your data analysis, and improve the conclusions that you’re able to draw.

Don’t trust data blindly

Don’t trust the data blindly. When reviewing data that seems out of the ordinary or like outliers, examine the different reasons why the data could be like that. Who does the data represent? What does it represent? Make sure that you’re interpreting data in context, so that you’re able to understand exactly what it represents. It can be tempting to ignore data that doesn’t match your biases or expectations.

Above all, remember to use data to complement your research and writing, and validate or challenge assumptions about your audience.

Your turn to add data

  1. Identify the questions you’re trying to answer
  2. Use the data available to you
  3. Use the tools available to you
  4. Analyze and interpret the data
  5. Take action and prioritize accordingly

Additional resources

The Concepts Behind the Book: How to Measure Anything

I just finished reading How to Measure Anything: Finding the Value of Intangibles in Business by Douglas Hubbard. It discusses fascinating concepts about measurement and observability, but they are tendrils that you must follow among mentions of Excel, statistical formulas, and somewhat dry consulting anecdotes. For those of you that might want to focus mainly on the concepts rather than the literal statistics and formulas behind implementing his framework, I wanted to share the concepts that resonated with me. If you want to read a more thorough summary, I recommend the summary on Less Wrong, also titled How to Measure Anything.

The premise of the book is that people undertake many business decisions and large projects with the idea that success of the decisions or projects can’t be measured, and thus they aren’t measured. It seems a large waste of money and effort if you can’t measure the success of such projects and decisions, and so he developed a consulting business and a framework, Applied Information Economics (AIE)to prove that you can measure such things.

Near the end of his book on page 267, he summarizes his philosophy as six main points:

1. If it’s really that important, it’s something you can define. If it’s something you think exists at all, then it’s something that you’ve already observed somehow.

2. If it’s something important and something uncertain, then you have a cost of being wrong and a chance of being wrong.

3. You can quantify your current uncertainty with calibrated estimates.

4. You can compute the value of additional information by knowing the “threshold” of the measurement where it begins to make a difference compared to your existing uncertainty.

5. Once you know what it’s worth to measure something, you can put the measurement effort in context and decide on the effort it should take.

6. Knowing just a few methods for random sampling, controlled experiments, or even just improving on the judgment of experts can lead to a significant reduction in uncertainty.

To restate those points:

  1. Define what you want to know. Consider ways that you or others have measured similar problems. What you want to know might be easier to see than you thought.
  2. It’s valuable to measure things that you aren’t certain about if they are important to be certain about.
  3. Make estimates about what you think will happen, and calibrate those estimates to understand just how uncertain you are about outcomes.
  4. Determine a level of certainty that will help you feel more confident about a decision. Additionally, determine how much information will be needed to get you there.
  5. Determine how much effort it might take to gather that information.
  6. Understand that it probably takes less effort than you think to reduce uncertainty.

The crux of the book revolves around restating measurement from “answer a specific question” to “reduce uncertainty based on what you know today”.

Measure to reduce uncertainty

Before reading this book, I thought about data analysis as a way to find an answer to a question. I’d go in with a question, I’d find data, and thanks to that data, I’d magically know the answer. However, that approach only works with specifically-defined questions and perfect data. If I want to know “how many views did a specific documentation topic get last week” I can answer that straightforwardly with website metrics.

However, if I want to know “Was the guidance about how to perform a task more useful after I rewrote it?” there was really no way to know the answer to that question. Or so I thought.

Hubbard’s book makes the crucial distinction that data doesn’t need to exist to directly answer that question. It merely needs to make you more certain of the likely answer. You can make a guess about whether or not it was useful, carefully calibrating your guess based on your knowledge of similar scenarios, and then perform data analysis or measurement to improve the accuracy of your guess. If you’re not very certain of the answer, it doesn’t take much data or measurement to make you more certain, and thus increase your confidence in an outcome. However, the more certain you are, the more measurement you need to perform to increase your certainty.

Start by decomposing the problem

If you think what you want to measure isn’t measurable, Hubbard encourages you to think again, and decompose the problem. To use my example, and #1 on his list, I want to measure whether or not a documentation topic was more useful after I rewrote it. As he points out with his first point, the problem is likely more observable than I might think at first.

“Decompose the measurement so that it can be estimated from other measurements. Some of these elements may be easier to measure and sometimes the decomposition itself will have reduced uncertainty.”

I can decompose the question that I’m trying to answer, and consider how I might measure usefulness of a topic. Maybe something is more useful if it is viewed more often, or if people are sharing the link to the topic more frequently, or if there are qualitative comments in surveys or forums that refer to it. I can think about how I might tell someone that a topic is useful, what factors of the topic and information about it I might point to. Does it come up first when you search for a specific customer question? Maybe then search rankings for relevant keywords are an observable metric that could help me measure utility of a topic.

You can also perform extra research to think of ways to measure something.

“Consider your findings from secondary research: Look at how others measured similar issues. Even if their specific findings don’t relate to your measurement problem, is there anything you can salvage from the methods they used?”

Is it business critical to measure this?

Before I invest a lot of time and energy performing measurements, I want to make sure (to Hubbard’s second point in his list) that the question I am attempting to answer, what I am trying to measure, is important enough to merit measurement. This is also tied to points four, five, and six: does the importance of the knowledge outweigh the difficulty of the measurement? It often does, especially because (to his sixth point), the measurement is often easier to obtain than it might seem at first.

Estimate what you think you’ll measure

To Hubbard’s third point, a calibrated estimate is important when you do a measurement. I need to be able to estimate what “success” might look like, and what reasonable bounds of success I might expect are.

Make estimates about what you think will happen, and calibrate those estimates to understand just how uncertain you are about outcomes.

To continue with my question about a rewritten topic’s usefulness, let’s say that I’ve determined that added page views, elevated search rankings, and link shares on social media will mean the project is a success. I’d then want to estimate what number of each of those measurements might be meaningful.

To use page views as an example for estimation, If page views increase by 1%, it might not be meaningful. But maybe 5% is a meaningful increase? I can use that as a lower bound for my estimate. I can also think about a likely upper bound. A 1000% increase would be unreasonable, but maybe I could hope that page views would double, and I’d see a 100% increase in page views! I can use that as an upper bound. By considering and dismissing the 1% and 1000% numbers, I’m also doing some calibration of my estimates—essentially gut checking them with my expertise and existing knowledge. The summary of How to Measure Anything that I linked in the first paragraph addresses calibration of estimates in more detail, as does the book itself!

After I’ve settled on a range of measurement outcomes, I can assess how confident I am that this might happen. Hubbard calls this a Confidence Interval. I might only be 60% certain that page views will increase by at least 5% but they won’t increase more than 100%. This gives me a lot of uncertainty to reduce when I start measuring page views.

One way to start reducing my uncertainty about these percentage increases might be to look at the past page views of this topic, to try to understand what regular fluctuation in page views might be over time. I can look at the past 3 months, week by week, and might discover that 5% is too low to be meaningful, and a more reasonable signifier of success would be a 10% or higher increase in page views.

Estimating gives me a number that I am attempting to reduce uncertainty about, and performing that initial historical measurement can already help me reduce some uncertainty. Now I can be 100% certain that a successful change to the topic should show more than 5% page views on a week-to-week basis, and maybe am 80% certain that a successful change would show 10% or more page views.

When doing this, keep in mind another point of Hubbards:

“a persistent misconception is that unless a measurement meets an arbitrary standard….it has no value….what really makes a measurement of high value is a lot of uncertainty combined with a high cost of being wrong.”

If you’re choosing to undertake a large-scale project that will cost quite a bit if you get it wrong, you likely want to know in advance how to measure the success of that project. This point also underscores his continued emphasis on reducing uncertainty.

For my (admittedly mild) example, it isn’t valuable for me to declare that I can’t learn anything from page view data unless  3 months have passed. I can likely reduce uncertainty enough with two weeks of data to learn something valuable, especially if my uncertainty level is in relatively low (in this example, in the 40-70% range).

Measure just enough, not a lot

Hubbard talks about the notion of a Rule of Five:

There is a 93.75% chance that the median of a population is between the smallest and largest values in any random sample of five from that population.

Knowing the median value of a population can go a long way in reducing uncertainty. Even if you can only get a seemingly-tiny sample of data, this rule of five makes it clear that even that small sample can be incredibly valuable for reducing uncertainty about a likely value. You don’t have to know all of something to know something important about it.

Do something with what you’ve learned

After you perform measurements or do some data analysis and reduce your uncertainty, then it’s time to do something with what you’ve learned. Given my example, maybe my rewrite increased page views of the topic by 20%, something I’m now fairly certain is a significant degree, and it is now higher in the search results. I’ve now sufficiently reduced my uncertainty about whether or not the changes made this topic more useful, and I can now rewrite similar topics to use a similar content pattern with confidence. Or at least, more confidence than I had before.

Overall summary

My super abbreviated summary of the book would then be to do the following:

  1. Start by decomposing the problem
  2. Ask is it business critical to measure this?
  3. Estimate what you think you’ll measure
  4. Measure just enough, not a lot
  5. Do something with what you’ve learned

I recommend the book (with judicious skimming), especially if you need some conceptual discussion to help you unravel how best to measure a specific problem. As I read the book, I took numerous notes about how I might be able to measure something like support case deflection with documentation, or how to prioritize new features for product development (or documentation). I also considered how customers might better be able to identify valuable data sources for measuring security posture or other events in their data if they followed many of the practices outlined in this book.

Not sober curious, just sober

An article covering the “Sober Curious Movement” was published in the Chicago Tribune a few weeks ago. My brother shared it with me, and I’m still thinking about it. The article discusses a “sober curious” movement in America and interviews a number of people in Chicago that have chosen to quit drinking. Apparently because they quit drinking for different reasons than alcoholism or binge drinking, they are “sober curious” instead of simply “sober”. (It’s a book too).

My brother sent it to me because I don’t drink either, which can feel like an oddity in your twenties. I quit drinking at concerts when I was 22, after I went to a concert, had one drink, and ended up fainting in between the opener and the headliner. Thinking it was just a fluke of that night, I tried again at another show a few months later, and spent the headlining set sitting down in the back of the venue to avoid fainting a second time. After that I realized that it wasn’t worth it, and never drank at a concert again.

It took longer for me to quit drinking overall, and I’d make exceptions at time for special occasions when it just felt too awkward to not drink—weddings, parties, first dates—but after awhile I decided to stop making the exceptions. It was part personal challenge, and part health-conscious decision. My body had never responded well to alcohol, what with lightheadedness or nausea following anything more than a couple drinks. By the time I was 22 I had a short list of “okay alcohols” and quantities, and by the time I was 26 I’d grown tired of bothering.

My life had shifted to involve fun activities beyond drinking, and my friends weren’t drinking-focused either. I’d be going to concerts or to the gym/the soccer pitch every other day, and drinking just didn’t fit anywhere. I’d spent time in college not drinking at various parties, where I knew I had to be fresh for studying the next day, so I knew I could still have fun without drinking. Choosing to quit drinking overall felt like a natural progression.

So where does that leave me now, and why am I still so peeved at that article? For one, it only quotes women. I like to see women quoted by journalists, but by only quoting women, the choice to be sober felt somewhat trivialized. In addition, the women’s comments were contextualized with talk of mindfulness and yoga, as though this is a choice being made by a particular type and class of woman, and no others. It also perpetuates the notion that having fun without drinking is some strange novelty. There are a lot of people out there that have fun without drinking. Indeed, as one of the women in the article points out—it’s a challenge to your confidence to go out and be all of yourself, without alcohol. But it can be that much more invigorating in that way. You get to challenge your social anxiety and actively build confidence, rather than relying on alcohol and wondering if you can talk to strangers without it at all.

I think there’s also harm in talking about sobriety distinct from alcoholism. A lot of quitting drinking is about realizing that you don’t like who you are when you drink (or after you drink). The recent essay about the Joe Beef restaurateurs makes this clear. Some people have the lifestyles, genetic predisposition, or experienced traumas that escalate their alcohol consumption to recognizable alcoholism. Others have a dependence on it that they dislike, even if others don’t see it as an issue (as one of the women interviewed in the Chicago Tribune article mentions). It’s more than okay to share that common understanding, rather than separate ourselves into different groups, “the sober curious” and “the fully sober due to addiction”. That’s harmful. Indeed, the sober curious meetup in Chicago also includes “young women in recovery”. They get it.

Often, quitting drinking feels like a social choice more than a personal choice. It feels that way largely due to the fact that there often aren’t that many sober social activities out there. It’s hard to stay out late with friends sober when the only places open late are bars. It’s harder to choose yourself over alcohol, when it can often mean isolating yourself from friends. So while I struggle with the rhetoric and the patronizing presentation of the “sober curious” movement, I absolutely support it as an overall societal direction. Here’s to more late-night diners, pastry places like Mission Pie, and sober pop-ups like Brillig Dry Bar that help us sober people stay up late and out with friends.

Planning and analyzing my concert attendance with Splunk

This past year I added some additional datasets to the Splunk environment I use to analyze my music: information about tickets that I’ve purchased, and information about upcoming concerts.

Ticket purchase analysis

I started keeping track of the tickets that I’ve purchased over the years, which gave me good insights about ticket fees associated with specific ticket sites and concert promoters.  

Based on the data that I’ve accumulated so far, Ticketmaster doesn’t have the highest fees for concert tickets. Instead, Live Nation does. This distinction is relatively meaningless when you realize they’ve been the same company since 2010.

However, the ticket site isn’t the strongest indicator of fees, so I decided to split the data further by promoter to identify if specific promoters had higher fees than others.

Based on that data you can see that the one show I went to promoted by AT&T had fee percentages of nearly 37%, and that shows promoted by Live Nation (through their evolution and purchase by Ticketmaster) also had fees around 26%. Shows promoted by independent venues have somewhat higher fees than others, hovering around 25% for 1015 Folsom and Mezzanine, but shows promoted by organizations whose only purpose is promotion tend to have slightly lower fees, such as select entertainment with 18%, Popscene with 16.67%, and KC Turner Presents with 15.57%.

I realized I might want to refine this, so I recalculated this data, limiting it to promoters from which I’ve bought at least two tickets.

It’s a much more even spread in this case, ranging from 25% to 11% in fees. However, you can see that the same patterns exist— for the shows I’ve bought tickets to, the independent venues average 22-25% in fees, while dedicated independent promoters are 16% or less in added fees, with corporate promoters like Another Planet, JAM, and Goldenvoice filling the middle of the data ranging from 18% to 22%.

I also attempted to determine how I’m discovering concerts. This data is entirely reliant on my memory, with no other data to back it up, but it’s pretty fascinating to track.

It’s clear that Songkick has become a vital service in my concert-going planning, helping me discover 46 shows, and friends and email newsletters from venues helping me stay in the know as well for 19 and 14 shows respectively. Social media contributes as well, with a Facebook community (raptors) and Instagram making appearances with 10 and 2 discoveries respectively.

Concert data from Songkick

Because Songkick is so vital to my concert discovery, I wanted to amplify the information I get from the service. In addition to tracking artists on the site, I wanted to proactively gather information about artists coming to the SF Bay Area and compare that with my listening habits. To do this, I wrote a Songkick alert action in Python to run in Splunk.

Songkick does an excellent job for the artists that I’m already tracking, but there are some artists that I might have just recently discovered but am not yet tracking. To reduce the likelihood of missing fast-approaching concerts for these newly-discovered artists, I set up an alert to look for concerts for artists that I’ve discovered this year and have listened to at least 5 times.

To make sure I’m also catching other artists I care about, I use another alert to call the Songkick API for every artist that is above a calculated threshold. That threshold is based on the average listens for all artists that I’ve seen live, so this search helps me catch approaching concerts for my historical favorite artists.

Also to be honest, I also did this largely so that I could learn how to write an alert action in Splunk software. Alert actions are essentially bits of custom python code that you can dispatch with the results of a search in Splunk. The two alert examples I gave are both saved searches that run every day and update an index. I built a dashboard to visualize the results.

I wanted to use log data to confirm which artists were being sent to Songkick with my API request, even if no events were returned. To do this I added a logging statement in my Python code for the alert action, and then visualized the log statements (with the help of a lookup to match the artist_mbid with the artist name) to display the artists that had no upcoming concerts at all, or had no SF concerts.

For those artists without concerts in the San Francisco Bay Area, I wanted to know where they were going instead, so that I could identify possible travel locations for the future.

It seems like Paris is the place to be for several of these artists—there might be a festival that LAUER, Max Cooper, George Fitzgerald, and Gerald Toto are all playing at, or they just happen to all be visiting that city on their tours.

I’m planning to publish a more detailed blog post about the alert action code in the future on the Splunk blogs site, but until then I’ll be off looking up concert tickets to these upcoming shows….

Engaging with San Francisco history as a newcomer

I moved to San Francisco from the Midwest a few years ago, and I’d been missing a strong sense of history since then. I’ve been to a few events in an attempt to learn more about my new home, such as a Dolores Park history day, or a history-relevant event in the Mission as part of Litcrawl, but I struggled to absorb a history for the city that went beyond “gold rush, earthquake, tech boom, bust, boom”.

But last week I was at the library and saw an event that was being held in conjunction with the display of the Bay Model throughout SF public libraries and Take Part SF, called Vanished Waters. It was about Mission Bay history so I skipped my regular workout to attend. It was well worth it!

Vanished Waters is also a book, so that was the loose structure that the talk revolved around, and was given by Chris Carlsson, an expert on San Francisco history and an engaging speaker. He co-founded Shaping SF, helping to maintain a digital archive of the city’s past.

My favorite fascinating facts that I learned at the talk were that in 1852, Market street ended where 3rd street is in an 80 ft tall sand dune.

SoMA was really hilly and marshy, but then some dude with a steam shovel was like “sup let me move that sand for you” and also “sup let me help you fill in this lot that you bought that is literally just water”. That’s my paraphrasing, but the actual details are to be read on Found SF.

The whole idea to fill the San Francisco Bay in is hard to imagine now because it’s not polluted, but if it was a stinky polluted putrid mess full of garbage it’s easier to imagine it being a good idea. (The whole idea to fill in the Bay is why the SF Bay Model was built).

However, my favorite part of the talk was when Carlsson discussed using this history to inform our present and future decisions. He pointed out that there is a lot of rhetoric in San Francisco about how to build more housing to manage the growth of the city, and what kinds of development is best suited to accommodating all of the people that move here.

However, there’s not much rhetoric (if any) about staging a managed retreat from climate change. San Francisco is a coastal city, built on top of marshland, sand dunes, or literal land fill. What happens when the sea level begins to rise, or more volatile weather patterns cause bigger storms and potential flooding?

I realized after this talk that some city dwellers love to judge southeastern coastal city residents that build or rebuild homes in the path of hurricanes or immediate climate change threats, and yet, New York City and San Francisco are both at high risk from sea level rise.

That’s not to mention the earthquake risk in San Francisco. Our current development plans are not necessarily smart, as this article in the New York Times points out.

It’s fascinating to explore what the city used to look like less than 200 years ago, and imagine what it might look like in 2057 in the face of climate change. I lost nearly an hour clicking around the maps on David Rumsey’s website (recommended by Carlsson).

Here’s a map of the 1857 coast, overlaid on modern San Francisco.

This map in 1869 of land lots makes it clear just how much of the land that was sold during that period wasn’t actually land. This essay in Collector’s Weekly covers how that land speculation happened and how it shapes modern real estate in the city.

This exploration all happened because of this talk and the display of the 1938 3D model of San Francisco in the San Francisco Public Libraries. If you want to help find the model a permanent home for display in the city, sign this petition. Just imagine making this map overlay of what San Francisco looked like in 1938 into a tactile experience.

The model is still on display in SF Public Library branches throughout the city, and you can stay engaged in city history through the San Francisco Department of Memory, the California Historical Society, Shaping SF, and Found SF.