Unbiased data analysis with the data-to-everything platform: unpacking the Splunk rebrand in an era of ethical data concerns

Splunk software provides powerful data collection, analysis, and reporting functionality. The new slogan, “data is for doing”, alongside taglines like “the data-to-everything platform” and “turn data into answers” want to bring the company to the forefront of data powerhouses, where it rightly belongs (I’m biased, I work for Splunk).

There is nuance in those phrases that can’t be adequately expressed in marketing materials, but that are crucial for doing ethical and unbiased data analysis, helping you find ultimately better answers with your data and do even better things with it.

Start with the question

If you start attempting to analyze data without an understanding of a question you’re trying to answer, you’re going to have a bad time. This is something I really appreciate about moving away from the slogan “listen to your data” (even though I love a good music pun). Listening to your data implies that you should start with the data, when in fact you should start with what you want to know and why you want to know it. You start with a question.

Data analysis starts with a question, and because I’m me, I want to answer a fairly complex question: what kind of music do I like to listen to? This overall question, also called an objective function in data science, can direct my data analysis. But first, I want to evaluate my question. If I’m going to turn my data into doing, I want to consider the ethics and the bias of my question.

Consider what you want to know, and why you want to know it so that you can consider the ethics of the question. 

  • Is this question ethical to ask? 
  • Is it ethical to use data to answer it? 
  • Could you ask a different question that would be more ethical and still help you find useful, actionable answers? 
  • Does my question contain inherent bias? 
  • How might the biases in my question affect the results of my data analysis? 

Questions like “How can we identify fans of this artist so that we can charge them more money for tickets?” or “What’s the highest fee that we can add to tickets where people will still buy the tickets?” could be good for business, or help increase profits, but they’re unethical. You’d be using data to take actions that are unfair, unequal, and unethical. Just because Splunk software can help you bring data to everything doesn’t mean that you should. 

Break down the question into answerable pieces

If my question is something that I’ve considered ethical to use data to help answer, then it’s time to consider how I’ll perform my data analysis. I want to be sure I consider the following about my question, before I try to answer it:

  • Is this question small enough to answer with data?
  • What data do I need to help me answer this question?
  • How much data do I need to help me answer this question?

I can turn data into answers, but I have to be careful about the answers that I look for. If I don’t consider the small questions that make up the big question, I might end up with biased answers. (For more on this, see my .conf17 talk with Celeste Tretto).

So if I consider “What kind of music do I like to listen to?”, I might recognize right away that the question is too broad. There are many things that could change the answer to that question. I’ll want to consider how my subjective preferences (what I like listening to) might change depending on what I’m doing at the time: commuting, working out, writing technical documentation, or hanging out on the couch. I need to break the question down further. 

A list of questions that might help me answer my overall question could be: 

  • What music do I listen to while I’m working? When am I usually working?
  • What music do I listen to while I’m commuting? When am I usually commuting?
  • What music do I listen to when I’m relaxing? When am I usually relaxing?
  • What are some characteristics of the music that I listen to?
  • What music do I listen to more frequently than other music?
  • What music have I purchased or added to a library? 
  • What information about my music taste isn’t captured in data?
  • Do I like all the music that I listen to?

As I’m breaking down the larger question of “What kind of music do I like to listen to?”, the most important question I can ask is “What kind of music do I think I like to listen to?”. This question matters because data analysis isn’t as simple as turning data into answers. That can make for catchy marketing, but the nuance here lies in using the data you have to reduce uncertainty about what you think the answer might be. The book How to Measure Anything by Douglas Hubbard covers this concept of data analysis as uncertainty reduction in great detail, but essentially the crux is that for a sufficiently valuable and complex question, there is no single objective answer (or else we would’ve found it already!). 

So I must consider, right at the start, what I think the answer (or answers) to my overall question might be. Since I want to know what kind of music I like, I therefore want to ask myself what kind of music I think I might like. Because “liking” and “kind of music” are subjective characteristics, there can be no single true answer that is objective truth. Very few, if any, complex questions have objectively true answers, especially those that can be found in data. 

So I can’t turn data into answers for my overall question, “What kind of music do I like?” but I can turn it into answers for more simple questions that are rooted in fact. The questions I listed earlier are much easier to answer with data, with relative certainty, because I broke up the complex, somewhat subjective question into many objective questions. 

Consider the data you have

After you have your questions, look for the answers! Consider the data that you have, and whether or not it is sufficient and appropriate to answer the questions. 

The flexibility of Splunk software means that you don’t have to consider the questions you’ll ask of the data before you ingest it. Structured or unstructured, you can ask questions of your data, but you might have to work harder to fully understand the context of the data to accurately interpret it. 

Before you analyze and interpret the data, you’ll want to gather context about the data, like:

  • Is the dataset complete? If not, what data is missing?
  • Is the data correct? If not, in what ways could it be biased or inaccurate?
  • Is the data similar to other datasets you’re using? If not, how is it different?

This additional metadata (data about your datasets) can provide crucial context necessary to accurately analyze and interpret data in an unbiased way. For example, if I know there is data missing in my analysis, I need to consider how to account for that missing data. I can add additional (relevant and useful) data, or I can acknowledge how the missing data might or might not affect the answers I get.

After gathering context about your datasets, you’ll also want to consider if the data is appropriate to answer the question(s) that you want to answer. 

In my case, I’ll want to assess the following aspects of the datasets: 

  • Is using the audio features API data from Spotify the best way to identify characteristics in music I listen to? 
  • Could another dataset be better? 
  • Should I make my own dataset? 
  • Does the data available to me align with what matters for my data analysis? 

You can see a small way that the journalist Matt Daniels of The Pudding considered the data relevant to answer the question “How popular is male falsetto?” for the Vox YouTube series Earworm starting at 1:45 in this clip. For about 90 seconds, Matt and the host of the show, Estelle Caswell, discuss the process of selecting the right data to answer their question, including discussing the size of the dataset (eventually choosing a smaller, but more relevant, dataset) to answer their question. 

Is more data always better? 

Data is valuable when it’s in context and applied with consideration for the problem that I’m trying to solve. Collecting data about my schedule may seem overly-intrusive or irrelevant, but if it’s applied to a broader question of “what kind of music do I like to listen to?” it can add valuable insights and possibly shift the possible overall answer, because I’ve applied that additional data with consideration for the question that I’m trying to answer.

Splunk published a white paper to accompany the rebranding, and it contains some excellent points. One of them that I want to explore further is the question:

“how complete, how smart, are these decisions if you’re ignoring vast swaths of your data?” 

On the one hand, having more data available can be valuable. I am able to get a more valuable answer to “what kind of music do I like” because I’m able to consider additional, seemingly irrelevant data about how I spend my time while I’m listening to music. However, there are many times when you want to ignore vast swaths of your data. 

The most important aspect to consider when adding data to your analysis is not quantity, but quality. Rather than focusing on how much data you might be ignoring, I’d suggest instead focusing on which data you might be ignoring, for which questions, and affecting which answers. You might have a lot of ignored data, but put your focus on the small amount of data that can make a big difference in the answers you find in the data.

As the academics in “I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb?” make clear with their crucial finding:

“More data lead to better conclusions only when we know how to take advantage of their information. In other words, size does matter, but only if it is used appropriately.”

The most important aspect of adding data to an analysis is exactly as the academics point out: it’s only more helpful if you know what to do with it. If you aren’t sure how to use additional data you have access to, it can distract you from what you’re trying to answer, or even make it harder to find useful answers because of the scale of the data you’re attempting to analyze. 

Douglas Hubbard in the book How to Measure Anything makes the case that doing data analysis is not about gathering the most data possible to produce the best answer possible. Instead, it’s about measuring to reduce uncertainty in the possible answers and measuring only what you need to know to make a better decision (based on the results of your data analysis). As a result, such a focused analysis often doesn’t require large amounts of data — rough calculations and small samples of data are often enough. More data might lead to greater precision in your answer, but it’s a tradeoff between time, effort, cost, and precision. (I also blogged about the high-level concepts in the book).

If I want to answer my question “What kind of music do I like to listen to?” I don’t need the listening data of every user on the Last.fm service, nor do I need metadata for songs I’ve never heard to help me identify song characteristics I might like. Because I want to answer a specific question, it’s important that I identify the specific data that I need to answer it—restricted by affected user, existence in another dataset, time range, type, or whatever else.

If you want more evidence, the notion that more data is always better is also neatly upended by the Nielsen-Norman Group in Why You Only Need to Test with 5 Users and the follow-up How Many Test Users in a Usability Study?.

Keep context alongside the data

Indeed, the white paper talks about bringing people to a world where they can take action without worrying about where their data is, or where it comes from. But it’s important to still consider where the data comes from, even if you aren’t having to worry about it because you use Splunk software. It’s relevant to data analysis to keep context about the data alongside the data.

For example, it’s important for me to keep track of the fact that the song characteristics I might use to identify the type of music I like come from a dataset crafted by Spotify, or that my listening behavior is tracked by the service Last.fm. Last.fm can only track certain types of listening behavior on certain devices, and Spotify has their own biases in creating a set of audio characteristics.

If I lose track of this seemingly-mundane context when analyzing my data, I can potentially incorrectly interpret my data and/or draw inaccurate conclusions about what kind of music I like to listen to, based purely on the limitations of the data available to me. If I don’t know where my data is coming from, or what it represents, then it’s easy to find biased answers to questions, even though I’m using data to answer them.

If you have more data than you need, this also makes keeping context close to your data more difficult. The more data, the more room for error when trying to track contextual meaning. Splunk software includes metadata fields for data that can help you keep some context with the data, such as where it came from, but other types of context you’d need to track yourself.

More data can not only complicate your analysis, but it can also create security and privacy concerns if you keep a lot of data around and for longer than you need it. If I want to know what kind of music I like to listen to, I might be comfortable doing data analysis to answer that question, identifying the characteristics of music that I like, and then removing all of the raw data that led me to that conclusion out of privacy or security concerns. Or I could drop the metadata for all songs that I’ve ever listened to, and keep only the metadata for some songs. I’d want to consider, again, how much data I really need to keep around. 

Turn data into answers—mostly

So I’ve broken down my overall question into smaller, more answerable questions, I’ve considered the data I have, and I’ve kept the context alongside the data I have. Now I can finally turn it into answers, just like I was promised!

It turns out I can take a corpus of my personal listening data and combine it with a dataset of my personal music libraries to weight the songs in the listening dataset. I can also assess the frequency of listens to further weight the songs in my analysis and formulate a ranking of songs in order of how much I like them. I’d probably also want to split that ranking by what I was doing while I was listening to the music, to eliminate outliers from the dataset that might bias the results. All the small questions that feed into the overall question are coming to life.

After I have that ranking, I could use additional metadata from another source, such as the Spotify audio features API, to identify the characteristics of the top-ranked songs, and ostensibly then be able to answer my overall question: what kind of music do I like to listen to?

By following all these steps, I turned my data into answers! And now I can turn my data into doing, by taking action on those characteristics. I can of course seek out new music based on those characteristics, but I can also book the ideal DJs for my birthday party, create or join a community of music lovers with similar taste in music, or even delete any music from my library that doesn’t match those characteristics. Maybe the only action I would take is self-reflection, and see if what the data has “told” me is in line with what I think is true about myself.

It is possible to turn data into answers, and turn data into doing, with caution and attention to all the ways that bias can be introduced into the data analysis process. But there’s still one more way that data analysis could result in biased outcomes: communicating results. 

Carefully communicate data findings

After I find the answers in my data, I need to carefully communicate them to avoid bias. If I want to tell all my friends that I figured out what kind of music I like to listen to, I want to make sure that I’m telling them that carefully so that they can take the appropriate and ethical action in response to what I tell them. 

I’ll want to present the answers in context. I need to describe the findings with the relevant qualifiers: I like music with these specific characteristics, and when I say I like this music I mean this is the kind of music that I listen to while doing things I enjoy, like working out, writing, or sitting on my couch. 

I also need to make clear what kind of action might be appropriate or ethical to take in reaction to this information. Maybe I want to find more music that has these characteristics, or I’d like to expand my taste, or I want to see some live shows and DJ sets that would feature music that has these characteristics. Actions that support those ends would be appropriate, but can also risk being unethical. What if someone learns of these characteristics, and chooses to then charge me more money than other people (whose taste in music is unknown) to see specific DJ sets or concerts featuring music with those characteristics? 

Data, per the white paper, “must be brought not only to every action and decision, but to every department.” Because of that, it’s important to consider how that happens. Share relevant parts of the process that led to the answers you found from the data. Communicate the results in a way that can be easily understood by your audience. This Medium post by Cecelia Shao, a product manager at Comet.ml, covers important points about how to communicate the results of data analysis. 

Use data for good

I wanted to talk through the data analysis process in the context of the rebranded slogans and marketing content so that I could unpack additional nuance that marketing content can’t convey. I know how easy it is to introduce bias into data analysis, and how easily data analysis can be applied to unethical questions, or used to take unethical actions.

As the white paper aptly points out, the value of data is not merely in having it, but in how you use it to create positive outcomes. You need to be sure you’re using data safely and intelligently, because with great access to data comes great responsibility. 

Go forth and use the data-to-everything platform to turn data into doing…the right thing. 

Disclosure: I work for Splunk. Thanks to my colleagues Chris Gales, Erica Chen, and Richard Brewer-Hay for the feedback on drafts of this post. While colleagues reviewed this post and provided feedback, the content is my own and represents my own views rather than those of Splunk the company. 

Planning and analyzing my concert attendance with Splunk

This past year I added some additional datasets to the Splunk environment I use to analyze my music: information about tickets that I’ve purchased, and information about upcoming concerts.

Ticket purchase analysis

I started keeping track of the tickets that I’ve purchased over the years, which gave me good insights about ticket fees associated with specific ticket sites and concert promoters.  

Based on the data that I’ve accumulated so far, Ticketmaster doesn’t have the highest fees for concert tickets. Instead, Live Nation does. This distinction is relatively meaningless when you realize they’ve been the same company since 2010.

However, the ticket site isn’t the strongest indicator of fees, so I decided to split the data further by promoter to identify if specific promoters had higher fees than others.

Based on that data you can see that the one show I went to promoted by AT&T had fee percentages of nearly 37%, and that shows promoted by Live Nation (through their evolution and purchase by Ticketmaster) also had fees around 26%. Shows promoted by independent venues have somewhat higher fees than others, hovering around 25% for 1015 Folsom and Mezzanine, but shows promoted by organizations whose only purpose is promotion tend to have slightly lower fees, such as select entertainment with 18%, Popscene with 16.67%, and KC Turner Presents with 15.57%.

I realized I might want to refine this, so I recalculated this data, limiting it to promoters from which I’ve bought at least two tickets.

It’s a much more even spread in this case, ranging from 25% to 11% in fees. However, you can see that the same patterns exist— for the shows I’ve bought tickets to, the independent venues average 22-25% in fees, while dedicated independent promoters are 16% or less in added fees, with corporate promoters like Another Planet, JAM, and Goldenvoice filling the middle of the data ranging from 18% to 22%.

I also attempted to determine how I’m discovering concerts. This data is entirely reliant on my memory, with no other data to back it up, but it’s pretty fascinating to track.

It’s clear that Songkick has become a vital service in my concert-going planning, helping me discover 46 shows, and friends and email newsletters from venues helping me stay in the know as well for 19 and 14 shows respectively. Social media contributes as well, with a Facebook community (raptors) and Instagram making appearances with 10 and 2 discoveries respectively.

Concert data from Songkick

Because Songkick is so vital to my concert discovery, I wanted to amplify the information I get from the service. In addition to tracking artists on the site, I wanted to proactively gather information about artists coming to the SF Bay Area and compare that with my listening habits. To do this, I wrote a Songkick alert action in Python to run in Splunk.

Songkick does an excellent job for the artists that I’m already tracking, but there are some artists that I might have just recently discovered but am not yet tracking. To reduce the likelihood of missing fast-approaching concerts for these newly-discovered artists, I set up an alert to look for concerts for artists that I’ve discovered this year and have listened to at least 5 times.

To make sure I’m also catching other artists I care about, I use another alert to call the Songkick API for every artist that is above a calculated threshold. That threshold is based on the average listens for all artists that I’ve seen live, so this search helps me catch approaching concerts for my historical favorite artists.

Also to be honest, I also did this largely so that I could learn how to write an alert action in Splunk software. Alert actions are essentially bits of custom python code that you can dispatch with the results of a search in Splunk. The two alert examples I gave are both saved searches that run every day and update an index. I built a dashboard to visualize the results.

I wanted to use log data to confirm which artists were being sent to Songkick with my API request, even if no events were returned. To do this I added a logging statement in my Python code for the alert action, and then visualized the log statements (with the help of a lookup to match the artist_mbid with the artist name) to display the artists that had no upcoming concerts at all, or had no SF concerts.

For those artists without concerts in the San Francisco Bay Area, I wanted to know where they were going instead, so that I could identify possible travel locations for the future.

It seems like Paris is the place to be for several of these artists—there might be a festival that LAUER, Max Cooper, George Fitzgerald, and Gerald Toto are all playing at, or they just happen to all be visiting that city on their tours.

I’m planning to publish a more detailed blog post about the alert action code in the future on the Splunk blogs site, but until then I’ll be off looking up concert tickets to these upcoming shows….

Making Concert Decisions with Splunk

The annual Noise Pop music festival starts this week, and I purchased a badge this year, which means I get to go to any show that’s a part of the festival without buying a dedicated ticket.

That means I have a lot of choices to make this week! I decided to use data to assess (and validate) some of the harder choices I needed to make, so I built a dashboard, “Who Should I See?” to help me out.

First off, the Wednesday night show. Albert Hammond, Jr. of the Strokes is playing, but more people are talking about the Baths show the same night. Maybe I should go see Baths instead?

Screen capture showing two inputs, one with Baths and one with Albert Hammond, Jr, resulting in count of listens compared for each artist (6 vs 39) and listens over time for each artist. Baths has 1 listen before 2012, and 1 listen each year for 2016 until this year. Albert Hammond, Jr has 8 listens before 2010, and a consistent yet reducing number over time, with 5 in 2011 and 4 in 2015, but just a couple since then.

If I’m making my decisions purely based on listen count, it’s clear that I’m making the right choice to see Albert Hammond, Jr. It is telling, though, that I’ve listened to Baths more recently than him, which might have contributed to my indecision.

The other night I’m having a tough time deciding about is Saturday night. Beirut is playing, but across the Bay in Oakland. Two other interesting artists are playing closer to home, Bob Mould and River Whyless. I wouldn’t normally care about this so much, but I know my Friday night shows will keep me busy and leave me pretty tired. So which artist should I go see?

3 inputs on a dashboard this time, Beirut, Bob Mould, and River Whyless are the three artists being compared. Beirut has 44 listens, Bob Mould has 21, River Whyless has 3. Beirut has frequent listens over time, peaking at 6 before 2010, but with peaks at 5 in 2011 and 2019. Bob Mould has 6 listens pre-2009, but only 3 in 2010 and after that, 1 a year at most. River Whyless has 1 listen in April, and 2 in December of 2018.

It’s pretty clear that I’m making the right choice to go see Beirut, especially given my recent renewed interest thanks to their new album.

I also wanted to be able to consider if I should see a band at all! This isn’t as relevant this week thanks to the Noise Pop badge, but it currently evaluates if the number of listens I have for an artist exceeds the threshold that I calculate based on the total number of listens for all artists that I’ve seen live in concert. To do this, I’m evaluating whether or not an artist has more listens than the threshold. If they do, I return advice to “Go to the concert!” but if they don’t, I recommend “Only if it’s cheap, yo.”

Because I don’t need to make this decision for Noise Pop artists, I picked a few that I’ve been wanting to see lately: Lane 8, Luttrell, and The Rapture.

4 dashboard panels, 3 of which ask "Should I go see (artist) at all?" one for each artist, Lane 8, Luttrell, and The Rapture. Lane 8 and Luttrell both say "Only go if it's cheap, yo." and The Rapture says "Go to the concert!". The fourth panel shows frequent listening for The Rapture, especially from 2008-2012, with a recent peak in 2018. Lane 8 spikes at the end of the graph, and Luttrell is a small blip at the end of the graph.

While my interest in Lane 8 has spiked recently, there still aren’t enough cumulative listens to put them over the threshold. Same for Luttrell. However, The Rapture has enough to put me over the threshold (likely due to the fact that I’ve been listening to them for over 10 years), so I should go to the concert! I’m going to see The Rapture in May, so I am gleefully obeying my eval statement!

On a more digressive note, it’s clear to me that this evaluation needs some refinement to actually reflect my true concert-going sentiments. Currently, the threshold averages all the listens for all artists that I’ve seen live. It doesn’t restrict that average to consider only the listens that occur before seeing an artist live, which might make it more accurate. That calculation would also be fairly complex, given that it would need to account for artists that I’ve seen multiple times.

However, number of listens over time doesn’t alone reflect interest in going to a concert. It might be useful to also consider time spent listening, beyond count of listens for an artist. This is especially relevant when considering electronic music, or DJ sets, because I might only have 4 listen counts for an artist, but if that comprises 8 hours of DJ sets by that artist that I’ve listened to, that is a pretty strong signal that I would likely enjoy seeing that artist perform live.

I thought that I’d need to get direct access to the MusicBrainz database in order to get metadata like that, but it turns out that the Last.fm API makes some available through their track.getInfo endpoint, so I just found a new project! In the meantime I am able to at least calculate duration for tracks that exist in my iTunes library.

I now have a new avenue to explore with this project, collecting that data and refining this calculation. Reach out on Twitter to let me know what you might consider adding to this calculation to craft a data-driven concert-going decision-making dashboard.

If you’re interested in this app, it is open sourced and available on Splunkbase. I’ll commit the new dashboard to the app repo soon!

My 2018 Year in Music: Data Analysis and Insights

This past year has been pretty eventful in music for me. I’ve attended a couple new festivals, seen shows while traveling, and discovered plenty of new bands. I want to examine the data available to me and contrast it with my memories of the past year.

I’ve been using Splunk to analyze my music data for the past couple years. You can learn more about what I’ve learned from that in the past in my other posts, see Reflecting on a Decade of Quantified Music Listening and Best of 2017: Newly-Discovered Music. I also wrote a blog post for the Splunk blog (I work there) about this too: 10 Years of Listens: Analyzing My Music Data with Splunk.

Comparing Spotify’s Data with Mine

Spotify released its #2018wrapped campaign recently, sharing highlights from the year of my listening data with me (and in an ad campaign, aggregate data from all the users). As someone that uses Spotify but not as my exclusive source of music listening, I was curious to compare the results with my holistic dataset that I’ve compiled in Splunk. 

Top Artists are Poolside, The Blaze, Justice, Born Ruffians, and Bob Moses. Top Songs are Beautiful Rain, For the Birds, Miss You, Faces, and Heaven. I listened for 30.473 minutes, and my top genre was Indie.

Spotify’s top artists for me were somewhat different from the results that I found from the data I gather from Last.fm and analyze with Splunk software.  Spotify and my holistic listening data agree that I listened to Poolside more than anyone else, and was also a big fan of Born Ruffians, but beyond that they differ. This is probably due to the fact that I bought music and when I’m mobile I switch my primary listening out of Spotify to song files stored on my phone. 

Table showing my top artists and their listens, Poolside with 162 listens, The Vaccines with 136, Young Fathers with 124, Born Ruffians with 102 and Mumford and Sons with 99 listens.

In addition, my top 5 songs of the year were completely different from those listed in Spotify. My holistic top 5 songs of the year were all songs that I purchased. I don’t listen to music exclusively in Spotify, and my favorites go beyond what the service can recognize.

Table showing top songs and the corresponding artist and listen count for the song. Border Girl by Young Fathers with 35 was first, followed by Era by Hubert Kirchner with 32, Naive by the xx with 29, Sun (Viceroy Remix) by Two Door Cinema Club with 27 and There Will Be Time by Mumford & Sons with Baaba Maal also with 27 listens.

Spotify identified that I’ve listened to 30,473 minutes of music, but I can’t make a similarly reliable calculation with my existing data because I don’t have track length data for all the music that I’ve listened to. I can calculate the number of track listens so far this year, and based on that, make an approximation based on the track length data that I do have from my iTunes library. The minute calculation I can make indicates that I’ve so far spent 21,577 minutes listening to 3,878 of the 10,301 total listens I’ve accumulated so far this year (Numbers to change literally as this post is being written).

Screen capture showing total listens of 10,301 and total minutes listened to itunes library songs as 21,577 minutes.

I’m similarly lacking data allowing me to determine my top genre of the year, but Indie is a pretty reliable genre for my taste. 

Other Insights from 2018

I was able to calculate my Top 10 artists, songs, and albums of the year, and drill down on the top 10 artists to see additional data about them (if it existed) in my iTunes library, like other tracks, the date it was added, as well as the kind of file (helping me identify if it was purchased or not), and the length of the track.

Screen capture displaying top 10 artists, top 10 songs, top 10 albums of the year, with the artist Hubert Kirchner selected in the top 10 song list, with additional metadata about songs by Hubert Kirchner listed in a table below the top 10 lists, showing 3 songs by Hubert Kirchner along with the album, genre, rating, date_added, Kind, and track_length for the songs. Other highlights described in text.

There are quite a few common threads across the top 10 artists, songs, and albums, with Poolside, Young Fathers, Gilligan Moss, The Vaccines, and Justice making consistent appearances. The top 10 songs display obsessions with particular songs that outweigh an aggregate popularity for the entire album, leading other songs to be the top albums of the year.

Interestingly, the Polo & Pan album makes my top 10 albums while they don’t make it to my top 10 artist or song lists. This is also true for the album Dancehall by The Blaze. I’m not much of an album listener usually, but I know I listened to those albums several times.

The top 10 song list is more dominated by specific songs that caught my attention, and the top 10 artists neatly reflect both lists. The artists that have a bit more of a back catalog also reveal themselves, given that Born Ruffians managed to crack the top 10 despite not having any songs or albums make the top 10 lists, and Hey Rosetta! makes the top artist and album lists, despite having no top songs.

Screen capture that says Songs Purchased in 2018. 285 songs.

I purchased 285 songs this year, an increase of 157 compared to the year before. I think I just bought songs more quickly after first hearing them this year, and there are even some songs missing from this list that I bought on Beatport or Bandcamp because they weren’t available in the iTunes Store. While I caved in to Spotify premium this year, I still kept up an old promise to myself to buy music (rather than acquire it without paying for it, from a library or questionable download mechanisms) now that I can afford it. 

A Year of Concerts

Screen capture of 4 single value data points, followed by 2 bar charts. Single value data points are total spent on concerts attended in 2018 ($1835.04), total concerts in 2018 (48), artists seen in concert in 2018 (116 artists), and total spent on concert tickets in 2018 ($2109). The first bar chart shows the number of concerts attended per month, 2 in January, 3 in February, 2 in March, 6 in April, 4 in May, 2 in June, 3 in July, 8 in August, 4 in September, 6 in October, 5 in November, and 3 so far in December. The last bar chart is the number of artists seen by month: 5 in Jan, 10 in Feb, 3 in March, 14 in April, 8 in May, 3 in June, 8 in July, 18 in August, 9 in Sep, 22 in Oct, 10 in Nov, 6 in December.

I’ve been to a lot of concerts so far this year. 48, to be exact. I spent a lot of money on concert tickets, both for the shows I attended this year and for shows that went on sale during 2018 (but at this point, might be happening in 2019). I often will buy tickets for multiple people, so this number isn’t very precise for my own personal ticket usage.

I managed to go to at least 2 concerts every month. By the time the year is over, I’m on track to go to 51 different shows. Based on the statistics, there are some months where I went to many more than 1 show per week, and others where I didn’t. Especially apparent are the months with festivals—February, August, and October all included festivals that I attended. 

Many of those festivals brought me to new-to-me locations, with the Noise Pop Block Party and Golden Gate Park giving me new perspectives on familiar places, and Lollapalooza after shows bringing me out to Schubas Tavern for the first time in Chicago.  

Screen capture listing venues visited for the first time in 2018, with venue, city, state, and date listed. Notable ones mentioned in text, full list of venue names: Audio, The New Parish, San Francisco Belle, Schubas Tavern, Golden Gate Park, August Hall, Noise Pop Block Party, Bergerac, Great American Music Hall, Cafe du Nord, Swedish American Hall.

If you’re reading this wondering what San Francisco Belle is, it’s a boat. That’s one of several new venues that electronic music brought me to—DJ sets on that boat as part of Goldroom and Gigamesh’s tour, plus a day party in Bergerac and a nighttime set at Audio other times throughout the year.

Some of those new venue locations brought newly-discovered music to me as well.

Screen capture showing top 20 artists discovered in 2018, sorted by count of listens, featuring a sparkline to show how frequently I listened to the artist throughout the year, and a first_discovered date. List: Gilligan Moss, The Blaze, Polo & Pan, Hubert Kirchner, Keita Sano, Jude Woodhead, Ben Böhmer, Karizma, Luxxury, SuperParka, Chris Malinchak, Mumford & Sons and Baaba Maal, Jon Hopkins, Yon Yonson,  Brandyn Burnette and dwilly, Asgeir, The Heritage Orchestra Jules Buckley and Pete Tong, Confidence Man, Bomba Estereo, and Jenn Champion.

The 20th-most-popular artist I discovered this year was Jenn Champion, who opened for We Were Promised Jetpacks at their show at the Great American Music Hall. I started writing this assuming that I hadn’t heard Jenn Champion before that night, but apparently I first discovered them on July 9, but the show wasn’t until October 9. 

As it turns out, I listened to what is now my favorite song by Jenn Champion that day in July, likely as part of a Spotify algorithm-driven playlist (judging by the listening neighbors around the same time) but it didn’t stick until I saw them play live months later. The vagaries of playlists that refresh once a week can mean fleeting discoveries that you don’t really absorb.

Screen capture showing Splunk search results of artist, track_name, and time from July 9th. Songs near Jenn Champion's song in time include Mcbaise - Le Paradis Du Cuir, Wolf Alice - Don't Delete the Kisses (Tourist Remix) and Champyons - Roaming in Paris.
Other songs I listened to that day in July

Because of how I can search for things in Splunk, I was also curious to see what others songs I heard when I first discovered Hubert Kirchner, a great house artist.

Songs listened to around the same time as I first heard Hubert Kirchner's song Era.... I listened to Dion's song Dream Lover, Deradoorian's song You Carry the Dead (Hidden Cat Remix) followed by Hubert Kirchner, then listened to Miguel's song Sure Thing, How to Dress Well with What You Wanted, then listen to Rihanna, Love on the Brain, Selena Gomez with Bad Liar, and Descendents with I'm the One. I have no idea how I got into this mix of songs.

I have really no idea what playlist I was listening to that might have led to me making jumps from Sofi Tukker, to Tanlines, to Dion, to Deradoorian, then to Hubert Kirchner, Miguel, How to Dress Well, Rihanna, Selena Gomez, and Descendents. Given that August 24th was a Friday, my best guess is perhaps that it was a Release Radar playlist, or perhaps an epic shuffle session. 

Repeat of earlier screen capture showing top 20 artists discovered in 2018. Sorted by count of listens, featuring a sparkline to show how frequently I listened to the artist throughout the year, and a first_discovered date. List: Gilligan Moss, The Blaze, Polo & Pan, Hubert Kirchner, Keita Sano, Jude Woodhead, Ben Böhmer, Karizma, Luxxury, SuperParka, Chris Malinchak, Mumford & Sons and Baaba Maal, Jon Hopkins, Yon Yonson,  Brandyn Burnette and dwilly, Asgeir, The Heritage Orchestra Jules Buckley and Pete Tong, Confidence Man, Bomba Estereo, and Jenn Champion

For the top 20 bands I discovered in 2018, many of them I started listening to on Spotify, but not necessarily because of Spotify. Gilligan Moss was a discovery from a collaborative playlist shared with those that are also in a Facebook group about concert-going. I later saw them at one of the festivals I went to this year, and it even turned out that a friend knew one of the band members! Their status as my most-listened-to discovery of this year is very accurate.

 Polo & Pan was a discovery from a friend, fully brought to life with a playlist built by Polo & Pan themselves and shared on Spotify. Spent some quality time sitting in a park listening to that playlist and just enjoying life. They were at the same festival as Gilligan Moss, playing the same day, making that day a standout of my concerts this year.

Karizma was a discovery from Jamie xx’s set at Outside Lands. I tracked down the song from the set with the help of several other people on the internet (not necessarily anyone I knew) and then the song that was from the set itself wasn’t even on Spotify itself (Spotify, however, did help me discover more of the artist’s back catalog, like my other favorite song ‘Nuffin Else) Apparently I was far behind the curve hearing the song from the set, since it came out in 2017 and was featured in a Chromebook ad, but Work It Out still made me lose my mind at that set. (For the record, so did Take Me Higher, a song I did not manage to track down at all, and have so much thanks for the person that messaged me on Facebook ages later to send me the link!)

Similarly, Luxxury was a DJ I first spotted on a cruise that I went on because it featured other DJs I had heard of from college, Goldroom and Gigamesh, whom I’d discovered through remixes of songs I downloaded from mp3 blogs like The Burning Ear.

~ Finding Meaning in the Platforms ~

Many of these discoveries were deepened by Spotify, or had Spotify as a vector—through a collaborative playlist, algorithmically-generated one, or the quick back-catalog access for a new artist—but don’t rely on Spotify as a platform. I prefer to keep my music listening habits platform-adjacent. 

Spotify, SoundCloud, iTunes, Beatport and other music platforms I use help make my music experiences possible. But the artists making the music, performing live in venues that I have the privilege to live near and afford to visit, they are creating what keep my mind alive and energized.

The social platforms too, mediate the music-related experiences I’ve had, whether it’s with the people I share music and concert experiences with in a Facebook group, the people I exchange tracks and banter with in Slack channels, or those of you reading this on yet another platform. 

I like to listen to music that moves me, physically, or that arrests my mind and takes me somewhere. More now than ever I realize that musical enjoyment for me is an intense instantiation of the continuous tension-and-release pattern that exists in so many human art forms. The waves of neatness that clash and collide in a house music track, or the soaring crescendos of harmonies. 

It’s become clear to me over the years that I can’t separate my enjoyment of music from the platforms that bring me closer to it. Perhaps supporting the platforms in addition to the musical artists, performers, and venues, is just another element of contributing to a thriving music scene.

Reflecting on a decade of (quantified) music listening

I recently crossed the 10 year mark of using Last.fm to track what I listen to.

From the first tape I owned (Train’s Drops of Jupiter) to the first CD (Cat Stevens Classics) to the first album I discovered by roaming the stacks at the public library (The Most Serene Republic Underwater Cinematographer) to the college radio station that shaped my adolescent music taste (WONC) to the college radio station that shaped my college experience (WESN), to the shift from tapes, to CDs, (and a radio walkman all the while), to the radio in my car, to SoundCloud and MP3 music blogs, to Grooveshark and later Spotify, with Windows Media Player and later an iTunes music library keeping me company throughout…. It’s been quite a journey.

Some, but not all, of that journey has been captured while using the service Last.fm for the last 10 years. Last.fm “scrobbles” what you listen to as you listen to it, keeping a record of your listening habits and behaviors. I decided to add all this data to Splunk, along with my iTunes library and a list of concerts I’ve attended over the years, to quantify my music listening, acquisition, and attendance habits. Let’s go.

What am I doing?

Before I get any data in, I have to know what questions I’m trying to answer, otherwise I won’t get the right data into Splunk (my data analysis system of choice, because I work there). Even if I get the right data into Splunk, I have to make sure that the right fields are there to do the analysis that I wanted. This helped me prioritize certain scripts over others to retrieve and clean my data (because I can’t code well enough to write my own).

I also made a list of the questions that I wanted to answer with my data, and coded the questions according to the types of data that I would need to answer the questions. Things like:

  • What percentage of the songs in iTunes have I listened to?
  • What is my artist distribution over time? Do I listen to more artists now? Different ones overall?
  • What is my listen count over time?
  • What genres are my favorite?
  • How have my top 10 artists shifted year over year?
  • How do my listening habits shift around a concert? Do I listen to that artist more, or not at all?
  • What songs did I listen to a lot a few years ago, but not since?
  • What personal one hit wonders do I have, where I listen to one song by an artist way more than any other of their songs?
  • What songs do I listen to that are in Spotify but not in iTunes (that I should buy, perhaps)?
  • How many listens does each service have? Do I have a service bias?
  • How many songs are in multiple services, implying that I’ve probably bought them?
  • What’s the lag between the date a song or album was released and my first listen?
  • What geographic locations are my favorite artists from?

As the list goes on, the questions get more complex and require an increasing number of data sources. So I prioritized what was simplest to start, and started getting data in.

 

Getting data in…

I knew I wanted as much music data as I could get into the system. However, SoundCloud isn’t providing developer API keys at the moment, and Spotify requires authentication, which is a little bit beyond my skills at the moment. MusicBrainz also has a lot of great data, but has intense rate-limiting so I knew I’d want a strategy to approach that metadata-gathering data source. I was left with three initial data sources: my iTunes library, my own list of concerts I’ve gone to, and my Last.fm account data.

Last.fm provides an endpoint that allows you to get the recent tracks played by a user, which was exactly what I wanted to analyze. I started by building an add-on for Last.fm with the Splunk Add-on Builder to call this REST endpoint. It was hard. When I first tried to do this a year and a half ago, the add-on builder didn’t yet support checkpointing, so I could only pull in data if I was actively listening and Splunk was on. Because I had installed Splunk on a laptop rather than a server in ~ the cloud ~, I was pretty limited in the data I could pull in. I pretty much abandoned the process until checkpointing was supported.

After the add-on builder started supporting checkpointing, I set it up again, but ran into issues. Everything from forgetting to specify the from date in my REST call to JSON path decision-making that meant I was limited in the number of results I could pull back at a time. I deleted the data from the add-on sourcetype many times, triple-checking the results each time before continuing.

I used a python script (thanks Reddit) to pull my historical data from Last.fm to add to Splunk, and to fill the gap between this initial backfill and the time it took me to get the add-on working, I used an NPM module. When you don’t know how to code, you’re at the mercy of the tools other people have developed. Adding the backfill data to Splunk also meant I had to adjust the max_days_ago default in props.conf, because Splunk doesn’t necessarily expect data from 10+ years ago by default. 2 scripts in 2 languages and 1 add-on builder later, I had a working solution and my Last.fm data in Splunk.

To get the iTunes data in, I used an iTunes to CSV script on Github (thanks StackExchange) to convert the library.xml file into CSV. This worked great, but again, it was in a language I don’t know (Ruby) and so I was at the mercy of a kind developer posting scripts on Github again. I was limited to whatever fields their script supported. This again only did backfill.

I’m still trying to sort out the regex and determine if it’s possible to parse the iTunes Library.xml file in its entirety and add it to Splunk without too much of a headache, and/or get it set up so that I can ad-hoc add new songs added to the library to Splunk without converting the entries some other way. Work in progress, but I’m pretty close to getting that working thanks to help from some regex gurus in the Splunk community.

For the concert data, I added the data I had into the Lookup File Editor app and was up and running. Because of some column header choices I made for how to organize my data, and the fact that I chose to maintain a lookup rather than add the information as events, I was up for some more adventures in search, but this data format made it easy to add new concerts as I attend them.

Answer these questions…with data!

I built a lot of dashboard panels. I wanted to answer the questions I mentioned earlier, along with some others. I was spurred on by my brother recommending a song to me to listen to. I was pretty sure I’d heard the song before, and decided to use data to verify it.

Screen image of a chart showing the earliest listens of tracks by the band VHS collection.

I’d first heard the song he recommended to me, Waiting on the Summer, in March. Hipster credibility: intact. Having this dashboard panel now lets me answer the questions “when was the first time I listened to an artist, and which songs did I hear first?”. I added a second panel later, to compare the earliest listens with the play counts of songs by the artist. Maybe the first song I’d heard by an artist was the most listened song, but often not.

Another question I wanted to answer was “how many concerts have I been to, and what’s the distribution in my concert attendance?”

Screen image showing concerts attended over time, with peaks in 2010 and 2017.

It’s pretty fun to look at this chart. I went to a few concerts while I was in high school, but never more than one a month and rarely more than a few per year. The pace picked up while I was in college, especially while I was dating someone that liked going to concerts. A slowdown as I studied abroad and finished college, then it picks up for a year as I get settled in a new town. But after I get settled in a long-term relationship, my concert attendance drops off, to where I’m going to fewer shows than I did in high school. As soon as I’m single again, that shifts dramatically and now I’m going to 1 or more show a month. The personal stories and patterns revealed by the data are the fun part for me.

I answered some more questions, especially those that could be answered by fun graphs, such as what states have my concentrated music listens?

Screen image of a map of the contiguous united states, with Illinois highlighted in dark blue, indicating 40+ concerts attended in that state, California highlighted in a paler blue indicating 20ish shows attended there, followed by Michigan in paler blue, and finally Ohio, Wisconsin, and Missouri in very pale blue. The rest of the states are white, indicating no shows attended in those states.

It’s easy to tell where I’ve spent most of my life living so far, but again the personal details tell a bigger story. I spent more time in Michigan than I have lived in California so far, but I’ve spent more time single in California so far, thus attending more concerts.

Speaking of California, I also wanted to see what my most-listened-to songs were since moving to California. I used a trellis visualization to split the songs by artist, allowing me to identify artists that were more popular with me than others.

Screen image showing a "trellis" visualization of top songs since moving to California. Notable songs are Carly Rae Jepsen "Run Away With Me" and Ariana Grande "Into You" and CHVRCHES with their songs High Enough to Carry You Over and Clearest Blue and Leave a Trace.

I really liked the CHVRCHES album Every Open Eye, so I have three songs from that album. I also spent some time with a four song playlist featuring Adele’s song Send My Love (To Your New Lover), Ariana Grande’s Into You, Carly Rae Jepsen’s Run Away With Me, and Ingrid Michaelson’s song Hell No. Somehow two breakup songs and two love songs were the perfect juxtaposition for a great playlist. I liked it enough to where all four songs are in this list (though only half of it is visible in this screenshot). That’s another secret behind the data.

I also wanted to do some more analytics on my concert data, and decided to figure out what my favorite venues were. I had some guesses, but wanted to see what the data said.

Screen image of most visited concert venues, with The Metro in Chicago taking the top spot with 6 visits, followed by First Midwest Bank Ampitheatre (5 visits), Fox Theater, Mezzanine, Regency Ballroom, The Greek Theatre, and The Independent with 3 visits each.

The Metro is my favorite venue in Chicago, so it’s no surprise that it came in first in the rankings (I also later corrected the data to make it its proper name, “Metro” so that I could drill down from the panel to a Google Maps search for the venue). First Midwest Bank Ampitheatre hosted Warped Tour, which I attended (apparently) 5 times over the years. Since moving to California it seems like I don’t have a favorite venue based on visits alone, but it’s really The Independent, followed by Bill Graham Civic Auditorium, which doesn’t even make this list. Number of visits doesn’t automatically equate to favorite.

But what does it MEAN?

I could do data analysis like that all day. But what else do I learn by just looking at the data itself?

I can tell that Last.fm didn’t handle the shift to mobile and portable devices very well. It thrives when all of your listening happens on your laptop, and it can grab the scrobbles from your iPod or other device when you plug it into your computer. But as soon as internet-connected devices got popular (and I started using them), listens scrobbled overall dropped. In addition to devices, the rise of streaming music on sites like Grooveshark and SoundCloud to replace the shift from MediaFire-hosted and MegaUpload-hosted free music shared on music blogs also meant trouble for my data integrity. Last.fm didn’t handle listens on the web then, and only handles them through a fragile extension now.

Two graphs depicting distinct song listens and distinct artist listens, respectively, with a peak and steady listens through 2008-2012, then it drops down to a trough in 2014 before coming up to half the amount of 2010 and rising slightly.

Distinct songs and artists listened to in Last.fm data.But that’s not the whole story. I also got a job and started working in an environment where I couldn’t listen to music at work, so wasn’t listening to music there, and also wasn’t listening to music at home much either due to other circumstances. Given that the count plummets to near-zero, it’s possible there were also data issues at play.  It’s imperfect, but still fascinating.

What else did I learn?

Screen image showing 5 dashboard panels. Clockwise, the upper left shows a trending indicator of concerts attended per month, displaying 1 for the month of December and a net decrease of 4 from the previous month. The next shows the overall number of concerts attended, 87 shows. The next shows the number of iTunes library songs with no listens: 4272. The second to last shows a pie chart showing that nearly 30% of the songs have 0 listens, 23% have 1 listen, and the rest are a variety of listen counts. The last indicator shows the total number of songs in my iTunes library, or 16202.

I have a lot of songs in my iTunes library. I haven’t listened to nearly 30% of them. I’ve listened to nearly 25% of them only once. That’s the majority of my music library. If I split that by rating, however, it would get a lot more interesting. Soon.

You can’t see the fallout from my own personal Music-ocalypse in this data, because the Library.xml file doesn’t know which songs don’t point to actual files, or at least my version of it doesn’t. I’ll need more high-fidelity data to determine the “actual” size of my library, and perform more analyses.

I need more data in general, and more patience, to perform the analyses to answer the more complex questions I want to answer, like my listening habits of particular artists around a concert. As it is, this is a really exciting start.

If you want more details about the actual Splunking I did to do these analyses, I’ll be posting a blog on the official Splunk blog. That got posted on January 4th! Here it is: 10 Years of Listens: Analyzing My Music Data with Splunk.