Wrapping up 2020: Spotify, SoundCloud, and Last.fm data

Another year, another Spotify Wrapped campaign, another effort to analyze the music data that I collect and compare it to what Spotify produces. This year I have last.fm listening habit data, concert attendance and ticket purchase data, livestream view activity data, my SoundCloud 2020 Playback playlist, and the tracks on my Spotify top 100 songs of 2020 playlist

Screenshot of Spotify Wrapped header image, top artists of disclosure, lane 8, kidnap, tourist, and amtrac, top songs of apricots, atlas, idontknow, cappadocia, know your worth, minutes listened of 59,038 and top genre of house.

It’s always important to point out that the data covered in the Spotify Wrapped campaign only covers the time period from January 1st, 2020 to October 31st, 2020. I discuss the effects of this misleading time period in Communicate the data: How missing data biases data-driven decisions. Of course, writing this post on December 2nd, nearly the entire month of December is missing from my own analyses. I’ll follow up (on Twitter) about any data insights that change over the next few weeks.

Top Artists of the Year

screenshot of spotify wrapped top artists, content duplicated in surrounding text.

Spotify says my Top 5 artists of the year are: 

  1. Disclosure
  2. Lane 8 
  3. Kidnap
  4. Tourist 
  5. Amtrac

My own data shows some slight permutations.

Screenshot of Splunk table showing top 10 artists in order: tourist with 156 listens, amtrac with 155 listens, booka shade with 147 listens, jacques greene with 134 listens, lane 8 with 129 listens, bicep with 128 listens, kidnap with 114 listens, ben böhmer with 111 listens, cold war kids with 110 listens, and sjowgren with 99 listens

My top 5 artists are nearly the same, but much more influenced by music that I’ve purchased. The overall list instead looks like:

  1. Tourist
  2. Amtrac
  3. Booka Shade
  4. Jacques Greene
  5. Lane 8

For the second year in a row, Tourist is my top artist! Kidnap still makes it into the top 10, as my 7th most-listened-to artist so far of 2020. 

Disclosure, somewhat hilariously, doesn’t even break the top 10 artists if I am relying on Last.fm data instead of only Spotify. What’s going on there? Turns out Disclosure is my 11th-most-listened to artist, with 97 total listens so far this year. If I dig a little bit deeper, looking at the song Know Your Worth which Spotify says I’ve listened to the most in 2020 by Disclosure, I can see exactly why this is happening.

Screenshot showing the track_name Know Your Worth listed 5 times, with different artist permutations each time, Khalid, Disclosure & Khalid, Disclosure & Blick Bassy, Khalid & Disclosure, and Khalid, with total listens of 20 for all permutations.

Disclosure’s latest album, ENERGY, includes a number of collaborations. Disclosure is the main artist for most of these tracks, but in some cases (like with Know Your Worth, which came out as a single February 4, 2020) the artist can be inconsistently stored by different services.

As a result, the Last.fm data has a number of different entries for the same track, with differently-listed artists for each one. Last.fm stores only one artist per track, whereas Spotify stores an array of artists for each track. This data structure decision means that Disclosure should have had about 127 total listens, and been my 7th-most-listened-to artist of 2020, instead of 11th. 

This truncated screenshot shows some examples of the permutations of data that exist in my Last.fm data collection, with a total listen count of 127 for Disclosure during 2020. 

Screenshot showing additional permutations of Disclosure artist data, such as Disclosure & slowthai, Disclosure & Common, and Disclosure & Channel Tres.

I had a sneaking suspicion that my Booka Shade listening habits are primarily concentrated on a few songs from an EP that he put out this year, so I dug into how many tracks my total listens for the year were spread across.

Table showing top 10 artists and total listens, with total tracks for each artist as well. Tourist has 62 tracks for 159 listens, Amtrac has 59 tracks for 155 listens, Booka Shade has 64 tracks for 147 listens, Jacques Greene has 33 tracks for 134 listens, Lane 8 has 60 tracks for 129 listens, Bicep has 46 tracks for 128 listens, Kidnap has 35 tracks for 114 listens, Ben Böhmer has 51 tracks for 111 listens, Cold War Kids has 53 tracks for 110 listens, and sjowgren has 15 tracks for 99 listens.

Instead, it turns out that my listens to Booka Shade are actually the most distributed across tracks of all of my top 10 artists. Sjowgren is also an outlier here, because they’ve never released an album, so they only have 15 songs in their overall discography yet still made the top 10 artist listens. 

Returning my comparison between Spotify and Last.fm data, Amtrac and Lane 8 are in both top 5 lists. This is somewhat expected, because if I look at the top 10 list for artists that I’ve most consistently listened to—artists that I’ve listened to at least once in each month of 2020—both Amtrac and Lane 8 place high in that list. 

Screenshot of a table showing top 10 consistently listened to artists, with Lane 8 being listened to at least once in all 12 months of 2020, Amtrac 11 months, Caribou 11 months, Disclosure 11 months, Elderbrook 11 months, Kidnap 11 months, Kölsch 11 months, Tourist 11 months, Ben Böhmer 10 months, and CamelPhat for 10 months.

Given that only 2 days of December have happened as I write this, it’s unsurprising that I’ve only listened to one artist in every month of 2020. 

Top Songs of 2020

Enough about the artists—what about the songs? 

Screenshot of top 5 songs from spotify wrapped, duplicated in surrounding text.

According to Spotify, my Top 5 songs of the year are:

  1. Apricots by Bicep
  2. Atlas by Bicep
  3. Idontknow by Jamie xx
  4. Cappadocia by Ben Böhmer feat. Romain Garcia
  5. Know Your Worth by Disclosure feat. Khalid

That pretty closely matches my top 5 list according to Last.fm, with some notable exceptions.

Screenshot of Splunk table with top 10 songs of last.fm data, Apricots by Bicep with 38 listens, Atlas by Bicep with 32 listens, Idontknow by Jamie xx with 22 listens, White Ferrari (Greene Edit) by Jacques Greene with 21 listens, That Home Extended by The Cinematic Orchestra with 20 listens, Lalala by Y2K and bbno$ with 19 listens, Trish's Song by Hey Rosetta! with 18 listens, Wonderful by Burna Boy with 18 listens, Somewhere feat. Octavian by the Blaze with 17 listens, and Yes, I Know by Daphni with 17 listens.

My top 5 tracks according to Last.fm are:

  1. Apricots by Bicep (38 listens)
  2. Atlas by Bicep (32 listens)
  3. Idontknow by Jamie xx (22 listens)
  4. White Ferrari (Greene Edit) by Jacques Greene (21 listens)
  5. That Home Extended by The Cinematic Orchestra (20 listens)

The first 3 tracks match, though of course Spotify has an incomplete representation of those listens—I have 29 streams of Apricots according to Spotify.

However, since I bought the track almost as soon as it came out, I also have another 9 listens that have happened off of Spotify. There were also some mysterious things happening with Spotify and Last.fm connections around that time as well, so it’s possible some listens are missing beyond these numbers. 

What’s up with the 4th track on the list, though? Where is that in Spotify’s data? It’s actually a bootleg remix of the Frank Ocean song White Ferrari that Jacques Greene shared on SoundCloud and as a free download earlier this year, so it isn’t anywhere on Spotify. It did, however, make it onto my top tracks of 2020 on SoundCloud:

Screenshot of top 13 tracks in SoundCloud, with Jacques Greene - White Ferrari (JG Edit) listed as the 11th track.

And again, this is a spot where metadata intrudes again and leads to some inconsistent counts. If I look at all the permutations of White Ferrari and Jacques Greene in my data for 2020, the total number of listens should actually be a bit higher, at 23 total listens:

Screenshot of Splunk table showing the two permutations of the Jacques Greene remix, with 21 listens for the Greene Edit version and 2 listens for the JG Edit version, for a total of 23.

This would actually make it my 3rd-most popular song of 2020 so far, and I’m listening to it as I write this paragraph, so let’s go ahead and call that total number 24 listens. 

The 5th-most popular song and 7th-most popular song of 2020 make the case that I haven’t been sleeping very well this year (though I recall these tracks also showed up in 2019 as well…), because those 2 tracks comprise my “Insomnia” playlist that I use to help me fall asleep on nights when I’ve been, perhaps, staying up too late doing data analysis like this. 

You can see the influence of consistent listening habits with top artist behaviors when you look at the top 10 songs that I’ve consistently listened to throughout 2020, with 2 songs by Kidnap, one by Bicep, and another by Amtrac.

Table of tracks listened to consistently in 2020, Never Come Back by Caribou listened to at least once in 8 months of 2020, Start Again by Kidnap with 8 months, Accountable by Amtrac with 7 months, Atlas by Bicep with 7 months, Calling out by Sophie Lloyd with 7 months, Made to Stray by Mount Kimbie for 7 months, Moments (Ben Böhmer Remix) by Kidnap with 7 months, Somewhere feat. Octavian by the Blaze with 7 months, The Promise by David Spinelli with 7 months, and Without You My Life Would Be Boring by The Knife with 7 months.

To me, though, this table mostly underscores how much music discovery this year involved. I didn’t return to the same songs month after month during 2020. Likely as a result of all the DJ sets I’ve been streaming (as I mentioned in my post about Listening to Music while Sheltering in Place) this has been quite a year for music discovery, and breadth of listening habits. 

My top 10 songs of 2020 had a total of 222 listens across them. However, I have a total of 14,336 listens for the entire year, spread across 8,118 unique songs in total.

duplicated in surrounding text

Even with possible metadata issues, that’s still quite the distribution of behavior. Let’s dig a bit deeper into artist discovery this year. 

Artist Discovery in 2020

In my post earlier this year about my listening behavior while sheltering in place, I discovered that my artist discovery numbers in 2020 seemed to be way up compared with 2018 and 2019, but weren’t actually that far off from 2017 numbers. 

What I see when comparing my 2020 artist discovery statistics from my Last.fm data and my Spotify data is even more interesting. In contrast to what seemed to be true in last year’s post, Wrapping up the year and the decade in music: Spotify vs my data (For what it’s worth, last year’s number should have been 1074, instead of 2857 artists discovered—data analysis is difficult), Spotify’s data is much higher than the number I calculated this year. 

duplicated in surrounding text

According to Spotify, I discovered 2,051 new artists, whereas my Last.fm data claims that I only discovered 1,497 artists this year. 

duplicated in surrounding text

Similarly, Spotify claims that I listened to 4,179 artists this year, whereas my Last.fm data indicates that I listened to 3,715 artists. 

duplicated in surrounding text

Again, this comes down to data structures and how the artist metadata is stored for each service. I wrote about the importance of quality metadata for digital streaming providers earlier this year in Why the quality of audio analysis metadatasets matters for music, but it’s also apparent that the data structures for those metadatasets are just as important for crafting data insights of varying value. 

Because Spotify stores all artists that contributed to a track as an array, I can listen to a track with 4 contributing artists on it, 1 of which I’ve listened to before, and according to Spotify, I’ve now discovered 3 artists and listened to 4, whereas according to Last.fm, I’ll have either listened to 1 artist that I’ve already heard before, or a new artist, possibly called “Luciano & David Morales”. 

Screenshot of two artist names, Luciano, and Luciano & David Morales.

Spotify would store the second artist as Luciano, David Morales, thus allowing a more accurate count of listens for the Luciano artist. Similarly, my artist discovery data includes some flawed data, such as YouTube videos that got incorrectly recorded.

Screenshot of 3 artist names in my data, Billie Joe Armstrong of Green Day, Billy Joel and Jimmy Fallon Form 2, and Biosphere.
The Billy Joel and Jimmy Fallon duet of The Lion Sleeps Tonight never gets old, but it appears the original video is no longer on YouTube so I’m not going to link it.
Screenshot of two artist names in my data, &lez and 'Coming of age ceremony' Dance cover by Jimin and Jung Kook.

This becomes clear in my top 20 artist discoveries of 2020 chart, where BTS and Big Hit Labels are listed separately, although they are both indicative of one of my best friends joining BTS ARMY this year and sharing her enthusiasm with me. 

Giant table of top 20 artists discovered in 2020, in order with first_discovered date last:
Re.You with 85 listens starting July 12, 2020
Elliot Adamson, 75 listens, April 15 2020
Fennec, 53 listens, March 24 2020
Southern Shores, 52 listens, November 19 2020
Eelke Kleijn, 45 listens, August 10 2020
Christian Löffler, 43 listens, April 2 2020
Icarus, 35 listens, April 2 2020
Monkey Safari, 35 listens, April 15 2020
Black Motion, 34 listens, April 30 2020
BTS, 31 listens, September 29 2020
Bronson, 31 listens, May 9 2020
Love Regenerator, 30 listens, March 30 2020
Eltonnick, 29 listens, April 27 2020
Jerro, 27 listens, April 29 2020
Theo Kottis, 27 listens, June 16 2020
Dennis Cruz, 26 listens, June 22 2020
Da Capo, 25 listens, May 10 2020
Bit Hit Labels, 21 listens, June 30 2020
HYENAH, 20 listens, June 4 2020
KC Lights, 20 listens, September 22 2020

Ultimately I’m grateful that the top 20 artists of 2020 are all artists that I discovered during the pandemic and have excellent songs that I love and continue to listen to. Many of the sparklines that represent my listening activity for these artists throughout the year have spikes, but mostly my listening patterns indicate that I’ve been returning to these artists and their songs multiple times after first discovery. Some notable favorites on this list are KC Lights’ track Girl and Dennis Cruz’s track El Sueño, plus the entire Fennec album Free Us Of This Feeling.

Genre Discovery in 2020

The most-commented-on data insight from #wrapped2020 is probably the genre discovery slide.

According to Spotify, I listened to 801 genres this year, including 294 new ones. I’m not even sure I could name 30 genres, let alone 300 or 800. Where are these numbers coming from? 

It turns out that, much like storing artist data as an array for each song, Spotify stores genre data as an array for each artist. This means that each artist can be assigned multiple genres, thus successfully inflating the number of genres that you’ve listened to in 2020. 

For example, if I use Spotify’s API developer console to retrieve the artist information for Tourist, with a Spotify ID of 2ABBMkcUeM9hdpimo86mo6, it turns out that he has 6 total genres associated with him in Spotify’s database: chillwave, electronica, indie soul, shimmer pop, tropical house, and vapor soul. 

Screenshot of JSON response from Spotify API call, content duplicated in surrounding text.

I could start discussing the possible meaningless of genres as a descriptive tool, the lack of validation possible for such a signifier, the lack of clarity about how these genres were defined and also assigned to specific artists, but that’s best for another blog post.

Instead, let’s look at what little genre data I do have available to me more generally. 

duplicated in surrounding text

According to Spotify, my top genres were:

  1. House
  2. Electronica
  3. Pop
  4. Afro House
  5. Organic House

All of these make sense to me, except for Organic House, because I don’t know what makes house music organic, unless it’s also grass-fed, locally-sourced, and free range. Perhaps Blond:ish is organic house. 

I don’t have any genre data from Last.fm, since the service only stores user-defined tags for each artist, and those are not included in the data that I collect from Last.fm today. Instead, I have the genres assigned by iTunes for the tracks that I’ve purchased from the iTunes store. 

The top 8 genres of music that I added to my iTunes library in 2020 by purchasing tracks from the iTunes store are:

  1. Dance (124 songs)
  2. Electronic (121 songs)
  3. House (78 songs)
  4. Pop (37 songs)
  5. Alternative (27 songs)
  6. Electronica (12 songs)
  7. Deep House (10 songs)
  8. Melodic House & Techno (9 songs)
duplicated in surrounding text

Clearly, this is a very selective sample, and is only tied to select purchasing habits, which are roughly correlated to my listening habits.

I shared all of this genre data to essentially look at it and go “wow, that wasn’t very insightful at all”. Let’s move on. 

Time Spent Listening to Music in 2020

The last metric I want to unpack from Spotify’s #wrapped2020 campaign is the minutes listened data insight. According to Spotify, I spent 59,038 minutes listening to music this year. 

relevant content duplicated in surrounding text

According to my own calculations, I spent roughly 81,134 minutes listening to music in 2020.

Let’s talk about how both of these metrics are super flawed!

Spotify counts a song as streamed after you listen to it for more than 30 seconds (per their Spotfiy for Artists FAQ), so it’s logical to assume that this minutes listened metric likely from a calculation of “number of streams for a track” x “length of track” and then rounded and converted to minutes. It could even result from an different type of calculation, “number of total streams” x “average length of track in Spotify library”, but I have no way of knowing if either of these are accurate besides tweeting at Spotify and hoping they’ll pay attention to me. 

Unfortunately for all of us, but mostly me, my own minutes listened metric is just as lazily calculated. I don’t have track length data for all the tracks that I listen to and I don’t know at what point Last.fm counts a track as being worthy of a scrobble. I do have a list of how much time I spent listening to livestreamed DJ sets online, and I do have some excellent estimation skills. I calculated my number of 81,134 minutes so far in 2020 by calculating and assuming the following:

  • An average track length of 4 minutes
  • An average concert length of 3 hours
  • An average DJ set length of 4 hours
  • An average festival length of 8 hours

Using those averages and estimates, I calculated the total amount of time I spent listening to music across Last.fm listening habits, concerts and DJ sets attended (no festivals this year), and livestreams that I watched online, thus arriving at 81,134 minutes. That doesn’t count any DJ sets that I listened to on SoundCloud, and certainly the combination of a 4 minute track length estimate with the uncertainty of what qualifies a track as being scrobbled makes this data insight somewhat meaningless.

Regardless, let’s compare this estimated time spent listening in minutes against the total number of minutes in a year.

Total minutes listened (81,134) as a gauge compared with total minutes in a year (525,600)

Beautiful. I still remembered to sleep this year. No matter which dataset I use, however, it’s clear that I’ve listened to more music in 2020 than in 2019. Spotify’s metric for this same time period in 2019 was 35,496 minutes. The less-flawed but less-complete metric I used last year, calculated using the track length stored in iTunes multiplied by the number of listens for that track, indicated that I spent 14,296 minutes listening to music in 2019. 

As one final Spotify examination, let’s dig into the Spotify Top 100 playlist.

Top 100 Songs of 2020 Playlist

Alongside the fancy graphics and data insights in the #wrapped2020 campaign, Spotify also creates a 100 song playlist, likely (but not definitively) the top 100 songs of the time period between January 1st, 2020 and October 31st, 2020. 

I found my playlist this year to be relatively accurate, perhaps because I spent more time listening to Spotify than I might have in previous years, or perhaps they made some internal data improvements, or both! I often spend more time listening to SoundCloud if I’m traveling a lot, listening to offline DJ sets on plane flights; or listening to Apple Music on my iPhone, with songs that I’ve added from my iTunes library. Without much time spent commuting or traveling this year, it’s likely that my listening habits remained fairly consolidated. 

duplicated in surrounding text

Similarly to what I discovered about my top 10 tracks, I had relatively distributed music interests this year. The 811 total listens for all 100 songs in my Spotify playlist represent just 0.06% of my total listens in 2020 so far. 

duplicated in surrounding text

Despite my overall listening habits being relatively distributed across lots of artists and songs, the Top Songs playlist is somewhat more consolidated, with 69 artists performing the 100 songs on the playlist. Nice. 

duplicated in surrounding text

It’s clear that I spent most of this year exploring and discovering new artists, given that 83 of my top songs of 2020 according to Spotify were songs that I discovered in 2020. 

Thanks for coming on this journey through my music data with me. I’ll be back at the actual end of the year to dive deeper into my top 10 artists of the year, top 10 consistent artists of the year, my music purchasing activity, as well as some more livestream and concert statistics to round out my 2020 year in music. 

Reflecting on a decade of (quantified) music listening

I recently crossed the 10 year mark of using Last.fm to track what I listen to.

From the first tape I owned (Train’s Drops of Jupiter) to the first CD (Cat Stevens Classics) to the first album I discovered by roaming the stacks at the public library (The Most Serene Republic Underwater Cinematographer) to the college radio station that shaped my adolescent music taste (WONC) to the college radio station that shaped my college experience (WESN), to the shift from tapes, to CDs, (and a radio walkman all the while), to the radio in my car, to SoundCloud and MP3 music blogs, to Grooveshark and later Spotify, with Windows Media Player and later an iTunes music library keeping me company throughout…. It’s been quite a journey.

Some, but not all, of that journey has been captured while using the service Last.fm for the last 10 years. Last.fm “scrobbles” what you listen to as you listen to it, keeping a record of your listening habits and behaviors. I decided to add all this data to Splunk, along with my iTunes library and a list of concerts I’ve attended over the years, to quantify my music listening, acquisition, and attendance habits. Let’s go.

What am I doing?

Before I get any data in, I have to know what questions I’m trying to answer, otherwise I won’t get the right data into Splunk (my data analysis system of choice, because I work there). Even if I get the right data into Splunk, I have to make sure that the right fields are there to do the analysis that I wanted. This helped me prioritize certain scripts over others to retrieve and clean my data (because I can’t code well enough to write my own).

I also made a list of the questions that I wanted to answer with my data, and coded the questions according to the types of data that I would need to answer the questions. Things like:

  • What percentage of the songs in iTunes have I listened to?
  • What is my artist distribution over time? Do I listen to more artists now? Different ones overall?
  • What is my listen count over time?
  • What genres are my favorite?
  • How have my top 10 artists shifted year over year?
  • How do my listening habits shift around a concert? Do I listen to that artist more, or not at all?
  • What songs did I listen to a lot a few years ago, but not since?
  • What personal one hit wonders do I have, where I listen to one song by an artist way more than any other of their songs?
  • What songs do I listen to that are in Spotify but not in iTunes (that I should buy, perhaps)?
  • How many listens does each service have? Do I have a service bias?
  • How many songs are in multiple services, implying that I’ve probably bought them?
  • What’s the lag between the date a song or album was released and my first listen?
  • What geographic locations are my favorite artists from?

As the list goes on, the questions get more complex and require an increasing number of data sources. So I prioritized what was simplest to start, and started getting data in.

 

Getting data in…

I knew I wanted as much music data as I could get into the system. However, SoundCloud isn’t providing developer API keys at the moment, and Spotify requires authentication, which is a little bit beyond my skills at the moment. MusicBrainz also has a lot of great data, but has intense rate-limiting so I knew I’d want a strategy to approach that metadata-gathering data source. I was left with three initial data sources: my iTunes library, my own list of concerts I’ve gone to, and my Last.fm account data.

Last.fm provides an endpoint that allows you to get the recent tracks played by a user, which was exactly what I wanted to analyze. I started by building an add-on for Last.fm with the Splunk Add-on Builder to call this REST endpoint. It was hard. When I first tried to do this a year and a half ago, the add-on builder didn’t yet support checkpointing, so I could only pull in data if I was actively listening and Splunk was on. Because I had installed Splunk on a laptop rather than a server in ~ the cloud ~, I was pretty limited in the data I could pull in. I pretty much abandoned the process until checkpointing was supported.

After the add-on builder started supporting checkpointing, I set it up again, but ran into issues. Everything from forgetting to specify the from date in my REST call to JSON path decision-making that meant I was limited in the number of results I could pull back at a time. I deleted the data from the add-on sourcetype many times, triple-checking the results each time before continuing.

I used a python script (thanks Reddit) to pull my historical data from Last.fm to add to Splunk, and to fill the gap between this initial backfill and the time it took me to get the add-on working, I used an NPM module. When you don’t know how to code, you’re at the mercy of the tools other people have developed. Adding the backfill data to Splunk also meant I had to adjust the max_days_ago default in props.conf, because Splunk doesn’t necessarily expect data from 10+ years ago by default. 2 scripts in 2 languages and 1 add-on builder later, I had a working solution and my Last.fm data in Splunk.

To get the iTunes data in, I used an iTunes to CSV script on Github (thanks StackExchange) to convert the library.xml file into CSV. This worked great, but again, it was in a language I don’t know (Ruby) and so I was at the mercy of a kind developer posting scripts on Github again. I was limited to whatever fields their script supported. This again only did backfill.

I’m still trying to sort out the regex and determine if it’s possible to parse the iTunes Library.xml file in its entirety and add it to Splunk without too much of a headache, and/or get it set up so that I can ad-hoc add new songs added to the library to Splunk without converting the entries some other way. Work in progress, but I’m pretty close to getting that working thanks to help from some regex gurus in the Splunk community.

For the concert data, I added the data I had into the Lookup File Editor app and was up and running. Because of some column header choices I made for how to organize my data, and the fact that I chose to maintain a lookup rather than add the information as events, I was up for some more adventures in search, but this data format made it easy to add new concerts as I attend them.

Answer these questions…with data!

I built a lot of dashboard panels. I wanted to answer the questions I mentioned earlier, along with some others. I was spurred on by my brother recommending a song to me to listen to. I was pretty sure I’d heard the song before, and decided to use data to verify it.

Screen image of a chart showing the earliest listens of tracks by the band VHS collection.

I’d first heard the song he recommended to me, Waiting on the Summer, in March. Hipster credibility: intact. Having this dashboard panel now lets me answer the questions “when was the first time I listened to an artist, and which songs did I hear first?”. I added a second panel later, to compare the earliest listens with the play counts of songs by the artist. Maybe the first song I’d heard by an artist was the most listened song, but often not.

Another question I wanted to answer was “how many concerts have I been to, and what’s the distribution in my concert attendance?”

Screen image showing concerts attended over time, with peaks in 2010 and 2017.

It’s pretty fun to look at this chart. I went to a few concerts while I was in high school, but never more than one a month and rarely more than a few per year. The pace picked up while I was in college, especially while I was dating someone that liked going to concerts. A slowdown as I studied abroad and finished college, then it picks up for a year as I get settled in a new town. But after I get settled in a long-term relationship, my concert attendance drops off, to where I’m going to fewer shows than I did in high school. As soon as I’m single again, that shifts dramatically and now I’m going to 1 or more show a month. The personal stories and patterns revealed by the data are the fun part for me.

I answered some more questions, especially those that could be answered by fun graphs, such as what states have my concentrated music listens?

Screen image of a map of the contiguous united states, with Illinois highlighted in dark blue, indicating 40+ concerts attended in that state, California highlighted in a paler blue indicating 20ish shows attended there, followed by Michigan in paler blue, and finally Ohio, Wisconsin, and Missouri in very pale blue. The rest of the states are white, indicating no shows attended in those states.

It’s easy to tell where I’ve spent most of my life living so far, but again the personal details tell a bigger story. I spent more time in Michigan than I have lived in California so far, but I’ve spent more time single in California so far, thus attending more concerts.

Speaking of California, I also wanted to see what my most-listened-to songs were since moving to California. I used a trellis visualization to split the songs by artist, allowing me to identify artists that were more popular with me than others.

Screen image showing a "trellis" visualization of top songs since moving to California. Notable songs are Carly Rae Jepsen "Run Away With Me" and Ariana Grande "Into You" and CHVRCHES with their songs High Enough to Carry You Over and Clearest Blue and Leave a Trace.

I really liked the CHVRCHES album Every Open Eye, so I have three songs from that album. I also spent some time with a four song playlist featuring Adele’s song Send My Love (To Your New Lover), Ariana Grande’s Into You, Carly Rae Jepsen’s Run Away With Me, and Ingrid Michaelson’s song Hell No. Somehow two breakup songs and two love songs were the perfect juxtaposition for a great playlist. I liked it enough to where all four songs are in this list (though only half of it is visible in this screenshot). That’s another secret behind the data.

I also wanted to do some more analytics on my concert data, and decided to figure out what my favorite venues were. I had some guesses, but wanted to see what the data said.

Screen image of most visited concert venues, with The Metro in Chicago taking the top spot with 6 visits, followed by First Midwest Bank Ampitheatre (5 visits), Fox Theater, Mezzanine, Regency Ballroom, The Greek Theatre, and The Independent with 3 visits each.

The Metro is my favorite venue in Chicago, so it’s no surprise that it came in first in the rankings (I also later corrected the data to make it its proper name, “Metro” so that I could drill down from the panel to a Google Maps search for the venue). First Midwest Bank Ampitheatre hosted Warped Tour, which I attended (apparently) 5 times over the years. Since moving to California it seems like I don’t have a favorite venue based on visits alone, but it’s really The Independent, followed by Bill Graham Civic Auditorium, which doesn’t even make this list. Number of visits doesn’t automatically equate to favorite.

But what does it MEAN?

I could do data analysis like that all day. But what else do I learn by just looking at the data itself?

I can tell that Last.fm didn’t handle the shift to mobile and portable devices very well. It thrives when all of your listening happens on your laptop, and it can grab the scrobbles from your iPod or other device when you plug it into your computer. But as soon as internet-connected devices got popular (and I started using them), listens scrobbled overall dropped. In addition to devices, the rise of streaming music on sites like Grooveshark and SoundCloud to replace the shift from MediaFire-hosted and MegaUpload-hosted free music shared on music blogs also meant trouble for my data integrity. Last.fm didn’t handle listens on the web then, and only handles them through a fragile extension now.

Two graphs depicting distinct song listens and distinct artist listens, respectively, with a peak and steady listens through 2008-2012, then it drops down to a trough in 2014 before coming up to half the amount of 2010 and rising slightly.

Distinct songs and artists listened to in Last.fm data.But that’s not the whole story. I also got a job and started working in an environment where I couldn’t listen to music at work, so wasn’t listening to music there, and also wasn’t listening to music at home much either due to other circumstances. Given that the count plummets to near-zero, it’s possible there were also data issues at play.  It’s imperfect, but still fascinating.

What else did I learn?

Screen image showing 5 dashboard panels. Clockwise, the upper left shows a trending indicator of concerts attended per month, displaying 1 for the month of December and a net decrease of 4 from the previous month. The next shows the overall number of concerts attended, 87 shows. The next shows the number of iTunes library songs with no listens: 4272. The second to last shows a pie chart showing that nearly 30% of the songs have 0 listens, 23% have 1 listen, and the rest are a variety of listen counts. The last indicator shows the total number of songs in my iTunes library, or 16202.

I have a lot of songs in my iTunes library. I haven’t listened to nearly 30% of them. I’ve listened to nearly 25% of them only once. That’s the majority of my music library. If I split that by rating, however, it would get a lot more interesting. Soon.

You can’t see the fallout from my own personal Music-ocalypse in this data, because the Library.xml file doesn’t know which songs don’t point to actual files, or at least my version of it doesn’t. I’ll need more high-fidelity data to determine the “actual” size of my library, and perform more analyses.

I need more data in general, and more patience, to perform the analyses to answer the more complex questions I want to answer, like my listening habits of particular artists around a concert. As it is, this is a really exciting start.

If you want more details about the actual Splunking I did to do these analyses, I’ll be posting a blog on the official Splunk blog. That got posted on January 4th! Here it is: 10 Years of Listens: Analyzing My Music Data with Splunk.