This is important.

musings on the intersections of the world

Skip to content
  • Home
  • About Sarah
  • Borders on the Web
  • Newsletter Archives
    • Newsletter
Search

metadata

Why the quality of audio analysis metadatasets matters for music

March 16, 2020 / Sarah

I’ve been thinking for some time about the derived metadata that Spotify and other digital streaming services construct from the music on their platforms. Spotify’s current business revolves around providing online streaming access to music and podcasts, as well as related content like playlists, to users. 

Like any good SaaS business, their primary goal is to acquire and keep customers. As a digital streaming service business, the intertwined goal is to provide quality content to those customers. The best way to do both of those is to derive and collect metadata about customer usage patterns, but also about the content being delivered to the customers. The more you know about the content being delivered, the more you can create new distribution mechanisms for the content and make informed deals to acquire new content. 

Creating metadatasets from the intellectual property of artists

Today, when labels and distributors provide music to digital streaming services (artists can’t provide it directly), they grant those services permission to make the music tracks available to users of the digital streaming services. Based on my review of the Spotify Terms and Conditions of Use, Spotify for Artists Terms and Conditions, and the distribution agreement for a commonly-used distribution service, DistroKid, artists don’t grant explicit permission for what services do next—create metadata about those tracks. An exception relevant with the DistroKid distribution agreement is if artists sign up for an additional service, DistroLock, they then are bound by an additional addendum granting the service permission to create an audio fingerprint to uniquely represent the track so that it can be used for copyright enforcement and possibly to pay out royalties. 

In his book Metadata, Jeffrey Pomerantz defines metadata as “a means by which the complexity of an object is represented in a simpler form.” In this case, streaming services like Spotify create different types of metadata to represent the complexity of music with various audio features, audio analysis statistics, and audio fingerprints. The services also gather “use metadata” about how customers use their services—at what point in a song a person hits skip, what devices they use to listen, their location when listening, and other data points.

Creating metadatasets is crucial to delivering content

Pandora has patents for the types of music metadata that they create, that behind the “music genome project”. Spotify also has patents (and a crucial one from their acquisition of the Echo Nest) to do the same, as well as many that cover the various applications of those metadata.

These companies can use these metadatasets as marketing tools, as we’ve seen with the #SpotifyWrapped campaign; to correlate the music metadata with use metadata, such as to create new music marketing methods like contextual playlists; to select advertising that matches up sonically well with the tracks being listened to; and to provide these insights to artists and labels, making them more reliant on their service as a distribution and marketing mechanism.

Spotify currently provides a subset of the insights they derive from the combination of use metadata with music track metadata to artists with the Spotify for Artists service. The end user license agreement for the service makes it clear that it’s a free service and Spotify cannot be held responsible for the relative accuracy of the data available. Emphasis mine: 

Spotify for Artists is a free service that we are providing to you for use at our discretion. Spotify for Artists may provide you with the ability to view demographic data on your fans and usage data of your music. While we work hard to ensure the accuracy of the data, we do not guarantee that the Spotify for Artists Service or the data that we collect from the Service will be error-free or that mistakes, including mistakes in the data insights that we provide to you, will not happen from time to time.

Spotify for Artists Terms and Conditions

It’s likely that some labels have already negotiated access to various insights and metadata that Spotify creates and collects. 

Other valuable insights that can be derived from these metadatasets include: the types of music that people listen to in certain cities, which tracks are most popular in certain cities, what types of music people tend to listen to in different seasons, and even what types of music people of different ages, genders, education levels, and classes tend to listen to. 

These insights, provided to artists, labels, and distributors, guide marketing campaigns, tour planning, artist-specific investments, and even music production styles. Thing is, it’s tough to decipher exactly how these companies create the metadatasets that all these valuable insights rely on, and how the accuracy of that metadata is (if at all) validated. 

How the metadatasets get made

In an episode of  Vox Earworm, the journalist Matt Daniels of The Pudding and Estelle Caswell of Vox briefly discuss how the metadatasets of Spotify and Pandora were created, pointing out that Spotify has 35 million songs, but the metadataset is algorithmically generated. Meanwhile, Pandora has only 2 million songs, but those 450 total attributes were defined and applied by a combination of trained musicologists and algorithms to the songs. Their discussion starts at 1:45 in this episode and continues for about 90 seconds. 

The features in the metadatasets have been defined by algorithms written by trained musicologists, amateur musicians, or even ordinary data scientists without musical training or expertise. The specific features collected by Spotify are publicly available in their audio features API and audio analysis API endpoints, and both include metadata that objectively describe each track, such as duration, as well as more subjective features such as acousticness, liveness, valence, and instrumentalness.

The more detailed audio analysis API features splits up each track into various sections and segments, and computes features and confidence levels for each of the sections and segments. 

Spotify, building off the Echo Nest technology, relies on web scraping and algorithms to create these metadatasets. According to a patent filed by the Echo Nest in 2011, three different types of metadata are created:

  • Acoustic metadata, which is the “numerical or mathematical representation of the sound of a track”,
  • Cultural metadata, which “refers to text-based information describing listener’s reactions to a track or song”, and
  • Explicit metadata, which “refers to factual or explicit information relating to music”. 

The explicit metadata is information such as “track name” or “artist name” or “composer, while the acoustic metadata can be an acoustic fingerprint to represent the song, or can include features like “tempo, rhythm, beats, tatums, or structure, and spectral information such as melody, pitch, harmony, or timbre.” The cultural metadata is where the more subjective features come from, and it can come from a variety of different subjective sources: “expert opinion such as music reviews”, “listeners through Web sites, chat rooms, blogs, surveys, and the like”, as well as information “generated by a community of listeners and automatically retrieved from Internet sites, chat rooms, blogs, and the like.” The patent gives other examples such as “sales data, shared collections, lists of favorite songs, and any text information that may be used to describe, rank, or interpret music.” It can also build off of existing databases made available by companies like Gracenote, AllMusic (referenced as AMG, now RhythmOne, in the patent), and others. 

Pandora doesn’t share an API for their Music Genome Project data, but they do mention that it contains 450 total attributes, or features in the data. I dug into their patents and it is clear that the number of features used varies depending on the type of music, and the features given as examples in the patents range from vocalist gender, distortion in electric guitar, type of background vocals, genre, era, syncopation, and lead vocal present in song. Pandora uses a combination of musicologists and algorithms to assign values. 

Representation in the metadatasets, representation in the taco bell

We know a little about how Spotify and Pandora create their metadatasets. We know less about how representative those metadatasets are, both in terms of feature coverage and music coverage. 

Barely knowing which features are available for Pandora, and even while having a decent idea of what Spotify has available, it’s possible that the features that exist in the metadatasets are incomplete. The features in the metadatasets could be limited to those that were the easiest to compute at the time, those that are deemed interesting by the creators, or even those that are highly-correlated with profitable user behavior. It’s expensive to create, store, and apply new metadata features, so businesses must have a clear value proposition before developing new models or tasking more musicologists with the creation of a unique audio feature. 

Based on the locations of Spotify, Pandora, and the companies informing their metadatasets, it’s likely that the datasets that these metadatasets and their features are built on aren’t representative of music worldwide but instead include bias toward music that is easily available in their geographic locations. 

The size of the datasets that underpin the metadata creation varies—Pandora has 2 million tracks, Spotify has 35 million—the representativeness of the data sample is more important than the size. And that is a variable that we have almost no information about. 

I haven’t done (and can’t do) the data analysis to determine the distribution of tracks in those giant datasets. Without that I can only speculate:

  • It’s possible that both of them have a disproportionate concentration of artists that create and record music in the United States and Western Europe. 
  • It’s almost certain that both of those datasets contain only music recorded in the digital or digital-adjacent eras. Music recorded in analog tape eras that haven’t been digitized can’t be represented in the datasets. 
  • It’s unlikely that the datasets include music by artists lacking the internet connection necessary to digitally distribute their music, even if it is digitized. 

We could learn more about the representativeness of the datasets used to create the metadatasets if we knew more about how the metadatasets themselves are validated. But again, that’s another area that lacks clarity. 

How the metadatasets get validated… or not

Their uniqueness of their businesses are built on these metadatasets, but it doesn’t seem like there are processes in place to validate the features developed and in use by Pandora and Spotify across the industry. There’s no central database of tracks that I know of, a “Tom’s Diner” of audio feature validation, that can be used to tune the accuracy of audio features that exist in multiple industry metadatasets. Instead, much like the lossy compression of an MP3, there is just the “close enough for our purposes” approximation for validation.

Pandora uses its musicologists to validate the features assigned to tracks by other musicologists and by algorithms, and uses a selection and ranking module to arrive at a “wisdom of the crowd of experts” result for the eventual list of features associated with a track. The accuracy of a feature is a relative score based on how many other experts associated that same feature with a track. 

Spotify uses a prediction model to predict the subjective (and harder-to-compute) features such as liveness, valence, danceability, and presence of spoken word lyrics. In the patent filing, they disclose the validation methods used for the features predicted by that model: 

  • Comparing the results of the model to a “ground truth dataset” created from already-labeled data sourced in part from “crowdsourced online music datasets such as SOUNDCLOUD, LAST.FM, and the like” [sic]. 
  • Evaluating the percentage of true positives, false negatives, and true negatives returned by the model predictions for features with a binary value (true or false). 

The patent then describes taking appropriate steps to bolster training data and improve coverage of the datasets to produce more accurate results in response to the validation results. However, since this is a patent filing rather than a blog post describing their data science practices, we don’t know how often the prediction models and training datasets are updated, or what other methods are used to compile and validate the training datasets themselves. 

Lacking an objectively true value for many of these audio features, it’s difficult for services to reliably validate their metadatasets. In fact, rather than comparatively validating their metadatasets, many of the metadatasets are built on top of each other. The Spotify patent for the prediction model makes it clear that the “ground truth dataset” used for validation is partially sourced from other metadatasets. This Echo Nest patent that I discussed earlier makes it clear that different types of metadata can come from pre-existing metadatasets. 

Without large-scale understanding of metadata validity across these existing metadatasets, it’s likely that errors and biases in the metadata can proliferate as new ones are created. Eventually, that lack of quality metadata can have a disproportionate effect on the artists creating the music that this metadata is derived from. 

Why metadata quality matters 

Spotify and Pandora both rely extensively on these metadatasets to deliver valuable streaming services to customers and to create engaging content like playlists and stations for their listeners. Spotify has positioned itself as a valuable distribution and marketing mechanism for artists, to the point that they’ve devised a new scheme where artists and labels can pay for privileges like prominent playlist placement or spotlights in Spotify.  

Metadata underpins the business model of these companies, shaping our experience of music by directly affecting how music is distributed and consumed. But we don’t know how valid the metadata is, we don’t know if it’s biased, and we don’t know how much of a feedback loop is involved in its interpretation to create new distribution and consumption mechanisms. 

If these companies don’t do more to improve the quality of metadata, artists can lose revenue and miss out on distribution opportunities. Listeners can get bored by the sameness of playlists, or the inaccurate interpretations of their radio station requests, and stop using Spotify and Pandora to discover new music. Without representative and valid metadata, music loses. 

What went into writing this

I read a lot over the past few months that informed my thinking in this essay, or some of the points that I made, without being something I quoted or linked directly in the text. I also am grateful to the conversations I had with my former colleague Jessica about this topic, and the feedback that my former colleague Neal gave me on an earlier version of this post. 

Spotify background

  • I read the Spotify API documentation for the audio features and audio analysis endpoints. 
    • Get Audio Analysis for a Track
    • Get Audio Features for a Track
  • The Spotify for Artists FAQ was informative, especially the following questions.
    • How do I get my music on Spotify?
    • My music is mixed up with another artist
    • What’s a unique link?
    • How does Fans Also Like work?
    • How often are my stats updated in Spotify for Artists?
    • How far back do my stats go?
    • How does Spotify process my audio files?
    • My track doesn’t sound as loud as other tracks on Spotify. Why?
  • Posts on the Spotify engineering blog, Spotify Labs. 
    • The Winding Road to Better Machine Learning Infrastructure Through Tensorflow Extended and Kubeflow
    • Views From The Cloud: A History of Spotify’s Journey to the Cloud, Part 1
    • Spotify’s Event Delivery – The Road to the Cloud (Part II)
    • Spotify’s Event Delivery – The Road to the Cloud (Part III)
    • Spotify’s Event Delivery – Life in the Cloud
    • Analytics at Spotify
    • Spotify Unwrapped: How we brought you a decade of data
    • Big Data Processing at Spotify: The Road to Scio (Part 1)
    • Big Data Processing at Spotify: The Road to Scio (Part 2)
    • Scio 0.7: a deep dive
  • Patents filed by Spotify or The Echo Nest, in an attempt to learn how they create music metadata:
    • US8280889B2 – Automatically acquiring acoustic information about music
    • US10089578B2 – Automatic prediction of acoustic attributes from an audio signal
  • A Hacker Noon article by Sophia Ciocca: Spotify’s Discover Weekly: How machine learning finds your new music
  • This Hypebot article by Bruce Houghton: Spotify’s Paid Promotion Tool Is Called Marquee and Artists, Indie Labels Can’t Afford To Use It

Pandora background

  • An article in the New York Times Magazine by Rob Walker: The Song Decoders at Pandora
  • A sponsored article in Forbes Insights by their Insights Team: Forbes Insights: How Pandora Knows What You Want To Hear Next
  • Two essays in the East Bay Express:
    • By Chris Parker: Personal Shoppers
    • By Kara Platoni: Pandora’s Box
  • Several Pandora patents in an attempt to learn about some of the features that they create and how they create them: 
    • US7003515B1 – Consumer item matching method and system
    • https://patents.google.com/patent/US20160253416A1/en
    • US10088978B2 – Country-specific content recommendations in view of sparse country data
    • US8306976B2 – Methods and systems for utilizing contextual feedback to generate and modify playlists
    • US9729910B2 – Advertisement selection based on demographic information inferred from media item preferences
    • US20160379274A1 – Relating Acoustic Features to Musicological Features For Selecting Audio with Similar Musical Characteristics
    • US10129314B2 – Media feature determination for internet-based media streaming
    • US10387489B1 – Selecting songs with a desired tempo 

Other content

  • An article on Billboard by Emily White: Predicting What You Want To Hear: Music And Data Get It On
  • A few Penny Fractions email newsletter missives by David Turner: 
    • Penny Fractions: Does Your Data Mean Anything? Maybe Notsomuch.
    • Penny Fractions: Spotify’s Perfectly Calculated ‘Wrapped’ Campaigns
  • A Water & Music Patreon post written by Cherie Hu: Decoding 8tracks’ demise, and what it reveals about the state of music streaming
  • An essay on Music Business Worldwide by Cherie Hu: Spotify Needs To Make A Decision About Its Future, Based On Whether It Actually Believes Its Own Mission Statement
  • A Water & Music email newsletter missive written by Cherie Hu: Exclusive: Chartmetric’s inaugural six-month data report reveals hidden music trends beyond streaming
  • A podcast episode from Chartmetric’s podcast, How Music Charts: Global Music Marketing With Christine Osazuwa
  • A series of posts on the Chartmetric blog by Jason Joven: 
    • Music “Trigger Cities” in Latin America & South/Southeast Asia (Part 1)
    • Music “Trigger Cities”: Focus on Southeast Asia (Part 2)
    • Music “Trigger Cities”: Focus on Latin America (Part 3)
  • An essay in The Guardian by Siraj Datoo: How Shazam uses big data to predict music’s next big artists 
  • An article on Toptal’s engineering blog by Jovan Jovanovic: How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing
  • A Medium post by Trey Cooper: How Shazam Works
  • An essay in The Atlantic by Derek Thompson: The Shazam Effect 
  • Derek Thompson’s book: Hit Makers: How to Succeed in an Age of Distraction
  • Abe Winter’s blog post: The coming IP war over facts derived from books
  • An article on Wired by Eliot Van Buskirk: 4 Ways One Big Database Would Help Music Fans, Industry 
  • An article on The Verge by Dani Deahl: Metadata is the biggest little problem plaguing the music industry
  • An article on MakeUseOf by Dave Parrack: Music Geeks Can Now Edit Spotify’s Metadata (as of 2018, but no longer possible). 
  • An article on Medium’s Cuepoint by Cherie Hu: How Has Streaming Affected our Identities as Music Collectors?
  • My own essay on avoiding biased data analysis: Unbiased data analysis with the data-to-everything platform: unpacking the Splunk rebrand in an era of ethical data concerns
  • This Newsweek article by Brian Moon: From Spotify to Shazam: How Big-Data Remade the Music Industry One Algorithm at a Time
  • An essay on Art Forum by Jace Clayton: Stream Logic: Jace Clayton on Carl Stone and close listening in the Spotify era 
  • An article on Music Week by Mark Sutherland: Tech it for granted: Why the music biz still needs instinct as well as data to succeed in the digital age
  • An article on Ars Technica by Cathleen O’Grady: Spotify data shows how music preferences change with latitude

Wrapping up the year and the decade in music: Spotify vs my data

December 5, 2019December 5, 2019 / Sarah

Spotify’s 2019 Wrapped aims to give you an overview of your past year’s listening habits. It proclaims: these were your top 5 tracks and artists! You spent this much time listening to your favorite artist! 

Screenshot of the Spotify Unwrapped results for 2019, showing the top artists, top tracks, minutes listened, and top genre (Indietronica) of the year.

This year (the last year of the decade) they also expanded to all of the 2010s, sharing the top artists and tracks for each year in the decade that you used Spotify.

Because I have my own data that combines Last.fm listening data, my iTunes music library, and concert-relevant activities, this is my comparison of Spotify’s data with my own listening habits (more exhaustively tracked).

I have Last.fm set up to monitor Spotify, but also tracks that I listen to in Google Chrome, using the Music app on my iPhone, and local iTunes listening on my personal laptop. Spotify, of course, just sees Spotify.

According to Spotify, my top 5 artists were:

  1. Tourist
  2. Manatee Commune
  3. Lane 8
  4. Amtrac
  5. SebastiAn

According to my own data, my top 5 artists were:

  1. Tourist
  2. Lane 8
  3. Benoit & Sergio
  4. The Vaccines
  5. Litany

Manatee Commune was just 4 listens behind Litany, with 76 total listens for the year so far. It’s an impressive showing from them, considering that I never ended up purchasing any tracks by them. I own full albums or several tracks by all the other artists in both of my top 5 lists, making it easier for me to rack up listens—I listen only to music that I own or untrackable DJ sets in SoundCloud while I’m mobile.

The track stats are where my data really starts to differ from Spotify’s… 

My top 5 tracks according to Spotify are:

  1. Manatee Commune – W/O
  2. Odesza – Just A Memory (Mild Minds Remix)
  3. Aloe Blacc – Brooklyn in the Summer
  4. Manatee Commune – What We’ve Got (feat. Flint Eastwood)
  5. Manatee Commune – Raspberry Puree

Wow that’s a lot of Manatee Commune! Let’s see how the listens of those tracks stack up:

  1. Manatee Commune – W/O | 10 listens
  2. Odesza – Just A Memory (Mild Minds Remix) | 7 listens
  3. Aloe Blacc – Brooklyn in the Summer | 8 listens
  4. Manatee Commune – What We’ve Got (feat. Flint Eastwood) | 11 listens
  5. Manatee Commune – Raspberry Puree | 10 listens

What were my “actual” top 5 songs of 2019?

I’ll cheat and go to my top 7 because my first 2 top 5 songs just prove that I struggled to sleep a lot and listened to my Insomnia playlist…

  1. Hey Rosetta! – Trish’s Song | 28 listens
  2. The Cinematic Orchestra – That Home Extended | 26 listens
  3. The Chemical Brothers – Got To Keep On | 25 listens
  4. Justin Jay – I’m Shy When I’m Around You | 23 listens
  5. Benoit & Sergio – The Way You Get | 22 listens
  6. Poolside – Everything Goes (Body Music Remix) | 21 listens
  7. Tourist – Elixir | 20 listens

Pretty stark difference in those lists and those numbers. Manatee Commune is nowhere in sight. A large reason for that is because my listening pattern with Manatee Commune almost perfectly lines up with seeing them live (indicated by the orange triangle and dotted line):

Area graph showing listens over time for Manatee Commune. Fewer than 5 listens show up for 2017 in June and August, then fewer than 5 listens in February 2018, then no data on the graph until June 2019 when there is a spike to 70 listens coinciding with the date I saw them in concert, June 21 2019. Following that date there are fewer than 5 listens through July, then another few listens in November, then the graph ends.

But enough about Manatee Commune. Let’s talk about the real star of 2019: Tourist! All of the data agrees that he was my top artist of 2019. I saw him twice in concert (and I’ll see him again in a couple weeks). 

I’ve listened to him sporadically since 2012, first listening to a track of his in December 2012, discovering a few tracks every few years following, until I saw him live in March this year. You can see what happened after that in this graph:

Area graph showing fewer than 5 listens for Tourist in February 2013, then another few blips of listens in the first half of 2015, then a gap until July 2017 with a blip of listens, then another blip in January 2018 and August 2018, then an increase to about 10 listens starting in January 2019, then a huge spike in March 2019 to 75 listens, coinciding with a concert date of March 3, then it drops off to a consistent 5-10 listens for the next few months until a concert date in mid-August, after which the spikes go up to 11, then 0, then 34, then 5 for subsequent months until November.

Interestingly enough, I went to that show in March 2019 to see Gilligan Moss, who made my top 10 artists last year and were my most popular newly-discovered artist of 2018. If I hadn’t discovered them last year, I probably wouldn’t have gone to that show at all, and this year would have been completely different. 

Spotify claims that I spent 8 hours listening to Tourist this year. My own data? Rough calculations estimate that I’ve spent 15 hours and 15 minutes listening to Tourist. To put that in perspective, I spent at least 238 hours listening to music this year. At least 6% of my total listening time was spent on this one artist. Nice.

[I calculated this by counting the listens for specific tracks in my Last.fm data, then looking up the lengths of those tracks in my iTunes data and multiplying the number of listens by the track lengths. Of course this means that I’m not even considering the tracks that aren’t in my library, since I’m missing that metadata.]

According to Spotify, my favorite Tourist song was “Too Late – Continuous Mix”. If I consolidate the two similar tracks in my data (data consistency is hard), that’s also true overall—that track has 24 listens so far (which would actually make it #4 on my top tracks of the year). 

To move beyond Tourist, Spotify also told me that I discovered 1503 new artists, and that Plastic Plates were my favorite of those. Meanwhile in my data, I see that I discovered 2857 artists this year (probably at least 100 of those are random Youtube videos that got mis-filed), with my top 5 discoveries being: 

  1. Benoit & Sergio with 86 listens
  2. Kölsch with 55 listens
  3. warner case with 39 listens
  4. Parra for Cuva with 38 listens
  5. Lindstrøm with 28 listens

According to my data, I only listened to Plastic Plates 3 times this year after discovering them on January 2, 2019.

You can see more details about my new discoveries, along with a sparkline of my listening patterns for those artists throughout the year, in this table:

Table showing the top 20 artists discovered in 2019, led by the 5 artists mentioned in the surrounding text, including the first discovered dates. I'm sorry I can't embed the actual graph because it is too much data to describe in text in a useful way.

I spent 35,496 minutes listening to music this year, according to Spotify. Spotify’s data is much better than mine in this respect (100% coverage of metadata!) because my data tells me I only spent 14,296 minutes listening to music. In reality, it’s probably closer to the sum of those numbers. 

What else happened in 2019 that Spotify doesn’t know about?

My top 10 albums of the year:

My top 10 albums of 2019, Lane 8's Little by Little with 78 listens, Litany's 4 Track EP with 74 listens, Tourist's Live from Corsica Studios (DJ Mix) with 74 listens, O'Flynn's Aletheia with 56 listens, Manatee Commune's PDA with 51 listens, Tourist's Live from Corsica Studios (Continuous Mix)(Data is hard) with 50 listens, The Vaccines' What Did You Expect From the Vaccines? with 50 listens, Gilligan Moss's Ceremonial EP with 40 listens, a blank album by The Hood Internet which is actually their 5 songs from 1979-1983 with 39 listens, and Hey Rosetta's Second Sight with 34 listens.

This is where you can see data struggles once again. Those two Tourist albums are essentially the same album, just differently-named in Spotify vs iTunes, so that is actually my most-listened-to album of the year. The Hood Internet metadata was incomplete when they shared their 1979-1983 mashup tracks on SoundCloud, so the free downloads that I added to my iTunes library show up without an album. Which is actually technically correct.

In addition to those top 10 albums, here is an area graph showing the total listens of my top 10 artists over 2019:

Area graph showing the top 10 artists listening counts over time. There is a big spike in March corresponding to Tourist, another blob in June corresponding with Manatee Commune, and a good mix in September with Max Cooper, Bon Iver, Moon Boots, O'Flynn, and Tourist all taking up some time (which corresponds with concerts for all of those artists too)(but that data is not in this graph).

That looks somewhat exciting until you see those numbers stacked against all the other artists I listened to this year:

Area graph showing the same information as the previous graph, but this time with a big purple blob at the bottom of the graph that takes up between 500 and 1000 listens of the total graph. The Top 10 artists take up 50-100 listens on the same scale.

I added 344 new tracks (so far) to my iTunes library, and listened to 5,902 different tracks a total of 9,823 times (so far). I went to 49 concerts (with my 50th of the year lined up for tonight!) seeing a total of 136 artists (so far). Numbers!

My most frequented-venues of the year were 1015 Folsom and Audio, followed closely by the Fox Theater, Great American Music Hall, and The Fillmore. The artist I saw most frequently was Teh Raptor (DJ sets), soon to be tied by Tourist when I see him for the third time (in general and in 2019) in a couple weeks.

Spotify was also able to tell me some things that I can’t yet identify, namely that I listened to artists from 73 different countries. I’m hopeful that next year I’ll have additional metadata from the MusicBrainz database set up and correlating with my Splunk indexes.

Screenshot of spotify's wrapped site showing that I've listened to artists from 73 countries and "when it comes to your music, borders disappear"

The 2010s: Best of the Decade

Because 2019 is the last year of the decade, Spotify also added some stats for the entire decade to their #wrapped feature.

My top 5 artists and tracks for the 2010s according to spotify, described in the surrounding text, and the top genre is Indie Pop

My top 5 artists of the 2010s according to Spotify are:

  1. Daughter
  2. Hey Rosetta!
  3. CHVRCHES
  4. Cold War Kids
  5. The Format

According to my data, these artists are my top 5 of the 2010s:

  1. Hey Rosetta! with 1162 listens
  2. Alkaline Trio with 803 listens
  3. Cold War Kids with 743 listens
  4. Manchester Orchestra with 721 listens
  5. CHVRCHES with 674 listens

Motion City Soundtrack just barely missed out on the top 5, with 673 total listens. The Format, meanwhile are in 10th with 493 total listens. Daughter didn’t make my top 10, but are instead 24th with 314 total listens for the decade.

My top 5 songs of the decade according to Spotify are:

  1. Carly Rae Jepsen’s Run Away With Me
  2. Ariana Grande’s Into You
  3. Tame Impala’s The Less I Know The Better
  4. Adele’s Send My Love (To Your New Lover)
  5. Ingrid Michaelson’s Hell No

According to my own data:

  1. Hey Rosetta! – Trish’s Song | 114 listens
  2. Hey Rosetta! – Kintsukuroi | 102 listens
  3. Hey Rosetta! – The Simplest Thing | 95 listens
  4. Carly Rae Jepsen – Run Away With Me | 93 listens
  5. CHVRCHES – High Enough to Carry You Over | 90 listens

It’s probably then no surprise to learn that my top album of the decade is Hey Rosetta!’s Second Sight, which features the first 2 songs of my top 5 of the decade, and came out in 2014. My other top albums of the decade:

Top 10 albums of the decade for me, Second Sight by Hey Rosetta! with 611 listens, Every Open Eye by CHVRCHES with 452 listens, xx by the xx with 337 listens, Mean Everything to Nothing by Manchester Orchestra with 258 listens, Pershing by Someone Still Loves You Boris Yeltsin with 243 listens, Crimson by Alkaline Trio with 223 listens, Hey Rosetta!'s album Seeds with 204 listens, Tegan and Sara's album The Con with 202 listens and finally The Vaccine's album What Did you expect from the vaccines? with 193 listens

According to Spotify I’ve been using their service since 2011, but I think it’s actually more like 2013—and this is borne out in their data. They list my top tracks and artists for the decade only starting in 2013. Let’s compare!

Year Spotify Total Listens My Data Total Listens
2010 – – Hello Saferide – The Quiz 56
2011 – – Jamie xx – Rolling in the Deep remix feat. Childish Gambino 49
2012 – – Smoking Popes – Can’t Find It 33
2013 Daughter – Winter 1 Cold War Kids – Bulldozer 16
2014 Daughter – Youth 4 Taylor Swift – Blank Space 31
2015 Hey Rosetta! – The Simplest Thing 53 Hey Rosetta! – Kintsukuroi 55
2016 Carly Rae Jepsen – Run Away With Me 78 Carly Rae Jepsen – Run Away With Me 78
2017 Gibbz – 24/7 18 Hey Rosetta! – Trish’s Song 26
Sjowgren – Now & Then 24
2018 Jude Woodhead – Beautiful Rain 25 Young Fathers – Border Girl 35
2019 Manatee Commune – W/O 10 Hey Rosetta! – Trish’s Song 28
The Chemical Brothers – Got To Keep On 25

I added extra lines for the years when “Trish’s Song” took the top spot because that song is a lullaby and I listen to it accordingly—so perhaps I should consider the second place song as the “true” top song for that year.

Fun fact, my fifth-most-listened-to track of 2014 is The Riff-Off from Pitch Perfect. I watched it so many times on YouTube y’all now have some idea how obsessed I was (am).

In 2016, Run Away With Me by Carly Rae Jepsen took the top track slot by 1 listen. Ariana Grande’s Into You trailed with 77. Those 2 tracks were on a playlist of only 4 tracks that I listened to a LOT that year. The other 2 tracks from the playlist were my 4th- and 5th-most-listened tracks of the year: Ingrid Michaelson’s Hell No with 52 listens and Adele’s Send My Love (To Your New Lover), also with 52 listens. 

My top artists for each year of the past decade are as follows, comparing Spotify’s data with my data. I gotta say, I wasn’t expecting to see Taylor Swift take the top spot for 2014.

Year Spotify Total Listens My Data Total Listens
2010 – – Alkaline Trio 352
2011 – – Tegan and Sara 236
2012 – – Smoking Popes 241
2013 The Format 9 Cold War Kids 57
2014 The Format 69 Taylor Swift 108
2015 Hey Rosetta! – Hey Rosetta! 591
2016 Jason Derulo 179 Hey Rosetta!  332
2017 Cold War Kids 156 The xx 166
2018 Poolside – Poolside 162
2019 Tourist – Tourist 251

My music listening habits (and possibly also my data fidelity) dropped dramatically in the early/mid-2010s, which is why those numbers are so different compared with other years. 

It’s fun to see my overall trend in top artists for the decade. It’s almost like the 2013/2014 dropoff in music listening also coincided with a pivot in terms of what artists I was listening to.

Area graph of top 10 artists for the decade, with the first 3 years being dominated by indie rock artists like someonestill loves you boris yeltsin, tegan and sara, manchester orchestra, alkaline trio, and the xx, then almost no data in 2014 followed by a rise in artists like the format cold war kids, hey rosetta!, and chvrches, plus a resurgence of the xx in 2017. I guess most of those artists are still indie rock but still.

I was also in college until 2012, but Manchester Orchestra and Alkaline Trio and Someone Still Loves You Boris Yeltsin almost totally drop out of the listening patterns after 2013, taken over by Hey Rosetta!, CHVRCHES, The Format, and The xx resurging in 2017.

In total I listened to 27,068 unique tracks 100,240 times since 2010, spending 223,350 minutes (at least) listening to music. I purchased a total of 772 songs from the iTunes store in the last decade, with nearly half of those purchases happening this year.

Screenshot with data described in the surrounding text

Considering those total minutes spent listening, Spotify also shared minutes spent listening data all the way back to 2015! I’ve already talked about why my data is so different from Spotify’s (~ incomplete metadata ~) but here’s how the numbers compare:

Year Spotify My Data
2015 11,834 20,462
2016 28,659 19,959
2017 26,137 13,919
2018 35,655 16,737
2019 35,496 14,296

It’s fun to review the artists that I’ve discovered in the past decade, with Hey Rosetta!, CHVRCHES, Mumford & Sons, Two Door Cinema Club, and Daughter taking the top 5 spots.

Top 20 artists discovered in the past decade, described in the surrounding text

I’ve attended 163 concerts so far in the last decade, seeing a total of 404 artists The distribution of those concerts and artists over time is interesting to look at as well: spikes while I was in college, but not really taking off until I moved to San Francisco and joined a concert community group in the area.

Described in the surrounding text

I saw several artists multiple times throughout the decade, some as supporting acts (Future Feats, who I saw as an opening act 3 times, despite not enjoying their sets) and others as a combination of supporting and main acts (such as Smoking Popes).

Artists seen more than twice in the 2010s: Alkaline Trio, 4 times, Cold War Kids, Future Feats, Geographer, Gilligan Moss, Goldroom, RAC, Smoking Popes, Teh Raptor, and The Faint I've seen 3 times each.

My most-visited venue of the decade was The Independent, which I’ve been to 14 times. I don’t think I made it to a single show there in 2019, but hopefully I’ll be back for a 15th visit soon.

This has been a lot of data. I shared a similar roundup last year around this time, My 2018 Year in Music: Data Analysis and Insights. It’s fascinating to look back at the entire decade and reflect on how my life has changed, how my music taste and listening habits have shifted (or not) over time, and see the influence of live music attendance in my listening patterns and popular artists. Whether I’m using Spotify, iTunes, the Music app on my phone, SoundCloud, YouTube, or seeing live music, I’m glad I have music in my life.

Data enrichment at ingest-time, not search time, with Cribl

March 6, 2019 / Sarah

Disclaimer: I’m a Splunk employee, and I’m not a Cribl customer, but I do know the founders (including the author of the blog post). I figured I’d write this exploration up here rather than as an exceedingly-long Twitter thread. My reactions are all to the content of the blog post, not actual use of the product.

If I’m reading this blog post from Cribl correctly, their product makes it easy to enrich events with metadata at ingest-time. This is relevant/exciting for me because when I’m ingesting music data for my side project, I’m only ever getting the initial slice that’s available from a specific REST endpoint or in a file.

I’ve been identifying and collecting additional data sources that I want to enrich my dataset with, but doing so requires extra calls to other endpoints in the same API, or other web services, which means I then need to figure out where I want to store all of that data.

It quickly turns into an architectural and conceptual headache that I delay handling, because I know I’d either be dumping a lot of data into lookups / the KV store, or having to seriously level up my Python skills and do data processing and enrichment in my code before sending it to Splunk Enterprise.

As a specific example, I use the Last.fm getRecentTracks endpoint to send my listening data to Splunk Enterprise, but to enrich that data with additional metadata like track duration, or album release date, I’d have to hit 2 additional endpoints (track.getInfo and album.getInfo, respectively).

Deciding when in the data processing pipeline to hit those endpoints, how to hit them, and where to store that information to enrich my events has been a struggle that I’ve been avoiding dealing with.

There is an advantage to collecting the metadata once and storing it in a lookup or the KV store, because the metadata is relatively static. That means that it is relatively straightforward to call an endpoint, collect the data, and store it somewhere for when I need it. That means that I then have the added flexibility to enrich my events with extra data at search time when I want to, but not otherwise.

However, this means that I’m having to make conceptual decisions at multiple points—when collecting the data, when deciding what format to store it in, and where, and when I am enriching events at search time. It’s a lot of added complexity, but this type of enrichment doesn’t affect the size of my originally-indexed events, though it might end up being indexed separately instead.

But with Cribl’s solution, I’d be making that choice once. That does mean I lose potential flexibility about when and which events I can enrich with the data, but it also means that the conceptual decisions aren’t something I have to belabor. I can enrich my listening data at ingest-time with additional metadata about the album, artist, and track, then send it on to be indexed. Then when I’m searching and want to perform additional work with the metadata, it’s all right there with my events already.

This is a convenient, if imperfect, solution for my use case. But my use case is pretty basic: enrich events with static information that might be shared across many events. That’s a use case with a lot of potential solutions. I could use this approach if I didn’t care about reducing the amount of data that I indexed to the bare minimum, and focused instead on convenience and context for my data ingestion, allowing me to save time when searching my data.

This solution is much more exciting for use cases other than mine, where you’re enriching events with dynamic information that is relevant and true for specific events at index-time. The blog post includes an example of this, combining the web access logs with context from proxy logs, making the time-to-discovery for investigations that use web access logs shorter.

There is flexibility in combining data at search time, but there is complexity with that approach as well. Cribl shows that there is convenience in creating that context at index-time as well.

Search

Most Popular Posts

  • Homemade Döner Kebab
  • Unbiased data analysis with the data-to-everything platform: unpacking the Splunk rebrand in an era of ethical data concerns
  • So you want to be a technical writer
  • Wrapping up 2020: Spotify, SoundCloud, and Last.fm data
Blog at WordPress.com.
Cancel