Why the quality of audio analysis metadatasets matters for music

I’ve been thinking for some time about the derived metadata that Spotify and other digital streaming services construct from the music on their platforms. Spotify’s current business revolves around providing online streaming access to music and podcasts, as well as related content like playlists, to users.

Like any good SaaS business, their primary goal is to acquire and keep customers. As a digital streaming service business, the intertwined goal is to provide quality content to those customers. The best way to do both of those is to derive and collect metadata about customer usage patterns, but also about the content being delivered to the customers. The more you know about the content being delivered, the more you can create new distribution mechanisms for the content and make informed deals to acquire new content.

Creating metadatasets from the intellectual property of artists

Today, when labels and distributors provide music to digital streaming services (artists can’t provide it directly), they grant those services permission to make the music tracks available to users of the digital streaming services. Based on my review of the Spotify Terms and Conditions of Use, Spotify for Artists Terms and Conditions, and the distribution agreement for a commonly-used distribution service, DistroKid, artists don’t grant explicit permission for what services do next—create metadata about those tracks. An exception relevant with the DistroKid distribution agreement is if artists sign up for an additional service, DistroLock, they then are bound by an additional addendum granting the service permission to create an audio fingerprint to uniquely represent the track so that it can be used for copyright enforcement and possibly to pay out royalties.

In his book Metadata, Jeffrey Pomerantz defines metadata as “a means by which the complexity of an object is represented in a simpler form.” In this case, streaming services like Spotify create different types of metadata to represent the complexity of music with various audio features, audio analysis statistics, and audio fingerprints. The services also gather “use metadata” about how customers use their services—at what point in a song a person hits skip, what devices they use to listen, their location when listening, and other data points.

Creating metadatasets is crucial to delivering content

Pandora has patents for the types of music metadata that they create, that behind the “music genome project”. Spotify also has patents (and a crucial one from their acquisition of the Echo Nest) to do the same, as well as many that cover the various applications of those metadata.

These companies can use these metadatasets as marketing tools, as we’ve seen with the #SpotifyWrapped campaign; to correlate the music metadata with use metadata, such as to create new music marketing methods like contextual playlists; to select advertising that matches up sonically well with the tracks being listened to; and to provide these insights to artists and labels, making them more reliant on their service as a distribution and marketing mechanism.

Spotify currently provides a subset of the insights they derive from the combination of use metadata with music track metadata to artists with the Spotify for Artists service. The end user license agreement for the service makes it clear that it’s a free service and Spotify cannot be held responsible for the relative accuracy of the data available. Emphasis mine:

Spotify for Artists is a free service that we are providing to you for use at our discretion. Spotify for Artists may provide you with the ability to view demographic data on your fans and usage data of your music. While we work hard to ensure the accuracy of the data, we do not guarantee that the Spotify for Artists Service or the data that we collect from the Service will be error-free or that mistakes, including mistakes in the data insights that we provide to you, will not happen from time to time.

Spotify for Artists Terms and Conditions

It’s likely that some labels have already negotiated access to various insights and metadata that Spotify creates and collects.

Other valuable insights that can be derived from these metadatasets include: the types of music that people listen to in certain cities, which tracks are most popular in certain cities, what types of music people tend to listen to in different seasons, and even what types of music people of different ages, genders, education levels, and classes tend to listen to.

These insights, provided to artists, labels, and distributors, guide marketing campaigns, tour planning, artist-specific investments, and even music production styles. Thing is, it’s tough to decipher exactly how these companies create the metadatasets that all these valuable insights rely on, and how the accuracy of that metadata is (if at all) validated.

How the metadatasets get made

In an episode of Vox Earworm, the journalist Matt Daniels of The Pudding and Estelle Caswell of Vox briefly discuss how the metadatasets of Spotify and Pandora were created, pointing out that Spotify has 35 million songs, but the metadataset is algorithmically generated. Meanwhile, Pandora has only 2 million songs, but those 450 total attributes were defined and applied by a combination of trained musicologists and algorithms to the songs. Their discussion starts at 1:45 in this episode and continues for about 90 seconds.

The features in the metadatasets have been defined by algorithms written by trained musicologists, amateur musicians, or even ordinary data scientists without musical training or expertise. The specific features collected by Spotify are publicly available in their audio features API and audio analysis API endpoints, and both include metadata that objectively describe each track, such as duration, as well as more subjective features such as acousticness, liveness, valence, and instrumentalness.

The more detailed audio analysis API features splits up each track into various sections and segments, and computes features and confidence levels for each of the sections and segments.

Spotify, building off the Echo Nest technology, relies on web scraping and algorithms to create these metadatasets. According to a patent filed by the Echo Nest in 2011, three different types of metadata are created:

The explicit metadata is information such as “track name” or “artist name” or “composer, while the acoustic metadata can be an acoustic fingerprint to represent the song, or can include features like “tempo, rhythm, beats, tatums, or structure, and spectral information such as melody, pitch, harmony, or timbre.” The cultural metadata is where the more subjective features come from, and it can come from a variety of different subjective sources: “expert opinion such as music reviews”, “listeners through Web sites, chat rooms, blogs, surveys, and the like”, as well as information “generated by a community of listeners and automatically retrieved from Internet sites, chat rooms, blogs, and the like.” The patent gives other examples such as “sales data, shared collections, lists of favorite songs, and any text information that may be used to describe, rank, or interpret music.” It can also build off of existing databases made available by companies like Gracenote, AllMusic (referenced as AMG, now RhythmOne, in the patent), and others.

Pandora doesn’t share an API for their Music Genome Project data, but they do mention that it contains 450 total attributes, or features in the data. I dug into their patents and it is clear that the number of features used varies depending on the type of music, and the features given as examples in the patents range from vocalist gender, distortion in electric guitar, type of background vocals, genre, era, syncopation, and lead vocal present in song(also). Pandora uses a combination of musicologists and algorithms to assign values.

Representation in the metadatasets, representation in the taco bell

We know a little about how Spotify and Pandora create their metadatasets. We know less about how representative those metadatasets are, both in terms of feature coverage and music coverage.

Barely knowing which features are available for Pandora, and even while having a decent idea of what Spotify has available, it’s possible that the features that exist in the metadatasets are incomplete. The features in the metadatasets could be limited to those that were the easiest to compute at the time, those that are deemed interesting by the creators, or even those that are highly-correlated with profitable user behavior. It’s expensive to create, store, and apply new metadata features, so businesses must have a clear value proposition before developing new models or tasking more musicologists with the creation of a unique audio feature.

Based on the locations of Spotify, Pandora, and the companies informing their metadatasets, it’s likely that the datasets that these metadatasets and their features are built on aren’t representative of music worldwide but instead include bias toward music that is easily available in their geographic locations.

The size of the datasets that underpin the metadata creation varies—Pandora has 2 million tracks, Spotify has 35 million—the representativeness of the data sample is more important than the size. And that is a variable that we have almost no information about.

I haven’t done (and can’t do) the data analysis to determine the distribution of tracks in those giant datasets. Without that I can only speculate:

We could learn more about the representativeness of the datasets used to create the metadatasets if we knew more about how the metadatasets themselves are validated. But again, that’s another area that lacks clarity.

How the metadatasets get validated… or not

Their uniqueness of their businesses are built on these metadatasets, but it doesn’t seem like there are processes in place to validate the features developed and in use by Pandora and Spotify across the industry. There’s no central database of tracks that I know of, a “Tom’s Diner” of audio feature validation, that can be used to tune the accuracy of audio features that exist in multiple industry metadatasets. Instead, much like the lossy compression of an MP3, there is just the “close enough for our purposes” approximation for validation.

Pandora uses its musicologists to validate the features assigned to tracks by other musicologists and by algorithms, and uses a selection and ranking module to arrive at a “wisdom of the crowd of experts” result for the eventual list of features associated with a track. The accuracy of a feature is a relative score based on how many other experts associated that same feature with a track.

Spotify uses a prediction model to predict the subjective (and harder-to-compute) features such as liveness, valence, danceability, and presence of spoken word lyrics. In the patent filing, they disclose the validation methods used for the features predicted by that model:

The patent then describes taking appropriate steps to bolster training data and improve coverage of the datasets to produce more accurate results in response to the validation results. However, since this is a patent filing rather than a blog post describing their data science practices, we don’t know how often the prediction models and training datasets are updated, or what other methods are used to compile and validate the training datasets themselves.

Lacking an objectively true value for many of these audio features, it’s difficult for services to reliably validate their metadatasets. In fact, rather than comparatively validating their metadatasets, many of the metadatasets are built on top of each other. The Spotify patent for the prediction model makes it clear that the “ground truth dataset” used for validation is partially sourced from other metadatasets. This Echo Nest patent that I discussed earlier makes it clear that different types of metadata can come from pre-existing metadatasets.

Without large-scale understanding of metadata validity across these existing metadatasets, it’s likely that errors and biases in the metadata can proliferate as new ones are created. Eventually, that lack of quality metadata can have a disproportionate effect on the artists creating the music that this metadata is derived from.

Why metadata quality matters

Spotify and Pandora both rely extensively on these metadatasets to deliver valuable streaming services to customers and to create engaging content like playlists and stations for their listeners. Spotify has positioned itself as a valuable distribution and marketing mechanism for artists, to the point that they’ve devised a new scheme where artists and labels can pay for privileges like prominent playlist placement or spotlights in Spotify.

Metadata underpins the business model of these companies, shaping our experience of music by directly affecting how music is distributed and consumed. But we don’t know how valid the metadata is, we don’t know if it’s biased, and we don’t know how much of a feedback loop is involved in its interpretation to create new distribution and consumption mechanisms.

If these companies don’t do more to improve the quality of metadata, artists can lose revenue and miss out on distribution opportunities. Listeners can get bored by the sameness of playlists, or the inaccurate interpretations of their radio station requests, and stop using Spotify and Pandora to discover new music. Without representative and valid metadata, music loses.

What went into writing this

I read a lot over the past few months that informed my thinking in this essay, or some of the points that I made, without being something I quoted or linked directly in the text. I also am grateful to the conversations I had with my former colleague Jessica about this topic, and the feedback that my former colleague Neal gave me on an earlier version of this post.

Spotify background

Pandora background

Other content