Reflecting on a decade of (quantified) music listening

I recently crossed the 10 year mark of using Last.fm to track what I listen to.

From the first tape I owned (Train’s Drops of Jupiter) to the first CD (Cat Stevens Classics) to the first album I discovered by roaming the stacks at the public library (The Most Serene Republic Underwater Cinematographer) to the college radio station that shaped my adolescent music taste (WONC) to the college radio station that shaped my college experience (WESN), to the shift from tapes, to CDs, (and a radio walkman all the while), to the radio in my car, to SoundCloud and MP3 music blogs, to Grooveshark and later Spotify, with Windows Media Player and later an iTunes music library keeping me company throughout…. It’s been quite a journey.

Some, but not all, of that journey has been captured while using the service Last.fm for the last 10 years. Last.fm “scrobbles” what you listen to as you listen to it, keeping a record of your listening habits and behaviors. I decided to add all this data to Splunk, along with my iTunes library and a list of concerts I’ve attended over the years, to quantify my music listening, acquisition, and attendance habits. Let’s go.

What am I doing?

Before I get any data in, I have to know what questions I’m trying to answer, otherwise I won’t get the right data into Splunk (my data analysis system of choice, because I work there). Even if I get the right data into Splunk, I have to make sure that the right fields are there to do the analysis that I wanted. This helped me prioritize certain scripts over others to retrieve and clean my data (because I can’t code well enough to write my own).

I also made a list of the questions that I wanted to answer with my data, and coded the questions according to the types of data that I would need to answer the questions. Things like:

  • What percentage of the songs in iTunes have I listened to?
  • What is my artist distribution over time? Do I listen to more artists now? Different ones overall?
  • What is my listen count over time?
  • What genres are my favorite?
  • How have my top 10 artists shifted year over year?
  • How do my listening habits shift around a concert? Do I listen to that artist more, or not at all?
  • What songs did I listen to a lot a few years ago, but not since?
  • What personal one hit wonders do I have, where I listen to one song by an artist way more than any other of their songs?
  • What songs do I listen to that are in Spotify but not in iTunes (that I should buy, perhaps)?
  • How many listens does each service have? Do I have a service bias?
  • How many songs are in multiple services, implying that I’ve probably bought them?
  • What’s the lag between the date a song or album was released and my first listen?
  • What geographic locations are my favorite artists from?

As the list goes on, the questions get more complex and require an increasing number of data sources. So I prioritized what was simplest to start, and started getting data in.

 

Getting data in…

I knew I wanted as much music data as I could get into the system. However, SoundCloud isn’t providing developer API keys at the moment, and Spotify requires authentication, which is a little bit beyond my skills at the moment. MusicBrainz also has a lot of great data, but has intense rate-limiting so I knew I’d want a strategy to approach that metadata-gathering data source. I was left with three initial data sources: my iTunes library, my own list of concerts I’ve gone to, and my Last.fm account data.

Last.fm provides an endpoint that allows you to get the recent tracks played by a user, which was exactly what I wanted to analyze. I started by building an add-on for Last.fm with the Splunk Add-on Builder to call this REST endpoint. It was hard. When I first tried to do this a year and a half ago, the add-on builder didn’t yet support checkpointing, so I could only pull in data if I was actively listening and Splunk was on. Because I had installed Splunk on a laptop rather than a server in ~ the cloud ~, I was pretty limited in the data I could pull in. I pretty much abandoned the process until checkpointing was supported.

After the add-on builder started supporting checkpointing, I set it up again, but ran into issues. Everything from forgetting to specify the from date in my REST call to JSON path decision-making that meant I was limited in the number of results I could pull back at a time. I deleted the data from the add-on sourcetype many times, triple-checking the results each time before continuing.

I used a python script (thanks Reddit) to pull my historical data from Last.fm to add to Splunk, and to fill the gap between this initial backfill and the time it took me to get the add-on working, I used an NPM module. When you don’t know how to code, you’re at the mercy of the tools other people have developed. Adding the backfill data to Splunk also meant I had to adjust the max_days_ago default in props.conf, because Splunk doesn’t necessarily expect data from 10+ years ago by default. 2 scripts in 2 languages and 1 add-on builder later, I had a working solution and my Last.fm data in Splunk.

To get the iTunes data in, I used an iTunes to CSV script on Github (thanks StackExchange) to convert the library.xml file into CSV. This worked great, but again, it was in a language I don’t know (Ruby) and so I was at the mercy of a kind developer posting scripts on Github again. I was limited to whatever fields their script supported. This again only did backfill.

I’m still trying to sort out the regex and determine if it’s possible to parse the iTunes Library.xml file in its entirety and add it to Splunk without too much of a headache, and/or get it set up so that I can ad-hoc add new songs added to the library to Splunk without converting the entries some other way. Work in progress, but I’m pretty close to getting that working thanks to help from some regex gurus in the Splunk community.

For the concert data, I added the data I had into the Lookup File Editor app and was up and running. Because of some column header choices I made for how to organize my data, and the fact that I chose to maintain a lookup rather than add the information as events, I was up for some more adventures in search, but this data format made it easy to add new concerts as I attend them.

Answer these questions…with data!

I built a lot of dashboard panels. I wanted to answer the questions I mentioned earlier, along with some others. I was spurred on by my brother recommending a song to me to listen to. I was pretty sure I’d heard the song before, and decided to use data to verify it.

Screen image of a chart showing the earliest listens of tracks by the band VHS collection.

I’d first heard the song he recommended to me, Waiting on the Summer, in March. Hipster credibility: intact. Having this dashboard panel now lets me answer the questions “when was the first time I listened to an artist, and which songs did I hear first?”. I added a second panel later, to compare the earliest listens with the play counts of songs by the artist. Maybe the first song I’d heard by an artist was the most listened song, but often not.

Another question I wanted to answer was “how many concerts have I been to, and what’s the distribution in my concert attendance?”

Screen image showing concerts attended over time, with peaks in 2010 and 2017.

It’s pretty fun to look at this chart. I went to a few concerts while I was in high school, but never more than one a month and rarely more than a few per year. The pace picked up while I was in college, especially while I was dating someone that liked going to concerts. A slowdown as I studied abroad and finished college, then it picks up for a year as I get settled in a new town. But after I get settled in a long-term relationship, my concert attendance drops off, to where I’m going to fewer shows than I did in high school. As soon as I’m single again, that shifts dramatically and now I’m going to 1 or more show a month. The personal stories and patterns revealed by the data are the fun part for me.

I answered some more questions, especially those that could be answered by fun graphs, such as what states have my concentrated music listens?

Screen image of a map of the contiguous united states, with Illinois highlighted in dark blue, indicating 40+ concerts attended in that state, California highlighted in a paler blue indicating 20ish shows attended there, followed by Michigan in paler blue, and finally Ohio, Wisconsin, and Missouri in very pale blue. The rest of the states are white, indicating no shows attended in those states.

It’s easy to tell where I’ve spent most of my life living so far, but again the personal details tell a bigger story. I spent more time in Michigan than I have lived in California so far, but I’ve spent more time single in California so far, thus attending more concerts.

Speaking of California, I also wanted to see what my most-listened-to songs were since moving to California. I used a trellis visualization to split the songs by artist, allowing me to identify artists that were more popular with me than others.

Screen image showing a "trellis" visualization of top songs since moving to California. Notable songs are Carly Rae Jepsen "Run Away With Me" and Ariana Grande "Into You" and CHVRCHES with their songs High Enough to Carry You Over and Clearest Blue and Leave a Trace.

I really liked the CHVRCHES album Every Open Eye, so I have three songs from that album. I also spent some time with a four song playlist featuring Adele’s song Send My Love (To Your New Lover), Ariana Grande’s Into You, Carly Rae Jepsen’s Run Away With Me, and Ingrid Michaelson’s song Hell No. Somehow two breakup songs and two love songs were the perfect juxtaposition for a great playlist. I liked it enough to where all four songs are in this list (though only half of it is visible in this screenshot). That’s another secret behind the data.

I also wanted to do some more analytics on my concert data, and decided to figure out what my favorite venues were. I had some guesses, but wanted to see what the data said.

Screen image of most visited concert venues, with The Metro in Chicago taking the top spot with 6 visits, followed by First Midwest Bank Ampitheatre (5 visits), Fox Theater, Mezzanine, Regency Ballroom, The Greek Theatre, and The Independent with 3 visits each.

The Metro is my favorite venue in Chicago, so it’s no surprise that it came in first in the rankings (I also later corrected the data to make it its proper name, “Metro” so that I could drill down from the panel to a Google Maps search for the venue). First Midwest Bank Ampitheatre hosted Warped Tour, which I attended (apparently) 5 times over the years. Since moving to California it seems like I don’t have a favorite venue based on visits alone, but it’s really The Independent, followed by Bill Graham Civic Auditorium, which doesn’t even make this list. Number of visits doesn’t automatically equate to favorite.

But what does it MEAN?

I could do data analysis like that all day. But what else do I learn by just looking at the data itself?

I can tell that Last.fm didn’t handle the shift to mobile and portable devices very well. It thrives when all of your listening happens on your laptop, and it can grab the scrobbles from your iPod or other device when you plug it into your computer. But as soon as internet-connected devices got popular (and I started using them), listens scrobbled overall dropped. In addition to devices, the rise of streaming music on sites like Grooveshark and SoundCloud to replace the shift from MediaFire-hosted and MegaUpload-hosted free music shared on music blogs also meant trouble for my data integrity. Last.fm didn’t handle listens on the web then, and only handles them through a fragile extension now.

Two graphs depicting distinct song listens and distinct artist listens, respectively, with a peak and steady listens through 2008-2012, then it drops down to a trough in 2014 before coming up to half the amount of 2010 and rising slightly.

Distinct songs and artists listened to in Last.fm data.But that’s not the whole story. I also got a job and started working in an environment where I couldn’t listen to music at work, so wasn’t listening to music there, and also wasn’t listening to music at home much either due to other circumstances. Given that the count plummets to near-zero, it’s possible there were also data issues at play.  It’s imperfect, but still fascinating.

What else did I learn?

Screen image showing 5 dashboard panels. Clockwise, the upper left shows a trending indicator of concerts attended per month, displaying 1 for the month of December and a net decrease of 4 from the previous month. The next shows the overall number of concerts attended, 87 shows. The next shows the number of iTunes library songs with no listens: 4272. The second to last shows a pie chart showing that nearly 30% of the songs have 0 listens, 23% have 1 listen, and the rest are a variety of listen counts. The last indicator shows the total number of songs in my iTunes library, or 16202.

I have a lot of songs in my iTunes library. I haven’t listened to nearly 30% of them. I’ve listened to nearly 25% of them only once. That’s the majority of my music library. If I split that by rating, however, it would get a lot more interesting. Soon.

You can’t see the fallout from my own personal Music-ocalypse in this data, because the Library.xml file doesn’t know which songs don’t point to actual files, or at least my version of it doesn’t. I’ll need more high-fidelity data to determine the “actual” size of my library, and perform more analyses.

I need more data in general, and more patience, to perform the analyses to answer the more complex questions I want to answer, like my listening habits of particular artists around a concert. As it is, this is a really exciting start.

If you want more details about the actual Splunking I did to do these analyses, I’ll be posting a blog on the official Splunk blog. That got posted on January 4th! Here it is: 10 Years of Listens: Analyzing My Music Data with Splunk.

Data as a Gift: Implications for Product Design

The idea of data as a gift, and the act of sharing data as an exchange of a gift, has data ethics and privacy implications for product and service design.

Recent work by Kadija Ferryman and Nick Seaver on data as a gift in the last year addressed this concept more broadly and brought it to my attention. Ferryman, in her piece Reframing Data as a Gift, took the angle of data sharing in the context of health data and open data policies. Seaver, in his piece Return of the Gift, approached it from the angle of the gift economy and big data. Both make great points that are relevant in the context of data collection and ethics, especially as it relates to data security and privacy more generally.

Ferryman introduces the concept brilliantly:

What happens when we think about data as a gift? Well, first, we move away from thinking about data in the usual way, as a thing, as a repository of information and begin to think of it as an action. Second, we see that there is an obligation to give back, or reciprocate when data is given. And third, we can imagine that giving a lot of data has the potential to create tension.

When you frame the information that we “voluntarily” share with services as a gift, the dynamics of the exchange shift. We can’t truly share data with digital services—that implies that we retain ultimate ownership over the data. You can take back something after you share it with them. But you can’t do that with your personal data. Because you can’t take back your data after you share it, you can more accurately conceptualize the exchange of data with digital services as a gift. Something you give, and which cannot be returned to you (at least not in its original form).

Data as a gift creates an expectation or obligation for a return, Seaver makes clear. Problem is, when we’re sharing data on the internet, we don’t always know exactly what we’re giving and what we’re getting.

The gift exchange might be based on the expectation that your data is used to provide the service to you. And the more data, the better the service (you might expect). For this reason, it seems easier to share specific types of data with specific services. For example, it’s easier for me to answer questions about my communication or sexual preferences with a company if I think I’m going to get a boyfriend out of the exchange, and sharing that data might make it more likely.

But what happens if a company stops seeing (or doesn’t ever see) an exchange of data as a gift exchange, and starts using the data you gift it for whatever it wants in order to make a profit? By violating the terms of the gift exchange, the company violates the implicit social contract you made with the company when you gifted your data. This is where privacy comes in. Gifting information for one purpose and having it used for other unexpected purposes feels like a violation of privacy. Because it is.

A violation of the gift exchange of data is a privacy violation, but it feels like the norm now. It’s common in terms of services to be informed that after you gift your data to a service, it is no longer yours and the company can do with it what it wants.

Products and services are designed so that you can’t pay for them even if you want to. You must share certain amounts of data, and if you don’t, the product doesn’t work. As Andrew Lewis put it, “If you are not paying for it, you’re not the customer; you’re the product being sold.” We didn’t end up there because we are that dedicated to free things on the Internet. We were lured into gifting our data in exchange for specific, limited services, and the companies realized later that the data was the profitable part of the exchange.

Nick Seaver refers to this as “The obligation to give one’s data in exchange for the use of “free” services,” and it is indeed an obligation. To avoid gifting your data to services that you might not want to enter into that type of exchange, you have very few ways to interact with the modern Internet. You’d likely also have to have a lot of money, in order to enter into a paid transaction rather than a gift exchange with a company in return for services.

For those of us working in product or service development, we can use this perspective and consider the social contract of the exchange of data gifts.

  • Consider whether the service you offer is on par with the amount of data you ask people to gift to you.
    • Do I really need to share my Facebook likes with Tinder to get a superior match?
  • Consider whether the service you offer can deliver on the obligations and expectations created by the gift exchange.
    • Is your service rewarding enough and trustworthy enough to where I’ll save my credit card information?
  • Consider whether you can design your service to allow people to choose the data that they want to gift to you.
    • What is the minimum-possible data gift that a person could exchange with your service, and still feel as though their gift was reciprocated?
  • Consider the type of gift exchange that you design if you force people to gift you a specific type or amount of data.
    • Is that an expectation or obligation that you want to create?

When you view each piece of information that a person shares with you as a gift, it’s harder to misuse that information.

 

Note: Thanks to Clive Thompson for bringing Kadija Ferryman’s piece to my attention, and Nick Seaver for sharing his piece Return of the Gift with me on Twitter. 

Libraries, Digital Advertising, and the Machine Zone

Librarians are an underused, underpaid, and underestimated legion. And one librarian in particular is frustrated by e-book lending. Not just the fact that libraries have to maintain waitlists for access to a digital file, but also that the barriers to checking out an ebook are unnecessarily high. As she puts it,

“Teaching people about having technology serve them includes helping them learn to assess and evaluate risk for themselves.”

In her view,

“Information workers need to be willing to step up and be more honest about how technology really works and not silently carry water for bad systems. People trust us to tell them the truth.”

That seems like the least that can be expected by library patrons.

Continue reading

Torture, Ownership, and Privacy

The Senate Intelligence Committee released hundreds of pages (soon available as a book) detailing acts of torture committed by the CIA.

Continue reading

Quantified Health and Software Apps

I went on a bit of a Twitter rant last night, about how MyFitnessPal doesn’t give me much helpful data:

While it’s called MyFitnessPal, it doesn’t feel much like a pal, and feels more like a diet app than a fitness app:

It’s like a friend congratulating you for eating a lot of whole wheat, but making a face because the egg you ate has a lot of cholesterol in it, even if it’s the only egg you’ve eaten that week.

Continue reading

Public Transit and Technology – Chicago Edition

The Chicago Tribune reports on a recent study completed by OECD on Metropolitan Governance of Transport and Land Use in Chicago. As the Tribune describes:

“The Chicago area’s transportation is hamstrung by a proliferation of local governments, the “irrational organizational structure” of the Regional Transportation Authority and the service boards and an antiquated formula by which transit agencies are funded, the report found.”

When reached for comment, the various transit organizations had no comment:

“Spokesmen for the RTA, Metra and Pace said officials had not read the 20-page report and had no comment. As it has previously, the CTA said last week that it opposes transit agency consolidation, as does Emanuel.
A superagency would be an unnecessary bureaucracy unaccountable to commuters that would divert dollars from train and bus service, said [CTA] spokesman Brian Steele.”

Per the Tribune, the report points out that:

“”The current state of transit ridership in Chicago is relatively depressing,” concludes the report from the Organization for Economic Cooperation and Development, a Paris-based research agency whose backers include the world’s richest nations, among them the U.S.

The report found a lack of coordination among the four transit agencies and their four separate boards as well as insufficient accountability. Those issues intensify the economic impact of congestion on Chicago, estimated at over $6 billion in 2011 by the Texas Transportation Institute, the report said.”

Transit organizations in Chicago aren’t well-integrated, and leadership in Chicago opposes any integration or consolidation of those organizations. In the meantime, ridership is low and congestion (and its related economic impact) is high.

Contrast that with the recent article in Citylab about the importance of the smartphone in transportation.

“As more and more of the transport system falls into private hands and becomes fragmented, multi-modalism risks declining and cities will lose out on valuable data on where people want to go, how they travel, what’s slowing them down, and how the network is operating. A publicly-operated unified mobility app has enormous potential to eliminate barriers between modes, use existing infrastructure more efficiently, and bring the entire transport network to the smartphone.”

Privatized transportation systems, especially fragmented ones, means that cities lose valuable opportunities to find out more about their riders–and thus lose opportunities to attune their systems to the needs of their riders. Jason Prechtel writing for Gaper’s Block has closely followed the public-private partnerships that dominate Chicago public transportation.As the article continues:

“Better data about movement makes it easier for officials to site bike-share docks, or re-route buses to fit travel patterns, or add an extra train during rush-hour to meet demand. Instead of operating on a static schedule that forces users to adapt to it, a transportation network that’s monitored and adjusted in real-time can adapt to users. Just as the paved road launched a transportation revolution by enabling point-to-point travel via the car last century, networked technology can shift the paradigm again by making the user and infrastructure dynamic actors who respond to one another. This isn’t a trivial improvement—it’s a dramatic reimagining of how transportation systems operate.”

Transportation systems making use of ample data across the network have the ability to reshape themselves to meet the needs of customers–thereby reducing congestion, and increasing ridership.

“if U.S. cities can move past the fractured transportation landscape and embrace the challenge, their slow start isn’t necessarily a bad thing; it might even help officials avoid the mistakes of bad apps and refine the successes of good ones.”

Chicago has a long way to go before it can embrace and make use of technology across all of its public-private partnerships. Finding a way to integrate the data on ridership from Divvy, with the public transit usage stats from Ventra-carded services Pace and CTA, as well as Metra, could lead to some public transit innovation and some cost savings alongside transit improvements. Maybe claims of creating a “smart city” with “big data” could lead to some movement, but without improved partnerships and governance across transit organizations, Chicago’s public transportation situation seems destined to fester.

9/16/14 Update:

Jason Prechtel wrote in an earlier Gapers Block column about the role of the RTA (regional transportation authority) which oversees the CTA, Pace, and Metra and which was responsible for uniting the three under one common payment system, Ventra. Prechtel on the RTA and Ventra:

“…both Gov. Quinn’s office and the SouthtownStar have called for finding ways to reduce waste and bureaucracy and eventually overhaul the entire regional transit system.

From this perspective, the need for a system like Ventra makes sense. Uniting transit fare payment under a single system is one major step towards merging the transit systems together under the RTA umbrella, and reducing overall transit costs and inter-agency squabbles.”

While that common payment system has been plagued with controversy and difficulties, perhaps the efforts of the RTA could lead to a unified transportation app for Chicagoans.

Metrication of the Self

A soon-to-emerge recurring theme…

Also referred to as “datafication” by the authors of Big Data: A Revolution that Will Transform how we Live Work and Think, metrication can be defined as beginning to see all aspects of our lives as valuable data points and metrics against which to gauge our worth, success, and productivity, a relatively recent trend. Spurred on by technological advances, new tools of monitoring others such as plug-ins and cookies also allow us to track ourselves. 

Using metrics to evaluate people is not a new concept–from birth we’re monitored against percentile growth charts by pediatricians and our anxious parents; once we’re of schooling age we are monitored and tracked by the government and school districts using grades and standardized testing–reducing our school performance to “valuable” numbers and the odd, coded comment like “works hard in class”. After graduation and/or college, it could be over, but the working world possesses its own set of metrics. At my own job we track all sorts of data related to customer satisfaction, in addition to how quickly and efficiently we serve our users. This is consistently relayed back to us as workers, with the implicit intent of improving those numbers. The higher the better.

Continue reading