Reflecting on a decade of (quantified) music listening

I recently crossed the 10 year mark of using Last.fm to track what I listen to.

From the first tape I owned (Train’s Drops of Jupiter) to the first CD (Cat Stevens Classics) to the first album I discovered by roaming the stacks at the public library (The Most Serene Republic Underwater Cinematographer) to the college radio station that shaped my adolescent music taste (WONC) to the college radio station that shaped my college experience (WESN), to the shift from tapes, to CDs, (and a radio walkman all the while), to the radio in my car, to SoundCloud and MP3 music blogs, to Grooveshark and later Spotify, with Windows Media Player and later an iTunes music library keeping me company throughout…. It’s been quite a journey.

Some, but not all, of that journey has been captured while using the service Last.fm for the last 10 years. Last.fm “scrobbles” what you listen to as you listen to it, keeping a record of your listening habits and behaviors. I decided to add all this data to Splunk, along with my iTunes library and a list of concerts I’ve attended over the years, to quantify my music listening, acquisition, and attendance habits. Let’s go.

What am I doing?

Before I get any data in, I have to know what questions I’m trying to answer, otherwise I won’t get the right data into Splunk (my data analysis system of choice, because I work there). Even if I get the right data into Splunk, I have to make sure that the right fields are there to do the analysis that I wanted. This helped me prioritize certain scripts over others to retrieve and clean my data (because I can’t code well enough to write my own).

I also made a list of the questions that I wanted to answer with my data, and coded the questions according to the types of data that I would need to answer the questions. Things like:

  • What percentage of the songs in iTunes have I listened to?
  • What is my artist distribution over time? Do I listen to more artists now? Different ones overall?
  • What is my listen count over time?
  • What genres are my favorite?
  • How have my top 10 artists shifted year over year?
  • How do my listening habits shift around a concert? Do I listen to that artist more, or not at all?
  • What songs did I listen to a lot a few years ago, but not since?
  • What personal one hit wonders do I have, where I listen to one song by an artist way more than any other of their songs?
  • What songs do I listen to that are in Spotify but not in iTunes (that I should buy, perhaps)?
  • How many listens does each service have? Do I have a service bias?
  • How many songs are in multiple services, implying that I’ve probably bought them?
  • What’s the lag between the date a song or album was released and my first listen?
  • What geographic locations are my favorite artists from?

As the list goes on, the questions get more complex and require an increasing number of data sources. So I prioritized what was simplest to start, and started getting data in.

 

Getting data in…

I knew I wanted as much music data as I could get into the system. However, SoundCloud isn’t providing developer API keys at the moment, and Spotify requires authentication, which is a little bit beyond my skills at the moment. MusicBrainz also has a lot of great data, but has intense rate-limiting so I knew I’d want a strategy to approach that metadata-gathering data source. I was left with three initial data sources: my iTunes library, my own list of concerts I’ve gone to, and my Last.fm account data.

Last.fm provides an endpoint that allows you to get the recent tracks played by a user, which was exactly what I wanted to analyze. I started by building an add-on for Last.fm with the Splunk Add-on Builder to call this REST endpoint. It was hard. When I first tried to do this a year and a half ago, the add-on builder didn’t yet support checkpointing, so I could only pull in data if I was actively listening and Splunk was on. Because I had installed Splunk on a laptop rather than a server in ~ the cloud ~, I was pretty limited in the data I could pull in. I pretty much abandoned the process until checkpointing was supported.

After the add-on builder started supporting checkpointing, I set it up again, but ran into issues. Everything from forgetting to specify the from date in my REST call to JSON path decision-making that meant I was limited in the number of results I could pull back at a time. I deleted the data from the add-on sourcetype many times, triple-checking the results each time before continuing.

I used a python script (thanks Reddit) to pull my historical data from Last.fm to add to Splunk, and to fill the gap between this initial backfill and the time it took me to get the add-on working, I used an NPM module. When you don’t know how to code, you’re at the mercy of the tools other people have developed. Adding the backfill data to Splunk also meant I had to adjust the max_days_ago default in props.conf, because Splunk doesn’t necessarily expect data from 10+ years ago by default. 2 scripts in 2 languages and 1 add-on builder later, I had a working solution and my Last.fm data in Splunk.

To get the iTunes data in, I used an iTunes to CSV script on Github (thanks StackExchange) to convert the library.xml file into CSV. This worked great, but again, it was in a language I don’t know (Ruby) and so I was at the mercy of a kind developer posting scripts on Github again. I was limited to whatever fields their script supported. This again only did backfill.

I’m still trying to sort out the regex and determine if it’s possible to parse the iTunes Library.xml file in its entirety and add it to Splunk without too much of a headache, and/or get it set up so that I can ad-hoc add new songs added to the library to Splunk without converting the entries some other way. Work in progress, but I’m pretty close to getting that working thanks to help from some regex gurus in the Splunk community.

For the concert data, I added the data I had into the Lookup File Editor app and was up and running. Because of some column header choices I made for how to organize my data, and the fact that I chose to maintain a lookup rather than add the information as events, I was up for some more adventures in search, but this data format made it easy to add new concerts as I attend them.

Answer these questions…with data!

I built a lot of dashboard panels. I wanted to answer the questions I mentioned earlier, along with some others. I was spurred on by my brother recommending a song to me to listen to. I was pretty sure I’d heard the song before, and decided to use data to verify it.

Screen image of a chart showing the earliest listens of tracks by the band VHS collection.

I’d first heard the song he recommended to me, Waiting on the Summer, in March. Hipster credibility: intact. Having this dashboard panel now lets me answer the questions “when was the first time I listened to an artist, and which songs did I hear first?”. I added a second panel later, to compare the earliest listens with the play counts of songs by the artist. Maybe the first song I’d heard by an artist was the most listened song, but often not.

Another question I wanted to answer was “how many concerts have I been to, and what’s the distribution in my concert attendance?”

Screen image showing concerts attended over time, with peaks in 2010 and 2017.

It’s pretty fun to look at this chart. I went to a few concerts while I was in high school, but never more than one a month and rarely more than a few per year. The pace picked up while I was in college, especially while I was dating someone that liked going to concerts. A slowdown as I studied abroad and finished college, then it picks up for a year as I get settled in a new town. But after I get settled in a long-term relationship, my concert attendance drops off, to where I’m going to fewer shows than I did in high school. As soon as I’m single again, that shifts dramatically and now I’m going to 1 or more show a month. The personal stories and patterns revealed by the data are the fun part for me.

I answered some more questions, especially those that could be answered by fun graphs, such as what states have my concentrated music listens?

Screen image of a map of the contiguous united states, with Illinois highlighted in dark blue, indicating 40+ concerts attended in that state, California highlighted in a paler blue indicating 20ish shows attended there, followed by Michigan in paler blue, and finally Ohio, Wisconsin, and Missouri in very pale blue. The rest of the states are white, indicating no shows attended in those states.

It’s easy to tell where I’ve spent most of my life living so far, but again the personal details tell a bigger story. I spent more time in Michigan than I have lived in California so far, but I’ve spent more time single in California so far, thus attending more concerts.

Speaking of California, I also wanted to see what my most-listened-to songs were since moving to California. I used a trellis visualization to split the songs by artist, allowing me to identify artists that were more popular with me than others.

Screen image showing a "trellis" visualization of top songs since moving to California. Notable songs are Carly Rae Jepsen "Run Away With Me" and Ariana Grande "Into You" and CHVRCHES with their songs High Enough to Carry You Over and Clearest Blue and Leave a Trace.

I really liked the CHVRCHES album Every Open Eye, so I have three songs from that album. I also spent some time with a four song playlist featuring Adele’s song Send My Love (To Your New Lover), Ariana Grande’s Into You, Carly Rae Jepsen’s Run Away With Me, and Ingrid Michaelson’s song Hell No. Somehow two breakup songs and two love songs were the perfect juxtaposition for a great playlist. I liked it enough to where all four songs are in this list (though only half of it is visible in this screenshot). That’s another secret behind the data.

I also wanted to do some more analytics on my concert data, and decided to figure out what my favorite venues were. I had some guesses, but wanted to see what the data said.

Screen image of most visited concert venues, with The Metro in Chicago taking the top spot with 6 visits, followed by First Midwest Bank Ampitheatre (5 visits), Fox Theater, Mezzanine, Regency Ballroom, The Greek Theatre, and The Independent with 3 visits each.

The Metro is my favorite venue in Chicago, so it’s no surprise that it came in first in the rankings (I also later corrected the data to make it its proper name, “Metro” so that I could drill down from the panel to a Google Maps search for the venue). First Midwest Bank Ampitheatre hosted Warped Tour, which I attended (apparently) 5 times over the years. Since moving to California it seems like I don’t have a favorite venue based on visits alone, but it’s really The Independent, followed by Bill Graham Civic Auditorium, which doesn’t even make this list. Number of visits doesn’t automatically equate to favorite.

But what does it MEAN?

I could do data analysis like that all day. But what else do I learn by just looking at the data itself?

I can tell that Last.fm didn’t handle the shift to mobile and portable devices very well. It thrives when all of your listening happens on your laptop, and it can grab the scrobbles from your iPod or other device when you plug it into your computer. But as soon as internet-connected devices got popular (and I started using them), listens scrobbled overall dropped. In addition to devices, the rise of streaming music on sites like Grooveshark and SoundCloud to replace the shift from MediaFire-hosted and MegaUpload-hosted free music shared on music blogs also meant trouble for my data integrity. Last.fm didn’t handle listens on the web then, and only handles them through a fragile extension now.

Two graphs depicting distinct song listens and distinct artist listens, respectively, with a peak and steady listens through 2008-2012, then it drops down to a trough in 2014 before coming up to half the amount of 2010 and rising slightly.

Distinct songs and artists listened to in Last.fm data.But that’s not the whole story. I also got a job and started working in an environment where I couldn’t listen to music at work, so wasn’t listening to music there, and also wasn’t listening to music at home much either due to other circumstances. Given that the count plummets to near-zero, it’s possible there were also data issues at play.  It’s imperfect, but still fascinating.

What else did I learn?

Screen image showing 5 dashboard panels. Clockwise, the upper left shows a trending indicator of concerts attended per month, displaying 1 for the month of December and a net decrease of 4 from the previous month. The next shows the overall number of concerts attended, 87 shows. The next shows the number of iTunes library songs with no listens: 4272. The second to last shows a pie chart showing that nearly 30% of the songs have 0 listens, 23% have 1 listen, and the rest are a variety of listen counts. The last indicator shows the total number of songs in my iTunes library, or 16202.

I have a lot of songs in my iTunes library. I haven’t listened to nearly 30% of them. I’ve listened to nearly 25% of them only once. That’s the majority of my music library. If I split that by rating, however, it would get a lot more interesting. Soon.

You can’t see the fallout from my own personal Music-ocalypse in this data, because the Library.xml file doesn’t know which songs don’t point to actual files, or at least my version of it doesn’t. I’ll need more high-fidelity data to determine the “actual” size of my library, and perform more analyses.

I need more data in general, and more patience, to perform the analyses to answer the more complex questions I want to answer, like my listening habits of particular artists around a concert. As it is, this is a really exciting start.

If you want more details about the actual Splunking I did to do these analyses, I’ll be posting a blog on the official Splunk blog. That got posted on January 4th! Here it is: 10 Years of Listens: Analyzing My Music Data with Splunk.

Yoga Beta for Climbers

As a companion to Finding Yourself on the Wall, sometimes what you need while climbing isn’t real beta or advice of what to do, but mental reinforcement. This beta can sound kind of like the mantras that someone might give you in the midst of a yoga class—yoga beta.

  • Do what feels right
  • Don’t forget to breathe
  • Don’t look, just feel
  • You are stronger than you think
  • Just let go 

 

Finding Myself on the Wall

How climbing teaches me to manage my fear and love myself.

Sometimes I find myself on the wall doing something I never thought possible: holding onto something that doesn’t seem to have a place to hold, or reaching something that looks out of reach. Other times it’s like I’m waking up to find myself trapped in what seems to be an inescapable spot: no holds above me, or nowhere to put my feet to push myself higher. In these cases, the problem is clear. The solution isn’t.

In climbing, the problem can be on the wall, or it can be with my confidence, or my fear. Being able to consistently test solutions, push through challenges, and conquer the problem is what makes climbing a perfect mental and physical outlet for me.

For me, climbing is all about managing fear and trusting myself. I have to manage my natural instincts of being afraid of heights and of falling. I also have to learn to trust my abilities and skills while respecting myself and my boundaries in order to avoid getting hurt or endangering myself or others.

In addition, the different types of climbing require different levels of this fear management and self-trust. I first learned top-rope climbing, but as I got better I got more comfortable. Then I learned bouldering, and got more comfortable there, so I learned how to lead climb. Throughout this process, I’ve built my physical strength and climbing technique, but also self confidence and my ability to manage fear.

  • Top-roping is the most comfortable form of climbing for me. I can see the rope, and I can sometimes see the anchor keeping the rope secure. I can feel the taut lack of slack in my rope, and lean back from the wall to test it. I can rest at any time as well, so there is time to slow down and take breaks. All of this physical security reinforces a psychological sense of security, which can help me do more challenging moves and climb higher than I might otherwise feel comfortable climbing.
  • Bouldering requires me to stomach my fear and muster my self-confidence to take me to the top of a wall, or over the top of a wall, without a rope. Bouldering routes are typically anywhere from 10-20 feet high in a gym, and in some incredible outdoor routes, 40 or more feet high. Without the physical security of a rope or an anchor, I have to know my physical and psychological strengths and limits before I start. This forces me to scope out the route before I start climbing, and prepare myself to jump or fall to the ground if I feel uncomfortable. Bouldering forces me to get used to this discomfort and either overcome it or recognize when it is valid and to listen to it.
  • Lead climbing takes the height of top-roping and combines it with the mental aspects of bouldering. No longer do I have the visible anchor or a taut rope to help me feel safe—it’s just me and the wall. I’m conquering the problem while also taking all the necessary steps to keep myself safe: clip properly, climb safely around the rope, and rest when I can. There’s little to no room for fear.

Each type of climbing removes an element of physical security and further challenges my psychological security as I progress. In this way, I’ve been forced to progressively confront and challenge my limits at the same time that I learn to respect and recognize them.

The dangers of climbing are real. It’s an extreme sport. Though it doesn’t always feel dangerous in a gym, any time that you are high up in the air relying on humans and equipment, something can fail and you can die. It’s also easy to get injured due to bad technique: over-gripping holds, inadequately engaging muscles, straining hand muscles and tendons on hard-to-grip holds. If anything, these risks force me to prioritize muscle recovery and rest days, allowing me to recognize that just as physical self care is important, so too is psychological self care.

Despite these risks, climbing lets me get more in-tune with myself than anything else that I’ve tried. It’s a wall of problems, but each one is recognizable and each one is solvable, and I can try them again and again. I can learn by watching someone else solve it, but I can’t solve it the same way because we have different skill sets, physical strength, and body types. I still have solve the problem myself in my own way.

Climbing with other people has also been key to my mental strength. Climbing partners are vital to my safety, but also to my confidence level. They can encourage me to try new routes, and give me beta when I start to falter on a route. Beta, typically defined as information about a route, can also involve encouragement. Everything from the tactical “there’s a foothold by your right knee” to the encouraging “you can reach it!” to the calming “don’t look, just feel” is great beta that has helped me succeed. (I’ve named that last type Yoga Beta). Even so, sometimes the best beta is silence so that I can focus on the problem.

Climbing as a method for teaching myself that I can succeed and iterating my way through problem-solving helps me overcome my fear of failure. I’m learning to trust myself to get through each move, and find something to (physically, psychologically) support myself along the way. I have to trust myself, and the rock, every step of the way.

Country-specific search results

It isn’t really possible to search the “global web” today. You can, however, try to use Google to search the web of another country by manually manipulating the ccTLD in the URL to divert your search to a different country service than the country you are located in.

But starting recently, that’s no longer possible. Betanews points out that Google makes it harder to search for results from other countries:

Google has announced that it will now always serve up results that are relevant to the country that you’re in, regardless of the country code top level domain names (ccTLD) you use.

What can you do instead? The official Google blog explains in Making search results more local and relevant:

If for some reason you don’t see the right country when you’re browsing, you can still go into settings and select the correct country service you want to receive. Typing the relevant ccTLD in your browser will no longer bring you to the various country services—this preference should be managed directly in settings.

This codifies your country preference, making it harder to switch across different experiences. In the past I’ve both searched in different languages and modified the ccTLD to attempt to locate different search results. Now my searches are limited to the information stored in the US-specific country service maintained by Google, unless I make a settings-level change to affect that.

Searching the global web for information gets a little bit harder. Perhaps market research is showing Google that our hunt for information is more valuable when it’s local (or more “relevant” at least). It’s another way that the web is mediated for our consumption.

Who gives you the Internet?

Iran and Russia are becoming Internet provider nexuses to other countries. Dyn Research wrote about shifts in 2013 that led states in the Persian Gulf to seek out additional Internet providers.

Sometimes, it takes a real disaster to create something genuinely new. March 2013 was a month of disasters in the Middle Eastern, South Asian, and East African Internet, with major submarine cable cuts affecting SMW3, SMW4, IMEWE, EIG, SEACOM, and TE-North.

One of the “genuinely new” Internet traffic paths that emerged in response is a counterintuitive terrestrial route, linking the ancient Indian Ocean trade empire of Oman with the Internet markets of Western Europe, by way of Iran, Azerbaijan, and the Russian Caucasus. As we’ll see, its effects are now being felt across the region, from Pakistan, to Gulf states like Bahrain and Oman, to Kenya.

More recently, Russia started providing a new Internet link to North Korea. As reported by 38North, Russia Provides New Internet Connection to North Korea:

The connection, from TransTeleCom, began appearing in Internet routing databases at 09:08 UTC on Sunday, or around 17:38 Pyongyang time on Sunday evening.

Before this additional route became available, China was the only provider of Internet access to the country’s sole ISP.

Until now, Internet users in North Korea and those outside accessing North Korean websites were all funneled along the same route connecting North Korean ISP Star JV and the global Internet: A China Unicom link that has been in operation since 2010.

This additional link makes the country’s access to the Internet less precarious and vulnerable to disconnection by attackers.

More than once the link has been the target of denial of service attacks. Most were claimed by the “Anonymous” hacking collective, but on at least one previous occasion, many wondered if US intelligence services had carried out the action.

IPv4 trafficking in Romania

Romania is selling IPv4 addresses to make money, but how did they get so many in the first place? ComputerWorld explores How Romania’s patchwork Internet helped spawn an IP address industry.

The roots of the Romanian IP address trade lie in the country’s peculiar Internet history. When commercial Internet service began in Romania around 2000, it was totally unplanned and unregulated. People started ISPs by pulling cables from one house to the next.

As the IPv6 transition and adoption is still ongoing, people are still seeking out IPv4 addresses where they can find them.

Where is the Internet decentralized?

Dyn Research interrogates the notion that the Internet is decentralized by looking at the actual state of infrastructure and routing resilience around the world. What did they find?

The key to the Internet’s survival is the Internet’s decentralization — and it’s not uniform across the world. In some countries, international access to data and telecommunications services is heavily regulated. There may be only one or two companies who hold official licenses to carry voice and Internet traffic to and from the outside world, and they are required by law to mediate access for everyone else.

Countries might not be ensuring maximum redundancy through decentralization for political reasons, a desire to control access to the Internet more easily, but also due to monetary reasons:

Increased diversity at the international frontier often spells less money for the national incumbent provider (typically the old telephone company, often owned by the government itself). Without some strong legal prodding and guidance from the telecoms regulator, significant diversification in smaller markets with a strong incumbent can take a long, long time.

 

Unrepresented languages on the web

Per Al Jazeera, 95% of the world’s languages continue to be unrepresented online.

The real problem is a digital architecture that forces people to operate on the terms of another culture, unable to continue the development of their own.

The architecture of the web influences the languages and cultures interacting with it:

He rightly homes in on the invisible underpinnings that enable us to use a language online, such as input methods, OS support (on a range of devices, in countless applications), transliteration and translation and spell-checking tools. Just developing a Yiddish spell-checker, for instance, has required a stable input method for the modified Hebrew alphabet that Yiddish uses, the prior standardization of that alphabet (still contested), standardized spellings of most words (sometimes contested), technical ease in handling the Yiddish alphabet and a loaded dictionary.

It’s complex to reflect the world views and cultures of the world on the web.

Each language reflects a unique world-view and culture complex, mirroring the manner in which a speech community has resolved its problems in dealing with the world, and has formulated its thinking, its system of philosophy and understanding of the world around it. In this, each language is the means of expression of the intangible cultural heritage of people, and it remains a reflection of this culture for some time even after the culture which underlies it decays and crumbles, often under the impact of an intrusive, powerful, usually metropolitan, different culture.

What makes it such a challenge to both incorporate multiple languages on the web, and also to build out fleshed-out versions of those languages?

Language homogenization on the web

Motherboard says The Internet Is Killing Most Languages:

The great flat, globalized world of the internet operates pretty much as a monoculture, Kornai says. Only about 250 languages can be called well-established online, and another 140 are borderline. Of the 7,000 languages still alive, perhaps 2,500 will survive, in the classical sense, for another century, and many fewer will make it on to the internet.

Globalization of the world and the web could lead to homogenization of the languages in both places.

The adage “If it’s not on the web, it does not exist,” neatly encapsulates the loss of prestige. And as a generation of digital natives comes up, their online tongue is likely not to be their mother tongue—a loss of competence.

Languages on the web matter for self identity

Is homogenization of language on the web an instantiation of totalitarianism?

Boston Review on Herta Müller’s Language of Resistance:

Since language plays such an important part in the construction of the self, when the state subjects you to constant acts of linguistic aggression, whether you realize it or not, your sense of who you are and of your place in the world are seriously affected. Your language is not just something you use, but an essential part of what you are. For this reason any political disruption of the way language is normally used can in the long run cripple you mentally, socially, and existentially. When you are unable to think clearly you cannot act coherently. Such an outcome is precisely what a totalitarian system wants: a population perpetually caught in a state of civic paralysis.

What if the web is the state, in this context? What does it mean for self-identity, power, and a neutral web?