This course was an introduction to several data science related methods and tools such as MapReduce, R, python, and sentiment analysis. My individual final project was to analyze the Million Song Dataset to find each user's most listened song, the length of that song and the year of release. I also analyzed the top ten songs that were listened to.
My project was aimed to learn about people’s most listened song (calling that a hit song) and seeing if there is a trend over the years in regards to the duration of these hit songs. This information can be used by songwriters who want to write hit songs that stay with current trends. I also wanted to do deeper analysis on the subset to determine users top listened songs.
Year: 2015
Duration: 3 months
I used two data sources. First, I used The Echo Nest Taste Profile subset from https://labrosa.ee.columbia.edu/millionsong/tasteprofile which is a text file of tab separated information: user_id, song_id and play count.
From this dataset, I wanted to find out each user’s most listened song and assumed that is their favorite song, which I will refer to as hit song, and it has the most play counts in their playlist.
The other source I used was from track_metadata.db from Million Song Dataset https://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/track_metadata.db, which is a database that contains metadata information for 1 million tracks with this information:
track_id, title, song_id, release, artist_id, artist_mbid, artist_name, duration, artist_familiarity, artist_hotttnesss, year
From this list, I was most interested in the year of the hit song and the duration of each hit song.
The years of each most played song spanned from 1922 to 2010. I also noticed that the number of unique “most played songs” were unevenly distributed quite possibly because The Echo Nest data contains mostly data of users of more recent years (2000s). As such, I decided to create a chart of the number of unique “most played songs” per year. I found that 2007 had the most number of unique most played songs at 8322.

I wanted to learn if the duration of a most played song had a common length and to see if there was a trend over the years. From 2003 onwards, users’ most played songs on average was 240 seconds long (4 minutes). Hit songwriters should aim to write a song that is around 4 minutes in length. The length of a most played song before 1965 appears to be typically shorter than songs are today (around 3 minutes).

It seems that for The Echo Nest subset, a majority of users’ most listened songs were created in the 2000s. This may be because most play count data did not start getting collected until 2000s. Rarely was it the case that a user’s most listened song was from the 1980s or before (less than 500 unique most listened song per year from Fig 1).
As a comparison, I checked Billboard Top Ten Songs of 2015. Most songs were about 4 minutes long with a couple longer than 4 minutes and a couples around 3 minutes, but of these 10 songs, the average length is 3:53 minutes. The duration of TheEchoNest set of user’s most listened songs from 2003 to 2010 seems to match the average length of Billboard’s top ten hit songs for 2015.
Therefore, hit songwriters should aim to keep the length their songs to be around 4 minutes or less. This matches the listening patterns of the users from The Echo Nest from 2003 to 2010 and also fits in with the top ten songs of of the 2015 Billboard top 100 list.
Below is a table of the top ten songs that with the most number of total playcounts in the Echo Nest Subset.
| A | 726885 plays | Dwight Yoakam - You're The One |
| B | 648239 plays | Bjork - Undo |
| C | 527893 plays | Kings of Leon - Revelry |
| D | 425463 plays | Harmonia - Sehr kosmisch |
| E | 389880 plays | Barry Tuckwell/Academy of St Martin-in-the-Fields/Sir Neville Marriner - Horn Concerto No. 4 in E flat K495: II. Romance (Andante cantabile) |
| F | 356533 plays | Florence + The Machine - Dog Days Are Over (Radio Edit) |
| G | 292642 plays | OneRepublic - Secrets |
| H | 274627 plays | Five Iron Frenzy - Canada |
| I | 268353 plays | Tub Ring - Invalid |
| J | 244730 plays | Sam Cooke - Ain't Misbehavin |
I found it suspicious that there were some songs that I had never heard of in the list so I collected the number of YouTube.com views on each of the songs and found discrepancies that make me wonder how realistic and accurate the EchoNest subset is. As you can see below, the least number of views is at a count of 76 and was 5th in the Echo Nest as most listened to song. The 7th song on youtube had over 128 million views, but in the Echo Nest subset only had 292642 plays. I doubt the accuracy and validity of using the Echo Nest Taste Profile as representative of popular listening behavior.

Below I noticed that the top song with the most plays does not imply that that particular song will have the most unique users who listened to it. It appears that the song “D” Harmonia - Sehr kosmisch has the most number of different listeners. However, again, referencing the youtube.com views, that song I found only 12192 views, but in the Echo Nest subset, there were 425463 playcounts of the song. Again, another detail that seems to be somewhat unbelievable.

Finally, I think the Echo Nest subset might be a good subset to use to test your code, but is not representative of most users’ listening behavior. I would be interested in knowing how this data is collected, from where and over what time period.