The SPUD dataset contains 10,000 playlists created by Last.fm users. Spotify URI and song metadata are included for every song. Full listening history is also included for a number of users. The dataset is in SQLite database format and is available under the MIT open source license:
Download dataset
Read the paper
This example shows a user’s listening behaviour over several years. The entropy over features is computed for a moving window of track plays, indicating how diverse the listening was over each feature. Taking artist IDs as example, throughout 2007-2010, the entropy is often low, with some sessions of just one artist. In February 2010, the user began using online radio, listening to tracks irrespective of artist, causing a distinct change in entropy. Note that they still sometimes continued their old style of music retrieval as well.
library(RSQLite)
library(infotheo)
library(reshape)
library(ggplot2)
library(changepoint)
library(zoo)
library(RColorBrewer)
# Query the dataset
con <- dbConnect(SQLite(), "spud.sqlite")
history <- dbGetQuery(con, "SELECT tl.date, t.* FROM lastfmtracklistens as tl JOIN tracks as t ON tl.track=t.trackid WHERE tl.user=2275;")
# Sort listens by date
history <- history[with(history, order(date)),]
# We need discrete features for entropy functions
history[,'popularity'] <- discretize(history[,'popularity'],nbins=25)
history[,'duration'] <- discretize(history[,'duration'],nbins=25)
# Select a feature
f <- zoo(history$artist)
time(f) <- as.POSIXlt(history$date)
# Calculate the windowed entropy
H <- rollapply(f, 20, entropy, by=20, partial=TRUE)
Hdf <- fortify(H)
# Find a changepoint
cp <- Hdf[attributes(cpt.meanvar(Hdf$H))$cpts[1],]$Index
Hdf$aftercp <- Hdf$Index > cp
historyPlot <- ggplot(data=Hdf, aes(x=Index, y=H)) + geom_point(alpha=0.7, aes(colour=aftercp)) + ...
This example shows how to identify music features relevant to how users made the 10,000 playlists in the dataset. We retrieved music features for each song and calculate the Mutual-Information of music features with playlist membership.
library(infotheo)
library(RSQLite)
library(ggplot2)
# Calculate Adapted Mutual Information
# We normalise against feature entropy only, and not class (playlist)
# Refer to the paper for more details
getAMI <- function(feature, class) {
I <- mutinformation(class, feature)
n0 <- mutinformation(class[sample.int(length(class))], feature)
#return((I - n0) / (max(entropy(class), entropy(feature)) - n0))
return((I - n0) / (entropy(feature) - n0))
}
# Query the dataset
con <- dbConnect(SQLite(), "spud.sqlite")
# Our entropy view joins playlists with music features
# These are propriertary and not currently included in the SPUD dataset
data <- dbGetQuery(con, "SELECT * FROM [Entropy];")
# Select the features we're interested in
features <- data[,6:(ncol(data)-4)]
class <- data$playlist
# Calculate AMI between features and class
fami <- apply(features,2,getAMI,class)