This blog post was inspired by this blog post #blogception.
I'm addicted to Spotify. What gets me somewhat excited for Monday morning is a little nugget called Discover Weekly. It's a playlist of recommended songs based on a user's preferences, which I'm guessing is based on play history. It's a machine learning-powered playlist generated by those magicians at Spotify. A software engineer explains how these song recommendations are made here.
My Discover Weekly playlist is hit or miss. Sometimes I find a few really good songs in there, but other times the majority of the songs are just "meh". So I decided to create a playlist of songs that I KNOW I like, and a playlist of songs that I KNOW I do not like. I'll combine these tracks into one playlist and use them as training data to feed into a machine learning algorithm. Once the algorithm is sufficiently trained, the hope is that it will be able to create me a filtered Discover Weekly playlist.
I use the Spotipy Python library to access the Spotify Web API and obtain data on song features.
Spotipy configuration¶
import spotipy
import spotipy.util as util
from config import client_id, client_secret, redirect_uri, username, good_playlist_id, bad_playlist_id
from dw_id import dw_playlist_id
import numpy as np
import pandas as pd
scope = 'playlist-modify-private playlist-modify-public playlist-read-private user-library-read'
token = util.prompt_for_user_token(username, scope, client_id=client_id, client_secret=client_secret, redirect_uri=redirect_uri)
if token:
sp = spotipy.Spotify(auth=token)
else:
print("Can't get token for", username)
Pull data for good and bad playlists¶
I created these methods to clean up the code that pulls the tracks from a playlist. A full explanation is in my previous post.
def get_playlist_tracks(username, playlist_id):
results = sp.user_playlist_tracks(username, playlist_id)
tracks = results['items']
while results['next']:
results = sp.next(results)
tracks.extend(results['items'])
return tracks
def tracks_to_df(username, playlist_id, data_array):
tracks = get_playlist_tracks(username, playlist_id)
for i in range(len(tracks)):
row = [tracks[i]['track']['id'],
tracks[i]['track']['name'],
tracks[i]['track']['artists'][0]['name'],
tracks[i]['track']['popularity']]
data_array.append(row)
data_df = pd.DataFrame(data=data_array,columns=['id','name','artist','popularity'])
return data_df
Collect tracks into a dataframe with columns for id, name, artist and popularity.
data_good = []
df_good = tracks_to_df(username, good_playlist_id, data_good)
data_bad = []
df_bad = tracks_to_df(username, bad_playlist_id, data_bad)
Pull features¶
Below are two methods to help pull features from each track. Again, an explanation of the code clean up is in my previous post.
def chunks(mylist, chunk_size):
# For item i in a range that is a length of l,
for i in range(0, len(mylist), chunk_size):
# Create an index range for l of n items:
yield mylist[i:i+chunk_size]
def features_to_df(ids, data_array):
# Create a list from the results of the function chunks, get features for batch of ids, append to array
for i in range(0, len(list(chunks(ids, 50)))):
ids_batch = list(chunks(ids, 50))[i]
features_temp = sp.audio_features(tracks=ids_batch)
data_array.append(features_temp)
columns = list(data_array[0][0].keys())
columns.sort()
# convert to df
# instantiate empty dataframe
df_features = pd.DataFrame(columns = columns)
for i in range(0, len(data_array)):
df_temp = pd.DataFrame(data_array[i], columns = columns)
df_features = df_features.append(df_temp, ignore_index=True)
return df_features
Save the track IDs into a list.
good_ids=df_good['id'].tolist()
bad_ids=df_bad['id'].tolist()
Append all of the track features (for good and bad tracks) into one dataframe called data
.
good_features = []
df_features_good = features_to_df(good_ids, good_features)
bad_features = []
df_features_bad = features_to_df(bad_ids, bad_features)
df_features_good['target'] = 1
df_features_bad['target'] = 0
data = df_features_good.append(df_features_bad, ignore_index=True)
Create test and training data¶
In a previous post I plotted the distribution of features of songs I like and dislike. The features that seemed to have the most variation from "like" to "dislike" were danceability, energy, tempo and valence. As such, I have created a subset of the full 11 features called features_variation
for the training data set.
I chose a test size of 25%. I played around with this parameter a bit, trying 0.3 and 0.4, but the model prediction accuracy seemed to be best at 0.25. This results in a training sample of 378 songs and a test sample of 127 songs.
#Define the set of features that we want to look at
features_full = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "key", "duration_ms"]
features_variation = ["danceability", "energy", "tempo", "valence", "key", "duration_ms"]
features = features_variation
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data[features], data['target'], test_size = 0.25)
x_train.shape
x_test.shape
Models¶
In this section I feed the training data into various classifiers i.e. train them to make predictions on which songs I will like and dislike. I'm not yet familiar with all of these algorithms but the blog post I was following used all of them. 😂😂😂
1. Decision Tree Classifier¶
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(min_samples_split=100)
tree.fit(x_train, y_train)
tree_pred = tree.predict(x_test)
score = accuracy_score(y_test, tree_pred) * 100
print("Accuracy using Decision Tree: ", round(score, 1), "%")
2. K Neighbours Classifier¶
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(3)
knn.fit(x_train, y_train)
knn_pred = knn.predict(x_test)
score = accuracy_score(y_test, knn_pred) * 100
print("Accuracy using K Neighbours: ", round(score, 1), "%")
3. Multi-layer Perceptron¶
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier()
mlp.fit(x_train, y_train)
mlp_pred = mlp.predict(x_test)
score = accuracy_score(y_test, mlp_pred) * 100
print("Accuracy using Multi-layer Perceptron: ", round(score, 1), "%")
4. Random Forest Classifier¶
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
forest.fit(x_train, y_train)
forest_pred = forest.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, forest_pred) * 100
print("Accuracy using Random Forest: ", round(score, 1), "%")
5. AdaBoost Classifier¶
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(n_estimators=100)
ada.fit(x_train, y_train)
ada_pred = ada.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, ada_pred) * 100
print("Accuracy using AdaBoost: ", round(score, 1), "%")
6. Naive Bayes¶
from sklearn.naive_bayes import GaussianNB
gauss = GaussianNB()
gauss.fit(x_train, y_train)
gauss_pred = gauss.predict(x_test)
score = accuracy_score(y_test, gauss_pred)*100
print("Accuracy using Gaussian Naive Bayes: ", round(score, 1), "%")
7. K Means Clustering¶
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(x_train, y_train)
predicted= k_means.predict(x_test)
score = accuracy_score(y_test, predicted)*100
print("Accuracy using K Means: ", round(score, 1), "%")
8. Gradient Boosting Classifier¶
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=.1, max_depth=1, random_state=0)
gbc.fit(x_train, y_train)
predicted = gbc.predict(x_test)
score = accuracy_score(y_test, predicted)*100
print("Accuracy using Gradient Boosting: ", round(score, 1), "%")
Now apply predictions to my Discover Weekly playlist¶
data_dw = []
df_dw = tracks_to_df(username, dw_playlist_id, data_dw)
df_dw
dw_ids=df_dw['id'].tolist()
dw_features = []
data_discover_weekly = features_to_df(dw_ids, dw_features)
pred_gbc = gbc.predict(data_discover_weekly[features])
pred_tree = tree.predict(data_discover_weekly[features])
likedSongs = 0
i = 0
for prediction in pred_tree:
if(prediction == 1):
print ("Song " + str(likedSongs+1) + ": " + df_dw["name"][i] + ", By: "+ df_dw["artist"][i])
# add each song to a new playlist
sp.user_playlist_add_tracks(username, '2RARDnZLQGVPo0sXScDA8g', [df_dw['id'][i]])
likedSongs= likedSongs + 1
i = i +1
from IPython.display import IFrame
IFrame("https://open.spotify.com/embed/playlist/2RARDnZLQGVPo0sXScDA8g", width=600, height=380)
SO WHAT'S THE VERDICT?!¶
I think this process was reasonably successful. The Decision Tree algorithm was able to cut the playlist down from 30 songs to 18, a 40% reduction! I think the playlist wasn't cut down further because my Discover Weekly playlist is actually fairly customized for me already. As you can see, it's all dance/pop type music which is the majority of what I listen to and what my other playlists are made of. Spotify just knwos me too well.
Other classification algorithms¶
To be honest, I don't know what these do. I haven't familiarized myself with all of the different types of classification algorithms...
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
qda = QuadraticDiscriminantAnalysis()
qda.fit(x_train, y_train)
qda_pred = qda.predict(x_test)
score = accuracy_score(y_test, qda_pred)*100
print("Accuracy using Quadratic Discriminant Analysis: ", round(score, 1), "%")
from sklearn.svm import SVC
svc_lin = SVC(kernel="linear", C=0.025)
svc_lin.fit(x_train, y_train)
svc_pred = svc_lin.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, svc_pred) * 100
print("Accuracy using Support Vector Machine: ", round(score, 1), "%")
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
gpc = GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True)
gpc.fit(x_train, y_train)
gpc_pred = gpc.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, gpc_pred) * 100
print("Accuracy using Gaussian Process: ", round(score, 1), "%")