This blog post was inspired by this blog post #blogception.

I'm addicted to Spotify. What gets me somewhat excited for Monday morning is a little nugget called Discover Weekly. It's a playlist of recommended songs based on a user's preferences, which I'm guessing is based on play history. It's a machine learning-powered playlist generated by those magicians at Spotify. A software engineer explains how these song recommendations are made here.

My Discover Weekly playlist is hit or miss. Sometimes I find a few really good songs in there, but other times the majority of the songs are just "meh". So I decided to create a playlist of songs that I KNOW I like, and a playlist of songs that I KNOW I do not like. I'll combine these tracks into one playlist and use them as training data to feed into a machine learning algorithm. Once the algorithm is sufficiently trained, the hope is that it will be able to create me a filtered Discover Weekly playlist.

I use the Spotipy Python library to access the Spotify Web API and obtain data on song features.

Spotipy configuration

In [1]:
import spotipy
import spotipy.util as util
from config import client_id, client_secret, redirect_uri, username, good_playlist_id, bad_playlist_id
from dw_id import dw_playlist_id
import numpy as np
import pandas as pd

scope = 'playlist-modify-private playlist-modify-public playlist-read-private user-library-read'
token = util.prompt_for_user_token(username, scope, client_id=client_id, client_secret=client_secret, redirect_uri=redirect_uri)
if token:
    sp = spotipy.Spotify(auth=token)
else:
    print("Can't get token for", username)

Pull data for good and bad playlists

I created these methods to clean up the code that pulls the tracks from a playlist. A full explanation is in my previous post.

In [2]:
def get_playlist_tracks(username, playlist_id):
    results = sp.user_playlist_tracks(username, playlist_id)
    tracks = results['items']
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    return tracks

def tracks_to_df(username, playlist_id, data_array):
    tracks = get_playlist_tracks(username, playlist_id)
    for i in range(len(tracks)):
        row = [tracks[i]['track']['id'],
              tracks[i]['track']['name'],
              tracks[i]['track']['artists'][0]['name'],
              tracks[i]['track']['popularity']]
        data_array.append(row)
        
    data_df = pd.DataFrame(data=data_array,columns=['id','name','artist','popularity'])
    return data_df

Collect tracks into a dataframe with columns for id, name, artist and popularity.

In [3]:
data_good = []
df_good = tracks_to_df(username, good_playlist_id, data_good)

data_bad = []
df_bad = tracks_to_df(username, bad_playlist_id, data_bad)

Pull features

Below are two methods to help pull features from each track. Again, an explanation of the code clean up is in my previous post.

In [4]:
def chunks(mylist, chunk_size):
    # For item i in a range that is a length of l,
    for i in range(0, len(mylist), chunk_size):
        # Create an index range for l of n items:
        yield mylist[i:i+chunk_size]
In [5]:
def features_to_df(ids, data_array):
    # Create a list from the results of the function chunks, get features for batch of ids, append to array
    for i in range(0, len(list(chunks(ids, 50)))):
        ids_batch = list(chunks(ids, 50))[i]
        features_temp = sp.audio_features(tracks=ids_batch)
        data_array.append(features_temp)

    columns = list(data_array[0][0].keys())
    columns.sort()

    # convert to df
    # instantiate empty dataframe
    df_features = pd.DataFrame(columns = columns)

    for i in range(0, len(data_array)):
        df_temp = pd.DataFrame(data_array[i], columns = columns)
        df_features = df_features.append(df_temp, ignore_index=True)
    
    return df_features

Save the track IDs into a list.

In [6]:
good_ids=df_good['id'].tolist()
bad_ids=df_bad['id'].tolist()

Append all of the track features (for good and bad tracks) into one dataframe called data.

In [7]:
good_features = []
df_features_good = features_to_df(good_ids, good_features)

bad_features = []
df_features_bad = features_to_df(bad_ids, bad_features)

df_features_good['target'] = 1
df_features_bad['target'] = 0

data = df_features_good.append(df_features_bad, ignore_index=True)

Create test and training data

In a previous post I plotted the distribution of features of songs I like and dislike. The features that seemed to have the most variation from "like" to "dislike" were danceability, energy, tempo and valence. As such, I have created a subset of the full 11 features called features_variation for the training data set.

I chose a test size of 25%. I played around with this parameter a bit, trying 0.3 and 0.4, but the model prediction accuracy seemed to be best at 0.25. This results in a training sample of 378 songs and a test sample of 127 songs.

In [8]:
#Define the set of features that we want to look at
features_full = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "key", "duration_ms"]
features_variation = ["danceability", "energy", "tempo", "valence", "key", "duration_ms"]
features = features_variation

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data[features], data['target'], test_size = 0.25)
In [9]:
x_train.shape
Out[9]:
(378, 6)
In [10]:
x_test.shape
Out[10]:
(127, 6)

Models

In this section I feed the training data into various classifiers i.e. train them to make predictions on which songs I will like and dislike. I'm not yet familiar with all of these algorithms but the blog post I was following used all of them. 😂😂😂

1. Decision Tree Classifier

In [11]:
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(min_samples_split=100)
tree.fit(x_train, y_train)
tree_pred = tree.predict(x_test)
score = accuracy_score(y_test, tree_pred) * 100
print("Accuracy using Decision Tree: ", round(score, 1), "%")
Accuracy using Decision Tree:  72.4 %

2. K Neighbours Classifier

In [12]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(3)
knn.fit(x_train, y_train)
knn_pred = knn.predict(x_test)
score = accuracy_score(y_test, knn_pred) * 100
print("Accuracy using K Neighbours: ", round(score, 1), "%")
Accuracy using K Neighbours:  51.2 %

3. Multi-layer Perceptron

In [13]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier()
mlp.fit(x_train, y_train)
mlp_pred = mlp.predict(x_test)
score = accuracy_score(y_test, mlp_pred) * 100
print("Accuracy using Multi-layer Perceptron: ", round(score, 1), "%")
Accuracy using Multi-layer Perceptron:  52.8 %

4. Random Forest Classifier

In [14]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
forest.fit(x_train, y_train)
forest_pred = forest.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, forest_pred) * 100
print("Accuracy using Random Forest: ", round(score, 1), "%")
Accuracy using Random Forest:  70.9 %

5. AdaBoost Classifier

In [15]:
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(n_estimators=100)
ada.fit(x_train, y_train)
ada_pred = ada.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, ada_pred) * 100
print("Accuracy using AdaBoost: ", round(score, 1), "%")
Accuracy using AdaBoost:  77.2 %

6. Naive Bayes

In [16]:
from sklearn.naive_bayes import GaussianNB
gauss = GaussianNB()
gauss.fit(x_train, y_train)
gauss_pred = gauss.predict(x_test)
score = accuracy_score(y_test, gauss_pred)*100
print("Accuracy using Gaussian Naive Bayes: ", round(score, 1), "%")
Accuracy using Gaussian Naive Bayes:  67.7 %

7. K Means Clustering

In [17]:
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(x_train, y_train)
predicted= k_means.predict(x_test)
score = accuracy_score(y_test, predicted)*100
print("Accuracy using K Means: ", round(score, 1), "%")
Accuracy using K Means:  52.0 %

8. Gradient Boosting Classifier

In [18]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=.1, max_depth=1, random_state=0)
gbc.fit(x_train, y_train)
predicted = gbc.predict(x_test)
score = accuracy_score(y_test, predicted)*100
print("Accuracy using Gradient Boosting: ", round(score, 1), "%")
Accuracy using Gradient Boosting:  77.2 %

Now apply predictions to my Discover Weekly playlist

In [19]:
data_dw = []
df_dw = tracks_to_df(username, dw_playlist_id, data_dw)

df_dw
Out[19]:
id name artist popularity
0 5GukxVkcnm6wyuw17nYevK Done - R3hab Remix Nikki Vianna 46
1 0ap4E0W70EcUjXqItoM74l Walk Away - 3LAU Deep Mix 3LAU 39
2 3ZuLTogqYwaL7DLqAP43t3 Growing Pains - Justin Caruso Remix Alessia Cara 29
3 2bcTdyGjBUR8fknw2GeH0z Gone (feat. Marvin Brooks) - Flyboy Remix Maan On The Moon 43
4 1MqBckcnN45W32KSSHnylW Sometimes DallasK 59
5 7mAYdYyUrkUSArOdSrC7rR Drew Barrymore LU2VYK 47
6 0xARbGHzPT1o5t1sFlmyO2 Grip - Jay Pryor Remix Seeb 54
7 4MuYNxE0Dgw0PFXz9Aquw6 Trampoline - BKAYE Remix SHAED 53
8 1UBDqRniw09drFPk7hgzOF All That She Wants Jordan Jay 44
9 3SoHRFBuaJ11rD7uxxG5Uq Off My Back Thoreau 30
10 5XNltFLO0aM3PHfKYRkuH2 Bad Habits dzill 30
11 6xUy203RnyyOfbqf96Nven Selfish Dimitri Vegas & Like Mike 69
12 0McMlTPzi5QjtrQUOCffaZ What About Us WizG 29
13 4ABdTWafMCXfATpILRuZFW IDWK DVBBS 63
14 5ek8fux89OY3S3E8DVNJ4i All The Way Up Glazy 38
15 4apGmexRZUxpTL6f8z42Qt Congratulations Carda 29
16 69HVwrOSZdcFPwJUnuTN1n Always on My Mind Nick Martin 35
17 328QDttJ2uYhtFyFmsiuI6 Lay Me Down Timeflies 50
18 16uC4HSJUTNcqaAVbHgWwk No One Has To Know (Adam Kahati X Deerock Remix) GoldFish 6
19 2xFSwFeA7UNM4Tlu2nX9Vz Wish You Well (feat. Trove) - Club Mix Famba 33
20 5DHp41RoSuq0Lv8x9AnQRg Into My Bed Harpoon 36
21 2ZrMXdHe6RfVWv1dlN52as All U Need Dizaro 39
22 2lmyHaEaM1ATZyiFXjI3jg Stay Here Zaxx 33
23 3NSjJE5P1RNWOkDAaUSgra White Flag Noah Neiman 35
24 5pYVOAWWO774uGCouME1wU Love Thang (feat. Ookay) YDG 35
25 0mT29GxaF6xs61GuAd6End Wild Like The Wind Deorro 47
26 7MLZc2C7guizPRLIq6DspS Turn It Up (COE Remix) Mike Parr 42
27 6abIrYu5OWE3z3F4p8MlyO Creep On Me (feat. French Montana & DJ Snake) ... GASHI 38
28 78j7afPUzFV0kAn2qNd1jZ Treat Me Like A Lady (feat. Jeanne Naylor) Francis Mercier 30
29 1AJG3n8tWJut45jn1o2cEH Getting Closer - Watson Remix NEW CITY 31
In [20]:
dw_ids=df_dw['id'].tolist()
In [21]:
dw_features = []
data_discover_weekly = features_to_df(dw_ids, dw_features)
In [22]:
pred_gbc = gbc.predict(data_discover_weekly[features])
pred_tree = tree.predict(data_discover_weekly[features])
In [23]:
likedSongs = 0
i = 0
for prediction in pred_tree:
    if(prediction == 1):
        print ("Song " + str(likedSongs+1) + ": " + df_dw["name"][i] + ", By: "+ df_dw["artist"][i])
        # add each song to a new playlist
        sp.user_playlist_add_tracks(username, '2RARDnZLQGVPo0sXScDA8g', [df_dw['id'][i]])
        likedSongs= likedSongs + 1
    i = i +1
Song 1: Done - R3hab Remix, By: Nikki Vianna
Song 2: Walk Away - 3LAU Deep Mix, By: 3LAU
Song 3: Gone (feat. Marvin Brooks) - Flyboy Remix, By: Maan On The Moon
Song 4: Sometimes, By: DallasK
Song 5: Grip - Jay Pryor Remix, By: Seeb
Song 6: All That She Wants, By: Jordan Jay
Song 7: What About Us, By: WizG
Song 8: IDWK, By: DVBBS
Song 9: Congratulations, By: Carda
Song 10: Always on My Mind, By: Nick Martin
Song 11: Wish You Well (feat. Trove) - Club Mix, By: Famba
Song 12: Into My Bed, By: Harpoon
Song 13: All U Need, By: Dizaro
Song 14: Stay Here, By: Zaxx
Song 15: Wild Like The Wind, By: Deorro
Song 16: Turn It Up (COE Remix), By: Mike Parr
Song 17: Treat Me Like A Lady (feat. Jeanne Naylor), By: Francis Mercier
Song 18: Getting Closer - Watson Remix, By: NEW CITY
In [24]:
from IPython.display import IFrame
IFrame("https://open.spotify.com/embed/playlist/2RARDnZLQGVPo0sXScDA8g", width=600, height=380)
Out[24]:

SO WHAT'S THE VERDICT?!

I think this process was reasonably successful. The Decision Tree algorithm was able to cut the playlist down from 30 songs to 18, a 40% reduction! I think the playlist wasn't cut down further because my Discover Weekly playlist is actually fairly customized for me already. As you can see, it's all dance/pop type music which is the majority of what I listen to and what my other playlists are made of. Spotify just knwos me too well.

Other classification algorithms

To be honest, I don't know what these do. I haven't familiarized myself with all of the different types of classification algorithms...

In [25]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
qda = QuadraticDiscriminantAnalysis()
qda.fit(x_train, y_train)
qda_pred = qda.predict(x_test)
score = accuracy_score(y_test, qda_pred)*100
print("Accuracy using Quadratic Discriminant Analysis: ", round(score, 1), "%")
Accuracy using Quadratic Discriminant Analysis:  65.4 %
In [26]:
from sklearn.svm import SVC
svc_lin = SVC(kernel="linear", C=0.025)
svc_lin.fit(x_train, y_train)
svc_pred = svc_lin.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, svc_pred) * 100
print("Accuracy using Support Vector Machine: ", round(score, 1), "%")
Accuracy using Support Vector Machine:  62.2 %
In [27]:
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
gpc = GaussianProcessClassifier(1.0 * RBF(1.0), warm_start=True)
gpc.fit(x_train, y_train)
gpc_pred = gpc.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, gpc_pred) * 100
print("Accuracy using Gaussian Process: ", round(score, 1), "%")
Accuracy using Gaussian Process:  47.2 %