Today I will look at the differences in features of songs I like and dislike. Along the way to a nice, pretty graph of song features, this post shows how to extract features from songs contained in spotify playlists. It makes use of the Spotipy Python library to access the Spotify Web API. What's nice about this library is that it contains easy helper methods to access different data from the API. For example, I use the user_playlist_tracks
and audio_features
methods.
Step 1: access the Spotify API¶
Because I am trying to get data from my specific user account the methods I want to use from Spotipy require authentication. That means I need to generate an authorization token which indicates the user has granted me permission to perform specific tasks from my application (my code). Basically, I need to send a request to myself to authenticate myself. This is all handled simply with Spotipy's util.prompt_for_user_token
, which will coordinate the authorization with the web browser (pass the client credentials, ask the user for access, return the granted access via a token). Spotipy's documentation explains it a lot better than me.
import spotipy
import spotipy.util as util
from config import client_id, client_secret, redirect_uri, username, good_playlist_id, bad_playlist_id
import numpy as np
import pandas as pd
scope = 'user-library-read playlist-read-private'
token = util.prompt_for_user_token(username, scope, client_id=client_id, client_secret=client_secret, redirect_uri=redirect_uri)
if token:
sp = spotipy.Spotify(auth=token)
else:
print("Can't get token for", username)
Step 2: get list of track IDs from a playlist¶
In this case, I am grabbing track IDs from my "good songs" playlist and my "bad songs" playlist. The Spotify API only lets you grab 100 tracks at a time so you have to use a while
loop where results['next']
is true/false.
def get_playlist_tracks(username, playlist_id):
results = sp.user_playlist_tracks(username, playlist_id)
tracks = results['items']
while results['next']:
results = sp.next(results)
tracks.extend(results['items'])
return tracks
Here I convert the raw JSON data from the API request into a dataframe so I can make sure I have the good and bad playlists labelled correctly. There really was no other reason to put it into a dataframe other than to see the data nicely in a table! I could have just collected the IDs into an array because that is all I need to pull the features for each song.
def tracks_to_df(username, playlist_id, data_array):
tracks = get_playlist_tracks(username, playlist_id)
for i in range(len(tracks)):
row = [tracks[i]['track']['id'],
tracks[i]['track']['name'],
tracks[i]['track']['popularity']]
data_array.append(row)
data_df = pd.DataFrame(data=data_array,columns=['id','name','popularity'])
return data_df
Pull good tracks¶
data_good = []
df_good = tracks_to_df(username, good_playlist_id, data_good)
df_good.head()
Pull bad tracks¶
data_bad = []
df_bad = tracks_to_df(username, bad_playlist_id, data_bad)
df_bad.head()
Save the track IDs¶
Because the audio_features
method only lets you pull track features for 50 tracks at a time, I have had to write a method to split the array of IDs into chunks of 50.
def chunks(mylist, chunk_size):
# For item i in a range that is a length of l,
for i in range(0, len(mylist), chunk_size):
# Create an index range for l of n items:
yield mylist[i:i+chunk_size]
good_ids=df_good['id'].tolist()
bad_ids=df_bad['id'].tolist()
Step 3: get track features¶
def features_to_df(ids, data_array):
# Create a list from the results of the function chunks, get features for batch of ids, append to array
for i in range(0, len(list(chunks(ids, 50)))):
ids_batch = list(chunks(ids, 50))[i]
features_temp = sp.audio_features(tracks=ids_batch)
data_array.append(features_temp)
columns = list(data_array[0][0].keys())
columns.sort()
# convert to df
# instantiate empty dataframe
df_features = pd.DataFrame(columns = columns)
for i in range(0, len(data_array)):
df_temp = pd.DataFrame(data_array[i], columns = columns)
df_features = df_features.append(df_temp, ignore_index=True)
return df_features
Pull good features¶
good_features = []
df_features_good = features_to_df(good_ids, good_features)
df_features_good.head()
Pull bad features¶
bad_features = []
df_features_bad = features_to_df(bad_ids, bad_features)
df_features_bad.head()
Step 4: plot the distribution of features for good and bad tracks¶
%matplotlib inline
import matplotlib.pyplot as plt
import datetime
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()
import warnings
warnings.filterwarnings('ignore')
sns.set_palette("tab10")
sns.set_style("ticks")
f, axes = plt.subplots(3, 3, figsize=(10,10))
sns.despine(fig=f, ax=axes)
# this could be refactored majorly lol
sns.distplot(df_features_good['acousticness'], hist=False, ax=axes[0,0]);
sns.distplot(df_features_bad['acousticness'], hist=False, ax=axes[0,0], color='red');
sns.distplot(df_features_good['danceability'], hist=False, ax=axes[0,1]);
sns.distplot(df_features_bad['danceability'], hist=False, ax=axes[0,1], color='red');
sns.distplot(df_features_good['energy'], hist=False, ax=axes[0,2], label='Like');
sns.distplot(df_features_bad['energy'], hist=False, ax=axes[0,2], color='red', label='Dislike');
sns.distplot(df_features_good['instrumentalness'], hist=False, ax=axes[1,0]);
sns.distplot(df_features_bad['instrumentalness'], hist=False, ax=axes[1,0], color='red');
sns.distplot(df_features_good['liveness'], hist=False, ax=axes[1,1]);
sns.distplot(df_features_bad['liveness'], hist=False, ax=axes[1,1], color='red');
sns.distplot(df_features_good['loudness'], hist=False, ax=axes[1,2]);
sns.distplot(df_features_bad['loudness'], hist=False, ax=axes[1,2], color='red');
sns.distplot(df_features_good['speechiness'], hist=False, ax=axes[2,0]);
sns.distplot(df_features_bad['speechiness'], hist=False, ax=axes[2,0], color='red');
sns.distplot(df_features_good['tempo'], hist=False, ax=axes[2,1]);
sns.distplot(df_features_bad['tempo'], hist=False, ax=axes[2,1], color='red');
sns.distplot(df_features_good['valence'], hist=False, ax=axes[2,2]);
sns.distplot(df_features_bad['valence'], hist=False, ax=axes[2,2], color='red');
plt.subplots_adjust(hspace = 0.3)
# put legend on one subplot
axes[0,2].legend(['Like', 'Dislike'], bbox_to_anchor=(0.9, 1), loc=2, borderaxespad=0.)
%config InlineBackend.figure_format = 'svg'
Interpretation¶
I've plotted the features of the good and bad songs in overlayed distribution charts to see if there are any clear differences. Immediately I can see the most variation in danceability, energy, tempo and valence. That there is variation across my good and bad playlists is positive for when I input these features into a classification algorithm in order to create my own Spotify Weekly playlist. The songs I like tend to have higher danceability, energy and tempo, but less valence. Spotify's API gives some nice definitions of these features:
- Danceability: "...how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity."
- Energy: "...a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy."
- Tempo: pace of the track in beats per minute (BPM)
- Valence: "...musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric)..."
The other features do not seem as meaninful. There is virtually no variation in liveness or loudness across good and bad songs. This kind of makes sense since most of the songs in both of these playlists are pop/electronic songs. There are small differences for the acousticness and speechiness features. This also makes sense because none of these songs have any "spoken word" attributes like podcasts would have (speechiness) and they don't have acoustic guitar (except for maybe some of the P!nk songs I threw into the "bad" category).
The F u t u r e¶
My next post will use the song features from my good and bad playlists to create a customized Discover Weekly playlist (or just filter the already-created Discover Weekly playlist that Spotify generates). Similar to this post I will get different classification algorithms such as Decision Trees, K-Nearest Neighbors, Adaptive Boosting or Gradient Boosting to predict which songs I will like.