Clustering & Forecasting | Spotify Songs Audio Features

Using Spotify Web API with Python Spotipy, Scikit-Learn, Facebook Prophet and Plotly.


  1. Introduction
  2. Getting the data
  3. Clustering the audio features
  4. Forecasting the streams
  5. Conclusions


As a kid I dreamed to be a great musical producer, and play the drums and the keyboards on stages around the globe. I’ve been playing instruments and music since I can recall. I had played the drum made with cooking pots and pot lids as cymbals until I got my first drum when I was twelve; then I never stopped composing and playing music. Besides, I studied drum about 10 years in a particular manner, as well as I hold a musical producing diploma from Buenos Aires Music School.

Fortunately (or unfortunately, depending on the point of view), life had another plan for me and leaded me through another path: Economics & Data Science. I have been developing as a AI Data Engineer and I’ve worked as a Data Engineering Technical Lead for consulting services companies. Many times I had felt frustrated by procrastinating that dream with a BA degree, family mandates or the social prejudgments, and struggled against my deepest instincts and passions. But some time ago I’ve found a way to combine my love for economics and my passion for the music through the data science, in spite guess I’m not that good data scientist such I am as musician, but I keep on learning (it’s a very vast world: the technologies and techniques push forward faster than we can learn).

From introduced above, here’s my first contribution. In this article we will explore the weekly top 200 from the Spotify API in order to forecast the streams based on their behavior between 2017 and 2020 and accordingly to the specific clustered audio features.

For doing this, we will download the list of songs from spotify website; then, we will connect to Spotify API and get the attributes for those songs. Once we get the audio features, we will perform a cluster modeling to group the songs that share similar attributes. So, four clusters will arise that will comprise the most listened songs from the Spotify weekly ranking.

On the other hand, we will take the historical streams between 2017 and 2020 for each cluster and we will forecast the 2021–2023 streams by modeling data from Facebook Prophet framework.

Lastly, the entire project was developed in Python using libraries such as spotipy to get the data, scikit-learn to perform clustering , fbprophet to forecast the streams for each cluster, and plotly to visualize the data and the results. The Spotify variables we will use to get the data and build our model are the following:

  • TracksArtist
  • Streams
  • Audio Features Object

Getting the data

The first step is to get the global weekly top 200 songs from Spotify. I guess that there are some ways to write down specific queries on the search function of Spotify API and look for lists, but on my side I did not find any way Spotify API to return a list of songs most streamed (at least not trusted list). Thus, I downloaded several csv files from with the historical data of global weekly top 200 songs; then I performed a python script in order to append all the files and getting as a result just a single dataset. Since this step is already done, you can directly download the dataset from my Kaggle channel following this link.

On the other hand, we will need the audio features for all songs in the ranking. We access to them by connecting to the Spotify API and through the python spotipy framework we search the attributes by the track ids from the list. For those who wants go straight to the point, you can download the final dataset from the following link.

Nevertheless, I think that it could be useful to share the code from how we get the song attributes and the audio features in case someone wants to dive into different data or give a deepest exploration to the spotipy functions.

In order to explore the Spotify API we need to create a Spotify account, or if you already have one you may log in with that account by accessing to the Spotify developer webpage. Once you are in, in the option dashboard at top bar menu, click on create an app so you can get the app credentials and access to the data by making queries. For further information about what can be done with the Spotify API, I recommend read the documentation at Spotify Docs.

After creating the API app, we are gona make the connection through the following code.

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
cid = # Your Spotify Client ID
secret = # Your Spotify Client Secret
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

We create a dataframe reading the file shared previously -Global Weekly Spotify Top 200 2017–2020-, select the track ids in a separated list and then we look for some attributes of those tracks using the tracks function of spotipy library.

import os
import numpy as np
import pandas as pd
directory = # Your root path directory top_200_tracks = pd.read_csv(os.path.join(directory,'global_weekly_top_200_2017to2020.csv'))

We write down the code to look for the songs attributes.

artist_name = []
album_id = []
album_name = []
track_name = []
popularity = []
track_id = []
track_uri = []
date_release = []
track_ids_list = list(top_200_tracks['track_id'])for i in range(0,len(track_ids_list),50):
data = track_ids_list[i:i + 50]
tracks_df = sp.tracks(data)

for idx, track in enumerate(tracks_df['tracks']):

Now from the different lists of tracks attributes, we create a dataframe merging the lists.

tracks_dataframe = pd.DataFrame({'track_id' : track_id, 'track_uri' : track_uri, 'artist_name' : artist_name, 'album_id': album_id, 'album_name': album_name, 'track_name' : track_name, 'popularity' : popularity, 'date_release':date_release})

Next step is to join both dataframe, the one with the global weekly top 200 and the one which has the attributes.

data_df = tracks_dataframe.merge(top_200_tracks, how='inner', on='track_id')data_df = data_df[['track_id', 'track_uri', 'artist_name', 'album_id', 'album_name', 'track_name', 'popularity', 'date_release', 'Position', 'Streams', 'URL', 'start_week', 'end_week']]data_df = data_df.drop_duplicates()
data_df = data_df.reset_index(drop=True)

For getting the audio attributes we must to repeat the same steps above, changing the Spotify function to audio_features(), which looks for the audio attributes based on our list of track ids.

track_id = []
acousticness = []
analysis_url = []
danceability = []
duration_ms = []
energy = []
instrumentalness = []
key = []
liveness = []
loudness = []
mode = []
speechiness = []
tempo = []
time_signature = []
track_href = []
valence = []
feature_type = []
track_ids_list = list(data_df['track_id'])for i in range(0,len(track_ids_list),100):
data = track_ids_list[i:i + 100]
tracks_features = sp.audio_features(data)
for idx, track in enumerate(tracks_features):

Again, we create a dataframe merging the lists of audio features. Then, we join the dataframes.

tracks_features_df = pd.DataFrame({'track_id' : track_id, 'track_href':track_href, 'analysis_url' : analysis_url,  'acousticness' : acousticness, 'danceability' : danceability, 'duration_ms' : duration_ms, 'energy':energy, 'instrumentalness':instrumentalness, 'key':key, 'liveness':liveness, 'loudness':loudness, 'mode':mode, 'speechiness':speechiness, 'tempo':tempo, 'time_signature':time_signature, 'valence':valence})data_featured = data_df.merge(tracks_features_df, how='inner', on='track_id')
data_featured = data_featured.drop_duplicates()
data_featured = data_featured.reset_index(drop=True)

Finally we’ve got all the data in just one dataframe in order to perform any analysis we want. In next section, we are going to cluster the tracks by their audio features and then we are going to forecast the streaming for each features group.

Clustering the songs features

Firstly, we select the columns we want to cluster by, those which are relevant to the grouping criteria. In this case, we are going to select acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness and valence . If you want to understand what each feature means I recommend to read the documentation in Spotify Object Index Docs.

columns = data_featured.columns[np.r_[15:17, 18:20, 21:23, 24:25, 27]]
fit_data = data_featured[columns]

In a further analysis, it could be worthly asses the behaviors regarding the key -pitch class notation- and the mode -major or minor- from which the songs are composed and their relationship to the positiveness, acousticness and energy. One may assume that low energy could be associate to minor modes, and the more positiveness songs could had been composed in mayor modes, but it is a presumption should be confirmed or refused.

Before applying clustering method, I always prefer starting by normalizing the data we are working with. By displaying the dataset we may notice that the different audio features have different value ranges. If we do not normalize them we will have unbalanced weights to the clusters criteria, thus the audio feature with wider ranges will have disproportionate contributions to cluster separations.

From the scikit-learn library we import preprocessing feature and then we process the audio attributes into normalization. Since the result is a numpy array, we need to create a pandas dataframe and assign the data to the columns.

from sklearn import preprocessingscaler = preprocessing.MinMaxScaler()
features_normal = scaler.fit_transform(fit_data)
features_normal = pd.DataFrame(features_normal)
features_normal.columns = fit_data.columns

By displaying the normalized data we can notice that the values for every audio attributes were fitted to the same scale, values between 0 and 1. Now we are going to get a better balanced proportionality on weights to the cluster contributions of every audio features.

Now that we already have normalized the audio features, we may check the correlations between those attributes to know more about their relationship and behavior. For doing this, we are going to build a pandas dataframe with the correlations and then we plot the correlation matrix using figure_factory from plotly library.


As it can be seen at the correlation matrix above, it is very reasonable that energy has a considerable negative correlations to acousticness, as well as, acousticness a negative correlation to loudness. On opposite, energy has a highly positive correlation to loudness and valence, which also makes sense.

The main question that arise when we think in clustering data is how many clusters we should group the features into. The most common method is the elbow method to decide. So we perform clustering technique using Kmeans from scikit-learn for a range from 1 to10 clusters, calculate the inertia for each amount of clusters, and then we plot the number of cluster and the inertia which is the sum of squared distances of the data points from their cluster’s center. The analysis consist on increasing numbers of clusters and see if you can find a clear cluster number where the decrease in distortion starts to level off, this point is where the elbow in line chart gets more evident.

from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
import plotly.graph_objects as go
inertia = []
K = np.arange(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(features_normal)

From the chart above it can be seen that we have a pseudo elbow between 3 and 5 clusters. By now, we are going to choose 4, but you may replicate the whole process for each amount of clusters and compare the differences.

So, now we fit the audio features data with kmeans clustering technique for 4 number of clusters, and then we create a column on our main dataset assigning the label of cluster for every song based on their audio features.

from sklearn.cluster import KMeans# Compute k-means clusteringKMeans = KMeans(n_clusters=4).fit(features_normal)
data_featured['cluster'] = KMeans.fit_predict(features_normal)
tracks_clustered = data_featured.sort_values('start_week').reset_index(drop=True)

It is worth highlighting that each cluster will comprise songs from several and different music genres due to the cluster is not based on the genres attributes but it is based on the song audio attributes, thus we may find some songs that share similar attributes -inter-genres- despite they were tagged into different genres.

Having said that, we are going to display a boxplot grouping the audio features by cluster label to understand what the result of clustering criteria was.

  • Cluster 0: this cluster comprises low-acousticness songs with high danceability and energy, as well as, this is the group with highest level of musical positiveness. We will find inside this cluster songs like Fake Love from Drake, 24K Magic from Bruno Mars, Cold Water from Major Lazer or The Greatest from Sia.
  • Cluster 1: this cluster is pretty similar to the previous one in danceability but it contains songs with higher acousticness and a little lower energy (what is quiet understandable due to the negative correlation between those audio features). Songs like Closer from The Chainsmokers, Rockabye from Clean Bandit, Don’t Wanna Know from Maroon 5 or I Feel It Coming from The Weeknd can be found in here.
  • Cluster 2: despite here the danceability and energy is moderated, this has lower level of acousticness and valence. This cluster comprises songs like Starboy from The Weeknd, Let Me Love You from DJ Snake, Black Beatles from Rae Sremmurd and All I Want for Christmas Is You from Mariah Carey.
  • Cluster 3: this cluster contains the songs with highest level of acousticness in average, with low positiveness and a median level of energy and danceability. Some of the tracks that we may find in here are Say You Won’t Let Go from James Arthur, We Don’t Talk Anymore from Charlie Puth, It’s Beginning to Look a Lot like Christmas from Michael Bublé and Let It Snow! Let It Snow! Let It Snow! from Frank Sinatra.

It is worth to highlight, and as it can be seen at the boxplot taking out the outliers from the analysis, the three cluster has a similar audio configuration, except by the acousticness and valence which are the audio features with more variance across the clusters.

Lastly, to give a high level comprehension about the historical series regarding the streams for each cluster, we plot the sum of the song streams from January 2017 to January 2021. In average, cluster 2 have had the highest level of streaming along the historical series, reaching a peak in June 2018 of 946M streams approximately, and with a strong decrease at the end of 2020. On the opposite, cluster 3 is which has the lowest performance along the period regarding the amount of streams, with the lowest range registered in the middle of 2017. With an oscillating pattern, cluster 0 has a growing trend which allows reaching the streaming levels of cluster 2 at the end of 2020.

Forecasting the streamings

We get to the last section where we are going to forecast the trend streams to the clusters.

Regarding to Spotify streams definition: Song stream is counted when someone listens for over 30 second.

First, we need to generate a dataframe with the structure what fbprophet takes as argument, it is just one column with date format and another one with the variable to forecast. In this case, we are going to add one column with cluster label, but just for reference to build our model.

forcast_data = tracks_clustered.groupby(['start_week','cluster'])['Streams'].sum().round(2).reset_index()forcast_data.columns = ['ds','cluster','y']

So, we build a for loop that iterates along the clusters filtering the streams for each cluster and making a forecasting for 2021 and 2023, this is specified as days at periods attribute on make_future_dataframe feature. Lastly, we concatenate all the results in just one single dataframe called fcst_all.

from fbprophet import Prophetclusters = forcast_data.cluster.unique()
fcst_all = pd.DataFrame()
for cluster in range(len(clusters)):
print('Fitting cluster n° '+ str(cluster) + ' ...')
cluster_data = forcast_data.loc[forcast_data.cluster==cluster]

m = Prophet().fit(cluster_data)
future = m.make_future_dataframe(periods=365)

print('Forcasting data for cluster n° '+ str(cluster) + ' ...')

forecast = m.predict(future)
forecast['cluster'] = cluster
fcst_all = pd.concat((fcst_all, forecast))

Once we get the results for the forecasting to every cluster streams, we are ready to plot the results with plotly and make an analysis of the trends and predictions.

As it can be seen at the chart below, despite cluster 2 has had the highest streaming numbers historically, its trend in the period 2021–2023 is stabilized lower than its maximum values, and at same time, the cluster 0 displays the most upwards trend, what it seems that this will ease to overpass the streaming levels of cluster 2 in average for the same period. On the other hand, cluster 3 shows the same historical growing trend along the forecasting period but within lower limits in amount of streaming than the rest of the clusters. Regarding cluster 1, this will have a stable performance during 2021–2023, getting some peaks in its oscillating path.


This post is a high level approach to the music market research, so this do can be expanded to a deepest and extensive investigation by making more complex modeling and adding attributes to the songs and audio features.

Summarizing the content of this document, initially we connect to Spotify API through the python spotipy library in order to get the historical streams for the global weekly top songs for the period 2017-2020. Then, we clustered the tracks based on their audio features using kmeans technique and sklearn python framework, so we could get a song groups what shares the similar audio configuration. Lastly, we perform a forecasting to each cluster predicting the level of streams for the period 2021–2023 by using fbprophet python library.

Keeping the sight on the music market and based on this analysis we may conclude that songs composed with same audio attributes than cluster 0 will gain more market share than others. This is songs that comprises configurations around low acousticness, high danceability and energy, and those songs composed on melodies and harmonies that transmit comfortable positiveness. On an overall analysis, cluster 0 and cluster 2 share these most of these attributes, so it may be not surprise that both clusters has the highest stream levels and take part of the biggest piece of the market share.

I‘ve been developing as a BI professional applying DS tools and techniques, performing Data Structuring & Modeling and producing Data Analytics & Visualizations