Welcome to the Spotify Rehydrator!

The Spotify Rehydrator was created to provide a simple way to generate full datasets of track features from user-owned Spotify data. It relies on the excellent Spotipy library and brings together a series of API calls in a convenient way that can manage data from multiple different people, as would be common in a research study. It can also be used by individuals who are curious to learn more about their own data! The idea of a rehydrator was inspired by similar work being done to enable sharing of Twitter datasets for research purposes.

Before you use the rehyrdator, please make sure to read the Disclaimers to get an understanding of the limitations of the search strategy used.

User Guide

The Spotify Rehydrator primarily operates through the Rehydrator class. The required inputs for this class are an input folder, an output folder and a Client ID and Client Secret from the Spotify Developer Portal. These are used for authenticating the API calls. You can then call the run() method.

Note

To request developer credentials go to Spotify’s developer portal. You will need to ‘create an app’ which have credentials associated with it. Your app dashboard will give you access to your Client ID and a Client Secret.

Install the Spotify Rehydrator using pip::

pip install spotifyrehydrator

Assuming you have set your Client ID and Client Secret as environment variables then this is an example of how you could run the Rehydrator:

import os
from spotifyrehydrator import Rehydrator

Rehydrator(
    input_path=os.path.join(pathlib.Path(__file__).parent.absolute(), "input"),
    output_path=os.path.join(pathlib.Path(__file__).parent.absolute(), "output"),
    client_id=os.getenv("SPOTIFY_CLIENT_ID"),
    client_secret=os.getenv("SPOTIFY_CLIENT_SECRET"),
).run(return_all=True)

The .run() argument will by default return the following information as columns: spotify track ID of the returned track, the name of the artist of the returned track, the name of the returned track. This will be joined with the searched artist and track, the person ID where relevant, and the time metadata in the original .json file. There are then three optional arguments: * artist_info = True will return the popularity of the artist returned and a list of genres attributed to that artist, provided by the Artists API endpoint * audio_features = True will return a column for each of the audio features provided by the Tracks API. * return_all = True will return both the above.

Be aware that extra arguments involve more API calls and so may take longer.

Expected formats

Streaming History JSON

This package is designed to work with the files named StreamingHistory.json that are sent to users as part of their data package if they request their own Spotify data. The file will contain up to the past year of the user’s listening data.

This data should be in one or more files with a list of JSON objects that look like this:

{
"endTime" : "2019-01-19 17:01",
"artistName" : "An Artist",
"trackName" : "A Track Name",
"msPlayed" : 19807
}

Input folder

The input folder should contain a series of Streaming History JSON files. If you have files belonging to multiple individuals then the package expects the unique identifier for each person to be the prefix, followed by an underscore. For example:

# input folder
person001_StreamingHistory0.json
person001_StreamingHistory1.json
person002_StreamingHistory0.json

This would result in two rehydrated files being saved to the output folder:

# output folder
person001-rehydrated.tsv
person002-rehydrated.tsv

You could also input several files without any underscores to represent individuals. These would all be combined and saved in one output file.

Useful information

  • If the output directory does not exist then it will be created.

  • Rehydration for one individual can take 15 minutes or more depending on how many songs there are.

  • If a file for the next individual’s data to be rehydrated already exists in the output directory then that person will be skipped. You will need to delete or remove their file from the output folder for the rehydrator to process their data.

Disclaimers

  • Not all tracks can be retreived from the API. In our experience about 5% of tracks cannot be found on the API. These will have a value of NONE in the output files.

  • There is not a guaranteed match between the first returned item in a search and the track you want. Comparing msPlayed with the track length is a good way to test this since msPlayed should not exceed the track length.

Code Documentation

The main module for the spotifyrehydrator package contains three dataclasses.

Track operates on a single Track instance, starting from just a name and an artist, as would be provided in self-requested data. It is possible to use Track to get information about a single Track.

Tracks contains similar logic as for Track, but makes use of the batch endpoints to save on API calls. Therefore, its more efficient than Track for many calls, and I/O is primarily Pandas DataFrame objects, rather than dictionaries.

Rehydrator is mainly intended to rebuild multiple datasets in instances when you have many listening histories from multiple different users with additional metadata such as datetimes. The Rehydrator is the only class which will write files.

class utils.Rehydrator(input_path: str, output_path: str, client_id: str, client_secret: str, _person_ids: list = <function Rehydrator._person_ids>)

Class to iterate through input files, generate full datasets for each listening history and save the data to the output folder. Will create output folder if it does not exist.

input_path

path to the directory (folder) where the input json files are stored.

Type

str

output_path

path to the directory (folder) where the output .tsv files are saved.

Type

str

client_id

Spotify API client ID Credentials

Type

str

client_secret

Spotify API client secret Credentials

Type

str

_person_ids

A list of each of the unique ‘people’ files identified for, or None.

Type

list or None

Example

>>> Rehydrator(input_path, output_path, sp).run()
rehydrate(person_id: Optional[str] = None, return_all: bool = False, audio_features: bool = False, artist_info: bool = False) pandas.core.frame.DataFrame

For a single person’s set of data, use the Tracks class to get all of the track IDs and features, then join these on the full listening history data. Save out the complete data, and return it too.

Parameters
run(return_all: bool = False, audio_features: bool = False, artist_info: bool = False) None

Iterate through each person’s set of data by calling the ‘rehydrate’ method on each.

Parameters
class utils.Track(name: str, artist: str, client_id: str, client_secret: str)

A class that searches for and returns a spotify ID and other optional information for a track, given a trackName and and artistName.

name

The name of the track.

Type

str

artist

The name of the artist.

Type

str

client_id

Spotify API client ID Credentials

Type

str

client_secret

Spotify API client secret Credentials

Type

str

Example

get(return_all: bool = False, returned_artist: bool = False, returned_track: bool = False, artist_info: bool = False, audio_features: bool = False) dict

Calls search_results() to get the spotifyID, trying to remove apostrophes and dashes if an IndexError is raised. Returns a dictionary of objects, with spotifyID and then any other objects as defined in function call.

Parameters
search_results(remove_char=None) dict

Searches the Spotify API for the track and artist and returns the whole results object.

Takes remove_char as a char to remove from the artist and track before searching if needed - this can improve results.

class utils.Tracks(data: pandas.core.frame.DataFrame, client_id: str, client_secret: str)

A class that takes a dataframe of listening events with artistName and trackName, and returns these with the trackID and audio features of each track as a dataframe.

data
Type

A dataframe with two columns ‘artistName’ and ‘trackName’.

client_id
Type

Spotify API client ID Credentials

client_secret
Type

Spotify API client secret Credentials

_sp_auth
Type

Spotipy OAuth object for API calls.

Example

>>> Tracks(data, client_id, client_secret).get(return_all=True)

This will return a pd.Dataframe with feature columns filled for each unique track in the original data.

Raises

KeyError – If the input data provided does not contain a artistName and trackName

get(return_all: bool = False, audio_features: bool = False, artist_info: bool = False) pandas.core.frame.DataFrame

Get the requested data for each track. Returns a dataframe of unique tracks.

Parameters

Contributing

Contributions to the package are very welcome!

If you would like to add a new feature then

Indices and tables