Welcome to the Spotify Rehydrator!

The Spotify Rehydrator was created to provide a simple way to generate full datasets of track features from user-owned Spotify data. It relies on the excellent Spotipy library and brings together a series of API calls in a convenient way that can manage data from multiple different people, as would be common in a research study. It can also be used by individuals who are curious to learn more about their own data! The idea of a rehydrator was inspired by similar work being done to enable sharing of Twitter datasets for research purposes.

Before you use the rehyrdator, please make sure to read the Disclaimers to get an understanding of the limitations of the search strategy used.

User Guide

The Spotify Rehydrator primarily operates through the Rehydrator class. The required inputs for this class are an input folder, an output folder and a Client ID and Client Secret from the Spotify Developer Portal. These are used for authenticating the API calls. You can then call the run() method.


To request developer credentials go to Spotify’s developer portal. You will need to ‘create an app’ which have credentials associated with it. Your app dashboard will give you access to your Client ID and a Client Secret.

Install the Spotify Rehydrator using pip::

pip install spotifyrehydrator

Assuming you have set your Client ID and Client Secret as environment variables then this is an example of how you could run the Rehydrator:

import os
from spotifyrehydrator import Rehydrator

    input_path=os.path.join(pathlib.Path(__file__).parent.absolute(), "input"),
    output_path=os.path.join(pathlib.Path(__file__).parent.absolute(), "output"),

The .run() argument will by default return the following information as columns: spotify track ID of the returned track, the name of the artist of the returned track, the name of the returned track. This will be joined with the searched artist and track, the person ID where relevant, and the time metadata in the original .json file. There are then three optional arguments: * artist_info = True will return the popularity of the artist returned and a list of genres attributed to that artist, provided by the Artists API endpoint * audio_features = True will return a column for each of the audio features provided by the Tracks API. * return_all = True will return both the above.

Be aware that extra arguments involve more API calls and so may take longer.

Expected formats

Streaming History JSON

This package is designed to work with the files named StreamingHistory.json that are sent to users as part of their data package if they request their own Spotify data. The file will contain up to the past year of the user’s listening data.

This data should be in one or more files with a list of JSON objects that look like this:

"endTime" : "2019-01-19 17:01",
"artistName" : "An Artist",
"trackName" : "A Track Name",
"msPlayed" : 19807

Input folder

The input folder should contain a series of Streaming History JSON files. If you have files belonging to multiple individuals then the package expects the unique identifier for each person to be the prefix, followed by an underscore. For example:

# input folder

This would result in two rehydrated files being saved to the output folder:

# output folder

You could also input several files without any underscores to represent individuals. These would all be combined and saved in one output file.

Useful information

  • If the output directory does not exist then it will be created.

  • Rehydration for one individual can take 15 minutes or more depending on how many songs there are.

  • If a file for the next individual’s data to be rehydrated already exists in the output directory then that person will be skipped. You will need to delete or remove their file from the output folder for the rehydrator to process their data.


  • Not all tracks can be retreived from the API. In our experience about 5% of tracks cannot be found on the API. These will have a value of NONE in the output files.

  • There is not a guaranteed match between the first returned item in a search and the track you want. Comparing msPlayed with the track length is a good way to test this since msPlayed should not exceed the track length.

Code Documentation

The main module for the spotifyrehydrator package contains three dataclasses.

Track operates on a single Track instance, starting from just a name and an artist, as would be provided in self-requested data. It is possible to use Track to get information about a single Track.

Tracks contains similar logic as for Track, but makes use of the batch endpoints to save on API calls. Therefore, its more efficient than Track for many calls, and I/O is primarily Pandas DataFrame objects, rather than dictionaries.

Rehydrator is mainly intended to rebuild multiple datasets in instances when you have many listening histories from multiple different users with additional metadata such as datetimes. The Rehydrator is the only class which will write files.

class utils.Rehydrator(input_path: str, output_path: str, client_id: str, client_secret: str, _person_ids: list = <function Rehydrator._person_ids>)

Class to iterate through input files, generate full datasets for each listening history and save the data to the output folder. Will create output folder if it does not exist.


path to the directory (folder) where the input json files are stored.




path to the directory (folder) where the output .tsv files are saved.




Spotify API client ID Credentials




Spotify API client secret Credentials




A list of each of the unique ‘people’ files identified for, or None.


list or None


>>> Rehydrator(input_path, output_path, sp).run()
rehydrate(person_id: Optional[str] = None, return_all: bool = False, audio_features: bool = False, artist_info: bool = False)pandas.core.frame.DataFrame

For a single person’s set of data, use the Tracks class to get all of the track IDs and features, then join these on the full listening history data. Save out the complete data, and return it too.

run(return_all: bool = False, audio_features: bool = False, artist_info: bool = False)None

Iterate through each person’s set of data by calling the ‘rehydrate’ method on each.

class utils.Track(name: str, artist: str, client_id: str, client_secret: str)

A class that searches for and returns a spotify ID and other optional information for a track, given a trackName and and artistName.


The name of the track.




The name of the artist.




Spotify API client ID Credentials




Spotify API client secret Credentials




get(return_all: bool = False, returned_artist: bool = False, returned_track: bool = False, artist_info: bool = False, audio_features: bool = False)dict

Calls search_results() to get the spotifyID, trying to remove apostrophes and dashes if an IndexError is raised. Returns a dictionary of objects, with spotifyID and then any other objects as defined in function call.


Searches the Spotify API for the track and artist and returns the whole results object.

Takes remove_char as a char to remove from the artist and track before searching if needed - this can improve results.

class utils.Tracks(data: pandas.core.frame.DataFrame, client_id: str, client_secret: str)

A class that takes a dataframe of listening events with artistName and trackName, and returns these with the trackID and audio features of each track as a dataframe.


A dataframe with two columns ‘artistName’ and ‘trackName’.


Spotify API client ID Credentials


Spotify API client secret Credentials


Spotipy OAuth object for API calls.


>>> Tracks(data, client_id, client_secret).get(return_all=True)

This will return a pd.Dataframe with feature columns filled for each unique track in the original data.


KeyError – If the input data provided does not contain a artistName and trackName

get(return_all: bool = False, audio_features: bool = False, artist_info: bool = False)pandas.core.frame.DataFrame

Get the requested data for each track. Returns a dataframe of unique tracks.



Contributions to the package are very welcome!

If you would like to add a new feature then

Indices and tables