Pyspark Spotify ETL

Description

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

The purpose of this is to help those that want to become Data Engineers, like myself, create their first project.

Essentials

Extra libraries that must be imported: sys, json, datetime.

ETL Execution

Install all the necessary libraries from the Pipfile.
Read the "Token_request_instructions" to get your own refresh token. In case you don't want that you can get one from this website https://developer.spotify.com/console/get-recently-played/ which will have to be changed every hour.
Add your you postgreSQL credentials in the engine variable. In case you'll be using another RDBMS, use this website https://docs.sqlalchemy.org/en/14/core/engines.html.
Create SQL Database/Table (Optional).
Create a bash file. This file is were you'll write down the path to Spark, Python and your script. If this isn't created you'll get the "ModuleNotFoundError" for each module you import inside your script. (Think of this as the ETL's own ~/.bash_profile)
Create a new crontab or use the existing one if you want the job to run on midnight every day.

Extras

To verify that your scheduled job is working you can change the crontab to "* * * * *".
Here is the website https://developer.spotify.com/documentation/general/guides/scopes/ with other Spotify scopes in case you don't want to use "recently played tracks".
Thank you Karolina Sowinska for your DE Beginners guide.

Pyspark Spotify ETL

Related tags

Overview

Pyspark Spotify ETL

Owner

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

This mini project showcase how to build and debug Apache Spark application using Python

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

INFO-H515 - Big Data Scalable Analytics

LynxKite: a complete graph data science platform for very large graphs and other datasets.

Project under the certification "Data Analysis with Python" on FreeCodeCamp

Functional tensors for probabilistic programming

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Automatic earthquake catalog building workflow: EQTransformer + Siamese EQTransformer + PickNet + REAL + HypoInverse

Statistical package in Python based on Pandas

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana

Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles

ASTR 302: Python for Astronomy (Winter '22)

Get mutations in cluster by querying from LAPIS API

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code