The goal of this project is to design and implement an ETL data pipeline that ingests raw retail data, processes it, and stores it in a data warehouse for analysis and visualization. The pipeline is orchestrated using Apache Airflow, hosted locally with Docker Compose. AWS S3 serves as the data lake, while AWS Redshift functions as the data warehouse. Finally, the processed data is visualized using Looker Studio for insights and reporting.
!! You can view the dashboard here. !!
2. Download raw data from AWS S3 for processing.
3. Transform the raw data into a suitable format for the data warehouse using Pandas and upload it back to AWS S3.
4. Load the processed data from AWS S3 into AWS Redshift (data warehouse).
5. Use data from the warehouse to create dashboards in Looker Studio for insights and reporting.
I use Looker Studio to create dashboards using data from the data warehouse.
!! You can view the dashboard here. !!
While developing this project, I connected Looker Studio to AWS Redshift for data. However, due to AWS free tier limits, Redshift cannot run continuously. As a result, the dashboard now uses data from a CSV file exported from Redshift, but it appears the same as when directly connected to Redshift.
- Cloud - Amazon Web Services (AWS)
- Containerization - Docker, Docker Compose
- Orchestration - Airflow
- Transformation - pandas
- Data Lake - AWS S3
- Data Warehouse - AWS Redshift
- Data Visualization - Looker Studio
- Language - Python
docker run --rm "debian:bookworm-slim" bash -c "numfmt --to iec $(echo $(($(getconf _PHYS_PAGES) * $(getconf PAGE_SIZE))))"
git clone https://github.com/mikecerton/The-Retail-ELT-Pipeline-End-To-End-project.git
cd The-Retail-ELT-Pipeline-End-To-End-project
mkdir logs, plugins, config
AIRFLOW_UID=50000
bucket_name = your bucket_name
aws_access_key_id = your aws_access_key_id
aws_secret_access_key = your aws_secret_access_key
region_name = your region_name
redshift_host = your redshift_host
redshift_port = your redshift_port
redshift_db = your redshift_db
redshift_user = your redshift_user
redshift_password = your redshift_password
iam_role = your iam_role
docker-compose up airflow-init
docker-compose up
1. Full architecture
2. docker-compose.yaml
- Mainly its Airflow Docker Compose and also pip install pandas, etc.
3. dag/main_dag.py
- Defines the DAG here.
4. dag/my_TL.py
- Contains functions to transform data and load it to AWS Redshift (data warehouse).
5. dag/raw_data_function.py
- Contains functions to upload and download data from AWS S3 (data lake).
6. Pull data from Redshift to create a dashboard at Google Looker Studio.
!!!!! Note: This data only covers the period from 01/01/2021 to 31/12/2021.
!!!!! If you selected a date range other than this, the graph will show NULL.