Skip to content

This project designs and implements an ETL pipeline using Apache Airflow (Docker Compose) to ingest, process, and store retail data. AWS S3 acts as the data lake, AWS Redshift as the data warehouse, and Looker Studio for visualization. [Data Engineer]

Notifications You must be signed in to change notification settings

mikecerton/The-Retail-ELT-Pipeline-End-To-End

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The-Retail-ELT-Pipeline-End-To-End

Overview

 The goal of this project is to design and implement an ETL data pipeline that ingests raw retail data, processes it, and stores it in a data warehouse for analysis and visualization. The pipeline is orchestrated using Apache Airflow, hosted locally with Docker Compose. AWS S3 serves as the data lake, while AWS Redshift functions as the data warehouse. Finally, the processed data is visualized using Looker Studio for insights and reporting.
!! You can view the dashboard here. !!

Architecture

Architecture

1. Upload raw data to AWS S3 (data lake) to handle data from multiple sources.
2. Download raw data from AWS S3 for processing.
3. Transform the raw data into a suitable format for the data warehouse using Pandas and upload it back to AWS S3.
4. Load the processed data from AWS S3 into AWS Redshift (data warehouse).
5. Use data from the warehouse to create dashboards in Looker Studio for insights and reporting.

Dashboard

Dashboard

I use Looker Studio to create dashboards using data from the data warehouse.

!! You can view the dashboard here. !!

A special note

While developing this project, I connected Looker Studio to AWS Redshift for data. However, due to AWS free tier limits, Redshift cannot run continuously. As a result, the dashboard now uses data from a CSV file exported from Redshift, but it appears the same as when directly connected to Redshift.

Data Warehouse

DataWarehouse

DAG

DAG

Tools & Technologies

  • Cloud - Amazon Web Services (AWS)
  • Containerization - Docker, Docker Compose
  • Orchestration - Airflow
  • Transformation - pandas
  • Data Lake - AWS S3
  • Data Warehouse - AWS Redshift
  • Data Visualization - Looker Studio
  • Language - Python

Set up

1. Check that your Docker has more than 4 GB of RAM. (to use airflow)

docker run --rm "debian:bookworm-slim" bash -c "numfmt --to iec $(echo $(($(getconf _PHYS_PAGES) * $(getconf PAGE_SIZE))))"

2. clone this github repository

git clone https://github.com/mikecerton/The-Retail-ELT-Pipeline-End-To-End-project.git
cd The-Retail-ELT-Pipeline-End-To-End-project

3. Run mkdir to create directories: logs, plugins, and config.

mkdir logs, plugins, config

4. put you data into .env file like this

AIRFLOW_UID=50000

bucket_name = your bucket_name
aws_access_key_id = your aws_access_key_id
aws_secret_access_key = your aws_secret_access_key
region_name = your region_name

redshift_host = your redshift_host
redshift_port = your redshift_port
redshift_db = your redshift_db
redshift_user = your redshift_user
redshift_password = your redshift_password
iam_role = your iam_role

5. run (airflow-init)

docker-compose up airflow-init

6. run (start docker-compose)

docker-compose up

7. you can start activate dag at

http://localhost:8080

Explain Path

1. Full architecture
2. docker-compose.yaml
   - Mainly its Airflow Docker Compose and also pip install pandas, etc.
3. dag/main_dag.py
   - Defines the DAG here.
4. dag/my_TL.py
   - Contains functions to transform data and load it to AWS Redshift (data warehouse).
5. dag/raw_data_function.py
   - Contains functions to upload and download data from AWS S3 (data lake).
6. Pull data from Redshift to create a dashboard at Google Looker Studio.

!!!!! Note: This data only covers the period from 01/01/2021 to 31/12/2021.
!!!!! If you selected a date range other than this, the graph will show NULL.

Disclaimer

About

This project designs and implements an ETL pipeline using Apache Airflow (Docker Compose) to ingest, process, and store retail data. AWS S3 acts as the data lake, AWS Redshift as the data warehouse, and Looker Studio for visualization. [Data Engineer]

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages