using Pentaho Data Integration (PDI)/Kettle ⚙
.: 📄 Dataset taken from Kaggle :.
-
This repository contains:
- ETL file using Pentaho Data Integration (PDI).
- CSV file that has gone through ETL process.
-
This project will also:
- Clean and transform both data sets (application record and credit record),
- Merge, clean, and transform data sets into one data set (in CSV format).
- Perform ETL using PDI for both datasets.
- Create time dimension using PDI.
- Create fact table using PDI.
- The dataset description can be seen here.
- Importing application record csv.
- Sort data based on ID (in ascending order).
- Filtering duplicate ID.
- Replacing 'Y' with 1, and 'N' with 0.
- Adding 'Current_Date' column.
- Calculating applicant age and how long applicant have been working (in years).
- Filtering applicant that is less than 21 y.o.
- Filtering applicant with null/empty values
- Importing credit record csv.
- Sort data based on ID (in ascending order).
- Adding 'Current_Date' column.
- Calculating month loan payment.
- Creating copy of 'STATUS' column.
- Replace C, X, 0 with 'Good Debt' (C: loan for that month is already paid; X: no loan for that month; 0: loan is 1 to 29 days overdue).
- Replace 1, 2, 3, 4, 5 with 'Bad Debt' (1: loan is 30 to 59 days overdue; 2: loan is 60 to 89 days overdue; 3: loan is 90 to 119 days overdue; 4: loan is 120 to 149 days overdue; 5: loan is more than 150 days overdue).
- Creating 2 copies of 'STATUS2' column (Good_Debt and Bad_Debt).
- Good_Debt: Good Debt will be change to 1, while Bad Debt will be change to 0.
- Bad_Debt: Good Debt will be change to 0, while Bad Debt will be change to 1.
- Calculating total of Good Debt and Bad Debt from each applicant (similar to group by function in SQL).
- If the total of Good Debt is higher than Bad Debt, then an applicant status will be eligible (1).
- If the total of Bad Debt is higher than Good Debt, then an applicant status will be not eligible (0).
- Bad_Debt_CNT, Good_Debt_CNT, and STATUS will be merged based on applicant ID.
- Applicant with empty Bad_Debt_CNT, Good_Debt_CNT, and STATUS will be deleted.
- Select columns that will extracted.
- Exporting cleaned and transformed data set into CSV file.
👉 If you find this project useful, please ⭐ this repository 😆!
🎈 Check out my work using AutoML/PyCaret with this processed data set here!
👉 More about myself: here