[go: up one dir, main page]

Skip to content

This repository contains ETL file from Pentaho Data Integration. The ETL process cleaned applicant with empty values/data and dirty data. The dataset is taken from https://www.kaggle.com/rikdifos/credit-card-approval-prediction

Notifications You must be signed in to change notification settings

caesarmario/etl-credit-card-dataset-using-pentaho

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

💳 ETL Credit Card Data Set 💳

using Pentaho Data Integration (PDI)/Kettle ⚙

.: 📄 Dataset taken from Kaggle :.


Star Badge Kaggle Projects


📃 Table of Contents:


🖋 About Project

  • This repository contains:

    • ETL file using Pentaho Data Integration (PDI).
    • CSV file that has gone through ETL process.
  • This project will also:

    • Clean and transform both data sets (application record and credit record),
    • Merge, clean, and transform data sets into one data set (in CSV format).

📌 Objectives

  • Perform ETL using PDI for both datasets.
  • Create time dimension using PDI.
  • Create fact table using PDI.

🧾 Data Set Description

  • The dataset description can be seen here.

⚙ ETL Process

Main ETL Flow



👨‍💼 Application Record

▶ CSV file input

CSV file input - Application

  • Importing application record csv.

▶ Sort rows

Sort rows - Application

  • Sort data based on ID (in ascending order).

▶ Unique rows

Unique rows - Application

  • Filtering duplicate ID.

▶ Replace in string

Replace in string - Application

  • Replacing 'Y' with 1, and 'N' with 0.

▶ Add constants

Add contants - Application

  • Adding 'Current_Date' column.

▶ Calculator

Calculator - Application

  • Calculating applicant age and how long applicant have been working (in years).

▶ Filter rows

Filter rows - Application

  • Filtering applicant that is less than 21 y.o.
  • Filtering applicant with null/empty values



💶 Credit Record

▶ CSV file input

CSV file input - Credit Record

  • Importing credit record csv.

▶ Sort rows 2

Sort rows - Credit Record

  • Sort data based on ID (in ascending order).

▶ Add constants 2

Add contants - Credit Record

  • Adding 'Current_Date' column.

▶ Calculator 2

Calculator - Credit Record

  • Calculating month loan payment.
  • Creating copy of 'STATUS' column.

▶ Replace in string 2

Replace in string - Credit Record

  • Replace C, X, 0 with 'Good Debt' (C: loan for that month is already paid; X: no loan for that month; 0: loan is 1 to 29 days overdue).
  • Replace 1, 2, 3, 4, 5 with 'Bad Debt' (1: loan is 30 to 59 days overdue; 2: loan is 60 to 89 days overdue; 3: loan is 90 to 119 days overdue; 4: loan is 120 to 149 days overdue; 5: loan is more than 150 days overdue).

▶ Calculator 3

Calculator2 - Credit Record

  • Creating 2 copies of 'STATUS2' column (Good_Debt and Bad_Debt).

▶ Replace in string 3

Replace in string2 - Credit Record

  • Good_Debt: Good Debt will be change to 1, while Bad Debt will be change to 0.
  • Bad_Debt: Good Debt will be change to 0, while Bad Debt will be change to 1.

▶ Group by

Group by - Credit Record

  • Calculating total of Good Debt and Bad Debt from each applicant (similar to group by function in SQL).

▶ Modified JavaScript value

JavaScript - Credit Record

  • If the total of Good Debt is higher than Bad Debt, then an applicant status will be eligible (1).
  • If the total of Bad Debt is higher than Good Debt, then an applicant status will be not eligible (0).



📥 Output file

▶ Stream lookup

Stream lookup - Output file

  • Bad_Debt_CNT, Good_Debt_CNT, and STATUS will be merged based on applicant ID.

▶ Filter rows

Filter rows - Output file

  • Applicant with empty Bad_Debt_CNT, Good_Debt_CNT, and STATUS will be deleted.

▶ Select values 2

Select values - Output file

  • Select columns that will extracted.

▶ Text file output

Text file output1 - Output file Text file output2 - Output file

  • Exporting cleaned and transformed data set into CSV file.



👀 Preview Output File

Preview Output File 1 Preview Output File 2



🙌 Support me!

👉 If you find this project useful, please ⭐ this repository 😆!

🎈 Check out my work using AutoML/PyCaret with this processed data set here!


👉 More about myself: here

About

This repository contains ETL file from Pentaho Data Integration. The ETL process cleaned applicant with empty values/data and dirty data. The dataset is taken from https://www.kaggle.com/rikdifos/credit-card-approval-prediction

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published