GitHub - keyoumao/Credit_Risk: Supervised Machine Learning and Credit Risk

About The Project

In 2019, more than 19 million Americans had at least one unsecured personal loan. That’s a record-breaking number! Personal lending is growing faster than credit card, auto, mortgage, and even student debt. With such incredible growth, FinTech firms are storming ahead of traditional loan processes. By using the latest machine learning techniques, these FinTech firms can continuously analyze large amounts of data and predict trends to optimize lending.

We used Python to build and evaluate several machine learning models to predict credit risk. Being able to predict credit risk with machine learning algorithms can help banks and financial institutions predict anomalies, reduce risk cases, monitor portfolios, and provide recommendations on what to do in cases of fraud.

Roadmap

Logistic Regression
Classification Model Validation
Support Vector Machines
Data Preprocessing in Machine Learning
Decision Trees
Ensemble Learning and Random Forests
Bagging and Boosting

Steps

The goals of this project are to:

Implement machine learning models.
Use resampling to attempt to address class imbalance.
Evaluate the performance of machine learning models using scikit-learn library.

Tasks

Oversample the data using the RandomOverSampler and SMOTE algorithms.
Undersample the data using the cluster centroids algorithm.
Use a combination approach with the SMOTEENN algorithm. For each of the above:
1. Train a logistic regression classifier (from Scikit-learn) using the resampled data.
2. Calculate the balanced accuracy score using balanced_accuracy_score from sklearn.metrics.
3. Generate a confusion_matrix.
4. Print the classification report (classification_report_imbalanced from imblearn.metrics).

Analysis

Oversampling

Naive Random Oversampling

The accuracy score for the random oversampling is 0.65.The classification report is given as follows. The precision for high_risk is 0.01, very low and high for the low risk, indicating an overfitting for the low_risk. The recall (sensitivity) for both cases are not ideal.

               pre       rec       spe        f1       geo       iba       sup

high_risk       0.01      0.70      0.60      0.02      0.65      0.42       101
low_risk       1.00      0.60      0.70      0.75      0.65      0.42     17104

avg / total       0.99      0.60      0.70      0.74      0.65      0.42     17205

SMOTE Oversampling

The accuracy score for the SMOTE oversampling is 0.63.The classification report is given as follows. The precision for high_risk is 0.01, very low and high for the low risk, indicating an overfitting for the low_risk. The recall (sensitivity) for both cases are not ideal and lower than the random oversampling.

               pre       rec       spe        f1       geo       iba       sup

high_risk       0.01      0.59      0.66      0.02      0.62      0.39       101
low_risk       1.00      0.66      0.59      0.79      0.62      0.39     17104

avg / total       0.99      0.66      0.59      0.79      0.62      0.39     17205

To sum up, the oversampling cases are not good.

Undersampling

The accuracy score for the random oversampling is 0.63.The classification report is given as follows. The precision for high_risk is 0.01, very low and high for the low risk, indicating an overfitting for the low_risk. The recall (sensitivity) for both cases are not good.

               pre       rec       spe        f1       geo       iba       sup

high_risk       0.01      0.65      0.54      0.02      0.60      0.36       101
low_risk       1.00      0.54      0.65      0.70      0.60      0.35     17104

avg / total       0.99      0.54      0.65      0.70      0.60      0.35     17205

Combination (Over and Under) Sampling

The accuracy score for the random oversampling is 0.6.The classification report is given as follows. The precision for high_risk is 0.01, very low and high for the low risk, indicating an overfitting for the low_risk. The recall (sensitivity) for both cases are not good.

               pre       rec       spe        f1       geo       iba       sup

high_risk       0.01      0.70      0.60      0.02      0.65      0.43       101
low_risk       1.00      0.60      0.70      0.75      0.65      0.42     17104

avg / total       0.99      0.60      0.70      0.75      0.65      0.42     17205

Recommendations

From the analysis above, all of the models above are not recommended. Because all the models have accuracy scores less than 0.7. The precision score for the credit scores are overfit apparently. The recall (sensitivity) is also not good. More detailed model to distinguish the features need to be establlished for a better prediction.

Extension

BalancedRandomForestClassifier improved the accuracy score a little bit to 0.74. But the precision and sensitivity still exist. EasyEnsembleClassifier is by far the best model, where the accuracy score is 0.94. The sensitivity has been improved as well. The classification_report_imbalanced is given as follows.

               pre       rec       spe        f1       geo       iba       sup

high_risk       0.10      0.92      0.95      0.18      0.94      0.87       101
low_risk       1.00      0.95      0.92      0.97      0.94      0.88     17104

avg / total       0.99      0.95      0.92      0.97      0.94      0.88     17205

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
credit_risk_ensemble.ipynb		credit_risk_ensemble.ipynb
credit_risk_resampling.ipynb		credit_risk_resampling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About The Project

Roadmap

Steps

Tasks

Analysis

Oversampling

Naive Random Oversampling

SMOTE Oversampling

Undersampling

Combination (Over and Under) Sampling

Recommendations

Extension

License

About

Releases

Packages

Languages

License

keyoumao/Credit_Risk

Folders and files

Latest commit

History

Repository files navigation

About The Project

Roadmap

Steps

Tasks

Analysis

Oversampling

Naive Random Oversampling

SMOTE Oversampling

Undersampling

Combination (Over and Under) Sampling

Recommendations

Extension

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages