Date of Award

2023

Publication Type

Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Catboost; Credit scoring, Decision trees, Machine learning, Tabular data, Tree-based methods

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

The lending industry commonly relied on assessing borrowers’ repayment performance to make lending decisions. This is to safeguard their assets and maintain their profitability. With the rise of Artificial Intelligence, lenders resorted to Machine Learning (ML) algorithms to solve this problem.

In this study, the novelty introduced is applying ML’s Tree-based methods to a large dataset and accurately predicting financial repayment performance without using any repayment history, which was utilized in all literature reviewed. Instead, the attributes used were demographics and psychographics of applicants, only. The study’s proprietary US-based dataset comprises an anonymous population whose owner does not wish to be disclosed and it contains the information of about half a million beneficiaries with a very balanced bimodal binary target distribution.

An Area Under the Curve of Receiver Characteristic Operator (ROC-AUC) of 85% was achieved with a binary classification target using CatBoost API. The study also experimented with a given tri-class target. Furthermore, this research used ML to gain insight into which attributes contribute the most to the repayment prediction. The study also tested whether similar results can be achieved with fewer attributes for the sake of the practicality of application by the data owner. The best model was applied to one of the biggest publicly available financial datasets for verification. The original research of said dataset had an accuracy score of 82%, this study achieved 79% using 5-fold Cross-Validation (CV). This result was achieved with Tree-Based models with a complexity of O(log n) compared to O(2n) in the original research, which is a significant efficiency enhancement.

Share

COinS