Date of Award

9-27-2023

Publication Type

Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Anomaly detection;Early warning system;Large datasets;Loan default;Unbalanced dataset

Supervisor

Ziad Kobti

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Given the rise in loan defaults, especially after the COVID-19 pandemic, it is necessary to predict if customers might default on a loan for risk management. This thesis proposes an early warning system architecture using anomaly detection based on the unbalanced nature of loan default data in the real world. Most customers do not default on their loans; only a tiny percentage do, resulting in an unbalanced dataset. We aim to evaluate potential anomaly detection methods for their suitability in handling unbalanced datasets. We conduct a comparative study on different anomaly detection approaches on four balanced and unbalanced datasets. We compare five of each supervised, unsupervised, and semi-supervised anomaly detection approaches. The supervised algorithms compared are logistic regression, stochastic gradient descent (SGD), XGBoost, LightGBM, and CatBoost classification methods. The unsupervised anomaly detection methods are isolation forest, angle-based outlier detection (ABOD), outlier detection using empirical cumulative distribution function (ECOD), copula-based outlier detection (COPOD), and deep one-class classifier with autoencoder (DeepSVDD). The semi-supervised anomaly detection methods are improving supervised outlier detection with unsupervised representation learning (XGBOD), feature encoding with autoencoders for weakly-supervised anomaly detection (FeaWAD), deep semi-supervised anomaly detection (DeepSAD), progressive image deraining networks (PReNet), and deep anomaly detection with deviation networks (DevNet). We compare them using standard evaluation metrics such as accuracy, precision, recall, F1 score, training and prediction time, and area under the receiver operating characteristic (ROC) curve. The results show that anomaly detection methods perform significantly better on unbalanced loan default data and are more suitable for real-world applications. The results also show that supervised methods work better for balanced datasets, and for peer-to-peer lending datasets, boosting approaches are expected to perform well.

Recommended Citation

Pirani, Rayhaan, "Anomaly Detection in Large Datasets: A Case Study in Loan Defaults" (2023). Electronic Theses and Dissertations. 9253.
https://scholar.uwindsor.ca/etd/9253

Download

Included in

Computer Sciences Commons

COinS

Scholarship at UWindsor

Electronic Theses and Dissertations

Anomaly Detection in Large Datasets: A Case Study in Loan Defaults

Date of Award

Publication Type

Degree Name

Department

Keywords

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Scholarship at UWindsor

Electronic Theses and Dissertations

Anomaly Detection in Large Datasets: A Case Study in Loan Defaults

Author

Date of Award

Publication Type

Degree Name

Department

Keywords

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner