Machine Learning Approaches for Healthcare Analysis
Abstract
Machine learning (ML)is a division of artificial intelligence that teaches computers how to discover difficult-to-distinguish patterns from huge or complex data sets and learn from previous cases by utilizing a range of statistical, probabilistic, data processing, and optimization methods. Nowadays, ML plays a vital role in many fields, such as finance, self-driving cars, image processing, medicine, and Speech recognition. In healthcare, ML has been used in applications such as the detection, prognosis, diagnosis, and treatment of diseases due to Its capability to handle large data. Moreover, ML has exceptional abilities to predict disease by uncovering patterns from medical datasets. Machine learning and deep learning are better suited for analyzing medical datasets than traditional methods because of the nature of these datasets. They are mostly large and complex heterogeneous data coming from different sources, requiring more efficient computational techniques to handle them. This dissertation presents several machine-learning techniques to tackle medical issues such as data imbalance, classification and upgrading tumor stages, and multi-omics integration. In the second chapter, we introduce a novel method to handle class-imbalanced dilemmas, a common issue in bioinformatics datasets. In class-imbalanced data, the number of samples in each class is unequal. Since most data sets contain usual versus unusual cases, e.g., cancer versus normal or miRNAs versus other noncoding RNA, the minority class with the least number of samples is the interesting class that contains the unusual cases. The learning models based on the standard classifiers, such as the support vector machine (SVM), random forest, and k-NN, are usually biased towards the majority class, which means that the classifier is most likely to predict the samples from the interesting class inaccurately. Thus, handling class-imbalanced datasets has gained researchers’ interest recently. A combination of proper feature selection, a cost-sensitive classifier, and ensembling based on the random forest method (BCECSC-RF) is proposed to handle the class-imbalanced data. Random class-balanced ensembles are built individually. Then, each ensemble is used as a training pool to classify the remaining out-bagged samples. Samples in each ensemble will be classified using a class-sensitive classifier incorporating random forest. The sample will be classified by selecting the most often class that has been voted for in all sample appearances in all the formed ensembles. A set of performance measurements, including a geometric measurement, suggests that the model can improve the classification of the minority class samples. In the third chapter, we introduce a novel study to predict the upgrading of the Gleason score on confirmatory magnetic resonance imaging-guided targeted biopsy (MRI-TB) of the prostate in candidates for active surveillance based on clinical features. MRI of the prostate is not accessible to many patients due to difficulty contacting patients, insurance denials, and African-American patients are disproportionately affected by barriers to MRI of the prostate during Active surveillance [6,7]. Modeling clinical variables with advanced methods, such as machine learning, could allow us to manage patients in resource-limited environments with limited technological access. Upgrading to significant prostate cancer on MRI-TB was defined as upgrading to G 3+4 (definition 1 - DF1) and 4+3 (DF2). For upgrading prediction, the AdaBoost model was highly predictive of upgrading DF1 (AUC 0.952), while for prediction of upgrading DF2, the Random Forest model had a lower but excellent prediction performance (AUC 0.947). In the fourth chapter, we introduce a multi-omics data integration method to analyze multi-omics data for biomedical applications, including disease prediction, disease subtypes, biomarker prediction, and others. Multi-omics data integration facilitates collecting richer understanding and perceptions than separate omics data. Our method is constructed using the combination of gene similarity network (GSN) based on Uniform Manifold Approximation and Projection (UMAP) and convolutional neural networks (CNNs). The method utilizes UMAP to embed gene expression, DNA methylation, and copy number alteration (CNA) to a lower dimension creating two-dimensional RGB images. Gene expression is used as a reference to construct the GSN and then integrate other omics data with the gene expression for better prediction. We used CNNs to predict the Gleason score levels of prostate cancer patients and the tumor stage in breast cancer patients. The results show that UMAP as an embedding technique can better integrate multi-omics maps into the prediction model than SOM