Date of Award


Publication Type

Doctoral Thesis

Degree Name



Computer Science

First Advisor

Luis Rueda


Artficial intelligence, Bioinformatics, Cancer biomarkers, Machine learning, Pattern recognition




Identifying biomarkers that can be used to classify certain disease stages or predict when a disease becomes more aggressive is one of the most important applications of machine learning. Next generation sequencing (NGS) is a state-of-the-art method that enables fast sequencing of DNA or RNA samples. The output usually contains a very large file that consists of base pairs of DNA or RNA. The generated data can be analyzed to provide gene expression, chromosome counting, detection of mutations on the genes, and detecting levels of copy number variations or alterations in specific genes, just as examples. NGS is leading the way to explore the human genome, enabling the future of personalized medicine. In this thesis, a demonstration is done on how machine learning is used extensively to identify genes that can be used to predict prostate cancer stages with very high accuracy, using gene expression. We have also been successful in predicting the location of prostate tumors based on gene expression. In addition, traditional biomarker identification approaches, typically, use machine learning techniques to identify a number of genes and macromolecules as biomarkers that can be used to diagnose specific diseases or states of diseases with very high accuracy, using molecular measurements such as mutations, gene expression, copy number variations, and others. However, experts' opinions and knowledge is required to validate such findings. We, therefore, also introduce a new machine learning model that incorporates a knowledge-assisted system used to integrate the findings of the DisGeNET database, which is a framework that contains proven relationships among diseases and genes. The machine learning pipeline starts by reducing the number of features using a filter-based feature selection method. The DisGeNET database is used to score each gene related to the given cancer name. Then, a wrapper-based feature-selection algorithm picks the best set of genes with the highest classification accuracy. The method has been able to retrieve key genes from multiple data sets that classify with very high accuracy, while being biologically relevant, and no human intervention needed. Initial results provide a high area-under-the-curve with a handful of genes that are already proven to be related to the relevant disease and state based on the latest published medical findings. The proposed methods results provide biomarkers that can be verified in wet lab environments and can then be further analyzed and studied for diagnostic purposes.