Date of Award

2017

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

breast cancer, gene-expression, microarray, network-based

Supervisor

Rueda, Luis

Supervisor

Ngom, Alioune

Rights

info:eu-repo/semantics/openAccess

Abstract

One of the key challenges of breast cancer research is to predict whether a patient identified with specific subtype or treated with a specific therapy is going to survive or die. Current studies find small subsets of gene biomarkers able to accurately predict the response to therapy. In these studies, the selected genes are not necessarily functionally related, and hence, they may not correctly indicate the molecular mechanism behind breast cancer survivability. Also, several studies have shown there is a very low overlap between the different respective biomarkers subsets for the same cancer disease. To improve the robustness of classification performance and stability of detected biomarkers, recent methods take existing knowledge on relations between genes into account in the classifier, by aggregating functionality related genes to produce discriminative gene subnetworks called network-biomarkers. In this paper, given a breast cancer dataset of patients with different subtypes treated with a given therapy drug, we devised network-based machine learning approach by integrating protein protein interaction network (PPI) with gene expression data (1) to identify the network-biomarkers of breast cancer survivability a) based on subtypes and b) based on therapy and (2) to predict the survivability of breast cancer patients a) based on subtypes b) treated with a therapy drug. We used the concept of seed gene for identification of network-biomarkers with distance 2, 3 and 4 from seed gene protein and our method found distance 3 and $4$ are the distance that gives us best result for identifying survivability of breast cancer patient based on subtype and therapy respectively. To solve the class imbalance problem in some subtypes, we implemented ADASYN. We obtained best classification performance using random forest where the geometric mean, F1-measure and accuracy are respectively 0.867, 0.850 and 87.00% for subtype specific study, and 0.829, 0.807 and 83.77%, for therapy specific.

Share

COinS