Date of Award

Fall 2021

Publication Type

Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Cancer subtype classification, Convolutional neural networks, Precision medicine, RNA Seq

Supervisor

J. Chen

Supervisor

A. Biniaz

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

The introduction of genetic testing has profoundly enhanced the prospects of early detection of diseases and techniques to suggest precision medicines. The subtyping of critical diseases has proven to be an essential part of the development of individualized therapies and has led to deeper insights into the heterogeneity of the disease. Studies suggest that variants in particular genes have significant effects on certain types of immune system cells and are also involved in the risk of certain critical illnesses like cancer. By analyzing the genetic sequence of a patient, disease types and subtypes can be predicted. Recent research work has shown that the CNN's prediction quality within this context using gene intensity features could be improved when the input is structured into 2D images.

Constructed from chromosome locations or from transformations involving kPCA, t-SNE, etc., these two-dimensional images express certain types of relationships among the intensity features. While this approach extends the success of convolutional neural networks to non-image data, getting a precise mapping of features on the images to reflect the relationship among the features is hard, if not impossible. To this end, we propose an enhancement to the approach by providing the CNN training procedure with not only the samples of the structured image data but also the samples from the unstructured raw gene expression data in its original form. While the former is fed into the convolutional layers in the network, the latter is input only to the fully connected layers of the network. The proposed method is applied to The Cancer Genome Atlas (TCGA) dataset for cancer subtypes with the median values of the expression level of all expressed genes in an RNA sequence. According to the experiments, our proposed approach can improve the classification accuracy by 2.7% when it is applied to the state-of-the-art method with 2D CNN architecture trained using images that are constructed based on chromosome locations of the genes. When built on top of the method with 2D CNN architecture trained using images that are constructed with transformation process involving t-SNE, classification accuracy is enhanced by 4.7%. For the implementation of the proposed approach on the 1D CNN model using the data structured using covariance between the features, the classification accuracy is improved by 1% and an increase of 3% is observed when the approach is implemented over the model trained using 1D CNN with data ordered based on chromosome locations.

Share

COinS