Date of Award


Publication Type

Master Thesis

Degree Name



Computer Science

First Advisor

Rueda, Luis

Second Advisor

Ngom, Alioune




Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. In this work, we propose a hierarchical clustering method used to separate dissimilar groups of genes in time-series data, which have the furthest distances from the rest of the genes throughout dierent time intervals. The isolated outliers(genes that trend dierently from other genes) can serve as potential biomarkers of breast cancer survivability. We partition the time axis (time points) into bins of length six months starting from 1-6 up to 337-342 month intervals and, for each gene, we average its expression level over all patients who appear in a survival bin. Gene expressions throughout those time points are cubic spline interpolated to create a trending prole for each gene. First, we universally align the gene expression proles to minimize the total area between them. Then, we cluster them using a sliding window approach and hierarchical clustering based on minimum vertical distances. To the best of our knowledge, this work is the rst time-series model that is built on the survival time of patients after the treatment. With this approach, we identied 46 genes (including 24 oncogenes and 18 tumor suppressor genes) as potential biomarkers of breast cancer survivability.