Date of Award
3-10-2021
Publication Type
Master Thesis
Degree Name
M.Sc.
Department
Computer Science
Keywords
Machine Learning, Motif Discovery, Neural Network, Proteins, Short Linear Motif
Supervisor
Luis Rueda
Supervisor
Alioune Ngom
Rights
info:eu-repo/semantics/openAccess
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Abstract
With the increasing quantity of biological data, it is important to develop algorithms that can quickly find patterns in large databases of DNA, RNA and protein sequences. Previous research has been very successful at applying deep learning methods to the problems of motif detection as well as classification of biological sequences. There are, however, limitations to these approaches. Most are limited to finding motifs of a single length. In addition, most research has focused on DNA and RNA, both of which use a four letter alphabet. A few of these have attempted to apply deep learning methods on the larger, twenty letter, alphabet of proteins. We present an enhanced deep learning model, called DeePSLiM, capable of detecting predictive, short linear motifs (SLiM) in protein sequences. The model is a shallow network that can be trained quickly on large amounts of data. The SLiMs are predictive because they can be used to classify the sequences into their respective families. The model was able to reach scores of 94.5% on accuracy, precision, recall, F1-Score and Matthews-correlation coefficient, as well as 99.9% area under the receiver operator characteristic curve (AUROC).
Recommended Citation
Filip, Alexandru, "DeePSLiM: A Deep Learning Approach to Identify Predictive Short-linear Motifs for Protein Sequence Classification" (2021). Electronic Theses and Dissertations. 8553.
https://scholar.uwindsor.ca/etd/8553
Supplementary Material