Date of Award

3-10-2021

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Machine Learning, Motif Discovery, Neural Network, Proteins, Short Linear Motif

Supervisor

Luis Rueda

Supervisor

Alioune Ngom

Rights

info:eu-repo/semantics/openAccess

Abstract

With the increasing quantity of biological data, it is important to develop algorithms that can quickly find patterns in large databases of DNA, RNA and protein sequences. Previous research has been very successful at applying deep learning methods to the problems of motif detection as well as classification of biological sequences. There are, however, limitations to these approaches. Most are limited to finding motifs of a single length. In addition, most research has focused on DNA and RNA, both of which use a four letter alphabet. A few of these have attempted to apply deep learning methods on the larger, twenty letter, alphabet of proteins. We present an enhanced deep learning model, called DeePSLiM, capable of detecting predictive, short linear motifs (SLiM) in protein sequences. The model is a shallow network that can be trained quickly on large amounts of data. The SLiMs are predictive because they can be used to classify the sequences into their respective families. The model was able to reach scores of 94.5% on accuracy, precision, recall, F1-Score and Matthews-correlation coefficient, as well as 99.9% area under the receiver operator characteristic curve (AUROC).

Supplementary Material.pdf (2518 kB)
Supplementary Material

Share

COinS