Date of Award


Publication Type

Master Thesis

Degree Name



Computer Science

First Advisor

Ngom, Alioune

Second Advisor

Rueda, Luis


Classification, Motifs, mRmR Feature selection, PPIs, Prediction of High-Throughput Protein-Protein Interactions, Short Linear Motifs




Wet-lab experimental methods for prediction of Protein-Protein Interactions (PPIs), as a decisive problem in biology, are labor demanding and costly, and usually comprise high false-negative and false-positive rates [20]. Therefore, computational methods have been extensively used as faster, less-expensive and more accurate alternatives [1]. Among all different computational approaches for predicting PPIs, methods based on protein sequences information are more common than the others [16]. While such methods do not need any extra knowledge or data about the proteins rather than their sequences’ amino acids information, they have shown to be promising about predicting PPIs [16]. Basically, these methods try to find patterns spread over interacting and non-interacting proteins’ sequences, take them as features, and use them for predicting PPIs. Motifs, as common patterns of amino acids between a group of sequences [33], have been recently used for this purpose. There are some algorithms and tools for obtaining motifs from protein sequences. However, most of them have limitations on size of the datasets they can deal with, and also depend on datasets of pre-found motifs. One of the most popular algorithms which is capable of handling big datasets is Multiple EM for Motif Elucidation (MEME). Nevertheless, even for powerful tools like MEME, finding large number of motifs from such datasets would be time-wise infeasible. We proposed a new method which is able to extract large amount of motifs from a large dataset using MEME, in reasonable period of time. We tested our method on a PPIs dataset of size 5000 (2500 positive and 2500 negative pairs of protein sequences) to obtain 5000 motifs. Then, we used acquired motifs as features to represent our PPI dataset based on them. Finally, using machine learning techniques, we classified our dataset with some of the well-known classifiers like K-nearest neighbour (K-NN), Random Forest, and Support Vector Machine (SVM). Results not only prove the accuracy of our method, which is above 93%, but they also show that the proposed method for finding motifs from big datasets is effective and can be applied for prediction of PPIs.