Date of Award
6-14-2024
Publication Type
Thesis
Degree Name
M.Sc.
Department
Computer Science
Keywords
Artificial Intelligence;Convolutional Neural Network;Long-Short Term Memory;Natural Language Processing;Plagiarism Detection;Recurrent Neural Network
Supervisor
Dan Wu
Abstract
This thesis presents the development and evaluation of an advanced plagiarism detection system designed to analyze source code. The system leverages a combination of Convolutional Neural Networks (CNN), Long Short-Term Memory networks (LSTM), and a Code Pre-trained Model (CodePTM) to detect various levels of plagiarism within C programming submissions. The goal was to improve the detection accuracy across different types of plagiarism, from simple copy-paste to more sophisticated structural and semantic plagiarism that is difficult to detect with traditional methods. The methodology encompasses a comprehensive preprocessing stage that involves tokenization, generation of Abstract Syntax Trees (AST), and transformation into vectors that encapsulate lexical, structural, and semantic features of the code. These features are then analyzed through a multistage feature extraction process, utilizing the strengths of CNNs to discern structural patterns, LSTMs to capture contextual dependencies, and CodePTMs to understand deep semantic relationships. Our results demonstrate that the integrated approach significantly enhances the capability to detect a wide range of plagiarism activities, achieving high precision and recall rates across six defined levels of plagiarism. Specifically, the system excels in identifying complex plagiarism cases that involve significant modifications to the code's structure or semantics, which are traditionally the most challenging to detect. The implications of this research are profound, offering not only a more effective tool for academic settings to uphold integrity and originality in programming assignments but also potential applications in legal and professional domains where code originality is critical. Future work can extend this model to other programming languages and explore the integration of real-time detection capabilities, further broadening the impact and applicability of this research.
Recommended Citation
Surendran, Sumisha, "Plagiarism Detection in Source Code using Machine Learning" (2024). Electronic Theses and Dissertations. 9498.
https://scholar.uwindsor.ca/etd/9498