Date of Award

6-14-2024

Publication Type

Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Artificial Intelligence;Convolutional Neural Network;Long-Short Term Memory;Natural Language Processing;Plagiarism Detection;Recurrent Neural Network

Supervisor

Dan Wu

Abstract

This thesis presents the development and evaluation of an advanced plagiarism detection system designed to analyze source code. The system leverages a combination of Convolutional Neural Networks (CNN), Long Short-Term Memory networks (LSTM), and a Code Pre-trained Model (CodePTM) to detect various levels of plagiarism within C programming submissions. The goal was to improve the detection accuracy across different types of plagiarism, from simple copy-paste to more sophisticated structural and semantic plagiarism that is difficult to detect with traditional methods. The methodology encompasses a comprehensive preprocessing stage that involves tokenization, generation of Abstract Syntax Trees (AST), and transformation into vectors that encapsulate lexical, structural, and semantic features of the code. These features are then analyzed through a multistage feature extraction process, utilizing the strengths of CNNs to discern structural patterns, LSTMs to capture contextual dependencies, and CodePTMs to understand deep semantic relationships. Our results demonstrate that the integrated approach significantly enhances the capability to detect a wide range of plagiarism activities, achieving high precision and recall rates across six defined levels of plagiarism. Specifically, the system excels in identifying complex plagiarism cases that involve significant modifications to the code's structure or semantics, which are traditionally the most challenging to detect. The implications of this research are profound, offering not only a more effective tool for academic settings to uphold integrity and originality in programming assignments but also potential applications in legal and professional domains where code originality is critical. Future work can extend this model to other programming languages and explore the integration of real-time detection capabilities, further broadening the impact and applicability of this research.

Share

COinS