Date of Award

9-20-2024

Publication Type

Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Conversation Classification;Machine Learning;Natural Language Processing;Online Grooming Detection

Supervisor

Hossein Fani

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Grooming minors for sexual exploitation has become an increasingly significant concern on online conversation platforms. For a safer online experience for minors, researchers have proposed machine learning models that analyze explicit textual content to automate the detection of predatory conversations, enabling warnings to minors, parents, or law enforcement while preserving minors’ privacy. However, the proposed models often fall short of real-world applications due to the short, noisy, and informal nature of messages and, more importantly, the sparse distribution of predatory conversations. In this research, we introduce Osprey, an open-source benchmark facilitating standardized pipelines for online grooming detection. Osprey implements canonical neural models, vector representation learning, and novel features, including one-on-one interactions, message exchange patterns, and temporal signals. We extend Osprey to support backtranslation augmentation, a round-trip translation of original conversations via intermediary natural languages, aimed at augmenting training datasets with additional predatory conversations. The modular design of Osprey allows seamless incorporation of further features to address evolving research needs. In addition to this framework, we propose the use of recurrent models, where the input to the model is a sequence of feature vectors, each representing a message. This approach allows the model to incorporate not only the text embedding of a message but also other relevant information, such as the timestamp of the message, the number of participants, and the identity of the message sender. This formulation is particularly useful for the early detection of grooming conversations, as it enables the model to process messages incrementally. In this research, we evaluate the efficiency and effectiveness of our models using various metrics and present the results through comprehensive data visualization techniques.

Available for download on Friday, September 19, 2025

Share

COinS