Date of Award

12-19-2023

Publication Type

Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Language Model;Natural Language Processing;Privacy;Unlearning

Supervisor

Sherif Saad

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Deep learning models have recently achieved remarkable progress in Natural Language Processing (NLP), specifically in classification, question-answering, and machine translation. However, NLP models face challenges related to security and privacy. Security-wise, even small perturbations in the input can significantly impact a model's prediction. This highlights the importance of generating natural adversarial attacks to analyze the weaknesses of NLP models and bolster their robustness through adversarial training (AT). Conversely, Large Language Models (LLMs) are trained on vast amounts of data, which may include sensitive information. If exposed, this poses a risk to personal privacy. LLMs can memorize portions of their training data and reproduce them verbatim when prompted by adversaries. To address these limitations, we delve into the potential of reinforcement learning (RL) based methods in tackling these issues and surmounting the shortcomings present in the existing literature. RL excels in achieving specific objectives guided by a reward function. In pursuit of this, we introduce an End-to-End framework that employs a proximal policy gradient—a reinforcement learning approach—to cultivate a self-learned policy directed by the chosen reward function. The language model (LM) takes on the role of a policy learner. For adversarial attacks, we opt for a combination of the mutual implication score and the negative likelihood of samples generated by the victim classifier. This approach allows us to craft perplexing samples while preserving their semantic significance. In addressing memorization, we employ the negative similarity function, BERTScore, to develop a "Dememorization Privacy Policy." This policy effectively mitigates the risks associated with memorization. Our findings indicate that our framework has proven effective in enhancing the performance of the vanilla classifier by 2% when generating adversarial attacks and reducing LM memorization by 34% % to mitigate privacy risks while maintaining the general LM performance.

Share

COinS