Electronic Theses and Dissertations

Mitigating The Shortcomings of Language Models: Strategies For Handling Memorization & Adversarial Attacks

Aly Kassem, University of WindsorFollow

Date of Award

12-19-2023

Publication Type

Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Language Model;Natural Language Processing;Privacy;Unlearning

Supervisor

Sherif Saad

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Deep learning models have recently achieved remarkable progress in Natural Language Processing (NLP), specifically in classification, question-answering, and machine translation. However, NLP models face challenges related to security and privacy. Security-wise, even small perturbations in the input can significantly impact a model's prediction. This highlights the importance of generating natural adversarial attacks to analyze the weaknesses of NLP models and bolster their robustness through adversarial training (AT). Conversely, Large Language Models (LLMs) are trained on vast amounts of data, which may include sensitive information. If exposed, this poses a risk to personal privacy. LLMs can memorize portions of their training data and reproduce them verbatim when prompted by adversaries. To address these limitations, we delve into the potential of reinforcement learning (RL) based methods in tackling these issues and surmounting the shortcomings present in the existing literature. RL excels in achieving specific objectives guided by a reward function. In pursuit of this, we introduce an End-to-End framework that employs a proximal policy gradient—a reinforcement learning approach—to cultivate a self-learned policy directed by the chosen reward function. The language model (LM) takes on the role of a policy learner. For adversarial attacks, we opt for a combination of the mutual implication score and the negative likelihood of samples generated by the victim classifier. This approach allows us to craft perplexing samples while preserving their semantic significance. In addressing memorization, we employ the negative similarity function, BERTScore, to develop a "Dememorization Privacy Policy." This policy effectively mitigates the risks associated with memorization. Our findings indicate that our framework has proven effective in enhancing the performance of the vanilla classifier by 2% when generating adversarial attacks and reducing LM memorization by 34% % to mitigate privacy risks while maintaining the general LM performance.

Recommended Citation

Kassem, Aly, "Mitigating The Shortcomings of Language Models: Strategies For Handling Memorization & Adversarial Attacks" (2023). Electronic Theses and Dissertations. 9187.
https://scholar.uwindsor.ca/etd/9187

Download

Included in

Computer Sciences Commons

COinS

Scholarship at UWindsor

Electronic Theses and Dissertations

Mitigating The Shortcomings of Language Models: Strategies For Handling Memorization & Adversarial Attacks

Date of Award

Publication Type

Degree Name

Department

Keywords

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Scholarship at UWindsor

Electronic Theses and Dissertations

Mitigating The Shortcomings of Language Models: Strategies For Handling Memorization & Adversarial Attacks

Author

Date of Award

Publication Type

Degree Name

Department

Keywords

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner