Electronic Theses and Dissertations

Extending the work of DT-Fixup: Examining the Effects of PowerNorm and MADGRAD Optimization on DT-Fixup Performance

Prem Shanker Mohan, University of WindsorFollow

Date of Award

7-6-2023

Publication Type

Thesis

Degree Name

M.Sc.

Department

Computer Science

Supervisor

Jessica Chen

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

With the introduction of the attention technique, the Bidirectional Encoder Representations from Transformers (BERT) have greatly advanced the study of solving sequence-to-sequence tasks in Natural Language Processing (NLP). When the task-specific annotations are limited, the NLP tasks are commonly performed by pre-training a model using the transformer technique on large-scale general corpora, followed by fine-tuning the model on domain-specific data. Instead of using shallow neural components for fine-tuning, additional transformer layers could be introduced into the architecture. Recent research shows that, by resolving some initialization and optimization issues, these augmented transformer layers could lead to performance gains despite of the limited size of the available data, and this can be successful, especially for well-structured data. Along this direction, we will perform comprehensive experiments on the DT-Fixup algorithm which is designed to mitigate mentioned issues. For possible performance improvement on DT-Fixup, we propose to study the applicability of the power normalization and Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization (MADGRAD) in this setting. This is motivated by the recent literature which shows that, stemming from batch normalization widely adopted in the area of computer vision, power normalization is shown to outperform the layer normalization usually found in the transformers. In the family of AdaGrad adaptive gradient methods, MADGRAD is a new optimization technique that performs exceptionally well on deep learning optimization problems from a variety of fields, including classification and image-to-image tasks in vision and recurrent and bidirectionally-masked models in natural language processing. Even on issues where adaptive methods typically perform badly, MADGRAD matches or beats both SGD and ADAM in test set performance for each of these tasks. This research will be performed on ReClor, and LogiQA datasets selected according to its structure.

Recommended Citation

Mohan, Prem Shanker, "Extending the work of DT-Fixup: Examining the Effects of PowerNorm and MADGRAD Optimization on DT-Fixup Performance" (2023). Electronic Theses and Dissertations. 9338.
https://scholar.uwindsor.ca/etd/9338

Download

Included in

Computer Sciences Commons

COinS

Scholarship at UWindsor

Electronic Theses and Dissertations

Extending the work of DT-Fixup: Examining the Effects of PowerNorm and MADGRAD Optimization on DT-Fixup Performance

Date of Award

Publication Type

Degree Name

Department

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Included in

Search

Browse

Author Corner

Scholarship at UWindsor

Electronic Theses and Dissertations

Extending the work of DT-Fixup: Examining the Effects of PowerNorm and MADGRAD Optimization on DT-Fixup Performance

Author

Date of Award

Publication Type

Degree Name

Department

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Included in

Share

Search

Browse

Author Corner