Extending the Work of DT-Fixup: Examining the Effects of PowerNorm and MADGRAD Optimization on DT-Fixup Performance

Prem Shankar Mohan, University of Windsor

Abstract

With the introduction of the attention technique, the Bidirectional Encoder Representations from Transformers (BERT) have greatly advanced the study of solving sequence-to-sequence tasks in Natural Language Processing (NLP). When the task-specific annotations are limited, the NLP tasks are commonly performed by pre-training a model using the transformer technique on large-scale general corpora, followed by fine-tuning the model on domain-specific data. Instead of using shallow neural components for fine-tuning, additional transformer layers could be introduced into the architecture. Recent research shows that, by resolving some initialization and optimization issues, these augmented transformer layers could lead to performance gains despite of the limited size of the available data, and this can be successful, especially for well-structured data. Along this direction, we will perform comprehensive experiments on the DT-Fixup algorithm which is designed to mitigate mentioned issues. For possible performance improvement on DT-Fixup, we propose to study the applicability of the power normalization and Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization (MADGRAD) in this setting. This is motivated by the recent literature which shows that, stemming from batch normalization widely adopted in the area of computer vision, power normalization is shown to outperform the layer normalization usually found in the transformers. In the family of AdaGrad adaptive gradient methods, MADGRAD is a new optimization technique that performs exceptionally well on deep learning optimization problems from a variety of fields, including classification and image-to-image tasks in vision and recurrent and bidirectionally-masked models in natural language processing. Even on issues where adaptive methods typically perform badly, MADGRAD matches or beats both SGD and ADAM in test set performance for each of these tasks. This research will be performed on ReClor, and LogiQA datasets selected according to its structure.