Date of Award
GloVe, Word co-occurrence, Word embedding, Word2Vec
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
One of the trends in Natural Language Processing (NLP) is the use of word embedding. Its aim is to build a low dimensional vector representation of words from text corpora. Global Vectors for Word Representation (GloVe) and Sikp-Gram with Negative Sampling (SGNS) are two representative word embedding methods. Existing papers have different conclusions on the performance of these two methods. This thesis focuses on GloVe and studies its commonalities and differences with SGNS.
Word co-occurrence is the cornerstone of all word embedding algorithms. One difference between GloVe and SGNS is the definition of co-occurrence. The weight of co-occurring words tapers o↵ with the distance between them. GloVe and SGNS adopts different weighting schemes. In SGNS, weight decreases linearly with the distance. In GloVe, the weight decreases harmonically, giving less weight to the words in the center of the window. We propose GloVe-L (GloVe Linear), by changing the weighting scheme to the linear weighting. We find that GloVe-L outperforms GloVe consistently in word similarity tasks. The conclusion is supported by extensive experiments on 8 Word evaluation benchmarks on Wikipedia training corpus. The thesis also explores the impact of hyper-parameters on the result, including window size and xmax in GloVe. Another interesting observation is that Glove-L does not work well for word analogy tasks.
Lu, Quinlan, "Improved GloVe Word Embedding Using Linear Weighting Scheme for Word Similarity Tasks" (2021). Electronic Theses and Dissertations. 8845.