Date of Award

Summer 2021

Publication Type

Thesis

Degree Name

M.A.Sc.

Department

Computer Science

Keywords

GloVe, Word co-occurrence, Word embedding, Word2Vec

Supervisor

J. Lu

Supervisor

J. Chen

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

One of the trends in Natural Language Processing (NLP) is the use of word embedding. Its aim is to build a low dimensional vector representation of words from text corpora. Global Vectors for Word Representation (GloVe) and Sikp-Gram with Negative Sampling (SGNS) are two representative word embedding methods. Existing papers have different conclusions on the performance of these two methods. This thesis focuses on GloVe and studies its commonalities and differences with SGNS.

Word co-occurrence is the cornerstone of all word embedding algorithms. One difference between GloVe and SGNS is the definition of co-occurrence. The weight of co-occurring words tapers o↵ with the distance between them. GloVe and SGNS adopts different weighting schemes. In SGNS, weight decreases linearly with the distance. In GloVe, the weight decreases harmonically, giving less weight to the words in the center of the window. We propose GloVe-L (GloVe Linear), by changing the weighting scheme to the linear weighting. We find that GloVe-L outperforms GloVe consistently in word similarity tasks. The conclusion is supported by extensive experiments on 8 Word evaluation benchmarks on Wikipedia training corpus. The thesis also explores the impact of hyper-parameters on the result, including window size and xmax in GloVe. Another interesting observation is that Glove-L does not work well for word analogy tasks.

Share

COinS