Date of Award
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Document representation learning is crucial for downstream machine learning tasks such as document classification. Recent neural network approaches such as Doc2Vec and its variants are popular. Regarding its comparison with traditional representation methods such as the TF-IDF method, the results are not very conclusive due to several factors-- Doc2vec has many hyper-parameters, resulting in performance fluctuation; traditional methods have space to improve. More importantly, document length and data size have impacts on the result. This thesis conducts a comparative study of these methods, and propose to improve the TF-IDF weighting with mutual information(MI). We find that Doc2vec works good only for short documents, and only when the data size (the number of documents) is large. For long documents and small data size, MI performs better. The experiments are conducted extensively on 11 data sets that are of a variety of combinations of document length and data size. In addition, we study the relationship between TF-IDF and MI weighting. We find that their correlation is high overall (Pearson correlation coefficient is over 0.9 on all the data sets used in our thesis). For medium frequency words, the MI weighting is always smaller than the TF-IDF weighting. However, for rare words and popular words, MI diverges from TF-IDF greatly, and the weighting of MI is higher than TF-IDF for popular words but lower than TF-IDF for rare words.
Tian, Ziyang, "A Comparative Study of Document Representation Methods" (2019). Electronic Theses and Dissertations. 8183.