Date of Award

1-1-2019

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

Supervisor

Jianguo Lu

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Abstract

Document representation learning is crucial for downstream machine learning tasks such as document classification. Recent neural network approaches such as Doc2Vec and its variants are popular. Regarding its comparison with traditional representation methods such as the TF-IDF method, the results are not very conclusive due to several factors-- Doc2vec has many hyper-parameters, resulting in performance fluctuation; traditional methods have space to improve. More importantly, document length and data size have impacts on the result. This thesis conducts a comparative study of these methods, and propose to improve the TF-IDF weighting with mutual information(MI). We find that Doc2vec works good only for short documents, and only when the data size (the number of documents) is large. For long documents and small data size, MI performs better. The experiments are conducted extensively on 11 data sets that are of a variety of combinations of document length and data size. In addition, we study the relationship between TF-IDF and MI weighting. We find that their correlation is high overall (Pearson correlation coefficient is over 0.9 on all the data sets used in our thesis). For medium frequency words, the MI weighting is always smaller than the TF-IDF weighting. However, for rare words and popular words, MI diverges from TF-IDF greatly, and the weighting of MI is higher than TF-IDF for popular words but lower than TF-IDF for rare words.

Recommended Citation

Tian, Ziyang, "A Comparative Study of Document Representation Methods" (2019). Electronic Theses and Dissertations. 8183.
https://scholar.uwindsor.ca/etd/8183

Download

COinS

Scholarship at UWindsor

Electronic Theses and Dissertations

A Comparative Study of Document Representation Methods

Date of Award

Publication Type

Degree Name

Department

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Search

Browse

Author Corner

Scholarship at UWindsor

Electronic Theses and Dissertations

A Comparative Study of Document Representation Methods

Author

Date of Award

Publication Type

Degree Name

Department

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Share

Search

Browse

Author Corner