Date of Award
Chi-Squared Statistic, Doc2vec, Document Embedding, Hybrid Mutual Information, Pointwise Mutual Information, Terminology Extraction
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Automated terminology extraction is a crucial task in natural language processing and ontology construction. Termhood can be inferred using linguistic and statistic techniques. This thesis focuses on the statistic methods. Inspired by feature selection techniques in documents classification, we experiment with a variety of metrics including PMI (point-wise mutual information), MI (mutual information), and Chi-squared. We find that PMI is in favour of identifying top keywords in a domain, but Chi-squared can recognize more keywords overall. Based on this observation, we propose a hybrid approach, called HMI, that combines the best of PMI and Chi-squared. HMI outperforms both PMI and Chi-squared. The result is verified by comparing overlapping between the extracted keywords and the author-identified keywords in arXiv data. When the corpora are computer science and physics papers, the top-100 hit rate can reach 0.96 for HMI. We also demonstrate that terminologies can improve documents embeddings. In this experiment, we treat machine-identified multi-word terminologies with one word. Then we use the transformed text as input for the document embedding. Compared with the representations learnt from unigrams only, we observe a performance improvement over 9.41% for F1 score in arXiv data on document classification tasks.
Kulkarni, Jayanth Prakash, "Multi-Word Terminology Extraction and Its Role in Document Embedding" (2021). Electronic Theses and Dissertations. 8563.