Date of Award

3-10-2021

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Chi-Squared Statistic, Doc2vec, Document Embedding, Hybrid Mutual Information, Pointwise Mutual Information, Terminology Extraction

Supervisor

Jianguo Lu

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Abstract

Automated terminology extraction is a crucial task in natural language processing and ontology construction. Termhood can be inferred using linguistic and statistic techniques. This thesis focuses on the statistic methods. Inspired by feature selection techniques in documents classification, we experiment with a variety of metrics including PMI (point-wise mutual information), MI (mutual information), and Chi-squared. We find that PMI is in favour of identifying top keywords in a domain, but Chi-squared can recognize more keywords overall. Based on this observation, we propose a hybrid approach, called HMI, that combines the best of PMI and Chi-squared. HMI outperforms both PMI and Chi-squared. The result is verified by comparing overlapping between the extracted keywords and the author-identified keywords in arXiv data. When the corpora are computer science and physics papers, the top-100 hit rate can reach 0.96 for HMI. We also demonstrate that terminologies can improve documents embeddings. In this experiment, we treat machine-identified multi-word terminologies with one word. Then we use the transformed text as input for the document embedding. Compared with the representations learnt from unigrams only, we observe a performance improvement over 9.41% for F1 score in arXiv data on document classification tasks.

Recommended Citation

Kulkarni, Jayanth Prakash, "Multi-Word Terminology Extraction and Its Role in Document Embedding" (2021). Electronic Theses and Dissertations. 8563.
https://scholar.uwindsor.ca/etd/8563

Download

COinS

Scholarship at UWindsor

Electronic Theses and Dissertations

Multi-Word Terminology Extraction and Its Role in Document Embedding

Date of Award

Publication Type

Degree Name

Department

Keywords

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Search

Browse

Author Corner

Scholarship at UWindsor

Electronic Theses and Dissertations

Multi-Word Terminology Extraction and Its Role in Document Embedding

Author

Date of Award

Publication Type

Degree Name

Department

Keywords

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Share

Search

Browse

Author Corner