Date of Award

10-30-2020

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Author Name Disambiguation, Co-training, doc2vec, multi-view learning

Supervisor

Jianguo Lu

Rights

info:eu-repo/semantics/openAccess

Abstract

In the community of bibliometrics, author name ambiguity means that author's name is not a reliable identier for associating academic papers with their authors. Author name ambiguity has been the problem in bibliometrics and service providers like Google Scholar, generating a domain of study call Author Name Disambiguation (AND). Author name ambiguity is often tackled using classication techniques, where labeled papers are provided, and papers are assigned to correct authors according to the paper text and paper citations. When applying classication methods to author name disambiguation, two issues stand out: one is that a paper has multiple views (paper text and citation network). The other is the lack of training data: there are not many papers that are labeled. To cope with these two issues, we propose to use the co-training algorithm in AND. The co-training algorithm uses two views to classify papers iteratively and add the top selected papers into the training pool. We demonstrate that the co-training algorithm outperforms the baseline multi-view classication algorithm. We also experiment with hyper-parameters in the co-training algorithm. The experiment is done on the PubMed dataset, where authors are labeled with ORCID. Papers are represented by two embeddings that are learnt from paper content and paper citation network separately. Baseline classiers for comparison are logistic regression and SVM.

Share

COinS