Date of Award
academic papers, feature selection, feature weight normalization, language models, text classification
CC BY-NC-ND 4.0
The fast growing speed of the size of scholarly data have made it necessary to nd out e cient machine learning ways to automatically categorize the data. This thesis aims to build a classi er that can automatically categorize Computer Science (CS) papers based on text content. To nd out the best method for CS papers, we collect and prepare two large labeled data sets: CiteSeerX and arXiv, and experiment with di erent classi cation approaches including Naive Bayes and Logistic Regression, di erent feature selection schemes, di erent language models, and di erent feature weighting schemes. We found that with large size of training set, Bi-gram modeling with normalized feature weight performs the best for all the two data sets. It is surprising that arXiv data set can be classi ed up to 0.95 F1 value, while CiteSeerX reaches lower F1 (0.764). That is probably caused by labeling of CiteSeerX is not as accurate as arXiv data set.
Zhou, Tong, "Automated Identification of Computer Science Research Papers" (2016). Electronic Theses and Dissertations. 5776.