Date of Award

2016

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

bad words, feature selection, machine learning, spammer, Twitter

Supervisor

Lu, Jianguo

Rights

info:eu-repo/semantics/openAccess

Abstract

Large amount of Twitter accounts are suspended. Over ve year period, about 14% accounts are terminated for reasons not speci ed explicitly by the service provider. We collected about 120,000 suspended users, along with their tweets and social re- lations. This thesis studies these suspended users, and compares them with normal users in terms of their tweets. We train classi ers to automatically predict whether a user will be suspended. Three di erent kinds of features are used. We experimented using Nave Bayes method, including Bernoulli (BNB) and multinomial (MNB) plus various feature selection mechanisms (mutual information, chi square and point-wise mutual informa- tion) and achieved F1=78%. To reduce the high dimensions, in our second approach we use word2vec and doc2vec to represent each user with a vector of a shot and xed length and achieved F1 (73%) using SVM with RBF function kernel. Random forest works best with F1=74% on this approach.

Share

COinS