Date of Award

2013

Degree Type

Thesis

Degree Name

M.Sc.

Department

Computer Science

First Advisor

Jianguo Lu

Keywords

Communication and the arts, Applied sciences, Deep web, Estimation, Random walk, Sampling, Search engine

Rights

CC BY-NC-ND 4.0

Abstract

The deep web is the part of World Wide Web that is hidden under form-like interfaces and can be accessed by queries only. Global properties of a deep web data source such as average degree, population size need to be estimated because the data in its entirety is not available. When a deep web data source is modelled as a document-term bipartite graph, the estimation can be performed by random walks on this graph. This thesis conducts comparative studies on various random walk sampling methods, including Simple Random Walk (SRW), Rejection Random Walk (RRW), Metropolis-Hastings Random Walk (MHRW) and uniform random sampling. Since random walks are conducted by queries in searchable interfaces, our study has focused on the overall sampling cost and the estimator performance in terms of bias, variance and RRMSE in this particular setting. From our experiments performed on Newsgroup data we find that MHRW results higher variance and RRMSE especially when the degree distribution follows the power law. On the other hand RRW performs worse in terms of query cost as it rejects too many samples. Compared to MHRW and RRW, SRW has low variance and RRMSE. Besides, SRW outperforms the real uniform random samples when the distribution follows the power law.

Share

COinS