"Discovering the size of a deep web data source by coverage" by Jie Liang

Date of Award

2009

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Computer Science.

Supervisor

Lu, Jianguo (School of Computer Science)

Rights

info:eu-repo/semantics/openAccess

Abstract

The deep web is a part of the web that can only be accessed via query interfaces. Discovering the size of a deep web data source has been an important and challenging problem ever since the web emerged. The size plays an important role in crawling and extracting a deep web data source. The thesis proposes a new estimation method based on coverage to estimate the size. This method relies on the construction of a query pool that can cover most of the data source. We propose two approaches to constructing a query pool so that document frequency variance is small and most of the documents can be covered. Our experiments on four data collections show that using a query pool built from a sample of the collection will result in lower bias and variance. We compared the new method with three existing methods based on the corpora collected by us.

Share

COinS