Date of Award

2009

Degree Type

Thesis

Degree Name

M.Sc.

Department

Computer Science

First Advisor

Lu, Jianguo (School of Computer Science)

Keywords

Computer Science.

Rights

CC BY-NC-ND 4.0

Abstract

The deep web is a part of the web that can only be accessed via query interfaces. Discovering the size of a deep web data source has been an important and challenging problem ever since the web emerged. The size plays an important role in crawling and extracting a deep web data source. The thesis proposes a new estimation method based on coverage to estimate the size. This method relies on the construction of a query pool that can cover most of the data source. We propose two approaches to constructing a query pool so that document frequency variance is small and most of the documents can be covered. Our experiments on four data collections show that using a query pool built from a sample of the collection will result in lower bias and variance. We compared the new method with three existing methods based on the corpora collected by us.

Share

COinS