Date of Award
2009
Publication Type
Master Thesis
Degree Name
M.Sc.
Department
Computer Science
Keywords
Computer Science.
Supervisor
Lu, Jianguo (School of Computer Science)
Rights
info:eu-repo/semantics/openAccess
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Abstract
The deep web is a part of the web that can only be accessed via query interfaces. Discovering the size of a deep web data source has been an important and challenging problem ever since the web emerged. The size plays an important role in crawling and extracting a deep web data source. The thesis proposes a new estimation method based on coverage to estimate the size. This method relies on the construction of a query pool that can cover most of the data source. We propose two approaches to constructing a query pool so that document frequency variance is small and most of the documents can be covered. Our experiments on four data collections show that using a query pool built from a sample of the collection will result in lower bias and variance. We compared the new method with three existing methods based on the corpora collected by us.
Recommended Citation
Liang, Jie, "Discovering the size of a deep web data source by coverage" (2009). Electronic Theses and Dissertations. 326.
https://scholar.uwindsor.ca/etd/326