Query Selection in Deep Web Crawling

Yan Wang

Abstract

In many web sites, users need to type in keywords in a search Form in order to access the pages. These pages, called the deep web, are often of high value but usually not crawled by conventional search engines. This calls for deep web crawlers to retrieve the data so that they can be used, indexed, and searched upon in an integrated environment. Unlike the surface web crawling where pages are collected by following the hyperlinks embedded inside the pages, there are no hyperlinks in the deep web pages. Therefore, the main challenge of a deep web crawler is the selection of promising queries to be issued. This dissertation addresses the query selection problem in three parts: 1) Query selection in an omniscient setting where the global data of the deep web are available. In this case, query selection is mapped to the set-covering problem. A weighted greedy algorithm is presented to target the log-normally distributed data. 2) Sampling-based query selection when global data are not available. This thesis empirically shows that from a small sample of the documents we can learn the queries that can cover most of the documents with low cost. 3) Query selection for ranked deep web data sources. Most data sources rank the matched documents and return only the top k documents. This thesis shows that we need to use queries whose size is commensurate with k, and experiments with several query size estimation methods.