Date of Award


Publication Type

Master Thesis

Degree Name



Computer Science


Chen, Jessica




Deep web crawling refers to the process of collecting documents that have been organized into a data source and can only be retrieved via a search interface. This is often achieved by sending different queries to the search interface. Dealing with the difficulty in selecting suitable set of queries, this crawling process can be implemented with stepwise refinement: documents are retrieved step by step, while in each step, we adapt the query selection to our accumulated knowledge obtained from the documents downloaded in the previous steps. However, it takes much of our time and effort to download the documents and learn from the resulting sample in order to improve the query selection. Here we propose a cost-effective, data-driven method for stepping the adaptive crawling of the deep web. Through empirical study, we explore the criteria in setting the lengths of the steps to best balance the trade-off between the sample updating cost and the improved quality of the selected queries. Derived from four existing data sets typically used for deep web crawling, such criteria provide practical guidelines for cost-effective stepwise refinement in iterative document retrieval.