Date of Award
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
An ever-increasing amount of information on the web today is available only through search interfaces: the users have to type in a set of keywords in a search form in order to access the pages from certain web sites. These pages are often referred to as the Hidden Web or the Deep Web. According to recent studies, the content provided by hidden web sites is often of very high quality and can be extremely valuable to many users. This calls for deep web crawlers to excavate the data so that they can be reused, indexed, and searched upon in an integrated environment.
Crawling deep web is the process of collecting data from search interfaces by issuing queries. It often requires the selection of an appropriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied in graph theory. The conventional set covering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typically, most set covering algorithms do not take into account the distribution of the elements being covered. For deep web crawling, the sizes of the documents and the document frequency of the queries follow the power law distribution.
A new GA-based algorithm is introduced in this thesis. It targets at deep web crawling of a database with this power law distribution. The experiment shows that it outperforms the straightforward greedy algorithm previously introduced to the literature.
Wang, Shaohua, "Crawling Deep Web using a GA-based set covering algorithm" (2010). Electronic Theses and Dissertations. 8109.