Date of Award

2012

Publication Type

Doctoral Thesis

Degree Name

Ph.D.

Department

Computer Science

Keywords

Computer science.

Supervisor

Chen, Jessica (Computer Science)

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Abstract

In many web sites, users need to type in keywords in a search Form in order to access the pages. These pages, called the deep web, are often of high value but usually not crawled by conventional search engines. This calls for deep web crawlers to retrieve the data so that they can be used, indexed, and searched upon in an integrated environment. Unlike the surface web crawling where pages are collected by following the hyperlinks embedded inside the pages, there are no hyperlinks in the deep web pages. Therefore, the main challenge of a deep web crawler is the selection of promising queries to be issued. This dissertation addresses the query selection problem in three parts: 1) Query selection in an omniscient setting where the global data of the deep web are available. In this case, query selection is mapped to the set-covering problem. A weighted greedy algorithm is presented to target the log-normally distributed data. 2) Sampling-based query selection when global data are not available. This thesis empirically shows that from a small sample of the documents we can learn the queries that can cover most of the documents with low cost. 3) Query selection for ranked deep web data sources. Most data sources rank the matched documents and return only the top k documents. This thesis shows that we need to use queries whose size is commensurate with k, and experiments with several query size estimation methods.

Recommended Citation

WANG, YAN, "Query selection in Deep Web Crawling" (2012). Electronic Theses and Dissertations. 418.
https://scholar.uwindsor.ca/etd/418

Download

COinS

Scholarship at UWindsor

Electronic Theses and Dissertations

Query selection in Deep Web Crawling

Date of Award

Publication Type

Degree Name

Department

Keywords

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Search

Browse

Author Corner

Scholarship at UWindsor

Electronic Theses and Dissertations

Query selection in Deep Web Crawling

Author

Date of Award

Publication Type

Degree Name

Department

Keywords

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Share

Search

Browse

Author Corner