Date of Award

2011

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

First Advisor

Lu, Jianguo (School of Computer Science)

Keywords

Computer Science.

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Abstract

Data of deep web in general is stored in a database that is only accessible via web query forms or through web service interfaces. One challenge of deep web crawling is how to select meaningful queries to acquire data. There is substantial research on the selection of queries, such as the approach based on the set covering problem where greedy algorithm or its variation is used. These methods are not extensively studied in the context of real web services, which may impose new challenges for deep web crawling. This thesis studies several query selection methods on Microsoft’s Bing web service, especially the impact of the ranking of the returns in real data sources. Our results show that for unranked data sources, weighted method performed a little better then un-weighted set covering algorithm. For ranked data sources, document frequent estimation is necessary to harvest data more efficiently.

Share

COinS