Document Type

Article

Publication Date

2010

Publication Title

Data and Knowledge Engineering

Volume

Issue

First Page

866

Keywords

Deep web, Ranked data source, Estimators, Capture–recapture

Last Page

879

Abstract

Many deep web data sources are ranked data sources, i.e., they rank the matched documents and return at most the top k number of results even though there are more than k documents matching the query. While estimating the size of such ranked deep web data source, it is well known that there is a ranking bias—the traditional methods tend to underestimate the size when queries overflow (match more documents than the return limit). Numerous estimation methods have been proposed to overcome the ranking bias, such as by avoiding overflowing queries during the sampling process, or by adjusting the initial estimation using a fixed function.

We observe that the overflow rate has a direct impact on the accuracy of the estimation. Under certain conditions, the actual size is close to the estimation obtained by unranked model multiplied by the overflow rate. Based on this result, this paper proposes a method that allows overflowing queries in the sampling process.

DOI

10.1016/j.datak.2010.03.007

Comments

NOTICE: this is the author’s version of a work that was accepted for publication in Data and Knowledge Engineering. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Data and Knowledge Engineering, 69 (8), 2010 and is available here.

Recommended Citation

Lu, Jianguo. (2010). Ranking Bias in Deep Web Size Estimation Using Capture Recapture Method. Data and Knowledge Engineering, 69 (8), 866-879.
https://scholar.uwindsor.ca/computersciencepub/2

Download

Find in your library

Included in

Computer Sciences Commons

COinS

Scholarship at UWindsor

Computer Science Publications

Ranking Bias in Deep Web Size Estimation Using Capture Recapture Method

Document Type

Publication Date

Publication Title

Volume

Issue

First Page

Keywords

Last Page

Abstract

DOI

Comments

Recommended Citation

Included in

Search

Browse

Author Corner

Links

Scholarship at UWindsor

Computer Science Publications

Ranking Bias in Deep Web Size Estimation Using Capture Recapture Method

Authors

Document Type

Publication Date

Publication Title

Volume

Issue

First Page

Keywords

Last Page

Abstract

DOI

Comments

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links