Computer Science Publications

Cleaning Web Pages for Effective Web Content Mining

J. Li
C. I. Ezeife, University of WindsorFollow

Document Type

Conference Paper

Publication Date

9-4-2006

Publication Title

International Conference on Database and Expert Systems Applications

First Page

560

Last Page

571

Abstract

Classifying and mining noise-free web pages will improve on accuracy of search results as well as search speed, and may benefit web-page organization applications (e.g., keyword-based search engines and taxonomic web page categorization applications). Noise on web pages are irrelevant to the main content on the web pages being mined, and include advertisements, navigation bar, and copyright notices. The few existing work on web page cleaning detect noise blocks with exact matching contents but are weak at detecting near duplicate blocks, characterized by items like navigation bars.This paper proposes a system, WebPageCleaner, for eliminating noise blocks from web pages for purposes of improving the accuracy and efficiency of web content mining. A vision-based technique is employed for extracting blocks from web pages. Then, relevant web page blocks are identified as those with high importance level by analyzing such physical features of the blocks as the block location, percentage of web links on the block, and level of similarity of block contents to other blocks. Important blocks are exported to be used for web content mining using Naive Bayes text classification. Experiments show that WebPageCleaner leads to a more accurate and efficient web page classification results than comparable existing approaches.

DOI

10.1007/11827405_55

Recommended Citation

Li, J. and Ezeife, C. I.. (2006). Cleaning Web Pages for Effective Web Content Mining. International Conference on Database and Expert Systems Applications, 560-571.
https://scholar.uwindsor.ca/computersciencepub/28

Link to Full Text

Find in your library

COinS

Scholarship at UWindsor

Computer Science Publications

Cleaning Web Pages for Effective Web Content Mining

Document Type

Publication Date

Publication Title

First Page

Last Page

Abstract

DOI

Recommended Citation

Search

Browse

Author Corner

Links

Scholarship at UWindsor

Computer Science Publications

Cleaning Web Pages for Effective Web Content Mining

Authors

Document Type

Publication Date

Publication Title

First Page

Last Page

Abstract

DOI

Recommended Citation

Share

Search

Browse

Author Corner

Links