Cleaning Web pages for effective Web content mining.
Date of Award
CC BY-NC-ND 4.0
Web pages usually contain many noisy blocks, such as advertisements, navigation bar, copyright notice and so on. These noisy blocks can seriously affect web content mining because contents contained in noise blocks are irrelevant to the main content of the web page. Eliminating noisy blocks before performing web content mining is very important for improving mining accuracy and efficiency. A few existing approaches detect noisy blocks with exact same contents, but are weak in detecting near-duplicate blocks, such as navigation bars. In this thesis, given a collection of web pages in a web site, a new system, WebPageCleaner, which eliminates noisy blocks from these web pages so as to improve the accuracy and efficiency of web content mining, is proposed. WebPageCleaner detects both noisy blocks with exact same contents as well as those with near-duplicate contents. It is based on the observation that noisy blocks usually share common contents, and appear frequently on a given web site. WebPageCleaner consists of three modules: block extraction, block importance retrieval, and cleaned files generation. A vision-based technique is employed for extracting blocks from web pages. Blocks get their importance degree according to their block features such as block position, and level of similarity of block contents to each other. A collection of cleaned files with high importance degree are generated finally and used for web content mining. The proposed technique is evaluated using Naive Bayes text classification. Experiments show that WebPageCleaner is able to lead to a more efficient and accurate web page classification results than existing approaches.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .L5. Source: Masters Abstracts International, Volume: 45-01, page: 0359. Thesis (M.Sc.)--University of Windsor (Canada), 2006.
Li, Jing., "Cleaning Web pages for effective Web content mining." (2006). Electronic Theses and Dissertations. 1443.