Title

Comparative Mining of B2C Web Sites by Discovering Web Database Schemas

Document Type

Conference Paper

Publication Date

2016

Publication Title

Proceedings of the 20th International Database Engineering & Applications Symposium

First Page

183

Last Page

192

DOI

10.1145/2938503.2938522

Abstract

Discovering potentially useful and previously unknown information or knowledge from heterogeneous web contents such as "list all laptop prices from Walmart and Staples between 2013 and 2015 including make, type, screen size, CPU power, year of make", would require the difficult task of finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse integration before web content extraction and mining from the database. Wrappers that extract target information from web pages can be manual, semi-supervised or automatic systems. Automatic systems such as the WebOMiner system, use some data extraction techniques based on parsing the web page html source code into a document object model (DOM) tree, then traverse the DOM for pattern discovery. Some limitations of these existing systems include using complicated matching techniques such as tree matching, Finite state automata, not yielding accurate results for complex queries such as historical and derived.

This paper proposes building the WebOMiner S which uses web structure and content mining approaches on the DOM-tree html code to simplify and make more easily extendable, the web data extraction process of theWebOMiner system. TheWebOMiner system is based on non-deterministic finite state automata (NFA) to recognize and extract web different types (e.g., text, image, links, and lists). The proposed WebOMiner S replaces the use of NFA of the WebOMiner with a frequent structure finder algorithm which uses regular expression matching in Java xpath parser and methods (such as compile(),evaluate()) to dynamically discover the most frequent structure (which is the most frequently repeated blocks in the html code represented as tags < divclass = " " >) in the Dom tree. This approach eliminates the need for any supervised training or updating the wrapper for each new B2C web page making the approach simpler, more easily extendable and automated.