Computer Science Publications

Comparative Mining of B2C Web Sites by Discovering Web Database Schemas

C. I. Ezeife, University of WindsorFollow
Bindu Peravali

Document Type

Conference Paper

Publication Date

2016

Publication Title

Proceedings of the 20th International Database Engineering & Applications Symposium

First Page

183

Last Page

192

Abstract

Discovering potentially useful and previously unknown information or knowledge from heterogeneous web contents such as "list all laptop prices from Walmart and Staples between 2013 and 2015 including make, type, screen size, CPU power, year of make", would require the difficult task of finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse integration before web content extraction and mining from the database. Wrappers that extract target information from web pages can be manual, semi-supervised or automatic systems. Automatic systems such as the WebOMiner system, use some data extraction techniques based on parsing the web page html source code into a document object model (DOM) tree, then traverse the DOM for pattern discovery. Some limitations of these existing systems include using complicated matching techniques such as tree matching, Finite state automata, not yielding accurate results for complex queries such as historical and derived.

This paper proposes building the WebOMiner S which uses web structure and content mining approaches on the DOM-tree html code to simplify and make more easily extendable, the web data extraction process of theWebOMiner system. TheWebOMiner system is based on non-deterministic finite state automata (NFA) to recognize and extract web different types (e.g., text, image, links, and lists). The proposed WebOMiner S replaces the use of NFA of the WebOMiner with a frequent structure finder algorithm which uses regular expression matching in Java xpath parser and methods (such as compile(),evaluate()) to dynamically discover the most frequent structure (which is the most frequently repeated blocks in the html code represented as tags < divclass = " " >) in the Dom tree. This approach eliminates the need for any supervised training or updating the wrapper for each new B2C web page making the approach simpler, more easily extendable and automated.

DOI

10.1145/2938503.2938522

Recommended Citation

Ezeife, C. I. and Peravali, Bindu. (2016). Comparative Mining of B2C Web Sites by Discovering Web Database Schemas. Proceedings of the 20th International Database Engineering & Applications Symposium, 183-192.
https://scholar.uwindsor.ca/computersciencepub/54

Link to Full Text

Find in your library

COinS

Scholarship at UWindsor

Computer Science Publications

Comparative Mining of B2C Web Sites by Discovering Web Database Schemas

Document Type

Publication Date

Publication Title

First Page

Last Page

Abstract

DOI

Recommended Citation

Search

Browse

Author Corner

Links

Scholarship at UWindsor

Computer Science Publications

Comparative Mining of B2C Web Sites by Discovering Web Database Schemas

Authors

Document Type

Publication Date

Publication Title

First Page

Last Page

Abstract

DOI

Recommended Citation

Share

Search

Browse

Author Corner

Links