Date of Award

2013

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Information Technology, Computer science

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Abstract

Web contents usually contain different types of data which are embedded under different complex structures. Existing approaches for extracting data contents from the web are manual wrappers, supervised wrapper induction, or automatic data extraction. The WebOminer system is an automatic extraction system that attempts to extract diverse heterogeneous web contents by modeling web sites as object oriented schemas. The goal is to generate and integrate various web site object schemas for deeper comparative querying of historical and derived contents of Business to Customer (B2C) such as BestBuy and Future Shop. The current WebOMiner system generates and extracts from only one product list page (e.g., computer page) of B2C web sites and still needs to generate and extract from a more comprehensive web site object schemas (e.g., those of Computer, Laptop and Desktop products). The current WebOMiner system does not yet handle historical aspects of data objects from different web pages. This thesis extends and advances the WebOMiner system to automatically generate a more comprehensive web site object schema, extract and mine structured web contents from different web pages based on objects' patterns similarity matching, and stores the extracted objects in historical object-oriented data warehouse. Approaches to be used include similarity matching of DOM tree tag nodes for identifying data blocks and data regions, automatic Non-Deterministic and Deterministic Finite Automata (NFA and DFA) for generating web site object schemas and content extraction, which contain similar data objects. Experimental results show that our system is effective and able to extract and mine structured data tuples from different web websites with 79% recall and 100% precision. The average execution time of our system is 21.8 seconds.

Recommended Citation

Alahmad, Yanal, "Comparative Mining of Multiple Web Data Source Contents with Object Oriented Model" (2013). Electronic Theses and Dissertations. 4730.
https://scholar.uwindsor.ca/etd/4730

Download

COinS

Scholarship at UWindsor

Electronic Theses and Dissertations

Comparative Mining of Multiple Web Data Source Contents with Object Oriented Model

Date of Award

Publication Type

Degree Name

Department

Keywords

Rights

Creative Commons License

Abstract

Recommended Citation

Search

Browse

Author Corner

Scholarship at UWindsor

Electronic Theses and Dissertations

Comparative Mining of Multiple Web Data Source Contents with Object Oriented Model

Author

Date of Award

Publication Type

Degree Name

Department

Keywords

Rights

Creative Commons License

Abstract

Recommended Citation

Share

Search

Browse

Author Corner