Date of Award
Information Technology, Computer science
CC BY-NC-ND 4.0
Web contents usually contain different types of data which are embedded under different complex structures. Existing approaches for extracting data contents from the web are manual wrappers, supervised wrapper induction, or automatic data extraction. The WebOminer system is an automatic extraction system that attempts to extract diverse heterogeneous web contents by modeling web sites as object oriented schemas. The goal is to generate and integrate various web site object schemas for deeper comparative querying of historical and derived contents of Business to Customer (B2C) such as BestBuy and Future Shop. The current WebOMiner system generates and extracts from only one product list page (e.g., computer page) of B2C web sites and still needs to generate and extract from a more comprehensive web site object schemas (e.g., those of Computer, Laptop and Desktop products). The current WebOMiner system does not yet handle historical aspects of data objects from different web pages. This thesis extends and advances the WebOMiner system to automatically generate a more comprehensive web site object schema, extract and mine structured web contents from different web pages based on objects' patterns similarity matching, and stores the extracted objects in historical object-oriented data warehouse. Approaches to be used include similarity matching of DOM tree tag nodes for identifying data blocks and data regions, automatic Non-Deterministic and Deterministic Finite Automata (NFA and DFA) for generating web site object schemas and content extraction, which contain similar data objects. Experimental results show that our system is effective and able to extract and mine structured data tuples from different web websites with 79% recall and 100% precision. The average execution time of our system is 21.8 seconds.
Alahmad, Yanal, "Comparative Mining of Multiple Web Data Source Contents with Object Oriented Model" (2013). Electronic Theses and Dissertations. 4730.