Date of Award

2012

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

Keywords

Communication and the arts, Applied sciences, Nondeterministic finite automata, Regular expression, Web mining, Deterministic finite automata, Dom tree, Frequent pattern

Supervisor

Christie I. Ezeife

Rights

info:eu-repo/semantics/openAccess

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Abstract

Existing web content extracting systems use unsupervised, supervised, and semi-supervised approaches. The WebOMiner system is an automatic web content data extraction system which models a specific Business to Customer (B2C) web site such as "bestbuy.com" using object oriented database schema. WebOMiner system extracts different web page content types like product, list, text using non deterministic finite automaton (NFA) generated manually. This thesis extends the automatic web content data extraction techniques proposed in the WebOMiner system to handle multiple web sites and generate integrated data warehouse automatically. We develop the WebOMiner-2 which generates NFA of specific domain classes from regular expressions extracted from web page DOM trees' frequent patterns. Our algorithm can also handle NFA epsilon([varepsilon]) transition and convert it to deterministic finite automata (DFA) to identify different content tuples from list of tuples. Experimental results show that our system is highly effective and performs the content extraction task with 100% precision and 98.35% recall value.

Recommended Citation

Harun-Or-Rashid, Mohammad, "Mining Multiple Web Sources Using Non-Deterministic Finite State Automata " (2012). Electronic Theses and Dissertations. 4814.
https://scholar.uwindsor.ca/etd/4814

Download

COinS

Scholarship at UWindsor

Electronic Theses and Dissertations

Mining Multiple Web Sources Using Non-Deterministic Finite State Automata

Date of Award

Publication Type

Degree Name

Department

Keywords

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Search

Browse

Author Corner

Scholarship at UWindsor

Electronic Theses and Dissertations

Mining Multiple Web Sources Using Non-Deterministic Finite State Automata

Author

Date of Award

Publication Type

Degree Name

Department

Keywords

Supervisor

Rights

Creative Commons License

Abstract

Recommended Citation

Share

Search

Browse

Author Corner