Modeling web documents as objects for automatic web content extraction: Object-oriented web data model
ICEIS 2009 - 11th International Conference on Enterprise Information Systems
Traditionally, mining web page contents involves modeling their contents to discover the underlying knowledge. Data extraction proposals represent web data in a formal structure such as database structures specific to application domains. Those models fail to catch the full diversity of web data structures which can be composed of different types of contents, and can be also unstructured. In fact, with these proposals, it is not possible to focus on a given type of contents, to work on data of different structures and to mine on data of different application domains as required to mine efficiently a given content type or web documents from different domains. On top of that, since web pages are designed to be understood by users, this paper considers modeling of web document presentations expressed through HTML tag attributes as useful for an efficient web content mining. Hence, this paper provides a general framework composed of an object-oriented web data model based on HTML tags and algorithms for web content and web presentation object extraction from any given web document. From the HTML code of a web document, web objects are extracted for mining, regardless of the domain.
Annoni, E. and Ezeife, C. I.. (2009). Modeling web documents as objects for automatic web content extraction: Object-oriented web data model. ICEIS 2009 - 11th International Conference on Enterprise Information Systems, 91-100.