Date of Award

2016

Publication Type

Master Thesis

Degree Name

M.Sc.

Department

Computer Science

Supervisor

Ezeife, Christie

Rights

info:eu-repo/semantics/openAccess

Abstract

Mining frequent sequential patterns from multiple databases to discover more complex patterns from multiple data sources such as multiple E-Commerce (B2C) web sites for comparative, historical and derived analysis, poses the additional challenge of integrating mined patterns from multiple sources during various levels of mining. A few existing work on mining frequent patterns from multiple databases (MDB’s) are the ApproxMap algorithm and the TidFP algorithm. The ApproxMap algorithm breaks its input sequences (e.g., the 2-column sequence <(123)(45)>) into columns so it can find the collection of approximate frequent sequences of all the columns as the approximate sequence of the database. The same method is used to integrate frequent sequences from each MDB that must have the same table structure. The TidFP algorithm mines frequent item_sets from multiple sources of different table structures and related through foreign key attributes using transaction ids for integrating patterns through set operations (e.g., intersect, union) in order to answer global queries involving multiple sources. The limitations of existing work on multiple database sequential pattern mining such as the ApproxMap algorithm is that they are not able to mine frequent sequences to answer exact and historical queries from MDB’s of different structure; while the TidFp algorithm can only answer queries from MDB’s on item_sets but not for sequences. This thesis proposes the Transaction id frequent sequence pattern (TidFSeq) algorithm which uses the techniques of the TidFP algorithm for mining item sets on the problem of mining frequent sequences from diverse MDB’s. The challenges with mining frequent sequences from MDBs that is solved by this thesis are that the TidFSeq algorithm first computes the element (ie. Sequence item position id) in which each item in each sequence (ie. sequence id) occurs by replacing the tuple used in the TidFp with a tuple. For every item ‘i’ in the kth sequence of n-sequence of length ‘n’, the TidFSeq algorithm first transforms it into a tuple that specifies (a) it’s transaction id (Tid) and (b) the list of the kth sequence in this transaction that item ‘i’ occurs (called it’s position id list). Next the GSP-like candidate generation approach is used on our transformed sequences to generate frequent sequences with transacion ids which can be used to answer complex queries from MDB’s through set operations. The proposed TidFSeq algorithm, PrefixSpan algorithm and ApproxMap algorithm are compared with respect to the results obtained for a given query, processing speed and memory requirement. Experiments show that the proposed TidFSeq algorithm mines the exact frequent sequences (ie. 100% accuracy) from multiple sequence tables, when compared to the ApproxMap algorithm that has an accuracy of 79%. The TidFSeq algorithm has faster processing time for mining frequent sequences from multiple tables than the PrefixSpan and ApproxMap algorithms.

Share

COinS