Date of Award
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Many organizations collect large amounts of data to support their business and decision-making processes. The data collected originate from a variety of sources that may have inherent data quality problems. These problems become more pronounced when heterogeneous data sources are integrated to build data warehouses. Data warehouses integrating huge amounts of data from a number of heterogeneous data sources, are used to support decision-making and on-line analytical processing. The integrated databases inherit the data quality problems that were present in the source databases, and also have data quality problems arising from the integration process. The data in the integrated systems (especially data warehouses) need to be cleaned for reliable decision support querying. A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying "equivalent" records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This thesis identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The thesis then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level. The thesis also proposes a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independent de-duplication at the record level. Experiments show that the positional algorithm achieves more accurate de-duplication than the existing algorithms. Experiments also show that the data profiling technique for field weighting effectively assigns field weights for de-duplication purposes. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2002 .U34. Source: Masters Abstracts International, Volume: 41-04, page: 1123. Adviser: Christie I. Ezeife. Thesis (M.Sc.)--University of Windsor (Canada), 2002.
Udechukwu, Ajumobi Okwuchukwu., "Domain-independent de-duplication in data warehouse cleaning." (2002). Electronic Theses and Dissertations. 1777.