Mining Knowledge for Instance Integration in Heterogeneous Databases

Data integration from multiple heterogeneous data sources has become a high-priority task in many large enterprises, to achieve competitive advantage and effective utilization of corporate resources. Success of the integration process is critically dependent on the availability of accurate semantic information on the data contents. Techniques for retrieving this information, directly from the data, are being used to complement human knowledge and to automate some of the data integration tasks. This research investigates the application of data-mining techniques to retrieve knowledge for instance-integration in heterogeneous database systems. Identifying and integrating all the instances of data items that represent the same real-world entity is an important task, distinct from schema integration. Entity Identification (EI) and attribute-value conflict resolution (AVCR) comprise the instanceintegration task. When common key-attributes are not available across different data sources, the rules for EI, and the rules for AVCR, are expressed as combinations of constraints on their attribute values. We have developed a method which allows the users to provide examples of similar data items instead of specifying the instanceintegration rules directly. A learning module is then used to extract comprehensive and precise rules from the examples, employing knowledge-discovery techniques. Well-known classification and clustering techniques are designed for applications which deal with much smaller number of distinct items than what is usually present in database environments. We use distance functions measuring similarity between attribute values, to transform the instance-integration problem into a binary classification problem. A library of distance functions for commonly-occuring attribute types has been developed. Experimental evaluation on real-world business databases show that this method can achieve complete accuracy in learning simple EI rules and can classify more than 85% records accurately even with relatively complex rules. We have also developed an algorithm to compute record-distances as a function of attribute-distances, thereby making it possible to use heuristic clustering algorithms for EI. Such algorithms greatly improve the efficiency of the process with a minor trade-off in effectiveness.

Collections

Computer Science & Engineering (CS&E) Technical Reports

Series/Report Number

Technical Report; 97-041

Suggested citation

Ganesh, M.. (1997). Mining Knowledge for Instance Integration in Heterogeneous Databases. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215324.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

Mining Knowledge for Instance Integration in Heterogeneous Databases

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

Mining Knowledge for Instance Integration in Heterogeneous Databases

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation