Ganesh, M.2020-09-022020-09-021997https://hdl.handle.net/11299/215324Data integration from multiple heterogeneous data sources has become a high-priority task in many large enterprises, to achieve competitive advantage and effective utilization of corporate resources. Success of the integration process is critically dependent on the availability of accurate semantic information on the data contents. Techniques for retrieving this information, directly from the data, are being used to complement human knowledge and to automate some of the data integration tasks. This research investigates the application of data-mining techniques to retrieve knowledge for instance-integration in heterogeneous database systems. Identifying and integrating all the instances of data items that represent the same real-world entity is an important task, distinct from schema integration. Entity Identification (EI) and attribute-value conflict resolution (AVCR) comprise the instanceintegration task. When common key-attributes are not available across different data sources, the rules for EI, and the rules for AVCR, are expressed as combinations of constraints on their attribute values. We have developed a method which allows the users to provide examples of similar data items instead of specifying the instanceintegration rules directly. A learning module is then used to extract comprehensive and precise rules from the examples, employing knowledge-discovery techniques. Well-known classification and clustering techniques are designed for applications which deal with much smaller number of distinct items than what is usually present in database environments. We use distance functions measuring similarity between attribute values, to transform the instance-integration problem into a binary classification problem. A library of distance functions for commonly-occuring attribute types has been developed. Experimental evaluation on real-world business databases show that this method can achieve complete accuracy in learning simple EI rules and can classify more than 85% records accurately even with relatively complex rules. We have also developed an algorithm to compute record-distances as a function of attribute-distances, thereby making it possible to use heuristic clustering algorithms for EI. Such algorithms greatly improve the efficiency of the process with a minor trade-off in effectiveness.en-USMining Knowledge for Instance Integration in Heterogeneous DatabasesReport