Mining Knowledge for Instance Integration in Heterogeneous Databases
1997
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
Mining Knowledge for Instance Integration in Heterogeneous Databases
Authors
Published Date
1997
Publisher
Type
Report
Abstract
Data integration from multiple heterogeneous data sources has become a high-priority
task in many large enterprises, to achieve competitive advantage and effective utilization
of corporate resources. Success of the integration process is critically dependent
on the availability of accurate semantic information on the data contents. Techniques
for retrieving this information, directly from the data, are being used to complement
human knowledge and to automate some of the data integration tasks. This research
investigates the application of data-mining techniques to retrieve knowledge
for instance-integration in heterogeneous database systems.
Identifying and integrating all the instances of data items that represent the same
real-world entity is an important task, distinct from schema integration. Entity Identification
(EI) and attribute-value conflict resolution (AVCR) comprise the instanceintegration
task. When common key-attributes are not available across different data
sources, the rules for EI, and the rules for AVCR, are expressed as combinations of
constraints on their attribute values. We have developed a method which allows the
users to provide examples of similar data items instead of specifying the instanceintegration
rules directly. A learning module is then used to extract comprehensive
and precise rules from the examples, employing knowledge-discovery techniques.
Well-known classification and clustering techniques are designed for applications
which deal with much smaller number of distinct items than what is usually present
in database environments. We use distance functions measuring similarity between
attribute values, to transform the instance-integration problem into a binary classification
problem. A library of distance functions for commonly-occuring attribute
types has been developed.
Experimental evaluation on real-world business databases show that this method
can achieve complete accuracy in learning simple EI rules and can classify more than
85% records accurately even with relatively complex rules. We have also developed an
algorithm to compute record-distances as a function of attribute-distances, thereby
making it possible to use heuristic clustering algorithms for EI. Such algorithms
greatly improve the efficiency of the process with a minor trade-off in effectiveness.
Keywords
Description
Related to
Replaces
License
Series/Report Number
Technical Report; 97-041
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Ganesh, M.. (1997). Mining Knowledge for Instance Integration in Heterogeneous Databases. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215324.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.