Criterion Functions for Document Clustering: Experiments and Analysis
2001-11-29
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
Criterion Functions for Document Clustering: Experiments and Analysis
Alternative title
Authors
Published Date
2001-11-29
Publisher
Type
Report
Abstract
In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and company-wide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information with the ultimate goal of helping them to find what they are looking for. Fast and high-quality document clustering algorithms play an important role towards this goal as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters as well as to greatlyimprove the retrieval performance either via cluster-driven dimensionality reduction, term-weighting, or query expansion. This ever-increasing importance of document clustering and the expanded range of its applications led to the development of a number of new and novel algorithms with different complexity-quality trade-offs. Among them, a class of clustering algorithms that have relatively low computational requirements are those that treat the clustering problem as an optimization process which seeks to maximize or minimize a particular {em clustering criterion function} defined over the entire clustering solution.
The focus of this paper is to evaluate the performance of different criterion functions for the problem of clustering documents. Our study involves a total of eight different criterion functions, three of which are introduced inthis paper and five that have been proposed in the past. Our evaluation consists ofboth a comprehensive experimental evaluation involving fifteen different datasets, as well as an analysis of the characteristics of the various criterionfunctions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion function lead to the best overall results. Our theoretical analysis of the criterion function shows that their relative performance depends on (i) the degree to which they cancorrectly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters.
Keywords
Description
Related to
Replaces
License
Series/Report Number
Technical Report; 01-040
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Zhao, Ying; Karypis, George. (2001). Criterion Functions for Document Clustering: Experiments and Analysis. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215490.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.