An approach to improving cluster labeling and evaluation
2014-02
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
An approach to improving cluster labeling and evaluation
Authors
Published Date
2014-02
Publisher
Type
Thesis or Dissertation
Abstract
The clustering of a large document collection produces subsets of documents (typically overlapping) such that documents within a given cluster exhibit substantial similarities with each other. In this work, the final phase of the clustering process is to generate labels for each cluster, that is, a set of terms that represent the inherent meaning associated with a cluster. Although several methods exist for generating labels, little work has been done in developing methods that determine the quality of the labels. In other words, do the labels represent terms that a human might associate with a cluster? Do they enable the user to readily distinguish between clusters? Do they provide insight into the inherent meaning of the documents in the cluster? In this thesis, we focus on developing a tool that automatically assesses the quality of document cluster labels. Our objective is for the tool to be flexible, extensible, and reliable. It uses the Hungarian algorithm [16] to calculate the accuracy of the labels.We analyze the performance of our evaluation tool using cluster labels generated by the labeling mechanism of SenseClusters [21], a comprehensive package that generates clusters utilizing unsupervised learning. Label generation is based on the selection of the top five or ten bigrams as ranked by a measure of association. Since selecting features is a significant step in generating labels, we extend the labeling mechanism of SenseClusters by incorporating higher valued n-grams and tf-idf term weighting and then analyze the quality of the labels produced by these additional methods. The experimental results indicate that trigram features produce better results than the traditional unigram or bigram features of SenseClusters. Also, using tf- idf improves the quality of terms in the labels over those produced by the similarity mechanism of the SenseClusters.
Keywords
Description
University of Minnesota M.S. thesis. February, 2014. Major: Computer science. Advisor: Donald B. Crouch. 1 computer file (PDF); vii, 52 pages.
Related to
Replaces
License
Series/Report Number
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Jha, Anand Mohan. (2014). An approach to improving cluster labeling and evaluation. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/162841.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.