An approach to improving cluster labeling and evaluation

The clustering of a large document collection produces subsets of documents (typically overlapping) such that documents within a given cluster exhibit substantial similarities with each other. In this work, the final phase of the clustering process is to generate labels for each cluster, that is, a set of terms that represent the inherent meaning associated with a cluster. Although several methods exist for generating labels, little work has been done in developing methods that determine the quality of the labels. In other words, do the labels represent terms that a human might associate with a cluster? Do they enable the user to readily distinguish between clusters? Do they provide insight into the inherent meaning of the documents in the cluster? In this thesis, we focus on developing a tool that automatically assesses the quality of document cluster labels. Our objective is for the tool to be flexible, extensible, and reliable. It uses the Hungarian algorithm [16] to calculate the accuracy of the labels.We analyze the performance of our evaluation tool using cluster labels generated by the labeling mechanism of SenseClusters [21], a comprehensive package that generates clusters utilizing unsupervised learning. Label generation is based on the selection of the top five or ten bigrams as ranked by a measure of association. Since selecting features is a significant step in generating labels, we extend the labeling mechanism of SenseClusters by incorporating higher valued n-grams and tf-idf term weighting and then analyze the quality of the labels produced by these additional methods. The experimental results indicate that trigram features produce better results than the traditional unigram or bigram features of SenseClusters. Also, using tf- idf improves the quality of terms in the labels over those produced by the similarity mechanism of the SenseClusters.

Keywords

cluster

labeling

Label evaluation

Description

University of Minnesota M.S. thesis. February, 2014. Major: Computer science. Advisor: Donald B. Crouch. 1 computer file (PDF); vii, 52 pages.

Collections

Master's Theses (Plan A and Professional Engineering Design Projects)

Suggested citation

Jha, Anand Mohan. (2014). An approach to improving cluster labeling and evaluation. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/162841.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University of Minnesota Twin Cities

University Digital Conservancy

An approach to improving cluster labeling and evaluation

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation