Centroid-Based Document Classification Algorithms: Analysis & Experimental Results
2000-03-06
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
Centroid-Based Document Classification Algorithms: Analysis & Experimental Results
Authors
Published Date
2000-03-06
Publisher
Type
Report
Abstract
In recent years we have seen a tremendous growth in the volume of online text documents available on the Internet, digital libraries, news sources, and company-wide intranet. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well finding information on these huge resources. Text categorization presents unique challenges due to the large number of attributes present in the data set, large number of training samples, and attribute dependencies. In this paper we focus on a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our extensive experiments show that this centroid-based classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior, as measured by the average similarity between the documents, matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities. Furthermore, our analysis also shows that the similarity measure of the centroid-based scheme accounts for dependencies between the terms in the different classes. We believe that this feature is the reason why it consistently out-performs other classifiers, which cannot take these dependencies into account.
Keywords
Description
Related to
Replaces
License
Series/Report Number
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Han, Euihong; Karypis, George. (2000). Centroid-Based Document Classification Algorithms: Analysis & Experimental Results. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215406.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.