Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization

Loading...
Thumbnail Image

View/Download File

Persistent link to this item

Statistics
View Statistics

Journal Title

Journal ISSN

Volume Title

Title

Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization

Published Date

2000-03-06

Publisher

Type

Report

Abstract

In recent years, we have seen a tremendous growth in the volume of online text documents available on the Internet, digital libraries, news sources, and company-wide intranet. This has led to an increased interest in developing methods that can efficiently retrieve relevant information. In recent years, retrieval techniques based on dimensionality reduction, such as latent semantic indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words that are present in the documents. Unfortunately, LSI is computationally expensive and cannot be used in a supervised setting. In this paper we present a new fast dimensionality reduction algorithm, called concept indexing (CI), that is based on document clustering. CI computes a k-dimensional representation of a collection of documents by first clustering the documents in k groups, and then using the centroid vectors of the clusters to derive the axes of the reduced k-dimensional space. The low computational complexity of CI is achieved by using an almost linear time clustering algorithm. Furthermore, CI can be used to compute the dimensionality reduction in a supervised setting. Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. Moreover, the supervised dimensionality reduction computed by CI greatly improved the classification accuracies of existing classification algorithms such as C4.5 and kNN.

Keywords

Description

Related to

Replaces

License

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation

Karypis, George; Han, Euihong. (2000). Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215405.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.