Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization
2000-03-06
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization
Authors
Published Date
2000-03-06
Publisher
Type
Report
Abstract
In recent years, we have seen a tremendous growth in the volume of online text documents available on the Internet, digital libraries, news sources, and company-wide intranet. This has led to an increased interest in developing methods that can efficiently retrieve relevant information. In recent years, retrieval techniques based on dimensionality reduction, such as latent semantic indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words that are present in the documents. Unfortunately, LSI is computationally expensive and cannot be used in a supervised setting. In this paper we present a new fast dimensionality reduction algorithm, called concept indexing (CI), that is based on document clustering. CI computes a k-dimensional representation of a collection of documents by first clustering the documents in k groups, and then using the centroid vectors of the clusters to derive the axes of the reduced k-dimensional space. The low computational complexity of CI is achieved by using an almost linear time clustering algorithm. Furthermore, CI can be used to compute the dimensionality reduction in a supervised setting. Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. Moreover, the supervised dimensionality reduction computed by CI greatly improved the classification accuracies of existing classification algorithms such as C4.5 and kNN.
Keywords
Description
Related to
Replaces
License
Series/Report Number
Technical Report; 00-016
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Karypis, George; Han, Euihong. (2000). Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/215405.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.