Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization

Karypis, GeorgeHan, Euihong2020-09-022020-09-022000-03-06https://hdl.handle.net/11299/215405In recent years, we have seen a tremendous growth in the volume of online text documents available on the Internet, digital libraries, news sources, and company-wide intranet. This has led to an increased interest in developing methods that can efficiently retrieve relevant information. In recent years, retrieval techniques based on dimensionality reduction, such as latent semantic indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words that are present in the documents. Unfortunately, LSI is computationally expensive and cannot be used in a supervised setting. In this paper we present a new fast dimensionality reduction algorithm, called concept indexing (CI), that is based on document clustering. CI computes a k-dimensional representation of a collection of documents by first clustering the documents in k groups, and then using the centroid vectors of the clusters to derive the axes of the reduced k-dimensional space. The low computational complexity of CI is achieved by using an almost linear time clustering algorithm. Furthermore, CI can be used to compute the dimensionality reduction in a supervised setting. Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. Moreover, the supervised dimensionality reduction computed by CI greatly improved the classification accuracies of existing classification algorithms such as C4.5 and kNN.en-USConcept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & CategorizationReport