Browsing by Author "Han, Euihong"

Now showing 1 - 11 of 11

A Hierarchical Clustering Algorithm Using Dynamic Modeling
(1999-03-09) Karypis, George; Han, Euihong; Kumar, Vipin
Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. Existing clustering algorithms, such as K-means, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit some static models. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes. In this paper, we present a novel hierarchical clustering algorithm called Chameleon that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the inter-connectivity and closeness (proximity) between two clusters are high relative to the internal inter-connectivity of the clusters and closenessof items within the clusters. The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous clusters. The methodology of dynamic modeling of clusters used in Chameleon is applicable to all types of data as long as a similarity matrix can be constructed. We demonstrate the effectiveness of Chameleon in a number of data sets that contain points in 2D space, and contain clusters of different shapes, densities, sizes, noise, and artifacts. Experimental results on these data sets show that Chameleon can discover natural clusters that many existing state-of-the art clustering algorithms fail to find.
Centroid-Based Document Classification Algorithms: Analysis & Experimental Results
(2000-03-06) Han, Euihong; Karypis, George
In recent years we have seen a tremendous growth in the volume of online text documents available on the Internet, digital libraries, news sources, and company-wide intranet. Automatic text categorization, which is the task of assigning text documents to pre-specified classes (topics or themes) of documents, is an important task that can help both in organizing as well finding information on these huge resources. Text categorization presents unique challenges due to the large number of attributes present in the data set, large number of training samples, and attribute dependencies. In this paper we focus on a simple linear-time centroid-based document classification algorithm, that despite its simplicity and robust performance, has not been extensively studied and analyzed. Our extensive experiments show that this centroid-based classifier consistently and substantially outperforms other algorithms such as Naive Bayesian, k-nearest-neighbors, and C4.5, on a wide range of datasets. Our analysis shows that the similarity measure used by the centroid-based scheme allows it to classify a new document based on how closely its behavior, as measured by the average similarity between the documents, matches the behavior of the documents belonging to different classes. This matching allows it to dynamically adjust for classes with different densities. Furthermore, our analysis also shows that the similarity measure of the centroid-based scheme accounts for dependencies between the terms in the different classes. We believe that this feature is the reason why it consistently out-performs other classifiers, which cannot take these dependencies into account.
Clustering Based on Association Rule Hypergraphs
(1997) Han, Euihong; Karypis, George; Kumar, Vipin; Mobasher, Bamshad
Traditional clustering algorithms, used in data mining for transactional databases, arc mainly concerned with grouping transactions, but they do not generally provide an adequate mechanism for grouping items found within these transactions. Item clustering, on the other hand, can be useful in many data mining applications. We propose a new method for clustering related items in transactional databases that is based on partitioning an association rule hypcrgraph, where each association rule defines a hyperedge. We also discuss some of the applications of item clustering, such as the discovery of meta-rules among item clusters, and clustering of transactions. We evaluated our scheme experimentally on data from a number of domains, and, wherever applicable, compared it with AutoClass. In our experiment with stock-market data, our clustering scheme is able to successfully group stocks that belong to the same industry group. In the experiment with congressional voting data, this method is quite effective in finding clusters of transactions that correspond to either democrat or republican voting patterns. We found clusters of segments of protein-coding sequences from protein coding database that share the same functionality and thus are very valuable to biologist for determining functionality of new proteins. We also found clusters of related words in documents retrieved from the World Wide Web (a common and important application in information retrieval). These experiments demonstrate that our approach holds promise in a wide range of domains, and is much faster than traditional clustering algorithms such as AutoClass.
Concept Indexing: A Fast Dimensionality Reduction Algorithm with Applications to Document Retrieval & Categorization
(2000-03-06) Karypis, George; Han, Euihong
In recent years, we have seen a tremendous growth in the volume of online text documents available on the Internet, digital libraries, news sources, and company-wide intranet. This has led to an increased interest in developing methods that can efficiently retrieve relevant information. In recent years, retrieval techniques based on dimensionality reduction, such as latent semantic indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words that are present in the documents. Unfortunately, LSI is computationally expensive and cannot be used in a supervised setting. In this paper we present a new fast dimensionality reduction algorithm, called concept indexing (CI), that is based on document clustering. CI computes a k-dimensional representation of a collection of documents by first clustering the documents in k groups, and then using the centroid vectors of the clusters to derive the axes of the reduced k-dimensional space. The low computational complexity of CI is achieved by using an almost linear time clustering algorithm. Furthermore, CI can be used to compute the dimensionality reduction in a supervised setting. Experimental results show that the dimensionality reduction computed by CI achieves comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. Moreover, the supervised dimensionality reduction computed by CI greatly improved the classification accuracies of existing classification algorithms such as C4.5 and kNN.
Content-Based Methods for Predicting Web-Site Demographic Attributes
(2010-09-17) Kabbur, Santosh; Han, Euihong; Karypis, George
Demographic information plays an important role in gaining valuable insights about a web-site's user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using information derived from their content and their hyperlinked structure and not relying on any information directly or indirectly obtained from the web-site's users. Such methods are important because users are becoming increasingly more concerned about sharing their personal and behavioral information on the Internet. Regression-based approaches are developed and studied for predicting demographic attributes that utilize different content-derived features, different ways of building the prediction models, and different ways of aggregating web-page level predictions that take into account the web's hyperlinked structure. In addition, a matrix-approximation based approach is developed for coupling the predictions of individual regression models into a model designed to predict the probability mass function of the attribute. Extensive experiments show that these methods are able to achieve an RMSE of 8--10% and provide insights on how to best train and apply such models.
Efficient Parallel Algorithms for Mining Associations
(2001-01-26) Joshi, Mahesh; Han, Euihong; Karypis, George; Kumar, Vipin
The problem of mining hidden associations present in the largeamounts of data has seen widespread applications in manypractical domains such as customer-oriented planning and marketing,telecommunication network monitoring, and analyzing data from scientificexperiments. The combinatorial complexity of the problem hasfascinated many researchers. Many elegant techniques, such as Apriori,have been developed to solve the problem on single-processor machines.However, most available datasets are becoming enormous in size.Also, their high dimensionality results in possibly large number ofmined associations.This strongly motivates the need for efficient and scalable parallelalgorithms. The design of such algorithms is challenging.In the chapter, we give a evolutionary and comparativereview of many existing representative serial and parallel algorithmsfor discovering two kinds of associations. The first part of the chapteris devoted to the non-sequential associations, which utilize the relationshipsbetween events that happen together. The second part is devoted to themore general and potentially more useful sequential associations,which utilize the temporal or sequential relationships between events.It is shown that many existing algorithms actually belong to a few categorieswhich are decided by the broader design strategies.Overall the focus of the chapter is to serve as a comprehensive accountof the challenges and issues involved in effective parallelformulations of algorithms for discovering associations, andhow various existing algorithms try to handle them.
Multilevel Refinement for Hierarchical Clustering
(1999-05-17) Karypis, George; Han, Euihong; Kumar, Vipin
Hierarchical methods are well known clustering technique that can be potentially very useful for various data mining tasks. A hierarchical clustering scheme produces a sequence of clusterings in which each clustering is nested into the next clustering in the sequence. Since hierarchical clustering is a greedy search algorithm based on a local search, the merging decision made early in the agglomerative process are not necessarily the right ones. One possible solution to this problem is to refine a clustering produced by the agglomerative hierarchical algorithm to potentially correct the mistakes made early in the agglomerative process. The problem of refining a clustering has many similarities with that of refining a min-cut k-way partitioning of a graph. In this paper, we explore multilevel refinement schemes for refining and improving the clusterings produced by hierarchical agglomerative clustering. This algorithm combines traditional hierarchical clustering with multilevel refinement that has been found to be very effective for computing min-cut k-way partitioning of graphs. We consider several clustering objective functions for the proposed refinement step and investigate the usefulness of these objective functions. Our experimental results demonstrate that this algorithm produces clustering solutions that are consistently and significantly better than those produced by hierarchical clustering algorithms alone. Furthermore, our algorithm has the additional advantage of being extremely fast, as it operates on a sparse similarity matrix. The amount of time required by our algorithm ranged from two second for a data set with 358 items, to 80 seconds for a data set with 9133 items on a Pentium~II PC.
Parallel Algorithm Scalability Issues in PetaFLOPS Architectures
(2001-01-26) Garma, Ananth; Gupta, Anshul; Han, Euihong; Kumar, Vipin
The projected design space of petaFLOPS architectures entails exploitationof very large degrees of concurrency, locality of data access, and toleranceto latency. This puts considerable pressure on the design of parallelalgorithms capable of effectively utilizing increasing amounts of processingresources in a memory and bandwidth constrained environment. This aspect ofalgorithm design, also referred to as scalability analysis, is a keycomponent for guiding algorithm designers as well as hardware architects.By identifying bottlenecks to scalability and machine parameters thatinfluence these bottlenecks, scalability analysis provides insights toalleviating the bottlenecks in the context of the specific algorithm.In this paper, we motivate the need for, and benefits of scalabilityanalysis in the context of petaFLOPS systems. We overview variousscalability metrics and study their suitability to petaFLOPS system.We also present sample analysis of selected computational kernels fromdense linear algebra, fast fourier transforms, and data intensive applications(association rule mining). The objective of this analysis is to demonstratethe analysis framework and its use in identifying desirable architecturalfeatures as well the ability of these selected kernels to scale to petaFLOPSsystems.
Parallel Algorithms in Data Mining
(2001-01-26) Joshi, Mahesh; Han, Euihong; Karypis, George; Kumar, Vipin
Recent times have seen an explosive growth in the availability ofvarious kinds of data. It has resulted in an unprecedented opportunity todevelop automated data-driven techniques of extracting useful knowledge.Data mining, an important step in this process of knowledge discovery,consists of methods that discover interesting, non-trivial, and usefulpatterns hidden in the data.To date, the primary driving force behind the research in data mininghas been the development of algorithms for data-sets arising in variousbusiness, information retrieval, and financial applications.Due to the latest technological advances,very large data-sets are becoming available in many scientificdisciplines as well. The rate of production of such data-sets far outstripsthe ability to analyze them manually.Data mining techniques hold great promises for developing new sets of toolsthat can be used to automatically analyze the massive data-sets resultingfrom such simulations, and thushelp engineers and scientists unravel the causal relationships in theunderlying mechanisms of the dynamic physical processes.The huge size of the available data-sets and their high-dimensionalitymake large-scale data mining applications computationally very demanding,to an extent that high-performance parallel computing is fast becomingan essential component of the solution.Moreover, the quality of the data mining results often depends directlyon the amount of computing resources available.In fact, data mining applications are poised to become the dominant consumersof supercomputing in the near future. There is a necessity to developeffective parallel algorithms for various data mining techniques.However, designing such algorithms is challenging.In this paper, we will describe the parallel formulations of twoimportant data mining algorithms: discovery of association rules, andinduction of decision trees for classification.
Personalized Profile Based Search Interface With Ranked and Clustered Display
(2001-06-01) Kumar, Sachin; Uygar Oztekin, B.; Ertoz, Levent; Singhal, Saurabh; Han, Euihong; Kumar, Vipin
We have developed an experimental meta-search engine, which takes the snippets from traditional search engines and presents them to the user either in the form of clusters, indices or re-ranked list optionally based on the user’s profile. The system also allows the user to give positive or negative feedback on the documents, clusters and indices. The architecture allows different algorithms for each of the features to be plugged-in easily, i.e. various clustering, indexing and relevance feedback algorithms, and profiling methods.
Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification
(1999-05-17) Han, Euihong; Karypis, George; Kumar, Vipin
Text categorization is the task of deciding whether a document belongs to a set of prespecified classes of documents. Automatic classification schemes can greatly facilitate the process of categorization. Categorization of documents is challenging, as the number of discriminating words can be very large. Many existing algorithms simply would not work with these many number of features. k-nearest neighber (k-NN) classification is an instance-based learning algorithm that has shown to be very effective for a variety of problem domains including documents. The key element of this scheme is the availability of a similarity measure that is capable of identifying neighbors of a particular document. A major drawback of the similarity measure used in k-NN is that it uses all features in computing distances. In many document data sets, only smaller number of the total vocabulary may be useful in categorizing documents. A possible approach to overcome this problem is to learn weights for different featrures (or words in document data sets). In this paper, we propose the Weight Adjusted k-Nearest Neighbor (WAKNN) classification algorithm that is based on the k-NN classification paradigm. In WAKNN, the weights of features are learned using an iterative algorithm. In the weight adjustment step, the weight of each feature is perturbed in small steps to see if the change improves the classification objective function. The feature with the most improvement in the objective function is identified and the corresponding weight is updated. The feature weights are used in the similarity measure computation such that important features contribute more in the similarity measure. Experiments on several real life document data sets show the promise of WAKNN, as it outperforms the state of the art classification algorithms such as C4.5,RIPPER, Rainbow, PEBLS, and VSM.

University Digital Conservancy

Browse by Author

Browsing by Author "Han, Euihong"