Browsing by Author "Han, Eui-Hong"
Now showing 1 - 3 of 3
- Results Per Page
- Sort Options
Item An Efficient, Scalable, Parallel Classifer for Data Mining(1997) Srivastava, Anurag; Singh, Vineet; Han, Eui-Hong; Kumar, VipinClassification is an important data mining problem. Recently, there has been significant interest in classification using training datasets that are large enough that they do not fit in main memory and need to be disk-resident. Although training data can be reduced by sampling, it has been shown that it can be advantageous to use the entire training dataset since that can increase accuracy. Most current algorithms are unsui:table for large disk-resident datasets because their space and time complexities (including I/0) are prohibitive. A recent algorithm called SPRINT promises to alleviate some of the data size restrictions. We present a new algorithm called SPEC that provides similar accuracy, reduces I/0, reduces memory requirements, and improves scalability (time and space) on both sequential and parallel computers. We provide some theoretical results as well as experimental results on the IBM SP2.Item Clustering in a High-Dimensional Space Using Hypergraph Models(1997) Han, Eui-Hong; Karypis, George; Kumar, Vipin; Mobasher, BamshadClustering of data in a large dimension space is of a great interest in many data mining applications. Most of the traditional algorithms such as K-means or AutoCJass fail to produce meaningful clusters in such data sets even when they are used with well known dimensionality reduction techniques such as Principal Component Analysis and Latent Semantic Indexing. In this paper, we propose a method for clustering of data in a high dimensional space based on a hypergraph model. The hypergraph model maps the relationship present in the original data in high dimensional space into a hypergraph. A hyperedge ;epresents a relationship (affinity) among subsets of data and the weight of the hyperedge reflects the strength of this affinity. A hypergraph partitioning algorithm is used to find a partitioning of the vertices such that the corresponding data items in each partition are highly related and the weight of the hyperedges cut by the partitioning is minimized. We present results of experiments on three different data sets: S&PSOO stock data for the period of 1994-1996, protein coding data, and Web document data. Wherever aplicable, we compared our results with those of AutoClass and K-means clustering algorithm on original data as well as on the reduced dimensionality data obtained via Principal Component Analysis or Latent Semantic Indexing scheme. These experiments demonstrate that our approach is applicable and effective in a wide range of domains. More specifically, our approach performed much better than traditional schemes [or high dimensional data sets in terms of quality of clusters and runtime. Our approach was also able to filter out noise data from the clusters very effectively without compromising the quaJity of the clusters.Item Min-Apriori: An Algorithm for Finding Association Rules in Data with Continuous Attributes(1997) Han, Eui-Hong; Karypis, George; Kumar, Vipin