Browsing by Author "Karypis, George"
Now showing 1 - 20 of 147
- Results Per Page
- Sort Options
Item A Comparison of Document Clustering Techniques(2000-05-23) Steinbach, Michael; Karypis, George; Kumar, VipinThis paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data.Item A Generalized Framework for Protein Sequence Annotation(2007-10-15) Rangwala, Huzefa; Kauffman, Christopher; Karypis, GeorgeOver the last decade several data mining techniques have been developed for determining structural and functional properties of individual protein residues using sequence and sequence-derived information. These protein residue annotation problems are often formulated as either classification or regression problems and solved using a common set of techniques. We develop a generalized protein sequence annotation toolkit (prosat) for solving classification or regression problems using support vector machines. The key characteristic of our method is its effective use of window-based information to capture the local environment of a protein sequence residue. This window information is used with several kernel functions available within our framework. We show the effectiveness of using the previously developed normalized second order exponential kernel function and experiment with local window-based information at different levels of granularity. We report empirical results on a diverse set of classification and regression problems: prediction of solvent accessibility, secondary structure, local structure alphabet, transmembrane helices, DNA-protein interaction sites, contact order, and regions of disorder are all explored. Our methods show either comparable or superior results to several state-of-the-art application tuned prediction methods for these problems. prosat provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. The results of some of these predictions can be used to assist in solving the overarching 3D structure prediction problem.Item A Hierarchical Clustering Algorithm Using Dynamic Modeling(1999-03-09) Karypis, George; Han, Euihong; Kumar, VipinClustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized. Existing clustering algorithms, such as K-means, PAM, CLARANS, DBSCAN, CURE, and ROCK are designed to find clusters that fit some static models. These algorithms can breakdown if the choice of parameters in the static model is incorrect with respect to the data set being clustered, or if the model is not adequate to capture the characteristics of clusters. Furthermore, most of these algorithms breakdown when the data consists of clusters that are of diverse shapes, densities, and sizes. In this paper, we present a novel hierarchical clustering algorithm called Chameleon that measures the similarity of two clusters based on a dynamic model. In the clustering process, two clusters are merged only if the inter-connectivity and closeness (proximity) between two clusters are high relative to the internal inter-connectivity of the clusters and closenessof items within the clusters. The merging process using the dynamic model presented in this paper facilitates discovery of natural and homogeneous clusters. The methodology of dynamic modeling of clusters used in Chameleon is applicable to all types of data as long as a similarity matrix can be constructed. We demonstrate the effectiveness of Chameleon in a number of data sets that contain points in 2D space, and contain clusters of different shapes, densities, sizes, noise, and artifacts. Experimental results on these data sets show that Chameleon can discover natural clusters that many existing state-of-the art clustering algorithms fail to find.Item A Medium-Grained Algorithm for Distributed Sparse Tensor Factorization(2016-05-02) Smith, Shaden; Karypis, GeorgeModeling multi-way data can be accomplished using tensors, which are data structures indexed along three or more dimensions. Tensors are increasingly used to analyze extremely large and sparse multi-way datasets in life sciences, engineering, and business. The canonical polyadic decomposition (CPD) is a popular tensor factorization for discovering latent features and is most commonly found via the method of alternating least squares (CPD-ALS). The computational time and memory required to compute CPD limits the size and dimensionality of the tensors that can be solved on a typical workstation, making distributed solution approaches the only viable option. Most methods for distributed-memory systems have focused on distributing the tensor in a coarse-grained, one-dimensional fashion that prohibitively requires the dense matrix factors to be fully replicated on each node. Recent work overcomes this limitation by using a fine-grained decomposition of the tensor nonzeros, at the cost of computationally expensive hypergraph partitioning. To that effect, we present a medium-grained decomposition that avoids complete factor replication and communication, while eliminating the need for expensive pre-processing steps. We use a hybrid MPI+OpenMP implementation that exploits multi-core architectures with a low memory footprint. We theoretically analyze the scalability of the coarse-, medium-, and fine-grained decompositions and experimentally compare them across a variety of datasets. Experiments show that the medium-grained decomposition reduces communication volume by 36-90% compared to the coarse-grained decomposition, is 41-76x faster than a state-of- the-art MPI code, and is 1.5-5.0x faster than the fine-grained decomposition with 1024 cores.Item A New Algorithm for Multi-objective Graph Partitioning(1999-02-15) Kumar, Vipin; Karypis, George; Schloegel, KirkRecently, a number of graph partitioning applications have emerged with addtional requirements that the traditional graph partitioning model alone cannot effectively handle. One such class of problems is those in which multiple objectves, each of which can be modeled as a sum of weights of the edges of a graph, must be simultaneously optimized. This class of problems must be solved utilizing a multi-objective graph partitioning algorithm. In this paper, we describe a new algorithm for computing partitionings for multi-objective graphs. We explain how this scheme is able to handle the class of problems in which the objectives represent similar quantities, as well as, the class of problems in which the objectives represent dissimilar quantitites. We show that our multi-objective graph partitioning algorithm is better able to compute partitionings based on a user-supplied preference vector than partitioning with respect to a single objective only. We also show that by modifyig the input preference vector, the multi-objective graph partitioning algorithm is able to gracefully tradeoff increases in one or more objectives for decreases in ther objectives. This gives the user a fine-tuned control of the tradeoffs among the objectives.Item A Novel Regression Model Combining Instance Based Rule Mining With EM Algorithm(2013-04-01) Jiang, Zhonghua; Karypis, GeorgeIn recent years, there have been increasing efforts to apply association rule mining to build Associative Classification (AC) models. However, the similar area that applies association rule mining to build Associative Regression (AR) models has not been well explored. In this work, we fill this gap by presenting a novel regression model based on association rules called AREM. AREM derives a set of regression rules by: (i) applying an instance based approach to mine itemsets which form the regression rules' left hand side, and (ii) developing a probabilistic model which determines, for each mined itemset, the corresponding rule's right hand side and the importance weight. To address the computational bottleneck of the traditional two-step approach for itemset mining, AREM utilizes an Instance-Based Itemset Miner (IBIMiner) algorithm that directly discovers the final set of itemsets. IBIMiner incorporates various methods to bound the quality of any future extensions of the itemset under consideration. These bounds are then used to prune the search space. In addition, AREM treats the regression rules' right hand side and importance weights as parameters of a probabilistic model, which are then learned in the expectation and maximization (EM) framework. The extensive experimental evaluation shows that our bounding strategies allow IBIMiner to considerably reduce the runtime and the EM optimization can improve the predictive performance dramatically. We also show that our model can perform better than some of the state of the art regression models.Item A Parallel Hill-Climbing Refinement Algorithm for Graph Partitioning(2015-11-25) LaSalle, Dominique; Karypis, GeorgeGraph partitioning is an important step in distributing workloads on parallel compute systems, sparse matrix re-ordering, and VLSI circuit design. Producing high quality graph partitionings while effectively utilizing available CPU power is becoming increasingly challenging due to the rising number of cores per processor. This not only increases the amount of parallelism required of the partitioner, but also the degree partitionings it is to generate. In this work we present a new shared-memory parallel k-way method of refining an existing partitioning that can break out of local minima. Our method matches the quality of the current high-quality serial refinement methods, and achieves speedups of 5.7-16.7x using 24 threads, while exhibiting only 0.52% higher edgecuts than when run serially. This is 6.3x faster than other parallel refinement methods.Item A Scalable Algorithm for Clustering Sequential Data(2001-08-16) Guralnik, Valerie; Karypis, GeorgeIn recent years, we have seen an enormous growth in the amount of available commercial and scientific data. Data from domains such as protein sequences, retail transactions,intrusion detection, and web-logs have an inherent sequential nature. Clustering of such data sets is useful for various purposes. For example, clustering of sequences from commercial data sets may help marketer identify different customer groups based upon their purchasing patterns. Grouping protein sequences that share similar structure helps in identifying sequences with similar functionality. Over the years, many methods have been developed for clustering objects according to their similarity. However these methods tend to have a computational complexity that is at least quadratic on the number of sequences, as they need to compute the pairwisesimilarity between all the sequences. In this paper we present an entirely different approach to sequence clustering that does not require an all-against-all analysis and uses a near-linear complexity $K$-means based clustering algorithm. Our experiments using data sets derived from sequences of purchasing transactions and protein sequences show that this approach is scalable and leads to reasonably good clusters.Item A Segment-based Approach To Clustering Multi-Topic Documents(2008-01-31) Tagarelli, Andrea; Karypis, GeorgeDocument clustering has been recognized as a central problem in text data management, and it becomes particularly challenging when documents have multiple topics. In this paper we address the problem of multi-topic document clustering by leveraging the natural composition of documents in text segments, which bear one or more topics on their own. We propose a segment-based document clustering framework, which is designed to induce a classification of documents starting from the identification of cohesive groups of segment-based portions of the original documents. We empirically give evidence of the significance of our approach on different, large collections of multi-topic documents.Item A Unified Algorithm for Load-Balancing Adaptive Scientific Simulations(2000-05-22) Schloegel, Kirk; Karypis, George; Kumar, VipinAdaptive scientific simulations require that periodic repartitioning occur dynamically throughout the course of the simulation. The computed repartitionings should minimize both the inter-processor communications incurred during the iterative mesh-based computation and the data redistribution costs required to balance the load. Recently developed schemes for computing repartitionings provide the user with only a limited control of the tradeoffs among these objectives. This paper describes a new Unified Repartitioning Algorithm that can gracefully tradeoff one objective for the other dependent upon a user-defined parameter describing the relative costs of these objectives. We show that the Unified Repartitioning Algorithm is able to minimize the precise overheads associated with repartitioning as well as or better than other repartitioning schemes for a variety of problems, regardless of the relative costs of performing inter-processor communication and data redistribution. Our experiments show that the Unified Repartitioning Algorithm is extremely fast and scalable to very large problems.Item A Universal Formulation of Sequential Patterns(1999-05-18) Joshi, Mahesh; Karypis, George; Kumar, VipinThis report outlines a more general formulation of sequential patterns, which unifies the generalized patterns proposed by Srikant and Agarwal [SA96] and episode discovery approach taken by Manilla et al [MTV97]. We show that just by varying the values of timing constraint parameters and counting methods, our formulation can be made identical to either one of these. Furthermore, our approach defines several other couting methods which could be suitable for various applications. The algorithm usd to discover these universal sequential patterns is based on a modification of the GSP algorithm proposed in [SA96]. Some of these modifications are made to take care of the newly introduced timing constraints and patttern restrictions, whereas some modifications are made for performance reasons. In the end, we present an application, which illustrates the deficiencies of current approaches that can be overcome by the proposed universal formulationItem Accelerating the Tucker Decomposition with Compressed Sparse Tensors(2017-09-04) Smith, Shaden; Karypis, GeorgeThe Tucker decomposition is a higher-order analogue of the singular value decomposition and is a popular method of performing analysis on multi-way data (tensors). Computing the Tucker decomposition of a sparse tensor is demanding in terms of both memory and computational resources. The primary kernel of the factorization is a chain of tensor-matrix multiplications (TTMc). State-of-the-art algorithms accelerate the underlying computations by trading off memory to memoize the intermediate results of TTMc in order to reuse them across iterations. We present an algorithm based on a compressed data structure for sparse tensors and show that many computational redundancies during TTMc can be identified and pruned without the memory overheads of memoization. In addition, our algorithm can further reduce the number of operations by exploiting an additional amount of user-specified memory. We evaluate our algorithm on a collection of real-world and synthetic datasets and demonstrate up to 20.7x speedup while using 28.5x less memory than the state-of-the-art parallel algorithm.Item Accounting for Language Changes over Time in Document Similarity Search(2015-07-07) Morsy, Sara; Karypis, GeorgeGiven a query document, ranking the documents in a collection based on how similar they are to the query is an essential task with extensive applications. For collections that contain documents whose creation dates span several decades, this task is further complicated by the fact that the language changes over time. For example, many terms add or lose one or more senses to meet people's evolving needs. To address this problem, we present methods that take advantage of two types of information in order to account for the language change. The first is the citation network that often exists within the collection, which can be used to link related documents with significantly different creation dates (and hence different language use). The second is the changes in the usage frequency of terms that occur over time, which can indicate changes in their senses and uses. These methods utilize the above information while estimating the representation of both documents and terms within the context of non-probabilistic static and dynamic topic models. Our experiments on two real-world datasets that span more than 40 years show that our proposed methods improve the retrieval performance of existing models and that these improvements are statistically significant.Item Acyclic Subgraph based Descriptor Spaces for Chemical Compound Retrieval and Classification(2006-03-20) Wale, Nikil; Karypis, GeorgeIn recent years the development of computational techniques that build models to correctly assign chemical compounds to various classes or to retrieve potential drug-like compounds has been an active area of research. These techniques are used extensively at various phases during the drug development process. Many of the best-performing techniques for these tasks, utilize a descriptor-based representation of the compound that captures various aspects of the underlying molecular graph's topology. In this paper we introduce and describe algorithms for efficiently generating a new set of descriptors that are derived from all connected acyclic fragments present in the molecular graphs. In addition, we introduce an extension to existing vector-based kernel functions to take into account the length of the fragments present in the descriptors. We experimentally evaluate the performance of the new descriptors in the context of SVM-based classification and ranked-retrieval on 28 classification and retrieval problems derived from 17 datasets. Our experiments show that for both the classification and retrieval tasks, these new descriptors consistently and statistically outperform previously developed schemes based on the widely used fingerprint- and Maccs keys-based descriptors, as well as recently introduced descriptors obtained by mining and analyzing the structure of the molecular graphs.Item Affinity-based Structure-Activity-Relationship Models: Improving Structure-Activity-Relationship Models by Incorporating Activity Information from Related Targets(2009-05-29) Ning, Xia; Rangwala, Huzefa; Karypis, GeorgeStructure-activity-relationship SAR models are used to inform and guide the iterative optimization of chemical leads, and play a fundamental role in modern drug discovery. In this paper we present a new class of methods for building SAR models, referred to as affinity-based, that utilize activity information from different targets. These methods first identify a set of targets that are related to the target under consideration and then they employ various machine-learning techniques that utilize activity information from these targets in order to build the desired SAR model. We developed different methods for identifying the set of related targets, which take into account the primary sequence of the targets or the structure of their ligands,and we also developed different machine learning techniques that were derived by using principles of semi-supervised learning, multi-task learning, and classifier ensembles.The comprehensive evaluation of these methods shows that they lead to considerable improvements over the standard SAR models that are based only on the ligands of the target under consideration. On a set of 117 protein targets obtained from PubChem, these affinity-based methods achieve an ROC score that is on the average 7.0% - 7.2% higher than that achieved by the standard SAR models. Moreover, on a set of targets belonging to six protein families, the affinity-based methods outperform chemogenomics-based approaches by 4.33%.Item AFGEN2.0(2008-06-24) Karypis, GeorgeAFGEN2.0 is a program that generates descriptor spaces for chemical compound(s). The descriptor space consists of graph fragments that can have three different types of topologies: paths (PF), acyclic subgraphs (AF), and arbitrary topology subgraphs (GF).Item Algorithms for Mining the Coevolving Relational Motifs in Dynamic Networks(2014-03-24) Ahmed, Rezwan; Karypis, GeorgeComputational methods and tools that can efficiently and effectively analyze the temporal changes in dynamic complex relational networks enable us to gain significant insights regarding the entity relations and their evolution. This paper introduces a new class of dynamic graph patterns, referred to as coevolving relational motifs (CRMs), which are designed to identify recurring sets of entities whose relations change in a consistent way over time. Coevolving relational motifs can provide evidence to the existence of, possibly unknown, coordination mechanisms by identifying the relational motifs that evolve in a similar and highly conserved fashion. We developed an algorithm to efficiently analyze the frequent relational changes between the entities of the dynamic networks and capture all frequent coevolutions as CRMs. Our algorithm follows a depth-first exploration of the frequent CRM lattice and incorporates canonical labeling for redundancy elimination. Experimental results based on multiple real world dynamic networks show that the method is able to efficiently identify CRMs. In addition, a qualitative analysis of the results shows that the discovered patterns can be used as features to characterize the dynamic network.Item An Analysis of Information Content Present in Protein-DNA Interactions(2007-09-11) Kauffman, Christopher; Karypis, GeorgeUnderstanding the role proteins play in regulating DNA replication is essential to forming a complete picture of how the genome manifests itself. In this work, we examine the feasibility of predicting the residues of a protein essential to binding by analyzing protein-DNA interactions from an information theoretic perspective. Through the lens of mutual information, we explore which properties of protein sequence and structure are most useful in determining binding residues with a particular focus on sequence features. We find that the quantity of information carried in most features is small with respect to DNA-contacting residues, the bulk being provided by sequence features along with a select few structural features. Supplemental information for this article is available at http://www.cs.umn.edu/~kauffman/supplements/psb2008Item An Efficient Algorithm for Discovering Frequent Subgraphs(2002-06-25) Kuramochi, Michihiro; Karypis, GeorgeOver the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to non-traditional domains, existing frequent pattern discovery approach cannot be used. This is because the transaction framework that isassumed by these algorithms cannot be used to effectively model the datasets in these domains. An alternate way of modeling the objects in these datasets is to represent them using graphs. Within that model, the problem of finding frequent patterns becomes that of discovering subgraphs that occur frequently over the entire set of graphs. In thispaper we present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph databases. We experimentally evaluate the performance of FSG using a variety of real and synthetic datasets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in datasets containing over 100,000 graph transactions and scales linearly with respect to the size of the database.Item Analysis of Recommendation Algorithms for E-Commerce(2000-09-19) Sarwar, Badrul; Karypis, George; Konstan, Joseph; Riedl, John T.Recommender systems apply statistical and knowledge discovery techniques to the problem of making product recommendations during a live customer interaction and they are achieving widespread success in E-Commerce nowadays. In this paper, we investigate several techniques for analyzing large-scale purchase and preference data for the purpose of producing useful recommendations to customers. In particular, we apply a collection of algorithms such as traditional data mining, nearest-neighbor collaborative filtering, and dimensionality reduction on two different data sets. The first data set was derived from the web-purchasing transaction of a large E-commerce company whereas the second data set was collected from MovieLens movie recommendation site. For the experimental purpose, we divide the recommendation generation process into three sub processes - representation of input data, neighborhood formation, and recommendation generation. We devise different techniques for different sub processes and apply their combinations on our data sets to compare for recommendation quality and performance.