Browsing by Author "Steinbach, Michael"

Now showing 1 - 20 of 33

A Comparison of Document Clustering Techniques
(2000-05-23) Steinbach, Michael; Karypis, George; Kumar, Vipin
This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document data.
A Computationally Efficient and Statistically Powerful Framework for Searching High-order Epistasis with Systematic Pruning and Gene-set Constraints
(2010-06-21) Fang, Gang; Haznadar, Majda; Wang, Wen; Steinbach, Michael; Van Ness, Brian; Kumar, Vipin
This paper has not yet been submitted.
A Data Mining Framework for Forest Fire Mapping
(2012-03-29) Karpatne, Anuj; Chen, Xi; Chamber, Yashu; Mithal, Varun; Lau, Michael; Steinhaeuser, Karsten; Boriah, Shyam; Steinbach, Michael; Kumar, Vipin
Forests are an important natural resource that support economic activity and play a significant role in regulating the climate and the carbon cycle, yet forest ecosystems are increasingly threatened by fires caused by a range of natural and anthropogenic factors. Mapping these fires, which can range in size from less than an acre to hundreds of thousands of acres, is an important task for supporting climate and carbon cycle studies as well as informing forest management. There are two primary approaches to fire mapping: field and aerial-based surveys, which are costly and limited in their extent; and remote sensing-based approaches, which are more cost-effective but pose several interesting methodological and algorithmic challenges. In this paper, we introduce a new framework for mapping forest fires based on satellite observations. Specifically, we develop spatio-temporal data mining methods for Moderate Resolution Imaging Spectroradiometer (MODIS) data to generate a history of forest fires. A systematic comparison with alternate approaches across diverse geographic regions demonstrates that our algorithmic paradigm is able to overcome some of the limitations in both data and methods employed by other prior efforts.
A pattern mining based integrative framework for biomarker discovery
(2012-02-10) Dey, Sanjoy; Atluri, Gowtham; Steinbach, Michael; MacDonald, Angus; Lim, Kelvin; Kumar, Vipin
Recent advancement in high throughput data collection technologies has resulted in the availability of diverse biomedical datasets that capture complementary information pertaining to the biological processes in an organism. Biomarkers that are discovered by integrating these datasets obtained from a case-control studies have the potential to elucidate the biological mechanisms behind complex human diseases. In this paper we define an interaction-type integrative biomarker as one whose features together can explain the disease, but not individually. In this paper, we propose a pattern mining based integrative framework (PAMIN) to discover an interaction-type integrative biomarkers from diverse case control datasets. PAMIN first finds patterns form individual datasets to capture the available information separately and then combines these patterns to find integrated patterns (IPs) consisting of variables from multiple datasets. We further use several interestingness measures to characterize the IPs into specific categories. Using synthetic data we compare the IPs found using our approach with those of CCA and discriminative-CCA (dCCA). Our results indicate that PAMIN can discover interaction type patterns that competing approaches like CCA and discriminative-CCA cannot find. Using real datasets we also show that PAMIN discovers a large number of statistically significant IPs than the competing approaches.
Association Analysis for Real-valued Data: Definitions and Application to Microarray Data
(2008-03-03) Pandey, Gaurav; Atluri, Gowtham; Steinbach, Michael; Myers, Chad L.; Kumar, Vipin
The discovery of biclusters, which denote groups of items that show coherent values across a subset of all the transactions in a data set, is an important type of analysis performed on real-valued data sets in several domains, such as biology. Several algorithms have been proposed to find different types of biclusters in such data sets. However, the search schemes used by these algorithms are unable to search the space of all possible biclusters exhaustively. Pattern mining algorithms in association analysis also essentially produce biclusters as their result, since the patterns consist of items that are supported by a subset of all the transactions. However, a major limitation of the numerous techniques developed in association analysis is that they are only able to analyze data sets that are constituted of binary and/or categorical variables, and their application to real-valued data sets often involves some lossy transformation such as discretization or binarization of the attributes. In this paper, we propose a novel association analysis framework for exhaustively and efficiently mining range support patterns from such a data set. On one hand, this framework reduces the loss of information incurred by binarization- and discretization-based approaches, and on the other, it enables the exhaustive discovery of coherent biclusters. We compared the performance of our framework with two standard biclustering algorithms through the evaluation of the functional coherence on patterns/biclusters derived from microarray data. These experiments show that the real-valued patterns discovered by our framework are better enriched by small biologically interesting functional classes. We also demonstrate the complementarity between our framework and the commonly used biclustering algorithm ISA, using specific examples of patterns that are found and functions that are covered by the former but not the latter. The source code and data sets used in this paper are available at http://www.cs.umn.edu/vk/gaurav/rap.
Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study
(2007-03-02) Pandey, Gaurav; Steinbach, Michael; Gupta, Rohit; Garg, Tushar; Kumar, Vipin
Protein interaction networks are one of the most promising types of biological data for the discovery of functional modules and the prediction of individual protein functions. However, it is known that these networks are both incomplete and inaccurate, i.e., they have spurious edges and lack biologically valid edges. One way to handle this problem is by transforming the original interaction graph into new graphs that remove spurious edges, add biologically valid ones, and assign reliability scores to the edges constituting the final network. We investigate currently existing methods, as well as propose robust association analysis-based method for this task. One promising method is based on the concept of h-confidence, which is a measure that can be used to extract groups of objects having high similarity with each other. Experimental evaluation on several protein interaction data sets show that hyperclique-based transformations enhance the performance of standard function prediction algorithms significantly, and thus have merit.
Characterizing Discriminative Patterns
(2011-02-18) Fang, Gang; Wang, Wen; Oatley, Benjamin; NessVan, Brian; Steinbach, Michael; Kumar, Vipin
Discriminative patterns are association patterns that occur with disproportionate frequency in some classes versus others, and have been studied under names such as emerging patterns and contrast sets. Such patterns have demonstrated considerable value for classification and subgroup discovery, but a detailed understanding of the types of interactions among items in a discriminative pattern is lacking. To address this issue, we propose to categorize discriminative patterns according to four types of item interaction: (i) driver-passenger, (ii) coherent, (iii) independent additive and (iv) synergistic beyond independent additive. The coherent, additive, and synergistic patterns are of practical importance, with the latter two representing a gain in the discriminative power of a pattern over its subsets. Synergistic patterns are most restrictive, but perhaps the most interesting since they capture a cooperative effect that is more than the sum of the effects of the individual items in the pattern. For domains such as biomedical and genetic research, differentiating among these types of patterns is critical since each yields very different biological interpretations. For general domains, the characterization provides a novel view of the nature of the discriminative patterns in a dataset, which yields insights beyond those provided by current approaches that focus mostly on pattern-based classification and subgroup discovery. This paper presents a comprehensive discussion that defines these four pattern types and investigates their properties and their relationship to one another. In addition, these ideas are explored for a variety of datasets (ten UCI datasets, one gene expression dataset and two genetic-variation datasets). The results demonstrate the existence, characteristics and statistical significance of the different types of patterns. They also illustrate how pattern characterization can provide novel insights into discriminative pattern mining and the discriminative structure of different datasets. Codes for pattern characterization and supplementary documents are available at http://vk.cs.umn.edu/CDP
Characterizing Pattern based Clustering
(2005-04-19) Xiong, Hui; Steinbach, Michael; Ruslim, Arifin; Kumar, Vipin
Recently, there has been considerable interest in using association patterns for clustering. Although several interesting algorithms have been developed, further investigation is needed to characterize (1) the benefits of using association patterns and (2) the most effective way of using them for clustering. To that end, we present a new clustering technique, bisecting K-means Clustering with pAttern Preservation (K-CAP), which exploits key properties of the hyperclique association pattern and bisecting k-means. Experimental results on document data show that, in terms of entropy, K-CAP can perform substantially better than the standard bisecting k-means algorithm when data sets contain clusters of widely different sizes--the typical situation. Furthermore, because hyperclique patterns can be found much more efficiently than other types of association patterns, K-CAP retains the appealing computational efficiency of bisecting k-means.
Computational Approaches for Protein Function Prediction: A Survey
(2006-10-31) Pandey, Gaurav; Kumar, Vipin; Steinbach, Michael
Proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is a crucial link in the development of new drugs, better crops, and even the development of synthetic biochemicals such as biofuels. Experimental procedures for protein function prediction are inherently low throughput and are thus unable to annotate a non-trivial fraction of proteins that are becoming available due to rapid advances in genome sequencing technology. This has motivated the development of computational techniques that utilize a variety of high-throughput experimental data for protein function prediction, such as protein and genome sequences, gene expression data, protein interaction networks and phylogenetic profiles. Indeed, in a short period of a decade, several hundred articles have been published on this topic. This survey aims to discuss this wide spectrum of approaches by categorizing them in terms of the data type they use for predicting function, and thus identify the trends and needs of this very important field. The survey is expected to be useful for computational biologists and bioinformaticians aiming to get an overview of the field of computational function prediction, and identify areas that can benefit from further research.
Construction and Functional Analysis of Human Genetic Interaction Networks with Genome-wide Association Data
(2011-01-18) Fang, Gang; Wang, Wen; Paunic, Vanja; Oatley, Benjamin; Haznadar, Majda; Steinbach, Michael; Van Ness, Brian; Myers, Chad L.; Kumar, Vipin
Motivation: Genetic interaction measures how different genes collectively contribute to a phenotype, and can reveal functional compensation and buffering between pathways under genetic perturbations. Recently, genome-wide investigation for genetic interactions has revealed genetic interaction networks that provide novel insights both when analyzed independently and when integrated with other functional genomic datasets. For higher eukaryotes such as human, the above reverse-genetics approaches are not straightforward since the phenotypes of interest for higher eukaryotes such as disease onset or survival, are difficult to study in a cell based assay. Results: In this paper, we propose a general framework for constructing and analyzing human genetic interaction networks from genome-wide single nucleotide polymorphism (SNP) datasets used for case-control studies on complex diseases. Specifically, we propose a general approach with three major steps: (1) estimating SNP-SNP genetic interactions, (2) identifying linkage disequilibrium (LD) blocks and mapping SNP-SNP interactions to LD block-block interactions, and (3) functional mapping for LD blocks. We performed two sets of functional analyses for each of the six case-control SNP datasets used in the paper, and demonstrated that (i) genes in LD blocks showing similar interaction profiles tend to be functionally related, and (ii) the network can be used to discover pairs of compensatory gene modules (between-pathway models) in their joint association with a disease phenotype. The proposed framework should provide novel insights beyond existing approaches that either ignore interactions between SNPs or model different SNP-SNP pairs with genetic interactions separately. Furthermore, our study provides evidence that some of the core properties of genetic interaction networks based on reverse genetics in model organisms like yeast are also present in genetic interactions revealed by natural variation in human populations. Availability: Supplementary material http://vk.cs.umn.edu/humanGI
Discovering Groups of Time Series with Similar Behavior in Multiple Small Intervals of Time
(2014-01-22) Atluri, Gowtham; Steinbach, Michael; Lim, Kelvin; MacDonald, Angus; Kumar, Vipin
The focus of this paper is to address the problem of discovering groups of time series that share similar behavior in multiple small intervals of time. This problem has two characteristics: i) There are exponentially many combinations of time series that needs to be explored to find these groups, ii) The groups of time series of interest need to have similar behavior only in some subsets of the time dimension. We present an Apriori based approach to address this problem. We evaluate it on a synthetic dataset and demonstrate that our approach can directly find all the short-living trends without finding spurious trends unlike other alternative approaches that find many spurious trends. We also demonstrate, using a neuroimaging dataset, that our approach can be used to discover significantly reproducible groups of shared trends when applied on independent sets of time series data. In addition, we demonstrate the utility of our approach on an S&P 500 stocks data set.
Discovering the Longest Set of Distinct Maximal Correlated Intervals in Time Series Data
(2014-10-01) Atluri, Gowtham; Steinbach, Michael; Lim, Kelvin; MacDonald, Angus; Kumar, Vipin
In this paper we focus on finding all maximal correlated intervals where a given pair of time series have correlation above a user provided threshold for all its subintervals and for none of its immediate subsuming intervals. Our objective then is to find a longest set of such maximal correlated intervals. We propose a two step solution to achieve this objective. In the first step an efficient bottom-up approach is proposed to discover maximal correlated intervals. In the second step we use a dynamic programming approach to select the longest non-overlapping set. We evaluate the efficiency of our approach on synthetic datasets and compare it with that of a bruteforce approach. Using neuroimaging data that contains activity time series from brain regions, we show the utility of our approach in studying transient nature of relationships between different brain regions.
Efficient Algorithms for Creating Product Catalogs
(2000-11-10) Steinbach, Michael; Karypis, George; Kumar, Vipin
For the purposes of this paper we define a catalog to be a promotional catalog, i.e., a collection of products (items) presented to a customer with the hope of encouraging a purchase. The single mailing problem addresses how to build a collection of catalogs and distribute them to customers (one per customer) so as to achieve an optimal outcome, e.g., the most profit. Each catalog is a subset of a given set of items and different catalogs may contain some of the same items, i.e., catalogs may overlap. A slightly more general, but important extension of the single mailing problem seeks the optimal set of catalogs when multiple mailings are allowed, i.e., multiple catalogs can be sent to each customer. Catalog creation has important applications for e-commerce and traditional brick-and-mortar retailers, especially when used with personalized recommender systems. The catalog creation problem is NP complete and some (relatively expensive) approximation algorithms have recently been developed. In this paper we describe more efficient techniques for building catalogs and show that these algorithms outperform one of the previously suggested approaches. (Indeed, the techniques previously suggested are not feasible for realistic numbers of customers and catalogs.) Some of our techniques directly use the objective function, e.g., maximize profit, to find a locally optimal solution in an approach based on gradient ascent. However, by combining such techniques with the clustering of similar customers, better results can sometimes be obtained. We also analyze the performance of our algorithms with respect to a theoretical bound and show that, in some cases, their performance is close to optimal.
Enhancing Data Analysis with Noise Removal
(2005-05-19) Xiong, Hui; Pandey, Gaurav; Steinbach, Michael; Kumar, Vipin
Removing objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of low-level data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amount of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of these methods are based on traditional outlier detection techniques: distance-based, clustering-based, and an approach based on the Local Outlier Factor (LOF) of an object. The other technique, which is a new method that we are proposing, is a hyperclique-based data cleaner (HCleaner). These techniques are evaluated in terms of their impact on the subsequent data analysis, specifically, clustering and association analysis. Our experimental results show that all of these methods can provide better clustering performance and higher quality association patterns as the amount of noise being removed increases, although HCleaner generally leads to better clustering performance and higher quality associations than the other three methods for binary data.
Finding Novel Multivariate Relationships in Time Series Data: Applications to Climate and Neuroscience
(2018-02-12) Agrawal, Saurabh; Steinbach, Michael; Boley, Daniel; Liess, Stefan; Chatterjee, Snigdhansu; Kumar, Vipin; Atluri, Gowtham
In many domains, there is significant interest in capturing novel relationships between time series that represent activities recorded at different nodes of a highly complex system. In this paper, we introduce multipoles, a novel class of linear relationships between more than two time series. A multipole is a set of time series that have strong linear dependence among themselves, with the requirement that each time series makes a significant contribution to the linear dependence. We demonstrate that most interesting multipoles can be identified as cliques of negative correlations in a correlation network. Such cliques are typically rare in a real-world correlation network, which allows us to find almost all multipoles efficiently using a clique-enumeration approach. Using our proposed framework, we demonstrate the utility of multipoles in discovering new physical phenomena in two scientific domains: climate science and neuroscience. In particular, we discovered several multipole relationships that are reproducible in multiple other independent datasets, and lead to novel domain insights.
HICAP: Hierarchical Clustering With Pattern Preservation
(2003-10-15) Xiong, Hui; Steinbach, Michael; Tan, Pang-ning; Kumar, Vipin
This paper describes a new approach for clustering---pattern preserving clustering---which produces more easily interpretable and usable clusters. This approach is motivated by the following observation: while there are usually strong patterns in the data---patterns that may be key for the analysis and description of the data---these patterns are often split among different clusters by current clustering approaches. This is, perhaps, not surprising, since clustering algorithms have no built in knowledge of these patterns and may often have goals that are in conflict with preserving patterns, e.g., minimize the distance of points to their nearest cluster centroid. Also, patterns are typically overlapping, i.e., may involve some of the same objects, and if the clustering algorithm produces disjoint clusters, then some patterns must be split when the objects are clustered. In this paper we describe a technique for pattern preserving clustering that first finds patterns composed of tightly connected groups of objects or attributes and then, starting from these patterns, performs agglomerative clustering using the Group Average (UPGMA) technique. We present the results of some experiments on document data that compare our approach, HIerarchical Clustering with pAttern Preservation (HICAP), to two other clustering techniques: bisecting K-means and traditional UPGMA. These results show that, despite the extra constraint of pattern preservation, HICAP has performance very much like traditional UPGMA with respect to the cluster evaluation criteria of entropy and F-measure. More importantly, we also illustrate how patterns, if preserved, can aid cluster interpretation.
Integration of Clinical and Genomic data: a Methodological Survey
(2013-02-20) Dey, Sanjoy; Gupta, Rohit; Steinbach, Michael; Kumar, Vipin
Human diseases are inherently complex and governed by the complicated interplay of several underlying factors. Clinical research focuses on behavioral, demographic and pathology information, whereas molecular genomics focuses on finding underlying genetic and genomic factors in genomic data collected on mRNA expression, proteomics, biological networks, and other microbiological features. However, each of these clinical and genomic datasets contains information only about one particular aspect of a complex disease, rather than covering all of the several complicated underlying risk factors. This has led to a new area of research that integrates both clinical and genomic data and aims to extract more information about diseases by considering not only all the various factors, but also the interactions among those factors, which cannot be captured by clinical and genomic studies that are performed independently of each other. Although initial efforts have already been made to develop such integrative modeling of the clinical and genomic data to shed light on the biological mechanism of the diseases, the research field is still in a rudimentary stage. In this review article, we survey the general issues, challenges and current work of clinicogenomic studies. We also summarize the current state of the field and discuss some possibilities for future work.
Integration of Differential Gene-combination Search and Gene Set Enrichment Analysis: A General Approach
(2009-12-30) Fang, Gang; Steinbach, Michael; Myers, Chad L.; Kumar, Vipin
Motivation: Gene Set Enrichment Analysis (GSEA) and its variations aim to discover collections of genes that show moderate but coordinated differences in expression. However, such techniques may be ineffective if many individual genes in a phenotype-related gene set have weak discriminative power. A potential solution is to search for combinations of genes that are highly differentiating even when individual genes are not. Although such techniques have been developed, these approaches have not been used with GSEA to any significant degree because of the large number of potential gene combinations and the heterogeneity of measures that assess the differentiation provided by gene groups of different sizes. Results: To integrate the search for differentiating gene combinations and GSEA, we propose a general framework with two key components: (A) a procedure that reduces the number of scores to be handled by GSEA to the number of genes by summarizing the scores of the gene combinations involving a particular gene in a single score, and (B) a procedure to integrate the heterogeneous scores from combinations of different sizes and from different gene combination measures by mapping the scores to p-values. Experiments on four gene expression data sets demonstrate that the integration of GSEA and gene combination search can enhance the power of traditional GSEA by discovering gene sets that include genes with weak individual differentiation but strong joint discriminative power. Also, gene sets discovered by the integrative framework share several common biological processes and improve the consistency of the results among three lung cancer data sets. Availability: Source code and datasets: http://vk.cs.umn.edu/ICG/. Contact: gangfang@cs.umn.edu
Land Cover Change Detection using Data Mining Techniques
(2008-03-14) Boriah, Shyam; Kumar, Vipin; Steinbach, Michael; Potter, Christopher; Klooster, Steven
The study of land cover change is an important problem in the Earth science domain because of its impacts on local climate, radiation balance, biogeochemistry, hydrology, and the diversity and abundance of terrestrial species. Data mining and knowledge discovery techniques can aid this effort by efficiently discovering patterns that capture complex interactions between ocean temperature, air pressure, surface meteorology, and terrestrial carbon flux. Most well-known change detection techniques from statistics, signal processing and control theory are not well-suited for the massive high-dimensional spatio-temporal data sets from Earth Science due to limitations such as high computational complexity and the inability to take advantage of seasonality and spatio-temporal autocorrelation inherent in Earth Science data. In our work, we seek to address these challenges with new change detection techniques that are based on data mining approaches. Specifically, in this paper we have performed a case study for a new change detection technique for the land cover change detection problem. We study land cover change in the state of California, focusing on the San Francisco Bay Area as well perform an extended study on the entire state. We also perform a comparative evaluation on forests in the entire state. These results demonstrate the utility of data mining techniques for the land cover change detection problem.
Language and Library Support for Climate Data Applications
(2009) Van Wyk, Eric; Kumar, Vipin; Steinbach, Michael; Boriah, Shyam; Choudhary, Alok

University Digital Conservancy

Browse by Author

Browsing by Author "Steinbach, Michael"