Department of Computer Science and Engineering

Persistent link for this community

https://hdl.handle.net/11299/93252

Browse

Now showing 1 - 8 of 8

A pattern mining based integrative framework for biomarker discovery
(2012-02-10) Dey, Sanjoy; Atluri, Gowtham; Steinbach, Michael; MacDonald, Angus; Lim, Kelvin; Kumar, Vipin
Recent advancement in high throughput data collection technologies has resulted in the availability of diverse biomedical datasets that capture complementary information pertaining to the biological processes in an organism. Biomarkers that are discovered by integrating these datasets obtained from a case-control studies have the potential to elucidate the biological mechanisms behind complex human diseases. In this paper we define an interaction-type integrative biomarker as one whose features together can explain the disease, but not individually. In this paper, we propose a pattern mining based integrative framework (PAMIN) to discover an interaction-type integrative biomarkers from diverse case control datasets. PAMIN first finds patterns form individual datasets to capture the available information separately and then combines these patterns to find integrated patterns (IPs) consisting of variables from multiple datasets. We further use several interestingness measures to characterize the IPs into specific categories. Using synthetic data we compare the IPs found using our approach with those of CCA and discriminative-CCA (dCCA). Our results indicate that PAMIN can discover interaction type patterns that competing approaches like CCA and discriminative-CCA cannot find. Using real datasets we also show that PAMIN discovers a large number of statistically significant IPs than the competing approaches.
Association Analysis for Real-valued Data: Definitions and Application to Microarray Data
(2008-03-03) Pandey, Gaurav; Atluri, Gowtham; Steinbach, Michael; Myers, Chad L.; Kumar, Vipin
The discovery of biclusters, which denote groups of items that show coherent values across a subset of all the transactions in a data set, is an important type of analysis performed on real-valued data sets in several domains, such as biology. Several algorithms have been proposed to find different types of biclusters in such data sets. However, the search schemes used by these algorithms are unable to search the space of all possible biclusters exhaustively. Pattern mining algorithms in association analysis also essentially produce biclusters as their result, since the patterns consist of items that are supported by a subset of all the transactions. However, a major limitation of the numerous techniques developed in association analysis is that they are only able to analyze data sets that are constituted of binary and/or categorical variables, and their application to real-valued data sets often involves some lossy transformation such as discretization or binarization of the attributes. In this paper, we propose a novel association analysis framework for exhaustively and efficiently mining range support patterns from such a data set. On one hand, this framework reduces the loss of information incurred by binarization- and discretization-based approaches, and on the other, it enables the exhaustive discovery of coherent biclusters. We compared the performance of our framework with two standard biclustering algorithms through the evaluation of the functional coherence on patterns/biclusters derived from microarray data. These experiments show that the real-valued patterns discovered by our framework are better enriched by small biologically interesting functional classes. We also demonstrate the complementarity between our framework and the commonly used biclustering algorithm ISA, using specific examples of patterns that are found and functions that are covered by the former but not the latter. The source code and data sets used in this paper are available at http://www.cs.umn.edu/vk/gaurav/rap.
Discovering Groups of Time Series with Similar Behavior in Multiple Small Intervals of Time
(2014-01-22) Atluri, Gowtham; Steinbach, Michael; Lim, Kelvin; MacDonald, Angus; Kumar, Vipin
The focus of this paper is to address the problem of discovering groups of time series that share similar behavior in multiple small intervals of time. This problem has two characteristics: i) There are exponentially many combinations of time series that needs to be explored to find these groups, ii) The groups of time series of interest need to have similar behavior only in some subsets of the time dimension. We present an Apriori based approach to address this problem. We evaluate it on a synthetic dataset and demonstrate that our approach can directly find all the short-living trends without finding spurious trends unlike other alternative approaches that find many spurious trends. We also demonstrate, using a neuroimaging dataset, that our approach can be used to discover significantly reproducible groups of shared trends when applied on independent sets of time series data. In addition, we demonstrate the utility of our approach on an S&P 500 stocks data set.
Discovering the Longest Set of Distinct Maximal Correlated Intervals in Time Series Data
(2014-10-01) Atluri, Gowtham; Steinbach, Michael; Lim, Kelvin; MacDonald, Angus; Kumar, Vipin
In this paper we focus on finding all maximal correlated intervals where a given pair of time series have correlation above a user provided threshold for all its subintervals and for none of its immediate subsuming intervals. Our objective then is to find a longest set of such maximal correlated intervals. We propose a two step solution to achieve this objective. In the first step an efficient bottom-up approach is proposed to discover maximal correlated intervals. In the second step we use a dynamic programming approach to select the longest non-overlapping set. We evaluate the efficiency of our approach on synthetic datasets and compare it with that of a bruteforce approach. Using neuroimaging data that contains activity time series from brain regions, we show the utility of our approach in studying transient nature of relationships between different brain regions.
Enhancing the functional content of protein interaction networks
(2012-02-01) Pandey, Gaurav; Manocha, Sahil; Atluri, Gowtham; Kumar, Vipin
Protein interaction networks are a promising type of data for studying complex biological systems. However, despite the rich information embedded in these networks, they face important data quality challenges of noise and incompleteness that adversely affect the results obtained from their analysis. Here, we explore the use of the concept of common neighborhood similarity (CNS), which is a form of local structure in networks, to address these issues. Although several CNS measures have been proposed in the literature, an understanding of their relative efficacies for the analysis of interaction networks has been lacking. We follow the framework of graph transformation to convert the given interaction network into a transformed network corresponding to a variety of CNS measures evaluated. The effectiveness of each measure is then estimated by comparing the quality of protein function predictions obtained from its corresponding transformed network with those from the original network. Using a large set of S. cerevisiae interactions, and a set of 136 GO terms, we find that several of the transformed networks produce more accurate predictions than those obtained from the original network. In particular, the HC.cont measure proposed here performs particularly well for this task. Further investigation reveals that the two major factors contributing to this improvement are the abilities of CNS measures, especially HC.cont, to prune out noisy edges and introduce new links between functionally related proteins.
Finding Novel Multivariate Relationships in Time Series Data: Applications to Climate and Neuroscience
(2018-02-12) Agrawal, Saurabh; Steinbach, Michael; Boley, Daniel; Liess, Stefan; Chatterjee, Snigdhansu; Kumar, Vipin; Atluri, Gowtham
In many domains, there is significant interest in capturing novel relationships between time series that represent activities recorded at different nodes of a highly complex system. In this paper, we introduce multipoles, a novel class of linear relationships between more than two time series. A multipole is a set of time series that have strong linear dependence among themselves, with the requirement that each time series makes a significant contribution to the linear dependence. We demonstrate that most interesting multipoles can be identified as cliques of negative correlations in a correlation network. Such cliques are typically rare in a real-world correlation network, which allows us to find almost all multipoles efficiently using a clique-enumeration approach. Using our proposed framework, we demonstrate the utility of multipoles in discovering new physical phenomena in two scientific domains: climate science and neuroscience. In particular, we discovered several multipole relationships that are reproducible in multiple other independent datasets, and lead to novel domain insights.
Tripoles: A New Class of Climate Teleconnections
(2015-12-11) Agrawal, Saurabh; Atluri, Gowtham; Liess, Stefan; Chatterjee, Snigdhansu; Kumar, Vipin
Teleconnections in climate represent a persistent and large-scale temporal connection in a given climate variable between two distant geographical regions. They are known to impact and explain the variability in climate of many regions across the globe and have been a subject of interest to climatologists. Traditionally, climate teleconnections have been studied as a persistent relationship between a pair of geographical regions (e.g. North Atlantic Oscillation (NAO), and El-Nino Southern Oscillation (ENSO)). In this report, we define a new class of climate teleconnections which we refer to as tripoles that capture climatic relationships between three regions, in contrast to teleconnections that are traditionally defined using only two regions. We further provide a categorization of tripoles based on pairwise relationships between the three participating regions and propose a shared nearest neighbor (SNN) graph-based approach to find tripoles in a given spatio-temporal dataset.
Two-Dimensional Association Analysis For Finding Constant Value Biclusters In Real-Valued Data
(2009-07-07) Atluri, Gowtham; Bellay, Jeremy; Pandey, Gaurav; Myers, Chad L.; Kumar, Vipin
Biclustering is a commonly used type of analysis for real-valued data sets, and several algorithms have been proposed for finding different types of biclusters. However, no systematic approach has been proposed for exhaustive enumerating all (nearly) constant value biclusters in such data sets, which is the problem addressed in this paper. Using a monotonic range measure to capture the coherence of values in a block/submatrix of an input data matrix, we propose a two-step Apriori-based algorithm for discovering all nearly constant value biclusters, referred to as Range Constrained Blocks (RCBs). By systematic evaluation on an extensive genetic interaction data set, we show that the submatrices with similar values represent groups of genes that are functionally related than the biclusters with diverse values. We also show that our approach can exhaustively find all the biclusters with a range less than a given threshold, while the other competing approaches can not find all such biclusters.