Browsing by Author "Atluri, Gowtham"

Now showing 1 - 9 of 9

A pattern mining based integrative framework for biomarker discovery
(2012-02-10) Dey, Sanjoy; Atluri, Gowtham; Steinbach, Michael; MacDonald, Angus; Lim, Kelvin; Kumar, Vipin
Recent advancement in high throughput data collection technologies has resulted in the availability of diverse biomedical datasets that capture complementary information pertaining to the biological processes in an organism. Biomarkers that are discovered by integrating these datasets obtained from a case-control studies have the potential to elucidate the biological mechanisms behind complex human diseases. In this paper we define an interaction-type integrative biomarker as one whose features together can explain the disease, but not individually. In this paper, we propose a pattern mining based integrative framework (PAMIN) to discover an interaction-type integrative biomarkers from diverse case control datasets. PAMIN first finds patterns form individual datasets to capture the available information separately and then combines these patterns to find integrated patterns (IPs) consisting of variables from multiple datasets. We further use several interestingness measures to characterize the IPs into specific categories. Using synthetic data we compare the IPs found using our approach with those of CCA and discriminative-CCA (dCCA). Our results indicate that PAMIN can discover interaction type patterns that competing approaches like CCA and discriminative-CCA cannot find. Using real datasets we also show that PAMIN discovers a large number of statistically significant IPs than the competing approaches.
Association Analysis for Real-valued Data: Definitions and Application to Microarray Data
(2008-03-03) Pandey, Gaurav; Atluri, Gowtham; Steinbach, Michael; Myers, Chad L.; Kumar, Vipin
The discovery of biclusters, which denote groups of items that show coherent values across a subset of all the transactions in a data set, is an important type of analysis performed on real-valued data sets in several domains, such as biology. Several algorithms have been proposed to find different types of biclusters in such data sets. However, the search schemes used by these algorithms are unable to search the space of all possible biclusters exhaustively. Pattern mining algorithms in association analysis also essentially produce biclusters as their result, since the patterns consist of items that are supported by a subset of all the transactions. However, a major limitation of the numerous techniques developed in association analysis is that they are only able to analyze data sets that are constituted of binary and/or categorical variables, and their application to real-valued data sets often involves some lossy transformation such as discretization or binarization of the attributes. In this paper, we propose a novel association analysis framework for exhaustively and efficiently mining range support patterns from such a data set. On one hand, this framework reduces the loss of information incurred by binarization- and discretization-based approaches, and on the other, it enables the exhaustive discovery of coherent biclusters. We compared the performance of our framework with two standard biclustering algorithms through the evaluation of the functional coherence on patterns/biclusters derived from microarray data. These experiments show that the real-valued patterns discovered by our framework are better enriched by small biologically interesting functional classes. We also demonstrate the complementarity between our framework and the commonly used biclustering algorithm ISA, using specific examples of patterns that are found and functions that are covered by the former but not the latter. The source code and data sets used in this paper are available at http://www.cs.umn.edu/vk/gaurav/rap.
Discovering Groups of Time Series with Similar Behavior in Multiple Small Intervals of Time
(2014-01-22) Atluri, Gowtham; Steinbach, Michael; Lim, Kelvin; MacDonald, Angus; Kumar, Vipin
The focus of this paper is to address the problem of discovering groups of time series that share similar behavior in multiple small intervals of time. This problem has two characteristics: i) There are exponentially many combinations of time series that needs to be explored to find these groups, ii) The groups of time series of interest need to have similar behavior only in some subsets of the time dimension. We present an Apriori based approach to address this problem. We evaluate it on a synthetic dataset and demonstrate that our approach can directly find all the short-living trends without finding spurious trends unlike other alternative approaches that find many spurious trends. We also demonstrate, using a neuroimaging dataset, that our approach can be used to discover significantly reproducible groups of shared trends when applied on independent sets of time series data. In addition, we demonstrate the utility of our approach on an S&P 500 stocks data set.
Discovering the Longest Set of Distinct Maximal Correlated Intervals in Time Series Data
(2014-10-01) Atluri, Gowtham; Steinbach, Michael; Lim, Kelvin; MacDonald, Angus; Kumar, Vipin
In this paper we focus on finding all maximal correlated intervals where a given pair of time series have correlation above a user provided threshold for all its subintervals and for none of its immediate subsuming intervals. Our objective then is to find a longest set of such maximal correlated intervals. We propose a two step solution to achieve this objective. In the first step an efficient bottom-up approach is proposed to discover maximal correlated intervals. In the second step we use a dynamic programming approach to select the longest non-overlapping set. We evaluate the efficiency of our approach on synthetic datasets and compare it with that of a bruteforce approach. Using neuroimaging data that contains activity time series from brain regions, we show the utility of our approach in studying transient nature of relationships between different brain regions.
Enhancing the functional content of protein interaction networks
(2012-02-01) Pandey, Gaurav; Manocha, Sahil; Atluri, Gowtham; Kumar, Vipin
Protein interaction networks are a promising type of data for studying complex biological systems. However, despite the rich information embedded in these networks, they face important data quality challenges of noise and incompleteness that adversely affect the results obtained from their analysis. Here, we explore the use of the concept of common neighborhood similarity (CNS), which is a form of local structure in networks, to address these issues. Although several CNS measures have been proposed in the literature, an understanding of their relative efficacies for the analysis of interaction networks has been lacking. We follow the framework of graph transformation to convert the given interaction network into a transformed network corresponding to a variety of CNS measures evaluated. The effectiveness of each measure is then estimated by comparing the quality of protein function predictions obtained from its corresponding transformed network with those from the original network. Using a large set of S. cerevisiae interactions, and a set of 136 GO terms, we find that several of the transformed networks produce more accurate predictions than those obtained from the original network. In particular, the HC.cont measure proposed here performs particularly well for this task. Further investigation reveals that the two major factors contributing to this improvement are the abilities of CNS measures, especially HC.cont, to prune out noisy edges and introduce new links between functionally related proteins.
Finding Novel Multivariate Relationships in Time Series Data: Applications to Climate and Neuroscience
(2018-02-12) Agrawal, Saurabh; Steinbach, Michael; Boley, Daniel; Liess, Stefan; Chatterjee, Snigdhansu; Kumar, Vipin; Atluri, Gowtham
In many domains, there is significant interest in capturing novel relationships between time series that represent activities recorded at different nodes of a highly complex system. In this paper, we introduce multipoles, a novel class of linear relationships between more than two time series. A multipole is a set of time series that have strong linear dependence among themselves, with the requirement that each time series makes a significant contribution to the linear dependence. We demonstrate that most interesting multipoles can be identified as cliques of negative correlations in a correlation network. Such cliques are typically rare in a real-world correlation network, which allows us to find almost all multipoles efficiently using a clique-enumeration approach. Using our proposed framework, we demonstrate the utility of multipoles in discovering new physical phenomena in two scientific domains: climate science and neuroscience. In particular, we discovered several multipole relationships that are reproducible in multiple other independent datasets, and lead to novel domain insights.
Mining dynamic relationships from spatio-temporal datasets: an application to brain fMRI data
(2014-05) Atluri, Gowtham
Spatio-temporal datasets are being widely collected in several domains such as climate science, neuorscience, sociology, and transportation. These data sets offer tremendous opportunities to address the imminent problems facing our society such as climate change, dementia, traffic congestion, crime etc. One example of a spatio-temporal dataset that is the focus of this dissertation is Functional Magnetic Resonance Imaging (fMRI) data. fMRI captures the activity at all locations in the brain and at regular time intervals. Using this data one can investigate the processes in the brain that relate to human psychological functions such as cognition, decision making etc. or physiological functions such as sensory perception or motor skills. Above all, one can advance the diagnosis and treatment procedures for mental disorders.The focus of this thesis is to study dynamic relationships between brain regions using fMRI data. Existing work in neuroscience has predominantly treated the relationships among brain regions as stationary. There is growing evidence in this community that the relationships between brain regions are transient. In the time series data mining community transient relationships have been studied and are shown to be useful for various tasks such as clustering and classification of time series data. In this work we focused on discovering combinations of brain regions that exhibit high similarity in the activity time series in small intervals. We proposed an efficient approach that can discover all such combinations exhaustively. We demonstrated its effectiveness on synthetic and real world data sets.We applied our approach on fMRI data collected in different settings on different groups of people and studied the reliability and replicability of the combinations we discover. Reliability is the degree to which a combination that is discovered using fMRI scans from a population can be found again using a different set of scans on the same population. Replicability is the degree to which a combination discovered using scans from one set of subjects can be discovered again using scans from a different set of subjects. These two factors reflect the generality of the combinations we discover. Our results suggest that the combinations we discover are indeed reliable and replicable. This indicates the validity of the combinations and they suggest that the underlying neuronal principles drive these combinations. We also investigated the utility of the combinations in studying differences between healthy and schizophrenia subjects.Existing work in estimating transient relationships among time series typically uses sliding time windows of a fixed length that are shifted from one end to the other using a fixed step size. This approach does not directly identify the intervals in which a pair of time series exhibit similarity. We proposed another computational approach to discover the time intervals where a given pair of time series are highly similar. We showed that our approach is efficient using synthetic datasets. We demonstrated the effectiveness of our approach on a synthetic dataset. Using this approach we provided a characterization of the transient nature of a relationship between time series and showed its utility in identifying task related transient connectivity in fMRI data that is collected while a subject is resting and while involved in a task.In summary, the computational approaches proposed in this thesis advance the state-of-the-art in time series data mining. Whereas the extensive evaluations that are performed on multiple fMRI datasets demonstrate the validity of the findings and provide novel hypothesis that can be systematically studied to advance the state-of-the-art in neuroscience.
Tripoles: A New Class of Climate Teleconnections
(2015-12-11) Agrawal, Saurabh; Atluri, Gowtham; Liess, Stefan; Chatterjee, Snigdhansu; Kumar, Vipin
Teleconnections in climate represent a persistent and large-scale temporal connection in a given climate variable between two distant geographical regions. They are known to impact and explain the variability in climate of many regions across the globe and have been a subject of interest to climatologists. Traditionally, climate teleconnections have been studied as a persistent relationship between a pair of geographical regions (e.g. North Atlantic Oscillation (NAO), and El-Nino Southern Oscillation (ENSO)). In this report, we define a new class of climate teleconnections which we refer to as tripoles that capture climatic relationships between three regions, in contrast to teleconnections that are traditionally defined using only two regions. We further provide a categorization of tripoles based on pairwise relationships between the three participating regions and propose a shared nearest neighbor (SNN) graph-based approach to find tripoles in a given spatio-temporal dataset.
Two-Dimensional Association Analysis For Finding Constant Value Biclusters In Real-Valued Data
(2009-07-07) Atluri, Gowtham; Bellay, Jeremy; Pandey, Gaurav; Myers, Chad L.; Kumar, Vipin
Biclustering is a commonly used type of analysis for real-valued data sets, and several algorithms have been proposed for finding different types of biclusters. However, no systematic approach has been proposed for exhaustive enumerating all (nearly) constant value biclusters in such data sets, which is the problem addressed in this paper. Using a monotonic range measure to capture the coherence of values in a block/submatrix of an input data matrix, we propose a two-step Apriori-based algorithm for discovering all nearly constant value biclusters, referred to as Range Constrained Blocks (RCBs). By systematic evaluation on an extensive genetic interaction data set, we show that the submatrices with similar values represent groups of genes that are functionally related than the biclusters with diverse values. We also show that our approach can exhaustively find all the biclusters with a range less than a given threshold, while the other competing approaches can not find all such biclusters.

University Digital Conservancy

Browse by Author

Browsing by Author "Atluri, Gowtham"