Browsing by Author "Pandey, Gaurav"
Now showing 1 - 11 of 11
- Results Per Page
- Sort Options
Item Association Analysis for Real-valued Data: Definitions and Application to Microarray Data(2008-03-03) Pandey, Gaurav; Atluri, Gowtham; Steinbach, Michael; Myers, Chad L.; Kumar, VipinThe discovery of biclusters, which denote groups of items that show coherent values across a subset of all the transactions in a data set, is an important type of analysis performed on real-valued data sets in several domains, such as biology. Several algorithms have been proposed to find different types of biclusters in such data sets. However, the search schemes used by these algorithms are unable to search the space of all possible biclusters exhaustively. Pattern mining algorithms in association analysis also essentially produce biclusters as their result, since the patterns consist of items that are supported by a subset of all the transactions. However, a major limitation of the numerous techniques developed in association analysis is that they are only able to analyze data sets that are constituted of binary and/or categorical variables, and their application to real-valued data sets often involves some lossy transformation such as discretization or binarization of the attributes. In this paper, we propose a novel association analysis framework for exhaustively and efficiently mining range support patterns from such a data set. On one hand, this framework reduces the loss of information incurred by binarization- and discretization-based approaches, and on the other, it enables the exhaustive discovery of coherent biclusters. We compared the performance of our framework with two standard biclustering algorithms through the evaluation of the functional coherence on patterns/biclusters derived from microarray data. These experiments show that the real-valued patterns discovered by our framework are better enriched by small biologically interesting functional classes. We also demonstrate the complementarity between our framework and the commonly used biclustering algorithm ISA, using specific examples of patterns that are found and functions that are covered by the former but not the latter. The source code and data sets used in this paper are available at http://www.cs.umn.edu/vk/gaurav/rap.Item Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study(2007-03-02) Pandey, Gaurav; Steinbach, Michael; Gupta, Rohit; Garg, Tushar; Kumar, VipinProtein interaction networks are one of the most promising types of biological data for the discovery of functional modules and the prediction of individual protein functions. However, it is known that these networks are both incomplete and inaccurate, i.e., they have spurious edges and lack biologically valid edges. One way to handle this problem is by transforming the original interaction graph into new graphs that remove spurious edges, add biologically valid ones, and assign reliability scores to the edges constituting the final network. We investigate currently existing methods, as well as propose robust association analysis-based method for this task. One promising method is based on the concept of h-confidence, which is a measure that can be used to extract groups of objects having high similarity with each other. Experimental evaluation on several protein interaction data sets show that hyperclique-based transformations enhance the performance of standard function prediction algorithms significantly, and thus have merit.Item Computational Approaches for Protein Function Prediction: A Survey(2006-10-31) Pandey, Gaurav; Kumar, Vipin; Steinbach, MichaelProteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is a crucial link in the development of new drugs, better crops, and even the development of synthetic biochemicals such as biofuels. Experimental procedures for protein function prediction are inherently low throughput and are thus unable to annotate a non-trivial fraction of proteins that are becoming available due to rapid advances in genome sequencing technology. This has motivated the development of computational techniques that utilize a variety of high-throughput experimental data for protein function prediction, such as protein and genome sequences, gene expression data, protein interaction networks and phylogenetic profiles. Indeed, in a short period of a decade, several hundred articles have been published on this topic. This survey aims to discuss this wide spectrum of approaches by categorizing them in terms of the data type they use for predicting function, and thus identify the trends and needs of this very important field. The survey is expected to be useful for computational biologists and bioinformaticians aiming to get an overview of the field of computational function prediction, and identify areas that can benefit from further research.Item Data mining techniques for enhancing protein function prediction(2010-04) Pandey, GauravProteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is crucial for obtaining a basic understanding of the cellular processes operating in an organism as well as for important applications in biotechnology, such as the development of new drugs, better crops, and synthetic biochemicals such as biofuels. Recent revolutions in biotechnology has given us numerous high-throughput experimental technologies that generate very useful data, such as gene expression and protein interaction data, that provide high-resolution snapshots of complex cellular processes and a novel avenue to understand their underlying mechanisms. In particular, several computational approaches based on the principle of Guilt by Association (GBA) have been proposed to predict the function(s) of the protein are inferred from those of other proteins that are "associated" to it in these data sets. In this thesis, we have developed several novel methods for improving the performance of these approaches by making use of the unutilized and under-utilized information in genomic data sets, as well as their associated knowledge bases. In particular, we have developed pre-processing methods for handling data quality issues with gene expression (microarray) data sets and protein interaction networks that aim to enhance the utility of these data sets for protein function prediction. We have also developed a method for incorporating the inter-relationships between functional classes, as captured by the ontologies in Gene Ontology, into classification-based protein function prediction algoriths, which enabled us to improve the quality of predictions made for several functional classes, particularly those with very few member proteins (rare classes). Finally, we have developed a novel association analysis-based biclustering algorithm to address two major challenges with traditional biclustering algorithms, namely an exhaustive search of all valid biclusters satisfying the definition specified by the algorithm, and the ability to search for small biclusters. This algorithm makes it possible to discover smaller sized biclusters that are more significantly enriched with specific GO terms than those produced by the traditional biclustering algorithms. Overall, the methods proposed in this thesis are expected to help uncover the functions of several unannotated proteins (or genes), as shown by specific examples cited in some of the chapters. To conclude, we also suggest several opportunities for further progress on the very important problem of protein function predictionItem Enhancing Data Analysis with Noise Removal(2005-05-19) Xiong, Hui; Pandey, Gaurav; Steinbach, Michael; Kumar, VipinRemoving objects that are noise is an important goal of data cleaning as noise hinders most types of data analysis. Most existing data cleaning methods focus on removing noise that is the result of low-level data errors that result from an imperfect data collection process, but data objects that are irrelevant or only weakly relevant can also significantly hinder data analysis. Thus, if the goal is to enhance the data analysis as much as possible, these objects should also be considered as noise, at least with respect to the underlying analysis. Consequently, there is a need for data cleaning techniques that remove both types of noise. Because data sets can contain large amount of noise, these techniques also need to be able to discard a potentially large fraction of the data. This paper explores four techniques intended for noise removal to enhance data analysis in the presence of high noise levels. Three of these methods are based on traditional outlier detection techniques: distance-based, clustering-based, and an approach based on the Local Outlier Factor (LOF) of an object. The other technique, which is a new method that we are proposing, is a hyperclique-based data cleaner (HCleaner). These techniques are evaluated in terms of their impact on the subsequent data analysis, specifically, clustering and association analysis. Our experimental results show that all of these methods can provide better clustering performance and higher quality association patterns as the amount of noise being removed increases, although HCleaner generally leads to better clustering performance and higher quality associations than the other three methods for binary data.Item Enhancing the functional content of protein interaction networks(2012-02-01) Pandey, Gaurav; Manocha, Sahil; Atluri, Gowtham; Kumar, VipinProtein interaction networks are a promising type of data for studying complex biological systems. However, despite the rich information embedded in these networks, they face important data quality challenges of noise and incompleteness that adversely affect the results obtained from their analysis. Here, we explore the use of the concept of common neighborhood similarity (CNS), which is a form of local structure in networks, to address these issues. Although several CNS measures have been proposed in the literature, an understanding of their relative efficacies for the analysis of interaction networks has been lacking. We follow the framework of graph transformation to convert the given interaction network into a transformed network corresponding to a variety of CNS measures evaluated. The effectiveness of each measure is then estimated by comparing the quality of protein function predictions obtained from its corresponding transformed network with those from the original network. Using a large set of S. cerevisiae interactions, and a set of 136 GO terms, we find that several of the transformed networks produce more accurate predictions than those obtained from the original network. In particular, the HC.cont measure proposed here performs particularly well for this task. Further investigation reveals that the two major factors contributing to this improvement are the abilities of CNS measures, especially HC.cont, to prune out noisy edges and introduce new links between functionally related proteins.Item Incorporating Functional Inter-relationships into Protein Function Prediction Algorithms(2008-01-07) Pandey, Gaurav; Myers, Chad L.; Kumar, VipinFunctional classification schemes that serve as the basis for annotation efforts in several organisms (e.g. the Gene Ontology) are often the source of gold standard information for computational efforts at supervised gene function prediction. While successful function prediction algorithms have been developed, few previous efforts have utilized more than the gene-to-function class labels provided by such knowledge bases. For instance, the Gene Ontology not only captures gene annotations to a set of functional classes, but it also arranges these classes in a DAG-based hierarchy that captures rich inter-relationships between different classes. These inter-relationships present both opportunities, such as the potential for additional training examples for small classes from larger related classes, and challenges, such as a harder to learn distinction between similar GO terms, for standard classification-based approaches. In this paper, we propose to enhance the performance of classification-based protein function prediction algorithms by addressing these issues, using the same inter relationships between functional classes. Using a standard measure for evaluating the semantic similarity between nodes in an ontology, we quantify and incorporate these inter-relationships into the k-nearest neighbor classifier. We present experiments on several large genomic data sets, each of which is used for the modeling and prediction of different sets of over hundred classes from the GO Biological Process ontology. The results show that this incorporation produces more accurate predictions for a large number of the functional classes considered, and also that the classes benefitted most by this approach are those containing the fewest members. In addition, we show how our proposed framework can be used for integrating information from the entire GO hierarchy for improving the accuracy of predictions made over a set of base classes.Item Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data(2009-04-02) Fang, Gang; Pandey, Gaurav; Wang, Wen; Gupta, Manish; Steinbach, Michael; Kumar, VipinDiscriminative patterns can provide valuable insights into datasets with class labels, that may not be available from the individual features or predictive models built using them. Most existing approaches work efficiently for sparse or low-dimensional datasets. However, for dense and high-dimensional datasets, they have to use high thresholds to produce the complete results within limited time, and thus, may miss interesting low-support patterns. In this paper, we address the necessity of trading off the completeness of discriminative pattern discovery with the efficient discovery of low-support discriminative patterns from such datasets. We propose a family of anti-monotonic measures named SupMaxK that organize the set of discriminative patterns into nested layers of subsets, which are progressively more complete in their coverage, but require increasingly more computation. In particular, the member of SupMaxK with K = 2, named SupMaxPair, is suitable for dense and high-dimensional datasets. Several experiments on a cancer gene expression dataset demonstrate that there are low-support patterns that can be discovered using SupMaxPair, but not by existing approaches, and that these patterns are statistically significant and biologically relevant. This illustrates the complementarity of SupMaxPair to existing approaches for discriminative pattern discovery. The codes and dataset for this paper are available at http://vk.cs.umn.edu/SMP/.Item Subspace Differential Coexpression Analysis: Problem Definition and a General Approach(2009-07-17) Fang, Gang; Kuang, Rui; Pandey, Gaurav; Steinbach, Michael; Myers, Chad L.; Kumar, VipinIn this paper, we study methods to identify differential coexpression patterns in case-control gene expression data. A differential coexpression pattern consists of a set of genes that have substantially different levels of coherence of their expression profiles across the two sample-classes, i.e., highly coherent in one class, but not in the other. Biologically, a differential coexpression patterns may indicate the disruption of a regulatory mechanism possibly caused by disregulation of pathways or mutations of transcription factors. A common feature of all the existing approaches for differential coexpression analysis is that the coexpression of a set of genes is measured on all the samples in each of the two classes, i.e., over the full-space of samples. Hence, these approaches may miss patterns that only cover a subset of samples in each class, i.e., subspace patterns, due to the heterogeneity of the subject population and disease causes. In this paper, we extend differential coexpression analysis by defining a subspace differential coexpression pattern, i.e., a set of genes that are coexpressed in a relatively large percent of samples in one class, but in a much smaller percent of samples in the other class. We propose a general approach based upon association analysis framework that allows exhaustive yet efficient discovery of subspace differential coexpression patterns. This approach can be used to adapt a family of biclustering algorithms to obtain their corresponding differential versions that can directly discover differential coexpression patterns. Using a recently developed biclustering algorithm as illustration, we perform experiments on cancer datasets which demonstrates the existence of subspace differential coexpression patterns. Permutation tests demonstrate the statistical significance for a large number of discovered subspace patterns, many of which can not be discovered if they are measured over all the samples in each of the classes. Interestingly, in our experiments, some discovered subspace patterns have significant overlap with known cancer pathways, and some are enriched with the target gene sets of cancer-related microRNA and transcription factors. The source codes and datasets used in this paper are available at http://vk.cs.umn.edu/SDC/.Item Systematic Evaluation of Scaling Methods for Gene Expression Data(2007-06-08) Pandey, Gaurav; Ramakrishnan, Lakshmi Naarayanan; Steinbach, Michael; Kumar, VipinEven after an experimentally prepared gene expression data set has been pre-processed to account for variations in the microarray technology, there may be inconsistencies between the scales of measurements in different conditions. This may happen for a variety of reasons, such as the accumulation of gene expression data prepared by different laboratories into a single data set. A variety of scaling and transformation methods have been used for addressing these scale differences in different studies on the analysis gene expression data sets. However, a quantitative estimation of their relative performance has been lacking. In this paper, we report an extensive evaluation of scaling and transformation methods for their effectiveness, with respect to the important application of protein function prediction. We consider several such commonly used methods for gene expression data, such as z-score scaling, quantile normalization, diff transformation, and two scaling methods, sigmoid and double sigmoid, that have not been used previously in this domain to the best of our knowledge. We show that the performance of these methods can vary significantly across different data sets. We also provide evidence that the two types of gene expression data, namely temporal and non-temporal, need different types of analyses in order to use them effectively for uncovering functional information.Item Two-Dimensional Association Analysis For Finding Constant Value Biclusters In Real-Valued Data(2009-07-07) Atluri, Gowtham; Bellay, Jeremy; Pandey, Gaurav; Myers, Chad L.; Kumar, VipinBiclustering is a commonly used type of analysis for real-valued data sets, and several algorithms have been proposed for finding different types of biclusters. However, no systematic approach has been proposed for exhaustive enumerating all (nearly) constant value biclusters in such data sets, which is the problem addressed in this paper. Using a monotonic range measure to capture the coherence of values in a block/submatrix of an input data matrix, we propose a two-step Apriori-based algorithm for discovering all nearly constant value biclusters, referred to as Range Constrained Blocks (RCBs). By systematic evaluation on an extensive genetic interaction data set, we show that the submatrices with similar values represent groups of genes that are functionally related than the biclusters with diverse values. We also show that our approach can exhaustively find all the biclusters with a range less than a given threshold, while the other competing approaches can not find all such biclusters.