Browsing by Author "Gupta, Rohit"
Now showing 1 - 5 of 5
- Results Per Page
- Sort Options
Item A Novel Error-Tolerant Frequent Itemset Model for Binary and Real-Valued Data(2009-10-12) Gupta, Rohit; Rao, Navneet; Kumar, VipinFrequent pattern mining has been successfully applied to a broad range of applications, however, it has two major drawbacks, which limits its applicability to several domains. First, as the traditional 'exact' model of frequent pattern mining uses a strict definition of support, it limits the recovery of frequent itemset patterns in real-life data sets where the patterns may be fragmented due to random noise/errors. Second, as traditional frequent pattern mining algorithms works with only binary or boolean attributes, it requires transformation of real-valued attributes to binary attributes, which often results in loss of information. As many of the real-life data sets are both noisy and real-valued in nature, past approaches have tried to independently address these issues and there is no systematic approach that addresses both of these issues together. In this paper, we propose a novel Error-Tolerant Frequent Itemset (ETFI) model for binary as well as real-valued data. We also propose a bottom-up pattern mining algorithm to sequentially discover all ETFIs from both types of data sets. To illustrate the efficacy of our proposed ETFI approach, we use two real-valued S.Cerevisiae microarray gene-expression data sets and evaluate the patterns obtained in terms of their functional coherence as evaluated using the GO-based functional enrichment analysis. Our results clearly demonstrate the importance of directly accounting for errors/noise in the data. Finally, the statistical significance of the discovered ETFIs as estimated by using two randomization tests, reveal that discovered ETFIs are indeed biologically meaningful and are neither obtained by random chance nor capture random structure in the data.Item Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study(2007-03-02) Pandey, Gaurav; Steinbach, Michael; Gupta, Rohit; Garg, Tushar; Kumar, VipinProtein interaction networks are one of the most promising types of biological data for the discovery of functional modules and the prediction of individual protein functions. However, it is known that these networks are both incomplete and inaccurate, i.e., they have spurious edges and lack biologically valid edges. One way to handle this problem is by transforming the original interaction graph into new graphs that remove spurious edges, add biologically valid ones, and assign reliability scores to the edges constituting the final network. We investigate currently existing methods, as well as propose robust association analysis-based method for this task. One promising method is based on the concept of h-confidence, which is a measure that can be used to extract groups of objects having high similarity with each other. Experimental evaluation on several protein interaction data sets show that hyperclique-based transformations enhance the performance of standard function prediction algorithms significantly, and thus have merit.Item Integration of Clinical and Genomic data: a Methodological Survey(2013-02-20) Dey, Sanjoy; Gupta, Rohit; Steinbach, Michael; Kumar, VipinHuman diseases are inherently complex and governed by the complicated interplay of several underlying factors. Clinical research focuses on behavioral, demographic and pathology information, whereas molecular genomics focuses on finding underlying genetic and genomic factors in genomic data collected on mRNA expression, proteomics, biological networks, and other microbiological features. However, each of these clinical and genomic datasets contains information only about one particular aspect of a complex disease, rather than covering all of the several complicated underlying risk factors. This has led to a new area of research that integrates both clinical and genomic data and aims to extract more information about diseases by considering not only all the various factors, but also the interactions among those factors, which cannot be captured by clinical and genomic studies that are performed independently of each other. Although initial efforts have already been made to develop such integrative modeling of the clinical and genomic data to shed light on the biological mechanism of the diseases, the research field is still in a rudimentary stage. In this review article, we survey the general issues, challenges and current work of clinicogenomic studies. We also summarize the current state of the field and discuss some possibilities for future work.Item Integrative Biomarker Discovery for Breast Cancer Metastasis from Gene Expression and Protein Interaction Data Using Error-tolerant Pattern Mining(2009-11-24) Gupta, Rohit; Agrawal, Smita; Rao, Navneet; Tian, Ze; Kuang, Rui; Kumar, VipinBiomarker discovery for complex diseases is a challenging problem. Most of the existing approaches identify individual genes as disease markers, thereby missing the interactions among genes. Moreover, often only single biological data source is used to discover biomarkers. These factors account for the discovery of inconsistent biomarkers. In this paper, we propose a novel error-tolerant pattern mining approach for integrated analysis of gene expression and protein interaction data. This integrated approach incorporates constraints from protein interaction network and efficiently discovers all patterns (groups of genes) in a bottom-up fashion from the gene-expression data. We call these patterns active sub-network biomarkers. To illustrate the efficacy of our proposed approach, we used four breast cancer gene expression data sets and a human protein interaction network and showed that active sub-network biomarkers are more biologically plausible and genes discovered are more reproducible across studies. Finally, through pathway analysis, we also showed a substantial enrichment for known cancer genes and hence were able to generate relevant hypotheses for understanding the molecular mechanisms of breast cancer metastasis.Item Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms(2009-02-27) Gupta, Rohit; Fang, Gang; Field, Blayne; Steinbach, Michael; Kumar, VipinTraditional association mining algorithms use a strict definition of support that requires every item in a frequent itemset to occur in each supporting transaction. In real-life datasets, this limits the recovery of frequent itemset patterns as they are fragmented due to random noise and other errors in the data. Hence, a number of methods have been proposed recently to discover approximate frequent itemsets in the presence of noise. These algorithms use a relaxed definition of support and additional parameters, such as row and column error thresholds to allow some degree of "error" in the discovered patterns. Though these algorithms have been shown to be successful in finding the approximate frequent itemsets, a systematic and quantitative approach to evaluate them has been lacking. In this paper, we propose a comprehensive evaluation framework to compare different approximate frequent pattern mining algorithms. The key idea is to select the optimal parameters for each algorithm on a given dataset and use the itemsets generated with these optimal parameters in order to compare different algorithms. We also propose simple variations of some of the existing algorithms by introducing an additional post-processing step. Subsequently, we have applied our proposed evaluation framework to a wide variety of synthetic datasets with varying amounts of noise and a real dataset to compare existing and our proposed variations of the approximate pattern mining algorithms. Source code and the datasets used in this study are made publicly available.