Browsing by Author "Fang, Gang"
Now showing 1 - 9 of 9
- Results Per Page
- Sort Options
Item A Computationally Efficient and Statistically Powerful Framework for Searching High-order Epistasis with Systematic Pruning and Gene-set Constraints(2010-06-21) Fang, Gang; Haznadar, Majda; Wang, Wen; Steinbach, Michael; Van Ness, Brian; Kumar, VipinThis paper has not yet been submitted.Item Characterizing Discriminative Patterns(2011-02-18) Fang, Gang; Wang, Wen; Oatley, Benjamin; NessVan, Brian; Steinbach, Michael; Kumar, VipinDiscriminative patterns are association patterns that occur with disproportionate frequency in some classes versus others, and have been studied under names such as emerging patterns and contrast sets. Such patterns have demonstrated considerable value for classification and subgroup discovery, but a detailed understanding of the types of interactions among items in a discriminative pattern is lacking. To address this issue, we propose to categorize discriminative patterns according to four types of item interaction: (i) driver-passenger, (ii) coherent, (iii) independent additive and (iv) synergistic beyond independent additive. The coherent, additive, and synergistic patterns are of practical importance, with the latter two representing a gain in the discriminative power of a pattern over its subsets. Synergistic patterns are most restrictive, but perhaps the most interesting since they capture a cooperative effect that is more than the sum of the effects of the individual items in the pattern. For domains such as biomedical and genetic research, differentiating among these types of patterns is critical since each yields very different biological interpretations. For general domains, the characterization provides a novel view of the nature of the discriminative patterns in a dataset, which yields insights beyond those provided by current approaches that focus mostly on pattern-based classification and subgroup discovery. This paper presents a comprehensive discussion that defines these four pattern types and investigates their properties and their relationship to one another. In addition, these ideas are explored for a variety of datasets (ten UCI datasets, one gene expression dataset and two genetic-variation datasets). The results demonstrate the existence, characteristics and statistical significance of the different types of patterns. They also illustrate how pattern characterization can provide novel insights into discriminative pattern mining and the discriminative structure of different datasets. Codes for pattern characterization and supplementary documents are available at http://vk.cs.umn.edu/CDPItem Construction and Functional Analysis of Human Genetic Interaction Networks with Genome-wide Association Data(2011-01-18) Fang, Gang; Wang, Wen; Paunic, Vanja; Oatley, Benjamin; Haznadar, Majda; Steinbach, Michael; Van Ness, Brian; Myers, Chad L.; Kumar, VipinMotivation: Genetic interaction measures how different genes collectively contribute to a phenotype, and can reveal functional compensation and buffering between pathways under genetic perturbations. Recently, genome-wide investigation for genetic interactions has revealed genetic interaction networks that provide novel insights both when analyzed independently and when integrated with other functional genomic datasets. For higher eukaryotes such as human, the above reverse-genetics approaches are not straightforward since the phenotypes of interest for higher eukaryotes such as disease onset or survival, are difficult to study in a cell based assay. Results: In this paper, we propose a general framework for constructing and analyzing human genetic interaction networks from genome-wide single nucleotide polymorphism (SNP) datasets used for case-control studies on complex diseases. Specifically, we propose a general approach with three major steps: (1) estimating SNP-SNP genetic interactions, (2) identifying linkage disequilibrium (LD) blocks and mapping SNP-SNP interactions to LD block-block interactions, and (3) functional mapping for LD blocks. We performed two sets of functional analyses for each of the six case-control SNP datasets used in the paper, and demonstrated that (i) genes in LD blocks showing similar interaction profiles tend to be functionally related, and (ii) the network can be used to discover pairs of compensatory gene modules (between-pathway models) in their joint association with a disease phenotype. The proposed framework should provide novel insights beyond existing approaches that either ignore interactions between SNPs or model different SNP-SNP pairs with genetic interactions separately. Furthermore, our study provides evidence that some of the core properties of genetic interaction networks based on reverse genetics in model organisms like yeast are also present in genetic interactions revealed by natural variation in human populations. Availability: Supplementary material http://vk.cs.umn.edu/humanGIItem Discovering combinatorial disease biomarkers(2012-08) Fang, GangMany diseases have a genetic component. Some, including many cancers, are caused by a change in the functioning of a gene or a group of genes in a person's cells. Disease-biomarker discovery seeks to find the association between diseases and a person's genetic or associated characteristics, such as genes, DNA mutations, methylations, non-coding RNAs, proteins, metabolic products, and biological pathways. These biomarkers, such as the mutations in the BRCA1 and BRCA2 genes that indicate a high risk of breast cancer, can help in understanding the mechanisms causing a disease, and can guide diagnosis, prognosis and treatment. With the recent availability of high-throughput "-omics" and next-generation sequencing data, biomarker discovery is shifting from hypothesis-driven analysis towards data-driven analysis, which enables the discovery of previously unsuspected genetic associations for a variety of diseases. However, for most diseases, there remains a substantial disparity between the disease risk explained by the discovered loci and the estimated total heritable disease risk based on familial aggregation, a problem that has been referred to as "missing heritability". While there are a number of possible explanations for missing heritability, genetic interactions between loci are one potential culprit. Genetic interactions generally refer to two or more genes whose contribution to a phenotype goes beyond the independent effects of the genes and are expected to play an important role in complex diseases. This thesis takes a data mining based approach, specifically discriminative pattern mining, and targets the computational discovery of combinatorial biomarkers associated with complex human diseases from a variety of large scale case control genomic datasets. It addresses several key challenges confronted by existing discriminative pattern mining algorithms: computational complexity, sample heterogeneity due to disease subtypes and lack of statistical power for most real datasets. It also proposes a novel concept to organize discriminative patterns into an interaction network that allows the discovery of high-level structural knowledge, in both global and local scales. Specifically, a general framework is proposed to detect pathway-pathway interaction pairs that are enriched for genetic level interactions from genome wide association datasets. Validations across independent real datasets not only demonstrate the reliability of the proposed framework but also lead to several interesting biological insights on several complex diseases such as breast cancer and Parkinson's disease. The data-mining algorithmic contributions in this thesis also hold promise for addressing generic challenges in other domains beyond biology.Item Integration of Differential Gene-combination Search and Gene Set Enrichment Analysis: A General Approach(2009-12-30) Fang, Gang; Steinbach, Michael; Myers, Chad L.; Kumar, VipinMotivation: Gene Set Enrichment Analysis (GSEA) and its variations aim to discover collections of genes that show moderate but coordinated differences in expression. However, such techniques may be ineffective if many individual genes in a phenotype-related gene set have weak discriminative power. A potential solution is to search for combinations of genes that are highly differentiating even when individual genes are not. Although such techniques have been developed, these approaches have not been used with GSEA to any significant degree because of the large number of potential gene combinations and the heterogeneity of measures that assess the differentiation provided by gene groups of different sizes. Results: To integrate the search for differentiating gene combinations and GSEA, we propose a general framework with two key components: (A) a procedure that reduces the number of scores to be handled by GSEA to the number of genes by summarizing the scores of the gene combinations involving a particular gene in a single score, and (B) a procedure to integrate the heterogeneous scores from combinations of different sizes and from different gene combination measures by mapping the scores to p-values. Experiments on four gene expression data sets demonstrate that the integration of GSEA and gene combination search can enhance the power of traditional GSEA by discovering gene sets that include genes with weak individual differentiation but strong joint discriminative power. Also, gene sets discovered by the integrative framework share several common biological processes and improve the consistency of the results among three lung cancer data sets. Availability: Source code and datasets: http://vk.cs.umn.edu/ICG/. Contact: gangfang@cs.umn.eduItem Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data(2009-04-02) Fang, Gang; Pandey, Gaurav; Wang, Wen; Gupta, Manish; Steinbach, Michael; Kumar, VipinDiscriminative patterns can provide valuable insights into datasets with class labels, that may not be available from the individual features or predictive models built using them. Most existing approaches work efficiently for sparse or low-dimensional datasets. However, for dense and high-dimensional datasets, they have to use high thresholds to produce the complete results within limited time, and thus, may miss interesting low-support patterns. In this paper, we address the necessity of trading off the completeness of discriminative pattern discovery with the efficient discovery of low-support discriminative patterns from such datasets. We propose a family of anti-monotonic measures named SupMaxK that organize the set of discriminative patterns into nested layers of subsets, which are progressively more complete in their coverage, but require increasingly more computation. In particular, the member of SupMaxK with K = 2, named SupMaxPair, is suitable for dense and high-dimensional datasets. Several experiments on a cancer gene expression dataset demonstrate that there are low-support patterns that can be discovered using SupMaxPair, but not by existing approaches, and that these patterns are statistically significant and biologically relevant. This illustrates the complementarity of SupMaxPair to existing approaches for discriminative pattern discovery. The codes and dataset for this paper are available at http://vk.cs.umn.edu/SMP/.Item Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms(2009-02-27) Gupta, Rohit; Fang, Gang; Field, Blayne; Steinbach, Michael; Kumar, VipinTraditional association mining algorithms use a strict definition of support that requires every item in a frequent itemset to occur in each supporting transaction. In real-life datasets, this limits the recovery of frequent itemset patterns as they are fragmented due to random noise and other errors in the data. Hence, a number of methods have been proposed recently to discover approximate frequent itemsets in the presence of noise. These algorithms use a relaxed definition of support and additional parameters, such as row and column error thresholds to allow some degree of "error" in the discovered patterns. Though these algorithms have been shown to be successful in finding the approximate frequent itemsets, a systematic and quantitative approach to evaluate them has been lacking. In this paper, we propose a comprehensive evaluation framework to compare different approximate frequent pattern mining algorithms. The key idea is to select the optimal parameters for each algorithm on a given dataset and use the itemsets generated with these optimal parameters in order to compare different algorithms. We also propose simple variations of some of the existing algorithms by introducing an additional post-processing step. Subsequently, we have applied our proposed evaluation framework to a wide variety of synthetic datasets with varying amounts of noise and a real dataset to compare existing and our proposed variations of the approximate pattern mining algorithms. Source code and the datasets used in this study are made publicly available.Item Subspace Differential Coexpression Analysis: Problem Definition and a General Approach(2009-07-17) Fang, Gang; Kuang, Rui; Pandey, Gaurav; Steinbach, Michael; Myers, Chad L.; Kumar, VipinIn this paper, we study methods to identify differential coexpression patterns in case-control gene expression data. A differential coexpression pattern consists of a set of genes that have substantially different levels of coherence of their expression profiles across the two sample-classes, i.e., highly coherent in one class, but not in the other. Biologically, a differential coexpression patterns may indicate the disruption of a regulatory mechanism possibly caused by disregulation of pathways or mutations of transcription factors. A common feature of all the existing approaches for differential coexpression analysis is that the coexpression of a set of genes is measured on all the samples in each of the two classes, i.e., over the full-space of samples. Hence, these approaches may miss patterns that only cover a subset of samples in each class, i.e., subspace patterns, due to the heterogeneity of the subject population and disease causes. In this paper, we extend differential coexpression analysis by defining a subspace differential coexpression pattern, i.e., a set of genes that are coexpressed in a relatively large percent of samples in one class, but in a much smaller percent of samples in the other class. We propose a general approach based upon association analysis framework that allows exhaustive yet efficient discovery of subspace differential coexpression patterns. This approach can be used to adapt a family of biclustering algorithms to obtain their corresponding differential versions that can directly discover differential coexpression patterns. Using a recently developed biclustering algorithm as illustration, we perform experiments on cancer datasets which demonstrates the existence of subspace differential coexpression patterns. Permutation tests demonstrate the statistical significance for a large number of discovered subspace patterns, many of which can not be discovered if they are measured over all the samples in each of the classes. Interestingly, in our experiments, some discovered subspace patterns have significant overlap with known cancer pathways, and some are enriched with the target gene sets of cancer-related microRNA and transcription factors. The source codes and datasets used in this paper are available at http://vk.cs.umn.edu/SDC/.Item The Nature and Limits of Discriminative Patterns(2012-12-17) Steinbach, Michael; Yu, Hayou; Fang, Gang; Paunic, Vanja; Kumar, VipinDiscriminative pattern mining seeks patterns that are more prevalent in one class than another and provide good classification accuracy for the objects in which the patterns occur. A number of approaches have been proposed for finding such patterns, which are also known under a variety of names, e.g., contrast sets and emerging patterns. However, fundamental questions about the nature and limits of such patterns remain unanswered. For instance, a discriminative pattern is only interesting if it provides better discriminative power than any of its subpatterns, but it is not obvious, for example, how much additional discriminative power can be provided by a pattern over and above the discriminative power of its subpatterns. Also, what do the patterns that provide the most additional discrimination look like? And, what is the relationship of different measures for discrimination (e.g., mutual information and DiffSup, the difference of the supports in the two classes). In previous work, we made an initial attempt at analyzing the first two questions. In this paper we present several new developments. Specifically, we present a more elegant and efficient formulation of the problem of determining the best discriminative pattern that can be obtained for a particular number of variables. We also explore for the first time the limits of patterns that go beyond the 'and' logic of traditional pattern mining, e.g., patterns based on the logic of 'or', 'n of k' or 'majority wins'. We show that the discriminative advantage of 'and' based patterns over their subpatterns is more limited than that of some of the other patterns, and hence, these patterns may represent a potential area for future development of discriminative pattern mining. Finally, we explore the relationship of various measures of discriminative pattern mining. We show that our results, although based on one of the measures (DiffSup) have implications for mutual information. More generally, we identify a potential avenue of exploration, which although challenging, may offer the opportunity for making a more definitive and general statement about a certain class of discriminative measures.