Browsing by Author "Paunic, Vanja"
Now showing 1 - 3 of 3
- Results Per Page
- Sort Options
Item Computational approaches to prediction and analysis of human leukocyte antigen genes(2014-03) Paunic, VanjaThe Human Leukocyte Antigen (HLA) gene system is the most polymorphic region of the human genome, containing some of the strongest associations with autoimmune, infectious, and inflammatory diseases. It plays a crucial role in hematopoietic stem cell transplantation, where patients and donors are matched with respect to their HLA genes to maximize the chances of a successful transplant. As such, HLA data is a highly valuable asset for clinicians and researchers for elucidating various disease-driving bio- logical mechanisms. This thesis contains original research on the analysis of uncertainty in HLA data, exploration of the strong correlation structure in the region and prediction of HLA genes from widely available genetic markers. We start by describing a novel method for correlated multi-label, multi-class prediction, which aims to solve the problem of prediction of HLA genes from widely available Single Nucleotide Polymorphism (SNP) data. Direct typing of HLA genes for large studies is expensive due to their extreme genetic polymorphism. Therefore, obtaining the HLA genes by prediction, rather than genetic typing, would be highly time- and cost-effective. In this study we use a two-step approach, involving label (gene) independent classifiers and label dependencies in the form of HLA haplotype frequencies, to predict HLA genes from SNP data. In addition, we propose different ways of integrating label dependency information into the prediction process and evaluate their impact on the prediction performance. The results from experiments on real-world data sets show that adding label dependencies into the prediction of HLA genes increases prediction accuracy when compared against the gene-independent approach. Next, we aim to resolve and quantify the uncertainty that exists in HLA data sets. Due to the high genetic polymorphism of HLA genes, their molecular typing often results in a set of uncertain or ambiguous assignments, rather than an exact allele assignment at each gene. We propose a novel, information theoretic measure to quantify uncertainty in HLA typing. In addition, we demonstrate that using the HLA gene dependencies that reflect the strong correlation structure in the region, decreases the uncertainty in HLA data. In the fourth chapter of the thesis, we propose a novel approach for multi-label prediction from uncertain data in the context of SNP-based prediction of HLA genes using ambiguous HLA data in training. Most existing HLA data sets contain uncertainty and, as such, need to be imputed to exact data before being used for training prediction models. Existing approaches for prediction of HLA genes from SNP data do not accommodate learning from uncertain data and, as such, miss the potential for an increased sample size and consequently improvements in prediction performance. In this thesis, we propose a novel algorithm for SNP-based prediction of HLA genes that utilizes ambiguous HLA data for building the prediction model. Additionally, we measure the impact that the uncertainty in the training data has on the prediction accuracy, and evaluate it on a real world data set. Our results show that the prediction from ambiguous HLA data generally performs better than the alternative approach which first imputes the ambiguous data into high-resolution HLA alleles and uses it to build the model. The work in this thesis is a step toward understanding the immense challenges in the analysis of the HLA gene system. In this thesis, we: i) define and solve a problem of prediction of HLA genes from widely available genetic markers using a correlated multi- label, multi-class approach, ii) define and validate a measure to quantify the uncertainty present in HLA data sets, and iii) propose a novel approach to correlated prediction from uncertain data in the context of prediction of HLA genes. We conclude the thesis by discussing future work to further the understanding of this important genetic region through novel computational algorithms.Item Construction and Functional Analysis of Human Genetic Interaction Networks with Genome-wide Association Data(2011-01-18) Fang, Gang; Wang, Wen; Paunic, Vanja; Oatley, Benjamin; Haznadar, Majda; Steinbach, Michael; Van Ness, Brian; Myers, Chad L.; Kumar, VipinMotivation: Genetic interaction measures how different genes collectively contribute to a phenotype, and can reveal functional compensation and buffering between pathways under genetic perturbations. Recently, genome-wide investigation for genetic interactions has revealed genetic interaction networks that provide novel insights both when analyzed independently and when integrated with other functional genomic datasets. For higher eukaryotes such as human, the above reverse-genetics approaches are not straightforward since the phenotypes of interest for higher eukaryotes such as disease onset or survival, are difficult to study in a cell based assay. Results: In this paper, we propose a general framework for constructing and analyzing human genetic interaction networks from genome-wide single nucleotide polymorphism (SNP) datasets used for case-control studies on complex diseases. Specifically, we propose a general approach with three major steps: (1) estimating SNP-SNP genetic interactions, (2) identifying linkage disequilibrium (LD) blocks and mapping SNP-SNP interactions to LD block-block interactions, and (3) functional mapping for LD blocks. We performed two sets of functional analyses for each of the six case-control SNP datasets used in the paper, and demonstrated that (i) genes in LD blocks showing similar interaction profiles tend to be functionally related, and (ii) the network can be used to discover pairs of compensatory gene modules (between-pathway models) in their joint association with a disease phenotype. The proposed framework should provide novel insights beyond existing approaches that either ignore interactions between SNPs or model different SNP-SNP pairs with genetic interactions separately. Furthermore, our study provides evidence that some of the core properties of genetic interaction networks based on reverse genetics in model organisms like yeast are also present in genetic interactions revealed by natural variation in human populations. Availability: Supplementary material http://vk.cs.umn.edu/humanGIItem The Nature and Limits of Discriminative Patterns(2012-12-17) Steinbach, Michael; Yu, Hayou; Fang, Gang; Paunic, Vanja; Kumar, VipinDiscriminative pattern mining seeks patterns that are more prevalent in one class than another and provide good classification accuracy for the objects in which the patterns occur. A number of approaches have been proposed for finding such patterns, which are also known under a variety of names, e.g., contrast sets and emerging patterns. However, fundamental questions about the nature and limits of such patterns remain unanswered. For instance, a discriminative pattern is only interesting if it provides better discriminative power than any of its subpatterns, but it is not obvious, for example, how much additional discriminative power can be provided by a pattern over and above the discriminative power of its subpatterns. Also, what do the patterns that provide the most additional discrimination look like? And, what is the relationship of different measures for discrimination (e.g., mutual information and DiffSup, the difference of the supports in the two classes). In previous work, we made an initial attempt at analyzing the first two questions. In this paper we present several new developments. Specifically, we present a more elegant and efficient formulation of the problem of determining the best discriminative pattern that can be obtained for a particular number of variables. We also explore for the first time the limits of patterns that go beyond the 'and' logic of traditional pattern mining, e.g., patterns based on the logic of 'or', 'n of k' or 'majority wins'. We show that the discriminative advantage of 'and' based patterns over their subpatterns is more limited than that of some of the other patterns, and hence, these patterns may represent a potential area for future development of discriminative pattern mining. Finally, we explore the relationship of various measures of discriminative pattern mining. We show that our results, although based on one of the measures (DiffSup) have implications for mutual information. More generally, we identify a potential avenue of exploration, which although challenging, may offer the opportunity for making a more definitive and general statement about a certain class of discriminative measures.