Paunic, Vanja2014-04-212014-04-212014-03https://hdl.handle.net/11299/163019University of Minnesota Ph.D. dissertation. March 2014. Major:Computer Science. Advisor: Vipin Kumar. 1 computer file (PDF); x, 103 pages.The Human Leukocyte Antigen (HLA) gene system is the most polymorphic region of the human genome, containing some of the strongest associations with autoimmune, infectious, and inflammatory diseases. It plays a crucial role in hematopoietic stem cell transplantation, where patients and donors are matched with respect to their HLA genes to maximize the chances of a successful transplant. As such, HLA data is a highly valuable asset for clinicians and researchers for elucidating various disease-driving bio- logical mechanisms. This thesis contains original research on the analysis of uncertainty in HLA data, exploration of the strong correlation structure in the region and prediction of HLA genes from widely available genetic markers. We start by describing a novel method for correlated multi-label, multi-class prediction, which aims to solve the problem of prediction of HLA genes from widely available Single Nucleotide Polymorphism (SNP) data. Direct typing of HLA genes for large studies is expensive due to their extreme genetic polymorphism. Therefore, obtaining the HLA genes by prediction, rather than genetic typing, would be highly time- and cost-effective. In this study we use a two-step approach, involving label (gene) independent classifiers and label dependencies in the form of HLA haplotype frequencies, to predict HLA genes from SNP data. In addition, we propose different ways of integrating label dependency information into the prediction process and evaluate their impact on the prediction performance. The results from experiments on real-world data sets show that adding label dependencies into the prediction of HLA genes increases prediction accuracy when compared against the gene-independent approach. Next, we aim to resolve and quantify the uncertainty that exists in HLA data sets. Due to the high genetic polymorphism of HLA genes, their molecular typing often results in a set of uncertain or ambiguous assignments, rather than an exact allele assignment at each gene. We propose a novel, information theoretic measure to quantify uncertainty in HLA typing. In addition, we demonstrate that using the HLA gene dependencies that reflect the strong correlation structure in the region, decreases the uncertainty in HLA data. In the fourth chapter of the thesis, we propose a novel approach for multi-label prediction from uncertain data in the context of SNP-based prediction of HLA genes using ambiguous HLA data in training. Most existing HLA data sets contain uncertainty and, as such, need to be imputed to exact data before being used for training prediction models. Existing approaches for prediction of HLA genes from SNP data do not accommodate learning from uncertain data and, as such, miss the potential for an increased sample size and consequently improvements in prediction performance. In this thesis, we propose a novel algorithm for SNP-based prediction of HLA genes that utilizes ambiguous HLA data for building the prediction model. Additionally, we measure the impact that the uncertainty in the training data has on the prediction accuracy, and evaluate it on a real world data set. Our results show that the prediction from ambiguous HLA data generally performs better than the alternative approach which first imputes the ambiguous data into high-resolution HLA alleles and uses it to build the model. The work in this thesis is a step toward understanding the immense challenges in the analysis of the HLA gene system. In this thesis, we: i) define and solve a problem of prediction of HLA genes from widely available genetic markers using a correlated multi- label, multi-class approach, ii) define and validate a measure to quantify the uncertainty present in HLA data sets, and iii) propose a novel approach to correlated prediction from uncertain data in the context of prediction of HLA genes. We conclude the thesis by discussing future work to further the understanding of this important genetic region through novel computational algorithms.en-USComputational approaches to prediction and analysis of human leukocyte antigen genesThesis or Dissertation