Two topics in association analysis of DNA sequencing data: population structure and multivariate traits

Zhang, Yiwei2013-11-042013-11-042013-08https://hdl.handle.net/11299/159590University of Minnesota Ph.D. dissertation. August 2013. Major: Biostatistics. Advisor: Wei Pan. 1 computer file (PDF); vii, 151 pages, appendices A-C.As the next-generation sequencing technologies become mature and affordable, we now have access to massive data of single nucleotides variants (SNVs) with varying minor allele frequencies (MAFs). This poses new opportunities, as more information from the human genome is available. However, new challenges also show up, such as how to utilize those SNVs with low MAFs. With current intensive efforts in association testing to detect genetic loci associated with common diseases and complex traits, two issues are of primary interest: reducing spurious findings and increasing power for true discoveries. In association testing, a major cause to the elevated level of false positives is the confounding effect of population structure -- the so-called population stratification. As a remedy, one popular method is to add principal components (PCs) in a regression model, named principal component regression (PCR). Yet, it is not clear how PCR will work in testing rare variants (RVs, with MAF$<0.01$), or with population stratification in a fine scale. More questions arise, like what types and what sets of SNVs should be used to construct PCs, and whether there are other better methods than principal component analysis (PCA) for constructing PCs. Utilizing the DNA sequencing data from the 1000 Genomes project, we first investigate whether PCR is adequate in adjusting for population stratification while maintaining high power when testing low frequency variants (LFVs with 0.01&lq MAF<0.05) and RVs. Furthermore, we compare the performance of two dimension reduction methods, PCA and spectral dimension reduction (SDR), as well as twelve different types and sets of variants for constructing PCs. The comparison is conducted with respect to controlling population stratification in a fine scale. On the other hand, linear mixed models (LMM) have emerged with its superior performance in handling complex population structures. Herein, we examine the connection and difference between PCR and LMM based on the formulation of probabilistic PCA, and propose a hybrid method combining the two. Its outstanding performance in addressing both population structure and environmental confounders is established by simulations using the the Genetic Analysis Workshop (GAW) 18 data and the 1000 Genomes project data. Lastly, we consider boosting power for association analysis of multivariate traits. A new class of tests, the sum of powered score tests (SPU), and an adaptive SPU (aSPU) test are extended to the generalized estimation equations (GEE) framework. We apply the new and some existing methods to association testing on both CVs and RVs with an HIV/AIDS dataset and the GAW 18 data.en-USAssociation testingMultivariate traitsNext-generation sequencingPopulation structurePrincipal componentTwo topics in association analysis of DNA sequencing data: population structure and multivariate traitsThesis or Dissertation