Tian, Shulan2019-03-132019-03-132016-12https://hdl.handle.net/11299/202123University of Minnesota Ph.D. dissertation. December 2016. Major: Biomedical Informatics and Computational Biology. Advisors: Susan Slager, Claudia Neuhauser. 1 computer file (PDF); iv, 170 pages.Whole exome sequencing is widely used for identifying disease-associated variants in both clinic and research settings. Using this technology to accurately identify genetic variants is essential, yet major challenges remain in highly divergent but medically important genomic regions. We developed an analytical workflow enabling sensitive and accurate variant discovery for highly divergent genomic regions from whole exome sequencing data. Our workflow combines both mapping- and de novo assembly-based approaches, for which the tools were selected and optimized through extensive evaluation of their performance across different coverage depths and divergence levels, the two key factors profoundly impacting variant detection. We used simulated exome reads for an initial assessment and then public exome data from a well-studied CEPH individual NA12878 for more focused evaluations. Our analysis revealed that the 25 combinations between five mappers and five callers had comparable performance in the non-HLA regions as expected, which have approximately 0.1-0.4% divergence. However, they differed markedly in the HLA region in which different haplotypes can show up to 10-15% divergence. We also evaluated the effect of post-alignment processing and provide a practical guideline regarding the application of local realignment and base quality score recalibration in designing analytical workflows. We transferred our findings into a highly sensitive and computationally efficient workflow for mapping-based variant discovery. It excels in both sensitivity and speed through our two-tier mapping strategy, not only in regions of high divergence but also in lowly divergent regions. To utilize the local phasing information and identify transmitted variants, we also developed a de novo assembly-based variant calling workflow for whole exome data. It performs well over a wide range of coverage depths and divergence levels. In fact, for SNP detection from the HLA region, it is far more superior to all other existing methods based on both simulated and multiple benchmarked exome datasets. Finally, we incorporated the mapping- and de novo assembly-based approaches into a single pipeline, providing the flexibility of variant detection through executing either or both methods. Our pipeline should be particularly useful for WES projects focusing on diseases that are associated with HLA or other highly divergent regions.ende novo assemblyhighly divergent regionvariant detectionIdentification Of Genetic Variation In Highly Divergent Regions Using Whole Exome SequencingThesis or Dissertation