Identification Of Genetic Variation In Highly Divergent Regions Using Whole Exome Sequencing

Thumbnail Image

Persistent link to this item

View Statistics

Journal Title

Journal ISSN

Volume Title


Identification Of Genetic Variation In Highly Divergent Regions Using Whole Exome Sequencing

Published Date




Thesis or Dissertation


Whole exome sequencing is widely used for identifying disease-associated variants in both clinic and research settings. Using this technology to accurately identify genetic variants is essential, yet major challenges remain in highly divergent but medically important genomic regions. We developed an analytical workflow enabling sensitive and accurate variant discovery for highly divergent genomic regions from whole exome sequencing data. Our workflow combines both mapping- and de novo assembly-based approaches, for which the tools were selected and optimized through extensive evaluation of their performance across different coverage depths and divergence levels, the two key factors profoundly impacting variant detection. We used simulated exome reads for an initial assessment and then public exome data from a well-studied CEPH individual NA12878 for more focused evaluations. Our analysis revealed that the 25 combinations between five mappers and five callers had comparable performance in the non-HLA regions as expected, which have approximately 0.1-0.4% divergence. However, they differed markedly in the HLA region in which different haplotypes can show up to 10-15% divergence. We also evaluated the effect of post-alignment processing and provide a practical guideline regarding the application of local realignment and base quality score recalibration in designing analytical workflows. We transferred our findings into a highly sensitive and computationally efficient workflow for mapping-based variant discovery. It excels in both sensitivity and speed through our two-tier mapping strategy, not only in regions of high divergence but also in lowly divergent regions. To utilize the local phasing information and identify transmitted variants, we also developed a de novo assembly-based variant calling workflow for whole exome data. It performs well over a wide range of coverage depths and divergence levels. In fact, for SNP detection from the HLA region, it is far more superior to all other existing methods based on both simulated and multiple benchmarked exome datasets. Finally, we incorporated the mapping- and de novo assembly-based approaches into a single pipeline, providing the flexibility of variant detection through executing either or both methods. Our pipeline should be particularly useful for WES projects focusing on diseases that are associated with HLA or other highly divergent regions.


University of Minnesota Ph.D. dissertation. December 2016. Major: Biomedical Informatics and Computational Biology. Advisors: Susan Slager, Claudia Neuhauser. 1 computer file (PDF); iv, 170 pages.

Related to




Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation

Tian, Shulan. (2016). Identification Of Genetic Variation In Highly Divergent Regions Using Whole Exome Sequencing. Retrieved from the University Digital Conservancy,

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.