Browsing by Subject "structural variation"
Now showing 1 - 1 of 1
- Results Per Page
- Sort Options
Item Defense-related gene families in the model legume, Medicago truncatula: computational analysis, pan-genome characterization, and structural variation(2015-06) Zhou, PengMedicago truncatula is a model for investigating legume genetics and the evolution of legume-rhizobia symbiosis. Over the past two decades, two large gene families in M. truncatula, the nucleotide-binding site leucine-rich repeat (NBS-LRR) family and the nodule-specific, cysteine-rich (NCR) gene family, have received considerable attention due to their involvement in disease resistance and nodulation, large family size, and high nucleotide and copy number diversity. While NBS-LRRs have been found in all plant species and therefore relatively well characterized at the sequence level, members of the cysteine-rich protein (CRP) families, including NCRs, have generally been overlooked by popular similarity search tools and gene prediction techniques due to their (a) small size, (b) high sequence divergence among family members and (c) limited availability of expression evidence. In this thesis, I first developed a homology-based gene prediction program (Small Peptide Alignment Detection Algorithm, i.e., SPADA) to accurately predict small peptides including CRPs at the genome level. Given a high-quality profile alignment, SPADA identifies and annotates nearly all family members in tested genomes with better performance than all general-purpose gene prediction programs surveyed. Numerous mis-annotations in the current Arabidopsis and Medicago genome databases were found by SPADA, most supported by RNA-Seq data. As a homology-based gene prediction tool, SPADA works well on other classes of small secreted peptides in plants (e.g., self-incompatibility protein homologues) as well as non-secreted peptides outside the plant kingdom. I then comprehensively annotated the NBS-LRR and NCR gene families in the Medicago reference genome (version 4.0), and set out to characterize natural variation of these genes in diverse M. truncatula accessions. Previous studies using whole-genome sequence data to identify sequence polymorphisms (SNPs and short Insertion / Deletions) relied on mapping short reads to a single reference genome. However, limitations of read-mapping approaches have hindered variant detection, especially characterization of repeat-rich and highly divergent regions. As a result, studies of these large gene families are also hindered due to high sequence similarity among family members along with high divergence among accessions. In this work I constructed high-quality de novo assemblies for 15 M. truncatula accessions. This allowed me to detect novel genetic variation that would not have been found by mapping reads to a single reference. This analysis led to a within-species diversity estimate 70% higher than previous mapping-based resequencing efforts, even using a smaller sample size. These results clearly demonstrate that de novo assembly-based comparison is both more accurate and precise than mapping-based variant calling in exploring variation in repetitive and highly divergent regions. For the first time in plants, my results enable systematically identification and characterization of different types of structural variants (SVs) using a synteny-based approach. This analysis suggests that, depending on the divergence from the reference accession, 7% to 21% of the entire genome is involved in large structural changes, affecting 10% to 28% of all gene models. The results identify 64 Mbp of unique sequence segments absent in the reference, including 30 Mbp shared by at least 2 accessions and 34 Mbp of accessions-specific sequences, thus expanding the Medicago reference space (389-Mbp) by 16%. Evidence-based annotation of the 15 de novo assemblies revealed that more than half of reference gene models were structurally diverse (lower than 60% sequence similarity) in at least one other accession. Not surprisingly, the NBS-LRR gene family harbors by far the highest level of nucleotide diversity, large effect single nucleotide changes, mean pairwise protein distance and copy number variation (levels comparable with transposable elements), consistent with the rapidly-evolving dynamics of disease resistance phenotypes. Characterization of deletion and tandem duplication events in the NBS-LRR and NCR gene families suggests accession-specific subfamily expansion / contraction patterns. This work illustrates the value of multiple de novo assemblies and the strength of comparative genomics in exploring and characterizing novel genetic variation within a population, and provides insights in understanding the impact of SVs on genome architecture and large gene families underlying important traits.