------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset Supplemental data from "Comparative Genomics Approaches Accurately Predict Deleterious Variants in Plants" 2. Author Information Principal Investigator Contact Information Name: Peter L. Morrell Institution: University of Minnesota Address: Department of Agronomy and Plant Genetics, 411 Borlaug Hall, 1991 Upper Buford Circle, Saint Paul, MN 55108 Email: pmorrell@umn.edu Associate or Co-investigator Contact Information Name: Justin C. Fay Institution: University of Rochester Address: Department of Biology, 402 Hutchison Hall, University of Rochester, P.O. Box 270211, Rochester, NY 14627 Email: 3. Date of data collection (single date, range, approximate date) 2017-10-01 4. Geographic location of data collection: NA 5. Information about funding sources that supported the collection of the data: National Science Foundation Plant Genome Program DBI-1339393 US Department of Agriculture Biotechnology Risk Assessment Research Grants Program USDA BRAG 2015-06504 University of Minnesota Doctoral Dissertation Fellowship -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: Creative Commons BY https://creativecommons.org/licenses/by/3.0/us/ 2. Links to publications that cite or use the data: https://doi.org/10.1534/g3.118.200563 3. Links to other publicly accessible locations of the data: https://doi.org/10.25387/g3.6998387 4. Links/relationships to ancillary data sets: NA 5. Was data derived from another source? If yes, list source(s): Derived from BLAST searches of gene sequences against Angiosperm genome databases from Phytozome (https://phytozome.jgi.doe.gov/) and Ensembl Plants (http://plants.ensembl.org/) Mutants with phenotypic effects in Arabidopsis thaliana were found by searching the Arabidopsis Information Resource for genes with phenotypic effects that differ by nucleotide substitutions, and by searching Google Scholar. These were merged with an independently maintained database of amino acid-altering mutations in UniProt/SwissProt. 6. Recommended citation for the data: Comparative Genomics Approaches Accurately Predict Deleterious Variants in Plants Thomas J. Y. Kono, Li Lei, Ching-Hua Shih, Paul J. Hoffman, Peter L. Morrell and Justin C. Fay G3: GENES, GENOMES, GENETICS Early online August 23, 2018; https://doi.org/10.1534/g3.118.200563 --------------------- DATA & FILE OVERVIEW --------------------- 1. File List A. Filename: Table S2_new.csv Short description: A list of 2,617 amino acid altering mutations in 960 A. thaliana genes. The approach by each mutation was identified and the results of each of the deleterious mutation annotation of tools is presented. B. Filename: multiple_alignment_seq.zip Short description: Multiple Alignment Sequence Files (FASTA) - 1975 genes 2. Relationship between files: Predictions in Table S2_new.csv were generated using substitution rate values calculated from alignments in multiple_alignment_seq.zip. 3. Additional related data collected that was not included in the current data package: NA 4. Are there multiple versions of the dataset? no -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data: We curated a set of amino acid-altering mutations with phenotypic impacts. Both morphological and biochemical phenotypes were represented, and mutations were in both single-copy and duplicated genes. These mutations were obtained from two sources. We generated a manually curated set of 542 amino acid-altering mutations in 155 genes with phenotypic effects that are described in the literature. These mutations were found by searching the Arabidopsis Information Resource (http://www.arabidopsis.org) for genes with either dominant or recessive alleles that differ by nucleotide substitutions. We also identified mutations using a literature search in Google Scholar (http://scholar.google.com). For each variant, we recorded the amino acid substitution, position, and link to the published paper (Table S2). We excluded nonsense mutations because they frequently completely eliminate gene function. We identified a second set of 2,617 amino acid-altering mutations in 960 genes from the manually curated UniProt/Swiss-Prot database (http://www.uniprot.org/) (Boutet et al. 2016). The two sets were independently generated and had an overlap of 249 mutants. Using mutants with named alleles as a proxy for those with morphological vs. biochemical phenotypes, 65% of our manually curated set and 33% of the Swiss-Prot set had macroscopic phenotypes. Duplicated genes were defined by those proteins with a significant BLASTP hit (E-value < 0.05) to another A. thaliana protein with 60% identity. By this criterion 466 of 995 proteins were classified as duplicated. We used BLAST searches of the A. thaliana gene sequences against 42 Angiosperm genomes, retaining the top hit from each species with a BLAST E-value threshold of 0.05. The homolog searches were restricted to Angiosperm genomes to avoid extensive saturation of synonymous sites. Protein alignments were generated with PASTA (Mirarab et al. 2015), and a likelihood ratio test (LRT) for constraint on each codon of interest was calculated using HyPhy (Pond et al. 2005). Sequences with ‘N’s or other ambiguous nucleotides were discarded prior to the likelihood ratio test. The LRT differs compared to its original formulation (Chun and Fay 2009) in that: i) dS was estimated using all codons for each gene separately, ii) query sequences were optionally masked (the entire sequence changed to N = missing) in the likelihood calculation to avoid any reference bias and iii) branches with dS greater than 3 were set to 3 to avoid spuriously high estimates of dS. Additionally, the original LRT used heuristics to eliminate sites with dN > dS, the derived allele present in another species, or sites with fewer than 10 species in the alignment. Rather than eliminating sites, we used logistic regression to provide a single probability of being deleterious based on the LRT test and these additional pieces of information. 2. Methods for processing the data: See filtering criteria and BLAST parameters above. 3. Instrument- or software-specific information needed to interpret the data: R statistical software, or similar. 4. Standards and calibration information, if appropriate: NA. 5. Environmental/experimental conditions: NA. 6. Describe any quality-assurance procedures performed on the data: NA 7. People involved with sample collection, processing, analysis and/or submission: Li Lei, Ching-Hua Shih, Emily Vonderharr, Thomas J. Y. Kono ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: Table S2_new.csv 1. Number of variables: 39 2. Number of cases/rows: 13,707 3. Missing data codes: NA: No information available 4. Variable List Name: Protein Description: A. thaliana protein ID Name: Position Description: Amino acid residue number of affected codon Name: WT Description: "wild-type" amino acid state Name: MUT Description: "mutant" amino acid state Name: Description Description: Gene name and annotation Name: Allele Description: Named allele code Name: Reference Description: Publication reporting mutant Name: GeneName Description: Gene symbol Name: Class Description: Curation source Name: Frequency Description: Minor allele frequency in a panel of A. thaliana accessions Name: Duplicated(1=yes; 0=no) Description: Gene is duplicated: 1=yes, 0=no Name: Training (1=True positive; 0=True negative) Description: Training status for a variant: 1=True positive, 0=True negative Name: SiftScore Description: Score from SIFT (Ng et al. 2003) Name: PolyPhen2 Description: Score from PolyPhen2 (Adzhubei et al. 2013) Name: Provean Description: Score from PROVEAN (Choi et al. 2015) Name: MAPP Description: Score from MAPP (Stone and Sidow 2005) Name: Gerp++ Description: Score from GERP++ (Davydov et al. 2010) Name: LRT-logistic Description: Score from likelihood ratio test (Chun and Fay 2009) Name: LRTm-logistic Description: Score from masked likelihood ratio test (Chun and Fay 2009) Name: Constraint Description: Substitution rate per bp in orthologue alignment Name: MaskedConstraint Description: Substitution rate in masked orthologue alignment Name: Log10(P-value LRT) Description: log10(P-value) from LRT Name: Log10(P-value LRTmasked) Description: log10(P-value) from masked LRT Name: Rn Description: Number of "reference" amino acids in alignment Name: An Description: Number of "alternate" amino acids in alignment Name: Sift.95.Spec Description: Predicted deleterious by SIFT at 95% Specificity: 1=Yes; 0=No Name: Sift.95.Sens Description: Predicted deleterious by SIFT at 95% Sensitivity: 1=Yes; 0=No Name: Poly.95.Spec Description: Predicted deleterious by PolyPhen2 at 95% Specificity: 1=Yes; 0=No Name: Poly.95.Sens Description: Predicted deleterious by PolyPhen2 at 95% Sensitivity: 1=Yes; 0=No Name: Prov.95.Spec Description: Predicted deleterious by PROVEAN at 95% Specificity: 1=Yes; 0=No Name: Prov.95.Sens Description: Predicted deleterious by PROVEAN at 95% Sensitivity: 1=Yes; 0=No Name: MAPP.95.Spec Description: Predicted deleterious by MAPP at 95% Specificity: 1=Yes; 0=No Name: MAPP.95.Sens Description: Predicted deleterious by MAPP at 95% Sensitivity: 1=Yes; 0=No Name: Gerp.95.Spec Description: Predicted deleterious by GERP++ at 95% Specificity: 1=Yes; 0=No Name: Gerp.95.Sens Description: Predicted deleterious by GERP++ at 95% Sensitivity: 1=Yes; 0=No Name: LRT.95.Spec Description: Predicted deleterious by LRT at 95% Specificity: 1=Yes; 0=No Name: LRT.95.Sens Description: Predicted deleterious by LRT at 95% Sensitivity: 1=Yes; 0=No Name: LRTm.95.Spec Description: Predicted deleterious by masked LRT at 95% Specificity: 1=Yes; 0=No Name: LRTm.95.Sens Description: Predicted deleterious by masked LRT at 95% Sensitivity: 1=Yes; 0=No ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: multiple_alignment_seq.zip 1. Number of variables: NA 2. Number of cases/rows: 1,975 genes Description: FASTA alignments of orthologous genes used for prediction in this study. The gene ID is coded as the prefix of the filename. The species name of origin for each orthologue is given as the FASTA sequence name. Alignments were generated with PASTA alignment of amino acid sequences, then back-translated to nucleotides, keeping codon structure intact.