------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset Supplemental data from "The fate of deleterious variants in a barley genomic prediction population" 2. Author Information Principal Investigator Contact Information Name: Peter L. Morrell Institution: University of Minnesota Address: Department of Agronomy and Plant Genetics, 411 Borlaug Hall, 1991 Upper Buford Circle, Saint Paul, MN 55108 Email: pmorrell@umn.edu Associate or Co-investigator Contact Information Name: Kevin P. Smith Institution: University of Minnesota Address: Department of Agronomy and Plant Genetics, 411 Borlaug Hall, 1991 Upper Buford Circle, Saint Paul, MN 55108 Email: smith376@umn.edu 3. Date of data collection (single date, range, approximate date) Genotyping Data: 2006-2009 Phenotyping Data: 2011-2014 Resequencing Data: 2014 4. Geographic location of data collection: Minnesota, USA 5. Information about funding sources that supported the collection of the data: National Science Foundation Plant Genome Program DBI-1339393 Minnesota Agricultural Experiment Station Variety Development Fund University of Minnesota Doctoral Dissertation Fellowship -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: Creative Commons BY 2. Links to publications that cite or use the data: 3. Links to other publicly accessible locations of the data: 4. Links/relationships to ancillary data sets: NA 5. Was data derived from another source? If yes, list source(s): Yes Genotypes from Illumina assays were collected as part of the Coordinated Agricultural Project (CAP) for barley. Full datasets from the barley CAP are available at The Triticeae Toolbox (T3, https://triticeaetoolbox.org/barley/) 6. Recommended citation for the data: Morrell, Peter L; Smith, Kevin P; Vonderharr, Emily E; Kono, Thomas John Y; Fay, Justin C; Koenig, Daniel. (2019). Supporting data for The Fate of Deleterious Variants in a Barley Genomic Prediction Population. Retrieved from the Data Repository for the University of Minnesota, https://doi.org/10.13020/d6w990. --------------------- DATA & FILE OVERVIEW --------------------- 1. File List A. Filename: 50x_Capture.bed.gz Short description: BED file that describes regions that are covered by the liquid-phase exome capture probeset designed by Mascher et al. (2013). Reads from ~241x coverage exome capture of Morex (SRA ERR271711) were aligned to the draft Morex assembly. Regions that were covered by at least 50 reads were considered to be covered by the exome capture probes. Intervals that were separated by 50bp or less were merged into a single interval. The final interval size is 88,450,766 bp. B. Filename: ALCHEMY_Raw_Data_and_Calls.tar.bz2 Short description: Raw intensities and genotype probabilities for a subset of the CAP lines that contain the founders of the experimental population. Probabilities were generated with ALCHEMY, with a prior on inbreeding of 0.99 for founders and 0.75 for progeny. C. Filename: Adjusted_Phenotypic_Data_800Lines.csv Short description: Spatially adjusted best linear unbiased estimates (BLUEs) for yield (kg/ha), DON concentration (ppm), and plant height (cm). Estiamtes are based on data from an augmented block design at five year-locations, and were spatially adjusted using a moving grid average. D. Filename: GP_AP.{bed/bim/fam} Short description: Genotypes for exome capture variants imputed onto progeny with AlphaPeel. Files are in PLINK format, and the bed/bim/fam must be kept together. E. Filename: BOPA_cMMb.txt.gz Short description: LOWESS-smoothed recombination rate estimates across the Morex draft assembly. SNP markers used for the smoothing were the 2,663 markers with unambiguous physical positions based on BLAST searches of the probe sequences to the draft assembly. Smoothing was done in windows of 2% of the markers, and windows that were less than 3Mb apart were collapsed. F. Filename: DON_LMM.assoc.txt.gz Short description: SNP-by-SNP regression coefficients and P-values for the linear mixed model association implemented in GEMMA. The trait used for association was DON concentration. G. Filename: ExomeCaptureTargets_per_Mb.txt.gz Short description: The number of exome capture targets in non-overlapping 1Mb-windows. H. Filename: GATK_Capture_WithID.vcf.gz Short description: The VCF of variants identified by exome capture resequencing of the 21 founder lines. Reads were aligned with BWA with tuned parameters, and genotypes were called with GATK HaplotypeCaller with a prior on "heterozygosity" set to 0.008. I. Filename: GP_BOPA_Physical.{bed/bim/fam} Short description: PLINK files that contain the 377 SNPs genotyped on the fixed SNP platform. Data were collected for all parental lines and all individuals in each cycle of breeding. Genotypes were called with ALCHEMY, physical positions were derived from BLAST searches against the draft assembly, and genetic positions were taken from the consensus map of Munoz et al. 2011. SNPs with missing physical or genetic position were filled-in with linear interpolation. SNPs with >20% missing data and Mendelian errors were excluded. J. Filename: GP_ExomeCap_Functional_Annotation.gz Short description: Functional annotation for each variant identified in exome capture resequencing of the founder lines. Functional annotation information includes the position, transcript ID (if coding), and deleterious predictions (if nonsynonymous). K. Filename: Height_LMM.assoc.txt.gz Short description: SNP-by-SNP regression coefficients and P-values for the linear mixed model association implemented in GEMMA. The trait used for association was plant height. L. Filename: Representative_Transcript_IDs.txt.gz Short description: Transcript IDs of the representative transcripts from each barley gene. The identifiers are Ensembl IDs. M. Filename: Representative_Transcripts.gtf.gz Short description: Position and strand annotation in a GTF for the representative transcripts. N. Filename: T3_Full_Pedigree.csv Short description: Pedigree information for each family in this population. O. Filename: Yield_LMM.assoc.txt.gz Short description: SNP-by-SNP regression coefficients and P-values for the linear mixed model association implemented in GEMMA. The trait used for association was grain yield. 2. Relationship between files: Genotypes in GP_BOPA_Physical.{bed/bim/fam} were derived from calls given by ALCHEMY in ALCHEMY_Raw_Data_and_Calls.tar.bz2. The pedigree given in T3_Full_Pedigree.csv and the exome capture variants in GATK_Capture_WithID.vcf.gz were used to generate the imputed genotypes in lphaPeel_Imputed_Merged.{bed/bim/fam}. The variants in GATK_Capture_WithID.vcf.gz were used with Representative_Transcripts.gtf.gz to generate the information in GP_ExomeCap_Functional_Annotation.gz. Individuals with impuited genotypes in GP_AP.{bed/bim/fam} were subset to those that have phenotype information with Adjusted_Phenotypic_Data_800Lines.csv. The subset genotypes were used to calculate associations in Yield_LMM.assoc.txt.gz, DON_LMM.assoc.txt.gz, and Height_LMM.assoc.txt.gz. The intervals in 50x_Capture.bed.gz were used to speed up the variant calling for GATK_Capture_WithID.vcf.gz. 3. Additional related data collected that was not included in the current data package: NA 4. Are there multiple versions of the dataset? No If yes, list versions: Name of file that was updated: i. Why was the file updated? ii. When was the file updated? Name of file that was updated: i. Why was the file updated? ii. When was the file updated? -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data: 2. Methods for processing the data: 3. Instrument- or software-specific information needed to interpret the data: R statistical software, or similar. VCF handling programs, such as VCFtools or vcflib. 4. Standards and calibration information, if appropriate: NA. 5. Environmental/experimental conditions: Phenotypic data were collected at Crookston, Morris, and Saint Paul field stations from 2011-2014. Yield and DON concentration trials were performed in separate nurseries. 6. Describe any quality-assurance procedures performed on the data: Genotypes were checked for Mendelian inheritance inconsistencies. Markers with high (>20%) missing data were removed. 7. People involved with sample collection, processing, analysis and/or submission: Thomas J. Y. Kono, Chaochih Liu, Emily E. Vonderharr, Kevin P. Smith, and Peter L. Morrell. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: 50x_Capture.bed.gz 1. Number of variables: NA 2. Number of cases/rows: 133,367 3. Missing data codes: None 4. Variable list: None. BED files list genomic intervals. DATA-SPECIFIC INFORMATION FOR: ALCHEMY_Raw_Data_and_Calls.tar.bz2 1. Number of variables: NA 2. Number of cases/rows: NA 3. Missing data codes: NA=missing 4. Variable list: None. DATA-SPECIFIC INFORMATION FOR: Adjusted_Phenotypic_Data_800Lines.csv 1. Number of variables: 9 2. Number of cases/rows: 888 3. Missing data codes: NA=missing 4. Variable list: Name: line Description: Individual ID Name: cycle Description: Breeding cycle Name: type Description: ran=Random panel; sel=Selected panel Name: cat Description: Combination of 'cycle' and 'type' Name: dataset Description: Dataset of origin for the phenotype data. Not used in this study. Name: prog Description: Breeding program. Not used in this study. Name: yld_kg Description: Yield BLUE, in kg/ha Name: DON Description: DON concentration, in ppm Name: height Description: Plant height, in cm DATA-SPECIFIC INFORMATION FOR: GP_AP.{bed/bim/fam} 1. Number of variables: NA 2. Number of cases/rows: 5,264 3. Missing data codes: 0=missing for pedigrees, -9=missing for genotypes 4. Variable list: NA DATA-SPECIFIC INFORMATION FOR: BOPA_cMMb.txt.gz 1. Number of variables: 4 2. Number of cases/rows: 2,553 3. Missing data codes: NA=missing 4. Variable list: Name: Chromosome Description: Chromosome identifier Name: LeftBP Description: Left coordinate of the marker window Name: RightBP Description: Right coordinate of the marker window Name: cMMb Description: LOWESS-smoothed value of recombination rate in the window DATA-SPECIFIC INFORMATION FOR: DON_LMM.assoc.txt.gz, Yield_LMM.assoc.txt.gz, Height_LMM.assoc.txt.gz 1. Number of variables: 15 2. Number of cases/rows: 419,957 3. Missing data codes: NA=missing 4: Variable list: Name: chr Description: chromosome Name: rs Description: SNP identifier Name: ps Description: position Name: n_miss Description: Number of missing genotypes in panel Name: allele1 Description: Minor allele Name: allele0 Description: Major allele Name: af Description: allele frequency Name: beta Description: Beta coefficient estimate in linear mixed model Name: se Description: Standard error of beta estimate Name: logl_H1 Description: log-likelihood of alternate hypothesis Name: l_remle Description: REML estimate of lambda (ratio of G/E variance components) Name: l_mle Description: Maximum likelihood estimate of lambda Name: p_wald Description: Wald test P-values Name: p_lrt Description: Likelihood ratio test P-values Name: p_score Description: Marginal z-score test P-values DATA-SPECIFIC INFORMATION FOR: ExomeCaptureTargets_per_Mb.txt.gz 1. Number of variables: 4 2. Number of cases/rows: 4,839 3. Missing data codes: NA=missing 4. Variable list: Name: Chromosome Description: Chromosome Name: Start Description: Start (left) border of window Name: End Description: End (right) border of window Name: NExCap Description: Number of exome capture intervals in the window DATA-SPECIFIC INFORMATION FOR: GATK_Capture_WithID.vcf.gz 1. Number of variables: NA 2. Number of cases/rows: 497,753 3. Missing data codes: .=missing, ./.=missing 4. Variable list: This file is a VCF. The specification is available here: https://samtools.github.io/hts-specs/VCFv4.2.pdf DATA-SPECIFIC INFORMATION FOR: GP_BOPA_Physical.{bed/bim/fam} 1. Number of variables: NA 2. Number of cases/rows: 5,234 3. Missing data codes: 0=missing for pedigrees, -9=missing for genotypes 4. Variable list: NA DATA-SPECIFIC INFORMATION FOR: GP_ExomeCap_Functional_Annotation.gz 1. Number of variables: 27 2. Number of cases/rows: 497,753 3. Missing data codes: NA=missing, -=missing 4. Variable list: Name: SNP_ID Description: SNP name Name: Chromosome Description: Chromosome Name: Position Description: BP position Name: Silent Description: yes=SNP does not alter protein sequence, no=SNP alters protein sequence Name: Transcript_ID Description: Ensembl identifier of transcript, if SNP is coding Name: Codon_Position Description: Position in a codon that a SNP occurs, {1,2,3} Name: Ref_Base Description: Morex (reference) base Name: Alt_Base Description: Alternate (mutant) base Name: AA1 Description: Morex (reference) amino acid state, if coding Name: AA2 Description: Alternate (mutant) amino acid state, if coding Name: CDS_Pos Description: Nucleotide position in coding sequence of transcript, if coding Name: Reside_Num Description: Amino acid position in coding sequence of transcript, if coding Name: PROVEAN Description: PROVEAN prediction score, if Silent=no Name: PPH2 Description: PolyPhen2 prediction, if Silent=no Name: AlignedPosition Description: Nucleotide position in coding sequence alignment of related species, if Silent=no Name: L0 Description: Null likelihood for LRT of deleterious prediction, if Silent=no Name: L1 Description: Alt likelihood for LRT of deleterious prediction, if Silent=no Name: Constraint Description: Phylogentic constraint of affected codon, if Silent=no Name: Chisquared Description: Name: P-value Description: LRT P-value, if Silent=no Name: SeqCount Description: Number of sequences in orthologue alignment, if Silent=no Name: Alignment Description: Alignment string of amino acid residues, if Silent=no Name: ReferenceAA Description: Reference amino acid state, if Silent=no Name: MaskedConstraint Description: Phylogenetic constraint of affected codon with Morex masked, if Silent=no Name: MaskedP-value Description: P-value of LRT with Morex masked, if Silent=no Name: LogisticP_Unmasked Description: Logistic regression P-value, if Silent=no Name: LogisticP_Masked Description: Lostistic regression P-value with Morex masked, if Silent=no DATA-SPECIFIC INFORMATION FOR: Representative_Transcript_IDs.txt.gz 1. Number of variables: NA 2. Number of cases/rows: 39,734 3. Missing data codes: None 4. Variable list: NA DATA-SPECIFIC INFORMATION FOR: Representative_Transcripts.gtf.gz 1. Number of variables: NA 2. Number of cases/rows: NA 3. Missing data codes: None 4. Variable list: This file is an Ensembl-formatted GTF. The specification is here: https://useast.ensembl.org/info/website/upload/gff.html DATA-SPECIFIC INFORMATION FOR: T3_Full_Pedigree.csv 1. Number of variables: 3 2. Number of cases/rows: 221 3. Missing data codes: NA=missing 4. Variable list: Note: this file does not have a named header row. Name: column 1 Description: Family ID Name: column 2 Description: Maternal parent ID Name: column 3 Description: Paternal parent ID