This readme.txt file was generated on 2023-10-09 by Rafael Della Coletta and updated on 2023-10-30 by Shannon Farrell. Recommended citation for the data: Della Coletta, Rafael; Fernandes, Samuel B; Monnahan, Patrick J; Mikel, Mark A; Bohn, Martin O; Lipka, Alexander E; Hirsch, Candice N. (2023). Datasets to test the importance of genetic architecture in marker selection decisions for genomic prediction. Retrieved from the Data Repository for the University of Minnesota, https://doi.org/10.13020/atq4-1b58. ----------------- 2023-10-30 UPDATE ----------------- The analysis had to be re-run with different parameters and, despite having the same data structure (number of columns, variable names, etc.), they have different values and some may have different numbers of rows. Other changes include: changing the tar.gz files to zip files for easier use; removing the original supp_file9; and changing the original supp_file10 to supp_file9. The original dataset is available at: https://conservancy.umn.edu/handle/11299/252793.1 ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset: Datasets to test the importance of genetic architecture in marker selection decisions for genomic prediction 2. Author Information Author Contact: Candice N Hirsch (cnhirsch@umn.edu) Name: Rafael Della Coletta Institution: University of Minnesota Email: della028@umn.edu ORCID: 0000-0001-6988-9598 Name: Samuel B Fernandes Institution: University of Arkansas Email: samuelbf@uark.edu ORCID: 0000-0001-8269-535X Name: Patrick J. Monnahan Institution: University of Minnesota Email: pmonnaha@umn.edu ORCID: 0000-0001-8269-535X Name: Mark A Mikel Institution: University of Illinois Email: mmikel@illinois.edu ORCID: - Name: Martin O Bohn Institution: University of Illinois Email: mbohn@illinois.edu ORCID: 0000-0003-2364-6229 Name: Alexander E Lipka Institution: University of Illinois Email: alipka@illinois.edu ORCID: 0000-0003-1571-8528 Name: Candice N Hirsch Institution: University of Minnesota Email: cnhirsch@umn.edu ORCID: 0000-0002-8833-3023 3. Date published or finalized for release: 2023-10-09 4. Information about funding sources that supported the collection of the data: United States Department of Agriculture (2018-67013-27571) National Science Foundation (IOS-1546727) Minnesota Agricultural Experiment Station 5. Overview of the data (abstract): This dataset contains the input files to simulate traits for maize recombinant inbred lines (RILs) and run genomic prediction models with different marker types. Using real genotypic information from 333 maize recombinant inbred lines with single nucleotide polymorphism (SNP) and structural variant (SV) information projected from their seven sequenced parental lines, we simulated traits with different genetic architectures in multiple environments using the R package simplePHENOTYPES. We varied the heritability, the number of quantitative trait loci (QTLs), the type of causative variant (SNPs or SVs), and the variant effect sizes. Weather data from five locations in the U.S. Midwest in 2020 was used to generate a residual correlation matrix among environments. After performing a two-stage analysis with multivariate GBLUP prediction model for each marker type and genetic architecture, we obtained prediction accuracies using two types of cross-validation (CV1 and CV2). For instructions on how to perform this analysis and analysis script, please see https://github.com/HirschLabUMN/genomic_prediction_svs -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: CC0 1.0 Universal (http://creativecommons.org/publicdomain/zero/1.0/) 2. Links to publications that cite or use the data: Della Coletta, R., Fernandes, S.B., Monnahan, P.J. et al. (2023). Importance of genetic architecture in marker selection decisions for genomic prediction. Theoretical and Applied Genetics 136, 220. https://doi.org/10.1007/s00122-023-04469-w 3. Was data derived from another source? If yes, list source(s): No 4. Terms of Use: Data Repository for the U of Minnesota (DRUM) By using these files, users agree to the Terms of Use. https://conservancy.umn.edu/pages/drum/policies/#terms-of-use --------------------- DATA & FILE OVERVIEW --------------------- To open vcf.gz files on Unix/Linux systems, open Terminal and type the command: gunzip supp_file1.vcf.gz To open vcf.gz files on Windows, use 7-Zip and open extracted file in a text editor (Notepad, Sublime, Atom, etc.). To open the hmp.txt file (which is a very large .txt file) use a program capable of opening larger files, such as LibreOffice. File List Filename: supp_file1.vcf.gz Short description: Raw structural variant calls of the maize parental lines in VCF format Filename: supp_file2.hmp.txt.gz Short description: Filtered genotypic data of recombinant inbred lines (RILs) in hapmap format with projected SNPs and SVs Filename: supp_file3.zip Short description: Files containing simulated trait values for each RIL across different genetic architectures Filename: supp_file4.zip Short description: Files containing ANOVA results for each simulated scenario Filename: supp_file5.zip Short description: Files containing all the marker datasets used for genomic prediction Filename: supp_file6.zip Short description: Files containing simulated trait values for each RIL across different genetic architectures to understand the relationship between LD and prediction accuracy Filename: supp_file7.zip Short description: Files containing all the marker datasets used for genomic prediction to understand the relationship between LD and prediction accuracy Filename: supp_file8.xlsx Short description: Genomic prediction accuracy of different marker types for each replicate of simulated traits where either SNPs or SVs were the causative variants Filename: supp_file9.xlsx Short description: Genomic prediction accuracy of markers with low (r2 < 0.5), moderate (0.5 < r2 < 0.9) and high (r2 > 0.9) linkage disequilibrium (LD) to a QTL for each replicate of simulated traits where both SNPs and SVs were the causative variants -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data: The population of 333 F7 RILs, which has been previously described (Della Coletta et al. 2023), was generated from half diallel crosses of six maize inbred lines including B73, PHG39, PHG47, PH207, PHG35, and LH82. The parental lines and the 333 F7 RILs were previously genotyped with a custom Illumina Infinium 20K SNP chip (available at https://hdl.handle.net/11299/250568). The parental lines have also been previously SNP genotyped using whole genome resequencing data (available at https://doi.org/10.1093/g3journal/jkab238), and structural variants were also called from this dataset. See Methods in https://link.springer.com/article/10.1007/s00122-023-04469-w for more information. 2. Methods for processing the data: Structural variants were called using Lumpy v0.2.13 and SVtools v0.5.1. The ~3.1 million SNPs and ~10,000 SVs from the deep parental information were projected onto the 333 RILs using the 20,000 SNP chip markers to defne haplotype blocks using TASSEL v5.2.56. Traits were simulated using The R package simplePHENOTYPES v1.3. Genomic prediction models were run in two stages using ASReml-R v4.1. See Methods in https://link.springer.com/article/10.1007/s00122-023-04469-w for more information. 3. Instrument- or software-specific information needed to interpret the data: All datasets are readable with a text editor (NotePad, Atom, Microsoft Excel, Google Sheets, etc.). For downstream analysis, please refer to https://github.com/HirschLabUMN/genomic_prediction_svs. 4. Environmental/experimental conditions: Weather data from five locations in the U.S. Midwest (Iowa City, IA, Bloomington, IL, Champaign, IL, Janesville, WI, and Saint Paul, MN) from April 2020 to October 2020 were obtained using the R package EnvRtype v1.0. 5. Describe any quality-assurance procedures performed on the data: Correct version of SNP chip data was confirmed using custom scripts. SNPs with low quality, segregation distortion and overlapping deletions in the genome were removed. A sliding window approach was used to correct sequencing errors. See Methods in https://link.springer.com/article/10.1007/s00122-023-04469-w for more information. 6. People involved with sample collection, processing, analysis and/or submission: Rafael Della Coletta, Samuel B. Fernandes, Patrick J. Monnahan, Mark A. Mikel, Martin O. Bohn, Alexander E. Lipka, Candice N. Hirsch ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file1.vcf.gz ----------------------------------------- 1. Number of variables: 109 2. Number of cases/rows: 10004 3. Missing data codes: . 4. Variable List: A. Name: CHROM Description: Chromosome B. Name: POS Description: Position C. Name: ID Description: Identifier D. Name: REF Description: Reference base E. Name: ALT Description: Alternate base F. Name: QUAL Description: Quality (Phred scale) G. Name: FILTER Description: Filter status H. Name: INFO Description: additional information encoded as a semicolon-separated series of short keys with optional values in the format =[,data] I. Name: FORMAT Description: specifics of the data types and order (colon-separated) Remaining columns. Names: A188 to W606S Description: Marker genotypes of maize inbred lines For more details about VCF format, see https://samtools.github.io/hts-specs/VCFv4.2.pdf ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file2.hmp.txt.gz ----------------------------------------- 1. Number of variables: 344 2. Number of cases/rows: 3131610 3. Missing data codes: NA for columns A to K; NN for remaining columns 4. Variable List A. Name: rs Description: Marker ID B. Name: alleles Description: Possible alleles for marker C. Name: chrom Description: Chromosome that the marker was mapped D. Name: pos Description: Respective position of this marker on chromosome E. Name: strand Description: Orientation of the marker in the DNA strand F. Name: assembly Description: Version of reference sequence assembly G. Name: center Description: Name of genotyping center that produced the genotypes H. Name: protLSID Description: ID for HapMap protocol I. Name: assayLSID Description: ID for HapMap assay used for genotyping J. Name: panel Description: ID for panel of individuals genotyped K. Name: QCcode Description: Quality control ID for all entries Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B Description: Genotypes of projected SNPs and SVs for each maize hybrid ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file3.zip ----------------------------------------- 1. Number of variables: 7 2. Number of cases/rows: 999 3. Missing data codes: NA 4. Variable List A. Name: Description: Hybrid name B. Name: Trait_1 Description: Simulated value for hybrid at environment 1 C. Name: Trait_2 Description: Simulated value for hybrid at environment 2 D. Name: Trait_3 Description: Simulated value for hybrid at environment 3 E. Name: Trait_4 Description: Simulated value for hybrid at environment 4 F. Name: Trait_5 Description: Simulated value for hybrid at environment 5 G. Name: Rep Description: Replicate number Simulated values for different genetic architectures are located in different folders. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file4.zip ----------------------------------------- 1. Number of variables: - 2. Number of cases/rows: - 3. Missing data codes: - 4. Variable List: - This is a plain-text with details about ANOVA results. Results for different genetic architectures are located in different folders. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file5.zip ----------------------------------------- 1. Number of variables: 344 2. Number of cases/rows: 7892 3. Missing data codes: NA for columns A to K; NN for remaining columns 4. Variable List A. Name: rs Description: Marker ID B. Name: alleles Description: Possible alleles for marker C. Name: chrom Description: Chromosome that the marker was mapped D. Name: pos Description: Respective position of this marker on chromosome E. Name: strand Description: Orientation of the marker in the DNA strand F. Name: assembly Description: Version of reference sequence assembly G. Name: center Description: Name of genotyping center that produced the genotypes H. Name: protLSID Description: ID for HapMap protocol I. Name: assayLSID Description: ID for HapMap assay used for genotyping J. Name: panel Description: ID for panel of individuals genotyped K. Name: QCcode Description: Quality control ID for all entries Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B Description: Predictor genotypes of maize hybrids Different iterations of predictors are located in different folders. Five different set of predictors ("all_markers", "snp_ld_markers", "snp_markers", "snp_not_ld_markers", "sv_markers") were generated in each iteration. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file6.zip ----------------------------------------- 1. Number of variables: 7 2. Number of cases/rows: 999 3. Missing data codes: NA 4. Variable List A. Name: Description: Hybrid name B. Name: Trait_1 Description: Simulated value for hybrid at environment 1 C. Name: Trait_2 Description: Simulated value for hybrid at environment 2 D. Name: Trait_3 Description: Simulated value for hybrid at environment 3 E. Name: Trait_4 Description: Simulated value for hybrid at environment 4 F. Name: Trait_5 Description: Simulated value for hybrid at environment 5 G. Name: Rep Description: Replicate number Simulated values for different genetic architectures are located in different folders. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file7.zip ----------------------------------------- 1. Number of variables: 344 2. Number of cases/rows: 500 3. Missing data codes: NA for columns A to K; NN for remaining columns 4. Variable List A. Name: rs Description: Marker ID B. Name: alleles Description: Possible alleles for marker C. Name: chrom Description: Chromosome that the marker was mapped D. Name: pos Description: Respective position of this marker on chromosome E. Name: strand Description: Orientation of the marker in the DNA strand F. Name: assembly Description: Version of reference sequence assembly G. Name: center Description: Name of genotyping center that produced the genotypes H. Name: protLSID Description: ID for HapMap protocol I. Name: assayLSID Description: ID for HapMap assay used for genotyping J. Name: panel Description: ID for panel of individuals genotyped K. Name: QCcode Description: Quality control ID for all entries Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B Description: Predictor genotypes of maize hybrids Different iterations of predictors are located in different folders. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file8.zip ----------------------------------------- 1. Number of variables: 31 2. Number of cases/rows: 800 3. Missing data codes: NA 4. Variable List A. Name: Heritability Description: Trait heritability B. Name: QTL number Description: Number of causative variants C. Name: Causative variant type Description: Causative variant type D. Name: Predictor type Description: Predictor type E. Name: Simulated population number Description: Simulated population number F. Name: Prediction iteration number Description: Prediction iteration number G. Name: Cross-validation Description: Cross-validation strategy H. Name: Accuracy (average) Description: Accuracy (average) I. Name: Standard error (average) Description: Standard error (average) J. Name: Lower CI (average) Description: Lower confidence interval (average) K. Name: Upper CI (average) Description: Upper confidence interval (average) L. Name: Accuracy (environment 1) Description: Accuracy (environment 1) M. Name: Standard error (environment 1) Description: Standard error (environment 1) N. Name: Lower CI (environment 1) Description: Lower confidence interval (environment 1) O. Name: Upper CI (environment 1) Description: Upper confidence interval (environment 1) P. Name: Accuracy (environment 2) Description: Accuracy (environment 2) Q. Name: Standard error (environment 2) Description: Standard error (environment 2) R. Name: Lower CI (environment 2) Description: Lower confidence interval (environment 2) S. Name: Upper CI (environment 2) Description: Upper confidence interval (environment 2) T. Name: Accuracy (environment 3) Description: Accuracy (environment 3) U. Name: Standard error (environment 3) Description: Standard error (environment 3) V. Name: Lower CI (environment 3) Description: Lower confidence interval (environment 3) W. Name: Upper CI (environment 3) Description: Upper confidence interval (environment 3) X. Name: Accuracy (environment 4) Description: Accuracy (environment 4) Y. Name: Standard error (environment 4) Description: Standard error (environment 4) Z. Name: Lower CI (environment 4) Description: Lower confidence interval (environment 4) AA. Name: Upper CI (environment 4) Description: Upper confidence interval (environment 4) AB. Name: Accuracy (environment 5) Description: Accuracy (environment 5) AC. Name: Standard error (environment 5) Description: Standard error (environment 5) AD. Name: Lower CI (environment 5) Description: Lower confidence interval (environment 5) AE. Name: Upper CI (environment 5) Description: Upper confidence interval (environment 5) ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file9.zip ----------------------------------------- 1. Number of variables: 31 2. Number of cases/rows: 540 3. Missing data codes: NA 4. Variable List A. Name: Heritability Description: Trait heritability B. Name: QTL number Description: Number of causative variants C. Name: Causative variant type Description: Causative variant type D. Name: Predictor type Description: Predictor type E. Name: Simulated population number Description: Simulated population number F. Name: Prediction iteration number Description: Prediction iteration number G. Name: Cross-validation Description: Cross-validation strategy H. Name: Accuracy (average) Description: Accuracy (average) I. Name: Standard error (average) Description: Standard error (average) J. Name: Lower CI (average) Description: Lower confidence interval (average) K. Name: Upper CI (average) Description: Upper confidence interval (average) L. Name: Accuracy (environment 1) Description: Accuracy (environment 1) M. Name: Standard error (environment 1) Description: Standard error (environment 1) N. Name: Lower CI (environment 1) Description: Lower confidence interval (environment 1) O. Name: Upper CI (environment 1) Description: Upper confidence interval (environment 1) P. Name: Accuracy (environment 2) Description: Accuracy (environment 2) Q. Name: Standard error (environment 2) Description: Standard error (environment 2) R. Name: Lower CI (environment 2) Description: Lower confidence interval (environment 2) S. Name: Upper CI (environment 2) Description: Upper confidence interval (environment 2) T. Name: Accuracy (environment 3) Description: Accuracy (environment 3) U. Name: Standard error (environment 3) Description: Standard error (environment 3) V. Name: Lower CI (environment 3) Description: Lower confidence interval (environment 3) W. Name: Upper CI (environment 3) Description: Upper confidence interval (environment 3) X. Name: Accuracy (environment 4) Description: Accuracy (environment 4) Y. Name: Standard error (environment 4) Description: Standard error (environment 4) Z. Name: Lower CI (environment 4) Description: Lower confidence interval (environment 4) AA. Name: Upper CI (environment 4) Description: Upper confidence interval (environment 4) AB. Name: Accuracy (environment 5) Description: Accuracy (environment 5) AC. Name: Standard error (environment 5) Description: Standard error (environment 5) AD. Name: Lower CI (environment 5) Description: Lower confidence interval (environment 5) AE. Name: Upper CI (environment 5) Description: Upper confidence interval (environment 5)