This readme.txt file was generated on 2023-02-27 by Rafael Della Coletta Recommended citation for the data: Della Coletta, Rafael; Fernandes, Samuel B; Monnahan, Patrick J; Mikel, Mark A; Bohn, Martin O; Lipka, Alexander E; Hirsch, Candice N. (2023). Datasets to test the importance of genetic architecture in marker selection decisions for genomic prediction. Retrieved from the Data Repository for the University of Minnesota, https://doi.org/10.13020/atq4-1b58. ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset: Datasets to test the importance of genetic architecture in marker selection decisions for genomic prediction 2. Author Information Author Contact: Candice N Hirsch (cnhirsch@umn.edu) Name: Rafael Della Coletta Institution: University of Minnesota Email: della028@umn.edu ORCID: 0000-0001-6988-9598 Name: Samuel B Fernandes Institution: University of Arkansas Email: samuelbf@uark.edu ORCID: 0000-0001-8269-535X Name: Patrick J. Monnahan Institution: University of Minnesota Email: pmonnaha@umn.edu ORCID: 0000-0001-8269-535X Name: Mark A Mikel Institution: University of Illinois Email: mmikel@illinois.edu ORCID: - Name: Martin O Bohn Institution: University of Illinois Email: mbohn@illinois.edu ORCID: 0000-0003-2364-6229 Name: Alexander E Lipka Institution: University of Illinois Email: alipka@illinois.edu ORCID: 0000-0003-1571-8528 Name: Candice N Hirsch Institution: University of Minnesota Email: cnhirsch@umn.edu ORCID: 0000-0002-8833-3023 3. Date published or finalized for release: 2023-02-27 4. Information about funding sources that supported the collection of the data: United States Department of Agriculture (2018-67013-27571) National Science Foundation (IOS-1546727) Minnesota Agricultural Experiment Station 5. Overview of the data (abstract): This dataset contains the input files to simulate traits for maize recombinant inbred lines (RILs) and run genomic prediction models with different marker types. Using real genotypic information from 333 maize recombinant inbred lines with single nucleotide polymorphism (SNP) and structural variant (SV) information projected from their seven sequenced parental lines, we simulated traits with different genetic architectures in multiple environments using the R package simplePHENOTYPES. We varied the heritability, the number of quantitative trait loci (QTLs), the type of causative variant (SNPs or SVs), and the variant effect sizes. Weather data from five locations in the U.S. Midwest in 2020 was used to generate a residual correlation matrix among environments. After performing a two-stage analysis with multivariate GBLUP prediction model for each marker type and genetic architecture, we obtained prediction accuracies using two types of cross-validation (CV1 and CV2). For instructions on how to perform this analysis and analysis script, please see https://github.com/HirschLabUMN/genomic_prediction_svs -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: CC0 1.0 Universal (http://creativecommons.org/publicdomain/zero/1.0/) 2. Links to publications that cite or use the data: Forthcoming. 3. Was data derived from another source? If yes, list source(s): No 4. Terms of Use: Data Repository for the U of Minnesota (DRUM) By using these files, users agree to the Terms of Use. https://conservancy.umn.edu/pages/drum/policies/#terms-of-use --------------------- DATA & FILE OVERVIEW --------------------- To open tar.gz files on Unix/Linux systems, open Terminal and type the command: tar -xvzf supp_file3.tar.gz To open tar.gz files on Windows, open Command Prompt, find the directory path of the file and type the command cd [directory path]. Press return, then type the command: tar -xvzf supp_file3.tar.gz These files can also be opened on Windows using 7-Zip, a free program. To install 7-Zip, go to http://www.7-zip.org/download.html. To open vcf.gz files on Unix/Linux systems, open Terminal and type the command: gunzip supp_file1.vcf.gz To open vcf.gz files on Windows, use 7-Zip and open extracted file in a text editor (Notepad, Sublime, Atom, etc.). To open the hmp.txt file (which is a very large .txt file) use a program capable of opening larger files, such as LibreOffice. File List Filename: supp_file1.vcf.gz Short description: Raw structural variant calls of the maize parental lines in VCF format Filename: supp_file2.hmp.txt.gz Short description: Filtered genotypic data of recombinant inbred lines (RILs) in hapmap format with projected SNPs and SVs Filename: supp_file3.tar.gz Short description: Files containing simulated trait values for each RIL across different genetic architectures Filename: supp_file4.tar.gz Short description: Files containing ANOVA results for each simulated scenario Filename: supp_file5.tar.gz Short description: Files containing all the marker datasets used for genomic prediction Filename: supp_file6.tar.gz Short description: Files containing simulated trait values for each RIL across different genetic architectures to understand the relationship between LD and prediction accuracy Filename: supp_file7.tar.gz Short description: Files containing all the marker datasets used for genomic prediction to understand the relationship between LD and prediction accuracy Filename: supp_file8.xlsx Short description: Genomic prediction accuracy of different marker types for each replicate of simulated traits where either SNPs or SVs were the causative variants Filename: supp_file9.xlsx Short description: Genomic prediction accuracy of different marker types for each replicate of simulated traits where both SNPs and SVs were the causative variants Filename: supp_file10.xlsx Short description: Genomic prediction accuracy of markers with low (r2 < 0.5), moderate (0.5 < r2 < 0.9) and high (r2 > 0.9) linkage disequilibrium (LD) to a QTL for each replicate of simulated traits where both SNPs and SVs were the causative variants -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data: See Methods in Forthcoming publication. 2. Methods for processing the data: See Methods in Forthcoming publication. 3. Instrument- or software-specific information needed to interpret the data: All datasets are readable with a text editor (NotePad, Atom, Microsoft Excel, Google Sheets, etc.). For downstream analysis, please refer to https://github.com/HirschLabUMN/genomic_prediction_svs 4. Environmental/experimental conditions: See Methods in Forthcoming publication. 5. Describe any quality-assurance procedures performed on the data: See Methods in Forthcoming publication. 6. People involved with sample collection, processing, analysis and/or submission: Rafael Della Coletta, Samuel B. Fernandes, Patrick J. Monnahan, Mark A. Mikel, Martin O. Bohn, Alexander E. Lipka, Candice N. Hirsch ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file1.vcf.gz ----------------------------------------- 1. Number of variables: 2. Number of cases/rows: 10004 3. Missing data codes: . 4. Variable List: A. Name: CHROM Description: Chromosome B. Name: POS Description: Position C. Name: ID Description: Identifier D. Name: REF Description: Reference base E. Name: ALT Description: Alternate base F. Name: QUAL Description: Quality (Phred scale) G. Name: FILTER Description: Filter status H. Name: INFO Description: additional information encoded as a semicolon-separated series of short keys with optional values in the format =[,data] I. Name: FORMAT Description: specifics of the data types and order (colon-separated) Remaining columns. Names: A188 to W606S Description: Marker genotypes of maize inbred lines For more details about VCF format, see https://samtools.github.io/hts-specs/VCFv4.2.pdf ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file2.hmp.txt.gz ----------------------------------------- 1. Number of variables: 344 2. Number of cases/rows: 9847162 3. Missing data codes: NA for columns A to K; NN for remaining columns 4. Variable List A. Name: rs Description: Marker ID B. Name: alleles Description: Possible alleles for marker C. Name: chrom Description: Chromosome that the marker was mapped D. Name: pos Description: Respective position of this marker on chromosome E. Name: strand Description: Orientation of the marker in the DNA strand F. Name: assembly Description: Version of reference sequence assembly G. Name: center Description: Name of genotyping center that produced the genotypes H. Name: protLSID Description: ID for HapMap protocol I. Name: assayLSID Description: ID for HapMap assay used for genotyping J. Name: panel Description: ID for panel of individuals genotyped K. Name: QCcode Description: Quality control ID for all entries Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B Description: Genotypes of projected SNPs and SVs for each maize hybrid ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file3.tar.gz ----------------------------------------- 1. Number of variables: 7 2. Number of cases/rows: 1000 3. Missing data codes: NA 4. Variable List A. Name: Description: Hybrid name B. Name: Trait_1 Description: Simulated value for hybrid at environment 1 C. Name: Trait_2 Description: Simulated value for hybrid at environment 2 D. Name: Trait_3 Description: Simulated value for hybrid at environment 3 E. Name: Trait_4 Description: Simulated value for hybrid at environment 4 F. Name: Trait_5 Description: Simulated value for hybrid at environment 5 G. Name: Rep Description: Replicate number Simulated values for different genetic architectures are located in different folders. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file4.tar.gz ----------------------------------------- 1. Number of variables: - 2. Number of cases/rows: - 3. Missing data codes: - 4. Variable List: - This is a plain-text with details about ANOVA results. Results for different genetic architectures are located in different folders. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file5.tar.gz ----------------------------------------- 1. Number of variables: 344 2. Number of cases/rows: 5476 3. Missing data codes: NA for columns A to K; NN for remaining columns 4. Variable List A. Name: rs Description: Marker ID B. Name: alleles Description: Possible alleles for marker C. Name: chrom Description: Chromosome that the marker was mapped D. Name: pos Description: Respective position of this marker on chromosome E. Name: strand Description: Orientation of the marker in the DNA strand F. Name: assembly Description: Version of reference sequence assembly G. Name: center Description: Name of genotyping center that produced the genotypes H. Name: protLSID Description: ID for HapMap protocol I. Name: assayLSID Description: ID for HapMap assay used for genotyping J. Name: panel Description: ID for panel of individuals genotyped K. Name: QCcode Description: Quality control ID for all entries Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B Description: Predictor genotypes of maize hybrids Different iterations of predictors are located in different folders. Five different set of predictors ("all_markers", "snp_ld_markers", "snp_markers", "snp_not_ld_markers", "sv_markers") were generated in each iteration. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file6.tar.gz ----------------------------------------- 1. Number of variables: 7 2. Number of cases/rows: 1000 3. Missing data codes: NA 4. Variable List A. Name: Description: Hybrid name B. Name: Trait_1 Description: Simulated value for hybrid at environment 1 C. Name: Trait_2 Description: Simulated value for hybrid at environment 2 D. Name: Trait_3 Description: Simulated value for hybrid at environment 3 E. Name: Trait_4 Description: Simulated value for hybrid at environment 4 F. Name: Trait_5 Description: Simulated value for hybrid at environment 5 G. Name: Rep Description: Replicate number Simulated values for different genetic architectures are located in different folders. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file7.tar.gz ----------------------------------------- 1. Number of variables: 344 2. Number of cases/rows: 500 3. Missing data codes: NA for columns A to K; NN for remaining columns 4. Variable List A. Name: rs Description: Marker ID B. Name: alleles Description: Possible alleles for marker C. Name: chrom Description: Chromosome that the marker was mapped D. Name: pos Description: Respective position of this marker on chromosome E. Name: strand Description: Orientation of the marker in the DNA strand F. Name: assembly Description: Version of reference sequence assembly G. Name: center Description: Name of genotyping center that produced the genotypes H. Name: protLSID Description: ID for HapMap protocol I. Name: assayLSID Description: ID for HapMap assay used for genotyping J. Name: panel Description: ID for panel of individuals genotyped K. Name: QCcode Description: Quality control ID for all entries Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B Description: Predictor genotypes of maize hybrids Different iterations of predictors are located in different folders. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file8.xlsx ----------------------------------------- 1. Number of variables: 31 2. Number of cases/rows: 1600 3. Missing data codes: NA 4. Variable List A. Name: Heritability Description: Trait heritability B. Name: QTL number Description: Number of causative variants C. Name: Causative variant type Description: Causative variant type D. Name: Predictor type Description: Predictor type E. Name: Simulated population number Description: Simulated population number F. Name: Prediction iteration number Description: Prediction iteration number G. Name: Cross-validation Description: Cross-validation strategy H. Name: Accuracy (average) Description: Accuracy (average) I. Name: Standard error (average) Description: Standard error (average) J. Name: Lower CI (average) Description: Lower confidence interval (average) K. Name: Upper CI (average) Description: Upper confidence interval (average) L. Name: Accuracy (environment 1) Description: Accuracy (environment 1) M. Name: Standard error (environment 1) Description: Standard error (environment 1) N. Name: Lower CI (environment 1) Description: Lower confidence interval (environment 1) O. Name: Upper CI (environment 1) Description: Upper confidence interval (environment 1) P. Name: Accuracy (environment 2) Description: Accuracy (environment 2) Q. Name: Standard error (environment 2) Description: Standard error (environment 2) R. Name: Lower CI (environment 2) Description: Lower confidence interval (environment 2) S. Name: Upper CI (environment 2) Description: Upper confidence interval (environment 2) T. Name: Accuracy (environment 3) Description: Accuracy (environment 3) U. Name: Standard error (environment 3) Description: Standard error (environment 3) V. Name: Lower CI (environment 3) Description: Lower confidence interval (environment 3) W. Name: Upper CI (environment 3) Description: Upper confidence interval (environment 3) X. Name: Accuracy (environment 4) Description: Accuracy (environment 4) Y. Name: Standard error (environment 4) Description: Standard error (environment 4) Z. Name: Lower CI (environment 4) Description: Lower confidence interval (environment 4) AA. Name: Upper CI (environment 4) Description: Upper confidence interval (environment 4) AB. Name: Accuracy (environment 5) Description: Accuracy (environment 5) AC. Name: Standard error (environment 5) Description: Standard error (environment 5) AD. Name: Lower CI (environment 5) Description: Lower confidence interval (environment 5) AE. Name: Upper CI (environment 5) Description: Upper confidence interval (environment 5) ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file9.xlsx ----------------------------------------- 1. Number of variables: 32 2. Number of cases/rows: 3200 3. Missing data codes: NA 4. Variable List A. Name: Heritability Description: Trait heritability B. Name: QTL number Description: Number of causative variants C. Name: Proportion SNPs and SVs Description: Proportion of SNP and SV causative variants D. Name: Relative effect sizes between SNPs and SVs Description: Relative effect sizes between SNP and SV causative variants E. Name: Predictor type Description: Predictor type F. Name: Simulated population number Description: Simulated population number G. Name: Prediction iteration number Description: Prediction iteration number H. Name: Cross-validation Description: Cross-validation strategy I. Name: Accuracy (average) Description: Accuracy (average) J. Name: Standard error (average) Description: Standard error (average) K. Name: Lower CI (average) Description: Lower confidence interval (average) L. Name: Upper CI (average) Description: Upper confidence interval (average) M. Name: Accuracy (environment 1) Description: Accuracy (environment 1) N. Name: Standard error (environment 1) Description: Standard error (environment 1) O. Name: Lower CI (environment 1) Description: Lower confidence interval (environment 1) P. Name: Upper CI (environment 1) Description: Upper confidence interval (environment 1) Q. Name: Accuracy (environment 2) Description: Accuracy (environment 2) R. Name: Standard error (environment 2) Description: Standard error (environment 2) S. Name: Lower CI (environment 2) Description: Lower confidence interval (environment 2) T. Name: Upper CI (environment 2) Description: Upper confidence interval (environment 2) U. Name: Accuracy (environment 3) Description: Accuracy (environment 3) V. Name: Standard error (environment 3) Description: Standard error (environment 3) W. Name: Lower CI (environment 3) Description: Lower confidence interval (environment 3) X. Name: Upper CI (environment 3) Description: Upper confidence interval (environment 3) Y. Name: Accuracy (environment 4) Description: Accuracy (environment 4) Z. Name: Standard error (environment 4) Description: Standard error (environment 4) AA Name: Lower CI (environment 4) Description: Lower confidence interval (environment 4) AB. Name: Upper CI (environment 4) Description: Upper confidence interval (environment 4) AC. Name: Accuracy (environment 5) Description: Accuracy (environment 5) AD. Name: Standard error (environment 5) Description: Standard error (environment 5) AE. Name: Lower CI (environment 5) Description: Lower confidence interval (environment 5) AF. Name: Upper CI (environment 5) Description: Upper confidence interval (environment 5) ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: supp_file10.xlsx ----------------------------------------- 1. Number of variables: 31 2. Number of cases/rows: 1080 3. Missing data codes: NA 4. Variable List A. Name: Heritability Description: Trait heritability B. Name: QTL number Description: Number of causative variants C. Name: Causative variant type Description: Causative variant type D. Name: Predictor type Description: Predictor type E. Name: Simulated population number Description: Simulated population number F. Name: Prediction iteration number Description: Prediction iteration number G. Name: Cross-validation Description: Cross-validation strategy H. Name: Accuracy (average) Description: Accuracy (average) I. Name: Standard error (average) Description: Standard error (average) J. Name: Lower CI (average) Description: Lower confidence interval (average) K. Name: Upper CI (average) Description: Upper confidence interval (average) L. Name: Accuracy (environment 1) Description: Accuracy (environment 1) M. Name: Standard error (environment 1) Description: Standard error (environment 1) N. Name: Lower CI (environment 1) Description: Lower confidence interval (environment 1) O. Name: Upper CI (environment 1) Description: Upper confidence interval (environment 1) P. Name: Accuracy (environment 2) Description: Accuracy (environment 2) Q. Name: Standard error (environment 2) Description: Standard error (environment 2) R. Name: Lower CI (environment 2) Description: Lower confidence interval (environment 2) S. Name: Upper CI (environment 2) Description: Upper confidence interval (environment 2) T. Name: Accuracy (environment 3) Description: Accuracy (environment 3) U. Name: Standard error (environment 3) Description: Standard error (environment 3) V. Name: Lower CI (environment 3) Description: Lower confidence interval (environment 3) W. Name: Upper CI (environment 3) Description: Upper confidence interval (environment 3) X. Name: Accuracy (environment 4) Description: Accuracy (environment 4) Y. Name: Standard error (environment 4) Description: Standard error (environment 4) Z. Name: Lower CI (environment 4) Description: Lower confidence interval (environment 4) AA. Name: Upper CI (environment 4) Description: Upper confidence interval (environment 4) AB. Name: Accuracy (environment 5) Description: Accuracy (environment 5) AC. Name: Standard error (environment 5) Description: Standard error (environment 5) AD. Name: Lower CI (environment 5) Description: Lower confidence interval (environment 5) AE. Name: Upper CI (environment 5) Description: Upper confidence interval (environment 5)