This readme.txt file was generated on 2023-02-27 by Rafael Della Coletta
Recommended citation for the data:
Della Coletta, Rafael; Fernandes, Samuel B; Monnahan, Patrick J; Mikel, Mark A; Bohn, Martin O; Lipka, Alexander E; Hirsch, Candice N. (2023). Datasets to test the importance of genetic architecture in marker selection decisions for genomic prediction. Retrieved from the Data Repository for the University of Minnesota, https://doi.org/10.13020/atq4-1b58.

-------------------
GENERAL INFORMATION
-------------------

1. Title of Dataset: Datasets to test the importance of genetic architecture in marker selection decisions for genomic prediction

2. Author Information

	Author Contact:  Candice N Hirsch (cnhirsch@umn.edu)

	Name:  Rafael Della Coletta
	Institution: University of Minnesota
	Email: della028@umn.edu
	ORCID: 0000-0001-6988-9598

	Name:  Samuel B Fernandes
	Institution: University of Arkansas
	Email: samuelbf@uark.edu
	ORCID: 0000-0001-8269-535X

	Name:  Patrick J. Monnahan
	Institution: University of Minnesota
	Email: pmonnaha@umn.edu
	ORCID: 0000-0001-8269-535X

	Name:  Mark A Mikel
	Institution: University of Illinois
	Email: mmikel@illinois.edu
	ORCID: -

	Name:  Martin O Bohn
	Institution: University of Illinois
	Email: mbohn@illinois.edu
	ORCID: 0000-0003-2364-6229

	Name:  Alexander E Lipka
	Institution: University of Illinois
	Email: alipka@illinois.edu
	ORCID: 0000-0003-1571-8528

	Name:  Candice N Hirsch
	Institution: University of Minnesota
	Email: cnhirsch@umn.edu
	ORCID: 0000-0002-8833-3023

3. Date published or finalized for release: 2023-02-27

4. Information about funding sources that supported the collection of the data:
   United States Department of Agriculture (2018-67013-27571)
	 National Science Foundation (IOS-1546727)
	 Minnesota Agricultural Experiment Station

5. Overview of the data (abstract):
   This dataset contains the input files to simulate traits for maize recombinant inbred lines (RILs) and run genomic prediction models with different marker types. Using real genotypic information from 333 maize recombinant inbred lines with single nucleotide polymorphism (SNP) and structural variant (SV) information projected from their seven sequenced parental lines, we simulated traits with different genetic architectures in multiple environments using the R package simplePHENOTYPES. We varied the heritability, the number of quantitative trait loci (QTLs), the type of causative variant (SNPs or SVs), and the variant effect sizes. Weather data from five locations in the U.S. Midwest in 2020 was used to generate a residual correlation matrix among environments. After performing a two-stage analysis with multivariate GBLUP prediction model for each marker type and genetic architecture, we obtained prediction accuracies using two types of cross-validation (CV1 and CV2). For instructions on how to perform this analysis and analysis script, please see https://github.com/HirschLabUMN/genomic_prediction_svs


--------------------------
SHARING/ACCESS INFORMATION
--------------------------

1. Licenses/restrictions placed on the data: CC0 1.0 Universal (http://creativecommons.org/publicdomain/zero/1.0/)

2. Links to publications that cite or use the data: Forthcoming.

3. Was data derived from another source?
	If yes, list source(s): No

4. Terms of Use: Data Repository for the U of Minnesota (DRUM) By using these files, users agree to the Terms of Use. https://conservancy.umn.edu/pages/drum/policies/#terms-of-use


---------------------
DATA & FILE OVERVIEW
---------------------

To open tar.gz files on Unix/Linux systems, open Terminal and type the command: tar -xvzf supp_file3.tar.gz

To open tar.gz files on Windows, open Command Prompt, find the directory path of the file and type the command cd [directory path]. Press return, then type the command: tar -xvzf supp_file3.tar.gz

These files can also be opened on Windows using 7-Zip, a free program. To install 7-Zip, go to http://www.7-zip.org/download.html.

To open vcf.gz files on Unix/Linux systems, open Terminal and type the command: gunzip supp_file1.vcf.gz 

To open vcf.gz files on Windows, use 7-Zip and open extracted file in a text editor (Notepad, Sublime, Atom, etc.).

To open the hmp.txt file (which is a very large .txt file) use a program capable of opening larger files, such as LibreOffice. 


File List

	Filename: supp_file1.vcf.gz
	Short description: Raw structural variant calls of the maize parental lines in VCF format

	Filename: supp_file2.hmp.txt.gz
	Short description: Filtered genotypic data of recombinant inbred lines (RILs) in hapmap format with projected SNPs and SVs

	Filename: supp_file3.tar.gz
	Short description: Files containing simulated trait values for each RIL across different genetic architectures

	Filename: supp_file4.tar.gz
	Short description: Files containing ANOVA results for each simulated scenario

	Filename: supp_file5.tar.gz
	Short description: Files containing all the marker datasets used for genomic prediction

	Filename: supp_file6.tar.gz
	Short description: Files containing simulated trait values for each RIL across different genetic architectures to understand the relationship between LD and prediction accuracy

	Filename: supp_file7.tar.gz
	Short description: Files containing all the marker datasets used for genomic prediction to understand the relationship between LD and prediction accuracy

	Filename: supp_file8.xlsx
	Short description: Genomic prediction accuracy of different marker types for each replicate of simulated traits where either SNPs or SVs were the causative variants

	Filename: supp_file9.xlsx
	Short description: Genomic prediction accuracy of different marker types for each replicate of simulated traits where both SNPs and SVs were the causative variants

	Filename: supp_file10.xlsx
	Short description: Genomic prediction accuracy of markers with low (r2 < 0.5), moderate (0.5 < r2 < 0.9) and high (r2 > 0.9) linkage disequilibrium (LD) to a QTL for each replicate of simulated traits where both SNPs and SVs were the causative variants


--------------------------
METHODOLOGICAL INFORMATION
--------------------------

1. Description of methods used for collection/generation of data:
   See Methods in Forthcoming publication. 

2. Methods for processing the data:
   See Methods in Forthcoming publication.

3. Instrument- or software-specific information needed to interpret the data:
   All datasets are readable with a text editor (NotePad, Atom, Microsoft Excel, Google Sheets, etc.). For downstream analysis, please refer to https://github.com/HirschLabUMN/genomic_prediction_svs

4. Environmental/experimental conditions:
   See Methods in Forthcoming publication.

5. Describe any quality-assurance procedures performed on the data:
   See Methods in Forthcoming publication.

6. People involved with sample collection, processing, analysis and/or submission:
   Rafael Della Coletta, Samuel B. Fernandes, Patrick J. Monnahan, Mark A. Mikel, Martin O. Bohn, Alexander E. Lipka, Candice N. Hirsch


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file1.vcf.gz
-----------------------------------------

1. Number of variables:

2. Number of cases/rows: 10004

3. Missing data codes: .

4. Variable List:

	A. Name: CHROM
		 Description: Chromosome

	B. Name: POS
		 Description: Position

	C. Name: ID
		 Description: Identifier

	D. Name: REF
		 Description: Reference base

	E. Name: ALT
		 Description: Alternate base

	F. Name: QUAL
		 Description: Quality (Phred scale)

	G. Name: FILTER
		 Description: Filter status

	H. Name: INFO
		 Description: additional information encoded as a semicolon-separated series of short keys with optional values in the format <key>=<data>[,data]

	I. Name: FORMAT
 		 Description: specifics of the data types and order (colon-separated)

	Remaining columns. Names: A188 to W606S
 		 Description: Marker genotypes of maize inbred lines

	For more details about VCF format, see https://samtools.github.io/hts-specs/VCFv4.2.pdf


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file2.hmp.txt.gz
-----------------------------------------

1. Number of variables: 344

2. Number of cases/rows: 9847162

3. Missing data codes: NA for columns A to K; NN for remaining columns

4. Variable List

	A. Name: rs
	   Description: Marker ID

	B. Name: alleles
	   Description: Possible alleles for marker

	C. Name: chrom
	   Description: Chromosome that the marker was mapped

	D. Name: pos
	   Description: Respective position of this marker on chromosome

	E. Name: strand
	   Description: Orientation of the marker in the DNA strand

	F. Name: assembly
	   Description: Version of reference sequence assembly

	G. Name: center
	   Description: Name of genotyping center that produced the genotypes

	H. Name: protLSID
	   Description: ID for HapMap protocol

	I. Name: assayLSID
	   Description: ID for HapMap assay used for genotyping

	J. Name: panel
	   Description: ID for panel of individuals genotyped

	K. Name: QCcode
	   Description: Quality control ID for all entries

	Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B
	   Description: Genotypes of projected SNPs and SVs for each maize hybrid


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file3.tar.gz
-----------------------------------------

1. Number of variables: 7

2. Number of cases/rows: 1000

3. Missing data codes: NA

4. Variable List

	A. Name: <Trait>
	   Description: Hybrid name

	B. Name: Trait_1
	   Description: Simulated value for hybrid at environment 1

	C. Name: Trait_2
	   Description: Simulated value for hybrid at environment 2

	D. Name: Trait_3
	   Description: Simulated value for hybrid at environment 3

	E. Name: Trait_4
	   Description: Simulated value for hybrid at environment 4

	F. Name: Trait_5
	   Description: Simulated value for hybrid at environment 5

	G. Name: Rep
	   Description: Replicate number

	Simulated values for different genetic architectures are located in different folders.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file4.tar.gz
-----------------------------------------

1. Number of variables: -

2. Number of cases/rows: -

3. Missing data codes: -

4. Variable List: -

	This is a plain-text with details about ANOVA results. Results for different genetic architectures are located in different folders.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file5.tar.gz
-----------------------------------------

1. Number of variables: 344

2. Number of cases/rows: 5476

3. Missing data codes: NA for columns A to K; NN for remaining columns

4. Variable List

	A. Name: rs
	   Description: Marker ID

	B. Name: alleles
	   Description: Possible alleles for marker

	C. Name: chrom
	   Description: Chromosome that the marker was mapped

	D. Name: pos
	   Description: Respective position of this marker on chromosome

	E. Name: strand
	   Description: Orientation of the marker in the DNA strand

	F. Name: assembly
	   Description: Version of reference sequence assembly

	G. Name: center
	   Description: Name of genotyping center that produced the genotypes

	H. Name: protLSID
	   Description: ID for HapMap protocol

	I. Name: assayLSID
	   Description: ID for HapMap assay used for genotyping

	J. Name: panel
	   Description: ID for panel of individuals genotyped

	K. Name: QCcode
	   Description: Quality control ID for all entries

	Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B
	   Description: Predictor genotypes of maize hybrids

	Different iterations of predictors are located in different folders. Five different set of predictors ("all_markers", "snp_ld_markers", "snp_markers", "snp_not_ld_markers", "sv_markers") were generated in each iteration.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file6.tar.gz
-----------------------------------------

1. Number of variables: 7

2. Number of cases/rows: 1000

3. Missing data codes: NA

4. Variable List

	A. Name: <Trait>
	   Description: Hybrid name

	B. Name: Trait_1
	   Description: Simulated value for hybrid at environment 1

	C. Name: Trait_2
	   Description: Simulated value for hybrid at environment 2

	D. Name: Trait_3
	   Description: Simulated value for hybrid at environment 3

	E. Name: Trait_4
	   Description: Simulated value for hybrid at environment 4

	F. Name: Trait_5
	   Description: Simulated value for hybrid at environment 5

	G. Name: Rep
	   Description: Replicate number

	Simulated values for different genetic architectures are located in different folders.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file7.tar.gz
-----------------------------------------

1. Number of variables: 344

2. Number of cases/rows: 500

3. Missing data codes: NA for columns A to K; NN for remaining columns

4. Variable List

	A. Name: rs
	   Description: Marker ID

	B. Name: alleles
	   Description: Possible alleles for marker

	C. Name: chrom
	   Description: Chromosome that the marker was mapped

	D. Name: pos
	   Description: Respective position of this marker on chromosome

	E. Name: strand
	   Description: Orientation of the marker in the DNA strand

	F. Name: assembly
	   Description: Version of reference sequence assembly

	G. Name: center
	   Description: Name of genotyping center that produced the genotypes

	H. Name: protLSID
	   Description: ID for HapMap protocol

	I. Name: assayLSID
	   Description: ID for HapMap assay used for genotyping

	J. Name: panel
	   Description: ID for panel of individuals genotyped

	K. Name: QCcode
	   Description: Quality control ID for all entries

	Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B
	   Description: Predictor genotypes of maize hybrids

	Different iterations of predictors are located in different folders.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file8.xlsx
-----------------------------------------

1. Number of variables: 31

2. Number of cases/rows: 1600

3. Missing data codes: NA

4. Variable List

	A. Name: Heritability
	   Description: Trait heritability

	B. Name: QTL number
	   Description: Number of causative variants

	C. Name: Causative variant type
	   Description: Causative variant type

	D. Name: Predictor type
	   Description: Predictor type

	E. Name: Simulated population number
	   Description: Simulated population number

	F. Name: Prediction iteration number
	   Description: Prediction iteration number

	G. Name: Cross-validation
	   Description: Cross-validation strategy

 	H. Name: Accuracy (average)
 	   Description: Accuracy (average)

 	I. Name: Standard error (average)
 	   Description: Standard error (average)

 	J. Name: Lower CI (average)
 	   Description: Lower confidence interval (average)

 	K. Name: Upper CI (average)
 	   Description: Upper confidence interval (average)

 	L. Name: Accuracy (environment 1)
 	   Description: Accuracy (environment 1)

 	M. Name: Standard error (environment 1)
 	   Description: Standard error (environment 1)

 	N. Name: Lower CI (environment 1)
 	   Description: Lower confidence interval (environment 1)

	O. Name: Upper CI (environment 1)
		Description: Upper confidence interval (environment 1)

	P. Name: Accuracy (environment 2)
		Description: Accuracy (environment 2)

	Q. Name: Standard error (environment 2)
		Description: Standard error (environment 2)

	R. Name: Lower CI (environment 2)
		Description: Lower confidence interval (environment 2)

	S. Name: Upper CI (environment 2)
		Description: Upper confidence interval (environment 2)

	T. Name: Accuracy (environment 3)
		Description: Accuracy (environment 3)

	U. Name: Standard error (environment 3)
		Description: Standard error (environment 3)

	V. Name: Lower CI (environment 3)
 	   Description: Lower confidence interval (environment 3)

 	W. Name: Upper CI (environment 3)
 	   Description: Upper confidence interval (environment 3)

 	X. Name: Accuracy (environment 4)
 	   Description: Accuracy (environment 4)

	Y. Name: Standard error (environment 4)
		Description: Standard error (environment 4)

	Z. Name: Lower CI (environment 4)
		Description: Lower confidence interval (environment 4)

	AA. Name: Upper CI (environment 4)
		Description: Upper confidence interval (environment 4)

	AB. Name: Accuracy (environment 5)
		Description: Accuracy (environment 5)

	AC. Name: Standard error (environment 5)
		Description: Standard error (environment 5)

	AD. Name: Lower CI (environment 5)
		Description: Lower confidence interval (environment 5)

	AE. Name: Upper CI (environment 5)
		Description: Upper confidence interval (environment 5)


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file9.xlsx
-----------------------------------------

1. Number of variables: 32

2. Number of cases/rows: 3200

3. Missing data codes: NA

4. Variable List

	A. Name: Heritability
	   Description: Trait heritability

	B. Name: QTL number
	   Description: Number of causative variants

	C. Name: Proportion SNPs and SVs
	   Description: Proportion of SNP and SV causative variants

	D. Name: Relative effect sizes between SNPs and SVs
	   Description: Relative effect sizes between SNP and SV causative variants

	E. Name: Predictor type
	   Description: Predictor type

	F. Name: Simulated population number
	   Description: Simulated population number

	G. Name: Prediction iteration number
	   Description: Prediction iteration number

	H. Name: Cross-validation
	   Description: Cross-validation strategy

 	I. Name: Accuracy (average)
 	   Description: Accuracy (average)

 	J. Name: Standard error (average)
 	   Description: Standard error (average)

 	K. Name: Lower CI (average)
 	   Description: Lower confidence interval (average)

 	L. Name: Upper CI (average)
 	   Description: Upper confidence interval (average)

 	M. Name: Accuracy (environment 1)
 	   Description: Accuracy (environment 1)

 	N. Name: Standard error (environment 1)
 	   Description: Standard error (environment 1)

 	O. Name: Lower CI (environment 1)
 	   Description: Lower confidence interval (environment 1)

	P. Name: Upper CI (environment 1)
		Description: Upper confidence interval (environment 1)

	Q. Name: Accuracy (environment 2)
		Description: Accuracy (environment 2)

	R. Name: Standard error (environment 2)
		Description: Standard error (environment 2)

	S. Name: Lower CI (environment 2)
		Description: Lower confidence interval (environment 2)

	T. Name: Upper CI (environment 2)
		Description: Upper confidence interval (environment 2)

	U. Name: Accuracy (environment 3)
		Description: Accuracy (environment 3)

	V. Name: Standard error (environment 3)
		Description: Standard error (environment 3)

	W. Name: Lower CI (environment 3)
 	   Description: Lower confidence interval (environment 3)

 	X. Name: Upper CI (environment 3)
 	   Description: Upper confidence interval (environment 3)

 	Y. Name: Accuracy (environment 4)
 	   Description: Accuracy (environment 4)

	Z. Name: Standard error (environment 4)
		Description: Standard error (environment 4)

	AA Name: Lower CI (environment 4)
		Description: Lower confidence interval (environment 4)

	AB. Name: Upper CI (environment 4)
		Description: Upper confidence interval (environment 4)

	AC. Name: Accuracy (environment 5)
		Description: Accuracy (environment 5)

	AD. Name: Standard error (environment 5)
		Description: Standard error (environment 5)

	AE. Name: Lower CI (environment 5)
		Description: Lower confidence interval (environment 5)

	AF. Name: Upper CI (environment 5)
		Description: Upper confidence interval (environment 5)


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file10.xlsx
-----------------------------------------

1. Number of variables: 31

2. Number of cases/rows: 1080

3. Missing data codes: NA

4. Variable List

	A. Name: Heritability
	   Description: Trait heritability

	B. Name: QTL number
	   Description: Number of causative variants

	C. Name: Causative variant type
	   Description: Causative variant type

	D. Name: Predictor type
	   Description: Predictor type

	E. Name: Simulated population number
	   Description: Simulated population number

	F. Name: Prediction iteration number
	   Description: Prediction iteration number

	G. Name: Cross-validation
	   Description: Cross-validation strategy

 	H. Name: Accuracy (average)
 	   Description: Accuracy (average)

 	I. Name: Standard error (average)
 	   Description: Standard error (average)

 	J. Name: Lower CI (average)
 	   Description: Lower confidence interval (average)

 	K. Name: Upper CI (average)
 	   Description: Upper confidence interval (average)

 	L. Name: Accuracy (environment 1)
 	   Description: Accuracy (environment 1)

 	M. Name: Standard error (environment 1)
 	   Description: Standard error (environment 1)

 	N. Name: Lower CI (environment 1)
 	   Description: Lower confidence interval (environment 1)

	O. Name: Upper CI (environment 1)
		Description: Upper confidence interval (environment 1)

	P. Name: Accuracy (environment 2)
		Description: Accuracy (environment 2)

	Q. Name: Standard error (environment 2)
		Description: Standard error (environment 2)

	R. Name: Lower CI (environment 2)
		Description: Lower confidence interval (environment 2)

	S. Name: Upper CI (environment 2)
		Description: Upper confidence interval (environment 2)

	T. Name: Accuracy (environment 3)
		Description: Accuracy (environment 3)

	U. Name: Standard error (environment 3)
		Description: Standard error (environment 3)

	V. Name: Lower CI (environment 3)
 	   Description: Lower confidence interval (environment 3)

 	W. Name: Upper CI (environment 3)
 	   Description: Upper confidence interval (environment 3)

 	X. Name: Accuracy (environment 4)
 	   Description: Accuracy (environment 4)

	Y. Name: Standard error (environment 4)
		Description: Standard error (environment 4)

	Z. Name: Lower CI (environment 4)
		Description: Lower confidence interval (environment 4)

	AA. Name: Upper CI (environment 4)
		Description: Upper confidence interval (environment 4)

	AB. Name: Accuracy (environment 5)
		Description: Accuracy (environment 5)

	AC. Name: Standard error (environment 5)
		Description: Standard error (environment 5)

	AD. Name: Lower CI (environment 5)
		Description: Lower confidence interval (environment 5)

	AE. Name: Upper CI (environment 5)
		Description: Upper confidence interval (environment 5)