This readme.txt file was generated on 2023-10-09 by Rafael Della Coletta and updated on 2023-10-30 by Shannon Farrell.

Recommended citation for the data:
Della Coletta, Rafael; Fernandes, Samuel B; Monnahan, Patrick J; Mikel, Mark A; Bohn, Martin O; Lipka, Alexander E; Hirsch, Candice N. (2023). Datasets to test the importance of genetic architecture in marker selection decisions for genomic prediction. Retrieved from the Data Repository for the University of Minnesota, https://doi.org/10.13020/atq4-1b58.


-----------------
2023-10-30 UPDATE
-----------------
The analysis had to be re-run with different parameters and, despite having the same data structure (number of columns, variable names, etc.), they have different values and some may have different numbers of rows.

Other changes include: changing the tar.gz files to zip files for easier use; removing the original supp_file9; and changing the original supp_file10 to supp_file9.

The original dataset is available at: https://conservancy.umn.edu/handle/11299/252793.1

-------------------
GENERAL INFORMATION
-------------------

1. Title of Dataset: Datasets to test the importance of genetic architecture in marker selection decisions for genomic prediction

2. Author Information

	Author Contact:  Candice N Hirsch (cnhirsch@umn.edu)

	Name:  Rafael Della Coletta
	Institution: University of Minnesota
	Email: della028@umn.edu
	ORCID: 0000-0001-6988-9598

	Name:  Samuel B Fernandes
	Institution: University of Arkansas
	Email: samuelbf@uark.edu
	ORCID: 0000-0001-8269-535X

	Name:  Patrick J. Monnahan
	Institution: University of Minnesota
	Email: pmonnaha@umn.edu
	ORCID: 0000-0001-8269-535X

	Name:  Mark A Mikel
	Institution: University of Illinois
	Email: mmikel@illinois.edu
	ORCID: -

	Name:  Martin O Bohn
	Institution: University of Illinois
	Email: mbohn@illinois.edu
	ORCID: 0000-0003-2364-6229

	Name:  Alexander E Lipka
	Institution: University of Illinois
	Email: alipka@illinois.edu
	ORCID: 0000-0003-1571-8528

	Name:  Candice N Hirsch
	Institution: University of Minnesota
	Email: cnhirsch@umn.edu
	ORCID: 0000-0002-8833-3023

3. Date published or finalized for release: 2023-10-09

4. Information about funding sources that supported the collection of the data:
   United States Department of Agriculture (2018-67013-27571)
	 National Science Foundation (IOS-1546727)
	 Minnesota Agricultural Experiment Station

5. Overview of the data (abstract):
   This dataset contains the input files to simulate traits for maize recombinant inbred lines (RILs) and run genomic prediction models with different marker types. Using real genotypic information from 333 maize recombinant inbred lines with single nucleotide polymorphism (SNP) and structural variant (SV) information projected from their seven sequenced parental lines, we simulated traits with different genetic architectures in multiple environments using the R package simplePHENOTYPES. We varied the heritability, the number of quantitative trait loci (QTLs), the type of causative variant (SNPs or SVs), and the variant effect sizes. Weather data from five locations in the U.S. Midwest in 2020 was used to generate a residual correlation matrix among environments. After performing a two-stage analysis with multivariate GBLUP prediction model for each marker type and genetic architecture, we obtained prediction accuracies using two types of cross-validation (CV1 and CV2). For instructions on how to perform this analysis and analysis script, please see https://github.com/HirschLabUMN/genomic_prediction_svs


--------------------------
SHARING/ACCESS INFORMATION
--------------------------

1. Licenses/restrictions placed on the data: CC0 1.0 Universal (http://creativecommons.org/publicdomain/zero/1.0/)

2. Links to publications that cite or use the data:
Della Coletta, R., Fernandes, S.B., Monnahan, P.J. et al. (2023). Importance of genetic architecture in marker selection decisions for genomic prediction. Theoretical and Applied Genetics 136, 220. 
https://doi.org/10.1007/s00122-023-04469-w

3. Was data derived from another source?
	If yes, list source(s): No

4. Terms of Use: Data Repository for the U of Minnesota (DRUM) By using these files, users agree to the Terms of Use. https://conservancy.umn.edu/pages/drum/policies/#terms-of-use


---------------------
DATA & FILE OVERVIEW
---------------------

To open vcf.gz files on Unix/Linux systems, open Terminal and type the command: gunzip supp_file1.vcf.gz

To open vcf.gz files on Windows, use 7-Zip and open extracted file in a text editor (Notepad, Sublime, Atom, etc.).

To open the hmp.txt file (which is a very large .txt file) use a program capable of opening larger files, such as LibreOffice.


File List

	Filename: supp_file1.vcf.gz
	Short description: Raw structural variant calls of the maize parental lines in VCF format

	Filename: supp_file2.hmp.txt.gz
	Short description: Filtered genotypic data of recombinant inbred lines (RILs) in hapmap format with projected SNPs and SVs

	Filename: supp_file3.zip
	Short description: Files containing simulated trait values for each RIL across different genetic architectures

	Filename: supp_file4.zip
	Short description: Files containing ANOVA results for each simulated scenario

	Filename: supp_file5.zip
	Short description: Files containing all the marker datasets used for genomic prediction

	Filename: supp_file6.zip
	Short description: Files containing simulated trait values for each RIL across different genetic architectures to understand the relationship between LD and prediction accuracy

	Filename: supp_file7.zip
	Short description: Files containing all the marker datasets used for genomic prediction to understand the relationship between LD and prediction accuracy

	Filename: supp_file8.xlsx
	Short description: Genomic prediction accuracy of different marker types for each replicate of simulated traits where either SNPs or SVs were the causative variants

	Filename: supp_file9.xlsx
	Short description: Genomic prediction accuracy of markers with low (r2 < 0.5), moderate (0.5 < r2 < 0.9) and high (r2 > 0.9) linkage disequilibrium (LD) to a QTL for each replicate of simulated traits where both SNPs and SVs were the causative variants


--------------------------
METHODOLOGICAL INFORMATION
--------------------------

1. Description of methods used for collection/generation of data:
   The population of 333 F7 RILs, which has been previously described (Della Coletta et al. 2023), was generated from half diallel crosses of six maize inbred lines including B73, PHG39, PHG47, PH207, PHG35, and LH82. The parental lines and the 333 F7 RILs were previously genotyped with a custom Illumina Infinium 20K SNP chip (available at https://hdl.handle.net/11299/250568). The parental lines have also been previously SNP genotyped using whole genome resequencing data (available at https://doi.org/10.1093/g3journal/jkab238), and structural variants were also called from this dataset. See Methods in https://link.springer.com/article/10.1007/s00122-023-04469-w for more information.

2. Methods for processing the data:
   Structural variants were called using Lumpy v0.2.13 and SVtools v0.5.1. The ~3.1 million SNPs and ~10,000 SVs from the deep parental information were projected onto the 333 RILs using the 20,000 SNP chip markers to defne haplotype blocks using TASSEL v5.2.56. Traits were simulated using The R package simplePHENOTYPES v1.3. Genomic prediction models were run in two stages using ASReml-R v4.1. See Methods in https://link.springer.com/article/10.1007/s00122-023-04469-w for more information.

3. Instrument- or software-specific information needed to interpret the data:
   All datasets are readable with a text editor (NotePad, Atom, Microsoft Excel, Google Sheets, etc.). For downstream analysis, please refer to https://github.com/HirschLabUMN/genomic_prediction_svs.

4. Environmental/experimental conditions:
   Weather data from five locations in the U.S. Midwest (Iowa City, IA, Bloomington, IL, Champaign, IL, Janesville, WI, and Saint Paul, MN) from April 2020 to October 2020 were obtained using the R package EnvRtype v1.0.

5. Describe any quality-assurance procedures performed on the data:
   Correct version of SNP chip data was confirmed using custom scripts. SNPs with low quality, segregation distortion and overlapping deletions in the genome were removed. A sliding window approach was used to correct sequencing errors. See Methods in https://link.springer.com/article/10.1007/s00122-023-04469-w for more information.

6. People involved with sample collection, processing, analysis and/or submission:
   Rafael Della Coletta, Samuel B. Fernandes, Patrick J. Monnahan, Mark A. Mikel, Martin O. Bohn, Alexander E. Lipka, Candice N. Hirsch


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file1.vcf.gz
-----------------------------------------

1. Number of variables: 109

2. Number of cases/rows: 10004

3. Missing data codes: .

4. Variable List:

	A. Name: CHROM
		 Description: Chromosome

	B. Name: POS
		 Description: Position

	C. Name: ID
		 Description: Identifier

	D. Name: REF
		 Description: Reference base

	E. Name: ALT
		 Description: Alternate base

	F. Name: QUAL
		 Description: Quality (Phred scale)

	G. Name: FILTER
		 Description: Filter status

	H. Name: INFO
		 Description: additional information encoded as a semicolon-separated series of short keys with optional values in the format <key>=<data>[,data]

	I. Name: FORMAT
 		 Description: specifics of the data types and order (colon-separated)

	Remaining columns. Names: A188 to W606S
 		 Description: Marker genotypes of maize inbred lines

	For more details about VCF format, see https://samtools.github.io/hts-specs/VCFv4.2.pdf


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file2.hmp.txt.gz
-----------------------------------------

1. Number of variables: 344

2. Number of cases/rows: 3131610

3. Missing data codes: NA for columns A to K; NN for remaining columns

4. Variable List

	A. Name: rs
	   Description: Marker ID

	B. Name: alleles
	   Description: Possible alleles for marker

	C. Name: chrom
	   Description: Chromosome that the marker was mapped

	D. Name: pos
	   Description: Respective position of this marker on chromosome

	E. Name: strand
	   Description: Orientation of the marker in the DNA strand

	F. Name: assembly
	   Description: Version of reference sequence assembly

	G. Name: center
	   Description: Name of genotyping center that produced the genotypes

	H. Name: protLSID
	   Description: ID for HapMap protocol

	I. Name: assayLSID
	   Description: ID for HapMap assay used for genotyping

	J. Name: panel
	   Description: ID for panel of individuals genotyped

	K. Name: QCcode
	   Description: Quality control ID for all entries

	Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B
	   Description: Genotypes of projected SNPs and SVs for each maize hybrid


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file3.zip
-----------------------------------------

1. Number of variables: 7

2. Number of cases/rows: 999

3. Missing data codes: NA

4. Variable List

	A. Name: <Trait>
	   Description: Hybrid name

	B. Name: Trait_1
	   Description: Simulated value for hybrid at environment 1

	C. Name: Trait_2
	   Description: Simulated value for hybrid at environment 2

	D. Name: Trait_3
	   Description: Simulated value for hybrid at environment 3

	E. Name: Trait_4
	   Description: Simulated value for hybrid at environment 4

	F. Name: Trait_5
	   Description: Simulated value for hybrid at environment 5

	G. Name: Rep
	   Description: Replicate number

	Simulated values for different genetic architectures are located in different folders.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file4.zip
-----------------------------------------

1. Number of variables: -

2. Number of cases/rows: -

3. Missing data codes: -

4. Variable List: -

	This is a plain-text with details about ANOVA results. Results for different genetic architectures are located in different folders.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file5.zip
-----------------------------------------

1. Number of variables: 344

2. Number of cases/rows: 7892

3. Missing data codes: NA for columns A to K; NN for remaining columns

4. Variable List

	A. Name: rs
	   Description: Marker ID

	B. Name: alleles
	   Description: Possible alleles for marker

	C. Name: chrom
	   Description: Chromosome that the marker was mapped

	D. Name: pos
	   Description: Respective position of this marker on chromosome

	E. Name: strand
	   Description: Orientation of the marker in the DNA strand

	F. Name: assembly
	   Description: Version of reference sequence assembly

	G. Name: center
	   Description: Name of genotyping center that produced the genotypes

	H. Name: protLSID
	   Description: ID for HapMap protocol

	I. Name: assayLSID
	   Description: ID for HapMap assay used for genotyping

	J. Name: panel
	   Description: ID for panel of individuals genotyped

	K. Name: QCcode
	   Description: Quality control ID for all entries

	Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B
	   Description: Predictor genotypes of maize hybrids

	Different iterations of predictors are located in different folders. Five different set of predictors ("all_markers", "snp_ld_markers", "snp_markers", "snp_not_ld_markers", "sv_markers") were generated in each iteration.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file6.zip
-----------------------------------------

1. Number of variables: 7

2. Number of cases/rows: 999

3. Missing data codes: NA

4. Variable List

	A. Name: <Trait>
	   Description: Hybrid name

	B. Name: Trait_1
	   Description: Simulated value for hybrid at environment 1

	C. Name: Trait_2
	   Description: Simulated value for hybrid at environment 2

	D. Name: Trait_3
	   Description: Simulated value for hybrid at environment 3

	E. Name: Trait_4
	   Description: Simulated value for hybrid at environment 4

	F. Name: Trait_5
	   Description: Simulated value for hybrid at environment 5

	G. Name: Rep
	   Description: Replicate number

	Simulated values for different genetic architectures are located in different folders.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file7.zip
-----------------------------------------

1. Number of variables: 344

2. Number of cases/rows: 500

3. Missing data codes: NA for columns A to K; NN for remaining columns

4. Variable List

	A. Name: rs
	   Description: Marker ID

	B. Name: alleles
	   Description: Possible alleles for marker

	C. Name: chrom
	   Description: Chromosome that the marker was mapped

	D. Name: pos
	   Description: Respective position of this marker on chromosome

	E. Name: strand
	   Description: Orientation of the marker in the DNA strand

	F. Name: assembly
	   Description: Version of reference sequence assembly

	G. Name: center
	   Description: Name of genotyping center that produced the genotypes

	H. Name: protLSID
	   Description: ID for HapMap protocol

	I. Name: assayLSID
	   Description: ID for HapMap assay used for genotyping

	J. Name: panel
	   Description: ID for panel of individuals genotyped

	K. Name: QCcode
	   Description: Quality control ID for all entries

	Remaining columns. Names: B73*LH82-B-B-10-1-1-B-B to PHG39*PHG47-B-B-9-1-1-B-B
	   Description: Predictor genotypes of maize hybrids

	Different iterations of predictors are located in different folders.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file8.zip
-----------------------------------------

1. Number of variables: 31

2. Number of cases/rows: 800

3. Missing data codes: NA

4. Variable List

	A. Name: Heritability
	   Description: Trait heritability

	B. Name: QTL number
	   Description: Number of causative variants

	C. Name: Causative variant type
	   Description: Causative variant type

	D. Name: Predictor type
	   Description: Predictor type

	E. Name: Simulated population number
	   Description: Simulated population number

	F. Name: Prediction iteration number
	   Description: Prediction iteration number

	G. Name: Cross-validation
	   Description: Cross-validation strategy

 	H. Name: Accuracy (average)
 	   Description: Accuracy (average)

 	I. Name: Standard error (average)
 	   Description: Standard error (average)

 	J. Name: Lower CI (average)
 	   Description: Lower confidence interval (average)

 	K. Name: Upper CI (average)
 	   Description: Upper confidence interval (average)

 	L. Name: Accuracy (environment 1)
 	   Description: Accuracy (environment 1)

 	M. Name: Standard error (environment 1)
 	   Description: Standard error (environment 1)

 	N. Name: Lower CI (environment 1)
 	   Description: Lower confidence interval (environment 1)

	O. Name: Upper CI (environment 1)
		Description: Upper confidence interval (environment 1)

	P. Name: Accuracy (environment 2)
		Description: Accuracy (environment 2)

	Q. Name: Standard error (environment 2)
		Description: Standard error (environment 2)

	R. Name: Lower CI (environment 2)
		Description: Lower confidence interval (environment 2)

	S. Name: Upper CI (environment 2)
		Description: Upper confidence interval (environment 2)

	T. Name: Accuracy (environment 3)
		Description: Accuracy (environment 3)

	U. Name: Standard error (environment 3)
		Description: Standard error (environment 3)

	V. Name: Lower CI (environment 3)
 	   Description: Lower confidence interval (environment 3)

 	W. Name: Upper CI (environment 3)
 	   Description: Upper confidence interval (environment 3)

 	X. Name: Accuracy (environment 4)
 	   Description: Accuracy (environment 4)

	Y. Name: Standard error (environment 4)
		Description: Standard error (environment 4)

	Z. Name: Lower CI (environment 4)
		Description: Lower confidence interval (environment 4)

	AA. Name: Upper CI (environment 4)
		Description: Upper confidence interval (environment 4)

	AB. Name: Accuracy (environment 5)
		Description: Accuracy (environment 5)

	AC. Name: Standard error (environment 5)
		Description: Standard error (environment 5)

	AD. Name: Lower CI (environment 5)
		Description: Lower confidence interval (environment 5)

	AE. Name: Upper CI (environment 5)
		Description: Upper confidence interval (environment 5)


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: supp_file9.zip
-----------------------------------------

1. Number of variables: 31

2. Number of cases/rows: 540

3. Missing data codes: NA

4. Variable List

	A. Name: Heritability
	   Description: Trait heritability

	B. Name: QTL number
	   Description: Number of causative variants

	C. Name: Causative variant type
	   Description: Causative variant type

	D. Name: Predictor type
	   Description: Predictor type

	E. Name: Simulated population number
	   Description: Simulated population number

	F. Name: Prediction iteration number
	   Description: Prediction iteration number

	G. Name: Cross-validation
	   Description: Cross-validation strategy

 	H. Name: Accuracy (average)
 	   Description: Accuracy (average)

 	I. Name: Standard error (average)
 	   Description: Standard error (average)

 	J. Name: Lower CI (average)
 	   Description: Lower confidence interval (average)

 	K. Name: Upper CI (average)
 	   Description: Upper confidence interval (average)

 	L. Name: Accuracy (environment 1)
 	   Description: Accuracy (environment 1)

 	M. Name: Standard error (environment 1)
 	   Description: Standard error (environment 1)

 	N. Name: Lower CI (environment 1)
 	   Description: Lower confidence interval (environment 1)

	O. Name: Upper CI (environment 1)
		Description: Upper confidence interval (environment 1)

	P. Name: Accuracy (environment 2)
		Description: Accuracy (environment 2)

	Q. Name: Standard error (environment 2)
		Description: Standard error (environment 2)

	R. Name: Lower CI (environment 2)
		Description: Lower confidence interval (environment 2)

	S. Name: Upper CI (environment 2)
		Description: Upper confidence interval (environment 2)

	T. Name: Accuracy (environment 3)
		Description: Accuracy (environment 3)

	U. Name: Standard error (environment 3)
		Description: Standard error (environment 3)

	V. Name: Lower CI (environment 3)
 	   Description: Lower confidence interval (environment 3)

 	W. Name: Upper CI (environment 3)
 	   Description: Upper confidence interval (environment 3)

 	X. Name: Accuracy (environment 4)
 	   Description: Accuracy (environment 4)

	Y. Name: Standard error (environment 4)
		Description: Standard error (environment 4)

	Z. Name: Lower CI (environment 4)
		Description: Lower confidence interval (environment 4)

	AA. Name: Upper CI (environment 4)
		Description: Upper confidence interval (environment 4)

	AB. Name: Accuracy (environment 5)
		Description: Accuracy (environment 5)

	AC. Name: Standard error (environment 5)
		Description: Standard error (environment 5)

	AD. Name: Lower CI (environment 5)
		Description: Lower confidence interval (environment 5)

	AE. Name: Upper CI (environment 5)
		Description: Upper confidence interval (environment 5)