# SNP Genotyping Data for the Barley Population in "Registration of the 
# S2MET Barley Mapping Population for Multi-Environment Genomewide Selection"

This folder contains marker data generated via genotyping-by-sequencing

## Read alignment and SNP calling

100-150 bp single-end reads were generated using an Illumina machine. Reads were first cleaned
and QC'ed using the FastX toolkit and cutadapt. Reads were aligned to the physical barley reference 
genome (v2) using Bowtie2. SNPs were called using GATK (v. 3.4-6) according to best practices. The scripts
used in the read alignment and SNP calling pipeline are available at the GitHub repository:
https://github.com/neyhartj/GBarleyS.

Variants were filtered according the following criteria:

Parameter            | Filter   
---------------------|-----------------------
mapping quality (MQ) | >= 40		     
genotype quality (GQ)| >= 40
read depth (DP)      | >= 5

This filtering resulted in a VCF with ~250,000 variants. This VCF, in a compressed format, is 
included in this directory under the name `2row_GBS_final_snps_newpos_renamed.vcf.gz`. The file
should be decompressed prior to conversion to a HapMap file.


## Conversion of VCF to HapMap file

The VCF file was converted to a HapMap file format for use in the rrBLUP R package. This is a
tab-separated file where the first 4 columns describe the marker name, reference/alternate alleles,
the chromosome, and the physical position on that chromosome.

To convert the VCF file to the HapMap file, the python script `vcf2hapmap.py` was used. This script is
available from the GitHub repository https://github.com/neyhartj/bioinformatic-utils. It is executed as such:

```
python vcf2hapmap.py -i 2row_GBS_final_snps_newpos_renamed.vcf -o 2row_GBS_final_snps_newpos_renamed -r
```

Executing this script results in the tab-separated HapMap file called `2row_GBS_final_snps_newpos_renamed_hmp.txt`.


## Filtering

Marker data filtering is accomplished using the `s2_marker_data_processing_use.R` script. Instructions for setting
up directories and packages are contained within.

Markers and entries were filtered on a population basis:

|		      |			   Filter		   |
|Parameter            | Training Population   | Cycle 1		   |
|---------------------|-----------------------|--------------------|
|minor allele freq    | >= 0.05		      | >= 0.02		   |
|snp missingness      | <= 0.80		      | <= 0.80  	   |
|snp heterozygosity   | <= 0.08		      | <= 0.35  	   |
|entry missingness    | <= 0.80		      | <= 0.80  	   |
|entry heterozygosity | NA		      | <= 0.35  	   |



## Imputation

Imputation is accomplished using the `s2_marker_data_imputation_use.R` script. Instructions for setting
up directories and packages are contained within.

Imputation was performed first using a custom R script that relies on R/qtl and the nature
of the pedigree structure (of the 175 training population lines, 31 were used as parents
for the cycle 1 lines). This R script is bundled into a package that is available on my
GitHub page (https://github.com/neyhartj/fsimpute).

The second step of imputation involved using fastPHASE on the training population lines.
Since these lines are highly inbred and (mostly) unrelated, fastPHASE is a good choice. The
`s2_marker_data_imputation_use.R` script includes a sample of the shell commands used to
execute fastPHASE.

Lastly, whatever data points were not imputed in the first steps were imputed using
the EM algorithm in the rrBLUP R package.

The imputation R script described above will result in two tab-separated files that are
included in this directory and described below.


## Final entry and marker count

There are 1288 lines (175 training population lines and 1113 cycle 1 lines). The training population
lines have names that take on the form "##[A-Z][A-Z]-##" and the cycle 1 lines have names
that start with "2MS14". There are 6361 bi-allelic SNP markers

Marker genotypes are encoded as {-1, 0, 1}, where -1 is homozygous for the alternate allele,
0 is heterozygous, and 1 is homozygous for the reference allele.


## Files in the directory (R scripts are not included):

File      	               		  | Description
------------------------------------------|--------------------------------------------------------------------
2row_GBS_final_snps_newpos_renamed.vcf.gz | VCF file generated during the read mapping and variant calling
					  | pipeline. It is compressed using `gzip`.
					  |
UMN_S2_pedigree.csv			  | Comma-separated value file containing the entry and pedigree
					  | information for entries that were genotyped.
					  |
GBS_marker_info.csv			  | Comma-separated value file containing GBS marker names, allele nucleotides,
					  | chromosome, physical position (bp) and genetic position (cM). The genetic
					  | positions were determined by linear interpolation using a different
					  | genetic map and estimated local genetic/physical distance ratios.
					  |
S2_final_discrete_genos_hmp.txt		  | Tab-separated file containing unimputed marker genotypes for 6351
					  | SNPs on 1289 barley individuals
					  |
S2_final_imputed_genos_hmp.txt		  | The same data as in "S2_final_discrete_genos_hmp.txt" but with imputed,
					  | continuous marker genotypes.
					  |


## Software version information


### Read alignment and SNP calling

Software version information is available on the README for the pipeline, located in the
GitHub repository: https://github.com/neyhartj/GBarleyS


### Filtering

The `s2_marker_data_processing_use.R` script was executed using R version 3.5.3 (Microsoft R Open
with Intel Math Kernel Library). Session information is below:

```
> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] forcats_0.4.0        stringr_1.4.0        dplyr_0.8.0.1        purrr_0.3.2          readr_1.3.1          tidyr_0.8.3         
 [7] tibble_2.1.1         ggplot2_3.1.1        tidyverse_1.2.1      RevoUtils_11.0.3     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       cellranger_1.1.0 pillar_1.3.1     compiler_3.5.3   plyr_1.8.4       tools_3.5.3      packrat_0.5.0   
 [8] jsonlite_1.6     lubridate_1.7.4  gtable_0.3.0     nlme_3.1-139     lattice_0.20-38  pkgconfig_2.0.2  rlang_0.3.4     
[15] cli_1.1.0        rstudioapi_0.10  haven_2.1.0      withr_2.1.2      xml2_1.2.0       httr_1.4.0       generics_0.0.2  
[22] hms_0.4.2        grid_3.5.3       tidyselect_0.2.5 glue_1.3.1       R6_2.4.0         readxl_1.3.1     modelr_0.1.4    
[29] magrittr_1.5     backports_1.1.4  scales_1.0.0     rvest_0.3.3      assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.3   
[36] lazyeval_0.2.2   munsell_0.5.0    broom_0.5.2      crayon_1.3.4   

```


### Imputation

The `s2_marker_data_imputation_use.R` script was executed using R version 3.5.3 (Microsoft R Open
with Intel Math Kernel Library). Session information is below:


```
> sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rrBLUP_4.6           qtl_1.44-9           forcats_0.4.0        stringr_1.4.0        dplyr_0.8.0.1        purrr_0.3.2         
 [7] readr_1.3.1          tidyr_0.8.3          tibble_2.1.1         ggplot2_3.1.1        tidyverse_1.2.1      fsimpute_0.0.0.9000 
[13] RevoUtils_11.0.3     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       cellranger_1.1.0 pillar_1.3.1     compiler_3.5.3   plyr_1.8.4       tools_3.5.3      packrat_0.5.0   
 [8] lubridate_1.7.4  jsonlite_1.6     nlme_3.1-139     gtable_0.3.0     lattice_0.20-38  pkgconfig_2.0.2  rlang_0.3.4     
[15] cli_1.1.0        rstudioapi_0.10  parallel_3.5.3   haven_2.1.0      withr_2.1.2      xml2_1.2.0       httr_1.4.0      
[22] generics_0.0.2   hms_0.4.2        grid_3.5.3       tidyselect_0.2.5 glue_1.3.1       R6_2.4.0         readxl_1.3.1    
[29] modelr_0.1.4     magrittr_1.5     backports_1.1.4  scales_1.0.0     rvest_0.3.3      assertthat_0.2.1 colorspace_1.4-1
[36] stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0    broom_0.5.2      crayon_1.3.4

```