# SNP Genotyping Data for the Barley Population in "Registration of the # S2MET Barley Mapping Population for Multi-Environment Genomewide Selection" This folder contains marker data generated via genotyping-by-sequencing ## Read alignment and SNP calling 100-150 bp single-end reads were generated using an Illumina machine. Reads were first cleaned and QC'ed using the FastX toolkit and cutadapt. Reads were aligned to the physical barley reference genome (v2) using Bowtie2. SNPs were called using GATK (v. 3.4-6) according to best practices. The scripts used in the read alignment and SNP calling pipeline are available at the GitHub repository: https://github.com/neyhartj/GBarleyS. Variants were filtered according the following criteria: Parameter | Filter ---------------------|----------------------- mapping quality (MQ) | >= 40 genotype quality (GQ)| >= 40 read depth (DP) | >= 5 This filtering resulted in a VCF with ~250,000 variants. This VCF, in a compressed format, is included in this directory under the name `2row_GBS_final_snps_newpos_renamed.vcf.gz`. The file should be decompressed prior to conversion to a HapMap file. ## Conversion of VCF to HapMap file The VCF file was converted to a HapMap file format for use in the rrBLUP R package. This is a tab-separated file where the first 4 columns describe the marker name, reference/alternate alleles, the chromosome, and the physical position on that chromosome. To convert the VCF file to the HapMap file, the python script `vcf2hapmap.py` was used. This script is available from the GitHub repository https://github.com/neyhartj/bioinformatic-utils. It is executed as such: ``` python vcf2hapmap.py -i 2row_GBS_final_snps_newpos_renamed.vcf -o 2row_GBS_final_snps_newpos_renamed -r ``` Executing this script results in the tab-separated HapMap file called `2row_GBS_final_snps_newpos_renamed_hmp.txt`. ## Filtering Marker data filtering is accomplished using the `s2_marker_data_processing_use.R` script. Instructions for setting up directories and packages are contained within. Markers and entries were filtered on a population basis: | | Filter | |Parameter | Training Population | Cycle 1 | |---------------------|-----------------------|--------------------| |minor allele freq | >= 0.05 | >= 0.02 | |snp missingness | <= 0.80 | <= 0.80 | |snp heterozygosity | <= 0.08 | <= 0.35 | |entry missingness | <= 0.80 | <= 0.80 | |entry heterozygosity | NA | <= 0.35 | ## Imputation Imputation is accomplished using the `s2_marker_data_imputation_use.R` script. Instructions for setting up directories and packages are contained within. Imputation was performed first using a custom R script that relies on R/qtl and the nature of the pedigree structure (of the 175 training population lines, 31 were used as parents for the cycle 1 lines). This R script is bundled into a package that is available on my GitHub page (https://github.com/neyhartj/fsimpute). The second step of imputation involved using fastPHASE on the training population lines. Since these lines are highly inbred and (mostly) unrelated, fastPHASE is a good choice. The `s2_marker_data_imputation_use.R` script includes a sample of the shell commands used to execute fastPHASE. Lastly, whatever data points were not imputed in the first steps were imputed using the EM algorithm in the rrBLUP R package. The imputation R script described above will result in two tab-separated files that are included in this directory and described below. ## Final entry and marker count There are 1288 lines (175 training population lines and 1113 cycle 1 lines). The training population lines have names that take on the form "##[A-Z][A-Z]-##" and the cycle 1 lines have names that start with "2MS14". There are 6361 bi-allelic SNP markers Marker genotypes are encoded as {-1, 0, 1}, where -1 is homozygous for the alternate allele, 0 is heterozygous, and 1 is homozygous for the reference allele. ## Files in the directory (R scripts are not included): File | Description ------------------------------------------|-------------------------------------------------------------------- 2row_GBS_final_snps_newpos_renamed.vcf.gz | VCF file generated during the read mapping and variant calling | pipeline. It is compressed using `gzip`. | UMN_S2_pedigree.csv | Comma-separated value file containing the entry and pedigree | information for entries that were genotyped. | GBS_marker_info.csv | Comma-separated value file containing GBS marker names, allele nucleotides, | chromosome, physical position (bp) and genetic position (cM). The genetic | positions were determined by linear interpolation using a different | genetic map and estimated local genetic/physical distance ratios. | S2_final_discrete_genos_hmp.txt | Tab-separated file containing unimputed marker genotypes for 6351 | SNPs on 1289 barley individuals | S2_final_imputed_genos_hmp.txt | The same data as in "S2_final_discrete_genos_hmp.txt" but with imputed, | continuous marker genotypes. | ## Software version information ### Read alignment and SNP calling Software version information is available on the README for the pipeline, located in the GitHub repository: https://github.com/neyhartj/GBarleyS ### Filtering The `s2_marker_data_processing_use.R` script was executed using R version 3.5.3 (Microsoft R Open with Intel Math Kernel Library). Session information is below: ``` > sessionInfo() R version 3.5.3 (2019-03-11) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.0.1 purrr_0.3.2 readr_1.3.1 tidyr_0.8.3 [7] tibble_2.1.1 ggplot2_3.1.1 tidyverse_1.2.1 RevoUtils_11.0.3 RevoUtilsMath_11.0.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.3.1 compiler_3.5.3 plyr_1.8.4 tools_3.5.3 packrat_0.5.0 [8] jsonlite_1.6 lubridate_1.7.4 gtable_0.3.0 nlme_3.1-139 lattice_0.20-38 pkgconfig_2.0.2 rlang_0.3.4 [15] cli_1.1.0 rstudioapi_0.10 haven_2.1.0 withr_2.1.2 xml2_1.2.0 httr_1.4.0 generics_0.0.2 [22] hms_0.4.2 grid_3.5.3 tidyselect_0.2.5 glue_1.3.1 R6_2.4.0 readxl_1.3.1 modelr_0.1.4 [29] magrittr_1.5 backports_1.1.4 scales_1.0.0 rvest_0.3.3 assertthat_0.2.1 colorspace_1.4-1 stringi_1.4.3 [36] lazyeval_0.2.2 munsell_0.5.0 broom_0.5.2 crayon_1.3.4 ``` ### Imputation The `s2_marker_data_imputation_use.R` script was executed using R version 3.5.3 (Microsoft R Open with Intel Math Kernel Library). Session information is below: ``` > sessionInfo() R version 3.5.3 (2019-03-11) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18362) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rrBLUP_4.6 qtl_1.44-9 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.0.1 purrr_0.3.2 [7] readr_1.3.1 tidyr_0.8.3 tibble_2.1.1 ggplot2_3.1.1 tidyverse_1.2.1 fsimpute_0.0.0.9000 [13] RevoUtils_11.0.3 RevoUtilsMath_11.0.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.3.1 compiler_3.5.3 plyr_1.8.4 tools_3.5.3 packrat_0.5.0 [8] lubridate_1.7.4 jsonlite_1.6 nlme_3.1-139 gtable_0.3.0 lattice_0.20-38 pkgconfig_2.0.2 rlang_0.3.4 [15] cli_1.1.0 rstudioapi_0.10 parallel_3.5.3 haven_2.1.0 withr_2.1.2 xml2_1.2.0 httr_1.4.0 [22] generics_0.0.2 hms_0.4.2 grid_3.5.3 tidyselect_0.2.5 glue_1.3.1 R6_2.4.0 readxl_1.3.1 [29] modelr_0.1.4 magrittr_1.5 backports_1.1.4 scales_1.0.0 rvest_0.3.3 assertthat_0.2.1 colorspace_1.4-1 [36] stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0 broom_0.5.2 crayon_1.3.4 ```