This readme.txt file was generated on <20191114> by ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset VCF file of SNP data of 556 isolates of the wheat leaf rust fungus, Puccinia triticina from 11 world-wide regions 2. Author Information Principal Investigator Contact Information Name: James Kolmer Institution: USDA-ARS Cereal Disease Laboratory; University of Minnesota Address: Email: JKolmer@umn.edu; jim.kolmer@usda.gov ORCID: Associate or Co-investigator Contact Information Name: Adam Herman Institution: Address: aherman@umn.edu Email: ORCID: Associate or Co-investigator Contact Information Name: Maria Ordonez Institution: Address: Email: ORCID: Associate or Co-investigator Contact Information Name: Silvia German Institution: Address: Email: ORCID: Associate or Co-investigator Contact Information Name: Alexy Morgounov Institution: Address: Email: ORCID: Associate or Co-investigator Contact Information Name: Zack Pretorius Institution: Address: Email: ORCID: Associate or Co-investigator Contact Information Name: Botma Visser Institution: Address: Email: ORCID: Associate or Co-investigator Contact Information Name: Yehoshoa Anikster Institution: Address: Email: ORCID: Associate or Co-investigator Contact Information Name: Maricelis Acevedo Institution: Address: Email: ORCID: 3. Date of data collection: 2018-08-02 to 2018-08-02 4. Geographic location of data collection (where was data collected?): 11 wheat growing regions worldwide 5. Information about funding sources that supported the collection of the data: USDA ARS Cereal Disease Lab -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: CCO 1.0 Universal Public Domain Dedication, https://creativecommons.org/publicdomain/zero/1.0/ 2. Links to publications that cite or use the data: Endemic and panglobal genetic groups, and divergence of host-associated forms, in world- wide collections of the wheat leaf rust fungus Puccinia triticina as determined by genotype by sequencing. Revisions submitted to Heredity 3. Recommended citation for the data: Kolmer, J. A.; Herman, A. C.; Ordonez, M.E.; German, S; Morgounov, A.; Pretorious, Z.; Visser, B.; Anikster, Y.; Acevedo, M.. (2019). Scripts and Files for Endemic and panglobal genetic groups, and divergence of host-associated forms, in world- wide collections of the wheat leaf rust fungus Puccinia triticina as determined by genotype by sequencing. Retrieved from the Data Repository for the University of Minnesota, https://doi.org/10.13020/0f8c-k469. Manuscript submitted to Heredity - 2019. --------------------- DATA & FILE OVERVIEW --------------------- 1. File List A. Filename: per_locus_fasta.tar.gz Short description: Tarball of (8,655) per locus fasta files Size: 82.53Mb Format: fa B. Filename: per_locus_rad_pi.py Short description: pairwise diversity calculator Size: 2.311Kb Format: ext/x-python C. Filename: names4picalc.txt Short description: Names file for pairwise diversity calculator Size: 22.27Kb Format: Text file D. Filename: populations.snps.genome.coordinate.vcf Short description: VCF file of 556 Puccinia triticina isolates Size: 107.3Mb Format: VCF E. Filename: populations.snps.genome.coordinate.vcf.gz.tbi Short description: tabix index for VCF Size: 43.12Kb Format: Unknown F. Filename: GBS Virulence data-binary(2).xlsx Short description: Virulence data of Puccinia triticina isolates to 20 Thatcher lines of wheat Size: 5.39Kb Format: Microsoft Excel 2007 E. Filename: Readme_208700_AH_JK.txt Short description: Readme file for dataset Size: 7.932Kb Format: Text 2. Relationship between files: GBS SNP calls in Variant Call Format - GBS Haplotype sequence in fasta format (tarballed directory)- Python script for calculating average number of pairwise differences using the fasta haplotypes as input -Names dependency file for python script - virulence data of same isolates 3. Additional related data collected that was not included in the current data package: -------------------------- METHODOLOGICAL INFORMATION -------------------------- We provide variant calls in VCF and fasta format. Fasta formatted data were used with the deposited python script to calculate the average number of pairwise differences between sequences. ----------------------------------------- NOTE REGARDING .VCF FILES (from https://faculty.washington.edu/browning/intro-to-vcf.html) ----------------------------------------- Variant Call Format (VCF) is a text file format for storing marker and genotype data. This short tutorial describes how Variant Call Format encodes data for single nucleotide variants. Every VCF file has three parts in the following order: Meta-information lines (lines beginning with "##"). One header line (line beginning with "#CHROM"). Data lines contain marker and genotype data (one variant per line). A data line is called a VCF record. Each VCF record has the same number of tab-separated fields as the header line. The symbol "." is used to denote missing data. The first nine columns of the header line and data lines describe the variants: CHROM the chromosome. POS the genome coordinate of the first base in the variant. Within a chromosome, VCF records are sorted in order of increasing position. ID a semicolon-separated list of marker identifiers. REF the reference allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC") ALT the alternate allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC"). If there is more than one alternate alleles, the field should be a comma-separated list of alternate alleles. QUAL probability that the ALT allele is incorrectly specified, expressed on the the phred scale (-10log10(probability)). FILTER Either "PASS" or a semicolon-separated list of failed quality control filters. INFO additional information (no white space, tabs, or semi-colons permitted). FORMAT colon-separated list of data subfields reported for each sample. ----------------------------------------- NOTE REGARDING .FASTQ FILES (from https://help.basespace.illumina.com/articles/descriptive/fastq-files/) ----------------------------------------- FASTQ files are named with the sample name and the sample number, which is a numeric assignment based on the order that the sample is listed in the sample sheet. Example: Data\Intensities\BaseCalls\samplenameS1L001R1001.fastq.gz A. samplename — The sample name provided in the sample sheet. If a sample name is not provided, the file name includes the sample ID, which is a required field in the sample sheet and must be unique. B. S1 — The sample number based on the order that samples are listed in the sample sheet starting with 1. In this example, S1 indicates that this sample is the first sample listed in the sample sheet. C. L001 — The lane number. D. R1 — The read. In this example, R1 means Read 1. For a paired-end run, there is at least one file with R2 in the file name for Read 2. E. 001 — The last segment is always 001. Each entry in a FASTQ file consists of four lines: A. Sequence identifier B. Sequence C. Quality score identifier line (consisting only of a +) D. Quality score DATA & FILE OVERVIEW --------------------- 1. File List A. GBS virulence data binary.xlsx Short description: Virulence data Size: 56.17 Kb Format: Excel DATA-SPECIFIC INFORMATION FOR: GBS virulence binary.xlsx 1.Number of variables: 20 lines of Thatcher wheat with single genes for leaf rust resistance 2. Number of cases/rows: 556 isolates of Puccinia triticina 3. Missing data codes: no missing data 4. Variable list 1= virulent 2= avirulent