This codebook.txt file was generated on 2018-07-11 by wilsonkm. ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset Maize Mo17 SNPs 2. Author Information Principal Investigator Contact Information Name: Peng Zhou Institution: University of Minnesota Address: Email: zhoux379@umn.edu 3. Date of data collection: N/A 4. Geographic location of data collection: N/A 5. Information about funding sources that supported the collection of the data: N/A -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: N/A 2. Links to publications that cite or use the data: 3. Links to other publicly accessible locations of the data: N/A 4. Links/relationships to ancillary data sets: N/A 5. Was data derived from another source? N/A. 6. Recommended citation for the data: Zhou, Peng. (2018). Maize Mo17 SNPs. Retrieved from the Data Repository for the University of Minnesota, http://hdl.handle.net/11299/198135. --------------------- DATA & FILE OVERVIEW --------------------- File List A. Filename: Mo17.vcf Short description: In VCF format version 4.2. -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data: Genome resequencing of Mo17 was done as part of the bioMAP project (REF). 477 million 100bp paired-end reads were generated for Mo17 giving an average of 95x coverage. Reads were first trimmed by Trimmomatic (Bolger et al. 2014) and mapped to the maize B73 genome AGPv4 (Jiao et al. 2017) using BWA-MEM (Li and Durbin 2010). PCR duplicates were marked and removed using GATK (McKenna et al. 2010). Variants were called by GATK haplotypecaller and filtered using different filters for SNPs: (QD > 2, FS < 60, MQ > 40, MQRankSum > -12.5, ReadPosRankSum > -8, SOR < 4) and for InDels: (QD > 2, FS < 200, ReadPosRankSum > -20, SOR < 10). Moreover, variants located in regions with unusually high coverage (DP > Mean + 2*SD) and heterozygous calls (GT == '0/1') were also removed. The final variant file containing 8.04 million variants with 164 thousand CDS variants was deposited in DRUM. 2. Instrument- or software-specific information needed to interpret the data: Software and tools used in the generation of these data include Trimmomatic, BWA_MEM, GATK Markduplicates, GATK haplotypecaller, and GATK VariantFiltration. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: Mo17.vcf ----------------------------------------- The headers present within the file provide additional metadata. Headers begin with #. Specific keywords in the headers are denoted with ##. Data lines contain genotype data with one variant per line. For more information on the VCF format, please see the available documentation for VCFv4.2 at https://github.com/samtools/hts-specs. 1a. Meta-information line(s): 7 ##fileformat: details the VCF format version number. ##GATKCommandLine.HaplotypeCaller: Line added by GATK (genome analysis toolkit) that contains parameters used to run the program that produced the VCF file. ##reference: field describes .fasta file path referenced ##INFO: fields are described as follows: = ##FORMAT: genotype fields specified in this line are described as the follows: = ##FILTER: describes filters that have been applied to the data. Follows: = ##contig: include tags describing the contigs referred to in the VCF file. = 1b. Header(s): 1 The header line names 8 fixed columns. #CHROM: the chromosome. POS: position ID: identifier REF: reference base(s) ALT: alternate base(s) QUAL: quality FILTER: filter status INFO: additional information 1c. Data line(s): 8,041,297 2. Total number of lines: 8,041,305 3. Missing data codes: N/A