This readme.txt file was generated on <20200724> by ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset  Why do coastal seeds fail? Evidence of local adaptation of northern red oak ( Quercus rubra ) in Minnesota coastal forests - Genomics and Geospatial Data 2. Author Information Maria Jose Gomez Quijano (gomez312@d.umn.edu) University of Minnesota Duluth Gross Lab University of Minnesota Duluth Etterson Lab   Principal Investigator Contact Information Name: Briana Gross           Institution: University of Minnesota Duluth            Address: 252C - 1035 Kirby Drive Swenson Science Building, Duluth, MN 55812            Email: blgross@d.umn.edu   Associate or Co-investigator Contact Information         Name: Julie Etterson           Institution: University of Minnesota Duluth            Address: 153B - 1035 Kirby Drive Swenson Science Building, Duluth, MN 55812            Email: jetterso@d.umn.edu 3. Date of data collection: 2020 08 09 4. Geographic location of data collection (where was data collected?):  University of Minnesota Duluth, Gross Lab 5. Information about funding sources that supported the collection of the data: This data share was prepared by Maria Jose Gomez Quijano, Briana L. Gross and Julie R. Etterson using Federal funds under award NA17NOS4190062 from the Coastal Zone Management Act of 1972, as amended, administered by the Office for Coastal Management, National Oceanic and Atmospheric Administration (NOAA), U.S. Department of Commerce provided to the Minnesota Department of Natural Resources (DNR) for Minnesota’s Lake Superior Coastal Program. The statements, findings, conclusions, and recommendations are those of the author(s) and do not necessarily reflect the views of NOAA’s Office of Coastal Management, the U.S. Department of Commerce, or the Minnesota DNR. -------------------------- SHARING/ACCESS INFORMATION --------------------------  1. Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication 2. Recommended citation for the data: Gomez Quijano, Maria Jose; Gross, Briana, L.; Etterson, Julie, R. (2020). Why do coastal seeds fail? Evidence of local adaptation of northern red oak ( Quercus rubra ) in Minnesota coastal forests - Genomics and Geospatial Data. Retrieved from the Data Repository for the University of Minnesota, http://hdl.handle.net/11299/212844. --------------------- DATA & FILE OVERVIEW --------------------- 1. File List    A. Filename: Gross2_Project_012_Sample_Information.csv             Short description:         This data is the sample information for the data file “ Gross2_Project_012_Coastal.zip” , “ Gross2_Project_012_Inland.zip”, “ Gross2_Project_012_Interior.zip” and “ Gross2_Project_012_pin zip”. It contains the sample name, the raw number of reads generated after Illumina sequencing, the species name, the region where the data was collected, and the population name each sample belongs to.             B. Filename:  Geospatial_data_Oak_project.csv             Short description:         This dataset contains the geospatial location of each of the 33 populations that were collected. All of the samples are within the state of Minnesota. In the dataset each entry refers to one population. For each population there is the following information: species name, MNDNR releve site number, name of the collection site, population name, region name, latitude and longitude. Populations were identified using the Minnesota Department of Natural Resources releve data. Coastal populations were located between 0-10 miles from the coast of Lake Superior, inland populations were located between 11-50 miles from the shore, and interior populations between 51-100 miles from the shore of Lake Superior.             C. Filename:  Freebayes_variants.vcf             Short description: This data file contains the single nucleotide polymorphisms (SNPs) for 412 Northern red oak (Quercus rubra) and Northern pin oak (Quercus ellipsoidalis) individuals. Genotyping by sequencing was performed by the University of Minnesota Genomics Center (UMGC) and SNPs/variants were called by the UMGC using the program Freebayes. D. Filename:  ipyrad_variants.vcf             Short description: This data file contains the single nucleotide polymorphisms (SNPs) for 412 Northern red oak (Quercus rubra) and Northern pin oak (Quercus ellipsoidalis) individuals. Genotyping by sequencing was performed by the University of Minnesota Genomics Center (UMGC) and SNPs/variants were called by the Gross Lab using the program ipyRAD. E. Filename:  Gross2_Project_012_Coastal.zip       Short description: This data file contains the .fastq files of raw Illumina reads for 114 Q. rubra individuals from 10 coastal populations. Genotyping by sequencing using the enzymes BamHI and NsiI was performed at the UMGC. The number of reads per sample can be found in the “Gross2_Project_012_Sample_information.csv” file, the population and region information can be found at the “Geospatial_data_Oak_project.cvs” file. F. Filename:  Gross2_Project_012_Inland.zip       Short description: This data file contains the .fastq files of raw Illumina reads for 131 Q. rubra individuals from 10 inland populations. Genotyping by sequencing using the enzymes BamHI and NsiI was performed at the UMGC. The number of reads per sample can be found in the “Gross2_Project_012_Sample_information.csv” file, the population and region information can be found at the “Geospatial_data_Oak_project.csv” file. G. Filename:  Gross2_Project_012_Interior.zip       Short description: This data file contains the .fastq files of raw Illumina reads for 129 Q. rubra individuals from 10 Interior populations. Genotyping by sequencing using the enzymes BamHI and NsiI was performed at the UMGC. The number of reads per sample can be found in the “Gross2_Project_012_Sample_information.csv” file, the population and region information can be found at the “Geospatial_data_Oak_project.csv” file. G. Filename:  Gross2_Project_012_Pin Oak.zip       Short description: This data file contains the .fastq files of raw Illumina reads for 38 Q. ellipsoidalis individuals from 3 populations. Genotyping by sequencing using the enzymes BamHI and NsiI was performed at the UMGC. The number of reads per sample can be found in the “Gross2_Project_012_Sample_information.csv” file, the population and region information can be found at the “Geospatial_data_Oak_project.csv” file. H. Filename:  Final_Report_20200429.docx       Short description: This file contains the final report presented to the funding agency, where results are summarized. 2. Relationship between files:         All the files containing “Gross2_Project_012_*” are part of one big data set that was analyzed all together within this files the raw reads can be found as well as the collection information for each sample. The “.vcf” files are data analysis files from the .fastq files in the “Gross2_Project_012_*” dataset, where SNPs were generated using the Freebayes and ipyRAD programs. The geospatial data provided, contains information from the collection sites for every sample in the “Gross2_Project_012_*” dataset. Populations were identified using the Minnesota Department of Natural Resources releve data. Coastal populations were located between 0-10 miles from the coast of Lake Superior, inland populations were located between 11-50 miles from the shore, and interior populations between 51-100 miles from the shore of Lake Superior. The Final report contains the summary of the data analysis and results. -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data:  Populations were identified using the Minnesota Department of Natural Resources releve data. Coastal populations were located between 0-10 miles from the coast of Lake Superior, inland populations were located between 11-50 miles from the shore, and interior populations between 51-100 miles from the shore of Lake Superior. Genomic DNA was performed using a modified CTAB extraction. - Sork Lab:Protocols. (2018). OpenWetWare, Available at: https://openwetware.org/mediawiki/index.php?title=Sork_Lab:Protocols&oldid=1037048. 2. Methods for processing the data: SNP calling performed by the UMGC, was done using Freebayes - Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012 SNP calling performed by the Gross Lab, was done using ipyRAD - Eaton DAR & Overcast I. “ipyrad: Interactive assembly and analysis of RADseq datasets.” Bioinformatics (2020). 3. People involved with sample collection, processing, analysis and/or submission: Maria Jose Gomez Quijano (gomez312@d.umn.edu) ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: Gross2_project_012_Sample_information.csv ----------------------------------------- 1. Number of variables: 5 2. Number of cases/rows: 412 3. Variable List          A. Name: Sample Name        Description: Sample ID including two letter code population name and sample number     B. Name: Total Reads        Description: Raw reads generated from illumina sequencing per sample C. Name: Species name        Description: species where each sample belongs to. D. Name: Region name        Description: region where each sample belongs to. Coastal = 0-10 miles from the coast of Lake Superior, Inland = 11-50 miles from the shore of Lake Superior, Interior = 51-100 miles from the shore of Lake Superior. E. Name: Population name        Description: Two letter population code where each individual belongs to. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: Geospatial_data_Oak_project.csv ----------------------------------------- 1. Number of variables: 7 2. Number of cases/rows: 33 3. Missing data codes:         Code/symbol :    N/A   Definition: no data available 4. Variable List     A. Name: Species name        Description: species where each sample belongs to.       B. Name: MNDNR releve site number        Description: ID number for each population where data was collected from, using the Minnesota Department of Natural resources releve data. C. Name: Collection site name        Description: Name of the site where data was collected from D. Name: Population name        Description: Two letter population code where each individual belongs to. E. Name: Region name        Description: region where each sample belongs to. Coastal = 0-10 miles from the coast of Lake Superior, Inland = 11-50 miles from the shore of Lake Superior, Interior = 51-100 miles from the shore of Lake Superior. F. Name: Latitude        Description: latitude of the site where samples were collected G. Name: Longitude        Description: longitude of the site where samples were collected ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: Freebayes_variants.vcf.zip ----------------------------------------- 1. Number of variables: 2. Number of cases/rows:  140,785 loci 3. Missing data codes:         Code/symbol : 0   N/A   Definition: missing data 4. Variable List   VCF variables are as follow:    ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO=      Source: https://samtools.github.io/hts-specs/VCFv4.1.pdf ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: Ipyrad_variants.vcf.zip ----------------------------------------- 1. Number of variables: 2. Number of cases/rows:  225,954  3. Missing data codes:         Code/symbol : 0   N/A   Definition: missing data 4. Variable List   VCF variables are as follow:    ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= ##INFO=      Source: https://samtools.github.io/hts-specs/VCFv4.1.pdf ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: Gross2_Project_012_XXXX.zip ----------------------------------------- 1. Number of cases: Gross2_Project_012_Coastal.zip 114 fastqs Gross2_Project_012_Inland.zip 131 fastqs Gross2_Project_012_Interior.zip 129 fastqs Gross2_Project_012_Pin Oak.zip 38 fastqs 2. Missing data codes:         Code/symbol : -   N/A   Definition: missing data 3. Variable List A. Name: Phred quality score        Description: quality score for sequence Please visit: https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm for a description of each quality score. B. Name: Sample information        Description: Information for each sequencing run with barcodes included in the header For more information visit: https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html C. Name: Sequence        Description: The sequence (the base calls; A, C, T, G and N)