This readme.txt file was generated on <20200724> by <Maria Jose Gomez Quijano >

-------------------
GENERAL INFORMATION
-------------------

1. Title of Dataset�
Why do coastal seeds fail? Evidence of local adaptation of northern red oak ( Quercus rubra ) in Minnesota coastal forests - Genomics and Geospatial Data

2. Author Information
Maria Jose Gomez Quijano (gomez312@d.umn.edu)
University of Minnesota Duluth Gross Lab
University of Minnesota Duluth Etterson Lab

��Principal Investigator Contact Information
        Name: Briana Gross
���������	Institution: University of Minnesota Duluth
�����������Address: 252C - 1035 Kirby Drive Swenson Science Building, Duluth, MN 55812 
�����������Email: blgross@d.umn.edu

��Associate or Co-investigator Contact Information
��������Name: Julie Etterson
���������	Institution: University of Minnesota Duluth
�����������Address: 153B - 1035 Kirby Drive Swenson Science Building, Duluth, MN 55812 
�����������Email: jetterso@d.umn.edu


3. Date of data collection:  2020 08 09

4. Geographic location of data collection (where was data collected?):�
University of Minnesota Duluth, Gross Lab 

5. Information about funding sources that supported the collection of the data:

This data share was prepared by Maria Jose Gomez Quijano, Briana L. Gross and Julie R. Etterson using Federal funds under award NA17NOS4190062 from the Coastal Zone Management Act of 1972, as amended, administered by the Office for Coastal Management, National Oceanic and Atmospheric Administration (NOAA), U.S. Department of Commerce provided to the Minnesota Department of Natural Resources (DNR) for Minnesota�s Lake Superior Coastal Program. The statements, findings, conclusions, and recommendations are those of the author(s) and do not necessarily reflect the views of NOAA�s Office of Coastal Management, the U.S. Department of Commerce, or the Minnesota DNR.


--------------------------
SHARING/ACCESS INFORMATION
--------------------------�

1. Licenses/restrictions placed on the data:
CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

2. Recommended citation for the data:
Gomez Quijano, Maria Jose; Gross, Briana, L.; Etterson, Julie, R. (2020). Why do coastal seeds fail? Evidence of local adaptation of northern red oak ( Quercus rubra ) in Minnesota coastal forests - Genomics and Geospatial Data. Retrieved from the Data Repository for the University of Minnesota, http://hdl.handle.net/11299/212844.


---------------------
DATA & FILE OVERVIEW
---------------------

1. File List
���A. Filename:�Gross2_Project_012_Sample_Information.csv������
������Short description:��������
This data is the sample information for the data file � Gross2_Project_012_Coastal.zip� , � Gross2_Project_012_Inland.zip�, � Gross2_Project_012_Interior.zip� and � Gross2_Project_012_pin zip�. It contains the sample name, the raw number of reads generated after Illumina sequencing, the species name, the region where the data was collected, and the population name each sample belongs to.  

��������
���B. Filename:��Geospatial_data_Oak_project.csv������
������Short description:��������
This dataset contains the geospatial location of each of the 33 populations that were collected. All of the samples are within the state of Minnesota. In the dataset each entry refers to one population. For each population there is the following information: species name, MNDNR releve site number, name of the collection site, population name, region name, latitude and longitude. Populations were identified using the Minnesota Department of Natural Resources releve data. Coastal populations were located between 0-10 miles from the coast of Lake Superior, inland populations were located between 11-50 miles from the shore, and interior populations between 51-100 miles from the shore of Lake Superior.

��������
���C. Filename:��Freebayes_variants.vcf������
������Short description:
This data file contains the single nucleotide polymorphisms (SNPs) for 412 Northern red oak (Quercus rubra) and Northern pin oak (Quercus ellipsoidalis) individuals. Genotyping by sequencing was performed by the University of Minnesota Genomics Center (UMGC) and SNPs/variants were called by the UMGC using the program Freebayes. 

   D. Filename:��ipyrad_variants.vcf������
������Short description:
This data file contains the single nucleotide polymorphisms (SNPs) for 412 Northern red oak (Quercus rubra) and Northern pin oak (Quercus ellipsoidalis) individuals. Genotyping by sequencing was performed by the University of Minnesota Genomics Center (UMGC) and SNPs/variants were called by the Gross Lab using the program ipyRAD.  

   E. Filename:��Gross2_Project_012_Coastal.zip
������Short description:
This data file contains the .fastq files of raw Illumina reads for 114 Q. rubra individuals from 10 coastal populations. Genotyping by sequencing using the enzymes BamHI and NsiI was performed at the UMGC. The number of reads per sample can be found in the �Gross2_Project_012_Sample_information.csv� file, the population and region information can be found at the �Geospatial_data_Oak_project.cvs� file. 

   F. Filename:��Gross2_Project_012_Inland.zip
������Short description:
This data file contains the .fastq files of raw Illumina reads for 131 Q. rubra individuals from 10 inland populations. Genotyping by sequencing using the enzymes BamHI and NsiI was performed at the UMGC. The number of reads per sample can be found in the �Gross2_Project_012_Sample_information.csv� file, the population and region information can be found at the �Geospatial_data_Oak_project.csv� file. 

   G. Filename:��Gross2_Project_012_Interior.zip
������Short description:
This data file contains the .fastq files of raw Illumina reads for 129 Q. rubra individuals from 10 Interior populations. Genotyping by sequencing using the enzymes BamHI and NsiI was performed at the UMGC. The number of reads per sample can be found in the �Gross2_Project_012_Sample_information.csv� file, the population and region information can be found at the �Geospatial_data_Oak_project.csv� file. 

G. Filename:��Gross2_Project_012_Pin Oak.zip
������Short description:
This data file contains the .fastq files of raw Illumina reads for 38 Q. ellipsoidalis individuals from 3 populations. Genotyping by sequencing using the enzymes BamHI and NsiI was performed at the UMGC. The number of reads per sample can be found in the �Gross2_Project_012_Sample_information.csv� file, the population and region information can be found at the �Geospatial_data_Oak_project.csv� file. 

H. Filename:��Final_Report_20200429.docx
������Short description:
This file contains the final report presented to the funding agency, where results are summarized. 

2. Relationship between files:��������

All the files containing �Gross2_Project_012_*�  are part of one big data set that was analyzed all together within this files the raw reads can be found as well as the collection information for each sample. The �.vcf� files are data analysis files from the .fastq files in the �Gross2_Project_012_*�  dataset, where SNPs were generated using the Freebayes and ipyRAD programs.  The geospatial data provided, contains information from the collection sites for every sample in the �Gross2_Project_012_*�  dataset. Populations were identified using the Minnesota Department of Natural Resources releve data. Coastal populations were located between 0-10 miles from the coast of Lake Superior, inland populations were located between 11-50 miles from the shore, and interior populations between 51-100 miles from the shore of Lake Superior. The Final report contains the summary of the data analysis and results. 


--------------------------
METHODOLOGICAL INFORMATION
--------------------------

1. Description of methods used for collection/generation of data:�

Populations were identified using the Minnesota Department of Natural Resources releve data. Coastal populations were located between 0-10 miles from the coast of Lake Superior, inland populations were located between 11-50 miles from the shore, and interior populations between 51-100 miles from the shore of Lake Superior.

Genomic DNA was performed using a modified CTAB extraction. - Sork Lab:Protocols. (2018).�OpenWetWare,�Available at: https://openwetware.org/mediawiki/index.php?title=Sork_Lab:Protocols&amp;oldid=1037048.

2. Methods for processing the data:

SNP calling performed by the UMGC, was done using Freebayes - Garrison E, Marth G. Haplotype-based variant detection from short-read
sequencing.�arXiv preprint arXiv:1207.3907 [q-bio.GN]�2012

SNP calling performed by the Gross Lab, was done using ipyRAD - Eaton DAR &amp; Overcast I. �ipyrad: Interactive assembly and analysis of RADseq datasets.� Bioinformatics (2020).

3. People involved with sample collection, processing, analysis and/or submission:

Maria Jose Gomez Quijano (gomez312@d.umn.edu)



-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: Gross2_project_012_Sample_information.csv
-----------------------------------------

1. Number of variables: 5
2. Number of cases/rows: 412

3. Variable List

����
����A. Name: Sample Name
�������Description: Sample ID including two letter code population name and sample number

����B. Name: Total Reads
�������Description: Raw reads generated from illumina sequencing per sample

C. Name: Species name 
�������Description: species where each sample belongs to. 

D. Name: Region name
�������Description: region where each sample belongs to. 

       Coastal  = 0-10 miles from the coast of Lake Superior, 
       Inland  = 11-50 miles from the shore of Lake Superior, 
       Interior = 51-100 miles from the shore of Lake Superior.
       
E. Name: Population name 
�������Description: Two letter population code where each individual belongs to.  


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: Geospatial_data_Oak_project.csv
-----------------------------------------

1. Number of variables: 7
2. Number of cases/rows: 33


3. Missing data codes:
��������Code/symbol�: � �N/A � Definition: no data available 

4. Variable List

����A. Name: Species name 
�������Description: species where each sample belongs to. 
�

����B. Name: MNDNR releve site number 
�������Description: ID number for each population where data was collected from, using the Minnesota Department of Natural resources releve data. 

C. Name: Collection site name  
�������Description: Name of the site where data was collected from 

D. Name: Population name 
�������Description: Two letter population code where each individual belongs to.  

E. Name: Region name
�������Description: region where each sample belongs to. 

       Coastal  = 0-10 miles from the coast of Lake Superior, 
       Inland  = 11-50 miles from the shore of Lake Superior, 
       Interior = 51-100 miles from the shore of Lake Superior.

F. Name: Latitude
�������Description: latitude of the site where samples were collected

G. Name: Longitude
�������Description: longitude of the site where samples were collected


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: Freebayes_variants.vcf.zip
-----------------------------------------

1. Number of variables:

2. Number of cases/rows:�
140,785 loci

3. Missing data codes:
��������Code/symbol�: 0� �N/A � Definition: missing data 

4. Variable List

��VCF variables are as follow: 
��
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=RO,Number=1,Type=Integer,Description="Count of full observations of the reference haplotype.">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Count of full observations of this alternate haplotype.">
##INFO=<ID=PRO,Number=1,Type=Float,Description="Reference allele observation count, with partial observations recorded fractionally">
##INFO=<ID=PAO,Number=A,Type=Float,Description="Alternate allele observations, with partial observations recorded fractionally">
##INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
##INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
##INFO=<ID=PQR,Number=1,Type=Float,Description="Reference allele quality sum in phred for partial observations">
##INFO=<ID=PQA,Number=A,Type=Float,Description="Alternate allele quality sum in phred for partial observations">
##INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
##INFO=<ID=SAF,Number=A,Type=Integer,Description="Number of alternate observations on the forward strand">
##INFO=<ID=SAR,Number=A,Type=Integer,Description="Number of alternate observations on the reverse strand">
����
Source: https://samtools.github.io/hts-specs/VCFv4.1.pdf

-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: Ipyrad_variants.vcf.zip
-----------------------------------------

1. Number of variables:

2. Number of cases/rows:�
225,954�

3. Missing data codes:
��������Code/symbol�: 0� �N/A � Definition: missing data 

4. Variable List

��VCF variables are as follow: 
��
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total read depth at the locus">
##INFO=<ID=DPB,Number=1,Type=Float,Description="Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1]">
##INFO=<ID=RO,Number=1,Type=Integer,Description="Count of full observations of the reference haplotype.">
##INFO=<ID=AO,Number=A,Type=Integer,Description="Count of full observations of this alternate haplotype.">
##INFO=<ID=PRO,Number=1,Type=Float,Description="Reference allele observation count, with partial observations recorded fractionally">
##INFO=<ID=PAO,Number=A,Type=Float,Description="Alternate allele observations, with partial observations recorded fractionally">
##INFO=<ID=QR,Number=1,Type=Integer,Description="Reference allele quality sum in phred">
##INFO=<ID=QA,Number=A,Type=Integer,Description="Alternate allele quality sum in phred">
##INFO=<ID=PQR,Number=1,Type=Float,Description="Reference allele quality sum in phred for partial observations">
##INFO=<ID=PQA,Number=A,Type=Float,Description="Alternate allele quality sum in phred for partial observations">
##INFO=<ID=SRF,Number=1,Type=Integer,Description="Number of reference observations on the forward strand">
##INFO=<ID=SRR,Number=1,Type=Integer,Description="Number of reference observations on the reverse strand">
##INFO=<ID=SAF,Number=A,Type=Integer,Description="Number of alternate observations on the forward strand">
##INFO=<ID=SAR,Number=A,Type=Integer,Description="Number of alternate observations on the reverse strand">
����
Source: https://samtools.github.io/hts-specs/VCFv4.1.pdf


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: Gross2_Project_012_XXXX.zip
-----------------------------------------

1. Number of cases:
Gross2_Project_012_Coastal.zip   114 fastqs
Gross2_Project_012_Inland.zip   131 fastqs
Gross2_Project_012_Interior.zip 129 fastqs
Gross2_Project_012_Pin Oak.zip   38 fastqs

2. Missing data codes:
��������Code/symbol�: -� �N/A � Definition: missing data 

3. Variable List

A. Name: Phred  quality score
�������Description: quality score for sequence

Please visit: https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm for a description of each quality score. 

B. Name: Sample information
�������Description: Information for each sequencing run with barcodes included in the header

For more information visit: https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html

C. Name: Sequence 
�������Description: The sequence (the base calls; A, C, T, G and N)