This codebook.txt file was generated on <2018/07/27> by Lex Flagel

-------------------
GENERAL INFORMATION
-------------------

1. Title of Dataset: Data from: The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference

2. Author Information

  Principal Investigator Contact Information
       	   Name: Lex Flagel
           Institution: Department of Plant and Microbial Biology, University of Minnesota
           Address:
           Email:flag0010@gmail.com

	   Name: Yaniv Brandvain
	   Institution: Department of Plant and Microbial Biology, University of Minnesota
           Email: ybrandva@umn.edu

           Name: Daniel Schrider
           Institution: Department of Genetics, University of North Carolina at Chapel Hill
           Email: drs@unc.edu


--------------------------
SHARING/ACCESS INFORMATION
-------------------------- 

1. Licenses/restrictions placed on the data:
  CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

2. Links to publications that cite or use the data:

Flagel, Lex, Yaniv J Brandvain, and Daniel R Schrider. “The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference.” bioRxiv preprint. January 1, 2018. https://doi.org/10.1101/336073.

3. Computational environment: Mac using Python 2.7.13

4. A snapshot of the related github folder at the time of data publication can be found here: https://github.com/flag0010/pop_gen_cnn/tree/7e26cd010f917d13b1603ae8f48bd0917628727f

---------------------
DATA & FILE OVERVIEW
---------------------
1. File List
   A. Filename: ld.data.npz
      Short description: This file contains simulated training and validation data sets for the inferences of the popultion scaled recombination rate (aka. rho) for phased chromosomes from the paper above.  Scripts related to this file are in the historical_recombination subfolder. 


        
   B. Filename: autotet.ld.data.npz
      Short description: This file contains simulated training and validation data sets for the inferences of the popultion scaled recombination rate (aka. rho) from unphased autotetraploid data from the paper above. Scripts related to this file are in the historical_recombination subfolder.


        
   C. Filename:  big_sim.npz
      Short description: This file contains simulated training and validation data sets for the inferences of gene flow between populations from the paper linked above.  Scripts related to this file are in the introgression subfolder.



--------------------------
METHODOLOGICAL INFORMATION
--------------------------


1. Description of methods used for collection/generation of data: All data sets were generated through coalescent simulations.  The specifics are given in the paper and related python scripts are available at https://github.com/flag0010/pop_gen_cnn. 


2. Describe any quality-assurance procedures performed on the data: Simulated data


3. People involved with sample collection, processing, analysis and/or submission: Lex Flagel, Yaniv Brandvain, Daniel Schrider



-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: autotet.ld.data.npz
-----------------------------------------
List of arrays

    A. Name: postrain
       Description: training data position specific vector (normalized between 0-1) for each segregating site in each simulation
                    
    B. Name: ytest
       Description: validation data vector of length 2, with the true theta and rho values used in the coalescent simulation

    C. Name: ytrain
       Description: training data vector of length 2, with the true theta and rho values used in the coalescent simulation
    
    D. Name: xtest
       Description: validation data matrices of the simulated segregating sites converted to frequency of the "a" allele (see paper) with individuals on columns  
	
    E. Name: postest
       Description: validation data position specific vector (normalized between 0-1) for each segregating site
 
    F. Name: xtrain
       Description: training data matrices of the simulated segregating sites converted to frequency of the "a" allele (see paper) with individuals on columns

-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: big_sim.npz
-----------------------------------------
List of arrays

    A. Name: xtrain
       Description: training data matrices of the simulated segregating sites (binary) with individuals on columns
                       
    B. Name: xtest
       Description: validation data matrices of the simulated segregating sites (binary) with individuals on columns
    
    C. Name: ytest
       Description: validation data categorical value for no migration=0, 1->2 migration=1, 2->1 migration=2     
		
    D. Name: ytrain
       Description: training data categorical value for no migration=0, 1->2 migration=1, 2->1 migration=2
	

-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: ld.data.npz
-----------------------------------------
List of arrays               

   A. Name: postrain
       Description: training data position specific vector (normalized between 0-1) for each segregating site in each simulation
                    
    B. Name: ytest
       Description: validation data vector of length 2, with the true theta and rho values used in the coalescent simulation

    C. Name: ytrain
       Description: training data vector of length 2, with the true theta and rho values used in the coalescent simulation

    D. Name: xtest
       Description: validation data matrices of the simulated segregating sites (binary) with individuals on columns

    E. Name: postest
       Description: validation data position specific vector (normalized between 0-1) for each segregating site
 
    F. Name: xtrain
       Description: training data matrices of the simulated segregating sites (binary) with individuals on columns