This codebook.txt file was generated on <2018/07/27> by Lex Flagel ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset: Data from: The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference 2. Author Information Principal Investigator Contact Information Name: Lex Flagel Institution: Department of Plant and Microbial Biology, University of Minnesota Address: Email:flag0010@gmail.com Name: Yaniv Brandvain Institution: Department of Plant and Microbial Biology, University of Minnesota Email: ybrandva@umn.edu Name: Daniel Schrider Institution: Department of Genetics, University of North Carolina at Chapel Hill Email: drs@unc.edu -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: CC0 1.0 Universal (CC0 1.0) Public Domain Dedication 2. Links to publications that cite or use the data: Flagel, Lex, Yaniv J Brandvain, and Daniel R Schrider. “The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference.” bioRxiv preprint. January 1, 2018. https://doi.org/10.1101/336073. 3. Computational environment: Mac using Python 2.7.13 4. A snapshot of the related github folder at the time of data publication can be found here: https://github.com/flag0010/pop_gen_cnn/tree/7e26cd010f917d13b1603ae8f48bd0917628727f --------------------- DATA & FILE OVERVIEW --------------------- 1. File List A. Filename: ld.data.npz Short description: This file contains simulated training and validation data sets for the inferences of the popultion scaled recombination rate (aka. rho) for phased chromosomes from the paper above. Scripts related to this file are in the historical_recombination subfolder. B. Filename: autotet.ld.data.npz Short description: This file contains simulated training and validation data sets for the inferences of the popultion scaled recombination rate (aka. rho) from unphased autotetraploid data from the paper above. Scripts related to this file are in the historical_recombination subfolder. C. Filename: big_sim.npz Short description: This file contains simulated training and validation data sets for the inferences of gene flow between populations from the paper linked above. Scripts related to this file are in the introgression subfolder. -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data: All data sets were generated through coalescent simulations. The specifics are given in the paper and related python scripts are available at https://github.com/flag0010/pop_gen_cnn. 2. Describe any quality-assurance procedures performed on the data: Simulated data 3. People involved with sample collection, processing, analysis and/or submission: Lex Flagel, Yaniv Brandvain, Daniel Schrider ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: autotet.ld.data.npz ----------------------------------------- List of arrays A. Name: postrain Description: training data position specific vector (normalized between 0-1) for each segregating site in each simulation B. Name: ytest Description: validation data vector of length 2, with the true theta and rho values used in the coalescent simulation C. Name: ytrain Description: training data vector of length 2, with the true theta and rho values used in the coalescent simulation D. Name: xtest Description: validation data matrices of the simulated segregating sites converted to frequency of the "a" allele (see paper) with individuals on columns E. Name: postest Description: validation data position specific vector (normalized between 0-1) for each segregating site F. Name: xtrain Description: training data matrices of the simulated segregating sites converted to frequency of the "a" allele (see paper) with individuals on columns ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: big_sim.npz ----------------------------------------- List of arrays A. Name: xtrain Description: training data matrices of the simulated segregating sites (binary) with individuals on columns B. Name: xtest Description: validation data matrices of the simulated segregating sites (binary) with individuals on columns C. Name: ytest Description: validation data categorical value for no migration=0, 1->2 migration=1, 2->1 migration=2 D. Name: ytrain Description: training data categorical value for no migration=0, 1->2 migration=1, 2->1 migration=2 ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: ld.data.npz ----------------------------------------- List of arrays A. Name: postrain Description: training data position specific vector (normalized between 0-1) for each segregating site in each simulation B. Name: ytest Description: validation data vector of length 2, with the true theta and rho values used in the coalescent simulation C. Name: ytrain Description: training data vector of length 2, with the true theta and rho values used in the coalescent simulation D. Name: xtest Description: validation data matrices of the simulated segregating sites (binary) with individuals on columns E. Name: postest Description: validation data position specific vector (normalized between 0-1) for each segregating site F. Name: xtrain Description: training data matrices of the simulated segregating sites (binary) with individuals on columns