This codebook.txt file was generated on 2017-12-05 by wilsonkm ------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset Sequence-Dependent Persistence Length of Long DNA 2. Author Information Principal Investigator Contact Information Name: Hui-Min Chuang Institution: University of Minnesota Address: Department of Chemical Engineering and Materials Science, 421 Washington Ave SE, Minneapolis, Minnesota 55455 Email: chuan077@umn.edu Associate or Co-investigator Contact Information Name: Jeffrey G. Reifenberger Institution: BioNano Genomics Address: 9640 Towne Centre Drive, Suite 100, San Diego, California 92121 Email: jreifenberger@bionanogenomics.com Associate or Co-investigator Contact Information Name: Han Cao Institution: BioNano Genomics Address: 9640 Towne Centre Drive, Suite 100, San Diego, California 92121 Email: han@bionanogenomics.com Associate or Co-investigator Contact Information Name: Kevin D. Dorfman Institution: University of Minnesota Address: Department of Chemical Engineering and Materials Science, 421 Washington Ave SE, Minneapolis, Minnesota 55455 Email: dorfman@umn.edu 3. Date of data collection: 2014-08-18 to 2014-08-20 4. Geographic location of data collection: N/A 5. Information about funding sources that supported the collection of the data: Sponsorship: National Institutes of Health under grants R01-HG006851 -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: N/A 2. Links to publications that cite or use the data: https://doi.org/10.1534/genetics.115.183483 doi:10.1038/sdata.2016.25 https://doi.org/10.1063/1.4907552 DOI: 10.1039/c5an00343a 3. Links to other publicly accessible locations of the data: 4. Links/relationships to ancillary data sets: 5. Was data derived from another source? If yes, list source(s): 6. Recommended citation for the data: Chuang, Hui-Min; Reifenberger, Jeffrey G.; Cao, Han; Dorfman, Kevin D.. (2017). Sequence-Dependent Persistence Length of Long DNA. Retrieved from the Data Repository for the University of Minnesota, http://hdl.handle.net/11299/191753. --------------------- DATA & FILE OVERVIEW --------------------- 1. File List A. Filename: FIG2b.m Short description: This is the Matlab code to generate figure 2 in the main text of the paper. The corresponding data can be found in the file "xlsx_data.xlsx"-sheet 4. B. Filename: FIG3a.m Short description: This is the Matlab code to generate figure 3a in the main text of the paper. The corresponding data can be found in the file "xlsx_data.xlsx"-sheet 5. C. Filename: FIG3b_FIGS4_ANOVA.m Short description: This is the Matlab code to apply the analysis of variance (ANOVA) and Tukey’s minimum significant difference test (MSD) to our data. The corresponding data needed for ANOVA and MSD can be found in the file "modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt", and "xlsx_data.xlsx"-sheet 4, 5 and 6. The result can be used to generate figure 3b in the main text of the paper, and figure 4 in the Supplemental Material. D. Filename: FIG4_FIGS11ab.m Short description: This is the Matlab code to generate figure 4 in the main text of the paper, and figure 11 in the Supplemental Material. The corresponding data needed for calculation can be found in the file "hg19_composition.xlsx"-sheet 1, and "xlsx_data.xlsx"-sheet 4, 5 and 6. E. Filename: FIGS1.m Short description: This is the Matlab code to generate figure 1 in the Supplemental Material. The corresponding data can be found in the file "xlsx_data.xlsx"-sheet 2. F. Filename: FIGS2.m Short description: This is the Matlab code to generate figure 2 in the Supplemental Material. The corresponding data can be found in the file "xlsx_data_wo_unknown". G. Filename: FIGS3.m Short description: This is the Matlab code to generate figure 3 in the Supplemental Material. The corresponding data needed for re-sample ANOVA can be found in the file "modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt", and "hg19_nicksite". H. Filename: FIGS5.m Short description: This is the Matlab code to generate figure 5 in the Supplemental Material. The corresponding data needed for re-bin data can be found in the file "modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt", and the result was saved in the file "revision_SI"-sheet 7, 8 and 9. I. Filename: FIGS6.m Short description: This is the Matlab code to generate figure 6 in the Supplemental Material. The corresponding data can be found in the file "xlsx_data.xlsx"-sheet 7. J. Filename: FIGS7.m Short description: This is the Matlab code to generate figure 7 in the Supplemental Material. The corresponding data can be found in the file "xlsx_data.xlsx"-sheet 6. K. Filename: FIGS8.m Short description: This is the Matlab code to generate figure 8 in the Supplemental Material. The corresponding data needed for calculation can be found in the file "xlsx_data.xlsx"-sheet 4, 5 and 6, and "revision_SI.xlsx"-sheet 1, 2 and 3. L. Filename: FIGS9.m Short description: This is the Matlab code to generate figure 9 in the Supplemental Material. The corresponding data needed for calculation can be found in the file "modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt". M. Filename: hg19_composition.xlsx Short description: This data file contains the fractions of 10 dinucleotide steps at different DNA length and % GC content from our experimental results. The result can be regenerated by the code saved in the folder "FIGS10". N. Filename: hg19_nicksite.xlsx Short description: This data gives the positions of the DNA sequence "GCTCTTC" in the direction from 5' to 3' for all chromosome which can be recognized by the nicking enzyme Nt.BspqI. O. Filename: modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt Short description: Experimental data set P. Filename: revision_SI.xlsx Short description: The data file contains the analytical results to generate figure 5 and 8 in the Supplemental Material. Q. Filename: xlsx_data_wo_unknown.xlsx Short description: The data file contains the experimental results to generate figure 2 in the Supplemental Material. Q. Filename: xlsx_data.xlsx Short description: The data file contains the experimental and analytical results to generate figure 2, 3 and 4 in the main text of the paper, and figure 1, 6, and 7 in the Supplemental Material. 2. Relationship between files: All relationships between files have been provided in the description of each file. 3. Additional related data collected that was not included in the current data package: You can find complete hg19 chromosome sequence online which can be further analyzed to get the results provided in the file "hg19_nicksite.xlsx". 4. Are there multiple versions of the dataset? No -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data: Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information. 2. Methods for processing the data: Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information. 3. Instrument- or software-specific information needed to interpret the data: Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information. 4. Standards and calibration information, if appropriate: Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information. 5. Environmental/experimental conditions: Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information. 6. Describe any quality-assurance procedures performed on the data: Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information. 7. People involved with sample collection, processing, analysis and/or submission: Hui-Min Chuang, Jeffrey G. Reifenberger, Han Cao, and Kevin D. Dorfman. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: hg19_composition.xlsx ----------------------------------------- 1. Number of variables: 12 2. Number of cases/rows: 3. Missing data codes: Symbol: “NaN” No data Symbol: “0” No data 4. Variable List A. Name: length Description: sequence length of an analyzed DNA sequence B. Name: AA/TT Description: dinucleotide step C. Name: AC/GT Description: dinucleotide step D. Name: AG/CT Description: dinucleotide step E. Name: AT Description: dinucleotide step F. Name: CA/TG Description: dinucleotide step G. Name: CC/GG Description: dinucleotide step H. Name: CG Description: dinucleotide step I. Name: GA/TC Description: dinucleotide step J. Name: GC Description: dinucleotide step K. Name: TA Description: dinucleotide step I. Name: GC content Description: guanine-cytosine content of an analyzed DNA sequence ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: hg19_nicksite.xlsx ----------------------------------------- 1. Number of variables: 25 2. Number of cases/rows: 3. Missing data codes: Symbol: “NaN” No data Symbol: “0” No data 4. Variable List A. Name: chrmID1 Description: Human chromosome 1 B. Name: chrmID2 Description: Human chromosome 2 C. Name: chrmID3 Description: Human chromosome 3 D. Name: chrmID4 Description: Human chromosome 4 E. Name: chrmID5 Description: Human chromosome 5 F. Name: chrmID6 Description: Human chromosome 6 G. Name: chrmID7 Description: Human chromosome 7 H. Name: chrmID8 Description: Human chromosome 8 I. Name: chrmID9 Description: Human chromosome 9 J. Name: chrmID10 Description: Human chromosome 10 K. Name: chrmID11 Description: Human chromosome 11 L. Name: chrmID12 Description: Human chromosome 12 M. Name: chrmID13 Description: Human chromosome 13 N. Name: chrmID14 Description: Human chromosome 14 O. Name: chrmID15 Description: Human chromosome 15 P. Name: chrmID16 Description: Human chromosome 16 Q. Name: chrmID17 Description: Human chromosome 17 R. Name: chrmID18 Description: Human chromosome 18 S. Name: chrmID19 Description: Human chromosome 19 T. Name: chrmID20 Description: Human chromosome 20 U. Name: chrmID21 Description: Human chromosome 21 V. Name: chrmID22 Description: Human chromosome 22 W. Name: chrmIDX Description: Human chromosome X X. Name: chrmIDY Description: Human chromosome Y Y. Name: chrmID Description: Human chromosome ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt ----------------------------------------- 1. Number of variables: 17 2. Number of cases/rows: 3. Missing data codes: Symbol: “NaN” No data Symbol: “0” No data 4. Variable List A. Name: chrmID Description: the alignment chromosome number (23 is X, 24 is Y) B. Name: NICKID1 Description: nick site number for the 1st label C. Name: NICKID2 Description: nick site number for the 2nd label D. Name: ExpectedDis Description: Expected distance between the nick sites based on the reference. E. Name: GCcontent Description: GC content based on the hg19 reference F. Name: numData Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference. G. Name: meanPairDist(bp) Description: average distance in bp measured between the pairs assuming 366 base pairs per pixel. H. Name: StdDevPairDist(bp) Description: standard deviation between the pair of labels in bp (again assuming 366 bp/pixel) I. Name: meanScalePosition Description: the average scaled position the pair of labels is found in the molecule (-1 is far left hand side, +1 is far righthand side, and 0 is the middle of a molecule). J. Name: StdDevScalePosition Description: the standard deviation of the position of the pair of labels within the molecule. K. Name: AvgBPP Description: remember all our measurements are in pixels. We then convert them to bp based on an assumption of 366 bp/pixel. However for every experiment that is always some variation depending upon salt concentration, channel size, etc. This is the actually bp/pixel based on the alignment to the reference. L. Name: StdDevBPP Description: standard deviation of the bp/pixel value. M. Name: skewBPP Description: skew of the bp/pixel value. N. Name: avgStretch Description: the average stretch measured between the labels assuming 0.34nm per bp. O. Name: StdDevStretch Description: the standard deviation of the stretch O. Name: skewStretch Description: the skew of the stretch. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: xlsx_data_wo_unknown ----------------------------------------- 1. Number of variables: 3 2. Number of cases/rows: 3. Missing data codes: Symbol: “NaN” No data Symbol: “0” No data 4. Variable List A. Name: avgStretch Description: the average stretch measured between the labels assuming 0.34nm per bp. B. Name: NumData Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference. C. Name: expectedDist Description: Expected distance between the nick sites based on the reference. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: revision_SI.xlsx ----------------------------------------- 1. Number of variables: 5 2. Number of cases/rows: 3. Missing data codes: Symbol: “NaN” No data Symbol: “0” No data 4. Variable List A. Name: numofData_cut Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference with quality cuts for our results. B. Name: avgStretch_cut Description: the average stretch measured between the labels assuming 0.34nm per bp with quality cuts for our results. C. Name: Std_cut Description: the standard deviation of the stretch with quality cuts for our results. D. Name: numofData_# (# = 320 or 799) Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference with our experimental data binned by # bins. E. Name: avgStretch_# (# = 320 or 799) Description: the average stretch measured between the labels assuming 0.34nm per bp with our experimental data binned by # bins. F. Name: Std_# (# = 320 or 799) Description: the standard deviation of the stretch with our experimental data binned by # bins. ----------------------------------------- DATA-SPECIFIC INFORMATION FOR: xlsx_data ----------------------------------------- 1. Number of variables: 8 2. Number of cases/rows: 3. Missing data codes: Symbol: “NaN” No data Symbol: “0” No data 4. Variable List A. Name: numData_unknown Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference with aligned sequences containing N-base regions. B. Name: avgStretch_unknown Description: the average stretch measured between the labels assuming 0.34nm per bp with aligned sequences containing N-base regions. C. Name: StdDtretch_SST_unknown Description: the standard deviation of the stretch with aligned sequences containing N-base regions. D. Name: numData_wounknown Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference without aligned sequences containing N-base regions. E. Name: avgStretch_wounknown Description: the average stretch measured between the labels assuming 0.34nm per bp without aligned sequences containing N-base regions. F. Name: StdDtretch_SST_wounknown Description: the standard deviation of the stretch without aligned sequences containing N-base regions. G. Name: numData_avgStretchbin_GC Description: these are the average stretch measured between the labels v.s. the number of pairs of labels on a molecule with different % GC content between a pair of nick sites that aligned to that position in the reference. H. Name: numData_avgStretchbin_LE Description: these are the average stretch measured between the labels v.s. the number of pairs of labels on a molecule with different sequence lengths between a pair of nick sites that aligned to that position in the reference.