This codebook.txt file was generated on 2017-12-05 by wilsonkm

-------------------
GENERAL INFORMATION
-------------------

1. Title of Dataset 
Sequence-Dependent Persistence Length of Long DNA

2. Author Information

  Principal Investigator Contact Information
        Name: Hui-Min Chuang
           Institution: University of Minnesota
           Address: Department of Chemical Engineering and Materials Science, 421 Washington Ave SE, Minneapolis, Minnesota 55455
           Email: chuan077@umn.edu

  Associate or Co-investigator Contact Information
        Name: Jeffrey G. Reifenberger
           Institution: BioNano Genomics 
           Address: 9640 Towne Centre Drive, Suite 100, San Diego, California 92121
           Email: jreifenberger@bionanogenomics.com

  Associate or Co-investigator Contact Information
        Name: Han Cao
           Institution: BioNano Genomics
           Address: 9640 Towne Centre Drive, Suite 100, San Diego, California 92121
           Email: han@bionanogenomics.com

  Associate or Co-investigator Contact Information
           Name: Kevin D. Dorfman
           Institution: University of Minnesota
           Address: Department of Chemical Engineering and Materials Science, 421 Washington Ave SE, Minneapolis, Minnesota 55455
           Email: dorfman@umn.edu

3. Date of data collection: 2014-08-18 to 2014-08-20


4. Geographic location of data collection: N/A


5. Information about funding sources that supported the collection of the data: Sponsorship: National Institutes of Health under grants R01-HG006851

--------------------------
SHARING/ACCESS INFORMATION
-------------------------- 

1. Licenses/restrictions placed on the data:
N/A

2. Links to publications that cite or use the data:
https://doi.org/10.1534/genetics.115.183483
doi:10.1038/sdata.2016.25
https://doi.org/10.1063/1.4907552
DOI: 10.1039/c5an00343a

3. Links to other publicly accessible locations of the data:


4. Links/relationships to ancillary data sets:


5. Was data derived from another source?
           If yes, list source(s):

6. Recommended citation for the data:
Chuang, Hui-Min; Reifenberger, Jeffrey G.; Cao, Han; Dorfman, Kevin D.. (2017). Sequence-Dependent Persistence Length of Long DNA. Retrieved from the Data Repository for the University of Minnesota, http://hdl.handle.net/11299/191753.

---------------------
DATA & FILE OVERVIEW
---------------------

1. File List
   A. Filename: FIG2b.m
      Short description: This is the Matlab code to generate figure 2 in the main text of the paper. The corresponding data can be found in the file "xlsx_data.xlsx"-sheet 4.       
        
   B. Filename: FIG3a.m
      Short description: This is the Matlab code to generate figure 3a in the main text of the paper. The corresponding data can be found in the file "xlsx_data.xlsx"-sheet 5.      

   C. Filename: FIG3b_FIGS4_ANOVA.m        
      Short description: This is the Matlab code to apply the analysis of variance (ANOVA) and Tukey’s minimum significant difference test (MSD) to our data. The corresponding data needed for ANOVA and MSD can be found in the file "modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt", and "xlsx_data.xlsx"-sheet 4, 5 and 6. The result can be used to generate figure 3b in the main text of the paper, and figure 4 in the Supplemental Material.      

   D. Filename: FIG4_FIGS11ab.m
      Short description: This is the Matlab code to generate figure 4 in the main text of the paper, and figure 11 in the Supplemental Material. The corresponding data needed for calculation can be found in the file "hg19_composition.xlsx"-sheet 1, and "xlsx_data.xlsx"-sheet 4, 5 and 6.       
        
   E. Filename: FIGS1.m
      Short description: This is the Matlab code to generate figure 1 in the Supplemental Material. The corresponding data can be found in the file "xlsx_data.xlsx"-sheet 2.       

   F. Filename: FIGS2.m       
      Short description: This is the Matlab code to generate figure 2 in the Supplemental Material. The corresponding data can be found in the file "xlsx_data_wo_unknown".

   G. Filename: FIGS3.m
      Short description: This is the Matlab code to generate figure 3 in the Supplemental Material. The corresponding data needed for re-sample ANOVA can be found in the file "modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt", and "hg19_nicksite".       

   H. Filename: FIGS5.m       
      Short description: This is the Matlab code to generate figure 5 in the Supplemental Material. The corresponding data needed for re-bin data can be found in the file "modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt", and the result was saved in the file "revision_SI"-sheet 7, 8 and 9. 

   I. Filename: FIGS6.m
      Short description: This is the Matlab code to generate figure 6 in the Supplemental Material. The corresponding data can be found in the file "xlsx_data.xlsx"-sheet 7.        

   J. Filename: FIGS7.m       
      Short description: This is the Matlab code to generate figure 7 in the Supplemental Material. The corresponding data can be found in the file "xlsx_data.xlsx"-sheet 6.

   K. Filename: FIGS8.m
      Short description: This is the Matlab code to generate figure 8 in the Supplemental Material. The corresponding data needed for calculation can be found in the file "xlsx_data.xlsx"-sheet 4, 5 and 6, and "revision_SI.xlsx"-sheet 1, 2 and 3.       

   L. Filename: FIGS9.m       
      Short description: This is the Matlab code to generate figure 9 in the Supplemental Material. The corresponding data needed for calculation can be found in the file "modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt".

   M. Filename: hg19_composition.xlsx
      Short description: This data file contains the fractions of 10 dinucleotide steps at different DNA length and % GC content from our experimental results. The result can be regenerated by the code saved in the folder "FIGS10".       

   N. Filename: hg19_nicksite.xlsx     
      Short description: This data gives the positions of the DNA sequence "GCTCTTC" in the direction from 5' to 3' for all chromosome which can be recognized by the nicking enzyme Nt.BspqI. 

   O. Filename: modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt
      Short description: Experimental data set  

   P. Filename: revision_SI.xlsx     
      Short description: The data file contains the analytical results to generate figure 5 and 8 in the Supplemental Material.

   Q. Filename: xlsx_data_wo_unknown.xlsx
      Short description: The data file contains the experimental results to generate figure 2 in the Supplemental Material.

   Q. Filename: xlsx_data.xlsx
      Short description: The data file contains the experimental and analytical results to generate figure 2, 3 and 4 in the main text of the paper, and figure 1, 6, and 7 in the Supplemental Material.

2. Relationship between files: All relationships between files have been provided in the description of each file.       


3. Additional related data collected that was not included in the current data package: You can find complete hg19 chromosome sequence online which can be further analyzed to get the results provided in the file "hg19_nicksite.xlsx".


4. Are there multiple versions of the dataset? No


--------------------------
METHODOLOGICAL INFORMATION
--------------------------

1. Description of methods used for collection/generation of data: 
Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information.


2. Methods for processing the data: <describe how the submitted data were generated from the raw or collected data>
Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information.

3. Instrument- or software-specific information needed to interpret the data:
Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information.

4. Standards and calibration information, if appropriate:
Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information.

5. Environmental/experimental conditions:
Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information.

6. Describe any quality-assurance procedures performed on the data:
Please see Supplemental Material at http://link.aps.org/supplemental/10.1103/PhysRevLett.119.227802 for additional information.

7. People involved with sample collection, processing, analysis and/or submission: Hui-Min Chuang, Jeffrey G. Reifenberger, Han Cao, and Kevin D. Dorfman.

-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: hg19_composition.xlsx
-----------------------------------------

1. Number of variables: 12


2. Number of cases/rows: 


3. Missing data codes:
        Symbol: “NaN”    No data
        Symbol: “0”      No data


4. Variable List  

    A. Name: length
       Description: sequence length of an analyzed DNA sequence

    B. Name: AA/TT
       Description: dinucleotide step

    C. Name: AC/GT
       Description: dinucleotide step

    D. Name: AG/CT
       Description: dinucleotide step

    E. Name: AT
       Description: dinucleotide step

    F. Name: CA/TG
       Description: dinucleotide step

    G. Name: CC/GG
       Description: dinucleotide step

    H. Name: CG
       Description: dinucleotide step

    I. Name: GA/TC
       Description: dinucleotide step

    J. Name: GC
       Description: dinucleotide step

    K. Name: TA
       Description: dinucleotide step

    I. Name: GC content
       Description: guanine-cytosine content of an analyzed DNA sequence

-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: hg19_nicksite.xlsx
-----------------------------------------

1. Number of variables: 25


2. Number of cases/rows: 


3. Missing data codes:
        Symbol: “NaN”    No data
        Symbol: “0”      No data


4. Variable List

    A. Name: chrmID1
       Description: Human chromosome 1
                    

    B. Name: chrmID2
       Description: Human chromosome 2
                    

    C. Name: chrmID3
       Description: Human chromosome 3
    
                
    D. Name: chrmID4
       Description: Human chromosome 4

                    
    E. Name: chrmID5
       Description: Human chromosome 5

                    
    F. Name: chrmID6
       Description: Human chromosome 6

                    
    G. Name: chrmID7
       Description: Human chromosome 7

                    
    H. Name: chrmID8
       Description: Human chromosome 8
                    
    I. Name: chrmID9
       Description: Human chromosome 9

                    
    J. Name: chrmID10
       Description: Human chromosome 10

                    
    K. Name: chrmID11
       Description: Human chromosome 11

                    
    L. Name: chrmID12
       Description: Human chromosome 12

                    
    M. Name: chrmID13
       Description: Human chromosome 13

                    
    N. Name: chrmID14
       Description: Human chromosome 14

                    
    O. Name: chrmID15
       Description: Human chromosome 15

                    
    P. Name: chrmID16
       Description: Human chromosome 16

                    
    Q. Name: chrmID17
       Description: Human chromosome 17

                    
    R. Name: chrmID18
       Description: Human chromosome 18

                    
    S. Name: chrmID19
       Description: Human chromosome 19

                    
    T. Name: chrmID20
       Description: Human chromosome 20

                    
    U. Name: chrmID21
       Description: Human chromosome 21

                    
    V. Name: chrmID22
       Description: Human chromosome 22

                    
    W. Name: chrmIDX
       Description: Human chromosome X

                    
    X. Name: chrmIDY
       Description: Human chromosome Y

                    
    Y. Name: chrmID
       Description: Human chromosome 
                    
-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: modified_081814_microlayer_878_sampleID_C2_1_36_pillarblast_stats_labelPairs_allPairs.txt
-----------------------------------------

1. Number of variables: 17


2. Number of cases/rows: 


3. Missing data codes:
        Symbol: “NaN”    No data
        Symbol: “0”      No data

4. Variable List           

    A. Name: chrmID
       Description: the alignment chromosome number (23 is X, 24 is Y)
                   

    B. Name: NICKID1
       Description: nick site number for the 1st label
                    

    C. Name: NICKID2
       Description: nick site number for the 2nd label

                
    D. Name: ExpectedDis
       Description: Expected distance between the nick sites based on the reference.

                    
    E. Name: GCcontent
       Description: GC content based on the hg19 reference 

                    
    F. Name: numData
       Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference.

                    
    G. Name: meanPairDist(bp)
       Description: average distance in bp measured between the pairs assuming 366 base pairs per pixel.

                    
    H. Name: StdDevPairDist(bp)
       Description: standard deviation between the pair of labels in bp (again assuming 366 bp/pixel)


    I. Name: meanScalePosition
       Description: the average scaled position the pair of labels is found in the molecule (-1 is far left hand side, +1 is far righthand side, and 0 is the middle of a molecule).

                   
    J. Name: StdDevScalePosition
       Description: the standard deviation of the position of the pair of labels within the molecule.

                    
    K. Name: AvgBPP
       Description: remember all our measurements are in pixels. We then convert them to bp based on an assumption of 366 bp/pixel. However for every experiment that is always some variation depending upon salt concentration, channel size, etc. This is the actually bp/pixel based on the alignment to the reference.
    
                
    L. Name: StdDevBPP
       Description: standard deviation of the bp/pixel value.

                    
    M. Name: skewBPP
       Description: skew of the bp/pixel value.

                    
    N. Name: avgStretch
       Description: the average stretch measured between the labels assuming 0.34nm per bp.

                    
    O. Name: StdDevStretch
       Description: the standard deviation of the stretch


    O. Name: skewStretch
       Description: the skew of the stretch.<description of the variable>


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: xlsx_data_wo_unknown
-----------------------------------------

1. Number of variables: 3


2. Number of cases/rows: 


3. Missing data codes:
        Symbol: “NaN”    No data
        Symbol: “0”      No data


4. Variable List


    A. Name: avgStretch
       Description: the average stretch measured between the labels assuming 0.34nm per bp.
                   

    B. Name: NumData
       Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference.
                    

    C. Name: expectedDist
       Description: Expected distance between the nick sites based on the reference.

-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: revision_SI.xlsx
-----------------------------------------

1. Number of variables: 5


2. Number of cases/rows: 


3. Missing data codes:
        Symbol: “NaN”    No data
        Symbol: “0”      No data


4. Variable List


    A. Name: numofData_cut
       Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference with quality cuts for our results.
                   

    B. Name: avgStretch_cut
       Description: the average stretch measured between the labels assuming 0.34nm per bp with quality cuts for our results.
                    

    C. Name: Std_cut
       Description: the standard deviation of the stretch with quality cuts for our results.
    
                
    D. Name: numofData_# (# = 320 or 799)
       Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference with our experimental data binned by # bins.

                    
    E. Name: avgStretch_# (# = 320 or 799)
       Description: the average stretch measured between the labels assuming 0.34nm per bp with our experimental data binned by # bins.


    F. Name: Std_# (# = 320 or 799)
       Description: the standard deviation of the stretch with our experimental data binned by # bins.


-----------------------------------------
DATA-SPECIFIC INFORMATION FOR: xlsx_data
-----------------------------------------

1. Number of variables: 8


2. Number of cases/rows: 


3. Missing data codes:
        Symbol: “NaN”    No data
        Symbol: “0”      No data


4. Variable List

    A. Name: numData_unknown
       Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference with aligned sequences containing N-base regions.
                   

    B. Name: avgStretch_unknown
       Description: the average stretch measured between the labels assuming 0.34nm per bp with aligned sequences containing N-base regions.
                    

    C. Name: StdDtretch_SST_unknown
       Description: the standard deviation of the stretch with aligned sequences containing N-base regions.
    
                
    D. Name: numData_wounknown
       Description: these are the number of pairs of labels on a molecule that aligned to that position in the reference without aligned sequences containing N-base regions.

                    
    E. Name: avgStretch_wounknown
       Description: the average stretch measured between the labels assuming 0.34nm per bp without aligned sequences containing N-base regions.

                    
    F. Name: StdDtretch_SST_wounknown
       Description: the standard deviation of the stretch without aligned sequences containing N-base regions.

    G. Name: numData_avgStretchbin_GC
       Description: these are the average stretch measured between the labels v.s. the number of pairs of labels on a molecule with different % GC content between a pair of nick sites that aligned to that position in the reference.

                    
    H. Name: numData_avgStretchbin_LE
       Description: these are the average stretch measured between the labels v.s. the number of pairs of labels on a molecule with different sequence lengths between a pair of nick sites that aligned to that position in the reference.