Enlarge Practical DNA Storage Capacity: The Challenge and The Methodology
2023-12
Loading...
View/Download File
Persistent link to this item
Statistics
View StatisticsJournal Title
Journal ISSN
Volume Title
Title
Enlarge Practical DNA Storage Capacity: The Challenge and The Methodology
Alternative title
Authors
Published Date
2023-12
Publisher
Type
Thesis or Dissertation
Abstract
As global data generation is increasingly expanding, traditional data storage especiallyarchival storage is facing severe challenges. The typical storage medium like tape, HDD, and SSD do not provide sufficient storage density as well as sufficient durability. People are seeking DNA as a promising storage medium for archiving digital data. Despite pioneering efforts demonstrating the feasibility of DNA-based data storage, the realization of a large-scale DNA storage system remains in its infancy. The theoretical potential for ultrahigh storage density in DNA is hindered by practical limitations, including various system overheads. This thesis delves into the practical implementation of DNA storage capacity, acknowledging the ongoing rapid development of DNA storage-related biotechnologies. The study systematically reviews system overheads and corresponding capacities based on current technologies. Key factors influencing DNA storage capacity, such as DNA strand length, encoding density, parallel factor, and the number of usable primers, are explored. Constrained by current DNA synthesis, sequencing, and PCR technology, factors like strand length, encoding density, and parallel factor are all constrained. However, the most critical challenge lies in primer-payload collisions, which significantly reduces the number of usable. As a result, DNA tube storage capacity falls short of expectations (e.g., Terabytes per DNA tube), remaining below three hundred gigabytes. To address the primer-payload issue and enhance the number of usable primers and storage capacity, this study proposes and evaluates solutions from three perspectives. The first approach involves post-processing DNA payloads to rectify collisions. Given the primer-payload collision is a pair of almost identical subsequences (longer than 12 bases) between primer and payload, we decide to change payload from two aspects: content (DNA mapping) and length (DNA cutting). DNA cutting generates DNA strands with variable lengths to break up a collided subsequence into two parts and remove the original collisions. Four optional payload lengths are selected based on the assumption that the current maximum payload length is 200. The combination of the four lengths can cut most collisions as long as the distance between collisions is long enough. A heuristic algorithm is proposed to determine which collision to cut if there are multiple collisions densely grouped. The DNA mapping maps an original DNA sequence to a new sequence so that the original collision is removed. Three meticulously designed mappings are introduced that can always obey the homopolymers and GC content bioconstraints. A combination of DNA cutting and mapping is further discussed to make use of both methods’ advantages and offset their disadvantages. The evaluation shows the combination of the two methods can help increase 25% storage capacity. After exploring the post-processing, we further investigate the potential of a collision resistant encoding scheme. An analysis of existing encoding schemes and their corresponding storage capacity indicates that it pays to trade a few encoding densities for better collision resistance. Therefore, we first summarize the potential collision-resistant patterns. We then develop an encoding scheme (i.e., CAC) to encode DNA sequences with the collision-resistant patterns. CAC encodes the current DNA sequence based on the observation of previous DNA sequences so that it introduces 1) no homopolymers, 2) fewer complementary sequences, and 3) more balanced GC content in the current DNA sequence. Evaluation shows CAC can almost double the number of usable primers and leads to a 50% increase in storage capacity. More evaluation results show CAC is also comparable to other existing encoding schemes in terms of encoding speed and error recoverability. Besides, a collision-aware data allocation scheme is proposed to allocate data to different tubes if they collide with different sets of primers. With this allocation as a pre-process, primers disabled in a tube due to collision are still usable in other tubes. Thus, the overall storage capacity increases. To better fit the goal: fewer common collided primers among tubes, or in other words minimum overall collided primers, a special clustering criteria is proposed together with a hierarchical clustering procedure. The evaluation shows a 20% increase in the overall storage capacity. Besides, we also check the influence of different chunk sizes. Chunk size as the clustering granularity will affect not only clustering quality but also clustering speed and sequencing per file retrieval. A smaller chunk size will have better clustering quality but slower clustering speed and need more sequencing to retrieve a file as the file is split into many small chunks. The evaluation shows 4KB chunk is a reasonable size to balance all the factors. Finally, several potential future works are discussed. First, via summarizing and comparing the potential improvements of different DNA storage-related technologies, we find a potential breakthrough in relaxed primer design rules and generate more primers. Second, several potential enhancements are noted to better utilize the solutions proposed in this thesis including speed-up of CAC, better cooperation between pre Last, a potential DNA sequence level data compression and data deduplication for DNA storage is pointed out. By minimizing redundancy in both DNA sequences and digital data, we can reduce the number of synthesized DNA strands. This approach accelerates the sluggish DNA synthesis process and results in cost savings for the overall synthesis.
Keywords
Description
University of Minnesota Ph.D. dissertation. 2024. Major: Computer Science. Advisor: David Du. 1 computer file (PDF); 123 pages.
Related to
Replaces
License
Collections
Series/Report Number
Funding information
Isbn identifier
Doi identifier
Previously Published Citation
Other identifiers
Suggested citation
Wei, Yixun. (2023). Enlarge Practical DNA Storage Capacity: The Challenge and The Methodology. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/262891.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.