High-Performance and Cost-Effective Storage Systems for Supporting Big Data Applications

Coming into the 21st century, the world innovations are shifting from the IT-driven (information technology-driven) to the DT-driven (data technology-driven). With the fast development of social media, e-business, Internet of Things (IoT), and the usage of millions of applications, an extremely large amount of data is created every day. We are in a big data era. Storage systems are the keystone to achieving data persistence, reliability, availability, and flexibility in the big data era. Due to the diversified applications that generate and use an extremely large amount of data, the research of storage systems for big data is a blue sky. Fast but expensive storage devices can deliver high performance. However, storing an extremely large volume of data in the fast storage devices causes very high costs. Therefore, designing and developing high-performance and cost-effective storage systems for big data applications has a significant social and economic impact. In this thesis, we mainly focus on improving the performance and cost-effectiveness of different storage systems for big data applications. First, file systems are widely used to store the data generated by different big data applications. As the data scale and the requirements of performance increases, designing and implementing a file system with good performance and high cost-effectiveness is urgent but challenging. We propose and develop a tier-aware file system with data deduplication to satisfy the requirements and address the challenges. A fast but expensive storage tier and a slow tier with much larger capacity are managed by a single file system. Hot files are migrated to the fast tier to ensure the high performance and cold files are migrated to the slow tier to lower the storing costs. Moreover, data deduplication is applied to eliminate the content level redundancy in both tiers to save more storage space and reduce the data migration overhead. Due to the extremely large scale of storage systems that support big data applications, hardware failures, system errors, software failures, or even natural disasters happen more frequently and can cause serious social and economic damages. Therefore, improving the performance of recovering data from backups is important to lower the loss. In this thesis, two research studies focus on improving the restore performance of the deduplication data in backup systems from different perspectives. The main objective is to reduce the storage reads during the restore. In the first study, two different cache designs are integrated to restore the data chunks with different localities. The memory space boundary between the two cache designs is adaptively adjusted based on the data locality variations. Thus, the cache hits are effectively improved and it lowers the essential storage reads. In the second study, to further address the data chunk fragmentation issues which cause some of the unavoidable storage reads, a look-back window assisted data chunk rewrite scheme is designed to store the fragmented data chunks together during deduplication. With a very small space overhead, the rewriting scheme transfers the multiple storage reads of fragmented data chunks to a single storage read. This design can reduce the storage reads that cannot be lowered by the caching schemes. Third, in the big data infrastructure, compute and storage clusters are disaggregated to achieve high availability, flexibility, and cost-effectiveness. However, it also causes a huge amount of network traffic between the storage and compute clusters, which leads to potential performance penalties. To investigate and solve the performance issues, we conduct a comprehensive study of the performance issues of HBase, which is a widely used distributed key-value store for big data applications, in the compute-storage disaggregated infrastructure. Then, to address the performance penalties, we propose in-storage computing based architecture to offload some of the I/O intensive modules from compute clusters to the storage clusters to effectively reduce the network traffic. The observations and explorations can help other big data applications to relive the similar performance penalties in the new infrastructure. Finally, designing and optimizing storage systems for big data applications requires a deep understanding of real-world workloads. However, there are limited workload characterization studies due to the challenges of collecting and analyzing real-world workloads in big data infrastructure. To bridge the gap, we select three large scale big data applications, which use RocksDB as the persistent key-value storage engine, at Facebook to characterize, model, and benchmark the RocksDB key-value workloads. To our best knowledge, this is the first research on characterizing the workloads of persistent key-value stores in real big data systems. In this research, we provide deep insights of the workload characteristics and the correlations between storage system behaviors and big data application queries. We show the methodologies and technologies of making better tradeoffs between performance and cost-effectiveness for storage systems supporting big data applications. Finally, we investigate the limitations of the existing benchmarks and propose a new benchmark that can better simulate both application queries and storage behaviors.

Keywords

big data

data deduplication

Description

University of Minnesota Ph.D. dissertation. 2020. Major: Computer Science. Advisor: David H.C. Du. 1 computer file (PDF); 248 pages.

Collections

Dissertations

Suggested citation

Cao, Zhichao. (2020). High-Performance and Cost-Effective Storage Systems for Supporting Big Data Applications. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/216408.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.

University Digital Conservancy

High-Performance and Cost-Effective Storage Systems for Supporting Big Data Applications

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation

University Digital Conservancy

University of Minnesota Twin Cities

High-Performance and Cost-Effective Storage Systems for Supporting Big Data Applications

View/Download File

Persistent link to this item

Statistics

Journal Title

Journal ISSN

Volume Title

Title

Authors

Published Date

Publisher

Type

Abstract

Keywords

Description

Related to

Replaces

License

Collections

Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Other identifiers

Suggested citation