Browsing by Subject "storage system"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item High-Performance and Cost-Effective Storage Systems for Supporting Big Data Applications(2020-07) Cao, ZhichaoComing into the 21st century, the world innovations are shifting from the IT-driven (information technology-driven) to the DT-driven (data technology-driven). With the fast development of social media, e-business, Internet of Things (IoT), and the usage of millions of applications, an extremely large amount of data is created every day. We are in a big data era. Storage systems are the keystone to achieving data persistence, reliability, availability, and flexibility in the big data era. Due to the diversified applications that generate and use an extremely large amount of data, the research of storage systems for big data is a blue sky. Fast but expensive storage devices can deliver high performance. However, storing an extremely large volume of data in the fast storage devices causes very high costs. Therefore, designing and developing high-performance and cost-effective storage systems for big data applications has a significant social and economic impact. In this thesis, we mainly focus on improving the performance and cost-effectiveness of different storage systems for big data applications. First, file systems are widely used to store the data generated by different big data applications. As the data scale and the requirements of performance increases, designing and implementing a file system with good performance and high cost-effectiveness is urgent but challenging. We propose and develop a tier-aware file system with data deduplication to satisfy the requirements and address the challenges. A fast but expensive storage tier and a slow tier with much larger capacity are managed by a single file system. Hot files are migrated to the fast tier to ensure the high performance and cold files are migrated to the slow tier to lower the storing costs. Moreover, data deduplication is applied to eliminate the content level redundancy in both tiers to save more storage space and reduce the data migration overhead. Due to the extremely large scale of storage systems that support big data applications, hardware failures, system errors, software failures, or even natural disasters happen more frequently and can cause serious social and economic damages. Therefore, improving the performance of recovering data from backups is important to lower the loss. In this thesis, two research studies focus on improving the restore performance of the deduplication data in backup systems from different perspectives. The main objective is to reduce the storage reads during the restore. In the first study, two different cache designs are integrated to restore the data chunks with different localities. The memory space boundary between the two cache designs is adaptively adjusted based on the data locality variations. Thus, the cache hits are effectively improved and it lowers the essential storage reads. In the second study, to further address the data chunk fragmentation issues which cause some of the unavoidable storage reads, a look-back window assisted data chunk rewrite scheme is designed to store the fragmented data chunks together during deduplication. With a very small space overhead, the rewriting scheme transfers the multiple storage reads of fragmented data chunks to a single storage read. This design can reduce the storage reads that cannot be lowered by the caching schemes. Third, in the big data infrastructure, compute and storage clusters are disaggregated to achieve high availability, flexibility, and cost-effectiveness. However, it also causes a huge amount of network traffic between the storage and compute clusters, which leads to potential performance penalties. To investigate and solve the performance issues, we conduct a comprehensive study of the performance issues of HBase, which is a widely used distributed key-value store for big data applications, in the compute-storage disaggregated infrastructure. Then, to address the performance penalties, we propose in-storage computing based architecture to offload some of the I/O intensive modules from compute clusters to the storage clusters to effectively reduce the network traffic. The observations and explorations can help other big data applications to relive the similar performance penalties in the new infrastructure. Finally, designing and optimizing storage systems for big data applications requires a deep understanding of real-world workloads. However, there are limited workload characterization studies due to the challenges of collecting and analyzing real-world workloads in big data infrastructure. To bridge the gap, we select three large scale big data applications, which use RocksDB as the persistent key-value storage engine, at Facebook to characterize, model, and benchmark the RocksDB key-value workloads. To our best knowledge, this is the first research on characterizing the workloads of persistent key-value stores in real big data systems. In this research, we provide deep insights of the workload characteristics and the correlations between storage system behaviors and big data application queries. We show the methodologies and technologies of making better tradeoffs between performance and cost-effectiveness for storage systems supporting big data applications. Finally, we investigate the limitations of the existing benchmarks and propose a new benchmark that can better simulate both application queries and storage behaviors.Item Improving Storage Performance with Non-Volatile Memory-based Caching Systems(2017-04) Fan, ZiqiWith the rapid development of new types of non-volatile memory (NVRAM), e.g., 3D Xpoint, NVDIMM, and STT-MRAM, these technologies have been or will be integrated into current computer systems to work together with traditional DRAM. Compared with DRAM, which can cause data loss when the power fails or the system crashes, NVRAM's non-volatile nature makes it a better candidate as caching material. In the meantime, storage performance needs to keep up to process and accommodate the rapidly generated amounts of data around the world (a.k.a the big data problem). Throughout my Ph.D. research, I have been focusing on building novel NVRAM-based caching systems to provide cost-effective ways to improve storage system performance. To show the benefits of designing novel NVRAM-based caching systems, I target four representative storage devices and systems: solid state drives (SSDs), hard disk drives (HDDs), disk arrays, and high-performance computing (HPC) parallel file systems (PFSs). For SSDs, to mitigate their wear out problem and extend their lifespan, we propose two NVRAM-based buffer cache policies which can work together in different layers to maximally reduce SSD write traffic: a main memory buffer cache design named Hierarchical Adaptive Replacement Cache (H-ARC) and an internal SSD write buffer design named Write Traffic Reduction Buffer (WRB). H-ARC considers four factors (dirty, clean, recency, and frequency) to reduce write traffic and improve cache hit ratios in the host. WRB reduces block erasures and write traffic further inside an SSD by effectively exploiting temporal and spatial localities. For HDDs, to exploit their fast sequential access speed to improve I/O throughput, we propose a buffer cache policy, named I/O-Cache, that regroups and synchronizes long sets of consecutive dirty pages to take advantage of HDDs' fast sequential access speed and the non-volatile property of NVRAM. In addition, our new policy can dynamically separate the whole cache into a dirty cache and a clean cache, according to the characteristics of the workload, to decrease storage writes. For disk arrays, although numerous cache policies have been proposed, most are either targeted at main memory buffer caches or manage NVRAM as write buffers and separately manage DRAM as read caches. To the best of our knowledge, cooperative hybrid volatile and non-volatile memory buffer cache policies specifically designed for storage systems using newer NVRAM technologies have not been well studied. Based on our elaborate study of storage server block I/O traces, we propose a novel cooperative HybrId NVRAM and DRAM Buffer cACHe polIcy for storage arrays, named Hibachi. Hibachi treats read cache hits and write cache hits differently to maximize cache hit rates and judiciously adjusts the clean and the dirty cache sizes to capture workloads' tendencies. In addition, it converts random writes to sequential writes for high disk write throughput and further exploits storage server I/O workload characteristics to improve read performance. For modern complex HPC systems (e.g., supercomputers), data generated during checkpointing are bursty and so dominate HPC I/O traffic that relying solely on PFSs will slow down the whole HPC system. In order to increase HPC checkpointing speed, we propose an NVRAM-based burst buffer coordination system for PFSs, named collaborative distributed burst buffer (CDBB). Inspired by our observations of HPC application execution patterns and experimentations on HPC clusters, we design CDBB to coordinate all the available burst buffers, based on their priorities and states, to help overburdened burst buffers and maximize resource utilization.