Hot and cold data identification: applications to storage devices and systems.

Park, Dongchul2012-11-142012-11-142012-08https://hdl.handle.net/11299/139066University of Minnesota Ph.D. dissertation. August 2012. Major: Computer Science. Advisor: Professor David H.C. Du. 1 computer file (PDF); ix, 130 pages.Hot data identification is an issue of paramount importance in storage systems since it has a great impact on their overall performance as well as retains a big potential to be applicable to many other fields. However, it has been least investigated. In this dissertation, I propose two novel hot data identification schemes: (1) multiple bloom filter-based scheme and (2) sampling-based scheme. Then I apply them to the storage device and system such as Solid State Drives (SSD) and data deduplication system. In the multiple bloom filter-based hot data identification scheme, I adopt multiple bloom filters and hash functions to efficiently capture finer-grained recency as well as frequency information by assigning a different recency coverage to each bloom filter. The sampling-based scheme employs a sampling mechanism so that it early discards some of the cold items to reduce runtime overheads and a waste of memory spaces. Both hot data identification schemes empower each scheme to precisely and efficiently identify hot data in storage with less system resources. Based on these approaches, I choose two storage fields as their applications: NAND flash-based SSD design and data deduplication system. Particularly in SSD design, hot data identification has a critical impact on its performance (due to a garbage collection) as well as its life span (due to a wear leveling). To address these issues in SSD design, I propose a new hybrid Flash Translation Layer (FTL) design that is a core part of the SSD design. The proposed FTL (named CFTL) is adaptive to data access patterns with the help of the multiple bloom filter-based hot data identification algorithm. As the other application, I explore a data deduplication storage system. Data dedu- plication (for short, dedupe) is a special data compression technique that has been widely adopted especially in backup storage systems for backup time saving as well as storage saving. Unlike the traditional dedupe research that has focused more on the write performance improvement, I address its read performance aspect. In this section, I newly design a read cache in dedupe storage for a backup application to improve read performance by looking ahead their future references in a moving window with the combination of a hot data identification algorithm. This dissertation addresses the importance of hot data identification in storage areas and shows how it can be effectively applied to them in order to overcome the existing limitations in each storage venue.en-USBloom filterData deduplicationFlash memoryHot dataSSDStorageHot and cold data identification: applications to storage devices and systems.Thesis or Dissertation