Hot and cold data identification: applications to storage devices and systems.

Thumbnail Image

Persistent link to this item

View Statistics

Journal Title

Journal ISSN

Volume Title


Hot and cold data identification: applications to storage devices and systems.

Published Date




Thesis or Dissertation


Hot data identification is an issue of paramount importance in storage systems since it has a great impact on their overall performance as well as retains a big potential to be applicable to many other fields. However, it has been least investigated. In this dissertation, I propose two novel hot data identification schemes: (1) multiple bloom filter-based scheme and (2) sampling-based scheme. Then I apply them to the storage device and system such as Solid State Drives (SSD) and data deduplication system. In the multiple bloom filter-based hot data identification scheme, I adopt multiple bloom filters and hash functions to efficiently capture finer-grained recency as well as frequency information by assigning a different recency coverage to each bloom filter. The sampling-based scheme employs a sampling mechanism so that it early discards some of the cold items to reduce runtime overheads and a waste of memory spaces. Both hot data identification schemes empower each scheme to precisely and efficiently identify hot data in storage with less system resources. Based on these approaches, I choose two storage fields as their applications: NAND flash-based SSD design and data deduplication system. Particularly in SSD design, hot data identification has a critical impact on its performance (due to a garbage collection) as well as its life span (due to a wear leveling). To address these issues in SSD design, I propose a new hybrid Flash Translation Layer (FTL) design that is a core part of the SSD design. The proposed FTL (named CFTL) is adaptive to data access patterns with the help of the multiple bloom filter-based hot data identification algorithm. As the other application, I explore a data deduplication storage system. Data dedu- plication (for short, dedupe) is a special data compression technique that has been widely adopted especially in backup storage systems for backup time saving as well as storage saving. Unlike the traditional dedupe research that has focused more on the write performance improvement, I address its read performance aspect. In this section, I newly design a read cache in dedupe storage for a backup application to improve read performance by looking ahead their future references in a moving window with the combination of a hot data identification algorithm. This dissertation addresses the importance of hot data identification in storage areas and shows how it can be effectively applied to them in order to overcome the existing limitations in each storage venue.


University of Minnesota Ph.D. dissertation. August 2012. Major: Computer Science. Advisor: Professor David H.C. Du. 1 computer file (PDF); ix, 130 pages.

Related to




Series/Report Number

Funding information

Isbn identifier

Doi identifier

Previously Published Citation

Suggested citation

Park, Dongchul. (2012). Hot and cold data identification: applications to storage devices and systems.. Retrieved from the University Digital Conservancy,

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.