Hot data identification is an issue of paramount importance in storage systems since
it has a great impact on their overall performance as well as retains a big potential to
be applicable to many other fields. However, it has been least investigated. In this
dissertation, I propose two novel hot data identification schemes: (1) multiple bloom
filter-based scheme and (2) sampling-based scheme. Then I apply them to the storage
device and system such as Solid State Drives (SSD) and data deduplication system.
In the multiple bloom filter-based hot data identification scheme, I adopt multiple
bloom filters and hash functions to efficiently capture finer-grained recency as well as
frequency information by assigning a different recency coverage to each bloom filter.
The sampling-based scheme employs a sampling mechanism so that it early discards
some of the cold items to reduce runtime overheads and a waste of memory spaces.
Both hot data identification schemes empower each scheme to precisely and efficiently
identify hot data in storage with less system resources.
Based on these approaches, I choose two storage fields as their applications: NAND
flash-based SSD design and data deduplication system. Particularly in SSD design, hot
data identification has a critical impact on its performance (due to a garbage collection)
as well as its life span (due to a wear leveling). To address these issues in SSD design,
I propose a new hybrid Flash Translation Layer (FTL) design that is a core part of
the SSD design. The proposed FTL (named CFTL) is adaptive to data access patterns
with the help of the multiple bloom filter-based hot data identification algorithm.
As the other application, I explore a data deduplication storage system. Data dedu-
plication (for short, dedupe) is a special data compression technique that has been
widely adopted especially in backup storage systems for backup time saving as well as
storage saving. Unlike the traditional dedupe research that has focused more on the
write performance improvement, I address its read performance aspect. In this section,
I newly design a read cache in dedupe storage for a backup application to improve read performance by looking ahead their future references in a moving window with the
combination of a hot data identification algorithm.
This dissertation addresses the importance of hot data identification in storage areas
and shows how it can be effectively applied to them in order to overcome the existing
limitations in each storage venue.