Browsing by Subject "Data deduplication"

Now showing 1 - 3 of 3

An efficient data deduplication design with flash-memory based solid state drive.
(2012-01) Lu, Guanlin
Today, a predominant portion of Internet services (e.g., content delivery networks, online backup storage, news broadcasting, blog sharing and social networks) are data centric. A significant amount of new data is generated by these services every day and a large portion of this created data is redundant. Data deduplication is a prevailing technique used to identify and eliminate redundant data, so as to reduce the space requirement for both primary file systems and data backups. The variety of objectives in a deduplication system design is the primary interest of this dissertation. These objects include maximizing the redundant data removed and achieving a high deduplication read/write throughput with a minimum RAM overhead per chunk. To achieve the first objective, this dissertation proposes a novel chunking algorithm that breaks the input dataset into chunks, with a higher redundancy or with larger sizes, so as to identify the more duplicated data without producing larger numbers of chunks, as compared to other chunking algorithms. To achieve high deduplication throughput while minimizing RAM overhead per chunk, this dissertation proposes a RAM frugal chunk index design along with a chunk filter that is used to filter out index lookups on nonexistent chunks. Both index and filter designs efficiently use a very limited RAM space with flash-memory as persistent storage. In particular, the proposed chunk filter design can dynamically scale up to adapt to the growth of the dataset. In addition, the proposed chunk index design could achieve high throughput, low latency chunk lookup/insert operations with extremely low RAM overhead at the sub-byte-per-chunk level.
Hot and cold data identification: applications to storage devices and systems.
(2012-08) Park, Dongchul
Hot data identification is an issue of paramount importance in storage systems since it has a great impact on their overall performance as well as retains a big potential to be applicable to many other fields. However, it has been least investigated. In this dissertation, I propose two novel hot data identification schemes: (1) multiple bloom filter-based scheme and (2) sampling-based scheme. Then I apply them to the storage device and system such as Solid State Drives (SSD) and data deduplication system. In the multiple bloom filter-based hot data identification scheme, I adopt multiple bloom filters and hash functions to efficiently capture finer-grained recency as well as frequency information by assigning a different recency coverage to each bloom filter. The sampling-based scheme employs a sampling mechanism so that it early discards some of the cold items to reduce runtime overheads and a waste of memory spaces. Both hot data identification schemes empower each scheme to precisely and efficiently identify hot data in storage with less system resources. Based on these approaches, I choose two storage fields as their applications: NAND flash-based SSD design and data deduplication system. Particularly in SSD design, hot data identification has a critical impact on its performance (due to a garbage collection) as well as its life span (due to a wear leveling). To address these issues in SSD design, I propose a new hybrid Flash Translation Layer (FTL) design that is a core part of the SSD design. The proposed FTL (named CFTL) is adaptive to data access patterns with the help of the multiple bloom filter-based hot data identification algorithm. As the other application, I explore a data deduplication storage system. Data dedu- plication (for short, dedupe) is a special data compression technique that has been widely adopted especially in backup storage systems for backup time saving as well as storage saving. Unlike the traditional dedupe research that has focused more on the write performance improvement, I address its read performance aspect. In this section, I newly design a read cache in dedupe storage for a backup application to improve read performance by looking ahead their future references in a moving window with the combination of a hot data identification algorithm. This dissertation addresses the importance of hot data identification in storage areas and shows how it can be effectively applied to them in order to overcome the existing limitations in each storage venue.
Read performance enhancement in data deduplication for secondary storage
(2013-05) Ganesan, Pradeep
Data deduplication, an efficient technique to eliminate redundant bytes in the data to be stored, is largely used in data backup and disaster recovery. This elimination is achieved by chunking the data and identifying the duplicate chunks. Along with data reduction it also delivers commendable backup and restore speeds. While backup process pertains to write process, the restore process defines the read process of a dedupe system. With much emphasis and analysis being made to expedite the write process, the read performance of a dedupe system is still a slower process comparatively. This work proposes a method to improve the read performance by investigating the recently accessed chunks and their locality in the backup set (datastream). Based on this study of the distribution of chunks in the datastream, few chunks are identified that need to be accumulated and stored to serve the future read requests better. This identification and accumulation happen on cached chunks. By this a small degree of duplication of the deduplicated data is introduced, but by later caching them together during the restore of the same datastream, the read performance is improved. Finally the read performance results obtained through experiments with trace datasets are presented and analyzed to evaluate the design.

University Digital Conservancy

Browse by Subject

Browsing by Subject "Data deduplication"