Browsing by Subject "Deduplication"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
Item HyperProtect: A Large-scale Intelligent Backup Storage System(2022-01) Qin, YaobinIn the current big data era, huge financial losses are caused when data becomes unavailable at the original storage side. Protecting data from loss plays a significantly important role in ensuring business continuity. Businesses generally employ third-party backup services to save their data in remote storage and hopefully retrieve the data in a tolerable time range when the original data cannot be accessed. To utilize these backup services, backup users have to handle many kinds configurations to ensure the effective backup of the data. As the scale of backup systems and the volume of backup data continue to grow significantly, the traditional backup systems are having difficulties satisfying the increasing demand requirement of backup users. The fast improvement of machine or deep learning techniques has made them successful in many areas, such as image recognition, object detection, and natural language processing. Compared with other system environments, the backup system environment is more consistent due to the backup nature; the backup data contents are not changed considerably, at least in the short run. Hence, we collected data from real backup systems and analyzed the backup behavior of backup users. By using machine learning techniques, we discovered that some patterns and features can be generalized from the backup environment. We used them to as a guid in the design of an intelligent agent called HyperProtect, which aims to improve the service level provided by the backup systems. To apply machine or deep learning techniques to enhance the service level of the backup systems, we first improved the stability and predictability of the backup environment by proposing a novel dynamic backup scheduling and high-efficiency deduplication. Backup scheduling and deduplication are important backup techniques in backup systems. Backup scheduling determines which backup starts first and which storage is assigned to that backup for improving the backup efficiency. Deduplication is used to remove the redundancy of the backup data to save the storage space. Besides the backup efficiency and storage overhead, we considered maintaining the stability and predictability of the backup environment when processing the backup scheduling and deduplication. When the backup environment became more stable, we applied machine learning to improve the reliability and efficiency of the large-scale backup system. We analyzed data protection system reports written over two years and collected from 3,500 backup systems. We found that inadequate capacity is among the most frequent causes of backup failure. We highlighted the characteristics of backup data and used the examined information to design a backup storage capacity forecasting structure for better reliability of backup systems. According to our observation of an enterprise backup system, for a newly created client, there are no historical backups, so the prefetching algorithm has no reference basis to perform effective fingerprint prefetching. We discovered a backup content correlation between clients from a study of the backup data. We propose a fingerprint prefetching algorithm to improve the deduplication rate and efficiency. Here machine learning and statistical techniques are applied to discover backup patterns and generalize their features. The above efforts introduced machine learning for backup systems. We also considered the other direction, namely, backup systems for machine learning. The advent of the Artificial Intelligence (AI) era has made it increasingly important to have an efficient backup system to protect training data from loss. Furthermore, maintaining a backup of the training data makes it possible to update or retrain the learned model as more data are collected. However, a huge backup overhead will result from always making a complete copy of all collected daily training data for backup storage, especially because data typically contains highly redundant information that does not contribute to model learning. Deduplication is a common technique of reducing data redundancy in modern backup systems. However, existing deduplication methods are invalid for training data. Hence, we propose a novel deduplication strategy for the training data used for learning in a deep neural network classifier.Item Integrating flash memory into the storage hierarchy.(2010-10) Debnath, Biplob KumarWith the continually accelerating growth of data, the performance of storage systems is increasingly becoming a bottleneck to improving overall system performance. Many applications, such as transaction processing systems, weather forecasting, large-scale scientific simulations, and on-demand services are limited by the performance of the underlying storage systems. The limited bandwidth, high power consumption, and low reliability of widely used magnetic disk-based storage systems impose a significant hurdle in scaling these applications to satisfy the increasing growth of data. These limitations and bottlenecks are especially acute for large-scale high-performance computing systems. Flash memory is an emerging storage technology that shows tremendous promise to compensate for the limitations of current storage devices. Flash memory's relatively high cost, however, combined with its slow write performance and limited number of erase cycles requires new and innovative solutions to integrate flash memory-based storage devices into a high-performance storage hierarchy. The first part of this thesis develops new algorithms, data structures, and storage architectures to address the fundamental issues that limit the use of flash-based storage devices in high-performance computing systems. The second part of the thesis demonstrates two innovative applications of the flash-based storage. In particular, the first part addresses a set of fundamental issues including new write caching techniques, sampling-based RAM-space efficient garbage collection scheme, and writing strategies for improving the performance of flash memory for write-intensive applications. This effort will improve the fundamental understanding of flash memory, will remedy the major limitations of using flash-based storage devices, and will extend the capability of flash memory to support many critical applications. On the other hand, the second part demonstrates how flash memory can be used to speed up server applications including Bloom Filter and online deduplication system. This effort will use flash-aware data structures and algorithms, and will show innovative uses of flash-based storage.