Browsing by Subject "HASH(0x40ce518)"
Now showing 1 - 1 of 1
- Results Per Page
- Sort Options
Item Exploiting Trade-Offs In Memory, Storage, And Communication Performance And Accuracy In High-Capacity Computing Systems(2019-08) Fan, QianqianError resilient applications are becoming more common in large-scale computing systems. These types of applications introduce the possibility of balancing cost and performance in new ways by trading-off output quality with performance. To exploit this new opportunity, this thesis introduces three approximation techniques that trade-off the accuracy of memory, storage, and communication for gains in efficiency and performance in large-scale, high-capacity computing systems. First, we employ the notion of approximate memory, which exploits the idea that some memory errors are not only nonfatal, but can be leveraged to enhance power and performance with minimal loss in quality. The traditional approach for increasing yield in large memory arrays has been to eliminate all hard errors using repair mechanisms. However, the cost of these mechanisms can become prohibitive at higher error rates. Instead of completely repairing faulty memories, we introduce new approximate memory repair mechanisms that only partially repair both CMOS DRAMs and STT-MRAMs. By combining redundant repair with unequal protection, such as skewing the limited spare elements available for repairing faults towards the k most significant bits, and a hybrid bit-shuffling and redundant repair scheme, the new mechanisms maintain excellent output quality while substantially reducing the cost of the repair mechanism, particularly for increasingly important cluster faults. Second, we investigate the use of approximate storage, which is defined as cheaper, lower reliability storage with higher error rates. In the past few years, ever-increasing amounts of image data have been generated by users globally, and these images are routinely stored in cold storage systems in compressed formats. Since traditional JPEG-based schemes that use variable-length coding are extremely sensitive to error, the direct use of approximate storage results in severe quality degradation. We propose an error-resilient adaptive-length coding (ALC) scheme that divides all symbols into two classes, based on their frequency of occurrence, where each class has a fixed-length codeword. This provides a balance between the reliability of fixed-length coding schemes, which have a high storage overhead, and the storage-efficiency of Huffman coding schemes, which show high levels of error on low-reliability storage platforms. Further, we use data partitioning to determine which bits are stored in approximate or reliable storage to lower the overall cost of storage. We show that ALC can be used with general non-volatile storage, and can substantially reduce the total cost compared to traditional JPEG-based storage. Finally, approximate communication as a new opportunity has arisen for improving the communication efficiency in parallel systems, which can significantly reduce the amount of communication time by transmitting partial or imprecise messages. Communication overheads in distributed systems constitute a large fraction of the total execution time, and limit the scalability of applications running on these systems. We propose a Discrete Cosine Transform (DCT)-based approximate communication scheme that takes advantage of the error resiliency of several widely-used applications, and improves communication efficiency by substantially reducing message lengths. Our scheme is implemented into the Message Passing Interface (MPI) library. When evaluated on several representative MPI applications on a real cluster system, it is shown that our approximate communication scheme effectively speeds up the total execution time without much loss in quality of the result, even accounting for the computational overhead required for DCT encoding. In summary, the partial-repaired memory scheme, error-resilient ALC scheme, and DCT-based approximate communication scheme are proposed in this thesis and allow the system to maintain an acceptable output quality while substantially reducing the cost of the system.