Browsing by Author "Zhai, Antonia"

Now showing 1 - 4 of 4

Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective
(2008-06-13) Packirisamy, Venkatesan; Luo, Yangchun; Hung, Wei-Lung; Zhai, Antonia; Yew, Pen-Chung
Computer industry has adopted multi-threaded and multi-core architectures as the clock rate increase stalled in early 2000.s. However, because of the lack of compilers and other related software technologies, most of the general-purpose applications today still cannot take advantage of such architectures to improve their performance. Thread-level speculation (TLS) has been proposed as a way of using these multi-threaded architectures to parallelize general-purpose applications. Both simultaneous multithreading (SMT) and chip multiprocessors (CMP) have been extended to implement TLS. While the characteristics of SMT and CMP have been widely studied under multi-programmed and parallel workloads, their behavior under TLS workload is not well understood. TLS workload due to speculative nature of the threads which could potentially be rollbacked and due to variable degree of parallelism available in applications, exhibits unique characteristics which makes it different from other workloads. In this paper, we present a detailed study of the performance, power consumption and thermal effect of these multithreaded architectures against that of a superscalar with equal chip area. A wide spectrum of design choices and tradeoffs are also studied using commonly used simulation techniques. We show that the SMT based TLS architecture performs about 21% better than the best CMP based configuration while it suffers about 16% power overhead. In terms of the Energy-Delay-Squared product, SMT based TLS performs about 26% better than the best CMP based TLS configuration and 11% better than the superscalar architecture. But the SMT based TLS configuration, causes more thermal stress than the CMP based TLS architectures.
Exploring Speculative Parallelism in SPEC2006
(2008-11-04) Packirisamy, Venkatesan; Zhai, Antonia; Yew, Pen-Chung
Computer industry has adopted multi-threaded and multi-core architectures as the clock rate increase stalled in early 2000's. It was hoped that the continuous improvement of single-program performance could be achieved through these architectures. However, traditional parallelizing compilers often fail to effectively parallelize general-purpose applications which typically have complex control flow and excessive pointer usage. Recently hardware techniques like Transactional Memory (TM) and Thread-Level Speculation (TLS) have been proposed to simplify the task of parallelization by using speculative threads. Potential of speculative parallelism in general-purpose applications like SPEC CPU 2000 have been well studied and have shown to be moderately successful. Preliminary work that examined the potential parallelism in SPEC2006 deployed parallel threads with a restrictive TLS execution model and limited compiler support, and thus showed only limited performance potential. In this paper, we first analyze the cross-iteration dependence behavior of SPEC 2006 benchmarks and show that more parallelism potential is available in SPEC 2006 benchmarks, comparing against SPEC2000. Further, we use a state-of-the-art profile-driven TLS compiler to identify loops that can be speculatively parallelized. Overall, we found an average speedup of 60% on four cores over what could be achieved by a traditional parallelizing compiler such as Intel.s ICC compiler on such benchmarks. We also found that an additional 11% improvement could be obtained on selected benchmarks using 8 cores when we extend TLS on multiple loop levels as opposed to restricting TLS only on a single loop level.
Issues and Support for Dynamic Register Allocation
(2006-06-21) Das, Abhinav; Fu, Rao; Zhai, Antonia; Hsu, Wei-Chung
Post-link and dynamic optimizations have become important to achieve program performance. This is because, it is difficult to produce a single binary that fits all micro-architectures and provides good performance for all inputs. A major challenge in post-link and dynamic optimizations is the acquisition of registers for inserting optimization code with the main program. We show that it is difficult to achieve both correctness and transparency when only software schemes for acquiring registers are used. We then propose an architecture feature that builds upon existing hardware for stacked register allocation on the Itanium processor. The hardware impact of this feature is minimal, while simultaneously allowing post-link and dynamic optimization systems to obtain registers for optimization in a "safe" manner, thus preserving the transparency and improving the performance of these systems.
Performance and power comparison of Thread Level Speculation in SMT and CMP architectures
(2007-10-30) Packirisamy, Venkatesan; Zhai, Antonia; Hsu, Wei-Chung; Yew, Pen-Chung
As technology advances, microprocessors that support multiple threads of execution on a single chip are becoming increasingly common. Improving the performance of general purpose applications by extracting parallel threads is extremely difficult, due to the complex control flow and ambiguous data dependences that are inherent to these applications. Thread-Level Speculation (TLS) enables speculative parallel execution of potentially dependent threads, and ensures correct execution by providing hardware support to detect data dependence violations and to recover from speculation failures. TLS can be supported on a variety of architectures, among them are Chip MultiProcessors (CMP) and Simultaneous MultiThreading(SMT). While there have been numerous papers comparing the performance and power efficiency of SMT and CMP processors under various workloads, relatively little has been done to compare them under the context of TLS. While CMPs utilize smaller and more power-efficient cores, resource sharing and constructive interference between speculative and non-speculative threads can potentially make SMT more power efficient. Thus, this paper aims to fill this void by extending a CMP and a SMT processor to support TLS, and evaluating the performance and power efficiency of the resulting systems with speculative parallel threads extracted for the SPEC2000 benchmark suite. Both SMT and CMP processors have a large variety of configurations, we choose to conduct our study on two architectures with equal die area and the same clock frequency. Our results show that a SMT processor that supports four speculative threads outperforms a CMP processor that supports the same number of threads, uses the same die area and operates at the same clock frequency by 23% while consuming only 8% more power on selected SPEC2000 benchmarks. In terms of energy-delay product, the same SMT processor is approximately 10% more efficient than the CMP processor.

University Digital Conservancy

Browse by Author

Browsing by Author "Zhai, Antonia"