Browsing by Author "Hsu, Wei-Chung"

Now showing 1 - 14 of 14

A Compiler Framework for Recovery Code Generation in General Speculative Optimizations
(2003-12-28) Lin, Jin; Hsu, Wei-Chung; Yew, Pen-Chung; Dz-ching Ju, Roy; Ngai, Tin-fook
Data speculation is an effective way to overcome imprecise alias analysis for many compiler optimizations. A general framework that integrates both control and data speculation using alias profiling and/or compiler heuristic rules has shown to improve SPEC2000 performance on Itanium systems [16]. However, speculative optimizations require check instructions and recovery code to ensure correct execution when speculation fails. How to generate check instructions and their associated recovery code efficiently and effectively is an issue yet to be well studied. Also, it is very important to avoid inhibiting later optimizations after the recovery code is generated in the earlier phases. This paper proposes a framework that uses an if-block structure to facilitate check instructions and recovery code generation for general speculative optimizations. It allows speculative and recovery code generated early in the compiler phases to be integrated effectively with the subsequent compiler optimization phases. It also allows multi-level speculation for multi-level pointers and multi-level expression trees to be handled with no additional complexity. The proposed recovery code generation framework has been implemented on the Open Research Compiler (ORC).
A General Compiler Framework for Data Speculation Using DSCM
(2004-03-01) Dai, Xiaoru; Hsu, Wei-Chung; Yew, Pen-Chung
Getting precise alias information in a language that allows pointers, such as C, is expensive. One reason is that alias analysis should generate conservative (safe) alias information. Alias analysis assumes possible aliases when it can’t prove there are no aliases. The conservativealias information may greatly affect compiler optimizations. In this paper, we present a general framework to allow speculative (unsafe) alias information to be used in several compiler optimizations such as redundancy elimination, copy propagation, dead store elimination and code scheduler. Speculative alias information is not guaranteed to be correct. Run-time verification and recovery code are generated to guarantee the correctness of the optimizations that use speculative alias information. In the framework, the key idea is to use data speculative code motion (DSCM) to move the instruction that will be optimized away to the instruction that causesthe optimization. During DSCM, verification and recovery code are generated. After the DSCM, the optimization becomes non-speculative and the moved instruction can be optimized away.
COBRA: A Framework for Continuous Profiling and Binary Re-Adaptation
(2008-05-09) Kim, Jinpyo; Hsu, Wei-Chung; Yew, Pen-Chung
Dynamic optimizers have shown to improve performance and power efficiency of single-threaded applications. Multithreaded applications running on CMP, SMP and cc-NUMA systems also exhibit opportunities for dynamic binary optimization. Existing dynamic optimizers lack efficient monitoring schemes for multiple threads to support appropriate thread specific or system-wide optimization for a collective behavior of multiple threads since they are designed primarily for single-threaded programs. Monitoring and collecting profiles from multiple threads expose optimization opportunities not only for single core, but also for multi-core systems that include interconnection networks and the cache coherent protocol. Detecting global phases of multithreaded programs and determining appropriate optimizations by considering the interaction between threads such as coherent misses are some features of the dynamic binary optimizer presented in this thesis when compared to the prior dynamic optimizers for single threaded programs. This thesis presents COBRA (Continuous Binary Re-Adaptation), a dynamic binary optimization framework, for single-threaded and multithreaded applications. It includes components for collective monitoring and dynamic profiling, profile and trace management, code optimization and code deployment. The monitoring component collects the hot branches and performance information from multiple working threads with the support of OS and the hardware performance monitors. It sends data to the dynamic profiler. The dynamic profiler accumulates performance bottleneck profiles such as cache miss information along with hot branch traces. Optimizer generates new optimized binary traces and stored them in the code cache. Profiler and optimizer closely interact with each other in order to optimize for more effective code layout and fewer data cache miss stalls. The continuous profiling component only monitors the performance behavior of optimized binary traces and generates the feedback information to determine the efficiency of optimizations for guiding continuous re-optimization. It is currently implemented on Itanium 2 based CMP, SMP and cc-NUMA systems. This thesis proposes a new phase detection scheme and hardware support, especially for dynamic optimizations, that effectively identifies and accurately predicts program phases by exploiting program control flow information. This scheme could not only be applied on single-threaded programs, but also more efficiently applied on multithreaded programs. Our proposed phase detection scheme effectively identifies dynamic intervals that are contiguous variable-length intervals aligned with dynamic code regions that show distinct single and parallel program phase behavior. Two efficient phase-aware runtime program monitoring schemes are implemented on our COBRA framework. The sampled Basic Block... [NOTE - Abstract continues in actual report]
Dynamic Code Region-based Program Phase Classification and Transition Prediction
(2005-05-23) Kim, Jinpyo; Kodakara, Sreekumar V.; Hsu, Wei-Chung; Lilja, David J.; Yew, Pen-Chung
Detecting and predicting a program's execution phases is crucial to dynamically adaptable systems and dynamic optimizations. Program execution phases have a strong connection to program control structures, in particular, loops and procedure calls. Intuitively, a phase can be associated with some dynamic code regions that are embedded in loops and procedures. This paper proposes off-line and on-line analysis techniques could effectively identify and predict program phases by exploiting program control flow information. For off-line analyses, we introduce a dynamic interval analysis method that converts the complete program execution into an annotated tree with statistical information attached to each dynamic code region. It can efficiently identify dynamic code regions associated with program execution phases at different granularities. For on-line analyses, we propose new phase tracking hardware which can effectively classify program phases and predict next execution phases. We have applied our dynamic interval analysis method on 10 SPEC CPU2000 benchmarks. We demonstrate that the change in program behavior has strong correlation with control transfer between dynamic code regions. We found that a small number of dynamic code regions can represent the whole program execution with high code coverage. Our proposed on-line phase tracking hardware feature can effectively identify a stable phase at a given granularity and very accurately predict the next execution phase.
Dynamic Register Allocation for ADORE Runtime Optimization System
(2004-12-01) Saxena, Aditya; Hsu, Wei-Chung
Dynamic optimization can apply several difficult to apply optimizations at run time. These optimizations are difficult to apply at compile time because of lack of accurate run time information. A dynamic optimizer can make use of run time characteristics to identify optimization opportunities. However, to deploy the optimizations, extra registers are required. In this report we explore methodology and issues involved in techniques that can be used to allocate registers dynamically and present experimental results. The techniques that have been explored include dead register detection, register spilling and using IA64 specific alloc instruction under the ADORE framework.
Implementation of Trace Optimization and Investigation of Advanced Load Optimization in ADORE
(2005-07-20) Joshi, Sourabh S.; Hsu, Wei-Chung; Yew, Pen-Chung
Dynamic optimizers endeavor to speed up program execution by performing runtime optimization based on knowledge of the program collected at runtime. Depending on the type of optimization applied, the dynamic instruction count(IC) or the cycles per instruction (CPI) of the program can be targeted for improvement. This report presents two such sets of optimizations, the first targeting dynamic instruction count and the second targeting CPI through data cache based optimizations. The first set of trace optimizations presented, target the instruction count of the program and consist of copy and constant propagation, partial dead code elimination and store copy propagation. The implementation, optimization issues and performance of these optimizations is discussed in the former half of the report. The latter half of the report concentrates on a new kind of dynamic optimization called advanced load optimization. This optimization strives to improve data cache performance by targeting short latency loads that cannot be targeted using pre-fetch instructions. The motivation and implementation strategies for this optimization are presented. The use of dynamic instrumentation to gather additional runtime information and track optimizations is also presented in this context. Trace optimizations are not found to be very beneficial in speeding up already optimized SPEC code further. However as proof of concept, in contrived cases where partial dead code does exist, these optimizations are shown to be promising. The advanced load optimization looks promising but needs more implementation before the more advanced strategies can be tried.
Issues and Support for Dynamic Register Allocation
(2006-06-21) Das, Abhinav; Fu, Rao; Zhai, Antonia; Hsu, Wei-Chung
Post-link and dynamic optimizations have become important to achieve program performance. This is because, it is difficult to produce a single binary that fits all micro-architectures and provides good performance for all inputs. A major challenge in post-link and dynamic optimizations is the acquisition of registers for inserting optimization code with the main program. We show that it is difficult to achieve both correctness and transparency when only software schemes for acquiring registers are used. We then propose an architecture feature that builds upon existing hardware for stacked register allocation on the Itanium processor. The hardware impact of this feature is minimal, while simultaneously allowing post-link and dynamic optimization systems to obtain registers for optimization in a "safe" manner, thus preserving the transparency and improving the performance of these systems.
PASS: Program Structure Aware Stratified Sampling for Statistically Selecting Instruction Traces and Simulation Points
(2005-12-30) Kodakara, Sreekumar V.; Kim, Jinpyo; Hsu, Wei-Chung; Lilja, David J.; Yew, Pen-Chung
As modeled microarchitectures become more complex and the size of benchmark program keeps increasing, simulating a complete program with various input sets is practically infeasible within a given time and computation resource budget. A common approach is to simulate only a subset of representative parts of the program selected from the complete program execution. SimPoint [1,2] and SMARTS [10] have shown that accurate performance estimates can be achieved with a relatively small number of instructions. This paper proposes a novel method called Program structure Aware Stratified Sampling (PASS) for further reducing microarchitecture simulation time without losing accuracy and coverage. PASS has four major phases, consisting of building Extended Calling Context Tree (ECCT), dynamic code region analysis, program behavior profiling, and stratified sampling. ECCT is constructed to represent program calling context and repetitive behavior via dynamic instrumentation. Dynamic code region analysis identifies code regions with similar program phase behaviors. Program behavior profiling stores statistical information of program behaviors such as number of executed instructions, branch mispredictions, and cache miss associated with each code region. Based on the variability of each phase, we adaptively sample instances of instruction streams through stratified sampling. We applied PASS on 12 SPEC CPU2000 benchmark and input combinations and achieved average 1.46 % IPC error bound from measurements of native execution on Itanium-2 machine with much smaller sampled instruction streams.
Performance and power comparison of Thread Level Speculation in SMT and CMP architectures
(2007-10-30) Packirisamy, Venkatesan; Zhai, Antonia; Hsu, Wei-Chung; Yew, Pen-Chung
As technology advances, microprocessors that support multiple threads of execution on a single chip are becoming increasingly common. Improving the performance of general purpose applications by extracting parallel threads is extremely difficult, due to the complex control flow and ambiguous data dependences that are inherent to these applications. Thread-Level Speculation (TLS) enables speculative parallel execution of potentially dependent threads, and ensures correct execution by providing hardware support to detect data dependence violations and to recover from speculation failures. TLS can be supported on a variety of architectures, among them are Chip MultiProcessors (CMP) and Simultaneous MultiThreading(SMT). While there have been numerous papers comparing the performance and power efficiency of SMT and CMP processors under various workloads, relatively little has been done to compare them under the context of TLS. While CMPs utilize smaller and more power-efficient cores, resource sharing and constructive interference between speculative and non-speculative threads can potentially make SMT more power efficient. Thus, this paper aims to fill this void by extending a CMP and a SMT processor to support TLS, and evaluating the performance and power efficiency of the resulting systems with speculative parallel threads extracted for the SPEC2000 benchmark suite. Both SMT and CMP processors have a large variety of configurations, we choose to conduct our study on two architectures with equal die area and the same clock frequency. Our results show that a SMT processor that supports four speculative threads outperforms a CMP processor that supports the same number of threads, uses the same die area and operates at the same clock frequency by 23% while consuming only 8% more power on selected SPEC2000 benchmarks. In terms of energy-delay product, the same SMT processor is approximately 10% more efficient than the CMP processor.
Performance of Runtime Optimization on BLAST
(2004-10-15) Das, Abhinav; Lu, Jiwei; Chen, Howard; Kim, Jinpyo; Yew, Pen-Chung; Hsu, Wei-Chung; Chen, Dong-yuan
Optimization of a real world application BLAST is used to demonstrate the limitations of static and profile-guided optimizations and to highlight the potential of runtime optimization systems. We analyze the performance profile of this application to determine performance bottlenecks and evaluate the effect of aggressive compiler optimizations on BLAST. We find that applying common optimizations (e.g. O3) can degrade performance. Profile guided optimizations do not show much improvement across the board, as current implementations do not address critical performance bottlenecks in BLAST. In some cases, these optimizations lower performance significantly due to unexpected secondary effects of aggressive optimizations. We also apply runtime optimization to BLAST using the ADORE framework. ADORE speeds up some queries by as much as 58% using data cache prefetching. Branch mispredictions can also be significant for some input sets. Dynamic optimization techniques to improve branch prediction accuracy are described and examined for the application. We find that the primary limitation to the application of runtime optimization for branch misprediction is the tight coupling between data and dependent branch. With better hardware support for influencing branch prediction, a runtime optimizer may deploy optimizations to reduce branch misprediction stalls.
PerfView: A Performance Monitoring and Visualization Tool for Intel Itanium Architecture
(2004-07-27) Lingamneni, Ananth; Das, Abhinav; Hsu, Wei-Chung
Application performance analysis in modern microprocessors has become extremely complex due to substantial instruction level parallelism, complex processor pipelines and deep memory hierarchies. Performance analysts need to have a thorough understanding of the dynamic behavior of programs in order to identify and fix performance bottlenecks. In order to help in the performance analysis process, modern day processors provide hardware support in the form of performance registers that capture micro-architectural events at program runtime. However, the data provided by these hardware registers is at a very low level and an extensive effort has to be made by performance analysts to make sense of the data. Therefore, it is extremely beneficial to make use of performance analysis tools that can assemble various types of performance related information available from the performance registers to provide a high level summary of the data. In this paper, we discuss PerfView, which is (1) a source-code based visualization tool (2) a tool that identifies and allows users to view performance-critical events based on execution paths and (3) an interactive performance monitoring and debugging tool.
Phase Locality Detection Using a Branch Trace Buffer for Efficient Profiling in Dynamic Optimization
(2002-02-07) Hsu, Wei-Chung; Chen, Howard; Yew, Pen-Chung; Chen, Dong-yuan
Abstract Efficient profiling is a major challenge for dynamic optimization because the profiling overhead contributes to the total execution time. In order to identify program hot spots for runtime optimization, the profiler in a dynamic optimizer must detect new execution phases and subsequent phase changes. Current profiling approaches used in prototype dynamic optimizers are interpretation or instrumentation based. They have eithervery high overhead, or generate poor quality profiles. We use the branch trace buffer and the hardware performance monitoring features provided in the IA -64 architecture to detect new execution phases and phase changes. The branch trace buffer records the last few branch instructions executed before an event based interrupt is generated. Using the branch trace buffer, our profiler continuously samples execution paths leading to critical performance events, such as cache misses and pipeline stalls. A set of frequently executed traces and their respective performancecharacteristics within a time interval are considered as an execution phase. Since such phases tend to repeat over time, a dynamic optimizer can exploit the phase locality to drive optimization. We check for new phases and phase changes at the end of each time interval. Although a new phase or a changed phase can be a candidate for optimization, our phase detector delays the invocation of the optimizer until a relatively stable phase is detected. We report the effectiveness of various phase detection methods using Spec2000int as the benchmark. Our results indicate branch trace based phase detection can be suitable for dynamic binary optimizers.
Speculative Register Promotion Using Advanced Load Address Table (ALAT)
(2002-09-19) Chen, Tong; Lin, Jin; Hsu, Wei-Chung; Yew, Pen-Chung
Register promotion is intended to allocate more references into registers when aliases do not appear in a region of code. When possible aliases are assumed by the compilers, speculative register promotion with hardware support is able to further improve register promotion. This paper addresses how to use the advanced loads address table (ALAT) in speculative register promotion. A compiler algorithm for speculative data flow analysis is proposed. Then the traditional partial redundancy elimination can be applied. The transformations based on the ALAT arediscussed. Experiments with SPEC CPU2000 are conducted to reveal the potential of speculative register promotion, the precision of our alias speculation, and the efficiency of ALAT. The results show that our algorithm is able to speculatively locate the candidates with high precision. The execution time on Itanium machines of one benchmark, equake, can be reduced 17% by the speculation register promotion. The tradeoffs in the hardware support for speculative register promotion are analyzed.
Using Large Input Sets with Hardware Performance Monitoring for Profile Based Compiler Optimizations
(2004-09-17) Dalvi, Sagar; Hsu, Wei-Chung; Yew, Pen-Chung
Traditional Profile Guided Optimization (PGO) uses program instrumentation with one or more small training input data sets to generate edge or value profiles to guide compiler optimizations. This approach has been effective in predicting branch directions for many applications. However, for optimizations that are more dependent on the performance characteristics and the accuracy of the profiles, it is not clear whether profiles generated with small input data sets can reliably predict the program behavior under different input sets. We studied the frequent execution paths, IPC, and the stall cycles breakdown of the test, train and different reference input sets of the SPEC2000Int benchmarks. Our studies indicate that small input sets are less effective in predicting the program behavior for larger input data sets. We propose to use hardware performance monitor (HPM) sampling based profiles to guide optimizations, because it can work with larger input sets and gather information on important performance events. As a proof of concept, we have implemented one type of HPM sampling based PBO. We use the dynamic call path sampled by HPM to automatically guide procedure inlining in the ORC Compiler. Our results show that this approach has much lower profiling overhead, and offers significant performance gains.

University Digital Conservancy

Browse by Author

Browsing by Author "Hsu, Wei-Chung"