Browsing by Author "Kim, Jinpyo"
Now showing 1 - 4 of 4
- Results Per Page
- Sort Options
Item COBRA: A Framework for Continuous Profiling and Binary Re-Adaptation(2008-05-09) Kim, Jinpyo; Hsu, Wei-Chung; Yew, Pen-ChungDynamic optimizers have shown to improve performance and power efficiency of single-threaded applications. Multithreaded applications running on CMP, SMP and cc-NUMA systems also exhibit opportunities for dynamic binary optimization. Existing dynamic optimizers lack efficient monitoring schemes for multiple threads to support appropriate thread specific or system-wide optimization for a collective behavior of multiple threads since they are designed primarily for single-threaded programs. Monitoring and collecting profiles from multiple threads expose optimization opportunities not only for single core, but also for multi-core systems that include interconnection networks and the cache coherent protocol. Detecting global phases of multithreaded programs and determining appropriate optimizations by considering the interaction between threads such as coherent misses are some features of the dynamic binary optimizer presented in this thesis when compared to the prior dynamic optimizers for single threaded programs. This thesis presents COBRA (Continuous Binary Re-Adaptation), a dynamic binary optimization framework, for single-threaded and multithreaded applications. It includes components for collective monitoring and dynamic profiling, profile and trace management, code optimization and code deployment. The monitoring component collects the hot branches and performance information from multiple working threads with the support of OS and the hardware performance monitors. It sends data to the dynamic profiler. The dynamic profiler accumulates performance bottleneck profiles such as cache miss information along with hot branch traces. Optimizer generates new optimized binary traces and stored them in the code cache. Profiler and optimizer closely interact with each other in order to optimize for more effective code layout and fewer data cache miss stalls. The continuous profiling component only monitors the performance behavior of optimized binary traces and generates the feedback information to determine the efficiency of optimizations for guiding continuous re-optimization. It is currently implemented on Itanium 2 based CMP, SMP and cc-NUMA systems. This thesis proposes a new phase detection scheme and hardware support, especially for dynamic optimizations, that effectively identifies and accurately predicts program phases by exploiting program control flow information. This scheme could not only be applied on single-threaded programs, but also more efficiently applied on multithreaded programs. Our proposed phase detection scheme effectively identifies dynamic intervals that are contiguous variable-length intervals aligned with dynamic code regions that show distinct single and parallel program phase behavior. Two efficient phase-aware runtime program monitoring schemes are implemented on our COBRA framework. The sampled Basic Block... [NOTE - Abstract continues in actual report]Item Dynamic Code Region-based Program Phase Classification and Transition Prediction(2005-05-23) Kim, Jinpyo; Kodakara, Sreekumar V.; Hsu, Wei-Chung; Lilja, David J.; Yew, Pen-ChungDetecting and predicting a program's execution phases is crucial to dynamically adaptable systems and dynamic optimizations. Program execution phases have a strong connection to program control structures, in particular, loops and procedure calls. Intuitively, a phase can be associated with some dynamic code regions that are embedded in loops and procedures. This paper proposes off-line and on-line analysis techniques could effectively identify and predict program phases by exploiting program control flow information. For off-line analyses, we introduce a dynamic interval analysis method that converts the complete program execution into an annotated tree with statistical information attached to each dynamic code region. It can efficiently identify dynamic code regions associated with program execution phases at different granularities. For on-line analyses, we propose new phase tracking hardware which can effectively classify program phases and predict next execution phases. We have applied our dynamic interval analysis method on 10 SPEC CPU2000 benchmarks. We demonstrate that the change in program behavior has strong correlation with control transfer between dynamic code regions. We found that a small number of dynamic code regions can represent the whole program execution with high code coverage. Our proposed on-line phase tracking hardware feature can effectively identify a stable phase at a given granularity and very accurately predict the next execution phase.Item PASS: Program Structure Aware Stratified Sampling for Statistically Selecting Instruction Traces and Simulation Points(2005-12-30) Kodakara, Sreekumar V.; Kim, Jinpyo; Hsu, Wei-Chung; Lilja, David J.; Yew, Pen-ChungAs modeled microarchitectures become more complex and the size of benchmark program keeps increasing, simulating a complete program with various input sets is practically infeasible within a given time and computation resource budget. A common approach is to simulate only a subset of representative parts of the program selected from the complete program execution. SimPoint [1,2] and SMARTS [10] have shown that accurate performance estimates can be achieved with a relatively small number of instructions. This paper proposes a novel method called Program structure Aware Stratified Sampling (PASS) for further reducing microarchitecture simulation time without losing accuracy and coverage. PASS has four major phases, consisting of building Extended Calling Context Tree (ECCT), dynamic code region analysis, program behavior profiling, and stratified sampling. ECCT is constructed to represent program calling context and repetitive behavior via dynamic instrumentation. Dynamic code region analysis identifies code regions with similar program phase behaviors. Program behavior profiling stores statistical information of program behaviors such as number of executed instructions, branch mispredictions, and cache miss associated with each code region. Based on the variability of each phase, we adaptively sample instances of instruction streams through stratified sampling. We applied PASS on 12 SPEC CPU2000 benchmark and input combinations and achieved average 1.46 % IPC error bound from measurements of native execution on Itanium-2 machine with much smaller sampled instruction streams.Item Performance of Runtime Optimization on BLAST(2004-10-15) Das, Abhinav; Lu, Jiwei; Chen, Howard; Kim, Jinpyo; Yew, Pen-Chung; Hsu, Wei-Chung; Chen, Dong-yuanOptimization of a real world application BLAST is used to demonstrate the limitations of static and profile-guided optimizations and to highlight the potential of runtime optimization systems. We analyze the performance profile of this application to determine performance bottlenecks and evaluate the effect of aggressive compiler optimizations on BLAST. We find that applying common optimizations (e.g. O3) can degrade performance. Profile guided optimizations do not show much improvement across the board, as current implementations do not address critical performance bottlenecks in BLAST. In some cases, these optimizations lower performance significantly due to unexpected secondary effects of aggressive optimizations. We also apply runtime optimization to BLAST using the ADORE framework. ADORE speeds up some queries by as much as 58% using data cache prefetching. Branch mispredictions can also be significant for some input sets. Dynamic optimization techniques to improve branch prediction accuracy are described and examined for the application. We find that the primary limitation to the application of runtime optimization for branch misprediction is the tight coupling between data and dependent branch. With better hardware support for influencing branch prediction, a runtime optimizer may deploy optimizations to reduce branch misprediction stalls.