Browsing by Author "Yew, Pen-Chung"

Now showing 1 - 20 of 27

A Code Transformation Framework for Scientific Applications on Structured Grids
(2011-09-29) Lin, Pei-Hung; Jayaraj, Jagan; Woodward, Paul; Yew, Pen-Chung
The combination of expert-tuned code expression and aggressive compiler optimizations is known to deliver the best achievable performance for modern multicore processors. The development and maintenance of these optimized code expressions is never trivial. Tedious and error-prone processes greatly decrease the code developer's willingness to adopt manually-tuned optimizations. In this paper, we describe a pre-compilation framework that will take a code expression with a much higher programmability and transform it into an optimized expression that would be much more difficult to produce manually. The user-directed, source-to-source transformations we implement rely heavily on the knowledge of the domain expert. The transformed output, in its optimized format, together with the optimizations provided by an available compiler is intended to deliver exceptionally high performance. Three computational fluid dynamics (CFD) applications are chosen to exemplify our strategy. The performance results show an average of 7.3x speedup over straightforward compilation of the input code expression to our pre-compilation framework. Performance of 7.1 Gflops/s/core on the Intel Nehalem processor, 30% of the peak performance, is achieved using our strategy. The framework is seen to be successful for the CFD domain, and we expect that this approach can be extended to cover more scientific computation domains.
A Compiler Framework for Recovery Code Generation in General Speculative Optimizations
(2003-12-28) Lin, Jin; Hsu, Wei-Chung; Yew, Pen-Chung; Dz-ching Ju, Roy; Ngai, Tin-fook
Data speculation is an effective way to overcome imprecise alias analysis for many compiler optimizations. A general framework that integrates both control and data speculation using alias profiling and/or compiler heuristic rules has shown to improve SPEC2000 performance on Itanium systems [16]. However, speculative optimizations require check instructions and recovery code to ensure correct execution when speculation fails. How to generate check instructions and their associated recovery code efficiently and effectively is an issue yet to be well studied. Also, it is very important to avoid inhibiting later optimizations after the recovery code is generated in the earlier phases. This paper proposes a framework that uses an if-block structure to facilitate check instructions and recovery code generation for general speculative optimizations. It allows speculative and recovery code generated early in the compiler phases to be integrated effectively with the subsequent compiler optimization phases. It also allows multi-level speculation for multi-level pointers and multi-level expression trees to be handled with no additional complexity. The proposed recovery code generation framework has been implemented on the Open Research Compiler (ORC).
A General Compiler Framework for Data Speculation Using DSCM
(2004-03-01) Dai, Xiaoru; Hsu, Wei-Chung; Yew, Pen-Chung
Getting precise alias information in a language that allows pointers, such as C, is expensive. One reason is that alias analysis should generate conservative (safe) alias information. Alias analysis assumes possible aliases when it can’t prove there are no aliases. The conservativealias information may greatly affect compiler optimizations. In this paper, we present a general framework to allow speculative (unsafe) alias information to be used in several compiler optimizations such as redundancy elimination, copy propagation, dead store elimination and code scheduler. Speculative alias information is not guaranteed to be correct. Run-time verification and recovery code are generated to guarantee the correctness of the optimizations that use speculative alias information. In the framework, the key idea is to use data speculative code motion (DSCM) to move the instruction that will be optimized away to the instruction that causesthe optimization. During DSCM, verification and recovery code are generated. After the DSCM, the optimization becomes non-speculative and the moved instruction can be optimized away.
A Hierarchical Approach to Context-Sensitive Interprocedural Alias Analysis
(1999-04-21) Zheng, Bixia; Yew, Pen-Chung
In this paper, we present a hierarchical flow-sensitive alias analysis algorithm which parameterizes the context-sensitive level. Our approach groups the pointers in a program by their maximum possible dereference levels. It then orders the analysis of each pointer group by its pointer level, starting from the highest level down to the lowest level. During the analysis of each pointer group by its pointer level, starting from the highest level down to the lowest level. During the analysis of each pointer group, a bottom-up traversal of a program call graph is followed by a top-down traversal with the necessary interprocedural information propagated along the way. The interprocedudral inforamtion is tagged with call-chains, which are the program call graph paths, to achieve context-sensitivity. We also provide empirical results to quantify how different context-sensitive levels affect the precision and the efficiency of the algorithm. Our studies show that (1) the precision improvement achieved by increasing the context-sensitive level of the analysis varies significantly depending on the programs analyzed; (2) increasing the maxium call-chain length for a higher degree of context sensitivity may trigger the exponential complexity problem [15, 10, 23]. Thus, it is very desirable for an algorithm to allow users to select an appropriate context-sensistive level that works best for a particular program. By parameterizing the maxiumum call-chain length used in tagging the interprocedural information, our approach provides this type of flexibility.
A Non-blocking Directory Protocol for Large-Scale Multiprocessors
(1999-04-08) Kong, Jinseok; Yew, Pen-Chung; Lee, Gyungho
This paper presents a non-blocking directory-based cache coherence protocol to improve the performance of large-scale distributed shared-memory multiprocessors. In the proposed non-blocking directory protocol, all subsequent request can proceed without being blocked at the directory. The critical path of a data access and the amount of network transactions needed to complete a memory request (including the transactions needed to maintain coherence) are also reduced. To support the non-blocking directory protocol, the history of data accesses is maintained with a small history table at the directory. Using detailed simulations,we evaluate the performance of the protocol compared with other blocking directory protocols.
Access Region Locality for High-Bandwidth Processor Memory System Design
(1999-02-15) Cho, Sangyeun; Yew, Pen-Chung; Lee, Gyungho
This paper explores an important behavior of memory access instructions, called access region locality. Unlike the traditional temporal and spatial data loacality that focuses on individual memory locations and how accesses to the locations are inter-related, the access region locality concerns with each static memory instruction and its range of access locations at run time. We consider program's data, heap, and stack regions in this paper. Our experimental study using a set of SPEC95 benchmark programs show that most memory reference instructions access a single region at run time. Also shown is that it is possible to predict the access region of a memory instruction accurately at run time by scrutinizing the addressing mode of the instruction and the past access region history of it. An important implication of the access region locality is that two memory accesses to different regions are data independent. Utilizing this property, we evaluate a novel processor memory system and pipeline design which can provide high data cache bandwidth without increasing the critical path of the processor, by decoupling the memory instructions that access program's stack at an early pepeline stage. Experimental results indicate that the proposed technique achieves comparable or better performance than a conventional memory design with a heavily multi-ported data cache that can lead to much higher hardware complexity.
An Efficient Algorithm for the Run-time Parallelization of DOACROSS Loops
(1997) Chen, Ding-Kai; Torrellas, Josep; Yew, Pen-Chung
While loop parallelization usually relies on compile-time analysis of data dependences, for some loops the data dependences cannot be detennined at compile time. An example is loops referencing arrays with subscripted subscripts. To parallelize these loops, it is necessary to use run-time analysis. In this paper, we present a new run-time algorithm to parallelize these loops. Our scheme handles any type of data dependence in the loop without requiring any special architectural support in the multiprocessor. Furthermore, compared to an older scheme with the same generality, our scheme significantly reduces the amount of processor communication required and increases the overlap among dependent iterations. We evaluate our algorithm with an extensive set of loops running on the 32-processor Cedar shared-memory multiprocessor. The execution times show speedups over the serial case that reach up to 13 with the full overhead of run-time analysis and up to 26 if part of the work is reused across loop invocations. Moreover, our algorithm outperforms the older scheme with the same generality in nearly all cases, reaching a 37-fold speedup over the older scheme when the loop has many dependences.
COBRA: A Framework for Continuous Profiling and Binary Re-Adaptation
(2008-05-09) Kim, Jinpyo; Hsu, Wei-Chung; Yew, Pen-Chung
Dynamic optimizers have shown to improve performance and power efficiency of single-threaded applications. Multithreaded applications running on CMP, SMP and cc-NUMA systems also exhibit opportunities for dynamic binary optimization. Existing dynamic optimizers lack efficient monitoring schemes for multiple threads to support appropriate thread specific or system-wide optimization for a collective behavior of multiple threads since they are designed primarily for single-threaded programs. Monitoring and collecting profiles from multiple threads expose optimization opportunities not only for single core, but also for multi-core systems that include interconnection networks and the cache coherent protocol. Detecting global phases of multithreaded programs and determining appropriate optimizations by considering the interaction between threads such as coherent misses are some features of the dynamic binary optimizer presented in this thesis when compared to the prior dynamic optimizers for single threaded programs. This thesis presents COBRA (Continuous Binary Re-Adaptation), a dynamic binary optimization framework, for single-threaded and multithreaded applications. It includes components for collective monitoring and dynamic profiling, profile and trace management, code optimization and code deployment. The monitoring component collects the hot branches and performance information from multiple working threads with the support of OS and the hardware performance monitors. It sends data to the dynamic profiler. The dynamic profiler accumulates performance bottleneck profiles such as cache miss information along with hot branch traces. Optimizer generates new optimized binary traces and stored them in the code cache. Profiler and optimizer closely interact with each other in order to optimize for more effective code layout and fewer data cache miss stalls. The continuous profiling component only monitors the performance behavior of optimized binary traces and generates the feedback information to determine the efficiency of optimizations for guiding continuous re-optimization. It is currently implemented on Itanium 2 based CMP, SMP and cc-NUMA systems. This thesis proposes a new phase detection scheme and hardware support, especially for dynamic optimizations, that effectively identifies and accurately predicts program phases by exploiting program control flow information. This scheme could not only be applied on single-threaded programs, but also more efficiently applied on multithreaded programs. Our proposed phase detection scheme effectively identifies dynamic intervals that are contiguous variable-length intervals aligned with dynamic code regions that show distinct single and parallel program phase behavior. Two efficient phase-aware runtime program monitoring schemes are implemented on our COBRA framework. The sampled Basic Block... [NOTE - Abstract continues in actual report]
Compiler Analysis for Cache Coherence: Interprocedural Array Data-Flow Analysis and Its Impacts on Cache Performance
(1997) Choi, Lynn; Yew, Pen-Chung
In this paper, we present compiler algorithms for detecting references to stale data in sharedmemory multiprocessors. The algorithm consists of two key analysis techniques, stale reference detection and locality preserning analysis. While the stale reference detection finds the memory reference patterns that may violate cache coherence, the locality preserving analysis minimizes the number of such stale references by analyzing both temporal and spatial reuses. By computing the regions referenced by arrays inside loops, we extend the previous scalar algorithms (8] for more precise analysis. We develop a full interprocedural array data-flow algorithm, which performs both bottom-up side-effect analysis and top-down context analysis on the procedure call graph to further exploit locality across procedure boundaries. The interprocedural algorithm eliminates cache invalidations at procedure boundaries, which were assumed in the previous compiler algorithms (9]. We have fully implemented the algorithm in the Polaris parallelizing compiler (27). Using execution-driven simulations on Perfect Club benchmarks, we demonstrate how unnecessary cache misses can be eliminated by the automatic stale reference detection. The algorithm can be used to implement cache coherence in the shared-memory multiprocessors that do not have hardware directories, such as Cray T3D (20].
Design & Implementation of a Framework for Integrating Front-End & Back-End Compilers
(1999-04-20) Yew, Pen-Chung; Li, Zhiyuan; Song, Yonghong; Tsai, Jenn-Yuan; Zheng, Bixia
We propose a new universal High-Level Information (HLI) format to effectively integrate front-end and back-end compilers by passing front-end information to the back-end compiler. Importing this information into an existing back-end leverages the state-of-the-art analysis and transformation capabilities of existing front-end compilers to allow the back-end greater optimization potential than it has when relying on only locally-extracted information. A version of the HLI has been implemented in the SUIF parallelizing compiler and the GCC back-end compiler. Experimental results with the SPEC benchmarks show that HLI can provide GCC with substantially more accurate data dependence information than it can obtain on its own. Our results show that the number of dependence edges in GCC can be reduced substantially for the benchmark programs studied, which provides greater flexibility to GCC's code scheduling pass, common subexpression elimination pass, loop invariant code removal pass and register local allocation pass. Even with GCC's optimizations limited to basic blocks, the use of HLI produces moderate speedups compared to usng only GCC's dependence tests when the optimized programs are executed on a MIPS R10000 processor.
Dynamic Code Region-based Program Phase Classification and Transition Prediction
(2005-05-23) Kim, Jinpyo; Kodakara, Sreekumar V.; Hsu, Wei-Chung; Lilja, David J.; Yew, Pen-Chung
Detecting and predicting a program's execution phases is crucial to dynamically adaptable systems and dynamic optimizations. Program execution phases have a strong connection to program control structures, in particular, loops and procedure calls. Intuitively, a phase can be associated with some dynamic code regions that are embedded in loops and procedures. This paper proposes off-line and on-line analysis techniques could effectively identify and predict program phases by exploiting program control flow information. For off-line analyses, we introduce a dynamic interval analysis method that converts the complete program execution into an annotated tree with statistical information attached to each dynamic code region. It can efficiently identify dynamic code regions associated with program execution phases at different granularities. For on-line analyses, we propose new phase tracking hardware which can effectively classify program phases and predict next execution phases. We have applied our dynamic interval analysis method on 10 SPEC CPU2000 benchmarks. We demonstrate that the change in program behavior has strong correlation with control transfer between dynamic code regions. We found that a small number of dynamic code regions can represent the whole program execution with high code coverage. Our proposed on-line phase tracking hardware feature can effectively identify a stable phase at a given granularity and very accurately predict the next execution phase.
Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective
(2008-06-13) Packirisamy, Venkatesan; Luo, Yangchun; Hung, Wei-Lung; Zhai, Antonia; Yew, Pen-Chung
Computer industry has adopted multi-threaded and multi-core architectures as the clock rate increase stalled in early 2000.s. However, because of the lack of compilers and other related software technologies, most of the general-purpose applications today still cannot take advantage of such architectures to improve their performance. Thread-level speculation (TLS) has been proposed as a way of using these multi-threaded architectures to parallelize general-purpose applications. Both simultaneous multithreading (SMT) and chip multiprocessors (CMP) have been extended to implement TLS. While the characteristics of SMT and CMP have been widely studied under multi-programmed and parallel workloads, their behavior under TLS workload is not well understood. TLS workload due to speculative nature of the threads which could potentially be rollbacked and due to variable degree of parallelism available in applications, exhibits unique characteristics which makes it different from other workloads. In this paper, we present a detailed study of the performance, power consumption and thermal effect of these multithreaded architectures against that of a superscalar with equal chip area. A wide spectrum of design choices and tradeoffs are also studied using commonly used simulation techniques. We show that the SMT based TLS architecture performs about 21% better than the best CMP based configuration while it suffers about 16% power overhead. In terms of the Energy-Delay-Squared product, SMT based TLS performs about 26% better than the best CMP based TLS configuration and 11% better than the superscalar architecture. But the SMT based TLS configuration, causes more thermal stress than the CMP based TLS architectures.
Enhancing Multiple-Path Speculative Execution with Predicate Window Shifting
(1997) Tsai, Jenn-Yuan; Yew, Pen-Chung
Speculative execution has long been used as an approach to exploit instruction level parallelism across basic block boundaries. Most existing speculative execution techniques only support speculating along single control path, and heavily rely on branch prediction to choose the right control path. In this paper, we propose an extended predicated execution mechanism, called predicate shifting, to support speculating along multiple control paths. The predicate shifting mechanism maintains a condition/predicate window for each basic block. With the condition/predicate window, instructions ca.n be guarded by predicates related to current or future branch conditions. The predicate shifting mechanism can reduce the number of required tag bits by shifting conditions/predicates out of the condition/predicate window whenever they are no longer in use. To incorporate the predicate shifting mechanism into a VLIW processor, a new result-buffering structure, call future buffer, is used to buffer uncommitted results and to evaluate predicates. The FIFO structure of the future buffer not only simplifies exception handling but also allows multiple uncommitted writes to the same register. Experimental results show that the predicate shifting mechanism can use predicate tag effectively and achieve 24% performance improvement over the previous predicating mechanism [2] using a small predicate tag.
Exploring Speculative Parallelism in SPEC2006
(2008-11-04) Packirisamy, Venkatesan; Zhai, Antonia; Yew, Pen-Chung
Computer industry has adopted multi-threaded and multi-core architectures as the clock rate increase stalled in early 2000's. It was hoped that the continuous improvement of single-program performance could be achieved through these architectures. However, traditional parallelizing compilers often fail to effectively parallelize general-purpose applications which typically have complex control flow and excessive pointer usage. Recently hardware techniques like Transactional Memory (TM) and Thread-Level Speculation (TLS) have been proposed to simplify the task of parallelization by using speculative threads. Potential of speculative parallelism in general-purpose applications like SPEC CPU 2000 have been well studied and have shown to be moderately successful. Preliminary work that examined the potential parallelism in SPEC2006 deployed parallel threads with a restrictive TLS execution model and limited compiler support, and thus showed only limited performance potential. In this paper, we first analyze the cross-iteration dependence behavior of SPEC 2006 benchmarks and show that more parallelism potential is available in SPEC 2006 benchmarks, comparing against SPEC2000. Further, we use a state-of-the-art profile-driven TLS compiler to identify loops that can be speculatively parallelized. Overall, we found an average speedup of 60% on four cores over what could be achieved by a traditional parallelizing compiler such as Intel.s ICC compiler on such benchmarks. We also found that an additional 11% improvement could be obtained on selected benchmarks using 8 cores when we extend TLS on multiple loop levels as opposed to restricting TLS only on a single loop level.
Hardware and Compiler-Directed Cache Coherence in Large-Scale Multiprocessors
(1997) Choi, Lynn; Yew, Pen-Chung
In this paper, we study a hardware-supported, compiler-directed (HSCD) cache coherence scheme, which can be implemented on a large-scale multiprocessor using off-the-shelf microprocessors, such as the Cray T3D. The scheme can be adapted to various cache organizations, including multi-word cache lines and byte-addressable architectures. Several system related issues, including critical sections, inter-thread communication, and task migration have also been addressed. The cost of the required hardware support is minimal and proportional to the cache size. The necessary compiler algorithms, including intra- and interprocedural array data flow analysis, have been implemented on the Polaris parallelizing compiler [33]. From our simulation study using the Perfect Club benchmarks [5], we found that in spite of the conservative analysis made by the compiler, the performance of the proposed HSCD scheme can be comparable to that of a full-map hardware directory scheme. Given its comparable performance and reduced hardware cost, the proposed scheme can be a viable alternative for large-scale multiprocessors such as the Cray T3D, which rely on users to maintain data coherence.
Implementation of Trace Optimization and Investigation of Advanced Load Optimization in ADORE
(2005-07-20) Joshi, Sourabh S.; Hsu, Wei-Chung; Yew, Pen-Chung
Dynamic optimizers endeavor to speed up program execution by performing runtime optimization based on knowledge of the program collected at runtime. Depending on the type of optimization applied, the dynamic instruction count(IC) or the cycles per instruction (CPI) of the program can be targeted for improvement. This report presents two such sets of optimizations, the first targeting dynamic instruction count and the second targeting CPI through data cache based optimizations. The first set of trace optimizations presented, target the instruction count of the program and consist of copy and constant propagation, partial dead code elimination and store copy propagation. The implementation, optimization issues and performance of these optimizations is discussed in the former half of the report. The latter half of the report concentrates on a new kind of dynamic optimization called advanced load optimization. This optimization strives to improve data cache performance by targeting short latency loads that cannot be targeted using pre-fetch instructions. The motivation and implementation strategies for this optimization are presented. The use of dynamic instrumentation to gather additional runtime information and track optimizations is also presented in this context. Trace optimizations are not found to be very beneficial in speeding up already optimized SPEC code further. However as proof of concept, in contrived cases where partial dead code does exist, these optimizations are shown to be promising. The advanced load optimization looks promising but needs more implementation before the more advanced strategies can be tried.
Interprocedural Induction Variable Analysis
(2002-02-28) Tang, Peiyi; Yew, Pen-Chung
Induction variable analysis is an important part of the symbolic analysis in parallelizing compilers. Induction variables can be formed by for or DO loops within procedures or loops of recursive procedure calls. This paper presents an algorithm to find induction variables in formal parameters of procedures caused by recursive procedure calls. The compile-time knowledge of induction variables in formal parameters is essential to summarize array sections to be used for data dependence test and parallelization.
Minimizing the Directory Size for Large-scale DSM Multiprocessors
(1999-01-09) Kong, Jinseok; Yew, Pen-Chung; Lee, Gyungho
Directory-based cache coherence schemes are commonly used in large-scale distributed shared-memory (DSM) multiprocessors, but most of them rely on heuristics to avoid large hardware requirements. We proposed to use physical address mapping on the memory modules to significantly reduce the directory size needed. This approach allows the size of directory to grow as O (cn log2n) as in optimal pointer-based directory schemes [10], where n is the number of nodes in the system and c is the number of cache lines in each cache memory. Our simulations show that the proposed scheme can achieve a performance equivalent to the heuristic pointer-based directory schemes, but with a smaller directory size and a simpler pointer management scheme.
PASS: Program Structure Aware Stratified Sampling for Statistically Selecting Instruction Traces and Simulation Points
(2005-12-30) Kodakara, Sreekumar V.; Kim, Jinpyo; Hsu, Wei-Chung; Lilja, David J.; Yew, Pen-Chung
As modeled microarchitectures become more complex and the size of benchmark program keeps increasing, simulating a complete program with various input sets is practically infeasible within a given time and computation resource budget. A common approach is to simulate only a subset of representative parts of the program selected from the complete program execution. SimPoint [1,2] and SMARTS [10] have shown that accurate performance estimates can be achieved with a relatively small number of instructions. This paper proposes a novel method called Program structure Aware Stratified Sampling (PASS) for further reducing microarchitecture simulation time without losing accuracy and coverage. PASS has four major phases, consisting of building Extended Calling Context Tree (ECCT), dynamic code region analysis, program behavior profiling, and stratified sampling. ECCT is constructed to represent program calling context and repetitive behavior via dynamic instrumentation. Dynamic code region analysis identifies code regions with similar program phase behaviors. Program behavior profiling stores statistical information of program behaviors such as number of executed instructions, branch mispredictions, and cache miss associated with each code region. Based on the variability of each phase, we adaptively sample instances of instruction streams through stratified sampling. We applied PASS on 12 SPEC CPU2000 benchmark and input combinations and achieved average 1.46 % IPC error bound from measurements of native execution on Itanium-2 machine with much smaller sampled instruction streams.
Performance and power comparison of Thread Level Speculation in SMT and CMP architectures
(2007-10-30) Packirisamy, Venkatesan; Zhai, Antonia; Hsu, Wei-Chung; Yew, Pen-Chung
As technology advances, microprocessors that support multiple threads of execution on a single chip are becoming increasingly common. Improving the performance of general purpose applications by extracting parallel threads is extremely difficult, due to the complex control flow and ambiguous data dependences that are inherent to these applications. Thread-Level Speculation (TLS) enables speculative parallel execution of potentially dependent threads, and ensures correct execution by providing hardware support to detect data dependence violations and to recover from speculation failures. TLS can be supported on a variety of architectures, among them are Chip MultiProcessors (CMP) and Simultaneous MultiThreading(SMT). While there have been numerous papers comparing the performance and power efficiency of SMT and CMP processors under various workloads, relatively little has been done to compare them under the context of TLS. While CMPs utilize smaller and more power-efficient cores, resource sharing and constructive interference between speculative and non-speculative threads can potentially make SMT more power efficient. Thus, this paper aims to fill this void by extending a CMP and a SMT processor to support TLS, and evaluating the performance and power efficiency of the resulting systems with speculative parallel threads extracted for the SPEC2000 benchmark suite. Both SMT and CMP processors have a large variety of configurations, we choose to conduct our study on two architectures with equal die area and the same clock frequency. Our results show that a SMT processor that supports four speculative threads outperforms a CMP processor that supports the same number of threads, uses the same die area and operates at the same clock frequency by 23% while consuming only 8% more power on selected SPEC2000 benchmarks. In terms of energy-delay product, the same SMT processor is approximately 10% more efficient than the CMP processor.

University Digital Conservancy

Browse by Author

Browsing by Author "Yew, Pen-Chung"