Browsing by Author "Johnson, Donald E."
Now showing 1 - 1 of 1
- Results Per Page
- Sort Options
Item Exploring Fine-Grained Process Interaction in Multiprocessor Systems(1997) Johnson, Donald E.Several techniques have been used to improve the perfonnance of process interaction in finegrained multiprocessor systems. These existing techniques tend to have long memory latencies or synchronization times, or they require complex and expensive hardware. This thesis proposes that user-level hardware and special-purpose communications channels for different interaction domains can dramatically improve access performance with relatively modest hardware cost. The thesis characterizes some specific domains for which the hypothesis holds. New lock and barrier mechanisms are presented that reduce both contention and latency to the minimum values that can be obtained using shared-bus communications, requiring at most two shared-bus transactions, with one transaction being typical. Distributed hardware locking queues and barrier flags reduce the latency for process continuation _after obtaining a lock or reaching a barrier to near zero. Four additional interaction mechanisms that use serial communication between processing elements (PEs) in a manner that eliminates inter-PE clocking delays are presented. All of these new techniques increase scalability, are applicable to both new architectures and to existing systems, and are less complex than other hardware solutions. The optimum two-dimensional cluster size for N PEs is shown to be proportional to (Nl/D) where/ and Dare the mean inter-node times, including gate and time-of-flight, on the global and local loops, respectively. · The access latency when optimally clustered is shown to be proportional to (NID)''. Using conservative parameters when optimally clustered, the maximum number of PEs for expected latencies of one microsecond are: 15621 PEs for barriers, 61308 PEs for locks, 37698 for shared-data, and 14592 PEs for shared-registers. All mechanisms are shown to have near-optimum performance if the configuration is near-optimum for any particular mechanism. Hierarchies beyond two levels were shown to have expected latencies proportional to the sum of all loop-times.