The coordination of multiple operations in shared memory multiprocessors often constitutes a substantial performance bottleneck. Process synchronization and scheduling are generally performed by software, and managed via shared memory. Execution of parallel programs on a shared-memory, speedup-oriented multiprocessor necessitates a means for synchronizing the activities of the individual processors. This necessity arises due to precedence constraints within algorithms: When one computation is dependent upon the result of other computations, it must not commence before they finish. In the general case, such constraints are projected onto an algorithm's parallel decomposition, and reflected as precedence relations among its execution threads.
Synchronization is only one aspect of a broad activity, which may be termed parallel operation coordination, whose other aspects are scheduling and work allocation. Scheduling is selecting an execution order for the operations of a program, out of a space of execution orders which are feasible under the given architecture and precedence constraints, as described in the paper entitled "The Effect of Operation Scheduling on the Performance of a Data Flow Computer," M. Gransky et al, IEEE Trans. on Computers, Vol. C-36 No. 9, September 1987, pp. 1019-1029. While scheduling deals with the point of view of the tasks to be computed, work allocation deals with the point of view of the processors which carry out the tasks. Thus, the distinction between scheduling an allocation is not clear-cut, and some researchers use these terms interchangeably. The decisive questions may be posed as follows: "which ready-to-run piece of work should be executed first ?" which is a matter of scheduling policy; questions of the sort "to which processor should a given piece of work be allocated ?" or "how much work should be allocated at once to a given processor ?", are considered to be a matter of allocation policy. Scheduling and allocation may be static, i.e. determined before program run-time.
In fully dynamic systems, all these coordination activities are not an inherent part of the actual computation, but are rather designed to support it. Since they consume computational resources, they are considered as overhead. Coordination or synchronization efficiency, refers to the efficiency of parallel operation coordination activity itself, excluding the indirect effects of scheduling policy.
The overall multiprocessor performance is influenced significantly by the efficiency of coordination, as described in the book entitled "High-Performance Computer Architecture", H. S. Stone, Addison-Wesley, 1987, and in the papers entitled "Execution of Parallel Loops on Parallel Processor Systems," C. D. Polychronopoulos et al, Proc. Int. Conf. on Parallel Processing, 1986, pp. 519-527: "A Technique for Reducing Synchronization Overhead in Large Scale Multiprocessors", Z. Li et al. Proc. of the 12th Symp. on Computer Architecture, 1985, pp. 284-291; "The Piecewise Data Flow Architecture: Architectural Concepts," J. E. Requa et al. IEEE Trans. on Computers, Vol. C-32 No. 5, May 1983, pp. 425-438; "A Case Study in the Application of a Tightly Coupled Multiprocessor to Scientific Computations," N. S. Ostlund et al, Parallel Computations, G. Rodrigue, editor, Academic Press, 1982, pp. 315-364; "Synchronized and Asynchronous Parallel Algorithms for Multiprocessors," H. T. Kung, Algorithms and Complexity, Academic Press, 1976, pp. 153-200; and "A Survey of Synchronization Methods for Prallel Computers," A Dinning, IEEE Computer, Vol. 20 No. 19, January 1987, pp. 100- 109.
Inefficiencies in these processes are manifested in overhead-activity and overhead-idling. The former is the activity which is required, once a task has been computed, to obtain a new piece of productive work, while the latter is due to contention of synchronization resources, which are system-global by nature.
Overhead-idling is principally caused by insufficient synchronization rate capability. As noted in the text by H. S. Stone supra, this capability (expressed in MSYPS, Millions of Synchronizations Per Second) constitutes an independent architectural measure; in particular, it is not necessarily proportionate to the system's overall raw processing power, as expressed MIPS and MFLOPS. Decompositing a given algorithm into ever finer granularity levels will yield an ever increasing demand for synchronization rate, and an ever bigger ratio of overhead-activity to productive computation. Thus, at some level of granularity, synchronization may become a bottleneck, thereby practically limiting the exploitable level of parallelism. Consequently, it is desirable to search for means to increase the synchronization rate capability and to reduce the coordination overhead activity of multiprocessor systems.
Synchronization methods for multiprocessors were born out of mutual exclusion methods, prevalent in multiprogrammed uniprocessors. Still, synchronization is usually implemented around special synchronization data in main memory, as described in the paper entitled "Synchronization, Coherence, and Event Ordering in Multiprocessors," M. Dubois et al, IEEE Computer, Vol. 21 No. 2, February 1988, pp. 9-22. These synchronization data are either stand-alone (e.g. locks and semaphores), or attached to regular data objects (such as presence bits). A variety of synchronization primitives, such as Test & Set or Fetch & Add. serve to establish access to synchronization variables and to manipulate them, as described in the paper entitled "The NYU Ultracomputer--Designing an MIMD shared Memory Parallel Processor," A. Gottlieb et al. IEEE Trans. on Computers, February 1983, pp. 175-89. The implementation of these primitives is based on some special hardware support, whether rudimentary or massive. Yet the essential levels of parallel operation coordination are implemented in software. Some examples of prominent commercial and research multiprocessors which are included in this framework are described in the following papers: "Cm*--A modular multi-microprocessor," R. J. Swan et al, AFIPS Conf. Proc., 1977 National Computer Conference, pp. 637-644; "Architecture and Applications of the HEP Multiprocessor Computer System," B. J. Smith, Real Time Signal Processing IV, Proceedings of SPIE, August 1981, pp. 241-248; "The IMB RP3 Introduction and Architecture," G. F. Pfister et al. Proc. Int. Conf. on Parallel Processing, August 1985, pp. 764-771; "Cedar", D. Gajski et al, Report No. UIUCDCS-R-83-1123. Department of Computer Science, University of Illinois, Urbana, February 1983, pp. 1-25; "Synchronization Scheme and its Applications for Large Multiprocessor Systems," C. Q. Zhu Proc. 4th Int. Conf. on Distributed Computing Systems, 1984, pp. 486-493; and "The Butterfly Parallel Processor," W. Crowther et al. Newsletter of the Computer Architecture Technical Committee (IEEE Computer Society), September/December 1985, pp. 18-45. Within this framework, efforts are aimed at improving synchronization efficiency were routed to the following directions: Development of enhanced hardware support for synchronization primitives (most notably - NYU Ultracomputer's combining network, as described in the paper by Gottlieb, supra.); development of more powerful synchronization primitives as described in the paper by C. Q. Zhu et al supra, and the paper by J. R. Goodman entitled "Efficient Synchronization Primitives for Large-Scale Cache-Coherent Multiprocessors," Proc. of the Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS-III, 1989, pp. 64-75; development of inherently asynchronous parallel algorithms, as described in the paper by H. T. Kung supra; and development of various techniques for synchronization minimization, as described in the paper by Z. Li et al, and in the paper entitled "Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers," C. D. Polychronopoulos et al, IEEE Trans. on Computers, Vol. C-36 No. 12, December 1987, pp. 1425-1439.
A recent survey of synchronization methods contained in the paper by Dinning supra, describes in detail the synchronization mechanisms of seven machines. While giving a classification for prevalent synchronization methods, the paper by Dinning supra confirms the central and basic role of protocols for synchronized access to shared data in all these methods (except in "puristic" message passing).
Synchronization mechanisms which exceed the framework described above, while promoting the role of hardware, have been proposed by various researchers. Some of these proposals are aimed at hardware implementations of barrier synchronization or synchronized wait, as described in the papers entitled "A Controllable MIMD Architecture," S. F. Lundstrom et al, Proceedings of the 1980 International Conference on Parallel Processing, pp. 19-27 and "The Fuzzy Barrier: A Mechanism for High speed Synchronization of Processors," R. Gupta, Proc. of the Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS-III, 1989, pp. 54-63. A more general hardware mechanism, which is aimed at arbitrary parallelism patterns, is based on routing of control tokens, but is oriented towards essentially static work allocation, is proposed in the paper entitled "A Hardware Task Scheduling Mechanism for Real-Time Multi-Microprocessor Architecture," A. D. Hurt et al, Proceedings of the 1982 Real-Time Systems Symposium, pp. 113-123. A centralized synchronization/scheduling facility, targeted at arbitrary parallelism patterns and at dynamic allocation and scheduling, was argued for in the paper by D. Gajski supra, but no specific architecture was proposed.
Therefore, it would be desirable to provide a global synchronization/scheduling unit which is capable of dynamic allocation and scheduling in a multiprocessor system.