In typical multi-processor computer systems and multi-core processors, software programs may be divided into function-specific tasks, or “threads”, and tasks within each thread may be performed by a different processing element. For the purposes of this disclosure, “processing element” may refer to a microprocessor, processor core, processing system, software routine, etc., in which instructions are executed to perform a function or functions associated with the instructions. In one prior art processing configuration, a “master” processing element may execute a multi-threaded software program and assign tasks within each thread to other processing elements (“slaves”). In such a “master-slave” multi-processing system, the master must detect when each of the slaves have completed their respective tasks before assigning another group of tasks to the slaves. A technique for communicating information between the master and slave processing elements to indicate the beginning and/or end of a set of tasks to be performed by the slaves concurrently is often referred to as “barrier synchronization”.
In general, access to registers between processing elements, such as two microprocessors, within a computer system typically requires intermediate steps, such as storing data within in memory before storing the data to a particular register within a processor. Moreover, typical prior art communication between two processing elements may require that the processing elements communicate according to a specific protocol commensurate with the type of computer system they are a part of. These prior art techniques of communicating between processing elements can require extra processing cycles, which may degrade processor and system performance. For example, in a point-to-point interconnect computer system with shared memory protocols, barrier synchronization using a single shared memory location between N processors can result in as many as 2N cache line transfers, which can translate into 2N2 bus transactions
FIG. 1 illustrates a processing system (microprocessor or computer system) in which a prior art barrier synchronization technique is used. Particularly, in FIG. 1, the master processing element is executing a program having two threads and assigns a task within each thread to a respective slave processing element. In order for the master to perform barrier synchronization, it must first initialize a counter value stored in either the master, a slave, or some other memory structure, to a known value.
The master must then indicate to each slave that the barrier synchronization counter has been initialized and each slave must acknowledge in response. In some prior art examples, the barrier synchronization counter is stored in a cache line in one of the slaves or the master. In such an example, cache coherency protocols must be used to grant ownership of the cache line to the master and the slaves must use cache coherency protocols to modify the count to indicate when they each have completed their assigned task. When the count indicates that all slaves have completed their tasks, the master may then assign a new task to each of the slaves corresponding to the threads of the multi-threaded program.
The barrier synchronization technique used in the processing system of FIG. 1 requires numerous bus transactions between the slaves and the master due to the caching protocol used to initialize and update the barrier synchronization counter value. The traffic on the bus grows linearly in the example of FIG. 1 as the master processing element performs programs with a greater number of threads and more slave processing elements are added to perform tasks within each thread. Therefore, the prior art barrier synchronization technique used in conjunction with FIG. 1 can scale poorly with the number of threads executed in a multi-threaded program, as the additional inter-processing element bus traffic can have adverse effects on computing system performance.
FIG. 2 illustrates another processing system in which a prior art barrier synchronization technique may be performed. In particular, FIG. 2 illustrates a multi-processing element (“PE#”) system, in which a barrier synchronization count is stored in a barrier synchronization circuit. Each PE is logically connected (“hard wired”) to the barrier synchronization circuit which keeps track of the count by associating a bit or bits with each PE via a fabric of logic gates (e.g., “AND” gates) through which the PE's can update their associated bit or bits after completing the concurrent tasked assigned to them. Once every PE has updated its associated bit or bits, the next task can be assigned to the PE's concurrently.
One problem with the technique illustrated in FIG. 2 is that the bit or bits associated with each of the PE's is statically assigned and cannot be changed or reassigned to another PE, if for example, more processing elements are needed and/or added or some are disabled and/or removed due to a changing number of threads to be processed, and therefore unused hardware is wasted. Indeed, in order for the processing system of FIG. 2 to scale to a greater number PEs, a new barrier synchronization circuit must be used that supports the number of threads to be executed. Furthermore, the processing system of FIG. 2 cannot reassign the bit or bits associated with one PE to another PE, due to the hard-wired circuitry associated with each PE and its respective barrier synchronization counter bit(s).
Therefore, system designers must anticipate a maximum number and configuration of threads that may be performed and design the barrier synchronization circuit accordingly. However if fewer threads are used than the maximum number for which the circuit is designed, the extra circuitry is wasted and unnecessarily increases system cost. Conversely, if more threads are to be supported than what the circuit can support, the circuit must be replaced with one that can support the increased number of threads, thereby incurring additional design costs. Moreover, the system illustrated in FIG. 2 may not combine the processing elements to handle a thread, for example, because the assignment of each PE with a particular barrier synchronization counter bit(s) may not be altered.