The present invention relates generally to parallel processors sharing a resource, and more specifically, the invention relates to a transaction of a parallel processing system having characteristics between those of a self-timed implementation and a fully-static implementation.
Parallel processing systems have multiple processors, each having functions similar to a central processing unit (CPU) of typical PC-type computers in widespread use today. The multiple processors are simultaneously active, calculating various intermediate results and exchanging the intermediate results among the several processors as necessary. The processors access a shared resource, such as a memory or a bus to exchange the information. A term, "transaction," refers to an access of the shared resource by one processor for the purpose of accomplishing communication with another processor. For certain tasks, the transactions must occur in a particular order to produce an accurate result. This is referred to as an inherent transaction order.
A difference between a parallel processing computer and the PC-type computer is the ability of the parallel processing computer to simultaneously execute different software components making up a single task. The term "software component" is taken to mean a program or set of instructions executable by a particular processor. Multiple PC-type computers can operate at the same time, but they typically do not operate in conjunction with each other to execute a single task.
It is precisely the concept of having the plural processors of a parallel processing computer simultaneously operate to solve a single task with the processors exchanging necessary information which sets the backdrop for the present invention. A problem with parallel processing arises when seeking to exchange needed information between the various processors. A first processor which calculates, for example, a series of intermediate sums used by a second processor in other calculations must properly exchange the numbers with the second processor. One way to do this exchange is to provide a bus connecting the two processors. A memory on this bus could store the data being interchanged. Often, the first processor must provide a first intermediate sum before the second processor can complete its first calculation. If the same memory location is used to transmit a second intermediate sum, then the first processor cannot provide the second intermediate sum prior to use of the first intermediate sum as the first intermediate would be overwritten. Likewise, the second processor cannot read its memory location prior to the first processor writing the correct value as an erroneous final calculation will result.
Increasing the number of processors amplifies the difficulty in performing transactions between the parallel processors. Bus contention resolution is necessary if a bus connects processors to one another. Providing a shared memory facilitates exchanging variable amounts of data between the different processors. Thus, a particular processor writes its data needed by one or more other processors to a prespecified data location in the shared memory after gaining access to the bus. The other processor reads the prespecified data location for its data, again after gaining control of the bus. If many different sets of transactions are occurring between different groups of processors, bus contention can contribute to degraded efficiency of solving the common task.
A greater problem can be ensuring that all the different accesses occur in the proper order, with no processor reading its value from the shared resource prior to the correct processor writing the correct value. There are many different solutions to ordering the shared accesses, depending upon the type of information available at a compilation time for the plurality of multiple software components. Two types relevant here are fully-static ordering and self-timed ordering. Fully-static scheduling requires a computer to establish an exact firing time for the software components, as well as their assignment to processors and order of executions. Self-timed scheduling requires that a computer establish an order in which the software components fire on each processor. At run time, each processor waits for data to be available for the next software component in its ordered list, and then fires that software component.
In fully-static ordering, the different processors can rely on an absolute time for their accesses to the shared resource. A programmer provides a compiler with precise information regarding a complete set of transactions for each processor such that the computer can compute an execution time for each of the transactions. The compiler then divides the task and orders the transactions so that no contention for the bus exists and the order of accesses is correct. For example, assume a repetitive task having one minute cycles divided among two processors. The compiler could have the first processor write its value exactly twenty seconds into each cycle. The second processor could read its data exactly forty seconds into each cycle. The problem is that the compiler requires a priori knowledge about the complete set of transactions for each processor and each transaction's execution time. An occurrence of an unscheduled transaction, such as service of a random interrupt, or data dependent iteration or recursion which variably influences the execution time of scheduled transactions, can produce erroneous results. The possibility of producing erroneous results prohibits use of fully-static scheduling for tasks which have unschedulable transactions or which have variable execution times. By "unschedulable," we mean that a compiler has insufficient information at compile time to determine when a transaction should occur. An advantage of fully-static scheduling is that bus contention and ordering are built into the compiled program making it unnecessary to use extra hardware or software to enforce the proper ordering or to resolve bus contention.
Prior art self-timing architectures permit unscheduled events and execution time variations, but at a significant hardware or software overhead. In self-timing architectures, semaphores associated with each data set control access ordering for the plurality of processors. For example, before a processor can access a particular data set, a semaphore associated with the particular data set must indicate that the particular data set is ready for the transaction the processor wants to perform. For example, if the processor wants to read the value stored in a particular storage location, checking the semaphore could indicate whether the data was last read or written. If the semaphore indicates that it was last read, then the processor would not read the value, but it waits until the semaphore indicates that some processor updated the value stored in the shared memory. Associated with this method is a problem of bus contention resolution hardware. For instance, in the example above with the processor desiring to read the particular storage location but finds the semaphore in the wrong condition, the access of the shared resource to test the semaphore can delay the second processor from updating the particular storage location, degrading an efficiency of transaction. Additionally, every time a processor is ready to access the shared resource, the processor needs to resolve any contention for the shared resource if another processor is also ready to access the shared resource. As the number of processors increases, contention and ordering the accesses requires corresponding increases in overhead to manage this semaphore/contention system. The various contending processors could constantly interfere with smooth and efficient transaction of their common task. The system does permit unscheduled tasks and timing variations.
FIG. 3 is a block diagram of a prior art self-timed architecture 10. The architecture implements a self-timing mechanism to control accesses of a plurality of processors P.sub.N to a single shared memory 12. Each processor P.sub.i has a local memory 14 which can store the processor's program or data. A gate keeper 16.sub.i interposed between each processor P.sub.i and the shared memory 12 controls access to the shared memory 12 for each processor P.sub.i. Data and address lines couple each processor P.sub.i to its gatekeeper 16.sub.i. A bus arbiter 18 resolves contention for a common bus 20 coupling the various gate keepers 16.sub.i to the shared memory 12 when multiple processors P.sub.i desire access to the shared memory 12 at the same time. The common bus 20 comprises data and address lines controlled by a particular gatekeeper 16.sub.i. The shared memory 12 has a plurality of storage locations particularly selectable in response to particular assertions of particular ones of the address lines, as is well known in the art. Each gatekeeper 16.sub.i asserts a request line to the arbiter 18 when the gatekeeper 16.sub.i desires to access the shared memory 12. When the arbiter 18 grants a particular gatekeeper 16.sub.i access to the common bus 20 so it can access the shared memory 12, the arbiter 18 asserts a grant signal to the gatekeeper 16.sub.i. For particular instances, the gatekeeper 16.sub.i asserts a wait signal to its processor P.sub.i, causing it to halt. As is well known in the art, the gatekeeper 16.sub.i functions could be designed into software executing on the various processors.
In operation, when a particular one processor P.sub.1 of the architecture 10 desires to read a particular value from the shared memory 12, the processor P.sub.1 provides an address to its gate keeper 16.sub.1. The gate keeper 16.sub.1 asserts a wait signal to its particular processor P.sub.1 until it satisfies the read request. The gatekeeper 16.sub.1 requests access to the bus 20 by asserting its request signal to the arbiter 18. If multiple gatekeepers 16 are requesting access to the bus 20, then the arbiter 18 resolves the contention as is well-known in the art. One simple example for contention resolution would be to grant access on a first-come first-served basis. Eventually, the arbiter 18 asserts the grant signal to the gatekeeper 16.sub.1 allowing the gatekeeper to check the semaphore associated with the data at the storage location it is to read for its processor P.sub.1. An incorrectly set semaphore results in the gatekeeper 16.sub.1 deasserting its request signal, indicating it released the bus 20. At some later time, the gatekeeper 16.sub.1 will try again. Eventually, the gatekeeper 16.sub. 1 will access the particular storage location and find the semaphore correctly set. The gatekeeper 16.sub.1 will read the value and pass it on to its processor P.sub.1, whereupon the processor P.sub.1 can then continue.
For writes to the shared memory 12, the gate keeper 16.sub.i need not halt the processor P.sub.i requesting the write. The processor P.sub.i can proceed with its execution until the next shared memory transaction and the gate keeper 16.sub.i can write the value in parallel when the semaphore permits and the gate keeper 16.sub.i gains access.
To enhance speed, the self-timed architecture 10 of FIG. 3 implements many tasks in hardware which were often implemented in software. A problem associated with the self-timed architecture 10 is that complex gatekeepers 16 associated with each processor and the arbiter 18 for bus 20 contention resolution are inefficient and add time delays and hardware costs. The gatekeeper 16 will add access delays relative to the shared memory 12 even when it is the only gatekeeper 16 requesting access to the shared memory 12. Further, as multiple gatekeepers 16 result in multiple accesses to the shared resource 12, some timing degradation occurs in resolving the bus 20 accesses for gate keepers 16 associated with processors P whose data is not ready.
The present state of the art requires a solution to a problem of efficient ordering and bus contention resolution for multiple processes while allowing some execution time variations.