1. Technical Field
The present invention relates to improvements in data processing in a multiprocessor computer system.
2. Description of the Related Art
A block diagram of a conventional multiprocessor computer system, abstracted in its simplest terms, is shown in FIG. 1. The processors 101, 111, 121 interact with memory 102 via system bus 122. The program counter 103 belonging to processor 101 specifies a location in memory 102 from which an instruction is fetched. The instruction, formatted as shown in FIG. 2, is dispatched to the execution unit 104 in processor 101. The execution unit 104 performs the operation 201 specified by the instruction. The input values 202 to the operation 201 may be fetched from a location in memory 102 or from a location in register file 105, belonging to processor 101 as represented graphically in FIG. 2. The output values 203 from the operation 201 may be stored to a location in memory 102, or to a location in register file 105, or to the program counter 103, as shown graphically in FIG. 2.
When the input values for an operation executed by one processor are results (i.e. output values) of another instruction executed by another processor within the shared memory multiprocessor environment, the processing of the operation becomes more complex. First, in order for the first processor--for example, processor 101--to obtain such results to be utilized as input values, the second processor--for example, processor 111--must first store those output values to memory 102 so that processor 101 may then retrieve those results from memory 102 to be utilized as input values for the execution of its instruction. As will appreciated, these prerequisite steps consume additional instructions and clock cycles to store and load these values from one processor to the other, thereby creating substantial inefficiencies and undesirable consumption of processor power. Also, the execution of instructions requiring the results of other executed instructions as inputs requires that the processors be synchronized to ensure that the first processor is indeed accessing the appropriate results in memory 102 and not some prior, out-dated values. In the prior art, complicated procedures of data management are required to insure that memory coherency is maintained in the system.
The inefficiencies of sharing data among instructions executing on different processors in a shared memory multiprocessor configuration relative to sharing data among instructions executing on the same processor have shaped the way in which algorithms are defined. For example, an algorithm written for a shared memory multiprocessor is carefully partitioned to minimize the performance degradation due to sharing data produced by an instruction stream executing on one of the processors with an instruction stream executing on another of the processors. This data is typically shared via memory operations and synchronized via locking primitives. Alternatively, an algorithm written for a single processor has no such constraints. Data produced by one instruction is shared with other instructions via a register file, a high bandwidth, low latency mechanism, and synchronized via the linear sequence of instruction execution.
Since the lower bandwidth, higher latency, high overhead parallelism afforded by shared memory multiprocessors is not suitable for exploiting the fine-grained instruction level parallelism (ILP) inherent in many algorithms, processor architects have employed other approaches to construct systems which efficiently execute such algorithms. By employing pipelining techniques, multiple execution units, and sophisticated hardware and software instruction scheduling techniques, they have achieved the parallel execution on a single processor of instructions found within a single instruction stream (which share data via a register file), and have provided a means to efficiently execute algorithms which exhibit ILP.
Unfortunately, two drawbacks limit the overall effectiveness of such approaches. The first drawback is an increase in processor complexity. When a simple, one-instruction-at-a-time, in-order execution processor is extended to execute several instructions at a time, possibly scheduling them for out-of-order execution, the number of circuits, the silicon area, the circuit complexity, the testing complexity, the development time, the risk, and hence the development cost all typically increase dramatically.
The second drawback is due to the fact that not all algorithms are able to take advantage of the computational bandwidth afforded by single processors which are capable of executing multiple instructions in parallel. In other words, these algorithms tend to execute nearly as efficiently on simple, one-instruction-at-a-time processors as they do on complex, multiple-instruction-at-a-time processors. Furthermore, many such algorithms typically scale well when executed in multiprocessor environments.
Thus, in the past, the ideal execution environment for the first class of algorithms has been a single, complex, multiple-instruction-at-a-time, expensive-to-develop processor, while the ideal execution environment for the second class of algorithms has been a shared memory or distributed memory multiprocessor configuration, comprised of several simple, one-instruction-at-a-time inexpensive-to-develop processors.