1. Field of the Invention.
The present invention generally relates to computer architectures that employ coprocessors closely coupled to a host processor and, in particular, to a coprocessor architecture that allows multiple coprocessors to be functionally configured to selectably implement a highly parallel single instruction, multiple data machine by the utilization of close inter-coprocessor coupling to obtain concurrency in operations on respective coprocessor data.
2. Description of the Related Art.
Computer coprocessors are utilized in a wide variety of computer systems, though typically in conjunction with micro, super-micro, mini and super-mini computer systems, to implement an equally varied number of fairly specific functions. Typically, the specific functions performed by these coprocessors are closely tailored or dedicated to a particular type of operation, such as servicing a particular end device, while presenting a very high level interface to the host processor of the computer system. Typically, the dedicated functions supported by coprocessors include direct hardware supported numerical data computations, support of a specific high-level communications network and management of high speed data channels further in support of, perhaps, other dedicated coprocessors.
Conventionally, each coprocessor within a computer system is a largely separate entity particularly from the perspective of the host processor. That is, each high level function of the various coprocessors within a computer system must be separately initiated, and where required, specifically managed by the host processor itself. Consequently, the host processor is significantly burdened since it is required to effectively function as the primary communication conduit between coprocessors performing interdependent tasks or functions.
With the advent of coprocessors, an immediate and continuing desire has been to increase their respective performance in executing their corresponding dedicated functions. However, coprocessors are typically fully defined chip-level devices. Therefore, specific aspects of their design are rigorously set from the outset. Fixed bus width and maximum processing speeds results in a hard limit on the performance of the coprocessor. This, in turn, leads to the necessary development of subsequent generations of coprocessors to provide incremental improvements over the previous generation coprocessors. However, the hard upper limit on the processing performance of each coprocessor remains fixed for its generation, regardless of its generation of development.
A somewhat simplistic appearing alternative to awaiting subsequent coprocessor generations is to utilize several coprocessors in parallel. There are, however, a number of rather fundamental difficulties in a realizing any significant performance gain in this manner. Each coprocessor is an integral, functionally complete unit and, therefore, not readily adaptable to truly parallel operation. Consequently, execution of the function requested of paralleled coprocessors cannot be efficiently performed unless the function is conveniently partitionable to independent, parallel subfunctions in number equaling or less than the number of paralleled coprocessors.
Another problem with the simple paralleling of conventional coprocessors is that the nominally minimum initialization and any subsequent management functions performed by the host processor increases in at least direct proportion to the increased number of the coprocessors utilized. Indeed, the burden on the host processor likely increases at a greater than linear rate due to the increasing complexity of the initialization and management functions required of the host processor due to the increasing number of the coprocessors.
The foregoing problems are substantially compounded in any application where the paralleled coprocessors are performing processes that are in any way interdependent. For example, if a single end result is required from the function performed globally by conventional, paralleled coprocessors, the host processor is required to collect the respective interim process results from the paralleled coprocessors and distribute it globally at each processing step where a final result might be obtained. Further, if the execution timing of the paralleled coprocessors is in any way data dependent, the host processor is further burdened with the responsibility of ensuring that the paralleled coprocessors ar synchronized for contributing properly corresponding interim results.
Alternately, the data dependent execution speed of the paralleled coprocessors might be masked by the enforced treatment of all data dependent execution operations as occurring at their worst case execution speed. Such an enforced lock-step execution would relieve the host processor of a substantial coprocessor management burden. However, the typically wide variance in actual to worst-case execution speed would result in a corresponding loss in the possible net performance gain obtained through the utilization of paralleled coprocessors participating in the performance of a single dedicated function.