The present invention relates to computers, and more particularly, to high-speed, parallel-processing computers employing horizontal architectures.
Typical examples of computers are the IBM 360/370 Systems. In such systems, a series of general purpose registers (GPRs) are accessible to supply data to an arithmetic and logic unit (ALU). The output from the arithmetic and logic unit in turn supplies results from arithmetic and logic operations to one or more of the general purpose registers. In a similar manner, some 360/370 Systems include a floating point processor (FPP) and include corresponding floating point registers (FPRs). The floating point registers supply data to the floating point processor and, similarly, the results from the floating point processor are stored back into one or more of the floating point registers. The types of instructions which employ either the GPRs or the FPRs are register to register (RR) instructions. Frequently, in the operation of the GPRs and the FPRs for RR instructions, identical data is stored in two or more register locations. Accordingly, the operation of storing data into the GPRs is selectively to one or more locations. Similarly, the input to the ALU frequently is selectively from one or more of many locations storing the same data.
Horizontal processors have been proposed for a number of years. See for example, "SOME SCHEDULING TECHNIQUES AND AN EASILY SCHEDULABLE HORIZONTAL ARCHITECTURE FOR HIGH PERFORMANCE SCIENTIFIC COMPUTING" by B. R. Rau and C. D. Glaeser, IEEE Proceedings of the 14th Annual Microprogramming Workshop, October 1981, pp. 183-198 Advanced Processor Technology Group ESL, Inc., San Jose, California, and "EFFICIENT CODE GENERATION FOR HORIZONTAL ARCHITECTURES:COMPILER TECHNIQUES AND ARCHITECTURAL SUPPORT" BY B. Ramakrishna Rau, Christopher D. Glaeser and Raymond L. Picard, IEEE 9th Annual Symposium on Computer Architecture 1982, pp. 131-139.
Horizontal architectures have been developed to perform high speed scientific computations at a relatively modest cost. The simultaneous requirements of high performance and low cost lead to an architecture consisting of multiple pipelined processing elements (PEs), such as adders and multipliers, a memory (which for scheduling purposes may be viewed as yet another PE with two operations, a READ and WRITE), and an interconnect which ties them all together.
The interconnect allows the result of one operation to be directly routed to one of the inputs for another processing element where another operation is to be performed. With such an interconnect the required memory bandwidth is reduced since temporary values need not be written to and read from the memory. Another aspect typical of horizontal processors is that their program memories emit wide instructions which synchronously specify the actions of the multiple and possibly dissimilar processing elements. The program memory is sequenced by a sequencer that assumes sequential flow of control unless a branch is explicitly specified.
As a consequence of their simplicity, horizontal architectures are inexpensive when considering the potential performance obtainable. However, if this potential performance is to be realized, the multiple resources of a horizontal processor must be scheduled effectively. The scheduling task for conventional horizontal processors is quite complex and the construction of highly optimizing compilers for them is difficult and expensive.
The polycyclic architecture has been designed to support code generation by simplifying the task of scheduling the resources of horizontal processors. The advantages are 1) that the scheduler portion of the compiler will be easier to implement, 2) that the code generated will be of a higher quality, 3) that the compiler will execute fast, and 4) that the automatic generation of compilers will be facilitated.
The polycyclic architecture is a horizontal architecture that has unique interconnect and delay elements The interconnect element of a polycyclic processor has a dedicated delay element between every directly connected resource output and resource input. This delay element enables a datum to be delayed by an arbitrary amount of time in transit between the corresponding output and input.
The topology of the interconnect may be arbitrary. It is possible to design polycyclic processors with n resources in which the number of delay elements is 0(n), (a unior multi-bus structure), 0(nlogn), (e.g. delta networks) or 0(n*n), (a cross-bar). The trade-offs are between cost, interconnect bandwidth and interconnect latency. Thus, it is possible to design polycyclic processors lying in various cost-performance brackets.
In previously proposed polycyclic processors, the structure of an individual delay element consists of a register file, any location of which may be read by providing an explicit read address. Optionally, the value accessed can be deleted. This option is exercised on the last access to that value. The result is that every value with addresses greater than the address of the deleted value is simultaneously shifted down, in one machine cycle, to the location with the next lower address. Consequently, all values present in the delay element are compacted into the lowest locations. An incoming value is written into the lowest empty location which is always pointed to by the Write Pointer that is maintained by hardware. The Write Pointer is automatically incremented each time a value is written and is decremented each time one is deleted As a consequence of deletions, a value, during its residence in the delay element, drifts down to lower addresses, and is read from various locations before it is itself deleted.
A value's current position at each instant during execution must be known by the compiler so that the appropriate read address may be specified by the program when the value is to be read. Keeping track of this position is a tedious task which must be performed by a compiler during code-generation.
To illustrate the differences, two processors, a conventional horizontal processor and a polycyclic processor, are compared. A typical horizontal processor contains one adder and one multiplier, each with a pipeline stage time of one cycle and a latency of two cycles. It also contains two scratch-pad register files labeled A and B. The interconnect is assumed to consist of a delayless cross-bar with broadcast capabilities, that is, the value at any input port may be distributed to any number of the output ports simultaneously. Each scratch-pad is assumed to be capable of one read and one write per cycle. A read specified on one cycle causes the datum to be available at the output ports of the interconnect on the next cycle. If a read and a write with the same address are specified on the same scratch-pad on the same cycle, then the datum at the input of the scratch-pad during that cycle will be available at the output ports of the interconnect on the next cycle. In this manner, a delay of one or more cycles may be obtained in transmitting a value between the output of one processor and the input of another. The horizontal processor typically also contains other resources.
A typical polycyclic processor is similar to the horizontal processor except for the nature of the interconnect element and the absence of the two scratchpad register files. While the horizontal processor's interconnect is a crossbar, the polycyclic processor's interconnect is a crossbar with a delay element at each cross-point. The interconnect has two output ports (columns) and one input port (row) for each of the two processing elements. Each cross-point has a delay element which is capable of one read and one write each cycle.
In previously proposed processors, a processor can simultaneously distribute its output to any or all of the delay elements which are in the row of the interconnect corresponding to its output port. A processor can obtain its input directly. If a value is written into a delay element at the same time that an attempt is made to read from the delay element, the value is transmitted through the interconnect with no delay. Any delay may be obtained merely by leaving the datum in the delay element for a suitable length of time.
In the polycyclic processors proposed, elaborate controls were provided for determining which delay element in a row actually received data as a result of a processor operation and the shifting of data from element to element. This selection process and shifting causes the elements in a row to have different data at different times. Furthermore, the removal of data from the different delay elements requires an elaborate process for purging the data elements at appropriate times. The operations of selecting and purging data from the data elements is somewhat cumbersome and is not entirely satisfactory.
Although the polycyclic and horizontal processors which have previously been proposed are an improvement over previously known systems, there still is a need for still additional improvements which increase the performance and efficiency of processors.