Many computer programs are characterized as having a large number of floating point arithmetic operations. As a consequence, a large number of the instructions executed by the machine are floating-point instructions (floating add, floating subtract, floating multiply, floating compare, etc.). For such programs, the greater the number of floating-point arithmetic instructions that can be executed per cycle, the faster the machine speed of operation. Arithmetic results produced by the floating-point arithmetic unit must be saved, in a register-file, for instance, for later use and for eventual storage into memory. Although many designs for arithmetic units allow multiple arithmetic operations to be in execution at once, most allow only one actual result be produced each cycle.
The IBM model 360/91 is an example of a machine with multiple floating point arithmetic units. As arithmetic results exit from one of the arithmetic units, they are placed on a putaway (result) bus where they travel to the register file and enter a predetermined register for which they are destined. In addition, the results travel along a special bypass bus which is connected to the waiting stages associated with each arithmetic unit. If any instruction in a waiting list is waiting for the newly produced result, the result is entered into one of the buffers in the waiting stations. In this way, performance is increased by not requiring waiting instructions to wait while the needed result is first gated into the register file and then gated from the register file into the waiting station. A complete description of this scheme may be found in "An Efficient Algorithm for Exploiting Multiple Arithmetic Units" by R. M. Tomasulo, IBM Journal, January 1967, pp. 25-33. Since there is only one putaway bus and one bypass in the Tomasulo scheme, only one floating point result may be produced each cycle.
U.S. Pat. No. 4,075,704 to O'Leary describes a floating point data processor for high speed operation, which includes two arithmetic units and two putaway busses. Both busses pass back to both arithmetic units. However, each bus enters only one side of each arithmetic unit.
Although the drawings in O'Leary illustrate multiple entries into each side of the arithmetic unit, these are controlled by the decoder (central control). Thus, O'Leary's scheme allows only a single data item, that is a result, to enter a given side of the pipeline during any cycle, and O'Leary's scheme requires central control.
U.S. Pat. No. 3,697,734 to Booth et al sets forth a digital computer utilizing a plurality of parallel asynchronous arithmetic units, which include a bussing scheme similar to O'Leary, in that only one arithmetic result is produced each cycle.
IBM Technical Disclosure Bulletin, Vol. 14, No. 10, March 1972, pp. 2930-2933 in an article by Sofa et al sets forth a floating point arithmetic scheme in which a single arithmetic result is allowed onto the pipeline during each cycle.
In an article by Ramamoorthy in "Computing Surveys", Vol. 9, No. 1, March 1977, pp. 61-85, various schemes are set forth for making floating point arithmetic operations, with each of the schemes allowing only one arithmetic result to enter the pipeline during each cycle.
According to the present invention, a floating point arithmetic unit is described which utilizes dual putaway/bypass busses which allows multiple arithmetic results to be produced and enter the pipeline in each cycle of operation. That is, an add result may appear on one bus and a multiply result on the other bus during the same cycle of operation.