1. Field of the Invention
This invention relates to stored program computers, and particularly to an improved (k)-instructions-at-a-time pipelined processor for parallel execution of inherently serial instructions by a specially equipped secondary data flow facility which optimizes the instruction processing capability of a pipelined processor by emulating the result of a prerequisite instruction.
2. Description of the Prior Art
Computer processor designs have traditionally incorporated many refinements to achieve increased throughput. All processors accomplish the same basic result by following the sequence of steps:
(1) FETCH INSTRUCTION, PA1 (2) DECODE INSTRUCTION, PA1 (3) FETCH OPERANDS, PA1 (4) EXECUTE INSTRUCTION, PA1 (5) STORE RESULTS. PA1 (1) Staging; PA1 (2) Execution; PA1 (3) Putaway. PA1 (1) address generation interlocks; PA1 (2) data dependencies; and PA1 (3) facility lockouts. PA1 n+(k-1)(n-1) where n is the number of inputs to the minimum facility, and k is the data flow facility number.
Many different approaches have been taken to accomplish the above steps at the greatest possible rate. A common approach is the "pipeline." Multiple instructions undergo various phases of the above steps sequentially, as required, but where possible, take place simultaneously insofar as there are no conflicts in demand for hardware or instruction dependencies. Processing instructions two at a time is an obvious desire, but providing economically for hardware demand and instruction dependencies can require expensive replication of hardware and complex control supervision.
A cycle is the period of time required to complete one phase of the pipeline. Commonly, pipelines have three actions:
Each of these three actions may take one or more phases.
The theoretical limitation on performance for serial pipelined processors is the completion of one instruction per cycle, overlapping Staging, Execution and Putaway for a set of instructions. This is seldom achieved due to instruction interdependencies. The most important interdependencies are:
An address generation interlock is the inability to compute the address of an operand needed by an instruction until the completion of a previous instruction. A data dependency is the ability to obtain an operand until the completion of a previous instruction. A facility lockout is the inability to use one or more organs of a processor until the completion of a previous instruction which requires the use of the critical organ or organs.
The following representative Patents and Publications demonstrate the context of the prior art:
U.S. Pat. No. 3,689,895, Kitamura, MICRO-PROGRAM CONTROL SYSTEM, Sept. 5, 1972, shows a parallel pipeline architecture in which a single micro-program memory is time-division multiplexed among plural arithmetic units.
U.S. Pat. No. 3,787,673, Watson et al, PIPELINED HIGH SPEED ARITHMETIC UNIT, Jan. 22, 1974, shows an array of computational organs arrayed for individual access so as to have simultaneous execution of arithmetic steps within the arithmetic unit as well as simultaneous execution of instructions in the instruction processing pipeline.
U.S. Pat. No. 3,840,861, Amdahl et al, DATA PROCESSING SYSTEM HAVING AN INSTRUCTION PIPELINE FOR CONCURRENTLY PROCESSING A PLURALITY OF INSTRUCTIONS, Oct. 8, 1974, shows an architecture for a two-cycle, time-offset instruction pipeline to match instructions which use two storage access cycles per execution.
U.S. Pat. No. 3,928,857, Carter et al, INSTRUCTION FETCH APPARATUS WITH COMBINED LOOK-AHEAD AND LOOK-BEHIND CAPABILITY, Dec. 23, 1975, shows an instruction pipeline with a multi-word instruction buffer deployed in anticipation of programming loops.
U.S. Pat. No. 3,949,379, Ball, PIPELINE DATA PROCESSING APPARATUS WITH HIGH SPEED SLAVE STORE, Apr. 6, 1976, shows a pipeline processor with provision to hold an address until data becomes available to store in that address.
U.S. Pat. No. 3,969,702, Tessera, ELECTRONIC COMPUTER WITH INDEPENDENT FUNCTIONAL NETWORKS FOR SIMULTANEOUSLY CARRYING OUT DIFFERENT OPERATIONS ON THE SAME DATA, July 13, 1976, shows a computer architecture with a group of differing functional units arrayed along a bus.
U.S. Pat. No. 4,057,846, Cockerill et al, BUS STEERING STRUCTURE FOR LOW COST PIPELINED PROCESSOR SYSTEM, Nov. 8, 1977, shows housekeeping for steering data along unidirectional busses with overlap of input an output functions.
U.S. Pat. No. 4,062,058, Haynes, NEXT ADDRESS SUBPROCESSOR, Dec. 6, 1977, shows a method for processing a special class of programs wherein determination of the next instruction occurs simultaneously with execution of a preceding set of instructions without the delay inherent in performance of the intervening branch conditions. A special subprocessor reviews the registers in the main processor for branch conditions and obtains the next (branch) address while the main processor is finishing routine processing.
U.S. Pat. No. 3,932,845, Beriot, shows plural execution units having differing speeds, similar to Shimoi, with the difference that Beriot places the fast execution unit and the slow execution unit in parallel, and tries to fit in several short operations during the period taken by one long operation.
U.S. Pat. No. 4,085,450, Tulpule, PERFORMANCE INVARIENT EXECUTION UNIT FOR NON-COMMUNICATIVE INSTRUCTIONS, Apr. 18, 1978, shows a pipeline technique for multiplexing, to three execution units, instructions which are subjected to a mode change if the sequence fits a criterion. Tulpule does not disclose any provision for handling dependent instructions, but rather discloses a standard pipeline in which operands of one instruction are read in parallel with the execution of the previous instruction. Tulpule identifies certain sequences of instructions which can benefit from a mode change from "forward operations" to "reverse operations," a sort of factoring operation to simplify the processing by restating the instruction in a different mode, and implements procedures to convert from forward to reverse mode by manipulating addresses. Tulpule makes special provision for handling reverse register to register instructions by exchanging address pairs within a given execution unit.
U.S. Pat. No. 4,152,763, Shimoi, CONTROL SYSTEM FOR CENTRAL PROCESSING UNIT WITH PLURAL EXECUTION UNITS, May 1, 1979, show plural small, fast, special purpose execution units for certain common instructions, with a backup shaped execution unit for other instructions. This is a parallel pipeline for the favored instructions, with serial backup for other instructions not favored. Shimoi does not deal with inherently sequential instructions.
U.S. Pat. No. 4,365,311, Fukunaga et al, CONTROL OF INSTRUCTION PIPELINE IN DATA PROCESSING SYSTEM, Dec. 21, 1982, shows an architecture for performing instruction processing by segments of instructions in parallel, with individual clocks which vary depending upon conditions.
Agerwala et al, ELIMINATING THE OVERHEAD OF FLOATING POINT LOAD AND STORE INSTRUCTIONS BY DECODING TWO INSTRUCTIONS PER CYCLE IN THE FLOATING POINT UNIT, IBM Technical Disclosure Bulletin, Vol. 25, No. 1, June 1982, pp 126-129, shows a floating point arithmetic unit which two instructions are decoded simultaneously and during the short loops of floating point executions data flows along separate paths simultaneously. The goal is to overlap loads and stores with arithmetic operations. There is no parallel execution of inherently serial instructions.
Hardin, VARIABLE I-FETCH, IBM Technical Disclosure Bulletin, Vol. 20, No. 7, December 1977, pp. 2547-2548, shows a technique for fetching the next instruction at a variable time depending on the availability of storage cycles.
Irwin, "A Pipelined Processing Unit for On-Line Division," the 5th Annular Symposium on Computer Architecture, Apr. 3-5, 1978, pp. 24-30, 78CH1284-9C 1979 IEEE, describes a procedure for designing a pipelined computer.
Irwin and Heller, "Online Pipeline Systems for Recursive Numeric Computations," The 7th Annular Symposium on Computer Architecture, May 6-8, 1980, pp. 292-299, CH1494-4/80/0000-0292 1979 IEEE, describes a pipeline system for recursive numeric computations such as are required in double precision division, and uses a multi-input redundant adder in a segment processing function to build up a full precision result.
Lang et al, "A Modeling Approach and Design Tool for Pipelined Central Processors," The 6th Annular Symposium on Computer Architecture, Apr. 23-25, 1979, pp. 122-129, CH1394-6/79/0000-0122 1979 IEEE, describes a procedure for designing and implementing a control unit for a pipelined computer.
Liptay et al, LOAD BYPASS FOR ADDRESS ARITHMETIC, IBM Technical Disclosure Bulletin, Vol. 20, No. 9, February 1978, pp. 3606-3607, shows a pipelined computer in which the operand address generation process may be dependent upon the results of a subsequent instruction that has been decoded but not yet executed. A bypass mechanism provides that data can be bypassed to the address adder, permitting the address generation cycle to occur a cycle earlier. Initiation of the bypass function occurs when the register to be loaded is the same as required in the subsequent address generation. At the same time the returning data is being sent to the addressed general register, it will also be sent directly to the address adder for use. This bypass technique overcomes a facilities interlock and permits parallel execution of certain instructions otherwise requiring queuing because of facility needs--but there is no parallel execution of inherently serial instructions.
Owens et al, "On-Line Algorithms for the Design of Pipeline Architectures, The 6th Annual Symposium on Computer Architecture, Apr. 23-25, 1979, pp. 12-19, CH1394-6/79/0000-0012 1979 IEEE, describes a procedure for designing and implementing a control unit for a pipelined computer.
Patel, "Pipelines with Internal Buffers," The 5th Annular Symposium on Computer Architecture, Apr. 3-5, 1978, pp. 249-254, 78CH1284-9C 1979 IEEF, describes a pipelined computer with internal buffers and priority schemes to control queue lengths.
Pomerene et al, SEQUENTIAL I-FETCHING MECHANISMS, IBM Technical Disclosure Bulletin, Vol. 25, No. 1, June 1982, pp. 124-125m shows a two-cycle putaway technique which provides for a better overlap by sharng facilities between two operations which are not required simultaneously. If the putaway requires only one cycle, then the next sequential instruction fetch requires only the second putaway cycle while the store operation requires the first putaway cycle. This two-cycle putaway permits the minimization of conflicts of appropriate types, but does not resolve conflicts by parallel execution of inherently serial instructions.
Sofer et al, PARALLEL PIPELINE ORGANIZATION OF EXECUTION UNIT, IBM Technical Disclosure Bulletin, Vol. 14, No. 10, March 1972, pp. 2930-2933, uses a pre-shifter and a post-shifter with the main adder in the mainstream and has a multiplier in a bypass stream connecting at the input to the main adder. A time consuming multiply or divide operation can be carried out by the multiplier while other operations are passing through the main-stream. The mainstream execution loop requires only four cycles. Five cycles are required to complete the execution of mainstream instructions; results are available on the result bus one cycle earlier and may be "fast forwarded" for use as an operand in a subsequent instruction. This fast forward technique saves one cycle out of five.
This prior art establishes a context of pipelined computers, including parallel pipelined computers, but does not teach parallel execution in a parallel pipelined computer of inherently serial instructions.