This invention relates generally to the architecture of microprocessors, and, more specifically, to the structure and use of parallel instruction processing pipelines.
A multi-staged pipeline is commonly used in a single integrated circuit chip microprocessor. A different step of the processing of an instruction is accomplished at each stage of the pipeline. For example, one important stage generates from the instruction and other data to which the instruction points, such as data stored in registers on the same chip, an address of the location in memory where an operand is stored that needs to be retrieved for processing. A next stage of the pipeline typically reads the memory at that address in order to fetch the operand and make it available for use within the pipeline. A subsequent stage typically executes the instruction with the operand and any other data pointed to by the instruction. The execution stage includes an arithmetic logic unit (ALU) that uses the operand and other data to perform either a calculation, such as addition, subtraction, multiplication, or division, or a logical combination according to what is specified by the instruction. The result is then, in a further stage, written back into either the memory or into one of the registers. As one instruction is moved along the pipeline, another is right behind it so that, in effect, a number of instructions equal to the number of stages in the pipeline are being simultaneously processed.
Two parallel multi-stage pipelines are also commonly used. Two instructions may potentially be processed in parallel as they move along the two pipelines. When some interdependency exists between two successive instructions, however, they often cannot be started along the pipeline at the same time. One such interdependency is where the second instruction requires for its execution the result of the execution of the first instruction. Each of the two pipelines has independent access to a data memory through one of two ports for reading operands from it and writing results of the instruction execution back into it. The memory accessed by the pipelines is generally on the integrated circuit chip as cache memory, which, in turn, accesses other semiconductor memory, a magnetic disk drive or other mass storage that is outside of the single microprocessor integrated circuit chip.
It continues to be a goal of processor design to increase the rate at which program instructions are processed. Therefore, it is the primary object of the present invention to provide an architecture for a pipelined microprocessor that makes possible an increased instruction processing throughput.
It is another object of the present invention to provide such a pipelined microprocessor that minimizes the additional amount of power consumed and integrated circuit space required to obtain a given increase the rate of processing program instructions.
These and additional objects are accomplished by the various aspects of the present invention, wherein, briefly and generally, according to one such aspect, three or more parallel pipelines are provided without having to use more than two data memory ports to retrieve operands or store the results of the instruction processing. It is undesirable to use a memory with more than two ports, or to use two or more separate data memories, since the complexity, power consumed and space taken by such many ported memories is highly undesirable. It has been recognized, as part of the present invention, that since a significant proportion of the individual instructions of most programs do not need access to data memory in order to be executed, an extra pipeline without such access still results in a significant increase in processing speed without a disproportionate increase in the amount of circuitry or power consumption. In a specific implementation of this aspect of the invention, three instructions are processed in parallel in three pipelines at one time so long as one of those instructions does not need access to the data memory. The two ports of the data memory are made available to the two pipelines processing instructions that need access to the data memory, while the third pipeline processes an instruction that does not require such access.
A three pipeline architecture is preferred. If all three instructions queued for entry into the three pipelines at one time all need access to the data memory, then one of the instructions is held. In this case, the third pipeline is not fully utilized for at least one cycle, but this does not occur excessively because of the high proportion of instructions in most operating systems and programs that do not need access to the data memory. A fourth pipeline may further be added for use with a two port data memory if that proportion of instructions not needing data memory access is high enough to justify the added integrated circuit space and power consumed by the additional pipeline circuitry.
According to another aspect of the present invention, the third pipeline is made simpler than the other two, since there is also a high enough proportion of instructions that do not need the complex, high performance pipeline stages normally supplied for processing the most complex instructions. A preferred form of the present invention includes two pipelines with stages having the normal full capability while at least some of the stages of the third pipeline are significantly simplified. In a specific implementation of this aspect of the present invention, the address generation stage of the third pipeline is made simpler than the address generation stage of the other two pipelines. The third address generation stage may, for example, be especially adapted to only calculate instruction addresses in response to jump instructions. The ALU of the execution stage of the third pipeline is also, in a specific implementation, made to be much simpler than the ALUs of the other two pipelines. The third ALU, for example, may be dedicated to executing move instructions. The simpler third pipeline stages minimize the extra integrated circuit space and power required of the third pipeline. Yet, a significant increase in through put of processing instructions is achieved.
According to a further aspect of the present invention, individual ones of the multiple stages of each of the pipelines are interconnectable with each other between the pipelines in order to take advantage of a multiple pipelined architecture where the capability and functions performed by a given stage of one pipeline is different than that of the same stage of another pipeline. This allows the pipelines to be dynamically configured according to the need of each instruction. Stages capable of processing a given instruction are connected together without having to use stages with excessive capability in most cases. One instruction, for example, may require a full capability address generator but then only needs the simplest ALU, so the instruction is routed through these two stages. For another instruction, as another example, no address generator may be necessary but a full capability ALU may be required.
The ideal operation which is sought to be achieved is to have three pipelines operating on three instructions all the time with no more circuitry (and thus no more space or power consumption) than is absolutely necessary to process each instruction. Each of the various aspects of the present invention contributes to moving closer to that ideal, the most improvement being obtained when all of these aspects of the present invention are implemented together.
Additional objects, advantages, and features of the present invention will become apparent from the following description of its preferred embodiments, which description should be take in conjunction with the accompanying drawings.