1. Field of the Invention
The present invention relates to data processing systems, and, more particularly, to the providing of instruction fields to a processing unit. The instruction fields control the processing of the data fields applied to the processing unit. Any delay in receiving the instruction fields can impact the performance of the data processing system.
2. Description of the Related Art
Microprocessor units and systems that use microprocessor units have attained wide-spread use throughout many industries. A goal of any microprocessor system is to process information quickly. One technique that increases the speed with which the microprocessor system processes information is to provide the microprocessor system with an architecture which includes at least one local memory called a cache unit.
A cache unit is used by the microprocessor system to store temporarily instructions and/or data. A cache unit that stores both instructions and data is referred to as a unified cache; a cache unit that stores only instructions is an instruction cache unit and a cache unit that stores only data is a data cache unit. Providing a microprocessor architecture with a unified instruction and data cache or with an instruction cache and a data cache is a matter of design choice. Both data and instructions are represented by data signal groups or fields. In the following discussion, the relationship of the instruction cache unit with the processing unit will be emphasized. Referring to FIG. 1, a microprocessing system 10 displaying the components that are important to the present discussion is shown. A microprocessing system 10 includes a processing unit 11 and an instruction cache unit 12. The processing unit 11 performs operations under control of instructions or instruction fields retrieved from the instruction cache unit 12. The processing unit 11 and the instruction cache unit 12 are coupled by a bus 13 over which the instruction fields are transferred to the processing unit 11. The processing unit 11 includes a program counter 111 that determines the instruction cache unit location that is currently being accessed, the accessed locations in the instruction cache unit storing instruction fields required by the processing unit 11. Thus, the program counter fields determine which instruction cache unit locations are to be accessed. The program counter fields are therefor addresses or address fields for the instruction cache unit 12. In the following discussion, the term program counter number and program counter address fields will be used interchangeably to mean location values in the program counter unit.
In order to increase the performance of the microprocessor systems in the past, the clock cycle period, the basic unit of time for the operations performed by the microprocessor system, has been decreased. At some point, the individual processing operations could no longer be performed within a single clock cycle. In order to decrease further the clock cycle period, the technique of pipelining the microprocessor system and, in particular, the processing unit, was developed. In pipelining a microprocessor, an operation was divided into a plurality of sub-operations, each sub-operation requiring approximately the same amount of time. Because each sub-operation required less time, the clock cycle period could be further reduced, thereby increasing the performance. This increase in performance is accomplished at the expense of increased complexity of the microprocessor resulting from the partitioning of a single operation into a plurality of sub-operations. As a result of the pipelining procedure, sequence of sub-operations can be completed at the lower clock cycle period, even though the total operation itself requires a longer period of time to be completed.
Referring to FIG. 2A, an example of a five stage pipeline for the execution of an instruction by a processing unit is shown. As above, the interaction between the processing unit and the instruction cache unit is emphasized. During clock cycle 1, an access of the instruction cache unit (labeled IC in FIG. 2A) is performed. During clock cycle 2, the instruction field decode and register file read (RF) operations are executed. During clock cycle 3, the activity of the execution (EX) pipeline stage is performed. During clock cycle 4, the data cache access (DC) operation is executed. And during clock cycle 5, the update register file (UB) operation is executed. As is clear from FIG. 2A, each pipeline stage requires one clock cycle to accomplish the operations assigned thereto. These operations are actually sub-operations of activity of the processing unit that was formerly performed in its entirety in one clock cycle. When the processor clock frequency goes up, the cycle time is reduced. Therefore, an execution of an activity of a pipeline can be completed during each of the reduced clock cycle periods. However, the total time to complete the activity of the pipeline is greater than the original time to execute the activity without the pipeline architecture.
Referring to FIG. 2B, the typical flow of instruction execution in a five stage pipeline, according to the prior art is shown. For each clock cycle, the implementation of another instruction is begun. At t (clock cycle)=1, instruction I1 begins execution in the IC pipeline stage. At t (clock cycle)=2, instruction I1 is being implemented in the RF pipeline stage, while the next instruction I2 is being executed in the IC pipeline stage. At t (clock cycle)=3, instruction I1 is being executed in the EX pipeline stage, instruction I2 is being executed in the RF pipeline stage, and instruction I3 has begun execution in the IC pipeline stage. The progress of the instructions is illustrated in FIG. 2B until at t (clock cycle)=5, instruction I1 is being executed in the last WB pipeline stage. At t (clock cycle)=6, instruction I1 has completed execution and is no longer being executed in the processor unit. At t (clock cycle)=j, the instruction Ij is being executed in the first or IC pipeline stage, instruction Ij-1 is being executed in the RF pipeline stage, instruction Ij-2 is being executed in the EX pipeline stage, instruction Ij-3 is being executed in the DC pipeline sage and instruction Ij-4 is being executed in the WB pipeline stage.
As can be seen from FIG. 2A and FIG. 2B, the pipelined processor can complete the execution of an instruction every clock cycle. The clock cycle time is typically much shorter than the time to execute the instruction in a non-pipelined processor. However, this performance benefit has a performance penalty, the performance penalty being the (5 clock cycle) delay before the first instruction is completed and the completion of the execution for each clock cycle can begin. This delay is typically referred to as the (5 cycle) latency of the pipeline. The latency can provide an obstacle to achieving the full execution performance of the pipelined processing unit.
The subdividing of the processing unit into pipeline stages can increase the performance of the processing unit. However, in each clock cycle, a plurality of operations are performed. For example, referring to FIG. 3, during the first pipeline (IC) stage, three separate sub-sub-operations are performed. First, the correct location in the instruction cache unit must be accessed and the instruction field stored therein transferred to the processing unit. Then the processing unit performs a decoding operation on a predecode subfield of the transferred instruction filed. The predecode subfield is an instruction field component assisting in the determination of the next program counter (NPC) address. This program counter address identifies the location of the next instruction field. Thus, this activity must be completed before the beginning of the second (RF) clock cycle, because the next instruction field must be accessed and transferred during the second clock cycle as shown in FIG. 3.
As the clock cycle period is further decreased, problems in the foregoing pipelined operation became apparent. For example, the transfer of the instruction field can only be shortened by a limited amount. Any attempt to further reduce this time results in inaccuracies in the identification of the logic signals transferred on the bus 13 in FIG. 1. Similarly, decoding the predecode subfield of the transferred instruction field, even though only a partial decoding of the instruction field, requires a certain amount of time and, if the clock cycle becomes too short, this amount of time is insufficient for determine the next program counter address.
Referring to FIG. 4, one solution, according to the related art, to the amount of activity needed to be performed during the IC clock cycle is shown in FIG. 4. As shown in FIG. 4, the IC pipeline stage shown in FIG. 2A is subdivided into two stages, labeled IC1 and IC2. The first IC1 sub-pipeline stage, in response to the address field from the program counter, transfers an instruction field to the processing unit. Because an entire clock cycle is devoted to this transfer, the transfer of the instruction field to the processing unit should be unambiguous. During the second IC2 subpipeline stage, the decoding of the predecode subfield of the instruction field and the generation of the next program counter field is completed. This solution to the problem provides that the correct next program counter address is determined in a timely fashion. However, this solution does have an unacceptable result. This result is illustrated in FIG. 5. As can be seen in FIG. 5, each and every stage of the pipeline performs an operation during a clock cycle and has no activity to perform on the following clock cycle. In other words, the performance of the processing unit has been decreased by 50%.
A need has been felt for apparatus and an associated technique for transferring the instruction fields from the instruction cache unit to the processing unit at a more rapid clock rate than previous pipelined processing units. The apparatus and associated technique should have the feature that voids in the pipelined execution of instructions would be minimized.
The aforementioned and other features are accomplished, according to the present invention, by dividing the instruction cache access and the next program counter field computation into two independent activities. However, both activities are performed during the same processor clock cycle. A speculative next program counter address is calculated by incrementing the current program counter address during a current instruction cache access. This speculative address takes advantage of the fact that program instructions are generally arranged in sequential order. The speculative program counter address is available for identifying the next sequential instruction cache access. The speculative program counter address is applied to program counter and the corresponding instruction field is accessed during the next clock cycle. During the current instruction cache access, the processing unit begins decoding the predecode subfield of the instruction field that was retrieved from the instruction cache during the previous clock cycle. The decoding and other apparatus determines the actual (correct) next program counter address. Thus, during the clock period when the speculative next program address brought forward from the previous clock period is accessing an instruction field, the decoding and other apparatus are determining whether the speculative program counter address is the correct program counter address. After determination of the actual program counter address, this actual program address is compared to the next program counter address from the previous clock cycle, the next program counter address from the previous clock cycle determining the instruction cache field accessed during the current clock cycle. When the comparison is true, then the address of the instruction cache memory unit being accessed is correct and the procedure is allowed to continue. In particular, the next program counter address from the previous clock cycle is now the current program counter field and this current program counter field is incremented to provide the next program counter address field for the next clock cycle. When the comparison is false, the current access to the instruction cache unit is in error and the instruction cache access is canceled. In addition, the actual program counter address becomes the next program counter address, thereby determining the access to the instruction cache unit during the next clock cycle