The present invention relates to processor performance generally and to the instruction set architecture (ISA) of such processors, in particular.
A chip that includes either a general purpose processor or a Digital Signal Processor (DSP) contains different elements besides the processor. One of the most significant elements in terms of area of the chip used are the memories. These memories may be program memories, data memories, RAM, ROM or any other type of memory. The silicon wafer die size consumed by the chip influences the cost of the chip and it therefore should be minimized. Thus, any reduction in memory size will reduce the cost of the chip.
In a chip using any kind of processor, there is usually a program or code memory as part of the chip die. The program includes the instructions required to be fetched by the processor. The nature of the program, meaning the encoding of the instructions itself, is defined as the instruction set architecture (ISA) of the processor.
One common way to reduce the area of the program memory is in the encoding of the instructions. If each instruction can be made to consume less bits, the program memory will be smaller. For example, a processor whose instructions are encoded in a 32-bit word is likely to consume more program space than another processor that uses 16-bit instruction words.
Pipelines are extensively used in processor architectures. Each pipeline includes the stages, each of which takes one cycle to complete, to be performed by the processor in order to complete an instruction.
Reference is now made to FIG. 1 which illustrates an exemplary instruction pipeline which will be used for reference purposes throughout this patent application. The exemplary pipeline comprises six stages which each take up one clock cycle of the processor and add up to a total execution of an instruction. The exemplary pipeline comprises the following stages: Instruction Fetch (IF), Instruction Decode (ID), Address Generation (AG), Operand Fetch (OF), Execute (EX) and Flags and Conditions (FC). The architecture of the processor and its state machine together determine the pipeline and its depth.
Reference is now made to FIGS. 2A and 2B which illustrate a set of instructions which utilize the pipeline of FIG. 1. The instructions are denoted I1, I2, I3, I4 in FIG. 2A and denoted I1, I2, I3, I10 in FIG. 2B. Each instruction goes through the stages of the pipeline where each stage uses one clock cycle, as described hereinabove. As far as the flow of the code is concerned there are two types of instruction: sequential (seq) and non-sequential (non-seq). Examples of sequential instructions are xe2x80x9caddxe2x80x9d and xe2x80x9csubtractxe2x80x9d. Non-sequential instructions are instructions which break the pipeline. The major reason for using them is to branch to a different memory location than the following one. The reason that this might be necessary is, for example, to perform a loop which repeats itself, to determine a condition and decide what instruction to take next or to accept an interrupt or any other non-sequential event.
Thus in FIGS. 2A and 2B, instructions I1 and I2 are sequential instructions whereas instruction I3 is a non-sequential or branch instruction whose target, should a condition be met, is instruction I10 (FIG. 2B). If the condition is not met, the pipeline proceeds with the next sequential instruction, I4, which is shown in FIG. 2A.
The non-sequential instructions usually involve a penalty for breaking the pipeline. There are two main reasons for this penalty. The first is that the non-sequential instruction has to be decoded before the target instruction can be fetched. Thus, the execution of the target instruction cannot begin until after the decoding stage of the non-sequential instruction has been completed. Sometimes the address of the target instruction has to be calculated prior to fetching it, further delaying the beginning of execution of the target instruction.
Another penalty arises when the non-sequential instruction is conditional. In this case, the execution of the non-sequential instruction has to be halted until the condition is checked and a true/false indication is ready.
FIG. 2B shows the flow of a taken branch (i.e. a conditional branch in which the condition was met) in the pipeline where the target instruction is I10. In this case, the penalty of the branch instruction is 4 cycles because, as was mentioned above, the true/false condition is only known at a late stage. The true/false condition is known at the EX (or execute) stage of the I3 branch instruction and therefore the IF (instruction fetch) stage of target instruction I10 must wait for four cycles (cycles 4,5,6,7) in order to start. Thus, the branch instruction takes 5 cycles to execute.
When the condition is found to be false and the branch is not taken, this branch instruction will only take four cycles, as shown in FIG. 2A. This is illustrated by the I4 pipeline, I4 being the instruction following the branch instruction I3, as opposed to the target instruction I10 (FIG. 2B). This causes a penalty of three cycles (cycles 4, 5 and 6) over a single cycle instruction. The lower execution time in this case is due to the early instruction fetch mechanism, which starts fetching the next sequential instruction I4 in the cycle before the condition is known (cycle 7). The pipeline of this instruction is then halted if the assumed condition does not, in fact, occur.
The branch instruction may be, for example the machine code xe2x80x9cbranch, new, neqxe2x80x9d, where xe2x80x9cbranchxe2x80x9d indicates a branch instruction, xe2x80x9cnewxe2x80x9d indicates the target instruction and xe2x80x9cneqxe2x80x9d is the condition for taking this branch, xe2x80x9cnewxe2x80x9d and xe2x80x9cneqxe2x80x9d each respectively constitute a field in the branch instruction.
The penalty of a certain instruction is calculated by counting the number of cycles between its ID stage and the ID stage of the next instruction. In case of a single-cycle instruction, there is no penalty since the ID stage of the next instruction immediately follows. Since in a non-sequential instruction this is not the case, the intermediate cycles are termed xe2x80x9cwasted cyclesxe2x80x9d.
For a branch instruction, at least three cycles are wasted (cycles 4, 5, 6 in FIGS. 2A and 2B marked with xe2x80x9cwasted IFxe2x80x9d), irrespective of whether or not the branch is taken.
Reference is now made to FIGS. 3A and 3B which illustrate a common way to eliminate these wasted cycles by using delay-slot instructions. FIG. 3A illustrates the case where the branch instruction is not taken and FIG. 3B where the branch instruction is taken. Similar items to previous figures have similar reference numerals and will not be described further. Delay slot instructions are instructions which use the wasted cycles. They appear after the branch instruction, but will be executed before the branch instruction is executed. Any sequential instruction can be used as a delay slot instruction.
Thus, instructions DS1, DS2 and DS3 demonstrate three delay slot instructions added to the pipeline of FIGS. 2A and 2B. These delay slot instructions are executed in the spaces corresponding to the xe2x80x9cwasted IFxe2x80x9d cycle (cycles 4, 5, 6). The following code (written in machine code) demonstrates the concept of these instructions:
Using the suggested pipeline of FIG. 1 and the pipeline flow of the branch instructions, a maximum of three delay slot instructions can be supported. These will fit into the three wasted cycles (xe2x80x9cwasted IFxe2x80x9d) in the branch instruction as mentioned above in respect of FIGS. 2A and 2B. The three instructions marked with axe2x86x92are the three delay slot instructions. The nop instruction stands for a xe2x80x9cNo Operationxe2x80x9d which means that this instruction does nothing in that delay slot. The programmer must use a nop instruction when he cannot utilize the third delay slot as in the present example. In general, a nop instruction must be used to fill delay slots that cannot be utilized. Instruction DS3 shown on FIGS. 3A and 3B is the third delay slot instruction, which will be a nop instruction in this case.
Even though the delay-slot instructions are executed prior to the branch instruction, they do not affect the decision whether to take the branch or not. Only instructions that appear before the branch instruction may affect this decision (in the present example, the xe2x80x9ccompxe2x80x9d instruction and any other instructions that may appear before it).
Utilizing this method, the penalty for a branch instruction may reduce to 0 if all delay-slot instructions are used and the branch is not taken. This situation is shown in FIG. 3A which illustrates the scenario when the branch is not taken. FIG. 3B illustrates the case when the branch is taken. It should be noted that when the branch is taken, as shown by the target instruction TI (I10), there is still a xe2x80x9cwasted IFxe2x80x9d even if the available delay slots have been used.
The following table summarizes the possible penalties of a branch instruction, utilizing delay-slots:
With reference to the processor architecture, reference is now made to FIG. 4 which illustrates the operations of the state machine of the processor when performing the pipelines of FIGS. 3A and 3B. The state machine manages the ID stage of different instructions while counting down the number of cycles left for the particular instruction currently being executed. Hence, for single cycle instructions there are no xe2x80x9ccycles leftxe2x80x9d between the ID stages of consecutive instructions. However, for branch instructions, the number of xe2x80x9ccycles leftxe2x80x9d corresponds to the delay, between the ID stage of the branch instruction I3 (FIGS. 2A and 2B) and the ID stage of the target instruction (TI), I10 (FIG. 2B).
State xe2x80x980xe2x80x99 is the ID stage in which all sequential instructions are executed. Since this state machine does not implement any other multi-cycle instructions except for the branch instruction, the transition from state xe2x80x980xe2x80x99 to state xe2x80x984xe2x80x99 (ID stage of the first delay slot) is enabled only due to a branch instruction as shown in FIGS. 3A-3B. State xe2x80x983xe2x80x99 is the ID stage of the second delay slot (DS2 in FIGS. 3A-3B). During state xe2x80x982xe2x80x99 (EX stage of the branch pipeline, I3, and ID stage of the third delay slot DS3), the true/false indication is ready, hence, a decision is made whether to take the branch (proceed to state xe2x80x981xe2x80x99 which is the FC stage of the branch pipeline, I3 (FIG. 3B)) or not (proceed to state xe2x80x980xe2x80x99, the ID stage of the next sequential instruction, I4 (FIG. 3A)).
The more delay slots a programmer uses (not by filling with nop instructions), the less penalty he will have on the non-sequential instruction. However, fully utilizing the delay slots is a very complicated task which requires careful programming. As described hereinabove it is sometimes impossible to utilize all the delay slots available (hence the nop instruction above) due to a number of factors including inter-dependencies between instructions. Clearly, in such a case, when not all delay-slots can be utilized, the penalty for the non-sequential instruction is that the code increases in size since the programmer is obliged to write nop instructions in the delay-slots he is not able to utilize.
These nop instructions are part of the code, hence they are encoded into the program memory and consume code size.
An object of the present invention is to reduce the size of code used in a processor.
A further object of the present invention is to provide means for indicating in a condensed instruction the number of delay slots used.
There is thus provided in accordance with a preferred embodiment of the present invention a method for reducing the size of code for processors. The method is made up of a step which is to provide each non-sequential instruction i with an option defining a number M of delay slots to be used out of Ni delay slots available for that instruction. M can vary from 0 to Ni.
There is further provided a method for executing non-sequential instructions. The method is made up of receiving a field from a non-sequential instruction and executing the non-sequential instruction with delay slot instructions and no operation instructions. The field contains the number M of delay slots to be utilized out of N delay slots available for the instruction. While executing the non-sequential instruction M delay slot instructions and (Nxe2x88x92M) no operation instructions are also executed.
Further, there is provided a non-sequential instruction having a plurality of fields one of which is a delay slot field. The delay slot field indicates the number, M out of N available delay slots to be utilized by a state machine for the instruction when the instruction is executed. Furthermore, the state machine performs a no operation instruction for the (Nxe2x88x92M) non-utilized delay slots.
There is further provided a state machine for executing sequential and non-sequential instructions. The non-sequential instructions have delay slots associated with them. The state machine is made up of a number of nodes representing states and a number of arcs connecting the nodes. The arcs and nodes are connected to form a first path and a second path. The first path represents the path where no delay slots are used and the second path represents the path where all available delay slots are used. Some of the arcs connect between the first and second paths.
Furthermore, there is provided a state machine for executing sequential and non-sequential instructions. The non-sequential instructions have delay slots associated with them. The state machine is made up of a delay slot path and a no operation path, both made up of nodes with arcs connecting between them. There are arcs between the nodes of the delay slot path and the nodes of the no operation path. The number of nodes in the no operation path is equivalent to the number of available delay slots. The path taken for a specific instruction along the delay slot path, the no operation path and the arcs depends on the number of delay slots which the specific instruction utilizes. Further, the no operation path ends at a decision node.