1. Field of the Invention
The present invention relates to the field of integrated circuit design, specifically to the use of a hardware description language (HDL) for implementing instructions in a pipelined central processing unit (CPU) or user-customizable microprocessor.
2. Description of Related Technology
RISC (or reduced instruction set computer) processors are well known in the computing arts. RISC processors generally have the fundamental characteristic of utilizing a substantially reduced instruction set as compared to non-RISC (commonly known as “CISC”) processors. Typically, RISC processor machine instructions are not all micro-coded, but rather may be executed immediately without decoding, thereby affording significant economies in terms of processing speed. This “streamlined” instruction handling capability furthermore allows greater simplicity in the design of the processor (as compared to non-RISC devices), thereby allowing smaller silicon and reduced cost of fabrication.
RISC processors are also typically characterized by (i) load/store memory architecture (i.e., only the load and store instructions have access to memory; other instructions operate via internal registers within the processor); (ii) unity of processor and compiler; and (iii) pipelining.
Pipelining is a technique for increasing the performance of processor by dividing the sequence of operations within the processor into segments which are effectively executed in parallel when possible. In the typical pipelined processor, the arithmetic units associated with processor arithmetic operations (such as ADD, MULTIPLY, DIVIDE, etc.) are usually “segmented”, so that a specific portion of the operation is performed in a given segment of the unit during any clock cycle. FIG. 1 illustrates a typical processor architecture having such segmented arithmetic units. Hence, these units can operate on the results of a different calculation at any given clock cycle. As an example, in the first clock cycle two numbers A and B are fed to the multiplier unit 10 and partially processed by the first segment 12 of the unit. In the second clock cycle, the partial results from multiplying A and B are passed to the second segment 14 while the first segment 12 receives two new numbers (say C and D) to start processing. The net result is that after an initial startup period, one multiplication operation is performed by the arithmetic unit 10 every clock cycle.
The depth of the pipeline may vary from one architecture to another. In the present context, the term “depth” refers to the number of discrete stages present in the pipeline. In general, a pipeline with more stages executes programs faster but may be more difficult to program if the pipeline effects are visible to the programmer. Most pipelined processors are either three stage (instruction fetch, decode, and execute) or four stages (such as instruction fetch, decode, operand fetch, and execute, or alternatively instruction fetch, decode/operand fetch, execute, and writeback), although more or less stages may be used.
When developing the instruction set of a pipelined processor, several different types of “hazards” must be considered. For example, so called “structural” or “resource contention” hazards arise from overlapping instructions competing for the same resources (such as busses, registers, or other functional units) which are typically resolved using one or more pipeline stalls. So-called “data” pipeline hazards occur in the case of read/write conflicts which may change the order of memory or register accesses. “Control” hazards are generally produced by branches or similar changes in program flow.
Interlocks are generally necessary with pipelined architectures to address many of these hazards. For example, consider the case where a following instruction (n+1) in an earlier pipeline stage needs the result of the instruction n from a later stage. A simple solution to the aforementioned problem is to delay the operand calculation in the instruction decoding phase by one or more clock cycles. A result of such delay, however is that the execution time of a given instruction on the processor is in part determined by the instructions surrounding it within the pipeline. This complicates optimization of the code for the processor, since it is often difficult for the programmer to spot interlock situations within the code.
“Scoreboarding” may be used in the processor to implement interlocks; in this approach, a bit is attached to each processor register to act as an indicator of the register content; specifically, whether (i) the contents of the register have been updated and are therefore ready for use, or (ii) the contents are undergoing modification such as being written to by another process. This scoreboard is also used in some architectures to generate interlocks which prevent instructions which are dependent upon the contents of the scoreboarded register from executing until the scoreboard indicates that the register is ready. This type of approach is referred to as “hardware” interlocking, since the interlock is invoked purely through examination of the scoreboard via hardware within the processor. Such interlocks generate “stalls” which preclude the data dependent instruction from executing (thereby stalling the pipeline) until the register is ready.
Alternatively, NOPs (no-operation opcodes) may be inserted in the code so as to delay the appropriate pipeline stage when desired. This later approach has been referred to as “software” interlocking, and has the disadvantage of increasing the code size and complexity of programs that employ instructions that require interlocking. Heavily software interlocked designs also tend not to be fully optimized in terms of their code structures.
Another important consideration in processor design is program branching or “jumps”. All processors support some type of branching instructions. Simply stated, branching refers to the condition where program flow is interrupted or altered. Other operations such as loop setup and subroutine call instructions also interrupt or alter program flow in a similar fashion. The term “jump delay slot” is often used to refer to the slot within a pipeline subsequent to a branching or jump instruction being decoded. The instruction after the branch (or load) is executed while awaiting completion of the branch/load instruction. Branching may be conditional (i.e., based on the truth or value of one or more parameters) or unconditional. It may also be absolute (e.g., based on an absolute memory address), or relative (e.g., based on relative addresses and independent of any particular memory address).
Branching can have a profound effect on pipelined systems. By the time a branch instruction is inserted and decoded by the processor's instruction decode stage (indicating that the processor must begin executing a different address), the next instruction word in the instruction sequence has been fetched and inserted into the pipeline. One solution to this problem is to purge the fetched instruction word and halt or stall further fetch operations until the branch instruction has been executed, as illustrated in FIG. 2. This approach, however, by necessity results in the execution of the branch instruction in several instruction cycles, this number typically being between one and the depth of the pipeline employed in the processor design. This result is deleterious to processor speed and efficiency, since other operations can not be conducted by the processor during this period.
Alternatively, a delayed branch approach may be employed. In this approach, the pipeline is not purged when a branch instruction reaches the decode stage, but rather subsequent instructions present in the earlier stages of the pipeline are executed normally before the branch is executed. Hence, the branch appears to be delayed by the number of instruction cycles necessary to execute all subsequent instructions in the pipeline at the time the branch instruction is decoded. This approach increases the efficiency of the pipeline as compared to multi-cycle branching described above, yet also complexity (and ease of understanding by the programmer) of the underlying code.
Based on the foregoing, processor designers and programmers must carefully weigh the tradeoffs associated with utilizing hardware or software interlocks as opposed to a non-interlock architecture. Furthermore, the interaction of branching instructions (and delayed or multi-cycle branching) in the instruction set with the selected interlock scheme must be considered. What is needed is an improved approach to pipeline interlocking which optimizes processor pipeline performance and provides attributes of both hardware and software interlocks, while providing the programmer with additional flexibility of coding. Furthermore, as more pipeline stages (and even multiple multi-stage pipelines) are added to processor designs, the benefits of enhanced interlock performance and code optimization within the processor increase manifold. Additionally, the ability to readily synthesize such improved pipelined processor designs in an application-specific manner, and using available synthesis tools, is of significant utility to the designer and programmer.