A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
This application is related to co-pending U.S. patent application Ser. No. 09/523,877 filed Mar. 13, 2000 and entitled xe2x80x9cMethod and Apparatus for Jump Delay Slot Control in a Pipelined Processorxe2x80x9d, U.S. patent application Ser. No. 09/524,179 filed Mar. 13, 2000 and entitled xe2x80x9cMethod and Apparatus for Processor Pipeline Segmentation and Re-Assemblyxe2x80x9d, U.S. patent application Ser. No. 09/524,178 filed Mar. 13, 2000 and entitled xe2x80x9cMethod and Apparatus for Loose Register Encoding Within a Pipelined Processorxe2x80x9d, and U.S. patent application Ser. No. 09/418,663 filed Oct. 14, 1999, entitled xe2x80x9cMethod and Apparatus for Managing the Configuration and Functionality of a Semiconductor Designxe2x80x9d.
1. Field of the Invention
The present invention relates to the field of integrated circuit design, specifically to the use of a hardware description language (RDL) for implementing instructions in a pipelined central processing unit (CPU) or user-customizable microprocessor.
2. Description of Related Technology
RISC (or reduced instruction set computer) processors are well known in the computing arts. RISC processors generally have the fundamental characteristic of utilizing a substantially reduced instruction set as compared to non-RISC (commonly known as xe2x80x9cCISCxe2x80x9d) processors. Typically, RISC processor machine instructions are not all micro-coded, but rather may be executed immediately without decoding, thereby affording significant economies in terms of processing speed. This xe2x80x9cstreamlinedxe2x80x9d instruction handling capability furthermore allows greater simplicity in the design of the processor (as compared to non-RISC devices), thereby allowing smaller silicon and reduced cost of fabrication.
RISC processors are also typically characterized by (i) load/store memory architecture (i.e., only the load and store instructions have access to memory; other instructions operate via internal registers within the processor); (ii) unity of processor and compiler; and (iii) pipelining.
Pipelining is a technique for increasing the performance of processor by dividing the sequence of operations within the processor into segments which are effectively executed in parallel when possible. In the typical pipelined processor, the arithmetic units associated with processor arithmetic operations (such as ADD, MULTIPLY, DIVIDE, etc.) are usually xe2x80x9csegmentedxe2x80x9d, so that a specific portion of the operation is performed in a given segment of the unit during any clock cycle. FIG. 1 illustrates a typical processor architecture having such segmented arithmetic units. Hence, these units can operate on the results of a different calculation at any given clock cycle. As an example, in the first clock cycle two numbers A and B are fed to the multiplier unit 10 and partially processed by the first segment 12 of the unit. In the second clock cycle, the partial results from multiplying A and B are passed to the second segment 14 while the first segment 12 receives two new numbers (say C and D) to start processing. The net result is that after an initial startup period, one multiplication operation is performed by the arithmetic unit 10 every clock cycle.
The depth of the pipeline may vary from one architecture to another. In the present context, the term xe2x80x9cdepthxe2x80x9d refers to the number of discrete stages present in the pipeline. In general, a pipeline with more stages executes programs faster but may be more difficult to program if the pipeline effects are visible to the programmer. Most pipelined processors are either three stage (instruction fetch, decode, and execute) or four stages (such as instruction fetch, decode, operand fetch, and execute, or alternatively instruction fetch, decode/operand fetch, execute, and writeback), although more or less stages may be used.
When developing the instruction set of a pipelined processor, several different types of xe2x80x9chazardsxe2x80x9d must be considered. For example, so called xe2x80x9cstructuralxe2x80x9d or xe2x80x9cresource contentionxe2x80x9d hazards arise from overlapping instructions competing for the same resources (such as busses, registers, or other functional units) which are typically resolved using one or more pipeline stalls. So-called xe2x80x9cdataxe2x80x9d pipeline hazards occur in the case of read/write conflicts which may change the order of memory or register accesses. xe2x80x9cControlxe2x80x9d hazards are generally produced by branches or similar changes in program flow.
Interlocks are generally necessary with pipelined architectures to address many of these hazards. For example, consider the case where a following instruction (n+1) in an earlier pipeline stage needs the result of the instruction n from a later stage. A simple solution to the aforementioned problem is to delay the operand calculation in the instruction decoding phase by one or more clock cycles. A result of such delay, however is that the execution time of a given instruction on the processor is in part determined by the instructions surrounding it within the pipeline. This complicates optimization of the code for the processor, since it is often difficult for the programmer to spot interlock situations within the code.
xe2x80x9cScoreboardingxe2x80x9d may be used in the processor to implement interlocks; in this approach, a bit is attached to each processor register to act as an indicator of the register content; specifically, whether (i) the contents of the register have been updated and are therefore ready for use, or (ii) the contents are undergoing modification such as being written to by another process. This scoreboard is also used in some architectures to generate interlocks which prevent instructions which are dependent upon the contents of the scoreboarded register from executing until the scoreboard indicates that the register is ready. This type of approach is referred to as xe2x80x9chardwarexe2x80x9d interlocking, since the interlock is invoked purely through examination of the scoreboard via hardware within the processor. Such interlocks generate xe2x80x9cstallsxe2x80x9d which preclude the data dependent instruction from executing (thereby stalling the pipeline) until the register is ready.
Alternatively, NOPs (no-operation opcodes) may be inserted in the code so as to delay the appropriate pipeline stage when desired. This later approach has been referred to as xe2x80x9csoftwarexe2x80x9d interlocking, and has the disadvantage of increasing the code size and complexity of programs that employ instructions that require interlocking. Heavily software interlocked designs also tend not to be fully optimized in terms of their code structures.
Another important consideration in processor design is program branching or xe2x80x9cjumpsxe2x80x9d. All processors support some type of branching instructions. Simply stated, branching refers to the condition where program flow is interrupted or altered. Other operations such as loop setup and subroutine call instructions also interrupt or alter program flow in a similar fashion. The term xe2x80x9cjump delay slotxe2x80x9d is often used to refer to the slot within a pipeline subsequent to a branching or jump instruction being decoded. The instruction after the branch (or load) is executed while awaiting completion of the branch/load instruction. Branching may be conditional (i.e., based on the truth or value of one or more parameters) or unconditional. It may also be absolute (e.g., based on an absolute memory address), or relative (e.g., based on relative addresses and independent of any particular memory address).
Branching can have a profound effect on pipelined systems. By the time a branch instruction is inserted and decoded by the processor""s instruction decode stage (indicating that the processor must begin executing a different address), the next instruction word in the instruction sequence has been fetched and inserted into the pipeline. One solution to this problem is to purge the fetched instruction word and halt or stall further fetch operations until the branch instruction has been executed, as illustrated in FIG. 2. This approach, however, by necessity results in the execution of the branch instruction in several instruction cycles, this number typically being between one and the depth of the pipeline employed in the processor design. This result is deleterious to processor speed and efficiency, since other operations can not be conducted by the processor during this period.
Alternatively, a delayed branch approach may be employed. In this approach, the pipeline is not purged when a branch instruction reaches the decode stage, but rather subsequent instructions present in the earlier stages of the pipeline are executed normally before the branch is executed. Hence, the branch appears to be delayed by the number of instruction cycles necessary to execute all subsequent instructions in the pipeline at the time the branch instruction is decoded. This approach increases the efficiency of the pipeline as compared to multi-cycle branching described above, yet also complexity (and ease of understanding by the programmer) of the underlying code.
Based on the foregoing, processor designers and programmers must carefully weigh the tradeoffs associated with utilizing hardware or software interlocks as opposed to a non-interlock architecture. Furthermore, the interaction of branching instructions (and delayed or multi-cycle branching) in the instruction set with the selected interlock scheme must be considered. What is needed is an improved approach to pipeline interlocking which optimizes processor pipeline performance and provides attributes of both hardware and software interlocks, while providing the programmer with additional flexibility of coding. Furthermore, as more pipeline stages (and even multiple multi-stage pipelines) are added to processor designs, the benefits of enhanced interlock performance and code optimization within the processor increase manifold. Additionally, the ability to readily synthesize such improved pipelined processor designs in an application-specific manner, and using available synthesis tools, is of significant utility to the designer and programmer.
The present invention satisfies the aforementioned needs by providing an improved method and apparatus for executing instructions within a pipelined processor architecture.
In a first aspect of the invention, an improved method of controlling jumping with the CPU is disclosed. In a first embodiment, a pipeline interlock mode is provided whereby the relationship between an instruction which sets branch flags and a subsequent branch is detected; such instructions immediately preceding the branch within a predetermined increment are prevented from affecting the branch, thereby permitting a flag-setting instruction to be scheduled into the slot immediately preceding the branch. A NOP (non-operation) may be used to occupy the slot preceding the branch. In a second embodiment, the NOP is obviated through a mode having a code structure which requires that a branch at an earlier stage of the pipeline be delayed until the flag-setting instruction has moved out of a later stage of the pipeline, and the flags have been set.
In a second aspect of the invention, an improved method of synthesizing the design of an integrated circuit incorporating the aforementioned jump delay slot method is disclosed. In one exemplary embodiment, the method comprises obtaining user input regarding the design configuration; creating customized HDL functional blocks based on the user""s input and existing library of functions; determining the design hierarchy based on the user""s input and the library and generating a hierarchy file, new library file, and makefile; running the makefile to create the structural HDL and scripts; running the generated scripts to create a makefile for the simulator and a synthesis script; and synthesizing the design based on the generated design and synthesis script.
In a third aspect of the invention, an improved computer program useful for synthesizing processor designs and embodying the aforementioned method is disclosed. In one exemplary embodiment, the computer program comprises an object code representation stored on the magnetic storage device of a microcomputer, and adapted to run on the central processing unit thereof. The computer program further comprises an interactive, menu-driven graphical user interface (GUI), thereby facilitating ease of use.
In a fourth aspect of the invention, an improved apparatus for running the aforementioned computer program used for synthesizing logic associated with pipelined processors is disclosed. In one exemplary embodiment, the system comprises a stand-alone microcomputer system having a display, central processing unit, data storage device(s), and input device.
In a fifth aspect of the invention, an improved processor architecture utilizing the foregoing jump interlock methodology and constrained/unconstrained synthesized logic is disclosed. In one exemplary embodiment, the processor comprises a reduced instruction set computer (RISC) having an at least three stage pipeline comprising instruction fetch, decode, and execute stages which are controlled in part by the aforementioned jump interlock methodology.