1. Field of the Invention
The present invention is directed to computer processors as well as systems and techniques for developing the same, and is more particularly directed to processors which have features configurable at the option of a user and related development systems and techniques.
2. Background of the Related Art
Prior art processors have generally been fairly rigid objects which are difficult to modify or extend. A limited degree of extensibility to processors and their supporting software tools, including the ability to add register-to-register computational instructions and simple state (but not register files) has been provided by systems such as those described in the above Killian et al. and Wilson et al. applications. This limited extensibility was a significant advance in the state of the art; many applications using these improvements see speedups or efficiency improvements of four times or better.
However, the limitations on extensibility of these prior art systems meant that other applications could not be adequately addressed. In particular, the need to use the existing core register file, with its fixed 32-bit width registers, generally prevents the use of these improvements in applications that require additional precision or replicated functional units where the combined width of the data operands exceeds 32 bits. In addition, the core register file often lacks sufficient read or write ports to implement certain instructions. For these reasons, there is a need in the art to support the addition of new register files that are configurable in width and in number of read and write ports.
With the addition of register files comes the need to transfer data between these files and memory. The core instruction set includes such load and store instructions for the core register file, but additional register files require additional load and store instructions. This is because one of the rationales for extensible register files is to allow them to be sized to required data types and bandwidths. In particular, the width of register file data may be wider than that supported by the rest of the instruction set. Therefore, it is not reasonable to load and store data by transferring the data to the registers provided by the core; it should be possible to load and store values from the new register file directly.
Further, although prior art systems support the addition of processor state, the quantity of that state is typically small. Consequently, there is a need in the art for a larger number of state bits to be easily added to the processor architecture. This state often needs to be context switched by the operating system. Once the quantity of state becomes large, new methods that minimize context switch time are desirable. Such methods have been implemented in prior art processors (e.g., the MIPS R2000 coprocessor enable bits). However, there is a need in the art to extend this further by generating the code sequences and logic automatically from the input specification to support real-time operating systems (RTOSes) and other software which need to know about new state and use it in a timely manner.
Further, prior art processors do not allow for sharing of logic between the core processor implementation and instruction extensions. With load and store instruction extensions, it is important that the data cache be shared between the core and the extensions. This is so that stores by newly-configured instructions are seen by loads by the core and vice versa to ensure cache coherency—separate caches would need special mechanisms to keep them consistent, a possible but undesirable solution. Also, the data cache is one of the larger circuits in the core processor, and sharing it promotes a reduction in the size of the core processor.
The addition of register files also makes it desirable to support allocation of high-level language variables to these registers. Prior art processors use the core register file to which prior art compilers already support allocation of user variables. Thus, compiler allocation is expected and should be supported for user-defined register files. To allocate variables to registers, a compiler supporting user-defined register files requires knowledge of how to spill, restore, and move such registers in order to implement conventional compiler functionality.
A related but more general limitation of prior art processor systems is the level of compiler support therefor. Often instructions are added to a processor to support new data types appropriate to the application (e.g., many DSP applications require processors implementing saturating arithmetic instead of the more conventional two's complement arithmetic usually supported by processors). Prior art systems allow instructions supporting new data types to be added, but it is necessary to map these new instructions to existing language data types when writing high-level language code that uses the extensions. In some cases an appropriate built-in data type may not exist.
For example, consider the saturating arithmetic example. As noted above, many DSP algorithms take advantage of arithmetic that saturates at the minimum value on underflow or maximum value on overflow of the number of bits used instead of wrapping, as in traditional two's complement systems. However, there is no C data type that has these semantics—the C language requires that                int a;        int b;        int c=a+b;have wrapping semantics. One could write        int a;        int b;        int c=SATADD(a, b);instead using built-in types with new intrinsic functions, but this is awkward and obscures the algorithm (the writer thinks of the SATADD function simply as +).        
On the other hand, adding new data types allows the + operator to function differently with those types—C already applies it to different operations for integer addition and floating-point addition operations, so the extension is natural. Thus, using new data types saturating addition might be coded as                dsp16 a;        dsp16 b;        dsp16 c=a+b;where dsp16 defines a saturating data type. Thus, the last line implies a saturating add because both of its operands are saturating data types.        
Most compilers schedule instructions to minimize pipeline stalls. However, with prior art systems there is no way the instruction specification may be used to extend the compiler's scheduling of data structures. For example, load instructions are pipelined with a two-cycle latency. Thus, if you reference the result of a load is reference on the next instruction after the load, there will be a one-cycle stall because the load is not finished. Thus, the sequence                load r1, addr1        store r1, addr2        load r2, addr3        store r2, addr4will have two stall cycles. If the compiler rearranges this to        load r1, addr1        load r2, addr3        store r1, addr2        store r2, addr4then the sequence executes with no stall cycles. This is a common optimization technique called instruction scheduling. Prior art instruction scheduling requires tables giving the pipe stages that instructions use their inputs and outputs but does not make use of such information for newly-added instructions.        
Another limitation of the prior art is that the computation portion of added instructions must be implemented in a single cycle of the pipeline. Some computations, such as multiplication of large operands, have a logic delay longer than the typical RISC pipeline stage. The inclusion of such operations using prior art techniques would require that the processor clock rate be reduced to provide more time in which to complete the computation. It would therefore desirable to support instructions where the computation is spread out over several pipeline stages. In addition to allowing the computation to be performed over multiple cycles, it could be useful to allow operands to be consumed and produced in different pipeline stages.
For example, a multiply/accumulate operation typically requires two cycles. In the first cycle, the multiplier produces the product in carry-save form; in the second cycle the carry-save product and the accumulator are reduced from three values to two values using a single level of carry-save-add, and then added in a carry-propagate-adder. So, the simplest declaration would be to say that multiply/accumulate instructions take two cycles from any source operand to the destination; however, then it would not be possible to do back-to-back multiply/accumulates into the same accumulator register, since there would be a one-cycle stall because of the two-cycle latency. In reality, however, the logic only requires one cycle from accumulator in to accumulator out, so a better approach is just to provide a more powerful description, such as                D←A+B*Cbeing described as taking B and C in stage 1, taking A in stage 2, and producing D in stage 3. Thus, the latency from B or C to D is 3−1=2, and the latency from A to D is 3−2=1.        
With the addition of multi-cycle instructions, it also becomes necessary to generate interlock logic appropriate to the target pipeline for the added instructions. This is because with one instruction per cycle issue, no latency one instruction can produce a result that will cause an interlock on the next cycle, because the next instruction is always delayed by one cycle. In general, if you can only issue instructions only every K cycles, the latency of those instructions is L cycles and L≧K, then those instructions cannot cause interlocks on their destination operand (instructions can still interlock on their source operands if their source operands were produced by a two-cycle instruction such as a load). If it is possible to have two-cycle newly-configured instructions, there is a need to have following instructions that interlock on the result of the newly-configured instructions.
Most instruction set architectures have multiple implementations for different processor architectures. Prior art systems combined the specification of the instruction semantics and the implementation logic for instructions and did not separate these, which might allow one set of reference semantics to be used with multiple implementations. Reference semantics are one component of instruction set documentation. It is traditional to describe instruction semantics in both English and a more precise notation. English is often ambiguous or error-prone but easier to read. Therefore, it provides the introduction, purpose and a loose definition of an instruction. The more formal definition is useful to have a precise understanding of what the instruction does. One of the purposes of the reference semantics is to serve as this precise definition. Other components include the instruction word, assembler syntax, and text description. Prior art systems have sufficient information in the extension language to generate the instruction word and assembler syntax. With the addition of the reference semantics, only the text description was missing, and there is a need to include the specification of instruction descriptions that can be converted to formatted documentation to produce a conventional ISA description book.
Processor development techniques including the above features would render design verification methods of the prior art no longer valid due to their increased flexibility and power. In conjunction with the above features, therefore, there is a need to verify the correctness of many aspects of the generated processor, including:                the correctness of the input reference instruction semantics;        the correctness of the input implementation instruction semantics;        the translation by the compiler of instruction semantics to the application programming language;        the translation by the instruction semantics compiler to the Hardware Description Language (HDL);        the translation by the instruction semantics compiler to the instruction set simulator programming language;        the HDL generated by the instruction semantics compiler for the register files, interlock, bypass, core interface, and exceptions;        any system function abstraction layuers generated during the process, such as the the Hardware Abstraction Layer (HAL) code generated by the instruction semantics compiler (see the aforementioned Songer et al. patent application for further details on the HAL); and        the intrinsic and data type support in the programming language compiler.The reference semantics are also used in some of the above.        
Finally, all of the new hardware functionality must be supported by the instruction set.