Generally, the present invention relates to the design of integrated circuits. More specifically, the present invention relates to a system for the automatic generation of processor datapaths.
It is typically known in the art that an Instruction Set Architecture (ISA) describes the instructions, operations, register files, encoding/decoding logic, and/or field assignments of a processor that are made visible to the programmer. A processor can include any type of integrated circuit capable of carrying out these types instructions, operations, register files, encoding/decoding logic, and/or field assignments. A processor that implements an ISA must contain hardware logic that implements the behavior of each operation in the ISA. It is known in the art that an efficient processor implementation typically contains one or more datapaths. Each datapath contains hardware logic that implements the behavior of a subset of the ISA operations. Within a datapath, logic blocks that are required to implement the behavior of two or more operations are often shared, so that the logic area of the datapath is reduced.
A configurable processor, such as the Xtensa processor from Tensilica, Inc. of Santa Clara, Calif., for example, allows the designer to extend an existing ISA with new designer-defined operations. For an efficient implementation of the configurable processor, the behavior of each designer-defined operation will typically be implemented by a shared datapath. For the most efficient configurable processor implementation, it may be necessary to implement the behavior of one or more existing ISA operations and one or more designer-defined operations with the same datapath.
Determining the number of datapaths and the logic sharing within those datapaths for a set of operations typically requires that the designer manually perform a number of tasks that are difficult and time-consuming. Also, introducing a new operation into the ISA may require the designer to reconsider the existing datapaths and logic sharing; because the new operation may significantly change the manner in which the datapaths and logic sharing should be implemented to provide the most efficient hardware implementation. Thus, there is need in the art for a system that, given one or more operations that have separately described behaviors, can automatically create one or more datapaths containing shared logic such that the hardware logic efficiently implements the behavior of those one or more operations.
For example, consider an ISA that contains an addition (ADD), subtraction (SUB), and multiplication (MUL) operation. It is typically know in the art that a processor can implement the behavior of an ADD and SUB operation by sharing a single hardware adder. Thus, one potential implementation of an ISA containing an ADD, SUB, and MUL operation requires the creation of two datapaths; one datapath implements the behavior of the ADD and SUB operations using shared adder logic and the other datapath implements the behavior of the MUL operation using multiplication logic. It is also known in the art that a processor can implement the behavior of a MUL operation with hardware logic that performs a partial-products calculation followed by adder logic. Thus, another potential implementation of an ISA containing an ADD, SUB, and MUL operation requires the creation of a single datapath that implements the behavior of all three operations. The datapath contains the partial-products logic followed by adder logic. The adder logic is shared by all three operations.
Creating a set of datapaths that efficiently implements the behavior of a set of operations typically requires that the designer manually identify common hardware resources that can be shared. A hardware resource is a block of logic that should be considered for sharing. The set of resources depends on the behavior of the operations. The designer must carefully choose resources to allow for maximum sharing. For example, if the operations are a 32-bit ADD and a 32-bit MUL, the designer could create a resource to represent 32-bit multiplier logic and another resource to represent 32-bit adder logic. With these resources, no hardware would be shared between the operations. On the other hand, the designer could create a resource to represent 32-bit multiply-partial-products logic and another resource to represent 32-bit adder logic. With these resources, the 32-bit adder logic can be shared between the operations, resulting in a more efficient implementation. Thus, to enable automatic generation of processor efficient datapaths, there is need in the art for a system that can automatically determine the hardware resources required for a set of operation behaviors.
In a pipelined processor implementation, the pipeline stage to which each hardware resource is assigned influences the amount of logic sharing possible in the datapath. If the designer specifies a long clock period, then more logic can be placed into a single stage, resulting in more potential logic sharing. Assume for the ADD/SUB/MUL example from above that the designer manually specifies a clock period that is long enough to allow each operation's behavior to be implemented in a single pipeline stage. Then, the adder logic used to implement the ADD, SUB, and partial-products add for the MUL can be placed in stage one and shared by all three operations. However, consider the case where the designer chooses a shorter clock period that requires the MUL's partial-products logic to occupy stage one and the MUL's adder logic to occupy stage two. In this case, there are several possible implementations that trade off application performance versus hardware logic area. Two of the typical pipelined processor implementations for this example are described below.
In the first implementation, the MUL's adder logic is shared with the ADD and SUB operation adder logic by increasing the latency of the ADD and SUB operation by one cycle, so that a single datapath implements all three operations. The datapath has a single copy of partial-products logic in stage one and a single copy of adder logic in stage two. This implementation increases the latency of the ADD and SUB operation by one cycle and so may cause an increase in the number of cycles required to execute an application.
In the second implementation, the latency of the ADD and SUB operation is not increased. Thus, the adder logic of the ADD and SUB in stage one cannot be shared with the adder logic of the MUL operation in stage two. This implementation requires two datapaths, one for the ADD and SUB, and one for the MUL. Thus, compared with the first implementation, this implementation requires an additional implementation of the adder logic. In exchange for the additional logic, the ADD and SUB operation have shorter latency than in the first implementation, which can potentially lead to a decrease in the number of cycles required to execute an application compared to the first implementation.
As this example shows, there is need in the art for a system that can automatically assign hardware resources to pipeline stages so that shared datapath logic can be efficiently implemented, while observing designer specified constraints such as target clock period and operation latency.
Typically, when determining the hardware resources for the datapath(s) associated with one or more operations, the designer must manually weigh the timing and area characteristics of the logic represented by the resource. The area characteristics of a resource will determine if it is large enough to consider for sharing. The timing characteristics of a resource will determine how sharing it will affect the latency of the operations that use the resource. Thus, there is need in the art for a system that can automatically determine the timing and area characterization of hardware resources derived from operation behaviors.
Logic synthesis systems, such as those described in “Behavioral Synthesis: Digital System Design Using the Synopsys Behavioral Compiler” by David Knapp, and “The Synthesis Approach to Digital System Design” by P. Michel, U. Lauther, and P. Duzy, can potentially perform resource sharing of blocks of hardware logic. However, these logic synthesis systems do not operate on the behaviors of ISA operations for the specific purpose of producing datapaths in a pipelined processor implementation. Therefore, these systems are unable to exploit information about the processor pipeline context to produce more efficient hardware.
For example, in the context of ISA operation behaviors being implemented in a processor pipeline, the behaviors of an operation that performs addition through an ADD resource and an operation that performs subtraction through a SUBTRACT resource can be implemented through a shared ADD/SUBTRACT resource. Existing logic synthesis systems cannot share hardware resources across operations in this manner automatically because those systems do not exploit the knowledge that in the processor pipeline context the ADD and the SUBTRACT resources are never active in the same cycle.
Similarly, in the context of ISA operation behaviors being implemented in a processor pipeline, the implementation of an operation behavior can be changed by varying the number of pipeline stages required for its implementation or by sharing hardware resources across multiple stages of the implementation. These processor design optimizations alter the latency of the operation and create pipeline hazards that potentially affect the performance of an application using the operation, but do not change the functionality of the operation. Existing logic synthesis systems cannot automatically share resources across stages or automatically vary the number of pipeline stages in this manner because those systems do not exploit knowledge of the processor pipeline context.
Therefore, to summarize, what is needed in the art is an automated datapath generation flow that allows the designer to produce one or more shared processor datapaths that implement the behaviors of a set of operations, such that designer-specified constraints like, for example, target clock period and operation latency are satisfied.