1. Field of the Invention
The present invention is directed toward digital computing systems. More particularly, it is directed to the automatic specification of programmer-visible features in digital computing systems.
2. Background of the Related Art
An Instruction Set Architecture (ISA) describes the instructions, operations, and register files of a processor that are made visible to the programmer (Hennessy et al., Computer Architecture, A Quantitative Approach, 2ed., Ch. 2) The ISA for a general purpose processor is designed to provide a reasonable level of performance for a wide variety of applications. In contrast, the ISA for an embedded or application-specific processor is designed for a specific set of applications with the goal of improving the performance of those applications as much as possible, or to meet a minimum performance requirement. For example, the performance of an application that decodes a video stream must satisfy the minimum requirement that the video is decoded in real time. Thus, the ISA designed for a processor that executes the video stream decoding application must provide the minimum level of performance required by the application, without regard to the level of performance the ISA provides for other applications.
At the same time, it is desirable to minimize the cost of the processor hardware required to implement the ISA. Thus, the ISA designer must balance required application performance against the processor hardware cost to implement the ISA.
When designing an ISA, the designer can exploit several techniques, each of which has different tradeoffs between potential performance improvement and hardware cost. These techniques include Very Long Instruction Word (VLIW) (Hennessy et al., Computer Architecture, A Quantitative Approach, 2nd ed., Section 4.4, pp. 284-285), Vector Operations (Hennessy et al., Computer Architecture, A Quantitative Approach, 2nd ed., Appendix B), Fused Operations and Specialization.
A. VLIW
The VLIW technique allows a single instruction to contain multiple independent operations. A VLIW instruction is partitioned into a number of slots, and each slot may contain operation. Describing a VLIW instruction simply requires specifying which operations can occur in each slot. For example, a two-slot VLIW instruction could allow a load operation in the first slot, and a multiply operation in the second slot.
If the ISA is designed to use VLIW, a source language compiler (for example, a C or C++ compiler) can use software-pipelining and instruction scheduling techniques to pack multiple operations into a single VLIW instruction, which has the potential to significantly increase performance (Lam, “Software Pipelining: An Effective Scheduling Technique for VLIW Machines”, Proceedings of the SIGPLAN 1988 Conference on Programming Language Design and Implementation; Krishnamurthy, “A Brief Survey of Papers on Scheduling for Pipelined Processors”, SIGPLAN Notices, V25, #7, July 1990). However, designing an ISA to use VLIW increases hardware cost compared with an ISA that does not use VLIW. Because a VLIW instruction issues and executes multiple independent operations in parallel, the hardware must contain multiple parallel decoders. Also, if multiple operations in the same VLIW instruction access the same register file, that register file must contain enough ports to satisfy all the possible accesses. In addition, if the VLIW instruction allows multiple instances of an operation to appear in the instruction, the hardware required to implement that operation must be duplicated so that the multiple instances of the operation can execute in parallel.
Thus, the number of slots in each VLIW instruction, the number of register file ports required to satisfy the operations in each VLIW instruction, and the combinations of operations allowed in each VLIW instruction influence both the performance improvement provided by the instructions, and the hardware cost of the logic required to implement the instructions. For example, a two-slot VLIW instruction that allows an integer add operation to occur in both slots requires that the integer register file have at least four read ports (each add requires two integer registers as input), and at least two write ports (each add produces a result into one integer register), and requires two copies of the addition logic. The additional decode logic, register file ports, and addition logic significantly increase hardware cost compared with a non-VLIW instruction, however, the hardware cost may be justified if being able to execute two add instructions in parallel significantly increases application performance.
To simplify the design of an ISA containing VLIW instructions, the prior art PICO system (Aditya et al., “Automatic Architectural Synthesis of VLIW and EPIC Processors”, Proc. International Symposium on System Synthesis, ISSS '99, San Jose, Calif., November 1999, pp. 107-113) automatically creates a VLIW ISA from an application by searching the space of VLIW processors and evaluating the cost and performance of each. Within the design space searched by PICO, a VLIW processor is characterized by the size and types of register files, the operations, and the allowed combinations of operations in an instruction. Using the results of the search, the designer can choose a VLIW ISA that meets the performance and hardware cost requirements.
B. Vector Operations
The Vector Operations technique increases data throughput by creating vector operations that operate on more than one data element at a time (vector operations are also referred to as SIMD operations). A vector operation is characterized by the operation it performs on each data element, and by the number of data elements that it operates on in parallel, i.e., the vector length.
For example, a four-wide vector integer add operation adds two input vectors, each containing four integers, and produces a single result vector containing four integers. If the ISA is designed to use vector operations, a source language compiler may be able to use automatic parallelization and vectorization techniques (Wolfe, Optimizing Supercompilers for Supercomputers, MIT Press, Cambridge, Mass., 1989; Zima, Hans et al., Supercompilers for Parallel and Vector Machines, ACM Press/Addison-Wesley, Reading, Mass. 1991) to significantly increase performance of one or more of the application's loops. However, as with VLIW, using vector operations increases hardware cost because the vector operations require logic that can perform operations on multiple data elements in parallel. Also, the vector operations require vector register files capable of holding the vectors of data elements.
For example, a four-wide vector integer add requires logic to perform four integer adds in parallel, and requires a vector register file capable of holding vectors of four integer values. Thus, designing an ISA to use vector operations requires that the designer determine a set of vector operations, the number of data elements operated on by the vector operations, and the number of registers in the vector register file(s) accessed by the vector operations, such that desired application performance is balanced against hardware cost. To simplify the design of an ISA that uses vector operations, there is need in the art for an automatic ISA generation system that can create vector operations that improve application performance while balancing hardware cost.
C. Fused Operations
Fused Operations is a technique that creates operations composed of several simple operations. The Fused Operations technique is similar in spirit to Fused Multiply-Add (Hennessy et al., Computer Architecture, A Quantitative Approach, 2nd ed., Section A.7), but unlike Fused Multiply-Add, the semantics of a fused operation is identical to the composition of the semantics of the simple operations.
Using the fused operation in place of the simple operations reduces code size and issue bandwidth, and may reduce register file port requirements. Also, the latency of the fused operation may be less than the combined latency of the simple operations. An example of a fused operation is the add-with-shift-by-1 operation present in Tensilica's Xtensa Architecture (Xtensa Instruction Set Architecture Reference Manual, Chapter 5, page 170). The add-with-shift-by-1 shifts a value left by one bit and then adds it to another value, and thus is a fused operation composed from a left shift operation and an add operation. One fused add-with-shift-by-1 operation replaces two simpler operations, and still executes in a single cycle.
Using fused operations may increase the hardware cost if the fused operation requires additional logic or if the fused operation requires additional register file ports to access its operands. Determining the set of fused operations that together provide performance improvement across a set of applications, and balancing that performance improvement against the hardware cost to implement the fused operations is a difficult task. Thus, to simplify the design of an ISA that uses fused operations, there is need in the art for an automatic ISA generation system that can create fused operations that improve application performance while balancing hardware cost.
D. Specialization
Specialization is a technique that creates an operation that always uses a smaller range of values for one or more of its operands than in the original operation. For example, a 16-bit multiply operation might be specialized into a multiply by a constant or it might be specialized into an 8-bit multiply if an application does not need the full generality of the original operation. Because the operation operates on a more limited input set, the logic required to implement the specialized operation is likely to be much simpler than the logic required for the original operation.
For example, a specialized multiply operation that always performs a multiply by three requires significantly less logic than a generic multiply operation. However, the application(s) may require the generic version of the operation in addition to the specialized version, and thus adding a specialized operation will increase hardware cost. A specialized operation can increase performance because the constant operand(s) does not need to be loaded into a register before executing the operation.
For example, to perform a multiply by three with a generic multiply operation requires that the constant “3” be loaded into a register that is then input to the multiply (assuming the multiply reads two registers for input), while the specialized multiply-by-three operation does not require the register load.
Determining the set of specialized operations that together provide performance improvement across a set of applications, and balancing that performance improvement against the hardware cost to implement the specialized operations is a difficult task. Thus, to simplify the design of an ISA that uses specialized operations, there is need in the art for an automatic ISA generation system that can create specialized operations that improve application performance while balancing hardware cost.
To get the maximum performance improvement for a given hardware cost, or to minimize hardware cost for a given performance improvement, the designer must consider an ISA that can contain any combination of vector operations, fused operations, specialized operations, and operations that combine those techniques (e.g., a single operation that can perform four parallel multiply-by-three-accumulate computations on two vectors of four integers, producing a result vector of four integers). In addition, the designer must consider the use of VLIW to allow multiple independent operations to be issued and executed in parallel. Simultaneously considering all four techniques when designing an ISA such that application performance is balanced against hardware cost is extremely difficult. Thus, there is need in the art for an automatic ISA generation system that uses VLIW, vector operations, fused operations, and specialized operations to create an ISA that improves application performance while balancing hardware cost.