New application-focused system-on-chip platforms motivate new application-specific processors. Configurable and extensible processor architectures offer the efficiency of tuned logic solutions with the flexibility of standard high-level programming methodology. Automated extension of processor function units and the associated software environment (compilers, debuggers, simulators and real-time operating systems) satisfies these needs. At the same time, designing at the level of software and instruction set architecture significantly shortens the design cycle and reduces verification effort and risk.
U.S. Pat. No. 6,282,633, issued Aug. 28, 2001 and entitled, “High Data Density RISC Processor,” U.S. application Ser. No. 09/246,047, filed Feb. 5, 1999 and entitled “Automated Processor Generation System for Designing a Configurable Processor and Software,” U.S. application Ser. No. 09/322,735, filed May 28, 1999 and entitled “System for Adding Complex Instruction Extensions to a Microprocessor,” and U.S. application Ser. No. 09/506,502, filed Feb. 17, 2000 and entitled “Improved Automated Processor Generation System for Designing a Configurable Processor and Software,” all commonly owned by the present assignee and incorporated herein by reference, dramatically advanced the state of the art of microprocessor architecture and design.
More particularly, these previous patents and applications described in detail a high-performance RISC processor, as well as a system that is able to generate a customized version of such a high-performance RISC processor based on user specifications (e.g. number of interrupts, width of processor interface, size of instruction/data cache, inclusion of MAC or multiplier) and implementation goals (e.g. target ASIC technology, speed, gate count, power dissipation, and prioritization among these goals). The system generates a Register Transfer Level (RTL) representation of the processor, along with the software tools for the processor (compiler, linker, assembler, debugger, simulator, profiler, etc.), and the set of scripts to transform the RTL representation into a manufacturable geometric representation (GDS II format files). The system further includes recursive evaluation tools that allow for the addition of processor extensions to provide hardware support for commonly used functions in accordance with the application to achieve an ideal trade-off between software flexibility and hardware performance.
Generally, as shown in FIG. 1, the processor 102 generated by the system can include a configurable core 104 that is substantially the processor described in U.S. Pat. No. 6,282,633, and an optional set of application-specific processor extensions 106, which extensions may be described by Tensilica Instruction Extension (TIE) language instructions, and/or other high level hardware description language instructions, as detailed in the above-referenced applications. The processor and generation system of the above-referenced patents and applications are embodied in products that are commercially available from Tensilica, Inc. of Santa Clara, Calif.
According to one aspect, U.S. Pat. No. 6,282,633 achieves high performance computing by making efficient use of instruction encoding to reduce program storage. As described therein, processor architecture is a well known art. As such, most features in instruction set design are not new in themselves. However, when optimizing a system for different target applications, it is possible to combine features in a novel way that results in significant improvements in performance, code size, and power dissipation.
Instruction set design must balance many conflicting requirements. Most processor systems seek to reduce the total execution time of an application (or a mixture of applications). Total execution time can be computed from the following equation,T=IE*CPI*CP where IE is the total dynamic instruction count required to execute the program, CPI is the average number of clocks per instruction, and CP is the time for each clock. Instruction set design seeks to reduce IE while at the same time allowing for efficient implementations that can reduce CPI and CP. Implementations that seek to minimize CPI and CP can be simply referred to as efficient.
One technique commonly used to reduce IE is to have each instruction operate on various pieces of data simultaneously—a technique often described as Single Instruction Multiple Data (SIMD) or vector processing. To allow efficient software implementations data elements that do not depend on each other and that are computed in similar ways must be identified, as this allows those elements to be computed in parallel.
Another technique used to increase performance is to allow the processor to execute more than one instruction per cycle (i.e. Instruction Level Parallelism (ILP)). There are broadly two different ways to achieve this. One is to explicitly encode multiple independent operations in one instruction. The second is to dynamically determine which operations can be executed in parallel in the processor and schedule them accordingly (i.e., out-of-order).
Some machines have used both SIMD and ILP techniques. The IA64 architecture from Intel Corp. of Santa Clara, Calif., for example, can encode up to three operations per instruction. Furthermore, each of these operations can operate on a vector of data. In order to encode this information into an instruction, these architectures normally use very long instruction words (VLIW). Although VLIW provides certain advantages, one big disadvantage of known VLIW machines is wasted code storage space. For example, in most fixed length VLIW machines (such as the IA64), all instructions must be encoded using the same fixed long instruction word format, even if there are not enough operations to execute in a given cycle.
A rule of thumb often quoted in the processor architecture community is that 10% of the static instructions are responsible for 90% of the execution time. This is because often applications, particularly those including digital signal processing (DSP) algorithms, have loops (or kernels) that do most of the work processing data. It is this 10% of the instructions that must be sped up to significantly reduce total execution time. The other 90% of the instructions, however, can be encoded more efficiently to reduce instruction count, and thus the storage required for the program.
The processor of U.S. Pat. No. 6,282,633 can very efficiently encode programs and can be implemented efficiently so as to minimize CPI and CP. However, in some applications it is still not possible to encode enough operations into a single instruction. It thus remains desirable to extend the instruction set architecture even further to allow the 10% of the instructions that account for 90% of the execution time to express more parallelism, particularly in DSP applications.