1. Field of the Invention
The present invention relates to a processor, and more particularly to a high-performance hybrid processor with configurable execution units.
2. Background of the Related Art
Application-Specific Processors
Microprocessors are used in broad array of electronic applications, because their programmability via software allows rapid development and modification of very complex tasks. They form the heart of both general-purpose computer systems and specialized electronic equipment ranging from network routers to cellular telephones. Advances in the speed and density of semiconductor technology enables the creation of microprocessors that are faster, smaller and lower power than succeeding designs. They are used in the billions today because of they are both convenient to program to serve the task at hand, and efficient enough to meet requirements. The traditional economics of integrated circuit design—the heavy engineering effort to design and verify a new microprocessor design, plus significant prototyping costs—encourages processors that can be used for many different tasks.
Most microprocessor designs are general-purpose. They include a fixed set of features—instruction set, memory systems and interfaces—that make the processor applicable to a wide range of different tasks. However, these generic processors are inadequate for many important tasks. In particular, the programmer must often use long sequences of generic instructions to compute the necessary results for a particular application task. This inefficiency may mean that electronic system is not fast enough, or dissipates too much power. Thus, the generic microprocessor cannot easily be used in such circumstances.
The Attraction of Universal Platforms
The ideal solution is an application-specific processor, which shares with generic processors the capacity for easy programming from high-level languages, but which includes exactly the right set of instructions for a specific set of tasks. Application-specific instruction sets can reduce the number of instructions that must be executed and the time for execution by up to hundreds of fold, depending on the application and the instruction set. For any given application-specific instruction set, the improvements may be limited to a small set of applications. Therefore it is important to reduce the cost and effort of developing and building a microprocessor tuned to a specific application.
There are two essential components to the cost and effort. The first component is the engineering effort to discover, describe and design the new microprocessor and the associated software. Essential tasks may include the following:
Analysis of the target applications to find performance bottlenecks and target code sections for instruction set optimization
Definition of new instructions which reduce the number of processor clock cycles required for execution
Detailed design of logic for implementation of the enhanced processor including enhanced instruction decode and new instruction execution units.
Development of new software development tools, especially new assemblers, high-level language compilers and other software generators to allow the program to use the enhanced instructions and other features.
Any adaptation of runtime software, such as operating systems and software libraries that may be needed to manage the enhanced hardware resources (registers, memories and instructions) of the processor.
The second component of the effort for application-specific processor development is the creation of the integrated semiconductor circuit that implements the complete processor with its enhanced instruction set. This implementation must strike a balance between low prototyping cost and low volume manufacturing cost. A fully customized design, in which all logic gates, memories and wiring is optimized for the target processor definition, will typically achieve the smallest size, lowest power and lowest volume manufacturing cost, but the one-time costs for development and prototyping may be very high. The time to design and manufacture prototypes will typically be months. For low to moderate manufacturing volumes, the amortized cost of prototyping may be larger than direct manufacturing costs.
The first component of cost and effort is addressed by processor generation tools, such as described by A. Wang, E. Killian, D. Maydan, C. Rowen, “Hardware/Software Instruction Set Configurability for System-on-Chip Processors”, Proceedings of Design Automation Conference, 2001 or R. Gonzalez, “Configurable and Extensible Processors Change System Design” Proceedings of Hot Chips 11, 1999. These tools let designers of electronic systems rapidly discover, describe, validate new instruction sets and generate complete hardware designs and corresponding software.
The second component is critically important for low volume designs, and solutions have been proposed. For example, the entire microprocessor can be implemented in a fast prototyping format, based on field-programmable gate array devices. Unfortunately, the complete flexibility of these devices imposes higher costs per electronic function and lower clock rates. Processors implemented in field-programmable logic are routinely more than five times slower and consume more than ten times as much silicon area as the identical processors implemented using more customized standard-cell circuit implementation. These standard-cell processors may, in turn be half the speed of equivalent processors implemented with carefully hand-tuned circuits. Therefore, it is attractive to consider hybrid implementation, where a base processor, including common instructions, registers, memories and interfaces, is implemented using fast dense circuits, and application-specific extensions are implemented by rapidly configuring a generic section of slow, but flexible field-programmable or reconfigurable logic. Ideally, this would yield a standard design, perhaps implemented on a single integrated circuit, with the speed and volume cost characteristics of a more fully customized design, but with the low prototyping costs and effort of reconfigurable logic circuits.
Choices in Processor—FPGA Coupling
Researchers have described a number of different possible solutions for hybrids of microprocessors and reconfigurable logic for application-specific processing. Gilson U.S. Pat. No. 5,361,373 outlines the combination of a processor circuit and separate standard field-programmable gate array (FPGA) devices to form a hybrid, but does not detail the communication between them, or describe a systematic method for developing configurations or programming the processor. K. Compton, S. Hauck “Configurable Computing: A Survey of Systems and Software”, Technical Report, Northwestern University, Dept of ECE, 1999 present a survey possible approaches to hybrids, categorized into four types, shown in their relationship to the processor 100 and the data memory 150, all together in FIG. 1, though no system is likely to contain more than one type of reconfigurable processing unit.
These four types are: (1) Reconfigurable function units 110 within processor—function units directly controlled by processor instructions and have access to internal processor registers. The latency of operations is one or a handful of cycles. (2) Reconfigurable co-processors 120—function units that operate without constant control by the processor, but may have access to processor memory. The latency of operations is measured in hundreds of cycles or more. (3) Attached processing unit 130—function units that operate with very little processor supervision for long periods of time. The processing units 130 cannot access processor local memories, such as the data memory 150 illustrated in FIG. 1.
Communication between the general-purpose processor 100 and the reconfigurable processing unit 130 occurs on a bus 102 and may take tens of cycles. The latency of operations is typically much greater than for co-processors. (4) Standalone processing unit 140—function units with complete independent control that operation independently of any other processor. They are typically accessed over a network 106 and have very long latencies, since they need to go through a network interface 104 and data bus 102 in order to communicate with processor 100.
Of the four types described, the reconfigurable function units 110 within the processor 100 appear to achieve lower latency and higher data bandwidth than the other forms of hybridization. For applications with low data transfer rates, the type of hybridization will not have a significant effect. For applications that require much data to be exchanged with the processor, however, this organizational choice can have a dramatic impact. When the reconfigurable function unit is tightly-coupled to the processor, the function unit and processor can exchange several operands per cycle—at least two source operands and one result operand—and the latency of transfer is just a fraction of one cycle. By contrast, the co-processor 120, attached processor 10, and standalone processing unit 140 arrangements require more than one cycle of latency for transfer and rarely can achieve even one operand per cycle.
A significant liability in placing a reconfigurable unit within a processor is the possible lack of parallelism between operations of the processor and the reconfigurable function units. The present invention focuses on fundamental improvements in such tightly coupled reconfigurable units that increase the operand bandwidth, reduce operand latency and maximize parallelism between the base processor and the function units and among the function units.
Simple Instruction Set Extensions
Some simple examples of tightly coupled reconfigurable function units have been described. R. Razdan, M. D. Smith, “A High-Performance Microarchitecture with Hardware-Programmable Function Units”, Proceedings of MICRO-27, November 1997 and U.S. Pat. Nos. 5,696,956, 5,819,064, and 6,035,123 have described a simple hybrid of a RISC base processor and a field-programmable logic array used to implement the combinatorial logic for additional simple RISC instructions. The field programmable logic is based on n-input, 1-output look-up tables (LUTs) similar to those used in popular commercial FPGAs. Added instructions follow exactly the format and structure of the base RISC instructions. A fixed part of the instruction encoding is reserved for new instructions to be implemented in reconfigurable logic. One field of the instruction word constitutes an ID that corresponds to the logic for the implementation of one combinatorial logic function. Each added instruction has access to the same two source register operands as the other instructions. Each added instruction may create one result operand, and must produce its result in one processor cycle. This result is written into the base processor's register file and the result is the same width as the base processor's word width. When an extended instruction is being executed, no other instruction executes in parallel. Furthermore, the logic for each added instruction is distinct and is not shared with the logic of any other. This allows the configuration for each instruction to be loaded dynamically in response to program usage, so the field-programmable logic serves as a cache of commonly used extended instructions. On the other hand, this prevents sharing of logic between instructions and higher logic costs for a group of instructions.
S. Hauck, T. W. Fry, M. H. Hosler, J. P. Kao, “The Chimaera Reconfigurable Functional Unit”, Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines, 1997 describe a hybrid design which attempts to further improve the operand bandwidth between the base processor's register file and the reconfigurable function units. A subset of the general-purpose RISC processor's registers are shadowed in the field configurable unit, that more than two can be used as source operands. This, however, dramatically increases the number of wires that must run from the register file into the field-configurable logic. However, only one result operand can be created per cycle, its width is limited to the word width of the base RISC processor, and it is not pipelined so the computation must complete in one cycle following the decoding of the extended instruction.
Limitations of Existing Inventions
Limitations of existing inventions that have been recognized by the present inventors can be summarized as follows:
1. Loosely-coupling of the processor and field-programmable logic dictates very long latency and low bandwidth. In addition, the operations of the field-programmable logic cannot be closely coordinated by the processor or controlled by software. This leads to lower performance, more difficult development and fewer opportunities to exploit joint operations by the two subsystems.
2. Tightly coupling between a RISC processor and RISC-type extended instructions limits parallelism. In RISC processors, computation instructions (instructions that write new values into the register file based on values in the register file) generally execute in one cycle. Either a base instruction or an extended instruction can be issued and executed at one time in the pipeline, but not both. Furthermore, if there are multiple reconfigurable function units, only one can be executing an instruction in any cycle. There is no effective overlap in execution.
3. Tight coupling between a RISC processor and RISC-type extended instructions limits speed-up from new instructions. RISC-type instruction encoding typically allows just two source registers to be specified. Therefore new instructions added via reconfigurable logic are limited to two source operands. The benefit of application-specific instruction set extension springs from replacement of a long sequence of generic instructions with a short sequence of special instructions. Unfortunately, few long sequences of generic instructions rarely have just two source operands. Only is special situations can a single RISC instruction replace more than a handful of generic instructions. Providing for access to additional general registers, as in Hauck et al, can improve the potential speed-up, but these references are implicit in the definition of instructions so the choice of registers is severely limited. This reduces the usefulness of these instructions with more than two source registers and makes generation of code from compilers very difficult.
4. The lack of function unit pipelining reduces the possibility of implementing a deep logic function in a single instruction. The most valuable possible instruction set extensions often involve combining several complex arithmetic and other operations into a single instruction. This logic may have a longer delay, particularly when implemented in slow field-programmable logic, than the normal cycle time of the processors. Either these instructions must be avoided, or the processor must be stalled in some fashion to wait for this result before issuing the next instruction. Either method reduces the potential speed-up of hybrid microprocessors.
5. All previously known methods for tightly-coupled reconfiguration provide only for new combinatorial functions on the general-purpose processor registers. No new state registers can be added. This limits the bandwidth of access to data and prevents updating of complex intermediate state within a long sequence of calculations. The existing limitation to just one result, with width equal to the base processor word, is particularly troublesome. If new state registers or complete register files could be added, the opportunities for large acceleration of applications would be greatly increased.
6. The existing restriction to simple RISC-type instruction formats limits flexibility and performance. When new formats can be added, then new acceleration opportunities emerge. Two important classes of instruction formats are missing from current field-configurable processors. First, instruction formats that specify more than two source operands, where source operands may come from either the general-purpose registers of the base processor, or additional registers and register files within the field-configurable function units. Second, instruction formats that specify more than one result. Multiple result specifiers are useful either for encoding complex operations or for encoding several simple operations in a single word.
7. Existing tightly-coupled field-programmable function units are intended as a means to directly implement fast substitutes for general-purpose RISC instructions. They are not well suited for implementation of data-parallel operations such as Single Instruction Multiple Data (SIMD) or vector operations. This limitation appears both in the use of source and result data-paths that are only as wide as the base processors registers (typically 32 bits) and in the proposed methods for discovery and design of new instructions, that do not use vectorization techniques to discover cases where multiple iterations of a loop can be executed in parallel. Moreover, existing solutions are directed at accelerating the types of software algorithms typically found in general-purpose computer systems. These often involve wide integer data-types, especially 16 bit and 32 bit integers. Parallel operations involving large numbers of such wide operands are expensive in hardware. By contrast, in embedded applications such as signal and image processing, the real native data size is often quite small (10 bits or less) so the opportunities for parallel operations is much greater.
8. The bias of existing solutions also appears in the limited interface between the processor and other logic functions in the system. In practice, virtually all computer system data is passed through a centralized main memory, implemented in separate integrated circuits, before going into the processor. By contrast, in embedded electronic systems, the processor is often implemented in the same integrated circuit at memories, input/output interfaces and other specialized logic. To pass all data through a main memory would form a bottleneck. Direct interface of external logic to the processor would reduce latency and increase bandwidth for many operations.
9. The existing methods for development and use of application-specific instruction logic suppose that the logic for each added instruction is independent of the others. Each logic configuration can be loaded on demand, on the assumption that not all required extensions can fit in the available configurable logic array. This means that common logic cannot be easily shared, potentially leading to substantial duplication of logic functions when two instructions with overlapping implementations are both resident in the logic array.
Overcoming these limitations would greatly improve the performance of hybrid processors.