As the need for increased real-time processing power has increased and it has become more and more difficult to increase processor clock rates, there has been a growing demand for a programmable and reconfigurable microprocessor architecture and corresponding programming method that are highly efficient and readily adaptable to a variety of software applications.
Often, three technologies are used in combination to provide adaptable, high efficiency processing solutions, namely application specific integrated circuits (ASIC), general purpose microprocessors (GPM), and field-programmable gate arrays (FPGA). ASIC's are typically designed for specific applications, and typically offer only very limited programmability. GPM's and FPGA's can both be adapted to different applications using programming languages at varying levels.
In particular, GPM's can typically be programmed using high-level software programming languages, whereby a user writes the code using a high-level language, after which a compiler is ultimately responsible for generating the machine code that runs on the GPM. This approach can be highly adaptable, and can reduce software development time, such that a GPM-based solution is typically the best approach for minimizing software development costs if it can meet the requirements of the application. However, the resulting solutions typically have lower hardware efficiency than an ASIC, causing GPM-based solutions to generally be more expensive and more power hungry than ASIC-based solutions.
FPGA's can only be programmed using more primitive “register transfer language (RTL) such as VHDL or Verilog, which requires more software development time. Also FPGA's represent a trade-off wherein the hardware overhead is high in exchange for being adaptable to different applications. Accordingly, the power consumption and cost of a product based on an FPGA is usually much higher than a similar product that uses an ASIC. In general, therefore, there is always a tradeoff between cost, power, and adaptability.
For the last several decades, as per the so-called “Moore's Law,” GPM processor clock frequencies have doubled approximately every eighteen months or so. Hence, if a compiled program did not meet a certain requirement (i.e. cycle count exceeds what is required) using current technology, it was only necessary to wait a few years until the processor clock frequency increased to meet the requirement. However, this trend of increasing clock frequencies has come to a virtual stop, due to power and light-speed limitations, such that application requirements that cannot be met using current GPM processors cannot be addressed simply by waiting.
Over the years, engineers have tried to improve the hardware efficiency of GPM's using so-called “pipelined” processors that take advantage of application programs that have multiple, independent threads of equal lengths. These include “same instruction multiple data” (SIMD) processors for threads that follow the exact same instruction sequence, as well as “very long instruction word” (VLIW) processors for threads that follow different instruction sequences. However, if an application program has multiple threads with very different lengths or multiple threads with data communication between them, then SIMD and VLIW architectures do not offer much advantage as compared to non-pipelined GPM's.
Other approaches include using special purpose processors that optimize specific operations in an application, such as digital signal processors (DSP's), image processors, network processors, and graphics processors. For example, a DSP typically includes a multiply accumulator (MAC) that has a throughput of one cycle. This can be very useful for signal processing applications, because multiply accumulate operations are very common in digital signal processing. However, if a DSP is used for network processing, it will be highly inefficient, since network processing does not typically require multiple operations. Instead, a network processor typically does not include a special multiplier, but does include features that optimize the table lookup operation, since table lookup is the most common operation used in network processing. In the same way, a network processor would be highly inefficient if used for digital signal processing.
Another problem with current pipelined processors arises from limitations that are inherent in the design of the pipeline stages. A modern pipelined processor, operating at a very high-clock rate, will typically include more than ten pipeline stages. This means more than ten cycles are required to perform a branch, even though only a couple of pipeline stages are actually being utilized. For example, algorithms with continuous branching do not use most of the pipeline stages, leading to very low efficiency. Instead, a processor with very few pipeline stages (i.e. very simple hardware) has to be used for such algorithms to improve the efficiency.
Due to these limitations to the current technology, modern-day handset “system on chip” (SoC) designs, for example, are forced to incorporate many of the technologies described above in combination to deliver a handset application. For example, a typical handset SoC might include a few “advanced RISK machine” (ARM) cores (big and small), an image processor, a graphics processor, a DSP, etc.
What is needed, therefore, is a parallel processor architecture and corresponding programming method that will provide very fast data processing with high energy efficiency, while also being highly programmable for use in multi-purpose devices and adaptable as new requirements and new applications arise.