As described in said copending patent application, prior art and existing processor architecture have not yet achieved powerful enough or flexible enough designs for real time universal multimedia applications, particularly in light of the growing and diverse program applications that are converging in one system, for example, single hand-held devices, such as a cellular handset.
Such existing processor architectures (e.g. MIPS, ARM, etc.) generally operate with a single instruction set; and the control information coded in such an instruction set, drives all the functional circuit blocks in a single processor core.
Typical of such functional blocks are                sequencer that calculates the address for the next instruction fetch (e.g. +1 to fetch the instruction immediately following the current, load x for a branch, etc.). The calculation of such instruction address can be dependent on a condition flag.        computation unit that performs various arithmetic or logic operations on incoming data, such as the execution units described in said co-pending patent application.        register files and their configurable connection to the inputs and outputs of the computation units.        memory bus that can be configured to send data to or receive data from specific address in external memory.        
The more flexible these functional blocks can be made, the better they can be used to execute any general purpose program. On the other hand, the more flexible these functional blocks are, the more bits are required to configure them for a specific operation.
Prior architectures and their strengths and limitations will now be reviewed as a background to the strategies of the invention set design, control and communication herein used in the processor cores of the present invention.
In the traditional Von Neumann architecture, the compiled software program contains the instruction sequence to be executed as well as the data to be processed, and both are stored in memory together. The bandwidth between the memory and the CPU, however, limits the performance as it sets an upper bound on how many bits of instruction and data can be sent to the processor each clock cycle. This is the famous Von Neumann “bottleneck” identified in 1970s.
Later architecture, such as the Harvard and Super Harvard Architectures, separated the instruction and data memory; and added an instruction cache of faster internal memory inside the CPU to enable speculative loading of new pages (blocks of memory) of instruction from external memory, and swapping out old pages. The goal was to fetch the next instruction from the faster cache memory instead of the main instruction memory. A speculation algorithm is used to determine what new pages to load and what old pages to swap out. While the performance is improved for a cache “hit” (i.e. finding the instruction in the cache), when there is a cache “miss” (i.e. not finding the instruction in the cache), the processor stalls many cycles while waiting for the new page to be loaded. If the speculation algorithm is not efficient, the performance suffers. Such design also comes at a price of added hardware and complexity to handle such an efficient speculation algorithm. Some modern processor architectures also use data caches as well.
A different prior art called RISC Processor and Pipeline, as described in said co-pending patent application, works on limiting the size of the single instruction. The Reduced Instruction Set Computer (RISC) defines the instruction set in the principle of lowest common denominator of any general purpose program. The instruction set is simple or “reduced”, making the hardware required to execute them, simple as well. The execution of a single instruction is then divided into pipeline stages in hardware, with equal or similar propagation delays and registers for buffering intermediate data results, and with necessary control signals passed by one stage to next. The processor then attempts to stack the execution of n instructions in parallel with the preceding instruction executing one stage ahead. When the pipeline is filled, the throughput of each instruction is 1/n of the time to complete its execution in hardware. This way, even though the instruction set is simpler and each instruction can perform limited operation, it executes much faster, as, for example, in the classic MIPS processor—a well-known 5-stage RISC. In such MIPS design, the instruction set is kept simple and the hardware is reused. For example, on ALU block is used not only for data processing but also for computing the address for data memory access, as well. A register file is used to store data pre- and post-ALU operation as well as storing part of the memory access address. This is possible because all instructions are kept relatively simple and require similar amounts of hardware processing. But even in this simple architecture, all the hardware cannot be utilized all the time. The MEM (memory access) stage, for example, is not utilized for any arithmetic or logic operation instruction.
It should be observed, moreover, that in the pipelined RISC design, all control signals for all pipeline stages are generated at the ID (Instruction decode) stage, and they have to be buffered and carried to their intended stages. Even in the simple 5-stage MIPS, there are thus still many control signals being buffered and sent along the pipeline stages.
Although, as also explained in said co-pending patent application, the RISC processor improves instruction throughput by utilizing pipelined structure, there are limitations on such attached improvements. One such is its ability to execute computation-intensive real-time signal processing programs. Without special instruction and special hardware for multiplication, or multiply-accumulation, these operations can take many cycles to execute. A 16-bit multiplication, in an example, can take up to 16 cycles; and a 32-bit multiplication can take up to 32 cycles. Such performance is not, however, adequate for real-time computation-intensive algorithms. Another limitation is the constraint on filling the pipeline. If the choice of next instruction is dependent on the computation result of the previous one (i.e. branch instruction), it cannot be fetched one cycle after the previous one is fetched at which time the result is not known. This prevents the pipeline from getting filled, which results in stalling. Instead of stalling, instruction on one path of the branch can then, however, be speculatively fetched. When the result is available, the pipeline can then proceed normally, provided the correct branch has been fetched. Otherwise, the pipeline must be flushed to go back to the right branch. Such speculative execution thus only improves efficiency if the branch prediction has a high rate of accuracy which is not always easy to achieve.
As also mentioned in said co-pending patent application, the use of DSP can significantly improve the performance of algorithms with continuous multiply-accumulate or MAC operation (e.g. filtering, matrix multiplication) because a pipelined DSP with added special instructions and dedicated hardware achieves MAC operation throughput of a single cycle.
But for non-computation-intensive programs, the added single cycle MAC logic can be a significant overhead, since such are not used for other instructions. And for algorithms that are not mostly MAC-based (e.g. motion compensation in video decode which is, rather, addition based), the MAC logic also does not improve performance.
As today's real-time multimedia processing algorithms get much more complicated, moreover, increasingly more computation hardware must be added to the processor. To keep the throughput high, a pipelined structure is still used, but with more stages in order to have a reasonable propagation delay for each stage.
With more hardware to perform more computations in parallel, moreover, more control information (i.e. instruction) and more data must enter the processor pipeline every clock cycle to make use of the hardware blocks. The original before-discussed Von Neumann bottleneck challenge is then multiplied many times, since the clock rate has become much higher. In addition, there is more instruction and data that needs to get into the processor pipeline stages every clock cycle, so techniques such as instruction and data cache, branch prediction must still be used to improve performance.
With the different computation hardware used in parallel to process data, their capability has to be mapped to the user program. As opposed to RISC, the hardware is no longer the lowest common denominator of general purpose program and the most efficient mapping is not easy to achieve. And instruction set design accordingly starts to depart from the traditional RISC principle.
A way to take advantage, however, of the multiple computation blocks executing in parallel, is to duplicate the hardware units and use the same instruction to drive multiple sets of data calculation. This is called Single Instruction Multiple Data (SIMD) and it is an efficient use of control bits; but it is only practical for algorithms that have a lot of parallel identical calculations on different data sets.
It is more complicated, however, to map parallel computation to different hardware blocks. One approach is to use Fixed Length Instruction with each instruction targeting one hardware block. A hardware instruction sequencing and dispatch block is capable of fetching and sequencing multiple instructions every clock cycle. There is an instruction decode block provided for each computation unit, such being called the Superscalar Instruction Dispatch Architecture.
Still another prior approach is to use Very Long Instruction Word (VLIW) to code for all possible combinations of parallel instruction. In this case, there only needs to be one instruction fetch module that can fetch one instruction at a time. But such a long instruction is very inefficient for simple operations (e.g. control instruction without parallel computation).
The Resulting Complexity of Processor Design
While today's processors use the above-described techniques to improve performance, all still increase the hardware complexity and power consumption. Resort has accordingly been taken to the use of one or more layers of hierarchical data and instruction memory for caching with sophisticated page replacement algorithms. This results, however, in the need for complex instruction fetch logic to figure out where to fetch the next instruction from. Multiple sets of computation block are dedicated to special computation activators, such as Multiplication, Addition and Logic Operation, Shift and Rotate—which indeed are only fully utilized in a cycle if 1) the program can be sequenced to use all blocks in parallel, and 2) there is enough bandwidth to get the required control bits to the computation block. The use of branch prediction to keep the pipeline filled, is, of course, subject to branch prediction errors which may be more costly, since the pipeline to be flushed is then deeper.
All the above processor design and prior art schemes, including added hardware and complexity, have thus not achieved a processor powerful enough and flexible enough for real-time universal multimedia application.
A review of today's multimedia mobile handsets with System On Chip (SoC) current designs, reveals the use of multiple processors, and also the supplemental uses of multiple Application Specific Integrated Circuit (ASIC) blocks in them (discussed also in said co-pending application). So also with the current high-end set-top box SoC. These multiple processors often include simple RISC for control function, traditional digital signal processing (DSP) for voice/audio processing, and VLIW multimedia processors for image and video processing, supplemented with ASIC blocks that handle algorithms that cannot be handled well by the prior programmable processors.
There is, however, a significant difference between resort to ASIC and a stand-alone programmable processor.
Today's processor has centralized instruction dispatch. All the logic blocks in the processor pipeline get their control signals sent through the pipeline from the instruction decode stage. For a coded instruction as long as 256 bits, for example, the decoded control signals can be numerous. These signals need to get to their intended block every cycle to maintain the throughput, resulting in a significant on-chip bandwidth requirement for the control signals. The instructions must also be sequenced to maximize computation hardware usage every clock cycle under the constraints of the data memory bandwidth, the size of the register file, and their possible connection to the computation unit, making efficient instruction sequencing a difficult task.
The most significant difference between ASIC and such a general purpose processor is that ASIC does not have programs or instructions. ASIC only has a data flow, and not an instruction or control flow. The input data flows through different functional blocks and buffer memory blocks towards the output. Data are processed by each functional block as they traverse through it, and without the overhead of instruction traffic, the clock rate can be kept low.
In accordance with the hereinafter detailed approach of the present invention, many of these inadequacies of existing and prior art programmable processors, with their centralized fetch and decode block strategy that determines the control for every other block in the system, every clock cycle, are successfully overcome.
In addition, there are a few popular algorithms and operations that existing and prior art general purpose processors have trouble in handling. One of them involves the implementation of Variable Length Decoder (VLD) or Huffman Decoder. In general, Huffman coding uses fewer bits to code symbols that appear more frequently (e.g. letter “e” in the English language), and more bits to code symbols that appear less frequently (e.g. letter “x” in the English language). Decoding of such symbols in a bit stream is difficult in current processors because                1. Frequent symbols are usually coded with much fewer bits than the fixed oprand bits for a processor; and        2. The location where a symbol starts depends on the processing result of the current symbol, making the next instruction dependent on computation results of current instructions all the time. No effective speculative instruction-fetch algorithm can indeed be implemented either, since the instruction-fetch is totally data dependent. This is very inefficient since it makes filling the pipeline almost impossible.        
Another challenge for today's processor is the implementation of a Finite State Machine (FSM). The FSM is used to quickly derive a new state based on the previous stored state and new inputs. Output (or actions) is then derived from the new state, or new state and inputs. There are usually, however, very few bits of new inputs, and very few bits that represent the state compared to the typical oprand bit width. It is extremely difficult, therefore, to write FSM instruction sequences that can be easily pipelined in a processor for fast execution. But with limited gates and a few bits of registers, a very fast FSM can be implemented in digital ASIC. In fact, Huffman decoding of each symbol can be implemented with a few linked states with each state corresponding to a specific bit pattern that has been read and number of new bits to read to continue the decoding process.
The present invention addresses these limitations by improving the logic circuit interface to memory banks.