1. Field of the Invention
This invention relates generally to the field of processors and more particularly, to processors having low power consumption, high performance, low die area, and flexibly and scalably employed in multimedia and communications applications.
2. Description of the Prior Art
With the advent of the popularity of consumer gadgets, such as cell or mobile phones, digital cameras, iPOD® and personal data assistances (PDAs), many new standards for communication with these gadgets have been adopted by the industry at wide. Some of these standards include H264, Moving Picture Experts Group (MPEG4), Ultra Wide Band (UWB), BLUETOOTH®, 2G/2.5G/3G/4G, Global Positioning System (GPS), Moving Picture Experts Group Layer-3 (MP3) and Security. iPOD® is a registered trademark owned by Apple Computers, Inc. BLUETOOTH® is a registered certification mark owned by Bluetooth Sig, Inc. However, an emerging problem is the use of different standards dictating communications of and between different gadgets requiring tremendous development effort. One of the reasons for the foregoing problem is that no processor or sub-processor, currently available in the marketplace, is easily programmable for use by all digital devices and conforming to the various mandated standards. It is only a matter of time before this problem grows as new trends in consumer electronics warrant even more standards adopted by the industry in the future.
One of the emerging and, if not, current, requirements of processors is low power consumption yet the ability to cause execution of code sufficient to process multiple applications. Current power consumption is on the order of sub-hundreds of milliwatts per application, whereas, the goal is to be under sub-hundreds of milliwatts for executing multiple applications. Another requirement of processors is low cost. Due to the wide utilization of processors in consumer products, the processor must be inexpensive to manufacture, otherwise, its use in most common consumer electronics is not pragmatic.
To provide specific examples for current processor problems, problems associated with Reduced-instruction-set Computing (RISC) chips, which are used in some consumer products, microprocessors, which are used in other consumer products, digital signal processors (DSPs), which are used in yet other consumer products and application specific integrated circuits (ASICs), which are used in still other consumer products, and some of the other well-know processors, each exhibiting a unique problem are briefly described below. These problems along with advantages of using each are outlined below in a “Cons” section discussing the disadvantages thereof and a “Pros” section discussing the benefits thereof.
A. RISC/Super Scalar Processors
RISC and Super Scalar processors have been the most widely accepted architectural solution for all general purpose computing. They are often enhanced with application specific accelerators for solving certain specialized problems within the context of a general solution.
Examples include: ARM series, ARC series, StrongARM series, and MIPS series.
Pros:                Industry wide acceptance has lead to a more matured tool chain and wide software choices        A robust programming model has resulted from a very efficient automatic code generator used to generate binaries from high level languages like C.        Processors in the category are very good general purpose solutions.        Moore's Law can be effectively used for increasing performance.        
Cons:                The general purpose nature of the architecture does not leverage common/specific characteristics of a set or sub-set of applications for better price, power and performance.        They consume moderate to high amounts of power with respect to the amount of computation provided.        Performance increase is mostly achieved at the expense of pipeline latency which adversely affects several multimedia and communication algorithms.        Complicated hardware scheduler, sophisticated control mechanisms and significantly reduced restrictions for more efficient automatic code generation for general algorithms have made this category of solutions less area efficient.        
B. Very Long Instruction Word (VLIW) and DSPs
VLIW architectures eliminated some of the inefficiencies found in RISC and Super Scalar architectures to create a fairly general solution in the digital signal processing space. Parallelism was significantly increased. The onus of scheduling was transferred from hardware to software to save area.
Examples include: TI 64xx, TI 55xx, StarCore SC140, ADI SHARC series.
Pros:                Restricting the solution to the signal processing space improved 3P in comparison with RISC and Super Scalar architectures        VLIW architectures provide higher level of parallelism relative to RISC and superscalar architectures.        An efficient tool chain and industry wide acceptance was generated fairly rapidly.        Automatic code generation and programmability are showing significant improvements as more processors designed for signal processing fall into this category.        
Cons:                Although problem solving capability is reduced to the digital signal processing space, it is too broad for a general solution like VLIW machine to have efficient 3P.        Control is both expensive and power consuming especially for primitive control code in many multimedia and communication applications.        Several power and area inefficient techniques were used to make automatic code generation easy. Strong reliance on these techniques by the software community is carrying forward this inefficiency from generation to generation.        VLIW architectures are not well suited for processing serial code.        
C. Reconfigurable Computing
Several efforts in industry and academia over the last 10 years were focused towards making a flexible solution with ASIC like price, power and performance characteristics. Many have challenged existing and matured laws and design paradigms with little industry success. Most of the attempts have been in the direction of creating solutions based on coarser grain Field-Programmable Gate Array (FPGA) like architectures.
Pros:                Some designs restricted to a specific application while providing needed flexibility within that application proved to be price, power, performance competitive        Research showed that such restricted yet flexible solutions can be created to address many application hotspots.        
Cons:                Several designs in this space did not provide an efficient and easy programming solution and therefore was not widely accepted by a community adept in programming DSPs.        Automatic code generation from higher level languages like C was either virtually impossible or highly inefficient for many of the designs.        3P advantage was lost when an attempt was made to combine heterogeneous applications using one type of interconnect and one level of granularity. Degree of utilization of the provided parallelism suffered heavily.        Reconfiguration overhead was significant in 3P for most designs.        In many cases, the external interface was complicated because the proprietary reconfigurable fabric did not match industry standard system design methodologies.        Reconfigurable machines are uni-processors and rely heavily on a tightly integrated RISC even for processing primitive control.        
D. Array of Processors
Some recent approaches are focused on making reconfigurable systems better suited to process heterogeneous applications. Solutions in this direction connect multiple processors optimized for either one or a set of applications to create a processor array fabric.
Pros:                Different processors optimized for different sets of applications when connected together using an efficient fabric can help solve a wide range of problems.        Uniform scaling model allows number processors to be connected together as performance requirements increase.        Complex algorithms can be efficiently partitioned.        
Cons:                Although performance requirements may be adequately answered, power and price inefficiencies are too high.        The programming model varies from processor to processor. This makes the job of the application developer much harder.        Uniform scaling of multiple processors is a very expensive and power consuming resource. This has shown to display some non-determinism that may be detrimental to the performance of the entire system.        The programming model at the system level suffers from complexity of communicating data, code and control information without any shared memory resources—since shared memory is not uniformly scalable.        Extensive and repetitive glue logic required to connect different types of processors to a homogeneous network adds to the area inefficiencies, increases power and adds to the latency.        
In light of the foregoing, there is a need for a low-power, inexpensive, efficient, high-performance, flexibly programmable, heterogenous processor for allowing execution of one or more multimedia applications simultaneously.