1. Field of the Invention
The present invention relates to processors and processor organization for the execution of data-dominated programs and in particular to provide a combination of the flexibility of software processors and the speed and cost efficiency of dedicated hardware processors. In particular, the present invention relates to flexibly reusable customized processors and methods of using the same.
2. Description of the Related Technology
A representative application for data-dominated processing is (MPEG based) video compression. Many custom hardware architectures for motion estimation and other MPEG subsystems have been proposed. Power management and reduction is becoming a major issue for such applications. Two examples of MPEG2 custom processors using comparable CMOS technologies are known and exhibit the following characteristics:
SGS: 4 chip set, 20 W@27 MHz, flexible centralized bus/memory organization, 64 Mbit external RAM. PA0 Mitsubishi: 3 chip set, 8.5 W@8 MHz, heavily customized, distributed organization, 44 Mbit external DRAM. PA0 1 -: design is difficult (MMU, control, interfaces); design exploration support is currently very limited at system level (only "integration" tackled); PA0 2. --: application-specific so cannot be changed after the chip is processed; PA0 3. -: power still too large due to rigid memory hierarchy and central bus architecture PA0 4. +: area acceptable due to logic synthesis and manually written structural VHDL PA0 5. ++: speed very good except transfer overhead for data-dominated systems PA0 1 ++: reasonable design time for application engineer. Programming relatively easy for the processor data-paths (arithmetic operations). The transfer and storage organization is however almost always left to hardware (cache controllers and MMU) to reduce the mapping complexity for the programmer/compiler. PA0 2 +: full flexibility in applications though the speed bottle-neck remains an issue in many cases. PA0 3 --: large power due to expensive data transfers (long "distances" and fixed memory hierarchy) of which many are not really necessary. The latter are added e.g. due to the non-optimal run-time cache control which typically involves overhead in duplicate data transfers from main memory for reused data with longer life-times, or due to the speculative execution in many processors as decided in the branch prediction unit. PA0 4 --: most area of chip/board is taken up by memories and busses. Also addressing and control complexity are high. PA0 5 --: speed still too low for demanding applications so parallel processors have to be introduced. But parallel processors are very difficult to program efficiently due to data communication.
Both have a widely differing data transfer and storage organization. As a result, the flexibility and ease of design of the shared bus system is traded-off against a higher power budget. A relatively general model (template) abstracting mainly the data transfer and storage architecture for such HW solutions is illustrated in FIG. 1. The main memory architecture is separate from the processing capacity (data processors DP) which includes a custom memory management unit (MMU), and some local buffers to control the flow of data between the main processing capacity and the memory. The pros and cons of this approach are:
Although the power bottle-neck for custom processors can be alleviated to a significant extent by a combination of global and aggressive system-level data-flow and loop transformations combined with a heavily partitioned custom memory organization without an area or speed penalty the same is not true for the current generation of programmable processor solutions.
Many of such architectures have been proposed for video and image processing. Power management and power reduction for these processors is hardly tackled in literature but it is recognized as a growing problem in the industry (at least at the "customer" side). Several recent commercial multi-media oriented processors have been marketed or announced: TI-C80 and recently C60, Philips-TriMedia, Chromatic-Mpact, Nvidia NV1, NEC PIP-RAM. Several other Super-scalar/VLIW processors have been announced with an extended instruction-set for multi-media applications: Intel (MMX), SGI/MPS (MDMX), HP (MAX), DEC (MVI), Sun (VVIS), AMD (MMX), IBM (Java). Also a few more dedicated domain-specific ASIP processors have been proposed, such as the MIPS MPEG2 engine which includes a multi-RISC, several memories and a programmable network.
A relatively general abstraction of such processors is illustrated in FIG. 2. The main memory is accessed via data caches with a separate memory and instruction cache for the program instructions. The pros and cons of this approach are:
In order to introduce parallel processing (see FIG. 3), the current approach is almost solely focused on the arithmetic operations: several parallel data processors operate on a shared cache (multi-scalar) or several "full" processors operate on a shared main memory. The reason to select such a "single virtual address space" option for the main memory is only the simplicity of compiling the applications on the parallel architecture. This choice leads however to several drawbacks, e.g. the power overhead is even larger due to the physical memory organization with large memories and many global connections. Area is also further influenced negatively because of the complexity of the internal memory organization to have everything shared. This happens both due to the bus communication overhead needed to solve the performance bottle-neck, and due to the need for duplicating globally allocated data also in the local (cache) memories for the processors. The speed bottle-neck is only partly solved by the introduced parallelism because obtaining the required data is usually not feasible in parallel with the arithmetic operations due to the limited I/O and memory band-width. As a result, idle cycles are introduced in the data-paths (NOPs).
Much work has been published in the past on cache coherence protocols, also for parallel processors. In the parallel compiler domain however, mostly load balancing and parallelisation issues for the arithmetic operations have been tackled because the general approach relies on the hardware caching mechanism. A few have addressed the data localization issue to obtain a better cache usage. The conventional programmable architectures and conventional (parallel) compilers will lead to major heat dissipation problems.
If the state of the art on custom (HW) and programmable instruction-set (SW) processors for data-dominated multi-media applications is reviewed it becomes clear that much of the architecture components are intended to solve the data transfer and storage issues. In both HW and SW, the main power (and largely also area) cost is situated in the memory units and the (bus) communication hardware. On the HW side, several system level memory management related methodologies are being proposed which promise very large savings on power and also on area while still meeting the real-time constraints. Unfortunately, on the SW side these methodologies are not applicable as such and even when modified, the programmability overhead in the data transfer and storage architecture will still lead to a (large) penalty paid in power consumption compared to the HW solution. They exhibit different characteristics in terms of power and speed efficiency versus processor architecture and design time impact.
The top-level view of a typical heterogeneous system architecture including both custom hardware (accelerator data path, dedicated processor and glue logic) and programmable processors (DSP and RISC cores, master control) in a target application domain is illustrated in FIG. 4. Architecture experiments have shown that 50-80% of the area cost in (application-specific) architectures for real-time multi-dimensional signal processing is due to memory units, i.e. single or multi-port RAMs, pointer-addressed memories, and register files. The power cost both for custom HW and for SW processors is even more heavily dominated by storage and transfers for complex data types. Hence, the organization of the global data transfer and storage, together with the related algorithmic transformations, form the dominating factors in the system-level architecture design decisions. For such applications, the data transfer and storage architecture for SW processors, i.e. the available memory units and the (bus) communication hardware between them together with their use for a given application is power inefficient.