There is an increasing demand for microprocessor architectures adapted to meet the requirements of various multimedia processing tasks and algorithms. The quest for increasing performance levels, however, needs to cope with the need of limiting power consumption and code size growth.
Vectorial and/or SIMD (Single Instruction, Multiple Data) architectures are thus used in applications with massive data parallelism, while VLIW (Very Long Instruction Word) architectures are optimal for applications with high instruction parallelism.
The multi-dimensional microprocessor described in U.S. published patent application no. 2005/0283587 is exemplary of a microprocessor with SIMD/vectorial capabilities based on a VLIW machine. As mentioned in this description, an example of architecture for digital media processing was introduced by Intel with their MXP5800/MXP5400 processor architecture. A multi-dimensional microprocessor architecture improves significantly over this more conventional architecture. For instance, in the MXP5800/MXP5400 architecture, processors require an external PC-based host processor for downloading microcode, register configuration, register initialization, and interrupt servicing. Conversely, in a multi-dimensional microprocessor architecture this task is allotted to one computational unit for each column.
Moreover, if compared against the case of a multi-dimensional microprocessor, the basic computational block in the MXP5800/MXP5400 processors is inevitably more complex. It includes five programming elements and each of these has its own registers and its own instruction memory. This entails a significant area size and large power consumption, particularly because a power management unit is not used to power down inactive Processing Elements (PEs).
One of the key problems to address in these architectures to take advantage of data parallelism is to properly handle access to the data. Optimizing access turns out to be a difficult task in that a processor having a high computational power requires access to the data cache to be optimized. Generally, this problem is addressed by resorting to two different approaches, namely a single data cache shared by all clusters (i.e., a Shared Memory or SM) with an address space which is similarly shared; and equipping each cluster with a dedicated cache (i.e., a Distributed Memory or DM).
If the choice is made to equip each individual cluster with a cache of its own (DM) by correspondingly allowing each cluster to address the data locally, access efficiency to the data is maximized. Each cluster will access the data in its cache without interfering with any other accesses. Compilation of the computational section is, at least notionally, simplified while rendering it more complex for the programmer to control the program flow and generating problems in terms of cache coherence. For this reason a much more complex memory architecture may be required at a higher level. The program flow of instructions is only one single if a cluster accesses certain data, with all the other clusters doing the same.
Moreover, the DM approach is not an optimum one from the viewpoint of properly exploiting the cache memory. The clusters will not all be simultaneously active, and in those parts of the program where, e.g., a single cluster is active, a major portion of the memory will be unavailable. Another disadvantage is that the presence of separate caches makes it necessary to duplicate a large amount of data (constants, tables, etc.). The main processor may need to write or read data in the memory space reserved to other clusters. Additionally, an ad hoc data exchange mechanism will be required for initialization purposes or communication between the clusters.
Additionally, one needs to take into account that further, non-negligible traffic and a fairly complex cache architecture will be required to ensure the coherence of the data in a plurality of caches. Conversely, if a single centralized cache is adopted (SM), each cluster needs to be able to access its data via a single data path, which will inevitably become a system bottleneck. Moreover, while enabling the programmer to see the data accessed by each individual cluster, thus permitting a better control of the program flow, the presence of a single address space necessitates explicit access to each single data item.