1. Field of the Invention
This invention relates to the field of processing system architectures suitable for signal and multimedia processing. More specifically, this invention relates to system having a shared memory architecture which simultaneously provides high memory access bandwidth to multiple processing devices.
2. Description of the Related Art
Digital multimedia systems require a substantial digital signal processing capability. This requirement is shared by many other digital systems including image rendering systems, artificial vision systems, digital communication systems, and speech recognition systems. The typical architecture for such systems is shown in FIG. 1.
FIG. 1 shows a microcontroller bus 102 which couples a microcontroller 104 to a microcontroller memory 106. A digital signal processor (DSP) 108 is similarly coupled to a DSP memory 110 by a DSP bus 112. The two busses are coupled by a bus bridge 114.
This architecture is popular since the microcontroller 104 can assume the responsibility for system-level functions (such as controlling a user interface, initiating and terminating operation of various system modules, and coordinating data transfers), and the DSP 108 can assume the responsibility for computationally-intensive tasks (such as various coding and compression algorithms, filtering operations, and data transforms). This division of labor provides eases system design and programming.
However, this architecture is inadequate for future generations of digital multimedia systems. The processing requirements are being increased as designers take advantage of compression algorithms and higher bandwidths to transmit more information. To keep pace, this architecture requires a more powerful DSP.
A more powerful DSP can be created in two ways. The clock speed can be increased, but this requires careful optimization and redesign of the DSP for every incremental improvement in semiconductor processing technology. Alternatively, the DSP can be provided with wider data paths, e.g. an 8-bit DSP could be replaced with a 32-bit DSP. However, the increases in the required area and power consumption are quadratic (i.e. to double the data path width, the area and power requirements increase by approximately a factor of four). This alternative is undesirable since power consumption is a perennial design constraint, particularly in view of the increasing popularity of portable devices.
Furthermore, larger data path widths are likely to be a poor "fit" for the data granularity, leading to inefficient use of the more powerful DSPs. For example, MPEG video compression operates on 8-bit blocks of video data. Even if multiple blocks were retrieved at a time, the DSP could only perform (at most) one 8-bit block operation per clock cycle. The rest of the data path width is unused for these operations.
To address these problems, this architecture may be modified by the addition of a dedicated hardware accelerator that is custom-designed to efficiently and quickly carry out specific algorithms. The hardware accelerator may be coupled to the DSP 108 and the DSP memory 110 via the DSP bus 112. The DSP 108 then performs the less demanding computationally-intensive tasks of pre-processing and post-processing the data, and allows the hardware accelerator to perform the processing steps that the DSP 108 is too inefficient to perform.
If the hardware accelerator includes its own memory buffer, then direct memory transfers may be used to move the data across bus 112. This represents undesirable power consumption, but the alternative is to require that the DSP bus 112 provide a high bandwidth connection between the hardware accelerator and DSP memory 110. This alternative presents a memory contention issue that practically requires the DSP 108 to halt whenever the hardware accelerator is operating.
Various work-arounds may be designed, such as additional memories or additional busses, but these may be expected to result in complex, custom-designed interfaces between the hardware accelerator and the DSP, and many limitations or liabilities of the DSP 108 (such as insufficient address space) may be difficult for the hardware accelerator to overcome.
Accordingly, it is desirable to have an architecture that provides modularity and support for high-bandwidth memory connections for each processor element in the system. Such an architecture would preferably provide a shared memory to minimize power consumption, and yet be scalable so as to support multiple processor elements.