The growing complexity of multi-media applications is continuously leading to the need for greater computational performance. In this regard, general purpose CPUs and DSPs (central processing units and digital signal processors) have been developed that use parallel processing by using media-accelerators which exploit data and task-level parallelism.
However, such media-accelerators are just enhancements in an effort to cope with the limitations of traditional CPU architectures in achieving high-performance. Consequently, the solutions result in high-power dissipation per unit of operation. A potentially more successful approach exploits the full data-parallelism available to come up with a power efficient architecture. One such architecture is Xetal (for example, see “Smart Cameras: Architectural Challenges”, Proceedings of ACIVS 2002, Ghent, Belgium) which is based on the single-instruction multiple-data (SIMD) processing paradigm. This paradigm preserves the locality of data due to the massive parallelism and allows sharing of resources such as instruction and address decoders, both of which are important for reducing power consumption.
FIG. 1 is a block diagram illustrating a SIMD architecture. The architecture 1 includes a processing element array 10, which comprises a plurality of processing elements PE-0 . . . PE-N. The processing elements PE-0 to PE-N receive data from an input line memory 12, which itself receives data 3 via an input pre-processing unit 40. The SIMD architecture 1 also includes a working memory array 14, which is operably divided into memory portions. Each memory portion is associated with a particular one of the processing elements in the processing element array 10. The processing elements in the array 10 are able to transfer data to and from the working memory array 14, in order to process that data in accordance with instructions received by the processing elements. An output line memory unit 16 is provided for outputting data, via an output post-processing unit 50.
The array 10 is controlled by a global control processor 20 which operates in accordance with the program stored in program memory 30. The control processor 20 operates to supply instructions to the processing element array in accordance with the retrieved program.
The input line memory unit 12 provides serial-to-parallel conversion of incoming data, whilst the output line memory unit 14 provides parallel-to-serial conversion of outgoing data. In video processing applications, the output path can be provided with a serial processor (50) to extract statistical information from a predefined region of interest in a video frame. This information can be used for adaptive video-processing such as auto-white balance and exposure-time control.
An important issue in SIMD architectures (and similar parallel processing machines) is the degree of inter-communication between the processing elements. The more the number of the communication channels, the more efficient the execution of certain signal processing algorithms. Algorithms like filtering involve basic convolution operations over a, range of neighbouring data elements and benefit from a processor-to-processor communication channel.
Assuming an interconnection level of N data elements per PE, (for most image processing kernels N≧3), then to have access to all N data elements with minimal latency, the PE requires N communication channels. An N-to-1 switch (multiplexor) would then be needed to connect one of the N channels to the PE input. FIG. 2 shows logical communication paths of a PE accessing data from six neighbouring data points. It will be readily appreciated that this leads to a very complex network of interconnections between PEs and memory.
Indeed, the greater the degree of communication and the higher the number of processing elements in the design, the more complex the physical design becomes in terms of design time to find an optimal interconnect topology with respect to silicon area and performance. Reducing the complexity of the interconnection network is an important issue in SIMD architectures. Failure to address the issue successfully, usually prevents massively parallel systems being effective.
Accordingly, there is a need for a methodology that enables PE to PE communication and PE to memory communication that is cost effective and practicable.