The present invention relates generally to improvements in parallel processing, and more particularly to methods and apparatus for efficient cosine transform implementations on the manifold array (xe2x80x9cManArrayxe2x80x9d) processing architecture.
Many video processing applications, such as the moving picture experts group (MPEG) decoding and encoding standards, use a discrete cosine transform (DCT) and its inverse, the indirect discrete cosine transform (IDCT), in their compression algorithms. The compression standards are typically complex and specify that a high data rate must be handled. For example, MPEG at Main Profile and Main Level specifies 720xc3x97576 picture elements (pels) per frame at 30 frames per second and up to 15 Mbits per second. The MPEG Main Profile at High Level specifies 1920xc3x971152 pels per frame at 60 frames per second and up to 80 Mbits per second. Video processing is a time constrained application with multiple complex compute intensive algorithms such as the two dimensional (2D) 8xc3x978 IDCT. The consequence is that processors with high clock rates, fixed function application specific integrated circuits (ASICs), or combinations of fast processors and ASICs are typically used to meet the high processing load. Having efficient 2D 8xc3x978 DCT and IDCT implementations is of great advantage to providing a low cost solution.
Prior art approaches, such as Pechanek et al. U.S. Pat. No. 5,546,336, used a specialized folded memory array with embedded arithmetic elements to achieve high performance with 16 processing elements. The folded memory array and large number of processing elements do not map well to a low cost regular silicon implementation. It will be shown in the present invention that high performance cosine transforms can be achieved with one quarter of the processing elements as compared to the 16 PE Mfast design in a regular array structure without need of a folded memory array. In addition, the unique instructions, indirect VLIW capability, and use of the ManArray network communication instructions allow a general programmable solution of very high performance.
To this end, the ManArray processor as adapted as further described herein provides efficient software implementations of the IDCT using the ManArray indirect very long instruction word (iVLIW) architecture and a unique data-placement that supports software pipelining between processor elements (PEs) in the 2xc3x972 ManArray processor. For example, a two-dimensional (2D) 8xc3x978 IDCT, used in many video compression algorithms such as MPEG, can be processed in 34-cycles on a 2xc3x972 ManArray processor and meets IEEE Standard 1180-1990 for precision of the IDCT. The 2D 8xc3x978 DCT algorithm, using the same distributed principles covered in the distributed 2D 8xc3x978 IDCT, can be processed in 35-cycles on the same 2xc3x972 ManArray processor. With this level of performance, the clock rate can be much lower than is typically used in MPEG processing chips thereby lowering overall power usage.
An alternative software process for implementing the cosine transforms on the ManArray processor provides a scalable algorithm that can be executed on various arrays, such as a 1xc3x971, a 1xc3x972, a 2xc3x972, a 2xc3x973, a 2xc3x974, and so on allowing scalable performance. Among its other aspects, this new software process makes use of the scalable characteristics of the ManArray architecture, unique ManArray instructions, and a data placement optimized for the MPEG application. In addition, due to the symmetry of the algorithm, the number of VLIWs is minimized through reuse of VLIWs in the processing of both dimensions of the 2D computation.
The present invention defines a collection of eight hardware ManArray instructions that use the ManArray iVLIW architecture and communications network to efficiently calculate the distributed two-dimensional 8xc3x978 IDCT. In one aspect of the present invention, appropriate data distribution and software pipeline techniques are provided to achieve a 34-cycle distributed two-dimensional 8xc3x978 IDCT on a 2xc3x972 ManArray processor that meets IEEE Standard 1180-1990 for precision of the IDCT. In another aspect of the present invention, appropriate data distribution patterns are used in local processor element memory in conjunction with a scalable algorithm to effectively and efficiently reuse VLIW instructions in the processing of both dimensions of the two dimensional algorithm.