The MPEG-2 video standard is an important standard for video compression today. The MPEG-2 coding/decoding algorithm can be found in different applications such as digital video broadcast receivers, DVD players, cable TV, graphics/image processing cards, set top boxes, digital cameras, SDTV and HDTV. Due to different profiles and levels of the MPEG-2 video standard, every application has a specific ratio between performance and quality.
The growing demand for high quality video has led to the advancement of the technology of the MPEG encoding/decoding process. The ISO/IEC 13818-2: 1995 MPEG video standard, ITU-T H.262 Recommendation defines a process in which partitioning the bitstream is performed in the encoding or decoding process. The bitstream is partitioned between the channels such that more critical parts of the bitstream (such as headers, motion vectors, low frequency DCT coefficients) are transmitted in the channel with the better error performance, and less critical data (such as higher frequency DCT coefficients) is transmitted in the channel with poor error performance.
The technological improvements come at the cost of increased complexity for real time applications. The real challenge is to cope with the cost and time-to-market. A complete hardware implementation is best to meet the real-time requirement but worst for time-to-market and development cost as described in the paper “Development of a VLSI Chip for Real-Time MPEG-2 Video Decoder”, Proceedings of the 1995 International Conference on Image Processing, IEEE, 1995, by Eishi Morimatsu, Kiyoshi Sakai, and Koichi Yamashita.
The reverse is true for a complete implementation in software as detailed in “Software MPEG-2 Video Decoder on a 200-MHz, Low Power Multimedia Microprocessor”, IEEE, 1998, pp. 3141-3144. This paper presents a low power, 32-bit RISC microprocessor with a 64-bit “single instruction multiple data” multimedia coprocessor, V830R/AV and its MPEG-2 video decoding performance. This coprocessor basically performs four 16-bit multimedia oriented operations every clock, such as multiply-accumulate with symmetric rounding and saturation, and accelerates computationally intensive procedures of the video decoding; an 8×8 IDCT is performed in 201 clocks. The processor employs the concurrent Rambus DRAM interface, and has facilities for controlling cache behavior explicitly by software to speed up the memory accesses necessary for motion compensation.
The best trade-offs have been obtained with the hardware-software co-design approach as described in “HW/SW codesign of the MPEG-2 Video Decoder”, Proc. International Parallel and Distributed Processing Symposium (IPDPS'03). In this paper the VLD decoding algorithm and IDCT algorithm are implemented with fast parallel architectures directly in hardware. The hardware part is described in VHDL/Verilog and implemented together with the RISC processor in a single Virtex 1600E FPGA device. The algorithms are run in software on a 32-bit RISC processor for decoding of the coded bitstream, inverse quantization and motion-compensation implemented in FPGA, and Linux is used as the operating system. This partitioning is therefore done at the frame level to achieve better timing properties and lower power consumption.
The software-based implementation has the advantage of easy adaptability of future enhancements. In all cases, higher processing throughput is currently the need. The solution is either to have a very high-speed special purpose processor or a set of processors with lower speed, working in parallel. A paper by Mikael Karlsson Rudberg and Lars Wanhammar, “An MPEG-2 Video Decoder DSP Architecture”, Proc. 1996 IEEE Digital Signal Processing Workshop, pp. 199-202, elaborates one such architecture.
The faster processors are expensive both in terms of cost and power consumption, whereas the recent developments in the System on Chip (SOC) front give us the freedom to have multiple simple processors in a single system, making parallel implementation a more economical solution.
Applications involving computations on large amounts of data (e.g., multimedia codecs) lead to excessive loading of the CPU. Through parallel computation the loading of each machine is reduced as the computational effort is distributed among the machines. This enhances the speed of computation by a factor that is a function of the number of processors used.
There are primarily two paradigms of parallel computation: a shared memory system and a distributed memory system. Details of a distributed memory system can be obtained from “Fundamentals of Distributed Memory Computing”, Cornell Theory Center: Virtual Workshop Module.
The shared memory approach requires a multi-port memory to be used as the means of communication between the computing nodes for simultaneous access to operands. This approach makes parallel application development easier. However, implementation of multi-port memories is expensive in the present state of memory technology. On the other hand, the distributed memory approach avoids this complexity in hardware and thereby reduces the silicon cost significantly. However, the task of application development is made more difficult.
U.S. Pat. No. 6,678,331 discloses a circuit that includes a microprocessor, an MPEG decoder for decoding an image sequence, and a memory common to the microprocessor and the decoder. The circuit also includes a circuit for evaluating a decoder delay, a control circuit for, if the decoder delay is greater than a predetermined level, granting the decoder a memory access priority, and otherwise, granting the microprocessor the memory access priority.