1. Field of the Invention
The present invention relates generally to video compression, and in particular, to a block matching processor for block matching motion estimation.
2. Description of the Related Art
Motion estimation which exploits temporal redundancies of an image sequence is a crucial step in video compression. Among diverse motion estimation techniques, a block matching algorithm (BMA) has been adopted in today's popular video coding standards due to computational simplicity with favorable performance. For details of the BMA, refer to D. Gall: ‘MPEG: A video compression standard for multimedia algorithm’, Comm. ACM, 1991, 4, pp. 47–58 [reference 1], P. Pirsch, N. Demassieux and W. Gehrke: ‘VLSI architectures for video compression’, Proceedings of the IEEE., 1995, 2, pp. 220–246 [reference 2], and V. Bhaskaran and K. Konstantinides: ‘Image and Video Compression Standards: Algorithms and Architectures’ (Kluwer academic publishers, 1990) 1st edn. [reference 3]. However, the enormous amount of computational requirement of the block matching has been a bottleneck in realizing a compact video encoding system. Hence reducing the amount of motion estimation hardware is the primary issue in designing a cost-effective single chip video encoder.
Regarding the prediction performance, the BMA has a few drawbacks which are mainly caused by employing a fixed block size. A stationary assumption within a block and a simplified translational motion model often violate real situation in image sequences. These problems could be relieved by using a variable size block matching algorithm (VSBMA). For details of the VSBMA, see M. H. Chan, Y. B. Yu and A. G. Constaninides: ‘Variable size block matching motion compensation with applications to video coding’, IEE Proc., 1990, 8, pp. 205–212 [reference 4] and F. Defaux and F. Moscheni: ‘Motion estimation techniques for digital TV: A review and a new contribution’, Proceedings of the IEEE, 1995, 6, pp. 858–876 [reference 5].
In the VSBMA, the choice of the block size has long been shown to be a compromise between several factors. The use of smaller blocks results in higher adaptivity, but the correlation among blocks cannot be exploited, and thus limits the compression ratio achieved. The user of larger blocks can better exploit the picture correlation as a whole, but the stationary assumption within each block may then be distributed, and the quality as a result suffers. To produce the best result, arbitrary size of block should be used in motion vector predictions according to the rate-distortion function for video sources. However, considering all the facts of performance, computational efficiency and additional bits for block size information, a small number of block sizes are acceptable in practicable video coding systems.
In addition, while the earlier video coding standards such as H.261 and MPEG-1 allow a single mode for the motion vector prediction with a 16×16 macroblock, today's prevalent MPEG-2 which is adopted in worldwide digital TV has more functional choices. For this, see ISO/IEC JTC1/SC29/WG11 and ITV-TS SG 15 EG for ATM video coding: ‘MPEG-2 test model 5’, April 1993 [reference 6]. According to this document, field prediction mode and special prediction modes are appended besides frame prediction mode. In field pictures, a 16×16 macroblock is decomposed into two 16×8 blocks, where one corresponds to the odd field and the other to the even field. Special prediction modes refer to 16× motion-compensation and dual-prime mode are also concerned on a 16×8 block. Furthermore, in the advanced prediction mode of H.263 and MPEG-4, it is allowed to utilize the motion vectors for 8×8 blocks. See ISO/IEC-JTC1/SC29/WG11 N1908: ‘Coding of moving pictures and audio’, October 1997 [reference 7] and ITU-T Recommendation H.263: ‘Video coding for low bit rate communication’, December 1995 [reference 8]. Previous efforts on a block matching processor have mainly focused on the architecture of fixed block size and single prediction mode. For details, see [reference 2], L. De Vos and M. Stegherr: ‘Parameterizable VLSI architectures for the full-search block-matching algorithms’, IEEE Trans. on Circuits Syst., 1989, 10, pp. 1309–1306 [reference 9], and S. Chang, J.-H. Hwang and C.-W Jen: ‘Scalable array architecture design for full search block matching’, IEEE Trans. on CAS for Video Tech., 1995, 10, pp. 332–343 [reference 10]], D. M. Yang, M. T. Sun and L. Wu: ‘A family of VLSI designs for the motion compensation block-matching algorithm’, IEEE Trans. on Circuits syst., 1989, 10, pp. 1317–1325 [reference 11], and Y. Jehng and L. Chen and T. Chiueh: ‘An efficient and simple VLSI tree architecture motion estimation algorithms’, IEEE Trans. on Signal Processing, 1993, 4, pp. 148–157 [reference 12].
A block matching procedure and a hardware mapping for it on a conventional architecture will be described below. An overall computation flow in block matching with full-search can be expressed as
SADmin = MAXVALUEVmin = (0,0);for m = −K to K−1for n = −L to L−1SAD(m, n) − 0;for i = 0 to M−1for j = 0 to M−1SAD(m, n) = SAD(m, n) +|x(i, j)−y(i+m, j+n)|;endforendforif SAD < SADmin thenSADmin =SAD(m, n);Vmin = (m, n);endifendforendfor
The widely accepted criterion of block distortion measure is Sum of Absolute Difference (SAD). The operations involved for computing SAD(m, n) and SADmin are associative, and thus the order for exploring the index spaces (I, j) and (m, n) is arbitrary. The block matching computation is massively repetitive and thus suited to be realized in a systolic array processor. See [reference 2], [reference 10] to [reference 12], and S. Y. Kung: ‘VLSI array processors’ (Eaglewood Cliffs, N.J.: Prentice Hall, 1988) 1st edn. [reference 13]. Block matching operations with a systolic array can be expressed as follows. First, in the overall computation flow in block matching with full-search, i and j loops are paralleled and mapped onto hardware. All absolute difference values conforming to one distance measure are calculated concurrently in M×N PEs (Processing Elements).
The arrangement of the PE and the computation flow in the systolic array are illustrated in FIG. 1. FIG. 2 shows an example of conventional two-dimensional systolic array for block matching and the internal structure of the PE (see [reference 2] and [reference 5]). By the conventional architecture, we mean the typical two-dimensional systolic array architecture shown in [reference 2] and [reference 9], which has been the base architecture for a systolic array block matching processor. The PE computes differences between pixels in the current frame X and the previous frame Y and collectively accumulates them to produce the block distortion SAD(m, n) for each matched block whose displacement vector is (m, n). It is symbolically denoted as AD as shown in FIG. 2 and can be decomposed into two sub-PEs, i.e., A and D, as shown in FIGS. 3B and 3C, after operation shown in FIG. 3A. In FIG. 2, an operator M stands for a comparator shown in FIG. 3D and keeps the minimum block distortion.
FIG. 4 is a detailed block diagram of the PE. One PE 100 includes a difference part 102 with an inverter 108 and an adder 110, an absolute part 104 with an inverter 112 and an exclusive-OR gate 114, and an accumulation part 106 with an adder 116 and a register 118. The accumulation part 106 corresponds to an operator A shown in FIG. 3B, and the difference part 102 and the absolute part 104 correspond to an operator D shown in FIG. 3C. In FIG. 4, pixel data is 8 bits. Reference data representing the pixels of a reference frame and current data representing the pixels of a current frame correspond to X and Y, respectively in FIG. 3A and an intermediate result received from a previous PE corresponds to a in FIG. 3A.
As stated above, the PE AD in systolic mesh is decomposed into individual elements A and D, so that both of them can operate simultaneously to speed up the computation (see [reference 9] and [reference 12]).
To deal with various sizes of blocks at miscellaneous motion vector modes, however, additional special hardware is needed. Therefore, extra area and control overhead are imposed as constraints.