One application of video data compression is in a video teleconferencing system, and a simple block diagram of such a system is shown in FIG. 1. A video teleconferencing system allows two or more remote parties to both see and talk to each other. Each party has: a video camera 1; a video encoder 2; a video decoder 3; a video display 4; a microphone 5; an audio encoder 6; an audio decoder 7; and an audio speaker 8. Encoded video and audio is transmitted across a data channel 9.
Most video data compression applications, of which video teleconferencing is an example, have either or both a video encoder 2 and a video decoder 3. Example schemes for a video encoder 2 and a video decoder 3 are shown in FIG. 2 and FIG. 3, respectively.
Efficient video compression involves the combination of transform coding and predictive coding. This type of compression scheme is sometimes called full motion compression and has being incorporated into, for example, the Consultative Committee for International Telegraphy and Telephony (CCITT) H.261 standard. In this scheme, a video encoder 2 predicts 19 the current picture frame using motion vectors 21 and the previous picture frame stored in the frame memory 23, then computes the error 25 of the predicted picture frame 19 when compared to the current picture frame 27. Motion vectors 21 are generated in the motion vector estimator 28, which compares the current frame 27 with the previous frame (stored in frame memory 23) and finds a "best" match, generally on a block-by-block basis. The encoder then codes the prediction error using a Discrete Cosine Transform (DCT) based transform 29, quantizes the data in a quantizer (Q) 31, and sends it along with the motion vectors 21 over a transmission channel 9 after sending them through a variable length coder (VLC) 33. At the receiving end, the decoder 3 uses the motion vectors 21 and the previous picture frame stored in the frame memory 35 to predict the current picture frame 37 and then adds the prediction error from the inverse DCT (IDCT) 39 to the predicted frame before displaying it 41.
The description of a video encoder 2 and a video decoder 3 is very simple and generalized, and is included here to show the critical role played by motion vectors in video compression applications. The process of motion vector estimation is an essential and computalionally expensive part of inter-frame predictive coding which is needed to achieve high compression ratios. The present patent presents a system and method for cross correlation, which can be used for motion vector estimator 28 and, therefore, be included in a video encoder 2.
A picture frame is made of many blocks of pixels, commonly called macro blocks. For example, in the Common Interchange Format (CIF), the picture frame has 22.times.18 macro blocks, where each macro block consists of 16.times.16 pixels. In inter-frame encoding mode, a motion vector for each macro block must be estimated and sent along with the encoded prediction error for the macro block. Motion vector estimation is done by comparing a macro block in the present frame to a search block in the previous frame and finding a "best" match. The motion vector is estimated to be the offset between the position of the macro block in the present frame and the location of the "best" matching area within the search block. Typically, the search block is restricted to 32.times.32 pixels and is centered at the same location within the picture frame as the macro block. The "best" match is a subject to the cost function used for evaluating the various choices, and a variety of cost functions have been discussed in the literature. In the book, "Discrete Cosine Transform Algorithms, Advantages, Applications" , by K. R. Rao and P. Yip, published by Academic Press, Inc. San Diego, Calif., 1990, cost functions are described on pages 242 to 247. The commonly used cost function is sum of absolute differences, and is mathematically expressed as Equation 1. ##EQU1##
In Equation 1, x(i,l) with i,l=0, . . . , 15 is the macro block in the current picture frame, and y(i,l) with i,l=0, . . . , 31 is the search block in the previous frame. For each motion vector estimate, a total of 17.times.17=289 two-dimensional summations are computed, and the smallest value determines the "best" motion vector estimate. This algorithm is sometimes referred to as the mean absolute error (MAE) algorithm. The MAE algorithm has been implemented in an integrated circuit, the L64720 Video Motion Estimation Processor (MEP) from LSI Logic Corporation, Milpitas, Calif. The computation for the MAE is relatively simple, and the algorithm is effective in providing motion vector estimates for video transmission where high data rate, and hence low compression ratio, is allowed. For example, in video conferencing using transmission lines with 128 kbits per second or more, this is quite adequate. However, for lower data rate transmissions which require higher compression ratios, a more accurate motion vector estimation process is required since it provides better prediction and hence, reduction of the error which is transmitted or stored in inter-frame coding.
An alternative, but better, cost function is the cross correlation function (CCF). The CCF is a more powerful cost function for determining data block matches, however it has not been implemented in any real time systems due to its computational complexity. The CCF is given by Equation 2. ##EQU2##
Again, a total of 289 values must be computed, and the largest one determines the "best" motion vector estimate. Clearly, the computation of the CCF involves multiplication and is, therefore, more computationally expensive than MAE, which needs only addition and absolute value. The CCF, as defined above in Equation 2, also requires square root, which can be avoided by squaring the entire expression. Thee squared version of CCF, called CCF2, is shown in Equation 3. ##EQU3##
The CCF or CCF2 approach can estimate the match more precisely, and hence, a more accurate motion vector can be estimated. This is crucial if a higher data compression ratio is to be achieved. For example, it is especially important to achieve a higher compression ratio when a standard analog telephone line, which, at present, supports up to only 19.2 kilobits per second, is used to transmit video signals.
An additional variation of the CCF, which can be used for motion vector estimation, is the cross covariance, CCV, as shown in Equation 4. The CCV is simply the numerator of the CCF, thus eliminating the need to compute the denominator and its square roots. Similarly, the largest value of CCV determines the "best" match. ##EQU4##
Another approach to motion vector estimation is the normalized mean square error, NMSE, which is shown in Equation 5, and the smallest value determines the "best" match. ##EQU5##
The NMSE can be simplified by eliminating its denominator to form the mean square error approach, MSE, shown as Equation 6, and the smallest value determines the "best" match. ##EQU6##
The present invention describes a method and apparatus that efficiently computes the computationally expensive summation common to Equation 2 through Equation 6, which is shown in Equation 7, and hereinafter called simply the "cross correlation". Cross correlation can itself be used as a cost function for motion vector estimation. ##EQU7##
Direct calculation of the cross correlation requires 256 multiplies and 255 adds for each resulting value. The innovative method presented here converts the two blocks, both the macro block and the search block, to their frequency domain representation via the Discrete Fourier Transform (DFT) and then multiplies the two blocks together to form a product matrix in the frequency domain. The frequency domain result is then convened back to spatial domain and the 289 cross correlation values are simultaneously available to be searched for the maximum value. The present patent describes a complete method for this computation and includes several particularly innovative "tricks" which make the method attractive for implementation in hardware.
In the detailed description of the invention, the method is illustrated graphically and each stage is precisely described. Since various block sizes can be chosen and also since cross correlation for block matching has applications in many areas, such as pattern recognition, signature analysis, and feature extraction, the dimensions in the figures and description are kept in general terms.
In this Digital Signal Processing (DSP) application, namely cross correlation computation, there is a need to perform numerical data processing at a very high rate. In many DSP architectures, high data rate processing is achieved through the use of multiple Arithmetic Units (AU) such as combinations of adders, multipliers, accumulators, dividers, shifters, and multiplexors. However, there are two major difficulties with using multiple AUs in parallel: first, many control signals are needed to control multiple AUs; and, second, it is difficult to get the correct data words to each AU input on every clock cycle.
Some DSP architectures are Single Instruction Multiple Data (SIMD) architectures. A SIMD architecture, as defined here, groups its AUs into identical Processors, and these Processors perform the same operation on every clock cycle, except on different data. Hence, a Processor within a SIMD architecture can have multiple AUs, with each AU operating in bit parallel fashion (some definitions of SIMD architectures have AUs operating in bit serial fashion, which is not the case here). The SIMD architecture is applicable to video image compression because images are split into identically sized blocks which can be processed in parallel. If there are four Processors, then four blocks within the image can be processed in parallel, and the Processors are controlled in parallel with the same instruction stream.
Many DSP applications, especially real-time applications such as video compression, perform the same set of operations over and over. In video compression, successive images are compressed continuously with the same set of operations. Within the complex function of video compression, there are simpler functions such as convolution, Fourier transform, and correlation, to name only a few. These simpler functions can be viewed as subroutines of the complex function of video compression. These simple functions can be broken down further into elementary subroutines; for example, the 2-dimensional Discrete Foutier Transform (DFT) of a 32.times.32 data matrix can be implemented with 64 calls to an elementary subroutine which performs an 32-point DFT. The cross correlation method of the present patent includes a SIMD architecture which efficiently performs a wide variety of elementary subroutines, including those needed for cross correlation computation.