Video encoding systems are known in which an image to be encoded is divided into blocks. These blocks are then encoded and transmitted to a decoding device or stored into a storage medium. For reducing the amount of information to be transmitted, different compression methods have been developed, such as MPEG-2 (Motion Picture Experts Group). In the transmission of video images, image compression can be performed either as interframe compression, intraframe compression, or a combination of these. In interframe compression, the aim is to eliminate redundant information in successive image frames. Typically, images contain a large amount of such non-varying information, for example a motionless background, or slowly changing information, for example when the object moves slowly. In interframe compression, it is also possible to utilise motion compensation, wherein the aim is to detect such larger elements in the image which are moving, wherein the motion vector and some kind of difference information of this entity is transmitted instead of transmitting the pixels representing the whole entity. Thus, the direction of the motion and the speed of the subject in question is defined, to establish this motion vector. For compression, the transmitting and the receiving video terminals are required to have such a high processing rate that it is possible to perform compression and decompression in real time.
Typically, image blocks are grouped together to form blocks. The block usually contains 16 rows by 16 pixels of luminance samples, mode information, and possible motion vectors. The block is divided into four 8×8 luminance blocks and to two 8×8 chrominance blocks. Scanning (and encoding/decoding) proceeds block by block, conventionally from the top-left to the bottom-right corner of the frame. Inside one block the scanning (and encoding/decoding) order is from the top-left to the bottom-right corner of the block.
In MPEG-2 compression, an image is Discrete Cosine Transform (DCT)-coded in blocks so that the block size is 8×8 pixels. The luminance level to be transformed is in full resolution. Both chrominance signals are subsampled, for example a field of 16×16 pixels is subsampled into a field of 8×8 pixels. The differences in the block sizes are primarily due to the fact that the eye does not discern changes in chrominance equally well as changes in luminance, wherein a field of 2×2 pixels is encoded with the same chrominance value.
The MPEG-2 defines three frame types: an I-frame (Intra), a P-frame (Predicted), and a B-frame (Bi-directional). The I-frame is generated solely on the basis of information contained in the image itself, wherein at the receiving end, this I-frame can be used to form the entire image. The P-frame is formed on the basis of a preceding I-frame or P-frame, wherein at the receiving stage the preceding I-frame or P-frame is correspondingly used together with the received P-frame. In the composition of P-frames, for instance motion compensation is used to compress the quantity of information. B-frames are formed on the basis of the preceding I-frame and the following P- or I-frame. Correspondingly, at the receiving stage it is not possible to compose the B-frame until the corresponding I-frame and P- or I-frame have been received. Furthermore, at the transmission stage, the order of these P- and B-frames is usually changed, wherein the P-frame following the B-frame is received first, which accelerates the reconstruction of the image in the receiver.
Of these three image types, the highest efficiency is achieved in the compression of B-frames. It should be mentioned that the number of I-frames, P-frames and B-frames can be varied in the application used at a given time. It must, however, be noticed here that at least one I-frame must be received at the receiving end, before it is possible to reconstruct a proper image in the display device of the receiver.
The aim of the motion estimation is to find such a block (a reference block) within a search area of some reference frame in a video sequence that is most similar to a given block within the current frame (block under examination). Among the variety of motion estimation algorithms, the most popular are those based on block matching where a sum of absolute differences (SAD) is used as the similarity criterion between frame blocks. Given two ordered sets of data X={x1, . . . , xK} and Y={y1, . . . , yK}, the value of the SAD is defined as:
                                          SAD            ⁡                          (                              X                ,                Y                            )                                =                                    ∑                              i                =                1                            K                        ⁢                                                  ⁢                                                                          x                  i                                -                                  y                  i                                                                                  ,                            (        1        )            
In some publications SAD is defined as the sum SAD(X,Y) divided by the number K of its addends. In that case it may also be called mean absolute error (MAE). Since in the most of the cases K is a power of two, these two definitions are substantially equivalent from the implementation point of view because the later one may simply be obtained by shifting the value of the former one by certain number of bits.
In a video encoding context, the SAD is computed between every (16×16) block X of every current interframe (in practice almost every frame of a video sequence) and a plurality of (16×16) blocks Y, Y′ within a search area of one or more reference frame(s) (see FIG. 9). Thus, SAD is applied many times and even a smallest improvement in the execution time of one SAD operation leads to significant savings in total video processing time. Naturally, on the other hand, the hardware utilized for computation of SAD should not be too large or power consuming especially in portable/wireless video processing applications.
There are many different motion estimation algorithms utilizing different search strategies in order to reduce the number and/or the size of SAD operations with possibly less degradation in the quality of the encoded video. They can roughly be grouped into two categories: data independent search where the choice of the next pair of X and Y blocks does not depend on the SAD value obtained at the previous step, and data dependent search. Normally, the data dependent search strategies require less SAD operations to be implemented. However, most of hardware implementations are based on data independent motion estimation algorithms due to the simplicity of organizing regular data movements typical to such algorithms. Common to data dependent strategies is that there are several options to choose the next pair of X and Y blocks and which pair will be chosen depends on the current SAD value.
According to recent investigations different motion estimation algorithms consume approximately 40%–80% of the total video encoding time when implemented in a General-Purpose Processor (GPP). The basic operation in the block matching motion estimation algorithms is the SAD, which is applied many times during the video encoding process. In typical fast motion estimation algorithms, SAD computation is repeated approximately 30 times for almost every block (usually, of the size (16×16)) within the video sequence. Even for a 15 frames per second QCIF resolution (Quarter Common Intermediate Format) video sequences this would mean at least 44550 256-point (16×16) SAD computations per second. In a purely software implementation on, e.g. ARM9E microprocessor, which is a typical microprocessor in embedded systems, computing one 256-point SAD takes several thousands of clock cycles. This means that even hundreds of millions of cycles per second are spent only for motion estimation in a software implementation of video encoding.
Due to the importance of the problem, many motion estimation devices have been reported in the literature recently. They can be classified into two categories: those supporting one or another search strategy for motion estimation but not considering details of SAD implementation; and those which essentially propose specialized architectures for SAD computation irrelevant on the motion estimation strategy. For example, the U.S. Pat. No. 5,864,372 discloses an apparatus for implementing a block matching algorithm for motion estimation in video image processing. The apparatus receives the pixel data of an original image block and the pixel data of a compared image block selected from a number of compared image blocks during video image processing. The selected image blocks are compared to determine a movement vector. The apparatus has a multistage pipelined tree-architecture that includes four stages. The first pipeline stage (computational stage) produces corresponding pairs of difference data and sign data. The second pipeline stage (compression stage) includes a compression array that receives all the difference data and sign data, which are added together to produce two (sum and carry term) rows of compressed summation and sign data. The third pipeline stage (summation stage) in the pipeline receives the compressed summation and sign data and produces a mean absolute error for each of the compared image block pixels. A last pipeline stage (minimization stage) receives the mean absolute error for each of the compared image blocks and determines a minimum mean absolute error from among them. The compression array includes a number of full and half adders or a number of 4/2 compressors arranged in a multi-level configuration in which none of the adder operand inputs and the carry-in inputs is left unconnected.
The apparatus disclosed in the U.S. Pat. No. 5,864,372 is illustrated in FIG. 1. The first pipeline stage consists of several (m) computational units (DS, Difference-Sign). FIG. 1 corresponds to the case of m=4. The computational unit structure is shown in FIG. 2. The ith computational unit, i=1, . . . , m, has two n-bit inputs Xi and Yi, one n-bit output Ai and one single-bit output Bi. The output Bi (sign data) is the sign bit of the difference Xi−Yi and the output Ai (difference data) is formed from the n least significant bits of the difference which are either inverted if Bi=1 (the difference is negative) or not if Bi=0 (the difference is non-negative). Thus, the input-output relation of a computational unit is such thata+b=|x−y|,  (2)where x and y are the values at the inputs of the computational unit, a is the value at its n-bit output (difference data) and b is the value at its 1-bit output (sign data).
The second pipeline stage is a compression array which is essentially a carry save adder tree having 2m inputs Ai and Bi, i=1, . . . , m, coming from the first pipeline stage, and two feedback inputs from sum and carry outputs of the array itself. The compression array may be constructed either from full adders (FAs) or 4/2-ratio compressors. Its width and depth (number of levels) and, therefore, the delay essentially depend on the number m of parallel channels (computational units) of the first stage. This dependency is presented in Table 1. In this table, NFA and N4/2 represent the number of levels in the compression array for the full adder- and 4/2-ratio compressor based configurations, respectively. DFA and D4/2 represent estimated time delays of the corresponding compression array configurations expressed in units of the basic time delay amount, τ, for one two-input NAND logic gate. Note that it is assumed that one full adder has the delay of two series connected NAND gates, and one 4/2-ratio compressor element has the delay of three series connected NAND gates.
The third pipeline stage is essentially an adder for adding the final values of sum and carry outputs of the compression array. In fact the SAD is obtained at the output of the third stage. Let us note that in order to compute the correct SAD value the adder of the third stage should have the precision of n+log2 K bits (practically, 16 bits in video encoding context).
TABLE 18mNFADFAN4/2N4/243 6τ2 6τ8510τ3 9τ16612τ412τ32816τ515τ641020τ618τ1281122τ721τ
The fourth stage of the apparatus is the minimum evaluation stage. Every time when new SAD value is obtained at the third pipeline stage, it is compared to the current minimum SAD value held in the minimum value evaluator unit M. The value which is smaller is selected and stored in the minimum evaluator unit as the new minimum value. Once computations of SADs between a given block X(c) within the current frame and all the corresponding blocks Y(r,c) within the search area of the reference frame are complete, the relative shift between X(c) and such a block Y(r,c) for which the minimum has been achieved is identified as the motion vector for X(c).
During the operation of the apparatus, input {x1, . . . , xK} and {y1, . . . , yK} enter, portion by portion, to the first pipeline stage. At the tth operating step, t=1, . . . , ┌K/m┐, data portion {x(t−1)m+1, . . . , xtm} and {y(t−1)m+1, . . . , ytm} enter to inputs X1, . . . , Xm and Y1, . . . , Ym, respectively. At the next operating step, the corresponding difference and sign data are formed at the outputs of the computational units which enter to the compression array to be accumulated to the current values of the sum and carry outputs of the array. Clearly, after ┌K/m┐+3 operating steps the final sum and carry terms will be formed at the output of the compression array and after one more operating step the SAD value will be computed at the output of the adder in the third pipeline stage. The minimum evaluation unit will consume another operating step to select the coordinates of the current motion vector.
The duration of the operating step is determined by the throughput of the slowest pipeline stage. In the apparatus according to the publication U.S. Pat. No. 5,864,372, the slowest part is considered to be the minimum evaluation unit. However, this unit as well as the adder of the third stage operates only two cycles for a given pair of input data sets while the first two stages operate ┌K/m┐+3 cycles. If K is sufficiently large with respect to m (which is the practical case), then it is more beneficial to halt the first two stages after ┌K/m┐+3 cycles when the last two ones start operating instead of immediately starting to process next pair of input data sets. This way, the clock cycle duration is determined by the throughput of the slowest between only the first two stages.
The throughput of the first pipeline stage is essentially the throughput of an n-bit (8-bit) adder/subtracter. Different adder/subtracters may be used resulting in different throughputs. For the ease of description it is considered here that standard 8-bit carry-ripple adders are used. It is assumed in U.S. Pat. No. 5,864,372 that the delay of a full adder is substantially equivalent to the delay of two series connected NAND gates. Thus the delay between two successive outputs of computational units is substantially equal to 16τ, where τ is the duration of the basic operational clock cycle (the delay of a NAND gate). Comparing to the delay of the compression array given in Table 1, it can be seen that the first pipeline stage is slower than the second one for the cases of up to 32 computational units within the first pipeline stage. In the cases of more computational units (which are, in fact, impractical due to large silicon area and input bus width needed) the compression array is split into two pipeline stages so that the first pipeline stage remains the slowest. Thus the duration of the operating step of the apparatus is 16τ, irrelevant of how many computational units are involved into the first pipeline stage. The compression array should also be clocked at the same operation step even though it could be clocked faster in the most of the practical cases because the delay of the compression array is less than 16τ, when the number of inputs to the compression array is less than 32 as is shown in Table 1.
The prior art apparatus has several drawbacks. Pipeline stages of the apparatus, more importantly, the first two of them, are poorly balanced since they have essentially different delays. For the cases of reasonable numbers of parallel computational units within the first stage, say m=4, 8, 16, the compression array of the second stage is, respectively, 2.7, 1.6, and 1.3 times faster than the first stage if standard carry-ripple adders are used within computational units as can be seen from Table 1. Thus, the compression array of the apparatus is utilized at approximately only 37%, 62.5%, and 77% of its capacity.
Adjusting the delays of the (first two) pipeline stages of the apparatus in order to achieve better balancing between them is only possible either by using faster adders/subtracters within computational units or by increasing the number of pipeline stages. Both cases lead to significant increase in silicon area and power consumption.
The width and the depth of the compression array essentially grow with the number of computational units within the first stage (see Table 1). Because the compression array is faster than the computational unit the compression array is not effectively utilized in prior art systems. Reducing the size of the compression array would reduce not only the gate count but also the delay due to a less number of its levels. Possibility of reducing the size of the compression array would also add flexibility in adjusting the first two pipeline stages.
The input bus width of the prior art apparatuses grows proportionally with the number of parallel computational units in the first stage thus restricting the practical use since in the most of the general purpose processors or digital signal processors (DSP) rather narrow busses are provided for interconnection with an accelerator. Though including input buffers with the apparatus could solve the problem, this would mean a significant increase in the gate count.