Video encoding systems are known in which an image to be encoded comprises video blocks. These blocks are then encoded and transmitted to a decoding device or stored into a storage medium. For reducing the amount of information to be transmitted, different compression methods have been developed, such as MPEG-2 (Motion Picture Experts Group). In the transmission of video images, image compression can be performed either as interframe compression, intraframe compression, or a combination of these. In interframe compression, the aim is to eliminate redundant information in successive image frames. Typically, images contain a large amount of such non-varying information, for example a motionless background, or slowly changing information, for example when the object moves slowly. In interframe compression, it is also possible to utilise motion compensation, wherein the aim is to detect such larger elements in the image which are moving, wherein the motion vector and some kind of difference information of this entity is transmitted instead of transmitting the pixels representing the whole entity. Thus, the direction of the motion and the speed of the subject in question is defined, to establish this motion vector. For compression, the transmitting and the receiving video terminals are required to have such a high processing rate that it is possible to perform compression and decompression in real time. Typically, image blocks are grouped together to form blocks. The block usually contains 16 rows by 16 pixels of luminance samples, mode information, and possible motion vectors. The block is divided into four 8×8 luminance blocks and to two 8×8 chrominance blocks. Scanning (and encoding/decoding) proceeds block by block, conventionally from the top-left to the bottom-right corner of the frame. Inside one block the scanning (and encoding/decoding) order is from the top-left to the bottom-right corner of the block.
In MPEG-2 compression, an image is Discrete Cosine Transform (DCT)-coded in blocks so that the block size is 8×8 pixels. The luminance level to be transformed is in full resolution. Both chrominance signals are subsampled, for example a field of 16×16 pixels is subsampled into a field of 8×8 pixels. The differences in the block sizes are primarily due to the fact that the eye does not discern changes in chrominance equally well as changes in luminance, wherein a field of 2×2 pixels is encoded with the same chrominance value.
The MPEG-2 defines three frame types: an I-frame (Intra), a P-frame (Predicted), and a B-frame (Bi-directional). The I-frame is generated solely on the basis of information contained in the image itself, wherein at the receiving end, this I-frame can be used to form the entire image. The P-frame is formed on the basis of a preceding I-frame or P-frame, wherein at the receiving stage the preceding I-frame or P-frame is correspondingly used together with the received P-frame. In the composition of P-frames, for instance motion compensation is used to compress the quantity of information. B-frames are formed on the basis of the preceding I-frame and the following P- or I-frame. Correspondingly, at the receiving stage it is not possible to compose the B-frame until the corresponding I-frame and P- or I-frame have been received. Furthermore, at the transmission stage, the order of these P- and B-frames is usually changed, wherein the P-frame following the B-frame is received first, which accelerates the reconstruction of the image in the receiver.
Of these three image types, the highest efficiency is achieved in the compression of B-frames. It should be mentioned that the number of I-frames, P-frames and B-frames can be varied in the application used at a given time. It must, however, be noticed here that at least one I-frame must be received at the receiving end, before it is possible to reconstruct a proper image in the display device of the receiver.
The aim of the motion estimation is to find such a block (a reference block) within a search area of some reference frame in a video sequence that is most similar to a given block within the current frame (block under examination). Among the variety of motion estimation algorithms, the most popular are those based on block matching where a sum of absolute differences (SAD) is used as the similarity criterion between frame blocks. Given two ordered sets of data X={x1, . . . , xK} and Y={y1, . . . , yK}, the value of the SAD is defined as:
                                          S            ⁢                                                  ⁢            A            ⁢                                                  ⁢                          D              ⁡                              (                                  X                  ,                  Y                                )                                              =                                    ∑                              i                =                1                            K                        ⁢                          |                                                x                  i                                -                                  y                  i                                            |                                      ,                            (        1        )            
In some publications SAD is defined as the sum SAD(X,Y) divided by the number K of its addends. In that case it may also be called mean absolute error (MAE). Since in the most of the cases K is a power of two, these two definitions are substantially equivalent from the implementation point of view because the later one may simply be obtained by shifting the value of the former one by certain number of bits.
In a video encoding context, the SAD is computed between every (16×16) block X(c) of every current interframe (in practice almost every frame of a video sequence) and a plurality of (16×16) blocks Y(c,r), Y(c,r′) within a search area S(c) of one or more reference frame(s) (see FIG. 5). The blocks Y(c,R) that corresponds to the minimum SAD value among SADs between X(c) and blocks Y(c,r), Y(c,r′) within the search area S(c) is then used to form a motion information. Thus, SAD is applied many times and even a smallest improvement in the execution time of one SAD operation leads to significant savings in total video processing time. Naturally, on the other hand, the hardware utilized for computation of SAD should not be too large or power consuming especially in portable/wireless video processing applications.
There are many different motion estimation algorithms utilizing different search strategies in order to reduce the number and/or the size of SAD operations with possibly less degradation in the quality of the encoded video. They can roughly be grouped into two categories: data independent search where the choice of the next pair of X and Y blocks does not depend on the SAD value obtained at the previous step, and data dependent search. Normally, the data dependent search strategies require less SAD operations to be implemented. However, most of hardware implementations are based on data independent motion estimation algorithms due to simplicity of organizing regular data movements typical to such algorithms. Common to data dependent strategies is that there are several options to choose the next pair of X and Y blocks and which pair will be chosen depends on the current SAD value.
According to recent investigations different motion estimation algorithms consume approximately 40%–80% of the total video encoding time when implemented in a General-Purpose Processor (GPP). The basic operation in the block matching motion estimation algorithms is the SAD, which is applied many times during the video encoding process. In typical fast motion estimation algorithms, SAD computation is repeated approximately 30 times for almost every block (usually, of the size (16×16)) within the video sequence. Even for a 15 frames per second QCIF resolution (Quarter Common Intermediate Format) video sequences this would mean at least 44550 256-point (16×16) SAD computations per second. In a purely software implementation on, e.g., ARM9E microprocessor, which is a typical microprocessor in embedded systems, computing one 256-point SAD takes several thousands of clock cycles. This means that even hundreds of millions of cycles per second are spent only for motion estimation in a software implementation of video encoding.
Due to the importance of the problem, many motion estimation devices have been reported in the literature recently. One class of the architectures for SAD computation involved in such devices are cascade-connected architectures. An example of such architecture is disclosed in the patent U.S. Pat. No. 6,154,492.
Another type of architectures are those which may, in general, be described according to the FIG. 6 and may be referred to as “parallel/iterative accumulation SAD architectures”. Examples of such architectures are found in e.g. U.S. Pat. No. 5,864,372 and U.S. Pat. No. 5,652,625 and in the paper “The sum-absolute-difference motion estimation accelerator” by S Vassiliadis, E. A. Hakkennes, J. S. S. M. Wong. And G. G. Pechanek published in Proceedings of Euromicro Conference, vol. 2, in 1998 (pages 559–566). In these architectures, some comparison values which are representatives of the absolute difference between every pair of data values (one from a current block and another from a reference block) are calculated at every step within a block of computational unit(s). These values are then one-by-one or portion-by-portion (iteratively), or all at once (in parallel) accumulated within a summation block which may have an internal feedback. After accumulating comparison values of all the pairs for given two blocks the SAD between these two blocks are obtained. SADs obtained between a given block X(c) and a plurality of blocks Y(c,r) within a search area S(c) are then analysed within the minimum evaluator block and the block producing the smallest SAD is selected to produce the motion estimation information.
In the practical use for motion compensation in video encoding, if the SAD value is too large it is not in an interest. Thus accumulation means within SAD architectures may be implemented with a lower precision (bit-width) than that for the correct SAD value in the worst case. While larger SAD values would then incorrectly be computed, normally this does not affect to the motion estimation result.
Some interrupt mechanisms have been introduced e.g. in U.S. Pat. No. 6,154,492 to be used in connection with SAD calculation means. The patent publication discloses a motion vector detection apparatus, which comprises cascade-connected processor elements. The processor elements calculate the absolute value of the difference between each of a plurality of pixels which compose a picture and a corresponding one of the same number of pixels included in a block, and also performs cumulative addition of the difference absolute values in the block. These operations are performed for each of the blocks within the predetermined search area. A comparative device repeatedly compares the cumulative addition values of two blocks obtained sequentially in the processor element at the final stage, and selects the smaller one of the cumulative addition values. A subtracter compares the smaller cumulative addition value with a setting value. When the smaller cumulative addition value is smaller than the setting value, a control circuit halts the supply of clock signals to the processor elements and the comparison device, so as to halt the entire operations of the apparatus. Since only smaller SAD values are important for determining the motion vectors, the larger SAD values may be incorrectly computed. This means that the precision of the functional units within the apparatus may have smaller precision than what is necessary for correct SAD computation in the worst case. This clearly leads to additional savings in silicon area and in power consumption. However, in this solution only a part of the architecture is halted thus partially saving the power but not the execution time. In fact, the architecture of U.S. Pat. No. 6,154,492 is constructed from K=256 processing elements each consisting of three adders. Such architecture does not appear feasible at the current state of technology or at least it appears to be too large for incorporating into mobile video encoding systems. In addition, this architecture supports only regular, data independent search strategies for motion estimation since it is heavily pipelined (256 pipeline stages) and the full interruption of the architecture would mean full pipeline reload.
There are several situations in the SAD computation for motion estimation where the calculation of a SAD between two given blocks X(c) and Y(c,r) may be terminated before completing the calculations and a new calculation of the sum of absolute differences between X(c) and another block Y(c,r′) from the search area S(c) may be started substantially immediately after the early termination. Examples of such situations are the cases where in the middle of calculations for the SAD between X(c) and Y(c,r) a temporary SAD value is obtained which already exceeds a predetermined threshold value or an earlier obtained value of the SAD between X(c) and some block Y(c,r″) within the search area S(c). In some other situations the search for a reference block within a given search area S(c) may be terminated and motion compensation information for X(c) may be formed before completing the process in the normal way. An example of such situation is the case where an SAD value between two blocks X(c) and Y(c,R) (not shown) is smaller than another predetermined threshold value. Therefore integrating some interrupt mechanisms into devices for motion estimation allowing pre-termination of the SAD computation and/or the reference block search would be rather advantageous.