Motion-compensated temporal analysis (MCTA) is a useful tool for a variety of applications that include optimization of compression performance/efficiency, filtering, and video content analysis and classification. The premise behind MCTA is the exploitation of the temporal correlation that characterizes video signals. Often, a picture in a video will share similar content with the previous picture. This has profound consequences to both compression and filtering. Compression benefits because a block in a current picture may be predicted as a displaced, warped, or weighted block in some previous picture. The displacement parameters are called motion vectors and are needed in order to create the motion-compensated prediction of the current block. If the motion model that is used to predict the current block is efficient enough, then the difference between the current block and its motion-compensated prediction will be low, and hence easy to compress. However, filtering can benefit as well. If the prediction is close enough to the current block in the picture, then it can be surmised that the prediction block is none other than the current original block with different noise characteristics. The current block, however, is also assumed to be a distorted version of the original source block, again with a different set of noise characteristics. If the noise in each block is assumed to have zero mean and is not correlated, then simply averaging the current block with its prediction block from some other reference picture will create a new block with halved error/noise energy, which is closer to the original source block. This can be extended to weighted combinations of arbitrary number of prediction blocks that originate from multiple reference pictures.
Motion-compensated temporal analysis has also been used within the context of temporal wavelets for video compression. See, for example, Y. Andreopoulos, A. Munteanu, J. Barbarien, M. van der Schaar, J. Cornelis, and P. Schelkens, “In-band motion compensated temporal filtering,” Signal Processing: Image Communication, vol. 19, pp. 653-673 and D. S. Turaga, M. van der Schaar, Y. Andreopoulos, A. Munteanu, and P. Schelkens, “Unconstrained motion compensated temporal filtering (UMCTF) for efficient and flexible interframe wavelet video coding,” Signal Processing: Image Communication, Volume 20, Issue 1, pp. 1-19. Motion-compensated temporal filtering has been applied both on the original pixel values (see “Unconstrained motion compensated temporal filtering (UMCTF) for efficient and flexible interframe wavelet video coding” cited above) as well as to values that have been transformed to the frequency domain (see “In-band motion compensated temporal filtering” cited above). The video sequence is divided into groups of pictures each of which is coded independently. Within those groups, motion-compensated temporal analysis is used to provide motion-compensated predictions for a subset of the pictures. The motion-compensated prediction errors are then used to refine the remaining pictures, which are again predicted using motion compensation. The final motion-compensated prediction errors are coded. Even though MCTA within a video coder is not addressed by this disclosure, some of the methods presented in this disclosure may be applicable on video coders that use motion-compensated temporal filtering.
Filtering is one of the applications that benefits from the use of motion-compensated temporal analysis. An early algorithm for denoising based on motion-compensated temporal filtering is found in E. Dubois and S. Sabri, “Noise reduction in image sequences using motion-compensated temporal filtering,” IEEE Transactions on Communications, Vol. COM-32, no. 7, pp. 826-831. A review of the first contributions in this field is presented in J. C. Brailean, R. P. Kleihorst, S. Efstratiadis, A. K. Katsaggelos, and R. L. Lagendijk, “Noise reduction filters for dynamic image sequences: A review,” Proceedings of the IEEE, vol. 83, pp. 1272-1292, September '95. More recent approaches for pre-filtering based on MCTA are presented in J. Llach and J. M. Boyce, “H.264 encoder with low complexity noise pre-filtering,” Proc. SPIE, Applications of Digital Image Processing XXVI, vol. 5203, p. 478-489, August '03; A. McInnis and S. Zhong, “Method and system for noise reduction with a motion compensated temporal filter,” United States Patent Application Publication No. 20070014368; and H.-Y. Cheong, A. M. Tourapis, J. Llach, and J. Boyce, “Advanced Spatio-Temporal Filtering for Video De-Noising,” in Proc. IEEE Int. Conf on Image Processing, vol. 2, pp. 965-968. “H.264 encoder with low complexity noise pre-filtering” (cited above) describes the use of the motion compensation module within an H.264/AVC video coder to perform temporal filtering. Multiple motion-compensated predictions from past pictures were generated and averaged and blended with the current picture to implement temporal filtering. The picture was also spatially filtered with a threshold-based 3×3 pixel-average filter. A more advanced and general approach is proposed in “Advanced Spatio-Temporal Filtering for Video De-Noising” (cited above), which takes into account both past and future pictures. The combination of the multiple motion-compensated predictions that originate from different pictures is done using a weighted average that adapts to the characteristics of the source signal. Furthermore, spatial filtering adopts a combination of wavelet filtering and Wiener filtering. The motion-compensated temporal analysis module that follows the architecture presented in “Advanced Spatio-Temporal Filtering for Video De-Noising” (cited above) is described in more detail below.
FIG. 1 shows a block diagram of a Motion-Compensated Spatio-Temporal Filter which implements Motion-Compensated Temporal Analysis. The input to the MCTA module shown in FIG. 1 are image pixels, and, optionally, motion and spatial filtering parameters that initialize motion modeling and spatial filtering in the analysis module. The processing arrangement consists of the following main components:
1. Spatial filters (wavelets, Wiener filter, among others).
2. Motion estimation and compensation with an arbitrary motion model.
3. Spatio-temporal de-blocking filter (optional).
4. Texture analysis (e.g. through spatial frequency analysis).
5. Luminance and chrominance information module.
The bi-predictive motion estimation (BME) modules 110 in FIG. 1 perform bi-predictive motion estimation, while the motion estimation (ME) modules 120 perform uni-predictive motion estimation. The subscripts denote the temporal distance of the reference pictures with respect to the current picture. The bi-predictive motion-compensation (BMC) modules 130 perform bi-predictive motion-compensation using as motion vectors the ones derived at the respective BME modules 110. Similarly, the motion compensation (MC) modules 140 perform uni-predictive motion-compensation with the motion vectors from the respective ME modules 120. The spatial (SP) filters 151, 153, 155, 157 perform a variety of functions that include high- and low-pass filtering and de-blocking, among others. Buffers Buff1 161 and Buff2 163 contain previous and future spatially and temporally filtered pictures. The weights w are adjusted to minimize the prediction error. The input picture may be spatially filtered by one of three available spatial filters 151, 153, 155, whose parameters are tunable depending of the statistics of pictures that have been already processed by the MEMC component. Note that spatio-temporal filtering topologies other than that specifically depicted in FIG. 1 may be used. For example, the BME modules 110 may operate on frames at different temporal distances, such as −M, +N.
Each input picture undergoes motion estimation with some reference picture, to yield a motion-compensated prediction of that picture. The input image is divided into pixel blocks or areas that may have an arbitrary size (e.g. 8×8 pixels). For this disclosure, the terms block, region or area of the picture are used inter-changeably. A block in the current picture n is matched using motion estimation with a prediction block that is generated from some part of a reference, picture n−k. The ME component determines the motion parameters that point to the prediction block. To generate this prediction block, the MC module 140 requires the motion parameters that are passed on by the ME module 120. The selected motion parameters minimize some cost between the original current block and the derived prediction block. Among many possible costs, one that may be used is the Mean Absolute Difference (MAD) between the original and the predicted block. An alternative cost could involve the sum of the MAD plus a value that represents motion field similarity. Motion field smoothness or similarity requires that motion parameters belonging to neighboring blocks are similar or correlated. Motion field similarity lowers the number of bits required to code the motion parameters, and can reduce blocking artifacts when applied to produce a motion-compensated prediction of the current picture.
In general, the motion-compensated (MC) prediction of picture n from picture n−k creates a prediction block that is drawn from picture n−k. Then the MC component takes the prediction blocks from reference picture n−k and combines them to form a motion-compensated picture that is the best approximation to picture n. Note that the motion model used in the ME and MC modules 120, 140 may utilize any known global and local motion model, such as the affine and translational motion models.
Motion estimation and compensation is not constrained to the previous picture alone as shown in FIG. 1. In fact, k can take positive and negative values and motion compensation utilizes multiple references pictures, as shown in FIG. 2. FIG. 2 shows prediction of the current picture using a weighted linear combination of blocks originating from pictures in the past and the future. One hypothesis uses picture n−1 as a reference, while another will use picture n−N. Pictures n+1 through n+N are used as well. Note that using reference pictures from the future entails delay, as up to N future pictures will have to be buffered prior to completing the motion estimation of picture n. For low delay applications, one could constrain motion compensation to employ past pictures as references.
The motion-compensated prediction of a block in picture n may also be a linear weighted combination of more than one prediction blocks that are originating from different reference pictures. In one possible arrangement, the current block in picture n could be predicted as the linear weighted combination of a prediction block derived from picture n−2 and a prediction block derived from picture n+1. This particular prediction structure is also known as bidirectional prediction. In another possible configuration, the prediction block could be a linear weighted combination of a prediction block derived from picture n−1 and another prediction block derived from picture n−2. The generalized prediction (weighted prediction with a translational motion model) is shown in Eq. 1 below as:
                                                        p              ~                        n                    ⁡                      (                          i              ,              j                        )                          =                                            ∑                              k                =                                  -                  m                                                            +                m                                      ⁢                          (                                                α                  k                                ×                                                      p                                          n                      -                      k                                                        ⁡                                      (                                                                  i                        +                                                  v                                                      x                            ,                            k                                                                                              ,                                              j                        +                                                  v                                                      y                            ,                            k                                                                                                                )                                                              )                                +          o                                    Eq        .                                  ⁢        1            
Disregarding fractional-pixel motion-compensated prediction, pixels pn(i,j) of a block in picture n can be predicted as a linear weighted combination of displaced blocks in pictures n−m through picture n+m. Note that m is a positive number. In other possible realizations, the combination need not be linear.
Note that a special case of motion estimation and compensation with multiple hypotheses as described in Eq. 1 is the so-called overlapped block motion estimation and compensation. In FIG. 18 an example of overlapped block motion compensation and estimation is depicted. In FIG. 18, the center part of the block is predicted as a single prediction block using a single motion vector (MV), however, the block boundaries are all weighted linear averages of both the prediction samples that are produced by using the current block MV as well as samples produced by using MVs of neighboring blocks. For example, the top overlapping area is a weighted average of both the current MV and the MV of the block at the top of the current block. The overlapping area at the top left is similarly a weighted average of samples predicted using four MVs, those of the current, left, top-left, and top blocks. Such techniques can reduce blocking artifacts at block edges, among other benefits.
Motion estimation schemes may also adopt hierarchical strategies. Hierarchical strategies may both improve estimation performance by avoiding local minima but may also help even with estimation speed. In general, these schemes perform some kind of spatial sub-sampling resulting to an image pyramid where at each level the input image may be sub-sampled by a constant ratio, e.g., 2. Motion estimation is first performed at the highest (lowest resolution level) hierarchy level. Then the MVs derived at this level are normalized to the next lower level (e.g. multiplied by 2) and are used as predictors or constraints for the next level. ME is performed again at the next level, using the scaled MVs as predictors and constraints. This process iterates until one derives MVs at the original highest resolution. Using previous levels as predictors one may limit the search range for the next level.
The MEMC framework can generate multiple MAD prediction error metrics as shown in FIG. 3 and FIG. 4. FIG. 3 shows MAD calculation using the MEMC framework and one reference from the past. FIG. 4 shows MAD calculation using the MEMC framework and two references from the future. One set of prediction error metrics is generated during motion estimation and corresponds to each reference block or combination of reference blocks, which in turn can originate from different pictures. The second one can be calculated after motion compensation has been completed for all blocks in the current pictures. Motion-compensation may create unwanted blocking artifacts. These artifacts can be reduced by applying a de-blocking filter on the final motion-compensated picture. Furthermore, the blocks constituting the final prediction picture do not necessarily originate from the same frame (blocks may be selected from among several reference frames). For example, one block could be the weighted combination of blocks in pictures n−1 and n−2, while another block could be predicted from picture n+2. Consequently, the MAD prediction error between this final prediction picture and the original picture may not be the same as the sum of the ME prediction errors. For example, the application of de-blocking on the final motion-compensated prediction parameter may result in a difference between the MAD prediction error and the sum of the ME prediction errors.
The motion-compensated temporal analysis module can be used to improve compression performance and the quality of filtering. The module may improve picture and scene complexity classification (pre-analysis). Pre-analysis can affect compression performance and visual quality considerably. It may be used to classify scenes and shots, detect scene changes and gradual scene transitions such as fades. It is also useful for pre-estimating scene complexity, which can then be used to optimize bit rate allocation and the motion-compensated prediction structure used at the video coder (e.g. if, and how many, and where to place bi-predictive coded pictures).
The complexity of a temporal analysis system may be considerable. Consider the example where each input picture is predicted using motion compensation using two past pictures and two future pictures. Initially, each picture block may be predicted from a single block from one of the four possible reference pictures. This will require conducting costly motion estimation four times. However, as shown in Eq. 1, a prediction block may be formed by linearly combining an arbitrary number of prediction blocks originating from different (or even the same) reference pictures. For multiple hypotheses, e.g. predicting a block as a linear combination of multiple prediction blocks corresponding to different MVs and even different reference pictures, one has to jointly estimate multiple motion vectors. Note here that a motion vector that is optimal when used for uni-prediction may not be the same with the optimal motion vector for the same reference picture when it is one of the multiple averaged references. One may do that to simplify estimation but the result will be suboptimal. Only a joint estimation of all MVs will provide the optimal performance. However, computationally this is often infeasible. Even if this is constrained to bi-predictive motion estimation, this will require joint optimization of motion estimation for two blocks, 0 and 1 (which will essentially entail testing all the pairwise combinations of the reference frames). To accomplish this, it has been proposed to apply iterative motion estimation where prediction block 0 is fixed and motion estimation is applied to find the best prediction block 1. In the next step, block 1 is fixed, and motion estimation is applied to find a new and better block 0. Then again, block 0 is fixed and motion estimation is applied to refine block 1, and so on. Hence, even though temporal filtering can improve compression performance and visual quality, it is very complex and is usually reserved for high-end applications such as DVD and broadcast encoding, where computational complexity is not a big issue.
The complexity cost of motion-compensated pre-analysis becomes prohibitive for applications that are power and memory-constrained. Power usage suffers due to the large number of motion estimation calculations that have to be performed for each combination of input picture and its possible reference pictures. Furthermore, memory complexity is high due to the large number of past and future reference pictures that have to be maintained in memory during the motion estimation and compensation process. Memory complexity suffers because the size of motion compensation references may be orders of magnitude larger than the original input size. If, for example, quarter-pixel motion compensation is used to predict a block, then the memory, which is required to store the quarter-pixel accurate picture reference, will be 4×4=16 times the memory required to store the original input picture.
Therefore, methods and systems that reduce the computational and memory complexity of motion-compensated temporal pre-analysis while at the same time taking care to achieve high performance pre-analysis, filtering, and motion parameter generation are desirable.