The present invention relates to a method and a device for simultaneous motion estimation and segmentation of digitized images, in particular moving pictures.
In compression of digitized motion pictures the use of motion estimation is a widely used and powerful tool. One commonly used technique for motion estimation is Block Matching (BM) and variants, see for instance A. N. Netravali, B. G. Haskell xe2x80x9cDigital Pictures Representation, Compression and Standardsxe2x80x9d, Planum Press, ISBN 0-306-44917-X.
Block matching techniques usually make the following assumptions:
1) A block moves with translational motion in the image plane.
2) There is only one moving object within a block.
3) The intensities of the pixels are preserved in time.
There are several problems with the block Matching technique and its variants:
i) They give an erroneous estimate if a block covers several moving objects or uncovered background.
ii) They are sensitive to noise in the input frames.
iii) The accuracy of the estimate is generally poor, especially if no sub-pixel matching is performed. If subpixel matching is performed the technique is computationally expensive and real-time implementation is difficult to obtain.
iv) The estimate is poor for blocks moving with non-translatory motion.
v) The estimates are sensitive to illumination changes.
Other frequently used techniques for motion estimation are based on phase-correlation. Their deficiencies are similar to the ones of block matching, i.e. a poor estimate accuracy and robustness to noise, and inability to cope with complex motions, i.e. non-translational and multimodal motions.
Another possible technique is to perform motion estimation based on an image which as been segmented into regions. The image can for example be segmented into regions of slowly varying intensity.
A region based technique, which is conceptually based on the Hough transform and uses Robust Statistical Kernels has been presented in the papers M. Bober and J. Kittler, xe2x80x9cRobust Motion Analysisxe2x80x9d, in the CVPR Conference proceedings, 1994, pages 947-952; M. Bober and J. Kittler, xe2x80x9cEstimation of Complex Multi-modal Motion: An Approach based on Robust Statistics and Hough Transformxe2x80x9d, Proceedings of the British Machine Vision Conference, 1993 pages 239-248 and M. Bober and J. Kittler, xe2x80x9cEstimation of general multimodal motion: an approach based on robust statistics and Hough transformxe2x80x9d, Image and Vision Computing, 1994, vol 12, No 12, pages 661-668, which are incorporated herein by reference.
This method performs simultaneous motion estimation and segmentation by iteratively maximising the support defined by a sum of errors weighted by a robust kernel between two regions: one in a reference frame and the other in the consecutive frame. The position and size of the reference block are arbitrary; whereas the position of the consecutive region is determined by a geometric transformation of the reference block. This prior art technique is also illustrated by the flow diagram in FIG. 8, which shows how input frames are low-pass filtered in a block 801. Then a coarse resolution is set in a block 803, and motion parameters are initialised in a block 805. Thereafter derivatives are calculated in a block 807, which are used for updating the estimate in a block 809. Based on the estimate a new scale is computed in a block 811. After this it is checked in block 813 if the path has retraced. If that condition is false the process returns to the block 807, else a new check is made in a block 815 if the finest resolution is reached. If that is the case the process is terminated in a block 819, and else the resolution is changed to a finer resolution in a block 817 and then the process returns to the block 807.
The parameters of the transformation which maximise the support measure are assumed to describe the motion of the block or image. According to the prior art technique described in the papers by M. Bober and J. Kitler, cited above, the statistical properties of the transformed frame difference are computed after each iteration and the shape of the kernel function is adjusted accordingly.
Contributions to the support measure from pixels with large motion compensation errors are weighted down or removed by the robust kernel. Such pixels usually belong to objects moving independently or a covered/occluded background and are termed outliers. Thus, a mechanism for simultaneous motion estimation and segmentation is introduced by the robust kernel.
The technique described in the above cited prior art papers solved some of the problems associated with the block matching and correlation-based techniques. Hence, the technique can cope with non-translatory motions, e.g. affine models, and with multiple moving objects within a block. However, the technique still exhibits several deficiencies, such as failing to converge for regions with complicated motions and still being computationally expensive.
It is an object of the present invention to overcome the problems associated with the prior art technique and to increase its robustness and the quality of the motion segmentation.
This object is obtained by adding new steps and by modification of some parts of the prior art technique.
In particular, the invention aims at solving the problems with the convergence of the prior art technique as described in the above cited papers by M. Bober and J. Kittler, and to increase the computational efficiency thereof. According to the invention, a modified gradient-based search technique is used and the centre of the coordinate system is placed in the centre of the region, image or block, where the estimation is applied. The use of such modified gradients, i.e. gradients scaled by different factors reduce the number of iterations needed, and thus also the complexity of the technique. It also improves the convergence, i.e. the modified search is more likely to converge to the true motion parameters.
Further, the median absolute deviation is used in the prior art technique as scale estimate. This has turned out to cause problems with convergence and accuracy. When there is only one moving object within a block, during iterations on finer resolutions, the values of scale become very small and many pixels, which are not outliers, are removed from the estimation process. To overcome this deficiency, a small constant is added to the MAD (Median Absolute Deviation) estimate of scale.
Further, in the prior art technique, the scale of residual errors has to be recomputed after each update of motion parameters. This has been found to be very inefficient, and the efficiency of the method has been found to be significantly improved by introducing the following modification. When iterations proceed on the coarse resolution in the parameter space, the scale is updated after each step. However, on finer resolutions, i.e. subpixel resolution in case of a translational motion model, the scale is only updated every k steps, k being an integer greater than one.
Moreover, the optimisation on the discrete grid, as described in the prior art, does not specify how to select a good starting point for the interations after a change of resolution in the parameter space. The description only defines the condition for change of the resolution in the parameter space when the path retraces. It has now been found that if during iterations, a record of the point in the parameter space corresponding to the lowest value of the error function is saved and used as the starting point at finer resolution, a significant improvement of the computational efficiency is obtained.
In addition, a good iteration termination condition is not stated in the prior art. However, it has been found that the iteration can be terminated, and provide a good result when one of the following conditions is fulfilled:
i) the number of iterations at a given resolution exceeds the iteration threshold, or
ii) the value of the median estimate of scale is below the scale threshold, or
iii) the path is retraced on the finest resolution:
Furthermore, the speed and convergence of the prior art technique have been found to be possible to improve, e.g. for sequences with large motion. In order to improve the speed and convergence, a hybrid scheme where the initial estimate at coarse resolution is determined by a non-iterative technique, such as the phase-based technique, can be used. That is, first at the low resolution stage a fast non-iterative technique is used. Then, at the medium or fine resolution stages the modified Hough Transform using Robust Statistical Kernels is used.
Further, the prior art technique is only applied to a block, i.e. regions of rectangular size, and the centre of the coordinate system is placed in the corner of the image. However, it has been found that the framework is not limited to block shaped regions, but can be extended to regions of any shape. In the latter case, the placement of the centre of the coordinate system however becomes crucial for the performance of the technique. It has been found that by using the centre of gravity of the reference region as the origin of the coordinate system, a good performance is obtained.
Further, in the prior art technique, it is not specified what the specific resolution for motion parameters a1. . . a6 nor what the relation between parameters should be. A solution to this problem is to relate the resolution in parameter space to the maximum displacement which the estimator is expected to estimate and to the size of the region.
Finally, the accuracy of the prior art technique has been found to be improved by prefiltering the initial image with a low-pass filter, preferably Gaussian shaped, and then storing the resulting image in an array with increased dynamic range. All successive processing is performed on the images stored with increased dynamic range.
The technique can employ any parametric motion model. It is particularly suited to be used in:
1) a translational motion model, that is:
dx=a1
dy=a2
2) a four-parameter motion model, that is:
dx=a1xxe2x88x92a2y+a3
dy=a2x+a1y+a4
3) an affine motion model, that is:
dx=a1x+a2y+a3
dy=a4x+a5y+a6
Where dx, dy are the components of the displacement vector d=(dx, dy) for a pixel at location (x, y), and where a=(a1. . . a6) are the motion parameters.