1. Field of the Invention
The present invention refers to VLSI architectures and, more particularly, to VLSI architectures for real time and low complexity procedures for motion estimation.
2. Description of the Related Art
Motion estimation is a key issue in video coding techniques (for example H.263/MPEG) or image filtering. In particular, algorithms based on predictive spatio-temporal techniques achieve high coding quality with the use of a reasonable computational power by exploiting the spatio-temporal correlation of the video motion field.
Especially in the last decade, multimedia communication techniques have experienced rapid growth and great market success for several applications, including videotelephony, videoconference, distance working and learning, tele-medicine, home or kiosk banking, to name but a few. This trend is becoming increasingly marked with the most recent information and communication technologies based on VDSL, ADSL and ISDN lines, also for residential use, and in third generation personal communication systems (IMT-2000). In this scenario, compression of the video signal aimed at reducing the noteworthy quantities of data and transmission frequency required to store and transmit signals, plays a fundamental role. Hence several compression standards, such as H.261, H263 and MPEG (versions 1, 2, 4) were developed, especially by ISO and ITU-T.
In this respect the following papers/works may be usefully referred to:
MPEG2, xe2x80x9cGeneric Coding of Moving Pictures and Associated Audioxe2x80x9d, ISO/IEC 13818-2, March 1994;
Telecommunication standardization sector of ITU, xe2x80x9cVideo Coding for Low Bit rate Communicationxe2x80x9d, Draft 21 ITU-T, Recommendation H.263 Version 2, January 1998;
F. Kossentini et al., xe2x80x9cTowards MPEG-4: an Improved H.263 Based Video Coderxe2x80x9d, Signal Processing: Image Communic, Special Journal, Issue on MPEG-4, vol. 10. pp. 143-148, July 1997.
The relevant codecs call for very powerful hardware architecture. In that respect the use of DSP techniques and advanced VLSI technology is of paramount importance to meet the requirements of low-cost and real time, which are almost mandatory to obtain truly satisfying results on the market.
In this respect, to reduce both time-to-market and the design costs typical of VLSI technology, design reuse methodologies are used extensively. This solution requires the development of VLSI IP (Intellectual Property) cells that are configurable, parametric and synthesizable and thus adapted to be reused in a wide range of applications. In particular, the design of high performance ASIC structures is required for Motion Estimation (ME) applications, which are the most complex part of ISO/ITU-T codecs.
Movement Estimation (ME) systems exploit the temporal correlation between adjacent frames in a video sequence to reduce the data interframe redundancy.
In this regard, works such as:
P. Pirsch et al. xe2x80x9cVLSI architectures for video compression, A Surveyxe2x80x9d, Proc. of IEEE, vol. 83, n. 2, pp. 220-246, February 1995;
Uramoto et al. xe2x80x9cA Half Pel Precision Motion Estimation Processor for NTSC Resolution Videoxe2x80x9d, Proc. IEEE Custom Integ. Circ. Conf., 1993;
Tokuno et al. xe2x80x9cA motion Video Compression LSI with Distributed Arithmetic Architecturexe2x80x9d, Proc. IEEE Custom Integ. Circ. Conf., 1993;
H. Nam and M. K. Lee, xe2x80x9cHigh Throughput BM VLSI Architecture with low Memory Bandwidthxe2x80x9d, IEEE Trans. on Circ. And Syst., vol. 45, n. 4, pp. 508-512, April 1998;
L. Fanucci, L. Bertini, S. Saponara et al. xe2x80x9cHigh Throughput, Low Complexity, Parametrizable VLSI Architecture for FS-BM Algorithm for Advance Multimedia Applicationsxe2x80x9d, Proc. of the ICECS ""99, vol. 3, pp. 1479-1482, September 1999,
describe solutions based on the technique currently named Full Search Block-Matching or, in short, FS-BM.
In this technique the current frame of a video sequence is divided into Nxc3x97N blocks (reference block) and, for each of them, an Nxc3x97N block in the previous frame (candidate block), addressed by a motion vector (MV), is exhaustively searched for the best matching within a search area range of (2ph+N)xc3x97(2pv+N) according to a determined cost function.
This technique achieves a high coding quality at the expense of high computational load and hence it limits a practical real time and low power/cost implementation of the movement estimation.
For example, for typical image formats such as CIF (352*288 pixels) at 30 frames/s, N=16, ph=pv=16, adopting the Sum of Absolute Difference (SAD) cost function, 3xc3x97109 absolute difference (AD) operations per second are required.
The SAD is defined by the formula:
SAD(m,n)=xcexa3xcexa3|a(i,j,T)xe2x88x92a(i+n,j+m,Txe2x88x921)|
where the two sums are extended to all i and j values from 0 to Nxe2x88x921, while a (i, j, T) represents the intensity of a generic pixel of the reference block while a (i+n, j+m, Txe2x88x921) represents the intensity of the corresponding pixel in the candidate block, shifted by a motion vector of coordinates (m, n).
To reduce this computational load while maintaining the same coding quality, several fast motion estimation algorithms were proposed in the literature.
In this regard, in addition to the aforesaid work by F. Kossentini et al., useful reference may be made to the following works:
M. Ghanbari xe2x80x9cThe Cross Search Algorithm for Motion Estimationxe2x80x9d, IEEE Trans. Communic., Vol. 38, pp. 950-953, July 1990;
C. -C. J. Kuo et al., xe2x80x9cFast Motion Vector Estimation Using Multiresolution Spatio-Temporal Correlationsxe2x80x9d, IEEE Trans. on Circ. and Syst. for Video Technology, Vol. 7, No. 3 pp. 477-488, June 1997;
A. Ortega et al., xe2x80x9cA Novel Computationally Scalable Algorithm for Motion Estimationxe2x80x9d, VCIP""98, January 1998;
F. Kossentini, R. K. Ward et al., xe2x80x9cPredictive RD Optimized Motion Estimation for Very Low Bit-Rate Video Codingxe2x80x9d, IEEE Journal on Selected Areas in Communications, Vol. 15, No. 9, pp. 1752-1763, December 1997,
and to European patent application 00830332.3.
Other relevant papers are: EP97830605.8, EP98830163.6, EP98830484.6, E097830591.0, EP98830689.0, EP98830484.6, EP98830600.7, EP99830751.6.
Other relevant publications are:
F. Rovati, D. Pau, E. Piccinelli, L. Pezzoni, J-M. Bard xe2x80x9cAn innovative, high quality and search window independent motion estimation algorithm and architecture for MPEG-2 encodingxe2x80x9d IEEE 2000 international conference on consumer electronics.
F. Scalise, A. Zuccaro, A. Cremonesi xe2x80x9cMotion estimation on very high speed signals. A flexible and cascadable block matching processorxe2x80x9d, international workshop on HDTV, Turin 1994.
F. Scalise, A. Zuccaro, M. G. Podesta, A Cremonesi, G. G. Rizzotto xe2x80x9cPMEP: a single chip user-configurable motion estimator processor for block-matching techniquesxe2x80x9d 135th SMTPE technical conference.
A. Artieri, F. Jutland xe2x80x9cA versatile and Powerful chip for real time motion estimationxe2x80x9d, ICASSP 1989.
STMicroelectronics xe2x80x9cST13220 motion estimation processorxe2x80x9d datasheet, July 1990.
Many of the solutions described in the aforesaid works, based on predictive spatio-temporal algorithms, achieve very good performance in terms of reduced computational load and high coding quality by exploiting the spatial and temporal correlation of the motion vectors field.
In a video sequence, particularly in low bit rate applications, the motion field usually varies slowly with a high correlation along both horizontal and vertical directions: in this regard, see the work mentioned above by C. -C. J. Kuo et al.
By exploiting this correlation, the motion vector of a given block can be predicted from a set of initial candidate motion vectors (MVS) selected from its spatio-temporal neighbors, according to a certain law.
This first step is called predictive phase.
To further reduce the residual error of estimation a refinement process is performed using the predicted motion vector as the starting point.
This second step is called refinement phase.
Several of the works mentioned above are based on this approach. These differ from each other both in the predictive phase (MVS structure or predictor selection law) and the refinement phase (size and shape of the refinement grid, stop search conditions, use of the half-pixel accuracy).
In the work by Kuo et al. mentioned several times above, both spatial and spatio-temporal predictive algorithms were proposed. In the former, for each reference block B (i, j, T) the MVS is made up by the motion vectors of the four spatial neighboring blocks (B (i, jxe2x88x921, T), B (ixe2x88x921, j, T), B (ixe2x88x921, jxe2x88x921, T), B (ixe2x88x921, j+1, T)), while in the latter the MVS is made up by the motion vectors of its two spatial and three temporal neighboring blocks (B (i, jxe2x88x921, T), B (ixe2x88x921, j, T), B (i, j, Txe2x88x921), B (i, j+1, Txe2x88x921), B (i+1, j, Txe2x88x921)).
In both cases, among the MVS, the motion vector which minimizes a generic cost function is chosen, for example obtained by dividing the SAD by the number of pixels in a block. This cost function is defined as the Mean of Absolute Difference (MAD).
The vector chosen as the best predictor (V0) is used as the starting point for further refinement. In the refinement process, the MAD (V0) is compared with a threshold (TH1). If it is lower, then V0 is chosen as the final motion vector and the search stops; otherwise, an exhaustive search in a 3xc3x973 pixel grid, centered on V0, is performed. If the new minimum MAD corresponds to the center of the grid or it is lower than TH1 the procedure stops. Otherwise, the refinement iterates until one of the above stopping criterion is reached, centering the search grid on the point which minimizes the MAD.
The algorithms also fix the maximum number of iterations (Smax) beyond which the search is stopped. For example, in the work by Kuo et al. mentioned previously, the average of search steps is two for most practical applications.
Alternatively, the work by Ortega et al., mentioned previously, is based on a so-called baseline motion estimation technique.
The main difference regards the cost function adopted, the SAD, and the possibility of performing a half-pixel resolution search as foreseen by the main coding standards.
The European patent application mentioned previously describes a spatio-temporal predictive technique which does not exploit iteration in the refinement phase and hence allows for a constant complexity algorithm. The initial predictive phase, featuring the selection of V0 from a set of four motion vectors using the SAD as cost function, is followed by a refinement phase on a grid centered around the position pointed by V0 and made up of four points on cross directions and four points on diagonal ones. Because this algorithm works with half-pixel precision, the points on cross directions have xc2xd pixel distance from the center, while the points on the diagonal ones have 1 or 3 pixel distance. The amplitude of the grid corner points is selected according to this rule: If SAD (V0) is greater than TH1 it means that V0 is likely to be a poor predictor and so the search area must be enlarged.
With this approach the refinement phase based on the iterative application of a fixed grid is replaced by the one based on a single step with a flexible grid.
As foreseen by main coding standards, at the end of the motion estimation procedure, the residual matching error must be compared with a threshold TH2 (obviously greater than TH1) to evaluate the convenience of adopting an inter coding strategy rather than an intra coding one. In any case this class of algorithms allows for noteworthy complexity reduction (for some algorithms up to 98%) compared with techniques based on a fill search (FS-BM).
The coding quality is almost the same, while the reduction in PSNR (Peak Signal Noise Ratio) in the worst cases is limited to a tenth of dB, while the Mean Square Error (MSE) increase is less than a few per cent.
From the above, it is evident that spatio-temporal predictive algorithms are potentially very suitable for real time and low complexity motion estimation in multimedia applications with the application of ASIC structures and the use of VLSI technologies.
Therefore, the disclosed embodiments of the present invention provide VLSI architecture that can be utilized for motion estimation applications in real time and with a lower implementation cost.
In the currently preferred embodiment, the architecture proposed in the invention, obtained using a design-reuse methodology, is parametric and configurable and hence it enables the implementation of different predictive algorithms.
Preferably, it features hardware complexity scalability and is suitable for the design of ASICs optimized for a wide range of possible multimedia applications.
In a particularly preferred form of implementation of the invention, foreseen for synthesis with a 0.25 micron CMOS technology, the solution proposed by the invention achieves a computational power up to 740xc3x97106 absolute differences per second, for a maximum 0.96 mm2 core size, and permits the processing of typical video sequences at clock frequencies of a few MHz.
In accordance with one embodiment of the invention, a VLSI architecture, particularly for motion estimation applications of video sequences having subsequent frames organized in blocks by means of identification of motion vectors that minimize a given cost function is provided. The architecture is adapted to cooperate with an external frame memory and includes a motion estimation engine configured to process the cost function and identify a motion vector that minimizes it; an internal memory configured to store sets of initial candidate vectors for the blocks of a reference frame; a first controller to manage motion vectors, the first controller configured to provide, starting from the internal memory, the estimation engine with the sets of initial candidate vectors and to update the internal memory with the motion vectors identified by the estimation engine; a second controller to manage the external frame memory, the second controller configured to provide the estimation engine with the candidate blocks; and a reference synchronizer to align, at the input of the estimation engine, the data relevant to the reference blocks with the data relevant to the candidate blocks coming from the second controller.