1. Field of Invention
The invention relates to the field of video sequence processing and more specifically to memory control for a motion estimation processor and associated search pattern.
2. Description Relative to the Prior Art
In a typical video sequence, neighboring pictures represent snapshots of a scene with a very short time interval. There is a great amount of similarity between consecutive pictures, particularly in the background areas. A well-known technique in video sequence coding to reduce the bit rate, called interframe coding, is to transmit the differences between pictures or frames. In an ideal situation, this technique can avoid the need to repeatedly transmit the information corresponding to the static background. There is a well-known advancement in video sequence coding, called the Block Matching Algorithm (BMA) for motion estimation. The BMA was developed by J. R. Jain and A. K. Jain, and the details are described in their publication entitled “Displacement Measurement and Its Application in Interframe Image Coding,” in IEEE Trans. on Communications, vol-COM-29, pp. 1799–1808, December 1981.
The objective of the BMA is to further improve the efficiency of interframe coding by taking into consideration of effect of object movement in the video sequence. Instead of forming the direct difference between consecutive frames, the BMA shifts the previous picture to compensate the object movement and then takes the difference between the current picture and the shifted previous picture. Such a coding system is commonly called motion-compensated interframe coding.
In practice, it would be very computationally difficult to derive the horizontal and vertical displacements, called the motion vector, for an arbitrarily shaped object. The BMA simplifies the situation by dividing the picture into small rectangular blocks and assuming that the object undergoes a planar movement only. This simplified model works satisfactorily when the block is inside the object boundary and the time interval between two pictures is small enough so that any movement (3D rotation or spin) can be reasonably modeled as a planar movement. Due to the effectiveness of bit rate reduction, the BMA and its variations have been widely used in various video coding standards. The BMA has to compute the block difference BDk,l(x,y) defined as:
                    BD                  k          ,          l                    ⁡              (                  x          ,          y                )              =                  ∑                  m          =          0                          M          -          1                    ⁢                          ⁢                        ∑                      n            =            0                                N            -            1                          ⁢                                  ⁢                  Dist          ⁡                      (                                                            I                                      k                    ,                    l                                                  ⁡                                  (                                      m                    ,                    n                                    )                                            -                                                I                                      k                    ,                    l                                    ′                                ⁡                                  (                                                            m                      -                      x                                        ,                                          n                      -                      y                                                        )                                                      )                                ,
where Ik,l(m,n) is the current block to be motion compensated, I′k,l(m,n) is the corresponding reference block from a previously reconstructed picture, Dist(●,●) is a distortion measurement, M and N are the horizontal and vertical dimensions of the block respectively, (x,y) is the displacement, and k and l are the block indexes in the horizontal and vertical directions. In practice, either the absolute value or squared value has been often used as the distortion measure. The BMA searches a region in the previous picture corresponding to the underlying block in the current picture. The size of the search region, also called a search window, depends on the anticipated largest displacement between two pictures. In order to find the best match, every location in the search window has to be processed. In other words, the BMA computes and compares BDk,l(●,●) for all (x,y) in the window and selects the (x,y) that achieves the minimum block distortion as the motion vector for the block.
While the BMA is very useful for video coding, its computational complexity is extremely high. The complexity for calculating the block difference is proportional to MN, where M and N are the dimensions of the block. If the search region covers from −I to +I pixels horizontally and from −J to +J pixels vertically, the total number of locations to be searched is (2I+1)(2J+1). A straightforward implementation would search every location in the window and this method is referred to as a “full” search. The total number of computations required for each block is roughly proportional to 4IJMN. It would be extremely challenging to perform this task in real time especially for large search windows required for high quality video sequences.
Over the years, there have been sizable development activities in the area of “fast block matching algorithm”, which address the issue of reducing the number of required search locations. In general, such methods start out with a small number of candidate locations including the original location and compute the block difference for each candidate. Based on the outcomes, it either moves to a new location or stays in the original location, depending on whichever results in the smallest block difference. If a new location results in the smallest block difference, the new search origin is moved to this new location and the process repeats. If the original location results in the smallest block difference, it narrows the search area by examining the surrounding locations closer than previous candidate locations. If the search area has been reduced to a minimum or the block difference is smaller than a pre-determined threshold, the search stops.
The fast search algorithms can substantially reduce the number of searches. However, sometimes they may miss the best match and have a negative impact on the coding efficiency. Among the fast search algorithms, the well-known “three-step search” was developed by T. Koga, et al, described in the publication entitled “Motion-compensated Interframe Coding for Video-Conferencing,” in Proceedings of IEEE National Telecommunication Conference (New Orleans, La.), pp. G5.3.1–G5.3.5, November 1981. The three-step search has shown the capability to reduce the number of searches by a factor of more than 10 with some loss in coding efficiency. The three-step search only covers a small search window in the original publication. It is possible that the three-step search could be expanded to cover larger search windows. However, the coding efficient probably will be greatly compromised. The three-step search and its variations are more popular for software- or DSP-based implementations than the hardware-based implementations. Nevertheless, dedicated hardware for the three-step search has also been reported, such as the invention in U.S. Pat. No. 6,160,850 and the publication by T-H Chen entitled “A Cost-Effective Three-Step Hierarchical Search Block-Matching Chip for Motion Estimation,” in IEEE Journal of Solid State Circuits, vol. 33, no. 8, August 1998.
There is another category of approaches to solving the high computational complexity issue by using massive parallel processing elements, which is a hardware solution. Due to the advancement in VLSI technology, it becomes more affordable to incorporate multiple processing elements on a single chip to perform the same task in parallel. The computation for block difference consists of computations of difference for individual pixels within the block. It has been long recognized as an ideal place to utilize parallel processors and there have been many technical publications on this subject over the years. One of the frequently referenced publications is entitled “A Family of VLSI Design for the Motion Compensation Block-Matching Algorithm”, by K-M Yang, et al, in IEEE Transaction on Circuits and Systems, vol. 36, no. 10, pp. 1317–1325, October 1989. This publication presents a modular VLSI architecture based on data-flow design that allows sequential data inputs, but performs parallel processing. Another frequently referenced article is entitled “A Novel Modular Systolic Array Architecture for Full-Search Block Matching Motion Estimation,” by Yeo and Hu in IEEE Transaction on Circuits and Systems for Video Technology, vol. 5, no. 5, pp.407–416, October 1995. They present a scalable systolic architecture that allows cascading multiple parallel processors of smaller size to form parallel processors of larger size. Both of the above mentioned techniques use sequential input data that matches with the pipelined processing of their system architecture. Furthermore, Yeo and Hu's method is intended to deal with smaller search windows since a search window much larger than the block size would complicate the interconnections among parallel processors.
The VLSI fabrication technology today is capable of squeezing millions of transistors into a single chip. It becomes more affordable to utilize one processing unit corresponding to each pixel in a block for computing the block difference in order to achieve the maximum possible processing speed. In a conventional parallel processor approach to high-speed motion estimation, the reference memory arrangement is not optimized for the situation using full processing units. Also, it is not optimized to conserve power consumption. Though the conventional approach is not optimized for memory access speed, it may be adequate for some real-time applications where the search window is relatively small, for example, from −16 to +16 pixels in both horizontal and vertical directions. If the search window is extended by a factor of 3, i.e., from −48 to +48, in both the horizontal and vertical direction, the number of searches increases roughly to 32, or 9 times as many. To accommodate the search over large windows in real time, it is necessary to employ more and more processing elements in parallel. Associated with a search over large windows is the increased number of memory accesses to the reference picture, which will result in much higher power consumption. It becomes very crucial for the commercial success of this block-matching motion estimation subsystem to achieve high speed and to conserve power.