1. Field of the Invention
The present invention relates generally to a method of fast motion estimation, and more particularly to a method employing data prediction and data reuse technique for fast motion estimation.
2. The Prior Arts
In order to save the storage medium space for storing image data and reduce the bandwidth used for transmitting the image data, original image data is often compressed to obtain compressed image data. When the image data is to be displayed, the compressed image data is recovered to displayable image data by executing a decompression process. The compression process is known as a coding process, while the decompression process is known as a decoding process.
FIG. 1 is a block diagram schematically illustrating the operation of a conventional image data coding system. Referring to FIG. 1, the image data coding system includes motion estimation S10, motion compensation S12, block codes S14, and variable length codes S16, by which P-frame bitstream, i.e., the compressed data, can be generated. Among the foregoing, the motion estimation S10 occupies very much system resources, such as memory space, computation time, and power consumption. Generally speaking, the motion estimation may occupy 76% of memory access, 77% of memory bandwidth, and 78% of computation time. As such, it is very highly desired to enhance the efficiency of the motion estimation S10 and improve the entire coding efficiency.
FIG. 2 is a schematic diagram illustrating the motion prediction of the conventional technology. Referring to FIG. 2, a search range 50 is selected from a reference frame 40 according to a current block 30 in a current frame 20. Then, a best matching algorithm (BMA) is utilized to find out a best matching block 60 from all reference blocks in the search range 50, thus obtaining a corresponding motion vector provided for subsequent variable length codes S16. Supposing that the current block 30 is an N×N block, in which N represents a side length of the current block, e.g., 16 as exemplified hereby. The BMA is defined by the following equation.
      S    ⁢                  ⁢    A    ⁢                  ⁢          D      ⁡              (                  i          ,          j                )              =            ∑              m        =        0            15        ⁢                  ∑                  n          =          0                15            ⁢                                            X            ⁡                          (                              m                ,                n                            )                                -                      Y            ⁡                          (                                                m                  +                  i                                ,                                  n                  +                  j                                            )                                                  In the equation, SAD represents a sum of absolute differences (SAD), X(m, n) represents the image data of the current block 30 at coordinates (m, n), Y(m+i, n+j) represents the image data of the reference block at coordinates (m+i, n+j), in which i is a horizontal coordinate, and j is a vertical coordinate, and i and j are integers. The best matching block 60 is reference block having a minimum SAD value. MV(i, j) shown in FIG. 2 represents a motion vector directed from coordinates (m, n) to coordinates (m+i, n+j).
FIG. 3 is a functional block diagram illustrating a conventional image data coding system. Referring to FIG. 3, a conventional image data coding system 1 includes an encoder 70, for searching for a best matching block 60 in the search range 50 of the reference frame 40. The encoder 70 loads data stored in an external memory 84 via an external bus 90 and a memory interface 82. The data stored in the external memory 84 is the data of the reference block in the search range 50. The encoder 70 includes an encoding engine 72, an internal memory 74, and a computation engine 76. The internal memory 74 is adapted for storing the data loaded from the external memory 84. The computation engine 76 executes a logical computation to obtain the SAD value. The encoding engine 72 finds out the best matching block 60 having the minimum SAD according to the SAD value obtained by the computation engine 76.
For calculating the SAD value, data of the external memory 84 must be very frequently loaded to the internal memory 74. As such, the external bus 90 is required for a large data bandwidth, and the computation engine 76 has to deal with a very heavy load, so that the entire coding efficiency is drastically impaired. Further, a longer time that the computation lasts means a higher power it consumes, thus shortening the operation time of the handheld apparatus is supplied with power by a battery system. Moreover, more data needed to load means a larger capacity of memory required, which inevitably increases the hardware cost of the coding system. As such, several data access schemes for accessing data of the search range are proposed by the conventional technology for saving data transmission and enhancing data reuse. The data access schemes include Level A, Level B, Level C, Level D, and Level C+.
FIG. 4 is a schematic diagram illustrating the search range of the conventional BMA. Referring to FIG. 4, the search range 50 has a width SRH+N−1, a height SRV+N−1, a horizontal searching range SRH, and a vertical searching range SRV. A reference block 61 positioned at a center point of the search range 50 is an N×N block, in which each of the values is counted by pixel as the unit thereof, and SRH=2PH, and SRV=2PV.
FIG. 5 is a schematic diagram illustrating the Level A scheme of the conventional technology. Referring to FIG. 5, in the search range 50, an overlap region 62 between two successive reference blocks is shown as the dashed region in FIG. 5. As such, whenever a next reference block is searched, N×1 pixels data must be loaded from the external memory 82 in advance. Therefore, the size of the internal memory 74 is N×(N−1). However, when data is frequently accessed, the external bus 90 suffers a very heavy load, and the data is not effectively reused.
FIG. 6 is a schematic diagram illustrating the Level B scheme of the conventional technology. Referring to FIG. 6, a search band 51 of the search range 50 in the external memory 82, as a whole, is retrieved by the coding system. The search band 51 has a width SRH+N−1, and a height N. The coding system obtains the SAD value of a corresponding reference block from the search band 51. An overlap region 62 of two successive search bands 51 and 52 is shown as the dashed region in FIG. 6. The overlap region 62 occupies a size (N−1)×(SRH+N−1) of the internal memory 74. The data in the overlap region 62 can be reused according to the Level B scheme. In other words, when the coding system executes a next time SAD calculation, the data in the overlap region 62 is not required to be reloaded into the internal memory anymore. Such data has been loaded in advance, and only data of 1×(SRH+N−1) is required to be loaded therein. Therefore, the data load bandwidth can be drastically reduced.
FIG. 7 is a schematic diagram illustrating the Level C scheme of the conventional technology. Referring to FIG. 7, the coding system divides the data of the search range 50 into two stages for loading into the internal memory 74. At the first time, a search band 51 is loaded. The search band 51 has a width SRH+N−1, and a height SRV+N−1. Then the SAD value is calculated, in which the two successive current blocks CB0, CB1 are selected from left to right as indicated by the arrow shown thereby. Then, another search band 52 is loaded. The search band 52 has a width SRH+N−1, and a height SRV+N−1. However, there is an overlap region 62 between the search band 51 and the search band 52 existed as shown as the dashed region in FIG. 7. As such, only data of (N+SRV−1)×(N+SRH−1) is required to be loaded. In other words, the size of the internal memory 74 is (N+SRV−1)×(N+SRH−1). Comparing with Level B scheme, the Level C scheme only needs to twice retrieve data from the external memory, and therefore the data load bandwidth can be drastically reduced.
FIG. 8 is a schematic diagram illustrating the Level D scheme of the conventional technology. Referring to FIG. 8, the Level D scheme is similar to the Level C scheme discussed above. The coding system divides the data of the search range 50 into two stages, i.e., search bands 51 and 52, for loading into the internal memory 74. Different from the Level C scheme shown in FIG. 7, in which the search bands are vertically partitioned, the Level D scheme shown in FIG. 8 horizontally partitions the search bands. The search bands 51, 52 have a width SRH+W−1, a height SRV+N−1, in which W is the width of an image. The overlap region 62 of the search bands 51, 52 is shown as the dashed region in FIG. 8. Further, two successive current blocks CB0 and CB1 are selected from upside to downside. Therefore, the size of the internal memory 74 is (SRH+W−1)×(SRV−1).
FIG. 9 is a schematic diagram illustrating the Level C+ scheme of the conventional technology. Referring to FIG. 9, the Level C+ scheme is similar to the Level C scheme and the Level D scheme discussed above. According to the Level C+ scheme, the search range 50 is horizontally partitioned and vertically partitioned into four for loading. As such, the size of the internal memory 74 is (SRH+N−1)×(SRV+nN−1), in which n=2. The four successive current blocks CB0, CB1, CB2, and CB3 are selected in a zigzag manner, indicated by the arrow shown in FIG. 9.
The load bandwidth BW of the external bus 90 is represented by the following equation:BW=f×W×H×N×Ra, in which f represents the frame rate, N represents the number of the searched reference frames, W represents the frame width, H represents the frame height, and Ra represents the average external pixel access count for each current pixel in its motion estimation process, and can be defines as:Ra=total number of external memory accesses in task/the current pixel count in task
As such, Ra of the Level A scheme can be expressed as:
      Ra    =                  SR        V            ×              (                  1          +                                    SR              H                        N                          )              ;
Ra of the Level B scheme can be expressed as:
      Ra    =                  (                  1          +                                    SR              V                        N                          )            ×              (                  1          +                                    SR              H                        N                          )              ;
Ra of the Level C scheme can be expressed as:
      Ra    =                  N        ×                                            SR              V                        +            N            -            1                                N            ×            N                              =              1        +                              SR            V                    N                      ;
Ra of the Level D scheme can be expressed as: Ra=1; and
Ra of the Level C+ scheme can be expressed as:
  Ra  =            N      ×                                    SR            V                    +          nN          -          1                          N          ×          nN                      =          1      +                                    SR            V                    nN                .            
Comparing the load bandwidths corresponding to the foregoing Level A, Level B, Level C, Level D, and Level C+ schemes, it can be learnt that the Level C scheme and the Level C+ scheme have relative good reusability.
However, the Level C and Level C+ schemes require for more internal memory spaces, for saving a large amount of image data of the search ranges, thus reducing the data accessing frequency of accessing the external memory. In other words, the hardware cost is traded off for saving the computation time and reducing the load bandwidth. Therefore, the Level C and Level C+ schemes have not solved the problems as expected.
Further, according to the Level C and Level C+ schemes, the best matching blocks are found out by a full search block matching algorithms (FSBMA). Although easy to apply, the schemes do not have high search efficiency, and do not have an improved search speed. As such, the coding system still consumes too much power.
Therefore, a method for fast estimation prediction which is adapted for reducing the load bandwidth and saving hardware cost is highly desired for employing a fast searching method with a high data reusability for solving the disadvantages of the conventional technology, i.e., high data load bandwidth and slow searching speed, without changing the architecture of the coding system.