(1) Field of the Invention
The present invention relates to a motion estimation device, a motion estimation method, a motion estimation integrated circuit, and a picture coding device, which perform motion estimation for blocks in a picture.
(2) Description of the Related Art
Recently, with the arrival of the age of multimedia in which audio, video and other pixel values are integrally handled, existing information media, i.e., newspapers, journals, TVs, radios and telephones and other means through which information is conveyed to people has come under the scope of multimedia. Generally speaking, multimedia refers to something that is represented by associating not only characters but also graphics, audio and especially images and the like together. However, in order to include the aforementioned existing information media in the scope of multimedia, it appears as a prerequisite to represent such information in digital form.
However, when estimating the amount of information contained in each of the aforementioned information media as the amount of digital information, the information amount per character requires 1 to 2 bytes whereas the audio requires more than 64 Kbits (telephone quality) per second, and when it comes to the moving picture, it requires more than 100 Mbits (present television reception quality) per second. Therefore, it is not realistic for the information media to handle such an enormous amount of information as it is in digital form. For example, although video phones are already in the actual use via Integrated Services Digital Network (ISDN) which offers a transmission speed of 64 Kbit/s to 1.5 Mbit/s, it is impossible to transmit images on televisions and images taken by cameras directly through ISDN.
This therefore requires information compression techniques, and for instance, in the case of the videophone, video compression techniques compliant with H.261 and H.263 standards recommended by International Telecommunication Union-Telecommunication Standardization Sector (ITU-T) are employed. According to the information compression techniques compliant with the MPEG-1 standard, image information as well as audio information can be stored in an ordinary music Compact Disc (CD).
Here, Moving Picture Experts Group (MPEG) is an international standard for compression of moving picture signals standardized by International Standards Organization/International Electrotechnical Commission (ISO/IEC), and MPEG-1 is a standard to compress moving picture signals down to 1.5 Mbps, that is, to compress information of TV signals approximately down to a hundredth. The transmission rate within the scope of the MPEG-1 standard is set to about 1.5 Mbps to achieve the middle-quality picture, therefore, MPEG-2 which was standardized with the view to meet the requirements of high-quality picture allows data transmission of moving picture signals at a rate of 2˜15 Mbps to achieve the quality of TV broadcasting. In the present circumstances, a working group (ISO/IEC JTC1/SC29/WG11) in the charge of the standardization of the MPEG-1 and the MPEG-2 has achieved a compression rate which goes beyond what the MPEG-1 and the MPEG-2 have achieved, further enabled coding/decoding operations on a per-object basis and standardized MPEG-4 in order to realize a new function required by the era of multimedia. In the process of the standardization of the MPEG-4, the standardization of coding method for a low bit rate was aimed. However, the aim is presently extended to a more versatile coding of moving pictures at a high bit rate including interlaced pictures.
Furthermore, MPEG-4 AVC and H.264 have been standardized since 2003 as a picture coding method with higher compression rate, which are jointly worked by the ISO/IEC and the ITU-T. Currently, regarding H.264, a draft of its revised standard in compliance with a High Profile which is suited for High Definition (HD) pictures have been developed. As an application in compliance with H.264 standard, it is expected, as in the cases of the MPEG-2 and MPEG-4, that the application extends to digital broadcast, a Digital Versatile Disk (DVD) player/recorder, a hard disc player/recorder, a camcorder, a video phone and the like.
In general, in coding of a moving picture, the amount of information is compressed by reducing redundancy in temporal and spatial directions. Therefore, an inter-picture prediction coding, which aims at reducing the temporal redundancy, estimates a motion and generates a predictive picture on a block-by-block basis with reference to prior and/or subsequent pictures, and then codes a differential value between the obtained predictive picture and a current picture to be coded. Here, “picture” is a term to represent a single screen and it represents a frame when used for a progressive picture whereas it represents a frame or fields when used for an interlaced picture. The interlaced picture here is a picture in which a single frame consists of two fields respectively having different time. For encoding and decoding an interlaced picture, a single frame can be processed either as a frame, as two fields or as a frame/field structure depending on a block in the frame.
A picture to which an intra-picture prediction coding is performed without reference pictures is referred to as an “I-picture”. A picture to which the inter-picture prediction coding is performed with reference to a single picture is referred to as a “P-picture”. A picture to which the inter-picture prediction coding is performed by referring simultaneously to two pictures is referred to as a “B-picture”. The B-picture can refer to two pictures, selected from the pictures whose display time is either forward or backward to that of a current picture to be coded, as an arbitrary combination. Whereas the reference pictures can be specified for each macroblock that is a fundamental unit of coding, they are distinguished as a first reference picture and a second reference picture. Here, the first reference picture is a first reference picture to be described firstly in a coded bit stream and the second reference picture is a reference picture to be described after the first reference picture in the coded bit stream. However, the reference pictures need to be already coded as a condition to code these I-picture, P-picture, and B-picture.
A motion compensation inter-picture prediction coding is used for coding the P-picture or the B-picture. The motion compensation inter-picture prediction coding is a coding method which adopts motion compensation to an inter-picture prediction coding. The motion compensation is a method of reducing the amount of data while increasing prediction precision by estimating an amount of motion (this is referred to as a motion vector, hereinafter) of each part in a picture and performing prediction in consideration of the estimated amount of data, instead of simply predicting a picture from a pixel value of a reference frame. For example, the amount data is reduced by estimating a motion vector of a current picture to be coded and coding a predictive difference between a predicted value which is shifted as much as the estimated motion vector and the current picture. Since this method requires information about the motion vector at the time of decoding, the motion vector is also coded, and recorded or transmitted.
The motion vector is estimated on a macroblock basis. Specifically, a motion vector is estimated by fixing a macroblock (target block) of the current picture, moving a macroblock (reference block) of the reference picture within a range in which the reference block is referred by the target block (hereinafter, referred to as “motion estimation range”), and finding a position of the reference block which is approximate to the target block.
FIG. 1 is a block diagram which shows a structure of a conventional inter-picture prediction coding device.
This inter-picture prediction coding device 800 includes a motion estimation unit 801, a multi-frame memory 802, a subtractor 803, a subtractor 804, a motion compensation unit 805, a coding unit 806, an adder 807, a motion vector memory 808, and a motion vector prediction unit 809.
The motion estimation unit 801 compares a motion estimation reference pixel MEp outputted from the multi-frame memory 802 with an image signal Vin, and outputs a motion vector MV and a reference frame number RN. The reference frame number RN is an identification signal for identifying a reference picture to be selected from among plural reference pictures as a reference picture for a current picture to be coded. The motion vector MV is temporally stored in the motion vector memory 808, and then outputted as a neighboring motion vector PvMV. This neighboring motion vector PvMV is referred to for predicting a predictive motion vector PrMV by the motion vector prediction unit 809. The subtractor 804 subtracts the predictive motion vector PrMV from the motion vector MV, and outputs the difference as the motion vector predictive difference DMV.
On the other hand, the multi-frame memory 802 outputs a pixel indicated by the reference frame number RN and the motion vector MV as a motion compensation reference pixel MCp1, and the motion compensation unit 805 generates a reference pixel in sub-pixel precision and outputs a reference picture pixel MCp2. The subtractor 803 subtracts the reference picture pixel MCp2 from the image signal Vin, and outputs a picture predictive difference DP.
The coding unit 806 performs variable-length coding on the picture predictive difference DP, the motion vector predictive difference DMV, and the reference frame number RN, and outputs the coded stream Str. It should be noted that, upon coding, a decoded picture predictive difference RDP, which is a result of decoding the picture predictive difference DP, is simultaneously outputted. The decoded picture predictive difference RDP is obtained by superimposing the coded difference on the picture predictive difference DP, and is same as the inter-picture predictive difference which is obtained by which the inter-picture prediction decoding device 800 decodes the coded stream Str.
The adder 807 adds the decoded picture predictive difference RDP to the reference picture pixel MCp2, and stores the resultant into the multi-frame memory 802 as a decoded picture RP. However, for an effective use of the capacity of the multi-frame memory 802, an area of the picture stored in the multi-frame memory 802 is released when it is not necessary, and the decoded picture RP of the picture which is not necessary to be stored in the multi-frame memory 802 is not stored into the multi-frame memory 802.
FIG. 2 is a block diagram for explaining a conventional inter-picture prediction decoding device. Note that the same reference characters in FIG. 1 are assigned to the identical constituent elements in FIG. 2, so that the details of those elements are the same as described above.
The conventional inter-picture prediction decoding device 900 shown in FIG. 2 outputs a decoded image signal Vout by decoding the coded stream Str coded by the conventional inter-picture prediction coding device 800 shown in FIG. 1. The inter-picture prediction decoding device 900 includes a multi-frame memory 901, a motion compensation unit 902, an adder 903, an adder 904, a motion vector memory 905, a motion vector prediction unit 906, and a decoding unit 907.
The decoding unit 907 decodes the coded stream Str, and outputs a decoded picture predictive difference RDP, a motion vector predictive difference DMV, and a reference frame number RN. The adder 904 adds a predictive motion vector PrMV outputted from the motion vector prediction unit 906 and the motion vector predictive difference DMV, and decodes a motion vector MV.
The multi-frame memory 901 outputs a pixel indicated by the reference frame number RN and the motion vector MV as a motion compensation reference pixel MCp1. The motion compensation unit 902 generates a reference pixel with a sub-pixel precision and outputs a reference picture pixel MCp2. The adder 903 adds the decoded picture predictive difference RDP to the reference picture pixel MCp2, and stores the sum into the multi-frame memory 901 as a decoded picture RP (a decoded image signal Vout). However, for an effective use of the capacity of the multi-frame memory 901, an area of the picture stored in the multi-frame memory 901 is released when it is not necessary, and the decoded picture RP of a picture which is not necessary to be stored in the multi-frame memory 901 is not stored into the multi-frame memory 901. Accordingly, the decoded image signal Vout, that is the decoded picture RP, can be correctly decoded from the coded stream Str.
By the way, Japanese Patent No. 2963269, for example, suggests a structure in which the conventional inter-picture prediction coding device 800 shown in FIG. 1 is embedded into a Large Scale Integration (LSI). As disclosed in the patent, in the case where the inter-picture prediction coding device is embedded in an LSI or the like, the multi-frame memory 802 of the conventional inter-picture prediction coding device 800 shown in FIG. 1 is separated to (i) an external frame memory outside the LSI and (ii) a local memory inside the LSI to be directly accessed when the motion estimation unit 801 performs motion estimation for macroblocks.
FIG. 3 is a block diagram showing an example of a structure of the multi-frame memory 802, in which the inter-picture prediction coding device 800 is connected with an external frame memory. Note that the reference characters in FIG. 1 are assigned to the identical constituent elements of FIG. 3, so that the details of those elements are the same as described above. The multi-frame memory 802 has an external frame memory 820 and a reference local memory 811 which is embedded in a LSI. The external frame memory 820 is a memory which is connected to the LSI having the inter-picture prediction coding device. The reference local memory 811 is a memory inside the LSI and accessed directly by the motion estimation unit 801 for motion estimation for macroblocks. The LSI is a LSI having the inter-picture prediction coding device. In FIG. 3, the constituent elements in the LSI other than the reference local memory 811 and the motion estimation unit 801 are not shown.
In FIG. 3, when motion estimation is performed, a picture range to be applied with the motion estimation is firstly transferred from the external frame memory 820 to the reference local memory 811 via an external connection bus Bus1. Next, data is read out from the reference local memory 811 via an internal bus Bus2, and motion estimation is performed by the motion estimation unit 801. With such a structure, a memory capacity of the LSI can be reduced.
FIG. 4 is a schematic diagram showing how pixels in one reference picture are to be transferred. The upper diagram shows an entire reference picture stored in the external frame memory 820. The lower diagram shows an image area which is transferred from the external frame memory 820 to the reference local memory 811 to be used for motion estimation, and a further image area which is transferred for next motion estimation. Assuming that the motion estimation is applied to each macroblock (MB) of 16×16 pixels, FIG. 4 shows that, for motion estimation for macroblocks in one row, pixels of (vertical length of motion estimation range)×(horizontal width of one picture) are transferred to the reference local memory 811. FIG. 4 also shows that, for motion estimation for macroblocks in one picture, the above-calculated pixels×(the number of MBs in a column in the picture) are transferred to the reference local memory 811. In more detail, if the picture is a Standard Definition (SD) picture in MPEG-2 or the like of 720×480 pixels, 45×30 MBs, in which a motion estimation range has macroblocks shifting each single MB from a position of a target macroblock (in other words, the motion estimation range has one macroblock at a position of the target macroblock and eight neighbor macroblocks surrounding the position), then total (16+16×2)×720×30=1,036,800 pixels are transferred to the reference local memory 811 for motion estimation for one picture.
However, if a SD picture in H.264 is managed by the reference local memory 811, more pixels surrounding the position are required than the above conventional MPEG-2 case, since in H.264, a 6-tap filter is used for motion estimation with sub-pixel precision, which is disclosed, for example, in “Information technology—Coding of audio-visual objects—Part 10: Advanced video coding” ISO/IEC 14496-10, International Standard, 2004-10-01. The reason is explained in more detail below. In MPEG-2, a sub-pixel is created using 4 pixels surrounding a position of a sub-pixel-precision pixel. In the case of H.264 using the 6-tap filter, however, a sub-pixel is created using 36 pixels. Therefore, if the motion estimation is assumed to be performed in the same range in both of MPEG-2 and H.264, H.264 requires pixels in two above rows, two below rows, two left columns, two right columns, in addition to pixels used in MPEG-2. As a result, if the picture is a SD picture in H.264 or the like, in which a motion estimation range has macroblocks shifting each single MB from a position of a target macroblock, then total (16+16×2+4)×720×30=1,123,200 pixels are transferred to the reference local memory 811 for motion estimation for one picture.
Moreover, if the picture is a High Definition (HD) picture of 1920×1088 pixels, 120×68 macroblocks, and especially coded in H.264, the above-described pixel transfer amount for one picture is significantly increased, so that such a huge amount is not able to be transferred with a capacity of the external connection bus Bus1 shown in FIG. 3.
Examples of such a huge transfer amount are given below. Here, it is assumed that a HD picture of MPEG-2 is managed by the reference local memory 811. Under the assumption, since a HD picture has pixels about 6 times as many as pixels in a SD picture, a motion estimation range is vertically and horizontally 2.5 times larger than a range of a SD picture, for the sake of simplified explanation, and thereby the motion estimation range has pixels shifting vertically and horizontally with 40 pixels from a target position. As a result, total (16+40×2)×1,920×68=12,533,760 pixels are transferred to the reference local memory 811 for motion estimation for one picture.
Furthermore, if it is assumed that a HD picture of H.264 is managed by the reference local memory 811, total (16+40×2+4)×1,920×68=13,056,000 pixels are received for motion estimation for one picture, in the same manner as described above.
As explained above, especially if a HD picture of H.264 is processed, a resulting transfer amount is extremely heavier as comparison to a SD picture of MPEG-2. Therefore, a technique for reducing the image transfer amount with sacrifice of an area cost. FIG. 5 is a schematic diagram showing how the external frame memory 820 is updated, in order to reduce a transfer amount of reference pixels.
If one picture Pic included in a to-be-coded stream has a frame structure, a SD picture has a width PW and a height PH which are 45 MB (=720 pixels) and 30 MB (=480 pixels), respectively, and a HD picture has a width PW and a height PH which are 120 MB (=1,920 pixels) and 68 MB (=1,088 pixels), respectively. Hereinafter, respective values of the width PW and the height PH are referred to as M (MB) and N (MB), respectively.
When the motion estimation unit 801 performs motion estimation for macroblocks in the n-th row of an original picture, the reference local memory 811 stores pixel data of (width PW of a reference picture)×(height PH of a motion estimation range for macroblocks in the n-th row of the original picture). More specifically, in the case of a SD picture, the reference local memory 811 stores reference pixel data of (i) macroblocks in a row corresponding the n-th row in the original picture (PW) and (ii) macroblocks in an immediately above row and a immediately below row of the row (PH). On the other hand, in the case of a HD picture, the reference local memory 811 stores reference pixel data of (i) macroblocks in a row corresponding the n-th row in the original picture (PW) and (ii) respective 40 pixels immediately above the macroblocks and respective 40 pixels immediately below the macroblocks (PH). Note that a center of motion estimation (motion estimation center) meCnt in a reference picture for each to-be-coded macroblock in the n-th row and the m-th column in the original picture may be at the same position of the to-be-coded macroblock, or may be a different position which is shifted from the to-be-coded macroblock position.
As described above, by adding a sub memory area to keep an area larger than the actual motion estimation range, it is possible to reduce the image transfer amount by about (1 MB unit height)/(vertical height of the motion estimation range).
Furthermore, FIG. 6 is a schematic diagram showing how stored pixels are managed, in order to reduce a capacity of the reference local memory 811. A reference area RefArea is an area which is used as reference for a current target macroblock in the motion estimation unit 801. A sub memory area SubArea is an area which is not used as reference for the target macroblock in the current motion estimation, but used in subsequent motion estimation. A next renewed area NxtArea is an area which is used as reference for a next target macroblock. A next released area RelArea is an area which becomes unnecessary in and after motion estimation for the next target macroblock, and to which a next renewed area NxtArea is overwritten as a physical memory area. The increase of the area cost can be restrained by deleting the sub memory area SubArea in the range stored in the reference local memory 811, as shown in FIG. 6.
However, as shown in FIG. 6, if memory addresses are processed by first in, first-out (FIFO) method in an area in which these rectangular areas are combined in the reference local memory 811, address management becomes quite difficult. FIG. 7 is a diagram showing a physical address layout around a logical boundary in the reference logical memory 811, when the FIFO management is used. For the sake of simplified explanation, in FIG. 7, it is assumed that the picture is a Quarter Video Graphics Array (QVGA) picture of horizontally 320×vertically 240 pixels, that a motion estimation range has ±16 pixels in horizontal and vertical directions, and the each word has 8 pixels. Under the assumption, FIG. 7 shows addresses around a boundary of address 0, when address mapping is performed as raster addresses from top left.
In FIG. 7 (a), an area HLA enclosed by a doted line is an area whose addresses are all able to be stored in the reference local memory 811, from address 0 to the last address. In this figure, the addresses are sequentially allocated from top left of the picture from the address 0. In this example, it is assumed that the area HLA has total 1408 words of (i) an area (a right-down shaded portion, a horizontally lined portion, and a lattice portion) in which horizontal 40 words (320 pixels)×vertical 32 words (240 pixels) are arranged, and (ii) an area (a doted portion) in which horizontal 6 words (48 pixels)×vertical 16 words (16 pixels) as a part of a motion estimation range, and horizontal 2 words (16 pixels)×vertical 16 words (16 pixels) as a part of an update area are arranged. FIG. 7 (b) shows physical address numbers included in the first macroblock. Since one macroblock horizontally has 2 words (16 pixels) and one picture horizontally has 40 words (320 pixels), as shown in FIG. 7 (b), the address numbers are horizontally allocated with respective 40 intervals in the first macroblock.
FIG. 7 (c) shows physical addresses in a pixel space around a part of boundary of the area HLA enclosed by the doted line. When all addresses in the area HLA are filled by pixel transfer for a macroblock, and after that, a next macroblock is to be transferred to be stored, the reference local memory 811 uses the FIFO method and the physical addresses of the next released area RelArea at top left are used for addresses of a next renewed area NxtArea. In more detail, a left-down shaded portion is shown beyond the area HLA, since this portion is not able to be stored at the same time in the reference local memory 811. After the physical addresses in the top-left macroblock shown in FIG. 7 (b) are used for motion estimation and become delete-able, the left-down shaded portion is overwritten with these addresses for the next macroblock.
Therefore, pixels positioned in and around a circle of FIG. 7 (c) are transferred, addresses around the boundary of the area HLA become inconsistent, which fails data access by general raster addresses. Moreover, positions in a horizontal direction of this address 0 are not allocated with respective unique addresses, because the horizontal positions are determined depending on vertical positions. As a result, the address calculation becomes more difficult.
As described above, if the FIFO method is used in the reference local memory 811 to manage physical addresses in the area in which rectangular areas are combined, the addresses are re-used at ill-defined pixel space positions, so that addresses management becomes significantly difficult, requiring various calculation such as division and modulo operations in addition to multiplication operation. Therefore, as a result of necessity of such complicated address calculation, various problems occur. For example, in the case of hardware implementation, a circuit area is increased, and operation timings for processing become difficult. In the case of software implementation, huge processing cycle numbers are required.