1. Field of the Invention
The present invention relates to the field of image processing of moving pictures, such as H.264 and Moving Picture Experts Group-4 (MPEG-4). More particularly, the present invention relates to a method for accessing a memory in which images are processed by an apparatus for processing moving pictures, which uses motion estimation techniques and motion compensation techniques.
2. Description of the Related Art
H.264 or MPEG-4 Advanced Video Coding (AVC) corresponds to standard technology that has been established by International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) MPEG, wherein ISO/IEC is an international standardization organization, and Joint Video Team (JVT) is a partnership project of the International Telecommunication Union-Telecommunication standardization sector Video Coding Experts Group (ITU-T VCEG). The H.264 standard has provided improved technologies which distinguish from the existing coding schemes in order to raise the level of coding efficiency. In addition, H.264 permits use of a coding tool, such as an intra prediction using a flexible block size, an in-loop de-blocking filter, a quarter-pixel motion compensation, etc., as typical technology.
In order to design/employ a COder/DECoder (CODEC) in accordance with the above standard in real-time, it is necessary not only to reduce the number of execution cycles, but also to minimize the number of times a memory is accessed. Since frame storage reference buffers of the majority of video CODECs are located at external memories, the frame storage reference buffers require a significant amount of access time relative to other functions performed by the CODECs. The following table (TABLE 1) shows the relative memory access ratio in each function module included in an H.264 decoder.
TABLE 1name of modulemax bytes of memory accessratio [%]reference picture storeW × H + 2 × (W/2) × (h/2)10de-blocking filter(W/16) × (H/16 − 1) × 16 × 4 × 2 ×52display factorW × H + 2 × (W/2) × (H/2)10motion compression(W/16) × (H/16) × 16 × (9 × 9 + 2 ×753 × 3) × 2total~16 × W × H
With regard to the above table, W and H denotes Width and Height, respectively.
As shown in TABLE 1, the greater part of memory accesses are generated by a motion compensation part, which performs the motion compression with the relatively large access requirement ratio. In particular, in a mobile environment, as the rate of an inter prediction increases faster than the rate of an intra prediction, by reducing the number of times memory is accessed in the inter prediction part, there is a greater need to provide a scheme embodying a more efficient decoder than known heretofore.
According to the fact that motion compensation in H.264 has a tree structure, a single macroblock is classified into either 16×16, 16×8, 8×16, or 8×8 groups of pixels, wherein a relevant motion vector is sought in each case, and an image value is predicted at different points in time. The aforementioned is particularly applicable in the instance where a block of 8×8 size is sub-divided into sub-macroblocks of 8×4, 4×8, 4×4 sizes in order to accurately sense detailed motion. At present, when a half-pixel or a quarter-pixel is found, a basic image is enlarged two or four times, respectively, and then the motion prediction can be performed. In order to enlarge the images, in the H.264, pixels are fetched from reference frames by using a six-tap filter, e.g., a six-tap Finite impulse Response (FIR) filter, and then the prediction is performed.
FIG. 1 is a view illustrating the pixels necessary for a 4×4 luminance inter prediction made during general moving picture processing. With reference to FIG. 1, in the case of a block of 4×4 size (hereinafter, referred to as “4×4 block”), in order to find interpolation pixels, such as the half-pixel or the quarter-pixel, etc., by using the six-tap filter, it is additionally required to include two more columns/rows or three more columns/rows that are adjacent to the upper and lower sides, and the left and right of a relevant block besides pixels of the 4×4 block.
For example, in FIG. 1, finding (vertical interpolation) the half-pixel (i.e., an interpolation pixel A0′) that is to be vertically interpolated between pixels A0 and A1, in the case where the six-tap filter is used, pixels A_2, A_1, A0, A1, A2, and A3 are used in the interpolation. At this time, weights (i.e., tap values of the six-tap filters) are given to the pixels participating in the interpolation. The tap values (weights) are set to 1, −5, 20, 20, −5, and 1, respectively. Undoubtedly, the number of the pixels participating in the interpolation and the tap values given to the pixels participating in the interpolation can be set in various ways. The following EXPRESSION 1 shows a formula for evaluating a value of the half-pixel A0′ by use of a six-tap filter.A0′=(1×A—2−5×A—1+20×A+20×A1−5×A2+1×A3)/32  EXPRESSION 1
As in EXPRESSION 1, the interpolation pixel A0′ corresponds to the weighted mean of A_2, A_1, A0, A1, A2, and A3 to which adequate weights are assigned.
However, since a reference buffer is located at an external memory during an inter prediction operation, in this particular example, a data byte of a necessary pixel should be loaded to memories L1 and L2 corresponding to an internal Central Processing Unit (CPU) cache memory. At this time, because 4×4 blocks of the number of 16 exist in a worst-case scenario when an inter prediction of one 16×16 macroblock is performed, blocks of the total 1296 [bytes] (from 9×9×16=1296) should be fetched to be read.
Also, in the case of vertical interpolation, filtering should be performed on data that has been loaded to the memories L1 and L2, as an array of memories is not continuous, and thus the number of times the memory is accessed increases. Namely, in the case of 4×4 block, one interpolation pixel is generated with loads of six times, and the greater part of the CPU registers has the number of registers greater than twenty, as more bytes than a maximum of 20 [bytes] cannot be loaded. As the total bytes necessary for the vertical interpolation corresponds to 36 [bytes] (from 9×4=36), when 36 [bytes] are loaded to a register so as to perform filtering on 36 [bytes], the memory is access for two reloads, so that the filtering should be accomplished by two loads. As a result, when considering a case where the register is used to fetch another instruction, many more reloads need to be requested to perform the operation. This relatively large number of reloads adversely impacts the time it takes for operation.