The computation amount required for detecting motion vectors is enormous in video coding. Thus, in order to speed up the computation processing, a motion search apparatus using the systolic array has been developed and is being used practically (refer to the non-patent document 1, patent document 1, non-patent document 2, and patent document 2).
The systolic array is a computing apparatus in which a plurality of processor elements (to be referred to as PE hereinafter) are arranged regularly, and calculation target data flows through the PEs like a pipeline so that computational processing is executed by the PEs in parallel and with high speed. The computing apparatus is also called a PE array.
Especially, as to a motion search apparatus for which high speed video coding processing is required, the PE array is used in processing for repeating calculation of the sum of absolute difference (SAD) of pixel values between a coding target block of the original image and a reference image within a motion search range of the reference image, so that speed-up of motion vector detection is realized.
FIG. 9A shows a configuration example of the PE array in a conventional motion search apparatus. This example is configured to calculate the sum of absolute difference between the original image of 4×2 pixels and the reference image by using 8 processor elements PE00-PE31. By increasing the number of PEs, the PE array can be also configured to calculate the sum of absolute difference in units of 4×4 pixels or 8×8 pixels, for example. In addition, by combining a plurality of PE arrays 40 shown in FIG. 9A, a calculation circuit can be configured for calculating the sum of absolute difference of n×m pixels (n≧4, m≧4).
The PE array 40 includes input terminals of original image input data SMB, 4 pieces of reference image input data RA00, RA01, RA10 and RA11, and a reference image switch control input signal RASW for controlling a selector for selecting reference image input data to be calculated. As an output terminal, the PE array 40 includes an output terminal for an output ADOUT of a result of accumulation of the sum of absolute difference.
As shown in FIG. 9B, each PE includes an input terminal MBin for inputting original image input data, an input terminal ADDin for inputting a sum value from a left adjacent PE, an output terminal ADDout for outputting a sum value to a right adjacent PE, and input terminals RAin0 and RAin1 for inputting two pieces of reference image input data.
FIGS. 10A-10C are diagrams for explaining operation of the PE array 40 shown in FIG. 9A. For example, the PE array 40 performs calculation for searching a reference image (pixel values x00, x01, . . . ) shown in FIG. 10B for a part by which the sum of absolute value becomes the smallest with respect to a 4×2 pixel group (pixel values c00-c31) of the original image shown in FIG. 10A.
In FIG. 10C, pixel values c00, c10 and c20 . . . of the original image are sequentially input to PE00, PE10 and PE20 . . . in initial 8 cycles (clock CLKs), and held. In cycle 1, PE00 receives a pixel value c00 of the original image and a pixel value x00 of the reference image, and calculates an absolute difference S00=|c00−x00|.
In the next cycle 2, PE00 receives a pixel value x10 of the reference image, and calculates the sum of absolute difference S01=|c00−x10|. PE10 calculates a value S10 by adding the absolute difference between a pixel value c10 of the original image and a pixel value x10 of the reference image to the value S00 calculated by the PE00 in cycle 1.
In the next cycle 3, PE00 receives a pixel value x20 of the reference image, and calculates the sum of absolute difference S02=|c00−x20|. PE10 calculates a value S11 by adding the absolute difference between a pixel value c10 of the original image and a pixel value x20 of the reference image to the value S01 calculated by the PE00 in cycle 2. PE20 calculates a value S20 by adding the absolute difference between a pixel value c20 of the original image and a pixel value x20 of the reference image to the value S10 calculated by the PE10 in cycle 2.
As mentioned above, each of PE00-PE31 executes calculation like a pipeline, so that a sum of absolute difference between c00-c31 and x00-x31 is output from the output terminal ADOUT of the PE array 40 initially. In the next cycle, the sum of absolute difference between c00-c31 and x10-x41 is output, and in the next cycle, the sum of absolute difference between c00-c31 and x20-x51 is output. Accordingly, the sum of absolute difference within the search range of the motion vector is sequentially output in each cycle (refer to non-patent documents 1 and 2, and patent documents 1 and 2 for more details).
FIG. 11 shows a timing chart in the PE array 40. In FIG. 11, HOLDMB indicates a start signal that instructs each of PE00-PE31 to hold original image input data SMB and to start calculation. CLK indicates a clock, and HOLDSEL indicates a reference image switch control input signal. In FIG. 11, a pixel value of the reference image is represented as pixel coordinates (x, y) of the reference image. For example, (0, 0) corresponds to a pixel value x00 shown in FIG. 10. Each pixel value of the reference image is sequentially supplied to the PE array 40. But, normally, pixel values of a plurality of pixels are read together from the reference image memory for convenience of the method of storing pixels in the memory and in order to decrease the number of times of memory accesses. In the example shown in FIG. 11, reference image pixel values of (0, 0)-(6, 0) and (0, 1)-(6, 1) are simultaneously input from the reference image memory at CLK0, and reference image pixel values of (0, 2)-(6, 2) and (0, 3)-(6, 3) are simultaneously input from the reference image memory at CLK8.
As to the pieces of data of 7 pixels×2 read at CLK0, first 7 pixels are sequentially supplied to the PE array 40 in 7 clocks starting from CLK1, and remaining 7 pixels are sequentially supplied to the PE array 40 in 7 clocks starting from CLK5. At CLK9, a result of accumulation of the sum of absolute difference at the search origin position coordinates (0, 0) is output.
FIG. 12 shows read timing for reading data from the reference image memory. As mentioned before, in the first cycle, the pieces of data of 7 pixels×2 are read from the reference image memory. Since two pieces of data cannot be simultaneously read from a same memory bank, data is stored in each of banks like Bank0, Bank1, Bank2, Bank0, . . . , for each line of the image, for example. Accordingly, bank conflict can be avoided, so that pieces of data in Bank0 and Bank1 can be read simultaneously from the reference image memory, for example.
In the above-mentioned motion search apparatus that has a mechanism for reading data from the reference image memory, when the process goes from the lowermost line in the search range of the reference image to the uppermost line for next motion search, it is necessary to read pieces of data of 7 pixels×3 simultaneously, that is, it is necessary to read data of the lowermost line together with data of uppermost two lines, in order not to waste processing time in the PE array 40. This process corresponds to reading in cycle 32 shown in FIG. 12. That is, in the cycle 32, it is necessary to simultaneously read the lowermost line having (0, 8) at the top and two uppermost lines having (4, 0) and (4, 1) at each top. In this example, although a case where the number of lines in the search range is an odd number is explained, similar process can be performed also for a case where the number is an even number.
In motion search for conventional image coding schemes such as MPEG-2, it is not necessary to perform search in which a reference position in the reference image goes out of the screen. Therefore, in the conventional schemes, even when pieces of data of the lowermost line and the two uppermost lines are read simultaneously, problem of delay in memory reading due to bank conflict does not occur since the banks are different for each piece of data.    [Non-patent document 1] Toshihiro MINAMI, Toshio KONDO, Kazuhito SUGURI and Ryota KASAI, “A Proposal of a One-dimensional Systolic Array Architecture for the Full-search Block Matching Algorithm”, IEICE Trans. D-I, Vol. J78-D-I, No. 12, pp. 913-925, December 1995.    [Non-patent document 2] Toshihiro MINAMI and Jiro NAGANUMA, “A Proposal of the Construction Method of the Motion Vector Detector Suitable for the Telescopic Search”, IEICE Trans. D-II, Vol. J87-D-II, No. 11, pp. 2007-2024, November 2004.    [Patent document 1] Japanese Patent No. 3127980    [Patent document 2] Japanese Laid-Open Patent Application No. 2005-136455